donderdag 31 december 2009

GPFS : Tuning recommendations

A few words on important GPFS tunables.

Pagepool & SeqDiscardTheshold
GPFS does not use the regular file buffer cache of the operating system (f.e. non-computational memory in AIX) but uses its own mechanism to implement caching. GPFS uses pinned computational memory to maintain its file buffer cache, called the pagepool, which is used to cache user file data and file system metadata. The default pagepool size is 64MB and is too small for many applications most of the times. Applications that re-use files a lot and perform sequential reads will benefit from the pagepool. Non-DIO writes will also be done to the pagepool. For a sequential write operation to the pagepool, write-behind will improve overall performance. For random I/O, GPFS will not be able to use read-ahead or write-behind techniques and will be able to rely on striping for improved performance.

A parameter that affects how data is cached in the pagepool is SeqDiscardTheshold, which will instruct GPFS to try to keep as much data in pagepool as possible. The default for this value is 1MB which means that if a file greater than 1 MB is read sequentially, GPFS will not keep the data in the pagepool. There might be applications in which large files are often re-read by multiple processes, which can lead to improved performance if this tunable is set to a higer value.
It should also be noted that NSD servers don't not cache anything for their NSD clients. If both NSD client A and NSD client B request the same file from an NSD server, the NSD server will get the data twice from disk. As a result, increasing the pagepool on an NSD server would have no effect.

Block size
This is one of the most important things to think about when designing a GPFS file system. After creating the file system with a specific block size, there's no way back other than recreating the file system with the new block size. Choosing the optimal GPFS block size is not a straight forward exercise since it relies on several other factors:
  • Physical disk block size
  • LUN segment size, which is the maximum amount of data that is written or read from a disk per operation before the next disk in the array is used
  • Application block size (f.e. DB block size for a RDBMS application)
The following example will calculate the (theorical) optimal GPFS block size, without taking the application block size into account:

RAID 5 4+1 , 128 KB LUN segment size = 512 KB LUN stripe size.
As a result, a GPFS block size of 512 KB (or a multiple of) would be good.
A GPFS block size of 256 KB will almost certainly lead to reduced performance because the disk subsystem would have to read the remaining 256 KB of the 512 KB stripe in order to calculate parity in a write operation. Summarized the operations for both 256 KB and 512 KB block sizes on the disk subsystem would look like:
  1. GPFS write (256 KB) = Write LUN Segment #1 (128 KB) + Write LUN Segment #2 (128 KB) + Read LUN Segment #3 (128 KB) + Read LUN Segment #4 (128 KB) + Calculate Parity + Write LUN segment #5 (128 KB)
  2. GPFS write (512 KB) = Write LUN Segment #1 (128 KB) + Write LUN Segment #2 (128 KB) + Write LUN Segment #3 (128 KB) + Write LUN Segment #4 (128 KB) + Calculate Parity + Write LUN segment #5 (128 KB)
Considering the possible use of a write cache on the disk subsystem , 1 is certainly more costly than 2.

Split data / metadata
Splitting data and metadata is one of the most underestimateded design questions.
The actual division can be changed online so unlike changing the GPFS file system block size, there is no downtime involved. If metadata (inodes + indirect data blocks) cannot be accessed fast enough, overall performance will degrade severely. Metadata access can be compared to a seek operation on a normal hard drive. Generally, it's a good idea to do the following:
  • Metadata (RAID 1 + Enable read/write cache on the disk subsystem)
    Try to keep as much metadata as possible in the cache of the disk subsystem so that every node that looks for it will find it in the cache.
    Write operations on metadata are generally random and small. Moreover, these write operations should be as fast as possible.
    As a result, using the write cache for metadata will be very beneficial.
    Finally, since metadata is more read from than written to, RAID 1 is a better (though more costly) backend for metadata than RAID 5.

  • Data (RAID 5 + Disable read/write cache on the disk subsystem)
    Try to protect metadata in the cache as much as possible by disabling caching on data.
    As a result, nodes that are reading a lot of data don't thrash the cache, which can be used efficiently for nodes that need access to metadata instead. Sometimes it can even be beneficial to dedicate a disk controller to metadata LUNs (caching enabled on that controller).
maxFilesToCache & maxStatCache
As ready stated, the pagepool is GPFS's file buffer cache in pinned computational memory which caches user file data and file system metadata. On the other hand, GPFS uses regular computational memory to maintain its inode and stat cache (user file metadata). The inode cache (controlled by the maxFilesToCache tunable, default 1000) contains copies of inodes for open files and for some recently used files that are no longer open. Storing a file's inode in cache permits faster re-access to that file. The stat cache (controlled by the maxStatCache tunable, default 4 * maxFilesToCache) contains enough information to open the file and satisfy a stat() call. It is intended to help functions such as ls -l, du, and certain backup programs that scan entire directories looking for modification times and file sizes. However, the stat cache entry does not contain enough information to read from or write to the file since it does not contain the indirect block references (unlike a regular inode). A stat cache entry consumes significantly less memory than a full inode.

It is possible that the number of currently opened files is larger than the size of the inode cache. In that case, the inode needs to retrieved from disk first if a node wishes to read a file, of which the inode in not in the inode cache. Therefore, it's very important that metadata can be accessed as fast as possible (see above).

prefetchThreads & worker1Threads & maxMBps
The prefetchThreads tunable (default 72) controls the maximum possible number of threads dedicated to prefetching data for files that are read sequentially, or to handle sequential write-behind. On the other hand, the worker1Threads tunable (default 48) controls the maximum number of concurrent file operations at a time. If there are more requests than the number of worker1Threads, the excess will wait until a previous request has finished. The primary use is for random read or write requests that cannot be prefetched, random I/O requests, or small file activity. The maximum value of prefetchThreads plus worker1Threads is 550 (64-bit kernels) or 164 (32-bit kernels). These values need tuning sometimes, f.e. in an Oracle RAC environment. Oracle does not need many prefetchThreads, since Oracle does its own prefetching and does not the GPFS pagepool (Oracle uses DIO to access files on a GPFS filesystem). However, Oracle does need a high amount of worker1Threads to allow as many Oracle AIO threads as possible to work in parallel.

The maxMBps tunable (default 150) is used for estimating the amount of I/O triggered by sequential read-ahead and write-behind. Setting this value higher than the default will get more parallelism if there are many LUNs. By lowering this value, the load on the disk subsystem can be limited artificially. Setting this value too high usually does not cause problems because of other limiting factors, such as the size of the pagepool and the number of prefetch threads.

zaterdag 19 december 2009

AIX : Naming resolution

Following a recent conversation with IBM L2 support concerning general naming resolution, all affected parameter files and environment variables (and their relationship) will be explained.

AIX supports several mechanisms for naming resolution of hosts, networks, protocols, services, netgroups and rpc:
  • dns - Domain Name Service
  • nis - Network Information Service
  • nis+ - Network Information Service Plus
  • local - Local naming service. Searches the files in /etc directory for resolving
  • nis_ldap - Provides naming resolution for host, networks, protocols, rpc, services, and netgroups. This mechanism works with any directory server which stores entity data using a schema defined in RFC 2307. Although the name of the mechanism is nis_ldap, this mechanism does not use or require any NIS services!
AIX can be configured to use a combination of the above services for naming resolution. There is a sequential order that AIX follows to use these services. The default ordering can be overridden in several ways:
  • NSORDER environment variable
  • /etc/netsvc.conf configuration file
  • /etc/irs.conf configuration file
NSORDER
NSORDER is an environment variable that can be used to specify the order for resolving host names to addresses (gethostbyname) and vice versa (gethostbyaddr). NSORDER overrides the host settings in the netsvc.conf and irs.conf files. The supported mechanisms for NSORDER are bind, nis, local, which is also the default order.

/etc/netsvc.conf
The netsvc.conf file specifies the sequential order for resolving host names and aliases. It should be noted that sendmail ONLY uses netsvc.conf for resolution of host names and aliases. Other configuration files or environment variables are not consulted. The environment variable NSORDER overrides the host settings in the netsvc.conf file, which in turn overrides the host settings in the irs.conf file.

/etc/irs.conf
The irs.conf file is used to control the order of mechanisms that the resolver libraries use in searching for network-related data, including the resolving of host names, networks, services, protocols, and netgroups. The default order for resolving host names and networks is dns, nis, local. The default order for resolving services, protocols, and netgroups is nis, local. The order defined in irs.conf will override the default values. The settings in the netsvc.conf configuration file override the settings in the irs.conf file. The NSORDER environment variable overrides the settings in the irs.conf and netsvc.conf files.

Note:
AIX offers two LDAP naming services, ldap and nis_ldap. The ldap naming service uses the IBM specific schema and supports host name resolution only. The nis_ldap naming service implemented since AIX 5.2, uses the RFC 2307 schema and supports name resolution of hosts, services, networks, protocols, and netgroups.

Summary

ServicePrecedence
hostsNSORDER, netsvc.conf, /etc/irs.conf
networksirs.conf
protocolsirs.conf
servicesirs.conf
netgroupsirs.conf

dinsdag 8 december 2009

NIM : Replication issue

Introduction:
Since AIX 5.3 TL5, Network Installation Manager supports replication of NIM objects from the NIM master to the alternate NIM master (APAR IY81860). Apparently, this feature does not function properly.

Impacted:
- All AIX versions up till now
http://www-933.ibm.com/eserver/support/fixes/fixcentral/pfixpacks/53
http://www-933.ibm.com/eserver/support/fixes/fixcentral/pfixpacks/61

Details:
The setup consists of the following two nodes:
master (NIM master) and alternate (Alternate NIM master)

When issuing a regular sync operation on the NIM master, the operation is successful:
# nim -Fo sync alternate
...
nim_master_recover Complete

When issuing a sync operation on the NIM master with the replicate option (this will copy all resources that are not present on the alternate NIM master), the following error is observed.
# nim -Fo sync -a replicate=yes alternate
...
nim_master_recover Complete
error replicating resources: unable to /usr/lpp/bos.sysmgt/nim/methods/c_rsh master
Finished Replicating NIM resources
...
Finished checking SPOTs
nim_master_recover Complete

The replicate operation fails because of the broken c_rsh utility.
Further debugging of c_rsh on the NIM master learned that there are several ODM lookups prior to the error.
# truss /usr/lpp/bos.sysmgt/nim/methods/c_rsh master date 2>&1 | grep objrepos
...
statx("/etc/objrepos/nim_object", 0x2FF1FD70, 76, 0) = 0
kopen("/etc/objrepos/nim_object", O_RDONLY) = 5
kopen("/etc/objrepos/nim_attr", O_RDONLY) = 5
kopen("/etc/objrepos/nim_attr.vc", O_RDONLY) = 6
...

Resolution:
Following PMR 25293.300.624, the IBM lab stated that a failing ODM lookup is the root cause of the issue. As a result, a particular data structure is not populated and the signal 11 occurs when trying to copy a string to this structure.

APAR IZ66255 was created to address this issue.

dinsdag 24 november 2009

VIO : Client path failover

Following a recent discussion with IBM L2 support, all parameters that affect VIO client path failover will be explained briefly.

# lsattr -El hdisk0
...
algorithm                    fail_over
hcheck_interval         60
hcheck_mode             nonactive
...

Currently, MPIO on the VIO client only supports failover from one VSCSI client adapter to another (fail_over algorithm). Load balancing over multiple VSCSI client adapters is currently not supported.
The heartbeat check interval for each disk using MPIO should be configured so that the path status is updated automatically. Specifying hcheck_mode=nonactive means that healthcheck commands are sent down paths that have no active I/O, including paths with a state of "Failed". The hcheck_interval attribute defines how often the healthcheck is performed. In the client partition the hcheck_interval for virtual SCSI devices is set to 0 by default which means healthchecking is disabled.

# lsattr -El vscsi2
vscsi_err_recov        fast_fail
vscsi_path_to            30


vscsi_path_to, when enabled, allows the virtual client adapter driver to determine the health of the VIO Server to improve and expedite path failover processing.
A value of 0 (default) disables it, while any other value defines the number of seconds the VSCSI client adapter will wait for commands issued to the VSCSI server adapter that were not serviced meanwhile. If that time is exceeded, the VSCSI client adapter attempts the commands again and waits up to 60 seconds until it fails the outstanding requests. An error will be writen to the error log and, if MPIO is used, another path to the disk will be tried to service the requests. Therefore, this parameter should only be set for MPIO installations with dual VIO servers.

Similar to the attribute fc_error_recov for real FC adapters, the attribute vscsi_err_recov is used by the VSCSI adapter driver. When this parameter is set to fast_fail, the VIO client adapter will send a FAST_FAIL datagram to the VIO server and it will subsequently fail the I/O immediately rather than delayed. This may help to improve MPIO failover.

vscsi_err_recov has been added since AIX 5.3 TL9 (APAR IZ28537) and AIX 6.1 TL2 (APAR IZ28554).
It requires VIO server 2.1.

woensdag 20 mei 2009

OpenSSH : Kerberos user principal name incorrect on AIX

Introduction:
Currently, it is observed that password based Kerberos authentication in OpenSSH does not function properly on AIX. Even though AIX can authenticate a user via Kerberos (using the KRB5/KRB5A load module), OpenSSH cannot.

Impacted:
- OpenSSH <= 5.2p1

Details:
This issue is caused by the fact that an AIX user has two attributes which OpenSSH doesn't take into account when forming the principal name of the user (attributes auth_name and auth_domain). If AIX user, myuser, has the attributes auth_name=someone and auth_domain=SOMEWHERE, then the Kerberos principal name would be someone@SOMEWHERE instead of myuser@DEFAULTREALM. By employing the auth_domain attribute, requests are sent to to the SOMEWHERE realm instead of the default realm DEFAULTREALM, which is listed in the libdefaults section of the krb5.conf configuration file.

The following can be seen in the OpenSSH code (auth-krb5.c on line 88):

problem = krb5_parse_name(authctxt->krb5_ctx,authctxt->pw->pw_name,&authctxt->krb5_user);

Since authctxt->pw->pw_name contains only the user name (without a realm), the default realm will be automatically appended according to the documentation of the krb5_parse_name call. Since this isn't the correct realm name (the overwritten auth_domain is the correct one), Kerberos authentication will fail. If the auth_domain attribute is not set, the default realm name will be used.

Resolution:
- Bugzilla item # 1583 was created to address this issue. The item contains a patch to the source which solves the issue.

woensdag 6 mei 2009

Samba : DFS does not work on AIX

Introduction:
Currently, there is a minor bug in Samba which makes DFS unusable on AIX.

Impacted:
- Samba <= 3.3.4

Details:
The issue is caused by the behaviour of the readlink system call on AIX. If the size of the buffer cannot contain the entire symbolic link, the ERANGE error is returned. Other UNIX and Linux distributions will never return an error if the size of the buffer is too small. Instead, only a part of the symbolic link will be written in the buffer.

In msdfs.c, the character array 'link_target_buf' is defined with size 7 (size of "msdfs:" + 1). Since the DFS link is larger than that, the readlink system call on AIX returns ERANGE. In order to resolve this issue, the array should be of size PATH_MAX (defined in /usr/include/sys/limits.h).

A proposed patch looks like:

--- msdfs.c 2009-05-06 08:36:00.000000000 +0200
+++ msdfs.new.c 2009-05-06 08:36:44.000000000 +0200
@@ -400,11 +400,15 @@
      char **pp_link_target,
      SMB_STRUCT_STAT *sbufp)
 {
   SMB_STRUCT_STAT st;
   int referral_len = 0;
+#ifdef AIX
+  char link_target_buf[PATH_MAX];
+#else
   char link_target_buf[7];
+#endif
   size_t bufsize = 0;
   char *link_target = NULL;

   if (pp_link_target) {
      bufsize = 1024;


Resolution:
- Bugzilla item # 6330 was created to address this issue.

zondag 3 mei 2009

OpenSSH : Server option PrintLastLog does not work on AIX

Introduction:
Currently, the OpenSSH server option "PrintLastLog" does not work on AIX. The last login time is always displayed, disregarding the option.

Impacted:
- OpenSSH <= 5.2p1

Details:
When browsing the source, several functions in loginrec.c were found which solely handle the processing of the last login info (login_get_lastlog, getlast_entry).
Since AIX does not provide such a function natively, the configure script sets the DISABLE_LASTLOG define. A small code snippet from getlast_entry in loginrec.c shows this:

#if defined(DISABLE_LASTLOG)
   /* On some systems we shouldn't even try to obtain last login
    * time, e.g. AIX */
   return (0);


On the other hand, when issuing the AIX loginsuccess() call (which writes a new login record), the last login record can be retrieved by that very same call.
Looking at port-aix.c, the following can be seen:

if (loginsuccess((char *)user, (char *)host, (char *)ttynm, &msg) == 0) {
   success = 1;
   if (msg != NULL && loginmsg != NULL && !msg_done) {
      debug("AIX/loginsuccess: msg %s", msg);
      buffer_append(loginmsg, msg, strlen(msg));
      xfree(msg);
      msg_done = 1;
   }
}


Pointer "msg" points to the new last login info for the user and it always appended to the loginmsg buffer. The buffer_append call should only be called if options.print_lastlog is set.

Resolution:
- Bugzilla item # 1595 was created to address this issue. The item contains patches to the source which solve the issue.

maandag 20 april 2009

EtherChannel : Issue with backup virtual adapter

Introduction:
Currently, there is something very odd going on when using EtherChannel (Network Interface Backup mode) if the backup adapter is a virtual adapter. PMR 68839.300.624 clarified that it is currently designed that the backup virtual adapter is receiving traffic, even though it is in backup mode. However, this introduces an additional problem: even though the backup virtual adapter is receiving traffic, it is not replying to it. It is the primary channel that responds, which creates an unbalanced situation on the physical network, resulting in flooding.

Impacted:
- All AIX versions up till now
http://www-933.ibm.com/eserver/support/fixes/fixcentral/pfixpacks/53
http://www-933.ibm.com/eserver/support/fixes/fixcentral/pfixpacks/61
- POWER5(+) firmware <= SF240_320

Details:


Consider that the ARP tables on client LPAR A and server LPAR B are empty aswell as the MAC table on the Ethernet Switch. Client LPAR A wishes to send data to LPAR B.

LPAR A:
ent1 (Virtual Ethernet) - MAC address 22:f1:30:00:70:06
en1 - IP address 10.226.32.145

LPAR B:
ent3 (EtherChannel in NIB mode (Active/Passive)) - MAC address 00:14:5e:c6:46:80
en3 - IP address 10.226.32.139
Primary Channel: ent2 (Physical Ethernet)
Backup Channel: ent1 (Virtual Ethernet)

VIO:
ent3 (Shared Ethernet Adapter) - MAC address 00:14:5e:48:2c:7a
Physical Ethernet: ent2 - MAC address 00:14:5e:48:2c:7a
Virtual Ethernet: ent1 - MAC address 22:f1:30:00:30:06

Source IP address: 10.226.32.145
Destination IP address: 10.226.32.139

Source MAC address: 22:f1:30:00:70:06
Destination MAC address: unknown

Since client LPAR A does not know the destination MAC address of server LPAR B, client LPAR A is broadcasting an ARP request (Who has 10.226.32.139, tell 10.226.32.145) on the internal Layer 2 PHYP switch. Even though the EtherChannel on server LPAR B is in Primary Channel Mode, the PHYP delivers this packet to the backup Virtual Ethernet adapter of the EtherChannel and also delivers the broadcast to the SEA for bridging. As a result, the MAC table on the physical switch is updated with MAC address of client LPAR A, located on physical port X. Server LPAR B will form a unicast reply but sends this unicast reply via the Primary Channel to the Ethernet Switch. The Ethernet Switch receives the unicast reply on port Y, links the source MAC address of server LPAR B to port Y in the MAC table. Since the frame contains a destination MAC address which has a valid MAC table entry on the physical switch, it is delivered to port X and it ultimately received by client LPAR A through the SEA. Client LPAR A updates the ARP table with the MAC address of server LPAR B.

Now client LPAR A can start communicating with server LPAR B since it now knows the destination MAC address. The PHYP is delivering the packets via the backup Virtual Ethernet adapter of the EtherChannel. After the TTL of the MAC table entry for client LPAR A expires, flooding is observed on the physical switch, meaning that the switch will act as a simple repeater for all communication from server LPAR B to client LPAR A and hereby sending it to all trunk ports and access ports defined in the same VLAN. Ofcourse, the frames are also forwarded to port X (it's in the same VLAN) and are ultimately received by client LPAR A though the SEA.

When client LPAR A is sending jumbo frames (data) to server LPAR B, approximately 2 Mbit/s of TCP ACK flooding was observed. It gets really bad when the process is reversed, in which server LPAR B is sending data to client LPAR A. As a result, all data will be flooded on the switch and only the TCP acks are delivered via the backup Virtual Ethernet Adapter.

According to IBM, this is working as designed and a DCR was created to address this issue.

Resolution:
- Reduce ARP table TTL on the LPARs (arpt_killc network tunable) OR
- Increase MAC table TTL on the physical switch OR
- Replace Virtual Ethernet adapter by a Physical Ethernet adapter for the EtherChannel backup channel.

woensdag 1 april 2009

Quorum active or not?

AIX 5.3 TL7 introduces concurrent quorum changes on a volume group. Prior to that version, the quorum change only becomes active after a varyoff/varyon operation on that specific volume group. This also means that, whever the ODM value is changed, there is no easy way to know whether quorum is currently active or not since lsvg displays the values of ODM attributes, not real-time values.
Fortunately, there is way to figure out whether quorum is active or not. This involves debugging the running kernel using kdb. The procedure to do this is as follows:

- Determine the major number of the volume group in /dev, and convert to the hexadecimal value. F.e. rootvg will always have a major number of 10 (hexadecimal A) and all logical volumes will have a sequential minor number starting at 1.

# ls -al /dev/rootvg
crw-rw---- 1 root system 10, 0 Apr 24 2008 /dev/rootvg


- List the device switch table entry for the volume group, based on the hexadecimal major number, and track the effective address of the volgrp structure in memory (dsdptr)

# echo 'devsw 0xA' | kdb
The specified kernel file is a 64-bit kernel
Preserving 1402949 bytes of symbol table
First symbol __mulh
START END
0000000000001000 0000000003DDF050 start+000FD8
F00000002FF47600 F00000002FFDC920 __ublock+000000
000000002FF22FF4 000000002FF22FF8 environ+000000
000000002FF22FF8 000000002FF22FFC errno+000000
F100070F00000000 F100070F10000000 pvproc+000000
F100070F10000000 F100070F18000000 pvthread+000000
PFT:
PVT:
id....................0002
raddr.....000000000A000000 eaddr.....F200800040000000
size..............00080000 align.............00001000
valid..1 ros....0 fixlmb.1 seg....0 wimg...2
(0)> devsw 0xA
Slot address F1000100101AA500
MAJOR: 00A
   open: 04165624
   close: 04164EC8
   read: 04164738
   write: 04164638
   ioctl: 04162960
   strategy: 04180E9C
   ttys: 00000000
   select: .nodev (00196AE4)
   config: 041588F8
   print: .nodev (00196AE4)
   dump: 04181E68
   mpx: .nodev (00196AE4)
   revoke: .nodev (00196AE4)
   dsdptr: F100010032BA2000
   selptr: 00000000
   opts: 0000012A DEV_DEFINED DEV_MPSAFE DEV_EXTBUF


- Determine the flags attribute of the volgrp structure. The last bit is about quorum (1 -> quorum disabled)

# echo 'volgrp F100010032BA2000' | kdb | grep flags | awk '{print $4}'
00000001

maandag 30 maart 2009

TCP issue in IBM NAS 1.4.0.8

Introduction:
Ever since IBM NAS (Network Authentication Service) 1.4.0.8 with TCP support (RFC1510 compliant) was released, an issue was found in the TCP reception of fragmented payloads. As a result, TCP connections will never be closed properly (they remain in the TCP state CLOSE_WAIT) and pose an mbuf depletion threat.

Impacted:
- IBM Network Authentication Service 1.4.0.8

Details:
NAS server: MTU 1500 bytes, IP x.y.z.u
NAS client: MTU 576 bytes, IP a.b.c.d

When the NAS client is in a LAN segment with MTU 576 bytes, the TCP issue on the server occurs. Once the client gets a cross realm ticket from an Active Directory domain controller, a service ticket is requested from the NAS server. The following tcpdump trace shows the TGS exchange:

(1)09:40:54.892621 IP a.b.c.d.1250 > x.y.z.u.88: S 1586082305:1586082305(0) win 64512 <mss 536,nop,nop,sackOK>
(2)09:40:54.892816 IP x.y.z.u.88 > a.b.c.d.1250: S 3658439259:3658439259(0) ack 1586082306 win 65535 <mss 1460>
(3)09:40:54.893145 IP a.b.c.d.1250 > x.y.z.u.88: . ack 1 win 64856
(4)09:40:54.893338 IP a.b.c.d.1250 > x.y.z.u.88: . 1:537(536) ack 1 win 64856
(5)09:40:54.893471 IP a.b.c.d.1250 > x.y.z.u.88: . 537:1073(536) ack 1 win 64856
(6)09:40:54.893743 IP x.y.z.u.88 > a.b.c.d.1250 : . ack 1073 win 65535
(7)09:40:54.894292 IP a.b.c.d.1250 > x.y.z.u.88: . 1073:1609(536) ack 1 win 64856
(8)09:40:54.894310 IP a.b.c.d.1250 > x.y.z.u.88: . 1609:2145(536) ack 1 win 64856
(9)09:40:54.894320 IP a.b.c.d.1250 > x.y.z.u.88: P 2145:2307(162) ack 1 win 64856
(10)09:40:55.070688 IP x.y.z.u.88 > a.b.c.d.1250 : . ack 2307 win 65535
(11)09:40:59.878565 IP a.b.c.d.1250 > x.y.z.u.88: . 2307:2843(536) ack 1 win 64856
(12)09:40:59.878649 IP a.b.c.d.1250 > x.y.z.u.88: . 2843:3379(536) ack 1 win 64856
(13)09:40:59.878658 IP a.b.c.d.1250 > x.y.z.u.88: . 3379:3915(536) ack 1 win 64856
(14)09:40:59.878720 IP a.b.c.d.1250 > x.y.z.u.88: . 3915:4451(536) ack 1 win 64856
(15)09:40:59.884118 IP x.y.z.u.88 > a.b.c.d.1250 : . ack 4451 win 65535
(16)09:40:59.884567 IP a.b.c.d.1250 > x.y.z.u.88: P 4451:4613(162) ack 1 win 64856
(17)09:41:00.084446 IP x.y.z.u.88 > a.b.c.d.1250 : . ack 4613 win 65535
(18)09:41:04.878515 IP a.b.c.d.1250 > x.y.z.u.88: F 4613:4613(0) ack 1 win 64856
(19)09:41:04.878592 IP x.y.z.u.88 > a.b.c.d.1250 : . ack 4614 win 65535

(1)First step in the TCP handshake in which the NAS client sends a SYN packet with TCP sequence number 1586082305, a TCP window size of 64512 bytes and a maximum TCP payload size (MSS) of 536 bytes.
(2)The NAS server replies by acknowledging the SYN packet from the NAS client and sending his own SYN packet with TCP sequence number 3658439259, a TCP window size of 65536 bytes and a maximum TCP payload size (MSS) of 1460 bytes.
(3)The NAS client acknowledges the SYN packet of the NAS server. The connection is now in the TCP state ESTABLISHED on both sides. The maximum TCP payload size (MSS) will be 536 bytes.
(4)The NAS client wants to send his TGS-REQ packet, but it has a total TCP payload of 2306 bytes. The large size of this payload can explained by the inclusion of the PAC in the user's TGT. Due to the large payload, TCP fragmentation needs to be done. Since the agreed MSS size is 536 bytes, 5 fragments need to be transmitted.
(5)The NAS client sends the second fragment
(6)The NAS server acknowledges the first two fragments.
(7-8-9) The NAS client send the remaining three fragments.
(10)The NAS server acknowledges the reception and reassembly of the remaining fragments. Normally, the NAS server should start sending the TGS-REP now but refuses to do so.
(11)After a 5 second timeout, the NAS client hasn't received the TGS-REP from the NAS server and starts retransmitting the first fragment of the TGS-REQ.
(12-13-14)The NAS client retransmits fragments #2,#3 and #4.
(15)The NAS server acknowledges the reception of the first 4 fragments.
(16)The NAS client sends his final fragment.
(17)The NAS server acknowledges the reception and reassembly of the remaining fragments. Once again, the NAS server doesn't start sending the TGS-REP.
(18)After an additional 5 second wait interval, the NAS client gives up and performs an active close on his end by sending a FIN packet to the NAS server. The NAS client is now in the TCP state FIN_WAIT_1.
(19)The NAS server acknowledges the FIN of the NAS client. The NAS server is now in the TCP state CLOSE_WAIT and the TCP client is now in the TCP state FIN_WAIT_2. Normally, the NAS server should now send a FIN packet to the NAS client for closing the TCP connection, but refuses to do so.


As a result, netstat on the NAS server shows TCP connections stuck in the TCP state CLOSE_WAIT. As long as the NAS server is active, those TCP connections will never be freed and pose a potential mbuf depletion threat.

After further investigation, the following truss output of the NAS server revealed the problem.
0.0000: _select(80, 0x2FF21A38, 0x00000000, 0x00000000,
0x00000000) (sleeping...)
0.0000: _select(80, 0x2FF21A38, 0x00000000, 0x00000000,
0x00000000) = 0
0.7132: yield() =
0.7136: thread_waitact(400) = 1
1.7665: naccept(75, 0x2FF21938, 0x2FF2198C) = 99
1.7669: ngetsockname(99, 0x2FF21998, 0x2FF21990) = 0
1.7673: kfcntl(99, F_GETFL, 0x00000000) = 6
1.7680: kfcntl(99, F_SETFL, 0x00000006) = 0
1.7684: kioctl(99, -2147195266, 0x10038260, 0x00000000) = 0
1.7688: setsockopt(99, 65535, 128, 0x10038268, 8) = 0
1.7691: __libc_sbrk(0x00000000) = 0x215E9520
1.7697: thread_setmystate(0x00000000, 0x2FF210B0) = 0
1.7700: mprotect(0x216C8000, 4096, 0) = 0
1.7704: thread_twakeup(3473645, 268435456) = 0
1.7707: _select(80, 0x2FF21A38, 0x00000000, 0x00000000,
0x00000000) = 268435456
= 1
1.7715: thread_setmystate(0x216E13D0, 0x216E16D8) = 0
1.7720: yield() =
1.7724: thread_waitact(400) = 1
1.7727: yield() =
3.7745: _select(80, 0x2FF21A38, 0x00000000, 0x00000000,
0x00000000) (sleeping...)
3.7745: _select(80, 0x2FF21A38, 0x00000000, 0x00000000,
0x00000000) = 1
kread(99, "\0\0\b ?", 4) = 4
4.7437: _select(100, 0x216DFAC8, 0x216E0AC8, 0x00000000,
0x00000000) = 1
kread(99, " l82\b 082\b ? ?030201".., 2302) = 532
4.7464: kthread_ctl(2, 0x00000000) = 0
4.7467: thread_setmystate_fast(0x4000000C, 0x00000000,
0x00000000, 0x00000000, 0x40000000, 0x00000158, 0x00000000, 0x00000000)
= 0x00000000
4.7472: thread_setmystate_fast(0x4000000D, 0x00000000,
0x00000000, 0x00000000, 0x40000000, 0x103500ED, 0x103500ED, 0x00000000)
= 0x00000000
4.7477: thread_setmystate_fast(0x4000000C, 0x00000000,
0x00000000, 0x00000000, 0x40000000, 0x00000176, 0x00000000, 0x00000000)
= 0x00000000
4.7481: thread_setmystate_fast(0x4000000D, 0x00000000,
0x00000000, 0x00000000, 0x40000000, 0x103500ED, 0x103500ED, 0x00000000)
= 0x00000000
4.7486: sigprocmask(0, 0xF08C77A8, 0x20366CEC) = 0
4.7489: thread_setmystate(0x203665F8, 0x00000000) = 0
4.7492: thread_tsleep(0, 0x20009100, 0x00000000, 0x00000000) = 0
5.7808: mprotect(0x216C8000, 4096, 3) = 0
5.7813: yield() =
...

The bold part of the truss output shows that of the requested 2302 additional bytes (remember the TGS-REQ is 2306 bytes) only 532 bytes were read because of fragmentation. After that, NAS doesn't even attempt to read the remaining fragments. It just freaks out and doesn't even proper close the socket, keeping the connections on the server in the TCP state CLOSE_WAIT.

Resolution:
NAS L3 support states that they will provide a fix for this issue, which will be incorporated in the next version of NAS (1.4.0.9).

zondag 29 maart 2009

Identifying memory leaks in AIX

Dynamic memory allocation happens at run time rather than at the creation of the process. While giving more flexibility to the programmer, it also requires a lot more housekeeping. In large programs, memory leaks are a very common issue, with very unpleasant side effects. Side effects of poor dynamic memory allocation management can include:
- malloc returns with errno set to ENOMEM
- process working segment is growing over time (detected with either ps gv or svmon)
- core dump of the process which has malloc in the stack trace

While an growing process working segment is an indication of a memory leak, it doesn't necessarily mean there is one. It might be perfectly normal for a process to allocate additional memory during its lifetime. However, at some point in time, dynamic memory has to be freed by the application. The following graphs show both a normal and abnormal memory evolution of a process (X-axis -> time , Y-axis -> allocated memory)

Since dynamic memory gets allocated on the process heap, it is automatically cleaned up when the process terminates. This also means that a memory leak isn't that harmful for a process with a short lifetime. However for daemons it is potentially more harmful!

How can memory leaks be tracked down? There are some commercial products available that examine of the memory allocation of processes (f.e. IBM Rational PurifyPlus) but AIX provides a subsystem capable of determining this out of the box. Let's consider the following C code which has a few memory leaks.

#include <stdio.h>
void routineA(){
   char *test=malloc(4);
   fprintf(stdout,"routineA\n");
   fprintf(stdout,"pointer residing at address %p\n",&test);
   fprintf(stdout,"value of pointer %p\n",test);
}
void routineB(){
   char *test=malloc(4);
   fprintf(stdout,"routineB\n");
   fprintf(stdout,"pointer residing at address %p\n",&test);
   fprintf(stdout,"value of pointer %p\n",test);
   free(test);
}
void routineC(){
   char *test=malloc(8);
   fprintf(stdout,"routineC\n");
   fprintf(stdout,"pointer residing at address %p\n",&test);
   fprintf(stdout,"value of pointer %p\n",test);
}
int main(){
   char *test=malloc(4);
   fprintf(stdout,"main\n");
   fprintf(stdout,"pointer residing at address %p\n",&test);
   fprintf(stdout,"value of pointer %p\n",test);
   routineA();
   routineB();
   routineC();
}


Here we can clearly see that the memory allocations in main,routineA and routineC don't get freed. Using the malloc debug subsystem, we are also made aware of this. Moreover, even if we don't have the source, the malloc debug subsystem will give the stack trace.

#export MALLOCTYPE=debug
#export MALLOCDEBUG=report_allocations
#./memleak
main
pointer residing at address 2ff22b10
value of pointer 2000eff8
routineA
pointer residing at address 2ff22ac0
value of pointer 20010ff8
routineB
pointer residing at address 2ff22ac0
value of pointer 20012ff8
routineC
pointer residing at address 2ff22ac0
value of pointer 20012ff8
Current allocation report:

   Allocation #0: 0x2000EFF8
      Allocation size: 0x4
      Allocation traceback:
      0xD03EA170 malloc
      0xD036C260 init_malloc
      0xD036D434 malloc
      0x10000540 main

   Allocation #1: 0x20010FF8
      Allocation size: 0x4
      Allocation traceback:
      0xD03EA170 malloc
      0x10000360 routineA
      0x1000058C main
      0x100001C4 __start

   Allocation #2: 0x20012FF8
      Allocation size: 0x8
      Allocation traceback:
      0xD03EA170 malloc
      0x100003FC routineC
      0x10000594 main
      0x100001C4 __start

   Total allocations: 3.

The malloc debug subsystem states there are three memory leaks in the program. The first one is the main routine (at address 0x2000EFF8, size 4 bytes), the second one is in routineA (at address 0x20010FF8, size 4 bytes) and the last one is located in routineC (at address 0x20012FF8, size 8 bytes).

Whenever you wish to open a PMR for a memory leak, be sure to add the malloc trace aswell. If it's not a known issue, you will be redirected to L3 support quite fast :)

zaterdag 28 maart 2009

Memory leak KRB5A & libkrb5.a

Introduction:
Currently there is a memory leak in both the KRB5A load module and the libkrb5 library in AIX. The KRB5A load module is shipped with AIX whereas the libkrb5.a library is shipped with the krb5.client.rte fileset in NAS (Network Authentication Service), which is IBM's version of Kerberos

Impacted:
- All AIX versions up till now
http://www-933.ibm.com/eserver/support/fixes/fixcentral/pfixpacks/53
http://www-933.ibm.com/eserver/support/fixes/fixcentral/pfixpacks/61
- IBM Network Authentication Service <= 1.4.0.8

Details:
# lsuser -a SYSTEM sidsmig
sidsmig SYSTEM=KRB5A
# cat /usr/lib/security/methods.cfg | grep -ip KRB5A
KRB5A:
    program = /usr/lib/security/KRB5A
    program_64 = /usr/lib/security/KRB5A_64
    options = authonly


The following C test program was used in PMR 69409.300.624

#include <stdio.h>
#include <usersec.h>

int main(int argc,char** argv){
   while(1){
      int reenter;
      char* msg;
      authenticate(argv[1], argv[2], &reenter, &msg);
      if(msg) {
         free(msg)
         break;
      }
   }
}


An increasing process working segment could be noticed with either ps gv or svmon -P when user sidsmig authenticates to the system.

Resolution:
- APAR IZ43820 was created to address this issue.

vrijdag 27 maart 2009

First post

There always has to be a first post, wouldn't you think? Well, this is it!
I've never been using a blog before so I have no clue how things will work out in the end. I mainly chose to start this blog in order to memorize everything I am currently working on and to archive the things I worked on (both from a professional point of view) so that other people can benefit from it. You'll mainly see AIX (IBM's version of UNIX) related things here.

See you around!

Miguel