Miguel's Spot: december 2009

donderdag 31 december 2009

GPFS : Tuning recommendations

A few words on important GPFS tunables.

Pagepool & SeqDiscardTheshold
GPFS does not use the regular file buffer cache of the operating system (f.e. non-computational memory in AIX) but uses its own mechanism to implement caching. GPFS uses pinned computational memory to maintain its file buffer cache, called the pagepool, which is used to cache user file data and file system metadata. The default pagepool size is 64MB and is too small for many applications most of the times. Applications that re-use files a lot and perform sequential reads will benefit from the pagepool. Non-DIO writes will also be done to the pagepool. For a sequential write operation to the pagepool, write-behind will improve overall performance. For random I/O, GPFS will not be able to use read-ahead or write-behind techniques and will be able to rely on striping for improved performance.

A parameter that affects how data is cached in the pagepool is SeqDiscardTheshold, which will instruct GPFS to try to keep as much data in pagepool as possible. The default for this value is 1MB which means that if a file greater than 1 MB is read sequentially, GPFS will not keep the data in the pagepool. There might be applications in which large files are often re-read by multiple processes, which can lead to improved performance if this tunable is set to a higer value.
It should also be noted that NSD servers don't not cache anything for their NSD clients. If both NSD client A and NSD client B request the same file from an NSD server, the NSD server will get the data twice from disk. As a result, increasing the pagepool on an NSD server would have no effect.

Block size
This is one of the most important things to think about when designing a GPFS file system. After creating the file system with a specific block size, there's no way back other than recreating the file system with the new block size. Choosing the optimal GPFS block size is not a straight forward exercise since it relies on several other factors:

Physical disk block size
LUN segment size, which is the maximum amount of data that is written or read from a disk per operation before the next disk in the array is used
Application block size (f.e. DB block size for a RDBMS application)

The following example will calculate the (theorical) optimal GPFS block size, without taking the application block size into account:

RAID 5 4+1 , 128 KB LUN segment size = 512 KB LUN stripe size.
As a result, a GPFS block size of 512 KB (or a multiple of) would be good.
A GPFS block size of 256 KB will almost certainly lead to reduced performance because the disk subsystem would have to read the remaining 256 KB of the 512 KB stripe in order to calculate parity in a write operation. Summarized the operations for both 256 KB and 512 KB block sizes on the disk subsystem would look like:

GPFS write (256 KB) = Write LUN Segment #1 (128 KB) + Write LUN Segment #2 (128 KB) + Read LUN Segment #3 (128 KB) + Read LUN Segment #4 (128 KB) + Calculate Parity + Write LUN segment #5 (128 KB)
GPFS write (512 KB) = Write LUN Segment #1 (128 KB) + Write LUN Segment #2 (128 KB) + Write LUN Segment #3 (128 KB) + Write LUN Segment #4 (128 KB) + Calculate Parity + Write LUN segment #5 (128 KB)

Considering the possible use of a write cache on the disk subsystem , 1 is certainly more costly than 2.

Split data / metadata
Splitting data and metadata is one of the most underestimateded design questions.
The actual division can be changed online so unlike changing the GPFS file system block size, there is no downtime involved. If metadata (inodes + indirect data blocks) cannot be accessed fast enough, overall performance will degrade severely. Metadata access can be compared to a seek operation on a normal hard drive. Generally, it's a good idea to do the following:

Metadata (RAID 1 + Enable read/write cache on the disk subsystem)
Try to keep as much metadata as possible in the cache of the disk subsystem so that every node that looks for it will find it in the cache.
Write operations on metadata are generally random and small. Moreover, these write operations should be as fast as possible.
As a result, using the write cache for metadata will be very beneficial.
Finally, since metadata is more read from than written to, RAID 1 is a better (though more costly) backend for metadata than RAID 5.

Data (RAID 5 + Disable read/write cache on the disk subsystem)
Try to protect metadata in the cache as much as possible by disabling caching on data.
As a result, nodes that are reading a lot of data don't thrash the cache, which can be used efficiently for nodes that need access to metadata instead. Sometimes it can even be beneficial to dedicate a disk controller to metadata LUNs (caching enabled on that controller).

maxFilesToCache & maxStatCache
As ready stated, the pagepool is GPFS's file buffer cache in pinned computational memory which caches user file data and file system metadata. On the other hand, GPFS uses regular computational memory to maintain its inode and stat cache (user file metadata). The inode cache (controlled by the maxFilesToCache tunable, default 1000) contains copies of inodes for open files and for some recently used files that are no longer open. Storing a file's inode in cache permits faster re-access to that file. The stat cache (controlled by the maxStatCache tunable, default 4 * maxFilesToCache) contains enough information to open the file and satisfy a stat() call. It is intended to help functions such as ls -l, du, and certain backup programs that scan entire directories looking for modification times and file sizes. However, the stat cache entry does not contain enough information to read from or write to the file since it does not contain the indirect block references (unlike a regular inode). A stat cache entry consumes significantly less memory than a full inode.

It is possible that the number of currently opened files is larger than the size of the inode cache. In that case, the inode needs to retrieved from disk first if a node wishes to read a file, of which the inode in not in the inode cache. Therefore, it's very important that metadata can be accessed as fast as possible (see above).

prefetchThreads & worker1Threads & maxMBps
The prefetchThreads tunable (default 72) controls the maximum possible number of threads dedicated to prefetching data for files that are read sequentially, or to handle sequential write-behind. On the other hand, the worker1Threads tunable (default 48) controls the maximum number of concurrent file operations at a time. If there are more requests than the number of worker1Threads, the excess will wait until a previous request has finished. The primary use is for random read or write requests that cannot be prefetched, random I/O requests, or small file activity. The maximum value of prefetchThreads plus worker1Threads is 550 (64-bit kernels) or 164 (32-bit kernels). These values need tuning sometimes, f.e. in an Oracle RAC environment. Oracle does not need many prefetchThreads, since Oracle does its own prefetching and does not the GPFS pagepool (Oracle uses DIO to access files on a GPFS filesystem). However, Oracle does need a high amount of worker1Threads to allow as many Oracle AIO threads as possible to work in parallel.

The maxMBps tunable (default 150) is used for estimating the amount of I/O triggered by sequential read-ahead and write-behind. Setting this value higher than the default will get more parallelism if there are many LUNs. By lowering this value, the load on the disk subsystem can be limited artificially. Setting this value too high usually does not cause problems because of other limiting factors, such as the size of the pagepool and the number of prefetch threads.

zaterdag 19 december 2009

AIX : Naming resolution

Following a recent conversation with IBM L2 support concerning general naming resolution, all affected parameter files and environment variables (and their relationship) will be explained.

AIX supports several mechanisms for naming resolution of hosts, networks, protocols, services, netgroups and rpc:

dns - Domain Name Service
nis - Network Information Service
nis+ - Network Information Service Plus
local - Local naming service. Searches the files in /etc directory for resolving
nis_ldap - Provides naming resolution for host, networks, protocols, rpc, services, and netgroups. This mechanism works with any directory server which stores entity data using a schema defined in RFC 2307. Although the name of the mechanism is nis_ldap, this mechanism does not use or require any NIS services!

AIX can be configured to use a combination of the above services for naming resolution. There is a sequential order that AIX follows to use these services. The default ordering can be overridden in several ways:

NSORDER environment variable
/etc/netsvc.conf configuration file
/etc/irs.conf configuration file

NSORDER
NSORDER is an environment variable that can be used to specify the order for resolving host names to addresses (gethostbyname) and vice versa (gethostbyaddr). NSORDER overrides the host settings in the netsvc.conf and irs.conf files. The supported mechanisms for NSORDER are bind, nis, local, which is also the default order.

/etc/netsvc.conf
The netsvc.conf file specifies the sequential order for resolving host names and aliases. It should be noted that sendmail ONLY uses netsvc.conf for resolution of host names and aliases. Other configuration files or environment variables are not consulted. The environment variable NSORDER overrides the host settings in the netsvc.conf file, which in turn overrides the host settings in the irs.conf file.

/etc/irs.conf
The irs.conf file is used to control the order of mechanisms that the resolver libraries use in searching for network-related data, including the resolving of host names, networks, services, protocols, and netgroups. The default order for resolving host names and networks is dns, nis, local. The default order for resolving services, protocols, and netgroups is nis, local. The order defined in irs.conf will override the default values. The settings in the netsvc.conf configuration file override the settings in the irs.conf file. The NSORDER environment variable overrides the settings in the irs.conf and netsvc.conf files.

Note:
AIX offers two LDAP naming services, ldap and nis_ldap. The ldap naming service uses the IBM specific schema and supports host name resolution only. The nis_ldap naming service implemented since AIX 5.2, uses the RFC 2307 schema and supports name resolution of hosts, services, networks, protocols, and netgroups.

Summary

Service	Precedence
hosts	NSORDER, netsvc.conf, /etc/irs.conf
networks	irs.conf
protocols	irs.conf
services	irs.conf
netgroups	irs.conf

dinsdag 8 december 2009

NIM : Replication issue

Introduction:
Since AIX 5.3 TL5, Network Installation Manager supports replication of NIM objects from the NIM master to the alternate NIM master (APAR IY81860). Apparently, this feature does not function properly.

Impacted:
- All AIX versions up till now
http://www-933.ibm.com/eserver/support/fixes/fixcentral/pfixpacks/53
http://www-933.ibm.com/eserver/support/fixes/fixcentral/pfixpacks/61

Details:
The setup consists of the following two nodes:
master (NIM master) and alternate (Alternate NIM master)

When issuing a regular sync operation on the NIM master, the operation is successful:
# nim -Fo sync alternate
...
nim_master_recover Complete

When issuing a sync operation on the NIM master with the replicate option (this will copy all resources that are not present on the alternate NIM master), the following error is observed.
# nim -Fo sync -a replicate=yes alternate
...
nim_master_recover Complete
error replicating resources: unable to /usr/lpp/bos.sysmgt/nim/methods/c_rsh master
Finished Replicating NIM resources
...
Finished checking SPOTs
nim_master_recover Complete

The replicate operation fails because of the broken c_rsh utility.
Further debugging of c_rsh on the NIM master learned that there are several ODM lookups prior to the error.
# truss /usr/lpp/bos.sysmgt/nim/methods/c_rsh master date 2>&1 | grep objrepos
...
statx("/etc/objrepos/nim_object", 0x2FF1FD70, 76, 0) = 0
kopen("/etc/objrepos/nim_object", O_RDONLY) = 5
kopen("/etc/objrepos/nim_attr", O_RDONLY) = 5
kopen("/etc/objrepos/nim_attr.vc", O_RDONLY) = 6
...

Resolution:
Following PMR 25293.300.624, the IBM lab stated that a failing ODM lookup is the root cause of the issue. As a result, a particular data structure is not populated and the signal 11 occurs when trying to copy a string to this structure.

APAR IZ66255 was created to address this issue.

Miguel's Spot