donderdag 31 december 2009

GPFS : Tuning recommendations

A few words on important GPFS tunables.

Pagepool & SeqDiscardTheshold
GPFS does not use the regular file buffer cache of the operating system (f.e. non-computational memory in AIX) but uses its own mechanism to implement caching. GPFS uses pinned computational memory to maintain its file buffer cache, called the pagepool, which is used to cache user file data and file system metadata. The default pagepool size is 64MB and is too small for many applications most of the times. Applications that re-use files a lot and perform sequential reads will benefit from the pagepool. Non-DIO writes will also be done to the pagepool. For a sequential write operation to the pagepool, write-behind will improve overall performance. For random I/O, GPFS will not be able to use read-ahead or write-behind techniques and will be able to rely on striping for improved performance.

A parameter that affects how data is cached in the pagepool is SeqDiscardTheshold, which will instruct GPFS to try to keep as much data in pagepool as possible. The default for this value is 1MB which means that if a file greater than 1 MB is read sequentially, GPFS will not keep the data in the pagepool. There might be applications in which large files are often re-read by multiple processes, which can lead to improved performance if this tunable is set to a higer value.
It should also be noted that NSD servers don't not cache anything for their NSD clients. If both NSD client A and NSD client B request the same file from an NSD server, the NSD server will get the data twice from disk. As a result, increasing the pagepool on an NSD server would have no effect.

Block size
This is one of the most important things to think about when designing a GPFS file system. After creating the file system with a specific block size, there's no way back other than recreating the file system with the new block size. Choosing the optimal GPFS block size is not a straight forward exercise since it relies on several other factors:
  • Physical disk block size
  • LUN segment size, which is the maximum amount of data that is written or read from a disk per operation before the next disk in the array is used
  • Application block size (f.e. DB block size for a RDBMS application)
The following example will calculate the (theorical) optimal GPFS block size, without taking the application block size into account:

RAID 5 4+1 , 128 KB LUN segment size = 512 KB LUN stripe size.
As a result, a GPFS block size of 512 KB (or a multiple of) would be good.
A GPFS block size of 256 KB will almost certainly lead to reduced performance because the disk subsystem would have to read the remaining 256 KB of the 512 KB stripe in order to calculate parity in a write operation. Summarized the operations for both 256 KB and 512 KB block sizes on the disk subsystem would look like:
  1. GPFS write (256 KB) = Write LUN Segment #1 (128 KB) + Write LUN Segment #2 (128 KB) + Read LUN Segment #3 (128 KB) + Read LUN Segment #4 (128 KB) + Calculate Parity + Write LUN segment #5 (128 KB)
  2. GPFS write (512 KB) = Write LUN Segment #1 (128 KB) + Write LUN Segment #2 (128 KB) + Write LUN Segment #3 (128 KB) + Write LUN Segment #4 (128 KB) + Calculate Parity + Write LUN segment #5 (128 KB)
Considering the possible use of a write cache on the disk subsystem , 1 is certainly more costly than 2.

Split data / metadata
Splitting data and metadata is one of the most underestimateded design questions.
The actual division can be changed online so unlike changing the GPFS file system block size, there is no downtime involved. If metadata (inodes + indirect data blocks) cannot be accessed fast enough, overall performance will degrade severely. Metadata access can be compared to a seek operation on a normal hard drive. Generally, it's a good idea to do the following:
  • Metadata (RAID 1 + Enable read/write cache on the disk subsystem)
    Try to keep as much metadata as possible in the cache of the disk subsystem so that every node that looks for it will find it in the cache.
    Write operations on metadata are generally random and small. Moreover, these write operations should be as fast as possible.
    As a result, using the write cache for metadata will be very beneficial.
    Finally, since metadata is more read from than written to, RAID 1 is a better (though more costly) backend for metadata than RAID 5.

  • Data (RAID 5 + Disable read/write cache on the disk subsystem)
    Try to protect metadata in the cache as much as possible by disabling caching on data.
    As a result, nodes that are reading a lot of data don't thrash the cache, which can be used efficiently for nodes that need access to metadata instead. Sometimes it can even be beneficial to dedicate a disk controller to metadata LUNs (caching enabled on that controller).
maxFilesToCache & maxStatCache
As ready stated, the pagepool is GPFS's file buffer cache in pinned computational memory which caches user file data and file system metadata. On the other hand, GPFS uses regular computational memory to maintain its inode and stat cache (user file metadata). The inode cache (controlled by the maxFilesToCache tunable, default 1000) contains copies of inodes for open files and for some recently used files that are no longer open. Storing a file's inode in cache permits faster re-access to that file. The stat cache (controlled by the maxStatCache tunable, default 4 * maxFilesToCache) contains enough information to open the file and satisfy a stat() call. It is intended to help functions such as ls -l, du, and certain backup programs that scan entire directories looking for modification times and file sizes. However, the stat cache entry does not contain enough information to read from or write to the file since it does not contain the indirect block references (unlike a regular inode). A stat cache entry consumes significantly less memory than a full inode.

It is possible that the number of currently opened files is larger than the size of the inode cache. In that case, the inode needs to retrieved from disk first if a node wishes to read a file, of which the inode in not in the inode cache. Therefore, it's very important that metadata can be accessed as fast as possible (see above).

prefetchThreads & worker1Threads & maxMBps
The prefetchThreads tunable (default 72) controls the maximum possible number of threads dedicated to prefetching data for files that are read sequentially, or to handle sequential write-behind. On the other hand, the worker1Threads tunable (default 48) controls the maximum number of concurrent file operations at a time. If there are more requests than the number of worker1Threads, the excess will wait until a previous request has finished. The primary use is for random read or write requests that cannot be prefetched, random I/O requests, or small file activity. The maximum value of prefetchThreads plus worker1Threads is 550 (64-bit kernels) or 164 (32-bit kernels). These values need tuning sometimes, f.e. in an Oracle RAC environment. Oracle does not need many prefetchThreads, since Oracle does its own prefetching and does not the GPFS pagepool (Oracle uses DIO to access files on a GPFS filesystem). However, Oracle does need a high amount of worker1Threads to allow as many Oracle AIO threads as possible to work in parallel.

The maxMBps tunable (default 150) is used for estimating the amount of I/O triggered by sequential read-ahead and write-behind. Setting this value higher than the default will get more parallelism if there are many LUNs. By lowering this value, the load on the disk subsystem can be limited artificially. Setting this value too high usually does not cause problems because of other limiting factors, such as the size of the pagepool and the number of prefetch threads.

1 opmerking:

  1. Needs to be updated for gpfs 3.5. Default pagepool is now 1G. Also should mention that data will now fit in the inode itself when the data is small enough.

    BeantwoordenVerwijderen