Tuning for very large shared_buffers, using Veritas VxFS

12 December 2008

The debate about the optimum shared_buffers size in PostgreSQL is clearly far from over. However, I have not seen any results where the buffer cache and the FS cache were tuned in unison.

Because PostgreSQL knows what buffers are in use, and knows when to flush the dirty buffers it is ideal to use as much of it as possible. Using an secondary cache (the FS) as in a ‘traditional’ PostgreSQL configuration, just introduces more workload in the system. The secondary cache needs a process to wake up and flush the buffers (pdflush), and also has to manage it’s own LRU, and likely has many duplicate buffers sitting there wasting your precious RAM.

The point I am trying to stress is that the PostgreSQL cache *and* the FS cache must be tuned together. In this case, I remove the FS cache entirely using direct I/O, and then take over that space with the fastest cache in the stack; the PostgreSQL buffer cache.

So here is a bit of testing to show the results.

The testing was performed on a set of nice new HP DL185 G5′s with 14 72GB 15k SAS Drives and Quad Core Opteron processors and 32GB of main memory. The OS is SUSE Linux EnterpriseServer 10 SP1 (x86_64) 2.6.16.46-0.12-smp. The filesystem is Veritas VxFS file system. The version of PostgreSQL tested is 8.2.5.

The test itself is a set of custom pgbench scripts. These scripts were developed using actual data from production PostgreSQL log files, and pgfouine. The access pattern is about 70% read, with 15% update and 15% insert. There is no delete activity. This matches a very specific workload on a subset of our production databases. The workload is random, and the total data size is larger than our cache size by a very large amount, this guarantees a good workout of physical disk and caches in front of them. The WAL logs and data were on the same VxFS volume.

The test runs were gathered with pgd, and graphed. Between each test the filesystem was unmounted and remounted. I tested each run 3 times and averaged the tests.

Test#1:
VxFS buffered I/O
shared_buffers=500MB
disk controller cache=512MB write, 0MB read
500GB total data size
36 concurrent backends

Test#2:
VxFS buffered I/O
shared_buffers=2GB
disk controller cache=512MB write, 0MB read
500GB total data size
36 concurrent backends

Test#3:
VxFS with convosync=direct,mincache=direct
shared_buffers=20GB
disk controller cache=512MB write, 0MB read
500GB total data size
36 concurrent backends

Conclusion:
A modest gain can be had when using a very large (comparatively) shared_buffers setting when combining that change with direct I/O. The PostgreSQL cache does scale quite nicely up to at least a 20GB cache size when configured in this manner.

Further gains could likely be achieved by separating the WAL logs with specific VxFS tunables as well as tuning other PostgreSQL parameters due to the larger cache (bgwriter tunables, etc).