|   +--dd
|   +--fdtree
|   +--fs
|   +--loader
|   +--memcached
|   +--metadata
|   +--network
|   +--perflog
|   |   +--cron-perflog
|   |   +--sample-output
|   +--postgres
|   +--postmark
|   +--raid
|   +--ram
|   +--scaling
|   +--swap

Performance tuning

  • When doing I/O, make your buffer size a multiple of the system block or pagesize.
  • Use mmap whenever possible.
  • Use swap across multiple drives: with equal priorities, Linux will automatically parallelize swap access, and you're not going through regular I/O channels when writing there.
  • Does Postgres communicate with the outside world via webserver? If so, don't use a heavyweight webserver if you don't need one. Lighttpd works fine; I use it on my workstation.
  • Use "noatime" when mounting your drives.
  • Use "-i 65536" to make one inode for every 64k of space. You save lots of room, fsck runs faster, and you don't end up with tons of inodes you don't need. In the example below, /mir01 is an entire 1.5Tb drive with default inode setup. /mir[23] is a second 1.5Tb drive split in half, with "-i 65536". I get an extra 26 Gb of space on /mir[23]:
    me% df -i /mir*
    Filesystem         Inodes    IUsed      IFree IUse%  Mounted on
    /dev/sda2       362774528     4899  362769629    1%  /mir01
    /dev/sdb6        11216896    88540   11128356    1%  /mir02
    /dev/sdb7        11216896   229548   10987348    3%  /mir03

    me% df -m /mir*
    Filesystem      1M-blocks      Used     Avail  Use%  Mounted on
    /dev/sda2         1372701      8089   1293759    1%  /mir01
    /dev/sdb6          699594     88930    596644   13%  /mir02
    /dev/sdb7          699601    233211    452370   35%  /mir03
  • Put as many drives in your system as you can. More spindles = faster.
  • Consider short-stroking your drives: use the outer regions closer to the edge, and don't bother making filesystems on the inner part.
  • What type of workload does your DB do? If it's transaction-based, your schema and indexing should not be the same as for a warehouse (mostly query) load. For a warehouse you want to use big buffers, big pages and process as much data as possible for each step (handle batches). Transaction systems are typically optimized to handle many small, unrelated data bits (handle one thing well at a time). It's difficult to integrate transaction and analysis oriented tasks on the same box.
  • If you're doing mainly key-value stuff, use redis instead. Just about every DB (including MySQL and Postgres) uses some sort of Btree setup where the tree leaves hold the data. The more leaves you hit, the slower you go, no matter what else is going on. Places like Craigslist use lots of redis servers in combination with MySQL.
  • Avoid using Western Digital "Green" drives because their energy-saving stuff is really aggressive; you'll hear the drives spinning up and down several times each minute, which not only trashes your performance but wears them out faster. There's a Windows program you can run to fix that, but why bother?
  • You could probably get a big win if your DB is in the 100-Gb range by putting the whole thing on an SSD filesystem.
  • Use either ext4 or XFS for your filesystem type. Google moved from ext2 in January 2010, and after evaluating everything that was out there, decided on ext4 only because they could migrate their stuff in place. Their second choice was XFS; it excells at large files, and over the last few years has vastly improved for small files.
  • Use fdtree, postmark, and dd tests for your benchmarks.
  • What's being stored in the DB? Metadata (owner, modification time, id, path, size, etc) is fine, but BLOBs are a pain in the ass and they rarely scale in a nice way. Logging is no better; most logging ends up being information that you rarely need to reference, so you're writing lots of data that eats up disk space and, more importantly, space in the DB buffer pool. Worse, log tables often grow without bound. This is where partitioning and tablespaces will definitely help.
  • If you don't already have a caching layer like memcached, plan for one. Instead of writing code that directly queries the database for data that rarely changes or data you expect to re-use frequently, either within or across user sessions, hide that logic behind something that checks the cache first. Doing so provides a layer of separation that makes it far easier to begin making use of a caching layer (since you'll have a single library of functions to update), even if you don't actually have memcached running from the start.
  • Have your vendor show you EXPLAIN output for the queries you'll be running most often. People trust the query optimizer even when they shouldn't. For example, one assumption is that all rows cost roughly the same amount of time to read, which is not always true. If the table you're using holds (say) Apache log data, it's probably already sorted by timestamp in the table and on disk (assuming minimal fragmentation), but the DB might assume it'll read fewer rows if it checks a different index. In that case you'll be seeking all over your drive.
  • One problem that might slow things down without leaving an obvious trace is excessive use of disk-based temporary tables. See the temp_buffers option in postgresql.conf to see how much room you can allocate for them without touching the disk. You can make RAM-only temporary tables by creating a tablespace on a RAM disk (/dev/shm works). This reduces the amount of disk IO, but I don't think you can do this without a physical disk write; the DB engine will flush the table list to stable storage when you create the temporary table.
  • PG v8.3 introduced a GUC parameter temp_tablespaces that allows selection of the tablespace(s) in which to store temp tables and temporary files. This is a list to allow spreading the load across multiple tablespaces; temp files are not stored in per-database pgsql_tmp/ directories anymore, but per-tablespace directories. If your temp tables are small enough, put those on a RAM disk.

Performance tuning Tue, 20 Mar 2012 22:31:20 -0400