+--server-tuning
| +--dd
| +--fdtree
| +--fs
| +--loader
| +--memcached
| +--metadata
| +--network
| +--perflog
| | +--cron-perflog
| | +--sample-output
| +--postgres
| +--postmark
| +--raid
| +--ram
| +--scaling
| +--swap
| Performance tuning
- When doing I/O, make your buffer size a multiple of the system block
or pagesize.
- Use mmap whenever possible.
- Use swap across multiple drives: with equal priorities, Linux will
automatically parallelize swap access, and you're not going through
regular I/O channels when writing there.
- Does Postgres communicate with the outside world via webserver? If so,
don't use a heavyweight webserver if you don't need one. Lighttpd
works fine; I use it on my workstation.
- Use "noatime" when mounting your drives.
- Use "-i 65536" to make one inode for every 64k of space. You save lots
of room, fsck runs faster, and you don't end up with tons of inodes you
don't need. In the example below, /mir01 is an entire 1.5Tb drive with
default inode setup. /mir[23] is a second 1.5Tb drive split in half,
with "-i 65536". I get an extra 26 Gb of space on /mir[23]:
me% df -i /mir*
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/sda2 362774528 4899 362769629 1% /mir01
/dev/sdb6 11216896 88540 11128356 1% /mir02
/dev/sdb7 11216896 229548 10987348 3% /mir03
me% df -m /mir*
Filesystem 1M-blocks Used Avail Use% Mounted on
/dev/sda2 1372701 8089 1293759 1% /mir01
/dev/sdb6 699594 88930 596644 13% /mir02
/dev/sdb7 699601 233211 452370 35% /mir03
- Put as many drives in your system as you can. More spindles = faster.
- Consider short-stroking your drives: use the outer regions closer to
the edge, and don't bother making filesystems on the inner part.
- What type of workload does your DB do? If it's transaction-based, your
schema and indexing should not be the same as for a warehouse (mostly
query) load. For a warehouse you want to use big buffers, big pages
and process as much data as possible for each step (handle batches).
Transaction systems are typically optimized to handle many small,
unrelated data bits (handle one thing well at a time). It's difficult
to integrate transaction and analysis oriented tasks on the same box.
- If you're doing mainly key-value stuff, use redis instead. Just about
every DB (including MySQL and Postgres) uses some sort of Btree setup
where the tree leaves hold the data. The more leaves you hit, the slower
you go, no matter what else is going on. Places like Craigslist use
lots of redis servers in combination with MySQL.
- Avoid using Western Digital "Green" drives because their energy-saving
stuff is really aggressive; you'll hear the drives spinning up and down
several times each minute, which not only trashes your performance but
wears them out faster. There's a Windows program you can run to fix that,
but why bother?
- You could probably get a big win if your DB is in the 100-Gb range by
putting the whole thing on an SSD filesystem.
- Use either ext4 or XFS for your filesystem type. Google moved from ext2
in January 2010, and after evaluating everything that was out there,
decided on ext4 only because they could migrate their stuff in place.
Their second choice was XFS; it excells at large files, and over the last
few years has vastly improved for small files.
- Use fdtree, postmark, and dd tests for your benchmarks.
- What's being stored in the DB? Metadata (owner, modification time, id,
path, size, etc) is fine, but BLOBs are a pain in the ass and they rarely
scale in a nice way. Logging is no better; most logging ends up being
information that you rarely need to reference, so you're writing lots
of data that eats up disk space and, more importantly, space in the DB
buffer pool. Worse, log tables often grow without bound. This is where
partitioning and tablespaces will definitely help.
- If you don't already have a caching layer like memcached, plan for one.
Instead of writing code that directly queries the database for data that
rarely changes or data you expect to re-use frequently, either within
or across user sessions, hide that logic behind something that checks
the cache first. Doing so provides a layer of separation that makes it
far easier to begin making use of a caching layer (since you'll have a
single library of functions to update), even if you don't actually have
memcached running from the start.
- Have your vendor show you EXPLAIN output for the queries you'll be running
most often. People trust the query optimizer even when they shouldn't.
For example, one assumption is that all rows cost roughly the same amount
of time to read, which is not always true. If the table you're using
holds (say) Apache log data, it's probably already sorted by timestamp
in the table and on disk (assuming minimal fragmentation), but the
DB might assume it'll read fewer rows if it checks a different index.
In that case you'll be seeking all over your drive.
- One problem that might slow things down without leaving an obvious trace
is excessive use of disk-based temporary tables. See the temp_buffers
option in postgresql.conf to see how much room you can allocate for them
without touching the disk. You can make RAM-only temporary tables by
creating a tablespace on a RAM disk (/dev/shm works). This reduces the
amount of disk IO, but I don't think you can do this without a physical
disk write; the DB engine will flush the table list to stable storage
when you create the temporary table.
- PG v8.3 introduced a GUC parameter temp_tablespaces that allows selection
of the tablespace(s) in which to store temp tables and temporary files.
This is a list to allow spreading the load across multiple tablespaces;
temp files are not stored in per-database pgsql_tmp/ directories anymore,
but per-tablespace directories. If your temp tables are small enough,
put those on a RAM disk.
|