Message-ID: <4D1FE347.8040008@hardwarefreak.com>
Date: Sat, 01 Jan 2011 20:30:31 -0600
From: Stan Hoeppner <stan@hardwarefreak.com>
To: debian-user@lists.debian.org
Subject: Re: PostgreSQL+ZFS
References: <4D1E6013.2010900@atifceylan.com>
 <4D1F5543.3010108@hardwarefreak.com>
 <201101011416.37472.bss@iguanasuicide.net>
In-Reply-To: <201101011416.37472.bss@iguanasuicide.net>

Boyd Stephen Smith Jr. put forth on 1/1/2011 2:16 PM:

> Is your problem with RAID5 or the SSDs?

RAID 5

> Sudden disk failure can occur with SSDs, just like with magnetic media.  If

This is not true.  The failure modes and rates for SSDs are the same as other
solid state components, such as system boards, HBAs, and PCI RAID cards,
even CPUs (although SSDs are far more reliable than CPUs due to the lack of
heat generation).  SSDs only have two basic things in common with mechanical
disk drives: permanent data storage and a block device interface.  SSD,
as the first two letters of the acronym tell us, have more in common with
other integrate circuit components in a system.  Can an SSD fail?  Sure.
So can a system board.  But how often do your system boards fail?  *That*
is the comparison you should be making WRT SSD failure rates and modes,
*not* comparing SSDs with HDDs.

> you are going to use them in a production environment they should be RAIDed 
> like any disk.

I totally disagree.  See above.  However, if one is that concerned about
SSD failure, instead of spending the money required to RAID (verb) one's
db storage SSDs simply for fault recovery, I would recommend freezing and
snapshooting the filesystem to a sufficiently large SATA drive, and then
running differential backups of the snapshot to the tape silo.  Remember, you
don't _need_ RAID with SSDs to get performance.  Mirroring one's boot/system
device is about the only RAID scenario I'd ever recommend for SSDs, and even
here I don't feel it's necessary.

> RAID 5 on SSDs is sort of odd though.  RAID 5 is really a poor man's RAID; 
> yet, SSDs cost quite a bit more than magnetic media for the same amount of 
> storage.

Any serious IT professional needs to throw out his old storage cost equation.
Size doesn't matter and hasn't for quite some time.  Everyone has more
storage than they can possibly ever use.  Look how many free*providers
(Gmail) are offering _unlimited_ storage.

The storage cost equation should no longer be based on capacity (should
never have been IMO), but capability.  The disk drive manufacturers have
falsely convinced buyers over the last decade that size is _the_ criteria
on which to base purchasing decisions.  This can't be further from fact.
Mechanical drives have become so cavernous that most users never come close
to using the available capacity, not even 25% of it.  SSDs actually cost
*less* than HDDs with the equation people should be using, which is based
on _capability_ and goes something like this, and is not based on dollars
but an absolute number--higher score is better:

storage_value=((IOPS+throughput)/unit_cost) + (MTBF/1M) - power_per_year

Power_per_year depends on local utility rates which can vary wildly depending
on locale.  For this comparison I'll use kwh pricing of $0.12 which is the
PG&E average in the Los Angeles area.

For a Seagate 146GB 15k rpm SAS drive ($170):
http://www.newegg.com/Product/Product.aspx?Item=N82E16822148558
storage_value = ((274 + 142) / 170) + (1.6) - 110
storage_value = -106

For an OCZ Vertex II 160GB SSD SATA II device ($330):
http://www.newegg.com/Product/Product.aspx?Item=N82E16820227686
storage_value = ((50000 + 250) / 330) + (2.0) - 18
storage_value = 136

Notice the mechanical drive ended up with a substantial negative score,
and that the SSD is 242 points ahead due to massively superior IOPS.
This is because in today's high energy cost world, performance is much more
costly when using mechanical drives.  The Seagate drive above represents the
highest performance mechanical drive available.  It cost $170 (bare drive) to
acquire but costs $110 per year to operate in a 24x7 enterprise environment.
Two years energy consumption will be greater than the acquisition cost.
By contrast, running the SSD costs a much more reasonable $18 per year, and
it will take 18 years of energy consumption to surpass the acquisition cost.
As the published MTBF ratings on the devices is so similar, 1.6 vs 2 million
hours, this has zero impact in the final ratings.

Ironically, the SSD is actually slightly _larger_ in capacity than the
mechanical drive in this case, as the SSDs fall between 120GB and 160GB,
and I chose the larger pricier option to give the mechanical drive more of
a chance.  It doesn't matter.  The SSD could cost $2000 and it will still
win by a margin of 115, for two reasons: 182 times the IOPS performance and
1/6th the power consumption.

For the vast majority of enterprise/business workloads, IOPS and power
consumption are far more relevant than than total storage space, especially
for transactional database systems.  The above equation bears this out.

> SSDs intended as HD replacements support more read/write cycles per block
> than you will use for many decades, even if you were using all the disk
> I/O the entire time.

Yep.  Most SSDs will, regardless of price.

> SSDs intended as HD replacements are generally faster than magnetic media, 
> though it varies based on manufacturer and workload.

All of the currently shipping decent quality SSDs outrun a 15k SAS drive in
every performance category.  You'd have to buy a really low end consumer model
such as the cheap A-Data's and Kingstons to get less streaming throughput
than an SAS drive.  And, obviously, every SSD, even the el chapos, run IOPS
circles around the fastest mechanicals.

But if we're talking strictly a business environment, one is going to be
buying higher end models of SSDs.  And you don't have to go all that far up the
price scale either.  The major price factor in SSDs is no longer performance
now that there are so many great controller chips available, but is size.
The more flash chips in the device, the higher the cost.  The high performance
controller chips (Sandforce et al) no longer have that much bearing on price.

> I see little to no problem using SSDs in a production environment.

Me neither. :)

> Some people just hate on RAID 5. It is fine for it's intended purpose,
> which is LOTS for storage with some redundancy on identical (or
> near-identical) drives. I've run (and recovered) it on 3-6 drives.

It's fine in two categories:

1.  You never suffer power failure or a system crash
2.  Your performance needs are meager

Most SOHO setups do fine with RAID 5.  For any application that stores large
volumes of little or never changing data it's fine.  For any application
that performs constant random IO, such as a busy mail server or db server,
you should use RAID 10.

> However, RAID 1/0 is vastly superior in terms of reliability and speed.  It 
> costs a bit more for the same amount of usable space, but it is worth it.

Absolutely agree on both counts, except in one particular case: with the
same drive count, RAID 5 can usually out perform RAID 10 in streaming read
performance, but not by much.  RAID 5 reads require no parity calculations so
you get almost the entire spindle stripe worth of performance.  Where RAID
10 really shines is in mixed workloads.  Throw a few random writes into
the streaming RAID 5 workload mentioned above and it will slow things down
quite dramatically.  RAID 10 doesn't suffer from this.  Its performance is
pretty consistent even with simultaneous streaming and random workloads.

> I suggest you use RAID 1/0 on your SSDs, quite a few RAID 1/0
> implementations will work with 3 drives. RAID 1/0 should be a little more
> performant and a little less CPU intensive than RAID 5 for transaction
> logs. As far as file system, I think ext3 would be fine for this
> workload, although it would probably be worth it to benchmark against
> ext4 to see if it gives any improvement.

Again, RAID isn't necessary for SSDs.

Also, I really, really, wish people would stop repeating this crap about
mdraid's various extra "RAID 10" *layouts* being RAID 10!  They are NOT
RAID 10!

There is only one RAID 10, and the name and description have been with
us for over 15 years, LONG before Linux had a software RAID layer.  Also,
it's not called "RAID 1+0" or "RAID 1/0".  It is simply called "RAID 10",
again, for 15+ years now.  It requires 4, or more, even number of disks.
RAID 10 is a stripe across multiple mirrored pairs.  Period.  There is
no other definition of RAID 10.  All of Neil's "layouts" that do not meet
the above description _are not RAID 10_ no matter what he, or anyone else,
decided to call them!!

Travel through your time machine back to 1995 to 2000 go into the BIOS firmware
menu of a Mylex, AMI, Adaptec, or DPT PCI RAID controller.  They all say RAID
10, and they all used the same "layout", which is hardware sector mirroring
of two disks and striping filesystem blocks across those mirrored pairs.

/end RAID 10 nomenclature rant

-- 
Stan


Message-ID: <4D200417.9030407@hardwarefreak.com>
Date: Sat, 01 Jan 2011 22:50:31 -0600
From: Stan Hoeppner <stan@hardwarefreak.com>
To: debian-user@lists.debian.org
Subject: Re: PostgreSQL+ZFS
References: <4D1E6013.2010900@atifceylan.com>
 <4D1F5543.3010108@hardwarefreak.com> <4D1FA384.4090407@atifceylan.com>
In-Reply-To: <4D1FA384.4090407@atifceylan.com>

Atif CEYLAN put forth on 1/1/2011 3:58 PM:
> On 01/01/2011 06:24 PM, Stan Hoeppner wrote:

>> How much data?  Total GB?
> ~300 GB

>> Are you currently short of space?
> no, don't need more space.
Perfect. :)

>> Are you currently short of IOPS capacity?
> yes
Got it.

>> How many concurrent transactions?
> minimum 100-200 transactions, maximum  800-1000 concurrent transactions.

>> What types of transactions?
> usually update and insert
Write heavy.

>> What is a "large postgresql database system"?  What exactly do you mean
>> by this?  Does large mean heavy transaction load?   Or does it simply
>> mean lots of data housed?  Or is it simply BS?

> heavy transaction load.

Cool.  If you don't need more than 300GB of space the answer is easy.
Get one of these 120,000 random write IOPS 360GB RevoDrive PCIe x4 cards
and put everything on it, db files, transaction logs, all of it.  For less
than $1200 USD you'll get the IOPS performance of an 800 disk, FC 15k rpm
RAID 10 fiber channel SAN array from EMC, costing about $2 million USD.
Your latency will be an order of magnitude lower though because the flash is
connected directly to your PCIe bus.  The only thing such a SAN setup would
have that you won't is dozens of terabytes of space and more link throughput,
neither of which you need.

You only need the additional IOPS, not the space, so you save $2 million
and get superior performance to boot.  This is the true power and economy
of SSD technology, and how its price should be evaluated, not dollars per
gigabyte, but dollars per IOPS and dollars per watt.  The $$ spent on the
electric bill for a year of running that EMC array with its many racks of
disk trays would buy you dozens of these RevoDrive cards.

http://www.newegg.com/Product/Product.aspx?Item=N82E16820227662
http://www.ocztechnology.com/products/solid-state-drives/pci-express/revodrive/ocz-revodrive-x2-pci-express-ssd-.html

* 120,000 4k random write IOPS (overkill)
* 400 MB/s sustained write throughput (overkill)
* PCI Express x4 interface

This is not a drive, but a PCB solution.  Supreme reliability, just like a
motherboard.  No mirroring or RAID required.  Simply snapshot the filesystem
and dump it to tape or D2D using differential backup.

This card works fine with Linux if you have a recent kernel, even though OCZ
targets the desktop with this model.  The 512GB Z-Drive card they target at
"servers and workstations" has only 1/10th the write IOPS capability of the
RevoDrive 380, and is $600 more expensive.  As far as I can tell the Z-Drive
has no advantage, but possibly official technical support.

Also, I recommend using the XFS filesystem due to its superior direct IO
performance with databases.  Configure PGSQL to use direct IO.  When you make
the XFS filesystem, consume the entire drive, creating 36 allocation groups.
Refer to "man mkfs.xfs".  This will maximize parallel IOPS throughput to
the SSD.  Buy this card and do these things, and you will be absolutely
stunned by the performance you get out of it.

This storage card with XFS on top should easily handle 100,000 inserts _per
second_ if you have enough CPU horsepower to drive that load.  If you go
this route, please let us know how well it works for you.  I'm sure many
here would be eager to know.  Well, others than myself.  ;)

-- 
Stan