Message-ID: <4D1FE347.8040008@hardwarefreak.com> Date: Sat, 01 Jan 2011 20:30:31 -0600 From: Stan Hoeppner To: debian-user@lists.debian.org Subject: Re: PostgreSQL+ZFS References: <4D1E6013.2010900@atifceylan.com> <4D1F5543.3010108@hardwarefreak.com> <201101011416.37472.bss@iguanasuicide.net> In-Reply-To: <201101011416.37472.bss@iguanasuicide.net> Boyd Stephen Smith Jr. put forth on 1/1/2011 2:16 PM: > Is your problem with RAID5 or the SSDs? RAID 5 > Sudden disk failure can occur with SSDs, just like with magnetic media. If This is not true. The failure modes and rates for SSDs are the same as other solid state components, such as system boards, HBAs, and PCI RAID cards, even CPUs (although SSDs are far more reliable than CPUs due to the lack of heat generation). SSDs only have two basic things in common with mechanical disk drives: permanent data storage and a block device interface. SSD, as the first two letters of the acronym tell us, have more in common with other integrate circuit components in a system. Can an SSD fail? Sure. So can a system board. But how often do your system boards fail? *That* is the comparison you should be making WRT SSD failure rates and modes, *not* comparing SSDs with HDDs. > you are going to use them in a production environment they should be RAIDed > like any disk. I totally disagree. See above. However, if one is that concerned about SSD failure, instead of spending the money required to RAID (verb) one's db storage SSDs simply for fault recovery, I would recommend freezing and snapshooting the filesystem to a sufficiently large SATA drive, and then running differential backups of the snapshot to the tape silo. Remember, you don't _need_ RAID with SSDs to get performance. Mirroring one's boot/system device is about the only RAID scenario I'd ever recommend for SSDs, and even here I don't feel it's necessary. > RAID 5 on SSDs is sort of odd though. RAID 5 is really a poor man's RAID; > yet, SSDs cost quite a bit more than magnetic media for the same amount of > storage. Any serious IT professional needs to throw out his old storage cost equation. Size doesn't matter and hasn't for quite some time. Everyone has more storage than they can possibly ever use. Look how many free*providers (Gmail) are offering _unlimited_ storage. The storage cost equation should no longer be based on capacity (should never have been IMO), but capability. The disk drive manufacturers have falsely convinced buyers over the last decade that size is _the_ criteria on which to base purchasing decisions. This can't be further from fact. Mechanical drives have become so cavernous that most users never come close to using the available capacity, not even 25% of it. SSDs actually cost *less* than HDDs with the equation people should be using, which is based on _capability_ and goes something like this, and is not based on dollars but an absolute number--higher score is better: storage_value=((IOPS+throughput)/unit_cost) + (MTBF/1M) - power_per_year Power_per_year depends on local utility rates which can vary wildly depending on locale. For this comparison I'll use kwh pricing of $0.12 which is the PG&E average in the Los Angeles area. For a Seagate 146GB 15k rpm SAS drive ($170): http://www.newegg.com/Product/Product.aspx?Item=N82E16822148558 storage_value = ((274 + 142) / 170) + (1.6) - 110 storage_value = -106 For an OCZ Vertex II 160GB SSD SATA II device ($330): http://www.newegg.com/Product/Product.aspx?Item=N82E16820227686 storage_value = ((50000 + 250) / 330) + (2.0) - 18 storage_value = 136 Notice the mechanical drive ended up with a substantial negative score, and that the SSD is 242 points ahead due to massively superior IOPS. This is because in today's high energy cost world, performance is much more costly when using mechanical drives. The Seagate drive above represents the highest performance mechanical drive available. It cost $170 (bare drive) to acquire but costs $110 per year to operate in a 24x7 enterprise environment. Two years energy consumption will be greater than the acquisition cost. By contrast, running the SSD costs a much more reasonable $18 per year, and it will take 18 years of energy consumption to surpass the acquisition cost. As the published MTBF ratings on the devices is so similar, 1.6 vs 2 million hours, this has zero impact in the final ratings. Ironically, the SSD is actually slightly _larger_ in capacity than the mechanical drive in this case, as the SSDs fall between 120GB and 160GB, and I chose the larger pricier option to give the mechanical drive more of a chance. It doesn't matter. The SSD could cost $2000 and it will still win by a margin of 115, for two reasons: 182 times the IOPS performance and 1/6th the power consumption. For the vast majority of enterprise/business workloads, IOPS and power consumption are far more relevant than than total storage space, especially for transactional database systems. The above equation bears this out. > SSDs intended as HD replacements support more read/write cycles per block > than you will use for many decades, even if you were using all the disk > I/O the entire time. Yep. Most SSDs will, regardless of price. > SSDs intended as HD replacements are generally faster than magnetic media, > though it varies based on manufacturer and workload. All of the currently shipping decent quality SSDs outrun a 15k SAS drive in every performance category. You'd have to buy a really low end consumer model such as the cheap A-Data's and Kingstons to get less streaming throughput than an SAS drive. And, obviously, every SSD, even the el chapos, run IOPS circles around the fastest mechanicals. But if we're talking strictly a business environment, one is going to be buying higher end models of SSDs. And you don't have to go all that far up the price scale either. The major price factor in SSDs is no longer performance now that there are so many great controller chips available, but is size. The more flash chips in the device, the higher the cost. The high performance controller chips (Sandforce et al) no longer have that much bearing on price. > I see little to no problem using SSDs in a production environment. Me neither. :) > Some people just hate on RAID 5. It is fine for it's intended purpose, > which is LOTS for storage with some redundancy on identical (or > near-identical) drives. I've run (and recovered) it on 3-6 drives. It's fine in two categories: 1. You never suffer power failure or a system crash 2. Your performance needs are meager Most SOHO setups do fine with RAID 5. For any application that stores large volumes of little or never changing data it's fine. For any application that performs constant random IO, such as a busy mail server or db server, you should use RAID 10. > However, RAID 1/0 is vastly superior in terms of reliability and speed. It > costs a bit more for the same amount of usable space, but it is worth it. Absolutely agree on both counts, except in one particular case: with the same drive count, RAID 5 can usually out perform RAID 10 in streaming read performance, but not by much. RAID 5 reads require no parity calculations so you get almost the entire spindle stripe worth of performance. Where RAID 10 really shines is in mixed workloads. Throw a few random writes into the streaming RAID 5 workload mentioned above and it will slow things down quite dramatically. RAID 10 doesn't suffer from this. Its performance is pretty consistent even with simultaneous streaming and random workloads. > I suggest you use RAID 1/0 on your SSDs, quite a few RAID 1/0 > implementations will work with 3 drives. RAID 1/0 should be a little more > performant and a little less CPU intensive than RAID 5 for transaction > logs. As far as file system, I think ext3 would be fine for this > workload, although it would probably be worth it to benchmark against > ext4 to see if it gives any improvement. Again, RAID isn't necessary for SSDs. Also, I really, really, wish people would stop repeating this crap about mdraid's various extra "RAID 10" *layouts* being RAID 10! They are NOT RAID 10! There is only one RAID 10, and the name and description have been with us for over 15 years, LONG before Linux had a software RAID layer. Also, it's not called "RAID 1+0" or "RAID 1/0". It is simply called "RAID 10", again, for 15+ years now. It requires 4, or more, even number of disks. RAID 10 is a stripe across multiple mirrored pairs. Period. There is no other definition of RAID 10. All of Neil's "layouts" that do not meet the above description _are not RAID 10_ no matter what he, or anyone else, decided to call them!! Travel through your time machine back to 1995 to 2000 go into the BIOS firmware menu of a Mylex, AMI, Adaptec, or DPT PCI RAID controller. They all say RAID 10, and they all used the same "layout", which is hardware sector mirroring of two disks and striping filesystem blocks across those mirrored pairs. /end RAID 10 nomenclature rant -- Stan Message-ID: <4D200417.9030407@hardwarefreak.com> Date: Sat, 01 Jan 2011 22:50:31 -0600 From: Stan Hoeppner To: debian-user@lists.debian.org Subject: Re: PostgreSQL+ZFS References: <4D1E6013.2010900@atifceylan.com> <4D1F5543.3010108@hardwarefreak.com> <4D1FA384.4090407@atifceylan.com> In-Reply-To: <4D1FA384.4090407@atifceylan.com> Atif CEYLAN put forth on 1/1/2011 3:58 PM: > On 01/01/2011 06:24 PM, Stan Hoeppner wrote: >> How much data? Total GB? > ~300 GB >> Are you currently short of space? > no, don't need more space. Perfect. :) >> Are you currently short of IOPS capacity? > yes Got it. >> How many concurrent transactions? > minimum 100-200 transactions, maximum 800-1000 concurrent transactions. >> What types of transactions? > usually update and insert Write heavy. >> What is a "large postgresql database system"? What exactly do you mean >> by this? Does large mean heavy transaction load? Or does it simply >> mean lots of data housed? Or is it simply BS? > heavy transaction load. Cool. If you don't need more than 300GB of space the answer is easy. Get one of these 120,000 random write IOPS 360GB RevoDrive PCIe x4 cards and put everything on it, db files, transaction logs, all of it. For less than $1200 USD you'll get the IOPS performance of an 800 disk, FC 15k rpm RAID 10 fiber channel SAN array from EMC, costing about $2 million USD. Your latency will be an order of magnitude lower though because the flash is connected directly to your PCIe bus. The only thing such a SAN setup would have that you won't is dozens of terabytes of space and more link throughput, neither of which you need. You only need the additional IOPS, not the space, so you save $2 million and get superior performance to boot. This is the true power and economy of SSD technology, and how its price should be evaluated, not dollars per gigabyte, but dollars per IOPS and dollars per watt. The $$ spent on the electric bill for a year of running that EMC array with its many racks of disk trays would buy you dozens of these RevoDrive cards. http://www.newegg.com/Product/Product.aspx?Item=N82E16820227662 http://www.ocztechnology.com/products/solid-state-drives/pci-express/revodrive/ocz-revodrive-x2-pci-express-ssd-.html * 120,000 4k random write IOPS (overkill) * 400 MB/s sustained write throughput (overkill) * PCI Express x4 interface This is not a drive, but a PCB solution. Supreme reliability, just like a motherboard. No mirroring or RAID required. Simply snapshot the filesystem and dump it to tape or D2D using differential backup. This card works fine with Linux if you have a recent kernel, even though OCZ targets the desktop with this model. The 512GB Z-Drive card they target at "servers and workstations" has only 1/10th the write IOPS capability of the RevoDrive 380, and is $600 more expensive. As far as I can tell the Z-Drive has no advantage, but possibly official technical support. Also, I recommend using the XFS filesystem due to its superior direct IO performance with databases. Configure PGSQL to use direct IO. When you make the XFS filesystem, consume the entire drive, creating 36 allocation groups. Refer to "man mkfs.xfs". This will maximize parallel IOPS throughput to the SSD. Buy this card and do these things, and you will be absolutely stunned by the performance you get out of it. This storage card with XFS on top should easily handle 100,000 inserts _per second_ if you have enough CPU horsepower to drive that load. If you go this route, please let us know how well it works for you. I'm sure many here would be eager to know. Well, others than myself. ;) -- Stan