New server setup

Discussion:

New server setup

(too old to reply)

Niels Kristian Schjødt

2013-03-01 11:43:17 UTC

Hi, I'm going to setup a new server for my postgresql database, and I am considering one of these: http://www.hetzner.de/hosting/produkte_rootserver/poweredge-r720 with four SAS drives in a RAID 10 array. Has any of you any particular comments/pitfalls/etc. to mention on the setup? My application is very write heavy.

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Craig James

2013-03-01 15:28:30 UTC

Permalink

On Fri, Mar 1, 2013 at 3:43 AM, Niels Kristian Schjødt <

Post by Niels Kristian SchjÃ¸dt
Hi, I'm going to setup a new server for my postgresql database, and I am
http://www.hetzner.de/hosting/produkte_rootserver/poweredge-r720 with
four SAS drives in a RAID 10 array. Has any of you any particular
comments/pitfalls/etc. to mention on the setup? My application is very
write heavy.

I can only tell you our experience with Dell from several years ago. We
bought two Dell servers similar (somewhat larger) than the model you're
looking at. We'll never buy from them again.

Advantages: They work. They haven't failed.

Disadvantages:

Performance sucks. Dell costs far more than "white box" servers we buy
from a "white box" supplier (ASA Computers). ASA gives us roughly double
the performance for the same price. We can buy exactly what we want from
ASA.

Dell did a disk-drive "lock in." The RAID controller won't spin up a
non-Dell disk. They wanted roughly four times the price for their disks
compared to buying the exact same disks on Amazon. If a disk went out
today, it would probably cost even more because that model is obsolete
(luckily, we bought a couple spares). I think they abandoned this policy
because it caused so many complaints, but you should check before you buy.
This was an incredibly stupid RAID controller design.

Dell tech support doesn't know what they're talking about when it comes to
RAID controllers and serious server support. You're better off with a
white-box solution, where you can buy the exact parts recommended in this
group and get technical advice from people who know what they're talking
about. Dell basically doesn't understand Postgres.

They boast excellent on-site service, but for the price of their computers
and their service contract, you can buy two servers from a white-box
vendor. Our white-box servers have been just as reliable as the Dell
servers -- no failures.

I'm sure someone in Europe can recommend a good vendor for you.

Craig James

Post by Niels Kristian SchjÃ¸dt
--
http://www.postgresql.org/mailpref/pgsql-performance

Niels Kristian Schjødt

2013-03-04 11:20:49 UTC

Permalink

Thanks both of you for your input.

Earlier I have been discussing my extremely high IO wait with you here on the mailing list, and have tried a lot of tweaks both on postgresql config, wal directly location and kernel tweaks, but unfortunately my problem persists, and I think I'm eventually down to just bad hardware (currently two 7200rpm disks in a software raid 1). So changing to 4 15000rpm SAS disks in a raid 10 is probably going to change a lot - don't you think? However, we are running a lot of background processing 300 connections to db sometimes. So my question is, should I also get something like pgpool2 setup at the same time? Is it, from your experience, likely to increase my throughput a lot more, if I had a connection pool of eg. 20 connections, instead of 300 concurrent ones directly?

Post by Niels Kristian SchjÃ¸dt
Hi, I'm going to setup a new server for my postgresql database, and I am considering one of these: http://www.hetzner.de/hosting/produkte_rootserver/poweredge-r720 with four SAS drives in a RAID 10 array. Has any of you any particular comments/pitfalls/etc. to mention on the setup? My application is very write heavy.
I can only tell you our experience with Dell from several years ago. We bought two Dell servers similar (somewhat larger) than the model you're looking at. We'll never buy from them again.
Advantages: They work. They haven't failed.
Performance sucks. Dell costs far more than "white box" servers we buy from a "white box" supplier (ASA Computers). ASA gives us roughly double the performance for the same price. We can buy exactly what we want from ASA.
Dell did a disk-drive "lock in." The RAID controller won't spin up a non-Dell disk. They wanted roughly four times the price for their disks compared to buying the exact same disks on Amazon. If a disk went out today, it would probably cost even more because that model is obsolete (luckily, we bought a couple spares). I think they abandoned this policy because it caused so many complaints, but you should check before you buy. This was an incredibly stupid RAID controller design.
Dell tech support doesn't know what they're talking about when it comes to RAID controllers and serious server support. You're better off with a white-box solution, where you can buy the exact parts recommended in this group and get technical advice from people who know what they're talking about. Dell basically doesn't understand Postgres.
They boast excellent on-site service, but for the price of their computers and their service contract, you can buy two servers from a white-box vendor. Our white-box servers have been just as reliable as the Dell servers -- no failures.
I'm sure someone in Europe can recommend a good vendor for you.
Craig James
--
http://www.postgresql.org/mailpref/pgsql-performance

Kevin Grittner

2013-03-05 16:34:04 UTC

Permalink

Post by Niels Kristian SchjÃ¸dt
So my question is, should I also get something like pgpool2 setup
at the same time? Is it, from your experience, likely to increase
my throughput a lot more, if I had a connection pool of eg. 20
connections, instead of 300 concurrent ones directly?

In my experience, it can make a big difference. If you are just
using the pooler for this reason, and don't need any of the other
features of pgpool, I suggest pgbouncer. It is a simpler, more
lightweight tool.
--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Scott Marlowe

2013-03-05 17:10:21 UTC

Permalink

I second the pgbouncer rec.

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Niels Kristian Schjødt

2013-03-05 17:11:48 UTC

Permalink

Thanks, that was actually what I just ended up doing yesterday. Any suggestion how to tune pgbouncer?

BTW, I have just bumped into an issue that caused me to disable pgbouncer again actually. My web application is querying the database with a per request based SEARCH_PATH. This is because I use schemas to provide country based separation of my data (e.g. english, german, danish data in different schemas). I have pgbouncer setup to have a transactional behavior (pool_mode = transaction) - however some of my colleagues complained that it sometimes didn't return data from the right schema set in the SEARCH_PATH - you wouldn't by chance have any idea what is going wrong wouldn't you?

#################### pgbouncer.ini
[databases]
production =

[pgbouncer]

logfile = /var/log/pgbouncer/pgbouncer.log
pidfile = /var/run/pgbouncer/pgbouncer.pid
listen_addr = localhost
listen_port = 6432
unix_socket_dir = /var/run/postgresql
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
admin_users = postgres
pool_mode = transaction
server_reset_query = DISCARD ALL
max_client_conn = 500
default_pool_size = 20
reserve_pool_size = 5
reserve_pool_timeout = 10
#####################

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Benjamin Krajmalnik

2013-03-05 18:03:44 UTC

Permalink

Set it to use session. I had a similar issue having moved one of the components of our app to use transactions, which introduced an undesired behavior.

-----Original Message-----
From: pgsql-performance-***@postgresql.org [mailto:pgsql-performance-***@postgresql.org] On Behalf Of Niels Kristian Schjødt
Sent: Tuesday, March 05, 2013 10:12 AM
To: Kevin Grittner
Cc: Craig James; pgsql-***@postgresql.org
Subject: Re: [PERFORM] New server setup

Thanks, that was actually what I just ended up doing yesterday. Any suggestion how to tune pgbouncer?

BTW, I have just bumped into an issue that caused me to disable pgbouncer again actually. My web application is querying the database with a per request based SEARCH_PATH. This is because I use schemas to provide country based separation of my data (e.g. english, german, danish data in different schemas). I have pgbouncer setup to have a transactional behavior (pool_mode = transaction) - however some of my colleagues complained that it sometimes didn't return data from the right schema set in the SEARCH_PATH - you wouldn't by chance have any idea what is going wrong wouldn't you?

#################### pgbouncer.ini
[databases]
production =

[pgbouncer]

logfile = /var/log/pgbouncer/pgbouncer.log pidfile = /var/run/pgbouncer/pgbouncer.pid listen_addr = localhost listen_port = 6432 unix_socket_dir = /var/run/postgresql auth_type = md5 auth_file = /etc/pgbouncer/userlist.txt admin_users = postgres pool_mode = transaction server_reset_query = DISCARD ALL max_client_conn = 500 default_pool_size = 20 reserve_pool_size = 5 reserve_pool_timeout = 10 #####################

Post by Niels Kristian SchjÃ¸dt
So my question is, should I also get something like pgpool2 setup at
the same time? Is it, from your experience, likely to increase my
throughput a lot more, if I had a connection pool of eg. 20
connections, instead of 300 concurrent ones directly?

In my experience, it can make a big difference. If you are just using
the pooler for this reason, and don't need any of the other features
of pgpool, I suggest pgbouncer. It is a simpler, more lightweight
tool.
--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL
Company

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Niels Kristian Schjødt

2013-03-05 18:27:32 UTC

Permalink

Okay, thanks - but hey - if I put it at session pooling, then it says in the documentation: "default_pool_size: In session pooling it needs to be the number of max clients you want to handle at any moment". So as I understand it, is it true that I then have to set default_pool_size to 300 if I have up to 300 client connections? And then what would the pooler then help on my performance - would that just be exactly like having the 300 clients connect directly to the database???

-NK

Den 05/03/2013 kl. 19.03 skrev "Benjamin Krajmalnik" <***@servoyant.com>:

Jeff Janes

2013-03-05 21:59:14 UTC

Permalink

On Tue, Mar 5, 2013 at 10:27 AM, Niels Kristian Schjødt <

Post by Niels Kristian SchjÃ¸dt
Okay, thanks - but hey - if I put it at session pooling, then it says in
the documentation: "default_pool_size: In session pooling it needs to be
the number of max clients you want to handle at any moment". So as I
understand it, is it true that I then have to set default_pool_size to 300
if I have up to 300 client connections?

If those 300 client connections are all long-lived, then yes you need that
many in the pool. If they are short-lived connections, then you can have a
lot less as any ones over the default_pool_size will simply block until an
existing connection is closed and can be re-assigned--which won't take long
if they are short-lived connections.

And then what would the pooler then help on my performance - would that

Post by Niels Kristian SchjÃ¸dt
just be exactly like having the 300 clients connect directly to the
database???

It would probably be even worse than having 300 clients connected
directly. There would be no point in using a pooler under those conditions.

Cheers,

Jeff

Gregg Jaskiewicz

2013-03-09 17:53:19 UTC

Permalink

In my recent experience PgPool2 performs pretty badly as a pooler. I'd
avoid it if possible, unless you depend on other features.
It simply doesn't scale.

On Tue, Mar 5, 2013 at 10:27 AM, Niels Kristian SchjÃždt <

If those 300 client connections are all long-lived, then yes you need that
many in the pool. If they are short-lived connections, then you can have a
lot less as any ones over the default_pool_size will simply block until an
existing connection is closed and can be re-assigned--which won't take long
if they are short-lived connections.
And then what would the pooler then help on my performance - would that

Post by Niels Kristian SchjÃ¸dt
just be exactly like having the 300 clients connect directly to the
database???

It would probably be even worse than having 300 clients connected
directly. There would be no point in using a pooler under those conditions.
Cheers,
Jeff

--
GJ

Wales Wang

2013-03-01 17:05:00 UTC

Permalink

pls choice PCI-E Flash for written heavy app

Wales

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Greg Smith

2013-03-10 15:58:47 UTC

Permalink

The Dell PERC H710 (actually a LSI controller) works fine for
write-heavy workloads on a RAID 10, as long as you order it with a
battery backup unit module. Someone must install the controller
management utility and do three things however:

1) Make sure the battery-backup unit is working.

2) Configure the controller so that the *disk* write cache is off.

3) Set the controller cache to "write-back when battery is available".
That will use the cache when it is safe to do so, and if not it will
bypass it. That will make the server slow down if the battery fails,
but it won't ever become unsafe at writing.

See http://wiki.postgresql.org/wiki/Reliable_Writes for more information
about this topic. If you'd like some consulting help with making sure
the server is working safely and as fast as it should be, 2ndQuadrant
does offer a hardware benchmarking service to do that sort of thing:
http://www.2ndquadrant.com/en/hardware-benchmarking/ I think we're even
generating those reports in German now.

--
Greg Smith 2ndQuadrant US ***@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Gregg Jaskiewicz

2013-03-12 21:41:08 UTC

Permalink

Post by Niels Kristian SchjÃ¸dt
Hi, I'm going to setup a new server for my postgresql database, and I am
considering one of these: http://www.hetzner.de/hosting/**
produkte_rootserver/poweredge-**r720<http://www.hetzner.de/hosting/produkte_rootserver/poweredge-r720>with four SAS drives in a RAID 10 array. Has any of you any particular
comments/pitfalls/etc. to mention on the setup? My application is very
write heavy.

The Dell PERC H710 (actually a LSI controller) works fine for write-heavy
workloads on a RAID 10, as long as you order it with a battery backup unit
module. Someone must install the controller management utility and do
We're going to go with either HP or IBM (customer's preference, etc).
1) Make sure the battery-backup unit is working.
2) Configure the controller so that the *disk* write cache is off.
3) Set the controller cache to "write-back when battery is available".
That will use the cache when it is safe to do so, and if not it will bypass
it. That will make the server slow down if the battery fails, but it won't
ever become unsafe at writing.
See http://wiki.postgresql.org/**wiki/Reliable_Writes<http://wiki.postgresql.org/wiki/Reliable_Writes>for more information about this topic. If you'd like some consulting help
with making sure the server is working safely and as fast as it should be,
2ndQuadrant does offer a hardware benchmarking service to do that sort of
thing: http://www.2ndquadrant.com/en/**hardware-benchmarking/<http://www.2ndquadrant.com/en/hardware-benchmarking/> I think we're even generating those reports in German now.

Thanks Greg. I will follow advice there, and also the one in your book. I
do always make sure they order battery backed cache (or flash based, which
seems to be what people use these days).

I think subject of using external help with setting things up did came up,
but more around connection pooling subject then hardware itself (shortly,
pgpool2 is crap, we will go with dns based solution and apps connection
directly to nodes).
I will let my clients (doing this on a contract) know that there's an
option to get you guys to help us. Mind you, this database is rather small
in grand scheme of things (30-40GB). Just possibly a lot of occasional
writes.

We wouldn't need German. But Proper English (i.e. british english) would
always be nice ;)

Whilst on the hardware subject, someone mentioned throwing ssd into the
mix. I.e. combining spinning HDs with SSD, apparently some raid cards can
use small-ish (80GB+) SSDs as external caches. Any experiences with that ?

Thanks !

--
GJ

John Lister

2013-03-13 15:33:37 UTC

Permalink

Post by Gregg Jaskiewicz
Whilst on the hardware subject, someone mentioned throwing ssd into
the mix. I.e. combining spinning HDs with SSD, apparently some raid
cards can use small-ish (80GB+) SSDs as external caches. Any
experiences with that ?

The new LSI/Dell cards do this (eg H710 as mentioned in an earlier
post). It is easy to set up and supported it seems on all versions of
dells cards even if the docs say it isn't. Works well with the limited
testing I did, switched to pretty much all SSDs drives in my current setup

These cards also supposedly support enhanced performance with just SSDs
(CTIO) by playing with the cache settings, but to be honest I haven't
noticed any difference and I'm not entirely sure it is enabled as there
is no indication that CTIO is actually enabled and working.

John

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Greg Jaskiewicz

2013-03-13 15:50:51 UTC

Permalink

Whilst on the hardware subject, someone mentioned throwing ssd into the mix. I.e. combining spinning HDs with SSD, apparently some raid cards can use small-ish (80GB+) SSDs as external caches. Any experiences with that ?

The new LSI/Dell cards do this (eg H710 as mentioned in an earlier post). It is easy to set up and supported it seems on all versions of dells cards even if the docs say it isn't. Works well with the limited testing I did, switched to pretty much all SSDs drives in my current setup
These cards also supposedly support enhanced performance with just SSDs (CTIO) by playing with the cache settings, but to be honest I haven't noticed any difference and I'm not entirely sure it is enabled as there is no indication that CTIO is actually enabled and working.

SSDs have much shorter life then spinning drives, so what do you do when one inevitably fails in your system ?

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

John Lister

2013-03-13 16:15:31 UTC

Permalink

Post by Greg Jaskiewicz
SSDs have much shorter life then spinning drives, so what do you do when one inevitably fails in your system ?

Define much shorter? I accept they have a limited no of writes, but that
depends on load. You can actively monitor the drives "health" level in
terms of wear using smart and it is relatively straightforward to
calculate an estimate of life based on average use and for me that works
out at about in excess of 5 years. Experience tells me that spinning
drives have a habit of failing in that time frame as well :( and in 5
years I'll be replacing the server probably.

I also overprovisioned the drives by about an extra 13% giving me 20%
spare capacity when adding in the 7% manufacturer spare space - given
this currently my drives have written about 4TB of data each and show 0%
wear, this is for 160GB drives. I actively monitor the wear level and
plan to replace the drives when they get low. For a comparison of write
levels see
http://www.xtremesystems.org/forums/showthread.php?271063-SSD-Write-Endurance-25nm-Vs-34nm,
it shows for the 320series that it reported to have hit the wear limit
at 190TB (for a drive 1/4 the size of mine) but actually managed nearer
700TB before the drive failed.

I've mixed 2 different manufacturers in my raid 10 pairs to mitigate
against both pairs failing at the same time either due to a firmware bug
or being full In addition when I was setting the box up I did some
performance testing against the drives but with using different
combinations for each test - the aim here is to pre-load each drive
differently to prevent them failing when full simultaneously.

If you do go for raid 10 make sure to have a power fail endurance, ie
capacitor or battery on the drive.

John

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Steve Crawford

2013-03-13 19:23:18 UTC

Permalink

Post by John Lister

Post by Greg Jaskiewicz
SSDs have much shorter life then spinning drives, so what do you do
when one inevitably fails in your system ?

Define much shorter? I accept they have a limited no of writes, but
that depends on load. You can actively monitor the drives "health"
level...

What concerns me more than wear is this:

InfoWorld Article:
http://www.infoworld.com/t/solid-state-drives/test-your-ssds-or-risk-massive-data-loss-researchers-warn-213715

Referenced research paper:
https://www.usenix.org/conference/fast13/understanding-robustness-ssds-under-power-fault

Kind of messes with the "D" in ACID.

Cheers,
Steve

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Karl Denninger

2013-03-13 19:38:27 UTC

Permalink

Post by Steve Crawford

Post by John Lister

Post by Greg Jaskiewicz
SSDs have much shorter life then spinning drives, so what do you do
when one inevitably fails in your system ?

Define much shorter? I accept they have a limited no of writes, but
that depends on load. You can actively monitor the drives "health"
level...

http://www.infoworld.com/t/solid-state-drives/test-your-ssds-or-risk-massive-data-loss-researchers-warn-213715
https://www.usenix.org/conference/fast13/understanding-robustness-ssds-under-power-fault
Kind of messes with the "D" in ACID.
Cheers,
Steve

One potential way around this is to run ZFS as the underlying filesystem
and use the SSDs as cache drives. If they lose data due to a power
problem it is non-destructive.

Short of that you cannot use a SSD on a machine where silent corruption
is unacceptable UNLESS you know it has a supercap or similar IN THE DISK
that guarantees that on-drive cache can be flushed in the event of a
power failure. A battery-backed controller cache DOES NOTHING to
alleviate this risk. If you violate this rule and the power goes off
you must EXPECT silent and possibly-catastrophic data corruption.

Only a few (and they're expensive!) SSD drives have said protection. If
yours does not the only SAFE option is as I described up above using
them as ZFS cache devices.

--
-- Karl Denninger
/The Market Ticker ®/ <http://market-ticker.org>
Cuda Systems LLC

CSS

2013-03-13 19:47:14 UTC

Permalink

Post by Steve Crawford

Post by Greg Jaskiewicz
SSDs have much shorter life then spinning drives, so what do you do when one inevitably fails in your system ?

Define much shorter? I accept they have a limited no of writes, but that depends on load. You can actively monitor the drives "health" level...

Have a look at this:

http://blog.2ndquadrant.com/intel_ssd_now_off_the_sherr_sh/

I'm not sure what other ssds offer this, but Intel's newest entry will, and it's attractively priced.

Another way we leverage SSDs that can be more reliable in the face of total SSD meltdown is to use them as ZFS Intent Log caches. All the sync writes get handled on the SSDs. We deploy them as mirrored vdevs, so if one fails, we're OK. If both fail, we're really slow until someone can replace them. On modest hardware, I was able to get about 20K TPS out of pgbench with the SSDs configured as ZIL and 4 10K raptors as the spinny disks.

In either case, the amount of money you'd have to spend on the two-dozen or so SAS drives (and the controllers, enclosure, etc.) that would equal a few pairs of SSDs in random IO performance is non-trivial, even if you plan on proactively retiring your SSDs every year.

Just another take on the issue..

Charles

Post by Steve Crawford
Cheers,
Steve
--
http://www.postgresql.org/mailpref/pgsql-performance

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

John Lister

2013-03-13 20:05:52 UTC

Permalink

Post by Steve Crawford

Post by John Lister

Post by Greg Jaskiewicz
SSDs have much shorter life then spinning drives, so what do you do
when one inevitably fails in your system ?

Define much shorter? I accept they have a limited no of writes, but
that depends on load. You can actively monitor the drives "health"
level...

http://www.infoworld.com/t/solid-state-drives/test-your-ssds-or-risk-massive-data-loss-researchers-warn-213715

When I read this they didn't name the drives that failed - or those that
passed. But I'm assuming the failed ones are standard consumer SSDS, but
2 good ones were either enterprise of had caps. The reason I say this,
is that yes SSD drives by the nature of their operation cache/store
information in ram while they write it to the flash and to handle the
mappings, etc of real to virtual sectors and if they loose power it is
this that is lost, causing at best corruption if not complete loss of
the drive. Enterprise drives (and some consumer, such as the 320s) have
either capacitors or battery backup to allows the drive to safely
shutdown. There have been various reports both on this list and
elsewhere showing that these drives successfully survive repeated power
failures.

A bigger concern is the state of the firmware in these drives which
until recently was more likely to trash your drive - fortunately things
seems to becoming more stable with age now.

John

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

David Boreham

2013-03-13 20:16:16 UTC

Permalink

Post by Steve Crawford
http://www.infoworld.com/t/solid-state-drives/test-your-ssds-or-risk-massive-data-loss-researchers-warn-213715
https://www.usenix.org/conference/fast13/understanding-robustness-ssds-under-power-fault
Kind of messes with the "D" in ACID.

It is somewhat surprising to discover that many SSD products are not
durable under sudden power loss (what where they thinking!?, and ...why
doesn't anyone care??).

However, there is a set of SSD types known to be designed to address
power loss events that have been tested by contributors to this list.
Use only those devices and you won't see this problem. SSDs do have a
wear-out mechanism but wear can be monitored and devices replaced in
advance of failure. In practice longevity is such that most machines
will be in the dumpster long before the SSD wears out. We've had
machines running with several hundred wps constantly for 18 months using
Intel 710 drives and the wear level SMART value is still zero.

In addition, like any electronics module (CPU, memory, NIC), an SSD can
fail so you do need to arrange for valuable data to be replicated.
As with old school disk drives, firmware bugs are a concern so you might
want to consider what would happen if all the drives of a particular
type all decided to quit working at the same second in time (I've only
seen this happen myself with magnetic drives, but in theory it could
happen with SSD).

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Mark Kirkwood

2013-03-14 03:29:13 UTC

Permalink

Post by David Boreham

It is somewhat surprising to discover that many SSD products are not
durable under sudden power loss (what where they thinking!?, and ...why
doesn't anyone care??).
However, there is a set of SSD types known to be designed to address
power loss events that have been tested by contributors to this list.
Use only those devices and you won't see this problem. SSDs do have a
wear-out mechanism but wear can be monitored and devices replaced in
advance of failure. In practice longevity is such that most machines
will be in the dumpster long before the SSD wears out. We've had
machines running with several hundred wps constantly for 18 months using
Intel 710 drives and the wear level SMART value is still zero.
In addition, like any electronics module (CPU, memory, NIC), an SSD can
fail so you do need to arrange for valuable data to be replicated.
As with old school disk drives, firmware bugs are a concern so you might
want to consider what would happen if all the drives of a particular
type all decided to quit working at the same second in time (I've only
seen this happen myself with magnetic drives, but in theory it could
happen with SSD).

Just going through this now with a vendor. They initially assured us
that the drives had "end to end protection" so we did not need to worry.
I had to post stripdown pictures from Intel's s3700, showing obvious
capacitors attached to the board before I was taken seriously and
actually meaningful specifications were revealed. So now I'm demanding
to know:

- chipset (and version)
- original manufacturer (for re-badged ones)
- power off protection *explicitly* mentioned
- show me the circuit board (and where are the capacitors)

Seems like you gotta push 'em!

Cheers

Mark

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

David Boreham

2013-03-14 03:39:51 UTC

Permalink

Post by Mark Kirkwood
Just going through this now with a vendor. They initially assured us
that the drives had "end to end protection" so we did not need to
worry. I had to post stripdown pictures from Intel's s3700, showing
obvious capacitors attached to the board before I was taken seriously
and actually meaningful specifications were revealed. So now I'm
- chipset (and version)
- original manufacturer (for re-badged ones)
- power off protection *explicitly* mentioned
- show me the circuit board (and where are the capacitors)

In addition to the above, I only use drives where I've seen compelling
evidence that plug pull tests have been done and passed (e.g. done by
someone on this list or in-house here). I also like to have a high
level of confidence in the firmware development group. This results in a
very small set of acceptable products :(

Bruce Momjian

2013-03-14 18:54:24 UTC

Permalink

Only use SSDs with a BBU cache, and don't set SSD caches to
write-through because an SSD needs to cache the write to avoid wearing
out the chips early, see:

http://momjian.us/main/blogs/pgblog/2012.html#August_3_2012
--
Bruce Momjian <***@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Mark Kirkwood

2013-03-14 21:37:55 UTC

Permalink

Post by Bruce Momjian
Only use SSDs with a BBU cache, and don't set SSD caches to
write-through because an SSD needs to cache the write to avoid wearing
http://momjian.us/main/blogs/pgblog/2012.html#August_3_2012

I not convinced about the need for BBU with SSD - you *can* use them
without one, just need to make sure about suitable longevity and also
the presence of (proven) power off protection (as discussed previously).
It is worth noting that using unproven or SSD known to be lacking power
off protection with a BBU will *not* save you from massive corruption
(or device failure) upon unexpected power loss.

Also, in terms of performance, the faster PCIe SSD do about as well by
themselves as connected to a RAID card with BBU. In fact they will do
better in some cases (the faster SSD can get close to the max IOPS many
RAID cards can handle...so more than a couple of 'em plugged into one
card will be throttled by its limitations).

Cheers

Mark

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Mark Kirkwood

2013-03-14 21:47:07 UTC

Permalink

Post by Mark Kirkwood
Also, in terms of performance, the faster PCIe SSD do about as well by
themselves as connected to a RAID card with BBU.

Sorry - I meant to say "the faster **SAS** SSD do...", since you can't
currently plug PCIe SSD into RAID cards (confusingly, some of the PCIe
guys actually have RAID card firmware on their boards...Intel 910 I think).

Cheers

Mark

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Bruce Momjian

2013-03-14 22:34:49 UTC

Permalink

Post by Mark Kirkwood

I don't think any drive that corrupts on power-off is suitable for a
database, but for non-db uses, sure, I guess they are OK, though you
have to be pretty money-constrainted to like that tradeoff.
--
Bruce Momjian <***@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Mark Kirkwood

2013-03-14 22:53:37 UTC

Permalink

Post by Bruce Momjian
I don't think any drive that corrupts on power-off is suitable for a
database, but for non-db uses, sure, I guess they are OK, though you
have to be pretty money-constrainted to like that tradeoff.

Agreed - really *all* SSD should have capacitor (or equivalent) power
off protection...that fact that it's a feature present on only a handful
of drives is...disappointing.

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Rick Otten

2013-03-15 18:06:02 UTC

Permalink

Post by Mark Kirkwood
I not convinced about the need for BBU with SSD - you *can* use them
without one, just need to make sure about suitable longevity and also
the presence of (proven) power off protection (as discussed
previously). It is worth noting that using unproven or SSD known to be
lacking power off protection with a BBU will *not* save you from
massive corruption (or device failure) upon unexpected power loss.

I don't think any drive that corrupts on power-off is suitable for a database, but for non-db uses, sure, I guess they are OK, though you have to be pretty money->constrainted to like that tradeoff.

Wouldn't mission critical databases normally be configured in a high availability cluster - presumably with replicas running on different power sources?

If you lose power to a member of the cluster (or even the master), you would have new data coming in and stuff to do long before it could come back online - corrupted disk or not.

I find it hard to imagine configuring something that is too critical to be able to be restored from periodic backup to NOT be in a (synchronous) cluster. I'm not sure all the fuss over whether an SSD might come back after a hard server failure is really about. You should architect the solution so you can lose the server and throw it away and never bring it back online again. Native streaming replication is fairly straightforward to configure. Asynchronous multimaster (albeit with some synchronization latency) is also fairly easy to configure using third party tools such as SymmetricDS.

Agreed that adding a supercap doesn't sound like a hard thing for a hardware manufacturer to do, but I don't think it should be a necessarily be showstopper for being able to take advantage of some awesome I/O performance opportunities.

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Bruce Momjian

2013-03-15 18:55:08 UTC

Permalink

Post by Rick Otten

Wouldn't mission critical databases normally be configured in a high
availability cluster - presumably with replicas running on different
power sources?
If you lose power to a member of the cluster (or even the master), you
would have new data coming in and stuff to do long before it could
come back online - corrupted disk or not.
I find it hard to imagine configuring something that is too critical
to be able to be restored from periodic backup to NOT be in a
(synchronous) cluster. I'm not sure all the fuss over whether an SSD
might come back after a hard server failure is really about. You
should architect the solution so you can lose the server and throw
it away and never bring it back online again. Native streaming
replication is fairly straightforward to configure. Asynchronous
multimaster (albeit with some synchronization latency) is also fairly
easy to configure using third party tools such as SymmetricDS.
Agreed that adding a supercap doesn't sound like a hard thing for
a hardware manufacturer to do, but I don't think it should be a
necessarily be showstopper for being able to take advantage of some
awesome I/O performance opportunities.

Do you want to recreate the server if it loses power over an extra $100
per drive?
--
Bruce Momjian <***@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Scott Marlowe

2013-03-15 19:14:17 UTC

Permalink

Post by Rick Otten

I don't think any drive that corrupts on power-off is suitable for a database, but for non-db uses, sure, I guess they are OK, though you have to be pretty money->constrainted to like that tradeoff.

Wouldn't mission critical databases normally be configured in a high availability cluster - presumably with replicas running on different power sources?

I've worked in high end data centers where certain failures resulted
in ALL power being lost. more than once. Relying on never losing
power to keep your data from getting corrupted is not a good idea. Now
if they're geographically separate you're maybe ok.

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Mark Kirkwood

2013-03-16 08:47:34 UTC

Permalink

Post by Rick Otten

I don't think any drive that corrupts on power-off is suitable for a database, but for non-db uses, sure, I guess they are OK, though you have to be pretty money->constrainted to like that tradeoff.

Wouldn't mission critical databases normally be configured in a high availability cluster - presumably with replicas running on different power sources?
If you lose power to a member of the cluster (or even the master), you would have new data coming in and stuff to do long before it could come back online - corrupted disk or not.
I find it hard to imagine configuring something that is too critical to be able to be restored from periodic backup to NOT be in a (synchronous) cluster. I'm not sure all the fuss over whether an SSD might come back after a hard server failure is really about. You should architect the solution so you can lose the server and throw it away and never bring it back online again. Native streaming replication is fairly straightforward to configure. Asynchronous multimaster (albeit with some synchronization latency) is also fairly easy to configure using third party tools such as SymmetricDS.
Agreed that adding a supercap doesn't sound like a hard thing for a hardware manufacturer to do, but I don't think it should be a necessarily be showstopper for being able to take advantage of some awesome I/O performance opportunities.

A somewhat extreme point of view. I note that the Mongodb guys added
journaling for single server reliability a while ago - an admission that
while in *theory* lots of semi-reliable nodes can be eventually
consistent, it is a lot less hassle if individual nodes are as reliable
as possible. That is what this discussion is about.

Regards

Mark

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

David Boreham

2013-03-14 23:37:54 UTC

Permalink

I think it probably depends on the specifics of the deployment, but for
us the fact that the BBU isn't required in order to achieve high write
tps with SSDs is one of the key benefits -- the power, cooling and space
savings over even a few servers are significant. In our case we only
have one or two drives per server so no need for fancy drive string
arrangements.

Post by Mark Kirkwood
Also, in terms of performance, the faster PCIe SSD do about as well by
themselves as connected to a RAID card with BBU. In fact they will do
better in some cases (the faster SSD can get close to the max IOPS
many RAID cards can handle...so more than a couple of 'em plugged into
one card will be throttled by its limitations).

You might want to evaluate the performance you can achieve with a
single-SSD (use several for capacity by all means) before considering a
RAID card + SSD solution.
Again I bet it depends on the application but our experience with the
older Intel 710 series is that their performance out-runs the CPU, at
least under our PG workload.

David Rees

2013-03-21 00:44:36 UTC

Permalink

You might want to evaluate the performance you can achieve with a single-SSD
(use several for capacity by all means) before considering a RAID card + SSD
solution.
Again I bet it depends on the application but our experience with the older
Intel 710 series is that their performance out-runs the CPU, at least under
our PG workload.

How many people are using a single enterprise grade SSD for production
without RAID? I've had a few consumer grade SSDs brick themselves -
but are the enterprise grade SSDs, like the new Intel S3700 which you
can get in sizes up to 800GB, reliable enough to run as a single drive
without RAID1? The performance of one is definitely good enough for
most medium sized workloads without the complexity of a BBU RAID and
multiple spinning disks...

-Dave

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

David Boreham

2013-03-21 01:04:42 UTC

Permalink

Post by David Rees

You might want to evaluate the performance you can achieve with a single-SSD
(use several for capacity by all means) before considering a RAID card + SSD
solution.
Again I bet it depends on the application but our experience with the older
Intel 710 series is that their performance out-runs the CPU, at least under
our PG workload.

You're replying to my post, but I'll raise my hand again :)

We run a bunch of single-socket 1U, short-depth machines (Supermicro
chassis) using 1x Intel 710 drives (we'd use S3700 in new deployments
today). The most recent of these have 128G and E5-2620 hex-core CPU and
dissipate less than 150W at full-load.

Couldn't be happier with the setup. We have 18 months up time with no
drive failures, running at several hundred wps 7x24. We also write 10's
of GB of log files every day that are rotated, so the drives are getting
beaten up on bulk data overwrites too.

There is of course a non-zero probability of some unpleasant firmware
bug afflicting the drives (as with regular spinning drives), and
initially we deployed a "spare" 10k HD in the chassis, spun-down, that
would allow us to re-jigger the machines without SSD remotely (the data
center is 1000 miles away). We never had to do that, and later
deployments omitted the HD spare. We've also considered mixing SSD from
two vendors for firmware-bug-diversity, but so far we only have one
approved vendor (Intel).

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Karl Denninger

2013-03-21 01:46:07 UTC

Permalink

Post by David Rees

You might want to evaluate the performance you can achieve with a single-SSD
(use several for capacity by all means) before considering a RAID card + SSD
solution.
Again I bet it depends on the application but our experience with the older
Intel 710 series is that their performance out-runs the CPU, at least under
our PG workload.

Two is one, one is none.
:-)

-
-- Karl Denninger
/The Market Ticker ®/ <http://market-ticker.org>
Cuda Systems LLC

Scott Marlowe

2013-03-21 01:56:46 UTC

Permalink

Post by David Rees

You might want to evaluate the performance you can achieve with a single-SSD
(use several for capacity by all means) before considering a RAID card + SSD
solution.
Again I bet it depends on the application but our experience with the older
Intel 710 series is that their performance out-runs the CPU, at least under
our PG workload.

I would still at least run two in software RAID-1 for reliability.

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Mark Kirkwood

2013-03-21 02:26:07 UTC

Permalink

Post by David Rees

You might want to evaluate the performance you can achieve with a single-SSD
(use several for capacity by all means) before considering a RAID card + SSD
solution.
Again I bet it depends on the application but our experience with the older
Intel 710 series is that their performance out-runs the CPU, at least under
our PG workload.

If you are using Intel S3700 or 710's you can certainly use a pair setup
in software RAID1 (so avoiding the need for RAID cards and BBU etc).

I'd certainly feel happier with 2 drives :-) . However, a setup using
replication with a number of hosts - each with a single SSD is going to
be ok.

Regards

Mark

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance