Risk of data corruption/loss?

Discussion:

(too old to reply)

Niels Kristian Schjødt

2013-03-13 15:24:03 UTC

I'm considering the following setup:

- Master server with battery back raid controller with 4 SAS disks in a RAID 0 - so NO mirroring here, due to max performance requirements.
- Slave server setup with streaming replication on 4 HDD's in RAID 10. The setup will be done with synchronous_commit=off and synchronous_standby_names = ''

So as you might have noticed, clearly there is a risk of data loss, which is acceptable, since our data is not very crucial. However, I have quite a hard time figuring out, if there is a risk of total data corruption across both server in this setup? E.g. something goes wrong on the master and the wal files gets corrupt. Will the slave then apply the wal files INCLUDING the corruption (e.g. an unfinished transaction etc.), or will it automatically stop restoring at the point just BEFORE the corruption, so my only loss is data AFTER the corruption?

Hope my question is clear

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Jeff Janes

2013-03-13 17:13:28 UTC

Permalink

On Wed, Mar 13, 2013 at 8:24 AM, Niels Kristian Schjødt <

Post by Niels Kristian SchjÃ¸dt
- Master server with battery back raid controller with 4 SAS disks in a
RAID 0 - so NO mirroring here, due to max performance requirements.
- Slave server setup with streaming replication on 4 HDD's in RAID 10. The
setup will be done with synchronous_commit=off and
synchronous_standby_names = ''

Out of curiosity, in the presence of BB controller, is
synchronous_commit=off getting you additional performance?

It depends on where the corruption happens. WAL is checksummed, so the
slave will detect a mismatch and stop applying records. However, if the
corruption happens in RAM before the checksum is taken, the checksum will
match and it will attempt to apply the records.

Cheers,

Jeff

Niels Kristian Schjødt

2013-03-13 17:34:19 UTC

Permalink

Post by Niels Kristian SchjÃ¸dt
- Master server with battery back raid controller with 4 SAS disks in a RAID 0 - so NO mirroring here, due to max performance requirements.
- Slave server setup with streaming replication on 4 HDD's in RAID 10. The setup will be done with synchronous_commit=off and synchronous_standby_names = ''
Out of curiosity, in the presence of BB controller, is synchronous_commit=off getting you additional performance?

Time will show :-)

Post by Niels Kristian SchjÃ¸dt
So as you might have noticed, clearly there is a risk of data loss, which is acceptable, since our data is not very crucial. However, I have quite a hard time figuring out, if there is a risk of total data corruption across both server in this setup? E.g. something goes wrong on the master and the wal files gets corrupt. Will the slave then apply the wal files INCLUDING the corruption (e.g. an unfinished transaction etc.), or will it automatically stop restoring at the point just BEFORE the corruption, so my only loss is data AFTER the corruption?
It depends on where the corruption happens. WAL is checksummed, so the slave will detect a mismatch and stop applying records. However, if the corruption happens in RAM before the checksum is taken, the checksum will match and it will attempt to apply the records.
Cheers,
Jeff

Joshua Berkus

2013-03-13 21:18:38 UTC

Permalink

Neils,

Post by Niels Kristian SchjÃ¸dt
- Master server with battery back raid controller with 4 SAS disks in
a RAID 0 - so NO mirroring here, due to max performance
requirements.
- Slave server setup with streaming replication on 4 HDD's in RAID
10. The setup will be done with synchronous_commit=off and
synchronous_standby_names = ''

I'd be concerned that, assuming you're making the master high-risk for performance reasons, that the standby would not keep up.

Well, in general RAID 1 really just protects you from HDD failure, not more subtle types of corruption which occur onboard an HDD. So from that respect, you haven't increased your chances of data corruption at all; if the master loses a disk, it should just stop operating; a simple check that all WALs are 16MB on the standby would do the rest. I'd be more concerned that you're likely to be yanking and completely rebuilding the master server every 4 or 5 months.

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance