Discussion:
Do I have a hardware or a software problem?
(too old to reply)
Niels Kristian Schjødt
2012-12-10 22:51:58 UTC
Permalink
#### Pitch ######################################################################################
I previously posted this question http://archives.postgresql.org/pgsql-performance/2012-11/msg00289.php about a performance issue with an update query.
The question evolved into a more general discussion about my setup, and about a lot of I/O wait that I was encountering. Since then, I have gotten a whole lot more familiar with measuring things, and now I "just" need some experienced eyes to judge which direction I should go in - do I have a hardware issue, or a software issue - and what action should I take?

##### My setup #############################################################################
The use case:
At night time we are doing a LOT of data maintenance, and hence the load on the database is very different from the day time. However we would like to be able to do some of it in the daytime, it's simply just too "heavy" on the database as is right now. The stats shown below is from one of those "heavy" load times.

Hardware:
- 32Gb ram
- 8 core Xeon E3-1245 processor
- Two SEAGATE ST33000650NS drives (called sdc and sdd in the stats) in a softeware RAID1 array (called md2 in the stats)
- Two INTEL SSDSC2CW240A3 SSD drives (called sda and sdb in the stats) in a software RAID1 (called md3 in the stats)

Software:
Postgres 9.2 running on 64bit ubuntu 12.04 with kernel 3.2

Configuration:
# postgresql.conf (a shortlist of everything changed from the default)
data_directory = '/var/lib/postgresql/9.2/main'
hba_file = '/etc/postgresql/9.2/main/pg_hba.conf'
ident_file = '/etc/postgresql/9.2/main/pg_ident.conf'
external_pid_file = '/var/run/postgresql/9.2-main.pid'
listen_addresses = '192.168.0.2, localhost'
port = 5432
max_connections = 300
unix_socket_directory = '/var/run/postgresql'
wal_level = hot_standby
synchronous_commit = off
archive_mode = on
archive_command = 'rsync -a %p ***@192.168.0.4:/var/lib/postgresql/9.2/wals/%f </dev/null'
max_wal_senders = 1
wal_keep_segments = 32
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d '
datestyle = 'iso, mdy'
lc_monetary = 'en_US.UTF-8'
lc_numeric = 'en_US.UTF-8'
lc_time = 'en_US.UTF-8'
default_text_search_config = 'pg_catalog.english'
default_statistics_target = 100
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
effective_cache_size = 22GB
work_mem = 160MB
wal_buffers = 4MB
checkpoint_segments = 100
shared_buffers = 4GB
checkpoint_timeout = 10min

The kernel has bee tweaked like so:
vm.dirty_ratio = 10
vm.dirty_background_ratio = 1
kernel.shmmax = 8589934592
kernel.shmall = 17179869184

The pg_xlog folder has been moved onto the SSD array (md3), and symlinked back into the postgres dir.

##### The stats ###############################################################
These are the typical observations/stats I see in one of these periods:

1)
At top level this is what I see in new relic:
https://rpm.newrelic.com/public/charts/6ewGRle6bmc

2)
When the database is loaded like this, I see a lot of queries talking up to 1000 times as long, as they would when the database is not loaded so heavily.

3)
sudo iostat -dmx (typical usage)
Linux 3.2.0-33-generic (master-db) 12/10/2012 _x86_64_ (8 CPU)

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 6.52 3.59 26.61 0.22 0.74 65.49 0.01 0.40 0.77 0.35 0.14 0.43
sdb 0.00 8.31 0.03 28.38 0.00 0.97 69.63 0.01 0.52 0.27 0.52 0.15 0.43
sdc 1.71 46.01 34.83 116.62 0.56 4.06 62.47 1.90 12.57 21.81 9.81 1.89 28.66
sdd 1.67 46.14 34.89 116.49 0.56 4.06 62.46 1.58 10.43 21.66 7.07 1.89 28.60
md1 0.00 0.00 0.00 0.00 0.00 0.00 2.69 0.00 0.00 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.11 0.24 0.00 0.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00
md2 0.00 0.00 72.99 161.95 1.11 4.06 45.10 0.00 0.00 0.00 0.00 0.00 0.00
md3 0.00 0.00 0.05 32.32 0.00 0.74 47.00 0.00 0.00 0.00 0.00 0.00 0.00

3)
sudo iotop -oa (running for about a minute or so)
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
292 be/4 root 0.00 B 0.00 B 0.00 % 99.33 % [md2_raid1]
2815 be/4 postgres 19.51 M 25.90 M 0.00 % 45.49 % postgres: autovacuum worker process production
32553 be/4 postgres 45.74 M 9.38 M 0.00 % 37.89 % postgres: user production 192.168.0.3(58866) UPDATE
32570 be/4 postgres 6.91 M 35.02 M 0.00 % 16.71 % postgres: user production 192.168.0.3(35547) idle
32575 be/4 postgres 4.06 M 43.90 M 0.00 % 16.62 % postgres: user production 192.168.0.3(35561) SELECT
31673 be/4 postgres 4.14 M 52.16 M 0.00 % 16.24 % postgres: user production 192.168.0.3(39112) idle
32566 be/4 postgres 4.73 M 44.95 M 0.00 % 15.66 % postgres: user production 192.168.0.3(35531) idle
32568 be/4 postgres 4.50 M 33.84 M 0.00 % 14.62 % postgres: user production 192.168.0.3(35543) SELECT
32573 be/4 postgres 3.20 M 34.44 M 0.00 % 13.98 % postgres: user production 192.168.0.3(35559) idle
31590 be/4 postgres 3.23 M 29.72 M 0.00 % 13.90 % postgres: user production 192.168.0.3(50690) idle in transaction
32577 be/4 postgres 5.09 M 25.54 M 0.00 % 13.63 % postgres: user production 192.168.0.3(35563) idle
32565 be/4 postgres 2.06 M 35.93 M 0.00 % 13.41 % postgres: user production 192.168.0.3(35529) SELECT
32546 be/4 postgres 4.48 M 36.49 M 0.00 % 13.39 % postgres: user production 192.168.0.3(56927) UPDATE waiting
32569 be/4 postgres 3.50 M 26.75 M 0.00 % 12.82 % postgres: user production 192.168.0.3(35545) INSERT
31671 be/4 postgres 4.58 M 24.45 M 0.00 % 12.76 % postgres: user production 192.168.0.3(34841) idle in transaction
32551 be/4 postgres 3.26 M 31.77 M 0.00 % 12.06 % postgres: user production 192.168.0.3(58864) idle in transaction
32574 be/4 postgres 5.32 M 32.92 M 0.00 % 11.70 % postgres: user production 192.168.0.3(35560) idle
32572 be/4 postgres 3.00 M 32.66 M 0.00 % 11.66 % postgres: user production 192.168.0.3(35558) UPDATE
32560 be/4 postgres 5.12 M 25.89 M 0.00 % 11.52 % postgres: user production 192.168.0.3(33886) SELECT
32567 be/4 postgres 4.66 M 36.47 M 0.00 % 11.44 % postgres: user production 192.168.0.3(35534) SELECT
32571 be/4 postgres 2.86 M 31.27 M 0.00 % 11.31 % postgres: user production 192.168.0.3(35557) SELECT
32552 be/4 postgres 4.38 M 28.75 M 0.00 % 11.09 % postgres: user production 192.168.0.3(58865) idle in transaction
32554 be/4 postgres 3.69 M 30.21 M 0.00 % 10.90 % postgres: user production 192.168.0.3(58870) UPDATE
339 be/3 root 0.00 B 2.29 M 0.00 % 9.81 % [jbd2/md2-8]
32576 be/4 postgres 3.37 M 19.91 M 0.00 % 9.73 % postgres: user production 192.168.0.3(35562) idle
32555 be/4 postgres 3.09 M 31.96 M 0.00 % 9.02 % postgres: user production 192.168.0.3(58875) SELECT
27548 be/4 postgres 0.00 B 97.12 M 0.00 % 7.41 % postgres: writer process
31445 be/4 postgres 924.00 K 27.35 M 0.00 % 7.11 % postgres: user production 192.168.0.1(34536) idle
31443 be/4 postgres 2.54 M 4.56 M 0.00 % 6.32 % postgres: user production 192.168.0.1(34508) idle
31459 be/4 postgres 1480.00 K 21.36 M 0.00 % 5.63 % postgres: user production 192.168.0.1(34543) idle
1801 be/4 postgres 1896.00 K 10.89 M 0.00 % 5.57 % postgres: user production 192.168.0.3(34177) idle
32763 be/4 postgres 1696.00 K 6.95 M 0.00 % 5.33 % postgres: user production 192.168.0.3(57984) SELECT
1800 be/4 postgres 2.46 M 5.13 M 0.00 % 5.24 % postgres: user production 192.168.0.3(34175) SELECT
1803 be/4 postgres 1816.00 K 9.09 M 0.00 % 5.16 % postgres: user production 192.168.0.3(34206) idle
32578 be/4 postgres 2.57 M 11.62 M 0.00 % 5.06 % postgres: user production 192.168.0.3(35564) SELECT
31440 be/4 postgres 3.02 M 4.04 M 0.00 % 4.65 % postgres: user production 192.168.0.1(34463) idle
32605 be/4 postgres 1844.00 K 11.82 M 0.00 % 4.49 % postgres: user production 192.168.0.3(40399) idle
27547 be/4 postgres 0.00 B 0.00 B 0.00 % 3.93 % postgres: checkpointer process
31356 be/4 postgres 1368.00 K 3.27 M 0.00 % 3.93 % postgres: user production 192.168.0.1(34450) idle
32542 be/4 postgres 1180.00 K 6.05 M 0.00 % 3.90 % postgres: user production 192.168.0.3(56859) idle
32523 be/4 postgres 1088.00 K 4.33 M 0.00 % 3.59 % postgres: user production 192.168.0.3(48164) idle
32606 be/4 postgres 1964.00 K 6.94 M 0.00 % 3.51 % postgres: user production 192.168.0.3(40426) SELECT
31466 be/4 postgres 1596.00 K 3.11 M 0.00 % 3.47 % postgres: user production 192.168.0.1(34550) idle
32544 be/4 postgres 1184.00 K 4.25 M 0.00 % 3.38 % postgres: user production 192.168.0.3(56861) idle
31458 be/4 postgres 1088.00 K 1528.00 K 0.00 % 3.33 % postgres: user production 192.168.0.1(34541) idle
31444 be/4 postgres 884.00 K 4.23 M 0.00 % 3.27 % postgres: user production 192.168.0.1(34510) idle
32522 be/4 postgres 408.00 K 2.98 M 0.00 % 3.27 % postgres: user production 192.168.0.5(38361) idle
32762 be/4 postgres 1156.00 K 5.28 M 0.00 % 3.20 % postgres: user production 192.168.0.3(57962) idle
32582 be/4 postgres 1084.00 K 3.38 M 0.00 % 2.86 % postgres: user production 192.168.0.5(43104) idle
31353 be/4 postgres 2.04 M 3.02 M 0.00 % 2.82 % postgres: user production 192.168.0.1(34444) idle
31441 be/4 postgres 700.00 K 2.68 M 0.00 % 2.64 % postgres: user production 192.168.0.1(34465) idle
31462 be/4 postgres 980.00 K 3.50 M 0.00 % 2.57 % postgres: user production 192.168.0.1(34547) idle
32709 be/4 postgres 428.00 K 3.23 M 0.00 % 2.56 % postgres: user production 192.168.0.5(34323) idle
685 be/4 postgres 748.00 K 3.59 M 0.00 % 2.41 % postgres: user production 192.168.0.3(34911) idle
683 be/4 postgres 728.00 K 3.19 M 0.00 % 2.38 % postgres: user production 192.168.0.3(34868) idle
32765 be/4 postgres 464.00 K 3.76 M 0.00 % 2.21 % postgres: user production 192.168.0.3(58074) idle
32760 be/4 postgres 808.00 K 6.18 M 0.00 % 2.16 % postgres: user production 192.168.0.3(57958) idle
1912 be/4 postgres 372.00 K 3.03 M 0.00 % 2.16 % postgres: user production 192.168.0.5(33743) idle
31446 be/4 postgres 1004.00 K 2.09 M 0.00 % 2.16 % postgres: user production 192.168.0.1(34539) idle
31460 be/4 postgres 584.00 K 2.74 M 0.00 % 2.10 % postgres: user production 192.168.0.1(34545) idle

5) vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 1 573424 321080 27124 28504352 0 0 143 618 0 4 2 0 91 7
0 1 573424 320764 27124 28504496 0 0 104 15654 3788 4961 1 0 85 14
0 1 573424 320684 27124 28504616 0 0 276 12736 4099 5374 0 1 84 15
0 1 573424 319672 27124 28504900 0 0 80 7746 3624 4949 2 0 82 16
0 1 573424 319180 27124 28504972 0 0 36 12489 3653 4761 2 0 86 12
0 1 573424 318184 27132 28505000 0 0 8 10482 3413 4898 0 0 87 13
0 1 573424 318424 27132 28505016 0 0 0 9564 2746 4290 0 0 87 13
0 1 573424 318308 27168 28505016 0 0 36 10562 1895 2149 0 0 87 12
0 3 573424 318208 27168 28505020 0 0 84 18529 3035 3265 1 0 85 14
0 1 573424 318732 27176 28505080 0 0 84 14574 2986 3231 0 0 84 16
0 2 573424 317588 27176 28505184 0 0 4 6681 1991 2207 2 1 86 12
0 1 573424 316852 27176 28505260 0 0 76 7670 2910 3996 2 1 85 13
0 1 573424 316632 27184 28505256 0 0 0 7186 2661 3740 0 0 87 12
0 1 573424 316720 27188 28505260 0 0 0 2590 1731 2474 0 0 88 12
0 1 573424 314252 27192 28505696 0 0 460 11612 1757 2431 0 0 82 18
0 2 573424 313504 27192 28505724 0 0 0 19656 1775 2099 0 0 83 17
0 3 573424 313300 27196 28505780 0 0 188 6237 2746 3193 2 0 80 17
0 2 573424 312736 27200 28506348 0 0 804 18466 5014 6430 2 1 75 23
2 35 573424 307564 27200 28509920 0 0 3912 16280 14377 15470 14 3 28 56
0 5 573424 282848 27208 28533964 0 0 7484 27580 22017 25938 17 3 17 63
1 5 573424 221100 27208 28563360 0 0 2852 3120 19639 28664 12 5 52 31
0 4 573428 229912 26704 28519184 0 4 1208 5890 13976 20851 13 3 56 28
0 2 573448 234680 26672 28513632 0 20 0 17204 1694 2636 0 0 71 28
3 7 573452 220836 26644 28525548 0 4 1540 36370 27928 36551 17 5 50 27
1 3 573488 234380 26556 28517416 0 36 584 19066 8275 9467 3 2 60 36
0 1 573488 234496 26556 28517852 0 0 56 47429 3290 4310 0 0 79 20

6) sudo lsof - a hell of a lot of output, I can post it if anyone is interested :-)

#### Notes and thoughts ##############################################################################

As you can see, even though I have moved the pg_xlog folder to the SSD array (md3) the by far largest amount of writes still goes to the regular HDD's (md2), which puzzles me - what can that be?
From stat 3) (the iostat) I notice that the SSD's doesn't seem to be something near fully utilized - maybe something else than just pg_xlog could be moved her?
I have no idea if the amount of reads/writes is within the acceptable/capable for my kind of hardware, or if it is far beyond?
In stat 3) (the iotop) it says that the RAID array (md2) is the most "waiting" part, does that taste like a root cause, or more like a symptom of some other bottleneck?

Thanks, for taking the time to look at by data! :-)
Evgeny Shishkin
2012-12-10 23:00:51 UTC
Permalink
Post by Niels Kristian Schjødt
#### Pitch ######################################################################################
I previously posted this question http://archives.postgresql.org/pgsql-performance/2012-11/msg00289.php about a performance issue with an update query.
The question evolved into a more general discussion about my setup, and about a lot of I/O wait that I was encountering. Since then, I have gotten a whole lot more familiar with measuring things, and now I "just" need some experienced eyes to judge which direction I should go in - do I have a hardware issue, or a software issue - and what action should I take?
##### My setup #############################################################################
At night time we are doing a LOT of data maintenance, and hence the load on the database is very different from the day time. However we would like to be able to do some of it in the daytime, it's simply just too "heavy" on the database as is right now. The stats shown below is from one of those "heavy" load times.
- 32Gb ram
- 8 core Xeon E3-1245 processor
- Two SEAGATE ST33000650NS drives (called sdc and sdd in the stats) in a softeware RAID1 array (called md2 in the stats)
- Two INTEL SSDSC2CW240A3 SSD drives (called sda and sdb in the stats) in a software RAID1 (called md3 in the stats)
Postgres 9.2 running on 64bit ubuntu 12.04 with kernel 3.2
# postgresql.conf (a shortlist of everything changed from the default)
data_directory = '/var/lib/postgresql/9.2/main'
hba_file = '/etc/postgresql/9.2/main/pg_hba.conf'
ident_file = '/etc/postgresql/9.2/main/pg_ident.conf'
external_pid_file = '/var/run/postgresql/9.2-main.pid'
listen_addresses = '192.168.0.2, localhost'
port = 5432
max_connections = 300
unix_socket_directory = '/var/run/postgresql'
wal_level = hot_standby
synchronous_commit = off
archive_mode = on
max_wal_senders = 1
wal_keep_segments = 32
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d '
datestyle = 'iso, mdy'
lc_monetary = 'en_US.UTF-8'
lc_numeric = 'en_US.UTF-8'
lc_time = 'en_US.UTF-8'
default_text_search_config = 'pg_catalog.english'
default_statistics_target = 100
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
effective_cache_size = 22GB
work_mem = 160MB
wal_buffers = 4MB
checkpoint_segments = 100
shared_buffers = 4GB
checkpoint_timeout = 10min
vm.dirty_ratio = 10
vm.dirty_background_ratio = 1
kernel.shmmax = 8589934592
kernel.shmall = 17179869184
The pg_xlog folder has been moved onto the SSD array (md3), and symlinked back into the postgres dir.
Actually, you should move xlog to rotating drives, since wal logs written sequentially, and everything else to ssd, because of random io pattern.
Post by Niels Kristian Schjødt
##### The stats ###############################################################
1)
https://rpm.newrelic.com/public/charts/6ewGRle6bmc
2)
When the database is loaded like this, I see a lot of queries talking up to 1000 times as long, as they would when the database is not loaded so heavily.
3)
sudo iostat -dmx (typical usage)
Linux 3.2.0-33-generic (master-db) 12/10/2012 _x86_64_ (8 CPU)
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 6.52 3.59 26.61 0.22 0.74 65.49 0.01 0.40 0.77 0.35 0.14 0.43
sdb 0.00 8.31 0.03 28.38 0.00 0.97 69.63 0.01 0.52 0.27 0.52 0.15 0.43
sdc 1.71 46.01 34.83 116.62 0.56 4.06 62.47 1.90 12.57 21.81 9.81 1.89 28.66
sdd 1.67 46.14 34.89 116.49 0.56 4.06 62.46 1.58 10.43 21.66 7.07 1.89 28.60
md1 0.00 0.00 0.00 0.00 0.00 0.00 2.69 0.00 0.00 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.11 0.24 0.00 0.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00
md2 0.00 0.00 72.99 161.95 1.11 4.06 45.10 0.00 0.00 0.00 0.00 0.00 0.00
md3 0.00 0.00 0.05 32.32 0.00 0.74 47.00 0.00 0.00 0.00 0.00 0.00 0.00
3)
sudo iotop -oa (running for about a minute or so)
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
292 be/4 root 0.00 B 0.00 B 0.00 % 99.33 % [md2_raid1]
2815 be/4 postgres 19.51 M 25.90 M 0.00 % 45.49 % postgres: autovacuum worker process production
32553 be/4 postgres 45.74 M 9.38 M 0.00 % 37.89 % postgres: user production 192.168.0.3(58866) UPDATE
32570 be/4 postgres 6.91 M 35.02 M 0.00 % 16.71 % postgres: user production 192.168.0.3(35547) idle
32575 be/4 postgres 4.06 M 43.90 M 0.00 % 16.62 % postgres: user production 192.168.0.3(35561) SELECT
31673 be/4 postgres 4.14 M 52.16 M 0.00 % 16.24 % postgres: user production 192.168.0.3(39112) idle
32566 be/4 postgres 4.73 M 44.95 M 0.00 % 15.66 % postgres: user production 192.168.0.3(35531) idle
32568 be/4 postgres 4.50 M 33.84 M 0.00 % 14.62 % postgres: user production 192.168.0.3(35543) SELECT
32573 be/4 postgres 3.20 M 34.44 M 0.00 % 13.98 % postgres: user production 192.168.0.3(35559) idle
31590 be/4 postgres 3.23 M 29.72 M 0.00 % 13.90 % postgres: user production 192.168.0.3(50690) idle in transaction
32577 be/4 postgres 5.09 M 25.54 M 0.00 % 13.63 % postgres: user production 192.168.0.3(35563) idle
32565 be/4 postgres 2.06 M 35.93 M 0.00 % 13.41 % postgres: user production 192.168.0.3(35529) SELECT
32546 be/4 postgres 4.48 M 36.49 M 0.00 % 13.39 % postgres: user production 192.168.0.3(56927) UPDATE waiting
32569 be/4 postgres 3.50 M 26.75 M 0.00 % 12.82 % postgres: user production 192.168.0.3(35545) INSERT
31671 be/4 postgres 4.58 M 24.45 M 0.00 % 12.76 % postgres: user production 192.168.0.3(34841) idle in transaction
32551 be/4 postgres 3.26 M 31.77 M 0.00 % 12.06 % postgres: user production 192.168.0.3(58864) idle in transaction
32574 be/4 postgres 5.32 M 32.92 M 0.00 % 11.70 % postgres: user production 192.168.0.3(35560) idle
32572 be/4 postgres 3.00 M 32.66 M 0.00 % 11.66 % postgres: user production 192.168.0.3(35558) UPDATE
32560 be/4 postgres 5.12 M 25.89 M 0.00 % 11.52 % postgres: user production 192.168.0.3(33886) SELECT
32567 be/4 postgres 4.66 M 36.47 M 0.00 % 11.44 % postgres: user production 192.168.0.3(35534) SELECT
32571 be/4 postgres 2.86 M 31.27 M 0.00 % 11.31 % postgres: user production 192.168.0.3(35557) SELECT
32552 be/4 postgres 4.38 M 28.75 M 0.00 % 11.09 % postgres: user production 192.168.0.3(58865) idle in transaction
32554 be/4 postgres 3.69 M 30.21 M 0.00 % 10.90 % postgres: user production 192.168.0.3(58870) UPDATE
339 be/3 root 0.00 B 2.29 M 0.00 % 9.81 % [jbd2/md2-8]
32576 be/4 postgres 3.37 M 19.91 M 0.00 % 9.73 % postgres: user production 192.168.0.3(35562) idle
32555 be/4 postgres 3.09 M 31.96 M 0.00 % 9.02 % postgres: user production 192.168.0.3(58875) SELECT
27548 be/4 postgres 0.00 B 97.12 M 0.00 % 7.41 % postgres: writer process
31445 be/4 postgres 924.00 K 27.35 M 0.00 % 7.11 % postgres: user production 192.168.0.1(34536) idle
31443 be/4 postgres 2.54 M 4.56 M 0.00 % 6.32 % postgres: user production 192.168.0.1(34508) idle
31459 be/4 postgres 1480.00 K 21.36 M 0.00 % 5.63 % postgres: user production 192.168.0.1(34543) idle
1801 be/4 postgres 1896.00 K 10.89 M 0.00 % 5.57 % postgres: user production 192.168.0.3(34177) idle
32763 be/4 postgres 1696.00 K 6.95 M 0.00 % 5.33 % postgres: user production 192.168.0.3(57984) SELECT
1800 be/4 postgres 2.46 M 5.13 M 0.00 % 5.24 % postgres: user production 192.168.0.3(34175) SELECT
1803 be/4 postgres 1816.00 K 9.09 M 0.00 % 5.16 % postgres: user production 192.168.0.3(34206) idle
32578 be/4 postgres 2.57 M 11.62 M 0.00 % 5.06 % postgres: user production 192.168.0.3(35564) SELECT
31440 be/4 postgres 3.02 M 4.04 M 0.00 % 4.65 % postgres: user production 192.168.0.1(34463) idle
32605 be/4 postgres 1844.00 K 11.82 M 0.00 % 4.49 % postgres: user production 192.168.0.3(40399) idle
27547 be/4 postgres 0.00 B 0.00 B 0.00 % 3.93 % postgres: checkpointer process
31356 be/4 postgres 1368.00 K 3.27 M 0.00 % 3.93 % postgres: user production 192.168.0.1(34450) idle
32542 be/4 postgres 1180.00 K 6.05 M 0.00 % 3.90 % postgres: user production 192.168.0.3(56859) idle
32523 be/4 postgres 1088.00 K 4.33 M 0.00 % 3.59 % postgres: user production 192.168.0.3(48164) idle
32606 be/4 postgres 1964.00 K 6.94 M 0.00 % 3.51 % postgres: user production 192.168.0.3(40426) SELECT
31466 be/4 postgres 1596.00 K 3.11 M 0.00 % 3.47 % postgres: user production 192.168.0.1(34550) idle
32544 be/4 postgres 1184.00 K 4.25 M 0.00 % 3.38 % postgres: user production 192.168.0.3(56861) idle
31458 be/4 postgres 1088.00 K 1528.00 K 0.00 % 3.33 % postgres: user production 192.168.0.1(34541) idle
31444 be/4 postgres 884.00 K 4.23 M 0.00 % 3.27 % postgres: user production 192.168.0.1(34510) idle
32522 be/4 postgres 408.00 K 2.98 M 0.00 % 3.27 % postgres: user production 192.168.0.5(38361) idle
32762 be/4 postgres 1156.00 K 5.28 M 0.00 % 3.20 % postgres: user production 192.168.0.3(57962) idle
32582 be/4 postgres 1084.00 K 3.38 M 0.00 % 2.86 % postgres: user production 192.168.0.5(43104) idle
31353 be/4 postgres 2.04 M 3.02 M 0.00 % 2.82 % postgres: user production 192.168.0.1(34444) idle
31441 be/4 postgres 700.00 K 2.68 M 0.00 % 2.64 % postgres: user production 192.168.0.1(34465) idle
31462 be/4 postgres 980.00 K 3.50 M 0.00 % 2.57 % postgres: user production 192.168.0.1(34547) idle
32709 be/4 postgres 428.00 K 3.23 M 0.00 % 2.56 % postgres: user production 192.168.0.5(34323) idle
685 be/4 postgres 748.00 K 3.59 M 0.00 % 2.41 % postgres: user production 192.168.0.3(34911) idle
683 be/4 postgres 728.00 K 3.19 M 0.00 % 2.38 % postgres: user production 192.168.0.3(34868) idle
32765 be/4 postgres 464.00 K 3.76 M 0.00 % 2.21 % postgres: user production 192.168.0.3(58074) idle
32760 be/4 postgres 808.00 K 6.18 M 0.00 % 2.16 % postgres: user production 192.168.0.3(57958) idle
1912 be/4 postgres 372.00 K 3.03 M 0.00 % 2.16 % postgres: user production 192.168.0.5(33743) idle
31446 be/4 postgres 1004.00 K 2.09 M 0.00 % 2.16 % postgres: user production 192.168.0.1(34539) idle
31460 be/4 postgres 584.00 K 2.74 M 0.00 % 2.10 % postgres: user production 192.168.0.1(34545) idle
5) vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 1 573424 321080 27124 28504352 0 0 143 618 0 4 2 0 91 7
0 1 573424 320764 27124 28504496 0 0 104 15654 3788 4961 1 0 85 14
0 1 573424 320684 27124 28504616 0 0 276 12736 4099 5374 0 1 84 15
0 1 573424 319672 27124 28504900 0 0 80 7746 3624 4949 2 0 82 16
0 1 573424 319180 27124 28504972 0 0 36 12489 3653 4761 2 0 86 12
0 1 573424 318184 27132 28505000 0 0 8 10482 3413 4898 0 0 87 13
0 1 573424 318424 27132 28505016 0 0 0 9564 2746 4290 0 0 87 13
0 1 573424 318308 27168 28505016 0 0 36 10562 1895 2149 0 0 87 12
0 3 573424 318208 27168 28505020 0 0 84 18529 3035 3265 1 0 85 14
0 1 573424 318732 27176 28505080 0 0 84 14574 2986 3231 0 0 84 16
0 2 573424 317588 27176 28505184 0 0 4 6681 1991 2207 2 1 86 12
0 1 573424 316852 27176 28505260 0 0 76 7670 2910 3996 2 1 85 13
0 1 573424 316632 27184 28505256 0 0 0 7186 2661 3740 0 0 87 12
0 1 573424 316720 27188 28505260 0 0 0 2590 1731 2474 0 0 88 12
0 1 573424 314252 27192 28505696 0 0 460 11612 1757 2431 0 0 82 18
0 2 573424 313504 27192 28505724 0 0 0 19656 1775 2099 0 0 83 17
0 3 573424 313300 27196 28505780 0 0 188 6237 2746 3193 2 0 80 17
0 2 573424 312736 27200 28506348 0 0 804 18466 5014 6430 2 1 75 23
2 35 573424 307564 27200 28509920 0 0 3912 16280 14377 15470 14 3 28 56
0 5 573424 282848 27208 28533964 0 0 7484 27580 22017 25938 17 3 17 63
1 5 573424 221100 27208 28563360 0 0 2852 3120 19639 28664 12 5 52 31
0 4 573428 229912 26704 28519184 0 4 1208 5890 13976 20851 13 3 56 28
0 2 573448 234680 26672 28513632 0 20 0 17204 1694 2636 0 0 71 28
3 7 573452 220836 26644 28525548 0 4 1540 36370 27928 36551 17 5 50 27
1 3 573488 234380 26556 28517416 0 36 584 19066 8275 9467 3 2 60 36
0 1 573488 234496 26556 28517852 0 0 56 47429 3290 4310 0 0 79 20
6) sudo lsof - a hell of a lot of output, I can post it if anyone is interested :-)
#### Notes and thoughts ##############################################################################
As you can see, even though I have moved the pg_xlog folder to the SSD array (md3) the by far largest amount of writes still goes to the regular HDD's (md2), which puzzles me - what can that be?
From stat 3) (the iostat) I notice that the SSD's doesn't seem to be something near fully utilized - maybe something else than just pg_xlog could be moved her?
I have no idea if the amount of reads/writes is within the acceptable/capable for my kind of hardware, or if it is far beyond?
In stat 3) (the iotop) it says that the RAID array (md2) is the most "waiting" part, does that taste like a root cause, or more like a symptom of some other bottleneck?
Thanks, for taking the time to look at by data! :-)
Jeff Janes
2012-12-10 23:58:29 UTC
Permalink
On Mon, Dec 10, 2012 at 2:51 PM, Niels Kristian Schjødt
Post by Niels Kristian Schjødt
synchronous_commit = off
The pg_xlog folder has been moved onto the SSD array (md3), and symlinked
back into the postgres dir.
With synchronous_commit = off, or with large transactions, there is
probably no advantage to moving those to SSD.
Post by Niels Kristian Schjødt
2)
When the database is loaded like this, I see a lot of queries talking up to
1000 times as long, as they would when the database is not loaded so
heavily.
What kinds of queries are they? single-row look-ups, full table scans, etc.
Post by Niels Kristian Schjødt
#### Notes and thoughts
##############################################################################
As you can see, even though I have moved the pg_xlog folder to the SSD array
(md3) the by far largest amount of writes still goes to the regular HDD's
(md2), which puzzles me - what can that be?
Every row you insert or non-HOT update has to do maintenance on all
indexes of that table. If the rows are not inserted/updated in index
order, this means you every row inserted/updated dirties a randomly
scattered 8KB for each of the indexes. If you have lots of indexes
per table, that adds up fast.

The fact that there is much more writing than reading tells me that
most of your indexes are in RAM. The amount of index you are rapidly
reading and dirtying is large enough to fit in RAM, but is not large
enough to fit in shared_buffers + kernel's dirty-buffer comfort level.
So you are redirtying the same blocks over and over, PG is
desperately dumping them to the kernel (because shared_buffers it too
small to hold them) and the kernel is desperately dumping them to
disk, because vm.dirty_background_ratio is so low. There is little
opportunity for write-combining, because they don't sit in memory long
enough to accumulate neighbors.

How big are your indexes?

You could really crank up shared_buffers or vm.dirty_background_ratio,
but doing so might cause problems with checkpoints stalling and
latency spikes. That would probably not be a problem during the
night, but could be during the day.

Rather than moving maintenance to the day and hoping it doesn't
interfere with normal operations, I'd focus on making night-time
maintenance more efficient, for example by dropping indexes (either
just at night, or if some indexes are not useful, just get rid of them
altogether), or cranking up shared_buffers at night, or maybe
partitioning or look into pg_bulkload.
Post by Niels Kristian Schjødt
From stat 3) (the iostat) I notice that the SSD's doesn't seem to be
something near fully utilized - maybe something else than just pg_xlog could
be moved her?
I don't know how big each disk is, or how big your various categories
of data are. Could you move everything to SSD? Could you move all
your actively updated indexes there?

Or, more fundamentally, it looks like you spent too much on CPUs (86%
idle) and not nearly enough on disks. Maybe you can fix that for less
money than it will cost you in your optimization time to make the best
of the disks you already have.

Cheers,

Jeff
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Niels Kristian Schjødt
2012-12-11 10:04:46 UTC
Permalink
Post by Jeff Janes
On Mon, Dec 10, 2012 at 2:51 PM, Niels Kristian Schjødt
Post by Niels Kristian Schjødt
synchronous_commit = off
The pg_xlog folder has been moved onto the SSD array (md3), and symlinked
back into the postgres dir.
With synchronous_commit = off, or with large transactions, there is
probably no advantage to moving those to SSD.
Post by Niels Kristian Schjødt
2)
When the database is loaded like this, I see a lot of queries talking up to
1000 times as long, as they would when the database is not loaded so
heavily.
What kinds of queries are they? single-row look-ups, full table scans, etc.
Well Mostly they are updates. Like the one shown in the previous question I referenced.
Post by Jeff Janes
Post by Niels Kristian Schjødt
#### Notes and thoughts
##############################################################################
As you can see, even though I have moved the pg_xlog folder to the SSD array
(md3) the by far largest amount of writes still goes to the regular HDD's
(md2), which puzzles me - what can that be?
Every row you insert or non-HOT update has to do maintenance on all
indexes of that table. If the rows are not inserted/updated in index
order, this means you every row inserted/updated dirties a randomly
scattered 8KB for each of the indexes. If you have lots of indexes
per table, that adds up fast.
The fact that there is much more writing than reading tells me that
most of your indexes are in RAM. The amount of index you are rapidly
reading and dirtying is large enough to fit in RAM, but is not large
enough to fit in shared_buffers + kernel's dirty-buffer comfort level.
Maybe I should mention, that I never see more than max 5Gb out of my total 32Gb being in use on the server… Can I somehow utilize more of it?
Post by Jeff Janes
So you are redirtying the same blocks over and over, PG is
desperately dumping them to the kernel (because shared_buffers it too
small to hold them) and the kernel is desperately dumping them to
disk, because vm.dirty_background_ratio is so low. There is little
opportunity for write-combining, because they don't sit in memory long
enough to accumulate neighbors.
How big are your indexes?
This is a size list of all my indexes: 117 MB, 118 MB, 11 MB, 12 MB, 12 MB, 12 MB, 12 MB, 140 MB, 15 MB, 15 MB, 16 kB, 16 kB, 16 kB, 16 kB, 16 kB, 16 kB, 16 kB, 16 kB, 16 kB, 16 kB, 16 kB, 16 kB, 16 MB, 16 MB, 176 kB, 176 kB, 17 MB, 18 MB, 19 MB, 23 MB, 240 kB, 24 MB, 256 kB, 25 MB, 25 MB, 26 MB, 26 MB, 27 MB, 27 MB, 27 MB, 27 MB, 280 MB, 2832 kB, 2840 kB, 288 kB, 28 MB, 28 MB, 28 MB, 28 MB, 28 MB, 28 MB, 28 MB, 28 MB, 29 MB, 29 MB, 3152 kB, 3280 kB, 32 kB, 32 MB, 32 MB, 3352 kB, 3456 kB, 34 MB, 36 MB, 3744 kB, 3776 kB, 37 MB, 37 MB, 3952 kB, 400 kB, 408 kB, 40 kB, 40 kB, 40 kB, 416 kB, 416 kB, 42 MB, 432 kB, 4520 kB, 4720 kB, 47 MB, 48 kB, 496 kB, 49 MB, 512 kB, 52 MB, 52 MB, 5304 kB, 5928 kB, 6088 kB, 61 MB, 6224 kB, 62 MB, 6488 kB, 64 kB, 6512 kB, 71 MB, 72 kB, 72 kB, 8192 bytes, 8400 kB, 88 MB, 95 MB, 98 MB
Post by Jeff Janes
You could really crank up shared_buffers or vm.dirty_background_ratio,
but doing so might cause problems with checkpoints stalling and
latency spikes. That would probably not be a problem during the
night, but could be during the day.
What do you have in mind here? Tweaking what parameters to what values?
Post by Jeff Janes
.
Rather than moving maintenance to the day and hoping it doesn't
interfere with normal operations, I'd focus on making night-time
maintenance more efficient, for example by dropping indexes (either
just at night, or if some indexes are not useful, just get rid of them
altogether), or cranking up shared_buffers at night, or maybe
partitioning or look into pg_bulkload.
Post by Niels Kristian Schjødt
From stat 3) (the iostat) I notice that the SSD's doesn't seem to be
something near fully utilized - maybe something else than just pg_xlog could
be moved her?
I don't know how big each disk is, or how big your various categories
of data are. Could you move everything to SSD? Could you move all
your actively updated indexes there?
With table spaces you mean?
Post by Jeff Janes
Or, more fundamentally, it looks like you spent too much on CPUs (86%
idle) and not nearly enough on disks. Maybe you can fix that for less
money than it will cost you in your optimization time to make the best
of the disks you already have.
The SSD's I use a are 240Gb each which will grow too small within a few months - so - how does moving the whole data dir onto four of those in a RAID5 array sound?
Post by Jeff Janes
Cheers,
Jeff
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Craig Ringer
2012-12-11 13:29:00 UTC
Permalink
Post by Niels Kristian Schjødt
Maybe I should mention, that I never see more than max 5Gb out of my total 32Gb being in use on the server… Can I somehow utilize more of it?
For an update-mostly workload it probably won't do you tons of good so
long as all your indexes fit in RAM. You're clearly severely
bottlenecked on disk I/O not RAM.
Post by Niels Kristian Schjødt
The SSD's I use a are 240Gb each which will grow too small within a
few months - so - how does moving the whole data dir onto four of
those in a RAID5 array sound?
Not RAID 5!

Use a RAID10 of four or six SSDs.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Niels Kristian Schjødt
2012-12-11 13:35:09 UTC
Permalink
Post by Craig Ringer
Post by Niels Kristian Schjødt
Maybe I should mention, that I never see more than max 5Gb out of my total 32Gb being in use on the server… Can I somehow utilize more of it?
For an update-mostly workload it probably won't do you tons of good so
long as all your indexes fit in RAM. You're clearly severely
bottlenecked on disk I/O not RAM.
Post by Niels Kristian Schjødt
The SSD's I use a are 240Gb each which will grow too small within a
few months - so - how does moving the whole data dir onto four of
those in a RAID5 array sound?
Not RAID 5!
Use a RAID10 of four or six SSDs.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Hehe got it - did you have a look at the SSD's I am considering building it of? http://ark.intel.com/products/66250/Intel-SSD-520-Series-240GB-2_5in-SATA-6Gbs-25nm-MLC
Are they suitable do you think?
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Evgeny Shishkin
2012-12-11 17:15:59 UTC
Permalink
Post by Niels Kristian Schjødt
Post by Craig Ringer
Post by Niels Kristian Schjødt
Maybe I should mention, that I never see more than max 5Gb out of my total 32Gb being in use on the server… Can I somehow utilize more of it?
For an update-mostly workload it probably won't do you tons of good so
long as all your indexes fit in RAM. You're clearly severely
bottlenecked on disk I/O not RAM.
Post by Niels Kristian Schjødt
The SSD's I use a are 240Gb each which will grow too small within a
few months - so - how does moving the whole data dir onto four of
those in a RAID5 array sound?
Not RAID 5!
Use a RAID10 of four or six SSDs.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Hehe got it - did you have a look at the SSD's I am considering building it of? http://ark.intel.com/products/66250/Intel-SSD-520-Series-240GB-2_5in-SATA-6Gbs-25nm-MLC
Are they suitable do you think?
I am not Craig, but i use them in production in raid10 array now.
Post by Niels Kristian Schjødt
--
http://www.postgresql.org/mailpref/pgsql-performance
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Niels Kristian Schjødt
2012-12-11 18:54:28 UTC
Permalink
And what is your experience so far?
On Dec 11, 2012, at 5:35 PM, Niels Kristian Schjødt <
Post by Niels Kristian Schjødt
Post by Craig Ringer
Post by Niels Kristian Schjødt
Maybe I should mention, that I never see more than max 5Gb out of my
total 32Gb being in use on the server… Can I somehow utilize more of it?
Post by Niels Kristian Schjødt
Post by Craig Ringer
For an update-mostly workload it probably won't do you tons of good so
long as all your indexes fit in RAM. You're clearly severely
bottlenecked on disk I/O not RAM.
Post by Niels Kristian Schjødt
The SSD's I use a are 240Gb each which will grow too small within a
few months - so - how does moving the whole data dir onto four of
those in a RAID5 array sound?
Not RAID 5!
Use a RAID10 of four or six SSDs.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Hehe got it - did you have a look at the SSD's I am considering building
it of?
http://ark.intel.com/products/66250/Intel-SSD-520-Series-240GB-2_5in-SATA-6Gbs-25nm-MLC
Post by Niels Kristian Schjødt
Are they suitable do you think?
I am not Craig, but i use them in production in raid10 array now.
Post by Niels Kristian Schjødt
--
Sent via pgsql-performance mailing list (
http://www.postgresql.org/mailpref/pgsql-performance
Evgeny Shishkin
2012-12-11 19:11:14 UTC
Permalink
Post by Niels Kristian Schjødt
And what is your experience so far?
Increased tps by a factor of 10, database no longer a limiting factor of application.
And it is cheaper than brand rotating drives.
Post by Niels Kristian Schjødt
Post by Niels Kristian Schjødt
Post by Craig Ringer
Maybe I should mention, that I never see more than max 5Gb out of my total 32Gb being in use on the server… Can I somehow utilize more of it?
For an update-mostly workload it probably won't do you tons of good so
long as all your indexes fit in RAM. You're clearly severely
bottlenecked on disk I/O not RAM.
The SSD's I use a are 240Gb each which will grow too small within a
few months - so - how does moving the whole data dir onto four of
those in a RAID5 array sound?
Not RAID 5!
Use a RAID10 of four or six SSDs.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Hehe got it - did you have a look at the SSD's I am considering building it of? http://ark.intel.com/products/66250/Intel-SSD-520-Series-240GB-2_5in-SATA-6Gbs-25nm-MLC
Are they suitable do you think?
I am not Craig, but i use them in production in raid10 array now.
Post by Niels Kristian Schjødt
--
http://www.postgresql.org/mailpref/pgsql-performance
Niels Kristian Schjødt
2012-12-11 22:41:37 UTC
Permalink
Are you using a hardware based raid controller with them?
On Dec 11, 2012, at 10:54 PM, Niels Kristian Schjødt <
And what is your experience so far?
Increased tps by a factor of 10, database no longer a limiting factor of application.
And it is cheaper than brand rotating drives.
On Dec 11, 2012, at 5:35 PM, Niels Kristian Schjødt <
Post by Niels Kristian Schjødt
Post by Craig Ringer
Post by Niels Kristian Schjødt
Maybe I should mention, that I never see more than max 5Gb out of my
total 32Gb being in use on the server… Can I somehow utilize more of it?
Post by Niels Kristian Schjødt
Post by Craig Ringer
For an update-mostly workload it probably won't do you tons of good so
long as all your indexes fit in RAM. You're clearly severely
bottlenecked on disk I/O not RAM.
Post by Niels Kristian Schjødt
The SSD's I use a are 240Gb each which will grow too small within a
few months - so - how does moving the whole data dir onto four of
those in a RAID5 array sound?
Not RAID 5!
Use a RAID10 of four or six SSDs.
--
Craig Ringer http://www.2ndQuadrant.com/<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Training & Services
Hehe got it - did you have a look at the SSD's I am considering
building it of?
http://ark.intel.com/products/66250/Intel-SSD-520-Series-240GB-2_5in-SATA-6Gbs-25nm-MLC
Post by Niels Kristian Schjødt
Are they suitable do you think?
I am not Craig, but i use them in production in raid10 array now.
Post by Niels Kristian Schjødt
--
Sent via pgsql-performance mailing list (
http://www.postgresql.org/mailpref/pgsql-performance
Evgeny Shishkin
2012-12-11 22:44:24 UTC
Permalink
Post by Niels Kristian Schjødt
Are you using a hardware based raid controller with them?
Yes, of course. Hardware raid with cache and bbu is a must. You can't get fast fsync without it.
Also mdadm is a pain in the ass and is suitable only on amazon and other cloud shit.
Post by Niels Kristian Schjødt
Post by Niels Kristian Schjødt
And what is your experience so far?
Increased tps by a factor of 10, database no longer a limiting factor of application.
And it is cheaper than brand rotating drives.
Post by Niels Kristian Schjødt
Post by Niels Kristian Schjødt
Post by Craig Ringer
Maybe I should mention, that I never see more than max 5Gb out of my total 32Gb being in use on the server… Can I somehow utilize more of it?
For an update-mostly workload it probably won't do you tons of good so
long as all your indexes fit in RAM. You're clearly severely
bottlenecked on disk I/O not RAM.
The SSD's I use a are 240Gb each which will grow too small within a
few months - so - how does moving the whole data dir onto four of
those in a RAID5 array sound?
Not RAID 5!
Use a RAID10 of four or six SSDs.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Hehe got it - did you have a look at the SSD's I am considering building it of? http://ark.intel.com/products/66250/Intel-SSD-520-Series-240GB-2_5in-SATA-6Gbs-25nm-MLC
Are they suitable do you think?
I am not Craig, but i use them in production in raid10 array now.
Post by Niels Kristian Schjødt
--
http://www.postgresql.org/mailpref/pgsql-performance
Craig Ringer
2012-12-12 01:03:14 UTC
Permalink
On Dec 12, 2012, at 2:41 AM, Niels Kristian Schjødt
Post by Niels Kristian Schjødt
Are you using a hardware based raid controller with them?
Yes, of course. Hardware raid with cache and bbu is a must. You can't
get fast fsync without it.
Most SSDs should offer fairly fast fsync without a hardware RAID
controller, as they do write-back caching. The trick is to find ones
that do write-back caching safely, so you don't get severe data
corruption on power-loss.

A HW RAID controller is an absolute must for rotating magnetic media,
though.
Also mdadm is a pain in the ass and is suitable only on amazon and other cloud shit.
I've personally been pretty happy with mdadm. I find the array
portability it offers very useful, so I don't need to buy a second RAID
controller just in case my main controller dies and I need a compatible
one to get the array running again. If you don't need a BBU for safe
write-back caching then mdadm has advantages over hardware RAID.

I'll certainly use mdadm over onboard fakeraid solutions or low-end
hardware RAID controllers. I suspect a mid- to high end HW RAID unit
will generally win.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Evgeny Shishkin
2012-12-12 01:17:14 UTC
Permalink
Post by Evgeny Shishkin
Post by Niels Kristian Schjødt
Are you using a hardware based raid controller with them?
Yes, of course. Hardware raid with cache and bbu is a must. You can't get fast fsync without it.
Most SSDs should offer fairly fast fsync without a hardware RAID controller, as they do write-back caching. The trick is to find ones that do write-back caching safely, so you don't get severe data corruption on power-loss.
Actually most of low-end SSDs don't do write caching, they do not have enough ram for that. Sandforce for example.
A HW RAID controller is an absolute must for rotating magnetic media, though.
Post by Evgeny Shishkin
Also mdadm is a pain in the ass and is suitable only on amazon and other cloud shit.
I've personally been pretty happy with mdadm. I find the array portability it offers very useful, so I don't need to buy a second RAID controller just in case my main controller dies and I need a compatible one to get the array running again. If you don't need a BBU for safe write-back caching then mdadm has advantages over hardware RAID.
If we are talking about dedicated machine for database with ssd drives, why would anybody don't by hardware raid for about 500-700$?
I'll certainly use mdadm over onboard fakeraid solutions or low-end hardware RAID controllers. I suspect a mid- to high end HW RAID unit will generally win.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Craig Ringer
2012-12-12 01:29:02 UTC
Permalink
Post by Evgeny Shishkin
Actually most of low-end SSDs don't do write caching, they do not have
enough ram for that. Sandforce for example.
Or, worse, some of them do limited write caching but don't protect their
write cache from power loss. Instant data corruption!

I would be extremely reluctant to use low-end SSDs for a database server.
Post by Evgeny Shishkin
If we are talking about dedicated machine for database with ssd
drives, why would anybody don't by hardware raid for about 500-700$?
I'd want to consider whether the same money is better spent on faster,
higher quality SSDs with their own fast write caches.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Evgeny Shishkin
2012-12-12 01:44:24 UTC
Permalink
Post by Evgeny Shishkin
Actually most of low-end SSDs don't do write caching, they do not have enough ram for that. Sandforce for example.
Or, worse, some of them do limited write caching but don't protect their write cache from power loss. Instant data corruption!
I would be extremely reluctant to use low-end SSDs for a database server.
Post by Evgeny Shishkin
If we are talking about dedicated machine for database with ssd drives, why would anybody don't by hardware raid for about 500-700$?
I'd want to consider whether the same money is better spent on faster, higher quality SSDs with their own fast write caches.
High quality ssd costs 5-7$ per GB. Consumer grade ssd - 1$. Highend - 11$
New intel dc s3700 2-3$ per GB as far as i remember.

So far, more than a year already, i bought consumer ssds with 300-400$ hw raid. Cost effective and fast, may be not very safe, but so far so good. All data protection measures from postgresql are on, of course.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Craig Ringer
2012-12-12 02:02:00 UTC
Permalink
Post by Evgeny Shishkin
So far, more than a year already, i bought consumer ssds with 300-400$
hw raid. Cost effective and fast, may be not very safe, but so far so
good. All data protection measures from postgresql are on, of course.
You're aware that many low end SSDs lie to the RAID controller about
having written data, right? Even if the RAID controller sends a flush
command, the SSD might cache the write in non-durable cache. If you're
using such SSDs and you lose power, data corruption is extremely likely,
because your SSDs are essentially ignoring fsync.

Your RAID controller's BBU won't save you, because once the disks tell
the RAID controller the data has hit durable storage, the RAID
controller feels free to flush it from its battery backed cache. If the
disks are lying...

The only solid way to find out if this is an issue with your SSDs is to
do plug-pull testing and find out.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Evgeny Shishkin
2012-12-12 02:13:22 UTC
Permalink
Post by Evgeny Shishkin
So far, more than a year already, i bought consumer ssds with 300-400$ hw raid. Cost effective and fast, may be not very safe, but so far so good. All data protection measures from postgresql are on, of course.
You're aware that many low end SSDs lie to the RAID controller about having written data, right? Even if the RAID controller sends a flush command, the SSD might cache the write in non-durable cache. If you're using such SSDs and you lose power, data corruption is extremely likely, because your SSDs are essentially ignoring fsync.
Your RAID controller's BBU won't save you, because once the disks tell the RAID controller the data has hit durable storage, the RAID controller feels free to flush it from its battery backed cache. If the disks are lying...
The only solid way to find out if this is an issue with your SSDs is to do plug-pull testing and find out.
Yes, i am aware of this issue. Never experienced this neither on intel 520, no ocz vertex 3.
Have you heard of them on this list?
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
David Boreham
2012-12-12 02:15:33 UTC
Permalink
Post by Evgeny Shishkin
Yes, i am aware of this issue. Never experienced this neither on intel
520, no ocz vertex 3.
Have you heard of them on this list?
People have done plug-pull tests and reported the results on the list
(sometime in the past couple of years).

But you don't need to do the test to know these drives are not safe.
They're unsafe by design.
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Evgeny Shishkin
2012-12-12 02:20:29 UTC
Permalink
Post by Evgeny Shishkin
Yes, i am aware of this issue. Never experienced this neither on intel 520, no ocz vertex 3.
Have you heard of them on this list?
People have done plug-pull tests and reported the results on the list (sometime in the past couple of years).
But you don't need to do the test to know these drives are not safe. They're unsafe by design.
Oh, there is no 100% safe system. The only way to be sure is to read data back.
Everything about system design is tradeoff between cost and risks.
--
http://www.postgresql.org/mailpref/pgsql-performance
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
David Boreham
2012-12-12 02:26:45 UTC
Permalink
Post by Evgeny Shishkin
Oh, there is no 100% safe system.
In this case we're discussing specifically "safety in the event of power
loss shortly after the drive indicates to the controller that it has
committed a write operation". Some drives do provide 100% safety against
this event, and they don't cost much more than those that don't.
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Evgeny Shishkin
2012-12-12 02:38:30 UTC
Permalink
Post by Evgeny Shishkin
Oh, there is no 100% safe system.
In this case we're discussing specifically "safety in the event of power loss shortly after the drive indicates to the controller that it has committed a write operation". Some drives do provide 100% safety against this event, and they don't cost much more than those that don't.
Which drives would you recommend? Besides intel 320 and 710.
--
http://www.postgresql.org/mailpref/pgsql-performance
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
David Boreham
2012-12-12 02:41:07 UTC
Permalink
Post by Evgeny Shishkin
Which drives would you recommend? Besides intel 320 and 710.
Those are the only drive types we have deployed in servers at present
(almost all 710, but we have some 320 for less mission-critical
machines). The new DC-S3700 Series looks nice too, but isn't yet in the
sales channel :
http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-dc-s3700-series.html
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Evgeny Shishkin
2012-12-12 02:49:12 UTC
Permalink
Post by David Boreham
Post by Evgeny Shishkin
Which drives would you recommend? Besides intel 320 and 710.
http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-dc-s3700-series.html
Yeah, s3700 looks promising, but sata interface is limiting factor for this drive.
I'm looking towards SMART ssd http://www.storagereview.com/smart_storage_systems_optimus_sas_enterprise_ssd_review

but i don't heard of it anywhere else.
Post by David Boreham
--
http://www.postgresql.org/mailpref/pgsql-performance
David Boreham
2012-12-12 03:05:13 UTC
Permalink
Post by Evgeny Shishkin
Yeah, s3700 looks promising, but sata interface is limiting factor for this drive.
I'm looking towards SMART ssd
http://www.storagereview.com/smart_storage_systems_optimus_sas_enterprise_ssd_review
What don't you like about SATA ?

I prefer to avoid SAS drives if possible due to the price premium for
dubious benefits besides vague hand-waving "enterprise-ness" promises.
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Evgeny Shishkin
2012-12-12 03:11:43 UTC
Permalink
Post by David Boreham
Post by Evgeny Shishkin
Yeah, s3700 looks promising, but sata interface is limiting factor for this drive.
I'm looking towards SMART ssd http://www.storagereview.com/smart_storage_systems_optimus_sas_enterprise_ssd_review
What don't you like about SATA ?
I prefer to avoid SAS drives if possible due to the price premium for dubious benefits besides vague hand-waving "enterprise-ness" promises.
Quoting http://www.storagereview.com/intel_ssd_dc_s3700_series_enterprise_ssd_review

Intel makes the case that the S3700 is the ideal drive for entry, mainstream and performance enterprise computing including HPC use cases. The claim is bold, largely because of the decision to go with a SATA interface, which has several limitations in the enterprise. The SATA interface tops out at a queue depth 32 (SAS scales as high as 256 in most cases) which means that when requests go above that level average and peak latency spike as we saw in all of our workloads.

Another huge advantage of SAS is the ability to offer dual-port modes for high availability scenarios, where there are two controllers interfacing with the same drive at the same time. In the event one goes offline, the connection with the SSD is not lost, as it would with a standard SATA interface without additional hardware. Some SAS drives also offer wide-port configurations used to increase total bandwidth above a single-link connection. While the Intel SSD DC S3700 against other SATA competitors is very fast, the story changes when you introduce the latest MLC and SLC-based SAS SSDs, which can cope better with increased thread and queue levels.

We picked the primary post-preconditioning sections of our benchmarks after each SSD had reached steady-state. For the purposes of this section, we added the Intel SSD DC S3700 onto the throughput charts of the newest SAS high-performance SSDs. There are also significant latency differences at higher queue depths that play a significant factor, but for the sake of easy comparison we stick with raw I/O speed across varying thread and queue counts.

In a 100% 4K random write or random read scenario, the Intel SSD DC 3700 performs quite well up against the high-end SAS competition, with the second fastest 4K steady-state speed. When you switch focus to read throughput at a heavy 16T/16Q load it only offers 1/2 to 1/3 the performance of SSDs in this category.

Loading Image...
Post by David Boreham
--
http://www.postgresql.org/mailpref/pgsql-performance
David Boreham
2012-12-12 03:19:35 UTC
Permalink
Post by Evgeny Shishkin
Quoting
http://www.storagereview.com/intel_ssd_dc_s3700_series_enterprise_ssd_review
Heh. A fine example of the kind of hand-waving of which I spoke ;)

Higher performance is certainly a benefit, although at present we can't
saturate even a single 710 series drive (the application, CPU, OS, etc
is the bottleneck). Similarly while dual-porting certainly has its uses,
it is not something I need.
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Mark Kirkwood
2012-12-12 03:07:31 UTC
Permalink
Post by David Boreham
Post by Evgeny Shishkin
Which drives would you recommend? Besides intel 320 and 710.
Those are the only drive types we have deployed in servers at present
(almost all 710, but we have some 320 for less mission-critical
machines). The new DC-S3700 Series looks nice too, but isn't yet in
http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-dc-s3700-series.html
+1

The s3700 is probably the one to get (when it is available). I'd opt for
the 710 if you need something now. I'd avoid the 320 - we have
encountered the firmware bug whereby you get an 8MB (yes 8MB) capacity
after powerdown with a depressingly large number of them (they were
updated to the latest firmware too).

Regards

Mark
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Craig Ringer
2012-12-12 02:47:51 UTC
Permalink
Post by Evgeny Shishkin
Yes, i am aware of this issue. Never experienced this neither on intel
520, no ocz vertex 3.
I wouldn't trust either of those drives. The 520 doesn't have Intel's "
Enhanced Power Loss Data Protection"; it's going to lose its buffers if
it loses power. Similarly, the Vertex 3 doesn't have any kind of power
protection. See:

http://www.intel.com/content/www/us/en/solid-state-drives/ssd-320-series-power-loss-data-protection-brief.html
http://ark.intel.com/products/family/56572/Intel-SSD-500-Family

http://www.ocztechnology.com/res/manuals/OCZ_SSD_Breakdown_Q2-11_1.pdf

The only way I'd use those for a production server was if I had
synchronous replication running to another machine with trustworthy,
durable storage - and if I didn't mind some downtime to restore the
corrupt DB from the replica after power loss.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Niels Kristian Schjødt
2012-12-12 16:22:10 UTC
Permalink
Well, In fact I do (as you can see from my configuration). I have a similar server running with hot standby replication - and it runs two 3T HDD in a RAID1 array.

So - is it still very bad if I choose to put four intel 520 disks in a RAID10 array on the other production server?
Post by Craig Ringer
Post by Evgeny Shishkin
Yes, i am aware of this issue. Never experienced this neither on intel 520, no ocz vertex 3.
http://www.intel.com/content/www/us/en/solid-state-drives/ssd-320-series-power-loss-data-protection-brief.html
http://ark.intel.com/products/family/56572/Intel-SSD-500-Family
http://www.ocztechnology.com/res/manuals/OCZ_SSD_Breakdown_Q2-11_1.pdf
The only way I'd use those for a production server was if I had synchronous replication running to another machine with trustworthy, durable storage - and if I didn't mind some downtime to restore the corrupt DB from the replica after power loss.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Craig Ringer
2012-12-12 23:26:36 UTC
Permalink
Post by Niels Kristian Schjødt
Well, In fact I do (as you can see from my configuration). I have a
similar server running with hot standby replication - and it runs two
3T HDD in a RAID1 array.
So - is it still very bad if I choose to put four intel 520 disks in a
RAID10 array on the other production server?
So long as you have it recording to a synchronous replia on another
machine and you're fully prepared to accept the small risk that you'll
have total and unrecoverable data corruption on that server, with the
corresponding downtime while you rebuild it from the replica, it should
be OK.

Alternately, you could use PITR with a basebackup to ship WAL to another
machine or a reliable HDD, so you can recover all but the last
checkpoint_timeout minutes of data from the base backup + WAL. There's
small window of data loss that way, but you don't need a second machine
as a streaming replication follower. barman might is worth checking out
as a management tool for PITR backups.

If the data is fairly low-value you could even just take nightly backups
and accept the risk of losing some data.

--
Craig Ringer
Rosser Schwarz
2012-12-12 01:47:28 UTC
Permalink
Post by Evgeny Shishkin
Actually most of low-end SSDs don't do write caching, they do not have
enough ram for that.
AIUI, *all* SSDs do write-caching of a sort: writes are actually flushed to
the NAND media by erasing, and then overwriting the erased space, and
erasing is done in fixed-size blocks, usually much larger than a
filesystem's pages. The drive's controller accumulates writes in an
on-board cache until it has an "erase block"'s worth of them, which are
then flushed. From casual searching, a common erase block size is 256
kbytes, while filesystem-level pages are usually 4k.

Most low-end (and even many mid-range) SSDs, including Sandforce-based
drives, don't offer any form of protection (e.g., supercaps, as featured on
the Intel 320 and 710-series drives) for the data in that write cache,
however, which may be what you're thinking of. I wouldn't let one of those
anywhere near one of my servers, unless it was a completely disposable,
load-balanced slave, and probably not even then.

rls
--
:wq
Scott Marlowe
2012-12-15 00:22:14 UTC
Permalink
On Dec 12, 2012, at 2:41 AM, Niels Kristian Schjødt
Are you using a hardware based raid controller with them?
Yes, of course. Hardware raid with cache and bbu is a must. You can't get
fast fsync without it.
Most SSDs should offer fairly fast fsync without a hardware RAID controller,
as they do write-back caching. The trick is to find ones that do write-back
caching safely, so you don't get severe data corruption on power-loss.
A HW RAID controller is an absolute must for rotating magnetic media,
though.
Also mdadm is a pain in the ass and is suitable only on amazon and other cloud shit.
I've personally been pretty happy with mdadm. I find the array portability
it offers very useful, so I don't need to buy a second RAID controller just
in case my main controller dies and I need a compatible one to get the array
running again. If you don't need a BBU for safe write-back caching then
mdadm has advantages over hardware RAID.
I'll certainly use mdadm over onboard fakeraid solutions or low-end hardware
RAID controllers. I suspect a mid- to high end HW RAID unit will generally
win.
Also for sequential throughput md RAID is usually faster than most
RAID controllers, even the high end Areca and LSI ones.
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Jeff Janes
2012-12-11 17:25:21 UTC
Permalink
On Tue, Dec 11, 2012 at 2:04 AM, Niels Kristian Schjødt
Post by Niels Kristian Schjødt
Post by Jeff Janes
The fact that there is much more writing than reading tells me that
most of your indexes are in RAM. The amount of index you are rapidly
reading and dirtying is large enough to fit in RAM, but is not large
enough to fit in shared_buffers + kernel's dirty-buffer comfort level.
Maybe I should mention, that I never see more than max 5Gb out of my total 32Gb being in use on the server… Can I somehow utilize more of it?
What tool do you use to determine that? Is that on top of the 4GB
shared_buffers, are including it?

How big is your entire data set? Maybe all your data fits in 5GB
(believable, as all your indexes listed below sum to < 2.5GB) so there
is no need to use more.

Or maybe you have hit an bug in the 3.2 kernel. At least one of those
has been frequently discussed.
Post by Niels Kristian Schjødt
Post by Jeff Janes
You could really crank up shared_buffers or vm.dirty_background_ratio,
but doing so might cause problems with checkpoints stalling and
latency spikes. That would probably not be a problem during the
night, but could be during the day.
What do you have in mind here? Tweaking what parameters to what values?
I'd set shared_buffers to 20GB (or 10GB, if that will hold all of your
data) and see what happens. And probably increase checkpoint_timeout
and checkpoint_segments about 3x each. Also, turn on log_checkpoints
so you can see what kinds of problem those changes may be causing
there (i.e. long sync times). Preferably you do this on some kind of
pre-production or test server.

But if your database is growing so rapidly that it soon won't fit on
240GB, then cranking up shared_buffers won't do for long. If you can
get your tables and all of their indexes clustered together, then you
can do the updates in an order that makes IO more efficient. Maybe
partitioning would help.
Post by Niels Kristian Schjødt
Post by Jeff Janes
I don't know how big each disk is, or how big your various categories
of data are. Could you move everything to SSD? Could you move all
your actively updated indexes there?
With table spaces you mean?
Yes. Or moving everything to SSD if it fits, then you don't have go
through and separate objects.

The UPDATE you posted in a previous thread looked like the table
blocks might also be getting dirtied in a fairly random order, which
means the table blocks are in the same condition as the index blocks
so maybe singling out the indexes isn't warranted.

Cheers,

Jeff
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Niels Kristian Schjødt
2012-12-12 16:46:11 UTC
Permalink
Post by Jeff Janes
On Tue, Dec 11, 2012 at 2:04 AM, Niels Kristian Schjødt
Post by Niels Kristian Schjødt
Post by Jeff Janes
The fact that there is much more writing than reading tells me that
most of your indexes are in RAM. The amount of index you are rapidly
reading and dirtying is large enough to fit in RAM, but is not large
enough to fit in shared_buffers + kernel's dirty-buffer comfort level.
Maybe I should mention, that I never see more than max 5Gb out of my total 32Gb being in use on the server… Can I somehow utilize more of it?
What tool do you use to determine that? Is that on top of the 4GB
shared_buffers, are including it?
Okay I might not have made myself clear, I was talking "physical" memory utilization. Here is the stats:
free -m
total used free shared buffers cached
Mem: 32075 25554 6520 0 69 22694
-/+ buffers/cache: 2791 29284
Swap: 2046 595 1451
Post by Jeff Janes
How big is your entire data set? Maybe all your data fits in 5GB
(believable, as all your indexes listed below sum to < 2.5GB) so there
is no need to use more.
It doesn't we are a search engine for used cars, and there are quite a lot of those out there :-) However, my indexes are almost all partial indexes, which mean that they are only on cars which is still for sale, so in that sense, the indexes them selves doesn't really grow, but the tables do.
Post by Jeff Janes
Or maybe you have hit an bug in the 3.2 kernel. At least one of those
has been frequently discussed.
Might be true - but likely?
Post by Jeff Janes
Post by Niels Kristian Schjødt
Post by Jeff Janes
You could really crank up shared_buffers or vm.dirty_background_ratio,
but doing so might cause problems with checkpoints stalling and
latency spikes. That would probably not be a problem during the
night, but could be during the day.
What do you have in mind here? Tweaking what parameters to what values?
I'd set shared_buffers to 20GB (or 10GB, if that will hold all of your
I had that before, Shaun suggested that I changed it to 4GB as he was talking about a strange behavior when larger than that on 12.04. But I can say, that there has not been any notable difference between having it at 4Gb and at 8Gb.
Post by Jeff Janes
data) and see what happens. And probably increase checkpoint_timeout
and checkpoint_segments about 3x each. Also, turn on log_checkpoints
so you can see what kinds of problem those changes may be causing
there (i.e. long sync times). Preferably you do this on some kind of
pre-production or test server.
But if your database is growing so rapidly that it soon won't fit on
240GB, then cranking up shared_buffers won't do for long. If you can
get your tables and all of their indexes clustered together, then you
can do the updates in an order that makes IO more efficient. Maybe
partitioning would help.
Can you explain a little more about this, or provide me a good link?
Post by Jeff Janes
Post by Niels Kristian Schjødt
Post by Jeff Janes
I don't know how big each disk is, or how big your various categories
of data are. Could you move everything to SSD? Could you move all
your actively updated indexes there?
With table spaces you mean?
Yes. Or moving everything to SSD if it fits, then you don't have go
through and separate objects.
The UPDATE you posted in a previous thread looked like the table
blocks might also be getting dirtied in a fairly random order, which
means the table blocks are in the same condition as the index blocks
so maybe singling out the indexes isn't warranted.
Cheers,
Jeff
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Jeff Janes
2012-12-13 15:10:37 UTC
Permalink
On Wed, Dec 12, 2012 at 8:46 AM, Niels Kristian Schjødt
Post by Niels Kristian Schjødt
Post by Jeff Janes
On Tue, Dec 11, 2012 at 2:04 AM, Niels Kristian Schjødt
Post by Niels Kristian Schjødt
Maybe I should mention, that I never see more than max 5Gb out of my total 32Gb being in use on the server… Can I somehow utilize more of it?
What tool do you use to determine that? Is that on top of the 4GB
shared_buffers, are including it?
free -m
total used free shared buffers cached
Mem: 32075 25554 6520 0 69 22694
-/+ buffers/cache: 2791 29284
Swap: 2046 595 1451
I don't how you get 5 Gig from that, though. You have 22 Gig of
cached file-system, which for your purposes probably counts as being
utilized. Although we don't know how much of this is for postgres
data files, chances are it is a pretty good chunk.
Post by Niels Kristian Schjødt
Post by Jeff Janes
How big is your entire data set? Maybe all your data fits in 5GB
(believable, as all your indexes listed below sum to < 2.5GB) so there
is no need to use more.
It doesn't we are a search engine for used cars, and there are quite a lot of those out there :-)
But how big? More than 22GB? (you can use \l+ in psql, or du -s on
the data directory)
Post by Niels Kristian Schjødt
However, my indexes are almost all partial indexes, which mean that they are only on cars which is still for sale, so in that sense, the indexes them selves doesn't really grow, but the tables do.
So maybe this reverses things. If your car table is huge and the
active cars are scattered randomly among all the inactive ones, then
updating random active cars is going to generate a lot of randomly
scattered writing which can't be combined into sequential writes.

Do you have plans for archiving cars that are no longer for sale? Why
do you keep them around in the first place, i.e. what types of queries
do you do on inactive ones?

Unfortunately you currently can't use CLUSTER with partial indexes,
otherwise that might be a good idea. You could build a full index on
whatever it is you use as the criterion for the partial indexes,
cluster on that, and then drop it.

But the table would eventually become unclustered again, so if this
works you might want to implement partitioning between active and
inactive partitions so as to maintain the clustering.
Post by Niels Kristian Schjødt
Post by Jeff Janes
Post by Niels Kristian Schjødt
Post by Jeff Janes
You could really crank up shared_buffers or vm.dirty_background_ratio,
but doing so might cause problems with checkpoints stalling and
latency spikes. That would probably not be a problem during the
night, but could be during the day.
What do you have in mind here? Tweaking what parameters to what values?
I'd set shared_buffers to 20GB (or 10GB, if that will hold all of your
I had that before, Shaun suggested that I changed it to 4GB as he was talking about a strange behavior when larger than that on 12.04. But I can say, that there has not been any notable difference between having it at 4Gb and at 8Gb.
It is almost an all or nothing thing. If you need 16 or 20GB, just
going from 4 to 8 isn't going to show much difference. If you can
test this easily, I'd just set it to 24 or even 28GB and run the bulk
update. I don't think you'd want to run a server permanently at those
settings, but it is an easy way to rule in or out different theories
about what is going on.
Post by Niels Kristian Schjødt
Post by Jeff Janes
But if your database is growing so rapidly that it soon won't fit on
240GB, then cranking up shared_buffers won't do for long. If you can
get your tables and all of their indexes clustered together, then you
can do the updates in an order that makes IO more efficient. Maybe
partitioning would help.
Can you explain a little more about this, or provide me a good link?
If all your partial btree indexes are using the same WHERE clause,
then your indexes are already clustered together in a sense--a partial
index is kind of like a composite index with the WHERE clause as the
first column.

So the trick would be to get the table to be clustered on the same
thing--either by partitioning or by the CLUSTER command, or something
equivalent to those. I don't know of a good link, other than the
documentation (which is more about how to do it, rather than why you
would want to or how to design it)

Cheers,

Jeff
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Loading...