Andre
2013-02-24 14:08:03 UTC
Hi,
Since our upgrade of hardware, OS and Postgres we experience server stalls under certain conditions, during that time (up to 2 minutes) all CPUs show 100% system time. All Postgres processes show BIND in top.
Usually the server only has a load of < 0.5 (12 cores) with up to 30 connections, 200-400 tps
Here is top -H during the stall:
Threads: 279 total, 25 running, 254 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.2 us, 99.8 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
This is under normal circumstances:
Threads: 274 total, 1 running, 273 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.2 us, 0.2 sy, 0.0 ni, 99.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
iostat shows under 0.3% load on the drives.
The stalls are mostly reproducible when there is the normal load on the server and then 20-40 new processes start executing SQLs.
Deactivating HT seemed to have reduced the frequency and length of the stalls.
The log shows entries for slow BINDs (8 seconds):
... LOG: duration: 8452.654 ms bind pdo_stmt_00000001: SELECT [20 columns selected] FROM users WHERE users.USERID=$1 LIMIT 1
I have tried to create a testcase, but even starting 200 client processes that execute prepared statements does not reproduce this behaviour on a nearly idle server, only under normal workload does it stall.
Hardware details:
2x Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
64 GB RAM
Postgres version: 9.2.2 and 9.2.3
Linux: OpenSUSE 12.2 with Kernel 3.4.6
Postgres config:
max_connections = 200
effective_io_concurrency = 3
max_wal_senders = 2
wal_keep_segments = 2048
max_locks_per_transaction = 500
default_statistics_target = 100
checkpoint_completion_target = 0.9
maintenance_work_mem = 1GB
effective_cache_size = 60GB
work_mem = 384MB
wal_buffers = 8MB
checkpoint_segments = 64
shared_buffers = 15GB
This might be related to this topic: http://www.postgresql.org/message-id/CANQNgOquOGH7AkqW6ObPafrgxv=J3WsiZg-NgVvbki-***@mail.gmail.com (Poor performance after update from SLES11 SP1 to SP2)
I believe the old server was OpenSUSE 11.x.
Thanks for any hint on how to fix this or diagnose the problem.
Since our upgrade of hardware, OS and Postgres we experience server stalls under certain conditions, during that time (up to 2 minutes) all CPUs show 100% system time. All Postgres processes show BIND in top.
Usually the server only has a load of < 0.5 (12 cores) with up to 30 connections, 200-400 tps
Here is top -H during the stall:
Threads: 279 total, 25 running, 254 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.2 us, 99.8 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
This is under normal circumstances:
Threads: 274 total, 1 running, 273 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.2 us, 0.2 sy, 0.0 ni, 99.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
iostat shows under 0.3% load on the drives.
The stalls are mostly reproducible when there is the normal load on the server and then 20-40 new processes start executing SQLs.
Deactivating HT seemed to have reduced the frequency and length of the stalls.
The log shows entries for slow BINDs (8 seconds):
... LOG: duration: 8452.654 ms bind pdo_stmt_00000001: SELECT [20 columns selected] FROM users WHERE users.USERID=$1 LIMIT 1
I have tried to create a testcase, but even starting 200 client processes that execute prepared statements does not reproduce this behaviour on a nearly idle server, only under normal workload does it stall.
Hardware details:
2x Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
64 GB RAM
Postgres version: 9.2.2 and 9.2.3
Linux: OpenSUSE 12.2 with Kernel 3.4.6
Postgres config:
max_connections = 200
effective_io_concurrency = 3
max_wal_senders = 2
wal_keep_segments = 2048
max_locks_per_transaction = 500
default_statistics_target = 100
checkpoint_completion_target = 0.9
maintenance_work_mem = 1GB
effective_cache_size = 60GB
work_mem = 384MB
wal_buffers = 8MB
checkpoint_segments = 64
shared_buffers = 15GB
This might be related to this topic: http://www.postgresql.org/message-id/CANQNgOquOGH7AkqW6ObPafrgxv=J3WsiZg-NgVvbki-***@mail.gmail.com (Poor performance after update from SLES11 SP1 to SP2)
I believe the old server was OpenSUSE 11.x.
Thanks for any hint on how to fix this or diagnose the problem.
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance