Discussion:
High CPU usage / load average after upgrading to Ubuntu 12.04
(too old to reply)
Dan Kogan
2013-02-12 17:25:48 UTC
Permalink
Hello,

We upgraded from Ubuntu 11.04 to Ubuntu 12.04 and almost immediately obeserved increased CPU usage and significantly higher load average on our database server.
At the time we were on Postgres 9.0.5. We decided to upgrade to Postgres 9.2 to see if that resolves the issue, but unfortunately it did not.

Just for illustration purposes, below are a few links to cpu and load graphs pre and post upgrade.

https://s3.amazonaws.com/iqtell.ops/Load+Average+Post+Upgrade.png
https://s3.amazonaws.com/iqtell.ops/Load+Average+Pre+Upgrade.png

https://s3.amazonaws.com/iqtell.ops/Server+CPU+Post+Upgrade.png
https://s3.amazonaws.com/iqtell.ops/Server+CPU+Pre+Upgrade.png

We also tried tweaking kernel parameters as mentioned here - http://www.postgresql.org/message-id/***@optionshouse.com, but have not seen any improvement.


Any advice on how to trace what could be causing the change in CPU usage and load average is appreciated.

Our postgres version is:

PostgreSQL 9.2.2 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit

OS:

Linux ip-10-189-175-25 3.2.0-37-virtual #58-Ubuntu SMP Thu Jan 24 15:48:03 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

Hardware (this an Amazon Ec2 High memory quadruple extra large instance):

8 core Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz
68 GB RAM
RAID10 with 8 drives using xfs
Drives are EBS with provisioned IOPS, with 1000 iops each

Postgres Configuration:

archive_command = rsync -a %p slave:/var/lib/postgresql/replication_load/%f
archive_mode = on
checkpoint_completion_target = 0.9
checkpoint_segments = 64
checkpoint_timeout = 30min
default_text_search_config = pg_catalog.english
external_pid_file = /var/run/postgresql/9.2-main.pid
lc_messages = en_US.UTF-8
lc_monetary = en_US.UTF-8
lc_numeric = en_US.UTF-8
lc_time = en_US.UTF-8
listen_addresses = *
log_checkpoints=on
log_destination=stderr
log_line_prefix = %t [%p]: [%l-1]
log_min_duration_statement =500
max_connections=300
max_stack_depth=2MB
max_wal_senders=5
shared_buffers=4GB
synchronous_commit=off
unix_socket_directory=/var/run/postgresql
wal_keep_segments=128
wal_level=hot_standby
work_mem=8MB


Thanks,
Dan
Dan Kogan
2013-02-12 19:59:16 UTC
Permalink
Thanks for the reply. We are still using postgresql-9.0-801.jdbc4.jar. It seemed to us that this is more related to the OS than the JDBC, version as we had the issue before we upgraded to 9.2.
It might still be worth a try.

Just out of curiosity, has anyone else experienced performance issues (or even tried) with the 9.0 jdbc driver against 9.2 server?

Dan

From: Eric Haertel [mailto:***@groupon.com]
Sent: Tuesday, February 12, 2013 12:52 PM
To: Dan Kogan
Cc: pgsql-***@postgresql.org
Subject: Re: [PERFORM] High CPU usage / load average after upgrading to Ubuntu 12.04

I don't know if it helps, but I had after update from 8.4 to 9.1 extrem problems with my local test until I changed the JDBC driver to the propper version. I'm not shure if the load occured on the client or the server side as the local integration test run on my machine.

2013/2/12 Dan Kogan <***@iqtell.com<mailto:***@iqtell.com>>
Hello,

We upgraded from Ubuntu 11.04 to Ubuntu 12.04 and almost immediately obeserved increased CPU usage and significantly higher load average on our database server.
At the time we were on Postgres 9.0.5. We decided to upgrade to Postgres 9.2 to see if that resolves the issue, but unfortunately it did not.

Just for illustration purposes, below are a few links to cpu and load graphs pre and post upgrade.

https://s3.amazonaws.com/iqtell.ops/Load+Average+Post+Upgrade.png
https://s3.amazonaws.com/iqtell.ops/Load+Average+Pre+Upgrade.png

https://s3.amazonaws.com/iqtell.ops/Server+CPU+Post+Upgrade.png
https://s3.amazonaws.com/iqtell.ops/Server+CPU+Pre+Upgrade.png

We also tried tweaking kernel parameters as mentioned here - http://www.postgresql.org/message-id/***@optionshouse.com, but have not seen any improvement.


Any advice on how to trace what could be causing the change in CPU usage and load average is appreciated.

Our postgres version is:

PostgreSQL 9.2.2 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit

OS:

Linux ip-10-189-175-25 3.2.0-37-virtual #58-Ubuntu SMP Thu Jan 24 15:48:03 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

Hardware (this an Amazon Ec2 High memory quadruple extra large instance):

8 core Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz
68 GB RAM
RAID10 with 8 drives using xfs
Drives are EBS with provisioned IOPS, with 1000 iops each

Postgres Configuration:

archive_command = rsync -a %p slave:/var/lib/postgresql/replication_load/%f
archive_mode = on
checkpoint_completion_target = 0.9
checkpoint_segments = 64
checkpoint_timeout = 30min
default_text_search_config = pg_catalog.english
external_pid_file = /var/run/postgresql/9.2-main.pid
lc_messages = en_US.UTF-8
lc_monetary = en_US.UTF-8
lc_numeric = en_US.UTF-8
lc_time = en_US.UTF-8
listen_addresses = *
log_checkpoints=on
log_destination=stderr
log_line_prefix = %t [%p]: [%l-1]
log_min_duration_statement =500
max_connections=300
max_stack_depth=2MB
max_wal_senders=5
shared_buffers=4GB
synchronous_commit=off
unix_socket_directory=/var/run/postgresql
wal_keep_segments=128
wal_level=hot_standby
work_mem=8MB


Thanks,
Dan



--

Eric Härtel
Senior Software Developer

Tel.: +49 (0) 30 240 20 40 35

Mobil: +49 (0) 174 43 38 614
Email: ***@groupon.com<mailto:***@groupon.de>



[cid:***@01CE092F.6F9E6ED0]



Groupon GmbH & Co. Service KG | Oberwallstraße 6 | 10117 Berlin
persönlich haftende Gesellschafterin: Groupon Verwaltungs GmbH, HRB 131594 B
Geschäftsführer: Mark S. Hoyt | Bradley Downes | Daniel Köllner
Eingetragen beim Amtsgericht Charlottenburg Berlin, HRA 45265 B | USt.-ID Nr. DE 279 803 459
Dan Kogan
2013-02-13 01:28:41 UTC
Permalink
Hi Will,

Yes, I think we've seen some discussions on that. Our servers our hosted on Amazon Ec2 and upgrading the kernel does not seem so straight forward.
We did a benchmark using pgbench on 3.5 vs 3.2 and saw an improvement. Unfortunately our production server would not boot off 3.5 so we had to revert back to 3.2.

At this point we are contemplating whether it's better to go back to 11.04 or upgrade to 12.10 (which comes with kernel version 3.5).
Any thoughts on that would be appreciated.

Dan

From: Will Ferguson [mailto:***@northplains.com]
Sent: Tuesday, February 12, 2013 5:20 PM
To: Dan Kogan; pgsql-***@postgresql.org
Subject: Re: [PERFORM] High CPU usage / load average after upgrading to Ubuntu 12.04

Hey Dan,

If I recall correctly there were some discussions on here related to performance issues with the 3.2 kernel. I'm away at the moment so can't dig them out but there have been much discussions lately about kernel performance in 3.2 which don't seem present in 3.4. I'll see if I can find them when I'm next at my desk.

Will


Sent from Samsung Mobile



-------- Original message --------
From: Dan Kogan <***@iqtell.com<mailto:***@iqtell.com>>
Date:
To: pgsql-***@postgresql.org<mailto:pgsql-***@postgresql.org>
Subject: Re: [PERFORM] High CPU usage / load average after upgrading to Ubuntu 12.04

Thanks for the reply. We are still using postgresql-9.0-801.jdbc4.jar. It seemed to us that this is more related to the OS than the JDBC, version as we had the issue before we upgraded to 9.2.
It might still be worth a try.

Just out of curiosity, has anyone else experienced performance issues (or even tried) with the 9.0 jdbc driver against 9.2 server?

Dan

From: Eric Haertel [mailto:***@groupon.com]
Sent: Tuesday, February 12, 2013 12:52 PM
To: Dan Kogan
Cc: pgsql-***@postgresql.org<mailto:pgsql-***@postgresql.org>
Subject: Re: [PERFORM] High CPU usage / load average after upgrading to Ubuntu 12.04

I don't know if it helps, but I had after update from 8.4 to 9.1 extrem problems with my local test until I changed the JDBC driver to the propper version. I'm not shure if the load occured on the client or the server side as the local integration test run on my machine.

2013/2/12 Dan Kogan <***@iqtell.com<mailto:***@iqtell.com>>
Hello,

We upgraded from Ubuntu 11.04 to Ubuntu 12.04 and almost immediately obeserved increased CPU usage and significantly higher load average on our database server.
At the time we were on Postgres 9.0.5. We decided to upgrade to Postgres 9.2 to see if that resolves the issue, but unfortunately it did not.

Just for illustration purposes, below are a few links to cpu and load graphs pre and post upgrade.

https://s3.amazonaws.com/iqtell.ops/Load+Average+Post+Upgrade.png
https://s3.amazonaws.com/iqtell.ops/Load+Average+Pre+Upgrade.png

https://s3.amazonaws.com/iqtell.ops/Server+CPU+Post+Upgrade.png
https://s3.amazonaws.com/iqtell.ops/Server+CPU+Pre+Upgrade.png

We also tried tweaking kernel parameters as mentioned here - http://www.postgresql.org/message-id/***@optionshouse.com, but have not seen any improvement.


Any advice on how to trace what could be causing the change in CPU usage and load average is appreciated.

Our postgres version is:

PostgreSQL 9.2.2 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit

OS:

Linux ip-10-189-175-25 3.2.0-37-virtual #58-Ubuntu SMP Thu Jan 24 15:48:03 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

Hardware (this an Amazon Ec2 High memory quadruple extra large instance):

8 core Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz
68 GB RAM
RAID10 with 8 drives using xfs
Drives are EBS with provisioned IOPS, with 1000 iops each

Postgres Configuration:

archive_command = rsync -a %p slave:/var/lib/postgresql/replication_load/%f
archive_mode = on
checkpoint_completion_target = 0.9
checkpoint_segments = 64
checkpoint_timeout = 30min
default_text_search_config = pg_catalog.english
external_pid_file = /var/run/postgresql/9.2-main.pid
lc_messages = en_US.UTF-8
lc_monetary = en_US.UTF-8
lc_numeric = en_US.UTF-8
lc_time = en_US.UTF-8
listen_addresses = *
log_checkpoints=on
log_destination=stderr
log_line_prefix = %t [%p]: [%l-1]
log_min_duration_statement =500
max_connections=300
max_stack_depth=2MB
max_wal_senders=5
shared_buffers=4GB
synchronous_commit=off
unix_socket_directory=/var/run/postgresql
wal_keep_segments=128
wal_level=hot_standby
work_mem=8MB


Thanks,
Dan



--

Eric Härtel
Senior Software Developer

Tel.: +49 (0) 30 240 20 40 35

Mobil: +49 (0) 174 43 38 614
Email: ***@groupon.com<mailto:***@groupon.de>



[cid:***@01CE095F.8BDF3F90]



Groupon GmbH & Co. Service KG | Oberwallstraße 6 | 10117 Berlin
persönlich haftende Gesellschafterin: Groupon Verwaltungs GmbH, HRB 131594 B
Geschäftsführer: Mark S. Hoyt | Bradley Downes | Daniel Köllner
Eingetragen beim Amtsgericht Charlottenburg Berlin, HRA 45265 B | USt.-ID Nr. DE 279 803 459
Josh Berkus
2013-02-13 19:24:41 UTC
Permalink
On 02/12/2013 05:28 PM, Dan Kogan wrote:
> Hi Will,
>
> Yes, I think we've seen some discussions on that. Our servers our hosted on Amazon Ec2 and upgrading the kernel does not seem so straight forward.
> We did a benchmark using pgbench on 3.5 vs 3.2 and saw an improvement. Unfortunately our production server would not boot off 3.5 so we had to revert back to 3.2.
>
> At this point we are contemplating whether it's better to go back to 11.04 or upgrade to 12.10 (which comes with kernel version 3.5).
> Any thoughts on that would be appreciated.

I have a machine running the same version of Ubuntu. I'll run some
tests and tell you what I find.


--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Josh Berkus
2013-02-14 00:25:52 UTC
Permalink
On 02/13/2013 11:24 AM, Josh Berkus wrote:
> On 02/12/2013 05:28 PM, Dan Kogan wrote:
>> Hi Will,
>>
>> Yes, I think we've seen some discussions on that. Our servers our hosted on Amazon Ec2 and upgrading the kernel does not seem so straight forward.
>> We did a benchmark using pgbench on 3.5 vs 3.2 and saw an improvement. Unfortunately our production server would not boot off 3.5 so we had to revert back to 3.2.
>>
>> At this point we are contemplating whether it's better to go back to 11.04 or upgrade to 12.10 (which comes with kernel version 3.5).
>> Any thoughts on that would be appreciated.
>
> I have a machine running the same version of Ubuntu. I'll run some
> tests and tell you what I find.

So I'm running a pgbench. However, I don't really have anything to
compare the stats I'm seeing. CPU usage and load average was high (load
7.9), but that was on -j 8 -c 32, with a TPS of 8500.

What numbers are you seeing, exactly?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Dan Kogan
2013-02-14 01:30:38 UTC
Permalink
Just to be clear - I was describing the current situation in our production.

We were running pgbench on different Ununtu versions today. I don’t have 12.04 setup at the moment, but I do have 12.10, which seems to be performing about the same as 12.04 in our tests with pgbench.
Running pgbench with 8 jobs and 32 clients resulted in load average of about 15 and TPS was 51350.

Question - how many cores does your server have? Ours has 8 cores.

Thanks,
Dan

-----Original Message-----
From: pgsql-performance-***@postgresql.org [mailto:pgsql-performance-***@postgresql.org] On Behalf Of Josh Berkus
Sent: Wednesday, February 13, 2013 7:26 PM
To: pgsql-***@postgresql.org
Subject: Re: [PERFORM] High CPU usage / load average after upgrading to Ubuntu 12.04

On 02/13/2013 11:24 AM, Josh Berkus wrote:
> On 02/12/2013 05:28 PM, Dan Kogan wrote:
>> Hi Will,
>>
>> Yes, I think we've seen some discussions on that. Our servers our hosted on Amazon Ec2 and upgrading the kernel does not seem so straight forward.
>> We did a benchmark using pgbench on 3.5 vs 3.2 and saw an improvement. Unfortunately our production server would not boot off 3.5 so we had to revert back to 3.2.
>>
>> At this point we are contemplating whether it's better to go back to 11.04 or upgrade to 12.10 (which comes with kernel version 3.5).
>> Any thoughts on that would be appreciated.
>
> I have a machine running the same version of Ubuntu. I'll run some
> tests and tell you what I find.

So I'm running a pgbench. However, I don't really have anything to compare the stats I'm seeing. CPU usage and load average was high (load 7.9), but that was on -j 8 -c 32, with a TPS of 8500.

What numbers are you seeing, exactly?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Josh Berkus
2013-02-14 18:38:09 UTC
Permalink
On 02/13/2013 05:30 PM, Dan Kogan wrote:
> Just to be clear - I was describing the current situation in our production.
>
> We were running pgbench on different Ununtu versions today. I don’t have 12.04 setup at the moment, but I do have 12.10, which seems to be performing about the same as 12.04 in our tests with pgbench.
> Running pgbench with 8 jobs and 32 clients resulted in load average of about 15 and TPS was 51350.

What size database?

>
> Question - how many cores does your server have? Ours has 8 cores.

32

I suppose I could throw multiple pgbenches at it. I just dont' see the
load numbers as unusual, but I don't have a similar pre-12.04 server to
compare with.


--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Dan Kogan
2013-02-14 20:41:26 UTC
Permalink
We used scale factor of 3600.
Yeah, maybe other people see similar load average, we were not sure.
However, we saw a clear difference right after the upgrade.
We are trying to determine whether it makes sense for us to go to 11.04 or maybe there is something here we are missing.

-----Original Message-----
From: pgsql-performance-***@postgresql.org [mailto:pgsql-performance-***@postgresql.org] On Behalf Of Josh Berkus
Sent: Thursday, February 14, 2013 1:38 PM
To: pgsql-***@postgresql.org
Subject: Re: [PERFORM] High CPU usage / load average after upgrading to Ubuntu 12.04

On 02/13/2013 05:30 PM, Dan Kogan wrote:
> Just to be clear - I was describing the current situation in our production.
>
> We were running pgbench on different Ununtu versions today. I don’t have 12.04 setup at the moment, but I do have 12.10, which seems to be performing about the same as 12.04 in our tests with pgbench.
> Running pgbench with 8 jobs and 32 clients resulted in load average of about 15 and TPS was 51350.

What size database?

>
> Question - how many cores does your server have? Ours has 8 cores.

32

I suppose I could throw multiple pgbenches at it. I just dont' see the load numbers as unusual, but I don't have a similar pre-12.04 server to compare with.


--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Josh Berkus
2013-02-14 23:57:50 UTC
Permalink
On 02/14/2013 12:41 PM, Dan Kogan wrote:
> We used scale factor of 3600.
> Yeah, maybe other people see similar load average, we were not sure.
> However, we saw a clear difference right after the upgrade.
> We are trying to determine whether it makes sense for us to go to 11.04 or maybe there is something here we are missing.

Well, I'm seeing a higher system % on CPU than I expect (around 15% on
each core), and a MUCH higher context-switch than I expect (up to 500K).
Is that anything like you're seeing?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Dan Kogan
2013-02-15 04:32:15 UTC
Permalink
Yes, we are seeing higher system % on the CPU, not sure how to quantify in terms of % right now - will check into that tomorrow.
We were not checking the context switch numbers during our benchmark, will check that tomorrow as well.

-----Original Message-----
From: pgsql-performance-***@postgresql.org [mailto:pgsql-performance-***@postgresql.org] On Behalf Of Josh Berkus
Sent: Thursday, February 14, 2013 6:58 PM
To: pgsql-***@postgresql.org
Subject: Re: [PERFORM] High CPU usage / load average after upgrading to Ubuntu 12.04

On 02/14/2013 12:41 PM, Dan Kogan wrote:
> We used scale factor of 3600.
> Yeah, maybe other people see similar load average, we were not sure.
> However, we saw a clear difference right after the upgrade.
> We are trying to determine whether it makes sense for us to go to 11.04 or maybe there is something here we are missing.

Well, I'm seeing a higher system % on CPU than I expect (around 15% on each core), and a MUCH higher context-switch than I expect (up to 500K).
Is that anything like you're seeing?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Scott Marlowe
2013-02-15 04:47:45 UTC
Permalink
If you run your benchmarks for more than a few minutes I highly
recommend enabling sysstat service data collection, then you can look
at it after the fact with sar. VERY useful stuff both for
benchmarking and post mortem on live servers.

On Thu, Feb 14, 2013 at 9:32 PM, Dan Kogan <***@iqtell.com> wrote:
> Yes, we are seeing higher system % on the CPU, not sure how to quantify in terms of % right now - will check into that tomorrow.
> We were not checking the context switch numbers during our benchmark, will check that tomorrow as well.
>
> -----Original Message-----
> From: pgsql-performance-***@postgresql.org [mailto:pgsql-performance-***@postgresql.org] On Behalf Of Josh Berkus
> Sent: Thursday, February 14, 2013 6:58 PM
> To: pgsql-***@postgresql.org
> Subject: Re: [PERFORM] High CPU usage / load average after upgrading to Ubuntu 12.04
>
> On 02/14/2013 12:41 PM, Dan Kogan wrote:
>> We used scale factor of 3600.
>> Yeah, maybe other people see similar load average, we were not sure.
>> However, we saw a clear difference right after the upgrade.
>> We are trying to determine whether it makes sense for us to go to 11.04 or maybe there is something here we are missing.
>
> Well, I'm seeing a higher system % on CPU than I expect (around 15% on each core), and a MUCH higher context-switch than I expect (up to 500K).
> Is that anything like you're seeing?
>
> --
> Josh Berkus
> PostgreSQL Experts Inc.
> http://pgexperts.com
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>
> --
> Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance



--
To understand recursion, one must first understand recursion.


--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Josh Berkus
2013-02-15 18:26:08 UTC
Permalink
On 02/14/2013 08:47 PM, Scott Marlowe wrote:
> If you run your benchmarks for more than a few minutes I highly
> recommend enabling sysstat service data collection, then you can look
> at it after the fact with sar. VERY useful stuff both for
> benchmarking and post mortem on live servers.

Well, background sar, by default on Linux, only collects every 30min.
For a benchmark run, you want to generate your own sar file, for example:

sar -o hddrun2.sar -A 10 90 &

which says "collect all stats every 10 seconds and write them to the
file hddrun2.sar for 15 minutes"


--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Scott Marlowe
2013-02-15 18:52:34 UTC
Permalink
On Fri, Feb 15, 2013 at 11:26 AM, Josh Berkus <***@agliodbs.com> wrote:
> On 02/14/2013 08:47 PM, Scott Marlowe wrote:
>> If you run your benchmarks for more than a few minutes I highly
>> recommend enabling sysstat service data collection, then you can look
>> at it after the fact with sar. VERY useful stuff both for
>> benchmarking and post mortem on live servers.
>
> Well, background sar, by default on Linux, only collects every 30min.
> For a benchmark run, you want to generate your own sar file, for example:

On all my machines (debian and ubuntu) it collects every 5.

> sar -o hddrun2.sar -A 10 90 &
>
> which says "collect all stats every 10 seconds and write them to the
> file hddrun2.sar for 15 minutes"

Not a bad idea. esp when benchmarking.


--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Josh Berkus
2013-02-19 00:19:19 UTC
Permalink
So, our drop in performance is now clearly due to pathological OS
behavior during checkpoints. Still trying to pin down what's going on,
but it's not system load; it's clearly related to the IO system.

Anyone else see this? I'm getting it both on 3.2 and 3.4. We're using
LSI Megaraid.


--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Josh Berkus
2013-02-19 00:39:53 UTC
Permalink
Scott,

> So do you have generally slow IO, or is it fsync behavior etc?

All tests except pgBench show this system as superfast. Bonnie++ and DD
tests are good (200 to 300mb/s), and test_fsync shows 14K/second.
Basically it has no issues until checkpoint kicks in, at which time the
entire system basically halts for the duration of the checkpoint.

For that matter, if I run a pgbench and halt it just before checkpoint
kicks in, I get around 12000TPS, which is what I'd expect on this system.

At this point, we've tried 3.2.0.26, 3.2.0.27, 3.4.0, and tried updating
the RAID driver, and changing the IO scheduler. Nothing seems to affect
the behavior. Testing using Ext4 (instead of XFS) next.


--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
jnelson+ (Jon Nelson)
2013-02-19 00:46:52 UTC
Permalink
On Mon, Feb 18, 2013 at 6:39 PM, Josh Berkus <***@agliodbs.com> wrote:
> Scott,
>
>> So do you have generally slow IO, or is it fsync behavior etc?
>
> All tests except pgBench show this system as superfast. Bonnie++ and DD
> tests are good (200 to 300mb/s), and test_fsync shows 14K/second.
> Basically it has no issues until checkpoint kicks in, at which time the
> entire system basically halts for the duration of the checkpoint.
>
> For that matter, if I run a pgbench and halt it just before checkpoint
> kicks in, I get around 12000TPS, which is what I'd expect on this system.
>
> At this point, we've tried 3.2.0.26, 3.2.0.27, 3.4.0, and tried updating
> the RAID driver, and changing the IO scheduler. Nothing seems to affect
> the behavior. Testing using Ext4 (instead of XFS) next.

Did you try turning barriers on or off *manually* (explicitly)? With
LSI and barriers *on* and ext4 I had less-optimal performance. With
Linux MD or (some) 3Ware configurations I had no performance hit.

--
Jon


--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Josh Berkus
2013-02-19 00:51:50 UTC
Permalink
> Did you try turning barriers on or off *manually* (explicitly)? With
> LSI and barriers *on* and ext4 I had less-optimal performance. With
> Linux MD or (some) 3Ware configurations I had no performance hit.

They're off in fstab.

/dev/sdd1 on /data type xfs (rw,noatime,nodiratime,nobarrier)


--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Scott Marlowe
2013-02-19 03:41:26 UTC
Permalink
On Mon, Feb 18, 2013 at 5:39 PM, Josh Berkus <***@agliodbs.com> wrote:
> Scott,
>
>> So do you have generally slow IO, or is it fsync behavior etc?
>
> All tests except pgBench show this system as superfast. Bonnie++ and DD
> tests are good (200 to 300mb/s), and test_fsync shows 14K/second.
> Basically it has no issues until checkpoint kicks in, at which time the
> entire system basically halts for the duration of the checkpoint.

I assume you've made attemtps at write levelling to reduce impacts of
checkpoints etc.


--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Mark Kirkwood
2013-02-19 04:28:00 UTC
Permalink
On 19/02/13 13:39, Josh Berkus wrote:
> Scott,
>
>> So do you have generally slow IO, or is it fsync behavior etc?
> All tests except pgBench show this system as superfast. Bonnie++ and DD
> tests are good (200 to 300mb/s), and test_fsync shows 14K/second.
> Basically it has no issues until checkpoint kicks in, at which time the
> entire system basically halts for the duration of the checkpoint.
>
> For that matter, if I run a pgbench and halt it just before checkpoint
> kicks in, I get around 12000TPS, which is what I'd expect on this system.
>
> At this point, we've tried 3.2.0.26, 3.2.0.27, 3.4.0, and tried updating
> the RAID driver, and changing the IO scheduler. Nothing seems to affect
> the behavior. Testing using Ext4 (instead of XFS) next.
>
>

Might be worth looking at your vm.dirty_ratio, vm.dirty_background_ratio
and friends settings. We managed to choke up a system with 16x SSD by
leaving them at their defaults...

Cheers

Mark




--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Josh Berkus
2013-02-19 17:51:16 UTC
Permalink
On 02/18/2013 08:28 PM, Mark Kirkwood wrote:
> Might be worth looking at your vm.dirty_ratio, vm.dirty_background_ratio
> and friends settings. We managed to choke up a system with 16x SSD by
> leaving them at their defaults...

Yeah? Any settings you'd recommend specifically? What did you use on
the SSD system?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Mark Kirkwood
2013-02-19 23:17:22 UTC
Permalink
On 20/02/13 06:51, Josh Berkus wrote:
> On 02/18/2013 08:28 PM, Mark Kirkwood wrote:
>> Might be worth looking at your vm.dirty_ratio, vm.dirty_background_ratio
>> and friends settings. We managed to choke up a system with 16x SSD by
>> leaving them at their defaults...
> Yeah? Any settings you'd recommend specifically? What did you use on
> the SSD system?
>

We set:

vm.dirty_background_ratio = 0
vm.dirty_background_bytes = 1073741824
vm.dirty_ratio = 0
vm.dirty_bytes = 2147483648

i.e 1G for dirty_background and 2G for dirty. We didn't spend much time
afterwards fiddling with the size much. I'm guessing the we could have
made it bigger - however the SSD were happier to be constantly writing a
few G than being handed (say) 50G of buffers to write at once . The
system has 512G of ram and 32 cores (no hyperthreading).

regards

Mark


--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Josh Berkus
2013-02-19 23:24:31 UTC
Permalink
On 02/19/2013 09:51 AM, Josh Berkus wrote:
> On 02/18/2013 08:28 PM, Mark Kirkwood wrote:
>> Might be worth looking at your vm.dirty_ratio, vm.dirty_background_ratio
>> and friends settings. We managed to choke up a system with 16x SSD by
>> leaving them at their defaults...
>
> Yeah? Any settings you'd recommend specifically? What did you use on
> the SSD system?
>

NM, I tested lowering dirty_background_ratio, and it didn't help,
because checkpoints are kicking in before pdflush ever gets there.

So the issue seems to be that if you have this combination of factors:

1. large RAM
2. many/fast CPUs
3. a database which fits in RAM but is larger than the RAID controller's
WB cache
4. pg_xlog on the same volume as pgdata

... then you'll see checkpoint "stalls" and spread checkpoint will
actually make them worse by making the stalls longer.

Moving pg_xlog to a separate partition makes this better. Making
bgwriter more aggressive helps a bit more on top of that.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Mark Kirkwood
2013-02-19 23:37:38 UTC
Permalink
On 20/02/13 12:24, Josh Berkus wrote:
>
> NM, I tested lowering dirty_background_ratio, and it didn't help,
> because checkpoints are kicking in before pdflush ever gets there.
>
> So the issue seems to be that if you have this combination of factors:
>
> 1. large RAM
> 2. many/fast CPUs
> 3. a database which fits in RAM but is larger than the RAID controller's
> WB cache
> 4. pg_xlog on the same volume as pgdata
>
> ... then you'll see checkpoint "stalls" and spread checkpoint will
> actually make them worse by making the stalls longer.
>
> Moving pg_xlog to a separate partition makes this better. Making
> bgwriter more aggressive helps a bit more on top of that.
>

We have pg_xlog on a pair of PCIe SSD. Also we running the deadline io
scheduler.

Regards

Mark


--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Scott Marlowe
2013-02-20 03:15:23 UTC
Permalink
On Tue, Feb 19, 2013 at 4:24 PM, Josh Berkus <***@agliodbs.com> wrote:
> ... then you'll see checkpoint "stalls" and spread checkpoint will
> actually make them worse by making the stalls longer.

Wait, if they're spread enough then there won't be a checkpoint, so to
speak. Are you saying that spreading them out means that they still
kind of pile up, even with say a completion target of 1.0 etc?


--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Josh Berkus
2013-02-20 22:44:34 UTC
Permalink
On 02/19/2013 07:15 PM, Scott Marlowe wrote:
> On Tue, Feb 19, 2013 at 4:24 PM, Josh Berkus <***@agliodbs.com> wrote:
>> ... then you'll see checkpoint "stalls" and spread checkpoint will
>> actually make them worse by making the stalls longer.
>
> Wait, if they're spread enough then there won't be a checkpoint, so to
> speak. Are you saying that spreading them out means that they still
> kind of pile up, even with say a completion target of 1.0 etc?

I'm saying that spreading them makes things worse, because they get
intermixed with the fsyncs for the WAL and causes commits to stall. I
tried setting checkpoint_completion_target = 0.0 and throughput got
about 10% better.

I'm beginning to think that checkpoint_completion_target should be 0.0,
by default.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Josh Berkus
2013-02-21 03:14:10 UTC
Permalink
> Sounds to me like your IO system is stalling on fsyncs or something
> like that. On machines with plenty of IO cranking up completion
> target usuall smooths things out.

It certainly seems like it does. However, I can't demonstrate the issue
using any simpler tool than pgbench ... even running four test_fsyncs in
parallel didn't show any issues, nor do standard FS testing tools.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Jeff Frost
2013-02-25 22:30:00 UTC
Permalink
On 02/20/13 19:14, Josh Berkus wrote:
>> Sounds to me like your IO system is stalling on fsyncs or something
>> like that. On machines with plenty of IO cranking up completion
>> target usuall smooths things out.
> It certainly seems like it does. However, I can't demonstrate the issue
> using any simpler tool than pgbench ... even running four test_fsyncs in
> parallel didn't show any issues, nor do standard FS testing tools.
>

We were really starting to think that the system had an IO problem that we
couldn't tickle with any synthetic tools. Then one of our other customers who
upgraded to Ubuntu 12.04 LTS and is also experiencing issues came across the
following LKML thread regarding pdflush on 3.0+ kernels:

https://lkml.org/lkml/2012/10/9/210

So, I went and built a couple custom kernels with this patch removed:

https://patchwork.kernel.org/patch/825212/

and the bad behavior stopped. Best performance was with a 3.5 kernel with
the patch removed.



--
Jeff Frost <***@pgexperts.com>
CTO, PostgreSQL Experts, Inc.
Phone: 1-888-PG-EXPRT x506
FAX: 415-762-5122
http://www.pgexperts.com/



--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Jeff Janes
2013-02-26 21:30:42 UTC
Permalink
On Fri, Feb 15, 2013 at 10:52 AM, Scott Marlowe <***@gmail.com>wrote:

> On Fri, Feb 15, 2013 at 11:26 AM, Josh Berkus <***@agliodbs.com> wrote:
> > On 02/14/2013 08:47 PM, Scott Marlowe wrote:
> >> If you run your benchmarks for more than a few minutes I highly
> >> recommend enabling sysstat service data collection, then you can look
> >> at it after the fact with sar. VERY useful stuff both for
> >> benchmarking and post mortem on live servers.
> >
> > Well, background sar, by default on Linux, only collects every 30min.
> > For a benchmark run, you want to generate your own sar file, for example:
>
> On all my machines (debian and ubuntu) it collects every 5.
>

All of mine were 10, but once I figured out to edit /etc/cron.d/sysstat
they are now every 1 minute.

sar has some remarkably opaque documentation, but I'm glad I tracked that
down.

Cheers,

Jeff
Scott Marlowe
2013-02-26 21:46:32 UTC
Permalink
On Tue, Feb 26, 2013 at 2:30 PM, Jeff Janes <***@gmail.com> wrote:
> On Fri, Feb 15, 2013 at 10:52 AM, Scott Marlowe <***@gmail.com>
> wrote:
>>
>> On Fri, Feb 15, 2013 at 11:26 AM, Josh Berkus <***@agliodbs.com> wrote:
>> > On 02/14/2013 08:47 PM, Scott Marlowe wrote:
>> >> If you run your benchmarks for more than a few minutes I highly
>> >> recommend enabling sysstat service data collection, then you can look
>> >> at it after the fact with sar. VERY useful stuff both for
>> >> benchmarking and post mortem on live servers.
>> >
>> > Well, background sar, by default on Linux, only collects every 30min.
>> > For a benchmark run, you want to generate your own sar file, for
>> > example:
>>
>> On all my machines (debian and ubuntu) it collects every 5.
>
>
> All of mine were 10, but once I figured out to edit /etc/cron.d/sysstat they
> are now every 1 minute.

oh yeah it's every 10 on the 5s. I too need to go to 1minute intervals.

> sar has some remarkably opaque documentation, but I'm glad I tracked that
> down.

It's so incredibly useful. When a machine is acting up often getting
it back online is more important than fixing it right then, and most
of the system state stuff is lost on reboot / fix.


--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Merlin Moncure
2013-02-14 14:07:48 UTC
Permalink
On Tue, Feb 12, 2013 at 11:25 AM, Dan Kogan <***@iqtell.com> wrote:
> Hello,
>
>
>
> We upgraded from Ubuntu 11.04 to Ubuntu 12.04 and almost immediately
> obeserved increased CPU usage and significantly higher load average on our
> database server.
>
> At the time we were on Postgres 9.0.5. We decided to upgrade to Postgres
> 9.2 to see if that resolves the issue, but unfortunately it did not.
>
>
>
> Just for illustration purposes, below are a few links to cpu and load graphs
> pre and post upgrade.
>
>
>
> https://s3.amazonaws.com/iqtell.ops/Load+Average+Post+Upgrade.png
>
> https://s3.amazonaws.com/iqtell.ops/Load+Average+Pre+Upgrade.png
>
>
>
> https://s3.amazonaws.com/iqtell.ops/Server+CPU+Post+Upgrade.png
>
> https://s3.amazonaws.com/iqtell.ops/Server+CPU+Pre+Upgrade.png
>
>
>
> We also tried tweaking kernel parameters as mentioned here -
> http://www.postgresql.org/message-id/***@optionshouse.com, but
> have not seen any improvement.
>
>
>
>
>
> Any advice on how to trace what could be causing the change in CPU usage and
> load average is appreciated.
>
>
>
> Our postgres version is:
>
>
>
> PostgreSQL 9.2.2 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro
> 4.6.3-1ubuntu5) 4.6.3, 64-bit
>
>
>
> OS:
>
>
>
> Linux ip-10-189-175-25 3.2.0-37-virtual #58-Ubuntu SMP Thu Jan 24 15:48:03
> UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
>
>
>
> Hardware (this an Amazon Ec2 High memory quadruple extra large instance):
>
>
>
> 8 core Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz
>
> 68 GB RAM
>
> RAID10 with 8 drives using xfs
>
> Drives are EBS with provisioned IOPS, with 1000 iops each
>
>
>
> Postgres Configuration:
>
>
>
> archive_command = rsync -a %p slave:/var/lib/postgresql/replication_load/%f
>
> archive_mode = on
>
> checkpoint_completion_target = 0.9
>
> checkpoint_segments = 64
>
> checkpoint_timeout = 30min
>
> default_text_search_config = pg_catalog.english
>
> external_pid_file = /var/run/postgresql/9.2-main.pid
>
> lc_messages = en_US.UTF-8
>
> lc_monetary = en_US.UTF-8
>
> lc_numeric = en_US.UTF-8
>
> lc_time = en_US.UTF-8
>
> listen_addresses = *
>
> log_checkpoints=on
>
> log_destination=stderr
>
> log_line_prefix = %t [%p]: [%l-1]
>
> log_min_duration_statement =500
>
> max_connections=300
>
> max_stack_depth=2MB
>
> max_wal_senders=5
>
> shared_buffers=4GB
>
> synchronous_commit=off
>
> unix_socket_directory=/var/run/postgresql
>
> wal_keep_segments=128
>
> wal_level=hot_standby
>
> work_mem=8MB

does your application have a lot of concurrency? history has shown
that postgres is highly sensitive to changes in the o/s scheduler
(which changes a lot from release to release).

also check this:
zone reclaim (http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html)

merlin


--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Dan Kogan
2013-02-14 17:27:13 UTC
Permalink
Thanks for the info.
Our application does have a lot of concurrency. We checked the zone reclaim parameter and it is turn off (that was the default, we did not have to change it).

Dan

-----Original Message-----
From: Merlin Moncure [mailto:***@gmail.com]
Sent: Thursday, February 14, 2013 9:08 AM
To: Dan Kogan
Cc: pgsql-***@postgresql.org
Subject: Re: [PERFORM] High CPU usage / load average after upgrading to Ubuntu 12.04

On Tue, Feb 12, 2013 at 11:25 AM, Dan Kogan <***@iqtell.com> wrote:
> Hello,
>
>
>
> We upgraded from Ubuntu 11.04 to Ubuntu 12.04 and almost immediately
> obeserved increased CPU usage and significantly higher load average on
> our database server.
>
> At the time we were on Postgres 9.0.5. We decided to upgrade to
> Postgres
> 9.2 to see if that resolves the issue, but unfortunately it did not.
>
>
>
> Just for illustration purposes, below are a few links to cpu and load
> graphs pre and post upgrade.
>
>
>
> https://s3.amazonaws.com/iqtell.ops/Load+Average+Post+Upgrade.png
>
> https://s3.amazonaws.com/iqtell.ops/Load+Average+Pre+Upgrade.png
>
>
>
> https://s3.amazonaws.com/iqtell.ops/Server+CPU+Post+Upgrade.png
>
> https://s3.amazonaws.com/iqtell.ops/Server+CPU+Pre+Upgrade.png
>
>
>
> We also tried tweaking kernel parameters as mentioned here -
> http://www.postgresql.org/message-id/***@optionshouse.com
> , but have not seen any improvement.
>
>
>
>
>
> Any advice on how to trace what could be causing the change in CPU
> usage and load average is appreciated.
>
>
>
> Our postgres version is:
>
>
>
> PostgreSQL 9.2.2 on x86_64-unknown-linux-gnu, compiled by gcc
> (Ubuntu/Linaro
> 4.6.3-1ubuntu5) 4.6.3, 64-bit
>
>
>
> OS:
>
>
>
> Linux ip-10-189-175-25 3.2.0-37-virtual #58-Ubuntu SMP Thu Jan 24
> 15:48:03 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
>
>
>
> Hardware (this an Amazon Ec2 High memory quadruple extra large instance):
>
>
>
> 8 core Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz
>
> 68 GB RAM
>
> RAID10 with 8 drives using xfs
>
> Drives are EBS with provisioned IOPS, with 1000 iops each
>
>
>
> Postgres Configuration:
>
>
>
> archive_command = rsync -a %p
> slave:/var/lib/postgresql/replication_load/%f
>
> archive_mode = on
>
> checkpoint_completion_target = 0.9
>
> checkpoint_segments = 64
>
> checkpoint_timeout = 30min
>
> default_text_search_config = pg_catalog.english
>
> external_pid_file = /var/run/postgresql/9.2-main.pid
>
> lc_messages = en_US.UTF-8
>
> lc_monetary = en_US.UTF-8
>
> lc_numeric = en_US.UTF-8
>
> lc_time = en_US.UTF-8
>
> listen_addresses = *
>
> log_checkpoints=on
>
> log_destination=stderr
>
> log_line_prefix = %t [%p]: [%l-1]
>
> log_min_duration_statement =500
>
> max_connections=300
>
> max_stack_depth=2MB
>
> max_wal_senders=5
>
> shared_buffers=4GB
>
> synchronous_commit=off
>
> unix_socket_directory=/var/run/postgresql
>
> wal_keep_segments=128
>
> wal_level=hot_standby
>
> work_mem=8MB

does your application have a lot of concurrency? history has shown that postgres is highly sensitive to changes in the o/s scheduler (which changes a lot from release to release).

also check this:
zone reclaim (http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html)

merlin


--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Loading...