Discussion:
Two Necessary Kernel Tweaks for Linux Systems
(too old to reply)
Shaun Thomas
2013-01-02 21:46:25 UTC
Permalink
Hey everyone!

After much testing and hair-pulling, we've confirmed two kernel settings
that should always be modified in production Linux systems. Especially
new ones with the completely fair scheduler (CFS) as opposed to the O(1)
scheduler.

If you want to follow along, these are:

/proc/sys/kernel/sched_migration_cost
/proc/sys/kernel/sched_autogroup_enabled

Which correspond to sysctl settings:

kernel.sched_migration_cost
kernel.sched_autogroup_enabled

What do these settings do?
--------------------------

* sched_migration_cost

The migration cost is the total time the scheduler will consider a
migrated process "cache hot" and thus less likely to be re-migrated. By
default, this is 0.5ms (500000 ns), and as the size of the process table
increases, eventually causes the scheduler to break down. On our
systems, after a smooth degradation with increasing connection count,
system CPU spiked from 20 to 70% sustained and TPS was cut by 5-10x once
we crossed some invisible connection count threshold. For us, that was a
pgbench with 900 or more clients.

The migration cost should be increased, almost universally on server
systems with many processes. This means systems like PostgreSQL or
Apache would benefit from having higher migration costs. We've had good
luck with a setting of 5ms (5000000 ns) instead.

When the breakdown occurs, system CPU (as obtained from sar) increases
from 20% on a heavy pgbench (scale 3500 on a 72GB system) to over 70%,
and %nice/%user is cut by half or more. A higher migration cost
essentially eliminates this artificial throttle.

* sched_autogroup_enabled

This is a relatively new patch which Linus lauded back in late 2010. It
basically groups tasks by TTY so perceived responsiveness is improved.
But on server systems, large daemons like PostgreSQL are going to be
launched from the same pseudo-TTY, and be effectively choked out of CPU
cycles in favor of less important tasks.

The default setting is 1 (enabled) on some platforms. By setting this to
0 (disabled), we saw an outright 30% performance boost on the same
pgbench test. A fully cached scale 3500 database on a 72GB system went
from 67k TPS to 82k TPS with 900 client connections.

Total Benefit
-------------

At higher connections counts, such as systems that can't use pooling or
make extensive use of prepared queries, these can massively affect
performance. At 900 connections, our test systems were at 17k TPS
unaltered, but 85k TPS after these two modifications. Even with this
performance boost, we still had 40% CPU free instead of 0%. In effect,
the logarithmic performance of the new scheduler is returned to normal
under large process tables.

Some systems will have a higher "cracking" point than others. The effect
is amplified when a system is under high memory pressure, hence a lot of
expensive queries on a high number of concurrent connections is the
easiest way to replicate these results.

Admins migrating from older systems (RHEL 5.x) may find this especially
shocking, because the old O(1) scheduler was too "stupid" to have these
advanced features, hence it was impossible to cause this kind of behavior.

There's probably still a little room for improvement here, since 30-40%
CPU is still unclaimed in our larger tests. I'd like to see the total
performance drop (175k ideal TPS at 24-connections) decreased. But these
kernel tweaks are rarely discussed anywhere, it seems. There doesn't
seem to be any consensus on how these (and other) scheduler settings
should be modified under different usage scenarios.

I just figured I'd share, since we found this info so beneficial.
--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
***@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Richard Neill
2013-01-03 00:47:22 UTC
Permalink
Dear Shaun,

Thanks for that - it's really interesting to know.
Post by Shaun Thomas
Hey everyone!
After much testing and hair-pulling, we've confirmed two kernel
settings that should always be modified in production Linux systems.
Especially new ones with the completely fair scheduler (CFS) as
opposed to the O(1) scheduler.
Does it apply to all types of production system, or just to certain
workloads?

For example, what happens when there are only one or two concurrent
processes? (i.e. there are always several more CPU cores than there are
actual connections).
Post by Shaun Thomas
* sched_autogroup_enabled
This is a relatively new patch which Linus lauded back in late 2010.
It basically groups tasks by TTY so perceived responsiveness is
improved. But on server systems, large daemons like PostgreSQL are
going to be launched from the same pseudo-TTY, and be effectively
choked out of CPU cycles in favor of less important tasks.
I've got several production servers using Postgres: I'd like to squeeze
a bit more performance out of them, but in all cases, one (sometimes
two) CPU cores are (sometimes) maxed out, but there are always several
cores permanently idling. So does this apply here?

Thanks for your advice,

Richard
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Merlin Moncure
2013-01-07 19:22:13 UTC
Permalink
Post by Shaun Thomas
Hey everyone!
After much testing and hair-pulling, we've confirmed two kernel settings
that should always be modified in production Linux systems. Especially new
ones with the completely fair scheduler (CFS) as opposed to the O(1)
scheduler.
/proc/sys/kernel/sched_migration_cost
/proc/sys/kernel/sched_autogroup_enabled
kernel.sched_migration_cost
kernel.sched_autogroup_enabled
What do these settings do?
--------------------------
* sched_migration_cost
The migration cost is the total time the scheduler will consider a migrated
process "cache hot" and thus less likely to be re-migrated. By default, this
is 0.5ms (500000 ns), and as the size of the process table increases,
eventually causes the scheduler to break down. On our systems, after a
smooth degradation with increasing connection count, system CPU spiked from
20 to 70% sustained and TPS was cut by 5-10x once we crossed some invisible
connection count threshold. For us, that was a pgbench with 900 or more
clients.
The migration cost should be increased, almost universally on server systems
with many processes. This means systems like PostgreSQL or Apache would
benefit from having higher migration costs. We've had good luck with a
setting of 5ms (5000000 ns) instead.
When the breakdown occurs, system CPU (as obtained from sar) increases from
20% on a heavy pgbench (scale 3500 on a 72GB system) to over 70%, and
%nice/%user is cut by half or more. A higher migration cost essentially
eliminates this artificial throttle.
* sched_autogroup_enabled
This is a relatively new patch which Linus lauded back in late 2010. It
basically groups tasks by TTY so perceived responsiveness is improved. But
on server systems, large daemons like PostgreSQL are going to be launched
from the same pseudo-TTY, and be effectively choked out of CPU cycles in
favor of less important tasks.
The default setting is 1 (enabled) on some platforms. By setting this to 0
(disabled), we saw an outright 30% performance boost on the same pgbench
test. A fully cached scale 3500 database on a 72GB system went from 67k TPS
to 82k TPS with 900 client connections.
Total Benefit
-------------
At higher connections counts, such as systems that can't use pooling or make
extensive use of prepared queries, these can massively affect performance.
At 900 connections, our test systems were at 17k TPS unaltered, but 85k TPS
after these two modifications. Even with this performance boost, we still
had 40% CPU free instead of 0%. In effect, the logarithmic performance of
the new scheduler is returned to normal under large process tables.
Some systems will have a higher "cracking" point than others. The effect is
amplified when a system is under high memory pressure, hence a lot of
expensive queries on a high number of concurrent connections is the easiest
way to replicate these results.
Admins migrating from older systems (RHEL 5.x) may find this especially
shocking, because the old O(1) scheduler was too "stupid" to have these
advanced features, hence it was impossible to cause this kind of behavior.
There's probably still a little room for improvement here, since 30-40% CPU
is still unclaimed in our larger tests. I'd like to see the total
performance drop (175k ideal TPS at 24-connections) decreased. But these
kernel tweaks are rarely discussed anywhere, it seems. There doesn't seem to
be any consensus on how these (and other) scheduler settings should be
modified under different usage scenarios.
I just figured I'd share, since we found this info so beneficial.
This is fantastic info.

Vlad, you might want to check this out and see if it has any impact in
your high cpu case...via:
http://postgresql.1045698.n5.nabble.com/High-SYS-CPU-need-advise-td5732045.html

merlin
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Andrea Suisani
2013-01-08 08:29:59 UTC
Permalink
Post by Shaun Thomas
Hey everyone!
After much testing and hair-pulling, we've confirmed two kernel settings that
should always be modified in production Linux systems. Especially new ones with
the completely fair scheduler (CFS) as opposed to the O(1) scheduler.
[cut]
Post by Shaun Thomas
I just figured I'd share, since we found this info so beneficial.
I just want to confirm that on our relatively small
test server that tweaks give us a 25% performance boost!

Really appreciated Shaun.

thanks
Andrea
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Andrea Suisani
2013-01-08 15:16:12 UTC
Permalink
Post by Andrea Suisani
Post by Shaun Thomas
Hey everyone!
After much testing and hair-pulling, we've confirmed two kernel settings that
should always be modified in production Linux systems. Especially new ones with
the completely fair scheduler (CFS) as opposed to the O(1) scheduler.
[cut]
Post by Shaun Thomas
I just figured I'd share, since we found this info so beneficial.
I just want to confirm that on our relatively small
test server that tweaks give us a 25% performance boost!
12.5% sorry for the typo...
Post by Andrea Suisani
Really appreciated Shaun.
thanks
Andrea
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Midge Brown
2013-01-08 18:25:52 UTC
Permalink
The kernel on our Linux system doesn't appear to have these two settings according to the list provided by sysctl -a. Please pardon my ignorance, but should I add them?

We have Postgresql 9.0 on Linux 2.6.18-164.el5 #1 SMP Thu Sep 3 03:28:30 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

Thanks,
Midge

----- Original Message -----
From: Shaun Thomas
To: pgsql-***@postgresql.org
Sent: Wednesday, January 02, 2013 1:46 PM
Subject: [PERFORM] Two Necessary Kernel Tweaks for Linux Systems


Hey everyone!

After much testing and hair-pulling, we've confirmed two kernel settings
that should always be modified in production Linux systems. Especially
new ones with the completely fair scheduler (CFS) as opposed to the O(1)
scheduler.

If you want to follow along, these are:

/proc/sys/kernel/sched_migration_cost
/proc/sys/kernel/sched_autogroup_enabled

Which correspond to sysctl settings:

kernel.sched_migration_cost
kernel.sched_autogroup_enabled

What do these settings do?
--------------------------

* sched_migration_cost

The migration cost is the total time the scheduler will consider a
migrated process "cache hot" and thus less likely to be re-migrated. By
default, this is 0.5ms (500000 ns), and as the size of the process table
increases, eventually causes the scheduler to break down. On our
systems, after a smooth degradation with increasing connection count,
system CPU spiked from 20 to 70% sustained and TPS was cut by 5-10x once
we crossed some invisible connection count threshold. For us, that was a
pgbench with 900 or more clients.

The migration cost should be increased, almost universally on server
systems with many processes. This means systems like PostgreSQL or
Apache would benefit from having higher migration costs. We've had good
luck with a setting of 5ms (5000000 ns) instead.

When the breakdown occurs, system CPU (as obtained from sar) increases
from 20% on a heavy pgbench (scale 3500 on a 72GB system) to over 70%,
and %nice/%user is cut by half or more. A higher migration cost
essentially eliminates this artificial throttle.

* sched_autogroup_enabled

This is a relatively new patch which Linus lauded back in late 2010. It
basically groups tasks by TTY so perceived responsiveness is improved.
But on server systems, large daemons like PostgreSQL are going to be
launched from the same pseudo-TTY, and be effectively choked out of CPU
cycles in favor of less important tasks.

The default setting is 1 (enabled) on some platforms. By setting this to
0 (disabled), we saw an outright 30% performance boost on the same
pgbench test. A fully cached scale 3500 database on a 72GB system went
from 67k TPS to 82k TPS with 900 client connections.

Total Benefit
-------------

At higher connections counts, such as systems that can't use pooling or
make extensive use of prepared queries, these can massively affect
performance. At 900 connections, our test systems were at 17k TPS
unaltered, but 85k TPS after these two modifications. Even with this
performance boost, we still had 40% CPU free instead of 0%. In effect,
the logarithmic performance of the new scheduler is returned to normal
under large process tables.

Some systems will have a higher "cracking" point than others. The effect
is amplified when a system is under high memory pressure, hence a lot of
expensive queries on a high number of concurrent connections is the
easiest way to replicate these results.

Admins migrating from older systems (RHEL 5.x) may find this especially
shocking, because the old O(1) scheduler was too "stupid" to have these
advanced features, hence it was impossible to cause this kind of behavior.

There's probably still a little room for improvement here, since 30-40%
CPU is still unclaimed in our larger tests. I'd like to see the total
performance drop (175k ideal TPS at 24-connections) decreased. But these
kernel tweaks are rarely discussed anywhere, it seems. There doesn't
seem to be any consensus on how these (and other) scheduler settings
should be modified under different usage scenarios.

I just figured I'd share, since we found this info so beneficial.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
***@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Shaun Thomas
2013-01-08 18:28:25 UTC
Permalink
Post by Midge Brown
The kernel on our Linux system doesn't appear to have these two
settings according to the list provided by sysctl -a. Please pardon
my ignorance, but should I add them?
Sorry if I wasn't more clear. These only apply to Linux systems with the
Completely Fair Scheduler, as opposed to the O(1) scheduler. For all
intents and purposes, this means 3.0 kernels and above.

With a 2.6 kernel, you're fine.

Effectively these changes fix what is basically a performance regression
compared to older kernels.
--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
***@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Scott Marlowe
2013-01-08 18:31:13 UTC
Permalink
Post by Shaun Thomas
Post by Midge Brown
The kernel on our Linux system doesn't appear to have these two
settings according to the list provided by sysctl -a. Please pardon
my ignorance, but should I add them?
Sorry if I wasn't more clear. These only apply to Linux systems with the
Completely Fair Scheduler, as opposed to the O(1) scheduler. For all intents
and purposes, this means 3.0 kernels and above.
With a 2.6 kernel, you're fine.
Effectively these changes fix what is basically a performance regression
compared to older kernels.
What's the comparison of these settings versus say going to the NOP scheduler?
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Shaun Thomas
2013-01-08 18:36:50 UTC
Permalink
Post by Scott Marlowe
What's the comparison of these settings versus say going to the NOP scheduler?
Assuming you actually meant NOP and not the NOOP I/O scheduler, I don't
know. These CPU scheduler tweaks are all I could dig up, and googling
for NOP by itself or combined with Linux terms is tremendously unhelpful.
--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
***@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Scott Marlowe
2013-01-08 19:04:36 UTC
Permalink
Post by Shaun Thomas
Post by Scott Marlowe
What's the comparison of these settings versus say going to the NOP scheduler?
Assuming you actually meant NOP and not the NOOP I/O scheduler, I don't
know. These CPU scheduler tweaks are all I could dig up, and googling for
NOP by itself or combined with Linux terms is tremendously unhelpful.
Assembly language on the brain. of course I meant NOOP.
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Shaun Thomas
2013-01-08 19:32:14 UTC
Permalink
Post by Scott Marlowe
Assembly language on the brain. of course I meant NOOP.
Ok, in that case, these are completely separate things. For IO
scheduling, there's the Completely Fair Queue (CFQ), NOOP, Deadline, and
so on.

For process scheduling, at least recently, there's Completely Fair
Scheduler or nothing. So far as I can tell, there is no alternative
process scheduler. Just as I can't find an alternative memory manager
that I can tell to stop flushing my freaking active file cache due to
phantom memory pressure. ;)

The tweaks I was discussing in this thread effectively do two things:

1. Stop process grouping by TTY.

On servers, this really is a net performance loss. Especially on heavily
forked apps like PG. System % is about 5% lower since the scheduler is
doing less work, but at the cost of less spreading across available
CPUs. Our systems see a 30% performance hit with grouping enabled,
others may see more or less.

2. Less aggressive process scheduling.

The O(log N) scheduler heuristics collapse at high process counts for
some reason, causing the scheduler to spend more and more time planning
CPU assignments until it spirals completely out of control. I've seen
this behavior on 3.0 kernels straight to 3.5, so it looks like an
inherent weakness of CFS. By increasing migration cost, we make the
scheduler do less work less often, so that weird 70+% system CPU spike
vanishes.

My guess is the increased migration cost basically offsets the point at
which the scheduler would freak out. I've tested up to 2000 connections,
and it responds fine, whereas before we were seeing flaky results as
early as 700 connections.

My guess as to why this is? I think it's due to VSZ as perceived by the
scheduler. To swap processes, it also has to preload L2 and L3 cache for
the assigned process. As the number of PG connections increase, all with
their own VSZ/RSS allocations, the scheduler has more thinking to do. At
a point when the sum of VSZ/RSS eclipses the amount of available RAM,
the scheduler loses nearly all decision-making ability and craps its pants.

This would also explain why I'm seeing something similar with memory. At
high connection counts, even though %used is fine, and we have over 40GB
free for caching. VSZ/RSS are both way bigger than available cache, so
memory pressure causes kswapd to continuously purge the active cache
pool into inactive, and inactive into free, all while the device
attempts to fill the active pool. It's an IO feedback loop, and around
the same number of connections that used to make the process scheduler
die. Too much of a coincidence, in my opinion.

But unlike the process scheduler, there are no good knobs to turn that
will fix the memory manager's behavior. At least, not in 3.0, 3.2, or
3.4 kernels.

But I freely admit I'm just speculating based on observed behavior. I
know neither jack, nor squat about internal kernel mechanics. Anyone who
actually *isn't* talking out of his ass is free to interject. :)
--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
***@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
AJ Weber
2013-01-08 20:05:56 UTC
Permalink
When I checked these, both of these settings exist on my CentOS 6.x host
(2.6.32-279.5.1.el6.x86_64).

However, the autogroup_enabled was already set to 0. (The
migration_cost was set to the 0.5ms, default noted in the OP.) So I
don't know if this is strictly limited to kernel 3.0.

Is there an "easy" way to tell what scheduler my OS is using?

-AJ
Post by Shaun Thomas
Post by Scott Marlowe
Assembly language on the brain. of course I meant NOOP.
Ok, in that case, these are completely separate things. For IO
scheduling, there's the Completely Fair Queue (CFQ), NOOP, Deadline,
and so on.
For process scheduling, at least recently, there's Completely Fair
Scheduler or nothing. So far as I can tell, there is no alternative
process scheduler. Just as I can't find an alternative memory manager
that I can tell to stop flushing my freaking active file cache due to
phantom memory pressure. ;)
1. Stop process grouping by TTY.
On servers, this really is a net performance loss. Especially on
heavily forked apps like PG. System % is about 5% lower since the
scheduler is doing less work, but at the cost of less spreading across
available CPUs. Our systems see a 30% performance hit with grouping
enabled, others may see more or less.
2. Less aggressive process scheduling.
The O(log N) scheduler heuristics collapse at high process counts for
some reason, causing the scheduler to spend more and more time
planning CPU assignments until it spirals completely out of control.
I've seen this behavior on 3.0 kernels straight to 3.5, so it looks
like an inherent weakness of CFS. By increasing migration cost, we
make the scheduler do less work less often, so that weird 70+% system
CPU spike vanishes.
My guess is the increased migration cost basically offsets the point
at which the scheduler would freak out. I've tested up to 2000
connections, and it responds fine, whereas before we were seeing flaky
results as early as 700 connections.
My guess as to why this is? I think it's due to VSZ as perceived by
the scheduler. To swap processes, it also has to preload L2 and L3
cache for the assigned process. As the number of PG connections
increase, all with their own VSZ/RSS allocations, the scheduler has
more thinking to do. At a point when the sum of VSZ/RSS eclipses the
amount of available RAM, the scheduler loses nearly all
decision-making ability and craps its pants.
This would also explain why I'm seeing something similar with memory.
At high connection counts, even though %used is fine, and we have over
40GB free for caching. VSZ/RSS are both way bigger than available
cache, so memory pressure causes kswapd to continuously purge the
active cache pool into inactive, and inactive into free, all while the
device attempts to fill the active pool. It's an IO feedback loop, and
around the same number of connections that used to make the process
scheduler die. Too much of a coincidence, in my opinion.
But unlike the process scheduler, there are no good knobs to turn that
will fix the memory manager's behavior. At least, not in 3.0, 3.2, or
3.4 kernels.
But I freely admit I'm just speculating based on observed behavior. I
know neither jack, nor squat about internal kernel mechanics. Anyone
who actually *isn't* talking out of his ass is free to interject. :)
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Shaun Thomas
2013-01-08 21:48:38 UTC
Permalink
Post by AJ Weber
Is there an "easy" way to tell what scheduler my OS is using?
Unfortunately not. I looked again, and it seems that CFS was merged into
2.6.23. Anything before that is probably safe, but the vendor may have
backported it. If you don't see the settings I described, you probably
don't have it.

So I guess Midge had 2.6.18, which predates the merge in 2.6.23.

I honestly don't understand the Linux kernel sometimes. A process
scheduler swap is a *gigantic* functional change, and it's in a dot
release. I vastly prefer PostgreSQL's approach...
--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
***@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Alan Hodgson
2013-01-08 23:24:33 UTC
Permalink
Post by Shaun Thomas
Post by AJ Weber
Is there an "easy" way to tell what scheduler my OS is using?
Unfortunately not. I looked again, and it seems that CFS was merged into
2.6.23. Anything before that is probably safe, but the vendor may have
backported it. If you don't see the settings I described, you probably
don't have it.
So I guess Midge had 2.6.18, which predates the merge in 2.6.23.
I honestly don't understand the Linux kernel sometimes. A process
scheduler swap is a *gigantic* functional change, and it's in a dot
release. I vastly prefer PostgreSQL's approach...
Red Hat also selectively backports major functionality into their enterprise
kernels. If you're running RHEL or a clone like CentOS, the reported kernel
version has little bearing on what may nor may not be in your kernel.

They're very well tested and stable, so there's nothing wrong with them, per
se, but you can't just say oh, you have version xxx, you don't have this
functionality.
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Boszormenyi Zoltan
2013-01-14 14:28:48 UTC
Permalink
Post by AJ Weber
Is there an "easy" way to tell what scheduler my OS is using?
Unfortunately not. I looked again, and it seems that CFS was merged into 2.6.23.
Anything before that is probably safe, but the vendor may have backported it. If you
don't see the settings I described, you probably don't have it.
So I guess Midge had 2.6.18, which predates the merge in 2.6.23.
I honestly don't understand the Linux kernel sometimes. A process scheduler swap is a
*gigantic* functional change, and it's in a dot release. I vastly prefer PostgreSQL's
approach...
The kernel version numbering is different.
A point release in 2.6.x is 2.6.x.y.
This has changed in 3.x, a point release is 3.x.y.

Best regards,
Zoltán Böszörményi
--
----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
http://www.postgresql.at/
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Henri Philipps
2013-01-10 08:51:26 UTC
Permalink
Hi,

we also hit this performance barrier a while ago, when migrating a
database on a big server (48 core Opteron, 512GB RAM) from Kernel
2.6.32 to 3.2 (both kernels from Debian packages). The system load was
getting very high, as you also observed (don't know the exact numbers
right now).

After some investigation I found out, that the reason for the high
system load was that the postgresql processes were migrating from core
to core at very high rates. So the behaviour of the CFS scheduler must
have changed in this regard between 2.6.32 and 3.2 kernels.

You can easily see this, if you have a look how much time the
migration kernel threads spend in the CPU (ps ax | grep migration). A
look into /proc/sched_debug also can give you some more insight into
the scheduler behaviour.

On NUMA systems the scheduler tries to migrate processes to the nodes
on which they have the best memory-locality. But on a big database one
process is typically reading randomly from a dataset which is spread
above all nodes. On newer kernels the CFS scheduler seems to try more
aggressively to migrate processes to other cores. I don't know if it
is for better load balancing or for better memory locality. But
process migrations are consuming a lot of resources.

I had to change sched_migration_costs from 500000 (0.5ms) to 100000000
(100ms). This means, the scheduler is only considering a task for
migration if the task was running at least for 100ms instead of 0.5ms.
This solved the problem for us - the migration kernel threads didn't
have to do much work anymore and thus the system load was going down
again.

A general problem is, that the CFS scheduler has a lot of changes
between all kernel versions, so it is really hard to predict which
regressions you can hit when going to another kernel version.
Scheduling on NUMA systems is also very complex.

An interesting dissertations showing the inconsistent behaviour of the
CFS scheduler:
http://research.cs.wisc.edu/adsl/Publications/meehean-thesis11.pdf

Some parameters, which also could be considered for systematic benchmarking are

sched_latency_ns
sched_min_granularity_ns

I guess that higher numbers could improve performance too on systems
with many cores and many connections.

Thanks for starting this interesting thread!

Henri
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Shaun Thomas
2013-01-10 15:53:25 UTC
Permalink
Post by Henri Philipps
http://research.cs.wisc.edu/adsl/Publications/meehean-thesis11.pdf
Wow, that was pretty interesting. It looks like for servers, the O(1)
scheduler is much better even with the assignment bug he identified, and
BFS responds better to varying load than CFS.

It's too bad the paper is so old and only considers the 2.6 kernel. I'd
love to see this type of research applied to the latest.
Post by Henri Philipps
sched_latency_ns
sched_min_granularity_ns
I guess that higher numbers could improve performance too on systems
with many cores and many connections.
I messed around with these a bit. Settings 10x smaller and 10x larger
didn't do anything appreciable that I noticed. Performance metrics were
within variance of my earlier tests. Only autogrouping and migration
cost had any appreciable effect.

I'm glad we weren't the only ones who ran into this, too. You settled on
a much higher setting than we did, but the end result was the same. I
wonder how prevalent this will become as more servers are switched over
to newer kernels in the next couple of years. Hopefully more people
start complaining so they fix it. :)
--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
***@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
AJ Weber
2013-01-23 16:53:57 UTC
Permalink
I have a server that is IO-bound right now (it's 4 cores, and top
indicates the use rarely hits 25%, but the Wait spikes above 25-40%
regularly). The server is running postgresql 9.0 and tomcat 6. As I
have mentioned in a previous thread, I can't alter the hardware to add
disks unfortunately, so I'm going to try and move postgresql off this
application server to its own host, but this is a production
environment, so in the meantime...

Is it possible that some spikes in IO could be attributable to the
autovacuum process? Is there a way to check this theory?

Would it be advisable (or even permissible to try/test) to disable
autovacuum, and schedule a manual vacuumdb in the middle of the night,
when this server is mostly-idle?

Thanks for any tips. I'm in a bit of a jam with my limited hardware.

-AJ
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Evgeniy Shishkin
2013-01-23 19:03:25 UTC
Permalink
I have a server that is IO-bound right now (it's 4 cores, and top indicates the use rarely hits 25%, but the Wait spikes above 25-40% regularly). The server is running postgresql 9.0 and tomcat 6. As I have mentioned in a previous thread, I can't alter the hardware to add disks unfortunately, so I'm going to try and move postgresql off this application server to its own host, but this is a production environment, so in the meantime...
Is it possible that some spikes in IO could be attributable to the autovacuum process? Is there a way to check this theory?
Try iotop
Would it be advisable (or even permissible to try/test) to disable autovacuum, and schedule a manual vacuumdb in the middle of the night, when this server is mostly-idle?
Thanks for any tips. I'm in a bit of a jam with my limited hardware.
-AJ
--
http://www.postgresql.org/mailpref/pgsql-performance
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Jeff Janes
2013-01-23 19:13:48 UTC
Permalink
I have a server that is IO-bound right now (it's 4 cores, and top indicates
the use rarely hits 25%, but the Wait spikes above 25-40% regularly).
How long do the spikes last?
The
server is running postgresql 9.0 and tomcat 6. As I have mentioned in a
previous thread, I can't alter the hardware to add disks unfortunately, so
I'm going to try and move postgresql off this application server to its own
host, but this is a production environment, so in the meantime...
Is it possible that some spikes in IO could be attributable to the
autovacuum process? Is there a way to check this theory?
set log_autovacuum_min_duration to 0 or some positive number, and see
if the vacuums correlate with periods of io stress (from sar or
vmstat, for example--the problem is that sar only takes snapshots
every 10 minutes, which is too coarse if the spikes are short).
Would it be advisable (or even permissible to try/test) to disable
autovacuum, and schedule a manual vacuumdb in the middle of the night, when
this server is mostly-idle?
Scheduling a manual vacuum should be fine (but keep in mind that
vacuum has very different default cost_delay settings than autovacuum
does. If the server is completely idle that shouldn't matter, but if
it is only mostly idle, you might want to throttle the IO a bit). But
I certainly would not disable autovacuum without further evidence. If
a table only needs to be vacuumed once a day and you preemptively do
it at 3a.m., then autovac won't bother to do it itself during the day.
So there is no point, but much risk, in also turning autovac off.

Cheers,

Jeff
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
AJ Weber
2013-01-23 22:48:10 UTC
Permalink
Post by Jeff Janes
I have a server that is IO-bound right now (it's 4 cores, and top indicates
the use rarely hits 25%, but the Wait spikes above 25-40% regularly).
How long do the spikes last?
From what I can gather, a few seconds to a few minutes.
Post by Jeff Janes
The
server is running postgresql 9.0 and tomcat 6. As I have mentioned in a
previous thread, I can't alter the hardware to add disks unfortunately, so
I'm going to try and move postgresql off this application server to its own
host, but this is a production environment, so in the meantime...
Is it possible that some spikes in IO could be attributable to the
autovacuum process? Is there a way to check this theory?
set log_autovacuum_min_duration to 0 or some positive number, and see
if the vacuums correlate with periods of io stress (from sar or
vmstat, for example--the problem is that sar only takes snapshots
every 10 minutes, which is too coarse if the spikes are short).
I used iotop last time it was going crazy, and there were 5 postgres
procs at the top of the list (and virtually nothing else) all doing a
SELECT. So I'm also going to restart the DB this weekend with
log-min-duration enabled. Could also be some misbehaving queries...

Is there a skinny set of instructions on loading pg_stat_statements? Or
should I just log them and review them from there?
Post by Jeff Janes
Would it be advisable (or even permissible to try/test) to disable
autovacuum, and schedule a manual vacuumdb in the middle of the night, when
this server is mostly-idle?
Scheduling a manual vacuum should be fine (but keep in mind that
vacuum has very different default cost_delay settings than autovacuum
does. If the server is completely idle that shouldn't matter, but if
it is only mostly idle, you might want to throttle the IO a bit). But
I certainly would not disable autovacuum without further evidence. If
a table only needs to be vacuumed once a day and you preemptively do
it at 3a.m., then autovac won't bother to do it itself during the day.
So there is no point, but much risk, in also turning autovac off.
If I set autovacuum_max_workers = 1, will that effectively single-thread
it so I don't have two running at once? Maybe that'll mitigate disk
contention a little at least?
Post by Jeff Janes
Cheers,
Jeff
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Alvaro Herrera
2013-01-24 02:03:04 UTC
Permalink
Post by AJ Weber
Post by Jeff Janes
Scheduling a manual vacuum should be fine (but keep in mind that
vacuum has very different default cost_delay settings than autovacuum
does. If the server is completely idle that shouldn't matter, but if
it is only mostly idle, you might want to throttle the IO a bit). But
I certainly would not disable autovacuum without further evidence. If
a table only needs to be vacuumed once a day and you preemptively do
it at 3a.m., then autovac won't bother to do it itself during the day.
So there is no point, but much risk, in also turning autovac off.
If I set autovacuum_max_workers = 1, will that effectively
single-thread it so I don't have two running at once? Maybe that'll
mitigate disk contention a little at least?
If you have a single one, it will go three times as fast. If you want
to make the whole thing go slower (i.e. cause less impact on your I/O
system when running), crank up autovacuum_vacuum_cost_delay.
--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Jeff Janes
2013-01-24 03:06:14 UTC
Permalink
Post by AJ Weber
Is there a skinny set of instructions on loading pg_stat_statements? Or
should I just log them and review them from there?
Make sure you have installed contrib. (How you do that depends on how you
installed PostgreSQL in the first place. If you installed from source, then
just follow "sudo make install" with "cd contrib; sudo make install")


Then, just change postgresql.conf so that

shared_preload_libraries = 'pg_stat_statements'

And restart the server.

Then in psql run

create extension pg_stat_statements ;

Cheers,

Jeff

Kevin Grittner
2013-01-23 22:17:00 UTC
Permalink
Post by AJ Weber
Is it possible that some spikes in IO could be attributable to
the autovacuum process? Is there a way to check this theory?
Taking a look at the ps aux listing, pg_stat_activity, and pg_locks
should help establish a cause, or at least rule out a number of
possibilities. There is a known issue with autovacuum when it tries
to reduce the size of a table which is found to be larger than it
currently needs to be while other transactions try to access the
table. This issue will be fixed in the next minor release for 9.0
and above. If this is the issue a manual VACUUM ANALYZE will fix
things -- at least for a while.

-Kevin
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Loading...