Discussion:
backend suddenly becomes slow, then remains slow
(too old to reply)
Andrew Dunstan
2012-12-14 18:40:04 UTC
Permalink
One of my clients has an odd problem. Every so often a backend will
suddenly become very slow. The odd thing is that once this has happened
it remains slowed down, for all subsequent queries. Zone reclaim is off.
There is no IO or CPU spike, no checkpoint issues or stats timeouts, no
other symptom that we can see. The problem was a lot worse that it is
now, but two steps have alleviated it mostly, but not completely: much
less aggressive autovacuuming and reducing the maximum lifetime of
backends in the connection pooler to 30 minutes.

It's got us rather puzzled. Has anyone seen anything like this?

cheers

andrew
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Tom Lane
2012-12-14 19:56:00 UTC
Permalink
Post by Andrew Dunstan
One of my clients has an odd problem. Every so often a backend will
suddenly become very slow. The odd thing is that once this has happened
it remains slowed down, for all subsequent queries. Zone reclaim is off.
There is no IO or CPU spike, no checkpoint issues or stats timeouts, no
other symptom that we can see. The problem was a lot worse that it is
now, but two steps have alleviated it mostly, but not completely: much
less aggressive autovacuuming and reducing the maximum lifetime of
backends in the connection pooler to 30 minutes.
It's got us rather puzzled. Has anyone seen anything like this?
Maybe the kernel is auto-nice'ing the process once it's accumulated X
amount of CPU time?

regards, tom lane
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Andrew Dunstan
2012-12-14 20:16:19 UTC
Permalink
Post by Tom Lane
Post by Andrew Dunstan
One of my clients has an odd problem. Every so often a backend will
suddenly become very slow. The odd thing is that once this has happened
it remains slowed down, for all subsequent queries. Zone reclaim is off.
There is no IO or CPU spike, no checkpoint issues or stats timeouts, no
other symptom that we can see. The problem was a lot worse that it is
now, but two steps have alleviated it mostly, but not completely: much
less aggressive autovacuuming and reducing the maximum lifetime of
backends in the connection pooler to 30 minutes.
It's got us rather puzzled. Has anyone seen anything like this?
Maybe the kernel is auto-nice'ing the process once it's accumulated X
amount of CPU time?
That was my initial thought, but the client said not. We'll check again.

cheers

andrew
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Jeff Janes
2012-12-27 04:03:33 UTC
Permalink
On Fri, Dec 14, 2012 at 10:40 AM, Andrew Dunstan <
Post by Andrew Dunstan
One of my clients has an odd problem. Every so often a backend will
suddenly
Post by Andrew Dunstan
become very slow. The odd thing is that once this has happened it remains
slowed down, for all subsequent queries. Zone reclaim is off. There is no
IO
Post by Andrew Dunstan
or CPU spike, no checkpoint issues or stats timeouts, no other symptom
that
Post by Andrew Dunstan
we can see.
By "no spike", do you mean that the system as a whole is not using an
unusual amount of IO or CPU, or that this specific slow back-end is not
using an unusual amount?

Could you strace is and see what it is doing?
Post by Andrew Dunstan
The problem was a lot worse that it is now, but two steps have
alleviated it mostly, but not completely: much less aggressive
autovacuuming
Post by Andrew Dunstan
and reducing the maximum lifetime of backends in the connection pooler to
30
Post by Andrew Dunstan
minutes.
Do you have a huge number of tables? Maybe over the course of a long-lived
connection, it touches enough tables to bloat the relcache / syscache. I
don't know how the autovac would be involved in that, though.


Cheers,

Jeff
Andrew Dunstan
2012-12-27 17:43:31 UTC
Permalink
Post by Jeff Janes
On Fri, Dec 14, 2012 at 10:40 AM, Andrew Dunstan
Post by Andrew Dunstan
One of my clients has an odd problem. Every so often a backend will
suddenly
Post by Andrew Dunstan
become very slow. The odd thing is that once this has happened it
remains
Post by Andrew Dunstan
slowed down, for all subsequent queries. Zone reclaim is off. There
is no IO
Post by Andrew Dunstan
or CPU spike, no checkpoint issues or stats timeouts, no other
symptom that
Post by Andrew Dunstan
we can see.
By "no spike", do you mean that the system as a whole is not using an
unusual amount of IO or CPU, or that this specific slow back-end is
not using an unusual amount?
both, really.
Post by Jeff Janes
Could you strace is and see what it is doing?
Not very easily, because it's a pool connection and we've lowered the
pool session lifetime as part of the amelioration :-) So it's not
happening very much any more.
Post by Jeff Janes
Post by Andrew Dunstan
The problem was a lot worse that it is now, but two steps have
alleviated it mostly, but not completely: much less aggressive
autovacuuming
Post by Andrew Dunstan
and reducing the maximum lifetime of backends in the connection
pooler to 30
Post by Andrew Dunstan
minutes.
Do you have a huge number of tables? Maybe over the course of a
long-lived connection, it touches enough tables to bloat the relcache
/ syscache. I don't know how the autovac would be involved in that,
though.
Yes, we do indeed have a huge number of tables. This seems a plausible
thesis.

cheers

andrew
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Jeff Janes
2012-12-28 00:11:23 UTC
Permalink
Post by Andrew Dunstan
Post by Jeff Janes
Do you have a huge number of tables? Maybe over the course of a
long-lived connection, it touches enough tables to bloat the relcache /
syscache. I don't know how the autovac would be involved in that, though.
Yes, we do indeed have a huge number of tables. This seems a plausible
thesis.
All of the syscache things have compiled hard-coded numbers of buckets, at
most 2048, and once those are exceeded the resulting collision resolution
becomes essentially linear. It is not hard to exceed 2048 tables by a
substantial multiple, and even less hard to exceed 2048 columns (summed
over all tables).

I don't know why syscache doesn't use dynahash; whether it is older than
dynahash is and was never converted out of inertia, or if there are extra
features that don't fit the dynahash API. If the former, then converting
them to use dynahash should give automatic resizing for free. Maybe that
conversion should be a To Do item?



Cheers,

Jeff

Loading...