backend suddenly becomes slow, then remains slow

Discussion:

(too old to reply)

Andrew Dunstan

2012-12-14 18:40:04 UTC

One of my clients has an odd problem. Every so often a backend will
suddenly become very slow. The odd thing is that once this has happened
it remains slowed down, for all subsequent queries. Zone reclaim is off.
There is no IO or CPU spike, no checkpoint issues or stats timeouts, no
other symptom that we can see. The problem was a lot worse that it is
now, but two steps have alleviated it mostly, but not completely: much
less aggressive autovacuuming and reducing the maximum lifetime of
backends in the connection pooler to 30 minutes.

It's got us rather puzzled. Has anyone seen anything like this?

cheers

andrew

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Tom Lane

2012-12-14 19:56:00 UTC

Permalink

Post by Andrew Dunstan
One of my clients has an odd problem. Every so often a backend will
suddenly become very slow. The odd thing is that once this has happened
it remains slowed down, for all subsequent queries. Zone reclaim is off.
There is no IO or CPU spike, no checkpoint issues or stats timeouts, no
other symptom that we can see. The problem was a lot worse that it is
now, but two steps have alleviated it mostly, but not completely: much
less aggressive autovacuuming and reducing the maximum lifetime of
backends in the connection pooler to 30 minutes.
It's got us rather puzzled. Has anyone seen anything like this?

Maybe the kernel is auto-nice'ing the process once it's accumulated X
amount of CPU time?

regards, tom lane

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Andrew Dunstan

2012-12-14 20:16:19 UTC

Permalink

Post by Tom Lane

Maybe the kernel is auto-nice'ing the process once it's accumulated X
amount of CPU time?

That was my initial thought, but the client said not. We'll check again.

cheers

andrew

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Jeff Janes

2012-12-27 04:03:33 UTC

Permalink

On Fri, Dec 14, 2012 at 10:40 AM, Andrew Dunstan <

Post by Andrew Dunstan
One of my clients has an odd problem. Every so often a backend will

suddenly

Post by Andrew Dunstan
become very slow. The odd thing is that once this has happened it remains
slowed down, for all subsequent queries. Zone reclaim is off. There is no

Post by Andrew Dunstan
or CPU spike, no checkpoint issues or stats timeouts, no other symptom

that

Post by Andrew Dunstan
we can see.

By "no spike", do you mean that the system as a whole is not using an
unusual amount of IO or CPU, or that this specific slow back-end is not
using an unusual amount?

Could you strace is and see what it is doing?

Post by Andrew Dunstan
The problem was a lot worse that it is now, but two steps have
alleviated it mostly, but not completely: much less aggressive

autovacuuming

Post by Andrew Dunstan
and reducing the maximum lifetime of backends in the connection pooler to

Post by Andrew Dunstan
minutes.

Do you have a huge number of tables? Maybe over the course of a long-lived
connection, it touches enough tables to bloat the relcache / syscache. I
don't know how the autovac would be involved in that, though.

Cheers,

Jeff

Andrew Dunstan

2012-12-27 17:43:31 UTC

Permalink

Post by Jeff Janes
On Fri, Dec 14, 2012 at 10:40 AM, Andrew Dunstan

Post by Andrew Dunstan
One of my clients has an odd problem. Every so often a backend will

suddenly

Post by Andrew Dunstan
become very slow. The odd thing is that once this has happened it

remains

Post by Andrew Dunstan
slowed down, for all subsequent queries. Zone reclaim is off. There

is no IO

Post by Andrew Dunstan
or CPU spike, no checkpoint issues or stats timeouts, no other

symptom that

Post by Andrew Dunstan
we can see.

By "no spike", do you mean that the system as a whole is not using an
unusual amount of IO or CPU, or that this specific slow back-end is
not using an unusual amount?

both, really.

Post by Jeff Janes
Could you strace is and see what it is doing?

Not very easily, because it's a pool connection and we've lowered the
pool session lifetime as part of the amelioration :-) So it's not
happening very much any more.

Post by Jeff Janes

Post by Andrew Dunstan
The problem was a lot worse that it is now, but two steps have
alleviated it mostly, but not completely: much less aggressive

autovacuuming

Post by Andrew Dunstan
and reducing the maximum lifetime of backends in the connection

pooler to 30

Post by Andrew Dunstan
minutes.

Do you have a huge number of tables? Maybe over the course of a
long-lived connection, it touches enough tables to bloat the relcache
/ syscache. I don't know how the autovac would be involved in that,
though.

Yes, we do indeed have a huge number of tables. This seems a plausible
thesis.

cheers

andrew

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Jeff Janes

2012-12-28 00:11:23 UTC

Permalink

Post by Andrew Dunstan

Post by Jeff Janes
Do you have a huge number of tables? Maybe over the course of a
long-lived connection, it touches enough tables to bloat the relcache /
syscache. I don't know how the autovac would be involved in that, though.

Yes, we do indeed have a huge number of tables. This seems a plausible
thesis.

All of the syscache things have compiled hard-coded numbers of buckets, at
most 2048, and once those are exceeded the resulting collision resolution
becomes essentially linear. It is not hard to exceed 2048 tables by a
substantial multiple, and even less hard to exceed 2048 columns (summed
over all tables).

I don't know why syscache doesn't use dynahash; whether it is older than
dynahash is and was never converted out of inertia, or if there are extra
features that don't fit the dynahash API. If the former, then converting
them to use dynahash should give automatic resizing for free. Maybe that
conversion should be a To Do item?

Cheers,

Jeff