Discussion:
Fighting the planner >:-(
(too old to reply)
Casey Allen Shobe
2013-02-01 17:11:53 UTC
Permalink
So where I'm working, a performance issue was identified that affected many
functions, because the (SQL language) functions took an int argument used
it in a where clause against a column (config_id) that was stored in
varchar format, leading to an inefficient casting when the query was
parameterized. We could work around that with (select $3::text) instead of
just $3, but since the data is actually all numbers under 65k, we altered
the data type of the column to smallint, rather than editing a boatload of
functions with a hacky workaround.

For most functions, this fixed the problem.

However, it had a drastically-negative impact on the query in question,
which was originally taking 2 minutes, 45 seconds. After adding a couple
indexes with the config_id still as a varchar, that time is reduced down to
42 seconds. However when the data type is smallint, the query runs for
many hours - I let it run for 4.5 hours yesterday before cancelling it.

It's pretty clear that the planner is making horrid misestimates and
picking a terrible plan. I would appreciate any advice for getting this
into a better state.

Here are the explain plans:

When config_id is a varchar, it executes in 42 seconds:
http://explain.depesz.com/s/wuf

When config_id is a smallint, it runs too long to allow to complete, but
clearly the plan is bad:
http://explain.depesz.com/s/u5P

Here is the query, along with rowcounts and schema of every table involved
in the query:
http://pgsql.privatepaste.com/c66fd497c9

PostgreSQL version is 8.4, and most of our GUC's are default.

Thanks in advance for any suggestions.
--
Casey Allen Shobe
***@shobe.info
Casey Allen Shobe
2013-02-01 17:54:09 UTC
Permalink
My apologies - I included the wrong version of the query before...during
testing I had tried deparameterizing a few of the input parameters. I
also accidentally left out the schema for the network_config_tot2 table
from the initial paste.

Here is an updated paste, which shows the correct query in a prepare
statements. The explain plans are from explain execute hewitt_test (...):
http://pgsql.privatepaste.com/00c582c840

Here is the correct explain plan for this statement (still bad):
http://explain.depesz.com/s/c46
Post by Casey Allen Shobe
So where I'm working, a performance issue was identified that affected
many functions, because the (SQL language) functions took an int argument
used it in a where clause against a column (config_id) that was stored in
varchar format, leading to an inefficient casting when the query was
parameterized. We could work around that with (select $3::text) instead of
just $3, but since the data is actually all numbers under 65k, we altered
the data type of the column to smallint, rather than editing a boatload of
functions with a hacky workaround.
For most functions, this fixed the problem.
However, it had a drastically-negative impact on the query in question,
which was originally taking 2 minutes, 45 seconds. After adding a couple
indexes with the config_id still as a varchar, that time is reduced down to
42 seconds. However when the data type is smallint, the query runs for
many hours - I let it run for 4.5 hours yesterday before cancelling it.
It's pretty clear that the planner is making horrid misestimates and
picking a terrible plan. I would appreciate any advice for getting this
into a better state.
http://explain.depesz.com/s/wuf
When config_id is a smallint, it runs too long to allow to complete, but
http://explain.depesz.com/s/u5P
Here is the query, along with rowcounts and schema of every table involved
http://pgsql.privatepaste.com/c66fd497c9
PostgreSQL version is 8.4, and most of our GUC's are default.
Thanks in advance for any suggestions.
--
Casey Allen Shobe
--
Casey Allen Shobe
***@shobe.info
Richard Huxton
2013-02-01 18:50:13 UTC
Permalink
Post by Casey Allen Shobe
My apologies - I included the wrong version of the query
before...during testing I had tried deparameterizing a few of the
input parameters. I also accidentally left out the schema for the
network_config_tot2 table from the initial paste.
Here is an updated paste, which shows the correct query in a prepare
http://pgsql.privatepaste.com/00c582c840
http://explain.depesz.com/s/c46
Three quick observations before the weekend.

1. You said config_id was now "smallint" in your email, but it reads
"int" in the pastes above.
Doesn't matter much which, but just checking we've got the right pastes.

2. The total estimated cost of both queries is about the same
(477,225.19 for the varchar, 447,623.86 for the int).
This suggests something about your configuration doesn't match the
performance of your machine, since presumably the int version is taking
at least twice as long as the varchar one.

3. Interestingly, the config_id search on both plans seems to be using a
Bitmap Index, so I'm not sure that's the root cause. However, the
varchar version seems to have a literal string it's matching against. If
you've manually substituted in a literal value, that could be skewing
the tests.

And two things for you to try if you would:

1. Can you just check and see if any of the row estimates are horribly
off for any particular clause in the query?

2. You mention your config settings are mostly at default. What's your
work_mem and can you increase it? You can issue a SET for the current
session, no need to change it globally. If you've got the RAM try
doubling it, then double it again. See what happens to your plan then.

--
Richard Huxton
Archonet Ltd
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Casey Allen Shobe
2013-02-01 19:18:47 UTC
Permalink
1. You said config_id was now "smallint" in your email, but it reads "int"
in the pastes above.
Doesn't matter much which, but just checking we've got the right pastes.
You have the correct pastes. I did an alter to int in an attempt to see if
that made any difference. It didn't. It takes a couple hours to do that
alter so I didn't convert it back to smallint.
2. The total estimated cost of both queries is about the same (477,225.19
for the varchar, 447,623.86 for the int).
This suggests something about your configuration doesn't match the
performance of your machine, since presumably the int version is taking at
least twice as long as the varchar one.
Configuration is pretty standard. As for the machine, it's a VM in an ESXi
setup, with dedicated resources. The disk is very fast and is barely
touched. One CPU sits at 100% for as long as I let the query run. There
is 18GB RAM and PostgreSQL is the only service running on the machine.
3. Interestingly, the config_id search on both plans seems to be using a
Bitmap Index, so I'm not sure that's the root cause. However, the varchar
version seems to have a literal string it's matching against. If you've
manually substituted in a literal value, that could be skewing the tests.
That's why I sent the followup re-parameterizing everything. And the
explains are on prepared statements with the parameterization. If I just
put the parameter values directly into the query and run it straight, it's
fast.

1. Can you just check and see if any of the row estimates are horribly off
for any particular clause in the query?
Yes they are. The places where the estimate is rows=1, particularly.
2. You mention your config settings are mostly at default. What's your
work_mem and can you increase it? You can issue a SET for the current
session, no need to change it globally. If you've got the RAM try doubling
it, then double it again. See what happens to your plan then.
21861KB. I tried setting it to 192MB and re-preparing the same statement.
Here's the explain execute: http://explain.depesz.com/s/pZ0, which looks
identical as before.
--
Casey Allen Shobe
***@shobe.info
Casey Allen Shobe
2013-02-01 21:02:26 UTC
Permalink
Rhodiumtoad on IRC helped me figure out how to move part of the query into
a CTE in order to work around the planner problem. This is a hack but it
brings the time down from many hours to 17 seconds, which is better than it
was even with the better plan in the first place! For some reason it
actually gets 2 seconds faster yet by putting it in a SQL function rather
than using prepare/execute.

Hopefully some improvements to the planner can come from this information?

Here is the CTE version of the query:
http://pgsql.privatepaste.com/2f7fd3f669
...and here is it's explain analyze: http://explain.depesz.com/s/5ml
--
Casey Allen Shobe
***@shobe.info
Виктор Егоров
2013-02-01 21:12:43 UTC
Permalink
Post by Casey Allen Shobe
Hopefully some improvements to the planner can come from this information?
http://pgsql.privatepaste.com/2f7fd3f669
...and here is it's explain analyze: http://explain.depesz.com/s/5ml
Estimated rows for ‘hewitt_1_0_factors_precalc_new’ are 1000x less then actual.
And for ‘census_user’ estimation is 100x less, then actual.

How many rows are in those tables and what is your statistics target?
--
Victor Y. Yegorov
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Casey Allen Shobe
2013-02-01 21:18:34 UTC
Permalink
Estimated rows for ‘hewitt_1_0_factors_precalc_new’ are 1000x less then
actual.
And for ‘census_user’ estimation is 100x less, then actual.
How many rows are in those tables and what is your statistics target?
Rowcounts are shown in the earlier paste link, but apparently I forgot to
include the census table - hewitt_1_0_factors_precalc_new has 4,135,890
rows, and census_user has 1846439 rows.

Statistics target is the default at 100.
--
Casey Allen Shobe
***@shobe.info
Виктор Егоров
2013-02-01 21:55:21 UTC
Permalink
Post by Casey Allen Shobe
Rowcounts are shown in the earlier paste link, but apparently I forgot to
include the census table - hewitt_1_0_factors_precalc_new has 4,135,890
rows, and census_user has 1846439 rows.
Statistics target is the default at 100.
I would try the following:
ALTER TABLE hewitt_1_0_factors_precalc_new SET STATISTICS 1000;
ALTER TABLE census_user SET STATISTICS 500;
ALTER TABLE census_output SET STATISTICS 500;
and analyzed them after. I hope I guessed ‘census_output’ name correctly.

And could you kindly share the plan after:
SET enable_nestloop TO off;
--
Victor Y. Yegorov
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Casey Allen Shobe
2013-02-01 22:03:34 UTC
Permalink
I'll get back to you on this Monday - I'm heading home for the week now.

However I was unable to adjust the statistics target using that command:

alter table opportunity.census_user set statistics 500;
ERROR: syntax error at or near "statistics"
LINE 1: alter table opportunity.census_user set statistics 500;
^
Post by Виктор Егоров
Post by Casey Allen Shobe
Rowcounts are shown in the earlier paste link, but apparently I forgot to
include the census table - hewitt_1_0_factors_precalc_new has 4,135,890
rows, and census_user has 1846439 rows.
Statistics target is the default at 100.
ALTER TABLE hewitt_1_0_factors_precalc_new SET STATISTICS 1000;
ALTER TABLE census_user SET STATISTICS 500;
ALTER TABLE census_output SET STATISTICS 500;
and analyzed them after. I hope I guessed ‘census_output’ name correctly.
SET enable_nestloop TO off;
--
Victor Y. Yegorov
--
Casey Allen Shobe
***@shobe.info
Виктор Егоров
2013-02-01 22:17:54 UTC
Permalink
Post by Casey Allen Shobe
alter table opportunity.census_user set statistics 500;
ERROR: syntax error at or near "statistics"
LINE 1: alter table opportunity.census_user set statistics 500;
I'm sorry for this, my bad. Try the following:

ALTER TABLE census_user ALTER parent_id SET STATISTICS 500;
ALTER TABLE census_user ALTER stakeholder_code SET STATISTICS 500;

Do the same for all the columns in ‘hewitt_1_0_factors_precalc_new_pkey’ index,
setting target at 1000. I would also updated target for columns from
this filter:
Filter: (((h.discount_type)::text = ANY ('{"Avg Comp",Blue}'::text[]))
AND ((h.data_type)::text = 'Historical'::text) AND ((h.source)::text =
'Hewitt 1.0'::text)
AND ((h.service_catg_scheme)::text = '11+3'::text))
--
Victor Y. Yegorov
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Loading...