Discussion:
how to help the planner
(too old to reply)
Marty Frasier
2013-03-28 15:59:05 UTC
Permalink
Greetings,

We've been using postgreSQL for a few years. This is my first post here
and first real dive into query plans.


A description of what you are trying to achieve and what results you expect.:
Query results of nested joins of table. Results are correct - just
takes a long time with selected plan.

PostgreSQL version number you are running:
PostgreSQL 9.2.3 on x86_64-unknown-linux-gnu, compiled by gcc (GCC)
4.4.6 20120305 (Red Hat 4.4.6-4), 64-bit

How you installed PostgreSQL:

yum, using PGDG repo (package
postgresql92-server-9.2.3-2PGDG.rhel6.x86_64 and friends)


Changes made to the settings in the postgresql.conf file
DateStyle = ISO, MDY
default_tablespace = esc_data
default_text_search_config = pg_catalog.english
effective_cache_size = 24GB
lc_messages = en_US.UTF-8
lc_monetary = en_US.UTF-8
lc_numeric = en_US.UTF-8
lc_time = en_US.UTF-8
listen_addresses = 0.0.0.0
log_connections = on
log_destination = stderr
log_disconnections = on
log_line_prefix = %t %c
log_rotation_age = 1d
log_timezone = US/Eastern
logging_collector = on
maintenance_work_mem = 96MB
max_connections = 100
search_path = "$user", esc_funcs, public
shared_buffers = 8GB
TimeZone = US/Eastern
track_functions = all
track_io_timing = on

Operating system and version:
Red Hat Enterprise Linux Server release 6.4 (Santiago)

What program you're using to connect to PostgreSQL:

java(jdbc) and psql


Is there anything relevant or unusual in the PostgreSQL server logs?:

no

The issue is similar on PostgreSQL 9.0.5 on x86_64-pc-linux-gnu, compiled
by GCC gcc-4.4.real (Ubuntu 4.4.3-4ubuntu5) 4.4.3 on Ubuntu 10.04 64-bit
although we're doing troubleshooting on our new RHEL server.

We have a particular query that takes about 75 minutes to complete. The
selected execution plan estimates 1 row from several of the outermost
results so picks nested loop join resolutions. That turns out to be a bad
choice since actual row counts are in the thirty to fifty thousand range.
original selected plan: http://explain.depesz.com/s/muR
SQL: http://pastebin.com/f40Xp0JM

I set enable_nestloop=false to hint at the planner not to use nested loop.
That resulted in 13 second runtime. It appears this plan was considered
originally but estimated cost was higher than the plan above.
enable_nestloop=false: http://explain.depesz.com/s/mAa
SQL: http://pastebin.com/CgcSe7r6

We tried rewriting the query using WITH clauses. That took 82 seconds but
plan thought it would take much longer.
using with clauses: http://explain.depesz.com/s/GEZ
SQL: http://pastebin.com/ZRvRK2TV

We have been looking into the issue to the best of our ability but can't
figure out how to help the planner. I've looked at the planner source some
and see where row count is set to 1 if it's <= 1. I haven't found where
it's set yet but presume it was unable to determine the result set row
count and defaulted to 1.

I've run analyze manually and tried it with default_statistics_target=10000
to see if that helped. It didn't.
The table is static - no new rows are being added and there is no other
load on the database.

schema dump: http://pastebin.com/pUU0BJvr

What can we do to help the planner estimate better?

Thanks in advance,
Marty Frasier
Tom Lane
2013-03-28 16:18:46 UTC
Permalink
Post by Marty Frasier
We've been using postgreSQL for a few years. This is my first post here
and first real dive into query plans.
One quick thought is that it's probably worth cranking up
join_collapse_limit and/or from_collapse_limit, since the number of
relations in the query is considerably more than the default values of
those limits. This will make planning take longer but possibly find
better plans. I'm not sure it will help a lot, since most of the
problem is evidently bad rowcount estimates, but it might help.

Also it seems like the major rowcount failing is in the estimate for the
t12 subquery. I can't tell why that particular combination of WHERE
clauses is giving it such a hard time --- is there something odd about
the distribution of 'cahsee_ela' tests? Why is that particular subquery
grouped over school/student when all the others are grouped over just
student?

regards, tom lane
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Marty Frasier
2013-03-28 20:45:26 UTC
Permalink
Tom,

I cranked (join|from)_collapse_limit up to 50, then 500 just to exclude the
limits completey, and attempted the query both times. The planner came up
with an estimate close to the other estimates (1,944,276) and I stopped
actual execution after some length of time.

The t12 subquery is grouped differently because that particular test can be
valid at mutliple schools per student.

I had set session pg_default_statistics to 10000 and analyzed prior to the
earlier runs to allow it to have the best stats it could. I've looked at
it a little more closely, setting pg_default_statistics back to default of
100 and re-ran analyze on that database.

The value 'cahsee_ela' occurs 75,000 times in column
analysis.iteration__student__test__year.test which totals 11M rows. It's
ranked about 60 of 91 values in frequency.
By setting statistics=1000 on the column 'test' the MCV from pg_stats
contains all 91 distinct values (there are no nulls) and there is no
histogram_bounds value for the column. From MCV: cahsee_ela = 0.00658
which is accurate.
I think that should give the planner good info on the selectivity of the
where clause. It appears from the var_eq_const function that it will use
that exact value when found. It doesn' t seem to help the outcome though
as it had good stats before. I just understand it a little better now -
which is good.

Do you have any suggestions where to probe next?
I see some statistics hooks mentioned in some of the source codes but don't
know how to take advantage of them or whether it would be of use.
I suppose the answer could eventually be we have to reorganize our queries?

Thanks,
Marty
Post by Tom Lane
Post by Marty Frasier
We've been using postgreSQL for a few years. This is my first post here
and first real dive into query plans.
One quick thought is that it's probably worth cranking up
join_collapse_limit and/or from_collapse_limit, since the number of
relations in the query is considerably more than the default values of
those limits. This will make planning take longer but possibly find
better plans. I'm not sure it will help a lot, since most of the
problem is evidently bad rowcount estimates, but it might help.
Also it seems like the major rowcount failing is in the estimate for the
t12 subquery. I can't tell why that particular combination of WHERE
clauses is giving it such a hard time --- is there something odd about
the distribution of 'cahsee_ela' tests? Why is that particular subquery
grouped over school/student when all the others are grouped over just
student?
regards, tom lane
Igor Neyman
2013-03-29 18:23:06 UTC
Permalink
Marty,

When you change from/join collaps_limit pay attention to Genetic Query Optimizer settings, I believe by default it's "on" (geqo = on).
Specifically look at geqo_threshold parameter (default is 12).
AFAIK, if you don't have intensions to use Genetic Query Optimizer, geqo_threshold parameter should be higher than your collaps_limit, e.g. if you want to set collaps_limit to 50, and you think you may join 50 tables, then also increase geqo_threshold to at least 51.
Otherwise GeCO will come into play unexpectedly.

Besides this, try to play with these parameters (according to your original message you keep them at default):

#seq_page_cost = 1.0 # measured on an arbitrary scale
random_page_cost = 2.0 # same scale as above (default 4.0)
cpu_tuple_cost = 0.05 # same scale as above (default 0.01)
cpu_index_tuple_cost = 0.05 # same scale as above (default 0.005)
cpu_operator_cost = 0.0075 # same scale as above (default 0.0025)

Start with cpu_tuple_cost, increasing it from default 0.01 to 0.03-0.05.

Regards,
Igor Neyman


From: Marty Frasier [mailto:***@escmatrix.com]
Sent: Thursday, March 28, 2013 4:45 PM
To: Tom Lane
Cc: pgsql-***@postgresql.org; James Quinn
Subject: Re: how to help the planner

Tom,
I cranked (join|from)_collapse_limit up to 50, then 500 just to exclude the limits completey, and attempted the query both times.  The planner came up with an estimate close to the other estimates (1,944,276) and I stopped actual execution after some length of time.
The t12 subquery is grouped differently because that particular test can be valid at mutliple schools per student.

I had set session pg_default_statistics to 10000 and analyzed prior to the earlier runs to allow it to have the best stats it could.  I've looked at it a little more closely, setting pg_default_statistics back to default of 100 and re-ran analyze on that database.

The value 'cahsee_ela' occurs 75,000 times in column analysis.iteration__student__test__year.test which totals 11M rows.  It's ranked about 60 of 91 values in frequency.
By setting statistics=1000 on the column 'test' the MCV from pg_stats contains all 91 distinct values (there are no nulls) and there is no histogram_bounds value for the column.  From MCV: cahsee_ela = 0.00658 which is accurate.
I think that should give the planner good info on the selectivity of the where clause.  It appears from the var_eq_const function that it will use that exact value when found.  It doesn' t seem to help the outcome though as it had good stats before.  I just understand it a little better now - which is good.

Do you have any suggestions where to probe next?
I see some statistics hooks mentioned in some of the source codes but don't know how to take advantage of them or whether it would be of use.
I suppose the answer could eventually be we have to reorganize our queries?
 
Thanks,
Marty
We've been using postgreSQL for a few years.  This is my first post here
and first real dive into query plans.
One quick thought is that it's probably worth cranking up
join_collapse_limit and/or from_collapse_limit, since the number of
relations in the query is considerably more than the default values of
those limits.  This will make planning take longer but possibly find
better plans.  I'm not sure it will help a lot, since most of the
problem is evidently bad rowcount estimates, but it might help.

Also it seems like the major rowcount failing is in the estimate for the
t12 subquery.  I can't tell why that particular combination of WHERE
clauses is giving it such a hard time --- is there something odd about
the distribution of 'cahsee_ela' tests?  Why is that particular subquery
grouped over school/student when all the others are grouped over just
student?

                        regards, tom lane
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Stephen Frost
2013-03-28 22:13:11 UTC
Permalink
Marty,
Post by Marty Frasier
We have a particular query that takes about 75 minutes to complete. The
selected execution plan estimates 1 row from several of the outermost
results so picks nested loop join resolutions. That turns out to be a bad
choice since actual row counts are in the thirty to fifty thousand range.
I've seen exactly this behaviour and it's led to many cases where we've
had to simply disable nest loop for a given query. They're usually in
functions, so that turns out to be workable without having to deal with
application changes. Still, it totally sucks.
Post by Marty Frasier
I haven't found where
it's set yet but presume it was unable to determine the result set row
count and defaulted to 1.
No.. There's no 'default to 1', afaik. The problem seems to simply be
that PG ends up estimating the number of rows coming back very poorly.
I'm actually suspicious that the number it's coming up with is much
*smaller* than one and then clamping it back to '1' as a minimum instead
of rounding it down to zero. I did see one query that moved to a nested
loop query plan from a more sensible plan when upgrading from 9.0 to
9.2, but there were plans even under 9.0 that were similairly bad.

The one thing I've not had a chance to do yet is actually build out a
test case which I can share which demonstrates this bad behaviour. If
that's something which you could provide, it would absolutely help us in
understanding and perhaps solving this issue.

Thanks!

Stephen
Loading...