slow bitmap heap scans on pg 9.2

Discussion:

(too old to reply)

Steve Singer

2013-04-10 13:49:55 UTC

I'm encountering an issue where PG 9.2.4 (we also see this with 9.2.3)
is picking a plan involving a bitmap heap scan that turns out to be much
slower than a nested-loop plan using indexes.

The planner picks the hashjoin plan by default (see attached files)

Bitmap Heap Scan on public.table_b_2 b (cost=172635.99..9800225.75
rows=8435754 width=10) (actual t
ime=9132.194..1785196.352 rows=9749680 loops=1)
Recheck Cond: ((b.organization_id = 3) AND
(b.year = 2013) AND (b.month = 3))
Rows Removed by Index Recheck: 313195667
Filter: (b.product_id = 2)

Is the part that seems be causing the problem (or at least taking most
of the time, other than the final aggregation)

If I set enable_hashjoin=false and enable_mergejoin=false I get the
nestedloop join plan.

table_b is 137 GB plus indexes each on is around 43 GB
table_a is 20 GB

random_page_cost = 2.0
effective_cache_size = 3500MB
cpu_tuple_cost = 0.01
cpu_index_tuple_cost = 0.005
cpu_operator_cost = 0.0025
work_mem = 64MB
shared_buffers = 300MB (for this output, I've also had it at 2GB)

If I bump cpu_tuple_cost to the 10-20 range it will pick the nested loop
join for some date ranges but not all. cpu_tuple_cost of 20 doesn't
sound like an sane value.

This database used to run 8.3 where it picked the nested-loop join. We
used pg_upgrade to migrate to 9.2

Any ideas why the bitmap heap scan is much slower than the planner expects?

Steve

k***@rice.edu

2013-04-10 13:56:57 UTC

Permalink

Post by Steve Singer
I'm encountering an issue where PG 9.2.4 (we also see this with
9.2.3) is picking a plan involving a bitmap heap scan that turns out
to be much slower than a nested-loop plan using indexes.
The planner picks the hashjoin plan by default (see attached files)
Bitmap Heap Scan on public.table_b_2 b (cost=172635.99..9800225.75
rows=8435754 width=10) (actual t
ime=9132.194..1785196.352 rows=9749680 loops=1)
Recheck Cond: ((b.organization_id = 3)
AND (b.year = 2013) AND (b.month = 3))
Rows Removed by Index Recheck: 313195667
Filter: (b.product_id = 2)
Is the part that seems be causing the problem (or at least taking
most of the time, other than the final aggregation)
If I set enable_hashjoin=false and enable_mergejoin=false I get the
nestedloop join plan.
table_b is 137 GB plus indexes each on is around 43 GB
table_a is 20 GB
random_page_cost = 2.0
effective_cache_size = 3500MB
cpu_tuple_cost = 0.01
cpu_index_tuple_cost = 0.005
cpu_operator_cost = 0.0025
work_mem = 64MB
shared_buffers = 300MB (for this output, I've also had it at 2GB)
If I bump cpu_tuple_cost to the 10-20 range it will pick the nested
loop join for some date ranges but not all. cpu_tuple_cost of 20
doesn't sound like an sane value.
This database used to run 8.3 where it picked the nested-loop join.
We used pg_upgrade to migrate to 9.2
Any ideas why the bitmap heap scan is much slower than the planner expects?
Steve

Hi Steve,

The one thing that stands out to me is that you are working with 200GB of
data on a machine with 4-8GB of ram and you have the random_page_cost set
to 2.0. That is almost completely uncached and I would expect a value of
10 or more to be closer to reality.

Regards,
Ken

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Steve Singer

2013-04-10 15:56:32 UTC

Permalink

Post by k***@rice.edu
Hi Steve,
The one thing that stands out to me is that you are working with 200GB of
data on a machine with 4-8GB of ram and you have the random_page_cost set
to 2.0. That is almost completely uncached and I would expect a value of
10 or more to be closer to reality.

Setting random_page_cost to 15 makes the planner choose the nested-loop
plan (at least the date range I tried).

I thought that the point of effective cache size was to tell the planner
high likely it is for a random page to be in cache. With 200GB of data
for this query and an effective cache size of 3.5 GB I would have
expected that to be accounted for.

Post by k***@rice.edu
Regards,
Ken

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

k***@rice.edu

2013-04-10 17:49:52 UTC

Permalink

Post by Steve Singer

Setting random_page_cost to 15 makes the planner choose the
nested-loop plan (at least the date range I tried).
I thought that the point of effective cache size was to tell the
planner high likely it is for a random page to be in cache. With
200GB of data for this query and an effective cache size of 3.5 GB I
would have expected that to be accounted for.

For random_page_cost to be that low, the database would need to be
mostly cached. 3.5GB is almost 100X too small to do that unless your
query exhibits a large amount of locality of reference. Values for
random_page_cost between 10 and 20 are very reasonable for disk-bound
I/O scenarios.

Regards,
Ken

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Jeff Janes

2013-04-10 18:15:34 UTC

Permalink

Post by Steve Singer

e_c_s tells it how likely it is to still be in cache the second (and
subsequent) time the page is visited during the *same query*. It doesn't
tell it how likely it is to be in cache the first time it is needed in a
given query. (Also, e_c_s is irrelevant for bitmap scans, as they
inherently hit every block only once)

Also, it doesn't tell how expensive it is to bring it into cache when it is
needed. That is what random_page_cost is for. If you tell that those
fetches are going to be cheap, then it doesn't matter so much how many of
them it is going to have to do.

Cheers,

Jeff

Jeff Janes

2013-04-10 18:06:16 UTC

Permalink

I think the index recheck means your bitmap is overflowing (i.e. needing
more space than work_mem) and so keeping only the pages which have at least
one match, which means all rows in those pages need to be rechecked. How
many rows does the table have? You might be essentially doing a seq scan,
but with the additional overhead of the bitmap machinery. Could you do
"explain (analyze,buffers)", preferably with track_io_timing set to on?

Cheers,

Jeff

Steve Singer

2013-04-10 23:54:09 UTC

Permalink

Post by Jeff Janes
I think the index recheck means your bitmap is overflowing (i.e. needing
more space than work_mem) and so keeping only the pages which have at
least one match, which means all rows in those pages need to be
rechecked. How many rows does the table have? You might be essentially
doing a seq scan, but with the additional overhead of the bitmap
machinery. Could you do "explain (analyze,buffers)", preferably with
track_io_timing set to on?

table_b has 1,530,710,469 rows

Attached is the output with track_io_timings and buffers.

Post by Jeff Janes
Cheers,
Jeff

Steve Singer

2013-04-11 14:20:11 UTC

Permalink

Post by Steve Singer

table_b has 1,530,710,469 rows
Attached is the output with track_io_timings and buffers.

I've done some more testing with a random_page_cost=20.

This gives me the nested-loop plan for the various date ranges I've tried.

However table_a_2 and table_b_2 are actually partition tables. This
query only needs to look at a single partition. When I run this same
query against a different partition (a smaller partition, but still
bigger than cache) it picks hash join plan involving a seq scan of
table_b but no bitmap index scan. On this partition the hash-join
plans tend to take 15 minutes versus 2 minutes when I disable hashjoin
plans. Bumping random_page_cost higher doesn't fix this.

I think the reason why it is picking the hash join based plans is
because of

Index Scan using table_b_1_ptid_orgid_ym_unq on table_b_1 b
(cost=0.00..503.86 rows=1 width=10) (actual time=0.016..0.017 rows=1
loops=414249)
Index Cond: ((a.id = a_id) AND (organization_id =
2) AND (year = 2013) AND (month = 3))
Filter: (product_id = 1)

I think we are over-estimating the cost of the index scans in the inner
loop. This seems similar to what was discussed a few months ago
http://www.postgresql.org/message-id/092a01cdd230$ff6143c0$fe23cb40$@foo.me.uk

This version of PG should have 3e9960e9d935e7e applied. I am trying to
get the database copied to a machine where I can easily switch PG
versions and test this against something prior to that commit and also
against a 9.3 build.

Steve

Post by Steve Singer

Post by Jeff Janes
Cheers,
Jeff