Discussion:
hash join vs nested loop join
(too old to reply)
Huan Ruan
2012-12-12 04:25:48 UTC
Permalink
Hello All

While investigating switching to Postgres, we come across a query plan that
uses hash join and is a lot slower than a nested loop join.

I don't understand why the optimiser chooses the hash join in favor of the
nested loop. What can I do to get the optimiser to make a better decision
(nested loop in this case)? I have run analyze on both tables.

The query is,

/*
smalltable has about 48,000 records.
bigtable has about 168,000,000 records.
invtranref is char(10) and is the primary key for both tables
*/
SELECT
*
FROM IM_Match_Table smalltable
inner join invtran bigtable on
bigtable.invtranref = smalltable.invtranref

The hash join plan is,

"Hash Join (cost=1681.87..6414169.04 rows=48261 width=171)"
" Output: smalltable.invtranref, smalltable.itbatchref,
smalltable.trantype, smalltable.trandate, smalltable.invprodref,
smalltable.invheadref, bigtable.itbatchref, bigtable.invtranref,
bigtable.invheadref, bigtable.feeplanref, bigtable.invprodref,
bigtable.trantype, bigtable.trandate, bigtable.pricedate, bigtable.units,
bigtable.tranamount, bigtable.createmode, bigtable.transtat,
bigtable.sysversion, bigtable.sysuser, bigtable.rectype, bigtable.recstat,
bigtable.seqnum, bigtable.transign"
" Hash Cond: (bigtable.invtranref = smalltable.invtranref)"
" -> Seq Scan on public.invtran bigtable (cost=0.00..4730787.28
rows=168121728 width=108)"
" Output: bigtable.itbatchref, bigtable.invtranref,
bigtable.invheadref, bigtable.feeplanref, bigtable.invprodref,
bigtable.trantype, bigtable.trandate, bigtable.pricedate, bigtable.units,
bigtable.tranamount, bigtable.createmode, bigtable.transtat,
bigtable.sysversion, bigtable.sysuser, bigtable.rectype, bigtable.recstat,
bigtable.seqnum, bigtable.transign"
" -> Hash (cost=1078.61..1078.61 rows=48261 width=63)"
" Output: smalltable.invtranref, smalltable.itbatchref,
smalltable.trantype, smalltable.trandate, smalltable.invprodref,
smalltable.invheadref"
" -> Seq Scan on public.im_match_table smalltable
(cost=0.00..1078.61 rows=48261 width=63)"
" Output: smalltable.invtranref, smalltable.itbatchref,
smalltable.trantype, smalltable.trandate, smalltable.invprodref,
smalltable.invheadref"

The nested loop join plan is,

"Nested Loop (cost=0.00..12888684.07 rows=48261 width=171)"
" Output: smalltable.invtranref, smalltable.itbatchref,
smalltable.trantype, smalltable.trandate, smalltable.invprodref,
smalltable.invheadref, bigtable.itbatchref, bigtable.invtranref,
bigtable.invheadref, bigtable.feeplanref, bigtable.invprodref,
bigtable.trantype, bigtable.trandate, bigtable.pricedate, bigtable.units,
bigtable.tranamount, bigtable.createmode, bigtable.transtat,
bigtable.sysversion, bigtable.sysuser, bigtable.rectype, bigtable.recstat,
bigtable.seqnum, bigtable.transign"
" -> Seq Scan on public.im_match_table smalltable (cost=0.00..1078.61
rows=48261 width=63)"
" Output: smalltable.invtranref, smalltable.itbatchref,
smalltable.trantype, smalltable.trandate, smalltable.invprodref,
smalltable.invheadref"
" -> Index Scan using pk_invtran on public.invtran bigtable
(cost=0.00..267.03 rows=1 width=108)"
" Output: bigtable.itbatchref, bigtable.invtranref,
bigtable.invheadref, bigtable.feeplanref, bigtable.invprodref,
bigtable.trantype, bigtable.trandate, bigtable.pricedate, bigtable.units,
bigtable.tranamount, bigtable.createmode, bigtable.transtat,
bigtable.sysversion, bigtable.sysuser, bigtable.rectype, bigtable.recstat,
bigtable.seqnum, bigtable.transign"
" Index Cond: (bigtable.invtranref = smalltable.invtranref)"

The version is PostgreSQL 9.2.0 on x86_64-unknown-linux-gnu, compiled by
gcc (GCC) 4.4.6 20120305 (Red Hat 4.4.6-4), 64-bit. Server specs are:

- Centos, ext4
- 24GB memory
- 6 cores hyper-threaded (Intel(R) Xeon(R) CPU E5645).
- raid 10 on 4 sata disks

Config changes are

- shared_buffers = 6GB
- effective_cache_size = 18GB
- work_mem = 10MB
- maintenance_work_mem = 3GB

Many Thanks
Huan
Evgeny Shishkin
2012-12-12 04:33:33 UTC
Permalink
Post by Huan Ruan
Hello All
While investigating switching to Postgres, we come across a query plan that uses hash join and is a lot slower than a nested loop join.
I don't understand why the optimiser chooses the hash join in favor of the nested loop. What can I do to get the optimiser to make a better decision (nested loop in this case)? I have run analyze on both tables.
Optimiser thinks that nested loop is more expensive, because of point PK lookups, which a random io.
Can you set random_page_cost to 2 or 3 and try again?
Post by Huan Ruan
The query is,
/*
smalltable has about 48,000 records.
bigtable has about 168,000,000 records.
invtranref is char(10) and is the primary key for both tables
*/
SELECT
*
FROM IM_Match_Table smalltable
inner join invtran bigtable on
bigtable.invtranref = smalltable.invtranref
The hash join plan is,
"Hash Join (cost=1681.87..6414169.04 rows=48261 width=171)"
" Output: smalltable.invtranref, smalltable.itbatchref, smalltable.trantype, smalltable.trandate, smalltable.invprodref, smalltable.invheadref, bigtable.itbatchref, bigtable.invtranref, bigtable.invheadref, bigtable.feeplanref, bigtable.invprodref, bigtable.trantype, bigtable.trandate, bigtable.pricedate, bigtable.units, bigtable.tranamount, bigtable.createmode, bigtable.transtat, bigtable.sysversion, bigtable.sysuser, bigtable.rectype, bigtable.recstat, bigtable.seqnum, bigtable.transign"
" Hash Cond: (bigtable.invtranref = smalltable.invtranref)"
" -> Seq Scan on public.invtran bigtable (cost=0.00..4730787.28 rows=168121728 width=108)"
" Output: bigtable.itbatchref, bigtable.invtranref, bigtable.invheadref, bigtable.feeplanref, bigtable.invprodref, bigtable.trantype, bigtable.trandate, bigtable.pricedate, bigtable.units, bigtable.tranamount, bigtable.createmode, bigtable.transtat, bigtable.sysversion, bigtable.sysuser, bigtable.rectype, bigtable.recstat, bigtable.seqnum, bigtable.transign"
" -> Hash (cost=1078.61..1078.61 rows=48261 width=63)"
" Output: smalltable.invtranref, smalltable.itbatchref, smalltable.trantype, smalltable.trandate, smalltable.invprodref, smalltable.invheadref"
" -> Seq Scan on public.im_match_table smalltable (cost=0.00..1078.61 rows=48261 width=63)"
" Output: smalltable.invtranref, smalltable.itbatchref, smalltable.trantype, smalltable.trandate, smalltable.invprodref, smalltable.invheadref"
The nested loop join plan is,
"Nested Loop (cost=0.00..12888684.07 rows=48261 width=171)"
" Output: smalltable.invtranref, smalltable.itbatchref, smalltable.trantype, smalltable.trandate, smalltable.invprodref, smalltable.invheadref, bigtable.itbatchref, bigtable.invtranref, bigtable.invheadref, bigtable.feeplanref, bigtable.invprodref, bigtable.trantype, bigtable.trandate, bigtable.pricedate, bigtable.units, bigtable.tranamount, bigtable.createmode, bigtable.transtat, bigtable.sysversion, bigtable.sysuser, bigtable.rectype, bigtable.recstat, bigtable.seqnum, bigtable.transign"
" -> Seq Scan on public.im_match_table smalltable (cost=0.00..1078.61 rows=48261 width=63)"
" Output: smalltable.invtranref, smalltable.itbatchref, smalltable.trantype, smalltable.trandate, smalltable.invprodref, smalltable.invheadref"
" -> Index Scan using pk_invtran on public.invtran bigtable (cost=0.00..267.03 rows=1 width=108)"
" Output: bigtable.itbatchref, bigtable.invtranref, bigtable.invheadref, bigtable.feeplanref, bigtable.invprodref, bigtable.trantype, bigtable.trandate, bigtable.pricedate, bigtable.units, bigtable.tranamount, bigtable.createmode, bigtable.transtat, bigtable.sysversion, bigtable.sysuser, bigtable.rectype, bigtable.recstat, bigtable.seqnum, bigtable.transign"
" Index Cond: (bigtable.invtranref = smalltable.invtranref)"
Centos, ext4
24GB memory
6 cores hyper-threaded (Intel(R) Xeon(R) CPU E5645).
raid 10 on 4 sata disks
Config changes are
shared_buffers = 6GB
effective_cache_size = 18GB
work_mem = 10MB
maintenance_work_mem = 3GB
Many Thanks
Huan
Evgeny Shishkin
2012-12-12 04:59:04 UTC
Permalink
Post by Evgeny Shishkin
Optimiser thinks that nested loop is more expensive, because of point PK lookups, which a random io.
Can you set random_page_cost to 2 or 3 and try again?
Hi Evgeny
Thanks for the quick reply. Setting random_page_cost to 3 doesn't make a difference, but to 2 makes the optimiser to choose nested loop. However, with such a small penalty for random I/O, I'm worry about this setting will make other small queries incorrectly use index when it should be a sequential scan though. I understand random I/O is expensive, but in this case the optimiser already knows the big table is really big, should it consider a sequential scan will be slower than an index lookup? Scan 170 million records vs index lookup of 50,000 records. Any thoughts?
Yes, this is the most common issue for me.
Usually you just have to find the right combination of random and seq scan costs, shared_buffers and effective_cache_size.
If some of the queries work well with another value of, say, random_page_cost, then, since it is per session parameter, you can SET it in your session before the query. But over time your table may change in size and distribution and everything brakes. No speaking about general ugliness from application standpoint.
May be somebody more experienced would help.
Also you can set different costs per tablespace.
Post by Evgeny Shishkin
Thanks
Huan
Added CC.
Jeff Janes
2012-12-12 16:28:13 UTC
Permalink
Post by Huan Ruan
Hello All
While investigating switching to Postgres, we come across a query plan that
uses hash join and is a lot slower than a nested loop join.
I don't understand why the optimiser chooses the hash join in favor of the
nested loop. What can I do to get the optimiser to make a better decision
(nested loop in this case)? I have run analyze on both tables.
The query is,
/*
smalltable has about 48,000 records.
bigtable has about 168,000,000 records.
invtranref is char(10) and is the primary key for both tables
*/
SELECT
*
FROM IM_Match_Table smalltable
inner join invtran bigtable on
bigtable.invtranref = smalltable.invtranref
..
Post by Huan Ruan
" -> Index Scan using pk_invtran on public.invtran bigtable (cost=0.00..267.03 rows=1 width=108)"
This looks like the same large-index over-penalty as discussed in the
recent thread "[PERFORM] Slow query: bitmap scan troubles".

Back-patching the log(npages) change is starting to look like a good idea.

Cheers,

Jeff
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Huan Ruan
2012-12-13 00:56:21 UTC
Permalink
Post by Jeff Janes
This looks like the same large-index over-penalty as discussed in the
recent thread "[PERFORM] Slow query: bitmap scan troubles".
Back-patching the log(npages) change is starting to look like a good idea.
Cheers,
Jeff
Thanks for the information Jeff. That does seem to be related.
Kevin Grittner
2012-12-12 23:47:32 UTC
Permalink
Post by Huan Ruan
is a lot slower than a nested loop join.
Giving actual numbers is more useful than terms like "a lot". Even
better is to provide the output of EXPLAIN ANALYZZE rather than
just EXPLAIN. This shows estimates against actual numbers, and give
timings. For more suggestions see this page:

http://wiki.postgresql.org/wiki/SlowQueryQuestions
Post by Huan Ruan
I don't understand why the optimiser chooses the hash join in
favor of the nested loop. What can I do to get the optimiser to
make a better decision (nested loop in this case)? I have run
analyze on both tables.
Config changes are
 - shared_buffers = 6GB
 - effective_cache_size = 18GB
 - work_mem = 10MB
 - maintenance_work_mem = 3GB
As already suggested, there was a change made in 9.2 which may have
over-penalized nested loops using index scans. This may be fixed in
the next minor release.

Also, as already suggested, you may want to reduce random_page
cost, to bring it in line with the actual cost relative to
seq_page_cost based on your cache hit ratio.

Additionally, I just routinely set cpu_tuple_cost higher than the
default of 0.01. I find that 0.03 to 0.05 better models the actual
relative cost of processing a tuple.

-Kevin
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Huan Ruan
2012-12-13 01:10:24 UTC
Permalink
Hi Kevin
Post by Kevin Grittner
Post by Huan Ruan
is a lot slower than a nested loop join.
Giving actual numbers is more useful than terms like "a lot". Even
better is to provide the output of EXPLAIN ANALYZZE rather than
just EXPLAIN. This shows estimates against actual numbers, and give
http://wiki.postgresql.org/wiki/SlowQueryQuestions
You are right. I realised my information wasn't accurate. Was a bit slack
and canceled the slower one. The full outputs are

Hash 1st run

"QUERY PLAN"
"Hash Join (cost=1681.87..6414169.04 rows=48261 width=171) (actual
time=2182.450..88158.645 rows=48257 loops=1)"
" Hash Cond: (bigtable.invtranref = smalltable.invtranref)"
" Buffers: shared hit=3950 read=3046219"
" -> Seq Scan on invtran bigtable (cost=0.00..4730787.28 rows=168121728
width=108) (actual time=0.051..32581.052 rows=168121657 loops=1)"
" Buffers: shared hit=3351 read=3046219"
" -> Hash (cost=1078.61..1078.61 rows=48261 width=63) (actual
time=21.751..21.751 rows=48261 loops=1)"
" Buckets: 8192 Batches: 1 Memory Usage: 4808kB"
" Buffers: shared hit=596"
" -> Seq Scan on im_match_table smalltable (cost=0.00..1078.61
rows=48261 width=63) (actual time=0.007..8.299 rows=48261 loops=1)"
" Buffers: shared hit=596"
"Total runtime: 88162.417 ms"

Hash 2nd run (after disconnect and reconnect)


"QUERY PLAN"
"Hash Join (cost=1681.87..6414169.04 rows=48261 width=171) (actual
time=2280.390..87934.540 rows=48257 loops=1)"
" Hash Cond: (bigtable.invtranref = smalltable.invtranref)"
" Buffers: shared hit=3982 read=3046187"
" -> Seq Scan on invtran bigtable (cost=0.00..4730787.28 rows=168121728
width=108) (actual time=0.052..32747.805 rows=168121657 loops=1)"
" Buffers: shared hit=3383 read=3046187"
" -> Hash (cost=1078.61..1078.61 rows=48261 width=63) (actual
time=62.161..62.161 rows=48261 loops=1)"
" Buckets: 8192 Batches: 1 Memory Usage: 4808kB"
" Buffers: shared hit=596"
" -> Seq Scan on im_match_table smalltable (cost=0.00..1078.61
rows=48261 width=63) (actual time=0.006..8.209 rows=48261 loops=1)"
" Buffers: shared hit=596"
"Total runtime: 87938.584 ms"

NL 1st run

"QUERY PLAN"
"Nested Loop (cost=0.00..6451637.88 rows=48261 width=171) (actual
time=0.056..551.438 rows=48257 loops=1)"
" Buffers: shared hit=242267"
" -> Seq Scan on im_match_table smalltable (cost=0.00..1078.61
rows=48261 width=63) (actual time=0.009..7.353 rows=48261 loops=1)"
" Buffers: shared hit=596"
" -> Index Scan using pk_invtran on invtran bigtable (cost=0.00..133.65
rows=1 width=108) (actual time=0.010..0.010 rows=1 loops=48261)"
" Index Cond: (invtranref = smalltable.invtranref)"
" Buffers: shared hit=241671"
"Total runtime: 555.336 ms"

NL 2nd run (after disconnect and reconnect)

"QUERY PLAN"
"Nested Loop (cost=0.00..6451637.88 rows=48261 width=171) (actual
time=0.058..554.215 rows=48257 loops=1)"
" Buffers: shared hit=242267"
" -> Seq Scan on im_match_table smalltable (cost=0.00..1078.61
rows=48261 width=63) (actual time=0.009..7.416 rows=48261 loops=1)"
" Buffers: shared hit=596"
" -> Index Scan using pk_invtran on invtran bigtable (cost=0.00..133.65
rows=1 width=108) (actual time=0.010..0.010 rows=1 loops=48261)"
" Index Cond: (invtranref = smalltable.invtranref)"
" Buffers: shared hit=241671"
"Total runtime: 558.095 ms"
Post by Kevin Grittner
Post by Huan Ruan
I don't understand why the optimiser chooses the hash join in
favor of the nested loop. What can I do to get the optimiser to
make a better decision (nested loop in this case)? I have run
analyze on both tables.
Config changes are
- shared_buffers = 6GB
- effective_cache_size = 18GB
- work_mem = 10MB
- maintenance_work_mem = 3GB
As already suggested, there was a change made in 9.2 which may have
over-penalized nested loops using index scans. This may be fixed in
the next minor release.
Will keep this in mind.
Post by Kevin Grittner
Also, as already suggested, you may want to reduce random_page
cost, to bring it in line with the actual cost relative to
seq_page_cost based on your cache hit ratio.
Additionally, I just routinely set cpu_tuple_cost higher than the
default of 0.01. I find that 0.03 to 0.05 better models the actual
relative cost of processing a tuple.
I originally reduced random_page_cost to 2 to achieve the nested loop join.
Now I set cpu_tuple_cost to 0.05 and reset random_page_cost back to 4, I
can also achieve a nested loop join.

I'm still new in Postgres, but I'm worried about random_page_cost being 2
is too low, so maybe increasing cpu_tuple_cost is a better choice. All
these tuning probably also depends on the above mentioned possible fix as
well. Can you see any obvious issues with the other memory settings I
changed?

Thanks for your help.

Cheers
Huan
Post by Kevin Grittner
-Kevin
Kevin Grittner
2012-12-13 15:26:24 UTC
Permalink
Hash 1st run
"Hash Join (cost=1681.87..6414169.04 rows=48261 width=171)
(actual time=2182.450..88158.645 rows=48257 loops=1)"
" -> Seq Scan on invtran bigtable (cost=0.00..4730787.28
rows=168121728 width=108) (actual time=0.051..32581.052
rows=168121657 loops=1)"
194 nanoseconds per row suggests 100% cache hits.
NL 1st run
"Nested Loop (cost=0.00..6451637.88 rows=48261 width=171) (actual
time=0.056..551.438 rows=48257 loops=1)"
" -> Index Scan using pk_invtran on invtran bigtable
(cost=0.00..133.65 rows=1 width=108) (actual time=0.010..0.010
rows=1 loops=48261)"
10 microseconds per index scan (each index scan requiring multiple
"random" accesses) also suggests 100% cache hits.
I originally reduced random_page_cost to 2 to achieve the nested
loop join. Now I set cpu_tuple_cost to 0.05 and reset
random_page_cost back to 4, I can also achieve a nested loop
join.
I'm still new in Postgres, but I'm worried about random_page_cost
being 2 is too low, so maybe increasing cpu_tuple_cost is a
better choice.
If these are typical of what you would expect in production, then
the fact that with default cost factors the costs are barely
different (by 0.6%) for actual run times which differ by two orders
of magnitude (the chosen plan is 160 times slower) means that the
modeling of cost factors is off by a lot.

If you expect the active portion of your database to be fully
cached like this, it makes sense to reduce random_page_cost to be
equal to seq_page_cost. But that only adjusts the costs by at most
a factor of four, and we've established that in the above query
they're off by a factor of 160. To help make up the difference, it
makes sense to de-emphasize page access compared to cpu-related
costs by reducing both page costs to 0.1. Combined, these
adjustments still can't compensate for how far off the estimate
was.

In my experience default cpu_tuple_cost is understated compared to
other cpu-related costs, so I would do the above *plus* a boost to
cpu_tuple_cost. Personally, I have never seen a difference between
plans chosen with that set to 0.03 and 0.05, so I can't say where
in that range is the ideal value; you should feel free to
experiment if there is a query which seems to be choosing a bad
plan. If the above results really do represent cache hit levels you
expect in production, the combination of the above changes should
come reasonably close to modeling costs realistically, resulting in
better plan choice.

If you don't expect such high cache hit ratios in production, you
probably don't want to go so low with page costs.
Post by Huan Ruan
- shared_buffers = 6GB
- effective_cache_size = 18GB
- work_mem = 10MB
- maintenance_work_mem = 3GB
Can you see any obvious issues with the other memory settings I
changed?
I might bump up work_mem to 20MB to 60MB, as long as you're not
going crazy with max_connections. I would probably take
maintenance_work_mem down to 1GB to 2GB -- you can have several of
these allocations at one time, and you don't want to blow away your
cache. (I think it might actually be adjusted down to 2GB
internally anyway; but I would need to check.)

-Kevin
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Huan Ruan
2012-12-13 23:51:27 UTC
Permalink
Hi Kevin

Again, many thanks for your time and help.
Post by Kevin Grittner
Hash 1st run
"Hash Join (cost=1681.87..6414169.04 rows=48261 width=171)
(actual time=2182.450..88158.645 rows=48257 loops=1)"
" -> Seq Scan on invtran bigtable (cost=0.00..4730787.28
rows=168121728 width=108) (actual time=0.051..32581.052
rows=168121657 loops=1)"
194 nanoseconds per row suggests 100% cache hits.
NL 1st run
"Nested Loop (cost=0.00..6451637.88 rows=48261 width=171) (actual
time=0.056..551.438 rows=48257 loops=1)"
" -> Index Scan using pk_invtran on invtran bigtable
(cost=0.00..133.65 rows=1 width=108) (actual time=0.010..0.010
rows=1 loops=48261)"
10 microseconds per index scan (each index scan requiring multiple
"random" accesses) also suggests 100% cache hits.
Interesting to see how you derived 100% cache hits. I assume by 'cache' you
mean the pg shared buffer plus the OS cache? Because the table is 23GB but
the shared buffer is only 6GB. Even then, I'm not completely convinced
because the total RAM is just 24GB, part of which will have to be used for
other data and indexes.

I read somewhere that a pg shared buffer that's too big can hurt the
performance and it's better just leave it to the OS cache. I'm not sure why
but for now, I just configured the shared buffer to be 1/4 of the total RAM.
Post by Kevin Grittner
I originally reduced random_page_cost to 2 to achieve the nested
loop join. Now I set cpu_tuple_cost to 0.05 and reset
random_page_cost back to 4, I can also achieve a nested loop
join.
I'm still new in Postgres, but I'm worried about random_page_cost
being 2 is too low, so maybe increasing cpu_tuple_cost is a
better choice.
If these are typical of what you would expect in production, then
the fact that with default cost factors the costs are barely
different (by 0.6%) for actual run times which differ by two orders
of magnitude (the chosen plan is 160 times slower) means that the
modeling of cost factors is off by a lot.
If you expect the active portion of your database to be fully
cached like this, it makes sense to reduce random_page_cost to be
equal to seq_page_cost. But that only adjusts the costs by at most
a factor of four, and we've established that in the above query
they're off by a factor of 160. To help make up the difference, it
makes sense to de-emphasize page access compared to cpu-related
costs by reducing both page costs to 0.1. Combined, these
adjustments still can't compensate for how far off the estimate
was.
In my experience default cpu_tuple_cost is understated compared to
other cpu-related costs, so I would do the above *plus* a boost to
cpu_tuple_cost. Personally, I have never seen a difference between
plans chosen with that set to 0.03 and 0.05, so I can't say where
in that range is the ideal value; you should feel free to
experiment if there is a query which seems to be choosing a bad
plan. If the above results really do represent cache hit levels you
expect in production, the combination of the above changes should
come reasonably close to modeling costs realistically, resulting in
better plan choice.
In production, 60% of the database would be able to fit in the RAM. But
roughly, all the active data we need to use should be able to fit in 100%.
On the test server I'm playing with now, RAM is only 8% of the database
size. Nonetheless, I will play with these parameters like you suggested.

I was wondering on our production server where the effetive_cache_size will
be much bigger, will pg then guess that probably most data is cached anyway
therefore leaning towards nested loop join rather than a scan for hash join?

Even on a test server where the cache hit rate is much smaller, for a big
table like this, under what circumstances, will a hash join perform better
than nested loop join though?
Post by Kevin Grittner
If you don't expect such high cache hit ratios in production, you
probably don't want to go so low with page costs.
Post by Huan Ruan
- shared_buffers = 6GB
- effective_cache_size = 18GB
- work_mem = 10MB
- maintenance_work_mem = 3GB
Can you see any obvious issues with the other memory settings I
changed?
I might bump up work_mem to 20MB to 60MB, as long as you're not
going crazy with max_connections. I would probably take
maintenance_work_mem down to 1GB to 2GB -- you can have several of
these allocations at one time, and you don't want to blow away your
cache. (I think it might actually be adjusted down to 2GB
internally anyway; but I would need to check.)
Yes, I had bumped up work_mem yesterday to speed up another big group by
query. I used 80MB. I assumed this memory will only be used if the query
needs it and will be released as soon as it's finished, so it won't be too
much an issue as long as I don't have too many concurrently sorting queries
running (which is true in our production). Is this correct?

I increased maintenance_work_mem initially to speed up the index creation
when I first pump in the data. In production environment, we don't do run
time index creation, so I think only the vacuum and analyze will consume
this memory?

Thanks
Huan
Post by Kevin Grittner
-Kevin
Kevin Grittner
2012-12-14 00:54:43 UTC
Permalink
Post by Huan Ruan
Interesting to see how you derived 100% cache hits. I assume by 'cache' you
mean the pg shared buffer plus the OS cache? Because the table is 23GB but
the shared buffer is only 6GB. Even then, I'm not completely convinced
because the total RAM is just 24GB, part of which will have to be used for
other data and indexes.
Well, you can't have more than a few disk hits, which typically
take something like 10 ms each, and still get an average less than 200
nanoseconds.
Post by Huan Ruan
I read somewhere that a pg shared buffer that's too big can hurt the
performance and it's better just leave it to the OS cache. I'm not sure why
but for now, I just configured the shared buffer to be 1/4 of the total RAM.
PostgreSQL goes through the OS and its filesystems, unlike some
other products. The right balance of memory in the PostgreSQL
shared buffers versus what is left for a combination of OS caching
and other memory allocations can be hard to determine. 25% is a
good starting point, but your best performance might be higher or
lower. It's a good idea to try incremental adjustments using your
actual workload. Just remember you need to allow enough for several
maintenance_work_mem allocations, about one work_mem allocation per
max_connections setting, plus a reasonable OS cache size.
Post by Huan Ruan
I was wondering on our production server where the effetive_cache_size will
be much bigger, will pg then guess that probably most data is cached anyway
therefore leaning towards nested loop join rather than a scan for hash join?
Once effective_cache_size is larger than your largest index, its
exact value doesn't matter all that much.
Post by Huan Ruan
Even on a test server where the cache hit rate is much smaller, for a big
table like this, under what circumstances, will a hash join perform better
than nested loop join though?
With a low cache hit rate, that would generally be when the number
of lookups into the table exceeds about 10% of the table's rows.
Post by Huan Ruan
Yes, I had bumped up work_mem yesterday to speed up another big group by
query. I used 80MB. I assumed this memory will only be used if the query
needs it and will be released as soon as it's finished, so it won't be too
much an issue as long as I don't have too many concurrently sorting queries
running (which is true in our production). Is this correct?
Each connection running a query can allocate one work_mem
allocation per plan node (depending on node type), which will be
freed after the query completes. A common "rule of thumb" is to
plan on peaks of max_conncetions allocations of work_mem.
Post by Huan Ruan
I increased maintenance_work_mem initially to speed up the index creation
when I first pump in the data. In production environment, we don't do run
time index creation, so I think only the vacuum and analyze will consume
this memory?
You'll probably be creating indexes from time to time. Figure an
occasional one of those plus up to one allocation per autovacuum
worker (and you probably shouldn't go below three of those).

-Kevin
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Huan Ruan
2012-12-14 04:46:44 UTC
Permalink
Post by Kevin Grittner
With a low cache hit rate, that would generally be when the number
of lookups into the table exceeds about 10% of the table's rows.
So far, my main performance issue comes down to this pattern where
Postgres chooses hash join that's slower than a nest loop indexed join. By
changing those cost parameters, this query works as expected now, but there
are others fall into the same category and appear to be harder to convince
the optimiser.

I'm still a bit worried about this query as Postgres gets the record count
right, and knows the index is a primary key index, therefore it knows it's
0.05m out of 170m records (0.03%) but still chooses the sequential scan.
Hopefully this is just related to that big index penalty bug introduced in
9.2.
Kevin Grittner
2012-12-14 13:48:30 UTC
Permalink
Post by Kevin Grittner
Post by Kevin Grittner
With a low cache hit rate, that would generally be when the number
of lookups into the table exceeds about 10% of the table's rows.
So far, my main performance issue comes down to this pattern where
Postgres chooses hash join that's slower than a nest loop indexed join. By
changing those cost parameters, this query works as expected now, but there
are others fall into the same category and appear to be harder to convince
the optimiser.
I'm still a bit worried about this query as Postgres gets the record count
right, and knows the index is a primary key index, therefore it knows it's
0.05m out of 170m records (0.03%) but still chooses the sequential scan.
Hopefully this is just related to that big index penalty bug introduced in
9.2.
Quite possibly, but it could be any of a number of other things,
like a type mismatch. It might be best to rule out other causes. If
you post the new query and EXPLAIN ANALYZE output, along with the
settings you have now adopted, someone may be able to spot
something. It wouldn't hurt to repeat OS and hardware info with it
so people have it handy for reference.

-Kevin
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Huan Ruan
2012-12-19 01:55:04 UTC
Permalink
Post by Kevin Grittner
Quite possibly, but it could be any of a number of other things,
like a type mismatch. It might be best to rule out other causes. If
you post the new query and EXPLAIN ANALYZE output, along with the
settings you have now adopted, someone may be able to spot
something. It wouldn't hurt to repeat OS and hardware info with it
so people have it handy for reference.
Sorry for the late reply. To summarise,

The version is PostgreSQL 9.2.0 on x86_64-unknown-linux-gnu, compiled by
gcc (GCC) 4.4.6 20120305 (Red Hat 4.4.6-4), 64-bit. Server specs are:

- Centos, ext4
- 24GB memory
- 6 cores hyper-threaded (Intel(R) Xeon(R) CPU E5645).
- raid 10 on 4 sata disks

Config changes are


- shared_buffers = 6GB
- work_mem = 80MB
- maintenance_work_mem = 3GB
- effective_cache_size = 22GB
- seq_page_cost = 0.1
- random_page_cost = 0.1
- cpu_tuple_cost = 0.05
- geqo = off

The query is,

explain (analyze, buffers)
SELECT
*
FROM IM_Match_Table smalltable
inner join invtran bigtable on bigtable.invtranref = smalltable.invtranref

The result is,

"QUERY PLAN"
"Nested Loop (cost=0.00..341698.92 rows=48261 width=171) (actual
time=0.042..567.980 rows=48257 loops=1)"
" Buffers: shared hit=242267"
" -> Seq Scan on im_match_table smalltable (cost=0.00..2472.65
rows=48261 width=63) (actual time=0.006..8.230 rows=48261 loops=1)"
" Buffers: shared hit=596"
" -> Index Scan using pk_invtran on invtran bigtable (cost=0.00..6.98
rows=1 width=108) (actual time=0.010..0.011 rows=1 loops=48261)"
" Index Cond: (invtranref = smalltable.invtranref)"
" Buffers: shared hit=241671"
"Total runtime: 571.662 ms"
Kevin Grittner
2012-12-19 13:16:37 UTC
Permalink
Post by Huan Ruan
explain (analyze, buffers)
SELECT
 *
FROM IM_Match_Table smalltable
 inner join invtran bigtable on bigtable.invtranref = smalltable.invtranref
Well, one table or the other will need to be read in full, and you
would normally want that one to be the small table. When there is
no ORDER BY clause, the fastest way to do that will normally be a
seqscan. So that part of the query is as it should be. The only
question is whether the access to the big table is using the
fastest technique.

If you want to see what the planner's second choice would have
been, you could run:

SET enable_indexscan = off;

on a connection and try the explain again. If you don't like that
one, you might be able to disable another node type and see what
you get. If one of the other alternatives is faster, that would
suggest that adjustments are needed to the costing factors;
otherwise, it just takes that long to read hundreds of thousands of
rows in one table and look for related data for each of them in
another table.
Post by Huan Ruan
"Nested Loop (cost=0.00..341698.92 rows=48261 width=171) (actual
time=0.042..567.980 rows=48257 loops=1)"
Frankly, at 12 microseconds per matched pair of rows, I think
you're doing OK.

-Kevin
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Huan Ruan
2012-12-20 01:02:17 UTC
Permalink
Frankly, at 12 microseconds per matched pair of rows, I think
Post by Kevin Grittner
you're doing OK.
This plan is the good one, I want the indexscan nested loop join and this
is only achieved after making all these costing factors change. Before
that, it was hash join and was very slow.

However, I'm worried about the config changes being too 'extreme', i.e.
both sequential I/O and random I/O have the same cost and being only 0.1.
So, I was more wondering why I have to make such dramatic changes to
convince the optimiser to use NL join instead of hash join. And also, I'm
not sure what impact will these changes have on other queries yet. e.g.
will a query that's fine with hash join now choose NL join and runs slower?
Kevin Grittner
2012-12-20 14:06:29 UTC
Permalink
Post by Kevin Grittner
Post by Kevin Grittner
Frankly, at 12 microseconds per matched pair of rows, I think
you're doing OK.
This plan is the good one, I want the indexscan nested loop join and this
is only achieved after making all these costing factors change. Before
that, it was hash join and was very slow.
However, I'm worried about the config changes being too 'extreme', i.e.
both sequential I/O and random I/O have the same cost and being only 0.1.
So, I was more wondering why I have to make such dramatic changes to
convince the optimiser to use NL join instead of hash join. And also, I'm
not sure what impact will these changes have on other queries yet. e.g.
will a query that's fine with hash join now choose NL join and runs slower?
I understand the concern, but PostgreSQL doesn't yet have a knob to
turn for "cache hit ratio". You essentially need to build that into
the page costs. Since your cache hit ratio (between shared buffers
and the OS) is so high, the cost of page access relative to CPU
costs has declined and there isn't any effective difference between
sequential and random access. As the level of caching changes, you
may need to adjust. In one production environment where there was
significant caching, but far enough from 100% to matter, we tested
various configurations and found the fastest plans being chosen
with seq_page_cost = 0.3 and random_page_cost = 0.5. Tune to your
workload.

-Kevin
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Huan Ruan
2012-12-20 21:16:24 UTC
Permalink
Post by Huan Ruan
Post by Kevin Grittner
Post by Kevin Grittner
Frankly, at 12 microseconds per matched pair of rows, I think
you're doing OK.
This plan is the good one, I want the indexscan nested loop join and this
is only achieved after making all these costing factors change. Before
that, it was hash join and was very slow.
However, I'm worried about the config changes being too 'extreme', i.e.
both sequential I/O and random I/O have the same cost and being only 0.1.
So, I was more wondering why I have to make such dramatic changes to
convince the optimiser to use NL join instead of hash join. And also, I'm
not sure what impact will these changes have on other queries yet. e.g.
will a query that's fine with hash join now choose NL join and runs
slower?
I understand the concern, but PostgreSQL doesn't yet have a knob to
turn for "cache hit ratio". You essentially need to build that into
the page costs. Since your cache hit ratio (between shared buffers
and the OS) is so high, the cost of page access relative to CPU
costs has declined and there isn't any effective difference between
sequential and random access. As the level of caching changes, you
may need to adjust. In one production environment where there was
significant caching, but far enough from 100% to matter, we tested
various configurations and found the fastest plans being chosen
with seq_page_cost = 0.3 and random_page_cost = 0.5. Tune to your
workload.
Thanks Kevin. I think I get some ideas now that I can try on the
production server when we switch.

Loading...