Alex Vinnik
2013-01-03 22:54:10 UTC
Hi everybody,
I have implemented my first app using PG DB and thought for a minute(may be
two) that I know something about PG but below problem totally destroyed my
confidence :). Please help me to restore it.
Here is simple join query. It runs just fine on MS SQL 2008 and uses
all available indexes using even bigger overall dataset.
select visits.id, views.id
from visits join views on visits.id = views.visit_id
where visits.created_at >= '11/15/2012' and visits.created_at <
'11/16/2012'
Quick performance stat
MS SQL: 1 second, 264K rows
PG: 158 seconds, 264K rows
Explain plan from both DBs
PG QUERY PLAN
Hash Join (cost=12716.17..1101820.09 rows=248494 width=8)
Hash Cond: (views.visit_id = visits.id)
-> Seq Scan on views (cost=0.00..819136.56 rows=17434456 width=8)
-> Hash (cost=10549.16..10549.16 rows=132081 width=4)
-> Index Scan using visits_created_at_index on visits
(cost=0.00..10549.16 rows=132081 width=4)
Index Cond: ((created_at >= '2012-11-15 00:00:00'::timestamp
without time zone) AND (created_at < '2012-11-16 00:00:00'::timestamp
without time zone))
schemaname | tablename | indexname | tablespace |
indexdef
------------+-----------+---------------------------------+------------+------------------------------------------------------------------------------------------
public | views | views_pkey | |
CREATE UNIQUE INDEX views_pkey ON views USING btree (id)
public | views | views_visit_id_index | |
CREATE INDEX views_visit_id_index ON views USING btree (visit_id)
MS SQL Query plan
'11/16/2012'
|--Parallelism(Gather Streams)
|--Nested Loops(Inner Join, OUTER REFERENCES:([visits].[id],
[Expr1006]) OPTIMIZED WITH UNORDERED PREFETCH)
|--Index Seek(OBJECT:([visits].[test]),
SEEK:([visits].[created_at] >= '2012-11-15 00:00:00.000' AND
[visits].[created_at] < '2012-11-16 00:00:00.000') ORDERED FORWARD)
|--Index Seek(OBJECT:([views].[views_visit_id_index]),
SEEK:([views].[visit_id]=[raw_visits].[id]) ORDERED FORWARD)
It is clear that PG does full table scan "Seq Scan on views
(cost=0.00..819136.56 rows=17434456 width=8)"
Don't understand why PG doesn't use views_visit_id_index in that query but
rather scans whole table. One explanation I have found that when resulting
dataset constitutes ~15% of total number of rows in the table then seq scan
is used. In this case resulting dataset is just 1.5% of total number of
rows. So it must be something different. Any reason why it happens and how
to fix it?
Postgres 9.2
Ubuntu 12.04.1 LTS
shared_buffers = 4GB the rest of the settings are default ones
Thanks
-Alex
I have implemented my first app using PG DB and thought for a minute(may be
two) that I know something about PG but below problem totally destroyed my
confidence :). Please help me to restore it.
Here is simple join query. It runs just fine on MS SQL 2008 and uses
all available indexes using even bigger overall dataset.
select visits.id, views.id
from visits join views on visits.id = views.visit_id
where visits.created_at >= '11/15/2012' and visits.created_at <
'11/16/2012'
Quick performance stat
MS SQL: 1 second, 264K rows
PG: 158 seconds, 264K rows
Explain plan from both DBs
PG QUERY PLAN
Hash Join (cost=12716.17..1101820.09 rows=248494 width=8)
Hash Cond: (views.visit_id = visits.id)
-> Seq Scan on views (cost=0.00..819136.56 rows=17434456 width=8)
-> Hash (cost=10549.16..10549.16 rows=132081 width=4)
-> Index Scan using visits_created_at_index on visits
(cost=0.00..10549.16 rows=132081 width=4)
Index Cond: ((created_at >= '2012-11-15 00:00:00'::timestamp
without time zone) AND (created_at < '2012-11-16 00:00:00'::timestamp
without time zone))
schemaname | tablename | indexname | tablespace |
indexdef
------------+-----------+---------------------------------+------------+------------------------------------------------------------------------------------------
public | views | views_pkey | |
CREATE UNIQUE INDEX views_pkey ON views USING btree (id)
public | views | views_visit_id_index | |
CREATE INDEX views_visit_id_index ON views USING btree (visit_id)
MS SQL Query plan
'11/16/2012'
|--Parallelism(Gather Streams)
|--Nested Loops(Inner Join, OUTER REFERENCES:([visits].[id],
[Expr1006]) OPTIMIZED WITH UNORDERED PREFETCH)
|--Index Seek(OBJECT:([visits].[test]),
SEEK:([visits].[created_at] >= '2012-11-15 00:00:00.000' AND
[visits].[created_at] < '2012-11-16 00:00:00.000') ORDERED FORWARD)
|--Index Seek(OBJECT:([views].[views_visit_id_index]),
SEEK:([views].[visit_id]=[raw_visits].[id]) ORDERED FORWARD)
It is clear that PG does full table scan "Seq Scan on views
(cost=0.00..819136.56 rows=17434456 width=8)"
Don't understand why PG doesn't use views_visit_id_index in that query but
rather scans whole table. One explanation I have found that when resulting
dataset constitutes ~15% of total number of rows in the table then seq scan
is used. In this case resulting dataset is just 1.5% of total number of
rows. So it must be something different. Any reason why it happens and how
to fix it?
Postgres 9.2
Ubuntu 12.04.1 LTS
shared_buffers = 4GB the rest of the settings are default ones
Thanks
-Alex