Discussion:
Improving index usage on very non selective index
(too old to reply)
m***@gmail.com
2014-02-23 00:10:23 UTC
Permalink
I'm having a very wierd dataset, where several records from large table has any data at all, but when they do it's hundredths of thousands of records.
I'm trying to select only records that have data but I'm having some problems with index usage. I know you cannot usually "force" postgresql to use certain index but in this case it works.

\d frames
Table "public.frames"
Column | Type | Modifiers
------------+-----------------------------+-----------------------------------------------------
id | integer | not null default nextval('frames_id_seq'::regclass)
match_id | integer |
timestamp | timestamp without time zone |
timecode | integer |
created_at | timestamp without time zone |
updated_at | timestamp without time zone |
file_id | character varying(255) |
identified | integer | default 0
Indexes:
"frames_pkey" PRIMARY KEY, btree (id)
"index_frames_on_match_id" btree (match_id)
"index_frames_on_timestamp_and_match_id" btree ("timestamp", match_id)

SELECT COUNT(*) FROM frames; => 832526

SELECT matches.id, count(frames.id) FROM matches LEFT JOIN frames ON frames.match_id = matches.id GROUP BY matches.id HAVING count(frames.id) > 0;
id | count
----+--------
31 | 123363
28 | 121475
24 | 110155
21 | 108258
22 | 106837
25 | 89182
26 | 87104
27 | 86152
(8 rows)

SELECT matches.id, count(frames.id) FROM matches LEFT JOIN frames ON frames.match_id = matches.id GROUP BY matches.id HAVING count(frames.id) = 0;
....
(568 rows)


Two solutions I've found would be:

SELECT "matches".* FROM "matches" WHERE EXISTS (SELECT true FROM frames WHERE frames.match_id = matches.id LIMIT 1);
Time: 11697,645 ms

or

SELECT DISTINCT "matches".* FROM "matches" INNER JOIN "frames" ON "frames"."match_id" = "matches"."id"
Time: 879,325 ms

Neither query seems to use index on match_id in frames table. It's understendable since normally it's not very selective, unfortunately here it would be really helpful. As:

SET enable_seqscan = OFF;
SELECT "matches".* FROM "matches" WHERE (SELECT true FROM frames WHERE frames.match_id = matches.id LIMIT 1);
Time: 1,239 ms

EXPLAIN for queries:

EXPLAIN for: SELECT DISTINCT "matches".* FROM "matches" INNER JOIN "frames" ON "frames"."match_id" = "matches"."id"

QUERY PLAN
-----------------------------------------------------------------------------
HashAggregate (cost=59253.47..59256.38 rows=290 width=155)
-> Hash Join (cost=6.26..33716.73 rows=785746 width=155)
Hash Cond: (frames.match_id = matches.id)
-> Seq Scan on frames (cost=0.00..22906.46 rows=785746 width=4)
-> Hash (cost=4.45..4.45 rows=145 width=155)
-> Seq Scan on matches (cost=0.00..4.45 rows=145 width=155)
(6 rows)

EXPLAIN for: SELECT "matches".* FROM "matches" WHERE (EXISTS (SELECT id FROM frames WHERE frames.match_id = matches.id LIMIT 1))
QUERY PLAN
----------------------------------------------------------------------------
Seq Scan on matches (cost=0.00..41.17 rows=72 width=155)
Filter: (SubPlan 1)
SubPlan 1
-> Limit (cost=0.00..0.25 rows=1 width=4)
-> Seq Scan on frames (cost=0.00..24870.83 rows=98218 width=4)
Filter: (match_id = matches.id)
(6 rows)


SET enable_seqscan = OFF;

EXPLAIN SELECT "matches".* FROM "matches" WHERE (SELECT true FROM frames WHERE frames.match_id = matches.id LIMIT 1);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Seq Scan on matches (cost=10000000000.00..10000000118.37 rows=72 width=155)
Filter: (SubPlan 1)
SubPlan 1
-> Limit (cost=0.00..0.79 rows=1 width=0)
-> Index Scan using index_frames_on_match_id on frames (cost=0.00..81762.68 rows=104066 width=0)
Index Cond: (match_id = matches.id)
(6 rows)


Any suggestions how to tweek the query to use index here? Maybe other ways to chec for existance of recrs that would execute closer to 1ms I get out of index then 11s ?

PS. I did run ANALYZE, VACUM ANALYZE, all the steps normally suggested to improve index usage. I did read http://wiki.postgresql.org/wiki/SlowQueryQuestions if you need any more data I'd be happy provide it.
m***@gmail.com
2014-02-23 00:55:25 UTC
Permalink
I've noticed that removing LIMIT 1 is actually speeding up query a lot in EXISTS variant.

It's still much slower (100 times) then the index only version though:

SET enable_seqscan = ON;
SELECT "matches".* FROM "matches" WHERE EXISTS (SELECT true FROM frames WHERE frames.match_id = matches.id);
Time: 1,263 ms
SET enable_seqscan = ON;
SELECT "matches".* FROM "matches" WHERE EXISTS (SELECT true FROM frames WHERE frames.match_id = matches.id);
Time: 163,803 ms

SET enable_seqscan = ON;
EXPLAIN SELECT "matches".* FROM "matches" WHERE EXISTS (SELECT true FROM frames WHERE frames.match_id = matches.id);
QUERY PLAN
------------------------------------------------------------------------------------
Nested Loop (cost=25455.58..25457.90 rows=8 width=155)
-> HashAggregate (cost=25455.58..25455.66 rows=8 width=4)
-> Seq Scan on frames (cost=0.00..23374.26 rows=832526 width=4)
-> Index Scan using matches_pkey on matches (cost=0.00..0.27 rows=1 width=155)
Index Cond: (id = frames.match_id)
(5 rows)

SET enable_seqscan = OFF;
EXPLAIN SELECT "matches".* FROM "matches" WHERE EXISTS (SELECT true FROM frames WHERE frames.match_id = matches.id);
QUERY PLAN
------------------------------------------------------------------------------------------------------
Nested Loop Semi Join (cost=0.00..51363.92 rows=8 width=155)
-> Index Scan using matches_pkey on matches (cost=0.00..16.68 rows=145 width=155)
-> Index Scan using index_frames_on_match_id on frames (cost=0.00..51272.64 rows=104066 width=4)
Index Cond: (match_id = matches.id)
(4 rows)

Time: 0,809 ms

Loading...