Discussion:
Why does the number of rows are different in actual and estimated.
(too old to reply)
AI Rumman
2012-12-13 22:12:22 UTC
Permalink
Why does the number of rows are different in actual and estimated?
The default_statistics_target is set to 100.


explain analyze
select *
FROM (
SELECT
entity.id AS "con_s_id", entity.setype AS "con_s_setype" ,
con_details.salutation AS "con_s_salutationtype", con_details.firstname AS
"con_s_firstname",
con_details.phone AS "con_s_phone", con_details.lastname AS
"con_s_lastname",
con_details.accountid AS "con_s_account_id_entityid", con_details.mobile AS
"con_s_mobile",
con_details.title AS "con_s_title", con_details.donotcall AS
"con_s_donotcall",
con_details.fax AS "con_s_fax", con_details.department AS
"con_s_department",
con_details.email AS "con_s_email", con_details.yahooid AS "con_s_yahooid",
con_details.emailoptout AS "con_s_emailoptout", con_details.reportsto AS
"con_s_con__id_entityid",
con_details.reference AS "con_s_reference", entity.smownerid AS
"con_s_assigned_user_id_entityid",
CASE WHEN entity.owner_type='U' THEN users.user_name ELSE groups.groupname
END AS "con_s_assigned_user_id_name",
CASE WHEN entity.owner_type='U' THEN users.first_name || ' ' ||
users.last_name ELSE groups.groupname END AS "con_s_assigned_user_id",
CASE WHEN entity.owner_type='U' THEN 'Users' ELSE 'Groups' END AS
"con_s_assigned_user_id_linkmodule",
entity.modifiedtime AS "con_s_modifiedtime", con_details.notify_owner AS
"con_s_notify_owner",
entity.createdtime AS "con_s_createdtime", entity.description AS
"con_s_description",
con_details.imagename AS "con_s_imagename"
FROM con_details
INNER JOIN entity ON con_details.con_id=entity.id
LEFT JOIN groups ON groups.groupid = entity.smownerid
LEFT join users ON entity.smownerid= users.id
WHERE entity.setype='con_s' AND entity.deleted=0
AND (((con_details.email ILIKE '%@%')))
) con_base
INNER JOIN con_scf ON con_s_base."con_s_id"=con_scf.con_id
INNER JOIN con_subdetails ON
con_s_base."con_s_id"=con_subdetails.con_subscriptionid
INNER JOIN customerdetails ON
con_s_base."con_s_id"=customerdetails.customerid
INNER JOIN con_address ON con_s_base."con_s_id"=con_address.con_addressid


Nested Loop (cost=18560.97..26864.83 rows=24871 width=535) (actual
time=1335.157..8492.414 rows=157953 loops=1)
-> Hash Left Join (cost=18560.97..26518.91 rows=116 width=454) (actual
time=1335.117..6996.585 rows=205418 loops=1)
Hash Cond: (entity.smownerid = users.id)
-> Hash Left Join (cost=18547.22..26503.57 rows=116 width=419)
(actual time=1334.354..6671.442 rows=205418 loops=1)
Hash Cond: (entity.smownerid = groups.groupid)
-> Nested Loop (cost=18546.83..26502.72 rows=116
width=398) (actual time=1334.314..6385.664 rows=205418 loops=1)
-> Nested Loop (cost=18546.83..26273.40 rows=774
width=319) (actual time=1334.272..5025.175 rows=205418 loops=1)
-> Hash Join (cost=18546.83..24775.02
rows=5213 width=273) (actual time=1334.238..3666.748 rows=205420 loops=1)
Hash Cond:
(con_subdetails.con_subscriptionid = entity.id)
-> Index Scan using con_subdetails_pkey
on con_subdetails (cost=0.00..4953.41 rows=326040 width=29) (actual
time=0.019..350
.736 rows=327328 loops=1)
-> Hash (cost=18115.71..18115.71
rows=34489 width=244) (actual time=1334.147..1334.147 rows=205420 loops=1)
Buckets: 4096 Batches: 1 Memory
Usage: 19417kB
-> Hash Join
(cost=9337.97..18115.71 rows=34489 width=244) (actual
time=418.054..1156.453 rows=205420 loops=1)
Hash Cond:
(customerdetails.customerid = entity.id)
-> Seq Scan on
customerdetails (cost=0.00..4752.46 rows=327146 width=13) (actual
time=0.021..176.389 rows=327328 loops=1)
-> Hash
(cost=6495.65..6495.65 rows=227386 width=231) (actual
time=417.839..417.839 rows=205420 loops=1)
Buckets: 32768 Batches:
1 Memory Usage: 16056kB
-> Index Scan using
entity_setype_idx on entity (cost=0.00..6495.65 rows=227386 width=231)
(actual time=0.033..2
53.880 rows=205420 loops=1)
Index Cond:
((setype)::text = 'con_s'::text)
-> Index Scan using con_address_pkey on
con_address (cost=0.00..0.27 rows=1 width=46) (actual time=0.003..0.004
rows=1 loops=2054
20)
Index Cond: (con_addressid = entity.id)
-> Index Scan using con_scf_pkey on con_scf
(cost=0.00..0.28 rows=1 width=79) (actual time=0.003..0.004 rows=1
loops=205418)
Index Cond: (con_id = entity.id)
-> Hash (cost=0.34..0.34 rows=4 width=25) (actual
time=0.016..0.016 rows=4 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Index Scan using groups_pkey on groups
(cost=0.00..0.34 rows=4 width=25) (actual time=0.008..0.012 rows=4 loops=1)
-> Hash (cost=9.00..9.00 rows=380 width=39) (actual
time=0.746..0.746 rows=380 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 28kB
-> Index Scan using users_pkey on users (cost=0.00..9.00
rows=380 width=39) (actual time=0.014..0.440 rows=380 loops=1)
-> Index Scan using con_details_pkey on con_details (cost=0.00..0.29
rows=1 width=85) (actual time=0.004..0.004 rows=1 loops=205418)
Index Cond: (con_id = entity.id)
Filter: ((email)::text ~~* '%@%'::text)
Total runtime: 8573.237 ms
Andrew Dunstan
2012-12-13 22:36:26 UTC
Permalink
Post by AI Rumman
Why does the number of rows are different in actual and estimated?
Isn't that in the nature of estimates? An estimate is a heuristic guess
at the number of rows it will find for the given query or part of a
query. It's not uncommon for estimates to be out by several orders of
magnitude. Guaranteeing estimates within bounded accuracy and in a given
short amount of time (you don't want your planning time to overwhelm
your execution time) isn't possible.

cheers

andrew
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Evgeny Shishkin
2012-12-13 22:40:35 UTC
Permalink
Post by AI Rumman
Why does the number of rows are different in actual and estimated?
Isn't that in the nature of estimates? An estimate is a heuristic guess at the number of rows it will find for the given query or part of a query. It's not uncommon for estimates to be out by several orders of magnitude. Guaranteeing estimates within bounded accuracy and in a given short amount of time (you don't want your planning time to overwhelm your execution time) isn't possible.
The main question i think is what to do with it.

The problem starts here

-> Hash Join (cost=9337.97..18115.71 rows=34489 width=244) (actual time=418.054..1156.453 rows=205420 loops=1)
Hash Cond: (customerdetails.customerid = entity.id)
-> Seq Scan on customerdetails (cost=0.00..4752.46 rows=327146 width=13) (actual time=0.021..176.389 rows=327328 loops=1)
-> Hash (cost=6495.65..6495.65 rows=227386 width=231) (actual time=417.839..417.839 rows=205420 loops=1)
Buckets: 32768 Batches: 1 Memory Usage: 16056kB
-> Index Scan using entity_setype_idx on entity (cost=0.00..6495.65 rows=227386 width=231) (actual time=0.033..2
53.880 rows=205420 loops=1)
Index Cond: ((setype)::text = 'con_s'::text)
-> Index Scan using con_address_pkey on con_address (cost=0.00..0.27 rows=1 width=46) (actual time=0.003..0.004 rows=1 loops=2054
20)

As you see access methods estimates are ok, it is join result set which is wrong.

How to deal with it?

May be a hack with CTE can help, but is there a way to improve statistics correlation?
cheers
andrew
--
http://www.postgresql.org/mailpref/pgsql-performance
Claudio Freire
2012-12-13 22:42:05 UTC
Permalink
Post by AI Rumman
Why does the number of rows are different in actual and estimated?
Isn't that in the nature of estimates? An estimate is a heuristic guess at
the number of rows it will find for the given query or part of a query. It's
not uncommon for estimates to be out by several orders of magnitude.
Guaranteeing estimates within bounded accuracy and in a given short amount
of time (you don't want your planning time to overwhelm your execution time)
isn't possible.
Nested Loop (cost=18560.97..26864.83 rows=24871 width=535) (actual time=1335.157..8492.414 rows=157953 loops=1)
-> Hash Left Join (cost=18560.97..26518.91 rows=116 width=454) (actual time=1335.117..6996.585 rows=205418 loops=1)
It usually is due to some unrecognized correlation between the joined tables.
-> Hash Join (cost=9337.97..18115.71 rows=34489 width=244) (actual time=418.054..1156.453 rows=205420 loops=1)
Hash Cond: (customerdetails.customerid = entity.id)
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Andrew Dunstan
2012-12-13 23:09:42 UTC
Permalink
Post by Evgeny Shishkin
-> Hash Join (cost=9337.97..18115.71 rows=34489 width=244) (actual time=418.054..1156.453 rows=205420 loops=1)
Hash Cond: (customerdetails.customerid = entity.id)
Well, it looks like it's choosing a join order that's quite a bit
different from the way the query is expressed, so the OP might need to
play around with forcing the join order some.


cheers

andrew
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Evgeny Shishkin
2012-12-13 23:13:29 UTC
Permalink
Post by Evgeny Shishkin
-> Hash Join (cost=9337.97..18115.71 rows=34489 width=244) (actual time=418.054..1156.453 rows=205420 loops=1)
Hash Cond: (customerdetails.customerid = entity.id)
Well, it looks like it's choosing a join order that's quite a bit different from the way the query is expressed, so the OP might need to play around with forcing the join order some.
OP joins 8 tables, and i suppose join collapse limit is set to default 8. I thought postgresql's optimiser is not mysql's.
cheers
andrew
--
http://www.postgresql.org/mailpref/pgsql-performance
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Tom Lane
2012-12-13 23:36:54 UTC
Permalink
Post by Evgeny Shishkin
Well, it looks like it's choosing a join order that's quite a bit different from the way the query is expressed, so the OP might need to play around with forcing the join order some.
OP joins 8 tables, and i suppose join collapse limit is set to default 8. I thought postgresql's optimiser is not mysql's.
It's not obvious to me that there's anything very wrong with the plan.
An 8-way join that produces 150K rows is unlikely to run in milliseconds
no matter what the plan. The planner would possibly have done the last
join step differently if it had had a better rowcount estimate, but even
if that were free the query would still have been 7 seconds (vs 8.5).

regards, tom lane
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Evgeny Shishkin
2012-12-13 23:50:19 UTC
Permalink
Post by Tom Lane
Post by Evgeny Shishkin
Well, it looks like it's choosing a join order that's quite a bit different from the way the query is expressed, so the OP might need to play around with forcing the join order some.
OP joins 8 tables, and i suppose join collapse limit is set to default 8. I thought postgresql's optimiser is not mysql's.
It's not obvious to me that there's anything very wrong with the plan.
An 8-way join that produces 150K rows is unlikely to run in milliseconds
no matter what the plan. The planner would possibly have done the last
join step differently if it had had a better rowcount estimate, but even
if that were free the query would still have been 7 seconds (vs 8.5).
May be in this case it is. I once wrote to this list regarding similar problem - joining 4 tables, result set are off by 2257 times - 750ms vs less then 1ms. Unfortunately the question was not accepted to the list.

I spoke to Bruce Momjian about that problem on one local conference, he said shit happens :)
Post by Tom Lane
regards, tom lane
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Claudio Freire
2012-12-14 00:38:53 UTC
Permalink
Post by Evgeny Shishkin
Post by Tom Lane
Post by Evgeny Shishkin
OP joins 8 tables, and i suppose join collapse limit is set to default 8. I thought postgresql's optimiser is not mysql's.
It's not obvious to me that there's anything very wrong with the plan.
An 8-way join that produces 150K rows is unlikely to run in milliseconds
no matter what the plan. The planner would possibly have done the last
join step differently if it had had a better rowcount estimate, but even
if that were free the query would still have been 7 seconds (vs 8.5).
May be in this case it is. I once wrote to this list regarding similar problem - joining 4 tables, result set are off by 2257 times - 750ms vs less then 1ms. Unfortunately the question was not accepted to the list.
I spoke to Bruce Momjian about that problem on one local conference, he said shit happens :)
I think it's more likely a missing FK constraint.
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
AI Rumman
2012-12-14 19:01:50 UTC
Permalink
Does FK Constraint help to improve performance? Or it is only
for maintaining data integrity?
Post by Evgeny Shishkin
Post by Evgeny Shishkin
Post by Tom Lane
Post by Evgeny Shishkin
OP joins 8 tables, and i suppose join collapse limit is set to default
8. I thought postgresql's optimiser is not mysql's.
Post by Evgeny Shishkin
Post by Tom Lane
It's not obvious to me that there's anything very wrong with the plan.
An 8-way join that produces 150K rows is unlikely to run in milliseconds
no matter what the plan. The planner would possibly have done the last
join step differently if it had had a better rowcount estimate, but even
if that were free the query would still have been 7 seconds (vs 8.5).
May be in this case it is. I once wrote to this list regarding similar
problem - joining 4 tables, result set are off by 2257 times - 750ms vs
less then 1ms. Unfortunately the question was not accepted to the list.
Post by Evgeny Shishkin
I spoke to Bruce Momjian about that problem on one local conference, he
said shit happens :)
I think it's more likely a missing FK constraint.
Claudio Freire
2012-12-14 19:12:56 UTC
Permalink
Does FK Constraint help to improve performance? Or it is only for
maintaining data integrity?
I'm not entirely sure it's taken into account, I think it is, but a FK
would tell the planner that every non-null value will produce a row.
It seems to think there are a large portion of non-null values that
don't.
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Kevin Grittner
2012-12-14 19:14:27 UTC
Permalink
Post by AI Rumman
Post by Claudio Freire
I think it's more likely a missing FK constraint.
Does FK Constraint help to improve performance? Or it is only
for maintaining data integrity?
I'm not aware of any situation where adding a foreign key
constraint would improve performance.

-Kevin
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Tom Lane
2012-12-14 19:22:05 UTC
Permalink
Post by Kevin Grittner
Post by AI Rumman
Does FK Constraint help to improve performance? Or it is only
for maintaining data integrity?
I'm not aware of any situation where adding a foreign key
constraint would improve performance.
There's been talk of teaching the planner to use the existence of FK
constraints to improve plans, but I don't believe any such thing is
in the code today.

regards, tom lane
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
AI Rumman
2012-12-14 19:28:50 UTC
Permalink
So I am going to change
join_collapse_limit
and
from_collapse_limit
to 20.

Do I need to set geqo_threshold to greater than 20. Now it is 12 ( default).

And could you let me know why geqo_optimizer is not working good in this
case?
Post by Tom Lane
Post by Kevin Grittner
Post by AI Rumman
Does FK Constraint help to improve performance? Or it is only
for maintaining data integrity?
I'm not aware of any situation where adding a foreign key
constraint would improve performance.
There's been talk of teaching the planner to use the existence of FK
constraints to improve plans, but I don't believe any such thing is
in the code today.
regards, tom lane
Claudio Freire
2012-12-14 20:10:18 UTC
Permalink
Post by Tom Lane
Post by Kevin Grittner
Post by AI Rumman
Does FK Constraint help to improve performance? Or it is only
for maintaining data integrity?
I'm not aware of any situation where adding a foreign key
constraint would improve performance.
There's been talk of teaching the planner to use the existence of FK
constraints to improve plans, but I don't believe any such thing is
in the code today.
That made me look the code.

So, eqjoinsel_inner in selfuncs.c would need those smarts. Cool.

Anyway, reading the code, I think I can now spot the possible issue
behind all of this.

Selectivity is decided based on the number of distinct values on both
sides, and the table's name "entity" makes me think it's a table that
is reused for several things. That could be a problem, since that
inflates distinct values, feeding misinformation to the planner.

Rather than a generic "entity" table, perhaps it would be best to
separate them different entities into different tables. Failing that,
maybe if you have an "entity type" kind of column, you could try
refining the join condition to filter by that kind, hopefully there's
an index over entity kind and the planner can use more accurate MCV
data.
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
AI Rumman
2012-12-14 20:25:42 UTC
Permalink
Yes, I do have a column in entity table like
setype where the values are 'Contacts', 'Candidate' etc.
I have an index on it also.
Are you suggesting to make different table for Contacts, Candidate etc.
Post by Claudio Freire
Post by Tom Lane
Post by Kevin Grittner
Post by AI Rumman
Does FK Constraint help to improve performance? Or it is only
for maintaining data integrity?
I'm not aware of any situation where adding a foreign key
constraint would improve performance.
There's been talk of teaching the planner to use the existence of FK
constraints to improve plans, but I don't believe any such thing is
in the code today.
That made me look the code.
So, eqjoinsel_inner in selfuncs.c would need those smarts. Cool.
Anyway, reading the code, I think I can now spot the possible issue
behind all of this.
Selectivity is decided based on the number of distinct values on both
sides, and the table's name "entity" makes me think it's a table that
is reused for several things. That could be a problem, since that
inflates distinct values, feeding misinformation to the planner.
Rather than a generic "entity" table, perhaps it would be best to
separate them different entities into different tables. Failing that,
maybe if you have an "entity type" kind of column, you could try
refining the join condition to filter by that kind, hopefully there's
an index over entity kind and the planner can use more accurate MCV
data.
Claudio Freire
2012-12-14 20:32:07 UTC
Permalink
Post by AI Rumman
Are you suggesting to make different table for Contacts, Candidate etc.
Yes
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Kevin Grittner
2012-12-14 20:34:03 UTC
Permalink
Post by Claudio Freire
Selectivity is decided based on the number of distinct values on
both sides, and the table's name "entity" makes me think it's a
table that is reused for several things. That could be a problem,
since that inflates distinct values, feeding misinformation to
the planner.
Rather than a generic "entity" table, perhaps it would be best to
separate them different entities into different tables.
I missed that; good catch. Good advice.

Don't try to build a "database within a database" by having one
table for different types of data, with a code to sort them out.
EAV is a seriously bad approach for every situation where I've seen
someone try to use it. I was about to say it's like trying to drive
a nail with a pipe wrench, then realized it's more like putting a
bunch of hammers in a bag and swinging the bag at the nail.

-Kevin
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
AI Rumman
2012-12-14 22:12:39 UTC
Permalink
Post by Kevin Grittner
Post by Claudio Freire
Selectivity is decided based on the number of distinct values on
both sides, and the table's name "entity" makes me think it's a
table that is reused for several things. That could be a problem,
since that inflates distinct values, feeding misinformation to
the planner.
Rather than a generic "entity" table, perhaps it would be best to
separate them different entities into different tables.
I missed that; good catch. Good advice.
Don't try to build a "database within a database" by having one
table for different types of data, with a code to sort them out.
EAV is a seriously bad approach for every situation where I've seen
someone try to use it. I was about to say it's like trying to drive
a nail with a pipe wrench, then realized it's more like putting a
bunch of hammers in a bag and swinging the bag at the nail.
-Kevin
The ENTITY table has 2164493 rows with data as follows:

type | count
-----------------------+--------
Contacts | 327352
Candidate | 34668
Emailst | 33604
Calendar | 493956
Contacts Image | 7
PriceBooks | 2
Notes Attachment | 17
SalesOrder | 6
Acc | 306832
...
..
(29 rows)

Do you think partitioning will improve the overall performance of the
application where all the queries have join with this table?
Kevin Grittner
2012-12-14 22:47:39 UTC
Permalink
 type | count
-----------------------+--------
 Contacts | 327352
 Candidate | 34668
 Emailst | 33604
 Calendar | 493956
 Contacts Image | 7
 PriceBooks | 2
 Notes Attachment | 17
 SalesOrder | 6
 Acc | 306832
...
..
(29 rows)
Do you think partitioning will improve the overall performance of
the application where all the queries have join with this table?
I would not consider putting contacts, calendars, and sales orders
in separate tables as "partitioning". It is normalizing. That will
be useful if you happen to discover, for instance, that the data
elements needed or relationships to other types of data for a
calendar don't exactly match those for a contact image or a sales
order.

And yes, I would expect that using separate tables for
fundamentally different types of data would improve performance. If
some of these objects (like contacts and candidates) have common
elements, you might want to either have both inherit from a common
Person table, or (usually better, IMO) have both reference rows in
a Person table. The latter is especially important if a contact can
be a candidate and you want to be able to associate them.

-Kevin
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Loading...