Stefan Andreatta
2012-12-29 20:57:04 UTC
I have encountered serious under-estimations of distinct values when
values are not evenly distributed but clustered within a column. I think
this problem might be relevant to many real-world use cases and I wonder
if there is a good workaround or possibly a programmatic solution that
could be implemented.
Thanks for your help!
Stefan
*The Long Story:*
When Postgres collects statistics, it estimates the number of distinct
values for every column (see pg_stats.n_distinct). This is one important
source for the planner to determine the selectivity and hence can have
great influence on the resulting query plan.
My Problem:
When I collected statistics on some columns that have rather high
selectivity but not anything like unique values, I consistently got
n_distinct values that are far too low (easily by some orders of
magnitude). Worse still, the estimates did not really improve until I
analyzed the whole table.
I tested this for Postgres 9.1 and 9.2. An artificial test-case is
described at the end of this mail.
Some Analysis of the Problem:
Unfortunately it is not trivial to estimate the total number of
different values based on a sample. As far as I found out, Postgres
uses an algorithm that is based on the number of values that are found
only once in the sample used for ANALYZE. I found references to
Good-Turing frequency estimation
(http://encodestatistics.org/publications/statistics_and_postgres.pdf)
and to a paper from Haas & Stokes, Computer Science, 1996
(http://researcher.ibm.com/researcher/files/us-phaas/jasa3rj.pdf). The
latter source is from Josh Berkus in a 2005 discussion on the Postgres
Performance List (see e.g.
http://grokbase.com/t/postgresql/pgsql-performance/054kztf8pf/bad-n-distinct-estimation-hacks-suggested
for a look on the whole discussion there). The formula given there for
the total number of distinct values is:
n*d / (n - f1 + f1*n/N)
where f1 is the number of values that occurred only once in the sample.
n is the number of rows sampled, d the number of distincts found and N
the total number of rows in the table.
Now, the 2005 discussion goes into great detail on the advantages and
disadvantages of this algorithm, particularly when using small sample
sizes, and several alternatives are discussed. I do not know whether
anything has been changed after that, but I know that the very distinct
problem, which I will focus on here, still persists.
When the number of values that are found only once in the sample (f1)
becomes zero, the whole term equals d, that is, n_distinct is estimated
to be just the number of distincts found in the sample.
This is basically fine as it should only happen when the sample has
really covered more or less all distinct values. However, we have a
sampling problem here: for maximum efficiency Postgres samples not
random rows but random pages. If the distribution of the values is not
random but clustered (that is, the same values tend to be close
together) we run into problems. The probability that any value from a
clustered distribution is sampled only once, when any page covers
multiple adjacent rows, is very low.
So, under these circumstances, the estimate for n_distinct will always
be close to the number of distincts found in the sample. Even if every
value would in fact only appear a few times in the table.
Relevance:
I think this is not just an unfortunate border case, but a very common
situation. Imagine two tables that are filled continually over time
where the second table references the first - some objects and multiple
properties for each for example. Now the foreign key column of the
second table will have many distinct values but a highly clustered
distribution. It is probably not helpful, if the planner significantly
underestimates the high selectivity of the foreign key column.
Workarounds:
There are workarounds: manually setting table column statistics or using
an extremely high statistics target, so that the whole table gets
analyzed. However, these workarounds do not seem elegant and may be
impractical.
Questions:
A) Did I find the correct reason for my problem? Specifically, does
Postgres really estimate n_distinct as described above?
B) Are there any elegant workarounds?
C) What could be a programmatic solution to this problem? I think, it
might be possible to use the number of values that are found in only one
page (vs. found only once at all) for f1. Or the number of distincts
could be calculated using some completely different approach?
Test Case:
For an artificial test-case let's create a table and fill it with 10
million rows (appr. 1,300 MB required). There is an ID column featuring
unique values and 4 groups of 3 columns each that have selectivities of:
- 5 (x_2000k = 2,000,000 distinct values)
- 25 (x_400k = 400,000 distinct values)
- 125 (x_80k = 80,000 distinct values).
The 4 groups of columns show different distributions:
- clustered and ordered (e.g. 1,1,1,2,2,2,3,3,3): clustered_ordered_x
- clustered but random values (e.g. 2,2,2,7,7,7,4,4,4): clustered_random_x
- uniform (e.g. 1,2,3,1,2,3,1,2,3): uniform_x
- random (e.g. well random, you know random_x
Here we go:
CREATE UNLOGGED TABLE test_1
(id BIGINT,
clustered_ordered_2000k BIGINT, clustered_ordered_400k BIGINT,
clustered_ordered_80k BIGINT,
clustered_random_2000k BIGINT, clustered_random_400k BIGINT,
clustered_random_80k BIGINT,
uniform_2000k BIGINT, uniform_400k BIGINT, uniform_80k BIGINT,
random_2000k BIGINT, random_400k BIGINT, random_80k BIGINT);
WITH q1 AS (SELECT generate_series(1,10000000) AS i, random() AS r),
q AS (SELECT q1.i, q1.r, trunc(sub_2000k.r * 10000000) AS r_2000k,
trunc(sub_400k.r * 10000000) AS r_400k, trunc(sub_80k.r * 10000000) AS
r_80k FROM q1
JOIN q1 AS sub_2000k ON sub_2000k.i - 1 = trunc((q1.i - 1)
/ 5)
JOIN q1 AS sub_400k ON sub_400k.i - 1 = trunc((q1.i - 1) / 25)
JOIN q1 AS sub_80k ON sub_80k.i - 1 = trunc((q1.i - 1) / 125)
ORDER BY q1.i)
INSERT INTO test_1
SELECT q.i,
trunc((q.i + 4) / 5), trunc((q.i + 24) / 25), trunc((q.i +
124) / 125),
q.r_2000k, q.r_400k, q.r_80k,
trunc(q.i % 2000000), trunc(q.i % 400000), trunc(q.i % 80000),
trunc(q.r * 2000000), trunc(q.r * 400000), trunc(q.r * 80000)
FROM q;
Now let's query the real numbers of distinct values:
SELECT colname, distincts FROM
(SELECT 'id' AS colname, COUNT(DISTINCT id) AS distincts FROM
test_1 UNION
SELECT 'clustered_ordered_2000k' AS colname, COUNT(DISTINCT
clustered_ordered_2000k) AS distincts FROM test_1 UNION
SELECT 'clustered_ordered_400k' AS colname, COUNT(DISTINCT
clustered_ordered_400k) AS distincts FROM test_1 UNION
SELECT 'clustered_ordered_80k' AS colname, COUNT(DISTINCT
clustered_ordered_80k) AS distincts FROM test_1 UNION
SELECT 'clustered_random_2000k' AS colname, COUNT(DISTINCT
clustered_random_2000k) AS distincts FROM test_1 UNION
SELECT 'clustered_random_400k' AS colname, COUNT(DISTINCT
clustered_random_400k) AS distincts FROM test_1 UNION
SELECT 'clustered_random_80k' AS colname, COUNT(DISTINCT
clustered_random_80k) AS distincts FROM test_1 UNION
SELECT 'uniform_2000k' AS colname, COUNT(DISTINCT uniform_2000k)
AS distincts FROM test_1 UNION
SELECT 'uniform_400k' AS colname, COUNT(DISTINCT uniform_400k) AS
distincts FROM test_1 UNION
SELECT 'uniform_80k' AS colname, COUNT(DISTINCT uniform_80k) AS
distincts FROM test_1 UNION
SELECT 'random_2000k' AS colname, COUNT(DISTINCT random_2000k) AS
distincts FROM test_1 UNION
SELECT 'random_400k' AS colname, COUNT(DISTINCT random_400k) AS
distincts FROM test_1 UNION
SELECT 'random_80k' AS colname, COUNT(DISTINCT random_80k) AS
distincts FROM test_1) AS sub
ORDER BY colname;
colname | distincts
-------------------------+-----------
clustered_ordered_2000k | 2000000
clustered_ordered_400k | 400000
clustered_ordered_80k | 80000
clustered_random_2000k | 1811948
clustered_random_400k | 391881
clustered_random_80k | 79681
id | 10000000
random_2000k | 1986619
random_400k | 400000
random_80k | 80000
uniform_2000k | 2000000
uniform_400k | 400000
uniform_80k | 80000
-> So we got what we asked for.
As the row length of the table is not very large, we decrease the
statistics target. Otherwise a quarter of the table will get sampled and
the effect is less clear:
SET default_statistics_target = 10;
ANALYZE VERBOSE test_1;
SELECT attname, n_distinct, correlation FROM pg_stats WHERE tablename
= 'test_1'
ORDER BY attname;
attname | n_distinct | correlation
-------------------------+------------+-------------
clustered_ordered_2000k | 51487 | 1
clustered_ordered_400k | 9991 | 1
clustered_ordered_80k | 3752 | 1
clustered_random_2000k | 51487 | 0.00938534
clustered_random_400k | 9991 | 0.00373309
clustered_random_80k | 3748 | -0.0461863
id | -1 | 1
random_2000k | -0.310305 | 0.00140735
random_400k | 289890 | 0.00140921
random_80k | 71763 | 0.00142101
uniform_2000k | -0.310305 | 0.209842
uniform_400k | -0.101016 | 0.0259991
uniform_80k | 74227 | 0.0154193
-> estimates for random and uniform distributions are really good. But
for clustered distributions, estimates are off by a factor of 20 to 40.
And clean up
DROP TABLE test_1;
values are not evenly distributed but clustered within a column. I think
this problem might be relevant to many real-world use cases and I wonder
if there is a good workaround or possibly a programmatic solution that
could be implemented.
Thanks for your help!
Stefan
*The Long Story:*
When Postgres collects statistics, it estimates the number of distinct
values for every column (see pg_stats.n_distinct). This is one important
source for the planner to determine the selectivity and hence can have
great influence on the resulting query plan.
My Problem:
When I collected statistics on some columns that have rather high
selectivity but not anything like unique values, I consistently got
n_distinct values that are far too low (easily by some orders of
magnitude). Worse still, the estimates did not really improve until I
analyzed the whole table.
I tested this for Postgres 9.1 and 9.2. An artificial test-case is
described at the end of this mail.
Some Analysis of the Problem:
Unfortunately it is not trivial to estimate the total number of
different values based on a sample. As far as I found out, Postgres
uses an algorithm that is based on the number of values that are found
only once in the sample used for ANALYZE. I found references to
Good-Turing frequency estimation
(http://encodestatistics.org/publications/statistics_and_postgres.pdf)
and to a paper from Haas & Stokes, Computer Science, 1996
(http://researcher.ibm.com/researcher/files/us-phaas/jasa3rj.pdf). The
latter source is from Josh Berkus in a 2005 discussion on the Postgres
Performance List (see e.g.
http://grokbase.com/t/postgresql/pgsql-performance/054kztf8pf/bad-n-distinct-estimation-hacks-suggested
for a look on the whole discussion there). The formula given there for
the total number of distinct values is:
n*d / (n - f1 + f1*n/N)
where f1 is the number of values that occurred only once in the sample.
n is the number of rows sampled, d the number of distincts found and N
the total number of rows in the table.
Now, the 2005 discussion goes into great detail on the advantages and
disadvantages of this algorithm, particularly when using small sample
sizes, and several alternatives are discussed. I do not know whether
anything has been changed after that, but I know that the very distinct
problem, which I will focus on here, still persists.
When the number of values that are found only once in the sample (f1)
becomes zero, the whole term equals d, that is, n_distinct is estimated
to be just the number of distincts found in the sample.
This is basically fine as it should only happen when the sample has
really covered more or less all distinct values. However, we have a
sampling problem here: for maximum efficiency Postgres samples not
random rows but random pages. If the distribution of the values is not
random but clustered (that is, the same values tend to be close
together) we run into problems. The probability that any value from a
clustered distribution is sampled only once, when any page covers
multiple adjacent rows, is very low.
So, under these circumstances, the estimate for n_distinct will always
be close to the number of distincts found in the sample. Even if every
value would in fact only appear a few times in the table.
Relevance:
I think this is not just an unfortunate border case, but a very common
situation. Imagine two tables that are filled continually over time
where the second table references the first - some objects and multiple
properties for each for example. Now the foreign key column of the
second table will have many distinct values but a highly clustered
distribution. It is probably not helpful, if the planner significantly
underestimates the high selectivity of the foreign key column.
Workarounds:
There are workarounds: manually setting table column statistics or using
an extremely high statistics target, so that the whole table gets
analyzed. However, these workarounds do not seem elegant and may be
impractical.
Questions:
A) Did I find the correct reason for my problem? Specifically, does
Postgres really estimate n_distinct as described above?
B) Are there any elegant workarounds?
C) What could be a programmatic solution to this problem? I think, it
might be possible to use the number of values that are found in only one
page (vs. found only once at all) for f1. Or the number of distincts
could be calculated using some completely different approach?
Test Case:
For an artificial test-case let's create a table and fill it with 10
million rows (appr. 1,300 MB required). There is an ID column featuring
unique values and 4 groups of 3 columns each that have selectivities of:
- 5 (x_2000k = 2,000,000 distinct values)
- 25 (x_400k = 400,000 distinct values)
- 125 (x_80k = 80,000 distinct values).
The 4 groups of columns show different distributions:
- clustered and ordered (e.g. 1,1,1,2,2,2,3,3,3): clustered_ordered_x
- clustered but random values (e.g. 2,2,2,7,7,7,4,4,4): clustered_random_x
- uniform (e.g. 1,2,3,1,2,3,1,2,3): uniform_x
- random (e.g. well random, you know random_x
Here we go:
CREATE UNLOGGED TABLE test_1
(id BIGINT,
clustered_ordered_2000k BIGINT, clustered_ordered_400k BIGINT,
clustered_ordered_80k BIGINT,
clustered_random_2000k BIGINT, clustered_random_400k BIGINT,
clustered_random_80k BIGINT,
uniform_2000k BIGINT, uniform_400k BIGINT, uniform_80k BIGINT,
random_2000k BIGINT, random_400k BIGINT, random_80k BIGINT);
WITH q1 AS (SELECT generate_series(1,10000000) AS i, random() AS r),
q AS (SELECT q1.i, q1.r, trunc(sub_2000k.r * 10000000) AS r_2000k,
trunc(sub_400k.r * 10000000) AS r_400k, trunc(sub_80k.r * 10000000) AS
r_80k FROM q1
JOIN q1 AS sub_2000k ON sub_2000k.i - 1 = trunc((q1.i - 1)
/ 5)
JOIN q1 AS sub_400k ON sub_400k.i - 1 = trunc((q1.i - 1) / 25)
JOIN q1 AS sub_80k ON sub_80k.i - 1 = trunc((q1.i - 1) / 125)
ORDER BY q1.i)
INSERT INTO test_1
SELECT q.i,
trunc((q.i + 4) / 5), trunc((q.i + 24) / 25), trunc((q.i +
124) / 125),
q.r_2000k, q.r_400k, q.r_80k,
trunc(q.i % 2000000), trunc(q.i % 400000), trunc(q.i % 80000),
trunc(q.r * 2000000), trunc(q.r * 400000), trunc(q.r * 80000)
FROM q;
Now let's query the real numbers of distinct values:
SELECT colname, distincts FROM
(SELECT 'id' AS colname, COUNT(DISTINCT id) AS distincts FROM
test_1 UNION
SELECT 'clustered_ordered_2000k' AS colname, COUNT(DISTINCT
clustered_ordered_2000k) AS distincts FROM test_1 UNION
SELECT 'clustered_ordered_400k' AS colname, COUNT(DISTINCT
clustered_ordered_400k) AS distincts FROM test_1 UNION
SELECT 'clustered_ordered_80k' AS colname, COUNT(DISTINCT
clustered_ordered_80k) AS distincts FROM test_1 UNION
SELECT 'clustered_random_2000k' AS colname, COUNT(DISTINCT
clustered_random_2000k) AS distincts FROM test_1 UNION
SELECT 'clustered_random_400k' AS colname, COUNT(DISTINCT
clustered_random_400k) AS distincts FROM test_1 UNION
SELECT 'clustered_random_80k' AS colname, COUNT(DISTINCT
clustered_random_80k) AS distincts FROM test_1 UNION
SELECT 'uniform_2000k' AS colname, COUNT(DISTINCT uniform_2000k)
AS distincts FROM test_1 UNION
SELECT 'uniform_400k' AS colname, COUNT(DISTINCT uniform_400k) AS
distincts FROM test_1 UNION
SELECT 'uniform_80k' AS colname, COUNT(DISTINCT uniform_80k) AS
distincts FROM test_1 UNION
SELECT 'random_2000k' AS colname, COUNT(DISTINCT random_2000k) AS
distincts FROM test_1 UNION
SELECT 'random_400k' AS colname, COUNT(DISTINCT random_400k) AS
distincts FROM test_1 UNION
SELECT 'random_80k' AS colname, COUNT(DISTINCT random_80k) AS
distincts FROM test_1) AS sub
ORDER BY colname;
colname | distincts
-------------------------+-----------
clustered_ordered_2000k | 2000000
clustered_ordered_400k | 400000
clustered_ordered_80k | 80000
clustered_random_2000k | 1811948
clustered_random_400k | 391881
clustered_random_80k | 79681
id | 10000000
random_2000k | 1986619
random_400k | 400000
random_80k | 80000
uniform_2000k | 2000000
uniform_400k | 400000
uniform_80k | 80000
-> So we got what we asked for.
As the row length of the table is not very large, we decrease the
statistics target. Otherwise a quarter of the table will get sampled and
the effect is less clear:
SET default_statistics_target = 10;
ANALYZE VERBOSE test_1;
SELECT attname, n_distinct, correlation FROM pg_stats WHERE tablename
= 'test_1'
ORDER BY attname;
attname | n_distinct | correlation
-------------------------+------------+-------------
clustered_ordered_2000k | 51487 | 1
clustered_ordered_400k | 9991 | 1
clustered_ordered_80k | 3752 | 1
clustered_random_2000k | 51487 | 0.00938534
clustered_random_400k | 9991 | 0.00373309
clustered_random_80k | 3748 | -0.0461863
id | -1 | 1
random_2000k | -0.310305 | 0.00140735
random_400k | 289890 | 0.00140921
random_80k | 71763 | 0.00142101
uniform_2000k | -0.310305 | 0.209842
uniform_400k | -0.101016 | 0.0259991
uniform_80k | 74227 | 0.0154193
-> estimates for random and uniform distributions are really good. But
for clustered distributions, estimates are off by a factor of 20 to 40.
And clean up
DROP TABLE test_1;