Simple join doesn't use index

Discussion:

Simple join doesn't use index

(too old to reply)

Alex Vinnik

2013-01-03 22:54:10 UTC

Hi everybody,

I have implemented my first app using PG DB and thought for a minute(may be
two) that I know something about PG but below problem totally destroyed my
confidence :). Please help me to restore it.

Here is simple join query. It runs just fine on MS SQL 2008 and uses
all available indexes using even bigger overall dataset.

select visits.id, views.id
from visits join views on visits.id = views.visit_id
where visits.created_at >= '11/15/2012' and visits.created_at <
'11/16/2012'

Quick performance stat

MS SQL: 1 second, 264K rows
PG: 158 seconds, 264K rows

Explain plan from both DBs

PG QUERY PLAN
Hash Join (cost=12716.17..1101820.09 rows=248494 width=8)
Hash Cond: (views.visit_id = visits.id)
-> Seq Scan on views (cost=0.00..819136.56 rows=17434456 width=8)
-> Hash (cost=10549.16..10549.16 rows=132081 width=4)
-> Index Scan using visits_created_at_index on visits
(cost=0.00..10549.16 rows=132081 width=4)
Index Cond: ((created_at >= '2012-11-15 00:00:00'::timestamp
without time zone) AND (created_at < '2012-11-16 00:00:00'::timestamp
without time zone))

schemaname | tablename | indexname | tablespace |
indexdef

------------+-----------+---------------------------------+------------+------------------------------------------------------------------------------------------
public | views | views_pkey | |
CREATE UNIQUE INDEX views_pkey ON views USING btree (id)
public | views | views_visit_id_index | |
CREATE INDEX views_visit_id_index ON views USING btree (visit_id)

MS SQL Query plan
'11/16/2012'
|--Parallelism(Gather Streams)
|--Nested Loops(Inner Join, OUTER REFERENCES:([visits].[id],
[Expr1006]) OPTIMIZED WITH UNORDERED PREFETCH)
|--Index Seek(OBJECT:([visits].[test]),
SEEK:([visits].[created_at] >= '2012-11-15 00:00:00.000' AND
[visits].[created_at] < '2012-11-16 00:00:00.000') ORDERED FORWARD)
|--Index Seek(OBJECT:([views].[views_visit_id_index]),
SEEK:([views].[visit_id]=[raw_visits].[id]) ORDERED FORWARD)

It is clear that PG does full table scan "Seq Scan on views
(cost=0.00..819136.56 rows=17434456 width=8)"

Don't understand why PG doesn't use views_visit_id_index in that query but
rather scans whole table. One explanation I have found that when resulting
dataset constitutes ~15% of total number of rows in the table then seq scan
is used. In this case resulting dataset is just 1.5% of total number of
rows. So it must be something different. Any reason why it happens and how
to fix it?

Postgres 9.2
Ubuntu 12.04.1 LTS
shared_buffers = 4GB the rest of the settings are default ones

Thanks
-Alex

Jeremy Harris

2013-01-03 23:11:40 UTC

Post by Alex Vinnik
I have implemented my first app using PG DB and thought for a minute(may be
two) that I know something about PG but below problem totally destroyed my
confidence :). Please help me to restore it.

https://wiki.postgresql.org/wiki/SlowQueryQuestions

--
Jeremy
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Stefan Andreatta

2013-01-04 04:33:09 UTC

Post by Alex Vinnik
Don't understand why PG doesn't use views_visit_id_index in that query
but rather scans whole table. One explanation I have found that when
resulting dataset constitutes ~15% of total number of rows in the table
then seq scan is used. In this case resulting dataset is just 1.5% of
total number of rows. So it must be something different. Any reason why
it happens and how to fix it?

But does the query planner know the same? If you added the EXPLAIN
ANALYZE output of the query and something like:

SELECT tablename AS table_name, attname AS column_name,
null_frac, avg_width, n_distinct, correlation
FROM pg_stats
WHERE tablename in ('views', 'visits');

.. one could possibly tell a bit more.

Post by Alex Vinnik
Postgres 9.2
Ubuntu 12.04.1 LTS
shared_buffers = 4GB the rest of the settings are default ones

There are more than just this one memory related value, that need to be
changed for optimal performance. E.g. effective_cache_size can have a
direct effect on use of nested loops. See:

http://www.postgresql.org/docs/9.2/static/runtime-config-query.html

Regards,
Stefan

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

AJ Weber

2013-01-04 21:31:31 UTC

Hi all,

I have a table that has about 73mm rows in it and growing. Running
9.0.x on a server that unfortunately is a little I/O constrained. Some
(maybe) pertinent settings:
default_statistics_target = 50
maintenance_work_mem = 512MB
constraint_exclusion = on
effective_cache_size = 5GB
work_mem = 18MB
wal_buffers = 8MB
checkpoint_segments = 32
shared_buffers = 2GB

The server has 12GB RAM, 4 cores, but is shared with a big webapp
running in Tomcat -- and I only have a RAID1 disk to work on. Woes me...

Anyway, this table is going to continue to grow, and it's used
frequently (Read and Write). From what I read, this table is a
candidate to be partitioned for performance and scalability. I have
tested some scripts to build the "inherits" tables with their
constraints and the trigger/function to perform the work.

Am I doing the right thing by partitioning this? If so, and I can
afford some downtime, is dumping the table via pg_dump and then loading
it back in the best way to do this?

Should I run a cluster or vacuum full after all is done?

Is there a major benefit if I can upgrade to 9.2.x in some way that I
haven't realized?

Finally, if anyone has any comments about my settings listed above that
might help improve performance, I thank you in advance.

-AJ

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Jeff Janes

2013-01-05 04:03:03 UTC

Post by AJ Weber
Hi all,
I have a table that has about 73mm rows in it and growing.

How big is the table in MB? Its indexes?

...

Post by AJ Weber
The server has 12GB RAM, 4 cores, but is shared with a big webapp running
in Tomcat -- and I only have a RAID1 disk to work on. Woes me...

By a RAID1 disk, do you mean two disks in a RAID1 configuration, or a
single RAID1 composed of an unspecified number of disks?

Often spending many thousands of dollars in DBA time can save you from
having to buy many hundreds of dollars in hard drives. :) On the other
hand, often you end up having to buy the extra disks anyway afterall.

Post by AJ Weber
Anyway, this table is going to continue to grow, and it's used frequently
(Read and Write).

Are all rows in the table read and written with equal vigor, or are there
hot rows and cold rows that can be recognized based on the row's values?

Post by AJ Weber
From what I read, this table is a candidate to be partitioned for
performance and scalability. I have tested some scripts to build the
"inherits" tables with their constraints and the trigger/function to
perform the work.
Am I doing the right thing by partitioning this?

Probably not. Or at least, you haven't given us the information to know.
Very broadly speaking, well-implemented partitioning makes bulk loading
and removal operations take less IO, but makes normal operations take more
IO, or if lucky leaves it unchanged. There are exceptions, but unless you
can identify a very specific reason to think you might have one of those
exceptions, then you probably don't.

Do you have a natural partitioning key? That is, is there a column (or
expression) which occurs as a selective component in the where clause of
almost all of your most io consuming SQL and DML? If so, you might benefit
from partitioning on it. (But in that case, you might be able to get most
of the benefits of partitioning, without the headaches of it, just by
revamping your indexes to include that column/expression as their leading
field).

If you don't have a good candidate partitioning key, then partitioning will
almost surely make things worse.

If so, and I can afford some downtime, is dumping the table via pg_dump

Post by AJ Weber
and then loading it back in the best way to do this?

To do efficient bulk loading into a partitioned table, you need to
specifically target each partition, rather than targeting with a trigger.
That pretty much rules out pg_dump, AFAIK, unless you are going to parse
the dump file(s) and rewrite them.

Post by AJ Weber
Should I run a cluster or vacuum full after all is done?

Probably not. If a cluster after the partitioning would be beneficial,
there would be a pretty good chance you could do a cluster *instead* of the
partitioning and get the same benefit.

If you do some massive deletes from the parent table as part of populating
the children, then a vacuum full of the parent could be useful. But if you
dump the parent table, truncate it, and reload it as partitioned tables,
then vacuum full would probably not be useful.

Really, you need to identify your most resource-intensive queries before
you can make any reasonable decisions.

Post by AJ Weber
Is there a major benefit if I can upgrade to 9.2.x in some way that I
haven't realized?

If you have specific queries that are misoptimized and so are generating
more IO than they need to, then upgrading could help. On the other hand,
it could also make things worse, if a currently well optimized query
becomes worse.

But, instrumentation has improved in 9.2 from 9.0, so upgrading would make
it easier to figure out just which queries are really bad and have the most
opportunity for improvement. A little well informed optimization might
obviate the need for either partitioning or more hard drives.

Post by AJ Weber
Finally, if anyone has any comments about my settings listed above that
might help improve performance, I thank you in advance.

Your default statistics target seemed low. Without knowing the nature of
your most resource intensive queries or how much memory tomcat is using, it
is hard to say more.

Cheers,

Jeff

AJ Weber

2013-01-06 15:27:01 UTC

All fair questions...

Thank you for your detailed response!

Post by AJ Weber
Hi all,
I have a table that has about 73mm rows in it and growing.
How big is the table in MB? Its indexes?

Not sure on this. Will see if pgAdmin tells me.

Post by AJ Weber
...
The server has 12GB RAM, 4 cores, but is shared with a big webapp
running in Tomcat -- and I only have a RAID1 disk to work on.
Woes me...
By a RAID1 disk, do you mean two disks in a RAID1 configuration, or a
single RAID1 composed of an unspecified number of disks?
Often spending many thousands of dollars in DBA time can save you from
having to buy many hundreds of dollars in hard drives. :) On the
other hand, often you end up having to buy the extra disks anyway
afterall.

I mean I have two disks in a RAID1 configuration. The server is
currently in a whitebox datacenter and I have zero control over the
hardware, so adding disks is unfortunately out of the question. I
completely understand the comment, and would love to have a larger SAN
available to me that I could configure...I just don't and have no way of
getting one anytime soon.

Post by AJ Weber
Anyway, this table is going to continue to grow, and it's used
frequently (Read and Write).
Are all rows in the table read and written with equal vigor, or are
there hot rows and cold rows that can be recognized based on the row's
values?

No, I could probably figure out a way to setup an "archive" or "older"
section of the data that is updated much less frequently. Deletes are
rare. Inserts/Updates "yes". Select on existing rows -- very frequent.

Post by AJ Weber
From what I read, this table is a candidate to be partitioned for
performance and scalability. I have tested some scripts to build
the "inherits" tables with their constraints and the
trigger/function to perform the work.
Am I doing the right thing by partitioning this?
Probably not. Or at least, you haven't given us the information to
know. Very broadly speaking, well-implemented partitioning makes bulk
loading and removal operations take less IO, but makes normal
operations take more IO, or if lucky leaves it unchanged. There are
exceptions, but unless you can identify a very specific reason to
think you might have one of those exceptions, then you probably don't.

I know you can't believe everything you read, but I thought I saw some
metrics about when a table's size exceeds some fraction of available
RAM, or when it approaches 100mm rows, it's a big candidate for
partitioning.

Post by AJ Weber
Do you have a natural partitioning key? That is, is there a column
(or expression) which occurs as a selective component in the where
clause of almost all of your most io consuming SQL and DML? If so,
you might benefit from partitioning on it. (But in that case, you
might be able to get most of the benefits of partitioning, without the
headaches of it, just by revamping your indexes to include that
column/expression as their leading field).
If you don't have a good candidate partitioning key, then partitioning
will almost surely make things worse.

The table is a "detail table" to its master records. That is, it's like
an order-details table where it will have a 1-n rows joined to the
master ("order") table on the order-id. So I can partition it based on
the order number pretty easily (which is a bigint, btw).

Post by AJ Weber
If so, and I can afford some downtime, is dumping the table via
pg_dump and then loading it back in the best way to do this?
To do efficient bulk loading into a partitioned table, you need to
specifically target each partition, rather than targeting with a
trigger. That pretty much rules out pg_dump, AFAIK, unless you are
going to parse the dump file(s) and rewrite them.
Should I run a cluster or vacuum full after all is done?
Probably not. If a cluster after the partitioning would be
beneficial, there would be a pretty good chance you could do a cluster
*instead* of the partitioning and get the same benefit.

I did try clustering the table on the PK (which is actually 4 columns),
and it appeared to help a bit. I was hoping partitioning was going to
help me even more.

Post by AJ Weber
If you do some massive deletes from the parent table as part of
populating the children, then a vacuum full of the parent could be
useful. But if you dump the parent table, truncate it, and reload it
as partitioned tables, then vacuum full would probably not be useful.
Really, you need to identify your most resource-intensive queries
before you can make any reasonable decisions.
Is there a major benefit if I can upgrade to 9.2.x in some way
that I haven't realized?
If you have specific queries that are misoptimized and so are
generating more IO than they need to, then upgrading could help. On
the other hand, it could also make things worse, if a currently well
optimized query becomes worse.

Is there some new feature or optimization you're thinking about with
this comment? If so, could you please just send me a link and/or
feature name and I'll google it myself?

Post by AJ Weber
But, instrumentation has improved in 9.2 from 9.0, so upgrading would
make it easier to figure out just which queries are really bad and
have the most opportunity for improvement. A little well informed
optimization might obviate the need for either partitioning or more
hard drives.

This is interesting too. I obviously would like the best available
options to tune the database and the application. Is this detailed in
the release notes somewhere, and what tools could I use to take
advantage of this? (Are there new/improved details included in the
EXPLAIN statement or something?)

Post by AJ Weber
Finally, if anyone has any comments about my settings listed above
that might help improve performance, I thank you in advance.
Your default statistics target seemed low. Without knowing the nature
of your most resource intensive queries or how much memory tomcat is
using, it is hard to say more.

Tomcat uses 4G of RAM, plus we have nginx in front using a little and
some other, smaller services running on the server in addition to the
usual Linux gamut of processes.

Post by AJ Weber
Cheers,
Jeff

Jeff Janes

2013-01-08 15:26:06 UTC

Post by AJ Weber
All fair questions...
Thank you for your detailed response!

Post by AJ Weber
Hi all,
I have a table that has about 73mm rows in it and growing.

How big is the table in MB? Its indexes?
Not sure on this. Will see if pgAdmin tells me.

It probably does, but from psql command line, you can do \d+ and \di+

Post by AJ Weber

Post by AJ Weber
Anyway, this table is going to continue to grow, and it's used frequently
(Read and Write).

Are all rows in the table read and written with equal vigor, or are
there hot rows and cold rows that can be recognized based on the row's
values?
No, I could probably figure out a way to setup an "archive" or "older"
section of the data that is updated much less frequently.

So the data that deliniates this does not exist in that table, but it does
exist someplace, either just in your head, or in the column of a higher
level table?

Post by AJ Weber
Deletes are rare. Inserts/Updates "yes". Select on existing rows -- very
frequent.

If you have little control over your storage and are already IO bound, and
the tables are growing rapidly, you may need to rethink that "deletes are
rare" bit. So the inserts and updates do target a hot part, while the
selects are evenly spread?

In that case, it is very important to know if the slow part are the
selects, or the insert and deletes. If the selects are slow, and the hot
rows for selects can't be gathered together into a hot partition, then
after clustering they will still be slow as the disk will still have to
seek all over the place (massive data-mining type selects might be an
exception to that, but I wouldn't count on it).

Post by AJ Weber

Post by AJ Weber
From what I read, this table is a candidate to be partitioned for
performance and scalability. I have tested some scripts to build the
"inherits" tables with their constraints and the trigger/function to
perform the work.
Am I doing the right thing by partitioning this?

Probably not. Or at least, you haven't given us the information to
know. Very broadly speaking, well-implemented partitioning makes bulk
loading and removal operations take less IO, but makes normal operations
take more IO, or if lucky leaves it unchanged. There are exceptions, but
unless you can identify a very specific reason to think you might have one
of those exceptions, then you probably don't.
I know you can't believe everything you read, but I thought I saw some
metrics about when a table's size exceeds some fraction of available RAM,
or when it approaches 100mm rows, it's a big candidate for partitioning.

I think it is a matter of semantics. A small table is poor candidate for
partitioning even if it has an excellent key to use for partitioning. A
large table could be a good candidate up until you realize it doesn't have
a good key to use, at which point it stops being a good candidate (in my
opinion).

Post by AJ Weber

Post by AJ Weber
Should I run a cluster or vacuum full after all is done?

Probably not. If a cluster after the partitioning would be beneficial,
there would be a pretty good chance you could do a cluster *instead* of the
partitioning and get the same benefit.
I did try clustering the table on the PK (which is actually 4 columns),
and it appeared to help a bit. I was hoping partitioning was going to help
me even more.

Was the order_num (from the parent table) the leading field of the 4 column
PK? If not, you might want to reorder the PK so that it is the leading
field and cluster again. Or if reordering the PK columns is not
convenient, make a new index on the order_num and cluster on that (perhaps
dropping the index after the cluster, if it no longer serves a purpose)

Post by AJ Weber

Post by AJ Weber
Is there a major benefit if I can upgrade to 9.2.x in some way that I
haven't realized?

If you have specific queries that are misoptimized and so are generating
more IO than they need to, then upgrading could help. On the other hand,
it could also make things worse, if a currently well optimized query
becomes worse.
Is there some new feature or optimization you're thinking about with
this comment? If so, could you please just send me a link and/or feature
name and I'll google it myself?

The main things I am thinking of are the "fudge factor" for large indexes,
which is currently being discussed in both performance and hackers mailing
lists, which was made overly aggressive in 9.2 and so can make it choose
worse plans, and the "allow the planner to generate custom plans for
specific parameter values even when using prepared statements" from the 9.2
release notes, which can allow it to choose better plans. But, surely
there are other changes as well, which amount to corner cases and so are
hard to discuss in the abstract. Which is why instrumentation is
important. There isn't much point in worrying about possible changed plans
until you've identified the queries that are important to worry about.

Post by AJ Weber
But, instrumentation has improved in 9.2 from 9.0, so upgrading would
make it easier to figure out just which queries are really bad and have the
most opportunity for improvement. A little well informed optimization
might obviate the need for either partitioning or more hard drives.
This is interesting too. I obviously would like the best available
options to tune the database and the application. Is this detailed in the
release notes somewhere, and what tools could I use to take advantage of
this? (Are there new/improved details included in the EXPLAIN statement or
something?)

track_io_timing is new, and it exposes new data into EXPLAIN (ANALYZE,
BUFFERS) as well as into other places. You might not want to turn this on
permanently, as it can affect performance (but you can test with
pg_test_timing <https://mail.google.com/mail/mu/mp/635/pgtesttiming.html>as
outlined in the docs to see how large probable affect it). Also,
EXPLAIN displays the number row removed by filters, which may or may not be
useful to you.

Most exciting I think are the improvements to the contrib module
pg_stat_statements. That would be my first recourse, to find out which of
your statements are taking the most time (and/or IO). I try to install and
configure this for all of my databases now as a matter of course.

See the 9.2 release notes (with links therein to the rest of the
documentation) for discussion of these.

Cheers,

Jeff

AJ Weber

2013-01-08 16:45:49 UTC

Post by Jeff Janes
It probably does, but from psql command line, you can do \d+ and \di+

\d+ doesn't appear to display any size information.

Post by Jeff Janes
If you have little control over your storage and are already IO bound,
and the tables are growing rapidly, you may need to rethink that
"deletes are rare" bit. So the inserts and updates do target a hot
part, while the selects are evenly spread?
In that case, it is very important to know if the slow part are the
selects, or the insert and deletes. If the selects are slow, and the
hot rows for selects can't be gathered together into a hot partition,
then after clustering they will still be slow as the disk will still
have to seek all over the place (massive data-mining type selects
might be an exception to that, but I wouldn't count on it).

Since order_num is sequential, I could partition on it in broad
(sequential) ranges. That would put all recent/new rows in one
table-partition that would be a fraction of the size of the overall
(unpartitioned) table. I guess that would require manual maintenance
over-time (to switch to another, new partition as each grows).

Post by Jeff Janes
I think it is a matter of semantics. A small table is poor candidate
for partitioning even if it has an excellent key to use for
partitioning. A large table could be a good candidate up until you
realize it doesn't have a good key to use, at which point it stops
being a good candidate (in my opinion).

My first idea to evenly-partition the table was to use the order_num and
do a "mod" on it with the number of tables I wanted to use. That would
yield a partition-table number of 0-mod, and all rows for the same order
would stay within the same partition-table. However, you're right in
thinking that a search for orders could -- really WOULD -- require
retrieving details from multiple partitions, probably increasing IO. So
maybe the sequential partitioning (if at all) is better, just more
maintenance down-the-road.

Post by Jeff Janes
Was the order_num (from the parent table) the leading field of the 4
column PK? If not, you might want to reorder the PK so that it is the
leading field and cluster again. Or if reordering the PK columns is
not convenient, make a new index on the order_num and cluster on that
(perhaps dropping the index after the cluster, if it no longer serves
a purpose)

Yes, the order_num is the first column in the PK, and our main browse
queries use, at a minimum, the first 2-3 columns in that PK in their
where-clause.

Many thanks again for all the input!
-AJ

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Jeff Janes

2013-01-08 17:21:42 UTC

Post by AJ Weber

Post by Jeff Janes
It probably does, but from psql command line, you can do \d+ and \di+

\d+ doesn't appear to display any size information.

It does if you use it without an argument, to display all the tables
in the search path:

jjanes=# \d+
List of relations
Schema | Name | Type | Owner | Size | Description
--------+------------------+-------+--------+---------+-------------
public | pgbench_accounts | table | jjanes | 128 MB |
public | pgbench_branches | table | jjanes | 40 kB |
public | pgbench_history | table | jjanes | 0 bytes |
public | pgbench_tellers | table | jjanes | 40 kB |
(4 rows)

It rather annoys me that you actually get less information (no size,
no owner) when you use \d+ on a named table. I don't know if there is
a reason for that feature, or if it was just an oversight.

Post by AJ Weber

Post by Jeff Janes
If you have little control over your storage and are already IO bound, and
the tables are growing rapidly, you may need to rethink that "deletes are
rare" bit. So the inserts and updates do target a hot part, while the
selects are evenly spread?
In that case, it is very important to know if the slow part are the
selects, or the insert and deletes. If the selects are slow, and the hot
rows for selects can't be gathered together into a hot partition, then after
clustering they will still be slow as the disk will still have to seek all
over the place (massive data-mining type selects might be an exception to
that, but I wouldn't count on it).

Since order_num is sequential, I could partition on it in broad (sequential)
ranges. That would put all recent/new rows in one table-partition that
would be a fraction of the size of the overall (unpartitioned) table. I
guess that would require manual maintenance over-time (to switch to another,
new partition as each grows).

Yep. If your selects are concentrated in those recent/new, this could
be very useful. But, if your selects are not concentrated on the
recent/new rows, the benefit would be small.

Post by AJ Weber

Post by Jeff Janes
I think it is a matter of semantics. A small table is poor candidate for
partitioning even if it has an excellent key to use for partitioning. A
large table could be a good candidate up until you realize it doesn't have a
good key to use, at which point it stops being a good candidate (in my
opinion).

My first idea to evenly-partition the table was to use the order_num and do
a "mod" on it with the number of tables I wanted to use. That would yield a
partition-table number of 0-mod,

The current constraint exclusion code is quite simple-minded and
doesn't know how to make use of check constraints that use the mod
function, so the indexes of all partitions would have to be searched
for each order_num-driven query, even though we know the data could
only exist in one of them. The constraint exclusion codes does
understand check constraints that involve ranges.

There could still be some benefit as the table data would be
concentrated, even if the index data is not.

Post by AJ Weber
and all rows for the same order would stay
within the same partition-table.

But usually a given order_num would only be of interest for a fraction
of a second before moving on to some other order_num of interest, so
by the time the relevant partition become fully cached, it would no
longer be hot. Or, if the partitions were small enough, you could
assume that all rows would be dragged into memory when the first one
was requested because they lay so close to each other. But it is not
feasible to have a large enough number of partitions to make that
happen. But if the table is clustered, this is exactly what you would
get--the trouble would be keeping it clustered. If most of the
line-items are inserted at the same time as each other, they probably
should be fairly well clustered to start with.

Cheers,

Jeff

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Tom Lane

2013-01-08 17:51:40 UTC

Post by Jeff Janes

Post by AJ Weber
\d+ doesn't appear to display any size information.

It does if you use it without an argument, to display all the tables
jjanes=# \d+
List of relations
Schema | Name | Type | Owner | Size | Description
--------+------------------+-------+--------+---------+-------------
public | pgbench_accounts | table | jjanes | 128 MB |
public | pgbench_branches | table | jjanes | 40 kB |
public | pgbench_history | table | jjanes | 0 bytes |
public | pgbench_tellers | table | jjanes | 40 kB |
(4 rows)
It rather annoys me that you actually get less information (no size,
no owner) when you use \d+ on a named table. I don't know if there is
a reason for that feature, or if it was just an oversight.

This is actually an abbreviation for \dtisv+, which is a completely
different command from "\d table". You can use something like
"\dt+ table-pattern" to get a display of the above form for a subset
of tables. I agree it ain't too consistent.

regards, tom lane

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

AJ Weber

2013-01-08 18:04:55 UTC

Post by Jeff Janes
It does if you use it without an argument, to display all the tables
jjanes=# \d+
List of relations
Schema | Name | Type | Owner | Size | Description
--------+------------------+-------+--------+---------+-------------
public | pgbench_accounts | table | jjanes | 128 MB |
public | pgbench_branches | table | jjanes | 40 kB |
public | pgbench_history | table | jjanes | 0 bytes |
public | pgbench_tellers | table | jjanes | 40 kB |
(4 rows)
It rather annoys me that you actually get less information (no size,
no owner) when you use \d+ on a named table. I don't know if there is
a reason for that feature, or if it was just an oversight.

That is rather peculiar. Sorry for that.
Table in question is 9284MB
(Parent table is 621MB)

Post by Jeff Janes
The current constraint exclusion code is quite simple-minded and
doesn't know how to make use of check constraints that use the mod
function, so the indexes of all partitions would have to be searched
for each order_num-driven query, even though we know the data could
only exist in one of them. The constraint exclusion codes does
understand check constraints that involve ranges.

Hmm. That's a bit of a limitation I didn't know about. I assume it
doesn't understand the percent (mod operator) just the same as not
understanding the MOD() function? Either way, I guess this strategy
does not pan-out.

Post by Jeff Janes
There could still be some benefit as the table data would be
concentrated, even if the index data is not.

I'm reaching way, way back in my head, but I think _some_ RDBMS I worked
with previously had a way to "cluster" the rows around a single one of
the indexes on the table, thus putting the index and the row-data
"together" and reducing the number of IO's to retrieve the row if that
index was used. Am I understanding that PG's "cluster" is strictly to
group like rows together logically -- table data only, not to coordinate
the table row with the index upon which you clustered them?

Post by Jeff Janes

Post by AJ Weber
and all rows for the same order would stay
within the same partition-table.

But usually a given order_num would only be of interest for a fraction
of a second before moving on to some other order_num of interest, so
by the time the relevant partition become fully cached, it would no
longer be hot. Or, if the partitions were small enough, you could
assume that all rows would be dragged into memory when the first one
was requested because they lay so close to each other. But it is not
feasible to have a large enough number of partitions to make that
happen. But if the table is clustered, this is exactly what you would
get--the trouble would be keeping it clustered. If most of the
line-items are inserted at the same time as each other, they probably
should be fairly well clustered to start with.

Does decreasing the fill to like 90 help keep it clustered in-between
times that I could shutdown the app and perform a (re-) cluster on the
overall table? Problem is, with a table that size, and the hardware I'm
"blessed with", the cluster takes quite a bit of time. :(

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Jeff Janes

2013-01-14 18:24:51 UTC

Post by Jeff Janes
The current constraint exclusion code is quite simple-minded and
doesn't know how to make use of check constraints that use the mod
function, so the indexes of all partitions would have to be searched
for each order_num-driven query, even though we know the data could
only exist in one of them. The constraint exclusion codes does
understand check constraints that involve ranges.

Hmm. That's a bit of a limitation I didn't know about. I assume it doesn't
understand the percent (mod operator) just the same as not understanding the
MOD() function? Either way, I guess this strategy does not pan-out.

Yes, it doesn't detect either. It would use it if you formulate to
every equality query with an extra restriction: "where id=1234567 and
mod(id,100)=67" or whatever.

(But I was surprised that % and mod() are not recognized as being
equivalent. If you specify it one way in the check constraint, you
need to use the same "spelling" in the where clause)

Post by Jeff Janes
There could still be some benefit as the table data would be
concentrated, even if the index data is not.

I'm reaching way, way back in my head, but I think _some_ RDBMS I worked
with previously had a way to "cluster" the rows around a single one of the
indexes on the table, thus putting the index and the row-data "together" and
reducing the number of IO's to retrieve the row if that index was used.

In Oracle this is called in "index organized table" or IOT (or it was
at one point, they have the habit of rename most of their features
with each release). I don't know what other RDBMS call it.
Supporting secondary indexes when the table data could move around was
quite intricate/weird.

PG doesn't have this index-organized-table feature--it has been
discussed but I don't of any currently active effort to add it.

There is another feature, sometimes called clustering, in which the
rows from different tables can be mingled together in the same block.
So both the parent order and the child order_line_item that have the
same order_num (i.e. the join column) would be in the same block. So
once you query for a specific order and did the necessary IO, the
corresponding order_line_item rows would already be in memory. I
thought this was interesting, but I don't know how often it was
actually used.

Am
I understanding that PG's "cluster" is strictly to group like rows together
logically -- table data only, not to coordinate the table row with the index
upon which you clustered them?

They are coordinated in a sense. Not as one single structure, but as
two structures in parallel.

Post by Jeff Janes

Post by AJ Weber
and all rows for the same order would stay
within the same partition-table.

But usually a given order_num would only be of interest for a fraction
of a second before moving on to some other order_num of interest, so
by the time the relevant partition become fully cached, it would no
longer be hot. Or, if the partitions were small enough, you could
assume that all rows would be dragged into memory when the first one
was requested because they lay so close to each other. But it is not
feasible to have a large enough number of partitions to make that
happen. But if the table is clustered, this is exactly what you would
get--the trouble would be keeping it clustered. If most of the
line-items are inserted at the same time as each other, they probably
should be fairly well clustered to start with.

Does decreasing the fill to like 90 help keep it clustered in-between times
that I could shutdown the app and perform a (re-) cluster on the overall
table? Problem is, with a table that size, and the hardware I'm "blessed
with", the cluster takes quite a bit of time. :(

Probably not. If the data starts out clustered and gets updated a
lot, lowering the fill factor might be able to prevent some
de-clustering due to row migration. But when you insert new rows, PG
makes no effort to put them near existing rows with the same key. (In
a hypothetical future in which that did happen, lowering the fill
factor would then probably help)

Cheers,

Jeff

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Merlin Moncure

2013-01-08 00:13:26 UTC

Post by Alex Vinnik
Don't understand why PG doesn't use views_visit_id_index in that query but
rather scans whole table. One explanation I have found that when resulting
dataset constitutes ~15% of total number of rows in the table then seq scan
is used. In this case resulting dataset is just 1.5% of total number of
rows. So it must be something different. Any reason why it happens and how
to fix it?
Postgres 9.2
Ubuntu 12.04.1 LTS
shared_buffers = 4GB the rest of the settings are default ones

<snip>

It happens because you lied to the database...heh. In particular, the
'effective_cache_size' setting which defaults to 128mb. That probably
needs to be much, much larger. Basically postgres is figuring the
cache is much smaller than the data and starts to favor sequential
plans once you hit a certain threshold. If you had a server with only
say 256mb ram, it probably *would* be faster.

SQL server probably uses all kinds of crazy native unportable kernel
calls to avoid having to make a similar .conf setting. Or maybe it
just assumes infinite cache size...dunno.

merlin

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Jeff Janes

2013-01-09 04:34:11 UTC

Post by Alex Vinnik
Hi everybody,
I have implemented my first app using PG DB and thought for a minute(may
be two) that I know something about PG but below
problem totally destroyed my confidence :). Please help me to restore it.
Here is simple join query. It runs just fine on MS SQL 2008 and uses
all available indexes using even bigger overall dataset.
select visits.id, views.id
from visits join views on visits.id = views.visit_id
where visits.created_at >= '11/15/2012' and visits.created_at <
'11/16/2012'
Quick performance stat
MS SQL: 1 second, 264K rows

If it returns 264K rows in 1 second, then it must have all data in memory.
Which prompts a few questions:

Is *all* data in memory, or is it just the data needed for this particular
query because you already ran it recently with the same date range?

Post by Alex Vinnik
PG: 158 seconds, 264K rows

Does the PG machine have enough memory to hold all the data in RAM? If so,
does it actually have all the data in RAM? That is, is the cache already
warm? Starting from scratch it can take a long time for the cache to warm
up naturally. And finally, if all the data is in RAM, does PG know this?

For the last question, the answer is "no", since you are using default
settings. You need to lower random_page_cost and probably also
seq_page_cost in order to trick PG to think the data is in RAM. Of course
if you do this when the data is in fact not in RAM, the result could be
catastrophically bad plans. (And when I tried to replicate your situation
on anemic hardware, indeed the hash join often was faster than the nested
loop on both indexes.)

Post by Alex Vinnik
Explain plan from both DBs
PG QUERY PLAN
Hash Join (cost=12716.17..1101820.09 rows=248494 width=8)
Hash Cond: (views.visit_id = visits.id)
-> Seq Scan on views (cost=0.00..819136.56 rows=17434456 width=8)

This cost estimate is probably due mostly to seg_page_cost and
cpu_tuple_cost, which at their defaults means the table has 645,000 blocks
(819136 - 17434456/100) blocks and each block has ~30 rows.

But you are returning 248,494 rows, or roughly 1 / 2.5 of a row per block.
Let's say you need to fetch 200,000 blocks (in random order) to get those
rows. Since at default settings fetching 200,000 random blocks is
considered as expensive as fetching 800,000 sequential blocks, the index
scan you want already looks more expensive than the sequential scan. But,
if you want to use the index scan, you also have to fetch the index
blocks, which a sequential scan does not need to do. There are probably
about 50,000 index blocks, but each one has to be fetched about 5 times
(248,494/50,000). Because your effective_cache_size is so low, PG assumes
the next time it needs to fetch the same block, it will no longer be in
memory and so needs to be fetched again at full random_page_cost.

Post by Alex Vinnik
It is clear that PG does full table scan "Seq Scan on views
(cost=0.00..819136.56 rows=17434456 width=8)"
Don't understand why PG doesn't use views_visit_id_index in that query but
rather scans whole table. One explanation I have found that when resulting
dataset constitutes ~15% of total number of rows in the table then seq scan
is used.

I don't know where you found that rule of thumb, but it would probably more
accurate if it was given in in terms of the percentage of the table's
*blocks* scanned, rather than *rows*.

Post by Alex Vinnik
In this case resulting dataset is just 1.5% of total number of rows.

Since there are about 30 rows per block, scanning 1.5% of the rows means
scanning somewhat less than 45% of the blocks, assuming the rows are
randomly distributed over the blocks. And they are scanned in a less
efficient way.

Post by Alex Vinnik
Postgres 9.2

You are probably getting hit hard by the overly-large "fudge factor"
penalty for scans of large indexes, of much discussion recently in regards
to 9.2.

Post by Alex Vinnik
Ubuntu 12.04.1 LTS
shared_buffers = 4GB the rest of the settings are default ones

The default effective_cache_size is almost certainly wrong, and if the
analogy to MSSQL to is correct, then random_page_cost almost certainly is
as well.

Another poster referred you to the wiki page for suggestion on how to
report slow queries, particularly using EXPLAIN (analyze, buffers) rather
than merely EXPLAIN. In this case, I would also try setting
enable_hashjoin=off and enable_mergejoin=off in the session, in order to
force the planner to use the plan you think you want, so we can see what PG
thinks of that one.

Cheers,

Jeff

Alex Vinnik

2013-01-09 15:49:43 UTC

Guys, thanks a lot for your input. It is very valuable for us. We plan to
fix a separate dev server similar to production one, copy all data there
and try you suggestions as we really don't want to do it on production
server. I also noticed that IOPS jumps to 100% when running this query. So
it is a bit scary to make those changes in production directly. Will report
back on the progress and findings.

Merlin Moncure

2013-01-09 15:53:08 UTC

Post by Alex Vinnik
Guys, thanks a lot for your input. It is very valuable for us. We plan to
fix a separate dev server similar to production one, copy all data there and
try you suggestions as we really don't want to do it on production server. I
also noticed that IOPS jumps to 100% when running this query. So it is a bit
scary to make those changes in production directly. Will report back on the
progress and findings.

nothing wrong with that, but keep in mind you can tweak
'effective_cache_size' for a single session with 'set' command;

merlin

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Alex Vinnik

2013-01-28 23:43:51 UTC

It sure turned out that default settings are not a good fit. Setting
random_page_cost
to 1.0 made query to run in 2.6 seconds and I clearly see that indexes are
being used in explain plan and IO utilization is close to 0.

QUERY PLAN
Sort (cost=969787.23..970288.67 rows=200575 width=8) (actual
time=2176.045..2418.162 rows=241238 loops=1)
Sort Key: visits.id, views.id
Sort Method: external sort Disk: 4248kB
-> Nested Loop (cost=0.00..950554.81 rows=200575 width=8) (actual
time=0.048..1735.357 rows=241238 loops=1)
-> Index Scan using visits_created_at_index on visits
(cost=0.00..5459.16 rows=82561 width=4) (actual time=0.032..178.591
rows=136021 loops=1)
Index Cond: ((created_at >= '2012-12-15 00:00:00'::timestamp
without time zone) AND (created_at < '2012-12-16 00:00:00'::timestamp
without time zone))
-> Index Scan using views_visit_id_index on views
(cost=0.00..11.33 rows=12 width=8) (actual time=0.004..0.006 rows=2
loops=136021)
Index Cond: (visit_id = visits.id)
Total runtime: 2635.169 ms

However I noticed that sorting is done using disk("external sort Disk:
4248kB") which prompted me to take a look at work_mem. But it turned out
that small increase to 4MB from default 1MB turns off index usage and query
gets x10 slower. IO utilization jumped to 100% from literally nothing. so
back to square one...

QUERY PLAN
Sort (cost=936642.75..937144.19 rows=200575 width=8) (actual
time=33200.762..33474.443 rows=241238 loops=1)
Sort Key: visits.id, views.id
Sort Method: external merge Disk: 4248kB
-> Hash Join (cost=6491.17..917410.33 rows=200575 width=8) (actual
time=7156.498..32723.221 rows=241238 loops=1)
Hash Cond: (views.visit_id = visits.id)
-> Seq Scan on views (cost=0.00..832189.95 rows=8768395 width=8)
(actual time=0.100..12126.342 rows=8200704 loops=1)
-> Hash (cost=5459.16..5459.16 rows=82561 width=4) (actual
time=353.683..353.683 rows=136021 loops=1)
Buckets: 16384 Batches: 2 (originally 1) Memory Usage:
4097kB
-> Index Scan using visits_created_at_index on visits
(cost=0.00..5459.16 rows=82561 width=4) (actual time=0.032..175.051
rows=136021 loops=1)
Index Cond: ((created_at >= '2012-12-15
00:00:00'::timestamp without time zone) AND (created_at < '2012-12-16
00:00:00'::timestamp without time zone))
Total runtime: 33698.000 ms

Basically PG is going through all views again and not using "Index Scan
using views_visit_id_index on views". Looks like setting work_mem confuses
planner somehow. Any idea what can be done to do sorting in memory. I
suspect it should make query even more faster. Thanks -Alex

Post by Merlin Moncure
nothing wrong with that, but keep in mind you can tweak
'effective_cache_size' for a single session with 'set' command;
merlin

Filip Rembiałkowski

2013-01-29 00:55:10 UTC

Post by Alex Vinnik
It sure turned out that default settings are not a good fit.

do you know pgtune?
it's a good tool for starters, if you want a fast postgres and don't really
want to learn what's behind the scenes.

random_page_cost=1 might be not what you really want.
it would mean that random reads are as fast as as sequential reads, which
probably is true only for SSD

Filip

Alex Vinnik

2013-01-29 14:24:10 UTC

Post by Filip RembiaÅkowski

Post by Alex Vinnik
It sure turned out that default settings are not a good fit.

do you know pgtune?
it's a good tool for starters, if you want a fast postgres and don't
really want to learn what's behind the scenes.

Yeah.. I came across pgtune but noticed that latest version dated
2009-10-29 http://pgfoundry.org/frs/?group_id=1000416 which is kind of
outdated. Tar file has settings for pg 8.3. Is still relevant?

Post by Filip RembiaÅkowski
random_page_cost=1 might be not what you really want.
it would mean that random reads are as fast as as sequential reads, which
probably is true only for SSD

What randon_page_cost would be more appropriate for EC2 EBS Provisioned
volume that can handle 2,000 IOPS?

Post by Filip RembiaÅkowski
Filip

Filip Rembiałkowski

2013-01-29 16:19:19 UTC

Post by Alex Vinnik

Post by Filip RembiaÅkowski
do you know pgtune?
it's a good tool for starters, if you want a fast postgres and don't
really want to learn what's behind the scenes.

Yeah.. I came across pgtune but noticed that latest version dated
2009-10-29 http://pgfoundry.org/frs/?group_id=1000416 which is kind of
outdated. Tar file has settings for pg 8.3. Is still relevant?

Yes, I'm sure it will not do anything bad to your config.

Post by Alex Vinnik

Post by Filip RembiaÅkowski
random_page_cost=1 might be not what you really want.
it would mean that random reads are as fast as as sequential reads, which
probably is true only for SSD

What randon_page_cost would be more appropriate for EC2 EBS Provisioned
volume that can handle 2,000 IOPS?

I'd say: don't guess. Measure.
Use any tool that can test sequential disk block reads versus random disk
block reads.
bonnie++ is quite popular.

Filip

Dan Fairs

2013-02-04 21:14:58 UTC

Yeah.. I came across pgtune but noticed that latest version dated 2009-10-29 http://pgfoundry.org/frs/?group_id=1000416 which is kind of outdated. Tar file has settings for pg 8.3. Is still relevant?
Yes, I'm sure it will not do anything bad to your config.

Apologies for leaping in a little late, but I note the version on Github has been updated much more recently:

https://github.com/gregs1104/pgtune

Cheers,
Dan
--
Dan Fairs | ***@gmail.com | @danfairs | secondsync.com

Ben Chobot

2013-01-29 17:39:07 UTC

Post by Filip RembiaÅkowski
random_page_cost=1 might be not what you really want.
it would mean that random reads are as fast as as sequential reads, which probably is true only for SSD
What randon_page_cost would be more appropriate for EC2 EBS Provisioned volume that can handle 2,000 IOPS?

For EC2 Provisioned IOPS volumes - not standard EBS - random_page_cost=1 is exactly what you want.

Alex Vinnik

2013-01-29 18:59:10 UTC

Post by Filip RembiaÅkowski
random_page_cost=1 might be not what you really want.

Post by Filip RembiaÅkowski
it would mean that random reads are as fast as as sequential reads, which
probably is true only for SSD

What randon_page_cost would be more appropriate for EC2 EBS Provisioned
volume that can handle 2,000 IOPS?
For EC2 Provisioned IOPS volumes - not standard EBS - random_page_cost=1
is exactly what you want.
Well... after some experimentation it turned out that random_page_cost=0.6

gives me fast query

QUERY PLAN
Sort (cost=754114.96..754510.46 rows=158199 width=8) (actual
time=1839.324..2035.405 rows=209401 loops=1)
Sort Key: visits.id, views.id
Sort Method: quicksort Memory: 15960kB
-> Nested Loop (cost=0.00..740453.38 rows=158199 width=8) (actual
time=0.048..1531.592 rows=209401 loops=1)
-> Index Scan using visits_created_at_index on visits
(cost=0.00..5929.82 rows=115492 width=4) (actual time=0.032..161.488
rows=131311 loops=1)
Index Cond: ((created_at >= '2013-01-15 00:00:00'::timestamp
without time zone) AND (created_at < '2013-01-16 00:00:00'::timestamp
without time zone))
-> Index Scan using views_visit_id_index on views
(cost=0.00..6.26 rows=10 width=8) (actual time=0.003..0.005 rows=2
loops=131311)
Index Cond: (visit_id = visits.id)
Total runtime: 2234.142 ms

random_page_cost=0.7 slows it down 16 times

Sort (cost=804548.42..804943.92 rows=158199 width=8) (actual
time=37011.337..37205.449 rows=209401 loops=1)
Sort Key: visits.id, views.id
Sort Method: quicksort Memory: 15960kB
-> Merge Join (cost=15871.37..790886.85 rows=158199 width=8) (actual
time=35673.602..36714.056 rows=209401 loops=1)
Merge Cond: (visits.id = views.visit_id)
-> Sort (cost=15824.44..16113.17 rows=115492 width=4) (actual
time=335.486..463.085 rows=131311 loops=1)
Sort Key: visits.id
Sort Method: quicksort Memory: 12300kB
-> Index Scan using visits_created_at_index on visits
(cost=0.00..6113.04 rows=115492 width=4) (actual time=0.034..159.326
rows=131311 loops=1)
Index Cond: ((created_at >= '2013-01-15
00:00:00'::timestamp without time zone) AND (created_at < '2013-01-16
00:00:00'::timestamp without time zone))
-> Index Scan using views_visit_id_visit_buoy_index on views
(cost=0.00..757596.22 rows=6122770 width=8) (actual time=0.017..30765.316
rows=5145902 loops=1)
Total runtime: 37407.174 ms

I am totally puzzled now...

Merlin Moncure

2013-01-29 19:35:13 UTC

Post by Filip RembiaÅkowski

Post by Alex Vinnik

Post by Filip RembiaÅkowski
random_page_cost=1 might be not what you really want.
it would mean that random reads are as fast as as sequential reads, which
probably is true only for SSD

What randon_page_cost would be more appropriate for EC2 EBS Provisioned
volume that can handle 2,000 IOPS?
For EC2 Provisioned IOPS volumes - not standard EBS - random_page_cost=1
is exactly what you want.

Well... after some experimentation it turned out that random_page_cost=0.6
gives me fast query
QUERY PLAN
Sort (cost=754114.96..754510.46 rows=158199 width=8) (actual
time=1839.324..2035.405 rows=209401 loops=1)
Sort Key: visits.id, views.id
Sort Method: quicksort Memory: 15960kB
-> Nested Loop (cost=0.00..740453.38 rows=158199 width=8) (actual
time=0.048..1531.592 rows=209401 loops=1)
-> Index Scan using visits_created_at_index on visits
(cost=0.00..5929.82 rows=115492 width=4) (actual time=0.032..161.488
rows=131311 loops=1)
Index Cond: ((created_at >= '2013-01-15 00:00:00'::timestamp
without time zone) AND (created_at < '2013-01-16 00:00:00'::timestamp
without time zone))
-> Index Scan using views_visit_id_index on views (cost=0.00..6.26
rows=10 width=8) (actual time=0.003..0.005 rows=2 loops=131311)
Index Cond: (visit_id = visits.id)
Total runtime: 2234.142 ms
random_page_cost=0.7 slows it down 16 times
Sort (cost=804548.42..804943.92 rows=158199 width=8) (actual
time=37011.337..37205.449 rows=209401 loops=1)
Sort Key: visits.id, views.id
Sort Method: quicksort Memory: 15960kB
-> Merge Join (cost=15871.37..790886.85 rows=158199 width=8) (actual
time=35673.602..36714.056 rows=209401 loops=1)
Merge Cond: (visits.id = views.visit_id)
-> Sort (cost=15824.44..16113.17 rows=115492 width=4) (actual
time=335.486..463.085 rows=131311 loops=1)
Sort Key: visits.id
Sort Method: quicksort Memory: 12300kB
-> Index Scan using visits_created_at_index on visits
(cost=0.00..6113.04 rows=115492 width=4) (actual time=0.034..159.326
rows=131311 loops=1)
Index Cond: ((created_at >= '2013-01-15
00:00:00'::timestamp without time zone) AND (created_at < '2013-01-16
00:00:00'::timestamp without time zone))
-> Index Scan using views_visit_id_visit_buoy_index on views
(cost=0.00..757596.22 rows=6122770 width=8) (actual time=0.017..30765.316
rows=5145902 loops=1)

Something is awry here. pg is doing an index scan via
views_visit_id_visit_buoy_index with no matching condition. What's
the definition of that index? The reason why the random_page_cost
adjustment is working is that you are highly penalizing sequential
type scans so that the database is avoiding the merge (sort A, sort B,
stepwise compare).

SQL server is doing a nestloop/index scan, just like the faster pg
plan, but is a bit faster because it's parallelizing.

merlin

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Alex Vinnik

2013-01-29 20:48:50 UTC

index definition
CREATE INDEX views_visit_id_visit_buoy_index ON views USING btree
(visit_id, visit_buoy)

Post by Filip RembiaÅkowski

Post by Filip RembiaÅkowski

Post by Alex Vinnik

Post by Filip RembiaÅkowski
random_page_cost=1 might be not what you really want.
it would mean that random reads are as fast as as sequential reads,

which

Post by Filip RembiaÅkowski

Post by Alex Vinnik

Post by Filip RembiaÅkowski
probably is true only for SSD

What randon_page_cost would be more appropriate for EC2 EBS Provisioned
volume that can handle 2,000 IOPS?
For EC2 Provisioned IOPS volumes - not standard EBS - random_page_cost=1
is exactly what you want.

Well... after some experimentation it turned out that

random_page_cost=0.6

Post by Filip RembiaÅkowski
gives me fast query
QUERY PLAN
Sort (cost=754114.96..754510.46 rows=158199 width=8) (actual
time=1839.324..2035.405 rows=209401 loops=1)
Sort Key: visits.id, views.id
Sort Method: quicksort Memory: 15960kB
-> Nested Loop (cost=0.00..740453.38 rows=158199 width=8) (actual
time=0.048..1531.592 rows=209401 loops=1)
-> Index Scan using visits_created_at_index on visits
(cost=0.00..5929.82 rows=115492 width=4) (actual time=0.032..161.488
rows=131311 loops=1)
Index Cond: ((created_at >= '2013-01-15

00:00:00'::timestamp

Post by Filip RembiaÅkowski
without time zone) AND (created_at < '2013-01-16 00:00:00'::timestamp
without time zone))
-> Index Scan using views_visit_id_index on views

(cost=0.00..6.26

Post by Filip RembiaÅkowski
rows=10 width=8) (actual time=0.003..0.005 rows=2 loops=131311)
Index Cond: (visit_id = visits.id)
Total runtime: 2234.142 ms
random_page_cost=0.7 slows it down 16 times
Sort (cost=804548.42..804943.92 rows=158199 width=8) (actual
time=37011.337..37205.449 rows=209401 loops=1)
Sort Key: visits.id, views.id
Sort Method: quicksort Memory: 15960kB
-> Merge Join (cost=15871.37..790886.85 rows=158199 width=8) (actual
time=35673.602..36714.056 rows=209401 loops=1)
Merge Cond: (visits.id = views.visit_id)
-> Sort (cost=15824.44..16113.17 rows=115492 width=4) (actual
time=335.486..463.085 rows=131311 loops=1)
Sort Key: visits.id
Sort Method: quicksort Memory: 12300kB
-> Index Scan using visits_created_at_index on visits
(cost=0.00..6113.04 rows=115492 width=4) (actual time=0.034..159.326
rows=131311 loops=1)
Index Cond: ((created_at >= '2013-01-15
00:00:00'::timestamp without time zone) AND (created_at < '2013-01-16
00:00:00'::timestamp without time zone))
-> Index Scan using views_visit_id_visit_buoy_index on views
(cost=0.00..757596.22 rows=6122770 width=8) (actual time=0.017..30765.316
rows=5145902 loops=1)

Something is awry here. pg is doing an index scan via
views_visit_id_visit_buoy_index with no matching condition. What's
the definition of that index? The reason why the random_page_cost
adjustment is working is that you are highly penalizing sequential
type scans so that the database is avoiding the merge (sort A, sort B,
stepwise compare).
SQL server is doing a nestloop/index scan, just like the faster pg
plan, but is a bit faster because it's parallelizing.
merlin

Jeff Janes

2013-01-29 23:15:36 UTC

On Mon, Jan 28, 2013 at 4:55 PM, Filip Rembiałkowski

Post by Filip RembiaÅkowski

Post by Alex Vinnik
It sure turned out that default settings are not a good fit.

do you know pgtune?
it's a good tool for starters, if you want a fast postgres and don't really
want to learn what's behind the scenes.
random_page_cost=1 might be not what you really want.
it would mean that random reads are as fast as as sequential reads, which
probably is true only for SSD

Or that the "reads" are cached and coming from RAM, which is almost
surely the case here.

Cheers,

Jeff

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Merlin Moncure

2013-01-29 02:31:32 UTC

Post by Alex Vinnik
It sure turned out that default settings are not a good fit. Setting
random_page_cost to 1.0 made query to run in 2.6 seconds and I clearly see
that indexes are being used in explain plan and IO utilization is close to
0.
QUERY PLAN
Sort (cost=969787.23..970288.67 rows=200575 width=8) (actual
time=2176.045..2418.162 rows=241238 loops=1)
Sort Key: visits.id, views.id
Sort Method: external sort Disk: 4248kB
-> Nested Loop (cost=0.00..950554.81 rows=200575 width=8) (actual
time=0.048..1735.357 rows=241238 loops=1)
-> Index Scan using visits_created_at_index on visits
(cost=0.00..5459.16 rows=82561 width=4) (actual time=0.032..178.591
rows=136021 loops=1)
Index Cond: ((created_at >= '2012-12-15 00:00:00'::timestamp
without time zone) AND (created_at < '2012-12-16 00:00:00'::timestamp
without time zone))
-> Index Scan using views_visit_id_index on views
(cost=0.00..11.33 rows=12 width=8) (actual time=0.004..0.006 rows=2
loops=136021)
Index Cond: (visit_id = visits.id)
Total runtime: 2635.169 ms
4248kB") which prompted me to take a look at work_mem. But it turned out
that small increase to 4MB from default 1MB turns off index usage and query
gets x10 slower. IO utilization jumped to 100% from literally nothing. so
back to square one...
QUERY PLAN
Sort (cost=936642.75..937144.19 rows=200575 width=8) (actual
time=33200.762..33474.443 rows=241238 loops=1)
Sort Key: visits.id, views.id
Sort Method: external merge Disk: 4248kB
-> Hash Join (cost=6491.17..917410.33 rows=200575 width=8) (actual
time=7156.498..32723.221 rows=241238 loops=1)
Hash Cond: (views.visit_id = visits.id)
-> Seq Scan on views (cost=0.00..832189.95 rows=8768395 width=8)
(actual time=0.100..12126.342 rows=8200704 loops=1)
-> Hash (cost=5459.16..5459.16 rows=82561 width=4) (actual
time=353.683..353.683 rows=136021 loops=1)
4097kB
-> Index Scan using visits_created_at_index on visits
(cost=0.00..5459.16 rows=82561 width=4) (actual time=0.032..175.051
rows=136021 loops=1)
Index Cond: ((created_at >= '2012-12-15
00:00:00'::timestamp without time zone) AND (created_at < '2012-12-16
00:00:00'::timestamp without time zone))
Total runtime: 33698.000 ms
Basically PG is going through all views again and not using "Index Scan
using views_visit_id_index on views". Looks like setting work_mem confuses
planner somehow. Any idea what can be done to do sorting in memory. I
suspect it should make query even more faster. Thanks -Alex

hm, what happens when you set work_mem a fair amount higher? (say,
64mb). You can set it for one session by going "set work_mem='64mb';
" as opposed to the entire server in postgresql.conf.

merlin

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Alex Vinnik

2013-01-29 14:41:50 UTC

Setting work_mem to 64MB triggers in memory sort but look what happens with
views look up. PG goes through all records there "Seq Scan on views"
instead of using visitor_id index and I have only subset of real data to
play around. Can imagine what cost would be running it against bigger
dataset. Something else is in play here that makes planner to take this
route. Any ideas how to gain more insight into planner's inner workings?

QUERY PLAN
Sort (cost=960280.46..960844.00 rows=225414 width=8) (actual
time=23328.040..23537.126 rows=209401 loops=1)
Sort Key: visits.id, views.id
Sort Method: quicksort Memory: 15960kB
-> Hash Join (cost=8089.16..940238.66 rows=225414 width=8) (actual
time=6622.072..22995.890 rows=209401 loops=1)
Hash Cond: (views.visit_id = visits.id)
-> Seq Scan on views (cost=0.00..831748.05 rows=8724205 width=8)
(actual time=0.093..10552.306 rows=6995893 loops=1)
-> Hash (cost=6645.51..6645.51 rows=115492 width=4) (actual
time=307.389..307.389 rows=131311 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 4617kB
-> Index Scan using visits_created_at_index on visits
(cost=0.00..6645.51 rows=115492 width=4) (actual time=0.040..163.151
rows=131311 loops=1)
Index Cond: ((created_at >= '2013-01-15
00:00:00'::timestamp without time zone) AND (created_at < '2013-01-16
00:00:00'::timestamp without time zone))
Total runtime: 23733.045 ms

Post by Alex Vinnik

Post by Alex Vinnik
It sure turned out that default settings are not a good fit. Setting
random_page_cost to 1.0 made query to run in 2.6 seconds and I clearly

see

Post by Alex Vinnik
that indexes are being used in explain plan and IO utilization is close

to

Post by Alex Vinnik
0.
QUERY PLAN
Sort (cost=969787.23..970288.67 rows=200575 width=8) (actual
time=2176.045..2418.162 rows=241238 loops=1)
Sort Key: visits.id, views.id
Sort Method: external sort Disk: 4248kB
-> Nested Loop (cost=0.00..950554.81 rows=200575 width=8) (actual
time=0.048..1735.357 rows=241238 loops=1)
-> Index Scan using visits_created_at_index on visits
(cost=0.00..5459.16 rows=82561 width=4) (actual time=0.032..178.591
rows=136021 loops=1)
Index Cond: ((created_at >= '2012-12-15

00:00:00'::timestamp

Post by Alex Vinnik
without time zone) AND (created_at < '2012-12-16 00:00:00'::timestamp
without time zone))
-> Index Scan using views_visit_id_index on views
(cost=0.00..11.33 rows=12 width=8) (actual time=0.004..0.006 rows=2
loops=136021)
Index Cond: (visit_id = visits.id)
Total runtime: 2635.169 ms
4248kB") which prompted me to take a look at work_mem. But it turned out
that small increase to 4MB from default 1MB turns off index usage and

query

Post by Alex Vinnik
gets x10 slower. IO utilization jumped to 100% from literally nothing. so
back to square one...
QUERY PLAN
Sort (cost=936642.75..937144.19 rows=200575 width=8) (actual
time=33200.762..33474.443 rows=241238 loops=1)
Sort Key: visits.id, views.id
Sort Method: external merge Disk: 4248kB
-> Hash Join (cost=6491.17..917410.33 rows=200575 width=8) (actual
time=7156.498..32723.221 rows=241238 loops=1)
Hash Cond: (views.visit_id = visits.id)
-> Seq Scan on views (cost=0.00..832189.95 rows=8768395

width=8)

Post by Alex Vinnik
(actual time=0.100..12126.342 rows=8200704 loops=1)
-> Hash (cost=5459.16..5459.16 rows=82561 width=4) (actual
time=353.683..353.683 rows=136021 loops=1)
4097kB
-> Index Scan using visits_created_at_index on visits
(cost=0.00..5459.16 rows=82561 width=4) (actual time=0.032..175.051
rows=136021 loops=1)
Index Cond: ((created_at >= '2012-12-15
00:00:00'::timestamp without time zone) AND (created_at < '2012-12-16
00:00:00'::timestamp without time zone))
Total runtime: 33698.000 ms
Basically PG is going through all views again and not using "Index Scan
using views_visit_id_index on views". Looks like setting work_mem

confuses

Post by Alex Vinnik
planner somehow. Any idea what can be done to do sorting in memory. I
suspect it should make query even more faster. Thanks -Alex

hm, what happens when you set work_mem a fair amount higher? (say,
64mb). You can set it for one session by going "set work_mem='64mb';
" as opposed to the entire server in postgresql.conf.
merlin

Merlin Moncure

2013-01-29 16:41:47 UTC

Post by Alex Vinnik
Setting work_mem to 64MB triggers in memory sort but look what happens with
views look up. PG goes through all records there "Seq Scan on views" instead
of using visitor_id index and I have only subset of real data to play
around. Can imagine what cost would be running it against bigger dataset.
Something else is in play here that makes planner to take this route. Any
ideas how to gain more insight into planner's inner workings?

did you set effective_cache_seize as noted upthread?

merlin

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Jeff Janes

2013-01-29 20:06:50 UTC

Post by Alex Vinnik
It sure turned out that default settings are not a good fit. Setting
random_page_cost to 1.0 made query to run in 2.6 seconds and I clearly see
that indexes are being used in explain plan and IO utilization is close to
0.
QUERY PLAN
Sort (cost=969787.23..970288.67 rows=200575 width=8) (actual
time=2176.045..2418.162 rows=241238 loops=1)
Sort Key: visits.id, views.id
Sort Method: external sort Disk: 4248kB

What query are you running? The query you originally showed us should
not be doing this sort in the first place.

Cheers,

Jeff

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Alex Vinnik

2013-01-29 20:43:17 UTC

Post by Jeff Janes

Post by Alex Vinnik
Sort Key: visits.id, views.id
Sort Method: external sort Disk: 4248kB

What query are you running? The query you originally showed us should
not be doing this sort in the first place.
Cheers,
Jeff

Here is the query

select visits.id, views.id
from visits join views on visits.id = views.visit_id
where visits.created_at >= '01/15/2013' and visits.created_at < '01/16/2013'
order by visits.id, views.id;

Original query didn't have order by clause

Here query plan w/o order by
Merge Join (cost=18213.46..802113.80 rows=182579 width=8) (actual
time=144443.693..145469.499 rows=209401 loops=1)
Merge Cond: (visits.id = views.visit_id)
-> Sort (cost=18195.47..18523.91 rows=131373 width=4) (actual
time=335.496..464.929 rows=131311 loops=1)
Sort Key: visits.id
Sort Method: quicksort Memory: 12300kB
-> Index Scan using visits_created_at_index on visits
(cost=0.00..7026.59 rows=131373 width=4) (actual time=0.037..162.047
rows=131311 loops=1)
Index Cond: ((created_at >= '2013-01-15 00:00:00'::timestamp
without time zone) AND (created_at < '2013-01-16 00:00:00'::timestamp
without time zone))
-> Index Scan using views_visit_id_visit_buoy_index on views
(cost=0.00..766120.99 rows=6126002 width=8) (actual
time=18.960..140565.130 rows=4014837 loops=1)
Total runtime: 145664.274 ms

Jeff Janes

2013-02-02 16:39:42 UTC

It sure turned out that default settings are not a good fit. Setting random_page_cost
to 1.0 made query to run in 2.6 seconds and I clearly see that indexes are
being used in explain plan and IO utilization is close to 0.

This is not surprising. The default settings are aimed at quite small
servers, while you seem to have a rather substantial one. Have you done
anything yet to fix effective_cache_size?

QUERY PLAN
Sort (cost=969787.23..970288.67 rows=200575 width=8) (actual
time=2176.045..2418.162 rows=241238 loops=1)
4248kB")

As far as pgsql is concerned, it is using the disk. But the kernel is
probably buffering that to an extent that the disk is probably being
touched rather little. So I wouldn't worry about it.

which prompted me to take a look at work_mem. But it turned out that
small increase to 4MB from default 1MB turns off index usage and query gets
x10 slower. IO utilization jumped to 100% from literally nothing. so back
to square one...
QUERY PLAN
Sort (cost=936642.75..937144.19 rows=200575 width=8) (actual
time=33200.762..33474.443 rows=241238 loops=1)

And why should the IO utilization have jumped? Is everything in memory, or
is it not? You should run your EXPLAINs with (analyze, buffers), and also
you should turn on track_io_timings, at least in the local session; that
will give us some insights.

If everything is in memory, then why is the seq scan taking so long? If
not, then why is the nested loop such a good idea? (In my hands, when
everything does *not* fit in memory, the nested loop is very very bad)

You seem have a bit of an infatuation with Dec 15th, running that one query
over and over and over. Why? If the real
live query is not just for that one day repeatedly, then you should test
with different days, not just one day repeatedly. (And if your real query
really is like the movie "Groundhog Day", you should probably cluster or
partition with that in mind.)

Anyway, there was an issue introduced in 9.2.0 and to be removed in 9.2.3
which over-penalized nested loops that had large indexes on the inner side.
Since your different plans are so close to each other in estimated cost, I
think this issue would be enough to tip it into the seq scan. Also, your
poor setting of effective_cache_size might also be enough to tip it. And
both combined, almost certainly are.

But ultimately, I think you are optimizing for a case that does not
actually exist.

Cheers,

Jeff

31 Replies
1 View
Permalink to this page
Disable enhanced parsing

Thread Navigation

Alex Vinnik 2013-01-03 22:54:10 UTC

Jeremy Harris 2013-01-03 23:11:40 UTC

Stefan Andreatta 2013-01-04 04:33:09 UTC

AJ Weber 2013-01-04 21:31:31 UTC

Jeff Janes 2013-01-05 04:03:03 UTC

AJ Weber 2013-01-06 15:27:01 UTC

Jeff Janes 2013-01-08 15:26:06 UTC

AJ Weber 2013-01-08 16:45:49 UTC

Jeff Janes 2013-01-08 17:21:42 UTC

Tom Lane 2013-01-08 17:51:40 UTC

AJ Weber 2013-01-08 18:04:55 UTC

Jeff Janes 2013-01-14 18:24:51 UTC

Merlin Moncure 2013-01-08 00:13:26 UTC

Jeff Janes 2013-01-09 04:34:11 UTC

Alex Vinnik 2013-01-09 15:49:43 UTC

Merlin Moncure 2013-01-09 15:53:08 UTC

Alex Vinnik 2013-01-28 23:43:51 UTC

Filip Rembiałkowski 2013-01-29 00:55:10 UTC

Alex Vinnik 2013-01-29 14:24:10 UTC

Filip Rembiałkowski 2013-01-29 16:19:19 UTC

Dan Fairs 2013-02-04 21:14:58 UTC

Ben Chobot 2013-01-29 17:39:07 UTC

Alex Vinnik 2013-01-29 18:59:10 UTC

Merlin Moncure 2013-01-29 19:35:13 UTC

Alex Vinnik 2013-01-29 20:48:50 UTC

Jeff Janes 2013-01-29 23:15:36 UTC

Merlin Moncure 2013-01-29 02:31:32 UTC

Alex Vinnik 2013-01-29 14:41:50 UTC

Merlin Moncure 2013-01-29 16:41:47 UTC

Jeff Janes 2013-01-29 20:06:50 UTC

Alex Vinnik 2013-01-29 20:43:17 UTC

Jeff Janes 2013-02-02 16:39:42 UTC

about - legalese

Loading...