Discussion:
[PERFORM] Picking out the most recent row using a time stamp column
(too old to reply)
Dave Crooke
2011-02-24 19:55:31 UTC
Permalink
Hi foks

This is an old chestnut which I've found a number of online threads for, and
never seen a clever answer to. It seems a common enough idiom that there
might be some slicker way to do it, so I thought I might inquire with this
august group if such a clever answer exists ....

Consider the following table

create table data
(id_key int,
time_stamp timestamp without time zone,
value double precision);

create unique index data_idx on data (id_key, time_stamp);

with around 1m rows, with 3500 or so distinct values of id_key.

I need to find the most recent value for each distinct value of id_key.
There is no elegant (that I know of) syntax for this, and there are two ways
I've typically seen it done:

1. Use a dependent subquery to find the most recent time stamp, i.e.

select
a.id_key, a.time_stamp, a.value
from
data a
where
a.time_stamp=
(select max(time_stamp)
from data b
where a.id_key=b.id_key)

2. Define a temporary table / view with the most recent time stamp for each
key, and join against it:

select
a.id_key, a.time_stamp, a.value
from
data a,
(select id_key, max(time_stamp) as mts
from data group by id_key) b
where
a.id_key=b.id_key and a.time_stamp=b.mts

I've found that for my data set, PG 8.4.2 selects the "obvious" / "do it as
written" plan in each case, and that method 2. is much quicker (2.6 sec vs.
2 min on my laptop) ....

Is there a more elegant way to write this, perhaps using PG-specific
extensions?

Cheers
Dave
Merlin Moncure
2011-02-24 20:11:29 UTC
Permalink
Post by Dave Crooke
Hi foks
This is an old chestnut which I've found a number of online threads for, and
never seen a clever answer to. It seems a common enough idiom that there
might be some slicker way to do it, so I thought I might inquire with this
august group if such a clever answer exists ....
Consider the following table
create table data
   (id_key int,
    time_stamp timestamp without time zone,
    value double precision);
create unique index data_idx on data (id_key, time_stamp);
with around 1m rows, with 3500 or so distinct values of id_key.
I need to find the most recent value for each distinct value of id_key.
There is no elegant (that I know of) syntax for this, and there are two ways
1. Use a dependent subquery to find the most recent time stamp, i.e.
select
   a.id_key, a.time_stamp, a.value
from
   data a
where
  a.time_stamp=
     (select max(time_stamp)
      from data b
      where a.id_key=b.id_key)
2. Define a temporary table / view with the most recent time stamp for each
select
   a.id_key, a.time_stamp, a.value
from
   data a,
   (select id_key, max(time_stamp) as mts
    from data group by id_key) b
where
   a.id_key=b.id_key and a.time_stamp=b.mts
I've found that for my data set, PG 8.4.2 selects the "obvious" / "do it as
written" plan in each case, and that method 2. is much quicker (2.6 sec vs.
2 min on my laptop) ....
Is there a more elegant way to write this, perhaps using PG-specific
extensions?
one pg specific method that a lot of people overlook for this sort of
problem is custom aggregates.

create or replace function maxfoo(foo, foo) returns foo as
$$
select case when $1.t > $2.t then $1 else $2 end;
$$ language sql immutable;

create aggregate aggfoo(foo)
(
sfunc=maxfoo,
stype=foo
);

create table foo(id int, t timestamptz default now());
insert into foo values (1);
insert into foo values (1);

select (f).* from (select aggfoo(foo) as f from foo group by id) q;

postgres=# select (f).* from (select aggfoo(foo) as f from foo group by id) q;
id | t
----+----------------------------
1 | 2011-02-24 14:01:20.051-06
(1 row)


where this approach can be useful is when you have a very complicated
aggregation condition that can be awkward to express in a join.

merlin
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Kevin Grittner
2011-02-24 20:18:55 UTC
Permalink
Post by Dave Crooke
create table data
(id_key int,
time_stamp timestamp without time zone,
value double precision);
create unique index data_idx on data (id_key, time_stamp);
I need to find the most recent value for each distinct value of id_key.
Well, unless you use timestamp WITH time zone, you might not be able
to do that at all. There are very few places where timestamp
WITHOUT time zone actually makes sense.
Post by Dave Crooke
There is no elegant (that I know of) syntax for this
How about this?:

select distinct on (id_key) * from data order by id_key, time_stamp;
Post by Dave Crooke
select
a.id_key, a.time_stamp, a.value
from
data a
where
a.time_stamp=
(select max(time_stamp)
from data b
where a.id_key=b.id_key)
Rather than the above, I typically find this much faster:

select
a.id_key, a.time_stamp, a.value
from
data a
where not exists
(select * from data b
where b.id_key=a.id_key and b.time_stamp > a.time_stamp)

-Kevin
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Merlin Moncure
2011-02-24 21:14:59 UTC
Permalink
On Thu, Feb 24, 2011 at 2:18 PM, Kevin Grittner
Post by Kevin Grittner
Post by Dave Crooke
create table data
   (id_key int,
    time_stamp timestamp without time zone,
    value double precision);
create unique index data_idx on data (id_key, time_stamp);
I need to find the most recent value for each distinct value of id_key.
Well, unless you use timestamp WITH time zone, you might not be able
to do that at all.  There are very few places where timestamp
WITHOUT time zone actually makes sense.
Post by Dave Crooke
There is no elegant (that I know of) syntax for this
select distinct on (id_key) * from data order by id_key, time_stamp;
Post by Dave Crooke
select
   a.id_key, a.time_stamp, a.value
from
   data a
where
  a.time_stamp=
     (select max(time_stamp)
      from data b
      where a.id_key=b.id_key)
select
  a.id_key, a.time_stamp, a.value
from
  data a
where not exists
 (select * from data b
  where b.id_key=a.id_key and b.time_stamp > a.time_stamp)
hm. not only is it faster, but much more flexible...that's definitely
the way to go.

merlin
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Florian Weimer
2011-02-26 20:54:50 UTC
Permalink
Post by Kevin Grittner
Post by Dave Crooke
create table data
(id_key int,
time_stamp timestamp without time zone,
value double precision);
create unique index data_idx on data (id_key, time_stamp);
I need to find the most recent value for each distinct value of id_key.
Well, unless you use timestamp WITH time zone, you might not be able
to do that at all. There are very few places where timestamp
WITHOUT time zone actually makes sense.
I don't think PostgreSQL keeps track of actual time zone values, just
as it doesn't keep track of the character encoding of TEXT columns.
Unless suppressed with WITHOUT TIME ZONE, PostgreSQL makes up some
time zone on demand. This makes TIMESTAMP WITH TIME ZONE not that
useful, and it's often to use TIMESTAMP WITHOUT TIME ZONE with times
in UTC.
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Michael Glaesemann
2011-02-24 20:21:49 UTC
Permalink
Post by Dave Crooke
Is there a more elegant way to write this, perhaps using PG-specific
extensions?
SELECT DISTINCT ON (data.id_key)
data.id_key, data.time_stamp, data.value
FROM data
ORDER BY data.id_key, data.time_stamp DESC;

Michael Glaesemann
grzm seespotcode net
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Kevin Grittner
2011-02-24 20:24:15 UTC
Permalink
Post by Michael Glaesemann
SELECT DISTINCT ON (data.id_key)
data.id_key, data.time_stamp, data.value
FROM data
ORDER BY data.id_key, data.time_stamp DESC;
Dang! I forgot the DESC in my post! Thanks for showing the
*correct* version.

-Kevin
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Dave Crooke
2011-02-24 23:38:37 UTC
Permalink
Thanks to all .... I had a tickling feeling at the back of my mind that
there was a neater answer here. For the record, times (all from in-memory
cached data, averaged over a bunch of runs):

Dependent subquery = 117.9 seconds
Join to temp table = 2.7 sec
DISTINCT ON = 2.7 sec

So the DISTINCT ON may not be quicker, but it sure is tidier.

Cheers
Dave
Post by Kevin Grittner
Post by Michael Glaesemann
SELECT DISTINCT ON (data.id_key)
data.id_key, data.time_stamp, data.value
FROM data
ORDER BY data.id_key, data.time_stamp DESC;
Dang! I forgot the DESC in my post! Thanks for showing the
*correct* version.
-Kevin
Josh Berkus
2011-02-25 00:20:00 UTC
Permalink
Post by Dave Crooke
Thanks to all .... I had a tickling feeling at the back of my mind that
there was a neater answer here. For the record, times (all from
Dependent subquery = 117.9 seconds
Join to temp table = 2.7 sec
DISTINCT ON = 2.7 sec
But wait, there's more! You haven't tested the Windowing Function
solution. I'll bet it's even faster.


SELECT id_key, time_stamp, value
FROM (
SELECT id_key, time_stamp, value,
row_number()
OVER ( PARTITION BY id_key
ORDER BY time_stamp DESC)
as ranking
FROM thetable
) as filtered_table
WHERE ranking = 1
--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Shaun Thomas
2011-02-25 00:58:33 UTC
Permalink
Post by Josh Berkus
SELECT id_key, time_stamp, value
FROM (
SELECT id_key, time_stamp, value,
row_number()
OVER ( PARTITION BY id_key
ORDER BY time_stamp DESC)
as ranking
FROM thetable
) as filtered_table
WHERE ranking = 1
Why did you use row_number instead of rank?

I am now curious how the speed compares though. I still think the
DISTINCT ON will be faster, but it would be a great surprise.
--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
***@peak6.com

______________________________________________

See http://www.peak6.com/email_disclaimer.php
for terms and conditions related to this email
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Josh Berkus
2011-02-25 01:52:19 UTC
Permalink
Post by Shaun Thomas
Why did you use row_number instead of rank?
Because I assumed he only wanted one row in the event of ties.

Hmmm, although with that schema, there won't be ties. So it's pretty
much arbitrary then.
Post by Shaun Thomas
I am now curious how the speed compares though. I still think the
DISTINCT ON will be faster, but it would be a great surprise.
Hopefully we'll find out! The windowing functions are usually much
faster for me. I think in 9.0 or 9.1 someone replumbed DISTINCT ON to
use a bunch of the window function internals, at which point it'll cease
to matter.
--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Dave Johansen
2011-02-25 16:50:40 UTC
Permalink
Post by Dave Crooke
Thanks to all .... I had a tickling feeling at the back of my mind that
there was a neater answer here. For the record, times (all from in-memory
Dependent subquery = 117.9 seconds
Join to temp table = 2.7 sec
DISTINCT ON = 2.7 sec
So the DISTINCT ON may not be quicker, but it sure is tidier.
Cheers
Dave
I'm using 8.3.3 and I have a similar sort of setup and just thought I'd add
another point of reference, here's the timing from doing the same sort of
queries on my dataset of ~700,000 records with ~10,000 unique "id_key"s.

I also added a 4th version that uses a permanent table that's auto-populated
by a trigger with the rid of the most recent entry from the main table, so
it's a simple join to get the latest entries.

Dependent subquery = (killed it after it ran for over 10 minutes)
Join on temp table = 1.5 seconds
DISTINCT ON = 2.9 seconds
Join on auto-populated table = 0.8 seconds

Dave
Dave Crooke
2011-02-25 20:45:23 UTC
Permalink
Hi Dave

Yes, 100% the best solution .... I did the same thing a while back, I just
have a separate copy of the data in a "latest" table and the Java code just
runs a second SQL statement to update it when writing a new record (I've
never been a trigger fan).

I found myself looking at the "find the latest" query again though in the
process of building a "demo mode" into our application, which will replay a
finite set of data on a rolling loop by moving it forward in time, and also
has to simulate the continuous updating of the "latest" table so the the
business logic will be appropriately fooled.

My next tweak will be to cache the "latest" table in the Java layer ;-)

Cheers
Dave
Post by Dave Johansen
Post by Dave Crooke
Thanks to all .... I had a tickling feeling at the back of my mind that
there was a neater answer here. For the record, times (all from in-memory
Dependent subquery = 117.9 seconds
Join to temp table = 2.7 sec
DISTINCT ON = 2.7 sec
So the DISTINCT ON may not be quicker, but it sure is tidier.
Cheers
Dave
I'm using 8.3.3 and I have a similar sort of setup and just thought I'd add
another point of reference, here's the timing from doing the same sort of
queries on my dataset of ~700,000 records with ~10,000 unique "id_key"s.
I also added a 4th version that uses a permanent table that's
auto-populated by a trigger with the rid of the most recent entry from the
main table, so it's a simple join to get the latest entries.
Dependent subquery = (killed it after it ran for over 10 minutes)
Join on temp table = 1.5 seconds
DISTINCT ON = 2.9 seconds
Join on auto-populated table = 0.8 seconds
Dave
Dave Johansen
2011-02-26 13:44:28 UTC
Permalink
Post by Dave Crooke
Hi Dave
Yes, 100% the best solution .... I did the same thing a while back, I just
have a separate copy of the data in a "latest" table and the Java code just
runs a second SQL statement to update it when writing a new record (I've
never been a trigger fan).
I found myself looking at the "find the latest" query again though in the
process of building a "demo mode" into our application, which will replay a
finite set of data on a rolling loop by moving it forward in time, and also
has to simulate the continuous updating of the "latest" table so the the
business logic will be appropriately fooled.
My next tweak will be to cache the "latest" table in the Java layer ;-)
Cheers
Dave
Our application has what sounds like a similar functionality that we call
"playback". The way that we did it was to have a schema called "playback"
with identical tables to those that we want to have repopulated. All the
other tables exist in only the "public" schema and then we don't have to do
any duplication of that data. Then during playback it just runs a query to
copy from the "public" table to the "playback" table and the trigger will
populate the "latest" table in the "playback" schema automatically just like
when the program is running normally and populating the "public" version.

The secret sauce comes in by setting "SET search_path TO playback, public;"
because then your application runs all the same queries to get the data and
doesn't have to know that anything different is going on other than the copy
coperation that it's doing. It's nice because it takes all of the data
management burden off of the application and then allows the database to do
the hard work for you. It's obviously not the perfect solution but it wasn't
too hard to setup and we've really liked the way it works.

Dave
Josh Berkus
2011-02-26 21:06:00 UTC
Permalink
Dave,

Why not test the windowing version I posted?
--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Dave Johansen
2011-02-26 21:38:05 UTC
Permalink
Unfortunately, I'm running 8.3.3 and to my knowledge the windowing stuff
wasn't added til 8.4.
Dave
Post by Josh Berkus
Dave,
Why not test the windowing version I posted?
--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com
Dave Johansen
2013-04-03 15:25:26 UTC
Permalink
Unfortunately, I'm running 8.3.3 and to my knowledge the windowing stuff wasn't added til 8.4.
Dave
Post by Josh Berkus
Dave,
Why not test the windowing version I posted?
We finally have moved over to 8.4 and so I just wanted to post the time comparison numbers to show the times on 8.4 as well. This is also a newer data set with ~700k rows and ~4k distinct id_key values.

1) Dependent subquery
SELECT a.id_key, a.time_stamp, a.value FROM data AS a WHERE a.time_stamp = (SELECT MAX(time_stamp) FROM data AS b WHERE a.id_key = b.id_key);
8.3.3: Killed it after a few minutes
8.4.13: Killed it after a few minutes

2) Join against temporary table
SELECT a.id_key, a.time_stamp, a.value FROM data AS a JOIN (SELECT id_key, MAX(time_stamp) AS max_time_stamp FROM data GROUP BY id_key) AS b WHERE a.id_key = b.id_key AND a.time_stamp = b.max_time_stamp;
8.3.3: 1.4 s
8.4.13: 0.5 s

3) DISTINCT ON:
SELECT DISTINCT ON (id_key) id_key, time_stamp, value FROM data ORDER BY id_key, time_stamp DESC;
Without Index:
8.3.3: 34.1 s
8.4.13: 98.7 s
With Index (data(id_key, time_stamp DESC)):
8.3.3: 3.4 s
8.4.13: 1.3 s

4) Auto-populated table
SELECT id_key, time_stamp, value FROM data WHERE rid IN (SELECT rid FROM latestdata);
8.3.3: 0.2 s
8.4.13: 0.06 s

5) Windowing
SELECT id_key, time_stamp, value FROM (SELECT id_key, time_stamp, value, row_number() OVER (PARTITION BY id_key ORDER BY time_stamp DESC) AS ranking FROM data) AS a WHERE ranking=1;
8.3.3: N/A
8.4.13: 1.6 s

So the auto-populated table (#4) is the fastest by an order of magnitude, but the join against the temporary table (#2) is the next best option based on speed and doesn't require the extra multi-column index that DISTINCT ON (#3) does.

On a related note though, is there a way to make the multi-column index used in the DISTINCT ON more efficient. Based on the results, it appears that the multi-column index is actually a single index with the ordering of the tree based on the first value and then the second value. Is there a way to make it be a "multi-level index"? What I mean is that the first value is basically a tree/hash that then points to the second index because if that's possible then that would probably make the DISTINCT ON (#3) version as fast or faster than the auto-populated table (#4). Is there a way to create an index like that in postgres?

Thanks,
Dave
Dave Johansen
2013-04-05 16:54:06 UTC
Permalink
Post by Dave Johansen
Unfortunately, I'm running 8.3.3 and to my knowledge the windowing stuff
wasn't added til 8.4.
Dave
Post by Josh Berkus
Dave,
Why not test the windowing version I posted?
We finally have moved over to 8.4 and so I just wanted to post the
time comparison numbers to show the times on 8.4 as well. This is also
a newer data set with ~700k rows and ~4k distinct id_key values.

1) Dependent subquery
SELECT a.id_key, a.time_stamp, a.value FROM data AS a WHERE
a.time_stamp = (SELECT MAX(time_stamp) FROM data AS b WHERE a.id_key =
b.id_key);
8.3.3: Killed it after a few minutes
8.4.13: Killed it after a few minutes

2) Join against temporary table
SELECT a.id_key, a.time_stamp, a.value FROM data AS a JOIN (SELECT
id_key, MAX(time_stamp) AS max_time_stamp FROM data GROUP BY id_key)
AS b WHERE a.id_key = b.id_key AND a.time_stamp = b.max_time_stamp;
8.3.3: 1.4 s
8.4.13: 0.5 s

3) DISTINCT ON:
SELECT DISTINCT ON (id_key) id_key, time_stamp, value FROM data ORDER
BY id_key, time_stamp DESC;
Without Index:
8.3.3: 34.1 s
8.4.13: 98.7 s
With Index (data(id_key, time_stamp DESC)):
8.3.3: 3.4 s
8.4.13: 1.3 s

4) Auto-populated table
SELECT id_key, time_stamp, value FROM data WHERE rid IN (SELECT rid
FROM latestdata);
8.3.3: 0.2 s
8.4.13: 0.06 s

5) Windowing
SELECT id_key, time_stamp, value FROM (SELECT id_key, time_stamp,
value, row_number() OVER (PARTITION BY id_key ORDER BY time_stamp
DESC) AS ranking FROM data) AS a WHERE ranking=1;
8.3.3: N/A
8.4.13: 1.6 s

So the auto-populated table (#4) is the fastest by an order of
magnitude, but the join against the temporary table (#2) is the next
best option based on speed and doesn't require the extra multi-column
index that DISTINCT ON (#3) does.

On a related note though, is there a way to make the multi-column
index used in the DISTINCT ON more efficient. Based on the results, it
appears that the multi-column index is actually a single index with
the ordering of the tree based on the first value and then the second
value. Is there a way to make it be a "multi-level index"? What I mean
is that the first value is basically a tree/hash that then points to
the second index because if that's possible then that would probably
make the DISTINCT ON (#3) version as fast or faster than the
auto-populated table (#4). Is there a way to create an index like that
in postgres?

Thanks,
Dave
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Merlin Moncure
2013-04-05 18:40:16 UTC
Permalink
Post by Dave Johansen
Post by Dave Johansen
Unfortunately, I'm running 8.3.3 and to my knowledge the windowing stuff
wasn't added til 8.4.
Dave
Post by Josh Berkus
Dave,
Why not test the windowing version I posted?
We finally have moved over to 8.4 and so I just wanted to post the
time comparison numbers to show the times on 8.4 as well. This is also
a newer data set with ~700k rows and ~4k distinct id_key values.
1) Dependent subquery
SELECT a.id_key, a.time_stamp, a.value FROM data AS a WHERE
a.time_stamp = (SELECT MAX(time_stamp) FROM data AS b WHERE a.id_key =
b.id_key);
8.3.3: Killed it after a few minutes
8.4.13: Killed it after a few minutes
2) Join against temporary table
SELECT a.id_key, a.time_stamp, a.value FROM data AS a JOIN (SELECT
id_key, MAX(time_stamp) AS max_time_stamp FROM data GROUP BY id_key)
AS b WHERE a.id_key = b.id_key AND a.time_stamp = b.max_time_stamp;
8.3.3: 1.4 s
8.4.13: 0.5 s
SELECT DISTINCT ON (id_key) id_key, time_stamp, value FROM data ORDER
BY id_key, time_stamp DESC;
8.3.3: 34.1 s
8.4.13: 98.7 s
8.3.3: 3.4 s
8.4.13: 1.3 s
4) Auto-populated table
SELECT id_key, time_stamp, value FROM data WHERE rid IN (SELECT rid
FROM latestdata);
8.3.3: 0.2 s
8.4.13: 0.06 s
5) Windowing
SELECT id_key, time_stamp, value FROM (SELECT id_key, time_stamp,
value, row_number() OVER (PARTITION BY id_key ORDER BY time_stamp
DESC) AS ranking FROM data) AS a WHERE ranking=1;
8.3.3: N/A
8.4.13: 1.6 s
I would also test:

*) EXISTS()

SELECT a.id_key, a.time_stamp, a.value FROM data
WHERE NOT EXISTS
(
SELECT 1 FROM data b
WHERE
a.id_key = b.id_key
and b.time_stamp > a.time_stamp
);

*) custom aggregate (this will not be the fastest option but is a good
technique to know -- it can be a real life saver when selection
criteria is complex)

CREATE FUNCTION agg_latest_data(data, data) returns data AS
$$
SELECT CASE WHEN $1 > $2 THEN $1 ELSE $2 END;
$$ LANGUAGE SQL IMMUTABLE;

CREATE AGGREGATE latest_data (
SFUNC=agg_latest_data,
STYPE=data
);

SELECT latest_data(d) FROM data d group by d.id_key;

the above returns the composite, not the fields, but that can be worked around.

merlin
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Dave Johansen
2013-04-08 21:10:53 UTC
Permalink
Post by Merlin Moncure
Post by Dave Johansen
Post by Dave Johansen
Unfortunately, I'm running 8.3.3 and to my knowledge the windowing stuff
wasn't added til 8.4.
Dave
Post by Josh Berkus
Dave,
Why not test the windowing version I posted?
We finally have moved over to 8.4 and so I just wanted to post the
time comparison numbers to show the times on 8.4 as well. This is also
a newer data set with ~700k rows and ~4k distinct id_key values.
1) Dependent subquery
SELECT a.id_key, a.time_stamp, a.value FROM data AS a WHERE
a.time_stamp = (SELECT MAX(time_stamp) FROM data AS b WHERE a.id_key =
b.id_key);
8.3.3: Killed it after a few minutes
8.4.13: Killed it after a few minutes
2) Join against temporary table
SELECT a.id_key, a.time_stamp, a.value FROM data AS a JOIN (SELECT
id_key, MAX(time_stamp) AS max_time_stamp FROM data GROUP BY id_key)
AS b WHERE a.id_key = b.id_key AND a.time_stamp = b.max_time_stamp;
8.3.3: 1.4 s
8.4.13: 0.5 s
SELECT DISTINCT ON (id_key) id_key, time_stamp, value FROM data ORDER
BY id_key, time_stamp DESC;
8.3.3: 34.1 s
8.4.13: 98.7 s
8.3.3: 3.4 s
8.4.13: 1.3 s
4) Auto-populated table
SELECT id_key, time_stamp, value FROM data WHERE rid IN (SELECT rid
FROM latestdata);
8.3.3: 0.2 s
8.4.13: 0.06 s
5) Windowing
SELECT id_key, time_stamp, value FROM (SELECT id_key, time_stamp,
value, row_number() OVER (PARTITION BY id_key ORDER BY time_stamp
DESC) AS ranking FROM data) AS a WHERE ranking=1;
8.3.3: N/A
8.4.13: 1.6 s
*) EXISTS()
SELECT a.id_key, a.time_stamp, a.value FROM data
WHERE NOT EXISTS
(
SELECT 1 FROM data b
WHERE
a.id_key = b.id_key
and b.time_stamp > a.time_stamp
);
I tried this and it was slow:
8.3.3: 674.4 s
8.4.13: 40.4 s
Post by Merlin Moncure
*) custom aggregate (this will not be the fastest option but is a good
technique to know -- it can be a real life saver when selection
criteria is complex)
CREATE FUNCTION agg_latest_data(data, data) returns data AS
$$
SELECT CASE WHEN $1 > $2 THEN $1 ELSE $2 END;
$$ LANGUAGE SQL IMMUTABLE;
CREATE AGGREGATE latest_data (
SFUNC=agg_latest_data,
STYPE=data
);
SELECT latest_data(d) FROM data d group by d.id_key;
the above returns the composite, not the fields, but that can be worked around.
My real table actually returns/needs all the values from the row so I
didn't feel like messing with aggregate stuff.

Thanks,
Dave
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Dave Johansen
2013-04-05 15:18:00 UTC
Permalink
Post by Josh Berkus
Dave,
Why not test the windowing version I posted?
We finally have moved over to 8.4 and so I just wanted to post the time comparison numbers to show the times on 8.4 as well. This is also a newer data set with ~700k rows and ~4k distinct id_key values.

1) Dependent subquery
SELECT a.id_key, a.time_stamp, a.value FROM data AS a WHERE a.time_stamp = (SELECT MAX(time_stamp) FROM data AS b WHERE a.id_key = b.id_key);
8.3.3: Killed it after a few minutes
8.4.13: Killed it after a few minutes

2) Join against temporary table
SELECT a.id_key, a.time_stamp, a.value FROM data AS a JOIN (SELECT id_key, MAX(time_stamp) AS max_time_stamp FROM data GROUP BY id_key) AS b WHERE a.id_key = b.id_key AND a.time_stamp = b.max_time_stamp;
8.3.3: 1.4 s
8.4.13: 0.5 s

3) DISTINCT ON:
SELECT DISTINCT ON (id_key) id_key, time_stamp, value FROM data ORDER BY id_key, time_stamp DESC;
Without Index:
8.3.3: 34.1 s
8.4.13: 98.7 s
With Index (data(id_key, time_stamp DESC)):
8.3.3: 3.4 s
8.4.13: 1.3 s

4) Auto-populated table
SELECT id_key, time_stamp, value FROM data WHERE rid IN (SELECT rid FROM latestdata);
8.3.3: 0.2 s
8.4.13: 0.06 s

5) Windowing
SELECT id_key, time_stamp, value FROM (SELECT id_key, time_stamp, value, row_number() OVER (PARTITION BY id_key ORDER BY time_stamp DESC) AS ranking FROM data) AS a WHERE ranking=1;
8.3.3: N/A
8.4.13: 1.6 s

So the auto-populated table (#4) is the fastest by an order of magnitude, but the join against the temporary table (#2) is the next best option based on speed and doesn't require the extra multi-column index that DISTINCT ON (#3) does.

On a related note though, is there a way to make the multi-column index used in the DISTINCT ON more efficient. Based on the results, it appears that the multi-column index is actually a single index with the ordering of the tree based on the first value and then the second value. Is there a way to make it be a "multi-level index"? What I mean is that the first value is basically a tree/hash that then points to the second index because if that's possible then that would probably make the DISTINCT ON (#3) version as fast or faster than the auto-populated table (#4). Is there a way to create an index like that in postgres?

Thanks,
Dave
Kevin Grittner
2011-02-26 21:52:44 UTC
Permalink
Post by Florian Weimer
Post by Kevin Grittner
Well, unless you use timestamp WITH time zone, you might not be
able to do that at all. There are very few places where timestamp
WITHOUT time zone actually makes sense.
I don't think PostgreSQL keeps track of actual time zone values,
True -- TIMESTAMP WITH TIME ZONE is always stored in UTC, which makes
it part of a consistent time stream. If you use TIMESTAMP WITHOUT
TIME ZONE, then unless you go to a lot of trouble you have a gap in
your time line in the spring and an overlap in autumn. With enough
work you can dance around that, but it's a heck of lot easier when
you can count on the UTC storage.

It sounds like you've successfully managed to find a way to dance
around it, so it might not be worth trying to refactor now; but I'd
bet your code would be simpler and more robust if you worked with the
data type intended to represent a moment in the stream of time
instead of constantly trying to pin the WITHOUT TIME ZONE to a time
zone (UTC) explicitly.

-Kevin
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Loading...