Dave Crooke
2011-02-24 19:55:31 UTC
Hi foks
This is an old chestnut which I've found a number of online threads for, and
never seen a clever answer to. It seems a common enough idiom that there
might be some slicker way to do it, so I thought I might inquire with this
august group if such a clever answer exists ....
Consider the following table
create table data
(id_key int,
time_stamp timestamp without time zone,
value double precision);
create unique index data_idx on data (id_key, time_stamp);
with around 1m rows, with 3500 or so distinct values of id_key.
I need to find the most recent value for each distinct value of id_key.
There is no elegant (that I know of) syntax for this, and there are two ways
I've typically seen it done:
1. Use a dependent subquery to find the most recent time stamp, i.e.
select
a.id_key, a.time_stamp, a.value
from
data a
where
a.time_stamp=
(select max(time_stamp)
from data b
where a.id_key=b.id_key)
2. Define a temporary table / view with the most recent time stamp for each
key, and join against it:
select
a.id_key, a.time_stamp, a.value
from
data a,
(select id_key, max(time_stamp) as mts
from data group by id_key) b
where
a.id_key=b.id_key and a.time_stamp=b.mts
I've found that for my data set, PG 8.4.2 selects the "obvious" / "do it as
written" plan in each case, and that method 2. is much quicker (2.6 sec vs.
2 min on my laptop) ....
Is there a more elegant way to write this, perhaps using PG-specific
extensions?
Cheers
Dave
This is an old chestnut which I've found a number of online threads for, and
never seen a clever answer to. It seems a common enough idiom that there
might be some slicker way to do it, so I thought I might inquire with this
august group if such a clever answer exists ....
Consider the following table
create table data
(id_key int,
time_stamp timestamp without time zone,
value double precision);
create unique index data_idx on data (id_key, time_stamp);
with around 1m rows, with 3500 or so distinct values of id_key.
I need to find the most recent value for each distinct value of id_key.
There is no elegant (that I know of) syntax for this, and there are two ways
I've typically seen it done:
1. Use a dependent subquery to find the most recent time stamp, i.e.
select
a.id_key, a.time_stamp, a.value
from
data a
where
a.time_stamp=
(select max(time_stamp)
from data b
where a.id_key=b.id_key)
2. Define a temporary table / view with the most recent time stamp for each
key, and join against it:
select
a.id_key, a.time_stamp, a.value
from
data a,
(select id_key, max(time_stamp) as mts
from data group by id_key) b
where
a.id_key=b.id_key and a.time_stamp=b.mts
I've found that for my data set, PG 8.4.2 selects the "obvious" / "do it as
written" plan in each case, and that method 2. is much quicker (2.6 sec vs.
2 min on my laptop) ....
Is there a more elegant way to write this, perhaps using PG-specific
extensions?
Cheers
Dave