Discussion:
Large Table - Slow Window Functions (Better Approach?)
(too old to reply)
Jeff Adams - NOAA Affiliate
2013-03-11 14:27:17 UTC
Permalink
Greetings,



I have a large table (~90 million rows) containing vessel positions. In
addition to a column that contains the location information (the_geom), the
table also contains two columns that are used to uniquely identify the
vessel (mmsi and name) and a column containing the Unix time (epoch) at
which the position information was logged. I frequently need to assign
records to vessel transits. To do this, I currently create a CTE that uses
a Window function (partitioning the data by mmsi and name ordered by epoch)
to examine the time that has elapsed between successive position reports
for individual vessels. For every position record for a vessel (as
identified using mmsi and name), if the time elapsed between the current
position record and the previous record (using the lag function) is less
than or equal to 2 hours, I assign the record a value of 0 to a CTE column
named tr_index. If the time elapsed is greater than 2 hours, I assign the
record a value of 1 to the tr_index column. I then use the CTE to generate
transit numbers by summing the values in the tr_index field across a Window
that also partitions the data by mmsi and name and is ordered by epoch.
This works, but is very slow (hours). The table is indexed (multi-column
index on mmsi, name and index on epoch). Does anyone see a way to get what
I am after in a more efficient manner. What I am after is an assignment of
transit number to vessels' position records based on whether the records
were within two hours of each other. The SQL that I used is provided below.
Any advice would be greatly appreciated...



WITH

cte_01 AS

(

SELECT

a.id,

a.mmsi,

a.name,

a.epoch,

a.the_geom

CASE

WHEN ((a.epoch - lag(a.epoch) OVER w) / 60) > 120 THEN 1

ELSE 0

END AS tr_index

FROM table a

WINDOW w AS (PARTITION BY a.mmsi, a.name ORDER BY a.epoch)

)





SELECT

a.id,

a.mmsi,

a.name,

a.epoch,

a.the_geom,

1 + sum(a.tr_index) OVER w AS transit,

a.active

FROM cte_01 a

WINDOW w AS (PARTITION BY a.mmsi, a.name ORDER BY a.epoch)
--
Jeff
Pavel Stehule
2013-03-11 15:03:38 UTC
Permalink
Hello

you can try procedural solution - use a cursor over ordered data in
plpgsql and returns table

Regards

Pavel Stehule
Post by Jeff Adams - NOAA Affiliate
Greetings,
I have a large table (~90 million rows) containing vessel positions. In
addition to a column that contains the location information (the_geom), the
table also contains two columns that are used to uniquely identify the
vessel (mmsi and name) and a column containing the Unix time (epoch) at
which the position information was logged. I frequently need to assign
records to vessel transits. To do this, I currently create a CTE that uses a
Window function (partitioning the data by mmsi and name ordered by epoch) to
examine the time that has elapsed between successive position reports for
individual vessels. For every position record for a vessel (as identified
using mmsi and name), if the time elapsed between the current position
record and the previous record (using the lag function) is less than or
equal to 2 hours, I assign the record a value of 0 to a CTE column named
tr_index. If the time elapsed is greater than 2 hours, I assign the record a
value of 1 to the tr_index column. I then use the CTE to generate transit
numbers by summing the values in the tr_index field across a Window that
also partitions the data by mmsi and name and is ordered by epoch. This
works, but is very slow (hours). The table is indexed (multi-column index on
mmsi, name and index on epoch). Does anyone see a way to get what I am after
in a more efficient manner. What I am after is an assignment of transit
number to vessels' position records based on whether the records were within
two hours of each other. The SQL that I used is provided below. Any advice
would be greatly appreciated...
WITH
cte_01 AS
(
SELECT
a.id,
a.mmsi,
a.name,
a.epoch,
a.the_geom
CASE
WHEN ((a.epoch - lag(a.epoch) OVER w) / 60) > 120 THEN 1
ELSE 0
END AS tr_index
FROM table a
WINDOW w AS (PARTITION BY a.mmsi, a.name ORDER BY a.epoch)
)
SELECT
a.id,
a.mmsi,
a.name,
a.epoch,
a.the_geom,
1 + sum(a.tr_index) OVER w AS transit,
a.active
FROM cte_01 a
WINDOW w AS (PARTITION BY a.mmsi, a.name ORDER BY a.epoch)
--
Jeff
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Jeff Adams - NOAA Affiliate
2013-03-11 15:20:07 UTC
Permalink
Pavel,

Thanks for the response. I have not yet had the opportunity to use cursors,
but am now curious. Could you perhaps provide a bit more detail as to what
the implementation of your suggested approach would look like?
Post by Pavel Stehule
Hello
you can try procedural solution - use a cursor over ordered data in
plpgsql and returns table
Regards
Pavel Stehule
Post by Jeff Adams - NOAA Affiliate
Greetings,
I have a large table (~90 million rows) containing vessel positions. In
addition to a column that contains the location information (the_geom),
the
Post by Jeff Adams - NOAA Affiliate
table also contains two columns that are used to uniquely identify the
vessel (mmsi and name) and a column containing the Unix time (epoch) at
which the position information was logged. I frequently need to assign
records to vessel transits. To do this, I currently create a CTE that
uses a
Post by Jeff Adams - NOAA Affiliate
Window function (partitioning the data by mmsi and name ordered by
epoch) to
Post by Jeff Adams - NOAA Affiliate
examine the time that has elapsed between successive position reports for
individual vessels. For every position record for a vessel (as identified
using mmsi and name), if the time elapsed between the current position
record and the previous record (using the lag function) is less than or
equal to 2 hours, I assign the record a value of 0 to a CTE column named
tr_index. If the time elapsed is greater than 2 hours, I assign the
record a
Post by Jeff Adams - NOAA Affiliate
value of 1 to the tr_index column. I then use the CTE to generate transit
numbers by summing the values in the tr_index field across a Window that
also partitions the data by mmsi and name and is ordered by epoch. This
works, but is very slow (hours). The table is indexed (multi-column
index on
Post by Jeff Adams - NOAA Affiliate
mmsi, name and index on epoch). Does anyone see a way to get what I am
after
Post by Jeff Adams - NOAA Affiliate
in a more efficient manner. What I am after is an assignment of transit
number to vessels' position records based on whether the records were
within
Post by Jeff Adams - NOAA Affiliate
two hours of each other. The SQL that I used is provided below. Any
advice
Post by Jeff Adams - NOAA Affiliate
would be greatly appreciated...
WITH
cte_01 AS
(
SELECT
a.id,
a.mmsi,
a.name,
a.epoch,
a.the_geom
CASE
WHEN ((a.epoch - lag(a.epoch) OVER w) / 60) > 120 THEN 1
ELSE 0
END AS tr_index
FROM table a
WINDOW w AS (PARTITION BY a.mmsi, a.name ORDER BY a.epoch)
)
SELECT
a.id,
a.mmsi,
a.name,
a.epoch,
a.the_geom,
1 + sum(a.tr_index) OVER w AS transit,
a.active
FROM cte_01 a
WINDOW w AS (PARTITION BY a.mmsi, a.name ORDER BY a.epoch)
--
Jeff
--
Jeffrey D. Adams
Contractor
OAI, Inc.
In support of:
National Marine Fisheries Service
Office of Protected Resources
1315 East West Hwy, Building SSMC3
Silver Spring, MD 20910-3282
phone: (301) 427-8434
fax: (301) 713-0376
Pavel Stehule
2013-03-11 15:34:17 UTC
Permalink
Post by Jeff Adams - NOAA Affiliate
Pavel,
Thanks for the response. I have not yet had the opportunity to use cursors,
but am now curious. Could you perhaps provide a bit more detail as to what
the implementation of your suggested approach would look like?
an example:

$$
DECLARE
r record;
prev_r record;

BEGIN
FOR r IN SELECT * FROM a ORDER BY epoch, mmsi
LOOP
IF prev_r IS NOT NULL THEN
/* do some counting */
prev_r contains previous row, r contains current row
do some
RETURN NEXT .. /* return data in defined order */
END IF;
prev_r = r;
END LOOP;


Probably slow part of your query is sorting - first can be accelerated
by index, but second (as CTE result cannot) - you can try increase
work_mem ??

Regards

Pavel
Post by Jeff Adams - NOAA Affiliate
Post by Pavel Stehule
Hello
you can try procedural solution - use a cursor over ordered data in
plpgsql and returns table
Regards
Pavel Stehule
Post by Jeff Adams - NOAA Affiliate
Greetings,
I have a large table (~90 million rows) containing vessel positions. In
addition to a column that contains the location information (the_geom), the
table also contains two columns that are used to uniquely identify the
vessel (mmsi and name) and a column containing the Unix time (epoch) at
which the position information was logged. I frequently need to assign
records to vessel transits. To do this, I currently create a CTE that uses a
Window function (partitioning the data by mmsi and name ordered by epoch) to
examine the time that has elapsed between successive position reports for
individual vessels. For every position record for a vessel (as identified
using mmsi and name), if the time elapsed between the current position
record and the previous record (using the lag function) is less than or
equal to 2 hours, I assign the record a value of 0 to a CTE column named
tr_index. If the time elapsed is greater than 2 hours, I assign the record a
value of 1 to the tr_index column. I then use the CTE to generate transit
numbers by summing the values in the tr_index field across a Window that
also partitions the data by mmsi and name and is ordered by epoch. This
works, but is very slow (hours). The table is indexed (multi-column index on
mmsi, name and index on epoch). Does anyone see a way to get what I am after
in a more efficient manner. What I am after is an assignment of transit
number to vessels' position records based on whether the records were within
two hours of each other. The SQL that I used is provided below. Any advice
would be greatly appreciated...
WITH
cte_01 AS
(
SELECT
a.id,
a.mmsi,
a.name,
a.epoch,
a.the_geom
CASE
WHEN ((a.epoch - lag(a.epoch) OVER w) / 60) > 120 THEN 1
ELSE 0
END AS tr_index
FROM table a
WINDOW w AS (PARTITION BY a.mmsi, a.name ORDER BY a.epoch)
)
SELECT
a.id,
a.mmsi,
a.name,
a.epoch,
a.the_geom,
1 + sum(a.tr_index) OVER w AS transit,
a.active
FROM cte_01 a
WINDOW w AS (PARTITION BY a.mmsi, a.name ORDER BY a.epoch)
--
Jeff
--
Jeffrey D. Adams
Contractor
OAI, Inc.
National Marine Fisheries Service
Office of Protected Resources
1315 East West Hwy, Building SSMC3
Silver Spring, MD 20910-3282
phone: (301) 427-8434
fax: (301) 713-0376
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Jeff Adams - NOAA Affiliate
2013-03-11 15:48:17 UTC
Permalink
Thanks again. The sorting does appear to be the issue. I will test out your
cursor idea...
Post by Jeff Adams - NOAA Affiliate
Post by Jeff Adams - NOAA Affiliate
Pavel,
Thanks for the response. I have not yet had the opportunity to use
cursors,
Post by Jeff Adams - NOAA Affiliate
but am now curious. Could you perhaps provide a bit more detail as to
what
Post by Jeff Adams - NOAA Affiliate
the implementation of your suggested approach would look like?
$$
DECLARE
r record;
prev_r record;
BEGIN
FOR r IN SELECT * FROM a ORDER BY epoch, mmsi
LOOP
IF prev_r IS NOT NULL THEN
/* do some counting */
prev_r contains previous row, r contains current row
do some
RETURN NEXT .. /* return data in defined order */
END IF;
prev_r = r;
END LOOP;
Probably slow part of your query is sorting - first can be accelerated
by index, but second (as CTE result cannot) - you can try increase
work_mem ??
Regards
Pavel
Post by Jeff Adams - NOAA Affiliate
Post by Pavel Stehule
Hello
you can try procedural solution - use a cursor over ordered data in
plpgsql and returns table
Regards
Pavel Stehule
Post by Jeff Adams - NOAA Affiliate
Greetings,
I have a large table (~90 million rows) containing vessel positions.
In
Post by Jeff Adams - NOAA Affiliate
Post by Pavel Stehule
Post by Jeff Adams - NOAA Affiliate
addition to a column that contains the location information
(the_geom),
Post by Jeff Adams - NOAA Affiliate
Post by Pavel Stehule
Post by Jeff Adams - NOAA Affiliate
the
table also contains two columns that are used to uniquely identify the
vessel (mmsi and name) and a column containing the Unix time (epoch)
at
Post by Jeff Adams - NOAA Affiliate
Post by Pavel Stehule
Post by Jeff Adams - NOAA Affiliate
which the position information was logged. I frequently need to assign
records to vessel transits. To do this, I currently create a CTE that uses a
Window function (partitioning the data by mmsi and name ordered by epoch) to
examine the time that has elapsed between successive position reports for
individual vessels. For every position record for a vessel (as identified
using mmsi and name), if the time elapsed between the current position
record and the previous record (using the lag function) is less than
or
Post by Jeff Adams - NOAA Affiliate
Post by Pavel Stehule
Post by Jeff Adams - NOAA Affiliate
equal to 2 hours, I assign the record a value of 0 to a CTE column
named
Post by Jeff Adams - NOAA Affiliate
Post by Pavel Stehule
Post by Jeff Adams - NOAA Affiliate
tr_index. If the time elapsed is greater than 2 hours, I assign the record a
value of 1 to the tr_index column. I then use the CTE to generate transit
numbers by summing the values in the tr_index field across a Window
that
Post by Jeff Adams - NOAA Affiliate
Post by Pavel Stehule
Post by Jeff Adams - NOAA Affiliate
also partitions the data by mmsi and name and is ordered by epoch.
This
Post by Jeff Adams - NOAA Affiliate
Post by Pavel Stehule
Post by Jeff Adams - NOAA Affiliate
works, but is very slow (hours). The table is indexed (multi-column index on
mmsi, name and index on epoch). Does anyone see a way to get what I am after
in a more efficient manner. What I am after is an assignment of
transit
Post by Jeff Adams - NOAA Affiliate
Post by Pavel Stehule
Post by Jeff Adams - NOAA Affiliate
number to vessels' position records based on whether the records were within
two hours of each other. The SQL that I used is provided below. Any advice
would be greatly appreciated...
WITH
cte_01 AS
(
SELECT
a.id,
a.mmsi,
a.name,
a.epoch,
a.the_geom
CASE
WHEN ((a.epoch - lag(a.epoch) OVER w) / 60) > 120 THEN 1
ELSE 0
END AS tr_index
FROM table a
WINDOW w AS (PARTITION BY a.mmsi, a.name ORDER BY a.epoch)
)
SELECT
a.id,
a.mmsi,
a.name,
a.epoch,
a.the_geom,
1 + sum(a.tr_index) OVER w AS transit,
a.active
FROM cte_01 a
WINDOW w AS (PARTITION BY a.mmsi, a.name ORDER BY a.epoch)
--
Jeff
Jeff Adams - NOAA Affiliate
2013-03-12 13:08:31 UTC
Permalink
So, I tested out the cursor approach, and it still chugs along for hours.
If the result set is large (and the available memory to process small),
does it matter what goes on within the cursor. Will it still choke trying
assemble and spit out the large result set?

On Mon, Mar 11, 2013 at 11:48 AM, Jeff Adams - NOAA Affiliate <
Post by Jeff Adams - NOAA Affiliate
Thanks again. The sorting does appear to be the issue. I will test out
your cursor idea...
Post by Jeff Adams - NOAA Affiliate
Post by Jeff Adams - NOAA Affiliate
Pavel,
Thanks for the response. I have not yet had the opportunity to use
cursors,
Post by Jeff Adams - NOAA Affiliate
but am now curious. Could you perhaps provide a bit more detail as to
what
Post by Jeff Adams - NOAA Affiliate
the implementation of your suggested approach would look like?
$$
DECLARE
r record;
prev_r record;
BEGIN
FOR r IN SELECT * FROM a ORDER BY epoch, mmsi
LOOP
IF prev_r IS NOT NULL THEN
/* do some counting */
prev_r contains previous row, r contains current row
do some
RETURN NEXT .. /* return data in defined order */
END IF;
prev_r = r;
END LOOP;
Probably slow part of your query is sorting - first can be accelerated
by index, but second (as CTE result cannot) - you can try increase
work_mem ??
Regards
Pavel
Post by Jeff Adams - NOAA Affiliate
On Mon, Mar 11, 2013 at 11:03 AM, Pavel Stehule <
Post by Pavel Stehule
Hello
you can try procedural solution - use a cursor over ordered data in
plpgsql and returns table
Regards
Pavel Stehule
Post by Jeff Adams - NOAA Affiliate
Greetings,
I have a large table (~90 million rows) containing vessel positions.
In
Post by Jeff Adams - NOAA Affiliate
Post by Pavel Stehule
Post by Jeff Adams - NOAA Affiliate
addition to a column that contains the location information
(the_geom),
Post by Jeff Adams - NOAA Affiliate
Post by Pavel Stehule
Post by Jeff Adams - NOAA Affiliate
the
table also contains two columns that are used to uniquely identify
the
Post by Jeff Adams - NOAA Affiliate
Post by Pavel Stehule
Post by Jeff Adams - NOAA Affiliate
vessel (mmsi and name) and a column containing the Unix time (epoch)
at
Post by Jeff Adams - NOAA Affiliate
Post by Pavel Stehule
Post by Jeff Adams - NOAA Affiliate
which the position information was logged. I frequently need to
assign
Post by Jeff Adams - NOAA Affiliate
Post by Pavel Stehule
Post by Jeff Adams - NOAA Affiliate
records to vessel transits. To do this, I currently create a CTE that uses a
Window function (partitioning the data by mmsi and name ordered by epoch) to
examine the time that has elapsed between successive position reports for
individual vessels. For every position record for a vessel (as identified
using mmsi and name), if the time elapsed between the current
position
Post by Jeff Adams - NOAA Affiliate
Post by Pavel Stehule
Post by Jeff Adams - NOAA Affiliate
record and the previous record (using the lag function) is less than
or
Post by Jeff Adams - NOAA Affiliate
Post by Pavel Stehule
Post by Jeff Adams - NOAA Affiliate
equal to 2 hours, I assign the record a value of 0 to a CTE column
named
Post by Jeff Adams - NOAA Affiliate
Post by Pavel Stehule
Post by Jeff Adams - NOAA Affiliate
tr_index. If the time elapsed is greater than 2 hours, I assign the record a
value of 1 to the tr_index column. I then use the CTE to generate transit
numbers by summing the values in the tr_index field across a Window
that
Post by Jeff Adams - NOAA Affiliate
Post by Pavel Stehule
Post by Jeff Adams - NOAA Affiliate
also partitions the data by mmsi and name and is ordered by epoch.
This
Post by Jeff Adams - NOAA Affiliate
Post by Pavel Stehule
Post by Jeff Adams - NOAA Affiliate
works, but is very slow (hours). The table is indexed (multi-column index on
mmsi, name and index on epoch). Does anyone see a way to get what I
am
Post by Jeff Adams - NOAA Affiliate
Post by Pavel Stehule
Post by Jeff Adams - NOAA Affiliate
after
in a more efficient manner. What I am after is an assignment of
transit
Post by Jeff Adams - NOAA Affiliate
Post by Pavel Stehule
Post by Jeff Adams - NOAA Affiliate
number to vessels' position records based on whether the records were within
two hours of each other. The SQL that I used is provided below. Any advice
would be greatly appreciated...
WITH
cte_01 AS
(
SELECT
a.id,
a.mmsi,
a.name,
a.epoch,
a.the_geom
CASE
WHEN ((a.epoch - lag(a.epoch) OVER w) / 60) > 120 THEN 1
ELSE 0
END AS tr_index
FROM table a
WINDOW w AS (PARTITION BY a.mmsi, a.name ORDER BY a.epoch)
)
SELECT
a.id,
a.mmsi,
a.name,
a.epoch,
a.the_geom,
1 + sum(a.tr_index) OVER w AS transit,
a.active
FROM cte_01 a
WINDOW w AS (PARTITION BY a.mmsi, a.name ORDER BY a.epoch)
--
Jeff
--
Jeffrey D. Adams
Contractor
OAI, Inc.
In support of:
National Marine Fisheries Service
Office of Protected Resources
1315 East West Hwy, Building SSMC3
Silver Spring, MD 20910-3282
phone: (301) 427-8434
fax: (301) 713-0376
Victor Yegorov
2013-03-12 13:36:54 UTC
Permalink
Post by Jeff Adams - NOAA Affiliate
Greetings,
I have a large table (~90 million rows) containing vessel positions.
...
Could you kindly provide a script to create the table and populate it with
several sample
rows, please? Also, provide the desired output for the sample rows.

It would be good to take a look on the “EXPLAIN ANALYZE” output of your
query,
please, share http://explain.depesz.com/ link.
--
Victor Y. Yegorov
Loading...