Mark Davidson
2013-04-05 15:51:37 UTC
Hi All,
Hoping someone can help me out with some performance issues I'm having with
the INDEX on my database. I've got a database that has a data table
containing ~55,000,000 rows which have point data and an area table
containing ~3,500 rows which have polygon data. A user queries the data by
selecting what areas they want to view and using some other filters such as
datatime and what datasets they want to query. This all works fine and
previously the intersect of the data rows to the areas was being done on
the fly with PostGIS ST_Intersects. However as the data table grow we
decided it would make sense to offload the data processing and not
calculate the intersect for a row on the fly each time, but to
pre-calculate it and store the result in the join table. Resultantly this
produce a table data_area which contains ~250,000,000 rows. This simply has
two columns which show the intersect between data and area. We where
expecting that this would give a significant performance improvement to
query time, but the query seems to take a very long time to analyse the
INDEX as part of the query. I'm thinking there must be something wrong with
my setup or the query its self as I'm sure postgres will perform better.
I've tried restructuring the query, changing config settings and doing
maintenance like VACUUM but nothing has helped.
Hope that introduction is clear enough and makes sense if anything is
unclear please let me know.
I'm using PostgreSQL 9.1.4 on x86_64-unknown-linux-gnu, compiled by
gcc-4.4.real (Ubuntu 4.4.3-4ubuntu5.1) 4.4.3, 64-bit on Ubuntu 12.04 which
was installed using apt.
Here is the structure of my database tables
CREATE TABLE data
(
id bigserial NOT NULL,
datasetid integer NOT NULL,
readingdatetime timestamp without time zone NOT NULL,
depth double precision NOT NULL,
readingdatetime2 timestamp without time zone,
depth2 double precision,
value double precision NOT NULL,
uploaddatetime timestamp without time zone,
description character varying(255),
point geometry,
point2 geometry,
CONSTRAINT "DATAPRIMARYKEY" PRIMARY KEY (id ),
CONSTRAINT enforce_dims_point CHECK (st_ndims(point) = 2),
CONSTRAINT enforce_dims_point2 CHECK (st_ndims(point2) = 2),
CONSTRAINT enforce_geotype_point CHECK (geometrytype(point) =
'POINT'::text OR point IS NULL),
CONSTRAINT enforce_geotype_point2 CHECK (geometrytype(point2) =
'POINT'::text OR point2 IS NULL),
CONSTRAINT enforce_srid_point CHECK (st_srid(point) = 4326),
CONSTRAINT enforce_srid_point2 CHECK (st_srid(point2) = 4326)
);
CREATE INDEX data_datasetid_index ON data USING btree (datasetid );
CREATE INDEX data_point_index ON data USING gist (point );
CREATE INDEX "data_readingDatetime_index" ON data USING btree
(readingdatetime );
ALTER TABLE data CLUSTER ON "data_readingDatetime_index";
CREATE TABLE area
(
id serial NOT NULL,
"areaCode" character varying(10) NOT NULL,
country character varying(250) NOT NULL,
"polysetID" integer NOT NULL,
polygon geometry,
CONSTRAINT area_primary_key PRIMARY KEY (id ),
CONSTRAINT polyset_foreign_key FOREIGN KEY ("polysetID")
REFERENCES polyset (id) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE CASCADE,
CONSTRAINT enforce_dims_area CHECK (st_ndims(polygon) = 2),
CONSTRAINT enforce_geotype_area CHECK (geometrytype(polygon) =
'POLYGON'::text OR polygon IS NULL),
CONSTRAINT enforce_srid_area CHECK (st_srid(polygon) = 4326)
);
CREATE INDEX area_polygon_index ON area USING gist (polygon );
CREATE INDEX "area_polysetID_index" ON area USING btree ("polysetID" );
ALTER TABLE area CLUSTER ON "area_polysetID_index";
CREATE TABLE data_area
(
data_id integer NOT NULL,
area_id integer NOT NULL,
CONSTRAINT data_area_pkey PRIMARY KEY (data_id , area_id ),
CONSTRAINT data_area_area_id_fk FOREIGN KEY (area_id)
REFERENCES area (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT data_area_data_id_fk FOREIGN KEY (data_id)
REFERENCES data (id) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE CASCADE
);
Here is the query I'm running and the result of its explain can be found
here http://explain.depesz.com/s/1yu
SELECT * FROM data d JOIN data_area da ON da.data_id = d.id LEFT JOIN area
a ON da.area_id = a.id WHERE d.datasetid IN
(5634,5635,5636,5637,5638,5639,5640,5641,5642) AND da.area_id IN
(1, 2, 3 .... 9999) AND (readingdatetime BETWEEN '1990-01-01' AND
'2013-01-01') AND depth BETWEEN 0 AND 99999;
If you look at the explain the index scan is taking 97% of the time is
spent on the index scan for the JOIN of data_area.
Hardware
- CPU: Intel(R) Xeon(R) CPU E5420 ( 8 Cores )
- RAM: 16GB
Config Changes
I'm using the base Ubuntu config apart from the following changes
- shared_buffers set to 2GB
- work_mem set to 1GB
- maintenance_work_men set to 512MB
- effective_cache_size set to 8GB
Think that covers everything hope this has enough detail for someone to be
able to help if there is anything I've missed please let me know and I'll
add any more info needed. Any input on the optimisation of the table
structure, the query, or anything else I can do to sort this issue would be
most appreciated.
Thanks in advance,
Mark Davidson
Hoping someone can help me out with some performance issues I'm having with
the INDEX on my database. I've got a database that has a data table
containing ~55,000,000 rows which have point data and an area table
containing ~3,500 rows which have polygon data. A user queries the data by
selecting what areas they want to view and using some other filters such as
datatime and what datasets they want to query. This all works fine and
previously the intersect of the data rows to the areas was being done on
the fly with PostGIS ST_Intersects. However as the data table grow we
decided it would make sense to offload the data processing and not
calculate the intersect for a row on the fly each time, but to
pre-calculate it and store the result in the join table. Resultantly this
produce a table data_area which contains ~250,000,000 rows. This simply has
two columns which show the intersect between data and area. We where
expecting that this would give a significant performance improvement to
query time, but the query seems to take a very long time to analyse the
INDEX as part of the query. I'm thinking there must be something wrong with
my setup or the query its self as I'm sure postgres will perform better.
I've tried restructuring the query, changing config settings and doing
maintenance like VACUUM but nothing has helped.
Hope that introduction is clear enough and makes sense if anything is
unclear please let me know.
I'm using PostgreSQL 9.1.4 on x86_64-unknown-linux-gnu, compiled by
gcc-4.4.real (Ubuntu 4.4.3-4ubuntu5.1) 4.4.3, 64-bit on Ubuntu 12.04 which
was installed using apt.
Here is the structure of my database tables
CREATE TABLE data
(
id bigserial NOT NULL,
datasetid integer NOT NULL,
readingdatetime timestamp without time zone NOT NULL,
depth double precision NOT NULL,
readingdatetime2 timestamp without time zone,
depth2 double precision,
value double precision NOT NULL,
uploaddatetime timestamp without time zone,
description character varying(255),
point geometry,
point2 geometry,
CONSTRAINT "DATAPRIMARYKEY" PRIMARY KEY (id ),
CONSTRAINT enforce_dims_point CHECK (st_ndims(point) = 2),
CONSTRAINT enforce_dims_point2 CHECK (st_ndims(point2) = 2),
CONSTRAINT enforce_geotype_point CHECK (geometrytype(point) =
'POINT'::text OR point IS NULL),
CONSTRAINT enforce_geotype_point2 CHECK (geometrytype(point2) =
'POINT'::text OR point2 IS NULL),
CONSTRAINT enforce_srid_point CHECK (st_srid(point) = 4326),
CONSTRAINT enforce_srid_point2 CHECK (st_srid(point2) = 4326)
);
CREATE INDEX data_datasetid_index ON data USING btree (datasetid );
CREATE INDEX data_point_index ON data USING gist (point );
CREATE INDEX "data_readingDatetime_index" ON data USING btree
(readingdatetime );
ALTER TABLE data CLUSTER ON "data_readingDatetime_index";
CREATE TABLE area
(
id serial NOT NULL,
"areaCode" character varying(10) NOT NULL,
country character varying(250) NOT NULL,
"polysetID" integer NOT NULL,
polygon geometry,
CONSTRAINT area_primary_key PRIMARY KEY (id ),
CONSTRAINT polyset_foreign_key FOREIGN KEY ("polysetID")
REFERENCES polyset (id) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE CASCADE,
CONSTRAINT enforce_dims_area CHECK (st_ndims(polygon) = 2),
CONSTRAINT enforce_geotype_area CHECK (geometrytype(polygon) =
'POLYGON'::text OR polygon IS NULL),
CONSTRAINT enforce_srid_area CHECK (st_srid(polygon) = 4326)
);
CREATE INDEX area_polygon_index ON area USING gist (polygon );
CREATE INDEX "area_polysetID_index" ON area USING btree ("polysetID" );
ALTER TABLE area CLUSTER ON "area_polysetID_index";
CREATE TABLE data_area
(
data_id integer NOT NULL,
area_id integer NOT NULL,
CONSTRAINT data_area_pkey PRIMARY KEY (data_id , area_id ),
CONSTRAINT data_area_area_id_fk FOREIGN KEY (area_id)
REFERENCES area (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT data_area_data_id_fk FOREIGN KEY (data_id)
REFERENCES data (id) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE CASCADE
);
Here is the query I'm running and the result of its explain can be found
here http://explain.depesz.com/s/1yu
SELECT * FROM data d JOIN data_area da ON da.data_id = d.id LEFT JOIN area
a ON da.area_id = a.id WHERE d.datasetid IN
(5634,5635,5636,5637,5638,5639,5640,5641,5642) AND da.area_id IN
(1, 2, 3 .... 9999) AND (readingdatetime BETWEEN '1990-01-01' AND
'2013-01-01') AND depth BETWEEN 0 AND 99999;
If you look at the explain the index scan is taking 97% of the time is
spent on the index scan for the JOIN of data_area.
Hardware
- CPU: Intel(R) Xeon(R) CPU E5420 ( 8 Cores )
- RAM: 16GB
Config Changes
I'm using the base Ubuntu config apart from the following changes
- shared_buffers set to 2GB
- work_mem set to 1GB
- maintenance_work_men set to 512MB
- effective_cache_size set to 8GB
Think that covers everything hope this has enough detail for someone to be
able to help if there is anything I've missed please let me know and I'll
add any more info needed. Any input on the optimisation of the table
structure, the query, or anything else I can do to sort this issue would be
most appreciated.
Thanks in advance,
Mark Davidson