INDEX Performance Issue

Discussion:

(too old to reply)

Mark Davidson

2013-04-05 15:51:37 UTC

Hi All,

Hoping someone can help me out with some performance issues I'm having with
the INDEX on my database. I've got a database that has a data table
containing ~55,000,000 rows which have point data and an area table
containing ~3,500 rows which have polygon data. A user queries the data by
selecting what areas they want to view and using some other filters such as
datatime and what datasets they want to query. This all works fine and
previously the intersect of the data rows to the areas was being done on
the fly with PostGIS ST_Intersects. However as the data table grow we
decided it would make sense to offload the data processing and not
calculate the intersect for a row on the fly each time, but to
pre-calculate it and store the result in the join table. Resultantly this
produce a table data_area which contains ~250,000,000 rows. This simply has
two columns which show the intersect between data and area. We where
expecting that this would give a significant performance improvement to
query time, but the query seems to take a very long time to analyse the
INDEX as part of the query. I'm thinking there must be something wrong with
my setup or the query its self as I'm sure postgres will perform better.
I've tried restructuring the query, changing config settings and doing
maintenance like VACUUM but nothing has helped.

Hope that introduction is clear enough and makes sense if anything is
unclear please let me know.

I'm using PostgreSQL 9.1.4 on x86_64-unknown-linux-gnu, compiled by
gcc-4.4.real (Ubuntu 4.4.3-4ubuntu5.1) 4.4.3, 64-bit on Ubuntu 12.04 which
was installed using apt.

Here is the structure of my database tables

CREATE TABLE data
(
id bigserial NOT NULL,
datasetid integer NOT NULL,
readingdatetime timestamp without time zone NOT NULL,
depth double precision NOT NULL,
readingdatetime2 timestamp without time zone,
depth2 double precision,
value double precision NOT NULL,
uploaddatetime timestamp without time zone,
description character varying(255),
point geometry,
point2 geometry,
CONSTRAINT "DATAPRIMARYKEY" PRIMARY KEY (id ),
CONSTRAINT enforce_dims_point CHECK (st_ndims(point) = 2),
CONSTRAINT enforce_dims_point2 CHECK (st_ndims(point2) = 2),
CONSTRAINT enforce_geotype_point CHECK (geometrytype(point) =
'POINT'::text OR point IS NULL),
CONSTRAINT enforce_geotype_point2 CHECK (geometrytype(point2) =
'POINT'::text OR point2 IS NULL),
CONSTRAINT enforce_srid_point CHECK (st_srid(point) = 4326),
CONSTRAINT enforce_srid_point2 CHECK (st_srid(point2) = 4326)
);

CREATE INDEX data_datasetid_index ON data USING btree (datasetid );
CREATE INDEX data_point_index ON data USING gist (point );
CREATE INDEX "data_readingDatetime_index" ON data USING btree
(readingdatetime );
ALTER TABLE data CLUSTER ON "data_readingDatetime_index";

CREATE TABLE area
(
id serial NOT NULL,
"areaCode" character varying(10) NOT NULL,
country character varying(250) NOT NULL,
"polysetID" integer NOT NULL,
polygon geometry,
CONSTRAINT area_primary_key PRIMARY KEY (id ),
CONSTRAINT polyset_foreign_key FOREIGN KEY ("polysetID")
REFERENCES polyset (id) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE CASCADE,
CONSTRAINT enforce_dims_area CHECK (st_ndims(polygon) = 2),
CONSTRAINT enforce_geotype_area CHECK (geometrytype(polygon) =
'POLYGON'::text OR polygon IS NULL),
CONSTRAINT enforce_srid_area CHECK (st_srid(polygon) = 4326)
);

CREATE INDEX area_polygon_index ON area USING gist (polygon );
CREATE INDEX "area_polysetID_index" ON area USING btree ("polysetID" );
ALTER TABLE area CLUSTER ON "area_polysetID_index";

CREATE TABLE data_area
(
data_id integer NOT NULL,
area_id integer NOT NULL,
CONSTRAINT data_area_pkey PRIMARY KEY (data_id , area_id ),
CONSTRAINT data_area_area_id_fk FOREIGN KEY (area_id)
REFERENCES area (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT data_area_data_id_fk FOREIGN KEY (data_id)
REFERENCES data (id) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE CASCADE
);

Here is the query I'm running and the result of its explain can be found
here http://explain.depesz.com/s/1yu

SELECT * FROM data d JOIN data_area da ON da.data_id = d.id LEFT JOIN area
a ON da.area_id = a.id WHERE d.datasetid IN
(5634,5635,5636,5637,5638,5639,5640,5641,5642) AND da.area_id IN
(1, 2, 3 .... 9999) AND (readingdatetime BETWEEN '1990-01-01' AND
'2013-01-01') AND depth BETWEEN 0 AND 99999;

If you look at the explain the index scan is taking 97% of the time is
spent on the index scan for the JOIN of data_area.

Hardware

- CPU: Intel(R) Xeon(R) CPU E5420 ( 8 Cores )
- RAM: 16GB

Config Changes

I'm using the base Ubuntu config apart from the following changes

- shared_buffers set to 2GB
- work_mem set to 1GB
- maintenance_work_men set to 512MB
- effective_cache_size set to 8GB

Think that covers everything hope this has enough detail for someone to be
able to help if there is anything I've missed please let me know and I'll
add any more info needed. Any input on the optimisation of the table
structure, the query, or anything else I can do to sort this issue would be
most appreciated.

Thanks in advance,

Mark Davidson

Kevin Grittner

2013-04-05 16:37:26 UTC

Permalink

CONSTRAINT data_area_pkey PRIMARY KEY (data_id , area_id ),

So the only index on this 250 million row table starts with the ID
of the point, but you are joining to it by the ID of the area.
That's requires a sequential scan of all 250 million rows. Switch
the order of the columns in the primary key, add a unique index
with the columns switched, or add an index on just the area ID.

Perhaps you thought that the foreign key constraints would create
indexes? (They don't.)

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Mark Davidson

2013-04-07 10:03:28 UTC

Permalink

Hi Kevin

Thanks for your response. I tried doing what you suggested so that table
now has a primary key of ' CONSTRAINT data_area_pkey PRIMARY KEY(area_id ,
data_id ); ' and I've added the INDEX of 'CREATE INDEX
data_area_data_id_index ON data_area USING btree (data_id );' unfortunately
it hasn't resulted in an improvement of the query performance. Here is the
explain http://explain.depesz.com/s/tDL I think there is no performance
increase because its now not using primary key and just using the index on
the data_id. Have I done what you suggested correctly? Any other
suggestions?

Thanks very much for your help,

Mark

Post by Kevin Grittner

Post by Mark Davidson
CONSTRAINT data_area_pkey PRIMARY KEY (data_id , area_id ),

So the only index on this 250 million row table starts with the ID
of the point, but you are joining to it by the ID of the area.
That's requires a sequential scan of all 250 million rows. Switch
the order of the columns in the primary key, add a unique index
with the columns switched, or add an index on just the area ID.
Perhaps you thought that the foreign key constraints would create
indexes? (They don't.)
--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Greg Williamson

2013-04-07 10:21:20 UTC

Permalink

Thanks for your response. I tried doing what you suggested so that table now has a primary key of ' CONSTRAINT data_area_pkey PRIMARY KEY(area_id , data_id ); ' and I've added the INDEX > of 'CREATE INDEX data_area_data_id_index ON data_area USING btree (data_id );' unfortunately it hasn't resulted in an improvement of the query performance. Here is the explain
...

Did you run analyze on the table after creating the index ?

GW

--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Kevin Grittner

2013-04-07 15:15:42 UTC