Discussion:
Very slow update statement on 40mio rows
(too old to reply)
Florian Schröck
2013-02-15 14:30:55 UTC
Permalink
Hi!
I'm new to this mailinglist and I'm new to postgres as well. It is about
our own backup software (java); we switched the DB from MySQL to
postgres and we need some help.
The backup database holds all files from the server in the database. On
my testing platform there are about 40 mio rows and the DB size grew to
about 70 GB (the same DB was about 11 GB with MySQL, but this is another
issue).

Before each backup job, there is a reset to get a consistent state of
all files:
UPDATE BackupFiles SET nTapeNr=0, nAFIOCounter=0, nBlockCounter=0,
cStatus='NEW'::StatusT, bOnSetBlue=false, bOnSetYellow=false,
nLastBackupTS= '0001-01-01 00:00:00' WHERE cStatus='NEW' OR
cStatus='WRITING'OR cStatus='ONTAPE';

Explain analyze: http://explain.depesz.com/s/8y5
The statement takes 60-90 minutes. In MySQL the same statement takes
about 2 minutes on the same HW.
I tried to optimize the settings but until now without success.

Can we optimize this update statement somehow? Do you have any other ideas?

Any help is appreciated! Thank you!

*My current config:*
shared_buffers = 2GB
work_mem = 16MB
wal_buffers = 16MB
checkpoint_segments = 45
random_page_cost = 2.0
effective_cache_size = 6GB

*HW:*
2x Intel E5405 @ 2.00GHz
8 GB RAM
3ware 9650SE-16ML RAID Controller, all caches enabled
DB is on a RAID6 with 14x 1TB (since the DB partition on the RAID1 was
too small)

*SW:*
Debian Squeeze 6.0 with Kernel 3.5.4
Postgres 8.4.13 (standard Debian package)

*Table:*
+---------------+-----------------------------+---------------------------------------------------------------------+
| Column | Type
| Modifiers |
+---------------+-----------------------------+---------------------------------------------------------------------+
| _rowid | bigint | not null default
nextval('backupfiles__rowid_seq'::regclass) |
| cfilename | bytea | not
null |
| nfilesize | bigint | not null default
0::bigint |
| nfilectimets | timestamp without time zone | not null default
'1970-01-01 00:00:00'::timestamp without time zone |
| ntapenr | integer | not null default
0 |
| nafiocounter | bigint | not null default
0::bigint |
| nblockcounter | bigint | not null default
0::bigint |
| cstatus | statust | not null default
'NEW'::statust |
| bonsetblue | boolean | default
false |
| bonsetyellow | boolean | default
false |
| nlastbackupts | timestamp without time zone | not null default
'1970-01-01 00:00:00'::timestamp without time zone |
+---------------+-----------------------------+---------------------------------------------------------------------+
Indexes:
"backupfiles_pkey" PRIMARY KEY, btree (_rowid)
"cfilename_index" btree (cfilename)
"cstatus_index" btree (cstatus)
"nfilectimets_index" btree (nfilectimets)
"ntapenr_index" btree (ntapenr)


*Example row:*
+--------+-----------------------------+-----------+---------------------+---------+--------------+---------------+---------+------------+--------------+---------------------+
| _rowid | cfilename | nfilesize | nfilectimets
| ntapenr | nafiocounter | nblockcounter | cstatus | bonsetblue |
bonsetyellow | nlastbackupts |
+--------+-----------------------------+-----------+---------------------+---------+--------------+---------------+---------+------------+--------------+---------------------+
| 1 | /dicom/log/datadir_init.log | 1790 | 2013-01-30 14:02:48
| 0 | 0 | 0 | NEW | f |
f | 0001-01-01 00:00:00 |
+--------+-----------------------------+-----------+---------------------+---------+--------------+---------------+---------+------------+--------------+---------------------+
--
Mit freundlichen Grüßen
Best regards

Florian Schröck
IT Services

aycan Digitalsysteme GmbH
Innere Aumühlstr. 5
97076 Würzburg . Germany
Tel. +49 (0)9 31. 270 40 88
Fax +49 (0)9 31. 270 40 89
mailto:***@aycan.de
mailto:***@aycan.de
http://www.aycan.de

Geschäftsführer: Dipl.-Ing. Stephan Popp
Sitz der Gesellschaft: Würzburg
Eingetragen beim Amtsgericht Würzburg unter HRB 6043
Ust-Id Nr. DE 190658226


aycan PACS

Ihre Vorteile: www.aycan.de/pacs-wechsel <http://www.aycan.de/pacs-wechsel>

Was ist ein VNA?: www.aycan.de/vna <http://www.aycan.de/vna>
Kevin Grittner
2013-02-15 14:59:17 UTC
Permalink
Post by Florian Schröck
UPDATE BackupFiles
   SET nTapeNr=0, nAFIOCounter=0, nBlockCounter=0,
       cStatus='NEW'::StatusT, bOnSetBlue=false,
       bOnSetYellow=false, nLastBackupTS= '0001-01-01 00:00:00'
   WHERE cStatus='NEW' OR cStatus='WRITING' OR cStatus='ONTAPE';
Explain analyze: http://explain.depesz.com/s/8y5
The statement takes 60-90 minutes.
The EXPLAIN ANALYZE at that URL shows a runtime of 3 minutes and 41
seconds.
Post by Florian Schröck
I tried to optimize the settings but until now without success.
Can we optimize this update statement somehow? Do you have any
other ideas?
Are there any rows which would already have the values that you are
setting?  If so, it would be faster to skip those by using this
query:

UPDATE BackupFiles
  SET nTapeNr=0, nAFIOCounter=0, nBlockCounter=0,
      cStatus='NEW'::StatusT, bOnSetBlue=false,
      bOnSetYellow=false, nLastBackupTS= '0001-01-01 00:00:00'
  WHERE (cStatus='NEW' OR cStatus='WRITING' OR cStatus='ONTAPE')
    AND (nTapeNr <> 0 OR nAFIOCounter <> 0 OR nBlockCounter <> 0
         OR cStatus <> 'NEW'::StatusT
         OR bOnSetBlue IS DISTINCT FROM false
         OR bOnSetYellow IS DISTINCT FROM false
         OR nLastBackupTS <> '0001-01-01 00:00:00');

Another way to accomplish this is with the
suppress_redundant_updates_trigger() trigger function:

http://www.postgresql.org/docs/9.2/interactive/functions-trigger.html
--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Florian Schröck
2013-02-15 15:32:10 UTC
Permalink
Hello Kevin,
not updating every row which doesn't need the update solved the problem!
Your query took only 1 minute. :)

Thank you so much for the fast response, have a great weekend!

PS: When you switch to "TEXT" on the explain URL you can see the final
runtime which was 66 minutes with the original statement.

Best regards,
Florian
Post by Kevin Grittner
Post by Florian Schröck
UPDATE BackupFiles
SET nTapeNr=0, nAFIOCounter=0, nBlockCounter=0,
cStatus='NEW'::StatusT, bOnSetBlue=false,
bOnSetYellow=false, nLastBackupTS= '0001-01-01 00:00:00'
WHERE cStatus='NEW' OR cStatus='WRITING' OR cStatus='ONTAPE';
Explain analyze: http://explain.depesz.com/s/8y5
The statement takes 60-90 minutes.
The EXPLAIN ANALYZE at that URL shows a runtime of 3 minutes and 41
seconds.
Post by Florian Schröck
I tried to optimize the settings but until now without success.
Can we optimize this update statement somehow? Do you have any other ideas?
Are there any rows which would already have the values that you are
setting? If so, it would be faster to skip those by using this
UPDATE BackupFiles
SET nTapeNr=0, nAFIOCounter=0, nBlockCounter=0,
cStatus='NEW'::StatusT, bOnSetBlue=false,
bOnSetYellow=false, nLastBackupTS= '0001-01-01 00:00:00'
WHERE (cStatus='NEW' OR cStatus='WRITING' OR cStatus='ONTAPE')
AND (nTapeNr <> 0 OR nAFIOCounter <> 0 OR nBlockCounter <> 0
OR cStatus <> 'NEW'::StatusT
OR bOnSetBlue IS DISTINCT FROM false
OR bOnSetYellow IS DISTINCT FROM false
OR nLastBackupTS <> '0001-01-01 00:00:00');
Another way to accomplish this is with the
http://www.postgresql.org/docs/9.2/interactive/functions-trigger.html
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Merlin Moncure
2013-02-19 20:04:22 UTC
Permalink
Post by Florian Schröck
Hello Kevin,
not updating every row which doesn't need the update solved the problem!
Your query took only 1 minute. :)
Thank you so much for the fast response, have a great weekend!
PS: When you switch to "TEXT" on the explain URL you can see the final
runtime which was 66 minutes with the original statement.
Best regards,
Florian
Post by Kevin Grittner
Post by Florian Schröck
UPDATE BackupFiles
SET nTapeNr=0, nAFIOCounter=0, nBlockCounter=0,
cStatus='NEW'::StatusT, bOnSetBlue=false,
bOnSetYellow=false, nLastBackupTS= '0001-01-01 00:00:00'
WHERE cStatus='NEW' OR cStatus='WRITING' OR cStatus='ONTAPE';
Explain analyze: http://explain.depesz.com/s/8y5
The statement takes 60-90 minutes.
The EXPLAIN ANALYZE at that URL shows a runtime of 3 minutes and 41
seconds.
Post by Florian Schröck
I tried to optimize the settings but until now without success.
Can we optimize this update statement somehow? Do you have any other ideas?
Are there any rows which would already have the values that you are
setting? If so, it would be faster to skip those by using this
UPDATE BackupFiles
SET nTapeNr=0, nAFIOCounter=0, nBlockCounter=0,
cStatus='NEW'::StatusT, bOnSetBlue=false,
bOnSetYellow=false, nLastBackupTS= '0001-01-01 00:00:00'
WHERE (cStatus='NEW' OR cStatus='WRITING' OR cStatus='ONTAPE')
AND (nTapeNr <> 0 OR nAFIOCounter <> 0 OR nBlockCounter <> 0
OR cStatus <> 'NEW'::StatusT
OR bOnSetBlue IS DISTINCT FROM false
OR bOnSetYellow IS DISTINCT FROM false
OR nLastBackupTS <> '0001-01-01 00:00:00');
Another way to accomplish this is with the
http://www.postgresql.org/docs/9.2/interactive/functions-trigger.html
if the number of rows you actually update is not very large relative
to size of the table, just for fun, try this:


CREATE OR REPLACE FUNCTION BakupFilesCandidateReset(BackupFiles)
RETURNS BOOL AS
$$
SELECT ($1).cStatus IN('NEW', 'WRITING', 'ONTAPE')
AND
(($1).nTapeNr, ($1).nAFIOCounter, ($1).nBlockCounter,
($1).cStatus, ($1).bOnSetBlue, ($1).bOnSetYellow, ($1).nLastBackupTS)
IS DISTINCT FROM /* simple != will suffice if values are never null */
(0, 0, 0, 'NEW'::StatusT, false, false, '0001-01-01 00:00:00');
$$ LANGUAGE SQL IMMUTABLE;

CREATE INDEX ON BackupFiles(BakupFilesCandidateReset(BackupFiles))
WHERE BakupFilesCandidateReset(BackupFiles);


SELECT * FROM BackupFiles WHERE BakupFilesCandidateReset(BackupFiles);
UPDATE BackupFiles SET ... WHERE BakupFilesCandidateReset(BackupFiles);
etc

idea here is to maintain partial boolean index representing candidate
records to update. plus it's nifty. this is basic mechanism that
can be used as foundation for very fast push pull queues.

merlin
--
Sent via pgsql-performance mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Florian Schröck
2013-02-26 08:01:14 UTC
Permalink
Hello Merlin,
thanks for the feedback, I forwarded this to my developer, this is an
interesting approach.
--
Best regards

Florian Schröck
Post by Merlin Moncure
Post by Florian Schröck
Hello Kevin,
not updating every row which doesn't need the update solved the problem!
Your query took only 1 minute. :)
Thank you so much for the fast response, have a great weekend!
PS: When you switch to "TEXT" on the explain URL you can see the final
runtime which was 66 minutes with the original statement.
Best regards,
Florian
Post by Kevin Grittner
Post by Florian Schröck
UPDATE BackupFiles
SET nTapeNr=0, nAFIOCounter=0, nBlockCounter=0,
cStatus='NEW'::StatusT, bOnSetBlue=false,
bOnSetYellow=false, nLastBackupTS= '0001-01-01 00:00:00'
WHERE cStatus='NEW' OR cStatus='WRITING' OR cStatus='ONTAPE';
Explain analyze: http://explain.depesz.com/s/8y5
The statement takes 60-90 minutes.
The EXPLAIN ANALYZE at that URL shows a runtime of 3 minutes and 41
seconds.
Post by Florian Schröck
I tried to optimize the settings but until now without success.
Can we optimize this update statement somehow? Do you have any other ideas?
Are there any rows which would already have the values that you are
setting? If so, it would be faster to skip those by using this
UPDATE BackupFiles
SET nTapeNr=0, nAFIOCounter=0, nBlockCounter=0,
cStatus='NEW'::StatusT, bOnSetBlue=false,
bOnSetYellow=false, nLastBackupTS= '0001-01-01 00:00:00'
WHERE (cStatus='NEW' OR cStatus='WRITING' OR cStatus='ONTAPE')
AND (nTapeNr <> 0 OR nAFIOCounter <> 0 OR nBlockCounter <> 0
OR cStatus <> 'NEW'::StatusT
OR bOnSetBlue IS DISTINCT FROM false
OR bOnSetYellow IS DISTINCT FROM false
OR nLastBackupTS <> '0001-01-01 00:00:00');
Another way to accomplish this is with the
http://www.postgresql.org/docs/9.2/interactive/functions-trigger.html
if the number of rows you actually update is not very large relative
CREATE OR REPLACE FUNCTION BakupFilesCandidateReset(BackupFiles)
RETURNS BOOL AS
$$
SELECT ($1).cStatus IN('NEW', 'WRITING', 'ONTAPE')
AND
(($1).nTapeNr, ($1).nAFIOCounter, ($1).nBlockCounter,
($1).cStatus, ($1).bOnSetBlue, ($1).bOnSetYellow, ($1).nLastBackupTS)
IS DISTINCT FROM /* simple != will suffice if values are never null */
(0, 0, 0, 'NEW'::StatusT, false, false, '0001-01-01 00:00:00');
$$ LANGUAGE SQL IMMUTABLE;
CREATE INDEX ON BackupFiles(BakupFilesCandidateReset(BackupFiles))
WHERE BakupFilesCandidateReset(BackupFiles);
SELECT * FROM BackupFiles WHERE BakupFilesCandidateReset(BackupFiles);
UPDATE BackupFiles SET ... WHERE BakupFilesCandidateReset(BackupFiles);
etc
idea here is to maintain partial boolean index representing candidate
records to update. plus it's nifty. this is basic mechanism that
can be used as foundation for very fast push pull queues.
merlin
Loading...