Steven Crandell
2013-03-01 09:52:02 UTC
Recently I moved my ~600G / ~15K TPS database from a
48 ***@2.0GHz server with 512GB RAM on 15K RPM disk
to a newer server with
64 ***@2.2Ghz server with 1T of RAM on 15K RPM disks
The move was from v9.1.4 to v9.1.8 (eventually also tested with v9.1.4 on
the new hardware) and was done via base backup followed by slave promotion.
All postgres configurations were matched exactly as were system and kernel
parameters.
On the first day that this server saw production load levels it absolutely
fell on its face. We ran an exhaustive battery of tests including failing
over to the new (hardware matched) slave only to find the problem happening
there also.
After several engineers all confirmed that every postgres and system
setting matched, we eventually migrated back onto the original hardware
using exactly the same methods and settings that had been used while the
data was on the new hardware. As soon as we brought the DB live on the
older (supposedly slower) hardware, everything started running smoothly
again.
As far as we were able to gather in the frantic moments of downtime,
hundreds of queries were hanging up while trying to COMMIT. This in turn
caused new queries backup as they waited for locks and so on.
Prior to failing back to the original hardware, we found interesting posts
about people having problems similar to ours due to NUMA and several
suggested that they had solved their problem by setting
vm.zone_reclaim_mode = 0
Unfortunately we experienced the exact same problems even after turning off
the zone_reclaim_mode. We did extensive testing of the i/o on the new
hardware (both data and log arrays) before it was put into service and
have done even more comprehensive testing since it came out of service.
The short version is that the disks on the new hardware are faster than
disks on the old server. In one test run we even set the server to write
WALs to shared memory instead of to the log LV just to help rule out i/o
problems and only saw a marginal improvement in overall TPS numbers.
At this point we are extremely confident that if we have a configuration
problem, it is not with any of the usual postgresql.conf/sysctl.conf
suspects. We are pretty sure that the problem is being caused by the
hardware in some way but that it is not the result of a hardware failure
(e.g. degraded array, raid card self tests or what have you).
Given that we're dealing with new hardware and the fact that this still
acts a lot like a NUMA issue, are there other settings we should be
adjusting to deal with possible performance problems associated with NUMA?
Does this sound like something else entirely?
Any thoughts appreciated.
thanks,
Steve
48 ***@2.0GHz server with 512GB RAM on 15K RPM disk
to a newer server with
64 ***@2.2Ghz server with 1T of RAM on 15K RPM disks
The move was from v9.1.4 to v9.1.8 (eventually also tested with v9.1.4 on
the new hardware) and was done via base backup followed by slave promotion.
All postgres configurations were matched exactly as were system and kernel
parameters.
On the first day that this server saw production load levels it absolutely
fell on its face. We ran an exhaustive battery of tests including failing
over to the new (hardware matched) slave only to find the problem happening
there also.
After several engineers all confirmed that every postgres and system
setting matched, we eventually migrated back onto the original hardware
using exactly the same methods and settings that had been used while the
data was on the new hardware. As soon as we brought the DB live on the
older (supposedly slower) hardware, everything started running smoothly
again.
As far as we were able to gather in the frantic moments of downtime,
hundreds of queries were hanging up while trying to COMMIT. This in turn
caused new queries backup as they waited for locks and so on.
Prior to failing back to the original hardware, we found interesting posts
about people having problems similar to ours due to NUMA and several
suggested that they had solved their problem by setting
vm.zone_reclaim_mode = 0
Unfortunately we experienced the exact same problems even after turning off
the zone_reclaim_mode. We did extensive testing of the i/o on the new
hardware (both data and log arrays) before it was put into service and
have done even more comprehensive testing since it came out of service.
The short version is that the disks on the new hardware are faster than
disks on the old server. In one test run we even set the server to write
WALs to shared memory instead of to the log LV just to help rule out i/o
problems and only saw a marginal improvement in overall TPS numbers.
At this point we are extremely confident that if we have a configuration
problem, it is not with any of the usual postgresql.conf/sysctl.conf
suspects. We are pretty sure that the problem is being caused by the
hardware in some way but that it is not the result of a hardware failure
(e.g. degraded array, raid card self tests or what have you).
Given that we're dealing with new hardware and the fact that this still
acts a lot like a NUMA issue, are there other settings we should be
adjusting to deal with possible performance problems associated with NUMA?
Does this sound like something else entirely?
Any thoughts appreciated.
thanks,
Steve