Find and fix PostgreSQL issues faster

 

PostgreSQL Bottlenecks High TrafficTaking our cue from Peter Zaitsev’s article on MySQL Performance Bottlenecks, I’m going to talk a little about the PostgreSQL perspective of removing bottlenecks caused by high traffic.

Many stacks these days are implemented by trusting the Object Relational Mapper, ORM, to do the right thing with PostgreSQL while one creates critical business logic on the application server process side of things. For the most part, this works out quite well but over time one must revisit the database server as it scales. As reliable as PostgreSQL is, it can experience slowdowns as traffic increases.

There are many ways of addressing performance bottlenecks, but for the sake of discussion here are several ways we can look at this issue:

  • Tuning Performance Parameters
  • Session Connections
  • Bloat
  • Autovacuum: Basic
  • Autovacuum: Advanced
  • Data Hotspots
  • Competing Application Processes
  • Replication Latency
  • Server Environment

About Monitoring and Statistics

Before we get too deep into the weeds, a word about monitoring. Yes, contrary to some people, all those metrics, values, and observations are not pulled out of thin air. Rather a rigorous approach to metrics monitoring is essential for identifying bottlenecks and their remedies. All the best advice in the world means nothing without context so in order to facilitate the process we’ve introduced a new tool called pg_stat_monitor.

Think of it as pg_stat_statements on steroids offering more granularity and, most significantly, introduces the concept of buckets, otherwise known as moving averages

A more detailed blog is forthcoming dedicated exclusively to this new technology in the meantime, here is an abridged set of instructions for installing this extension into your database.

Method 1: The Percona Distribution For PostgreSQL


Method 2: Compile and Install (community postgres repository)

Creating The Extension

Restart The Server

About “Categories” and “Potential Impact”

Complexity refers to the level of difficulty in implementing a particular solution. Whereas potential impact gives you an idea of the mitigation’s ideal impact on your system’s performance. Sometimes though, because of its age, the type of system, its technical debt, etc., accurately describing the complexity and its potential impact can be problematic. At the end of it all, given highly complex environments, it’s your judgment that makes the final call.

Categories

  • Complexity
    • Low
    • Medium
    • High
    • Low-Medium-High
  • Potential Impact
    • Low
    • Medium
    • High
    • Low-Medium-High

Tuning Performance Parameters

Complexity: Low
Potential Impact: High

There was a time not too long ago when a modern version of postgres could still run on an i386. Although the default settings have since been updated the parameters are still set to use the least amount of resources upon its installation.

These settings are the easiest to set and are normally updated when the service is first installed. Not adjusting these values can result in high CPU and IO:

  • effective_cache_size ~ 50 to 75%
  • shared_buffers ~ 1/4 – 1/3 total system RAM
  • work_mem ~ 10MB

So let’s talk about these variables.

The effective cache recommended value, although typical, can be set precisely by referring to “top” i.e. free+cached RAM.

Setting the shared buffers is an interesting puzzle. There are two ways of looking at this setting: assuming you have a small database one can set the shared buffers high enough that one essentially has a RAM resident database system. Otherwise, one configures loading those tables and indexes, used the most often by the service, to remain in RAM (the old 80/20 rule). Setting 1/3 system RAM used to be the recommended setting but over time it was dropped to 1/4 as machines acquired more RAM because there is such a thing as having too much assigned to shared buffers. Too little RAM means more CPU work and higher IO. You’ll know when the shared buffer setting is too high when CPU load and IO performance reaches a plateau.

Postgres traffic

Another factor to consider is the OS cache; given enough RAM, Linux will cache tables and indexes in RAM and can, depending upon various settings, trick PostgreSQL into believing that it’s reading from disk rather from RAM. Performance improves at the expense of increased redundancy by often copying the same pages found in the shared buffer into the OS cache, which is another reason to avoid too large a shared buffer cache. For the perfectionists among us, take a look at the extension pg_buffercache which reads cache usage in real-time (TIP: look at this table). 

Setting the work_mem too low guarantees poor performance as sorts are processed as temporary files on disk. On the other hand, although setting it high doesn’t affect performance it does risk starving the server of RAM if too many connections are active at any one time. A chunk of RAM equal to the work mem is used for each and every sort operation. You’ll need to do a bit of arithmetic counting the instances RAM is used by each query and session. TIP: use EXPLAIN ANALYZE to see where the sort operations are carried out and by varying the value in a given session one can see when it spills to disk.

One can learn quite a bit by benchmarking the system at discreet variable settings.

Session Connections: Managing

Complexity: Low
Potential Impact: Low-Medium-High

High Traffic is often typified as large numbers of connections made over short intervals of time. Too many connections block processes and can delay query response and can even cause session errors. Without exception handling root cause analysis may not be easily determined without digging into the postgres logs.

An easy fix is increasing the number of connections:

postgresql instances

Alternatively, a more sophisticated approach is connection pooling. There are numerous solutions but the most commonly used technology is pgbouncer. Among its many capabilities pgbouncer can regulate connection sessions using one of three modes:

  • Session pooling: The most polite method. When a client connects, a server connection will be assigned to it for the whole duration the client stays connected. When the client disconnects, the server connection will be put back into the pool. This is the default method.
  • Transaction pooling: A server connection is assigned to a client only during a transaction. When PgBouncer notices that the transaction is over, the server connection will be put back into the pool.
  • Statement pooling: The most aggressive method. The server connection will be put back into the pool immediately after a query completes. Multi-statement transactions are disallowed in this mode as they would break.

The Secure Socket Layer, SSL, is another consideration. When configured to use SSL certificates, PostgreSQL default behavior encourages all connecting sessions to use SSL consequently consuming more CPU processing power than an unencrypted session. One can configure the host-based authentication rules, pg_hba.conf, forcing ordinary client sessions not to use SSL and instead reserve its use for administrative tasks, by the superuser, or with streaming replication.

Autovacuum: Basic

Complexity: Medium
Potential Impact: Low-Medium

Multi-Version Concurrency Control is one of the foundational principles making PostgreSQL such a popular DBMS solution. However, one of the unfortunate downsides is that for every updated or deleted record a dead-tuple is created which must eventually be purged. An improperly tuned autovacuum process, which is the mechanism dealing with dead-tuples, reduces performance i.e. the busier the server the more significant the effect.

One manages the autovacuum daemon, using these three (3) parameters:

  • autovacuum_max_workers: Increasing the number of autovacuum workers from its default of three (3) workers means more processes are available to vacuum the datacluster, which is an especially useful feature to consider when confronted with a large number of very large tables. Ideally, one creates one worker per CPU. Workers should never exceed the number of CPUs but too many is potentially impactful by spiked increases of CPU usage. Typically one sets the number somewhere in between these two limits. This is a balancing act between maximizing autovacuum efficiency versus overall system performance.
  • maintenance_work_mem: The greater the value the more efficient the vacuuming. Keep in mind there is a law of diminishing returns. Too great a value is at best a waste of RAM and at worse can exhaust the amount of available RAM for the entire database system.
  • autovacuum_freeze_max_age: This parameter mitigates TXID WRAPAROUND. The older the age the less often it runs reducing the amount of system loading. But as with all autovacuum parameters mentioned thus far, there is a caveat. Delay the value too long and you risk running out of txid numbers before the process completes causing the server to force a shutdown in order to protect data integrity. Determining the correct value requires trending the largest/oldest txid against the autovacuum process when querying pg_stat_activity for WRAPAROUND activity.

Beware over-committing RAM and CPU. The higher the value initially set the greater the amount of consumed resources risk being exhausted as system loading increases. Set too high one can experience a sudden drop in performance when loading exceeds a certain point. TIP: Similar to the allocation of RAM regarding work_mem one can either perform some arithmetic or bench-mark the environment in order to set the values optimally.

Autovacuum: Advanced

Complexity: High
Potential Impact: High

Because of the amount of effort involved, one should consider this method when the database system risks pushing the host to its physical limits, and excessive bloat is identified as an issue.

Editing the autovacuum runtime parameters in postgresql.conf  is the most common method used to control its behavior for a datacluster. Unfortunately, this one size fits all approach may not work well over the long term especially as a system scales.

Table Storage Parameters: Often there will be tables whose activities represent a significant amount of the total datacluster churn. Adjusting the various autovacuum parameters on a table by table basis is an excellent way of mitigating hyper-active relations without resorting to a manual invocation of VACUUM which can be significantly impactful upon the system.

Tune individual tables using this SQL COMMAND.

Bloat

Complexity: Low
Potential Impact: Medium-High

Over time, even with the best of intentions, or not, performance can degrade due to inadequate vacuuming policies causing excessive bloating that even tuning the autovacuum daemon and manually invoking VACUUM will not easily resolve. For these cases, the pg_repack extension comes to the rescue.

pg_repack: rebuild and reorganize tables and indexes under production conditions

Data Hotspots

Complexity: High
Potential Impact: Low-Medium-High

Similar to MySQL HotSpots, the PostgreSQL experience, and its resolution, of hot spots relies upon one’s extensive knowledge of the data flow and can, at its most extreme mitigation, refactor the system’s architecture.

Here are a few of the more popular mitigation techniques:

  • Indexes: Confirming that criteria columns have indexes assigned to them is a major opportunity for improving query performance. Another technique is to query the various catalogs and monitoring views and confirm that SQL Commands are querying columns with indexes. TIP: use tools such as the pg_stat_statement extension and pgbadger to determine query performance.
  • Heap Only Tuples (HOT): There is such a thing as too many indexes. Reduce potential bloat and reduce table size by removing all unused indexes attached to those columns that are not used as par to the WHERE clause in a SELECT query.
  • Table Partitioning: Nothing affects performance quite like a table that is several times larger than the average table size. Splitting up a large table into smaller partitions can for example increase query performance when querying data that is partitioned by date. And because only one autovacuum worker is allowed to process a single table, breaking it into many smaller tables allows more than one worker to autovacuum it. Another advantage of partitioning tables is that data purges are so much more efficient and faster by truncating a single partitioned table rather than deleting a large number of rows from a single super-sized table.
  • Parallel Querying: Introduced in recent versions of postgres, one can use multiple CPUs processing a single query where before it was just one processor per query.
  • De-Normalization: Depending upon the specifics, one can improve performance by merging columns from several tables into a larger, single table. Performance gains are made by reducing query planning but at the expense of increased data redundancy. Ponder this option carefully before using it!

pmm query

Competing Application Processes

Complexity: Low
Potential Impact: High

PHP+Java+Python applications: Avoid running applications and postgres on the same host. Back in the old days, one could easily combine a web-service and RDBMS on the same machine since their usage of resources was complimentary. Beware of applications based upon these languages since they can consume large amounts of RAM, especially the garbage collection, which then competes with database systems reducing its overall performance.

Replication Latency

Complexity: Low
Potential Impact: High

async vs sync: The most recent versions of postgres support logical and streaming replication in both synchronous and asynchronous modes. Although the default replication mode is async one must consider the implications of using sync replication especially over network connections with less than ideal latency. 

Server Environment

Last, but not least, are the considerations of the most basic sort i.e. make the host bigger and better. Let’s review what each of the following resources brings by way of performance enhancements to PostgreSQL:

  • RAM: The more the better, it allows us to assign more RAM to queries and increase the number of individual sessions. More RAM means more of the database is cached thereby optimizing IO.
  • CPU: More CPU means more forked processes i.e more vacuuming, session connections, etc.
  • HDD: Size and speed optimizations
    • increases the permitted size of the database
    • overall query performance improves because of faster IO especially when operations such as merge sorts spill onto the disk
  • Disk Partitioning:
    • Breaking the datacluster across multiple partitions increases the number of channels and isolates distinct operations that postgres carries out at the same time. For example, one can put indexes and tables on separate partitions having different performance characteristics.
    • Temporary session tables and operations such as merge sorts can be dedicated to a single high-speed partition or rotated across multiple partitions in order to improve IO
    • Logging can be isolated onto a partition and in case you run out of space it won’t affect the RDBMS
    • WALL logs, similar to regular logging, can have its own partition for this write-only operation. If it runs out of space, as can happen when log shipping and the connection to the slave breaks, the database’s integrity is fully assured since the tables are located elsewhere.

postgresql tuple activity

 

 

postgresql high traffic

 


Our white paper “Why Choose PostgreSQL?” looks at the features and benefits of PostgreSQL and presents some practical usage examples. We also examine how PostgreSQL can be useful for companies looking to migrate from Oracle.

Download PDF