Robert Rode


Personal blog about coding, energy and data engineering.


Operating an Entire Company on a Minimal Two-Core PostgreSQL Instance: Query Optimization Insights, Part 1

In an era where the costs for cloud services continue to soar, I want to outline how our data team has successfully managed to run the infrastructure of our energy trading company within the first year using only two PostgreSQL database instances - each equipped with just two CPU cores and four GB of RAM. Despite this minimal use of resources, internal business processes and analytics dashboards operate with satisfactory performance. But how is this feasible?

Database administrators often tend to respond to performance problems with the demand for more budget for more powerful database instances. As our company grew, the utilization of our database regularly hit limits that would normally have led to noticeable performance degradation. However, these bottlenecks required quick decisions, which often resulted in a short-term increase in resources. In any case, however, we were able to return to a smaller database instance promptly through appropriate countermeasures. From a business perspective, a balance had to be found between:

  1. The ongoing costs that could arise from short-term performance degradation of the database (which could lead, for example, to power traders potentially making poorer decisions in the short term, as market information is delayed),
  2. The one-time costs for developer hours needed to implement countermeasures, and
  3. The ongoing costs for scaling up database resources.

Point 1 can quickly reach an extent of several thousand to tens of thousands of euros per day. That is why we have relied from the start on maintaining adequate performance reserves and redundancy. The balancing of points 2 and 3, on the other hand, is more difficult. We have often chosen to invest in developer hours, and here are the reasons:

  1. The optimization of queries, schemas, and indexes builds valuable knowledge about how data can be efficiently structured, stored, and retrieved (for example, by asking questions like: "How do I partition my table?").
  2. Fine-tuning the database configuration and infrastructure increases the understanding of database technology. Many problems can only be properly understood and solved when the underlying software architecture is known (such as: "How does PostgreSQL store data internally?").
  3. Problems in the areas mentioned above often cannot be permanently fixed by increasing resources. So, the problem is merely postponed, creating "technical debt".

I claim that most of the currently used databases could be operated with a fraction of the resources. In my five-part article series, I want to show how most common performance bottlenecks can be eliminated through optimization of queriestuning of indexespartitioningtuning of the database server, and the use of read-only replicas.

Part 1: Optimization of Queries

Bad queries are the root of all evil. Even the most performant database servers will buckle if users write poor queries. But what are bad queries and how can they be avoided?

SQL is not easy to learn. Inexperienced developers with little knowledge of the database structure cannot apply all the aspects that need to be considered in order to write efficient queries. Even with perfect knowledge of the database structure, it is not always easy to identify the most performant way of querying data - there are simply too many different ways to write SQL queries. Therefore, the first and most important point to avoid inefficient queries is:

1. Identify Inefficient Queries through Monitoring

The module pgstatstatements provides an excellent basis for identifying queries that are particularly frequently executed and that heavily burden the database instance.

The library is included by default in PostgreSQL and can be activated by modifying the postgresql.conf:

# postgresql.conf
shared_preload_libraries = 'pg_stat_statements'

compute_query_id = on
pg_stat_statements.max = 10000
pg_stat_statements.track = all

After activating the module, a restart of the database server is required, as the module needs additional shared memory.

Subsequently, a simple query reveals which queries take up the most execution time, thus providing initial guidance for optimization potential. In addition, the attributes shared_blks_hit and shared_blks_read can be used to determine how often the shared cache is used for a query. Based on these two attributes, a hit rate for the shared cache can be calculated. We will go into more detail on the meaning of this rate later in the article "Tuning the Database Server."

SELECT
  query,
  calls,
  total_exec_time,
  rows,
  100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent
FROM
  pg_stat_statements
ORDER BY
  total_exec_time DESC
LIMIT
  5;

This SQL command allows identifying the most critical queries for performance and thus represents a valuable first step in the optimization of the database.

2. Understanding Queries with EXPLAIN ANALYZE

PostgreSQL offers an extremely powerful tool to trace the execution of a query in detail: EXPLAIN ANALYZE.

The use of

EXPLAIN statement

provides the execution plan for a query without actually executing it. This gives insight into the paths the database planner envisions for the query, which indexes are used, and how data flows through the different phases of the plan.

In contrast,

EXPLAIN ANALYZE statement

not only actually executes the query but also gathers runtime statistics, which allow for a precise analysis of each operation in the execution plan. This gives insight into the actual costs of a query and can pinpoint performance bottlenecks accurately.

This in-depth analysis reveals not only how long each component of the plan took to complete its operation but also how many rows were processed and whether the database's estimates matched the actual circumstances.

The database's query estimates are crucial for selecting the most efficient execution plan. PostgreSQL's planning engine uses statistics about the data to predict how many rows will be returned by an operation, which method is best for joins, or how selective a filter is. These estimates are based on data such as the number of rows, value distribution, and other statistical information collected over time. The statistics are listed in the documentation Statistics Used by the Planner

EXPLAIN ANALYZE reveals whether these estimates were accurate. If actual runtime statistics differ significantly from the estimates, this suggests a potential inefficiency. This could be due to:

A precise analysis of EXPLAIN ANALYZE results helps to identify these discrepancies. Here's an example:

Seq Scan on my_table  (cost=0.00..36.50 rows=1230 width=40) (actual time=0.011..0.492 rows=1234 loops=1)
  Filter: (a_column = ‘Value’)
  Rows Removed by Filter: 123
Planning Time: 0.233 ms
Execution Time: 0.530 ms

In this example, rows=1230 would indicate the estimated number of rows that meet the restriction, while actual rows=1234 represents the actual number of rows that met the condition during execution. This is an indicator of fairly accurate estimates. If the estimated and actual values diverge significantly, this could be a sign that the database statistics need to be updated to improve the planner's accuracy. Additionally, the extent of the statistics can be increased, so, for example, the number of most_common_vals, the list of the most frequently occurring values within a column, can be raised from the standard 100. This would allow the planner to make more accurate estimates.

3. Optimizing Queries

In addition to checking the accuracy of estimates, the execution of EXPLAIN ANALYZE can uncover various other causes for inefficiencies in database queries and be used as a basis for optimization:

Suboptimal Join Strategies:

If suboptimal methods (Nested Loop, Hash Join, Merge Join, etc.) are chosen for joins, this can lead to unnecessarily high resource consumption.

Countermeasures:

  1. Update table statistics with ANALYZE.
  2. Create/optimize indexes for join criteria.
  3. Adjust PostgreSQL join-related configuration parameters 16.4.5. Query Planning.
  4. Shift logic to the application level to reduce database load.
  5. Adjust server settings such as work_mem (see Part 4: Tuning the Database Server).
  6. Partition large tables (see Part 3: Partitioning).

Unused or Missing Indexes:

When queries are executed without using indexes, it often leads to Full Table Scans and significantly affects performance.

Countermeasures:

  1. Create indexes for frequently used query criteria.
  2. Consider the use of Index-Only Scans where possible.

Costly Sorting and Aggregation Operations:

A high resource expenditure for sorting and aggregation can affect query performance.

Countermeasures:

  1. Create indexes for sorting keys to enable index sorting.
  2. Use Materialized Views to pre-calculate and store aggregation results.
  3. Rethink query structure to avoid excessive sorting and aggregation.
  4. Adjust the work_mem parameter to optimize sorting and aggregation operations in memory (see Part 4: Tuning the Database Server).

Filter Inefficiencies:

Ineffective filters can affect the query result and waste valuable resources.

Countermeasures:

  1. Optimize query logic to ensure that filter conditions are applied early.
  2. Improve database schemas, e.g., through normalization, to promote efficient filtering.
  3. Use partial indexes for specific filter conditions.
  4. Review filters on frequently used columns and create or optimize indexes if necessary.

Summary

Armed with the right knowledge, developers and database administrators can strategically adjust query structures and database indexing to prevent performance losses and maximize efficiency. Identifying, understanding, and optimizing inefficient queries is key to the long-term optimal operation of a database. PostgreSQL provides us with all the tools we need for this purpose, including pgstatstatements and EXPLAIN ANALYZE. Besides these, a multitude of other tools are available. We must not forget that 20 years ago only a fraction of the performance of current hardware resources was available, and despite these restrictions, performance database-based applications were developed.

2023-11-15