PostgreSQL: Index Tuning

Optimizing PostgreSQL is a relevant and controversial topic ("How much resources should I invest in optimization vs. hardware?"): The first part of the series, "Query Optimization Insights", made it to the top 4 posts on news.ycombinator.com and sparked a lively discussion with over 170 comments, which I highly recommend to any PostgreSQL enthusiast.

An interesting point of debate that has essentially created two camps concerns the economic trade-off between investing in development resources to optimize database usage and investing in hardware resources. In my opinion, this trade-off cannot be universally applied. When implementing a time-sensitive project where money plays only a secondary role or the costs of delayed implementation are very high, I am willing to accumulate "Technical Debt" and (temporarily) compensate for poor schema design, queries, etc., with hardware resources. In all other cases, however, I see myself in a marathon, not a sprint. I aim to gradually build a development team that is intimately familiar with and can effectively utilize the core tools, including the database, without a doubt. Once this knowledge is in place, it applies to all future software projects. Understanding how to efficiently process, store, and retrieve data is undoubtedly also a catalyst for innovations that go far beyond operating a PostgreSQL database. A good example of this is understanding the limits of SQL databases, meaning knowing when not to use this technology. Familiar with these limits, I can design systems with combined solutions for storing different types of data or develop my own storage solutions tailored specifically to my use case.

After addressing the identification of inefficient queries and corresponding countermeasures in the first part of my five-part series, I'd like to focus on tuning indexes in this second part.

Part 2: Index Tuning

PostgreSQL offers many ways to optimize, but why is fine-tuning so important? The answer lies not only in the obvious need to make applications faster and more responsive. It's also about getting the best value out of existing resources, thereby reducing costs. The key? Efficient use of indexes!

Indexes provide a way to significantly speed up database queries. They enable efficient searching - like finding a needle in a haystack - and make handling tables with millions or even billions of rows manageable. Besides searching, indexes can also represent unique constraints or sort data. However, if used improperly, they can bloat the database, slow down write operations, and strain the CPU. Indexes have a direct impact on operational costs and the efficiency of the IT infrastructure. With increasing amounts of data and growing demands on data processing, the right indexing strategy is not a luxury but a necessity for any business that wants to make optimal use of its databases. But how do you find the balance between building and managing indexes and the potential overhead they can bring? This blog post sheds light on the art and science behind indexing in PostgreSQL, showing how database performance can be maximized through strategic index tuning.

Statistics on Indexes

Statistics on indexes help estimate how much storage space indexes consume and how intensively an index can be used. Below, I present two examples of queries that I frequently use for an initial analysis in my day-to-day work.

Using this query

SELECT
   relname  as table_name,
   pg_size_pretty(pg_total_relation_size(relid)) As "Total Size",
   pg_size_pretty(pg_indexes_size(relid)) as "Index Size",
   pg_size_pretty(pg_relation_size(relid)) as "Actual Size"
   FROM pg_catalog.pg_statio_user_tables 
WHERE relname = 'table_name'
ORDER BY pg_total_relation_size(relid) DESC;

we can create an overview of the total size of a table (including index) and the size of the data and the indexes. These statistics help us identify excessively large indexes. In some cases, indexes occupy more storage space than the actual data. This could be the case, for example, with a composite index that indexes several columns and includes the combination of values of all these columns. If each column already has a large range of values, the combined index space can grow exponentially. A large index is not necessarily bad practice. If a large table is analyzed in different ways, it may make sense to maintain several indexes, each optimized for individual queries.

The usage of a table's indexes can be analyzed with the following query:

SELECT
    t.schemaname,
    t.tablename,
    c.reltuples::bigint                            AS num_rows,
    pg_size_pretty(pg_relation_size(c.oid))        AS table_size,
    psai.indexrelname                              AS index_name,
    pg_size_pretty(pg_relation_size(i.indexrelid)) AS index_size,
    CASE WHEN i.indisunique THEN 'Y' ELSE 'N' END  AS "unique",
    psai.idx_scan                                  AS number_of_scans,
    psai.idx_tup_read                              AS tuples_read,
    psai.idx_tup_fetch                             AS tuples_fetched,
    i2.indexdef
FROM
    pg_tables t
    LEFT JOIN pg_class c ON t.tablename = c.relname
    LEFT JOIN pg_index i ON c.oid = i.indrelid
    LEFT JOIN pg_stat_all_indexes psai ON i.indexrelid = psai.indexrelid
    LEFT JOIN pg_indexes i2 ON psai.indexrelname = i2.indexname
WHERE
    t.schemaname NOT IN ('pg_catalog', 'information_schema')
    AND t.tablename = 'table_name'
ORDER BY 1, 2;

This overview provides insights into the size of individual indexes using index_size. Furthermore, the values number_of_scans, tuples_read, and tuples_fetched are relevant for assessing the usage of the indexes. Indexes with a very low number_of_scans are rarely considered useful by the query planner and can often be removed without significant performance loss. But how does the query planner decide which indexes are most advantageous on the path to finding the desired data? To understand this, let's first look at the options available to us.

Index Scan vs. Sequential Scan

Imagine a hypothetical phone book contains 260 entries per letter, totaling 6760 entries. If we're looking for Mrs. Zylker, who happens to be the last entry in the phone book, we would have to go through all 6760 entries until we find her phone number in the last line of the book. Since we're only looking for a single entry, meaning the selectivity of our search is very high, it would be extremely inefficient to search through each line individually, i.e., perform a Sequential Scan.

An index works like a shortcut to find an element with specific characteristics in the database much faster. The names in a phone book are sorted alphabetically. At the beginning of the book, there is a short overview indicating from which page number each initial letter can be found. This index only includes 26 entries and therefore requires little time to read the index and determine the page number for the desired initial letter. The time-saving achieved by using an Index Scan to jump directly to the desired initial letter, without having to look at all the other entries in our further search, is very large compared to the time spent reading the index. So, when we look for Mrs. Zylker using the index, we only need to go through the 26 entries of the index and then 260 entries for the letter Z, a total of 286 entries until we come across her name. Thus, compared to the Sequential Scan, we have saved more than 95% of the time.

In general, a Sequential Scan is advantageous when we want to retrieve a large number of database entries with a query of low selectivity. An Index Scan shows its strengths when we want to retrieve a small number of database entries with a query of high selectivity.

Selectivity and Query Planner Statistics

The selectivity of a filter within a database query determines what fraction of all entries in a table is queried. For a query with low selectivity that uses an index, our database would often have to iterate over the index to collect all the necessary entries. Since using the index is associated with costs, these could outweigh the benefits of using the index to identify each individual entry � making the index negatively useful!

Therefore, for each query, the database's query planner must individually decide whether using an index promises time savings. An objective time measurement would require executing the query once with and once without using the index, which is obviously not feasible for practical reasons (although developers are allowed to do this for the sake of query optimization). Hence, the time effort must be estimated as accurately as possible before executing the query.

PostgreSQL performs a statistical analysis of the contents of all columns of a table periodically through ANALYZE. This involves taking a certain number of random entries from the database to create, among other things, a table of the most frequently used entries per column. The size of this table can be adjusted via the default_statistics_target parameter, whose default value is 100. In total, 300 * default_statistics_target samples are taken. This number was chosen because some statistical theories state that you need so many samples per histogram boundary you want to calculate, so the sample histogram boundaries possess an acceptable level of uncertainty.

The statistics are used among other things, by the query planner to estimate in which order tables should be queried and whether Sequential Scans or Index Scans should be performed. Therefore, inaccurate statistics can lead to inefficient execution plans. Analyzing the execution plan using EXPLAIN ANALYZE helps identify inefficient execution plans. If a column has a large number of frequently used entries, it might make sense to increase the default_statistics_target parameter. An increase for individual columns can be implemented with

ALTER TABLE example ALTER COLUMN postcode SET STATISTICS 300;

If this is not sufficient, "extended statistics" can be created to increase accuracy further.

The trade-off between the performance gain from more efficient execution plans and the longer duration to create the statistics usually comes out positive for large databases. Another interesting parameter is random_page_cost, which describes the cost of retrieving "non-sequentially-fetched disk pages". Due to the excellent random-read performance of modern hardware with NVMe drives, this parameter can generally be set very low (e.g., 1.1). A relatively smaller value compared to seq_page_cost favors the performance of Index Scans. Moreover, PostgreSQL offers a variety of other parameters for optimizing the query planner.

Index Types

The documentation of PostgreSQL provides an excellent overview of all types of indexes. In the following sections, I will particularly focus on edge cases where the default index is not sufficient.

B-Tree

Fundamentally, it can be said that in 99% of the cases, a B-Tree, which is also the default type, is the best choice. Whenever the data type is sortable and should be filtered with one of the operators <, <=, =, >=, >, this index can be used.

Hash Index

Hash indexes in PostgreSQL are based on hash table techniques and are especially efficient for exact match searches, i.e., equality queries. This type of index is specifically designed to speed up queries that exclusively use the equality operator.

Use of Hash Indexes: When only equality queries (=) are applied to a column and there is no need for sorting or range queries.
- Example: Quickly find a row with a specific ID in a table.

Hash indexes are generally less flexible than B-Tree indexes and are used less frequently in practice, as B-Tree indexes also support range queries and sorting in addition to equality queries, and often offer similar performance for equality queries. However, under certain circumstances, especially with very large datasets, hash indexes can offer advantages in terms of storage efficiency and speed.

GiST

GiST (Generalized Search Tree) is a versatile index system in PostgreSQL that supports the establishment of search trees for complex data types using advanced operators. It is also possible to use GiST indexes with custom data types and operators by implementing an operator class that provides necessary methods for the index. This allows developers to tailor GiST indexes to their specific needs.

Here is a brief (non-exhaustive) list of data types supported by GiST:

Geometric Data Types (point, box, circle, polygon)
- Example: Determine all coordinates that lie within a circle.
Network Addresses (inet)
- Example: Check if a specific IP address lies within a network range.
Text Search (tsvector, tsquery)
- Example: Find texts containing two defined keywords.
Range Types (tsrange, numrange, ...)
- Example: Check if a specific point in time lies within a defined period.

GIN

GIN (Generalized Inverted Index) is a specialized index in PostgreSQL designed for data types where a data element can encompass multiple values, such as arrays or documents intended for full-text search. GIN indexes are particularly efficient in operating with operators that check for the presence or match in a set of elements. While B-Tree indexes are ideal for discrete values and point queries, GIN indexes are excellent for data structures with multiple components, such as found in jsonb, full-text search fields, or arrays.

Here are some use cases where GIN indexes are beneficial: - Full-Text Search: When indexing tsvector columns to speed up queries using tsquery objects. - Example: Search documents containing a specific set of search terms. - JSONB Data Types: To speed up queries searching for specific keys or value pairs within a jsonb object. - Example: Search all entries that have a specific key-value pair in a JSONB field. - Array Data Types: For quickly determining if an array contains a specific value. - Example: Find all rows where an array field contains a specific value. - Composite Types: In certain cases, a GIN index can also be useful for composite types if they can be interpreted as a set of values.

Generally, GIN indexes should be used when queries primarily aim to find or verify elements within sets, rather than focusing on easy-to-identify intersections or sorted values. They prove particularly useful where the set of values for a single data element is large and queries are unpredictable. Although GIN indexes may be slightly slower when updating indexed columns, they offer a significant performance advantage in queries.

BRIN

BRIN stands for Block Range Index and is a type of index in PostgreSQL that is particularly suitable for very large tables where the data is physically stored sorted according to the indexed value. BRIN indexes store summary information about the value ranges in consecutive blocks (the so-called "pages") of the data table. Therefore, they are especially well-suited for scenarios where there is a natural correlation between the physical location of the data and the indexed values.

The main advantage of BRIN indexes lies in their storage efficiency: They require significantly less storage space than B-Tree, GiST, or GIN indexes, as they only retain coarse information about the data blocks. This makes them an excellent option for high-volume data sets with a uniform distribution of values, such as timestamps in a log file that are inserted in chronological order.

Partial Indexes

Partial indexes prove to be particularly useful when the selectivity of the data to be indexed is very high. If a dataset contains a very large number of identical values, we can actively exclude these values from indexing using a partial index. Retaining these same values in the index would only impair performance without offering a speed advantage in data querying. The PostgreSQL documentation provides an example of a logging table that stores, among other things, the client IP. Most requests come from the local IP range 192.168.100.0/24. However, when querying the logs, we are solely interested in access from external IPs. Therefore, we can exclude the internal IP range, which constitutes the majority of all log entries, from indexing:

CREATE INDEX access_log_client_ip_ix ON access_log (client_ip)
WHERE NOT (client_ip >= inet '192.168.100.0' AND
           client_ip <= inet '192.168.100.255');

In addition to selectivity, the frequency of use can also serve as a criterion for employing a partial index. If we mostly use data from the last seven days for analysis in our time series table, we can create an index that is limited to this narrow time interval. For our web application, which stores 100,000 registered users, of whom only 5,000 are active, we could create special indexes just for the active users:

CREATE INDEX user_email_partial_idx ON user (email)
WHERE (active = True);

Exclusion Constraints

Exclusion constraints expand the concept of uniqueness and allow for the definition of more complex restrictions. I often use them to represent time-dependent many-to-many relationships. My relation table, in addition to the IDs of the connected elements, contains a valid_from and a valid_to field, which describe the validity period of the relationship. I want to ensure that no two valid connections can exist between two elements at any point in time. This can be achieved with exclusion constraints:

ALTER TABLE relation_x_y ADD CONSTRAINT relation_x_y_excl_overlap EXCLUDE USING gist (
    x_id WITH =,
    y_id WITH =,
    tsrange(valid_from, valid_to, '()') WITH &&
);

The structure of the exclusion constraint:

EXCLUDE USING gist: Specifies that the exclusion constraint is implemented using a GiST (Generalized Search Tree) indexing method. This is particularly suitable for indexing complex data types, such as time ranges.
(x_id WITH =, y_id WITH =): Ensures that the combination of x_id and y_id is unique as long as valid_from and valid_to overlap. This means that two connections having the same x_id and y_id cannot be valid at the same time.
tsrange(valid_from, valid_to, '()') WITH &&: tsrange creates a time range based on valid_from and valid_to. The notation '()' signals that valid_from and valid_to are not included in this period themselves. The operator && checks for the overlap of two time ranges.

In summary, this exclusion constraint ensures that no two entries in the table exist where x_id and y_id are the same and their defined validity periods overlap. This prevents a double, simultaneously valid connection between two elements.

The extension [ WHERE ( predicate ) ] allows for applying the constraint to a part of the table by specifying a condition, effectively creating a partial index, as previously described. The operators used in the exclusion constraint must be commutative, meaning that the result of the comparison should not depend on the order of the operands.

Summary

In this blog post, we dived into the details of indexing and its optimization in PostgreSQL, a topic of great importance for both developers and database administrators. By tuning indexes and utilizing specific index types like B-Tree, Hash, GiST, GIN, and BRIN, queries can be significantly accelerated and database performance optimized. We've seen how vital it is to assess the selectivity and usage of indexes to effectively support the query planner. The discussion on partial indexes and exclusion constraints has also shown how targeted restrictions can further improve the performance and integrity of database queries. Understanding and correctly applying these techniques are crucial for the long-term performance and scalability of database systems. Writing efficient queries means not only getting fast responses to data requests but also conserving resources and reducing costs.

In the next blog post, we will tackle the topic of partitioning, another important strategy for optimizing PostgreSQL databases. We will explore how dividing a database into smaller parts can enhance query efficiency and simplify data management.

2024-03-28