Optimize PostgreSQL Performance and Reclaim Space with Vacuuming

GeekGuy
Jan 6, 2023
8 min read

In PostgreSQL, vacuuming is a maintenance task that helps to optimize database performance and reclaim space. It involves removing deleted or outdated rows from tables and indexes and updating statistics used by the query planner. This process is necessary to prevent the accumulation of unnecessary data, known as “dead rows,” which can take up significant space and slow down queries.

Multi-version concurrency control (MVCC)

To maintain consistency and prevent data loss due to concurrent updates, PostgreSQL employs multi-version concurrency control (MVCC). PostgreSQL and other database management systems use MVCC to ensure consistent reads and prevent data loss from concurrent updates. PostgreSQL is achieved by storing multiple versions of each row within the database, allowing transactions to access a consistent data snapshot.

In PostgreSQL, each row within a table is assigned a transaction ID referred to as an “xmin”. This ID signifies the transaction that inserted the row. When a row is updated or deleted, it is not immediately removed from the table. Instead, a new version of the row is inserted with a new transaction ID, while the old version is marked as “dead” with a transaction ID called an “xmax”.

When a transaction reads a row, it utilizes the xmin and xmax values to determine whether the row is visible to the transaction. If the xmin value is greater than the transaction’s “snapshot” (a record of the transaction IDs that were in progress when the transaction began), the row is not visible to the transaction. If the xmax value is equal to the transaction’s ID, the row has been deleted by the transaction and is also not visible. In all other cases, the row is visible to the transaction.

This allows transactions to access a consistent data snapshot, as they can only see rows committed when the transaction began. It also prevents data loss due to concurrent updates, as conflicting updates result in inserting new row versions rather than overwriting the existing data.

Although MVCC incurs some overhead in terms of storage and performance due to the need to maintain multiple versions of each row, it is a crucial feature of PostgreSQL and other database systems that support concurrent updates.

This allows multiple versions of each row to be stored, enabling transactions to access a consistent data snapshot. However, this can result in the accumulation of dead rows as rows are updated or deleted.

Vacuuming

The vacuum process in PostgreSQL helps maintain the database’s performance and space efficiency by removing rows that are no longer needed.

These rows, known as “dead rows,” accumulate because PostgreSQL uses MVCC to allow multiple transactions to access the same data simultaneously without conflicts. During the vacuum process, tables and indexes are scanned, and these dead rows are removed, helping to reclaim space and improve query performance. It is essential to run a vacuum to keep the database running smoothly periodically. MVCC stores multiple versions of each row, so dead rows are not immediately removed when a row is updated or deleted.

The vacuum process removes dead rows and updates statistics used by the query planner to more accurately estimate the number of rows returned by a query and choose the most efficient execution plan. There are two types of vacuum in PostgreSQL: VACUUM and ANALYZE. VACUUM removes dead rows and updates statistics, while ANALYZE only updates statistics. It is generally recommended to run both VACUUM and ANALYZE together.

The vacuum process can be initiated manually using SQL commands or automated using the autovacuum background process, which runs based on configurable thresholds such as the number of dead rows or live rows in a table. In PostgreSQL 15, the vacuum process has been optimized to make it more efficient and faster to vacuum large tables. It includes improvements such as the ability to vacuum multiple partitions of a table in parallel, vacuum indexes concurrently, and skip vacuuming indexes not affected by an update.

From a technical perspective, the vacuum process in PostgreSQL 15 involves several components. The vacuum daemon (autovacuum) initiates vacuum operations based on configurable thresholds. The vacuum worker executes the actual vacuum operation, scanning the table or index and removing dead rows while updating statistics.

Autovacuum, is enabled by default in PostgreSQL and can be configured using several parameters in postgresql.conf file. PostgreSQL has several settings related to vacuum that can be configured to control how the vacuum process runs. You can find the following settings in postgresql.conf file and include:

autovacuum: This setting enables or disables the autovacuum background process. By default, autovacuum is enabled.
autovacuum_vacuum_threshold: This setting determines the minimum number of dead rows that must be present in a table before it is vacuumed. The default value is 50.
autovacuum_analyze_threshold: This setting determines the minimum number of live rows that must be present in a table before it is analyzed. The default value is 50.
autovacuum_vacuum_scale_factor: This setting is a multiplier that determines how many dead rows are needed to trigger a vacuum based on the table size. The default value is 0.2.
autovacuum_analyze_scale_factor: This setting is a multiplier that determines how many live rows are needed to trigger an analyze based on the size of the table. The default value is 0.1.
autovacuum_vacuum_cost_delay: This setting determines the time (in milliseconds) the autovacuum will wait before starting a vacuum operation. The default value is 20.
autovacuum_vacuum_cost_limit: This setting determines the maximum number of rows that can be vacuumed in a single vacuum operation. The default value is 200.

Here is an example of configuring some of the vacuum settings in postgresql.conf file:

autovacuum = on 

autovacuum_vacuum_threshold = 100 

autovacuum_analyze_threshold = 100

autovacuum_vacuum_scale_factor = 0.5

autovacuum_analyze_scale_factor = 0.2

autovacuum_vacuum_cost_delay = 50

autovacuum_vacuum_cost_limit = 500

In this example, autovacuum is enabled, and the thresholds for vacuum and analyze are set to 100. The scale factors for vacuum and analyze are set to 0.5 and 0.2, respectively, which means that a vacuum will be triggered when there are 50 dead rows per 1,000 live rows in the table (0.5 x 100), and an analyze will be triggered when there are 20 live rows per 1,000 rows in the table (0.2 x 100). The vacuum cost delay is set to 50 milliseconds, and the vacuum cost limit is set to 500 rows, which means that the autovacuum will wait 50 milliseconds before starting a vacuum operation and will vacuum a maximum of 500 rows at a time.

It is important to configure these settings to ensure that vacuum and analyze properly are running effectively and not causing too much load on the database. It is also a good idea to monitor the activity of the autovacuum and manually vacuum tables that are not adequately maintained by the autovacuum.

Autovacuum can also be configured on a per-table basis using the autovacuum_vacuum_cost_delay and autovacuum_vacuum_cost_limit parameters in the table’s storage parameters. These parameters control how aggressively autovacuum vacuums the table, with a lower cost delay causing the vacuum to run more frequently and a higher cost limit allowing more rows to be vacuumed at once.

Parallel vacuuming

Parallel vacuum is a feature in PostgreSQL that allows the vacuum process to be run concurrently on multiple cores or processors, improving the performance of the vacuum operation. This can be especially useful for vacuuming large tables, as it allows the vacuum process to use multiple CPUs to scan and process the table in parallel.

The parallel vacuum was introduced in PostgreSQL 13 as an experimental feature and made generally available in PostgreSQL 14. To use a parallel vacuum, you need to set the “max_parallel_workers_per_gather” parameter in the postgresql.conf configuration file to a value greater than one. To enable parallel processing, specify the “PARALLEL” option when running a vacuum or analyze command.

For example, to run a parallel vacuum on a table named “foo_table”, you can use the following command:

VACUUM (PARALLEL, ANALYZE) foo_table;

You can also specify the “PARALLEL” option when running a vacuum or analyze command on an entire schema or database:

VACUUM (PARALLEL, ANALYZE) schema_name.*; VACUUM (PARALLEL, ANALYZE);

Note: Keep in mind that parallel vacuum can increase the load on the database server, as it requires multiple CPU cores to be used simultaneously. You should carefully monitor the performance of your database when using a parallel vacuum and adjust the “max_parallel_workers_per_gather” parameter as needed to find the optimal setting for your workload.

Transaction wraparound

Transaction wraparound is a phenomenon that can occur in PostgreSQL when the maximum transaction ID (TXID) has been reached, and the system wraps around to reuse old transaction IDs. This can cause problems if there are still rows in the database with a higher transaction ID than the current maximum, as they will be considered “dead” and removed by the vacuum process.

To understand how transaction-wraparound works, it is essential to understand how PostgreSQL manages transaction IDs. PostgreSQL transactions are assigned a unique transaction ID called a “xid.” The xid is a 32-bit integer, meaning it has a maximum value of 2^32-1, or 4,294,967,295. When this maximum value is reached, the system wraps around and reuses old xids.

To prevent transaction wraparound from causing problems, it is important to run the vacuum process and remove dead rows regularly. You can use the following query to check for tables at risk of transaction wraparound.

Vacuum statistics

To view the vacuum history for all tables in the current schema:

This query retrieves vacuum statistics for all tables in the “public” schema in the current database. This query can help monitor the status of the vacuum process and identify tables that may need to be vacuumed or analyzed. For example, if a table has many dead rows or has not been vacuumed or analyzed recently. In that case, it may be worth running a manual vacuum or analyzing operations to improve the performance of the database.

SELECT

    n.nspname as schema_name,

    c.relname as table_name,

    c.reltuples as row_count,

    c.relpages as page_count,

    s.n_dead_tup as dead_row_count,

    s.last_vacuum,

    s.last_autovacuum,

    s.last_analyze,

    s.last_autoanalyze

FROM pg_class c

JOIN pg_namespace n ON n.oid = c.relnamespace

LEFT JOIN pg_stat_user_tables s ON s.relid = c.oid

WHERE c.relkind = 'r' AND n.nspname = 'public';

-[ RECORD 1 ]----+------------------------------

schema_name      | public

table_name       | pgbench_accounts

row_count        | 9.999965e+06

page_count       | 163935

dead_row_count   | 41705

last_vacuum      | 2022-12-25 16:00:36.231734+00

last_autovacuum  | 

last_analyze     | 2022-12-25 16:00:18.90299+00

last_autoanalyze | 

-[ RECORD 2 ]----+------------------------------

schema_name      | public

table_name       | pgbench_branches

row_count        | 100

page_count       | 1

dead_row_count   | 41

last_vacuum      | 2022-12-25 16:00:44.722317+00

last_autovacuum  | 

last_analyze     | 2022-12-25 16:00:16.254529+00

last_autoanalyze | 2022-12-25 16:01:45.957663+00

To view the list of tables that have been modified since the last vacuum:

SELECT

    n.nspname as schema_name,

    c.relname as table_or_index_name,

    c.relkind as table_or_index,

    c.reltuples as row_count,

    s.last_vacuum,

    s.last_autovacuum,

    s.last_analyze,

    s.last_autoanalyze

FROM pg_class c

JOIN pg_namespace n ON n.oid = c.relnamespace

LEFT JOIN pg_stat_user_tables s ON s.relid = c.oid

WHERE (c.relkind = 'r' or c.relkind = 'i')
AND (s.last_vacuum < s.last_autovacuum OR s.last_vacuum < s.last_analyze);

To view the list of tables and indexes that have a high number of dead rows:

SELECT

    n.nspname as schema_name,

    c.relname as table_name,

    c.reltuples as row_count,

    s.n_dead_tup as dead_row_count

FROM pg_class c

JOIN pg_namespace n ON n.oid = c.relnamespace

LEFT JOIN pg_stat_user_tables s ON s.relid = c.oid

WHERE c.relkind = 'r' AND s.n_dead_tup > 0;




SELECT

    n.nspname as schema_name,

    c.relname as index_name,

    s.n_dead_tup as dead_row_count

FROM pg_class c

JOIN pg_namespace n ON n.oid = c.relnamespace

LEFT JOIN pg_stat_user_tables s ON s.relid = c.oid

    WHERE (c.relkind = ''r'' or c.relkind = ''i'')

    AND (s.last_vacuum < s.last_autovacuum OR s.last_vacuum < s.last_analyze)') AS t(

        schema_name text,

        table_or_index_name text,

        table_or_index char(1),

        row_count bigint,

        last_vacuum timestamp,

        last_autovacuum timestamp,

        last_analyze timestamp,

        last_autoanalyze timestamp

    );

You can write the above query for all the databases using dblink. This will give information about all the databases.

SELECT *

FROM dblink('host=<host> port=<port> dbname=<database> user=<username> password=<password>',

    'SELECT

        n.nspname as schema_name,

        c.relname as table_or_index_name,

        c.relkind as table_or_index,

        c.reltuples as row_count,

        s.last_vacuum,

        s.last_autovacuum,

        s.last_analyze,

        s.last_autoanalyze

    FROM pg_class c

    JOIN pg_namespace n ON n.oid = c.relnamespace

    LEFT JOIN pg_stat_user_tables s ON s.relid = c.oid

    WHERE (c.relkind = ''r'' or c.relkind = ''i'')

    AND (s.last_vacuum < s.last_autovacuum OR s.last_vacuum < s.last_analyze)') AS t(

        schema_name text,

        table_or_index_name text,

        table_or_index char(1),

        row_count bigint,

        last_vacuum timestamp,

        last_autovacuum timestamp,

        last_analyze timestamp,

        last_autoanalyze timestamp

    );

Note: You need to replace the placeholders <host>, <port>, <database>, <username>, and <password> with the actual values for your server.

Conclusion

It is important to properly configure the autovacuum to avoid overloading the database with too many vacuums. It is also a good idea to monitor the activity of the autovacuum and manually vacuum tables that are not adequately maintained by the autovacuum.

In summary, vacuuming is an essential maintenance task in PostgreSQL that helps to reclaim space and improve performance. t is generally recommended to run the vacuum regularly, either manually using the VACUUM SQL command or automatically using the autovacuum background process. This helps to ensure that the database remains efficient and free of dead rows and reduces the risk of transaction wraparound. Autovacuum is a convenient way to automate this process, but it is important to configure and monitor it to ensure it runs effectively and adequately.