How NOT EXISTS beat EXISTS by 26x — index size and heap fetches
Two queries. Same result. One is 26x faster — and it reads 53x fewer disk pages.
The difference is one word: NOT.
Two queries. Same result. One is 26x faster — and it reads 53x fewer disk pages.
The difference is one word: NOT.
When CREATE INDEX CONCURRENTLY or REINDEX INDEX CONCURRENTLY fails, Postgres leaves behind an invalid index. Many engineers assume these indexes are harmless placeholders waiting to be cleaned up. After all, the planner won't use them for queries, right?
Wrong. Invalid indexes are far from harmless. They continue to consume resources, generate I/O, block optimizations, and even cause lock contention — all while providing zero query performance benefit.
In this article, we'll demonstrate — with real evidence — the hidden costs of invalid indexes that every Postgres administrator should understand.
Your API is slowing down. You check your database and find 42 indexes on your users table. Which ones can you safely drop? How much performance are they costing you? Let's look at what actually happens in Postgres when you have too many indexes.
If you're a backend or full-stack engineer, you probably don't want to become an indexing expert — you just want your API fast and stable, without babysitting pg_stat_user_indexes.
Index maintenance includes multiple activities: dropping unused indexes, dropping redundant indexes, and rebuilding indexes on a regular basis to get rid of index bloat (and of course, keeping autovacuum well tuned).
There are many reasons why we need to keep our index set lean, and some of them are tricky.
Suppose you need to create a replica for a 1 TiB database. You have a fast server with NVMe storage and 75 Gbps network, but pg_basebackup typically delivers only 300-500 MiB/s due to its single-threaded architecture — regardless of how powerful your hardware is (though PG18 brings a surprise we'll discuss later).
The solution: replace pg_basebackup with pgBackRest and leverage parallel processing to achieve significantly faster replica creation, saturating (≈97% of) line rate on a 75 Gbps link.
Note: This is an R&D-style exploration focused on performance benchmarking on idle systems, not a production-ready automation guide. Many considerations important for production environments (monitoring, retry logic, integration with orchestration tools, etc.) are intentionally omitted to focus on the core performance characteristics.
In #PostgresMarathon 2-009 and #PostgresMarathon 2-010, we explored why execution 6 causes a lock explosion when building a generic plan for partitioned tables — the planner must lock all 52 relations because it can't prune without parameter values.
Today we'll test what actually happens with different plan_cache_mode settings.
In #PostgresMarathon 2-009, we focused on Lock Manager's behavior when dealing with prepared statements and partitioned tables.
And observed a lock explosion in our simple synthetic example: from 8 locks (custom plans) during first 5 calls, to 52 locks (building generic plan) in the 6th call, to 13 locks (using cached generic plan) in the 7th and subsequent calls. We left with questions:
Let's dig deeper.
In #PostgresMarathon 2-008, we discovered that prepared statements can dramatically reduce LWLock:LockManager contention by switching from planner locks (which lock everything) to executor locks (which lock only what's actually used). Starting with execution 7, we saw locks drop from 6 (table + 5 indexes) to just 1 (table only).
There we tested only a simple, unpartitioned table. What happens if the table is partitioned?
As was discussed in #PostgresMarathon 2-002, for a simple SELECT from a table, at planning time, Postgres locks the table and all of its indexes with AccessShareLock. A simple demo to remind it (let me be a bit weird here and save some bytes when typing SQL):
test=# create table t();
CREATE TABLE
test=# create index on t((1));
CREATE INDEX
test=# create index on t((1));
CREATE INDEX
test=# create index on t((1));
CREATE INDEX
test=# create index on t((1));
CREATE INDEX
test=# create index on t((1));
CREATE INDEX
test=# \d t
Table "public.t"
Column | Type | Collation | Nullable | Default
--------+------+-----------+----------+---------
Indexes:
"t_expr_idx" btree ((1))
"t_expr_idx1" btree ((1))
"t_expr_idx2" btree ((1))
"t_expr_idx3" btree ((1))
"t_expr_idx4" btree ((1))
test=#
test=# begin; explain select from t;
BEGIN
QUERY PLAN
-----------------------------------------------------
Seq Scan on t (cost=0.00..39.10 rows=2910 width=0)
(1 row)
test=*# select relation::regclass, mode from pg_locks where pid = pg_backend_pid();
relation | mode
-------------+-----------------
t_expr_idx2 | AccessShareLock
pg_locks | AccessShareLock
t_expr_idx3 | AccessShareLock
t_expr_idx4 | AccessShareLock
t_expr_idx | AccessShareLock
t_expr_idx1 | AccessShareLock
t | AccessShareLock
| ExclusiveLock
(8 rows)
test=*#
– indeed, all indexes locked.
To mitigate it, we can just use prepared statements. Let's create one:
prepare test_query (int) as select from t;
#PostgresMarathon 2-007: Should we worry about pg_blocking_pids()'s observer effect?
Many years ago, when developing complex automated procedures for a large company, I realized that my automation needs monitoring components. Including understanding heavyweight lock contention – for example, to recognize situations when a poorly designed change is blocked by things like autovacuum running in transaction ID wraparound prevention mode (it doesn't yield to anybody, when in this mode).
This led me to pg_blocking_pids() and analysis described in "Useful queries to analyze PostgreSQL lock trees (a.k.a. lock queues)".
The setting max_locks_per_transaction is mysterious, it is a good illustration of Socrates' "I know that I know nothing". This is the main fact to memorize about max_locks_per_transaction. Don't try to remember details. Unless you touch it often, you'll forget (I do). Instead, let's rely on the docs:
The shared lock table has space for max_locks_per_transaction objects (e.g., tables) per server process or prepared transaction; hence, no more than this many distinct objects can be locked at any one time. This parameter limits the average number of object locks used by each transaction; individual transactions can lock more objects as long as the locks of all transactions fit in the lock table. This is not the number of rows that can be locked; that value is unlimited. The default, 64, has historically proven sufficient, but you might need to raise this value if you have queries that touch many different tables in a single transaction, e.g., query of a parent table with many children. This parameter can only be set at server start.
When running a standby server, you must set this parameter to have the same or higher value as on the primary server. Otherwise, queries will not be allowed in the standby server.
Unless you read academic papers every day, you'll need a few minutes to understand what's written here.
Let's take even more time and understand every single detail, and also cover what's not covered in this description.