Skip to main content

How NOT EXISTS beat EXISTS by 26x — index size and heap fetches

· 4 min read
Maxim Boguk
Postgres Expert
Nikolay Samokhvalov
CEO & Founder

Two queries. Same result. One is 26x faster — and it reads 53x fewer disk pages. The difference is one word: NOT.

The setup

A classic soft-delete pattern: a posts table with a boolean deleted flag. Most posts are active — 2% are soft-deleted.

create table posts (
post_id bigint primary key,
deleted boolean not null default false,
pad text not null -- wide rows, ~200 bytes each
);

-- two partial indexes:
create unique index posts_not_deleted_id_key
on posts (post_id) where not deleted; -- 98%: ~49M rows

create unique index posts_deleted_id_key
on posts (post_id) where deleted; -- 2%: ~1M rows

Table: 50 million rows, 15 GiB on disk. Indexes:

IndexCoversSize
posts_not_deleted_id_key49M active rows1,050 MiB
posts_deleted_id_key1M deleted rows22 MiB

The query: find all post_tag rows for a given tag list where the post is not deleted.

Two ways to write it

Query 1 — filter for active posts directly:

select pt.*
from post_tag pt
where pt.tag_id = any($1)
and exists (
select from posts
where posts.post_id = pt.post_id
and not deleted
);

Query 2 — exclude deleted posts:

select pt.*
from post_tag pt
where pt.tag_id = any($1)
and not exists (
select from posts
where posts.post_id = pt.post_id
and deleted
);

Same result. Different index. Dramatically different performance.

Benchmark results

Environment: Postgres 18.3, 8 dedicated cores, 32 GiB RAM, cold OS page cache before each run (drop_caches + PG restart), dirty visibility map (10% of rows updated — normal for any active production table), work_mem = 4MB (default), shared_buffers = 128MB (default).

Execution time and buffer reads: NOT EXISTS vs EXISTS

tag_idsQ1: EXISTS(NOT deleted)Q2: NOT EXISTS(deleted)Speedup
50 (~25k rows)4,984 ms376 ms13x
250 (~125k rows)18,777 ms724 ms26x
1,000 (~500k rows)50,791 ms1,907 ms27x

Buffer reads:

tag_idsQ1 readsQ2 readsRatio
5051,9524,14713x
250251,9647,43534x
1,0001,001,93418,76553x

Q2 reads 13–53x fewer pages from disk. Both queries use Nested Loop — work_mem=4MB prevents the hash table from fitting in memory for either query.

Why

Two factors stack on top of each other, both pointing the same direction.

Factor 1: index size

posts_deleted_id_key is 22 MiB — it fits entirely in shared_buffers after the first access. Every subsequent lookup is a memory hit.

posts_not_deleted_id_key is 1,050 MiB — it does not fit. Each of the thousands of point lookups in a nested loop hits a different index page. On a 15 GiB table that is not in memory, that means random I/O on every single lookup.

Factor 2: "not found in index" requires no heap fetch

This is the key insight.

When Q2 looks up a post in posts_deleted_id_key (the deleted index) and finds nothing — it's done. No row in the index means no row in the table. No heap fetch. No visibility check. The visibility map doesn't matter at all.

When Q1 looks up a post in posts_not_deleted_id_key (the active index) and finds a row — Postgres must verify the row is visible to the current transaction. If the page is not all-visible in the visibility map — normal for any actively updated table — a heap fetch is required. That is a random read into a 15 GiB table that doesn't fit in RAM.

Heap fetches measured:

Q1: EXISTS(NOT deleted)Q2: NOT EXISTS(deleted)
50 tag_ids26,938 heap fetches576
250 tag_ids132,246 heap fetches2,830
1,000 tag_ids526,367 heap fetches10,801

Q1 heap-fetches nearly every active post — 27k to 526k random reads into a wide table that doesn't fit in memory.

Q2 heap-fetches only the deleted posts (those it actually found in the small index). For 98% of lookups, it simply doesn't find anything — and stops there.

The takeaway

"Not found in the index" is free. "Found" requires a heap fetch.

When 98% of your lookups are for the majority case (active posts), you want those lookups to return "not found" against the small index — not "found" against the large one.

Use NOT EXISTS with a partial index on the minority (deleted, disabled, flagged) — not EXISTS with a partial index on the majority.

This pattern was first identified by Maxim Boguk on a 400 GiB production table — where the speedup was approximately 50x.

Reproduce it

-- setup (~3-5 min for 50M rows on modern hardware)
create table posts (
post_id bigint primary key,
deleted boolean not null default false,
pad text not null default repeat('x', 200)
);
insert into posts (post_id, deleted)
select g, (random() < 0.02) from generate_series(1, 50000000) g;

create unique index posts_not_deleted_id_key on posts (post_id) where not deleted;
create unique index posts_deleted_id_key on posts (post_id) where deleted;

-- dirty the VM (simulate an active table)
update posts set pad = repeat('y', 200) where post_id % 10 = 0;
analyze posts;

-- run both queries with EXPLAIN (ANALYZE, BUFFERS)
-- look at: Execution Time, Heap Fetches, Buffers: shared read