How to use OpenAI APIs right from Postgres to implement semantic search and GPT chat

Today we will implement RAG (Retrieval Augmented Generation) right in Postgres:

Load full Postgres commit history to a Postgres table
Using plpython3u (N/A on some managed services, such as RDS), start calling OpenAI APIs right from Postgres.
⚠️ Warning: this approach doesn't scale well, so it's not recommended for larger production clusters. Consider this as either for fun or for only small projects/services.
For each commit, generate OpenAI embeddings and store them in the "vector" format (pgvector).
Use semantic search to find commits, sped by pgvector's HNSW index.
Finally, "talk to commit history" via OpenAI GPT4.

Preparations

First, install extensions:

create extension if not exists plpython3u;
create extension if not exists vector;

Then, we'll need an OpenAI API key. Let's store it as a custom GUC defined for our current database

⚠️ Warning: this is not the safest way to store it; we're using this method for simplicity):

do $$ begin
 execute(
   format(
     'alter database %I set openai.api_key = %L',
     current_database(),
     'sk-XXXXXXXXXXXX'
   )
 );
end $$;

Import commit history from Git

Create a table where we'll store Git commits:

psql -X \
  -c "create table commit_logs (
    id bigserial primary key,
    hash text not null,
    created_at timestamptz not null,
    authors text,
    message text,
    embedding vector(1536)
  )" \
  -c "create unique index on commit_logs(hash)"

Now, clone the Postgres repo from GitHub or GitLab and fetch the full commit history to a CSV file (taking care of escaping double quotes in the commit messages):

git clone https://gitlab.com/postgres/postgres.git
cd postgres

git log --all --pretty=format:'%H,%ad,%an,🐘%s %b🐘🐍' --date=iso \
 | tr '\n' ' ' \
 | sed 's/"/""/g' \
 | sed 's/🐘/"/g' \
 | sed 's/🐍/\n/g' \
 > commit_logs.csv

Load the commit history from CSV to the table:

psql -Xc "copy commit_logs(hash, created_at, authors, message)
   from stdin
   with delimiter ',' csv" \
 < commit_logs.csv

psql -X \
 -c "update commit_logs set hash = trim(hash)" \
 -c "vacuum full analyze commit_logs"

As of July 2025, this will yield ~96k rows, covering more than 10k days of the Postgres development history – more than 27 years – the first commit was made on in July 1996!

Create and store embeddings

Here is a function that we'll use to get vectors for each commit entry, from OpenAI API using plpython3u (u here means "untrusted" – it is allowed to such functions to talk to the external world):

create or replace function openai_get_embedding(
  content text,
  api_key text default current_setting('openai.api_key', true)
) returns vector(1536)
as $$
  import requests

  response = requests.post(
    'https://api.openai.com/v1/embeddings',
    headers={ 'Authorization': f'Bearer {api_key}' },
    json={
      'model': 'text-embedding-3-small',
      'input': content.replace("\n", " ")
   }
 )

  if response.status_code >= 400:
   raise Exception(f"Failed. Code: {response.status_code}")

  return response.json()['data'][0]['embedding']
$$ language plpython3u;

Once it's created, start obtaining and storing vectors. We'll do it in small batches, to avoid long-running transactions – no to lose large data volumes in case of failure and not to block concurrent sessions, if any:

with scope as (
  select hash
  from commit_logs
  where embedding is null
  order by id
  limit 5
), upd as (
  update commit_logs
  set embedding = openai_get_embedding(
    content := format(
      'Date: %s. Hash: %s. Message: %s',
      created_at,
      hash,
      message
    )
  )
  where hash in (
    select hash
    from scope
 )
 returning *
)
select
  count(embedding) as cnt_vectorized,
  max(upd.id) as latest_upd_id,
  round(
    max(upd.id)::numeric * 100 /
      (select max(c.id) from commit_logs as c),
    2
  )::text || '%' as progress,
  1 / count(*) as err_when_done
from upd
\watch .1

This process might take a significant amount of time, perhaps more than an hour, so prepare to wait. Also, here is where we start paying OpenAI for the API use (although, embedding creation is very cheap, and you'll pay somewhat ~$1 here, see pricing).

Once it's done – or earlier, with partial results – you can start using it.

Semantic search

Here's a straightforward example of using semantic search to find the most relevant commits:

The concept here is straightforward:

First, we call OpenAI API to "vectorize" the text of our request.
Then, we use pgvector's similarity search to find K nearest neighbors.

We will use the HNSW index, considered one of the best approaches today (although originally described in 2016); added in by many DBMSes. In pgvector, it was added in version 0.5.0. Note that this is ANN index – "approximate nearest neighbors", so it is, for the sake of speed, allowed to produce not strict result, unlike regular indexes in Postgres.

Create an index:

psql -Xc "create index on commit_logs
 using hnsw (embedding vector_cosine_ops)"

Now, in psql, perform the search:

select openai_get_embedding('psql features') as q_vector \gset

select
  created_at,
  format(
    'https://gitlab.com/postgres/postgres/-/commit/%s',
    left(hash, 8)
 ),
 left(message, 150),
 authors,
 1 - (embedding <-> :'q_vector') as similarity
from commit_logs
order by embedding <-> :'q_vector'
limit 5 \gx

If index is created, the second query should be very fast. You can check the plan and details of execution using EXPLAIN (ANALYZE, BUFFERS). Our dataset is tiny (<100k), so the search speed should be ~1ms, and the buffer hit/read numbers ~1000 or less. There are a few tuning options for indexes pgvector offers, check out its README.

Here is an example result for the query psql "\watch" limited count of loops:

postgres-# limit 5 \gx
┌─[ RECORD 1 ]────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ created_at │ 2025-05-29 20:58:57-07                                                                                                                                 │
│ format     │ https://gitlab.com/postgres/postgres/-/commit/13cb16e7                                                                                                 │
│ left       │ Implement comprehensive improvements to pg_restore object tracking - Add command-line options for file paths and disable flag - Capture actual Postgre │
│ authors    │ Nikolay Samokhvalov                                                                                                                                    │
│ similarity │ -0.025786986265972756                                                                                                                                  │
├─[ RECORD 2 ]────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ created_at │ 2025-05-04 11:38:42-07                                                                                                                                 │
│ format     │ https://gitlab.com/postgres/postgres/-/commit/2448c7a9                                                                                                 │
│ left       │ doc: Put some psql documentation pieces back into alphabetical order                                                                                   │
│ authors    │ Peter Eisentraut                                                                                                                                       │
│ similarity │ -0.03374895125754507                                                                                                                                   │
├─[ RECORD 3 ]────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ created_at │ 2025-05-29 20:05:54-07                                                                                                                                 │
│ format     │ https://gitlab.com/postgres/postgres/-/commit/64512a7b                                                                                                 │
│ left       │ Improve test and add documentation for pg_restore object tracking - Simplify test to use single table instead of complex schema - Use unique index cor │
│ authors    │ Nikolay Samokhvalov                                                                                                                                    │
│ similarity │ -0.037387386782650855                                                                                                                                  │
├─[ RECORD 4 ]────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ created_at │ 2025-05-23 14:05:35-07                                                                                                                                 │
│ format     │ https://gitlab.com/postgres/postgres/-/commit/4050aedd                                                                                                 │
│ left       │ Implement complete WAL metrics functionality for table statistics This commit adds comprehensive WAL tracking to PostgreSQL table-level statistics, en │
│ authors    │ Nikolay Samokhvalov                                                                                                                                    │
│ similarity │ -0.050729119237905795                                                                                                                                  │
├─[ RECORD 5 ]────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ created_at │ 2025-05-20 14:31:00-07                                                                                                                                 │
│ format     │ https://gitlab.com/postgres/postgres/-/commit/a6060f1c                                                                                                 │
│ left       │ pg_dump: Fix array literals in fetchAttributeStats(). Presently, fetchAttributeStats() builds array literals by treating the elements as SQL identifie │
│ authors    │ Nathan Bossart                                                                                                                                         │
│ similarity │ -0.0509084177885033                                                                                                                                    │
└────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

"Talk to the Git history" using OpenAI APIs right from Postgres

Finally, create two more functions and pose questions about the Postgres commit history:

create or replace function openai_gpt_call(
  question text,
  data_to_embed text,
  model text default 'gpt-4.1-nano-2025-04-14',
  token_limit int default 4096,
  api_key text default current_setting('openai.api_key', true)
) returns text
as $$
  import requests, json

  prompt = """Be terse. Discuss only Postgres and it's commits.
For commits, mention timestamp and hash.
CONTEXT (git commits):
---
%s
---
QUESTION: %s
""" % (data_to_embed[:2000], question)

  ### W: this code lacks error handling
  response = requests.post(
    'https://api.openai.com/v1/chat/completions',
    headers={ 'Authorization': f'Bearer {api_key}' },
    json={
      'model': model,
      'messages': [
          {"role": "user", "content": prompt}
      ],
      'max_tokens': token_limit,
      'temperature': 0
    }
 )

  if response.status_code >= 400:
    raise Exception(f"Failed. Code: {response.status_code}. Response: {response.text}")

  return response.json()['choices'][0]['message']['content']
$$ language plpython3u;

create or replace function openai_chat(
  in question text,
  in model text default 'gpt-4',
  out answer text
) as $$
  with q as materialized (
    select openai_get_embedding(
      question
    ) as emb
  ), find_enries as (
    select
      format(
        e'Created: %s, hash: %s, message: %s, committer: %s\n',
        created_at,
        left(hash, 8),
        message,
        authors
      ) as entry
    from commit_logs
    where embedding <=> (select emb from q) < 0.8
    order by embedding <=> (select emb from q)
    limit 10 -- adjust if needed
  )
  select openai_gpt_call(
    question := openai_chat.question,
    data_to_embed := string_agg(entry, e'\n'),
    model := openai_chat.model
  )
  from find_enries;
$$ language sql;

Now, just ask using openai_chat(...), for example:

nik=# \a
Output format is unaligned.

nik=# select openai_chat('tell me about fixes of CREATE INDEX CONCURRENTLY – when, who, etc.');
openai_chat
There are two notable fixes for CREATE INDEX CONCURRENTLY. The first one was made on 2016-02-16 18:43:03+00 by Tom Lane, with the commit hash a65313f2. The commit improved the documentation about CREATE INDEX CONCURRENTLY, clarifying the description of which transactions will block a CREATE INDEX CONCURRENTLY command from proceeding.

The second fix was made on 2012-11-29 02:25:27+00, with the commit hash 3c840464. This commit fixed assorted bugs in CREATE/DROP INDEX CONCURRENTLY. The issue was that the pg_index state for an index that's reached the final pre-drop stage was the same as the state for an index just created by CREATE INDEX CONCURRENTLY. This was fixed by adding an additional boolean column "indislive" to pg_index.
(1 row)

Note that it uses the models text-embedding-3-small for embeddings and gpt-4.1-nano-2025-04-14 for completion API calls by default. By the time you're reading this, there might be newer, better models – see pricing and models.

A few final notes

As mentioned above, this approach – calling external API from Postgres – doesn't scale well. It is good for fast prototyping but should not be used in projects where a significant growth of TPS is expected (otherwise, with growth, CPU saturation risks, idle-in-tx spikes, etc., might cause significant performance issues and even outages).
Another disadvantage of this approach is that plpython3u is not available in some Postgres services (e.g., RDS).
Finally, when working with in SQL context, it is quite easy to unintentionally have API calls in a loop. This might cause excessive expenses. To avoid it, we need to carefully check the execution plans.
For some people, such code is harder to debug.

Nonetheless, the concepts described above might work well in certain cases – just keep in mind these nuances and avoid inefficient moves. And if concerns are too high, the code calling APIs should be moved from plpython3u to outside Postgres.

Preparations​

Import commit history from Git​

Create and store embeddings​

Semantic search​

"Talk to the Git history" using OpenAI APIs right from Postgres​

A few final notes​