Building a RAG Retrieval Service: pgvector, Embedding Migrations, and Provenance Tracking

Feb 8, 2026·6 min read·Humza Tareen

RAGpgvectorEmbeddingsPostgreSQLFastAPI

In an AI evaluation platform, the ability to learn from past evaluations is what separates a useful tool from a demo. The RAG retrieval service provides that learning loop — when a new evaluation comes in, it can find similar past evaluations, their scores, and their outcomes. The secret ingredient: pgvector on PostgreSQL.

How the RAG Pipeline Works

The retrieval flow is elegant in its simplicity. When evaluation data arrives, it gets embedded using a configurable model (currently text-embedding-3-large). Those embeddings are stored in PostgreSQL with pgvector, creating a searchable vector space of all past evaluations.

When new tasks arrive, the system finds similar past evaluations via cosine similarity. Those results inform scoring calibration and rubric selection — if a similar task scored poorly with one rubric, we might try a different approach. The magic happens in a single SQL query:

SELECT 
  evaluation_id,
  score,
  outcome,
  1 - (embedding <=> $1::vector) AS similarity
FROM evaluations
WHERE embedding <=> $1::vector < 0.3
ORDER BY embedding <=> $1::vector
LIMIT 10;

The <=> operator computes cosine distance, and pgvector's IVFFlat index makes this fast even with millions of evaluations.

The Midnight Deprecation Crisis

The embedding model text-embedding-004 was deprecated overnight. Every query started returning 404. No warning, no grace period — just broken.

We needed to make the model configurable via EMBED_MODEL and EMBED_DIMENSIONS environment variables, but we couldn't afford downtime. The migration strategy had three parts:

First, we added side-by-side columns in the database — embedding_v1 and embedding_v2 — so we could roll back if the new model performed worse. Then we built a batch re-embedding pipeline that processed documents in chunks, updating embedding_v2 while queries continued using embedding_v1.

Finally, we added a feature flag to switch between columns. When we were confident the new embeddings worked, we flipped the flag. Zero downtime. The service now supports any embedding model via configuration — change the env var, re-embed, and switch.

Provenance Tracking for Calibration

CLI evaluation agents needed to know where data came from to calibrate their scoring. If a similar evaluation scored 0.8, was that from a production run or a test? Was it from the same agent model or a different one?

I added provenance and source_type fields to the retrieval endpoint. The insight: optional return fields save bandwidth when you don't need lineage data, but they're critical for calibration workflows. The endpoint accepts a include_provenance query parameter — when false, we skip those columns entirely.

This small change enabled downstream services to build calibration logic that learns from similar evaluations while respecting their context.

The Null Serialization Trap

The response included "provenance": null which broke downstream consumers expecting the field to be absent, not null. In microservice architectures, API contracts matter — a field that's null vs. a field that's missing sends different signals.

The fix was simple but illustrative: response_model_exclude_none=True in the FastAPI endpoint. Now, when provenance isn't requested, the field doesn't appear in the JSON at all. When it is requested but missing, it's still null, but that's the expected case.

This taught me that API design isn't just about what you include — it's about what you exclude, and how you signal absence vs. unavailability.

Infrastructure

The service runs on Cloud SQL PostgreSQL 17 with the pgvector extension. We use an IVFFlat index with cosine similarity, tuned with 100 lists for our dataset size. The index gets rebuilt daily via Cloud Scheduler to maintain query performance as new evaluations arrive.

Deployment is straightforward: Cloud Run with autoscaling based on request volume. The embedding model runs via OpenAI's API (configurable to any compatible provider), and the service handles rate limiting and retries internally.

Zero-Downtime Embedding Migration
Migrating the embedding model in this service
When RAG Says Duplicate but the LLM Disagrees
Dedup layer built on this RAG service
Automated LLM Scoring Service
Scorer that consumes this RAG pipeline

How the RAG Pipeline Works

The Midnight Deprecation Crisis

Provenance Tracking for Calibration

The Null Serialization Trap

Infrastructure

Related Articles