All Articles

Zero-Downtime Embedding Migration: Switching Models in a Production RAG System

· 7 min read · Humza Tareen
RAG AI Embeddings PostgreSQL Python

Our embedding model got deprecated overnight. Every RAG query in our AI evaluation platform started returning 404s. This wasn't just a feature—it was the entire platform's learning system going down. The RAG service powers continuous learning from past evaluations, similarity search for task recommendations, and context retrieval for agent prompts. We had 48 hours to migrate to a new model, re-embed all documents, and not break a single production workflow.

Step 1: Make the Model Configurable

Why environment variables matter: you ship the config change, not a code change. Two environment variables:

EMBED_MODEL = os.getenv("EMBED_MODEL", "text-embedding-3-large")
EMBED_DIMENSIONS = int(os.getenv("EMBED_DIMENSIONS", "768"))

This is what makes the difference between a 2-day migration and a 2-week one. If the model name was hardcoded in 20 files, we'd need to change code, test, deploy, and hope nothing broke. With env vars, we change the config, deploy once, and flip the switch. The code doesn't care which model it's using—it just reads the config.

Step 2: Add New Columns (Don't Replace)

The CONCURRENTLY keyword is critical. Without it, PostgreSQL locks the table during index creation. With millions of documents, that lock could last hours—production reads would freeze. Here's what we ran:

ALTER TABLE documents
ADD COLUMN embedding_v2 vector(768);

CREATE INDEX CONCURRENTLY idx_documents_embedding_v2
ON documents USING ivfflat (embedding_v2 vector_cosine_ops)
WITH (lists = 100);

What happens without CONCURRENTLY: table locks, production freeze, users see timeouts. With CONCURRENTLY: index builds in the background, production reads continue uninterrupted, zero downtime. The trade-off is that concurrent index creation takes longer and uses more resources, but it's worth it for production systems.

Step 3: Batch Re-embedding

Batch processing is essential. Don't embed one document at a time—that's slow and hits rate limits. Don't embed all documents in one giant transaction—that holds locks for too long. We processed in batches of 1000 documents. Batch size selection matters: too small and you're making too many API calls. Too large and you're holding database transactions too long. We settled on 1000 as the sweet spot.

Rate limiting against the embedding API: we added exponential backoff when hitting rate limits. Progress tracking: we logged every 10,000 documents so we could monitor progress. Per-batch commits: each batch of 1000 documents was committed independently. If the process crashed at document 50,000, we didn't lose the first 49,000. We could resume from checkpoint 50,000.

Step 4: Feature Flag Switch

The feature flag pattern is what made zero-downtime possible. Deploy with USE_V2_EMBEDDINGS=false. The code checks the flag:

def get_embedding_column():
    if os.getenv("USE_V2_EMBEDDINGS", "false").lower() == "true":
        return "embedding_v2"
    return "embedding"

Verify everything works with v1 still active. Flip to true. If anything breaks, flip back instantly. No code deployment needed—just change the environment variable. This pattern is what separates production-ready migrations from risky ones.

Step 5: Validation

We compared search results between old and new embeddings. Average overlap was 82%—different models produce different embeddings, but the top results were comparable. What 82% overlap means practically: for a given query, if the old model returned documents A, B, C, D, E as the top 5 results, the new model returned A, B, D, E, F. Four out of five results were the same. The fifth was different, but still relevant. This validation gave us confidence that the migration wouldn't break user workflows.

Lessons Learned

Always abstract the embedding provider. Two env vars saved us from a multi-file refactor. If we had hardcoded the model name in 20 files, we'd need to change code, test each change, deploy, and hope nothing broke. With abstraction, we change config and flip a switch.

Add model version tracking to stored vectors. We didn't. We should have. If we had a model_version column, we could have queried "which documents were embedded with which model" and migrated selectively. Without it, we had to re-embed everything.

Side-by-side columns > in-place replacement. The rollback story is instant. If v2 embeddings cause issues, flip the feature flag back to v1. The old embeddings are still there, still indexed, still usable. In-place replacement would have required restoring from backup.

Dry-run everything. Our validation caught 3 queries with low overlap that needed investigation. We ran the same queries against both embedding versions, compared results, and identified edge cases before flipping the switch. This proactive validation prevented production issues.

Total impact: 48 hours, zero downtime, zero data loss. The entire platform's learning system migrated to a new embedding model without breaking a single production workflow. The key was planning for failure: feature flags for instant rollback, concurrent index creation for zero downtime, batch processing with checkpoints for resilience, and validation to catch issues before they reach production.