Building the Core AI Evaluation Engine: AlloyDB Fixes, Autoscaling, and a Headless API
The evaluation engine is the brain of the platform. Every AI agent evaluation — task creation, execution orchestration, LLM-based scoring, result aggregation — flows through this service. Over four months, I was the primary contributor, and the work taught me more about production infrastructure than any textbook could.
The Architecture
┌──────────────────────────────────────────────────┐
│ Evaluation Engine (Cloud Run) │
│ FastAPI + SQLAlchemy + Pydantic │
├──────────────────────────────────────────────────┤
│ Agent Plane (GKE) │ Judge Pipeline (Cloud Run) │
│ Claude, GPT, Gemini │ Multi-LLM scoring │
├──────────────────────────────────────────────────┤
│ AlloyDB (PostgreSQL 17) │ Firestore (realtime) │
│ Cloud Tasks │ Pub/Sub │
│ Cloud Scheduler │ GCS buckets │
└──────────────────────────────────────────────────┘
The AlloyDB SSL Crisis
This was a detective story. Our evaluation tasks — which run 5-30 minutes — were randomly failing at a 5% rate with SSL connection has been closed unexpectedly. The failures were intermittent, making them hard to reproduce.
First, we added retries. Symptom treatment — it helped, but didn't fix the root cause. Then we suppressed errors on session close to see the real signal. During database pool alignment work, we discovered a Firestore bug (scores writing to the wrong database), but that was a separate issue.
We implemented targeted retries for transient SSL disconnects, hardened sessions, and added better logging. The breakthrough came when we discovered the timing mismatch: AlloyDB kills idle connections at ~10 minutes, but SQLAlchemy's connection pool recycled at 15 minutes. That 5-minute window meant stale connections were being handed out.
The fix was three lines of config:
# Before
engine = create_engine(
DATABASE_URL,
pool_size=10
)
# After
engine = create_engine(
DATABASE_URL,
pool_size=10,
pool_recycle=180, # Recycle before AlloyDB kills (10 min = 600s, use 180s)
pool_pre_ping=True # Verify connection before use
)
Result: 5% failure rate → 0.01%. Multiple iterations over 2 weeks to understand one infrastructure quirk.
The Firestore Default Database Trap
Firestore has a (default) database that silently catches everything you don't explicitly route elsewhere. We had multiple databases for different services, but several code paths were writing to (default) without anyone realizing.
The fix required updating every Firestore reference across backend and frontend. Here's the difference:
# Implicit - goes to (default)
db = firestore.Client()
collection = db.collection('scores')
# Explicit - goes to correct database
db = firestore.Client(database='evaluations-db')
collection = db.collection('scores')
This was also extended to the React frontend — real-time listeners were using the wrong database, causing stale UI data. The lesson: explicit is better than implicit, especially in distributed systems.
Continuous Learning Pipeline
Connecting evaluation results back to the RAG service so future evaluations can learn from past outcomes. The gap was that scored evaluations weren't flowing back to pgvector for similarity search.
When an evaluation completes, it now gets embedded and stored in the RAG service. Future evaluations can query "show me similar tasks and how they scored" to inform rubric selection and calibration. This closes the learning loop — the platform gets smarter with each evaluation.
The Headless API
Why it matters: external teams can now trigger evaluations as part of their CI/CD pipelines. Decoupling evaluation from the web UI opened up integration possibilities.
Before, evaluations were UI-only. Now, any service can POST to /api/v1/evaluations with a task definition and get back evaluation results. Teams integrate agent evaluation into their release process — run evaluations on every commit, gate deployments on score thresholds, track performance over time.
The headless API transformed the platform from a tool into a platform.
KEDA Autoscaling on GKE
Scale agent execution pods based on Pub/Sub message backlog. The gotcha: idleReplicaCount was misconfigured preventing scale-to-zero, and warm replicas weren't being maintained.
After fixing the KEDA ScaledObject configuration, the cluster properly scales from zero to dozens of nodes based on queue depth. When evaluations queue up, pods spin up. When the queue empties, pods scale down to zero, saving costs.
Autoscaling sounds simple until you realize it's balancing latency (warm replicas) vs. cost (scale-to-zero). The configuration is a compromise between these forces.
Security Audit Findings
Critical findings: Firestore rules allowing unauthenticated access, production Kubernetes using :latest tags, excessive TypeScript any usage, conflicting HPA and KEDA autoscaling.
The Firestore rules were the most concerning — anyone could read or write evaluation data. We locked down rules to require authentication and proper IAM roles. The :latest tags meant deployments weren't reproducible — we moved to semantic versioning.
The conflicting autoscaling was subtle: both HPA and KEDA were trying to scale the same pods, causing thrashing. We disabled HPA and let KEDA handle it.