All Articles

Building an AI Evaluation Platform on GCP: Architecture of a Multi-Cluster System

· 10 min read · Humza Tareen
GCP Cloud Run GKE Pub/Sub Cloud Tasks PostgreSQL Python FastAPI

The Problem

How do you evaluate whether an AI coding agent is actually good at writing code?

Not with a single benchmark. Not with a leaderboard. You need a platform that can run thousands of evaluation tasks concurrently, execute untrusted code in sandboxed environments, judge outputs using multiple LLMs (GPT-4o, Claude Sonnet, Gemini), track agent performance over time with continuous learning, and scale to zero when idle.

Over the past 6 months, I built this platform. This article is an honest look at the architecture, the decisions that worked, the ones that didn't, and the production incidents that taught me the most.

The Architecture

Service Map

The platform consists of 10 core services, each deployed as a Cloud Run service with auto-scaling:

Core Evaluation Pipeline:

  • Evaluation Engine — The brain. Manages task lifecycle, agent orchestration, and result aggregation.
  • Auto-Rating Service — Scores LLM outputs using configurable evaluation criteria. OpenAI-compatible API with LiteLLM proxy for model-agnostic scoring.
  • RAG Retrieval Service — Continuous learning from past evaluations using pgvector similarity search on PostgreSQL.

Execution & Training:

  • RL Training Arena — Gym-style reinforcement learning environment. Agents execute in Cloud Build sandboxes, earn rewards based on test results.
  • Preprocessor Service — 15-step task preparation pipeline. Analyzes repositories, generates verifiers, establishes baselines.

Platform Services:

  • Auth Gateway — Authentication (Google OIDC, JWT, API keys), authorization, and API gateway with Kong integration.
  • Notification Service — Event-driven delivery with Cloud Tasks, exponential backoff, webhook dispatch.
  • Workflow Orchestration — HITL workflows with WebSocket support for real-time updates.

Infrastructure Layer

┌─────────────────────────────────────────────────┐
│          Cloud Run (multiple services)           │
│  Auto-scaling: 0-100 instances per service       │
│  Regions: us-central1, us-west1, asia-southeast1 │
├─────────────────────────────────────────────────┤
│       GKE (several clusters, auto-scaling)       │
│  - CLI Eval: n2-standard-8 + n2-highmem-4       │
│  - OSWorld: e2-medium + n2-standard-4            │
├─────────────────────────────────────────────────┤
│              Data Layer                           │
│  - Cloud SQL (PostgreSQL, Regional HA)           │
│  - Multiple Firestore databases                  │
│  - Redis (Memorystore) for caching               │
│  - pgvector for embedding similarity search      │
├─────────────────────────────────────────────────┤
│        Event & Orchestration                     │
│  - Dozens of Pub/Sub topics (with DLQs)         │
│  - Cloud Tasks for reliable async execution      │
│  - Cloud Scheduler for daily automation          │
│  - GCS buckets for artifacts & results          │
├─────────────────────────────────────────────────┤
│            Observability                          │
│  - Structured logging with correlation IDs       │
│  - Error rate alerts (>5% threshold)             │
│  - Cloud Monitoring dashboards                   │
└─────────────────────────────────────────────────┘

Decision 1: Cloud Run vs. GKE

We use both, and the split is deliberate:

Cloud Run for stateless request-response services (API gateway, rating, notifications). They scale to zero, bill per-request, and need zero Kubernetes knowledge to deploy.

GKE for long-running, stateful workloads (agent execution, RL training). These need persistent connections, GPU access, and fine-grained resource control.

The mistake we almost made: putting everything on GKE because "it's more powerful." GKE is more powerful, but also more expensive when idle. Our notification service gets maybe 100 requests/hour off-peak. On GKE, that's a node running 24/7 for nothing.

Decision 2: AlloyDB + Firestore (Not Either/Or)

We use PostgreSQL (AlloyDB) for relational data — evaluation results, user data, task configurations. We use Firestore for real-time state — agent run progress, live dashboard updates, trajectory logging.

Why both? Evaluation results need ACID transactions, complex JOINs, and structured queries. Agent execution progress needs sub-second writes from multiple concurrent agents, real-time listeners for the frontend, and schema flexibility.

The gotcha: Firestore has a (default) database that catches everything if you're not careful. We had evaluation data routing to the default database instead of the tenant-specific one. Fix: explicit database references everywhere, never rely on defaults.

The Hardest Bug: AlloyDB SSL Connection Drops

This was the bug that taught me the most about cloud infrastructure.

Symptoms: Long-running evaluation tasks (15+ minutes) would fail with "SSL connection has been closed unexpectedly." Not every time — roughly 5% of tasks.

Root cause: AlloyDB terminates idle SSL connections. Our SQLAlchemy connection pool was set to recycle connections every 900 seconds (15 minutes). But AlloyDB was killing them at ~10 minutes of inactivity.

Fix:

engine = create_engine(
    DATABASE_URL,
    pool_recycle=180,      # Recycle before AlloyDB kills them
    pool_pre_ping=True,    # Validate connection on checkout
    pool_size=5,           # Don't hold more than you need
    max_overflow=10,       # Allow burst but not stampede
)

We went from ~5% random task failures to 0.01% after this change.

Idempotency: The Pattern That Saved Us

Cloud Tasks delivers messages at-least-once. That means your handler WILL be called multiple times for the same task. Our pattern: compute a deterministic ID from the task payload, use INSERT ... ON CONFLICT DO NOTHING, return success even if the row already exists.

This one pattern eliminated an entire category of bugs across the platform.

What I'd Do Differently

Connection pooling from day one. Would have avoided the AlloyDB crisis entirely. We learned the hard way that managed databases have their own connection lifecycle, and if you don't configure your pool correctly, you'll hit SSL connection drops under load. Setting pool_recycle and pool_pre_ping from the start would have saved us weeks of debugging.

Explicit database references everywhere. Would have prevented the Firestore trap. Firestore has a (default) database that catches everything if you're not careful. We had evaluation data routing to the default database instead of the tenant-specific one. The fix was simple—explicit database references everywhere, never rely on defaults—but we should have done it from the beginning.

Security audit earlier, not after 6 months of shipping. We accumulated security debt silently. SQL injection vulnerabilities, hardcoded credentials, PII in logs—all things that would have been caught in a systematic audit. The two-day audit we did eventually uncovered critical vulnerabilities across the entire platform. Doing it at month 2 instead of month 6 would have saved us from potential production incidents.

Event-driven first, REST as the exception. We retrofitted events onto some services that should have been event-driven from the start. The notification service started as a REST API, then we added Pub/Sub subscribers. The RAG service started synchronous, then we added async processing. Starting with events would have made the architecture cleaner and more scalable from day one.

Tech stack: Python (FastAPI, SQLAlchemy, Pydantic), TypeScript (React), GCP (Cloud Run, GKE, Cloud Tasks, Pub/Sub, AlloyDB, Firestore, Cloud Build, GCS, Cloud Scheduler, Secret Manager, Cloud Monitoring), PostgreSQL, Redis, Docker, GitHub Actions.