Building an Automated LLM Scoring Service: From Prototype to Production

Feb 8, 2026·10 min read·Humza Tareen

Cloud RunCloud TasksRedisLiteLLMPythonFastAPI

When an AI agent writes code, someone has to judge whether it's good. In our platform, that "someone" is often another LLM. The Auto-Rater takes an agent's output, runs it through evaluation strategies (direct scoring, chain-of-thought analysis, multi-model debate), and produces structured quality assessments. I took this service from a fragile prototype to a production system that handles thousands of evaluations daily.

The v1 Reality Check

The prototype worked, but production exposed its fragility. Hardcoded model names meant changing a model broke the whole service. No idempotency meant Cloud Tasks retries created duplicate evaluations. print() statements everywhere were invisible in Cloud Run's structured logging. The Redis client had no timeouts, so connection loss meant hangs forever. The database pool was sized for a laptop, not production load.

v1 was a proof of concept. v2 needed to be production-ready.

Making Cloud Tasks Idempotent

The most impactful change. Cloud Tasks delivers at-least-once, meaning your handler will get called multiple times. Without idempotency, every retry creates a duplicate evaluation.

The fix: deterministic UUIDs generated from the payload hash, with INSERT ... ON CONFLICT DO NOTHING:

import hashlib
import uuid

def generate_evaluation_id(payload: dict) -> str:
    """Deterministic ID from payload hash."""
    payload_str = json.dumps(payload, sort_keys=True)
    payload_hash = hashlib.sha256(payload_str.encode()).hexdigest()[:16]
    return str(uuid.UUID(payload_hash))

async def create_evaluation(payload: dict):
    eval_id = generate_evaluation_id(payload)
    
    # Idempotent insert - if ID exists, do nothing
    await db.execute(
        """
        INSERT INTO evaluations (id, payload, status)
        VALUES (:id, :payload, 'pending')
        ON CONFLICT (id) DO NOTHING
        """,
        {"id": eval_id, "payload": json.dumps(payload)}
    )

Zero duplicates since deployment. The same payload always generates the same ID, so retries are harmless.

The Logging Overhaul

Replacing print() with structured logging sounds simple. In practice, it meant replacing every logging call, adding env-driven config (different log levels per environment), correlation IDs via X-Correlation-ID header, cross-service correlation (trace a request from API → auto-rater → notification service), and PII stripping.

Structured logs look like this:

{
  "timestamp": "2026-02-08T10:23:45Z",
  "level": "INFO",
  "service": "auto-rater",
  "correlation_id": "req-abc123",
  "evaluation_id": "eval-xyz789",
  "message": "Starting chain-of-thought scoring",
  "model": "gpt-4o",
  "strategy": "cot"
}

Now we can trace a request across services, filter by evaluation ID, and search logs by correlation ID. Debugging production issues went from "search for a string" to "query structured data."

Model-Agnostic Scoring with LiteLLM

The OpenAI-compatible responses envelope with LiteLLM pass-through. Any model available through LiteLLM works without code changes. Want to switch from GPT-4o to Claude Sonnet for scoring? Change one config value.

# Before - hardcoded
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[...]
)

# After - configurable via LiteLLM
response = litellm.completion(
    model=os.getenv("SCORING_MODEL", "gpt-4o"),
    messages=[...]
)

LiteLLM handles the differences between providers (OpenAI, Anthropic, Google) behind a unified interface. The service is now model-agnostic — swap models without touching code.

The VPC Networking Saga

A 3-day debugging journey. Cloud Run couldn't connect to Redis (Memorystore). Added socket timeouts (without them, connections hang forever). Added VPC connector. Then discovered Direct VPC Egress was conflicting with the connector — Cloud Run has two mutually exclusive VPC networking modes and the error message doesn't tell you that.

Fix: removed VPC connector since infrastructure had already moved to Direct VPC Egress. The lesson: Cloud Run networking is subtle. When something doesn't work, check if you're mixing modes.

Redis State Recovery

Evaluations take 1-30 minutes. The frontend polls Redis for progress. Discovered state updates were logging "success" even when the Redis key didn't exist — HSET returns 0 for "field updated" and 0 for "key doesn't exist", so we couldn't distinguish success from failure.

Rewrote state management to verify return values and create recovery records for lost state. Now, if Redis state disappears (expiration, eviction, crash), we can reconstruct it from recovery records.

Production Tuning

Cloud Run defaults to 80 concurrent requests per container. Each request holds a database connection. 80 × 10 instances = 800 connections against Cloud SQL supporting ~100. The fix:

# Cloud Run service config
containerConcurrency: 10

# Uvicorn config
uvicorn.workers: 1

# SQLAlchemy pool config
pool_size: 5
max_overflow: 10

Now: 10 concurrent × 10 instances × 5 pool = 500 max connections, well under the limit. The trade-off: lower concurrency means more instances, but database stability is worth it.

Building the Core AI Evaluation Engine
The engine this scorer feeds into
Zero-Downtime Embedding Migration
Embedding model migration in the same ecosystem
When RAG Says Duplicate but the LLM Disagrees
Another LLM-driven quality gate