All Articles

Incident Response: Runbooks, CLI Agent Debugging, and Sandbox Fixes

·10 min read·Humza Tareen
Incident ResponseGKEPythonCLI Debugging

When Production Breaks at Scale

When you're running an AI evaluation platform that orchestrates agent execution across GKE clusters, "something broke" isn't a useful statement. You need structured response playbooks, diagnostic tools, and — most importantly — the discipline to investigate before guessing.

This week I built the incident response infrastructure and debugged a chain of silent failures that were causing agents to fail in production without any useful error signals.

The Incident Runbook

I wrote a comprehensive incident runbook with severity classification, triage checklists, and over 70 diagnostic commands tailored to our GCP stack. Not generic "check the logs" advice — specific gcloud and kubectl commands that tell you exactly where to look.

The runbook covers five severity levels, each with a decision tree: Is it a single task failure? A cluster-wide issue? A database connectivity problem? The goal was to make incident response reproducible — any engineer on the team can follow the same diagnostic path and reach the same conclusion.

Alongside the runbook, I added GitHub issue templates for bug reports and RCA (root cause analysis) prompts. Every incident now generates a structured post-mortem that feeds back into the runbook.

The Silent Agent Failures

The platform supports multiple AI agent frameworks — including the CLI agent framework, a CLI-based tool that wraps LLM API calls. Agents started failing in production, but the failures were invisible. Here's why:

The CLI tools exit with code 0 even when the upstream LLM API returns a 4xx or 5xx error. Our evaluation engine checked the exit code, saw "success," and moved on. But the output contained error messages that should have failed the task.

# Before: trusting exit codes
if process.returncode == 0:
    return TaskResult(status="passed", output=stdout)

# After: inspecting output for API errors
ERROR_PATTERNS = [
    r"HTTP \d{3}",
    r"rate.?limit",
    r"APIError",
    r"quota exceeded",
]
if process.returncode == 0 and has_error_patterns(stdout):
    return TaskResult(status="failed", output=stdout, 
                      reason="API error detected in output")

The Sandbox Environment Gap

Agent execution happens in sandboxed containers on GKE. The sandbox had two problems: missing development tools and shallow git clones that broke diff operations.

Agents that needed to analyze pull request diffs couldn't — because git clone --depth 1 doesn't fetch PR refs. When the GitHub API returned 404 for a PR's files endpoint, the fallback was supposed to clone and diff locally, but the shallow clone didn't have the commits to diff against.

The fix was straightforward: fetch the specific PR refs and ensure the sandbox environment included the tools agents actually needed. But diagnosing it required tracing through three layers — the evaluation engine, the sandbox provisioner, and the git clone logic.

Case-Insensitive Key Normalization

One of the subtler bugs: agent configuration keys were case-sensitive in PostgreSQL. The admin UI stored AgentCLI-model-v1 while the worker loaded agent-cli-model-v1. Config changes made through the UI silently targeted a different row than the one the worker was reading.

The fix normalized all agent keys to lowercase on write and on read, with a migration to deduplicate any existing mismatches.

The Streaming Flag Discovery

CLI agents needed a --stream CLI flag, but we were passing it as an environment variable AGENT_STREAM=true. The variable reached the subprocess, but the CLI framework doesn't read environment variables for API request parameters — it only accepts CLI flags.

This required a broader fix: a model-agnostic extra_params system that could pass arbitrary parameters to any agent framework, with per-framework normalization logic to convert them into the right format (CLI flags, env vars, or JSON config).

What I'd Do Differently

The incident runbook should have existed from day one. We spent two weeks debugging issues that structured diagnostic commands would have identified in minutes. The lesson: invest in observability infrastructure before you need it, not after production is on fire.

Silent failures — exit code 0 with error output — are more dangerous than loud crashes. Every external tool integration should validate output content, not just return codes.