Building an RL Training Arena for AI Code Agents on GCP
How do you teach an AI to get better at coding? Not with more training data — with structured feedback. Give it a task, let it try, measure the result, and let it learn. I built the environment that makes this possible.
The Gym-Style API
We modeled after OpenAI Gym because RL researchers already know this interface. Three endpoints capture the full RL loop: create episode, execute step, compute reward. The API abstracts away the complexity of code execution, sandboxing, and state management. Here's what it looks like:
POST /episodes
{
"task_id": "fix-bug-in-sort-function",
"agent_id": "claude-sonnet-4"
}
POST /episodes/{episode_id}/steps
{
"action": "edit_file",
"file": "src/sort.py",
"code": "def sort(arr): ..."
}
GET /episodes/{episode_id}/reward
{
"reward": 0.75,
"tests_passed": 3,
"total_tests": 4
}
This interface lets researchers focus on agent behavior, not infrastructure. They don't need to know about Cloud Build, Redis, or Firestore—they just need to understand episodes, steps, and rewards.
Sandboxing Untrusted Code
Agents execute arbitrary code. You need isolation. Cloud Build provides containers with no network, hard timeouts, resource limits, and read-only filesystems. The trade-off: roughly 10 seconds of startup latency per execution. Worth it for guaranteed isolation. Each agent execution happens in an isolated Cloud Build step. No cross-contamination between evaluations, no way for one agent's code to affect another's environment.
State Management Tiers
We use three storage tiers, each chosen for its access pattern. Redis holds ephemeral episode state—fast read/write, perfect for the active episode that needs sub-second updates. Episodes are short-lived, but they need to be fast. Firestore stores persistent trajectories—training data that needs to persist long-term and be queryable. Researchers need to analyze agent behavior over time, compare trajectories, and extract patterns. Cloud SQL holds task configuration—relational data that needs JOINs and complex queries. Task definitions, verifiers, baseline results—all structured data that belongs in a relational database.
Don't just list them—understand WHY each tier exists. Redis isn't just "fast"—it's for data that doesn't need to survive a restart. Firestore isn't just "persistent"—it's for data that needs real-time listeners and flexible schemas. Cloud SQL isn't just "relational"—it's for data that needs ACID guarantees and complex queries.
The Preprocessor Pipeline
Think of it as the task chef. Raw ingredients go in—a GitHub repository, a task description, maybe some baseline tests. A fully prepared training environment comes out. The 15-step preprocessing workflow analyzes repositories, generates verifiers, establishes baselines, creates test harnesses, and configures reward functions. Dynamic verifier and reward configuration generation means each task gets a custom evaluation setup. Repository analysis extracts dependencies, identifies test patterns, and builds execution contexts. Baseline test execution establishes ground truth—what should pass before the agent even starts.
Reward Shaping
Here's the reward formula:
reward = (tests_passed / total_tests) * base_reward
+ style_bonus
- time_penalty
- error_penalty
Getting the reward shaping right took three iterations. Too sparse—agents don't learn because they rarely get positive feedback. Too dense—agents game the metric, optimizing for the reward function instead of solving the task. Just right—balances test passage, style, time, and errors in a way that encourages genuine problem-solving.
Architecture Choice: Hexagonal
Domain logic is decoupled from GCP infrastructure. Switching Cloud Build for local Docker? Change the adapter, not the domain. This isn't academic—it made testing actually possible. We can run the entire RL loop locally without spinning up Cloud Build containers. The domain doesn't know about GCP. It knows about code execution, state management, and reward computation. The infrastructure adapters handle the details.