Building an RL Training Arena: Gym-Style API and Cloud Build Sandboxing
How do you train an AI coding agent to get better at writing code? Not with fine-tuning on more data — with reinforcement learning. Give it a coding task, let it try, measure whether the code passes tests, and use that signal to improve. The RL Training Arena provides this environment — a Gym-style RL system built on GCP that lets agents learn from structured feedback.
The Executor: A Gym for Code Agents
The executor service exposes a Gym-style API that RL researchers already understand. Create an episode, execute steps, compute rewards. The API shape follows OpenAI Gym conventions:
# Create a new training episode
POST /episodes
{
"task_id": "task-123",
"agent_config": {...}
}
→ { "episode_id": "ep-456" }
# Execute an action step
POST /episodes/ep-456/steps
{
"action": "write_file('src/main.py', '...')"
}
→ { "observation": "...", "done": false }
# Get reward signal
GET /episodes/ep-456/reward
→ { "reward": 0.75, "metrics": {...} }
Why this API shape? RL researchers already know how to use Gym environments. We're not reinventing the wheel — we're adapting it for code execution in the cloud.
Sandboxed Execution with Cloud Build
Agents run untrusted code. You can't let them touch the network, read other evaluations, or consume unbounded resources. Cloud Build provides isolated containers with no network access, hard timeouts, resource limits, and read-only filesystems.
The trade-off: Cloud Build adds latency (~10s container startup) but guarantees isolation. Each agent execution runs in a fresh container that can't see anything outside its sandbox. If an agent tries to make a network request, it fails. If it tries to read files outside its workspace, it fails. If it runs too long, Cloud Build kills it.
This isolation is non-negotiable. Without it, one malicious or buggy agent could corrupt the entire training environment.
Three-Tier State Management
State management follows a three-tier architecture, each tier chosen for its access pattern:
Redis holds episode state — fast, ephemeral. Episodes last minutes to hours, and we need sub-second lookups for reward computation. Redis fits perfectly.
Firestore stores trajectories — persistent, queryable. Training data lives forever. Researchers need to query "all episodes where reward > 0.8" or "episodes from agent X in the last week." Firestore's query engine handles this.
Cloud SQL manages task configuration — relational, schema-managed. Tasks have complex relationships (dependencies, prerequisites, difficulty tiers), and SQL's relational model makes these queries natural.
Each tier does what it's best at. Redis for speed, Firestore for queries, SQL for relationships.
The Preprocessor: Turning Repos into Training Tasks
The preprocessor is a multi-step pipeline that takes a raw GitHub repository and produces a complete RL training task. Why this complexity? Each step reduces the manual work a researcher would otherwise do to set up a single training task.
The flow: clone the repo → detect the primary language → resolve dependencies → discover all tests → run baseline to establish ground truth → generate verifiers → configure rewards → estimate difficulty → set resource limits → generate sandbox config → create prompts → define validation rules → package artifacts → upload to GCS.
Without the preprocessor, setting up one training task would take hours. With it, a researcher points at a GitHub repo and gets a fully configured task in minutes. The preprocessor handles edge cases — missing dependencies, flaky tests, unclear success criteria — that would otherwise require manual intervention.
Getting the Reward Function Right
The reward function is where RL theory meets engineering reality. Too sparse rewards and agents don't learn. Too dense and they game the metric. The final formula balances test passage, code style, time efficiency, and error penalties:
def compute_reward(episode):
base_reward = 1.0 if episode.tests_passed else 0.0
style_bonus = compute_style_score(episode.code)
time_penalty = max(0, 1.0 - episode.duration / episode.max_duration)
error_penalty = -0.1 * episode.error_count
return base_reward + 0.2 * style_bonus + 0.1 * time_penalty + error_penalty
This formula went through multiple iterations. Early versions rewarded test passage only, and agents learned to write code that passed tests but was unreadable. Adding style bonuses fixed that. Then agents learned to finish instantly with trivial solutions — the time penalty encourages exploration. The error penalty discourages random code generation.
Reward engineering is an art. The formula looks simple, but each coefficient was tuned through experimentation.
Hexagonal Architecture
The core RL domain logic is decoupled from GCP infrastructure. Why? Switching from Cloud Build to local Docker for development means changing an adapter, not rewriting domain logic.
The executor defines interfaces: SandboxExecutor, StateStore, RewardComputer. Concrete implementations — CloudBuildExecutor, RedisStateStore, FirestoreTrajectoryStore — live in adapter layers.
This architecture makes testing easier (mock the adapters), development faster (run locally with Docker), and migration possible (swap Cloud Build for another sandbox without touching domain code).