Docker Image Hardening for AI Benchmarks

March 13, 2026 · 8 min read · Humza Tareen

Docker AI Benchmarks Python Infrastructure

AI coding benchmarks evaluate agents across dozens of programming languages: Rust, Go, JavaScript, Python, C++, and more. Each task runs inside a Docker image that contains a specific repository snapshot—the exact codebase the agent must modify to pass tests. The problem: many of those images were never designed to host an agent runtime. They lack Python, have broken git histories, or ship without the dependencies the agent needs to execute. When an agent lands in a broken container, it explores indefinitely, burning through its entire call budget without ever making progress. This post describes how we hardened the image pipeline to prevent that.

The Problem: Broken Containers, Wasted Compute

Benchmark task images come from many sources. Some are hand-curated for a single language. Others are auto-generated from public repos. The common thread: they focus on the task—the code to edit, the tests to run—not the agent that will run inside them. The agent runtime is Python-based. It needs git for version control, a working shell, and a predictable environment. A Rust-only image has no Python. A minimal Alpine image might lack git or have a nonstandard layout. An image built from a shallow clone might have a corrupted .git directory that breaks git status and git diff.

During a production evaluation, we observed agents spinning for hours on a subset of tasks. Logs showed repeated failed attempts to start the runtime, or agents looping on basic setup steps. The root cause: broken containers. The agent would start, detect that something was wrong, try to fix it, fail, and retry. With a fixed call budget per task, every retry consumed budget. By the time we noticed, a significant fraction of our compute had been spent on tasks that were never going to succeed.

Injecting the Runtime Layer

Our approach: programmatically append a runtime layer to every task image. The layer installs Miniconda with Python 3.11, the agent runtime, and any shared dependencies. The challenge is that task images span dozens of base distros and languages. The layer must be self-contained and not assume a particular base. We wrote a build script that:

Discovers all task images (from a manifest or registry scan)
For each image, generates a Dockerfile that FROMs the original image and appends the runtime layer
Builds and tags the new image

The injection function is straightforward. It reads the base image name, writes a multi-stage Dockerfile that adds the runtime, and triggers a build:

def inject_runtime_layer(base_image: str, output_tag: str, runtime_path: str) -> bool:
    """Append Python runtime + agent layer to a task image."""
    dockerfile = f"""
FROM {base_image}
USER root
RUN apt-get update -qq && apt-get install -y -qq git curl && rm -rf /var/lib/apt/lists/*
COPY --from=runtime-builder /opt/conda /opt/conda
ENV PATH="/opt/conda/bin:$PATH"
RUN pip install --no-cache-dir -r /agent/requirements.txt
COPY agent/ /agent/
WORKDIR /workspace
"""
    with open("/tmp/Dockerfile.inject", "w") as f:
        f.write(dockerfile)
    result = subprocess.run(
        ["docker", "build", "-t", output_tag, "-f", "/tmp/Dockerfile.inject", "."],
        capture_output=True, text=True, timeout=600
    )
    return result.returncode == 0

The runtime-builder stage (not shown) installs Miniconda and pre-installs the agent dependencies. Copying /opt/conda from that stage keeps the final image smaller and avoids conflicting with existing Python installations in the base. We run this for every task image before an evaluation. Images that fail to build are logged and excluded.

Pre-flight Validation

Injection ensures the runtime is present. It does not guarantee the container is usable. A base image might have a broken .git, a read-only filesystem, or a nonstandard working directory. We added a pre-flight validator that runs before any expensive evaluation:

def validate_image(image: str, repo_path: str = "/workspace") -> tuple[bool, str]:
    """Check that an image is ready for agent execution."""
    def run_in_container(cmd: str) -> tuple[bool, str]:
        r = subprocess.run(
            ["docker", "run", "--rm", image, "sh", "-c", cmd],
            capture_output=True, text=True, timeout=30
        )
        return r.returncode == 0, (r.stderr or r.stdout or "").strip()

    checks = [
        ("image_exists", lambda: (subprocess.run(["docker", "inspect", image], capture_output=True).returncode == 0, "")),
        ("git_repo", lambda: run_in_container(f"test -d {repo_path}/.git && git -C {repo_path} status")),
        ("python_available", lambda: run_in_container("python3 --version")),
        ("agent_start", lambda: run_in_container("python3 -c 'import agent; print(\"ok\")'")),
    ]
    for name, check in checks:
        ok, msg = check()
        if not ok:
            return False, f"{name}: {msg}"
    return True, ""

Each check runs inside a short-lived container. image_exists verifies the image is pullable. git_repo ensures there is a valid git repository at the expected path. python_available confirms Python 3 is on the PATH. agent_start imports the agent module to catch missing dependencies or import errors. If any check fails, we log the task ID and reason, then skip it. No agent is ever launched for that task.

Pre-flight validation runs as a separate job before the main evaluation. We generate a skip list from the failures and pass it to the runner. The impact was immediate: we stopped wasting call budget on tasks that were structurally impossible to complete.

Skip Lists and Completion Detection

Some images fail validation repeatedly. Rebuilding them for every run is wasteful. We maintain a persistent skip list—a file or database table of task IDs known to have broken images. The runner consults this list before scheduling a task. If the task is on the skip list, it is not run. We also detect already-completed tasks. Each task produces a trajectory file (the sequence of actions the agent took) and a prediction file (the final answer). If both exist for a task, we skip it on reruns:

#!/bin/bash
# In the evaluation runner
SKIP_LIST="config/known_broken_images.txt"
OUTPUT_DIR="results/run_001"

run_task() {
    local task_id=$1
    local image=$2

    # Skip known broken
    if grep -q "^${task_id}$" "$SKIP_LIST" 2>/dev/null; then
        echo "SKIP (broken): $task_id"
        return 0
    fi

    # Skip already completed
    if [[ -f "$OUTPUT_DIR/${task_id}/trajectory.json" && -f "$OUTPUT_DIR/${task_id}/prediction.json" ]]; then
        echo "SKIP (done): $task_id"
        return 0
    fi

    # Run validation, then agent
    if ! validate_image "$image"; then
        echo "$task_id" >> "$SKIP_LIST"
        echo "SKIP (validation failed): $task_id"
        return 0
    fi

    docker run --rm -v "$OUTPUT_DIR:/out" "$image" agent_run "$task_id"
}

The skip list grows over time as we encounter new failures. Completion detection lets us resume interrupted runs without redoing finished tasks. Together, they reduce redundant work and prevent bad tasks from ever reaching the agent.

The Incident and the Fix

During a production evaluation, we noticed that a subset of tasks was consuming an outsized share of compute. Agents would run for hours, hit their call limit, and exit without producing a valid trajectory. Initial triage pointed to slow tasks or difficult problems. Deeper inspection showed the real pattern: the same task IDs appeared repeatedly in failure logs, and the failures occurred early—often within the first few agent actions. The containers were broken from the start.

We added pre-flight validation to the pipeline. The next run validated all images upfront. Roughly 12% of tasks failed validation. We added those to the skip list and re-ran the evaluation. Wasted compute dropped sharply. Agents no longer spun on impossible tasks. The evaluation completed in a fraction of the previous time, and we had confidence that the remaining failures were genuine agent limitations, not infrastructure issues.

Lessons Learned

Validate before you run. Expensive operations should be guarded by cheap checks. A 30-second validation per image is trivial compared to hours of agent execution. Fail fast, skip early.

Make the runtime layer self-contained. Task images vary wildly. The layer we inject cannot assume Ubuntu, or a specific Python version, or any particular tooling. It brings everything it needs. That increases image size but eliminates a whole class of environment bugs.

Persist skip lists. Broken images tend to stay broken until someone fixes the upstream. Storing failures in a skip list avoids re-validating and re-failing on every run. Update the list when images are rebuilt or fixed.

Detect completion for idempotent reruns. Evaluations get interrupted. Networks fail, nodes get preempted, humans cancel jobs. If the runner can detect already-finished tasks and skip them, reruns become cheap. Trajectory and prediction files are the natural completion signal.

Docker image hardening for AI benchmarks is unglamorous work. Nobody writes blog posts about fixing broken .git directories. But it is the difference between an evaluation that finishes and one that burns budget on tasks that were never going to succeed. The engineering decisions here—inject layers, validate first, skip intelligently—apply to any system that runs untrusted or heterogeneous containers at scale.

Multi-Turn Agent Evaluation: Persistent State
Evaluation using these hardened containers
RL Training Arena for Code Agents
Arena using these Docker images
Building an SFT Recording Pipeline
Another Docker sandbox system