All Articles

Production-Ready Multi-Turn Evaluation

· 9 min read · Humza Tareen
Multi-Turn Evaluation Python Docker AI Agents

In an earlier post I walked through building a basic multi-turn runner for the evaluation platform: persistent Docker state, conversational hint injection, and a loop that keeps one environment alive across turns. That system worked well enough to prove the idea. It was not, however, production-ready. A working demo and a system you can run overnight on hundreds of tasks are different animals. This post is about hardening that runner: verifying patches with a real test harness, freezing the filesystem for debugging, surviving flaky infrastructure, resuming after crashes, and ending every run with an honest accounting of what actually completed.

The Problem: Success Without Evidence

The naive multi-turn loop answered a narrow question: did the agent framework produce a patch file this turn? That is a weak signal. You can have a non-empty diff that breaks the build, or an empty trajectory because the container glitched on write. You cannot tell whether a turn moved the task toward resolution without running the benchmark suite against the tree inside the container. You also cannot debug a bad turn days later if you never captured what the tree looked like after that turn.

Worse, long runs fail in the middle. A host reboot, an out-of-memory kill, or a transient Docker error would leave you with partial output and no map of which tasks had already finished. Re-running everything from scratch is expensive and skews comparisons if the model or platform changes between attempts. We needed four things: per-turn verification against the official harness, snapshots of the workspace after each turn, retry when artifacts failed validation, and resume so a restarted process could skip completed work. On top of that, we needed completeness reporting so stakeholders could open one JSON file and see per-task, per-turn status instead of grepping logs.

Production readiness here means: every turn is test-backed, inspectable, retriable, and accounted for.

Per-Turn Harness Evaluation

After each turn, once the agent has written its patch into the persistent container, we invoke the same style of test harness the benchmark suite uses for single-turn runs. The harness runs inside the container against the current tree (with the patch applied). Configuration points at a harness directory and a dataset file so the evaluation platform stays decoupled from any one benchmark layout. The outcome we care about is not “patch exists” but “tests relevant to this instance pass or fail in a defined way.”

Doing this per turn changes how you interpret a run. You can plot pass/fail over turns for a single task and see whether later hints correlate with improved harness results. You also catch regressions early: if turn two’s patch breaks tests that passed after turn one, you see it immediately instead of discovering it when aggregating final patches at the end of the job.

def run_harness_after_turn(container_id: str, cfg: HarnessConfig) -> HarnessResult:
    """Run the benchmark harness inside the container against the current tree."""
    cmd = [
        "docker", "exec", container_id,
        cfg.harness_entrypoint,
        "--dataset", cfg.dataset_file,
        "--instance", cfg.instance_id,
        "--repo-root", cfg.workdir_in_container,
    ]
    proc = subprocess.run(cmd, capture_output=True, text=True, timeout=cfg.timeout_s)
    return HarnessResult(
        exit_code=proc.returncode,
        stdout=proc.stdout,
        stderr=proc.stderr,
    )

In practice the entrypoint and flags match what your benchmark suite documents; the important part is that the harness runs in the same filesystem context the agent used, so you are not evaluating a stale checkout on the host.

Codebase Snapshots for Post-Hoc Debugging

Logs and diffs are not always enough. When a turn behaves oddly, you want the full tree: dependencies, generated files, and exact line endings as the agent left them. We capture a compressed archive of the workspace from inside the container after each successful turn boundary (after the patch is applied and before the next hint). The archive is streamed out via base64 over docker exec so we do not depend on bind mounts or shared volumes that might differ between dev and batch environments.

The snapshot is stored alongside the trajectory and patch for that turn. Months later, you can unpack the tarball and run local tools, a linter, or the harness manually without re-running the model. That single habit turned “we cannot reproduce this” into “here is the exact tree at turn two.”

def snapshot_workspace_b64(container_id: str, workdir: str) -> bytes:
    """Tar.gz the workspace in-container and return bytes (stdout is base64)."""
    inner = (
        f"cd {shlex.quote(workdir)} && "
        "tar czf - . | base64 -w0"
    )
    proc = subprocess.run(
        ["docker", "exec", container_id, "bash", "-lc", inner],
        capture_output=True,
        check=True,
    )
    return base64.standard_b64decode(proc.stdout)

On macOS, base64 may not support -w0; you can use base64 | tr -d '\n' or write to a temp file inside the container and docker cp if you prefer. The pattern—archive in the container, encode for a single stdout pipe—stays the same.

Artifact Validation With Retry

Each turn should materialize three artifacts: a trajectory file (for example .traj), a unified diff (.patch), and codebase_snapshot.tar.gz. After the turn completes, a validator checks that each path exists, is readable, and has non-zero size. If any check fails, we treat the turn as failed-for-persistence rather than failed-for-quality: the model may have done fine work, but we do not have a trustworthy record.

We retry the turn up to a configurable maximum. Transient Docker I/O errors, racey flush behavior, or a short-lived exec timeout should not permanently lose a turn. Only after exhausting retries do we mark the turn failed and record the error in the completeness ledger. This dramatically reduced “empty patch but the log looked fine” incidents in long jobs.

Resume State Management

Batch evaluation is a graph of tasks and turns. We persist a small JSON file—call it .resume_state.json—that lists completed task identifiers (and optionally per-task turn completion). On startup, the runner loads this file and skips any task that is already marked complete. If you use finer granularity, you can store the last successfully validated turn per task and continue mid-task; we started with task-level resume and extended to turn-level where the harness and snapshots made partial replay safe.

The state file is updated atomically: write to a temporary path, then replace the canonical file, so a crash during write does not corrupt the resume index. That detail matters when your job runs for twelve hours and the laptop sleeps once.

def should_skip_task(task_id: str, resume_path: Path) -> bool:
    if not resume_path.is_file():
        return False
    state = json.loads(resume_path.read_text())
    done = set(state.get("completed_tasks", []))
    return task_id in done

def mark_task_complete(task_id: str, resume_path: Path) -> None:
    state = {"completed_tasks": []}
    if resume_path.is_file():
        state = json.loads(resume_path.read_text())
    tasks = set(state.get("completed_tasks", []))
    tasks.add(task_id)
    state["completed_tasks"] = sorted(tasks)
    tmp = resume_path.with_suffix(".json.tmp")
    tmp.write_text(json.dumps(state, indent=2))
    tmp.replace(resume_path)

Completeness Reports

At the end of a run—or incrementally if you want live dashboards—the runner emits completeness_report.json. The structure is deliberately flat to read: per task, per turn, you get harness outcome, artifact validation status, retry counts, and error strings. Aggregates at the top level answer “how many tasks fully completed,” “how many turns failed validation,” and “where did harnesses start passing.”

This file became the contract between the evaluation platform and downstream analytics. Instead of parsing thousands of log lines, a notebook or internal tool loads one JSON document. It is also how we proved to ourselves that a run was “done”: not when the process exited zero, but when every expected turn row existed with artifacts_ok: true or a documented failure reason.

def append_turn_report(
    report: dict,
    task_id: str,
    turn: int,
    harness: HarnessResult,
    artifacts_ok: bool,
    error: str | None,
) -> None:
    report.setdefault("tasks", {})
    report["tasks"].setdefault(task_id, {"turns": []})
    report["tasks"][task_id]["turns"].append({
        "turn": turn,
        "harness_exit_code": harness.exit_code,
        "artifacts_ok": artifacts_ok,
        "error": error,
    })

def write_completeness_report(path: Path, report: dict) -> None:
    path.write_text(json.dumps(report, indent=2))

Resolution Summaries

With per-turn harness results, you can post-process a run to find the resolution turn: the first turn index at which the harness reports success for that instance (according to whatever predicate your benchmark defines). A small script walks the completeness report or the per-turn harness logs, orders turns, and emits a summary map from task id to resolution turn, or null if the task never reached passing tests within the turn budget.

That summary is invaluable for comparing hint strategies or model versions. Two agents might both “solve” a task by the last turn, but the one that resolves on turn one is a different product story than the one that needs three nudges. Resolution turn is a simple metric, but it only exists if you recorded harness output every turn.

Incremental JSONL for Downstream Tools

Downstream pipelines often expect one JSON object per line, per turn, with fields that mirror how hints and patches are consumed in training or analysis—not a single giant transcript with cumulative text. We write a per-turn JSONL file where each record contains only the incremental hint for that turn (not the full concatenated history) and the model’s patch for that turn, plus identifiers and harness metadata. That matches what many evaluation and distillation tools expect and keeps each line self-contained for parallel processing.

Keeping “incremental only” in the JSONL avoids accidental duplication when someone joins records: the conversation history lives in the trajectory artifact; the JSONL row is the delta the business logic cares about for that step.

What Changed and Why It Matters

The basic multi-turn runner was the right foundation: persistent Docker state and conversational hints are prerequisites for measuring iterative repair. Production hardening adds the evidence layer—harness, snapshots, validation, resume, completeness—and the analytics layer—resolution summaries and incremental JSONL. Together, they turn a clever loop into something you can run at scale, defend in a meeting, and debug months later without rerunning GPUs.

Tech stack: Python, Docker, JSON state and reports, tarball snapshots via base64 transfer, pytest for validators and resume logic where it pays off.

This post follows Multi-Turn Agent Evaluation at Scale, which covers the core runner, hint injection, and persistent environments.