All Articles

Multi-Turn Agent Evaluation at Scale

· 8 min read · Humza Tareen
Multi-Turn AI Agents Docker Python Evaluation

Evaluating AI coding agents is deceptively hard. Single-turn benchmarks tell you whether an agent can solve a problem when given everything upfront. But real debugging is iterative: you try something, see the error, adjust. Multi-turn evaluation measures whether an agent can improve over multiple attempts with progressive feedback. Building that system taught me more about evaluation infrastructure than any single benchmark ever could.

The Problem: Why Single-Turn Evaluation Failed

Our evaluation platform originally ran each turn as an independent job. A task might have three turns with hints like "check the edge case for empty input" and "the bug is in the comparison logic." The naive approach: spin up a fresh Docker container for each turn, concatenate all hints into the problem statement, and run the agent. Simple. Clean. And it produced zero successful patches.

The failure mode was subtle. Turn one runs in container A. The agent makes edits. Container A is torn down. Turn two starts in container B—a pristine environment. The agent has no memory of what it tried. The hints are all there in the problem statement, but they arrive as a wall of text rather than a natural conversational flow. The agent sees "fix this bug, also check empty input, also the bug is in comparison logic" and treats it as noise. No conversation history. No accumulated context. No persistence of edits between turns. The evaluation was measuring something, but it wasn't measuring multi-turn reasoning.

Hint Injection: Turning Hints Into Conversation

The fix wasn't to change the agent—it was to change how we delivered hints. Instead of concatenating every hint into the initial problem statement, we inject each hint as a natural message in the conversation history between turns. The agent receives the hint as if a human had just said it: "Here's a hint for turn 2: check the edge case for empty input." The agent's previous messages, tool calls, and outputs remain intact. The new hint arrives in context, not as a prepended wall of text.

We implemented this with an inject_turn_hint method that appends a user message to the conversation before the next turn begins. The key insight: hints are conversational, not declarative. They should feel like a colleague leaning over and saying "try looking at X" rather than a spec document that was there from the start.

def inject_turn_hint(conversation: list[dict], hint: str, turn: int) -> list[dict]:
    """Append a hint as a user message before the next agent turn."""
    return conversation + [
        {"role": "user", "content": f"[Turn {turn} hint] {hint}"}
    ]

Simplified, yes—the real implementation handles message formatting and role alternation—but the core idea holds. Each hint becomes a discrete exchange. The agent sees its own history, then the new hint, then continues. The evaluation measures whether the agent can incorporate feedback, not whether it can parse a concatenated problem statement.

Persistent State: One Environment, Many Turns

Hint injection fixed the conversation. Persistent state fixed the environment. We built a MultiTurnRunner that maintains a single Docker container and a single agent instance per task across all turns. No teardown between turns. No reset. The agent's edits persist. The filesystem state persists. When turn two begins, the agent sees the code it wrote in turn one. When turn three begins, it sees both. That's the difference between "can you fix this" and "you tried X, it failed, here's a hint, try again."

The runner loop is straightforward: create the environment once, run each turn in sequence, inject hints between turns, and only tear down when the task completes or fails. The complexity is in the invariants: ensuring the container stays healthy, that state is correctly checkpointed, and that we don't leak resources when a run is interrupted.

def run_multiturn(task: Task, max_turns: int) -> RunResult:
    env = create_persistent_environment(task)
    agent = create_agent_for_task(task)
    conversation = [{"role": "user", "content": task.initial_prompt}]

    for turn in range(1, max_turns + 1):
        if turn > 1 and task.hints[turn - 1]:
            conversation = inject_turn_hint(conversation, task.hints[turn - 1], turn)
        response = agent.run(conversation, env)
        conversation.append({"role": "assistant", "content": response})
        if verify_patch(env):
            return RunResult(success=True, turn=turn)
    return RunResult(success=False)

The agent framework doesn't know it's in a multi-turn run. It receives a conversation and an environment. The runner orchestrates the turns and ensures the environment outlives any single invocation. That separation—orchestration vs. execution—made the system testable and debuggable.

The Config Trap: Call Limits and Silent Failure

We had everything working: persistent environments, hint injection, a clean runner loop. We kicked off a production run. Zero successful patches. Every single task hit the exit budget with no patch produced. We assumed a bug in the runner. We assumed a bug in the agent. We spent days tracing execution. The bug was in the config.

Every benchmark configuration in our suite had a call limit of 60. That limit caps how many tool invocations—file edits, command runs, reads—the agent can make per turn. For single-turn evaluation, 60 was often enough. For multi-turn, it was a death sentence. The agent would start turn one, make a few edits, hit 60 calls, and exit. No patch. Turn two never ran. The runner thought the task had completed; it had just completed without producing anything. One hundred percent of tasks failed the same way.

We raised the limit to 300 after validating that production runs completed within that budget. The impact was immediate: successful patches started appearing across Python, JavaScript, and other languages in the benchmark suite. The fix was a one-line config change. The lesson was that evaluation infrastructure fails silently when resource limits are misconfigured. There's no exception. The agent just stops. You have to instrument and monitor to catch it.

# benchmark_config.yaml
task_suite: standard
max_turns: 3
agent:
  call_limit: 300   # Was 60 — too low for multi-turn
  timeout_seconds: 600

Testing the Multi-Turn Pipeline

We didn't trust the system until we had tests at every layer. The hint injection API got 64 unit tests: edge cases for empty hints, multiple hints per turn, malformed conversation structures. The benchmark configurations got 16 parametrized tests—different suites, different call limits, different turn counts—to ensure we never again ship a config that silently fails. The multi-turn runner got 8 integration tests: run a known task, verify state persists, verify hints are injected, verify the runner cleans up on failure.

The parametrized config tests were the ones that would have caught the call limit bug. We added them after the incident. Now every config change runs through a matrix of scenarios. If a config produces zero patches in our test harness, the test fails. That's the kind of invariant that prevents regression.

What We Learned

Multi-turn evaluation isn't single-turn evaluation run multiple times. It requires persistent state, conversational hint delivery, and configs that match the workload. The engineering decisions—inject hints as messages, maintain one environment per task, validate configs with parametrized tests—came from watching the system fail and understanding why. The result is an evaluation platform that can measure iterative reasoning, not just one-shot problem solving. That distinction matters for building agents that debug, refine, and improve—the way humans actually write code.

Tech stack: Python, Docker, YAML configs, pytest for unit and parametrized tests.