Building an SFT Recording Pipeline
Supervised fine-tuning (SFT) only works when you have high-quality demonstration data: real people completing real tasks in environments that look like production, with every interaction captured faithfully enough that a model can learn from it. On the recording platform I helped build, that requirement forced a full vertical slice—sandboxed desktop sessions experts could drive remotely, client-side capture of fine-grained interaction traces, parallel video, a backend that could ingest large artifacts safely, and a review pipeline that could reject bad work without losing the story of why it failed.
This post walks that system end to end: Docker sandboxes with VNC, trajectory and media capture in the browser, eighteen HTTP routes that hold the lifecycle together, rubric-based review with rework, a twelve-page admin surface, observability hooks, and a test suite that held the whole thing accountable as it grew.
The problem
SFT needs human demonstrations: recordings of experts finishing tasks correctly and consistently. The product had to provision isolated desktop environments so experts never touched customer data directly, expose those environments through the browser (VNC), record not just pixels but semantics—mouse movement, clicks, keyboard input, scrolling, copy and paste, window resizes, focus and visibility changes—and pair that with a video stream reviewers could skim. Operators needed to score quality against rubrics, approve or push back, and route failed attempts into rework with clear feedback. None of that is a single feature; it is a pipeline where every stage can fail independently, and failures have to be recoverable without silent data loss.
Sandbox infrastructure
The core isolation story is Docker: each expert session runs in a container that presents a full desktop stack, reachable over VNC from a client embedded in the admin and worker UIs. Provisioning is not fire-and-forget. The platform tracks status from “requested” through “running” and “terminating,” with health checks and automatic retry when a container exits unexpectedly or fails its readiness probe. That matters because flaky sandboxes destroy trust faster than almost any UI bug—if experts lose ten minutes to a dead session, they stop trusting the task system entirely.
Lifecycle management is explicit: provision with a deterministic image and configuration bundle, monitor resource and connectivity signals, terminate cleanly when the recording ends or policy requires it. Before tear-down in debug-friendly environments, operators can optionally snapshot the filesystem state so incidents are reproducible without holding expensive capacity open forever. Every transition is observable in the task model so support and engineers can answer “where is this sandbox right now?” without SSH archeology.
Client-side recording
The browser cannot see the remote desktop’s OS events directly; what it can do is instrument the viewer surface and surrounding application chrome with consistent DOM listeners, normalize differences across browsers, and timestamp everything against a monotonic clock aligned to the server at session start. Trajectory capture therefore centers on a small set of event families: pointer movement and buttons, keyboard down/up, scroll, clipboard read and write where policy allows, window and element resize, and visibility changes that flag when the expert looked away or the tab slept.
Events are buffered, chunked by time or byte budget, and tagged with sequence metadata so the server can reorder and gap-detect after upload. Video uses the same mental model as screen sharing: HTMLCanvasElement.captureStream fed into MediaRecorder, with codec choices constrained to what target browsers reliably support. The two streams—trajectory JSON and encoded video—share a session identifier and version counter so ingest can correlate partial failures.
The highest-leverage reliability feature is chunked upload. Waiting until the end of a thirty-minute task to POST a single blob guarantees data loss on any disconnect, browser crash, or laptop sleep. Instead, the client flushes chunks on a timer and on boundary events (pause, network recovery, explicit checkpoint), each chunk carrying an index and rolling hash or length check the API validates before appending. That design trades a bit of backend complexity for a massive reduction in “we lost the whole recording” incidents.
function useTrajectoryCapture(sessionId: string, onChunk: (chunk: TrajectoryChunk) => void) {
const bufferRef = useRef<TrajectoryEvent[]>([]);
const seqRef = useRef(0);
const startedAt = useRef<number>(performance.now());
const flush = useCallback(() => {
if (bufferRef.current.length === 0) return;
const chunk: TrajectoryChunk = {
sessionId,
seq: seqRef.current++,
events: bufferRef.current.splice(0, bufferRef.current.length),
t0: startedAt.current,
};
onChunk(chunk);
}, [sessionId, onChunk]);
useEffect(() => {
const push = (ev: TrajectoryEvent) => {
bufferRef.current.push(ev);
if (shouldFlushChunk(bufferRef.current)) flush();
};
const subs = [
listenPointer(push),
listenKeyboard(push),
listenScroll(push),
listenClipboard(push),
listenResize(push),
listenVisibility(push),
];
const id = window.setInterval(flush, CHUNK_INTERVAL_MS);
return () => {
window.clearInterval(id);
subs.forEach((u) => u());
flush();
};
}, [flush]);
return { flush };
}
async function uploadRecordingChunks(
recordingId: string,
stream: AsyncIterable<Blob>,
opts: { signal?: AbortSignal }
) {
let index = 0;
for await (const blob of stream) {
const res = await fetch(`/api/recordings/${recordingId}/parts`, {
method: "POST",
headers: {
"Content-Type": "application/octet-stream",
"X-Part-Index": String(index++),
"X-Part-Bytes": String(blob.size),
},
body: blob,
signal: opts.signal,
});
if (!res.ok) throw new UploadError("chunk_rejected", await res.text());
}
await fetch(`/api/recordings/${recordingId}/complete`, {
method: "POST",
signal: opts.signal,
});
}
DOM normalization lives in thin adapter layers so Safari, Chromium, and Firefox disagree on event fields in one place instead of across every call site. That is the kind of boring glue that keeps trajectory analytics trustworthy when you later train on the JSON, not just watch the video.
The API layer: eighteen routes
The HTTP surface is intentionally bounded: eighteen routes cover the whole lifecycle without sprouting duplicate “almost the same” endpoints. Workers pull from a queue with an atomic claim operation so two tabs cannot mutate the same task. Sandbox routes provision and terminate instances, expose status for polling and webhooks where needed, and accept optional snapshot requests. Recording routes implement start, pause, resume, complete, and rework entry points—each validating state transitions so illegal combinations return explicit errors instead of corrupt rows.
Trajectory ingestion accepts chunked POST bodies with session and sequence headers; the server stitches streams idempotently when the client retries a failed chunk. File routes handle artifact metadata, content types, and cleanup policies. Admin routes back the dashboard: queue depth, stuck sandboxes, reviewer assignments, and read-only inspection of trajectories for dispute resolution. Keeping the count stable was a forcing function: every new idea had to justify itself as an extension of an existing resource rather than a one-off script.
Duplicate chunk delivery is inevitable on flaky networks, so ingest handlers treat part index and content hash as idempotency keys: a repeat upload with the same metadata short-circuits to success without doubling storage, while a conflicting index surfaces a hard error instead of silently corrupting the timeline. Recording and trajectory resources share consistent versioning on the entity so optimistic UI updates cannot race ahead of completed uploads.
type SandboxPhase = "pending" | "provisioning" | "running" | "draining" | "terminated" | "failed";
class SandboxLifecycleManager {
constructor(private readonly docker: DockerClient, private readonly store: SandboxStore) {}
async provision(taskId: string, spec: SandboxSpec): Promise<SandboxHandle> {
await this.store.transition(taskId, "pending", "provisioning");
try {
const containerId = await this.docker.run(spec);
await this.store.attachContainer(taskId, containerId);
await this.waitUntilReady(containerId, spec.healthcheck);
await this.store.transition(taskId, "provisioning", "running");
return { taskId, containerId, phase: "running" };
} catch (err) {
await this.store.transition(taskId, "provisioning", "failed", { error: String(err) });
if (spec.retryPolicy?.shouldRetry(err)) return this.provision(taskId, spec);
throw err;
}
}
async terminate(taskId: string, opts?: { snapshot?: boolean }) {
const row = await this.store.require(taskId);
if (opts?.snapshot) await this.docker.snapshot(row.containerId);
await this.docker.stop(row.containerId);
await this.store.transition(taskId, row.phase, "terminated");
}
}
Review and rework
Raw recordings are not automatically training data. The review pipeline attaches each submission to a rubric: dimensions like task completion, efficiency, clarity of reasoning, and adherence to policy. Reviewers score against typed criteria, leave free-text notes, and choose one of three outcomes—approve, reject, or changes requested. Approve moves the artifact downstream; reject closes the attempt with reasons; changes requested opens a rework path where the expert sees reviewer feedback and starts a new attempt linked to the same task.
Rework is not a second random queue entry. It preserves lineage so analytics can compare attempts, and it triggers worker notifications through a real-time channel: when a reviewer selects changes requested, the review pipeline writes a durable document in Firestore and fans out through Firebase Cloud Messaging (FCM) so the assigned worker gets a push on lock screen or in-app—the same path used for other task updates, so notifications stay consistent instead of bolting on a second alert system.
type RubricDimensionId = "correctness" | "efficiency" | "communication" | "policy";
type RubricScore = {
dimension: RubricDimensionId;
value: 1 | 2 | 3 | 4 | 5;
comment?: string;
};
type ReviewDecision = "approve" | "reject" | "changes_requested";
type RubricReviewPayload = {
recordingId: string;
reviewerId: string;
scores: RubricScore[];
decision: ReviewDecision;
summary: string;
feedbackForWorker?: string;
};
The admin surface: twelve pages
Operators live in the admin app. Twelve pages cover the operational surface without duplicating the worker client: queue management and fair pulling rules, task detail with an embedded sandbox viewer so engineers can see exactly what the expert saw, sandbox screens for provision, monitor, and terminate, the review queue with the rubric UI, bulk CSV import for seeding tasks at scale, and analytics views for throughput, rejection reasons, and sandbox failure rates. The goal is that a single on-call person can trace a bad recording from intake to decision without opening five tools.
Observability
When dozens of containers and simultaneous uploads collide, grep stops scaling. The service emits structured JSON logs with stable component tags—sandbox, ingest, auth, review—so dashboards and alerts can filter meaningfully. AsyncLocalStorage carries a request identifier from the edge middleware through async handlers; the browser includes the same identifier on chunk uploads where possible so support can correlate a failed part POST with the server trace that rejected it.
Beyond logs, every meaningful task transition appends to a TaskEvent audit trail: recording started, paused, resumed, sandbox recycled, review submitted, rework issued. That trail is what makes arguments about “what happened to this submission?” factual instead of folklore.
Testing
This system is too large to hold in your head alone. The repository now runs 1,096 tests across 56 test files using Vitest, plus eight Playwright end-to-end specs that walk the golden path—claim a task, open a sandbox, record, upload chunks, submit for review, receive a decision—from the same surfaces users touch. Thirty-one newer unit tests target the brittle edges: MediaRecorder wrappers, chunked upload coordinators, DOM event adapters, HTTP handlers for each lifecycle transition, sandbox state machine transitions, and schema validation for rubric payloads.
Those tests paid for themselves every time someone refactored ingest to add compression or changed the retry policy on Docker provision. E2E caught integration assumptions—CORS on part uploads, cookie timing, VNC iframe focus—that unit tests alone will never see.
Closing thoughts
An SFT recording pipeline is infrastructure, product, and risk management at once. The pieces that look glamorous—VNC in the browser, slick review UIs—only matter if sandboxes are dependable, uploads survive real networks, and operators can explain outcomes from data. Shipping that combination required discipline in route design, obsessive client reliability, and tests that treated the pipeline as a single system. If you are building something similar, invest early in chunked ingest and explicit sandbox state machines; everything else stacks on top more safely once those two are solid.