Back to Blog

Building an Events Table for Pipeline Observability

By · 9 min read
Observability Pipeline TypeScript GCS Event Sourcing Next.js

The platform I work on runs an auto-seeding task pipeline: each task moves through a multi-step workflow spanning task design, model output generation, quality checks, expert review, deduplication gates, and rework flows. A single run can touch thirteen or more steps, each with its own state transitions, failure modes, and side effects. When something went wrong in production, the debugging workflow was painful. There was no unified timeline. You opened the GCS manifest JSON, scrolled through statusHistory if it existed, downloaded individual trace.json files from step folders, cross-referenced result.json for non-agentic steps, and checked the database for assignment lifecycle events. Four different storage locations, four different schemas, zero causal ordering guarantee. I built an events table that aggregates all of it into one paginated, expandable timeline — and a companion instrumentation pass that made the write side trustworthy enough to depend on.

The Problem: Fragmented Observability

Pipeline failures do not announce themselves cleanly. A task can stall because a worker hit a dispatch deadline, because an admin paused it, because a dedup gate matched an existing task, because a QC check blocked advancement, or because a rework orchestrator sent it back three steps. Each of those failure paths writes evidence to a different place. Status transitions lived in the manifest — sometimes. Step-level execution detail lived in per-step trace files. Admin interventions were buried in actionHistory. Expert assignment lifecycle events lived only in the relational database.

Before this work, an engineer debugging a stuck task would spend twenty to forty minutes just assembling context: downloading manifest JSON from cloud storage, opening step trace files one by one, correlating timestamps manually, and guessing which event caused which downstream effect. The platform had rich data. It just had no lens.

Observability is not about collecting more logs. It is about making the causal chain visible in one place, in the order things actually happened.

The goal was straightforward: one API endpoint, one UI table, one chronological timeline per task. The implementation was not.

Four-Source Event Aggregation

I designed the aggregator to pull from four independent sources and normalize them into a single PipelineEvent schema. Each source contributes a different slice of the story.

Source What it captures Storage
statusHistory Status transitions at write time GCS manifest
actionHistory Admin actions (pause, resume, cancel, reject, rerun) GCS manifest
Step traces Execution detail, QC results, dedup verdicts, agent metrics GCS step folders (trace.json / events.json)
TaskAssignment Assignment lifecycle (created, claimed, closed) Database

The manifest's statusHistory array was the most important source — and the least complete before instrumentation. Each transition records fromStatus, toStatus, stepName, version, actorId, actorType, reason, and timestamp. Actor types distinguish human operators (USER) from automated paths (SYSTEM), which matters when you are trying to figure out whether a task was paused by an admin or blocked by a worker.

Step traces carry the execution layer: errorDetails with typed sub-reasons, per-check QC summaries, dedup verdicts with matched task IDs and similarity scores, and agent execution metrics (token counts, cost, turn count). For steps that do not write a trace.json — notably some model-output steps — the aggregator falls back to result.json in the same step folder.

The database source covers the expert review side of the pipeline. When a task enters a human review step, assignment records track creation, claim, and closure. Without this source, the timeline would show the task entering review but never show who claimed it or when it was released.

Architecture: Provider-Service Pattern

I structured the backend in four layers: route, service, aggregator, and provider. The route handles auth, pagination params, and cache headers. The service orchestrates merging, deduplication, consolidation, sorting, and pagination. The aggregator coordinates multiple providers. Each provider implements a narrow interface and knows how to read one storage backend.

interface EventsProvider {
  fetchStatusHistory(taskId: string): Promise<StatusTransitionEvent[]>;
  fetchActionHistory(taskId: string): Promise<AdminActionEvent[]>;
  fetchStepTraces(taskId: string): Promise<StepTraceEvent[]>;
}

class GcsEventsProvider implements EventsProvider {
  constructor(
    private readonly manifestStore: ManifestStore,
    private readonly stepStore: StepObjectStore,
  ) {}

  async fetchStatusHistory(taskId: string): Promise<StatusTransitionEvent[]> {
    const manifest = await this.manifestStore.read(taskId);
    return (manifest.statusHistory ?? []).map(normalizeStatusTransition);
  }

  async fetchStepTraces(taskId: string): Promise<StepTraceEvent[]> {
    const folders = await this.stepStore.listStepFolders(taskId);
    const traces = await Promise.all(
      folders.map((folder) => this.readTraceWithFallback(taskId, folder))
    );
    return traces.flat().map(enrichStepTrace);
  }
}

The EventsProvider interface is deliberately storage-agnostic. Today everything reads from GCS. Tomorrow, if we migrate manifest history to a database, we swap the provider implementation without touching the service layer or the UI. The service calls aggregator.collect(taskId) and receives a flat array of normalized events regardless of where they originated.

The service layer owns the messy parts: deduplicating events that appear in both statusHistory and step traces, consolidating step-level event groups so the UI can collapse an entire step execution into one row, and applying pagination after consolidation so step groups never split across page boundaries.

Trace Enrichment

Raw trace files are verbose. The UI does not need every field on every row — it needs the right summary at the right depth. I built an enrichment pass that extracts context-specific previews from each trace type.

For QC-blocked steps, the preview shows the count of failed checks and their IDs. Expanding the row opens a QcBreakdownPanel with per-check pass/fail status, confidence scores, explanations, and remediation guidance. For dedup gates, the preview shows the matched task ID and similarity score against the configured threshold. For agent failures, the preview surfaces error type and subtype without dumping the full stack trace into the table row.

function enrichStepTrace(raw: RawStepTrace): EnrichedStepEvent {
  const base = {
    type: "step_outcome" as const,
    stepName: raw.stepName,
    timestamp: raw.completedAt ?? raw.startedAt,
    actorType: "SYSTEM" as const,
  };

  if (raw.errorDetails) {
    return {
      ...base,
      preview: {
        kind: "error",
        errorType: raw.errorDetails.type,
        errorSubtype: raw.errorDetails.subtype,
        errorReason: classifyError(raw.errorDetails),
      },
      detail: raw.errorDetails,
    };
  }

  if (raw.qcSummary) {
    const failed = raw.qcSummary.checks.filter((c) => !c.passed);
    return {
      ...base,
      preview: {
        kind: "qc_blocked",
        failedCount: failed.length,
        checkIds: failed.map((c) => c.checkId),
      },
      detail: raw.qcSummary,
    };
  }

  if (raw.dedupVerdict) {
    return {
      ...base,
      preview: {
        kind: "dedup_match",
        matchedTaskId: raw.dedupVerdict.matchedTaskId,
        score: raw.dedupVerdict.score,
        threshold: raw.dedupVerdict.threshold,
      },
      detail: raw.dedupVerdict,
    };
  }

  return { ...base, preview: { kind: "success" }, detail: raw.agentMetrics };
}

Error classification deserves its own mention. Not every step failure is an agent crash. SIGTERM from dispatch deadline kills, output write failures, and enqueue failures each get a distinct errorReason stamp. This matters because the platform's rework flow can rerun a task from a failed step — and if the failure reason is misattributed, the rerun targets the wrong root cause.

Causal Event Ordering

Once events from four sources land in one array, sorting by timestamp alone is not enough. Multiple events can share the same millisecond — an admin action and the status transition it triggers, or a step outcome and the subsequent pipeline advance. Without deterministic tie-breaking, the timeline would shuffle between page loads.

I built an EVENT_TYPE_PRIORITY map that breaks timestamp ties in causal order: admin actions first, then status transitions, then step outcomes. The sort comparator applies timestamp first, then priority, then a stable secondary key.

const EVENT_TYPE_PRIORITY: Record<PipelineEventType, number> = {
  admin_action: 0,
  status_transition: 1,
  step_outcome: 2,
  assignment_lifecycle: 3,
};

function compareEvents(a: PipelineEvent, b: PipelineEvent): number {
  const timeDiff = a.timestamp.getTime() - b.timestamp.getTime();
  if (timeDiff !== 0) return timeDiff;

  const priorityDiff =
    EVENT_TYPE_PRIORITY[a.type] - EVENT_TYPE_PRIORITY[b.type];
  if (priorityDiff !== 0) return priorityDiff;

  return a.id.localeCompare(b.id);
}

This is a small piece of code with an outsized UX impact. Before it, an admin pause followed by a status change to paused could render in either order depending on which provider returned first. After it, the action always precedes the transition it caused.

Rich UI Components

The events table is not a generic data grid. Each event type renders a context-specific preview in the collapsed row, with expandable detail panels behind it. I built five panel types:

  • QcBreakdownPanel — per-check results with confidence, explanation, and remediation
  • DedupDetailsPanel — matched task link, score, threshold, and verdict reasoning
  • ErrorDetailsPanel — structured error type, subtype, stack context, and classified reason
  • SlugStatusPanel — slug-level status within multi-output steps
  • RepairSummaryPanel — rework and repair orchestration outcomes

Actor UIDs in the raw events are not human-readable. The UI resolves them to email addresses via a batch lookup on initial load, so operators see "paused by jane@example.com" instead of an opaque identifier. Pagination runs client-side after the service consolidates step groups, which prevents the confusing case where page one ends mid-step and page two starts with the error detail from that same step.

One deliberate UX decision: the Events tab uses keepMounted: false in the tab panel configuration. The timeline query can fan out across every step folder for a task. There is no reason to fire that aggregation when an operator is viewing a different tab. Unmounting the panel cancels in-flight requests and keeps the task detail page responsive.

Write-Side Instrumentation

Building the read path exposed a gap: statusHistory did not exist on most manifests. The array was defined in the Zod schema but populated at exactly zero call sites. The events table would have been a fancy viewer over empty data. I shipped a companion instrumentation pass that added write-side recording at every status mutation in the codebase.

The approach extended the manifest provider's update() and appendAction() methods to accept an optional StatusTransition parameter. Inside the same compare-and-swap window that writes the new status, the provider reads the current fromStatus and version from the existing manifest and appends a complete history entry. No separate read, no race between status write and history append.

async update(
  taskId: string,
  patch: ManifestPatch,
  transition?: StatusTransitionInput,
): Promise<Manifest> {
  return this.casWrite(taskId, (current) => {
    const next = applyPatch(current, patch);

    if (transition) {
      next.statusHistory = [
        ...(current.statusHistory ?? []),
        {
          fromStatus: current.status,
          toStatus: patch.status ?? current.status,
          stepName: transition.stepName,
          version: current.version,
          actorId: transition.actorId,
          actorType: transition.actorType,
          reason: transition.reason,
          errorReason: transition.errorReason,
          timestamp: new Date().toISOString(),
        },
      ];
    }

    return next;
  });
}

I instrumented thirty call sites across user-facing routes, background workers, the batch service, pipeline modules, and the rework orchestrator. Each call site passes the correct actorType: user routes stamp USER, workers and pipeline modules stamp SYSTEM. Critically, each transition carries its own errorReason rather than relying on a manifest-level error field. Without per-transition error reasons, a rework rerun could inherit the original failure's reason and misattribute the new failure to the old root cause.

The instrumentation pass added 367 lines across 19 files with 36 tests validating that every mutation path appends a history entry and that CAS conflicts do not produce duplicate or out-of-order records.

Performance and Security

The naive implementation listed step folders sequentially, then read each trace file one at a time. For a task with thirteen steps, that was thirteen serial round trips to GCS before any parsing began. I restructured the read path into three parallel stages: list all step folders, read all trace files via Promise.all, then parse and enrich in memory. The serial pipeline became three parallel batches.

The API response is cacheable. I set Cache-Control headers with a short TTL so repeated views of the same task within a debugging session do not re-hit GCS. For tasks that are actively running, the TTL is short enough that new events appear within seconds without requiring a hard refresh.

Security mattered too. Step trace files can contain raw model outputs and internal payloads. The step-detail events route strips sensitive fields for non-admin users before the response leaves the server. Admins see the full enrichment; standard operators see previews and sanitized detail panels.

What Changed in Practice

The combined delivery — read-path aggregation plus write-path instrumentation — landed at roughly 7,300 lines across 73 files over 64 commits. That sounds large, but the scope was end-to-end: backend provider interface, aggregation service, API routes, five detail panel components, actor resolution, pagination logic, error classification, parallelized GCS reads, cache headers, auth-gated payload stripping, and thirty instrumented write call sites.

The operational impact was immediate. A debugging session that previously required downloading four different file types and manually correlating timestamps now starts and ends in one browser tab. When a task is stuck in a dedup gate, the timeline shows the exact matched task, the similarity score, and the threshold in the collapsed row — no GCS console required. When a QC check blocks advancement, the breakdown panel shows which checks failed and what remediation the system recommends.

Pipeline observability is a read-side and write-side problem. The events table is the read side — aggregation, enrichment, ordering, and a UI that respects how operators actually debug. The statusHistory instrumentation is the write side — making sure every transition is recorded at the moment it happens, with the actor, reason, and error context that make the timeline trustworthy. Neither works without the other. Together, they turned a platform that hoarded observability data across disconnected storage into one where the causal chain is a first-class product surface.