Back to Blog

When RAG Says Duplicate but the LLM Disagrees: Building an Adjudication Layer

By · 8 min read
RAG LLM TypeScript Deduplication Embeddings Feature Flags

The platform I work on generates training data through AI agents. Before each new task is created, a deduplication gate checks whether something similar already exists. The gate uses RAG — retrieval augmented generation — to embed the candidate task, search a vector index of existing tasks, and block generation when cosine similarity crosses a threshold. It is a sensible first line of defense against redundant training data.

It was also blocking valid, distinct tasks. Two tasks could share the same persona and task type but differ in subdomain context — for example, a retail softlines scenario versus a retail hardlines scenario — and still score above the duplicate threshold because their embeddings were nearly identical. The semantic overlap was real. The domain-specific distinction was not. A human reviewer would immediately see these as different tasks. The embedding model did not.

I added an LLM adjudication layer that gives RAG a second opinion on ambiguous matches. The change shipped behind a feature flag across ten files and roughly 1,300 lines of TypeScript, with 109 tests. This post walks through the false-positive problem, the two-threshold band architecture, and why I chose fail-closed semantics when the LLM is unavailable.

The false positive problem

RAG dedup works by embedding similarity. When a candidate task arrives at the pipeline gate, the system computes its vector representation, queries the index for top-K nearest neighbors, and compares the highest similarity score against a fixed crossThreshold (0.80 in our configuration). Above that line, the task is treated as a duplicate and generation stops.

The approach is fast and cheap. It also conflates "structurally similar" with "actually duplicate." Tasks that share a persona ("store manager"), a task type ("inventory reconciliation"), and broad domain framing ("retail operations") produce embeddings that cluster tightly in vector space. The subdomain context that makes them meaningfully different — softlines versus hardlines, apparel versus electronics — contributes less signal than the shared scaffolding.

Embedding similarity captures semantic overlap but misses domain-specific distinctions that humans recognize instantly. That gap is where false positives live.

We were seeing legitimate new tasks rejected at the gate. Operators would retry with slightly different wording and hit the same wall. The dedup system was doing its job too aggressively, and lowering the global threshold would have let real duplicates through. We needed a way to resolve ambiguity without abandoning the fast path for obvious non-matches.

Two-threshold band architecture

The fix is not "call an LLM on every dedup check." That would add 10–30 seconds of latency to every task submission and burn inference budget on cases where the answer is already obvious. Instead, I introduced a review band between two thresholds:

  • Below llmClearThreshold (default 0.75): auto-clear. No LLM call. The match is weak enough that we trust RAG's negative signal.
  • Between 0.75 and crossThreshold (0.75–0.80): LLM review required. RAG is uncertain — this is the ambiguous zone where embeddings lie.
  • At or above crossThreshold (≥ 0.80): RAG says duplicate, but the LLM gets an override opportunity. A strong embedding match is not automatically a true duplicate.

The band means LLM latency is only paid for cases that actually need judgment. Clear non-matches and (after adjudication) clear non-duplicates never touch the model. The architecture looks like this in the dedup flow:

type DedupClearReason = 'below_threshold' | 'no_matches' | 'llm_cleared';

interface DedupVerdict {
  isDuplicate: boolean;
  topScore: number;
  clearReason?: DedupClearReason;
  llmAdjudication?: LlmAdjudication;
}

async function checkForDuplicate(
  candidate: TaskCandidate,
  config: DedupConfig,
): Promise<DedupVerdict> {
  const matches = await ragSearch(candidate, { topK: 5 });
  const topScore = matches[0]?.similarity ?? 0;

  if (matches.length === 0) {
    return { isDuplicate: false, topScore, clearReason: 'no_matches' };
  }

  if (!config.llmEnabled) {
    return {
      isDuplicate: topScore >= config.crossThreshold,
      topScore,
      clearReason: topScore < config.crossThreshold ? 'below_threshold' : undefined,
    };
  }

  if (topScore < config.llmClearThreshold) {
    return { isDuplicate: false, topScore, clearReason: 'below_threshold' };
  }

  // Review band or above crossThreshold — consult the LLM
  const adjudication = await adjudicateWithLlm(candidate, matches, config);
  return {
    isDuplicate: adjudication.isDuplicate,
    topScore,
    clearReason: adjudication.isDuplicate ? undefined : 'llm_cleared',
    llmAdjudication: adjudication,
  };
}

Notice the early returns. The LLM path is the exception, not the default. In production, the vast majority of candidates score below 0.75 and clear in a single RAG round-trip with zero added latency.

The LLM adjudicator module

Rather than build a new inference endpoint, I integrated with an already-deployed auto-rater service. That service exposes POST /api/v1/review with support for a custom_evaluation_prompt, which is exactly what structured dedup judgment needs. The adjudicator submits the candidate task plus the top-K RAG matches, then polls for the result — submit-and-poll, not synchronous blocking.

interface LlmAdjudication {
  isDuplicate: boolean;
  reason: string;
  confidence: number;
  durationMs: number;
}

async function adjudicateWithLlm(
  candidate: TaskCandidate,
  matches: RagMatch[],
  config: DedupConfig,
): Promise<LlmAdjudication> {
  const start = Date.now();

  const reviewId = await submitReview({
    endpoint: `${config.autoRaterBaseUrl}/api/v1/review`,
    custom_evaluation_prompt: buildDedupPrompt(candidate, matches),
    payload: { candidate, matches },
  });

  const result = await pollForResult(reviewId, {
    timeoutMs: config.llmTimeoutMs,       // default: 90_000
    intervalMs: config.llmPollIntervalMs, // default: 5_000
  });

  return {
    isDuplicate: parseIsDuplicate(result.classification, result.review_feedback),
    reason: result.review_feedback ?? result.classification,
    confidence: result.confidence ?? 0,
    durationMs: Date.now() - start,
  };
}

The custom evaluation prompt instructs the model to compare the candidate against each retrieved match on persona, task type, and subdomain context. It must return a structured classification — duplicate or not — with a natural-language reason explaining the decision. The parser handles both structured JSON responses and plain-text feedback, with a fallback for malformed output that defaults to duplicate (fail-closed, described below).

Storing the full LlmAdjudication metadata on the verdict gives operators an audit trail — reason, confidence, and latency — in the pipeline event log.

Integration with the existing dedup flow

The adjudication layer slots into the existing rag-dedup.ts module without changing its external contract. The gate in step1-gate.ts — the first node in the task generation pipeline — already checks dedup verdicts before authorizing ingest. I added a new clear reason, llm_cleared, alongside the existing below_threshold and no_matches:

const AUTHORIZED_CLEAR_REASONS: DedupClearReason[] = [
  'below_threshold',
  'no_matches',
  'llm_cleared',
];

function authorizeTaskIngest(verdict: DedupVerdict): boolean {
  if (verdict.isDuplicate) return false;
  if (!verdict.clearReason) return false;
  return AUTHORIZED_CLEAR_REASONS.includes(verdict.clearReason);
}

The gate whitelists clear reasons rather than treating "not duplicate" as sufficient — incomplete verdict objects cannot accidentally bypass the gate.

Fail-closed safety

When the LLM is unavailable — network error, auto-rater timeout, malformed response — the system falls closed. The task is treated as a duplicate. No silent pass-through on inference failure.

This was a deliberate product decision, not a technical default. The dedup gate exists to prevent redundant training data from entering the corpus. A false negative (letting a duplicate through) corrupts the dataset silently. A false positive (blocking a valid task) is visible: the operator sees a rejection, can investigate, and can retry. Erring on the side of caution is the right tradeoff for a data quality gate.

An operator escape hatch env var exists for emergencies — auto-rater downtime, for example — but it is off by default and requires explicit per-environment configuration.

Latency analysis

The band architecture keeps latency predictable:

Score range LLM called? Added latency
Below 0.75 No 0 ms
0.75 – 0.80 (review band) Yes ~10–30 s typical
≥ 0.80 (override zone) Yes ~10–30 s typical
LLM timeout (90 s) Failed Fail-closed → duplicate

The 10–30 second range comes from auto-rater submit (~1 s) plus polling at 5-second intervals until the review completes. This is acceptable because dedup runs inside Cloud Tasks workers — asynchronous background jobs, not the user-facing request path. Nobody is staring at a spinner waiting for the LLM.

Feature-flagged rollout

The entire feature ships disabled by default. Environment variables control every knob:

  • AUTO_SEED_DEDUP_LLM_ENABLED — master switch (default: false)
  • AUTO_SEED_DEDUP_LLM_CLEAR_THRESHOLD — lower bound of review band (default: 0.75)
  • AUTO_SEED_DEDUP_LLM_TIMEOUT_MS — max poll duration (default: 90000)
  • AUTO_SEED_DEDUP_LLM_POLL_INTERVAL_MS — poll frequency (default: 5000)

Kubernetes configmaps control per-environment rollout. Staging enables first; production follows after clearance rates validate. The config module clamps thresholds and rejects malformed env vars at startup.

Testing

I wrote 109 tests across six files. The breakdown covers the adjudicator in isolation, the dedup integration, configuration parsing, and gate authorization:

  • llm-adjudicator.test.ts (8 tests): approval, rejection, submit failure, poll timeout, plain-text feedback parsing, empty results, network error, malformed JSON fallback.
  • rag-dedup.test.ts (+4 tests): LLM disabled bypass, below-band auto-clear, in-band LLM clear, above-threshold LLM override, fail-closed on unavailability.
  • config.test.ts (+3 tests): env var parsing, threshold clamping to valid ranges.
  • step1-gate.test.ts (+1 test): llm_cleared authorization path.

The fail-closed tests verify that network errors, poll timeouts, and malformed responses all produce isDuplicate: true — never a silent clear.

What I learned

RAG is an excellent first-pass filter, but it is not a judge. Embeddings compress away the fine-grained distinctions that matter for domain-specific deduplication. When false positives come from structurally similar but contextually different tasks, the fix is a second opinion — not a better embedding model.

The two-threshold band keeps the LLM off the hot path for obvious cases, concentrates inference budget where ambiguity lives, and pairs every judgment with an audit trail. Fail-closed semantics and feature flags let us ship safely: the gate never silently degrades, and staging validates clearance rates before production.