When Two Cache Layers Serve Stale Data
Users on the dashboard reported a confusing pattern: when they clicked a completed task in the listing, the detail page briefly showed an intermediate In Progress state—step 1 or step 2 still running—before snapping to the correct completed view. The listing page always showed the right status. Only the detail page lied, and only for a moment. That asymmetry made the bug feel like a rendering glitch until we traced two independent cache layers, each serving stale data for different reasons. Fixing one was not enough; both had to change.
This post walks through that investigation: the architecture, two root causes, the fixes, and why sometimes the right answer is to remove a cache instead of perfecting its invalidation.
The bug report
The symptom was consistent and reproducible under the right timing. Open a task while it is still executing, leave, wait for the worker pipeline to finish, then open the same task from the listing where it now appears completed. For a second or two the detail page showed an old pipeline step—In Progress—then updated. No hard refresh was required; the UI eventually corrected itself. That “eventually” pointed at caching or polling, not at permanently wrong data in object storage.
Because the listing was correct, engineers first suspected the detail page’s client state or a race in the UI. Network tab inspection showed something subtler: the first response body for the detail API sometimes described an earlier step, even though manifests in storage had already moved on. Reproducing the issue required deliberate timing—visit during execution, navigate away, wait for completion, return from the listing—but once you had that sequence, the wrong first paint was reliable enough to rule out a one-off race.
Two data paths on the detail page
The detail page does not read storage directly in the browser. It has two stacked paths:
- Server: An API route loads task details from object storage via a
TaskDetailServiceand returns JSON. That handler was wrapped in Next.jsunstable_cachewith a 15-second TTL and tag-based revalidation. - Client: A React Query hook calls that API route on an interval—every 10 seconds—with background refetching while the tab is focused.
So a single user session could hit a cached server response and reuse a cached client query result. The listing page followed a different path: it read manifests fresh from storage on each revalidation, without the same server cache in front of the detail route. Same task, two pages, two truths—that is what made diagnosis slow.
Browser (detail page)
|
| React Query (client cache, default gcTime 5 min)
| poll every 10s
v
GET /api/tasks/[id] ----> unstable_cache (server, TTL 15s, tags)
| |
| v
+------------------> TaskDetailService ----> object storage
Listing page (separate path)
|
v
manifests read fresh from storage (no detail-route server cache)
Root cause 1: server cache never invalidated
The API route used unstable_cache with a revalidate tag so that, in theory, writers could bust the cache when task state changed. In practice, the worker pipeline that advances steps never called revalidateTag. Workers update manifests directly in object storage. They have no knowledge of a Next.js cache sitting in front of the detail route.
What happened in production looked like this: a user opened the detail page while the task was on step 1. The server cached that JSON for up to 15 seconds. The worker finished step 2 and step 3 in storage, but nothing told the web layer to drop the entry. Another visit inside the TTL window got the cached “step 1 running” payload. React Query’s polling would eventually fetch fresher data—but the first paint could still be wrong if the server cache responded first.
That is the classic “cache invalidation only works if every writer knows about the cache” problem. Your mental model might assume tags connect storage writes to HTTP responses automatically. In a split architecture—workers in one runtime, Next.js in another—the tag is invisible unless you build an explicit bridge. We had defined the bridge on paper (tags on the cache entry) but never wired the pipeline to call it.
The tempting fix was to sprinkle revalidateTag calls through the worker pipeline whenever a step completes. That couples pipeline code to a specific web framework’s cache API—tight coupling across system boundaries, and easy to forget on the next new writer.
The better fix for this route: remove the server-side cache entirely. Client polling already bounds how often the browser hits the API. A 15-second server cache bought little latency and added a stale-serving failure mode whenever invalidation was incomplete.
Before: cached API handler
import { unstable_cache } from "next/cache";
const TASK_DETAIL_TAG = "task-detail";
export async function GET(
_req: Request,
{ params }: { params: { id: string } }
) {
const getCachedDetail = unstable_cache(
async () => TaskDetailService.getById(params.id),
[`task-detail-${params.id}`],
{ revalidate: 15, tags: [TASK_DETAIL_TAG, `task-${params.id}`] }
);
const detail = await getCachedDetail();
return Response.json(detail);
}
After: direct service call
export async function GET(
_req: Request,
{ params }: { params: { id: string } }
) {
const detail = await TaskDetailService.getById(params.id);
return Response.json(detail);
}
After deploying that change, server responses tracked storage on every request. The flash did not disappear.
Root cause 2: client gcTime replayed old data
Even with a fresh server, users still saw a brief stale state. The second layer was React Query’s default gcTime (formerly cacheTime): five minutes. While a user stayed on the detail page during execution, React Query stored the “step 1 in progress” result. When they navigated away and returned after completion, React Query immediately showed that cached entry—instant wrong UI—while a background refetch ran. Fresh data replaced it a moment later. Correct outcome, jarring path.
For a detail view where correctness on entry matters more than instant back-navigation, we set gcTime: 0. Leaving the page discards cached data immediately. The next visit shows a loading state, then the current truth. No stale flash. The tradeoff is intentional: users who bounce between list and detail more often pay an extra spinner, but they never see a completed task masquerading as mid-pipeline work.
Note that staleTime and gcTime solve different problems. staleTime controls how long data is considered fresh before a refetch; gcTime controls how long inactive query data stays in memory after the last subscriber unmounts. Our bug was almost entirely the latter—old data resurrected on remount, not a refusal to refetch while the page was open.
Before: default garbage collection
export function useTaskDetail(taskId: string) {
return useQuery({
queryKey: ["task-detail", taskId],
queryFn: () => fetchTaskDetail(taskId),
refetchInterval: 10_000,
refetchIntervalInBackground: true,
// gcTime defaults to 5 minutes — stale snapshot survives navigation
});
}
After: discard cache on unmount
export function useTaskDetail(taskId: string) {
return useQuery({
queryKey: ["task-detail", taskId],
queryFn: () => fetchTaskDetail(taskId),
refetchInterval: 10_000,
refetchIntervalInBackground: true,
gcTime: 0,
});
}
Why both fixes were required
These layers are independent. Removing only the server cache still left React Query serving a five-minute-old snapshot on remount. Setting only gcTime: 0 still allowed unstable_cache to return 15-second-old JSON from the API. Each layer could be “correct” in isolation while the combined system was wrong.
Cache invalidation is only as reliable as the least-invalidated layer. Here, the worker never participated in server invalidation, and the client deliberately kept data warm for UX on other pages—defaults that made sense globally but hurt this screen.
| Layer | Symptom if unfixed | Fix |
|---|---|---|
Server unstable_cache |
Stale JSON within TTL; listing vs detail mismatch | Remove cache on detail route |
| React Query client cache | Instant replay of old step on return visit | gcTime: 0 on detail query |
Listing vs detail asymmetry
The listing page worked because it never depended on the cached detail handler. It aggregated status from manifests refreshed on a separate revalidation path. So operators trusted the grid and distrusted the detail page—a classic split-brain symptom when two features read the same domain through different pipelines.
When debugging “page A is right, page B is wrong,” map the full path for each screen before assuming a shared bug in storage or workers. Here, storage was fine; the caches in front of the detail route were not.
The lesson
Every cache you add is a potential source of stale data. Multiple layers multiply that risk: a bug or omission in either layer can serve old state even if the other is perfect. Tag-based revalidation only works when every writer that mutates underlying data knows about the tag—and workers that touch object storage often do not.
Sometimes the right fix is not better invalidation but fewer caches. Server caching on a route that is already polled every ten seconds was redundant. Client caching with a five-minute gcTime was wrong for a view where users care about the first paint after navigation. Together, those choices produced a bug that looked like a UI flicker but was really two systems each doing what they were told—just not coordinated with how tasks actually change in the platform.
If you are seeing a brief wrong state that self-corrects, check the stack twice: once at the edge of your framework on the server, and once in your data library on the client. Fix one, verify, then fix the other. Stale data rarely respects how neatly you drew the architecture diagram.