GCS Race Conditions & Generation-Fenced Leases
The pipeline I work on runs hundreds of steps concurrently, each executed by a worker process that pulls from a queue, runs the step, and reports results. To prevent two workers from executing the same step at the same time, the system uses leases: before a worker starts a step, it acquires a lease stored as an object in Google Cloud Storage. The lease contains a token identifying the owner and an expiry timestamp. If the lease already exists and has not expired, other workers back off. When the step finishes, the worker deletes the lease. Simple enough on paper.
This post is about two race conditions hiding in that "simple" lease lifecycle, and the one-line fix that eliminated both of them.
How the lease works
Each step has a corresponding GCS object at a deterministic path — something like leases/{stepId}.json. The object body is a small JSON blob:
interface StepLease {
token: string; // unique per acquisition
holder: string; // worker identity
expiresAt: string; // ISO timestamp
}
Acquiring a lease means creating or overwriting this object. Releasing means deleting it. If a worker crashes, the lease expires naturally, and another worker can reclaim the step by noticing the expired timestamp on a subsequent acquisition attempt.
This pattern is common in distributed systems that use object storage as a coordination layer. It works well — until you look at the space between the read and the write.
Race condition 1: the release path
Consider the following sequence. Worker A acquires a lease for step 42, starts executing, but runs slower than expected. The lease expires. Worker B comes along, sees the expired lease, acquires a fresh one, and begins its own execution of step 42. Meanwhile, Worker A finishes and calls releaseStepLease().
The original release code looked like this:
async function releaseStepLease(
bucket: Bucket,
stepId: string,
token: string
): Promise<void> {
const file = bucket.file(`leases/${stepId}.json`);
const [content] = await file.download();
const lease: StepLease = JSON.parse(content.toString());
if (lease.token !== token) {
return; // not our lease anymore
}
await file.delete();
}
The intent is correct: read the lease, check if the token matches ours, and only then delete. But there is a gap between the download() and the delete(). In that gap, the GCS object can change. Here is the timeline that breaks things:
- T1: Worker A reads the lease. The token matches — it still sees its own (now-expired) lease data because Worker B's acquisition has not landed yet, or Worker A's read hits a slightly stale version.
- T2: Worker B overwrites the lease with a fresh token and a new expiry.
- T3: Worker A's token check passes (it compared against the stale read from T1) and calls
file.delete(). - T4: Worker B's lease is gone. Worker B is now executing step 42 without a lock.
Worker B has no idea its lease was destroyed. If a third worker comes along, it sees no lease at all and happily acquires one. Now two workers are running the same step. Depending on the step's side effects, this can mean duplicated writes, corrupted outputs, or subtle data inconsistencies that surface hours later.
Race condition 2: the expired-lease reclaim path
The second bug lives in the acquisition path. When a worker tries to acquire a lease and finds an existing one that has expired, it needs to clean up the stale lease and create a fresh one. The original code:
async function acquireStepLease(
bucket: Bucket,
stepId: string,
workerIdentity: string
): Promise<AcquiredStepLease | null> {
const file = bucket.file(`leases/${stepId}.json`);
const [exists] = await file.exists();
if (exists) {
const [content] = await file.download();
const lease: StepLease = JSON.parse(content.toString());
if (new Date(lease.expiresAt) > new Date()) {
return null; // active lease held by someone else
}
// Expired — delete and recreate
await file.delete();
}
const newLease: StepLease = {
token: crypto.randomUUID(),
holder: workerIdentity,
expiresAt: new Date(Date.now() + LEASE_TTL_MS).toISOString(),
};
await file.save(JSON.stringify(newLease));
return { ...newLease, stepId };
}
Same pattern, same problem. Between the download() that reads the expired lease and the file.delete() that removes it, another worker may have already reclaimed it. Worker C reads the expired lease. Worker D reads the same expired lease. Worker C deletes it and creates a fresh one. Worker D then deletes Worker C's fresh lease and creates its own. Worker C is now running without a lock.
The window is small, but under load — dozens of workers polling the same expired leases — it is not hypothetical. It is a guaranteed occurrence at sufficient concurrency.
The fix: generation fencing
GCS assigns a monotonically increasing generation number to every object. Each time an object is created or overwritten, it gets a new generation. This number is available in the object's metadata and can be used as a precondition on subsequent operations.
The fix is conceptually simple: when you read a lease, also capture its generation. When you delete, pass ifGenerationMatch as a precondition. If any other worker has overwritten the object between your read and your delete, the generation will have changed, and GCS returns 412 Precondition Failed. Your delete becomes a no-op instead of destroying someone else's lease.
First, a helper that returns both the lease data and the generation:
interface LeaseWithGeneration {
lease: StepLease;
generation: number;
}
async function readLeaseWithGeneration(
file: File
): Promise<LeaseWithGeneration | null> {
const [exists] = await file.exists();
if (!exists) return null;
const [content] = await file.download();
const [metadata] = await file.getMetadata();
return {
lease: JSON.parse(content.toString()),
generation: Number(metadata.generation),
};
}
The AcquiredStepLease type now carries the generation so that the release path has access to it without a second metadata fetch:
interface AcquiredStepLease {
stepId: string;
token: string;
holder: string;
expiresAt: string;
generation: number;
}
The release path becomes:
async function releaseStepLease(
bucket: Bucket,
stepId: string,
lease: AcquiredStepLease
): Promise<void> {
const file = bucket.file(`leases/${stepId}.json`);
try {
await file.delete({
ifGenerationMatch: lease.generation,
});
} catch (err: any) {
if (err.code === 412) {
// Lease was already overwritten by another worker.
// Our lease expired and someone else reclaimed it — nothing to clean up.
return;
}
throw err;
}
}
No more read-then-check-then-delete. The generation precondition turns the delete into an atomic compare-and-delete. If the object has been touched since we acquired it, the delete silently fails, and the rightful owner keeps their lock.
The acquisition path gets the same treatment:
async function acquireStepLease(
bucket: Bucket,
stepId: string,
workerIdentity: string
): Promise<AcquiredStepLease | null> {
const file = bucket.file(`leases/${stepId}.json`);
const existing = await readLeaseWithGeneration(file);
if (existing) {
if (new Date(existing.lease.expiresAt) > new Date()) {
return null; // active lease
}
try {
await file.delete({
ifGenerationMatch: existing.generation,
});
} catch (err: any) {
if (err.code === 412) {
return null; // someone else already reclaimed it
}
throw err;
}
}
const newLease: StepLease = {
token: crypto.randomUUID(),
holder: workerIdentity,
expiresAt: new Date(Date.now() + LEASE_TTL_MS).toISOString(),
};
await file.save(JSON.stringify(newLease));
const [metadata] = await file.getMetadata();
return {
stepId,
...newLease,
generation: Number(metadata.generation),
};
}
When two workers race to reclaim the same expired lease, at most one of them will have a matching generation. The loser gets a 412, returns null, and retries on the next poll cycle. No lease is destroyed out from under an active worker.
Testing the fix
The tests mock GCS's file.delete() and file.getMetadata() to verify that ifGenerationMatch is always present on delete calls. A representative pattern:
it("passes ifGenerationMatch when releasing", async () => {
const deleteSpy = jest.fn().mockResolvedValue([]);
mockFile.delete = deleteSpy;
const lease: AcquiredStepLease = {
stepId: "step-42",
token: "abc-123",
holder: "worker-1",
expiresAt: new Date(Date.now() + 60_000).toISOString(),
generation: 17,
};
await releaseStepLease(mockBucket, "step-42", lease);
expect(deleteSpy).toHaveBeenCalledWith({
ifGenerationMatch: 17,
});
});
A second set of tests simulates the race: the mock throws a 412 error on delete(), and the test asserts that the function returns gracefully instead of propagating the error or leaving the system in a broken state.
The pattern: optimistic concurrency on object storage
This is the same principle as database row versioning or HTTP ETags. Read the version, attempt a conditional write (or delete), handle conflicts. The terminology changes — generation in GCS, ETag in S3, version in a database row — but the concurrency control is identical.
The key insight is that GCS objects are not just blobs. They carry metadata — generation, metageneration, content hashes — that enables safe concurrent access without external coordination services. You do not need Redis, ZooKeeper, or DynamoDB to build correct distributed locking on GCS. The primitives are already there, built into every API call, waiting to be used.
Why not use a dedicated lock service?
For this system, the leases were already GCS objects. Adding ifGenerationMatch to existing delete() calls was a one-line change per call site. Introducing a Redis cluster or a Chubby-style lock service would mean new infrastructure to provision, monitor, and page on — for a problem that GCS metadata solves natively.
The tradeoff is real: GCS-based locking has higher latency than an in-memory lock service, and it does not support features like blocking acquisition (you have to poll). For a pipeline running dozens of steps with lease durations measured in minutes, the latency is irrelevant and polling is already the natural acquisition pattern. Zero additional infrastructure, zero new failure modes, and the correctness guarantee comes from the storage layer itself rather than from a sidecar you have to keep alive.
Takeaways
Any time you see a read-then-act pattern on a shared mutable resource, ask what happens if the resource changes between the read and the act. In distributed systems, the answer is almost always "something bad, eventually." The fix is almost always some form of conditional operation: compare-and-swap, version-gated writes, generation-fenced deletes.
GCS makes this easy. The generation number is always there. The precondition parameters are already in the SDK. The 412 response is well-defined and easy to handle. The hard part is not the implementation — it is recognizing that you need it, which usually means staring at a timeline diagram until the interleaving jumps out at you.
Two race conditions, two stale deletes, one fix: stop trusting your read and start fencing your writes.