Building a Notification Service: Cloud Tasks Delivery, Circuit Breakers, and Webhook Dispatch
When an AI evaluation completes, a score is generated, or an agent finishes a task, someone needs to know. The notification service is the event-driven delivery backbone—it ensures the right information reaches the right subscribers at the right time, even when things go wrong. This isn't just about sending webhooks—it's about building a system that can handle partial failures, network timeouts, and subscriber outages without losing messages or cascading failures across the platform.
The Architecture
The notification pipeline follows a clear flow: event sources publish messages to Pub/Sub, the notification service consumes those messages, matches them against a subscription registry, and dispatches them through a delivery pipeline built on Cloud Tasks. Each webhook dispatch is retried with exponential backoff, failed deliveries end up in a dead-letter queue, and circuit breakers prevent one failing subscriber from slowing down deliveries to everyone else.
Event Sources (Pub/Sub)
↓
Notification Service (Cloud Run)
↓
Subscription Registry
↓
Delivery Pipeline (Cloud Tasks)
↓
Webhook Dispatch (with retries)
↓
Dead Letter Queue ← Circuit Breaker
↓
Observability (correlation IDs)
Each component exists for a reason. Pub/Sub provides durable message storage and at-least-once delivery guarantees. The subscription registry lets subscribers express interest in specific event types. Cloud Tasks handles the heavy lifting of retries and backoff, ensuring that transient failures don't become permanent message loss. The dead-letter queue captures messages that can't be delivered after exhausting retries, allowing manual intervention and analysis. Circuit breakers prevent a single failing subscriber from consuming all retry capacity, ensuring that healthy subscribers continue to receive notifications even when others are down.
The Pydantic v2 Serialization Bug
Pydantic v2 changed how it handles validation errors. In v1, validation errors were simple dictionaries that FastAPI could serialize to JSON without issue. In v2, Pydantic embeds the original Python exception object inside validation errors, preserving more context for debugging. This is great for development, but it breaks FastAPI's JSON serialization.
When a request with invalid data arrived, Pydantic would raise a validation error containing a ValueError exception object. FastAPI tried to serialize this to JSON for the 422 response, but Python exceptions aren't JSON serializable. The serialization failed, FastAPI crashed, and users saw a 500 internal server error instead of the helpful 422 validation error they should have received.
The fix required a custom exception encoder that knows how to serialize exception objects. We extended FastAPI's jsonable_encoder to handle exceptions by converting them to strings:
# Before: Pydantic v2 wraps exceptions, FastAPI can't serialize them
# -> 422 (validation error) becomes 500 (server error)
# After: Custom exception encoder
from fastapi.encoders import jsonable_encoder
@app.exception_handler(RequestValidationError)
async def validation_handler(request, exc):
return JSONResponse(
status_code=422,
content=jsonable_encoder({"detail": exc.errors()},
custom_encoder={Exception: str})
)
Now validation errors serialize correctly, and users get the helpful error messages they need to fix their requests. It's a small change, but it transforms a confusing 500 error into a clear 422 validation error that guides users toward the correct input format.
Building the Migration System from Scratch
Reliable delivery requires reliable state tracking. You can't know if a notification was delivered without storing delivery attempts. You can't retry failed deliveries without a record of what failed. You can't implement circuit breakers without tracking per-subscriber failure rates. All of this requires a database schema that can evolve as the system grows.
We built the migration system using Alembic with PostgreSQL 17. The schema covers subscriptions (who wants to receive what), notifications (the events themselves), delivery logs (every delivery attempt with its outcome), and dead-letter queue tables (permanent failures that need manual intervention). Each table has proper indexes for the queries we need to run—finding subscriptions for an event type, retrieving delivery logs for a notification, checking circuit breaker state.
Schema-first development matters here. You can't retrofit reliable delivery onto a system that doesn't track state. The migration system lets us evolve the schema as we discover new requirements—adding fields for retry policies, circuit breaker state, or delivery metadata—without losing existing data or breaking running code.
Cloud Tasks Delivery Pipeline
The delivery pipeline is where reliability meets complexity. Cloud Tasks provides at-least-once delivery with automatic retries and exponential backoff, but building a production-ready system on top requires careful design.
Idempotent handlers are essential. Cloud Tasks may deliver the same task multiple times—network partitions, retries, or race conditions can cause duplicate deliveries. Each handler must check if it's already processed this notification before doing any work. A simple database check prevents duplicate webhook calls even when Cloud Tasks delivers the same task twice.
Exponential backoff handles transient failures gracefully. A webhook endpoint that's temporarily down gets retried with increasing delays—first after one second, then two, then four, then eight. This gives the endpoint time to recover without overwhelming it with rapid retries. Permanent failures—like 404 Not Found or 401 Unauthorized—are detected early and moved to the dead-letter queue without wasting retry capacity.
Per-subscriber circuit breakers are the hardest part. If one subscriber's webhook endpoint is down, you don't want its failures to slow down deliveries to all other subscribers. The circuit breaker tracks failure rates per subscriber—when failures exceed a threshold, that subscriber's circuit opens, and new deliveries are paused. After a cooldown period, the circuit half-opens, allowing a test delivery. If it succeeds, the circuit closes and normal delivery resumes. If it fails, the circuit opens again.
This isolation is critical. Without circuit breakers, a single failing subscriber can consume all retry capacity, delaying notifications to healthy subscribers. With circuit breakers, failing subscribers are isolated, and healthy subscribers continue to receive notifications without delay.
What the Audit Revealed
A production load test revealed issues that don't show up in development. Under concurrent load, race conditions appeared in the caching layer—multiple requests would check the cache simultaneously, find it empty, and all try to populate it at once. The fix required proper locking and atomic cache operations.
Unbounded memory growth came from the caching layer never evicting entries. As the system ran, the cache grew without bound, eventually consuming all available memory. Adding a TTL and LRU eviction policy kept memory usage bounded while maintaining cache performance.
The event publisher was fire-and-forget—events were published to Pub/Sub without waiting for acknowledgment. On transient Pub/Sub failures, events were silently dropped. Adding proper error handling and retry logic ensured that events weren't lost on transient failures.
Overly permissive CORS headers allowed any origin to make requests to the notification service. While the service still required authentication, the permissive CORS policy was unnecessary and increased the attack surface. Restricting CORS to known origins reduced risk without affecting legitimate clients.
These issues are lessons about what breaks under production load that you don't catch in development. Race conditions only appear with concurrent requests. Memory leaks only show up over time. Fire-and-forget patterns only fail when the external service has issues. The audit process—systematic load testing, code review, and security scanning—surfaced these issues before they became production incidents.