Event-Driven Architecture on GCP: Pub/Sub, Cloud Tasks, and Cloud Scheduler in Practice
Why Event-Driven?
Our AI evaluation platform has a problem that REST APIs can't solve cleanly: a single evaluation task takes 5-30 minutes, touches 6 different services, and can fail at any point.
With synchronous HTTP: timeouts, retries hitting already-completed work, and debugging nightmares. With events: each service subscribes independently. If one fails, it retries from its own checkpoint. Others don't even know.
The Pipeline Phases
Six phases, each representing a handoff where the next service might fail independently. Phase A creates the task. Phase B preprocesses it. Phase C executes the agent. Phase D collects outputs. Phase E judges and scores. Phase F aggregates and learns. Each phase publishes to a Pub/Sub topic, and the next phase subscribes. If Phase C fails, Phase D doesn't even know a task was attempted. Phase C retries from its checkpoint, and when it succeeds, Phase D picks up the work. This independence is what makes event-driven architecture resilient.
Phase A: Task Creation → (Pub/Sub: task-events)
Phase B: Preprocessing → (Pub/Sub: phase-b-batch-notifications)
Phase C: Agent Execution → (Pub/Sub: phase-c-asset-feed)
Phase D: Output Collection → (Pub/Sub: phase-e-jobs)
Phase E: Judging & Scoring → (Pub/Sub: phase-f-jobs)
Phase F: Aggregation & Learning → (Pub/Sub: evaluation-events)
Consumers: Notification, RAG Sync, Analytics
Fan-Out Pattern
Our most powerful pattern: one topic, multiple subscriptions. evaluation-events is published to once but consumed by notification service, RAG sync, analytics pipeline, and auto-eval scheduler. Adding a new consumer? Create a subscription. Zero changes to the publisher. Here's how it works:
# Publisher (evaluation engine)
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path(project_id, "evaluation-events")
publisher.publish(topic_path, json.dumps({
"evaluation_id": "eval_123",
"status": "completed",
"score": 0.85
}).encode())
# Multiple subscribers (each independent)
# - notification-service subscribes to send emails
# - rag-service subscribes to update embeddings
# - analytics-service subscribes to update dashboards
# - auto-eval-service subscribes to trigger next batch
Each subscriber processes the message independently. If the notification service fails, the RAG service still processes the message. This decoupling is what makes fan-out so powerful.
Dead-Letter Queues
Every production topic has a DLQ with a 5-retry policy. We have a weekly DLQ review. Dead-letter queues catch failures that aren't transient: schema migration issues where the message format changed but old messages are still in the queue, quota limits where Firestore rejects writes because we've hit the daily limit, transient Cloud SQL connection failures that become permanent if the database is actually down. Without DLQs, these would have been silent data loss. Messages would fail, get retried, fail again, and disappear. With DLQs, we see every failure, investigate the root cause, and fix it.
Cloud Scheduler for Automation
Each scheduled job publishes a message to a Pub/Sub topic. The consuming service doesn't know if the message came from a schedule, a user action, or an API call. We could pause auto-eval-daily in production without touching any service code. Here's the pattern:
# Cloud Scheduler job
gcloud scheduler jobs create pubsub auto-eval-daily \
--schedule="0 2 * * *" \
--topic=auto-eval-trigger \
--message-body='{"trigger": "daily", "batch_size": 100}'
# Service handler (doesn't know the source)
def handle_auto_eval(message):
data = json.loads(message.data)
# Process the batch - same code path whether
# triggered by schedule, API, or manual publish
process_evaluation_batch(data["batch_size"])
This abstraction means we can test scheduled jobs by manually publishing to the topic, pause them by disabling the scheduler job, or change the schedule without deploying code.
Monitoring
Every Cloud Run service has an alert at >5% error rate. We also monitor CPU utilization < 10% to auto-detect idle VMs—this saves significant compute costs. But the real value is in the alerting strategy: we alert on error rates, not individual errors. One failed message isn't a problem. A sustained error rate above 5% means something is wrong. We also monitor message age in DLQs—if messages are sitting in DLQs for more than an hour, we investigate. This proactive monitoring catches issues before they become incidents.
Pub/Sub vs Cloud Tasks: When to Use Which
Pub/Sub is for events—things that happened that other services might care about. An evaluation completed. A task was created. A user signed up. Multiple services might subscribe to the same event. Cloud Tasks is for commands—things you want done exactly once, with retries and deadlines. Send an email. Process a payment. Update a database record. One task, one handler, guaranteed delivery.
We use Pub/Sub for the evaluation pipeline because multiple services care about evaluation events. We use Cloud Tasks for notification delivery because each email should be sent exactly once, with retries if the email service is down. The distinction matters: Pub/Sub is fire-and-forget events. Cloud Tasks is reliable command execution.
Key Lessons
Message ordering doesn't matter if you design correctly. Every message contains enough context to be processed independently. If you need ordering, include a sequence number or timestamp in the message. Don't rely on Pub/Sub's ordering guarantees—they're best-effort and can break under load.
Acknowledge after commit, not before. Write first, ack second, always. If you ack before writing to the database, and the write fails, the message is gone forever. If you write first, then ack, and the ack fails, the message will be redelivered—but your idempotency check will prevent duplicate processing.
Schema evolution is inevitable. Version your message schemas. Include a version field in every message. When you need to change the schema, publish both old and new versions, update consumers gradually, then deprecate the old version. Don't break existing consumers with breaking changes.
Pub/Sub for events, Cloud Tasks for commands. Keep the distinction clear. Events are "something happened." Commands are "do this thing." Mixing them leads to confusion about retry semantics, delivery guarantees, and error handling.