The Database Bug That Took Down Our AI Pipeline: Debugging AlloyDB SSL Connection Drops
The Symptoms
Our AI evaluation pipeline was randomly failing. Tasks that ran fine for 2 minutes would crash at minute 15. The error logs showed: SSL connection has been closed unexpectedly.
We were running long-running AI agent evaluations on GCP. Each evaluation could take 5-30 minutes — agents executing code, running tests, getting judged by LLMs.
What We Tried First (and Why It Failed)
- Increasing timeouts → didn't help, the connections were already dead
- Adding retry decorators → retries hit the same dead pooled connection
- Scaling up instances → more instances = more idle connections = more drops
Root Cause Analysis
AlloyDB (Google's PostgreSQL-compatible database) silently drops SSL connections from the connection pool after ~10 minutes of inactivity. But our SQLAlchemy pool kept serving those dead connections to new requests.
The timeline:
- Task A grabs Connection 1 from pool → starts 20-min evaluation
- Connections 2, 3, 4 sit idle in the pool
- After 10 min: AlloyDB kills Connections 2, 3, 4 (idle too long)
- Task B grabs Connection 2 from pool → DEAD → SSL error
The Fix
# Before: connections lived 15 minutes in the pool
DB_POOL_RECYCLE = 900 # 15 min
# After: recycle before AlloyDB kills them
DB_POOL_RECYCLE = 180 # 3 min
Plus two critical additions:
- Connection validation on checkout (
pool_pre_ping=True) - Idempotent writes for Cloud Tasks — because when a connection drops mid-transaction, Cloud Tasks retries the whole thing
The Deeper Lesson
This pattern appears everywhere in cloud-native systems. Any time you have connection pooling + managed databases, long-running tasks + idle connections, or retry mechanisms + non-idempotent writes — you will hit this bug eventually.
The lesson isn't "set pool_recycle to 180s" — it's that your connection pool strategy needs to match your infrastructure's behavior, not your application's assumptions.
We went from ~5% random task failures to 0.01% after this change. Across thousands of daily evaluations, that's the difference between a reliable platform and constant firefighting.
The Full Configuration
engine = create_engine(
DATABASE_URL,
pool_recycle=180, # Recycle before AlloyDB kills them
pool_pre_ping=True, # Validate connection on checkout
pool_size=5, # Don't hold more than you need
max_overflow=10, # Allow burst but not stampede
pool_timeout=30, # Fail fast if pool exhausted
)
And set containerConcurrency in Cloud Run to something reasonable (10-20, not 80). Three lines of configuration changed everything.