All Articles

The Database Bug That Took Down Our AI Pipeline: Debugging AlloyDB SSL Connection Drops

· 6 min read · Humza Tareen
PostgreSQL AlloyDB GCP Python Debugging

The Symptoms

Our AI evaluation pipeline was randomly failing. Tasks that ran fine for 2 minutes would crash at minute 15. The error logs showed: SSL connection has been closed unexpectedly.

We were running long-running AI agent evaluations on GCP. Each evaluation could take 5-30 minutes — agents executing code, running tests, getting judged by LLMs.

What We Tried First (and Why It Failed)

  • Increasing timeouts → didn't help, the connections were already dead
  • Adding retry decorators → retries hit the same dead pooled connection
  • Scaling up instances → more instances = more idle connections = more drops

Root Cause Analysis

AlloyDB (Google's PostgreSQL-compatible database) silently drops SSL connections from the connection pool after ~10 minutes of inactivity. But our SQLAlchemy pool kept serving those dead connections to new requests.

The timeline:

  1. Task A grabs Connection 1 from pool → starts 20-min evaluation
  2. Connections 2, 3, 4 sit idle in the pool
  3. After 10 min: AlloyDB kills Connections 2, 3, 4 (idle too long)
  4. Task B grabs Connection 2 from pool → DEAD → SSL error

The Fix

# Before: connections lived 15 minutes in the pool
DB_POOL_RECYCLE = 900  # 15 min

# After: recycle before AlloyDB kills them
DB_POOL_RECYCLE = 180  # 3 min

Plus two critical additions:

  1. Connection validation on checkout (pool_pre_ping=True)
  2. Idempotent writes for Cloud Tasks — because when a connection drops mid-transaction, Cloud Tasks retries the whole thing

The Deeper Lesson

This pattern appears everywhere in cloud-native systems. Any time you have connection pooling + managed databases, long-running tasks + idle connections, or retry mechanisms + non-idempotent writes — you will hit this bug eventually.

The lesson isn't "set pool_recycle to 180s" — it's that your connection pool strategy needs to match your infrastructure's behavior, not your application's assumptions.

We went from ~5% random task failures to 0.01% after this change. Across thousands of daily evaluations, that's the difference between a reliable platform and constant firefighting.

The Full Configuration

engine = create_engine(
    DATABASE_URL,
    pool_recycle=180,      # Recycle before AlloyDB kills them
    pool_pre_ping=True,    # Validate connection on checkout
    pool_size=5,           # Don't hold more than you need
    max_overflow=10,       # Allow burst but not stampede
    pool_timeout=30,       # Fail fast if pool exhausted
)

And set containerConcurrency in Cloud Run to something reasonable (10-20, not 80). Three lines of configuration changed everything.