All Articles

When Your Logging Framework Crashes Production

· 7 min read · Humza Tareen
Logging Python Production Debugging FastAPI

Logging is supposed to help you debug. When your logging infrastructure itself crashes production, you have a meta-problem.

The Crash

Two logging components were colliding. A ComponentAdapter adds service context to every log. A LogContext manager adds request-scoped context. Both tried to inject keys into the log record's extra dict. When both were active in the same call chain — which happened during agent execution sessions — they collided.

The ComponentAdapter expected extra to be a flat dict. LogContext had wrapped it in a nested structure. Result: KeyError deep in the logging pipeline, crashing the agent session.

Why It Was Hard to Find

The crash only happened when specific agent frameworks were used in specific execution modes. Standard API requests were fine. Only agent sessions that used both the component logger and the request-scoped context triggered it. The error appeared as a generic Python logging crash, not as an application error.

The Fix: safe_log_extra()

I built a helper that normalizes the extra dict before any logging operation:

def safe_log_extra(extra: dict | None = None) -> dict:
    if extra is None:
        return {}
    normalized = {}
    for key, value in extra.items():
        if isinstance(value, dict):
            normalized.update(value)
        else:
            normalized[key] = value
    return normalized

Applied to every logging call path — ComponentAdapter, LogContext, and direct logger usage. Added defensive flattening so nested structures can never crash the pipeline.

SSL Retry for Database Methods

While investigating the logging crash, I also added SSL connection retry wrappers around database methods used during agent runs. These long-running operations (5–30 minutes) were hitting the AlloyDB SSL drop issue in a code path that didn't have the retry protection yet.

Diagnostic Log Cleanup

I cleaned up diagnostic logging across the codebase — removed debug-level prints that were logging at INFO, assigned proper log levels (DEBUG for internal state, INFO for lifecycle events, WARNING for recoverable errors, ERROR for failures).

Production logging infrastructure needs the same engineering rigor as any other production system. If your logging can crash your application, your logging is a liability.