I Audited Our AI Platform and Found Critical Security Gaps
Six months of shipping fast had created something nobody wanted to talk about: security debt. I decided to spend two days auditing the entire platform systematically. What I found was sobering.
The Audit Methodology
I approached it service by service, starting with the most exposed: the API gateway and evaluation engine. For each service, I examined five categories: input validation, authentication, secrets management, logging hygiene, and architecture boundaries. I wasn't looking for perfection—I was looking for production incidents waiting to happen.
What I Found
The BigQuery f-string SQL injection was the first red flag. A developer had used f-strings for dynamic table names, which works fine in development but is catastrophic in production. The query looked innocent: f"SELECT * FROM {table_name}". But table_name came from user input, and BigQuery doesn't sanitize f-strings. One malicious payload could have exposed our entire evaluation dataset.
Then I found the alembic.ini credentials. Database migration config committed with hardcoded passwords. Anyone with repository access had production database credentials. This wasn't a theoretical risk—it was a live vulnerability sitting in our git history.
The CI/CD log leak was subtler. API keys were being printed during build steps, visible to anyone with repo access. The logs weren't public, but they didn't need to be. Internal access was enough.
Three endpoints had no authentication because they were "only called by other services." The "internal = safe" assumption. In a microservices architecture, internal doesn't mean secure. It means accessible to any compromised service or misconfigured network policy.
PII in logs was the compliance nightmare. User emails and evaluation data in plaintext application logs. GDPR violations waiting to happen, and we were generating them by the thousands.
The Monolithic God Class
Beyond security, there was architecture debt. A monolithic ParallelOrchestrator handling evaluation logic, API routing, state management, and error handling. V1/V2 code boundaries leaked into each other. In-memory rate limiting that doesn't work across Cloud Run instances—each instance had its own counter, so rate limits were effectively divided by instance count. CPU-bound algorithms blocking the async event loop, turning our FastAPI app into a synchronous bottleneck.
The Remediation
All secrets moved to GCP Secret Manager. Here's what changed:
# Before
DATABASE_PASSWORD = "hardcoded_password_123"
# After
from google.cloud import secretmanager
client = secretmanager.SecretManagerServiceClient()
secret = client.access_secret_version(
name="projects/my-project/secrets/db-password/versions/latest"
)
DATABASE_PASSWORD = secret.payload.data.decode("UTF-8")
Parameterized queries everywhere. No more f-strings in SQL. Here's the pattern we enforced:
# Before (vulnerable)
query = f"SELECT * FROM {table_name} WHERE id = {user_id}"
# After (safe)
query = "SELECT * FROM %s WHERE id = %s"
cursor.execute(query, (table_name, user_id))
Structured logging with PII stripping middleware. Every log entry goes through a sanitizer that removes emails, API keys, and user identifiers before writing to disk. Authentication middleware on every endpoint—no exceptions. Pre-commit hooks: ruff for linting, type checking, security scanning with bandit.
We broke the God Class into focused services: EvaluationService, RoutingService, StateService, ErrorHandler. Built an LLM Gateway abstraction so we could swap providers without touching business logic. Moved rate limiting to Redis so it works across instances. Wrapped CPU-bound work in thread pools so the event loop stays responsive.
The Takeaway
The two days I spent auditing saved us from at least three potential production incidents. Schedule security audits. Don't wait for an incident. Security debt compounds silently, and the longer you ignore it, the more expensive it becomes to fix.