All Articles

Systematic Service Hardening: Security, Performance, and Reliability

·12 min read·Humza Tareen
SecurityPerformancePythonFastAPI

One Week, Three Services, Zero Downtime

I spent a week methodically hardening three production services — the auth gateway, the notification service, and the auto-rater. Not feature work. Not refactoring for aesthetics. Targeted fixes for real vulnerabilities, performance bottlenecks, and reliability gaps that I'd been cataloging during months of development.

The Auth Gateway: Death by a Thousand Queries

The auth gateway had a classic N+1 query problem. Listing 50 teams triggered 100+ individual database queries — two per team (member count + project count). At scale, page loads were painfully slow.

# Before: N+1 queries
teams = await session.execute(select(Team).filter_by(org_id=org_id))
for team in teams:
    team.member_count = await count_members(team.id)  # Query per team
    team.project_count = await count_projects(team.id)  # Another query per team

# After: Single query with subquery aggregation
teams = await session.execute(
    select(
        Team,
        func.count(distinct(TeamMember.id)).label("member_count"),
        func.count(distinct(Project.id)).label("project_count"),
    )
    .outerjoin(TeamMember)
    .outerjoin(Project)
    .filter(Team.org_id == org_id)
    .group_by(Team.id)
)

The same service had non-atomic counter operations. Incrementing a team's member count used read-then-write: load the object, increment in Python, save. Under concurrent requests, two users joining simultaneously would both read count=5, both write count=6, losing one increment.

# Before: non-atomic (race condition)
team = await get_team(team_id)
team.member_count += 1
await session.commit()

# After: atomic SQL UPDATE
await session.execute(
    update(Team)
    .where(Team.id == team_id)
    .values(member_count=Team.member_count + 1)
)

Security: SSRF, XSS, and CORS

The auth gateway accepted route URLs without validation — a server-side request forgery (SSRF) vector. An attacker could register a route pointing to http://169.254.169.254/ (the cloud metadata endpoint) and have the gateway proxy requests to it.

I added URL validation: HTTPS-only, blocked internal hosts (localhost, 127.0.0.1, cloud metadata IPs), and rejected private/loopback address ranges.

The frontend stored auth tokens in localStorage — accessible to any XSS attack. Switched to sessionStorage with a safe storage wrapper, so tokens die with the browser tab. Added OAuth state parameter validation for CSRF protection, and fixed a CORS misconfiguration that was reflecting any origin with credentials.

The Notification Service: Race Conditions Everywhere

The notification service had race conditions in four subsystems: caching stats, notification dispatch, subscription management, and the dead-letter queue. Each shared mutable state across async tasks without proper locking.

The fixes were surgical: dedicated asyncio.Lock instances for each critical section, atomic operations where possible, and a complete rewrite of the DLQ from in-memory-only to file-based persistence so messages survive process restarts.

The delivery services (SendGrid, Slack, webhooks) had no retry logic and accepted narrow success codes. I expanded the retry strategies with exponential backoff, added input validation (email format, URL scheme), and made the event publisher reliable — it now awaits Pub/Sub acknowledgment instead of fire-and-forget.

Auto-Rater: Scoring Integrity

The auto-rater needed mathematical correctness guarantees. I wrote 24 tests with known inputs and analytically verified expected outputs for the Borda Count and Bradley-Terry scoring algorithms. These aren't "does it return 200" tests — they verify that the ranking math is correct.

I also added missing database indexes for common query patterns, startup configuration validation (fail fast on misconfiguration), and deployment concurrency controls to prevent race conditions during Cloud Run deploys.

The Test Suite

The auth gateway had zero automated tests. I wrote 124 unit tests covering all critical backend services, models, and infrastructure. The test suite runs on every PR with coverage reporting, alongside Trivy vulnerability scanning on the Docker image.

The Pattern

Every service got the same treatment: identify the vulnerability class, write the fix, add tests proving it works, and verify no regressions. Rate limiting on admin endpoints. Enum validation at the database level. Pagination on all list endpoints. Soft delete support. Consistent error responses. Security headers. Request body size limits.

It's not glamorous work. But production systems that don't get hardened eventually become production incidents.