The Connective Tissue of an AI Platform: Workflow, Taxonomy, Auth, and Memory
When you're building an AI evaluation platform with multiple microservices, the "core" services get all the attention — the evaluation engine, the scoring system, the RAG pipeline. But a platform doesn't work without the connective tissue: the workflow orchestration that keeps humans in the loop, the taxonomy engine that classifies tasks intelligently, the platform service that ties authentication together, and the evaluation suites that ensure models actually remember context.
These four services don't make headlines, but they're what turned a collection of microservices into an actual platform. Here's what went into each one and why the engineering decisions mattered.
Workflow Orchestration: The Human-in-the-Loop Engine
AI evaluation is not fully automated — and it shouldn't be. Certain decisions require human judgment: Is this model response harmful? Does this evaluation rubric make sense for this domain? Is this edge case a genuine failure or acceptable behavior?
The workflow orchestrator manages these decision points. It coordinates multi-step evaluation workflows where some steps are automated (LLM scoring, data validation) and others require human approval before the pipeline continues.
The Architecture
The core is a state machine built on FastAPI and PostgreSQL. Each workflow is a DAG (directed acyclic graph) of tasks, where each node can be:
- Automated: Runs immediately, calls another service (scoring, data enrichment), stores the result
- Human gate: Pauses the workflow, notifies the assigned reviewer via the notification service, waits for approval/rejection
- Conditional: Routes to different branches based on previous step outcomes (e.g., if confidence score < threshold, escalate to senior reviewer)
State transitions are persisted in PostgreSQL with Alembic-managed migrations. Every transition is logged — who approved what, when, and with what context. This audit trail turned out to be critical for client reporting.
Real-Time Updates with WebSocket
The original system polled the API every 5 seconds to check workflow status. With dozens of reviewers working concurrently, this created unnecessary load and a poor user experience — you'd approve a task and see nothing happen for up to 5 seconds.
I replaced this with WebSocket connections that push state changes in real-time. When a reviewer approves a step, every connected client watching that workflow sees the update instantly. The implementation uses FastAPI's WebSocket support with Redis Pub/Sub as the message broker, so it works across multiple Cloud Run instances.
# Simplified WebSocket broadcast pattern
async def broadcast_workflow_update(workflow_id: str, event: dict):
channel = f"workflow:{workflow_id}"
await redis.publish(channel, json.dumps({
"type": "state_change",
"workflow_id": workflow_id,
"step": event["step"],
"status": event["new_status"],
"actor": event["actor_email"],
"timestamp": datetime.utcnow().isoformat()
}))
Production Logging Overhaul
The existing codebase used print() statements everywhere. In production on Cloud Run, these were effectively invisible — they'd show up as unstructured text in Cloud Logging with no way to filter, search, or correlate them.
I replaced the entire logging infrastructure with structured JSON logging. Every log entry includes a correlation ID that traces a request across the workflow orchestrator, the notification service, and whatever downstream service is involved. When a workflow fails at step 4 of 7, you can now trace exactly what happened at each step, in each service, with a single query.
Taxonomy Workflow Engine: Intelligent Task Classification
Not all evaluation tasks are the same. A code generation task requires different rubrics, different evaluators, and different tooling than a conversational AI task. The taxonomy engine is the routing layer that classifies incoming tasks and determines which evaluation workflow to apply.
The Problem It Solves
Before this service existed, task classification was manual. A project manager would look at incoming evaluation requests, decide which team should handle them, and assign the appropriate rubric. This worked at 50 tasks per day. It didn't work at thousands.
How It Works
The engine uses a combination of keyword matching, metadata analysis, and configurable rule sets to classify tasks. Each classification determines:
- Which evaluation rubric to apply
- Which reviewer pool to draw from (by expertise)
- Whether the task requires single or multi-reviewer consensus
- SLA targets for completion time
The file upload system allows clients to submit evaluation tasks in bulk via CSV/JSON uploads to GCS. The engine parses, validates, classifies each row, and enqueues them into the appropriate workflow — all asynchronously via Cloud Tasks.
Infrastructure: Cloud SQL for taxonomy rules and classification history, GCS for bulk file uploads, Cloud Run for the API layer, Cloud Tasks for async processing.
Core Platform Service: The Authentication Backbone
Every microservice in the platform needs to answer two questions: "Who is making this request?" and "Are they allowed to do this?" The core platform service provides those answers.
JWT Authentication Fixes
The existing JWT implementation had a subtle but critical bug: token validation was checking expiration time against the server's local time rather than UTC. Cloud Run instances can have slight clock drift, and this meant tokens would occasionally be rejected as "expired" when they were still valid, or accepted when they should have been rejected.
The fix was straightforward — normalize all time comparisons to UTC — but finding it required tracing sporadic 401 errors across multiple services to realize the pattern correlated with specific Cloud Run instances, not specific users.
# Before: clock-sensitive comparison
if token_exp < datetime.now(): # Local time — unreliable on Cloud Run
raise HTTPException(401, "Token expired")
# After: UTC-normalized comparison
if token_exp < datetime.now(timezone.utc): # Always consistent
raise HTTPException(401, "Token expired")
User Management and GDPR Compliance
Built the user deletion endpoint — which sounds simple until you realize that "deleting a user" in a system with audit trails, evaluation history, and cross-service references means carefully cascading the deletion while preserving anonymized audit records. The implementation soft-deletes the user profile, anonymizes their evaluation history (replacing PII with hashed identifiers), and propagates the deletion event to downstream services via Pub/Sub.
Developer Experience
Improved the local development workflow by rewriting start_dev.sh to properly handle Docker container lifecycle. The previous script would silently fail if the PostgreSQL container was already running from a previous session, leading to "connection refused" errors that wasted 10-15 minutes of debugging time per developer, multiple times per week. The new script checks for existing containers, handles cleanup, and provides clear status messages.
Memory Evaluation Suite: Does the Model Remember?
One of the harder problems in LLM evaluation is measuring context retention. When you give a model a long conversation or a complex document, does it actually use information from the beginning when answering questions at the end? Or does it "forget" earlier context?
The memory evaluation suite provides structured tests for this. It generates conversations with deliberate information planted at various positions (beginning, middle, end), then asks questions that require recalling that information. The scoring tracks not just accuracy, but where in the context window the model starts losing information — which is critical data for the teams training these models.
Code Quality as a Feature
This was also where I implemented the team's first pre-commit workflow using Ruff for linting and formatting, enforced via GitHub Actions. The motivation wasn't just code aesthetics — inconsistent formatting was causing unnecessary merge conflicts across the team. Two developers would change the same file, both would reformat it differently, and the merge conflict had nothing to do with the actual logic.
After rolling out the pre-commit pipeline on this service and proving it reduced merge conflicts, we adopted it across every service in the platform.
The Engineering Pattern
What ties these four services together isn't the domain logic — it's the systematic engineering discipline I applied to each one. Every service I touched got the same treatment:
| Pattern | Why It Matters |
|---|---|
| Structured JSON logging with correlation IDs | One query to trace a request across all services. Reduced mean time to diagnosis from hours to minutes. |
| Pre-commit hooks (Ruff, type checking) | Eliminated formatting merge conflicts. Caught type errors before they hit production. |
| Custom exception hierarchies | Consistent error responses across services. Clients can programmatically handle errors instead of parsing strings. |
| Alembic migrations | Version-controlled schema changes. Zero-downtime deployments with reversible migrations. |
| Security audit per service | Found hardcoded credentials, missing auth checks, and SQL injection vectors before they became incidents. |
Why This Work Matters
It's easy to dismiss "I worked on four more services" as a breadth play. But the reality is that platform engineering requires breadth. The workflow orchestrator doesn't exist without the platform service providing authentication. The taxonomy engine doesn't work without the workflow orchestrator to route tasks into. The memory evaluation suite's code quality pipeline became the template for every other service.
These aren't four independent projects. They're four layers of a system that only works because someone cared enough to apply the same engineering rigor to the "boring" services that they applied to the "interesting" ones.