WARD #3|DEC 26, 2025

The AI RACI

Why “the robot wrote it” is not a valid defense

We need to talk about accountability.

In the golden age of "vibe coding", a developer can prompt a service into existence in 45 minutes. When that service later leaks memory, mishandles auth, or exposes something it shouldn't… who owns the blast radius?

"The model wrote it" is not an explanation. It’s a workflow smell.

This is the Liability Gap: output without ownership.

If you want boring software that actually stays up, you close the gap with something boring: clear roles + hard gates.

The problem: confusing responsibility with accountability

Let’s be clear:

An LLM is not a senior engineer. It is a hyper-active goblin with a keyboard, infinite stamina, and zero moral compass.

You can’t hold a model accountable. You can’t fire a neural network. So the accountability always boomerangs back to humans.

The AI RACI (GS-TDD Edition)

Yes, this bends "classic" RACI a bit—because in an AI workflow, the always-on consultant is your test suite.

In the formal GS-TDD framework (see the Research Paper), the RACI matrix maps the literal workflow steps. But philosophically, when we look at governance, we need to elevate the role of the Test Suite. In this 'Boring Reliability' interpretation, we treat the Test Suite not just as a tool, but as the primary Consultant.

Role	Concept	The Player	Job Description
R	Responsible	The AI (Cursor/LLM)	The doer. Writes boilerplate, drafts implementations, refactors, grinds through the typing.
A	Accountable	You (Engineer)	The owner. Defines what should be built. Reviews. Signs the commit. Fixes it when it breaks.
C	Consulted	The Test Suite (Wards)	The oracle. Rejects invalid logic. Forces reality onto the implementation.
I	Informed	Users / On-call	The people who shouldn’t have to care how it was built—only whether it works.

R: Responsible (The Goblin)

The AI is fantastic at being Responsible. It can churn out scaffolding and "green-phase" code at a ridiculous pace.

But it does not care about truth. It cares about plausible tokens.
So we use it for speed — never for trust.

A: Accountable (The Engineer)

In the old world you were paid to type syntax. In the new world you’re paid to maintain Accountability.

Your GS-TDD job is to:

Define behaviour (Gold Standard Tests).
Review output (Does it actually satisfy the intent?).
Ship consciously (Commit + Deploy).

In practice, when you merge AI-generated code, you are professionally adopting the consequences. "The robot did it" isn’t a defense. It’s an admission that nobody checked.

C: Consulted (The Wards)

In a manual workflow, "consulted" often means "ask a colleague". In GS-TDD, the test suite is the consultant that never sleeps:

AI: "I generated an auth middleware!"
Tests: "Objection. You broke the refresh token flow for mobile users."
AI: "Correcting..."

Without a gold-standard test suite, you’re letting the goblin run the bank.

A tiny real example

We recently flipped a strict strictness flag in TypeScript. The model "fixed" a handler by adding a default branch that looked safe… but it silently skipped a permission check for admin routes.

Because we had a Consulted test suite covering the "Sad Path" (access denied), the build failed immediately. The AI was Responsible for the fix, but the test suite saved the Accountability of the engineer.

That’s the point: LLMs move fast. Tests define reality. Humans own the outcome.

The workflow on a normal Tuesday

Accountable (You) writes a gold-standard test. It fails (Red). This defines the boundary of reality.
Responsible (AI) drafts code to satisfy it.
Consulted (Tests) rejects hallucinations and shortcuts.
Responsible (AI) iterates until tests go Green.
Accountable (You) reviews design and correctness, refactors (often with AI), and merges.

Conclusion: Don’t abdicate the throne

AI feels like magic. That’s why it’s dangerous.

Boring Reliability isn’t about doing less. It’s about spending your energy where it matters: correctness, architecture, and accountability. Let the AI do the typing. Just don’t let it hold the keys.

The Boring AI Checklist

Before you merge that AI-generated PR, ask yourself:

Did I write the test first? (Or am I just testing that the AI's hallucination works?)
Did the tests actually fail? (Trust, but verify the Red phase.)
Did I read the implementation? (Or did I just vibe-check the diff?)
Am I ready to be on-call for this? (If it breaks at 3AM, is “Cursor wrote it” going to comfort me?)
Is it boring? (Clever AI code is technical debt. Boring code is an asset.)