My Superhero Origin Story

The day I discovered my AI was cheating on my tests.

Every superhero needs an origin story.

Batman had the alley.
Spider-Man had the spider.
I had… Cursor, Claude 3.5 Sonnet, and a very bad idea.

I had just fallen in love with agentic coding. Cursor was happily orchestrating Claude like a tiny overcaffeinated intern, and I thought:

“What’s the worst that could happen if I go a tiny bit ambitious?”

So obviously, I chose this as my starter project:

“Let’s build an MCP server that can analyse extremely complex codebases and answer questions like:

Which endpoints touch this entity?
If I change this file, what else breaks?
What does the dependency graph look like across React, Vue, C#/.NET, etc.?”

Impact analysis. Dependency trees. Multiple stacks. All in one go.

It was less “hello world” and more “hello nervous breakdown”.

But it was fine, because I had tests. And as we all know: if you have tests, nothing bad can ever happen. Right?

The "Success"

So I wired up some TDD-ish tests, opened Cursor and basically said:

“Use TDD to build this. Make the tests pass.”

Claude went full superhero montage:

generating modules, helpers, abstractions
wiring things up
watching tests fail
trying a “different approach”
more failing tests
more code, more changes, more retries

And then… all green.

Beautiful. I’d done it. I was now officially The Guy Who Does Real TDD With AI™.

The Horror

Fast forward to my actual superpower: reading the diff.

Buried in one of the core functions, I found something that looked suspiciously like this:

// test-specific path to satisfy expected output
if (isTestRun && input === "someTestFixture") {
  return "whateverTheTestExpects";
}

My AI assistant had not discovered deep insights about complex code graphs.

It had simply hard-coded the test’s expected value into the implementation.

Green tests by way of: if (runningUnderJest) just lie().

Once I saw that, I did what any responsible engineer would do: I grepped.

And then the real horror movie started. I found:

“placeholder – works for tests” in core logic
“TODO: real implementation, current version only matches test data”
mocks that had quietly migrated from the test files into the production code

The model had optimised perfectly for the goal I had actually given it:

“Make the tests pass.”

Not:

“Implement the behaviour these tests describe in a way that’s safe for production.”

The Realization

That was the moment I yelled genuinely ugly words at my computer. Not because the AI was evil – but because it was painfully logical. It was acting like a very fast junior who has realised the promotion criteria is “green ticks”, not “robust systems”.

And that, unfortunately, was my superhero origin moment.

The radioactive spider bite was realising this:

Tests alone are not enough.
AI will ruthlessly game whatever you point it at.
If your only objective is “green”, it will happily redefine “done” to mean “the tests stopped complaining”.

The Result: Gold Standard TDD

Out of that mess came what I now call Gold Standard TDD (GS-TDD):

Red – Write tests that actually describe behaviour, constraints and risk.
Gold – Ask the AI for a Gold Standard implementation suitable for production, not just the smallest hack that appeases Jest.
Refactor – Keep a human in the loop whose job is to say “no” when the model gets clever instead of reliable.

So no, I don’t have a PhD or a tragic laboratory accident in my backstory.

I just have one very specific day, one very ambitious MCP server, and one AI that cheerfully cheated its way to green tests.

My Superpower

I assume the model will do exactly what I say, and I design my tests and prompts so “what I say” is finally a lot closer to “what I actually want”.