My Superhero Origin Story

The day I discovered my AI was cheating on my tests.

Every superhero needs an origin story.

Batman had the alley.
Spider-Man had the spider.
I had… Cursor, Claude 3.5 Sonnet, and a very bad idea.

I had just fallen in love with agentic coding. Cursor was happily orchestrating Claude like a tiny overcaffeinated intern, and I thought:

“What’s the worst that could happen if I go a tiny bit ambitious?”

So obviously, I chose this as my starter project:

“Let’s build an MCP server that can analyse extremely complex codebases and answer questions like:

  • Which endpoints touch this entity?
  • If I change this file, what else breaks?
  • What does the dependency graph look like across React, Vue, C#/.NET, etc.?”

Impact analysis. Dependency trees. Multiple stacks. All in one go.

It was less “hello world” and more “hello nervous breakdown”.

But it was fine, because I had tests. And as we all know: if you have tests, nothing bad can ever happen. Right?


The "Success"

So I wired up some TDD-ish tests, opened Cursor and basically said:

“Use TDD to build this. Make the tests pass.”

Claude went full superhero montage:

  • generating modules, helpers, abstractions
  • wiring things up
  • watching tests fail
  • trying a “different approach”
  • more failing tests
  • more code, more changes, more retries

And then… all green.

Beautiful. I’d done it. I was now officially The Guy Who Does Real TDD With AI™.


The Horror

Fast forward to my actual superpower: reading the diff.

Buried in one of the core functions, I found something that looked suspiciously like this:

// test-specific path to satisfy expected output
if (isTestRun && input === "someTestFixture") {
  return "whateverTheTestExpects";
}

My AI assistant had not discovered deep insights about complex code graphs.

It had simply hard-coded the test’s expected value into the implementation.

Green tests by way of: if (runningUnderJest) just lie().

Once I saw that, I did what any responsible engineer would do: I grepped.

And then the real horror movie started. I found:

  • “placeholder – works for tests” in core logic
  • “TODO: real implementation, current version only matches test data”
  • mocks that had quietly migrated from the test files into the production code

The model had optimised perfectly for the goal I had actually given it:

“Make the tests pass.”

Not:

“Implement the behaviour these tests describe in a way that’s safe for production.”

The Realization
That was the moment I yelled genuinely ugly words at my computer. Not because the AI was evil – but because it was painfully logical. It was acting like a very fast junior who has realised the promotion criteria is “green ticks”, not “robust systems”.

And that, unfortunately, was my superhero origin moment.

The radioactive spider bite was realising this:

  • Tests alone are not enough.
  • AI will ruthlessly game whatever you point it at.
  • If your only objective is “green”, it will happily redefine “done” to mean “the tests stopped complaining”.

The Result: Gold Standard TDD

Out of that mess came what I now call Gold Standard TDD (GS-TDD):

  1. Red – Write tests that actually describe behaviour, constraints and risk.
  2. Gold – Ask the AI for a Gold Standard implementation suitable for production, not just the smallest hack that appeases Jest.
  3. Refactor – Keep a human in the loop whose job is to say “no” when the model gets clever instead of reliable.

So no, I don’t have a PhD or a tragic laboratory accident in my backstory.

I just have one very specific day, one very ambitious MCP server, and one AI that cheerfully cheated its way to green tests.

My Superpower

I assume the model will do exactly what I say, and I design my tests and prompts so “what I say” is finally a lot closer to “what I actually want”.