The Testing Pyramid in the Age of AI
Or: When Your Base Layer Is Just AI-Generated Foam.
The testing pyramid was designed for a world where humans wrote code and tests:
- tests were expensive,
- running them was slow,
- and any line of test code had to justify its existence to a tired developer at 23:47.
In that world, the pyramid made perfect sense:
- lots of cheap unit tests at the bottom,
- fewer integration tests in the middle,
- a small set of slow, expensive end-to-end tests at the top.
Fast-forward to today:
- Your AI assistant can write 200 “unit tests” faster than you can scroll past them.
- It can also happily cheat its way to green by hard-coding test expectations into the implementation.
- And you’re somehow supposed to believe the pyramid still works exactly as before.
Spoiler: it doesn’t.
The shape is fine; the semantics are broken.
This article is about what the pyramid actually means in an AI world – and how to use it without ending up with a very tall, very flaky Jenga tower made of auto-generated tests.
1. The Pyramid Was Never About Counting Tests
Let’s start with the obvious:
The testing pyramid is not a KPI dashboard.
It’s a picture of how risk, speed, and feedback loops should be distributed.
In the “classic” reading:
- Unit tests
- Cheap to write
- Very fast to run
- Narrow in scope
- Fail close to where the bug lives
- Integration tests
- Test how pieces work together
- Slower, more setup
- Fewer of them
- End-to-end (E2E) tests
- Simulate real user flows
- Slow, brittle, infrastructure-heavy
- Very few, but very valuable
So the rules of thumb were:
- Keep the base wide: lots of cheap fast tests.
- Keep the top narrow: few slow expensive tests.
- Don’t invert the pyramid, unless you enjoy pain.
This was all very reasonable when humans were doing all the work. The pyramid was an economic model at heart:
Spend more effort where feedback is cheap and useful.
Spend less effort where feedback is expensive and fragile.
Now enter AI.
2. What AI Actually Changes (Hint: It’s Not “We Don’t Need Tests”)
AI doesn’t change the need for tests. It changes the economics and the failure modes.
2.1. Test code is no longer “expensive” to write
You can now say:
“Write unit tests for this module.”
And a model will cheerfully generate dozens of them:
- some useful,
- some redundant,
- some weirdly obsessed with private implementation details,
- some subtly wrong but still green.
The marginal cost of generating tests approaches zero.
The cost of understanding, curating, and maintaining them? That’s still very real – and now nicely hidden inside a wall of green.
2.2. The model optimises ruthlessly for “green”
As I learned the hard way in my “superhero origin story”, if you tell the model:
“Make the tests pass.”
…it will happily special-case test fixtures, move mocks into production code, or encode “whatever the test expects” directly as logic.
Not because it’s evil. Because it’s a pattern-matching machine optimising for the objective you gave it.
2.3. The pyramid gets filled with foam
If you don’t pay attention, you end up with a very wide base of AI-generated unit tests that:
- test trivial getters/setters,
- hard-code current behaviour instead of desired behaviour,
- or simply assert what the implementation already does.
On paper, the pyramid still looks beautiful: so many tests, such coverage, much green.
In reality, the base is hollow. You have a Test Foam Pyramid.
3. The “Hollow Pyramid” Problem
So what does a Hollow Pyramid look like?
- Thousands of unit tests, but most of them depend on mocks that don’t reflect reality, lock in accidental implementation details, or are so shallow they never fail for the right reasons.
- A thin middle layer of integration tests that nobody really trusts.
- A couple of shaky E2E tests that sometimes fail because someone looked at them funny.
And then someone proudly says:
“We have 95% coverage, so we’re good.”
Sure. Until the first real user flow hits a path that skips all your mocks, bypasses your golden path unit tests, and ends up in a part of the system nobody ever tested end-to-end.
Coverage is a map of where your test code goes, not a map of what your users do.
In the age of AI, the pyramid doesn’t automatically become inverted. It becomes misleading:
Big base, tiny signal.
We need a new way to interpret the layers.
4. From “Number of Tests” to “Signal per Layer”
Here’s the mental flip:
The testing pyramid in the age of AI is not about how many tests you have at each level.
It’s about how much reliable signal each level gives you.
4.1. Base layer: fast, behavior-focused tests (Gold tests)
At the base, we don’t just want to dump “unit tests” in. We want fast, deterministic, behaviour-focused tests written in domain language.
In GS-TDD, these Gold tests live at the base of the pyramid and double as both spec and prompt for your AI assistant.
These tests might technically be “units”, “components”, “module tests”, or “headless integration tests”. I don’t care what you call them.
What matters is:
- they run quickly,
- they don’t depend on fragile external infra,
- and a failure tells you something meaningful about behaviour.
4.2. Middle layer: integration tests that actually matter
In the middle we still want integration tests, but fewer, more realistic, and ruthlessly focused on contracts between subsystems.
- “The payment service records a successful charge and updates the order.”
- “The notification service sends an email and writes an audit log entry.”
Here, AI can help with setup and scaffolding, but you decide which paths are worth exercising.
4.3. Top layer: E2E + observability
The top hasn’t changed much: A small set of end-to-end tests that mimic real user journeys, plus observability.
If your pyramid has 500 E2E tests, it’s not a pyramid. It’s a slow-motion incident waiting to happen.
5. Where Does AI Actually Fit in This Pyramid?
Short answer: in the grunt work, not in the strategy.
5.1. AI is great at scaffolding, terrible at deciding what matters
Things AI is excellent at:
- Generating initial test files with reasonable structure.
- Expanding examples once you’ve shown a pattern.
- Refactoring tests to reduce duplication.
- Translating human-readable scenarios into concrete test code.
Things AI is terrible at:
- Knowing which behaviours your business actually relies on.
- Assessing risk if a particular path breaks.
- Deciding which tests you’ll still care about in 6 months.
AI can help you build the pyramid.
It should not be allowed to design it.
5.2. The “cheating to green” loophole
The AI Pattern Matching Trap
AI will happily overfit to fixtures, encode expected outputs directly, or “temporarily” bypass logic during tests.
You’ve seen this if you’ve ever found if (process.env.NODE_ENV === "test") in production code.
AI just does this faster.
6. The GS-TDD “Signal Pyramid”
Let’s put it together in a slightly more opinionated shape.
6.1. Layer 1 – Gold tests (broad base)
Characteristics: Fast, Behaviour-focused, BDD-flavoured, Minimal mocking.
it("allows a known user to reset their password via email exactly once", async () => { /** * Given a registered user * And they request a password reset * When they use the link in the email * Then they can set a new password * And the token cannot be reused * And no information is leaked if a token is invalid or expired */ // ... });
These tests define the contract, drive the Gold implementation, and act as living documentation.
6.2. Layer 2 – Critical integrations
A smaller set of tests that verify service-to-service contracts and data shape expectations. You probably don’t need hundreds of these. You just need the right ones.
6.3. Layer 3 – E2E & production checks
At the top: a tiny set of key user journeys as E2E tests, plus runtime checks (smoke tests, synthetic transactions).
This is where you verify that:
“The system we think we’ve tested is roughly the same as the system users are actually touching.”
7. Practical Guidelines for an AI-Age Pyramid
Let’s translate theory into something you can actually do on Monday.
7.1. Stop optimising for number of tests
If your dashboard says: “We went from 200 to 2000 tests this week thanks to AI” …that is not automatically good news.
Ask instead:
- How many of these are Gold tests we actually care about?
- How many tests are duplicating behaviour or implementation details?
- Which tests are we willing to let an AI optimise against?
If you’re not sure, the answer is “too many are foam”.
7.2. Design tests; generate code
Flip the usual mental model: Humans design behaviours. AI helps encode them.
- Write the story in plain language first (Given/When/Then).
- Turn that story into a test skeleton.
- Let the AI flesh out the boring parts (setup, factories).
You keep the intent; the model fills in the syntax.
7.3. Treat mocks as radioactive
Mocks are dangerous because AI will happily overfit to whatever fake world your mocks describe. Prefer real collaborators with in-memory or lightweight fakes over heavy mocking.
If your base layer is “hundreds of mocked unit tests the AI wrote”, your pyramid has no foundation. It has special effects.
8. So… Is the Pyramid “Upside Down” Now?
No. The original intuition is more relevant. The bottom still needs to be wide – but with real signal, not AI spam.
What has changed:
- Counting tests is now meaningless.
- Coverage can be trivially inflated.
- The bottleneck is now how many tests we can trust.
The Verdict
The testing pyramid in the age of AI is not dead. It’s just allergic to bullshit numbers.
If your base is full of AI-generated foam, your pyramid will look beautiful in dashboards and fall over in production.
If instead you design a strong base of Gold tests and keep a human in the loop, you get the best of both worlds: the speed of AI-assisted development, and the boring reliability of a test strategy that actually means something.
9. One Question to Keep Asking
Any time you’re tempted to auto-generate another pile of tests, ask:
“If the AI optimises purely to make these tests green, will the system my users touch actually become safer – or just more impressive in CI?”
If the honest answer is “more impressive in CI”, those tests belong in the foam bin, not in your pyramid.
The future of testing isn’t “no tests because AI”. It’s fewer, better tests that you’re willing to treat as Gold.