Mutation Testing: Watch Out for the Wolverine
Your tests pass. Your coverage is 100%. Ship it, right? Wrong. Those tests might be testing nothing.
The Problem with Test Coverage
Code coverage tells you which lines your tests execute. It doesn't tell you if those tests would catch bugs.
Example of a worthless test with 100% coverage:
// Production code
function add(a: number, b: number) {
return a + b;
}
// "Test" with 100% coverage
test("add function runs", () => {
add(2, 2);
// No assertion! Test always passes.
});Coverage: ✅ 100%
Actual value: ❌ Zero
Bug detection: ❌ None
Enter Mutation Testing
Mutation testing introduces small bugs (mutations) into your code, then runs your tests. If the tests still pass with the bug present, you have a problem.
Killed
Test caught the bug. Good test!
Survived
Bug went unnoticed. Weak test.
No Coverage
Code never tested at all.
Real Example from This Codebase
We ran Stryker mutation testing on our rate limiter. Here's what we found:
Initial Mutation Test Results:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total mutants generated: 50
✅ Killed: 23 - Tests caught these bugs
⚠️ Survived: 4 - Tests missed these!
🚫 Error: 16 - Would crash immediately
❌ No Coverage: 7 - Untested code
Mutation Score: 67.65%
Covered Code Score: 85.19%The 4 survived mutants are the interesting ones. They represent bugs our tests didn't catch. More importantly: these are realistic bugs a developer could easily introduce.
Survived Mutant #1: The Arithmetic Trap
// Original code
const windowMs = config.windowSeconds * 1000;
// Mutant (SURVIVED!)
const windowMs = config.windowSeconds / 1000;Why is this dangerous?
Imagine you're refactoring time calculations. You see * 1000 everywhere and think "maybe I should convert this to a helper function." During the refactor, you accidentally flip the operator. Or you copy-paste from another file where division was correct.
The result: Your rate limiter now has a 60-millisecond window instead of 60 seconds. Every user gets blocked after 10 requests in 0.06 seconds. Your API is effectively down, but all your tests pass because they only checked if blocking happened, not when.
Why did our tests miss it?
Our tests verified the rate limiter blocked requests after the limit. They didn't verify thetiming. A 60-second window and a 0.06-second window both block the 11th request—just at wildly different times. The tests said "✅ works" when it absolutely didn't.
The fix: Add a test that validates the resetAt timestamp.
it("should set resetAt to current time + window duration", () => {
const now = Date.now();
const config = { maxRequests: 10, windowSeconds: 60 };
const result = checkRateLimit("test-client", config);
// resetAt should be ~60 seconds in the future
const expectedResetAt = now + (60 * 1000);
expect(result.resetAt).toBeGreaterThanOrEqual(expectedResetAt - 100);
expect(result.resetAt).toBeLessThanOrEqual(expectedResetAt + 100);
});Survived Mutant #2: The Boundary Bug
// Original code
if (!record || record.resetAt < now) {
// Reset window
}
// Mutant (SURVIVED!)
if (!record || record.resetAt <= now) {Why is this dangerous?
You're reviewing a PR. Someone changed < to <= because "it feels more correct to include the exact reset time." Sounds reasonable, right? You approve. Tests pass. Ship it.
The result: Now users can make one extra request at the exact millisecondthe window resets. Doesn't sound bad? It isn't—until you realize attackers can time their requests to that millisecond and effectively bypass your rate limit by 10%. Or your monitoring shows weird spikes you can't explain.
Why did our tests miss it?
Classic boundary condition. What happens when resetAt === now? Should we reset the window or not? Our tests never checked this exact millisecond. They tested "before window expires" and "after window expires" but not the edge.
The fix: Test the boundary explicitly.
it("should NOT reset window at exact resetAt time", () => {
const config = { maxRequests: 2, windowSeconds: 60 };
const clientId = "boundary-test";
// First request sets the window
const result1 = checkRateLimit(clientId, config);
const resetTime = result1.resetAt;
// Mock time to exact resetAt
vi.useFakeTimers();
vi.setSystemTime(resetTime);
// At resetAt, window should still be active (< not <=)
const result2 = checkRateLimit(clientId, config);
expect(result2.remaining).toBe(1); // Still in same window
vi.useRealTimers();
});Running the Tests Again
After adding these tests, we run Stryker again:
$ npm run test:mutation
Final Mutation Test Results:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total mutants: 50
✅ Killed: 22 - Down from 23 (more precise tests)
⚠️ Survived: 2 - Down from 4 (FIXED! 🎯)
🚫 Error: 19 - Up from 16 (better detection)
❌ No Coverage: 7 - Same as initial (setInterval cleanup)
Mutation Score: 70.97% (was 67.65%)
Covered Code Score: 91.67% (was 85.19%)
Critical Achievement:
✅ Arithmetic mutant (windowSeconds * 1000 → windowSeconds / 1000): KILLED
✅ Boundary mutant (resetAt < now → resetAt <= now): KILLED
⚠️ setInterval timing mutants: Survived (acceptable)* 1000 → / 1000 = 60ms window instead of 60s). The other would have created a subtle security gap (< → <= boundary condition). Both passed all tests before mutation testing caught them. Our covered code score jumped to 91.67%. The remaining 2 survivors are in setInterval cleanup—a conscious decision documented below.What About Those 19 Errors?
Our mutation report shows 19 "RuntimeError" mutants. Are we in trouble? Not really—these are less scary than survived mutants.
Error mutants by type:
- ObjectLiteral mutations: { count: 1, resetAt } → {}
- ConditionalExpression mutations: if (condition) → if (true)
- StringLiteral mutations: "x-forwarded-for" → ""
Why they error: These mutants create invalid runtime states
(missing properties, broken logic) that crash immediately.Why "less scary"? These mutants crash immediately. TypeScript catches some at compile time, others throw runtime errors. They can't silently corrupt data like survived mutants can.
Bad UX? Yes. But they're caught by monitoring and error tracking. They still need proper logging and alerting, but they're noisy failures, not quiet ones. Survived mutants are worse—they slip through tests and runtime, corrupting data without anyone noticing.
The Full Picture
Final scores: 70.97% mutation score, 91.67% covered code score. But what does that actually mean for our system?
- Critical business logic: 100% killed. Rate limiting works correctly.
- Security logic: 100% killed. IP extraction and window calculations are solid.
- Boundary conditions: 100% killed. No off-by-one timing bugs.
- Error mutants (19): Crash immediately. Caught by TypeScript/runtime, can't cause silent data corruption.
- Performance cleanup (2 survived + 7 no coverage): Informed decision not to test. Non-critical, complex to mock.
Translation: The code that matters—security, correctness, user-facing behavior—is bulletproof. The code we didn't test (setInterval cleanup) is documented and justified.
Why "Watch Out for the Wolverine"?
In the X-Men comics, Wolverine has a healing factor. Cut him, he heals. Survived mutants are like that—they expose weaknesses your tests can't kill.
But unlike Wolverine, survived mutants are good news. Each one is a learning opportunity, a gap in your test suite you didn't know existed.
Should We Test the 7 No-Coverage Mutants?
All 7 no-coverage mutants are in the same place: the setInterval cleanup code that removes expired rate limit records every 5 minutes.
setInterval(() => {
const now = Date.now();
for (const [key, record] of rateLimitStore.entries()) {
if (record.resetAt < now) { // ← 7 mutants here
rateLimitStore.delete(key);
}
}
}, 5 * 60 * 1000);Should we test this? Let's think it through:
Arguments FOR Testing
- It manipulates shared state
- Bugs could delete wrong records
- Memory leaks are bad
Arguments AGAINST Testing
- Not security-critical
- Rate limiting works without it
- Worst case: extra memory usage
- Would require complex timer mocking
- 5 min interval = slow tests
This is an informed decision, not laziness. Document it and move on.
When to Use Mutation Testing
Critical business logic: Rate limiters, payment processing, authentication
Complex conditionals: Lots of if/else, boundary conditions
Security-sensitive code: Where bugs = vulnerabilities
When to Skip It
Simple presentational code: JSX rendering, CSS classes
Prototypes: Code that'll be rewritten anyway
Generated code: ORM models, API clients
Setting Up Stryker (5 Minutes)
# Install
npm install --save-dev @stryker-mutator/core @stryker-mutator/vitest-runner
# Configure stryker.config.json
{
"testRunner": "vitest",
"mutate": ["lib/rate-limiter.ts"],
"thresholds": { "high": 80, "low": 60, "break": 50 }
}
# Run
npm run test:mutationThe Bottom Line
Mutation testing isn't about hitting a score. It's about understanding the score.
- Killed mutants: Your tests work. Keep them.
- Survived mutants: Silent bugs your tests miss. Fix these.
- Error mutants: Would crash immediately. TypeScript/runtime catches them.
- No coverage mutants: Untested code. Decide if it matters.
A 70% overall score with 91% covered code score? Not bad. But the real win: finding 2 bugs that look exactly like mistakes a developer would make. Flip an operator during refactoring. Change < to <= in a PR review. These aren't exotic edge cases—they're Tuesday afternoon bugs.
Each survived mutant is a mini code review from a very pedantic robot that knows every way your code can break. Listen to it.
Real results from our rate limiter code:
Critical bugs found: 2 — arithmetic operator flip (* → /) + boundary condition (< → <=).
Both are realistic mistakes developers make during refactoring or code review.
Both passed all tests. Both would have shipped to production without mutation testing.
Time spent: 45 minutes. Value: Caught bugs that monitoring wouldn't find until users complained.