Mutation Testing: Watch Out for the Wolverine

Your tests pass. Your coverage is 100%. Ship it, right? Wrong. Those tests might be testing nothing.

Real talk: These bugs were caught in our own rate limiter code. Both survived our initial tests. Both could have shipped to production.

The Problem with Test Coverage

Code coverage tells you which lines your tests execute. It doesn't tell you if those tests would catch bugs.

Example of a worthless test with 100% coverage:

// Production code
function add(a: number, b: number) {
  return a + b;
}

// "Test" with 100% coverage
test("add function runs", () => {
  add(2, 2);
  // No assertion! Test always passes.
});

Coverage: ✅ 100%
Actual value: ❌ Zero
Bug detection: ❌ None

Enter Mutation Testing

Mutation testing introduces small bugs (mutations) into your code, then runs your tests. If the tests still pass with the bug present, you have a problem.

Killed

Test caught the bug. Good test!

Survived

Bug went unnoticed. Weak test.

No Coverage

Code never tested at all.

Real Example from This Codebase

We ran Stryker mutation testing on our rate limiter. Here's what we found:

Initial Mutation Test Results:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total mutants generated: 50

✅ Killed:        23 - Tests caught these bugs
⚠️  Survived:      4 - Tests missed these!
🚫 Error:        16 - Would crash immediately  
❌ No Coverage:   7 - Untested code

Mutation Score: 67.65%
Covered Code Score: 85.19%

The 4 survived mutants are the interesting ones. They represent bugs our tests didn't catch. More importantly: these are realistic bugs a developer could easily introduce.

Survived Mutant #1: The Arithmetic Trap

lib/rate-limiter.ts:46
// Original code
const windowMs = config.windowSeconds * 1000;

// Mutant (SURVIVED!)
const windowMs = config.windowSeconds / 1000;

Why is this dangerous?

Imagine you're refactoring time calculations. You see * 1000 everywhere and think "maybe I should convert this to a helper function." During the refactor, you accidentally flip the operator. Or you copy-paste from another file where division was correct.

The result: Your rate limiter now has a 60-millisecond window instead of 60 seconds. Every user gets blocked after 10 requests in 0.06 seconds. Your API is effectively down, but all your tests pass because they only checked if blocking happened, not when.

Why did our tests miss it?

Our tests verified the rate limiter blocked requests after the limit. They didn't verify thetiming. A 60-second window and a 0.06-second window both block the 11th request—just at wildly different times. The tests said "✅ works" when it absolutely didn't.

The fix: Add a test that validates the resetAt timestamp.

it("should set resetAt to current time + window duration", () => {
  const now = Date.now();
  const config = { maxRequests: 10, windowSeconds: 60 };
  
  const result = checkRateLimit("test-client", config);
  
  // resetAt should be ~60 seconds in the future
  const expectedResetAt = now + (60 * 1000);
  expect(result.resetAt).toBeGreaterThanOrEqual(expectedResetAt - 100);
  expect(result.resetAt).toBeLessThanOrEqual(expectedResetAt + 100);
});

Survived Mutant #2: The Boundary Bug

lib/rate-limiter.ts:52
// Original code
if (!record || record.resetAt < now) {
  // Reset window
}

// Mutant (SURVIVED!)
if (!record || record.resetAt <= now) {

Why is this dangerous?

You're reviewing a PR. Someone changed < to <= because "it feels more correct to include the exact reset time." Sounds reasonable, right? You approve. Tests pass. Ship it.

The result: Now users can make one extra request at the exact millisecondthe window resets. Doesn't sound bad? It isn't—until you realize attackers can time their requests to that millisecond and effectively bypass your rate limit by 10%. Or your monitoring shows weird spikes you can't explain.

Why did our tests miss it?

Classic boundary condition. What happens when resetAt === now? Should we reset the window or not? Our tests never checked this exact millisecond. They tested "before window expires" and "after window expires" but not the edge.

The fix: Test the boundary explicitly.

it("should NOT reset window at exact resetAt time", () => {
  const config = { maxRequests: 2, windowSeconds: 60 };
  const clientId = "boundary-test";
  
  // First request sets the window
  const result1 = checkRateLimit(clientId, config);
  const resetTime = result1.resetAt;
  
  // Mock time to exact resetAt
  vi.useFakeTimers();
  vi.setSystemTime(resetTime);
  
  // At resetAt, window should still be active (< not <=)
  const result2 = checkRateLimit(clientId, config);
  expect(result2.remaining).toBe(1); // Still in same window
  
  vi.useRealTimers();
});

Running the Tests Again

After adding these tests, we run Stryker again:

$ npm run test:mutation

Final Mutation Test Results:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total mutants: 50

✅ Killed:       22 - Down from 23 (more precise tests)
⚠️  Survived:     2 - Down from 4 (FIXED! 🎯)
🚫 Error:       19 - Up from 16 (better detection)
❌ No Coverage:  7 - Same as initial (setInterval cleanup)

Mutation Score: 70.97% (was 67.65%)
Covered Code Score: 91.67% (was 85.19%)

Critical Achievement:
✅ Arithmetic mutant (windowSeconds * 1000 → windowSeconds / 1000): KILLED
✅ Boundary mutant (resetAt < now → resetAt <= now): KILLED
⚠️  setInterval timing mutants: Survived (acceptable)
Mission accomplished: We killed the 2 critical mutants—bugs that a developer could easily introduce during refactoring or code review. One would have made our API unusable (* 1000 → / 1000 = 60ms window instead of 60s). The other would have created a subtle security gap (< → <= boundary condition). Both passed all tests before mutation testing caught them. Our covered code score jumped to 91.67%. The remaining 2 survivors are in setInterval cleanup—a conscious decision documented below.

What About Those 19 Errors?

Our mutation report shows 19 "RuntimeError" mutants. Are we in trouble? Not really—these are less scary than survived mutants.

Error mutants by type:
- ObjectLiteral mutations: { count: 1, resetAt } → {}
- ConditionalExpression mutations: if (condition) → if (true)
- StringLiteral mutations: "x-forwarded-for" → ""

Why they error: These mutants create invalid runtime states
(missing properties, broken logic) that crash immediately.

Why "less scary"? These mutants crash immediately. TypeScript catches some at compile time, others throw runtime errors. They can't silently corrupt data like survived mutants can.

Bad UX? Yes. But they're caught by monitoring and error tracking. They still need proper logging and alerting, but they're noisy failures, not quiet ones. Survived mutants are worse—they slip through tests and runtime, corrupting data without anyone noticing.

Key insight: Error mutants are different from survived mutants. Survived = bug goes unnoticed. Error = app crashes immediately, which is bad UX but not a silent security vulnerability.

The Full Picture

Final scores: 70.97% mutation score, 91.67% covered code score. But what does that actually mean for our system?

What actually matters:
  • Critical business logic: 100% killed. Rate limiting works correctly.
  • Security logic: 100% killed. IP extraction and window calculations are solid.
  • Boundary conditions: 100% killed. No off-by-one timing bugs.
  • Error mutants (19): Crash immediately. Caught by TypeScript/runtime, can't cause silent data corruption.
  • Performance cleanup (2 survived + 7 no coverage): Informed decision not to test. Non-critical, complex to mock.

Translation: The code that matters—security, correctness, user-facing behavior—is bulletproof. The code we didn't test (setInterval cleanup) is documented and justified.

Why "Watch Out for the Wolverine"?

In the X-Men comics, Wolverine has a healing factor. Cut him, he heals. Survived mutants are like that—they expose weaknesses your tests can't kill.

But unlike Wolverine, survived mutants are good news. Each one is a learning opportunity, a gap in your test suite you didn't know existed.

Should We Test the 7 No-Coverage Mutants?

All 7 no-coverage mutants are in the same place: the setInterval cleanup code that removes expired rate limit records every 5 minutes.

setInterval(() => {
  const now = Date.now();
  for (const [key, record] of rateLimitStore.entries()) {
    if (record.resetAt < now) {      // ← 7 mutants here
      rateLimitStore.delete(key);
    }
  }
}, 5 * 60 * 1000);

Should we test this? Let's think it through:

Arguments FOR Testing

  • It manipulates shared state
  • Bugs could delete wrong records
  • Memory leaks are bad

Arguments AGAINST Testing

  • Not security-critical
  • Rate limiting works without it
  • Worst case: extra memory usage
  • Would require complex timer mocking
  • 5 min interval = slow tests
Our decision: We're not testing this. It's a performance optimization, not business logic. If it breaks, the app still works—it just uses more memory. The cost of testing (complex mocking, slow tests) outweighs the benefit.

This is an informed decision, not laziness. Document it and move on.

When to Use Mutation Testing

Critical business logic: Rate limiters, payment processing, authentication

Complex conditionals: Lots of if/else, boundary conditions

Security-sensitive code: Where bugs = vulnerabilities

When to Skip It

Simple presentational code: JSX rendering, CSS classes

Prototypes: Code that'll be rewritten anyway

Generated code: ORM models, API clients

Setting Up Stryker (5 Minutes)

# Install
npm install --save-dev @stryker-mutator/core @stryker-mutator/vitest-runner

# Configure stryker.config.json
{
  "testRunner": "vitest",
  "mutate": ["lib/rate-limiter.ts"],
  "thresholds": { "high": 80, "low": 60, "break": 50 }
}

# Run
npm run test:mutation

The Bottom Line

Mutation testing isn't about hitting a score. It's about understanding the score.

  • Killed mutants: Your tests work. Keep them.
  • Survived mutants: Silent bugs your tests miss. Fix these.
  • Error mutants: Would crash immediately. TypeScript/runtime catches them.
  • No coverage mutants: Untested code. Decide if it matters.

A 70% overall score with 91% covered code score? Not bad. But the real win: finding 2 bugs that look exactly like mistakes a developer would make. Flip an operator during refactoring. Change < to <= in a PR review. These aren't exotic edge cases—they're Tuesday afternoon bugs.

Each survived mutant is a mini code review from a very pedantic robot that knows every way your code can break. Listen to it.

Real results from our rate limiter code:
Critical bugs found: 2 — arithmetic operator flip (* → /) + boundary condition (< → <=).
Both are realistic mistakes developers make during refactoring or code review.
Both passed all tests. Both would have shipped to production without mutation testing.
Time spent: 45 minutes. Value: Caught bugs that monitoring wouldn't find until users complained.