Ward #1: Yes, You Can Actually Build Good Software With AI

A calm introduction to Gold Standard TDD (GS-TDD)

Let's be honest: most AI-assisted coding right now is just… vibes.

You open your editor, type a slightly desperate prompt into your AI of choice, stare at the generated code and think:

"Yeah, that looks about right."

Then you copy-paste, tweak a line or two, and move on.
Future-you (or Security, or Ops) can deal with the fallout.

That pattern has a name: vibe coding. It's quietly wrecking quality, security, and predictability in a lot of teams.

This first Ward is about the opposite of that.

It's about how you can absolutely use AI in your stack and still ship boringly reliable, production-grade software. Not by trusting the vibes, but by putting the AI inside a calm, test-driven, quality-obsessed workflow.

That workflow is what I call Gold Standard Test-Driven Development (GS-TDD). This article is your introduction to it.


The problem isn't AI. It's how we use it.

There's a funny pattern in recent research on AI coding assistants.

Developers often:

  • feel faster with AI
  • but in controlled experiments they're often slower
  • and they ship more insecure code
  • while being more confident about that worse code

That's the vibe coding trap in one sentence:

AI makes you feel smarter, faster, and safer than you actually are.

You've probably seen some of these failure modes in real life:

  • A "quick Copilot snippet" that later explodes on an edge case
  • A refactor that probably works because "the tests are green-ish"
  • A generated implementation that passes happy-path tests but ignores auth, rate limiting, or performance

Want a concrete example? Check out LocalStorage Blues — a song about what happens when "remember me" becomes "remember everything, forever, in plaintext, for anyone with XSS access." Those are Tuesday afternoon bugs that should've been caught by tests.

The core issue: we're dropping a super-powered autocomplete into workflows that were never designed for a second, semi-autonomous engineer with infinite confidence and zero accountability.

If you just plug AI into your old habits, you don't get a 10x engineer.

You get a very fast intern with no brakes.


What if the process was AI-native?

Classic TDD tells you to:

  1. Write a failing test (Red)
  2. Write the minimum code to make it pass (Green)
  3. Refactor

That pattern's tuned for human brains: small steps, minimal changes, constant feedback.

An AI doesn't think like that. It doesn't need to protect its working memory. It can consider the entire feature, all the tests, and several cross-cutting concerns at once.

So instead of asking the AI for minimal "green" code, GS-TDD suggests a different move:

Ask the AI for a Gold Standard implementation from the start.

Red → Gold → Refactor

Classic TDD:

GS-TDD:

GS-TDD reframes the loop:

  1. Red – You write a failing, behavior-focused test suite.
  2. Gold – You ask the AI for a production-oriented implementation, not a minimal one.
  3. Refactor – You (and sometimes the AI) improve architecture, clarity, and performance once the tests are green.

The Gold step is the important one. A Gold Standard implementation is:

  • not perfect
  • not final
  • but intentionally holistic from the first serious attempt

You explicitly instruct the AI to care about:

  • correctness and edge cases
  • security basics (no obvious injection holes, no plain-text secrets)
  • maintainability (separation of concerns, clean interfaces)
  • fit with the surrounding architecture

Instead of generating something that just passes the tests, you get a first draft that tries to behave like production code, under the watchful eye of your test suite.

The magic isn't in the phrase "Gold Standard". It sits in the expectation you set in your prompts and process.


The tests become the AI's contract

In "normal" TDD, tests are mostly there for regression protection and design feedback.

In GS-TDD, they also become your AI contract.

Instead of writing abstract, implementation-shaped unit tests like:

it("should return true for valid credentials", () => {
  expect(coerceBoolean("true")).toBe(true);
  expect(coerceBoolean(1)).toBe(true);
});

you describe behaviors in a way that both humans and AIs can understand:

it("should handle boolean values", () => {
  /**
   * Given: A boolean
   * When: coerceBoolean is called
   * Then: It should return the boolean unchanged
   */
  expect(coerceBoolean(true)).toBe(true);
  expect(coerceBoolean(false)).toBe(false);
});

it('should handle string "true"/"false"', () => {
  /**
   * Given: A string "true" or "false"
   * When: coerceBoolean is called
   * Then: It should return the corresponding boolean
   */
  expect(coerceBoolean("true")).toBe(true);
  expect(coerceBoolean("false")).toBe(false);
});

it("should return undefined for null/undefined", () => {
  /**
   * Given: null or undefined
   * When: coerceBoolean is called
   * Then: It should return undefined
   */
  expect(coerceBoolean(null)).toBeUndefined();
  expect(coerceBoolean(undefined)).toBeUndefined();
});

Those comments aren't just for humans. They're high-quality input to the AI:

  • They encode intent, not just inputs and outputs.
  • They call out important edge cases explicitly.
  • They name the contract: what should happen, not how it happens.

Later, when you ask the AI:

"Here's the test suite. Generate a Gold Standard implementation that satisfies these behaviors."

you've already done the hardest part. You've turned fuzzy product ideas into precise, testable, natural-language behavior.

The AI now has a clear contract. Your job shifts from "guess if this code looks right" to "check whether this code genuinely satisfies the contract".


Humans stay accountable. AI becomes responsible.

GS-TDD also adds a small governance layer, inspired by the RACI model:

  • The human stays Accountable. You own correctness, security, architecture, and final approval.
  • The AI is usually Responsible. It writes tests, first implementations, and many refactorings.
  • Both can be Consulted. You can ask the AI for better tests. The AI relies on your review and edits.
  • The team is Informed. Pull requests, docs, and shared patterns make the process visible.

In practice this means:

  • The AI's allowed to move fast. It can generate tests, try an implementation, and iterate in a tight loop with the test suite.
  • Nothing ships unless a human has:
    • read and approved the tests
    • skimmed or reviewed the implementation
    • decided that it's good enough for production in this context

You're not outsourcing judgment. You're outsourcing typing and a lot of routine reasoning.


What this Ward (and the series) is here to do

This first Ward isn't the deep dive. It's the reframe.

  • AI isn't just a smarter autocomplete. It behaves like a second engineer who badly needs process.
  • The answer isn't "never use AI".
  • The answer's also not "YOLO prompts in production".
  • The interesting space sits in the middle: AI-native workflows with boringly reliable outcomes.

In the upcoming Wards, we'll get much more concrete. For example:

  • How to structure BDD-style tests so they double as great AI prompts.
  • How to phrase "Gold Standard" instructions that actually change the quality of the output.
  • How to run an AI-driven "Monitored Debugging Loop" without letting it cheat or overfit to tests.
  • How to adapt GS-TDD to:
    • frontend components
    • backend services
    • legacy code and refactors
    • AI agent workflows themselves

My north star for all of this:

If a senior engineer followed this process with a decent AI, they should end up with code they're not ashamed to put their name on.

No hype. No "10x engineer" cosplay. Just a calm, opinionated way to let AI do more of the typing, while you keep control of quality, architecture, and risk.


Where to go from here

If any of this felt familiar, you can treat this Ward as your baseline.

  1. Notice where you vibe code today.
    The "it looks right" moments. The rushed "eh, ship it" merges.

  2. Pick one feature or bugfix on your next task and ask:
    "What would the Gold Standard version of this look like if I made the AI work harder?"

  3. Write slightly better tests.
    Add one or two BDD-style descriptions that explain behavior in human language, not just asserts.

Small changes in how you ask the AI and how you constrain the workflow can make a big difference in the quality of what lands in main.

GS-TDD is just a name for that shift. Away from vibes, and toward verifiable, test-driven collaboration with your AI.

More patterns, prompts, and concrete walkthroughs will follow in the next Wards.


Real talk

I built this site using GS-TDD with Claude Sonnet. Every feature started with tests. The AI wrote most of the code. I reviewed everything. It works, it's tested, and I'm not embarrassed by it.

The mutation testing article you might've just read? That rate limiter was built this way. The chatbot nudging you right now? Also GS-TDD. Tests first, Gold implementation, refactor when needed.

This isn't theory. It's how this site exists.