AI TOOLING

|DEC 17, 2025

Flagship LLMs in 2025: Reliability vs. Vibes

(GPT-5.2 vs Claude Sonnet 4.5 vs Gemini 3 Pro)

There are two kinds of AI assistants:

The kind that does what you asked.
The kind that does what it wished you asked.

Boring Reliability is mostly about paying extra for (1), and then designing your prompts, tests, and guardrails so it stays that way.

This is a practical comparison of three current “flagship-ish” models:

OpenAI: GPT-5.2 (the “yes, and” got replaced with “yes, exactly”)
Anthropic: Claude Sonnet 4.5 (fast, elegant, occasionally… imaginative about constraints)
Google: Gemini 3 Pro (powerful, multimodal, sometimes commits to a path like it signed a lease)

My working take (aka: vibes with accountability)

GPT-5.2 is the most reliably precise in the “do the task as specified” sense. Not always the most poetic. Often a feature, not a bug.
Claude Sonnet 4.5 is quick and strong, but it can “helpfully” fill in blanks with assumptions unless you keep it on a short prompt leash.
Gemini 3 Pro is extremely capable, especially with giant multimodal context — but if it chooses an approach early, it can get… corner-shaped.

That’s the human take. Now let’s do the boring part: numbers.

The boring numbers that matter

Limits & pricing (API list rates)

GPT-5.2

Context: 400,000
Max output: 128,000
Input: $1.75 / 1M
Cached/other: $0.175 / 1M cached
Output: $14.00 / 1M

Claude Sonnet 4.5

Context: 200K (1M beta)
Max output: 64K
Input: $3 / 1M
Cached/other: see Anthropic caching tiers
Output: $15 / 1M

Gemini 3 Pro (Preview)

Context: 1,048,576
Max output: 65,536
Input: $2 / 1M (≤200K) / $4 / 1M (>200K)
Cached/other: caching priced separately
Output: $12 / 1M (≤200K) / $18 / 1M (>200K)

Model	Context window	Max output	Input price	Cached/other	Output price
GPT-5.2	400,000	128,000	$1.75 / 1M	$0.175 / 1M cached	$14.00 / 1M
Claude Sonnet 4.5	200K (1M beta)	64K	$3 / 1M	(see Anthropic caching tiers)	$15 / 1M
Gemini 3 Pro (Preview)	1,048,576	65,536	$2 / 1M (≤200K) / $4 / 1M (>200K)	caching priced separately	$12 / 1M (≤200K) / $18 / 1M (>200K)

Notes (because production systems are where joy goes to be tested):

Gemini’s listed pricing changes when your prompts exceed 200K tokens, which is… very on brand for a model with a 1M context window.
Claude Sonnet 4.5 can do 1M context via a beta header; the default is 200K. Great power, great footnotes.
GPT-5.2’s cached input discount is substantial when you structure workloads to reuse long prefixes (templates, system policies, big retrieved context).

“But which one is best?” (leaderboard snapshot)

Leaderboards are not truth tablets from the mountain. They’re a measurement instrument with quirks:

They reflect preference votes on a specific distribution of prompts.
They shift over time.
Early results are often marked preliminary for a reason.

Still, they’re useful as a sanity check.

LMArena — Text Arena (Last updated: Dec 16, 2025)

Gemini 3 Pro: Rank #1, score 1492 (18,120 votes)
Claude Sonnet 4.5 (thinking): Rank #8, score 1450 (30,277 votes)
GPT-5.2-high: Rank #13, score 1441 (Preliminary, 6,035 votes)

LMArena — WebDev Leaderboard (Last updated: Dec 11, 2025)

GPT-5.2-high: Rank #2, score 1486 (Preliminary, 1,641 votes)
Gemini 3 Pro: Rank #4, score 1482 (7,897 votes)
Claude Sonnet 4.5 (thinking): Rank #7, score 1395 (6,974 votes)
(Also: GPT-5.2 baseline shows up at Rank #6, score 1399, Preliminary)

What I infer from this (carefully, like an adult)

Gemini 3 Pro looks extremely competitive on general text preference right now.
GPT-5.2-high looks very strong on webdev-style tasks, which matches the “agentic + coding” positioning.
Claude Sonnet 4.5 remains strong, especially with thinking enabled, but it’s not dominating these two slices at this moment.

And here’s the key reliability point:

A leaderboard score is not the same thing as “does it follow instructions without improvising.”

Reliability (in the boringreliability.dev sense) is about:

constraint-following
reproducibility
tool-call correctness
refusal to invent missing requirements
not wandering off to build a cathedral when you asked for a shed

Practical guidance (what I’d pick, and why)

Pick GPT-5.2 when…

You care about precision, adherence, and controlled execution.

Strong default for “do exactly this” workflows (coding tasks, refactors, structured outputs, tool calls).
Pricing + caching can be cost-effective if you design for reuse.
Creativity is available, but it tends to be opt-in, not a surprise guest.

Pick Claude Sonnet 4.5 when…

You want speed + quality drafts, and you’re willing to supervise.

Great at moving fast and staying pleasant.
Add explicit constraints (“do not assume,” “ask before choosing X”) or it may “help” by guessing.

Pick Gemini 3 Pro when…

You want big multimodal context and heavy-duty ingestion.

1M context is genuinely useful for messy real-world inputs (PDFs, mixed media, huge repos).
Watch for early commitment: if it picks a framing that’s slightly off, it can take work to steer it back.

The “two-hour expert” phenomenon (release-day comedy)

A new model drops, and within two hours the internet produces:

400 hot takes
40 threads titled “I tested it thoroughly”
at least one person who is definitely “shaking rn” (for science)

The boring truth: you can’t responsibly generalize model behavior in two hours unless your “test suite” is literally ready, automated, and statistically sane.

That’s why the most honest early statements look like:

“Here are my prompts.”
“Here are failure cases.”
“Here’s what I didn’t test.”
“Here’s what might be placebo.”

Or, in Boring Reliability terms:

Ship claims only as fast as you can ship evidence.

Bottom line

If your goal is boring outcomes (repeatable, spec-following, production-safe):

GPT-5.2 is the safest default in my experience and in how it’s designed/positioned.
Claude Sonnet 4.5 is excellent when paired with tight requirements and a supervisor mindset.
Gemini 3 Pro is a monster in capability and context — just don’t let it drive unsupervised into a narrative ditch.

The universe is chaotic. Your AI doesn’t have to be.

References (public docs & leaderboards)

OpenAI pricing: https://openai.com/api/pricing/
GPT-5.2 model docs: https://platform.openai.com/docs/models/gpt-5.2
Anthropic model overview (Sonnet 4.5 limits/pricing/context): https://platform.claude.com/docs/en/about-claude/models/overview
Gemini 3 Pro pricing (Developer API): https://ai.google.dev/gemini-api/docs/pricing
Gemini 3 Pro model details (Vertex AI): https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/3-pro
LMArena Text leaderboard: https://lmarena.ai/leaderboard/text
LMArena WebDev leaderboard: https://lmarena.ai/leaderboard/webdev