Cheap, previous‑gen “mini” models are wildly under-rated

For many workloads they feel close enough to today’s top models that the cost gap becomes the dominant variable, not the quality gap.

How Much Cheaper Are Minis?

Here’s what “x times cheaper” actually looks like at the token level for previous‑gen minis versus current‑gen flagships. Prices are per 1M tokens.[1][2][3][4]

OpenAI: GPT‑4o‑mini vs GPT‑5

Metric	GPT‑4o‑mini	GPT‑5	Savings factor
Input (/1M)	$0.075 [1]	$0.625 [1]	8.3x cheaper
Output (/1M)	$0.30 [1]	$5.00 [1]	16.7x cheaper

A “normal” job with 700k input and 300k output tokens costs about $0.10 on GPT‑4o‑mini versus $1.97 on GPT‑5, close to a 20x difference for something that often feels similar for drafting, summarisation, and light coding.[1]

Anthropic: Claude 3.5 Haiku vs Claude 4 Opus

Metric	Claude 3.5 Haiku	Claude 4 Opus	Savings factor
Input (/1M)	$0.25 [2]	$3.00 [3]	12x cheaper
Output (/1M)	$1.25 [2]	$15.00 [3]	12x cheaper

Run the same 700k/300k job and you’re looking at roughly $0.55 on Haiku versus $6.60 on Opus, again about 12x cheaper while still being “good enough” for a lot of production tasks.[2][3]

Google: Gemini 1.5 Flash vs Gemini 2.5 Pro

Metric	Gemini 1.5 Flash	Gemini 2.5 Pro	Savings factor
Input (/1M)	$0.075 [4]	$1.25 [4]	16.7x cheaper
Output (/1M)	$0.30 [4]	$3.75 [4]	12.5x cheaper

Here the same job comes out around $0.10 on 1.5 Flash versus $2.44 on 2.5 Pro, more than 20x cheaper for output‑heavy workloads like report generation or multi‑turn agents.[4]

Why This Matters In Practice

Once you look at the multipliers, a pattern emerges: previous‑gen minis are often 8–20x cheaper per token while still being perfectly viable as default “daily driver” models. For internal tools, batch jobs, RAG systems, and agents, that price curve means you can either cut your bill by an order of magnitude or run 10x more experiments for the same budget.[3][2][4][1]

The top models (GPT‑5, Claude Opus, Gemini Pro) are still worth it for the genuinely hard stuff: complex reasoning, messy multi‑step workflows, high‑stakes outputs, or when latency and retries matter more than pure cost. But if you route everything through them by default, you are almost certainly burning money you do not need to spend.[5][6][7]

A Simple Routing Mental Model

A practical pattern that falls out of these numbers:

Default to previous‑gen mini (GPT‑4o‑mini, Claude 3.5 Haiku, Gemini 1.5 Flash) for 80–90% of traffic: chatbots, CRUD apps, summarisation, log analysis, boilerplate code, content drafts.[8][9][10]
Escalate to current‑gen top models only when:
- The prompt is clearly “hard” (nested reasoning, long tool chains, high ambiguity).
- The user or system flags low confidence and asks for a second opinion.
- The domain is high‑risk or high‑value (legal, medical, financial, exec‑visible outputs).

With that split, you keep the UX of frontier models where it matters, and let the cheap workhorses quietly do everything else for a tenth (or less) of the price.

How Much Cheaper Are Minis?

OpenAI: GPT‑4o‑mini vs GPT‑5

Anthropic: Claude 3.5 Haiku vs Claude 4 Opus

Google: Gemini 1.5 Flash vs Gemini 2.5 Pro

Why This Matters In Practice

A Simple Routing Mental Model

References & Further Reading

Related Posts

The 2026 Tipping Point: Why the "Decade of Agents" Will Cause a Corporate Collapse (And How to Survive It)

Vibe Coding: The Shift to AI as the Primary Implementation Engine

The Future of AI-Assisted Development