A client spent £50,000 on GPT-4 API costs in their first month of production. Their document analysis system was processing legal contracts—sensible use case—but they were sending full 100-page documents through the API without chunking. They didn't understand that transformer costs scale with context length, or that a smaller model with proper RAG would have been 90% cheaper and equally accurate.
Understanding how transformers work isn't just academic—it's the difference between a sustainable AI deployment and a budget crisis. This guide covers what you need to know to make informed decisions about model selection, context management, and infrastructure costs.
between Haiku and GPT-4
between small and large models
cost when doubling context
Transformers process sequences of data (text, code, images) and generate context-aware outputs. Unlike earlier models that processed text word-by-word sequentially, transformers see the entire input at once using a mechanism called "attention"—they can focus on any relevant part of the input when generating each part of the output.
This parallel processing is both their superpower and their cost driver. The attention mechanism compares every token to every other token, which is why costs scale roughly quadratically with input length.
The "context window" is the maximum amount of text a transformer can process in one request. Modern models range from 8K tokens (GPT-3.5) to 200K tokens (Claude) to 1M+ tokens (Gemini). But bigger isn't always better:
For most use cases, RAG (retrieving relevant chunks and including only those) outperforms stuffing the entire context window. It's cheaper, faster, and often more accurate.
| Option | When to Use | Trade-offs |
|---|---|---|
| Claude Haiku / GPT-4o-mini | High volume, simple tasks, cost-sensitive applications | £0.25-0.60/1M tokens, fast, good for classification and extraction |
| Claude Sonnet / GPT-4o | Complex reasoning, coding, analysis requiring nuance | £2.50-5/1M tokens, balanced cost/capability, most common choice |
| Claude Opus / GPT-4 | Highest-stakes tasks, complex multi-step reasoning | £15-60/1M tokens, use sparingly for critical paths only |
| Open-source (Llama, Mistral) | Data sovereignty, cost control at scale, customisation needs | Infrastructure costs, operational overhead, but no per-token fees |
You don't need to understand every detail, but these concepts help explain transformer behaviour and limitations:
Text is split into "tokens"—roughly word pieces. "Transformers" becomes ["Trans", "form", "ers"]. This is why token count ≠ word count, and why pricing is per-token not per-word.
Each token becomes a vector (list of numbers) capturing its meaning. Similar words have similar vectors. This is the same technology used in semantic search and RAG systems.
The model compares each token to every other token to understand relationships. In "The bank by the river was steep", attention connects "bank" with "river" and "steep" to understand it means a riverbank, not a financial institution.
Output is generated one token at a time, each new token becoming input for the next. This is why streaming responses arrive word-by-word, and why longer outputs take longer to generate.
Understanding your options for customising transformer behaviour:
Most teams should exhaust prompting and RAG before considering fine-tuning. Fine-tuning is expensive, requires ML expertise, and creates model management overhead.
Define your latency requirements (real-time chat vs batch processing)
Large models can have 2-5 second first-token latency
Estimate your token volume and calculate costs for different models
Defaulting to largest model can 100x your costs
Test smallest viable model on real examples before scaling up
Most tasks don't need flagship models
Consider data residency requirements (EU/UK hosting)
Not all providers offer EU data centres
Plan for model routing if different tasks need different capabilities
One-size-fits-all wastes money on simple tasks
Understanding transformers isn't about becoming an ML engineer—it's about making informed decisions. Knowing that costs scale with context, that smaller models often suffice, and that RAG usually beats fine-tuning will save you time and money.
The key takeaway: model selection has massive cost implications. A 100x difference between the cheapest and most expensive options means getting this decision right early in your project can determine whether it's financially viable at scale.
Need help choosing the right model for your use case? We help organisations navigate the cost/capability trade-offs and build sustainable AI systems.
Learn about our method →