A financial services firm asked us to review their document processing pipeline. Their LLM was extracting contract clauses with 62% accuracy—barely better than keyword matching. The fix wasn't a model upgrade or fine-tuning. We rewrote their prompts using structured techniques, added few-shot examples, and implemented an evaluation loop. Within two weeks, accuracy hit 94%.
Prompting is the most underrated lever in production AI. Teams spend months on infrastructure while shipping prompts written in five minutes. This guide covers the techniques that separate amateur prompts from production-grade ones.
with structured prompts
fewer regenerations needed
per week per analyst
We've seen teams switch from GPT-4 to Claude to Gemini, hoping each new model will fix their quality issues. It rarely does. The problem is almost always the prompt. A well-structured prompt on a smaller model will outperform a vague prompt on a flagship model—at a fraction of the cost.
In production systems, prompts are code. They should be versioned, tested, and reviewed like any other critical component. Yet most teams treat them as afterthoughts—strings buried in application code, edited ad-hoc, never evaluated systematically.
Every production prompt should have four components, in this order:
Before: Vague prompt with poor results
[#ff7b72]">class="text-[#a5d6ff]">"Summarise [#ff7b72]">this contract."After: Structured prompt with predictable output
"You are a legal analyst specialising in UK commercial contracts. [#79c0ff]">CONTEXT:You are reviewing a supplier agreement [#ff7b72]">for a mid-sized manufacturingcompany. The legal team needs to understand key obligations quickly. [#79c0ff]">TASK:Extract and summarise the following [#ff7b72]">from the attached [#79c0ff]">contract:[#79c0ff]">1. Payment terms and penalties[#79c0ff]">2. Termination clauses and notice periods[#79c0ff]">3. Liability caps and indemnification provisions[#79c0ff]">4. Any unusual or non-standard clauses [#79c0ff]">CONSTRAINTS:- Use bullet points [#ff7b72]">for each section- Flag any clauses that deviate [#ff7b72]">from market standard- Maximum [#79c0ff]">500 words total- Quote specific clause numbers where relevant"These four techniques form the foundation of production-grade prompting. Master them before exploring anything else.
For reasoning tasks, asking the model to "think step by step" improves accuracy by 20-40%. The model shows its working, making errors easier to spot and fix.
"A council tax bill [#79c0ff]">shows:- Band D [#79c0ff]">property: £[#79c0ff]">1,[#79c0ff]">847.52 annual charge- Single person [#79c0ff]">discount: [#79c0ff]">25%- Previous balance carried [#79c0ff]">forward: £[#79c0ff]">123.40 credit Calculate the amount due [#ff7b72]">for the year after discounts and credits.Think through each calculation step before giving the final figure."Modern APIs separate prompts into roles. The system prompt defines behaviour, constraints, and persona—it's your "constitution" that applies to every interaction. The user prompt contains the specific request.
[#79c0ff]">System: "You are a triage assistant [#ff7b72]">for NHS [#79c0ff]">111. Your role is toask clarifying questions and categorise urgency, NOT to diagnose. [#79c0ff]">RULES:- Never suggest specific medications- Always recommend [#79c0ff]">999 [#ff7b72]">for chest pain, difficulty breathing, or stroke symptoms- Escalate to human when confidence is below [#79c0ff]">70%- Log reasoning [#ff7b72]">for all categorisation decisions" [#79c0ff]">User: [#ff7b72]">class="text-[#a5d6ff]">"I[#ff7b72]">class="text-[#a5d6ff]">'ve had a headache [#ff7b72]">for three days and paracetamol isn't helping."Provide 2-4 examples of input-output pairs. This is the single highest-impact technique for classification, extraction, and formatting tasks. The model learns your exact requirements from examples rather than descriptions.
"Classify these citizen enquiries [#ff7b72]">for a local [#79c0ff]">council: [#79c0ff]">Enquiry: [#ff7b72]">class="text-[#a5d6ff]">'The streetlight outside [#79c0ff]">42 Oak Road has been flickering [#ff7b72]">for weeks'→ [#79c0ff]">Category: Highways | [#79c0ff]">Priority: Low | [#79c0ff]">Department: Street Lighting [#79c0ff]">Enquiry: [#ff7b72]">class="text-[#a5d6ff]">'There's raw sewage flooding into my garden [#ff7b72]">from the drain'→ [#79c0ff]">Category: Environmental Health | [#79c0ff]">Priority: Urgent | [#79c0ff]">Department: Drainage [#79c0ff]">Enquiry: [#ff7b72]">class="text-[#a5d6ff]">'I need to know when my bin collection day is changing'→ [#79c0ff]">Category: Waste Services | [#79c0ff]">Priority: Low | [#79c0ff]">Department: Refuse [#79c0ff]">Enquiry: [#ff7b72]">class="text-[#a5d6ff]">'A tree branch fell and is blocking the pavement on Church Lane'"When output feeds into downstream systems, request JSON or other machine-readable formats. Most APIs now support "JSON mode" to guarantee valid responses.
"Extract invoice details and [#ff7b72]">return as [#79c0ff]">JSON: { [#ff7b72]">class="text-[#a5d6ff]">"vendor_name": string, [#ff7b72]">class="text-[#a5d6ff]">"invoice_number": string, [#ff7b72]">class="text-[#a5d6ff]">"date_issued": [#ff7b72]">class="text-[#a5d6ff]">"YYYY-MM-DD", [#ff7b72]">class="text-[#a5d6ff]">"line_items": [{ [#ff7b72]">class="text-[#a5d6ff]">"description": string, [#ff7b72]">class="text-[#a5d6ff]">"amount": number }], [#ff7b72]">class="text-[#a5d6ff]">"total_amount": number, [#ff7b72]">class="text-[#a5d6ff]">"vat_amount": number | [#79c0ff]">null, [#ff7b72]">class="text-[#a5d6ff]">"payment_due_date": [#ff7b72]">class="text-[#a5d6ff]">"YYYY-MM-DD" | [#79c0ff]">null} If a field cannot be determined [#ff7b72]">from the document, use [#79c0ff]">null."| Option | When to Use | Trade-offs |
|---|---|---|
| Chain-of-Thought | Reasoning, calculations, multi-step logic | Increases token usage by 20-50%, but improves accuracy significantly |
| Few-Shot Examples | Classification, extraction, formatting tasks | Requires good examples upfront, adds to prompt length |
| Structured Output | Data feeds into code or databases | May reduce output naturalness, requires schema maintenance |
| System Prompts | Consistent behaviour across conversations | Static context, can't adapt to individual interactions |
The difference between amateur and production prompting is evaluation. Without measurement, you're guessing. Here's the loop we use with every client:
This loop transforms prompting from art into engineering. Teams using it typically see 30-50% improvement in output quality within the first month.
Here's a production prompt we built for a government client processing 200+ policy documents per week. It combines role assignment, few-shot learning, structured output, and explicit constraints:
1 SYSTEM [#79c0ff]">PROMPT: 2 You are a policy analyst [#ff7b72]">for UK central government. Your role is to 3 extract structured information [#ff7b72]">from policy documents to populate a 4 searchable database. Accuracy is critical—when uncertain, mark fields 5 as [#ff7b72]">class="text-[#a5d6ff]">"unclear" rather than guessing. 6 7 [#79c0ff]">EXAMPLES: 8 [#79c0ff]">Document: "The Carbon Budget Order [#79c0ff]">2021 sets the UK's sixth carbon 9 budget at [#79c0ff]">965 MtCO2e [#ff7b72]">for [#79c0ff]">2033-[#79c0ff]">2037..." 10 → { 11 [#ff7b72]">class="text-[#a5d6ff]">"policy_area": [#ff7b72]">class="text-[#a5d6ff]">"Climate", 12 [#ff7b72]">class="text-[#a5d6ff]">"effective_date": [#ff7b72]">class="text-[#a5d6ff]">"[#79c0ff]">2021-[#79c0ff]">06-[#79c0ff]">24", 13 [#ff7b72]">class="text-[#a5d6ff]">"key_figures": [{[#ff7b72]">class="text-[#a5d6ff]">"metric": [#ff7b72]">class="text-[#a5d6ff]">"carbon budget", [#ff7b72]">class="text-[#a5d6ff]">"value": [#ff7b72]">class="text-[#a5d6ff]">"[#79c0ff]">965 MtCO2e", [#ff7b72]">class="text-[#a5d6ff]">"period": [#ff7b72]">class="text-[#a5d6ff]">"[#79c0ff]">2033-[#79c0ff]">2037"}], 14 [#ff7b72]">class="text-[#a5d6ff]">"departments": [[#ff7b72]">class="text-[#a5d6ff]">"BEIS"], 15 [#ff7b72]">class="text-[#a5d6ff]">"compliance_deadline": [#79c0ff]">null 16 } 17 18 [#79c0ff]">Document: "Local authorities must publish their Local Plan within [#79c0ff]">30 19 months of the National Planning Policy Framework update..." 20 → { 21 [#ff7b72]">class="text-[#a5d6ff]">"policy_area": [#ff7b72]">class="text-[#a5d6ff]">"Planning", 22 [#ff7b72]">class="text-[#a5d6ff]">"effective_date": [#ff7b72]">class="text-[#a5d6ff]">"unclear", 23 [#ff7b72]">class="text-[#a5d6ff]">"key_figures": [{[#ff7b72]">class="text-[#a5d6ff]">"metric": [#ff7b72]">class="text-[#a5d6ff]">"publication deadline", [#ff7b72]">class="text-[#a5d6ff]">"value": [#ff7b72]">class="text-[#a5d6ff]">"[#79c0ff]">30 months", [#ff7b72]">class="text-[#a5d6ff]">"period": [#ff7b72]">class="text-[#a5d6ff]">"[#ff7b72]">from NPPF update"}], 24 [#ff7b72]">class="text-[#a5d6ff]">"departments": [[#ff7b72]">class="text-[#a5d6ff]">"DLUHC", [#ff7b72]">class="text-[#a5d6ff]">"Local Authorities"], 25 [#ff7b72]">class="text-[#a5d6ff]">"compliance_deadline": [#ff7b72]">class="text-[#a5d6ff]">"[#79c0ff]">30 months [#ff7b72]">from NPPF update" 26 } 27 28 USER [#79c0ff]">PROMPT: 29 Extract policy details [#ff7b72]">from the following document. Return valid JSON only. 30 31 [DOCUMENT TEXT HERE]
This prompt achieves 91% extraction accuracy with automatic quality flags for human review.
Define a clear role and expertise level
Generic roles like 'helpful assistant' give generic outputs
Provide 2-4 few-shot examples for classification/extraction
Examples should cover edge cases, not just happy paths
Specify output format explicitly (JSON schema, bullet points, etc.)
Vague format requests lead to inconsistent outputs
Set up an evaluation test set before iterating
Without baselines, you can't measure improvement
Version control prompts alongside application code
Untracked prompt changes cause silent regressions
The gap between "using AI" and "shipping reliable AI" is largely a prompting gap. Teams that treat prompts as first-class engineering artifacts—versioned, tested, reviewed—consistently outperform those who treat them as magic incantations.
Start with the fundamentals: clear roles, explicit constraints, few-shot examples, and structured outputs. Add an evaluation loop to measure progress. Within weeks, you'll see measurable improvements in accuracy, consistency, and downstream costs.
Need help building evaluation frameworks or improving prompt quality at scale? We set up prompt engineering practices for teams shipping production AI.
Learn about our method →