Writing Effective Prompts for LLMs

A financial services firm asked us to review their document processing pipeline. Their LLM was extracting contract clauses with 62% accuracy—barely better than keyword matching. The fix wasn't a model upgrade or fine-tuning. We rewrote their prompts using structured techniques, added few-shot examples, and implemented an evaluation loop. Within two weeks, accuracy hit 94%.

Prompting is the most underrated lever in production AI. Teams spend months on infrastructure while shipping prompts written in five minutes. This guide covers the techniques that separate amateur prompts from production-grade ones.

Key Takeaways

Structured prompts with clear roles, context, and constraints outperform vague instructions by 30-40%

Few-shot examples are the single highest-impact technique for classification and extraction tasks

Evaluation loops catch regressions before users do—you can't improve what you don't measure

+32%

Accuracy Gain

with structured prompts

60%

Retry Reduction

fewer regenerations needed

15hrs

Time Saved

per week per analyst

Why Prompts Matter More Than Models

We've seen teams switch from GPT-4 to Claude to Gemini, hoping each new model will fix their quality issues. It rarely does. The problem is almost always the prompt. A well-structured prompt on a smaller model will outperform a vague prompt on a flagship model—at a fraction of the cost.

In production systems, prompts are code. They should be versioned, tested, and reviewed like any other critical component. Yet most teams treat them as afterthoughts—strings buried in application code, edited ad-hoc, never evaluated systematically.

The Anatomy of a Production Prompt

Every production prompt should have four components, in this order:

Role: Who is the model? What expertise should it draw on?
Context: What background information does it need?
Task: What specific action should it take?
Constraints: What format, length, or style requirements apply?

Before: Vague prompt with poor results

bad-prompt.txt

[#ff7b72]">class="text-[#a5d6ff]">"Summarise [#ff7b72]">this contract."

After: Structured prompt with predictable output

structured-prompt.txt

"You are a legal analyst specialising in UK commercial contracts.
 
[#79c0ff]">CONTEXT:
You are reviewing a supplier agreement [#ff7b72]">for a mid-sized manufacturing
company. The legal team needs to understand key obligations quickly.
 
[#79c0ff]">TASK:
Extract and summarise the following [#ff7b72]">from the attached [#79c0ff]">contract:
[#79c0ff]">1. Payment terms and penalties
[#79c0ff]">2. Termination clauses and notice periods
[#79c0ff]">3. Liability caps and indemnification provisions
[#79c0ff]">4. Any unusual or non-standard clauses
 
[#79c0ff]">CONSTRAINTS:
- Use bullet points [#ff7b72]">for each section
- Flag any clauses that deviate [#ff7b72]">from market standard
- Maximum [#79c0ff]">500 words total
- Quote specific clause numbers where relevant"

Core Prompting Techniques

These four techniques form the foundation of production-grade prompting. Master them before exploring anything else.

1. Chain-of-Thought Prompting

For reasoning tasks, asking the model to "think step by step" improves accuracy by 20-40%. The model shows its working, making errors easier to spot and fix.

chain-of-thought.txt

"A council tax bill [#79c0ff]">shows:
- Band D [#79c0ff]">property: £[#79c0ff]">1,[#79c0ff]">847.52 annual charge
- Single person [#79c0ff]">discount: [#79c0ff]">25%
- Previous balance carried [#79c0ff]">forward: £[#79c0ff]">123.40 credit
 
Calculate the amount due [#ff7b72]">for the year after discounts and credits.
Think through each calculation step before giving the final figure."

2. System vs User Prompts

Modern APIs separate prompts into roles. The system prompt defines behaviour, constraints, and persona—it's your "constitution" that applies to every interaction. The user prompt contains the specific request.

system-user-prompts.txt

[#79c0ff]">System: "You are a triage assistant [#ff7b72]">for NHS [#79c0ff]">111. Your role is to
ask clarifying questions and categorise urgency, NOT to diagnose.
 
[#79c0ff]">RULES:
- Never suggest specific medications
- Always recommend [#79c0ff]">999 [#ff7b72]">for chest pain, difficulty breathing, or stroke symptoms
- Escalate to human when confidence is below [#79c0ff]">70%
- Log reasoning [#ff7b72]">for all categorisation decisions"
 
[#79c0ff]">User: [#ff7b72]">class="text-[#a5d6ff]">"I[#ff7b72]">class="text-[#a5d6ff]">'ve had a headache [#ff7b72]">for three days and paracetamol isn't helping."

3. Few-Shot Learning

Provide 2-4 examples of input-output pairs. This is the single highest-impact technique for classification, extraction, and formatting tasks. The model learns your exact requirements from examples rather than descriptions.

few-shot-examples.txt

"Classify these citizen enquiries [#ff7b72]">for a local [#79c0ff]">council:
 
[#79c0ff]">Enquiry: [#ff7b72]">class="text-[#a5d6ff]">'The streetlight outside [#79c0ff]">42 Oak Road has been flickering [#ff7b72]">for weeks'
→ [#79c0ff]">Category: Highways | [#79c0ff]">Priority: Low | [#79c0ff]">Department: Street Lighting
 
[#79c0ff]">Enquiry: [#ff7b72]">class="text-[#a5d6ff]">'There's raw sewage flooding into my garden [#ff7b72]">from the drain'
→ [#79c0ff]">Category: Environmental Health | [#79c0ff]">Priority: Urgent | [#79c0ff]">Department: Drainage
 
[#79c0ff]">Enquiry: [#ff7b72]">class="text-[#a5d6ff]">'I need to know when my bin collection day is changing'
→ [#79c0ff]">Category: Waste Services | [#79c0ff]">Priority: Low | [#79c0ff]">Department: Refuse
 
[#79c0ff]">Enquiry: [#ff7b72]">class="text-[#a5d6ff]">'A tree branch fell and is blocking the pavement on Church Lane'"

4. Structured Output

When output feeds into downstream systems, request JSON or other machine-readable formats. Most APIs now support "JSON mode" to guarantee valid responses.

output-schema.json

"Extract invoice details and [#ff7b72]">return as [#79c0ff]">JSON:
 
{
  [#ff7b72]">class="text-[#a5d6ff]">"vendor_name": string,
  [#ff7b72]">class="text-[#a5d6ff]">"invoice_number": string,
  [#ff7b72]">class="text-[#a5d6ff]">"date_issued": [#ff7b72]">class="text-[#a5d6ff]">"YYYY-MM-DD",
  [#ff7b72]">class="text-[#a5d6ff]">"line_items": [{ [#ff7b72]">class="text-[#a5d6ff]">"description": string, [#ff7b72]">class="text-[#a5d6ff]">"amount": number }],
  [#ff7b72]">class="text-[#a5d6ff]">"total_amount": number,
  [#ff7b72]">class="text-[#a5d6ff]">"vat_amount": number | [#79c0ff]">null,
  [#ff7b72]">class="text-[#a5d6ff]">"payment_due_date": [#ff7b72]">class="text-[#a5d6ff]">"YYYY-MM-DD" | [#79c0ff]">null
}
 
If a field cannot be determined [#ff7b72]">from the document, use [#79c0ff]">null."

From the Trenches

Few-shot examples are worth more than paragraphs of instructions. We've seen teams write 2,000-word system prompts that underperform a 200-word prompt with three good examples. When in doubt, show don't tell.

When to Use Each Technique

Option	When to Use	Trade-offs
Chain-of-Thought	Reasoning, calculations, multi-step logic	Increases token usage by 20-50%, but improves accuracy significantly
Few-Shot Examples	Classification, extraction, formatting tasks	Requires good examples upfront, adds to prompt length
Structured Output	Data feeds into code or databases	May reduce output naturalness, requires schema maintenance
System Prompts	Consistent behaviour across conversations	Static context, can't adapt to individual interactions

The Prompt Evaluation Loop

The difference between amateur and production prompting is evaluation. Without measurement, you're guessing. Here's the loop we use with every client:

Build a test set: Collect 50-100 real examples with expected outputs. These become your ground truth.
Define metrics: What does "good" mean? Accuracy? Format compliance? Latency? Pick 2-3 measurable criteria.
Baseline: Run your current prompt against the test set. Record scores.
Iterate: Make one change at a time. Re-run. Compare to baseline.
Automate: Set up CI to run evals on every prompt change. Catch regressions before deployment.

This loop transforms prompting from art into engineering. Teams using it typically see 30-50% improvement in output quality within the first month.

Version Control Your Prompts

We've seen production incidents where someone "improved" a prompt and broke downstream integrations. Treat prompts like code: version them, review changes, test before deploying. Consider a prompt registry that tracks which version is live in each environment.

Real-World Example: Policy Document Analysis

Here's a production prompt we built for a government client processing 200+ policy documents per week. It combines role assignment, few-shot learning, structured output, and explicit constraints:

policy-extraction.txt

1 SYSTEM [#79c0ff]">PROMPT:
2 You are a policy analyst [#ff7b72]">for UK central government. Your role is to
3 extract structured information [#ff7b72]">from policy documents to populate a
4 searchable database. Accuracy is critical—when uncertain, mark fields
5 as [#ff7b72]">class="text-[#a5d6ff]">"unclear" rather than guessing.
6  
7 [#79c0ff]">EXAMPLES:
8 [#79c0ff]">Document: "The Carbon Budget Order [#79c0ff]">2021 sets the UK's sixth carbon
9 budget at [#79c0ff]">965 MtCO2e [#ff7b72]">for [#79c0ff]">2033-[#79c0ff]">2037..."
10 → {
11     [#ff7b72]">class="text-[#a5d6ff]">"policy_area": [#ff7b72]">class="text-[#a5d6ff]">"Climate",
12     [#ff7b72]">class="text-[#a5d6ff]">"effective_date": [#ff7b72]">class="text-[#a5d6ff]">"[#79c0ff]">2021-[#79c0ff]">06-[#79c0ff]">24",
13     [#ff7b72]">class="text-[#a5d6ff]">"key_figures": [{[#ff7b72]">class="text-[#a5d6ff]">"metric": [#ff7b72]">class="text-[#a5d6ff]">"carbon budget", [#ff7b72]">class="text-[#a5d6ff]">"value": [#ff7b72]">class="text-[#a5d6ff]">"[#79c0ff]">965 MtCO2e", [#ff7b72]">class="text-[#a5d6ff]">"period": [#ff7b72]">class="text-[#a5d6ff]">"[#79c0ff]">2033-[#79c0ff]">2037"}],
14     [#ff7b72]">class="text-[#a5d6ff]">"departments": [[#ff7b72]">class="text-[#a5d6ff]">"BEIS"],
15     [#ff7b72]">class="text-[#a5d6ff]">"compliance_deadline": [#79c0ff]">null
16   }
17  
18 [#79c0ff]">Document: "Local authorities must publish their Local Plan within [#79c0ff]">30
19 months of the National Planning Policy Framework update..."
20 → {
21     [#ff7b72]">class="text-[#a5d6ff]">"policy_area": [#ff7b72]">class="text-[#a5d6ff]">"Planning",
22     [#ff7b72]">class="text-[#a5d6ff]">"effective_date": [#ff7b72]">class="text-[#a5d6ff]">"unclear",
23     [#ff7b72]">class="text-[#a5d6ff]">"key_figures": [{[#ff7b72]">class="text-[#a5d6ff]">"metric": [#ff7b72]">class="text-[#a5d6ff]">"publication deadline", [#ff7b72]">class="text-[#a5d6ff]">"value": [#ff7b72]">class="text-[#a5d6ff]">"[#79c0ff]">30 months", [#ff7b72]">class="text-[#a5d6ff]">"period": [#ff7b72]">class="text-[#a5d6ff]">"[#ff7b72]">from NPPF update"}],
24     [#ff7b72]">class="text-[#a5d6ff]">"departments": [[#ff7b72]">class="text-[#a5d6ff]">"DLUHC", [#ff7b72]">class="text-[#a5d6ff]">"Local Authorities"],
25     [#ff7b72]">class="text-[#a5d6ff]">"compliance_deadline": [#ff7b72]">class="text-[#a5d6ff]">"[#79c0ff]">30 months [#ff7b72]">from NPPF update"
26   }
27  
28 USER [#79c0ff]">PROMPT:
29 Extract policy details [#ff7b72]">from the following document. Return valid JSON only.
30  
31 [DOCUMENT TEXT HERE]

This prompt achieves 91% extraction accuracy with automatic quality flags for human review.

Production Prompt Checklist

Define a clear role and expertise level

Watch out

Generic roles like 'helpful assistant' give generic outputs

Provide 2-4 few-shot examples for classification/extraction

Watch out

Examples should cover edge cases, not just happy paths

Specify output format explicitly (JSON schema, bullet points, etc.)

Watch out

Vague format requests lead to inconsistent outputs

Set up an evaluation test set before iterating

Watch out

Without baselines, you can't measure improvement

Version control prompts alongside application code

Watch out

Untracked prompt changes cause silent regressions

Understanding Prompt Caching – Structure prompts for caching and reduce costs by 90%
Understanding Generative Search (RAG) – How prompting works in retrieval-augmented systems
Why Embeddings Are Important – The foundation for semantic search and RAG

Conclusion

The gap between "using AI" and "shipping reliable AI" is largely a prompting gap. Teams that treat prompts as first-class engineering artifacts—versioned, tested, reviewed—consistently outperform those who treat them as magic incantations.

Start with the fundamentals: clear roles, explicit constraints, few-shot examples, and structured outputs. Add an evaluation loop to measure progress. Within weeks, you'll see measurable improvements in accuracy, consistency, and downstream costs.

Need help building evaluation frameworks or improving prompt quality at scale? We set up prompt engineering practices for teams shipping production AI.

Learn about our method →

Last updated: December 2024

Writing Effective Prompts for LLMs

Writing Effective Prompts for LLMs

Accuracy Gain

Retry Reduction

Time Saved

Why Prompts Matter More Than Models

The Anatomy of a Production Prompt

Core Prompting Techniques

1. Chain-of-Thought Prompting

2. System vs User Prompts

3. Few-Shot Learning

4. Structured Output

The Prompt Evaluation Loop

Real-World Example: Policy Document Analysis

Related Articles

Conclusion

Writing Effective Prompts for LLMs

Accuracy Gain

Retry Reduction

Time Saved

Why Prompts Matter More Than Models

The Anatomy of a Production Prompt

Core Prompting Techniques

1. Chain-of-Thought Prompting

2. System vs User Prompts

3. Few-Shot Learning

4. Structured Output

The Prompt Evaluation Loop

Real-World Example: Policy Document Analysis

Related Articles

Conclusion

1	SYSTEM [#79c0ff]">PROMPT:
2	You are a policy analyst [#ff7b72]">for UK central government. Your role is to
3	extract structured information [#ff7b72]">from policy documents to populate a
4	searchable database. Accuracy is critical—when uncertain, mark fields
5	as [#ff7b72]">class="text-[#a5d6ff]">"unclear" rather than guessing.
6
7	[#79c0ff]">EXAMPLES:
8	[#79c0ff]">Document: "The Carbon Budget Order [#79c0ff]">2021 sets the UK's sixth carbon
9	budget at [#79c0ff]">965 MtCO2e [#ff7b72]">for [#79c0ff]">2033-[#79c0ff]">2037..."
10	→ {
11	[#ff7b72]">class="text-[#a5d6ff]">"policy_area": [#ff7b72]">class="text-[#a5d6ff]">"Climate",
12	[#ff7b72]">class="text-[#a5d6ff]">"effective_date": [#ff7b72]">class="text-[#a5d6ff]">"[#79c0ff]">2021-[#79c0ff]">06-[#79c0ff]">24",
13	[#ff7b72]">class="text-[#a5d6ff]">"key_figures": [{[#ff7b72]">class="text-[#a5d6ff]">"metric": [#ff7b72]">class="text-[#a5d6ff]">"carbon budget", [#ff7b72]">class="text-[#a5d6ff]">"value": [#ff7b72]">class="text-[#a5d6ff]">"[#79c0ff]">965 MtCO2e", [#ff7b72]">class="text-[#a5d6ff]">"period": [#ff7b72]">class="text-[#a5d6ff]">"[#79c0ff]">2033-[#79c0ff]">2037"}],
14	[#ff7b72]">class="text-[#a5d6ff]">"departments": [[#ff7b72]">class="text-[#a5d6ff]">"BEIS"],
15	[#ff7b72]">class="text-[#a5d6ff]">"compliance_deadline": [#79c0ff]">null
16	}
17
18	[#79c0ff]">Document: "Local authorities must publish their Local Plan within [#79c0ff]">30
19	months of the National Planning Policy Framework update..."
20	→ {
21	[#ff7b72]">class="text-[#a5d6ff]">"policy_area": [#ff7b72]">class="text-[#a5d6ff]">"Planning",
22	[#ff7b72]">class="text-[#a5d6ff]">"effective_date": [#ff7b72]">class="text-[#a5d6ff]">"unclear",
23	[#ff7b72]">class="text-[#a5d6ff]">"key_figures": [{[#ff7b72]">class="text-[#a5d6ff]">"metric": [#ff7b72]">class="text-[#a5d6ff]">"publication deadline", [#ff7b72]">class="text-[#a5d6ff]">"value": [#ff7b72]">class="text-[#a5d6ff]">"[#79c0ff]">30 months", [#ff7b72]">class="text-[#a5d6ff]">"period": [#ff7b72]">class="text-[#a5d6ff]">"[#ff7b72]">from NPPF update"}],
24	[#ff7b72]">class="text-[#a5d6ff]">"departments": [[#ff7b72]">class="text-[#a5d6ff]">"DLUHC", [#ff7b72]">class="text-[#a5d6ff]">"Local Authorities"],
25	[#ff7b72]">class="text-[#a5d6ff]">"compliance_deadline": [#ff7b72]">class="text-[#a5d6ff]">"[#79c0ff]">30 months [#ff7b72]">from NPPF update"
26	}
27
28	USER [#79c0ff]">PROMPT:
29	Extract policy details [#ff7b72]">from the following document. Return valid JSON only.
30
31	[DOCUMENT TEXT HERE]