← Intelligence Hub|Performance13 min read

Performance Optimization for LLMs and AI Agents in Production — Cutting Latency by 90% and Cost by 70%

The Performance Engineering Mindset for AI Systems

LLM and agent performance optimization is not the same as traditional software performance engineering. You cannot simply add more CPU. The bottleneck is not compute — it is the inherent latency of autoregressive token generation, the round-trip time to external model APIs, and the cost of processing unnecessary tokens.

The optimisation stack is layered: each layer builds on the previous, and together they can reduce average latency from 2,800ms to under 200ms and cut per-operation cost by 60-70% for steady-state production workloads.

💡The highest-impact, lowest-effort performance optimisation for most enterprise deployments is semantic caching. Before investing in model quantisation or inference infrastructure, implement semantic caching and measure your cache hit rate. For document processing workloads (invoices, B/Ls, customs documents) where similar documents recur frequently, cache hit rates of 40-65% are achievable, with zero inference cost on cache hits.

Optimisation Layer 1: Semantic Caching

How it works

Semantic caching stores LLM responses keyed on the embedding of the input prompt. When a new request arrives, compute its embedding and check for cached responses within a cosine similarity threshold (typically 0.92-0.96). Above threshold: return the cached response with zero LLM inference cost. Below threshold: send to the model and cache the response for future similar requests.

Implementation considerations

Cache TTL must be set appropriately for the task type — short for time-sensitive queries (hours), long for stable document types (days to weeks). The similarity threshold controls the tradeoff between cache hit rate and output quality; tune empirically with a held-out quality evaluation set. Use a vector store (Faiss, Pinecone, Redis Vector) for the cache backend.

Workload Type	Expected Cache Hit Rate	Recommended TTL	Quality Impact
Freight document extraction (B/L, AWB)	40–65%	7 days	Negligible — document structures are standardised
Invoice processing	50–70%	3 days	Low — invoice formats repeat within supplier
Conversational queries	15–30%	1 hour	Moderate — tune threshold carefully
Code generation	25–45%	24 hours	Low for similar patterns
Classification tasks	60–80%	30 days	Very low — classes are stable

Optimisation Layer 2: Intelligent Model Routing

Not every task requires the largest, most capable model. Routing requests to the smallest model capable of meeting quality requirements is the highest-impact cost optimisation available after caching. The challenge: how do you know which model is "capable enough" for a given task?

Cascade routing: Send every request to a small, fast model first. If the small model's output confidence (or a downstream quality check) meets threshold, use it. If it does not, escalate to a larger model. This gives you small-model speed and cost on the majority of requests, with large-model quality on the minority that require it.

Task-based routing: Classify every incoming request by task type (extraction, classification, reasoning, generation) and route to a pre-selected model for each class. More predictable than cascade routing; lower overhead; requires upfront evaluation to select the model per task class.

📋A production freight document processing deployment routes 78% of requests to a fine-tuned 7B extraction model (avg 340ms, $0.0002/request), 18% to a medium reasoning model (avg 1.2s, $0.0018/request), and 4% to a large model for complex edge cases (avg 3.8s, $0.012/request). Blended cost: $0.0008/request vs $0.012/request for all-large-model routing — 15x cost reduction with equivalent output quality.

Optimisation Layer 3: Model Quantisation and Inference Optimisation

For self-hosted models, quantisation reduces model weight precision (FP32 → INT8 → INT4) to shrink memory footprint and increase inference throughput. INT8 quantisation typically reduces memory by 2x and increases throughput by 1.5-2x with less than 1% quality degradation on most enterprise tasks. INT4 quantisation achieves 4x memory reduction with 2-4% quality impact.

Complementary inference optimisations: KV-cache management (share KV cache across requests with identical prefixes — especially effective for system prompts), speculative decoding (use a small draft model to generate token candidates that the large model verifies — 2-3x throughput improvement), and continuous batching (replace static batching with dynamic batching that processes tokens as they arrive).

Optimisation Layer 4: Prompt Engineering for Performance

Prompt length directly drives inference cost and latency. Every token in the prompt is processed at inference time. Prompt bloat — accumulated through iterative additions without discipline — is the most common source of unexpected cost growth in production LLM deployments.

Prompt audit process: Monthly review of all production prompts. For each prompt, identify sections that were added to address edge cases that are now rare. Measure the average and P95 prompt token count. Profile each prompt section's contribution to output quality using ablation testing. Target: eliminate 20-30% of prompt tokens without measurable quality impact.

Optimisation Layer 5: Async and Batch Processing

Not every agent task needs to complete synchronously. Document processing, report generation, and background analysis tasks can be queued for batch processing during off-peak hours. Batch inference at supported providers reduces cost by 30-50% versus real-time API calls for identical workloads. Async processing also improves user-perceived latency — acknowledging receipt immediately while processing in the background with a webhook or polling endpoint for completion.

Performance-Optimised by Default

VoltusWave's platform includes semantic caching, intelligent model routing, async processing queues, and prompt governance tooling as production defaults. No separate performance engineering project required.

See Performance Benchmarks →