Performance Optimization for LLMs and AI Agents in Production — Cutting Latency by 90% and Cost by 70%
The Performance Engineering Mindset for AI Systems
LLM and agent performance optimization is not the same as traditional software performance engineering. You cannot simply add more CPU. The bottleneck is not compute — it is the inherent latency of autoregressive token generation, the round-trip time to external model APIs, and the cost of processing unnecessary tokens.
The optimisation stack is layered: each layer builds on the previous, and together they can reduce average latency from 2,800ms to under 200ms and cut per-operation cost by 60-70% for steady-state production workloads.
Optimisation Layer 1: Semantic Caching
How it works
Semantic caching stores LLM responses keyed on the embedding of the input prompt. When a new request arrives, compute its embedding and check for cached responses within a cosine similarity threshold (typically 0.92-0.96). Above threshold: return the cached response with zero LLM inference cost. Below threshold: send to the model and cache the response for future similar requests.
Implementation considerations
Cache TTL must be set appropriately for the task type — short for time-sensitive queries (hours), long for stable document types (days to weeks). The similarity threshold controls the tradeoff between cache hit rate and output quality; tune empirically with a held-out quality evaluation set. Use a vector store (Faiss, Pinecone, Redis Vector) for the cache backend.
| Workload Type | Expected Cache Hit Rate | Recommended TTL | Quality Impact |
|---|---|---|---|
| Freight document extraction (B/L, AWB) | 40–65% | 7 days | Negligible — document structures are standardised |
| Invoice processing | 50–70% | 3 days | Low — invoice formats repeat within supplier |
| Conversational queries | 15–30% | 1 hour | Moderate — tune threshold carefully |
| Code generation | 25–45% | 24 hours | Low for similar patterns |
| Classification tasks | 60–80% | 30 days | Very low — classes are stable |
Optimisation Layer 2: Intelligent Model Routing
Not every task requires the largest, most capable model. Routing requests to the smallest model capable of meeting quality requirements is the highest-impact cost optimisation available after caching. The challenge: how do you know which model is "capable enough" for a given task?
Cascade routing: Send every request to a small, fast model first. If the small model's output confidence (or a downstream quality check) meets threshold, use it. If it does not, escalate to a larger model. This gives you small-model speed and cost on the majority of requests, with large-model quality on the minority that require it.
Task-based routing: Classify every incoming request by task type (extraction, classification, reasoning, generation) and route to a pre-selected model for each class. More predictable than cascade routing; lower overhead; requires upfront evaluation to select the model per task class.
Optimisation Layer 3: Model Quantisation and Inference Optimisation
For self-hosted models, quantisation reduces model weight precision (FP32 → INT8 → INT4) to shrink memory footprint and increase inference throughput. INT8 quantisation typically reduces memory by 2x and increases throughput by 1.5-2x with less than 1% quality degradation on most enterprise tasks. INT4 quantisation achieves 4x memory reduction with 2-4% quality impact.
Complementary inference optimisations: KV-cache management (share KV cache across requests with identical prefixes — especially effective for system prompts), speculative decoding (use a small draft model to generate token candidates that the large model verifies — 2-3x throughput improvement), and continuous batching (replace static batching with dynamic batching that processes tokens as they arrive).
Optimisation Layer 4: Prompt Engineering for Performance
Prompt length directly drives inference cost and latency. Every token in the prompt is processed at inference time. Prompt bloat — accumulated through iterative additions without discipline — is the most common source of unexpected cost growth in production LLM deployments.
Prompt audit process: Monthly review of all production prompts. For each prompt, identify sections that were added to address edge cases that are now rare. Measure the average and P95 prompt token count. Profile each prompt section's contribution to output quality using ablation testing. Target: eliminate 20-30% of prompt tokens without measurable quality impact.
Optimisation Layer 5: Async and Batch Processing
Not every agent task needs to complete synchronously. Document processing, report generation, and background analysis tasks can be queued for batch processing during off-peak hours. Batch inference at supported providers reduces cost by 30-50% versus real-time API calls for identical workloads. Async processing also improves user-perceived latency — acknowledging receipt immediately while processing in the background with a webhook or polling endpoint for completion.
VoltusWave's platform includes semantic caching, intelligent model routing, async processing queues, and prompt governance tooling as production defaults. No separate performance engineering project required.
See Performance Benchmarks →