How to Reduce LLM Latency in Production by 25x
An AI startup with 12,000 active users came to us with one problem: their RAG system took 28 seconds to respond. Users were churning. We reduced latency to 1.1 seconds in 5 days — without changing the model or rewriting the application.
Why 28 Seconds Is a Disaster
A user asks a question in the chat. After 5 seconds, they start thinking something is broken. After 10 — they close the tab. After 28 seconds — they switch to a competitor.
Research shows: conversion drops 7% for each additional second of delay. At 28 seconds, you lose most users before they get a response.
Diagnosis: Where Time Is Lost
First, we profiled the request end-to-end. Here's what we found:
| Stage | Time (before) | Time (after) |
|---|---|---|
| Query embedding | 4.2s | 0.08s |
| Vector DB search | 8.1s | 0.12s |
| Context loading | 3.4s | 0.05s |
| LLM inference | 12.3s | 0.85s |
| Total | 28.0s | 1.10s |
What We Did: 5 Concrete Changes
Replaced the embedding model with an optimized one
The original model ran through Hugging Face Transformers without optimizations. We switched to ONNX Runtime with INT8 quantization. Single query embedding: 4.2s → 0.08s.
Deployed Qdrant instead of FAISS in-memory
FAISS reloaded the entire index on every restart (8 minutes). Qdrant with persistent storage and HNSW index: search 0.12s, instant startup.
Switched from Ollama to vLLM with PagedAttention
Ollama doesn't support batching and continuous batching. vLLM with PagedAttention handles parallel requests efficiently. LLM inference: 12.3s → 0.85s with the same Mistral 7B model.
Added semantic caching
Similar questions (cosine similarity > 0.92) return a cached response instantly. 35% of requests are served from cache in < 50ms.
Parallelized embedding and retrieval
Query embedding and DB retrieval ran sequentially. We ran them in parallel via asyncio.gather(). Savings: ~200ms per request.
Final Architecture
Key Takeaways
- 1Most latency problems are not in the model, but in the infrastructure around it
- 2ONNX Runtime + quantization gives 10-50x embedding speedup without quality loss
- 3vLLM with PagedAttention is the standard for production LLM serving
- 4Semantic caching reduces GPU load by 30-40%
- 5Profile each stage separately — the problem may not be obvious
Have a Similar Problem?
We specialize in production LLM infrastructure. Tell us about your stack — we'll propose a concrete plan.