All Articles
LLMRAGProductionvLLM

How to Reduce LLM Latency in Production by 25x

InfoScale Team·10 February 2026· 8 min

An AI startup with 12,000 active users came to us with one problem: their RAG system took 28 seconds to respond. Users were churning. We reduced latency to 1.1 seconds in 5 days — without changing the model or rewriting the application.

Why 28 Seconds Is a Disaster

A user asks a question in the chat. After 5 seconds, they start thinking something is broken. After 10 — they close the tab. After 28 seconds — they switch to a competitor.

Research shows: conversion drops 7% for each additional second of delay. At 28 seconds, you lose most users before they get a response.

Diagnosis: Where Time Is Lost

First, we profiled the request end-to-end. Here's what we found:

StageTime (before)Time (after)
Query embedding4.2s0.08s
Vector DB search8.1s0.12s
Context loading3.4s0.05s
LLM inference12.3s0.85s
Total28.0s1.10s

What We Did: 5 Concrete Changes

01

Replaced the embedding model with an optimized one

The original model ran through Hugging Face Transformers without optimizations. We switched to ONNX Runtime with INT8 quantization. Single query embedding: 4.2s → 0.08s.

02

Deployed Qdrant instead of FAISS in-memory

FAISS reloaded the entire index on every restart (8 minutes). Qdrant with persistent storage and HNSW index: search 0.12s, instant startup.

03

Switched from Ollama to vLLM with PagedAttention

Ollama doesn't support batching and continuous batching. vLLM with PagedAttention handles parallel requests efficiently. LLM inference: 12.3s → 0.85s with the same Mistral 7B model.

04

Added semantic caching

Similar questions (cosine similarity > 0.92) return a cached response instantly. 35% of requests are served from cache in < 50ms.

05

Parallelized embedding and retrieval

Query embedding and DB retrieval ran sequentially. We ran them in parallel via asyncio.gather(). Savings: ~200ms per request.

Final Architecture

# RAG Pipeline — Optimized Stack
Embedding: ONNX Runtime + INT8 quantization
Vector DB: Qdrant (persistent, HNSW index)
LLM serving: vLLM + PagedAttention + continuous batching
Cache: Redis + semantic similarity (cosine > 0.92)
Orchestration: Kubernetes + GPU node pool
Monitoring: Prometheus + Grafana + LLM-specific metrics
# Result: p50 latency 1.1s, p95 latency 2.3s

Key Takeaways

  • 1Most latency problems are not in the model, but in the infrastructure around it
  • 2ONNX Runtime + quantization gives 10-50x embedding speedup without quality loss
  • 3vLLM with PagedAttention is the standard for production LLM serving
  • 4Semantic caching reduces GPU load by 30-40%
  • 5Profile each stage separately — the problem may not be obvious

Have a Similar Problem?

We specialize in production LLM infrastructure. Tell us about your stack — we'll propose a concrete plan.