Skip to main content
Documentation
Features

Semantic Cache

Laghav uses Redis Stack vector similarity search to serve repeat and near-identical queries from cache — zero LLM cost, zero quality loss.

How it works

  1. The compressed prompt is embedded using all-MiniLM-L6-v2 via the compressor service.
  2. Laghav searches the Redis Stack vector index for embeddings with cosine similarity > 0.92 (distance < 0.08).
  3. On a hit: cached response is returned immediately with X-Laghav-Cache: semantic header.
  4. On a miss: LLM is called, response + embedding are stored in Redis (TTL: 30 minutes).

Cache key schema

bash
laghav:semcache:{tenant_id}:{vec_hash} # TTL: 1800s (30 min)
laghav:dedup:{tenant_id}:{exact_hash} # TTL: 900s (15 min) — exact match cache
Tenant isolation
Cache is always scoped to your tenant_id — you never receive another customer's cached response.

Detecting a cache hit

cache_hit.py
response = client.complete(messages=messages, model="auto")
if response.laghav_meta.cache_hit:
print("Served from cache — $0 LLM cost")
print(f"Latency overhead: {response.laghav_meta.latency_overhead_ms}ms") # typically 3–8ms

Clearing the cache

bash
# Clear all cache types
curl -X POST https://api.laghav.ai/api/cache/clear \
-H "Authorization: Bearer lgh_live_xxx" \
-H "Content-Type: application/json" \
-d '{"scope": "all"}'
# Scope options: "dedup" | "semantic" | "routing" | "all"

Disabling cache per-call

no_cache.py
# Useful for time-sensitive queries ("what's today's news?")
response = client.complete(
messages=messages,
model="auto",
laghav_options={"cache": False}
)