Documentation
Features
Semantic Cache
Laghav uses Redis Stack vector similarity search to serve repeat and near-identical queries from cache — zero LLM cost, zero quality loss.
How it works
- The compressed prompt is embedded using
all-MiniLM-L6-v2via the compressor service. - Laghav searches the Redis Stack vector index for embeddings with cosine similarity > 0.92 (distance < 0.08).
- On a hit: cached response is returned immediately with
X-Laghav-Cache: semanticheader. - On a miss: LLM is called, response + embedding are stored in Redis (TTL: 30 minutes).
Cache key schema
bash
laghav:semcache:{tenant_id}:{vec_hash} # TTL: 1800s (30 min)laghav:dedup:{tenant_id}:{exact_hash} # TTL: 900s (15 min) — exact match cache
✦Tenant isolation
Cache is always scoped to your
tenant_id — you never receive another customer's cached response.Detecting a cache hit
cache_hit.py
response = client.complete(messages=messages, model="auto")if response.laghav_meta.cache_hit:print("Served from cache — $0 LLM cost")print(f"Latency overhead: {response.laghav_meta.latency_overhead_ms}ms") # typically 3–8ms
Clearing the cache
bash
# Clear all cache typescurl -X POST https://api.laghav.ai/api/cache/clear \-H "Authorization: Bearer lgh_live_xxx" \-H "Content-Type: application/json" \-d '{"scope": "all"}'# Scope options: "dedup" | "semantic" | "routing" | "all"
Disabling cache per-call
no_cache.py
# Useful for time-sensitive queries ("what's today's news?")response = client.complete(messages=messages,model="auto",laghav_options={"cache": False})