Semantic Cache

Laghav uses Redis Stack vector similarity search to serve repeat and near-identical queries from cache — zero LLM cost, zero quality loss.

How it works

The compressed prompt is embedded using all-MiniLM-L6-v2 via the compressor service.
Laghav searches the Redis Stack vector index for embeddings with cosine similarity > 0.92 (distance < 0.08).
On a hit: cached response is returned immediately with X-Laghav-Cache: semantic header.
On a miss: LLM is called, response + embedding are stored in Redis (TTL: 30 minutes).

Cache key schema

bash

laghav:semcache:{tenant_id}:{vec_hash}  # TTL: 1800s (30 min)
laghav:dedup:{tenant_id}:{exact_hash}   # TTL: 900s (15 min) — exact match cache

✦Tenant isolation

Cache is always scoped to your tenant_id — you never receive another customer's cached response.

Detecting a cache hit

cache_hit.py

response = client.complete(messages=messages, model="auto")
if response.laghav_meta.cache_hit:
    print("Served from cache — $0 LLM cost")
    print(f"Latency overhead: {response.laghav_meta.latency_overhead_ms}ms")  # typically 3–8ms

Clearing the cache

bash

# Clear all cache types
curl -X POST https://api.laghav.ai/api/cache/clear \
  -H "Authorization: Bearer lgh_live_xxx" \
  -H "Content-Type: application/json" \
  -d '{"scope": "all"}'
# Scope options: "dedup" | "semantic" | "routing" | "all"

Disabling cache per-call

no_cache.py

# Useful for time-sensitive queries ("what's today's news?")
response = client.complete(
    messages=messages,
    model="auto",
    laghav_options={"cache": False}
)

Model Routing Quality Scoring