Skip to main content
Documentation
Features

Model Routing

Laghav's DistilBERT ONNX classifier routes every prompt to the cheapest capable model in under 5ms. On average it selects Haiku 68% of the time — saving 98% vs Opus on simple queries.

How it works

When you set model: "auto", Laghav classifies the compressed prompt into one of four complexity tiers using a fine-tuned DistilBERT model exported to ONNX (3.4ms CPU inference). If classifier confidence is below 0.70, it falls back to a pattern-matching rule set.

CategoryRouted toTypical examplesSavings vs Opus
simple (68%)claude-haiku-3FAQ, yes/no, greetings, classification98%
translation (8%)claude-haiku-3Any language translation task98%
code (19%)claude-sonnet-4Code gen, debugging, review80%
complex (5%)claude-opus-4Research, legal, multi-step reasoning0%

Routing reason in response

routing.py
response = client.complete(messages=messages, model="auto")
print(response.laghav_meta.routing_reason) # "faq_pattern"
print(response.laghav_meta.model_requested) # "auto"
print(response.model) # "claude-haiku-3-20240307"

Override routing

override.py
# Force a specific model — bypasses routing
response = client.complete(
messages=messages,
model="claude-opus-4", # always uses Opus
laghav_options={"route": False} # disable routing middleware
)
routing_reason values
Common routing reasons: faq_pattern, translation_task, code_task, analytical, ml_high_confidence, ml_fallback_pattern