Driftwatch

Evaluation harness for LLM epistemic calibration. Run structured test batteries, track drift over time, and detect regressions before they reach production.

Coming soon

Driftwatch is an open-source evaluation harness (Apache 2.0) that measures how well a model knows what it doesn't know. It scores needs-ask detection, overconfident response rates, and safety compliance across structured test batteries.

The local CLI runs evaluations against your models. Cloud persistence (included in Keel Cloud Team) adds baseline comparisons, drift-over-time tracking, and regression alerts for CI/CD integration.

NADRNeeds-Ask Detection Rate

ORROverconfident Response Rate

SCRSafety Compliance Rate

WTRWindsock Trigger Rate

Cross-model validation results (Mistral 7B, Llama 3.1 8B, Llama 3.1 70B) show strong epistemic calibration improvements. Details available on request.

Get early access