Plugin
llm-eval-harness
Evaluate any LLM behind an OpenAI- or Anthropic-compatible endpoint across four dimensions — speed (TTFT + thinking-aware tokens/sec), concurrency/stability (success rate, p50/p90, breaking point), Anthropic protocol compliance (thinking-block trigger rate), and quality regression against your own accumulated use cases (blind-judge precision). Use to benchmark a model, verify a tokens-per-second claim, compare models head-to-head, or vet a newly released model before adopting it.