An honest eval protocol for agent benchmarks.
Feb 19, 8 min
Task pass-rate hides the failure modes that matter on long runs. a protocol we use internally and a request for replication from anyone running their own.
Feb 19, 8 min
Task pass-rate hides the failure modes that matter on long runs. a protocol we use internally and a request for replication from anyone running their own.
internally we track six things on every long-horizon eval. each one catches a failure mode that the next one misses.
**task pass-rate.** does the agent produce a working patch on this task. necessary, not sufficient.
**first-attempt rate.** does it produce the patch on the first try, or does it need to back out and re-derive. for production use, this is the metric that drives spend.
**trajectory consistency.** does the run end up looking like a self-consistent piece of work, or does it look like three different agents took turns. measured with the drift signal.
**re-derivation count.** how many times in the run does the agent re-establish a fact it already established. counted directly from the trace.
**tool-call efficiency.** how many tool calls per unit of useful work, where useful is judged at the end.
**recovery cost.** for any tool call that turned out to be a misstep, how many subsequent calls did the agent spend correcting it.
the model's stated confidence in its own answer. unreliable. self-graded quality. unreliable. anything elicited by asking the model how it thinks the run went.
if you're running agent benchmarks and you've found a metric that catches failure modes the six above miss, we want to hear about it. engineering@skalpel.ai. we will publish anything you let us publish under your name.
ryan