An honest eval protocol for agent benchmarks.

Feb 19, 8 min

Task pass-rate hides the failure modes that matter on long runs. a protocol we use internally and a request for replication from anyone running their own.

What we measure

internally we track six things on every long-horizon eval. each one catches a failure mode that the next one misses.

**task pass-rate.** does the agent produce a working patch on this task. necessary, not sufficient.

**first-attempt rate.** does it produce the patch on the first try, or does it need to back out and re-derive. for production use, this is the metric that drives spend.

**trajectory consistency.** does the run end up looking like a self-consistent piece of work, or does it look like three different agents took turns. measured with the drift signal.

**re-derivation count.** how many times in the run does the agent re-establish a fact it already established. counted directly from the trace.

**tool-call efficiency.** how many tool calls per unit of useful work, where useful is judged at the end.

**recovery cost.** for any tool call that turned out to be a misstep, how many subsequent calls did the agent spend correcting it.

What we don't measure

the model's stated confidence in its own answer. unreliable. self-graded quality. unreliable. anything elicited by asking the model how it thinks the run went.

The ask

if you're running agent benchmarks and you've found a metric that catches failure modes the six above miss, we want to hear about it. engineering@skalpel.ai. we will publish anything you let us publish under your name.

ryan

← docs @ryanndngg