Swe-bench: zero quality loss after two weeks of evals.

May 06, 6 min

What we ran, how we ran it, and what we held constant. the headline: agents under skalpel finish the same number of tasks they would have unguided, and they finish them more often on the first try.

The question

if skalpel intervenes at decision boundaries during a long agent run, the obvious worry is that it pushes the agent down the wrong fork and quality drops. the goal of the last two weeks of evals: measure that delta honestly.

The setup

we ran claude code on the standard swe-bench-verified split and on an internal long-horizon set of multi-file engineering tasks we use during development. each task ran five times with the same seed under two conditions:

  - unguided: claude code as it ships, no skalpel
  - guided: the same model with the trajectory engine and drift detector enabled

we held the prompt, the model, the temperature, the tool set, and the eval harness identical between conditions. the only difference is whether skalpel is in the loop.

The result

across the eval window, quality measured by task pass-rate matched unguided within noise. zero net quality loss. when we tightened the criterion to first-attempt completion, the guided runs pulled ahead, mostly because unguided runs went back to a finished file or re-derived a fact they'd already established.

What we did not change

we did not retune the model. we did not edit the prompt. we did not run a second model in the background. we did not retry on failure. all of those were on the table; none of them are in the shipping path. zero quality loss has to mean zero quality loss against an honest baseline.

What's in this report

we'll publish the full eval table here once we wire the methodology page. until then, contact engineering@skalpel.ai if you're trying to replicate and we'll send you the seeds and the trace dumps.

ryan

← docs @ryanndngg