Holding the trajectory: long agent runs that finish.

May 12, 2026, 8 min

Claude code is great at the first ten steps and loses the thread by step thirty. we describe what we measure, what we steer, and what we leave alone.

The problem

ask claude code to do a single thing and it does it well. ask it to do twenty things that depend on each other, like refactor, then re-test, then update the call sites, then re-run the integration suite, then write a migration. something shifts around step ten or fifteen. it stops asking the right next question. it re-opens a file it already finished. it commits early, or it doesn't commit at all. anyone running long agent sessions has felt this.

we call it trajectory drift. the model isn't getting worse. its context window is fine. its weights are fine. its sequence of decisions stops being self-consistent. the run forgets why it started.

What we built

skalpel is a set of engines that intervene at the moments where drift happens. you install it once and run claude code the way you always do. nothing about your code, your prompt, or the model changes.

today we ship three engines under the same install:

- the trajectory engine, the primary one, intercepts at decision boundaries - the drift detector, which produces the on-task signal we steer on - the compressor, which holds long context on-task without trimming meaning

each engine does one job. they compose. more are in flight.

What we measure

we record one number per agent decision: a confidence score for the action it just took, conditioned on the run's history. not the model's stated confidence. a separate signal we compute over the decision distribution. high values mean "this action follows from the run." low values mean "the run has gone off-script."

across thousands of long-horizon traces we have run internally, the same shape shows up. unguided runs hold confidence flat for the first ten to fifteen steps and then decay toward chance. guided runs hold steady all the way out.

What we steer

we do not edit your prompt. we do not fine-tune. we do not route to a different model.

we intervene at decision boundaries, the moments inside a tool call where the model commits to one path over another. when the drift detector reports high-confidence drift, the trajectory engine narrows the choice set. when it doesn't, we step back. most of the time the engines do nothing. their job is to wait for the moments where doing something is worth the latency.

Quality

we ran two weeks of evals on the engines you can install today. zero quality loss versus the same model running unguided on the same tasks. we measured this on the standard agent benchmarks and on long-form internal tasks where the model is doing real engineering work over hours.

zero is the bar we held the engines to before shipping. we will not raise our claim above zero until the runs prove out for longer.