Holding the trajectory: long agent runs that finish.
May 12, 2026, 8 min
Claude code is great at the first ten steps and loses the thread by step thirty. we describe what we measure, what we steer, and what we leave alone.
May 12, 2026, 8 min
Claude code is great at the first ten steps and loses the thread by step thirty. we describe what we measure, what we steer, and what we leave alone.
ask claude code to do a single thing and it does it well. ask it to do twenty things that depend on each other, like refactor, then re-test, then update the call sites, then re-run the integration suite, then write a migration. something shifts around step ten or fifteen. it stops asking the right next question. it re-opens a file it already finished. it commits early, or it doesn't commit at all. anyone running long agent sessions has felt this.
we call it trajectory drift. the model isn't getting worse. its context window is fine. its weights are fine. its sequence of decisions stops being self-consistent. the run forgets why it started.
skalpel is a set of engines that intervene at the moments where drift happens. you install it once and run claude code the way you always do. nothing about your code, your prompt, or the model changes.
today we ship three engines under the same install:
- the trajectory engine, the primary one, intercepts at decision boundaries - the drift detector, which produces the on-task signal we steer on - the compressor, which holds long context on-task without trimming meaning
each engine does one job. they compose. more are in flight.
we record one number per agent decision: a confidence score for the action it just took, conditioned on the run's history. not the model's stated confidence. a separate signal we compute over the decision distribution. high values mean "this action follows from the run." low values mean "the run has gone off-script."
across thousands of long-horizon traces we have run internally, the same shape shows up. unguided runs hold confidence flat for the first ten to fifteen steps and then decay toward chance. guided runs hold steady all the way out.
we do not edit your prompt. we do not fine-tune. we do not route to a different model.
we intervene at decision boundaries, the moments inside a tool call where the model commits to one path over another. when the drift detector reports high-confidence drift, the trajectory engine narrows the choice set. when it doesn't, we step back. most of the time the engines do nothing. their job is to wait for the moments where doing something is worth the latency.
we ran two weeks of evals on the engines you can install today. zero quality loss versus the same model running unguided on the same tasks. we measured this on the standard agent benchmarks and on long-form internal tasks where the model is doing real engineering work over hours.
zero is the bar we held the engines to before shipping. we will not raise our claim above zero until the runs prove out for longer.
how the drift signal is computed. what the intervention policy looks like. where the boundaries are drawn. these stay inside the company while we figure out the right shape to publish. there's enough above to know what we built and how to test whether it works.
the docs page collects what we publish as we publish it. more engines are coming, and each one ships with its own note. follow @ryanndngg on x for drops, or email engineering@skalpel.ai with questions, replication asks, or anything we got wrong.
ryan san francisco, may 2026