long-form agent work.
finished.
skalpel is a set of engines that keep claude code on course over long runs. we intervene at the decisions where agents drift, leave the prompt and the model untouched, and ship with zero measured quality loss across two weeks of evals.
measured against the same model running unguided on the same tasks. held across two weeks of evals.
swe-bench plus our internal long-horizon set. every shipping change re-runs the full battery.
trajectory, drift detection, compression. more in flight.
not one engine. several, each doing one job.
trajectory engine.
the primary engine. intervenes at decision boundaries when the run is about to drift. acts only when it should; otherwise it watches.
drift detector.
produces the on-task signal we steer on. one number per agent decision, conditioned on the run's history.
compressor.
holds long agent context on-task as the run grows. structurally and semantically. the model sees the same shape of context at step fifty as it did at step five.
what happens on a long claude code run.
same model, same prompt, same repo. observed over thirty steps of real engineering work.
what to ask before installing anything.
- what about quality?
- two weeks of evals across the standard agent benchmarks and our internal long-horizon set. zero quality loss versus the same model unguided. we hold the engines to that bar before shipping.
- what does skalpel see?
- only what your agent already sends to claude. we optimize the trajectory in flight. nothing is stored.
- do you train on my prompts?
- no. prompts and completions are never used as training data. contractually, not aspirationally.
- what happens if skalpel goes down?
- your agent keeps working. nothing we do sits on the critical path.
- can i bring my own anthropic key?
- yes. byok by default. you keep the relationship with anthropic. skalpel just shapes the run.
- how much does it cost?
- free.