What we do not touch, and why.

Mar 04, 4 min

No prompt rewriting. no fine-tuning. no model swap. no chain-of-thought injection. a deliberate list of interventions we evaluated and rejected.

The rejected list

over the last six months we evaluated and rejected the following classes of intervention. each one is something the field has tried; each one had a reason we walked away.

**prompt rewriting.** modifying the user's prompt en route to the model. rejected because it makes the system non-reproducible: the developer no longer knows what the model saw. trust is a feature.

**fine-tuning a long-horizon variant of the model.** rejected because it locks us to one model version and makes the system invisible to new model releases. skalpel works the day a new claude lands; a fine-tuned variant requires retraining.

**model swap mid-run.** rejected for the reasons in the flat-router post. on a long run, the cost you save on cheap calls comes back as the cost of recovering from cheap mistakes.

**chain-of-thought injection.** asking the model to think out loud at every step. rejected because it inflates the trace, doesn't measurably help, and makes the run slower without making it better.

**post-hoc evaluators.** running a second model over the trace to grade and re-run. rejected because the cost is roughly the cost of the original run and the second model can be wrong in the same way the first was.

What's left

what's left is the thing we built: measure trajectory, act at decision boundaries, leave the prompt and the model alone. it's a smaller surface than what most of the field is doing, and that's deliberate.

ryan

← docs @ryanndngg