Trajectory drift, defined.

Apr 28, 5 min

A one-page definition of the metric we steer on and why it predicts long-horizon failure earlier than task-level pass/fail does.

What it is

trajectory drift is a step-level measure of how out-of-character an agent's decisions become as a run progresses. for each step, we look at the decision the model committed to and ask: does this action follow from the run's history, the goal, and the model's prior choices, or does it look like a different agent showed up at step fifteen?

drift is bounded between zero and one. zero means the decision is what a self-consistent run would do at this point. one means the run has fully forgotten what it was doing.

What it isn't

it is not the model's stated confidence in its own answer. those numbers are unreliable on long-horizon work because the model can be confident and wrong, or uncertain and right. we compute the drift signal independently over the decision distribution and the run's history.

Why we use it

task-level pass/fail is the wrong unit for long runs. by the time you can tell a run failed, you've already paid for fifty tool calls and a re-derivation. drift gives us a signal at every step, so the trajectory engine can act in the third quartile of the run rather than waiting for the post-mortem.

across the long-horizon traces we sampled, drift starts climbing at exactly the step where the run starts going wrong. the model's own confidence stays flat through the same window.

What we publish

we are working on a methodology note that walks through how the signal is computed and how to validate it on your own traces. it will live here when it lands.

ryan

← docs @ryanndngg