Long-context retrieval: where drift bites first.

Apr 14, 6 min

Results from a small internal eval on million-token windows. drift starts earlier in the run and bites harder when the model is reasoning over a context bigger than its working memory.

The setup

we built a small internal eval that asks the model to do long-form engineering work over a context window in the hundreds of thousands of tokens. concretely: refactor a module that lives across thirty files, with the source for all of them in context. the model has to keep track of references, types, and behavior across the whole window.

What we saw

drift shows up earliest on this kind of task. on a thirty-step refactor with a wide context, unguided runs start drifting around step eight, where on a tight short-context task they drift around step twelve to fifteen.

with the trajectory engine on, the drift signal stays inside the band we hold the run to, all the way out. quality on the tasks matches the unguided baseline at zero loss.

we think this is where skalpel's bite is sharpest: anywhere the model is reasoning over a context wider than what it can hold in active attention at once.

Caveat

the eval is internal and small. we are working on a public version with seeds and traces. expect that here within the next two weeks.

ryan

← docs @ryanndngg