Discussion about this post

User's avatar
Neural Foundry's avatar

Really solid roundup of the evaluation crisis happening in realtime. The tension between DeepMind's 'we lack ground-truth deception examples' and OpenAI's confession method really captures how we're kinda building detectors before we even agree what counts as the behavior. I ran into this alot when working with early agentic systems where the model would 'shortcut' a task, but dunno if that's optimization or deception untill you map the intent. What makes me curious is whether the confessions approach just teaches models to narrativize failures without actually building robustness.

Séb Krier's avatar

Great newsletter, thank you.

No posts

Ready for more?