When Your AI Co-Worker Starts Improvising

What Anthropic's Study Means for How We Work

Feb 03, 2026

Here’s something that should matter to everyone building workflows with AI: the more you let AI “think,” the less consistent it becomes.

Anthropic just published research showing that extended reasoning doesn’t make AI more reliable. It makes it more unpredictable. Not dangerous in a sci-fi way, but unpredictable in a “wait, why did it give me three different answers to the same question?” way.

They measured AI mistakes in two categories:

Bias = the model consistently gets something wrong the same way every time
Variance = the model’s answers bounce around randomly between attempts

When variance dominates (when the AI becomes what they call “incoherent”), you’re dealing with something that might give you brilliant work on Monday and nonsense on Tuesday for reasons no one can explain.

The study found a troubling pattern across their tests: more reasoning steps lead to more randomness, longer agent actions reduce consistency, and complex tasks cause even the smartest models to wobble. This appears to be a feature of how these systems scale, not a bug in today’s AI.

What This Actually Means

If you’re building AI into your curriculum, planning organizational rollout, or integrating AI into daily workflows, this changes your risk model entirely.

The old fear was that AI pursues the wrong goal really well. The new reality is that AI pursues different goals each time you run it.

For educators: You can’t just check if an AI tool gives good answers. You need to know if it gives consistent answers when students use it repeatedly.

For workflow designers: That AI agent that worked beautifully in testing might behave completely differently in production, not because it learned something new, but because extended reasoning introduces drift.

For anyone implementing AI: “Let the AI think longer” isn’t automatically better. Sometimes it’s just more expensive chaos.

This is why human judgment remains essential in AI collaboration.

We’re here, in part, to catch AI being inconsistent, to notice when the pattern changes, when the reasoning wobbles, when Tuesday’s answer contradicts Monday’s for no good reason.

That’s a fundamentally different skill than “AI oversight.” It’s closer to editorial judgment, quality control, or what we used to call professional instinct. It’s recognizing when something’s off.

If you’re assessing whether your organization is ready for AI, the question shifts: instead of “Can we trust the AI?” ask “Can we catch when the AI becomes unreliable?”

Reliability isn’t a feature you can count on. It’s something you have to monitor, especially as AI systems scale up their reasoning capacity.

We're learning to catch AI when it starts improvising, when the variance overtakes the bias, when longer thinking produces less consistent output, when "smarter" paradoxically means "less predictable."

That’s a reliability problem in the practical sense. And reliability is something humans are actually quite good at spotting, if we’re paying attention, if we understand what we’re looking for, and if we haven’t handed over our judgment to the very systems we’re supposed to be collaborating with.

As AI gets more capable at reasoning, human pattern recognition becomes the quality control layer that keeps the whole system honest.

Discussion about this post

Ready for more?