Writing

Self-Consistency in LLMs: The Silent Effect That Degrades Long Sessions


Have you ever noticed that after a long conversation with an AI it gets "dumber"? Or starts making more errors?

It's not black magic. It has a name: Self-Consistency.

The problem

You start a simple, repetitive task with an AI — adding prices, classifying reviews, applying a rule to many rows — and at first it's perfect. After several minutes, errors appear. You correct one, and soon there are more. It feels like it degrades even though the task isn't difficult.

What self-consistency is

Self-consistency describes how the model influences itself with its own history. In long-horizon tasks, both correct answers and errors stay in the context window.

The model tries to be consistent with what it already "said." If there's an error in the context, it tends to align with that past and, unintentionally, reinforces it.

The mechanics:

  1. High precision at the start — almost 100% in the first steps
  2. Eventually an error surfaces
  3. That error stays in the context
  4. The model "looks back" and adjusts its belief
  5. Degradation loop → the probability of the next error increases

A concrete example

Task: "Extract prices and keep a running total."

Minutes later, two truths are circulating: yours and the chat's.

If the precision per step is 95%, the chance of 50 perfect steps is ~7.7%. The error is expected; the trick is making sure it doesn't contaminate everything.

Solutions

Chunking: process in short batches and validate externally.

Recalculate from scratch: in each batch, don't use previous totals.

Context reset: if an error appears, start a new thread with the canonical dataset.

Useful prompt:

Process 20 at a time. Recalculate from zero using only these 20.
Validate with these rules. If it fails, mark needs_review and don't correct the historical data.

TL;DR

Self-consistency causes past errors to drag the model toward more errors when the session is long. Context reinforces its own narrative. The antidote: short batches, recalculate from scratch, resets when something gets contaminated, and a rigid output format.