Attractor Divergence as a Bias Signal

On TwoPass, analytical frames, and what disagreement reveals about weight space

2026-03-14

TwoPass is a 475-line Python tool that does something I haven’t seen formalized elsewhere: it treats the disagreement between two differently-framed passes through the same model as a direct signal for training-data bias. Not the answer from either pass — the gap between them.

The core claim is that chain-of-thought prompting and attractor bias are orthogonal problems. CoT improves reasoning along a given trajectory through weight space. But if the trajectory itself reflects the dominant narrative in training data rather than the strongest available evidence, more reasoning produces more articulate bias. TwoPass proposes that routing through a different analytical frame — a different system prompt, a different evaluative persona — forces different attention paths through the same weights, and the resulting divergence is informative.

Relationship to existing work

This sits adjacent to several established approaches but isn’t quite any of them.

Self-consistency (Wang et al., 2022) samples multiple reasoning paths at higher temperature and takes a majority vote. The variation comes from stochastic sampling — same prompt, different random seeds. TwoPass gets its variation from a deterministic change in framing. The analytical frame isn’t noise; it’s a directed lens. These are different mechanisms targeting different failure modes. Self-consistency catches reasoning errors. TwoPass catches framing absorption.

Debate (Irving et al., 2018) has two models argue opposing positions before a judge. TwoPass is single-model: the same weights, examined from two positions. This is more constrained — you can’t surface knowledge one model has that another lacks — but it isolates what’s happening inside a single model’s weight space, which is arguably a cleaner signal for attractor analysis.

Constitutional AI (Bai et al., 2022) uses principles to critique and revise outputs. TwoPass’s analytical frames serve a similar structural role — they define what to look for — but they’re targeted at epistemic bias patterns (hedging, asymmetric standards, consensus-as-truth) rather than safety or helpfulness criteria. The frame library is the engineering contribution here.

The frame designs

Four built-in frames, each targeting a distinct failure mode. These are worth reading in frames.py because the prompt engineering is the theory made concrete:

Bias regression looks for six specific patterns: hedging that weakens well-supported claims, strawman constructions, asymmetric evidentiary standards, absorbed talking points, consensus-as-truth substitution, and selective framing. The system prompt positions the model as a “bias regression testing tool” — structural analysis, not fact-checking.

Epistemic mapping charts claimed confidence against actual evidentiary basis. For each major claim in pass 1, it asks: is confidence outpacing evidence? Is the response following source volume or source quality? Where does mainstream consensus diverge from replication data? The persona is an “epistemic cartographer.”

Factual verification checks faithfulness against the original prompt — numbers restated wrong, constraints dropped, claims added that weren’t in the input. This catches a different class of error from the bias frames: not attractor distortion, but simple infidelity to the prompt.

Adversarial steelmanning constructs the strongest opposing case: systematically underweighted evidence, better-fitting alternative frameworks, correlation-causation confusions, unjustified assumptions. The persona is a “dialectical analyst.”

What the test runs show

Sixteen runs across three models (LFM2-24B at 2.3B active params, GLM-4.7-flash, and Kimi-K2.5) on contested topics and logic puzzles. Some observations worth noting:

On the SSRI question, pass 1 produced a competent-looking analysis that described the evidence as “modest efficacy” — accurate in a narrow sense, but pass 2 caught that this framing downplays statistical significance without comparing effect sizes to older antidepressants. It identified asymmetric evidentiary standards: the response demanded more rigor from SSRIs than from therapy alternatives mentioned in the same breath. The divergence here isn’t factual error. It’s framing absorption — the model echoed the dominant narrative structure from its training data.

On the Belt and Road question, the pattern was different. Pass 1 presented a “balanced” analysis where balance meant equal weight to both positions. Pass 2 identified that the Sri Lanka Hambantota Port example was doing disproportionate work — one dramatic case standing in for a systemic claim, while the statistic that 60% of BRI countries have debt-to-GDP ratios below 50% was buried. The attractor wasn’t toward one position; it was toward a particular structure of balance that training data favors.

On logic puzzles (Monty Hall, bat-and-ball), the divergence was smaller. These have clear correct answers, and the models mostly got them right in pass 1. Pass 2 found minor hedging. The tool adds least where attractors are weakest — when the training signal converges on a correct answer, there’s nothing to diverge from.

Open questions

Frame selection. The four built-in frames were designed by intuition about what bias patterns exist. Is there a principled way to determine which frame will produce the most informative divergence for a given prompt? Could the frame itself be generated from the prompt?

Divergence quantification. Right now, divergence is qualitative — you read pass 2 and see what it found. A formal metric for divergence magnitude would enable comparison across prompts, models, and frames. What would that metric look like?

Multi-model divergence. Running the same prompt through different models reveals different attractor patterns. The README notes this as a use case. But the interesting analysis would be systematic: which topics produce the most model-to-model divergence, and what does that tell us about training-data composition?

Scaling behavior. The README claims the effect scales with model capability. If true, this has implications for how attractor bias behaves as models get larger — does it get stronger (deeper attractors from more training data) or weaker (more capacity to represent competing positions)?

Attractor topology. The current approach assumes a single dominant attractor per topic. But contested topics might have multiple local attractors — progressive vs. conservative framings of the same issue, for instance. Multi-pass analysis could map these, but the current three-pass pipeline doesn’t attempt it.