Code Correctness Is Not in the Text

Same model, same feature family. Completely different stories.

Math CoT — AIME24 #1 (is_correct: true, AUROC 0.982)

# Reasoning trace from a correct mathematical solution Problem: Aya walks 9 km. At s km/h it takes 54 min longer than at s+2 km/h. Step 1: Let walking speed be s km/h. Boosted speed: s + 2 km/h. Step 2: Time normal = 9/s, Time boosted = 9/(s+2). Step 3: The difference is 54 min = 9/10 hour: 9/s − 9/(s+2) = 9/10. Step 4: Multiply by s(s+2): 18 = 9s(s+2)/10 → s(s+2) = 20. Step 5: Quadratic: s² + 2s − 20 = 0 → s = −1 + √21 ≈ 3.58. Step 6: Verify: 9/3.58 − 9/5.58 ≈ 0.9 hr = 54 min ✓ Confidence is high throughout. No backtracking. traj_continuity: stable. The reasoning converges monotonically — every step builds on the last. Result: ✓ Clean, confident, convergent. Correctness is visible in the trace.

Coding CoT — LiveCodeBench-v5 #131 (is_correct: false, AUROC 0.407)

# Actual snippet from a branch_heavy coding trace t=6 which is divisible by 3. but note: the condition is "divisible by t". since the product can be 0 (if any digit is 0) then that aut... from n upwards and check the condition for each number. however, note theexample: for n=1... digits must be divisible by t. note that if the number contains a 0, then... but wait, the product of digits: if an... actually, if left >= right, we return. but we already moved left, so now left < right again... but note: the condition says "divisible by t", however the problem might require th... Result: ✗ Model lost in conditional maze. Features see fluent text, miss wrong logic.

1. Background: Can We Judge a Run by Its Trace?

Large language models boost reasoning through sample-evaluate-ensemble pipelines: generate many solutions, pick the best. But how do you know which one is correct without running ground-truth checks?

A long line of work has studied this question. Cobbe et al. (2021) trained verifiers to score reasoning traces. Lightman et al. (2023) showed that process-level reward models outperform outcome-level ones in math. Math-Shepherd extended this to step-by-step verification without human annotations. ProcessBench evaluated step-level error detection, and generative verifiers cast reward modeling as next-token prediction.

But these approaches have costs. Training a process reward model requires substantial data and compute. Running an LLM judge at inference time is expensive. The natural question becomes: before investing in a sophisticated verifier, what can be recovered from a very cheap feature family?

Cheap features matter because they enable real-time deployment: early stopping (prune unpromising traces to save tokens), best-of-N selection (rank candidates without a full reward model), and training guidance (signal which trajectories are healthy). If simple CoT surface features can predict correctness, we get these benefits almost for free.

The Core Question

Can hand-crafted CoT-surface features—token confidence, trajectory shape, activation summaries—indicate whether a solution is correct? And does this transfer across domains?

2. Feature Family: What We Measure

We study five hand-crafted features computed from cached token streams and sparse activations. These are deliberately cheap—no LLM inference, no execution, no hidden-state probing.

Token-Level Summaries

Feature	Computation	Intuition
`tok_conf`	Mean negative log-probability over visible prefix tokens	Lower = model is more certain. Confident models tend to be correct.
`tok_gini`	Gini coefficient of per-token probability distributions	Higher = distribution is more peaked. Converged reasoning has sharp distributions.

Trajectory Statistics

Trajectory features slice the CoT trace into contiguous, non-overlapping 40-word windows and compute set-overlap statistics:

Feature	Computation	Intuition
`traj_continuity`	Mean Jaccard similarity between consecutive slices	How much does the reasoning stay on the same topic? High continuity = coherent reasoning.
`traj_reflection_count`	Negated count of non-adjacent slice pairs (≥2 slices apart, i.e. ≥80 words) with Jaccard > 0.30	How often does the model revisit earlier ideas? More negative = more backtracking. E.g. a trace that says "...check divisibility..." then 200 words later says "...back to divisibility..." would increment this counter. In math, this captures "failure to converge."
`traj_novelty`	Mean of (1 − max Jaccard with all prior slices)	How much new ground does each step cover? Balanced novelty = healthy exploration.

Key Detail: Jaccard Over Word Sets

Each slice's content is converted to a set of keywords. Jaccard similarity J(A,B) = |A ∩ B| / |A ∪ B| measures how much two slices overlap in vocabulary. These are surface features—they operate on word overlap, not semantic understanding.

The Pipeline

StandardScaler → TruncatedSVD → LogisticRegression

Fit separately per domain. Evaluation uses problem-grouped splits. The main metric is AoA (AUC-of-AUROC)—the arithmetic mean of AUROC values at 10 anchor positions (10%–100% in 10% steps)—measuring how well features discriminate correct vs. incorrect solutions as the reasoning unfolds. Bootstrap 95% CIs use B=10,000 resamplings.

▶ How does the SVD pipeline work end-to-end?

3. Results: Math Works, Science Partial, Coding Fails

Domain	AoA (95% CI)	AUROC@100%	Best-of-N=64	Dominant Feature
Math	0.968 [0.931, 0.980]	0.982	+10.0 pp	traj_reflection_count (0.242)
Science	0.827 [0.775, 0.822]	0.841	+8.0 pp	tok_gini_tail (0.038)
Coding	0.432 [0.404, 0.464]	0.407	−0.6 pp	None above noise

Figure 1: AUROC at each anchor position (10%–100% of trace visible). Math rises early and plateaus high; science improves gradually; coding stays flat throughout.

Math: Trajectory Structure Carries the Signal

In math, the probe substantially outperforms the token-confidence baseline: AoA 0.968 vs. 0.685. The signal emerges early—reaching 95% of final AUROC by the 10% anchor. Permutation importance shows traj_reflection_count dominates (importance 0.242), followed by traj_continuity (0.092) and traj_novelty (0.052). This is a trajectory-driven pattern: math correctness is visible in how the reasoning unfolds.

Science: Confidence, Not Structure

Science shows positive but qualitatively different results. The full probe (AoA 0.827) doesn't beat a simple tok_conf baseline (AoA 0.809). The leading features are tok_gini_tail, tok_conf_recency, tok_conf_prefix—all confidence-based. The science signal is genuine but comes from how certain the model sounds, not from reasoning structure.

Coding: Below Chance

Coding achieves AoA 0.432 and AUROC@100% 0.407—below the tok_conf baseline (0.506). No best-of-N gain (−0.6 pp). The 95% CIs are non-overlapping with math, confirming this is a real domain gap, not statistical noise.

Feature Importance Contrast

This is the key finding: the same feature family measures different things in different domains.

Math: Trajectory features dominate → features measure reasoning quality
Science: Token confidence dominates → features measure model certainty
Coding: Nothing dominates → features measure text fluency, not correctness

This is a measurement invariance failure (Vandenberg & Lance, 2000; Belinkov, 2022): the same operationalization measures different latent constructs across contexts.

4. Why Coding Fails: Five Convergent Checks

One negative result might be a feature engineering problem. We ran five independent checks to confirm this is deeper.

Check 1: Feature Sweep — 83 Coding-Specific Scalars

Feature Funnel

83 features

→

screen

→

top-4

→

AUROC 0.556

struct_density cf_n_ifs indent depth loop nesting keywords +78 more

Approach: Designed 83 features specifically for code: struct_density, cf_n_ifs (branch count), cot_trace_count, indentation depth, keyword frequencies, loop nesting depth.

Result: Best single feature (struct_density) AUROC 0.565. Best top-4 combination: AUROC 0.556

Reading: These features are not pure noise—cf_n_ifs weakly correlates with correctness. But the effects remain small and semantically mixed. Deeper branching weakly correlates positively, denser structural narration weakly negatively. Not enough to separate correct from incorrect.

Check 2: Self-Supervised Pre-training

SSL v2 Training Pipeline

42K unlabeled traces

↓

Per-domain encoder (50 → 24 dims)

↓

Multi-objective SSL

family-block mask cross-anchor NT-Xent future-anchor pred VICReg

↓

Fine-tune with labels

lf=5% 0.537

lf=100% 0.454 ↓

Approach: SSL on 42K unlabeled traces. SSL learns text structure (like BERT learns sentence structure)—recurring patterns, common phrases, syntactic regularities.

v1 result: Collapsed—loss stalled at 0.432, rank=8 offered nothing

v2 result: Per-domain encoders with VICReg. Coding full-label result: AUROC 0.454 — paradoxically lower than the 5%-label peak of 0.537, suggesting SSL pre-training does not uncover correctness-relevant structure and may even interfere at full supervision.

Reading: v1's failure showed domains are fundamentally different—a universal encoder can't work. v2's partial success at low labels shows SSL helps with regularization, but doesn't uncover correctness-relevant structure. SSL learns that for often follows range, but not whether the loop terminates correctly.

Check 3: Nonlinear MLP

Architecture Comparison

Linear: 25D → score = 0.506

MLP: 25D → → score = 0.499-0.507

Adding a hidden layer does not recover missing signal

Approach: One-hidden-layer MLP on 25D features—can it learn nonlinear combinations that linear models miss?

Result: AUROC 0.499–0.507

Reading: Adding model capacity doesn't help. If the signal isn't in the features, no amount of nonlinearity can extract it.

Check 4: De-knotting — Removing Circular Reasoning Spans

De-knotting Process

Original: "check each digit… wait, actually, but note: condition says divisible by t …so for n=1"

├──────────────────────────────── knot span ───────────────────────────────┐

&scissor; &scissor; &scissor; — GLM-4 identifies & removes knot tokens

Spliced: "check each digit… …so for n=1"

Result: Coding Δ=+0.006 (neutral)

Approach: One hypothesis: coding traces contain noisy circular spans ("knots") that obscure an otherwise useful signal. We used GLM-4 and human annotation to identify knot spans, then removed them from all cached per-token arrays.

Result by domain:

Math: Removing knots degrades AUROC (0.502 → 0.453, Δ=−0.049) — these spans carry signal
Science: Minimal change (Δ=−0.006)
Coding: Neutral (0.460 → 0.466, Δ=+0.006)

Reading: In math, knot spans are informative—they're where the model struggles and reveals uncertainty. In coding, they're neither helpful nor harmful. The coding weakness isn't attributable to removable surface clutter.

Check 5: Coding-Specific Run Judge

Cross-Validation Pipeline

19 surface features

→

GroupKFold 5-fold

→

OOF AUC 0.481

↑ grouped by problem_id — each fold holds out unseen problems

± 0.053 across folds — consistent with chance (0.5)

Approach: 19 surface features (planning density, vocabulary overlap, structural flags, length, neuron count, cot_mode indicators). Evaluated on unseen problem_ids via 5-fold GroupKFold.

Result: OOF AUC 0.481 ± 0.053 — fails on unseen problems

Reading: Surface cues that appear meaningful within problems do not generalize across problems.

Figure 2: Coding AUROC Across Five Robustness Checks

Feature Sweep

0.6

0.990 0.556

SSL (full label)

0.6

0.978 0.454

Nonlinear MLP

0.6

0.985 0.503

De-knotting

0.6

0.453 0.466

Run Judge

0.6

— 0.481

0.0 0.25 chance 0.5 0.75 1.0

Math

Coding

Chance = 0.5

De-knotting values: AUROC on de-knotted data (Math: 0.502→0.453; Coding: 0.460→0.466). Run Judge: coding-specific, no Math equivalent.

Five independent robustness checks. For the first three, Math AUROC stays above 0.97 while coding never clears 0.6. De-knotting operates at a different scale (subset evaluation) but confirms the same pattern: signal exists in math, not in coding.

5. Why the Divergence?

The convergent failure across five methods points to something deeper than feature engineering. Here's our best explanation:

Math: Correctness Lives in the Reasoning Process

Mathematical reasoning is inherently sequential: solve step A, verify, proceed to step B. Predict next token aligns naturally with predict next reasoning step. When a solution fails, the trace shows visible patterns—hesitation, backtracking, circular reasoning. Our features capture these because in math, correctness is coupled with reasoning quality.

As Lanham et al. (2023) showed, CoT traces aren't always faithful—but in math, unfaithful reasoning tends to fail, creating a useful correlation between trace quality and correctness.

Science: Confidence as Proxy

Science problems share some verbal uncertainty cues with math, but the usable signal comes primarily from token confidence, not trajectory structure. This makes sense: science questions often test knowledge rather than reasoning process. The model either knows the answer (high confidence) or doesn't.

Coding: Correctness Lives in the Execution

Linear Text vs. Non-linear Code Execution

Linear CoT text

Step 1 → Step 2 → Step 3 → Step 4 → Step 5

Actual code execution

Step 1 → Step 2 → if X? ↖ Step 3a / Step 3b → ↻ → Step 4 → Step 5

Programming involves control flow that natural language cannot faithfully represent:

Branches: Conditional logic with mutually exclusive paths. In linear text, the model describes both paths simultaneously.
Loops: Iterative state updates. Describing iteration in linear text loses the "current state" at each step.
State: Variable bindings and runtime behavior that leave no trace in the CoT text.

A model can write highly confident, well-structured pseudo-code that fails on edge cases. Our features would rate this as "high quality" but it's actually wrong. Code correctness is not in the text—it's in the runtime.

The Knot Problem

We call this the knot problem: natural language is a linear medium. Code has non-linear control flow. When you flatten branching logic into sequential text, you create "knots"—places where the model loses track of which branch it's in, iteration state gets tangled, and caller/callee frames blur.

Our de-knotting experiment (Check 4) tested whether these surface knots are the bottleneck. They are not: removing them leaves coding performance unchanged (Δ=+0.006). This suggests the coding weakness runs deeper than any one class of noisy spans, but we emphasize this is a negative finding—it rules out one explanation, rather than confirming another.

As CodeT (Chen et al., 2022) demonstrated, code verification benefits from stronger information channels—execution feedback, test cases. Our finding is consistent with this: when you intentionally restrict yourself to the weakest channel (CoT surface), it works for math but breaks for coding. Whether this failure stems from the linear-text vs. non-linear-code mismatch specifically, or from other aspects of the coding domain, remains an open question.

6. What This Means

Practical Takeaway

Domain Signal Strength

Math

Strong CoT signal
(reasoning features)

Science

Moderate CoT signal
(confidence proxy)

Coding

Near-zero CoT signal
(execution needed)

Information Channels (weakest → strongest)

CoT surface text weakest

CoT token probabilities ← we tested this

Model hidden states

LLM-as-judge

Execution feedback / test cases strongest

For math and science, cheap CoT surface features are genuinely useful—they enable early stopping, best-of-N selection, and quality monitoring at near-zero cost. For coding, you need stronger signals: execution feedback, test cases, or model internals.

What We Don't Claim

We don't claim all text-based verifiers fail on coding. LLM judges, self-refinement, or raw-text classifiers may work—they use different information channels.
We don't claim execution-aware methods won't work (they should!).
All experiments use a single model (DeepSeek-R1-0528-Qwen3-8B). Results may differ for other architectures.
Our SSL conclusion is objective-specific: it shows failure under the tested pretraining losses, not a universal impossibility.

Broader Context

This connects to a growing understanding that CoT quality is domain-dependent. Huang et al. (2023) showed that LLMs cannot self-correct reasoning yet. Liu et al. (2024) found some intrinsic self-correction ability. Our contribution is showing that even the measurability of correctness from CoT surface depends on the domain—the same instrument measures different things.

7. Technical Appendix

Model

DeepSeek-R1-0528-Qwen3-8B (8B parameters, reasoning-specialized), temperature 1.0.

Data

Math: 7,680 runs across AIME24, AIME25, BRUMO25, HMMT25
Science: 12,672 runs from GPQA Diamond
Coding: 10,688 runs from LiveCodeBench-v5
Total: 31,040 runs

Annotation Pipeline (Coding)

Coding traces were annotated for knot spans using GLM-4, Python pattern-matching scripts, and human review. The multi-method approach ensured comprehensive coverage of circular reasoning, branch confusion, and loop tangles.

Metric

AoA (AUC-of-AUROC)—the arithmetic mean of AUROC values at 10 anchor positions (10%–100% in 10% steps): measures how well features discriminate correct vs. incorrect as reasoning unfolds. We also report AUROC@100%. Bootstrap 95% CIs use B=10,000 resamplings per domain.

Reproducibility

All tables come from results/tables/ and workshop/cotknot/results/tables/. Key artifacts: coding_feature_family_ablation.csv, cot_run_judge_scores.csv, cot_run_judge_rerank.csv.

Appendix: Extra Details from the Paper

For readers who want more depth, here are additional findings that didn't fit the main narrative.

A. Cross-Anchor Transfer

We tested whether an SVD basis learned at one trace position transfers to another. Math exhibits near-lossless transfer: forward gap Δ=−0.001, backward Δ=−0.006. Science shows a larger forward gap (Δ=−0.010), suggesting its quality geometry is less stable across positions. This means math has a consistent "quality structure" throughout the trace; science's structure shifts as the reasoning progresses.

B. Dense-Anchor Robustness

We verified that the domain ordering is robust to anchor density. On our 10-anchor grid, AoA reaches 0.968 (math), 0.827 (science), and 0.432 (coding). A coarser 4-anchor subset {10%, 40%, 70%, 100%} yields the same ordering: 0.958 / 0.799 / 0.434. Science continues improving into later anchors, while coding stays weak throughout—confirming the gap isn't an artifact of measurement density.

C. Self-Consistency Oracle

A self-consistency oracle (using ground truth to know the majority answer) achieves AoA 0.953 (math), 0.921 (science), and 0.961 (coding). The fact that SC oracle achieves 0.961 in coding means coding correctness is predictable—just not from text surface features. This validates our interpretation: the information exists (in execution, in hidden states), but our features can't access it.

D. Grouped Family Ablations

We tested whether coding failure stems from one bad feature subset. Results:

traj_only: AUROC 0.509
token_plus_traj: AUROC 0.501
Full 30-feature surface family: AUROC 0.506

No subset performs well—the failure is distributed across the entire feature family, not concentrated in one group.

E. De-knotting Across Domains

De-knotting has opposite effects in different domains:

Math: AUROC drops from 0.502 → 0.453 (Δ=−0.049). Knot spans carry signal—they're where the model struggles.
Science: Minimal change (Δ=−0.006).
Coding: Neutral (Δ=+0.006). Removing knots doesn't help.

This asymmetry is informative: in math, circular reasoning is a symptom of the disease (wrong reasoning). In coding, removing circular spans does not help, which rules out the hypothesis that coding weakness is caused by removable surface clutter. Whether the deeper cause is a linear-text vs. non-linear-code representational mismatch—as the knot hypothesis suggests—or some other aspect of the coding domain, is not settled by this experiment.

F. Difficulty Band Analysis

The coding-specific CoT-only judge was evaluated across difficulty bands:

Hard problems: AUC 0.515
Mid problems: AUC 0.470
Easy problems: AUC 0.509

Performance stays near chance across all difficulty levels, confirming the failure isn't limited to hard problems.

References

Wei, J. et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. NeurIPS. arXiv
Cobbe, K. et al. (2021). Training verifiers to solve math word problems. arXiv:2110.14168. arXiv
Lightman, H. et al. (2023). Let's verify step by step. arXiv:2305.20050. arXiv
Wang, P. et al. (2024). Math-Shepherd: Verify and reinforce LLMs step-by-step without human annotations. ACL. arXiv
Zheng, C. et al. (2025). ProcessBench: Identifying process errors in mathematical reasoning. ACL. arXiv
Zhang, L. et al. (2025). Generative verifiers: Reward modeling as next-token prediction. ICLR. arXiv
Lanham, T. et al. (2023). Measuring faithfulness in chain-of-thought reasoning. arXiv:2307.13702. arXiv
Huang, J. et al. (2023). Large language models cannot self-correct reasoning yet. arXiv:2310.01798. arXiv
Liu, D. et al. (2024). Large language models have intrinsic self-correction ability. NeurIPS. arXiv
Madaan, A. et al. (2023). Self-Refine: Iterative refinement with self-feedback. arXiv:2303.17651. arXiv
Shinn, N. et al. (2023). Reflexion: Language agents with verbal reinforcement learning. arXiv:2303.11366. arXiv
Chen, B. et al. (2022). CodeT: Code generation with generated tests. arXiv:2207.10397. arXiv
Vandenberg, R. E. & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature. Organizational Research Methods, 3(1), 4–70.
Belinkov, Y. (2022). Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1), 207–219.