Same model, same feature family. Completely different stories.
1. Background: Can We Judge a Run by Its Trace?
Large language models boost reasoning through sample-evaluate-ensemble pipelines: generate many solutions, pick the best. But how do you know which one is correct without running ground-truth checks?
A long line of work has studied this question. Cobbe et al. (2021) trained verifiers to score reasoning traces. Lightman et al. (2023) showed that process-level reward models outperform outcome-level ones in math. Math-Shepherd extended this to step-by-step verification without human annotations. ProcessBench evaluated step-level error detection, and generative verifiers cast reward modeling as next-token prediction.
But these approaches have costs. Training a process reward model requires substantial data and compute. Running an LLM judge at inference time is expensive. The natural question becomes: before investing in a sophisticated verifier, what can be recovered from a very cheap feature family?
Cheap features matter because they enable real-time deployment: early stopping (prune unpromising traces to save tokens), best-of-N selection (rank candidates without a full reward model), and training guidance (signal which trajectories are healthy). If simple CoT surface features can predict correctness, we get these benefits almost for free.
The Core Question
Can hand-crafted CoT-surface features—token confidence, trajectory shape, activation summaries—indicate whether a solution is correct? And does this transfer across domains?
2. Feature Family: What We Measure
We study five hand-crafted features computed from cached token streams and sparse activations. These are deliberately cheap—no LLM inference, no execution, no hidden-state probing.
Token-Level Summaries
| Feature | Computation | Intuition |
|---|---|---|
tok_conf |
Mean negative log-probability over visible prefix tokens | Lower = model is more certain. Confident models tend to be correct. |
tok_gini |
Gini coefficient of per-token probability distributions | Higher = distribution is more peaked. Converged reasoning has sharp distributions. |
Trajectory Statistics
Trajectory features slice the CoT trace into contiguous, non-overlapping 40-word windows and compute set-overlap statistics:
| Feature | Computation | Intuition |
|---|---|---|
traj_continuity |
Mean Jaccard similarity between consecutive slices | How much does the reasoning stay on the same topic? High continuity = coherent reasoning. |
traj_reflection_count |
Negated count of non-adjacent slice pairs (≥2 slices apart, i.e. ≥80 words) with Jaccard > 0.30 | How often does the model revisit earlier ideas? More negative = more backtracking. E.g. a trace that says "...check divisibility..." then 200 words later says "...back to divisibility..." would increment this counter. In math, this captures "failure to converge." |
traj_novelty |
Mean of (1 − max Jaccard with all prior slices) | How much new ground does each step cover? Balanced novelty = healthy exploration. |
Key Detail: Jaccard Over Word Sets
Each slice's content is converted to a set of keywords. Jaccard similarity J(A,B) = |A ∩ B| / |A ∪ B| measures how much two slices overlap in vocabulary. These are surface features—they operate on word overlap, not semantic understanding.
The Pipeline
StandardScaler → TruncatedSVD → LogisticRegression
Fit separately per domain. Evaluation uses problem-grouped splits. The main metric is AoA (AUC-of-AUROC)—the arithmetic mean of AUROC values at 10 anchor positions (10%–100% in 10% steps)—measuring how well features discriminate correct vs. incorrect solutions as the reasoning unfolds. Bootstrap 95% CIs use B=10,000 resamplings.
3. Results: Math Works, Science Partial, Coding Fails
| Domain | AoA (95% CI) | AUROC@100% | Best-of-N=64 | Dominant Feature |
|---|---|---|---|---|
| Math | 0.968 [0.931, 0.980] | 0.982 | +10.0 pp | traj_reflection_count (0.242) |
| Science | 0.827 [0.775, 0.822] | 0.841 | +8.0 pp | tok_gini_tail (0.038) |
| Coding | 0.432 [0.404, 0.464] | 0.407 | −0.6 pp | None above noise |
Math: Trajectory Structure Carries the Signal
In math, the probe substantially outperforms the token-confidence baseline: AoA 0.968 vs. 0.685. The signal emerges early—reaching 95% of final AUROC by the 10% anchor. Permutation importance shows traj_reflection_count dominates (importance 0.242), followed by traj_continuity (0.092) and traj_novelty (0.052). This is a trajectory-driven pattern: math correctness is visible in how the reasoning unfolds.
Science: Confidence, Not Structure
Science shows positive but qualitatively different results. The full probe (AoA 0.827) doesn't beat a simple tok_conf baseline (AoA 0.809). The leading features are tok_gini_tail, tok_conf_recency, tok_conf_prefix—all confidence-based. The science signal is genuine but comes from how certain the model sounds, not from reasoning structure.
Coding: Below Chance
Coding achieves AoA 0.432 and AUROC@100% 0.407—below the tok_conf baseline (0.506). No best-of-N gain (−0.6 pp). The 95% CIs are non-overlapping with math, confirming this is a real domain gap, not statistical noise.
Feature Importance Contrast
This is the key finding: the same feature family measures different things in different domains.
- Math: Trajectory features dominate → features measure reasoning quality
- Science: Token confidence dominates → features measure model certainty
- Coding: Nothing dominates → features measure text fluency, not correctness
This is a measurement invariance failure (Vandenberg & Lance, 2000; Belinkov, 2022): the same operationalization measures different latent constructs across contexts.
4. Why Coding Fails: Five Convergent Checks
One negative result might be a feature engineering problem. We ran five independent checks to confirm this is deeper.
Check 1: Feature Sweep — 83 Coding-Specific Scalars
Approach: Designed 83 features specifically for code: struct_density, cf_n_ifs (branch count), cot_trace_count, indentation depth, keyword frequencies, loop nesting depth.
Result: Best single feature (struct_density) AUROC 0.565. Best top-4 combination: AUROC 0.556
Reading: These features are not pure noise—cf_n_ifs weakly correlates with correctness. But the effects remain small and semantically mixed. Deeper branching weakly correlates positively, denser structural narration weakly negatively. Not enough to separate correct from incorrect.
Check 2: Self-Supervised Pre-training
Approach: SSL on 42K unlabeled traces. SSL learns text structure (like BERT learns sentence structure)—recurring patterns, common phrases, syntactic regularities.
v1 result: Collapsed—loss stalled at 0.432, rank=8 offered nothing
v2 result: Per-domain encoders with VICReg. Coding full-label result: AUROC 0.454 — paradoxically lower than the 5%-label peak of 0.537, suggesting SSL pre-training does not uncover correctness-relevant structure and may even interfere at full supervision.
Reading: v1's failure showed domains are fundamentally different—a universal encoder can't work. v2's partial success at low labels shows SSL helps with regularization, but doesn't uncover correctness-relevant structure. SSL learns that for often follows range, but not whether the loop terminates correctly.
Check 3: Nonlinear MLP
Approach: One-hidden-layer MLP on 25D features—can it learn nonlinear combinations that linear models miss?
Result: AUROC 0.499–0.507
Reading: Adding model capacity doesn't help. If the signal isn't in the features, no amount of nonlinearity can extract it.
Check 4: De-knotting — Removing Circular Reasoning Spans
Approach: One hypothesis: coding traces contain noisy circular spans ("knots") that obscure an otherwise useful signal. We used GLM-4 and human annotation to identify knot spans, then removed them from all cached per-token arrays.
Result by domain:
- Math: Removing knots degrades AUROC (0.502 → 0.453, Δ=−0.049) — these spans carry signal
- Science: Minimal change (Δ=−0.006)
- Coding: Neutral (0.460 → 0.466, Δ=+0.006)
Reading: In math, knot spans are informative—they're where the model struggles and reveals uncertainty. In coding, they're neither helpful nor harmful. The coding weakness isn't attributable to removable surface clutter.
Check 5: Coding-Specific Run Judge
problem_id — each fold holds out unseen problems
Approach: 19 surface features (planning density, vocabulary overlap, structural flags, length, neuron count, cot_mode indicators). Evaluated on unseen problem_ids via 5-fold GroupKFold.
Result: OOF AUC 0.481 ± 0.053 — fails on unseen problems
Reading: Surface cues that appear meaningful within problems do not generalize across problems.
5. Why the Divergence?
The convergent failure across five methods points to something deeper than feature engineering. Here's our best explanation:
Math: Correctness Lives in the Reasoning Process
Mathematical reasoning is inherently sequential: solve step A, verify, proceed to step B. Predict next token aligns naturally with predict next reasoning step. When a solution fails, the trace shows visible patterns—hesitation, backtracking, circular reasoning. Our features capture these because in math, correctness is coupled with reasoning quality.
As Lanham et al. (2023) showed, CoT traces aren't always faithful—but in math, unfaithful reasoning tends to fail, creating a useful correlation between trace quality and correctness.
Science: Confidence as Proxy
Science problems share some verbal uncertainty cues with math, but the usable signal comes primarily from token confidence, not trajectory structure. This makes sense: science questions often test knowledge rather than reasoning process. The model either knows the answer (high confidence) or doesn't.
Coding: Correctness Lives in the Execution
Programming involves control flow that natural language cannot faithfully represent:
- Branches: Conditional logic with mutually exclusive paths. In linear text, the model describes both paths simultaneously.
- Loops: Iterative state updates. Describing iteration in linear text loses the "current state" at each step.
- State: Variable bindings and runtime behavior that leave no trace in the CoT text.
A model can write highly confident, well-structured pseudo-code that fails on edge cases. Our features would rate this as "high quality" but it's actually wrong. Code correctness is not in the text—it's in the runtime.
We call this the knot problem: natural language is a linear medium. Code has non-linear control flow. When you flatten branching logic into sequential text, you create "knots"—places where the model loses track of which branch it's in, iteration state gets tangled, and caller/callee frames blur.
Our de-knotting experiment (Check 4) tested whether these surface knots are the bottleneck. They are not: removing them leaves coding performance unchanged (Δ=+0.006). This suggests the coding weakness runs deeper than any one class of noisy spans, but we emphasize this is a negative finding—it rules out one explanation, rather than confirming another.
As CodeT (Chen et al., 2022) demonstrated, code verification benefits from stronger information channels—execution feedback, test cases. Our finding is consistent with this: when you intentionally restrict yourself to the weakest channel (CoT surface), it works for math but breaks for coding. Whether this failure stems from the linear-text vs. non-linear-code mismatch specifically, or from other aspects of the coding domain, remains an open question.
6. What This Means
Practical Takeaway
(reasoning features)
(confidence proxy)
(execution needed)
For math and science, cheap CoT surface features are genuinely useful—they enable early stopping, best-of-N selection, and quality monitoring at near-zero cost. For coding, you need stronger signals: execution feedback, test cases, or model internals.
What We Don't Claim
- We don't claim all text-based verifiers fail on coding. LLM judges, self-refinement, or raw-text classifiers may work—they use different information channels.
- We don't claim execution-aware methods won't work (they should!).
- All experiments use a single model (DeepSeek-R1-0528-Qwen3-8B). Results may differ for other architectures.
- Our SSL conclusion is objective-specific: it shows failure under the tested pretraining losses, not a universal impossibility.
Broader Context
This connects to a growing understanding that CoT quality is domain-dependent. Huang et al. (2023) showed that LLMs cannot self-correct reasoning yet. Liu et al. (2024) found some intrinsic self-correction ability. Our contribution is showing that even the measurability of correctness from CoT surface depends on the domain—the same instrument measures different things.
7. Technical Appendix
Model
DeepSeek-R1-0528-Qwen3-8B (8B parameters, reasoning-specialized), temperature 1.0.
Data
- Math: 7,680 runs across AIME24, AIME25, BRUMO25, HMMT25
- Science: 12,672 runs from GPQA Diamond
- Coding: 10,688 runs from LiveCodeBench-v5
- Total: 31,040 runs
Annotation Pipeline (Coding)
Coding traces were annotated for knot spans using GLM-4, Python pattern-matching scripts, and human review. The multi-method approach ensured comprehensive coverage of circular reasoning, branch confusion, and loop tangles.
Metric
AoA (AUC-of-AUROC)—the arithmetic mean of AUROC values at 10 anchor positions (10%–100% in 10% steps): measures how well features discriminate correct vs. incorrect as reasoning unfolds. We also report AUROC@100%. Bootstrap 95% CIs use B=10,000 resamplings per domain.
Reproducibility
All tables come from results/tables/ and workshop/cotknot/results/tables/. Key artifacts: coding_feature_family_ablation.csv, cot_run_judge_scores.csv, cot_run_judge_rerank.csv.
Appendix: Extra Details from the Paper
For readers who want more depth, here are additional findings that didn't fit the main narrative.
A. Cross-Anchor Transfer
We tested whether an SVD basis learned at one trace position transfers to another. Math exhibits near-lossless transfer: forward gap Δ=−0.001, backward Δ=−0.006. Science shows a larger forward gap (Δ=−0.010), suggesting its quality geometry is less stable across positions. This means math has a consistent "quality structure" throughout the trace; science's structure shifts as the reasoning progresses.
B. Dense-Anchor Robustness
We verified that the domain ordering is robust to anchor density. On our 10-anchor grid, AoA reaches 0.968 (math), 0.827 (science), and 0.432 (coding). A coarser 4-anchor subset {10%, 40%, 70%, 100%} yields the same ordering: 0.958 / 0.799 / 0.434. Science continues improving into later anchors, while coding stays weak throughout—confirming the gap isn't an artifact of measurement density.
C. Self-Consistency Oracle
A self-consistency oracle (using ground truth to know the majority answer) achieves AoA 0.953 (math), 0.921 (science), and 0.961 (coding). The fact that SC oracle achieves 0.961 in coding means coding correctness is predictable—just not from text surface features. This validates our interpretation: the information exists (in execution, in hidden states), but our features can't access it.
D. Grouped Family Ablations
We tested whether coding failure stems from one bad feature subset. Results:
traj_only: AUROC 0.509token_plus_traj: AUROC 0.501- Full 30-feature surface family: AUROC 0.506
No subset performs well—the failure is distributed across the entire feature family, not concentrated in one group.
E. De-knotting Across Domains
De-knotting has opposite effects in different domains:
- Math: AUROC drops from 0.502 → 0.453 (Δ=−0.049). Knot spans carry signal—they're where the model struggles.
- Science: Minimal change (Δ=−0.006).
- Coding: Neutral (Δ=+0.006). Removing knots doesn't help.
This asymmetry is informative: in math, circular reasoning is a symptom of the disease (wrong reasoning). In coding, removing circular spans does not help, which rules out the hypothesis that coding weakness is caused by removable surface clutter. Whether the deeper cause is a linear-text vs. non-linear-code representational mismatch—as the knot hypothesis suggests—or some other aspect of the coding domain, is not settled by this experiment.
F. Difficulty Band Analysis
The coding-specific CoT-only judge was evaluated across difficulty bands:
- Hard problems: AUC 0.515
- Mid problems: AUC 0.470
- Easy problems: AUC 0.509
Performance stays near chance across all difficulty levels, confirming the failure isn't limited to hard problems.
References
- Wei, J. et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. NeurIPS. arXiv
- Cobbe, K. et al. (2021). Training verifiers to solve math word problems. arXiv:2110.14168. arXiv
- Lightman, H. et al. (2023). Let's verify step by step. arXiv:2305.20050. arXiv
- Wang, P. et al. (2024). Math-Shepherd: Verify and reinforce LLMs step-by-step without human annotations. ACL. arXiv
- Zheng, C. et al. (2025). ProcessBench: Identifying process errors in mathematical reasoning. ACL. arXiv
- Zhang, L. et al. (2025). Generative verifiers: Reward modeling as next-token prediction. ICLR. arXiv
- Lanham, T. et al. (2023). Measuring faithfulness in chain-of-thought reasoning. arXiv:2307.13702. arXiv
- Huang, J. et al. (2023). Large language models cannot self-correct reasoning yet. arXiv:2310.01798. arXiv
- Liu, D. et al. (2024). Large language models have intrinsic self-correction ability. NeurIPS. arXiv
- Madaan, A. et al. (2023). Self-Refine: Iterative refinement with self-feedback. arXiv:2303.17651. arXiv
- Shinn, N. et al. (2023). Reflexion: Language agents with verbal reinforcement learning. arXiv:2303.11366. arXiv
- Chen, B. et al. (2022). CodeT: Code generation with generated tests. arXiv:2207.10397. arXiv
- Vandenberg, R. E. & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature. Organizational Research Methods, 3(1), 4–70.
- Belinkov, Y. (2022). Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1), 207–219.