When CoT Surface Features Transfer

3 minute read

Published: May 04, 2026

This note summarizes my project code-not-text. The interactive version is here: demo. The full technical report is here: PDF report.

Question

Can cheap summaries of a chain-of-thought trace predict whether a solution is correct?

I study one deliberately narrow feature family: hand-crafted CoT-surface and token-trajectory descriptors. The point is not to test every possible verifier. The point is to ask whether this cheap, interpretable feature family transfers across domains.

Setup

The experiments use DeepSeek-R1-0528-Qwen3-8B at temperature 1.0 across three domains:

Math: AIME24, AIME25, BRUMO25, and HMMT25; 7,680 aligned runs
Science: GPQA-style tasks; 12,672 runs
Coding: LiveCodeBench-v5 tasks; 10,688 runs

The feature family includes token-confidence summaries, token-trajectory statistics, trajectory continuity, novelty, reflection count, and a small activation-derived descriptor. Evaluation uses problem-grouped splits. I report AoA (AUC-of-AUROC across trace anchors), AUROC@100%, and best-of-N reranking.

Main Result

The same feature family behaves very differently across domains.

Domain	AoA	AUROC@100%	Best-of-N=64 pass@1
Math	0.958	0.982	+10.0 pp
Science	0.799	0.841	+8.0 pp
Coding	0.434	0.407	-0.6 pp

This is the central finding: the feature family is strong in math, narrower in science, and weak for coding correctness on unseen problems.

Interpretation

The same feature names do not necessarily measure the same latent construct.

In math, features such as reflection count and trajectory continuity can track whether reasoning is converging. In science, the signal is more confidence-heavy. In coding, executable correctness is farther away from the surface form of the reasoning trace. A trace can look organized and confident while still implementing the wrong program logic.

I view this as a measurement non-invariance result: a feature family can be useful in one domain without being a general-purpose correctness proxy.

Robustness Checks

The coding result is not just one failed probe. I tested several possible explanations:

an 83-scalar coding-specific feature sweep,
grouped feature ablations,
a coding-specific CoT-only judge,
nonlinear MLP classifiers,
SSL pre-training on 42K unlabeled traces,
token-level de-knotting.

None produced a strong generalizable coding verifier from this feature family.

Semantic Knots

I also ran a smaller semantic analysis around what I call semantic knots: local places in the visible reasoning prose where the trace appears to lose state control and does not immediately repair itself. Ordinary uncertainty, normal exploration, and clean one-step self-correction do not count.

This analysis supports the same cross-domain story:

Math: knots are common and strongly associated with incorrect answers.
Science: knots appear less often, and the correctness relationship is weaker.
Coding: knots are rare and do not meaningfully separate correct from incorrect solutions.

The de-knotting check is especially useful. Removing knot spans reduces math signal, which suggests that those spans carry failure information. In coding, de-knotting does not reveal a hidden verifier; it leaves the weak signal essentially weak. That is why I read the result as measurement non-invariance rather than a missing preprocessing trick.

Boundary

This does not claim that all text-based verifiers fail. It also does not rule out code-aware models, execution-aware selectors, hidden-state probes, raw-text classifiers, or LLM judges.

The narrower conclusion is: cheap CoT-surface features are domain-specific measurement instruments, not general-purpose correctness proxies.

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Yuhan Chi

When CoT Surface Features Transfer

Question

Setup

Main Result

Interpretation

Robustness Checks

Semantic Knots

Boundary

Share on

You May Also Enjoy

Neither Needed Nor Feared

Intelligence Unveiled: Learning Mechanisms in Artificial and Biological Systems

AI Hackathon Journey

The spirit of rationality