Token-Level Verification under Controlled Evaluation
This project asks whether shallow token-level signals—entropy, log-probability, and confidence trajectories—can distinguish correct from incorrect math reasoning traces without extra model calls.
The emphasis is not on proposing a new verifier. Instead, the project audits how much apparent verification performance depends on evaluation protocol choices such as global pooling, in-sample scoring, direction-agnostic AUROC, and within-problem controls.
Takeaway: shallow token statistics can be useful diagnostics, but they should be reported with fixed-direction baselines and permutation-null calibration before being treated as stable standalone verifiers.
