Token-Level Verification under Controlled Evaluation

This project asks whether shallow token-level signals—entropy, log-probability, and confidence trajectories—can distinguish correct from incorrect math reasoning traces without extra model calls.

The emphasis is not on proposing a new verifier. Instead, the project audits how much apparent verification performance depends on evaluation protocol choices such as global pooling, in-sample scoring, direction-agnostic AUROC, and within-problem controls.

Takeaway: shallow token statistics can be useful diagnostics, but they should be reported with fixed-direction baselines and permutation-null calibration before being treated as stable standalone verifiers.

View the code Workshop website

Direct Link