Token-Level Verification under Controlled Evaluation

This project asks whether shallow token-level signals—entropy, log-probability, and confidence trajectories—can distinguish correct from incorrect math reasoning traces without extra model calls.

The emphasis is not on proposing a new verifier. Instead, the project audits how much apparent verification performance depends on evaluation protocol choices such as global pooling, in-sample scoring, direction-agnostic AUROC, and within-problem controls.

Takeaway: shallow token statistics can be useful diagnostics, but they should be reported with fixed-direction baselines and permutation-null calibration before being treated as stable standalone verifiers.

View the code Workshop website

Direct Link

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Yuhan Chi

Venue

Links

Token-Level Verification under Controlled Evaluation

Share on