From Bandits to PPO: RL Prerequisite Seminar

Date: April 18, 2026

Talk at Fudan University seminar, Shanghai, China

This page is the seminar detail page for “From Bandits to PPO”.

The PDF notes are the complete written version of this seminar. They include the full formulas, derivations, and algorithm summaries that are only briefly mentioned on this page.

The content follows a single line, starting from Bandits…

MDPs
MC/TD
policy gradient
GAE
PPO

The goal is to keep the concepts connected, so readers can see where each method comes from and why PPO is the endpoint of this prerequisite talk.

If you want a quick reading path, focus on these four points:

The basic map: what is learned (value vs. policy) and how the learning signal is estimated (MC vs. TD).
Variance reduction: baseline, advantage, actor-critic, and GAE.
Update stability: why plain policy gradient is unstable and how PPO clipping helps.
Handoff to the next topic: why PPO is a natural stopping point before GRPO/DAPO.

Seminar Snapshot

Item	Detail
Theme	Mathematical path from bandits to PPO
Positioning	Prerequisite for PPO-family LLM post-training methods (GRPO/DAPO)
Duration	1 hour
Audience	Students with basic probability and linear algebra
Output	Conceptual map + formulas needed for next seminar

Seminar At A Glance

Goal: understand why modern policy optimization naturally leads to actor-critic, GAE, and PPO
Focus: theory and intuition first, engineering details later

Simple Outline

Bandits and contextual bandits
MDPs, trajectories, and value functions
Monte Carlo vs. Temporal Difference (TD)
Tabular control: MC iteration, SARSA, Q-learning
Function approximation fork: DQN vs. policy parameterization
Policy gradient and REINFORCE
Baselines, advantage, actor-critic, and GAE
PPO clipping objective and practical training loop

Key Mathematical Threads

Learning object: value function vs. policy
Learning signal: Monte Carlo returns vs. TD bootstrapping
Variance control: baseline subtraction and advantage estimation
Stability: importance ratio clipping in PPO

Why This Stop Point Matters

PPO is the right endpoint for this seminar because it gives all the conceptual ingredients needed for understanding GRPO-like variants: policy gradient objective, advantage estimation, and controlled policy updates.

[Download / View Seminar PDF Notes]

PDF notes

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Yuhan Chi