Work in progress. Feedback welcome.
Dyna-Q Notebook
Dyna-Q is a Q-learning algorithm with a planning component that iterates additional Q-learning steps based on previously encountered state-action-reward-next state transitions. We implement Tabular Dyna-Q from Chapter 8 of Sutton & Barto (Example 8.1) with a gridworld environment.
DQN Notebook
DQN trains a network to predict Q-values based on previously seen transitions sampled from a replay buffer in a supervised fashion.
| Environment | Random policy | Trained policy | Training details |
|---|---|---|---|
| Pong | ![]() |
Random action during evaluation with 0% vs 5% probability. Scores: 5:21 vs 8:21 ![]() |
10 mio frames |
MCTS Notebook
Monte-Carlo tree search is a seach algorithm that selects at each step the most promising action, in terms of how good actions are expected to be vs. how much uncertainty there is. New actions are initialized by accessing either the environment or a model thereof, while existing Q-value estimates are used within the tree. We implement a naive Python version of the MCTS algorithm used by MuZero, and compare its output with the faster JAX implementation released by DeepMind, MCTX.
Work in progress: As a first step, I'm implementing changes to the DQN code only based on the MuZero paper, with the goal of having a scaled-down version of MuZero that can demonstrate improvements to DQN on Ms Pacman. I won't be looking at the published pseudocode for now. Afterwards, I'll review my implementation against the pseudocode.
Reinforcement Learning: An Introduction (Sutton & Barto)
Playing Atari with Deep Reinforcement Learning (DQN Arxiv 2013)
Human-level control through deep reinforcement learning (DQN Nature 2015)
Mastering the game of Go with deep neural networks and tree search (AlphaGo)
Mastering the Game of Go without Human Knowledge (AlphaGo Zero)
Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm (AlphaZero)
Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model (MuZero)
Monte Carlo tree search in JAX (MCTX)
Deep Reinforcement Learning and the Deadly Triad (function approximation, off-policy learning, and bootstrapping)
- Dyna-Q
- DQN
- Replay buffer
- Atari environment
- Neural network, stochastic gradient descent
- Training loop
- Signs of life :)
- GPU
- Debug!
- Remaining details from both DQN papers
- Run for the full number of frames
- MuZero
- Monte-Carlo tree search
- Does it work with tensors / batches
- Other changes to DQN
- Different loss
- TD-targets
- Non-uniform sampling from replay buffer
- ...
- Monte-Carlo tree search



