Skip to content

adrische/MuZero-MsPacman

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Step-by-step Reimplementation Attempt of MuZero for Ms Pacman

Work in progress. Feedback welcome.

Contents

  1. Dyna-Q
  2. Deep-Q-Network (DQN)
  3. Monte-Carlo Tree Search (MCTS)
  4. MuZero
  5. References

Dyna-Q Notebook

Dyna-Q is a Q-learning algorithm with a planning component that iterates additional Q-learning steps based on previously encountered state-action-reward-next state transitions. We implement Tabular Dyna-Q from Chapter 8 of Sutton & Barto (Example 8.1) with a gridworld environment.

DQN trains a network to predict Q-values based on previously seen transitions sampled from a replay buffer in a supervised fashion.

Environment Random policy Trained policy Training details
Pong Random action during evaluation with 0% vs 5% probability. Scores: 5:21 vs 8:21
10 mio frames

Monte-Carlo tree search is a seach algorithm that selects at each step the most promising action, in terms of how good actions are expected to be vs. how much uncertainty there is. New actions are initialized by accessing either the environment or a model thereof, while existing Q-value estimates are used within the tree. We implement a naive Python version of the MCTS algorithm used by MuZero, and compare its output with the faster JAX implementation released by DeepMind, MCTX.

MuZero

Work in progress: As a first step, I'm implementing changes to the DQN code only based on the MuZero paper, with the goal of having a scaled-down version of MuZero that can demonstrate improvements to DQN on Ms Pacman. I won't be looking at the published pseudocode for now. Afterwards, I'll review my implementation against the pseudocode.

References

Main

Reinforcement Learning: An Introduction (Sutton & Barto)

Playing Atari with Deep Reinforcement Learning (DQN Arxiv 2013)

Human-level control through deep reinforcement learning (DQN Nature 2015)

Mastering the game of Go with deep neural networks and tree search (AlphaGo)

Mastering the Game of Go without Human Knowledge (AlphaGo Zero)

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm (AlphaZero)

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model (MuZero)

MuZero Pseudocode

Monte Carlo tree search in JAX (MCTX)

Related

Deep Reinforcement Learning and the Deadly Triad (function approximation, off-policy learning, and bootstrapping)

TODO

  • Dyna-Q
  • DQN
    • Replay buffer
    • Atari environment
    • Neural network, stochastic gradient descent
    • Training loop
    • Signs of life :)
    • GPU
    • Debug!
    • Remaining details from both DQN papers
    • Run for the full number of frames
  • MuZero
    • Monte-Carlo tree search
      • Does it work with tensors / batches
    • Other changes to DQN
      • Different loss
      • TD-targets
      • Non-uniform sampling from replay buffer
      • ...

About

Step-by-step reimplementation attempt of MuZero for Ms Pacman, starting with Dyna-Q, then DQN and MCTS. Work in progress.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors