Using N-step dueling DDQN with PER for playing Pacman game
Using N-step dueling DDQN with PER for learning how to play a Pacman game
DeepMind published its famous paper Playing Atari with Deep Reinforcement Learning, in which a new algorithm called DQN was implemented. It showed that an AI agent could learn to play games by simply watching the screen without any prior knowledge about the game. Also, I have added a few enhancement to the vanilla DQN from various papers and tested it on the MsPacman-v4 OpenAI’s environment.
input_1 (InputLayer) (None, 88, 80, 1) 0
conv2d_1 (Conv2D) (None, 22, 20, 32) 2080 input_1[0][0]
conv2d_2 (Conv2D) (None, 11, 10, 64) 32832 conv2d_1[0][0]
conv2d_3 (Conv2D) (None, 11, 10, 64) 36928 conv2d_2[0][0]
flatten_1 (Flatten) (None, 7040) 0 conv2d_3[0][0]
dense_2 (Dense) (None, 5) 35205 flatten_1[0][0]
lambda_1 (Lambda) (None, 1) 0 dense_2[0][0]
dense_1 (Dense) (None, 1) 7041 flatten_1[0][0]
subtract_1 (Subtract) (None, 5) 0 dense_2[0][0]
lambda_1[0][0]
add_1 (Add) (None, 5) 0 dense_1[0][0]
subtract_1[0][0]
==================================================================================================
Total params: 114,086
Trainable params: 114,086
Non-trainable params: 0
```
For updating the Q values in the max operator, DQN uses the same values both to select and to evaluate an action. This makes it more likely to select overestimated values, resulting in overoptimistic value estimates. In order to solve this issue, we can use the target network as a value estimator and the main network as an action selector.
DQN accumulates a single reward and then uses the greedy action at the next step to bootstrap. Alternatively, forward-view multi-step targets can be used and bootstrap after few steps (5 steps here).
The dueling architecture can learn which states are valuable for each state without learning the effect of each action. This is particularly useful in states where its actions in no relevant way affect the environment. Also, for a more stable optimization, we use an average baseline for Q evaluation.
Lastly, we can prioritize the episode by the magnitude of a transition’s TD error. Moreover, to overcome the issue of replaying a subset of transitions more frequently, we will use a stochastic sampling method that interpolates between pure greedy prioritization and uniform random sampling. I used a min-heap and chose about 60% by the TD error priority and 40% uniformly.
This project is licensed under the MIT License - see the LICENSE file for details