A repo to understand Q Learning and Deep Q learning
Custom implementation of Q learning and Deep Q learning.
Reinforcement Learning Explained here: @ipaar3/saturnmind-94586f0d0158"">https://medium.com/@ipaar3/saturnmind-94586f0d0158
Run grid_environment.py to run just the Qlearning.
The experimental setup is simple, it’s a grid with 8x8 nodes and some blocks. The aim is to navigate to the diagonally opposite side of the grid with minimal number of steps.
python grid_environment.py --train True
Convergence is observed after 4000 - 5000 episodes
See this to know how to save and retrieve the model.
After the model is trained for 5000 episodes, we can just quit the environment and the model is saved automatically.
python grid_environment.py
For each prediction, we’ll get q values for the 4 actions.
cost = predicted q values - actual q values.
but the actual values here depends on the bellman's equation.
actual qvalue= reward + discount_factor * max qvalue
max qvalue is from the previous state taken to go to the next state.
now optimize the cost using Adam Optimizer.
Step 1 : Define your environment and set your actions, goal.
E.g.
Environment : Super Mario.(The Best game ever!)
Actions : move forward, jump, duck, long jump etc.
Goal : Retrieving the princess.
Note: enemy tortoises and triangle shaped things are removed from the scenerio for simplicity.
Step 2 : Initialize Q table wth states and actions. Network Parameters:
Q table AKA Quality table represents the quality of move that is being made on that state.
Higher Magnitude -> Higher Quality Move in a state.
States- Current state or Position in the environment.
E.g. Current location of Mario in the frame.
Action- List freedom of movements in the environment that is defined.
Like this one :
State | Forward | Jump | Duck | Ljump |
---|---|---|---|---|
Frame 0 | 0.0 | 0.0 | 0.0 | 0.0 |
Frame 1 | 0.0 | 0.0 | 0.0 | 0.0 |
Frame 2 | 0.0 | 0.0 | 0.0 | 0.0 |
Now our job is to train and adapt the above Q table by interacting with the environment in following steps
Step 3 : Let the hero explore environment.
Our hero can take a random move if Q table's Move is zero or equally distributed.
Else hero has to choose the move with highest reward for the present state.
For a given State:
if Jump > Forward:
Mario chooses to Jump.
else:
Mario chooses Forward.
Step 4 : Update the Q table.
Now Rewards for each move towards the goal is calculated and updated in Q table.
It is specific to that State and Move at that instant.
Q table( State , move) = Q table( State , move) + learning_Rate *[Q table(current(S,M) - previous(S,M))]
learning_rate= 0.1 # one step at a time.
S - State , M - Move.
An updated Q table after some movements:
State | Forward | Jump | Duck | Ljump |
---|---|---|---|---|
Frame 0 | 0.4 | 0.0 | 0.2 | 0.9 |
Frame 1 | 0.6 | 0.3 | 0.0 | 0.5 |
Frame 2 | 0.9 | 0.7 | 0.0 | 0.0 |
Step 5 : Handling Fail conditions.
If our hero fails to reach the goal, Update Q table with a negative reward.
Negative rewarding a Move at that State reduces the selection of that movement in future.
Step 6 : Reaching the goal.
The above process is continued till our hero reaches the goal.
Once the Goal is reached, Our program completed a generation.
Step 7 : Passing Knowledge to Generations.
Once a generation is complete, game is started again.
But the same Q table is kept, inorder to have knowledge of the previous generations.
The Steps 3 - 6 is repeated again and again till Saturation or till enough experience in large cases.
Finally, we got our updated Q table with enough knowledge of the environment.
This Q table can be used to successfuly complete Super Mario with much ease.
Nostalgia, huh?