Reinforcement Learning of Intelligent Agent: QLearning
Nagel Schreckenberg Cellular Automata Model for training an agent to make efficient lane change decisions using Q Learning Algorithm.
This program has three main objects - car, road, and representation. The representation object deals with interactive mode, while the road and car classes make up the environment for the simulation. The road has three lanes with each lane having 100 cells, the road is modeled as a circular road (periodic boundary conditions). The simulation starts with 99 HVs with well defined properties randomly distributed on the road, 1 agent is also distributed randomly in the same road. Each update of the system involves each car object making lane change decisions followed by longitudinal update. The agent uses QLearning algorithms to learn the optimal lane change policy that would reduce the time taken for it to make 10 cycles on the road.
Use the package manager pip to install pygame, matplotlib and any other packages that you may be missing on your system
pip install pygame
This version has the following simulation conditions
Total Number of cars: 100
Maximum speed on road: 5
Maximum AV-AV speed: 3
Maximum AV-HV speed: 3
Maximum HV speed: 3
Probability of lane change of AV: 0.6
Probability of lane change of HV: 0.6
Probability of braking of AV: 0.4
Probability of braking of HV: 0.4
Number of AVs: 1
Simulation Terminates at (cycles): 10
Random seed is fixed at 4
Maximum Time Steps: 2000
The default parameters used for qlearning are as below:
#define parameters
num_episodes = 100
max_steps_per_episode = 2000
learning_rate = 0.1
discount_rate = 0.99
exploration_rate = 1
max_exploration_rate = 1
min_exploration_rate = 0.01
exploration_decay_rate = 0.005
The new model/idea of the qLearning framework involves a realistic assumption that makes the training much faster (theoretically) and the program more accurate and realistic.
The new model depends on the concept of visibility; the agent is now aware of vehicles in its vicinity (front, back and sides) and makes its decision of lane changes based on this visibility radius. This reduces the state space by a factor of 10 and thus training speed increases drastically!
Matrix composed of 32 rows (assuming 5 cell visibility front and back and 2 cells on the sides ) and 3 columns.
The rows correspond to the state space (32 grids in the road data structure).
The columns correspond to the action space (change lane up, change lane down, do not change lane)
The environment would be as follows:
Walkway
Road (lane 1)
Road (lane 2)
Road (lane 3)
Walkway
Note that the roads are circular (periodic boundary).
There are three types of rewards and two types of Penalty.
Best Reward > Better Reward > Good Reward
Highest Penalty > Penalty
The reward function working for version 1.0 is as follows:
action is passed as input to the function guiding the agent's lane change dynamics
if action results in agent moving to an occupied lane -> end episode and high penalty
if action leads to empty block and safe for moving -> good reward
if action leads to safety and better v_potential -> highest reward
if action leads to not changing lane and lane change was not possible -> good reward
for each cycle completed, c, record time taken, t.
rewards += constant * ( cycle_distance / t)
-> this would incentivize agent to increase its average speed more
-> this would also make sure that it completes more cycles
Once, the aggregate reward is calculated using the above code block. The final reward for the episode is calculated as follows:
final_reward = aggregate_reward - timesteps_taken_to_complete_10_cycles
Dr. Li Meeting Note:
Reward Functions should be functions of speed and collision risk.
Rewards can be some constant times the speed gain from action.
Need to quantify collision risk and associate it with reward/penalty.
3 Dimensional action consequences: safetiness, velocity gain, survival.
These are the following cases that may be result of a lane changing action.
Training must be done for finite number of episodes and each episode must start with random initial conditions. The training code must allow for models to learn from previous trained models. The training model can include neural networks for efficient training (in the works).
In order to change the simulation condition, edit the file “config\case.py”.
9 #sim data
10 data = ["trial.txt",100,5,3,3,3,0.6,0.6,0.4,0.4,1,10]
"""
order of data:
["Output file name: ","Total Number of cars: ", "Maximum speed on road: ",
"Maximum AV-AV speed: ", "Maximum AV-HV speed: ",
"Maximum HV speed: ", "Probability of lane change of AV: ",
"Probability of lane change of HV: ", "Probability of braking of AV: ",
"Probability of braking of HV: ", "Number of AVs: ","Simulation Terminates at (cycles): "]
"""
In order to change the qlearning parameters, edit respective variables in the “driver.py” file. To change the environment conditions change the “simulation/road.py” file and to change agent/other car behaviors, reward functions change the “simulation/car.py” file.
Important methods in road.py:
self.step(act) , self.setEnvironment(totalCars,agentNum) and the constructor
Important methods in car.py:
self.qUpdateLane(act) , self.agentLaneChange(act), self.allocateReward() and
the constructor
In your shell, execute the following code
python3 driver.py
The program records the rewards, qvalues, and timesteps associated with each episode and upon termination creates a text file with these key statistics and generates two plots - rewards vs episodes, and timesteps vs episodes