Shortcuts

Reinforcement Learning

This module is a collection of common RL approaches implemented in Lightning.


Module authors

Contributions by: Donal Byrne

  • DQN

  • Double DQN

  • Dueling DQN

  • Noisy DQN

  • NStep DQN

  • Prioritized Experience Replay DQN

  • Reinforce

  • Vanilla Policy Gradient


Note

RL models currently only support CPU and single GPU training with distributed_backend=dp. Full GPU support will be added in later updates.

DQN Models

The following models are based on DQN


Deep-Q-Network (DQN)

DQN model introduced in Playing Atari with Deep Reinforcement Learning. Paper authors: Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller.

Original implementation by: Donal Byrne

The DQN was introduced in Playing Atari with Deep Reinforcement Learning by researchers at DeepMind. This took the concept of tabular Q learning and scaled it to much larger problems by apporximating the Q function using a deep neural network.

The goal behind DQN was to take the simple control method of Q learning and scale it up in order to solve complicated

tasks. As well as this, the method needed to be stable. The DQN solves these issues with the following additions.

Approximated Q Function

Storing Q values in a table works well in theory, but is completely unscalable. Instead, the authors apporximate the Q function using a deep neural network. This allows the DQN to be used for much more complicated tasks

Replay Buffer

Similar to supervised learning, the DQN learns on randomly sampled batches of previous data stored in an Experience Replay Buffer. The ‘target’ is calculated using the Bellman equation

Q(s,a)<-(r+{\gamma}\max_{a'{\in}A}Q(s',a'))^2

and then we optimize using SGD just like a standard supervised learning problem.

L=(Q(s,a)-(r+{\gamma}\max_{a'{\in}A}Q(s',a'))^2

DQN Results

DQN: Pong

DQN Baseline Results

Example:

from pl_bolts.models.rl import DQN
dqn = DQN("PongNoFrameskip-v4")
trainer = Trainer()
trainer.fit(dqn)
class pl_bolts.models.rl.dqn_model.DQN(env, gpus=0, eps_start=1.0, eps_end=0.02, eps_last_frame=150000, sync_rate=1000, gamma=0.99, learning_rate=0.0001, batch_size=32, replay_size=100000, warm_start_size=10000, num_samples=500, **kwargs)[source]

Bases: pytorch_lightning.LightningModule

Basic DQN Model

PyTorch Lightning implementation of DQN

Paper authors: Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller.

Model implemented by:

  • Donal Byrne <https://github.com/djbyrne>

Example

>>> from pl_bolts.models.rl.dqn_model import DQN
...
>>> model = DQN("PongNoFrameskip-v4")

Train:

trainer = Trainer()
trainer.fit(model)
Parameters
  • env (str) – gym environment tag

  • gpus (int) – number of gpus being used

  • eps_start (float) – starting value of epsilon for the epsilon-greedy exploration

  • eps_end (float) – final value of epsilon for the epsilon-greedy exploration

  • eps_last_frame (int) – the final frame in for the decrease of epsilon. At this frame espilon = eps_end

  • sync_rate (int) – the number of iterations between syncing up the target network with the train network

  • gamma (float) – discount factor

  • learning_rate (float) – learning rate

  • batch_size (int) – size of minibatch pulled from the DataLoader

  • replay_size (int) – total capacity of the replay buffer

  • warm_start_size (int) – how many random steps through the environment to be carried out at the start of training to fill the buffer with a starting point

  • num_samples (int) – the number of samples to pull from the dataset iterator and feed to the DataLoader

Note

This example is based on:

https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On-Second-Edition /blob/master/Chapter06/02_dqn_pong.py

Note

Currently only supports CPU and single GPU training with distributed_backend=dp

static add_model_specific_args(arg_parser)[source]

Adds arguments for DQN model

Note: these params are fine tuned for Pong env

Parameters

arg_parser (ArgumentParser) – parent parser

Return type

ArgumentParser

build_networks()[source]

Initializes the DQN train and target networks

Return type

None

configure_optimizers()[source]

Initialize Adam optimizer

Return type

List[Optimizer]

forward(x)[source]

Passes in a state x through the network and gets the q_values of each action as an output

Parameters

x (Tensor) – environment state

Return type

Tensor

Returns

q values

populate(warm_start)[source]

Populates the buffer with initial experience

Return type

None

prepare_data()[source]

Initialize the Replay Buffer dataset used for retrieving experiences

Return type

None

test_dataloader()[source]

Get test loader

Return type

DataLoader

test_epoch_end(outputs)[source]

Log the avg of the test results

Return type

Dict[str, Tensor]

test_step(*args, **kwargs)[source]

Evaluate the agent for 10 episodes

Return type

Dict[str, Tensor]

train_dataloader()[source]

Get train loader

Return type

DataLoader

training_step(batch, _)[source]

Carries out a single step through the environment to update the replay buffer. Then calculates loss based on the minibatch recieved

Parameters
Return type

OrderedDict

Returns

Training loss and log metrics


Double DQN

Double DQN model introduced in Deep Reinforcement Learning with Double Q-learning Paper authors: Hado van Hasselt, Arthur Guez, David Silver

Original implementation by: Donal Byrne

The original DQN tends to overestimate Q values during the Bellman update, leading to instability and is harmful to training. This is due to the max operation in the Bellman equation.

We are constantly taking the max of our agents estimates during our update. This may seem reasonable, if we could trust these estimates. However during the early stages of training, the estimates for these values will be off center and can lead to instability in training until our estimates become more reliable

The Double DQN fixes this overestimation by choosing actions for the next state using the main trained network but uses the values of these actions from the more stable target network. So we are still going to take the greedy action, but the value will be less “optimisitc” because it is chosen by the target network.

DQN expected return

Q(s_t, a_t) = r_t + \gamma * \max_{Q'}(S_{t+1}, a)

Double DQN expected return

Q(s_t, a_t) = r_t + \gamma * \max{Q'}(S_{t+1}, \arg\max_Q(S_{t+1}, a))

Double DQN Results

Double DQN: Pong

Double DQN Result

DQN vs Double DQN: Pong

orange: DQN

blue: Double DQN

Double DQN Comparison Result

Example:

from pl_bolts.models.rl import DoubleDQN
ddqn = DoubleDQN("PongNoFrameskip-v4")
trainer = Trainer()
trainer.fit(ddqn)
class pl_bolts.models.rl.double_dqn_model.DoubleDQN(env, gpus=0, eps_start=1.0, eps_end=0.02, eps_last_frame=150000, sync_rate=1000, gamma=0.99, learning_rate=0.0001, batch_size=32, replay_size=100000, warm_start_size=10000, num_samples=500, **kwargs)[source]

Bases: pl_bolts.models.rl.dqn_model.DQN

Double Deep Q-network (DDQN) PyTorch Lightning implementation of Double DQN

Paper authors: Hado van Hasselt, Arthur Guez, David Silver

Model implemented by:

  • Donal Byrne <https://github.com/djbyrne>

Example

>>> from pl_bolts.models.rl.double_dqn_model import DoubleDQN
...
>>> model = DoubleDQN("PongNoFrameskip-v4")

Train:

trainer = Trainer()
trainer.fit(model)
Parameters
  • env (str) – gym environment tag

  • gpus (int) – number of gpus being used

  • eps_start (float) – starting value of epsilon for the epsilon-greedy exploration

  • eps_end (float) – final value of epsilon for the epsilon-greedy exploration

  • eps_last_frame (int) – the final frame in for the decrease of epsilon. At this frame espilon = eps_end

  • sync_rate (int) – the number of iterations between syncing up the target network with the train network

  • gamma (float) – discount factor

  • lr – learning rate

  • batch_size (int) – size of minibatch pulled from the DataLoader

  • replay_size (int) – total capacity of the replay buffer

  • warm_start_size (int) – how many random steps through the environment to be carried out at the start of training to fill the buffer with a starting point

  • sample_len – the number of samples to pull from the dataset iterator and feed to the DataLoader

Note

This example is based on:

https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On-Second-Edition /blob/master/Chapter08/03_dqn_double.py

Note

Currently only supports CPU and single GPU training with distributed_backend=dp

PyTorch Lightning implementation of DQN

Paper authors: Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller.

Model implemented by:

  • Donal Byrne <https://github.com/djbyrne>

Example

>>> from pl_bolts.models.rl.dqn_model import DQN
...
>>> model = DQN("PongNoFrameskip-v4")

Train:

trainer = Trainer()
trainer.fit(model)
Parameters
  • env (str) – gym environment tag

  • gpus (int) – number of gpus being used

  • eps_start (float) – starting value of epsilon for the epsilon-greedy exploration

  • eps_end (float) – final value of epsilon for the epsilon-greedy exploration

  • eps_last_frame (int) – the final frame in for the decrease of epsilon. At this frame espilon = eps_end

  • sync_rate (int) – the number of iterations between syncing up the target network with the train network

  • gamma (float) – discount factor

  • learning_rate (float) – learning rate

  • batch_size (int) – size of minibatch pulled from the DataLoader

  • replay_size (int) – total capacity of the replay buffer

  • warm_start_size (int) – how many random steps through the environment to be carried out at the start of training to fill the buffer with a starting point

  • num_samples (int) – the number of samples to pull from the dataset iterator and feed to the DataLoader

Note

This example is based on:

https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On-Second-Edition /blob/master/Chapter06/02_dqn_pong.py

Note

Currently only supports CPU and single GPU training with distributed_backend=dp

training_step(batch, _)[source]

Carries out a single step through the environment to update the replay buffer. Then calculates loss based on the minibatch recieved

Parameters
Return type

OrderedDict

Returns

Training loss and log metrics


Dueling DQN

Dueling DQN model introduced in Dueling Network Architectures for Deep Reinforcement Learning Paper authors: Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, Nando de Freitas

Original implementation by: Donal Byrne

The Q value that we are trying to approximate can be divided into two parts, the value state V(s) and the ‘advantage’ of actions in that state A(s, a). Instead of having one full network estimate the entire Q value, Dueling DQN uses two estimator heads in order to separate the estimation of the two parts.

The value is the same as in value iteration. It is the discounted expected reward achieved from state s. Think of the value as the ‘base reward’ from being in state s.

The advantage tells us how much ‘extra’ reward we get from taking action a while in state s. The advantage bridges the gap between Q(s, a) and V(s) as Q(s, a) = V(s) + A(s, a).

In the paper [Dueling Network Architectures for Deep Reinforcement Learning](https://arxiv.org/abs/1511.06581) the network uses two heads, one outputs the value state and the other outputs the advantage. This leads to better training stability, faster convergence and overall better results. The V head outputs a single scalar (the state value), while the advantage head outputs a tensor equal to the size of the action space, containing an advantage value for each action in state s.

Changing the network architecture is not enough, we also need to ensure that the advantage mean is 0. This is done by subtracting the mean advantage from the Q value. This essentially pulls the mean advantage to 0.

Q(s, a) = V(s) + A(s, a) - 1/N * \sum_k(A(s, k)

Dueling DQN Benefits

  • Ability to efficiently learn the state value function. In the dueling network, every Q update also updates the Value

    stream, where as in DQN only the value of the chosen action is updated. This provides a better approximation of the values

  • The differences between total Q values for a given state are quite small in relation to the magnitude of Q. The

    difference in the Q values between the best action and the second best action can be very small, while the average state value can be much larger. The differences in scale can introduce noise, which may lead to the greedy policy switching the priority of these actions. The seperate estimators for state value and advantage makes the Dueling DQN robust to this type of scenario

Dueling DQN Results

The results below a noticeable improvement from the original DQN network.

Dueling DQN baseline: Pong

Similar to the results of the DQN baseline, the agent has a period where the number of steps per episodes increase as it begins to hold its own against the heuristic oppoent, but then the steps per episode quickly begins to drop as it gets better and starts to beat its opponent faster and faster. There is a noticable point at step ~250k where the agent goes from losing to winning.

As you can see by the total rewards, the dueling network’s training progression is very stable and continues to trend upward until it finally plateus.

Dueling DQN Result

DQN vs Dueling DQN: Pong

In comparison to the base DQN, we see that the Dueling network’s training is much more stable and is able to reach a score in the high teens faster than the DQN agent. Even though the Dueling network is more stable and out performs DQN early in training, by the end of training the two networks end up at the same point.

This could very well be due to the simplicity of the Pong environment.

  • Orange: DQN

  • Red: Dueling DQN

Dueling DQN Comparison Result

Example:

from pl_bolts.models.rl import DuelingDQN
dueling_dqn = DuelingDQN("PongNoFrameskip-v4")
trainer = Trainer()
trainer.fit(dueling_dqn)
class pl_bolts.models.rl.dueling_dqn_model.DuelingDQN(env, gpus=0, eps_start=1.0, eps_end=0.02, eps_last_frame=150000, sync_rate=1000, gamma=0.99, learning_rate=0.0001, batch_size=32, replay_size=100000, warm_start_size=10000, num_samples=500, **kwargs)[source]

Bases: pl_bolts.models.rl.dqn_model.DQN

PyTorch Lightning implementation of Dueling DQN

Paper authors: Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, Nando de Freitas

Model implemented by:

  • Donal Byrne <https://github.com/djbyrne>

Example

>>> from pl_bolts.models.rl.dueling_dqn_model import DuelingDQN
...
>>> model = DuelingDQN("PongNoFrameskip-v4")

Train:

trainer = Trainer()
trainer.fit(model)
Parameters
  • env (str) – gym environment tag

  • gpus (int) – number of gpus being used

  • eps_start (float) – starting value of epsilon for the epsilon-greedy exploration

  • eps_end (float) – final value of epsilon for the epsilon-greedy exploration

  • eps_last_frame (int) – the final frame in for the decrease of epsilon. At this frame espilon = eps_end

  • sync_rate (int) – the number of iterations between syncing up the target network with the train network

  • gamma (float) – discount factor

  • lr – learning rate

  • batch_size (int) – size of minibatch pulled from the DataLoader

  • replay_size (int) – total capacity of the replay buffer

  • warm_start_size (int) – how many random steps through the environment to be carried out at the start of training to fill the buffer with a starting point

  • sample_len – the number of samples to pull from the dataset iterator and feed to the DataLoader

Note

Currently only supports CPU and single GPU training with distributed_backend=dp

PyTorch Lightning implementation of DQN

Paper authors: Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller.

Model implemented by:

  • Donal Byrne <https://github.com/djbyrne>

Example

>>> from pl_bolts.models.rl.dqn_model import DQN
...
>>> model = DQN("PongNoFrameskip-v4")

Train:

trainer = Trainer()
trainer.fit(model)
Parameters
  • env (str) – gym environment tag

  • gpus (int) – number of gpus being used

  • eps_start (float) – starting value of epsilon for the epsilon-greedy exploration

  • eps_end (float) – final value of epsilon for the epsilon-greedy exploration

  • eps_last_frame (int) – the final frame in for the decrease of epsilon. At this frame espilon = eps_end

  • sync_rate (int) – the number of iterations between syncing up the target network with the train network

  • gamma (float) – discount factor

  • learning_rate (float) – learning rate

  • batch_size (int) – size of minibatch pulled from the DataLoader

  • replay_size (int) – total capacity of the replay buffer

  • warm_start_size (int) – how many random steps through the environment to be carried out at the start of training to fill the buffer with a starting point

  • num_samples (int) – the number of samples to pull from the dataset iterator and feed to the DataLoader

Note

This example is based on:

https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On-Second-Edition /blob/master/Chapter06/02_dqn_pong.py

Note

Currently only supports CPU and single GPU training with distributed_backend=dp

build_networks()[source]

Initializes the Dueling DQN train and target networks

Return type

None


Noisy DQN

Noisy DQN model introduced in Noisy Networks for Exploration Paper authors: Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell, Shane Legg

Original implementation by: Donal Byrne

Up until now the DQN agent uses a seperate exploration policy, generally epsilon-greedy where start and end values are set for its exploration. [Noisy Networks For Exploration](https://arxiv.org/abs/1706.10295) introduces a new exploration strategy by adding noise parameters to the weightsof the fully connect layers which get updated during backpropagation of the network. The noise parameters drive the exploration of the network instead of simply taking random actions more frequently at the start of training and less frequently towards the end.The of authors of propose two ways of doing this.

During the optimization step a new set of noisy parameters are sampled. During training the agent acts according to the fixed set of parameters. At the next optimization step, the parameters are updated with a new sample. This ensures the agent always acts based on the parameters that are drawn from the current noise distribution.

The authors propose two methods of injecting noise to the network.

  1. Independent Gaussian Noise: This injects noise per weight. For each weight a random value is taken from

    the distribution. Noise parameters are stored inside the layer and are updated during backpropagation. The output of the layer is calculated as normal.

  2. Factorized Gaussian Noise: This injects nosier per input/ouput. In order to minimize the number of random values

    this method stores two random vectors, one with the size of the input and the other with the size of the output. Using these two vectors, a random matrix is generated for the layer by calculating the outer products of the vector

Noisy DQN Benefits

  • Improved exploration function. Instead of just performing completely random actions, we add decreasing amount of noise

    and uncertainty to our policy allowing to explore while still utilising its policy

  • The fact that this method is automatically tuned means that we do not have to tune hyper parameters for

    epsilon-greedy!

Note

for now I have just implemented the Independant Gaussian as it has been reported there isn’t much difference in results for these benchmark environments.

In order to update the basic DQN to a Noisy DQN we need to do the following

Noisy DQN Results

The results below improved stability and faster performance growth.

Noisy DQN baseline: Pong

Similar to the other improvements, the average score of the agent reaches positive numbers around the 250k mark and steadily increases till convergence.

Noisy DQN Result

DQN vs Dueling DQN: Pong

In comparison to the base DQN, the Noisy DQN is more stable and is able to converge on an optimal policy much faster than the original. It seems that the replacement of the epsilon-greedy strategy with network noise provides a better form of exploration.

  • Orange: DQN

  • Red: Noisy DQN

Noisy DQN Comparison Result

Example:

from pl_bolts.models.rl import NoisyDQN
noisy_dqn = NoisyDQN("PongNoFrameskip-v4")
trainer = Trainer()
trainer.fit(noisy_dqn)
class pl_bolts.models.rl.noisy_dqn_model.NoisyDQN(env, gpus=0, eps_start=1.0, eps_end=0.02, eps_last_frame=150000, sync_rate=1000, gamma=0.99, learning_rate=0.0001, batch_size=32, replay_size=100000, warm_start_size=10000, num_samples=500, **kwargs)[source]

Bases: pl_bolts.models.rl.dqn_model.DQN

PyTorch Lightning implementation of Noisy DQN

Paper authors: Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell, Shane Legg

Model implemented by:

  • Donal Byrne <https://github.com/djbyrne>

Example

>>> from pl_bolts.models.rl.n_step_dqn_model import NStepDQN
...
>>> model = NStepDQN("PongNoFrameskip-v4")

Train:

trainer = Trainer()
trainer.fit(model)
Parameters
  • env (str) – gym environment tag

  • gpus (int) – number of gpus being used

  • eps_start (float) – starting value of epsilon for the epsilon-greedy exploration

  • eps_end (float) – final value of epsilon for the epsilon-greedy exploration

  • eps_last_frame (int) – the final frame in for the decrease of epsilon. At this frame espilon = eps_end

  • sync_rate (int) – the number of iterations between syncing up the target network with the train network

  • gamma (float) – discount factor

  • lr – learning rate

  • batch_size (int) – size of minibatch pulled from the DataLoader

  • replay_size (int) – total capacity of the replay buffer

  • warm_start_size (int) – how many random steps through the environment to be carried out at the start of

  • to fill the buffer with a starting point (training) –

  • sample_len – the number of samples to pull from the dataset iterator and feed to the DataLoader

Note

Currently only supports CPU and single GPU training with distributed_backend=dp

PyTorch Lightning implementation of DQN

Paper authors: Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller.

Model implemented by:

  • Donal Byrne <https://github.com/djbyrne>

Example

>>> from pl_bolts.models.rl.dqn_model import DQN
...
>>> model = DQN("PongNoFrameskip-v4")

Train:

trainer = Trainer()
trainer.fit(model)
Parameters
  • env (str) – gym environment tag

  • gpus (int) – number of gpus being used

  • eps_start (float) – starting value of epsilon for the epsilon-greedy exploration

  • eps_end (float) – final value of epsilon for the epsilon-greedy exploration

  • eps_last_frame (int) – the final frame in for the decrease of epsilon. At this frame espilon = eps_end

  • sync_rate (int) – the number of iterations between syncing up the target network with the train network

  • gamma (float) – discount factor

  • learning_rate (float) – learning rate

  • batch_size (int) – size of minibatch pulled from the DataLoader

  • replay_size (int) – total capacity of the replay buffer

  • warm_start_size (int) – how many random steps through the environment to be carried out at the start of training to fill the buffer with a starting point

  • num_samples (int) – the number of samples to pull from the dataset iterator and feed to the DataLoader

Note

This example is based on:

https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On-Second-Edition /blob/master/Chapter06/02_dqn_pong.py

Note

Currently only supports CPU and single GPU training with distributed_backend=dp

build_networks()[source]

Initializes the Noisy DQN train and target networks

Return type

None

on_train_start()[source]

Set the agents epsilon to 0 as the exploration comes from the network

Return type

None

training_step(batch, _)[source]

Carries out a single step through the environment to update the replay buffer. Then calculates loss based on the minibatch recieved

Parameters
Return type

OrderedDict

Returns

Training loss and log metrics


N-Step DQN

N-Step DQN model introduced in Learning to Predict by the Methods of Temporal Differences Paper authors: Richard S. Sutton

Original implementation by: Donal Byrne

N Step DQN was introduced in Learning to Predict by the Methods of Temporal Differences. This method improves upon the original DQN by updating our Q values with the expected reward from multiple steps in the future as opposed to the expected reward from the immediate next state. When getting the Q values for a state action pair using a single step which looks like this

Q(s_t,a_t)=r_t+{\gamma}\max_aQ(s_{t+1},a_{t+1})

but because the Q function is recursive we can continue to roll this out into multiple steps, looking at the expected

return for each step into the future.

Q(s_t,a_t)=r_t+{\gamma}r_{t+1}+{\gamma}^2\max_{a'}Q(s_{t+2},a')

The above example shows a 2-Step look ahead, but this could be rolled out to the end of the episode, which is just Monte Carlo learning. Although we could just do a monte carlo update and look forward to the end of the episode, it wouldn’t be a good idea. Every time we take another step into the future, we are basing our approximation off our current policy. For a large portion of training, our policy is going to be less than optimal. For example, at the start of training, our policy will be in a state of high exploration, and will be little better than random.

Note

For each rollout step you must scale the discount factor accordingly by the number of steps. As you can see from the equation above, the second gamma value is to the power of 2. If we rolled this out one step further, we would use gamma to the power of 3 and so.

So if we are aproximating future rewards off a bad policy, chances are those approximations are going to be pretty bad and every time we unroll our update equation, the worse it will get. The fact that we are using an off policy method like DQN with a large replay buffer will make this even worse, as there is a high chance that we will be training on experiences using an old policy that was worse than our current policy.

So we need to strike a balance between looking far enough ahead to improve the convergence of our agent, but not so far

that are updates become unstable. In general, small values of 2-4 work best.

N-Step Benefits

  • Multi-Step learning is capable of learning faster than typical 1 step learning methods.

  • Note that this method introduces a new hyperparameter n. Although n=4 is generally a good starting point and provides

    good results across the board.

N-Step Results

As expected, the N-Step DQN converges much faster than the standard DQN, however it also adds more instability to the loss of the agent. This can be seen in the following experiments.

N-Step DQN: Pong

The N-Step DQN shows the greatest increase in performance with respect to the other DQN variations. After less than 150k steps the agent begins to consistently win games and achieves the top score after ~170K steps. This is reflected in the sharp peak of the total episode steps and of course, the total episode rewards.

N-Step DQN Result

DQN vs N-Step DQN: Pong

This improvement is shown in stark contrast to the base DQN, which only begins to win games after 250k steps and requires over twice as many steps (450k) as the N-Step agent to achieve the high score of 21. One important thing to notice is the large increase in the loss of the N-Step agent. This is expected as the agent is building its expected reward off approximations of the future states. The large the size of N, the greater the instability. Previous literature, listed below, shows the best results for the Pong environment with an N step between 3-5. For these experiments I opted with an N step of 4.

N-Step DQN Comparison Results

Example:

from pl_bolts.models.rl import NStepDQN
n_step_dqn = NStepDQN("PongNoFrameskip-v4")
trainer = Trainer()
trainer.fit(n_step_dqn)
class pl_bolts.models.rl.n_step_dqn_model.NStepDQN(env, gpus=0, eps_start=1.0, eps_end=0.02, eps_last_frame=150000, sync_rate=1000, gamma=0.99, learning_rate=0.0001, batch_size=32, replay_size=100000, warm_start_size=10000, num_samples=500, n_steps=4, **kwargs)[source]

Bases: pl_bolts.models.rl.dqn_model.DQN

NStep DQN Model

PyTorch Lightning implementation of

N-Step DQN

Paper authors: Richard Sutton

Model implemented by:

  • Donal Byrne <https://github.com/djbyrne>

Example

>>> from pl_bolts.models.rl.dqn_model import DQN
...
>>> model = NStepDQN("PongNoFrameskip-v4")

Train:

trainer = Trainer()
trainer.fit(model)
Parameters
  • env (str) – gym environment tag

  • gpus (int) – number of gpus being used

  • eps_start (float) – starting value of epsilon for the epsilon-greedy exploration

  • eps_end (float) – final value of epsilon for the epsilon-greedy exploration

  • eps_last_frame (int) – the final frame in for the decrease of epsilon. At this frame espilon = eps_end

  • sync_rate (int) – the number of iterations between syncing up the target network with the train network

  • gamma (float) – discount factor

  • learning_rate (float) – learning rate

  • batch_size (int) – size of minibatch pulled from the DataLoader

  • replay_size (int) – total capacity of the replay buffer

  • warm_start_size (int) – how many random steps through the environment to be carried out at the start of training to fill the buffer with a starting point

  • num_samples (int) – the number of samples to pull from the dataset iterator and feed to the DataLoader

  • n_steps – number of steps to approximate and use in the bellman update

Note

Currently only supports CPU and single GPU training with distributed_backend=dp


Prioritized Experience Replay DQN

Double DQN model introduced in Prioritized Experience Replay Paper authors: Tom Schaul, John Quan, Ioannis Antonoglou, David Silver

Original implementation by: Donal Byrne

The standard DQN uses a buffer to break up the correlation between experiences and uniform random samples for each batch. Instead of just randomly sampling from the buffer prioritized experience replay (PER) prioritizes these samples based on training loss. This concept was introduced in the paper Prioritized Experience Replay

Essentially we want to train more on the samples that sunrise the agent.

The priority of each sample is defined below where

P(i) = P^\alpha_i / \sum_k P_k^\alpha

where pi is the priority of the ith sample in the buffer and 𝛼 is the number that shows how much emphasis we give to the priority. If 𝛼 = 0 , our sampling will become uniform as in the classic DQN method. Larger values for 𝛼 put more stress on samples with higher priority

Its important that new samples are set to the highest priority so that they are sampled soon. This however introduces bias to new samples in our dataset. In order to compensate for this bias, the value of the weight is defined as

w_i=(N . P(i))^{-\beta}

Where beta is a hyper parameter between 0-1. When beta is 1 the bias is fully compensated. However authors noted that in practice it is better to start beta with a small value near 0 and slowly increase it to 1.

PER Benefits

  • The benefits of this technique are that the agent sees more samples that it struggled with and gets more

    chances to improve upon it.

Memory Buffer

First step is to replace the standard experience replay buffer with the prioritized experience replay buffer. This is pretty large (100+ lines) so I wont go through it here. There are two buffers implemented. The first is a naive list based buffer found in memory.PERBuffer and the second is more efficient buffer using a Sum Tree datastructure.

The list based version is simpler, but has a sample complexity of O(N). The Sum Tree in comparison has a complexity of O(1) for sampling and O(logN) for updating priorities.

Update loss function

The next thing we do is to use the sample weights that we get from PER. Add the following code to the end of the loss function. This applies the weights of our sample to the batch loss. Then we return the mean loss and weighted loss for each datum, with the addition of a small epsilon value.

PER Results

The results below show improved stability and faster performance growth.

PER DQN: Pong

Similar to the other improvements, we see that PER improves the stability of the agents training and shows to converged on an optimal policy faster.

PER DQN Results

DQN vs PER DQN: Pong

In comparison to the base DQN, the PER DQN does show improved stability and performance. As expected, the loss

of the PER DQN is siginificantly lower. This is the main objective of PER by focusing on experiences with high loss.

It is important to note that loss is not the only metric we should be looking at. Although the agent may have very

low loss during training, it may still perform poorly due to lack of exploration.

PER DQN Results
  • Orange: DQN

  • Pink: PER DQN

Example:

from pl_bolts.models.rl import PERDQN
per_dqn = PERDQN("PongNoFrameskip-v4")
trainer = Trainer()
trainer.fit(per_dqn)
class pl_bolts.models.rl.per_dqn_model.PERDQN(env, gpus=0, eps_start=1.0, eps_end=0.02, eps_last_frame=150000, sync_rate=1000, gamma=0.99, learning_rate=0.0001, batch_size=32, replay_size=100000, warm_start_size=10000, num_samples=500, **kwargs)[source]

Bases: pl_bolts.models.rl.dqn_model.DQN

PyTorch Lightning implementation of DQN With Prioritized Experience Replay

Paper authors: Tom Schaul, John Quan, Ioannis Antonoglou, David Silver

Model implemented by:

  • Donal Byrne <https://github.com/djbyrne>

Example

>>> from pl_bolts.models.rl.per_dqn_model import PERDQN
...
>>> model = PERDQN("PongNoFrameskip-v4")

Train:

trainer = Trainer()
trainer.fit(model)

Args:
    env: gym environment tag
    gpus: number of gpus being used
    eps_start: starting value of epsilon for the epsilon-greedy exploration
    eps_end: final value of epsilon for the epsilon-greedy exploration
    eps_last_frame: the final frame in for the decrease of epsilon. At this frame espilon = eps_end
    sync_rate: the number of iterations between syncing up the target network with the train network
    gamma: discount factor
    learning_rate: learning rate
    batch_size: size of minibatch pulled from the DataLoader
    replay_size: total capacity of the replay buffer
    warm_start_size: how many random steps through the environment to be carried out at the start of
        training to fill the buffer with a starting point
    num_samples: the number of samples to pull from the dataset iterator and feed to the DataLoader

.. note::
    This example is based on:
     https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On-Second-Edition             /blob/master/Chapter08/05_dqn_prio_replay.py

.. note:: Currently only supports CPU and single GPU training with `distributed_backend=dp`

PyTorch Lightning implementation of DQN

Paper authors: Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller.

Model implemented by:

  • Donal Byrne <https://github.com/djbyrne>

Example

>>> from pl_bolts.models.rl.dqn_model import DQN
...
>>> model = DQN("PongNoFrameskip-v4")

Train:

trainer = Trainer()
trainer.fit(model)
Parameters
  • env (str) – gym environment tag

  • gpus (int) – number of gpus being used

  • eps_start (float) – starting value of epsilon for the epsilon-greedy exploration

  • eps_end (float) – final value of epsilon for the epsilon-greedy exploration

  • eps_last_frame (int) – the final frame in for the decrease of epsilon. At this frame espilon = eps_end

  • sync_rate (int) – the number of iterations between syncing up the target network with the train network

  • gamma (float) – discount factor

  • learning_rate (float) – learning rate

  • batch_size (int) – size of minibatch pulled from the DataLoader

  • replay_size (int) – total capacity of the replay buffer

  • warm_start_size (int) – how many random steps through the environment to be carried out at the start of training to fill the buffer with a starting point

  • num_samples (int) – the number of samples to pull from the dataset iterator and feed to the DataLoader

Note

This example is based on:

https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On-Second-Edition /blob/master/Chapter06/02_dqn_pong.py

Note

Currently only supports CPU and single GPU training with distributed_backend=dp

prepare_data()[source]

Initialize the Replay Buffer dataset used for retrieving experiences

Return type

None

training_step(batch, _)[source]

Carries out a single step through the environment to update the replay buffer. Then calculates loss based on the minibatch recieved

Parameters
  • batch – current mini batch of replay data

  • _ – batch number, not used

Return type

OrderedDict

Returns

Training loss and log metrics


Policy Gradient Models

The following models are based on Policy gradient


REINFORCE

REINFORCE model introduced in Policy Gradient Methods For Reinforcement Learning With Function Approximation Paper authors: Richard S. Sutton, David McAllester, Satinder Singh, Yishay Mansour

Original implementation by: Donal Byrne

Example:

from pl_bolts.models.rl import Reinforce
reinforce = Reinforce("CartPole-v0")
trainer = Trainer()
trainer.fit(reinforce)
class pl_bolts.models.rl.reinforce_model.Reinforce(env, gamma=0.99, lr=0.0001, batch_size=32, batch_episodes=4, **kwargs)[source]

Bases: pytorch_lightning.LightningModule

Basic REINFORCE Policy Model

PyTorch Lightning implementation of REINFORCE

Paper authors: Richard S. Sutton, David McAllester, Satinder Singh, Yishay Mansour

Model implemented by:

  • Donal Byrne <https://github.com/djbyrne>

Example

>>> from pl_bolts.models.rl.reinforce_model import Reinforce
...
>>> model = Reinforce("PongNoFrameskip-v4")

Train:

trainer = Trainer()
trainer.fit(model)
Parameters
  • env (str) – gym environment tag

  • gamma (float) – discount factor

  • lr (float) – learning rate

  • batch_size (int) – size of minibatch pulled from the DataLoader

  • batch_episodes (int) – how many episodes to rollout for each batch of training

Note

This example is based on:

https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On-Second-Edition /blob/master/Chapter11/02_cartpole_reinforce.py

Note

Currently only supports CPU and single GPU training with distributed_backend=dp

_dataloader()[source]

Initialize the Replay Buffer dataset used for retrieving experiences

Return type

DataLoader

static add_model_specific_args(arg_parser)[source]

Adds arguments for DQN model

Note: these params are fine tuned for Pong env

Parameters

arg_parser – the current argument parser to add to

Return type

ArgumentParser

Returns

arg_parser with model specific cargs added

build_networks()[source]

Initializes the DQN train and target networks

Return type

None

calc_qvals(rewards)[source]

Takes in the rewards for each batched episode and returns list of qvals for each batched episode

Parameters

rewards (List[List]) – list of rewards for each episodes in the batch

Return type

List[List]

Returns

List of qvals for each episodes

configure_optimizers()[source]

Initialize Adam optimizer

Return type

List[Optimizer]

static flatten_batch(batch_actions, batch_qvals, batch_rewards, batch_states)[source]

Takes in the outputs of the processed batch and flattens the several episodes into a single tensor for each batched output

Parameters
Return type

Tuple[Tensor, Tensor, Tensor, Tensor]

Returns

The input batched results flattend into a single tensor

forward(x)[source]

Passes in a state x through the network and gets the q_values of each action as an output

Parameters

x (Tensor) – environment state

Return type

Tensor

Returns

q values

get_device(batch)[source]

Retrieve device currently being used by minibatch

Return type

str

loss(batch_qvals, batch_states, batch_actions)[source]

Calculates the mse loss using a batch of states, actions and Q values from several episodes. These have all been flattend into a single tensor.

Parameters
  • batch_qvals (List[Tensor]) – current mini batch of q values

  • batch_actions (List[Tensor]) – current batch of actions

  • batch_states (List[Tensor]) – current batch of states

Return type

Tensor

Returns

loss

process_batch(batch)[source]

Takes in a batch of episodes and retrieves the q vals, the states and the actions for the batch

Parameters

batch (List[List[Experience]]) – list of episodes, each containing a list of Experiences

Return type

Tuple[List[Tensor], List[Tensor], List[Tensor]]

Returns

q_vals, states and actions used for calculating the loss

train_dataloader()[source]

Get train loader

Return type

DataLoader

training_step(batch, _)[source]

Carries out a single step through the environment to update the replay buffer. Then calculates loss based on the minibatch recieved

Parameters
Return type

OrderedDict

Returns

Training loss and log metrics


Vanilla Policy Gradient

Vanilla Policy Gradient model introduced in Policy Gradient Methods For Reinforcement Learning With Function Approximation Paper authors: Richard S. Sutton, David McAllester, Satinder Singh, Yishay Mansour

Original implementation by: Donal Byrne

Example:

from pl_bolts.models.rl import PolicyGradient
vpg = PolicyGradient("CartPole-v0")
trainer = Trainer()
trainer.fit(vpg)
class pl_bolts.models.rl.vanilla_policy_gradient_model.PolicyGradient(env, gamma=0.99, lr=0.0001, batch_size=32, entropy_beta=0.01, batch_episodes=4, *args, **kwargs)[source]

Bases: pytorch_lightning.LightningModule

Vanilla Policy Gradient Model

PyTorch Lightning implementation of Vanilla Policy Gradient

Paper authors: Richard S. Sutton, David McAllester, Satinder Singh, Yishay Mansour

Model implemented by:

  • Donal Byrne <https://github.com/djbyrne>

Example

>>> from pl_bolts.models.rl.vanilla_policy_gradient_model import PolicyGradient
...
>>> model = PolicyGradient("PongNoFrameskip-v4")

Train:

trainer = Trainer()
trainer.fit(model)
Parameters
  • env (str) – gym environment tag

  • gamma (float) – discount factor

  • lr (float) – learning rate

  • batch_size (int) – size of minibatch pulled from the DataLoader

  • batch_episodes (int) – how many episodes to rollout for each batch of training

  • entropy_beta (float) – dictates the level of entropy per batch

Note

This example is based on:

https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On-Second-Edition /blob/master/Chapter11/04_cartpole_pg.py

Note

Currently only supports CPU and single GPU training with distributed_backend=dp

_dataloader()[source]

Initialize the Replay Buffer dataset used for retrieving experiences

Return type

DataLoader

static add_model_specific_args(arg_parser)[source]

Adds arguments for DQN model

Note: these params are fine tuned for Pong env

Parameters

parent

Return type

ArgumentParser

build_networks()[source]

Initializes the DQN train and target networks

Return type

None

calc_entropy_loss(log_prob, logits)[source]

Calculates the entropy to be added to the loss function :type _sphinx_paramlinks_pl_bolts.models.rl.vanilla_policy_gradient_model.PolicyGradient.calc_entropy_loss.log_prob: Tensor :param _sphinx_paramlinks_pl_bolts.models.rl.vanilla_policy_gradient_model.PolicyGradient.calc_entropy_loss.log_prob: log probabilities for each action :type _sphinx_paramlinks_pl_bolts.models.rl.vanilla_policy_gradient_model.PolicyGradient.calc_entropy_loss.logits: Tensor :param _sphinx_paramlinks_pl_bolts.models.rl.vanilla_policy_gradient_model.PolicyGradient.calc_entropy_loss.logits: the raw outputs of the network

Return type

Tensor

Returns

entropy penalty for each state

static calc_policy_loss(batch_actions, batch_qvals, batch_states, logits)[source]

Calculate the policy loss give the batch outputs and logits :type _sphinx_paramlinks_pl_bolts.models.rl.vanilla_policy_gradient_model.PolicyGradient.calc_policy_loss.batch_actions: Tensor :param _sphinx_paramlinks_pl_bolts.models.rl.vanilla_policy_gradient_model.PolicyGradient.calc_policy_loss.batch_actions: actions from batched episodes :type _sphinx_paramlinks_pl_bolts.models.rl.vanilla_policy_gradient_model.PolicyGradient.calc_policy_loss.batch_qvals: Tensor :param _sphinx_paramlinks_pl_bolts.models.rl.vanilla_policy_gradient_model.PolicyGradient.calc_policy_loss.batch_qvals: Q values from batched episodes :type _sphinx_paramlinks_pl_bolts.models.rl.vanilla_policy_gradient_model.PolicyGradient.calc_policy_loss.batch_states: Tensor :param _sphinx_paramlinks_pl_bolts.models.rl.vanilla_policy_gradient_model.PolicyGradient.calc_policy_loss.batch_states: states from batched episodes :type _sphinx_paramlinks_pl_bolts.models.rl.vanilla_policy_gradient_model.PolicyGradient.calc_policy_loss.logits: Tensor :param _sphinx_paramlinks_pl_bolts.models.rl.vanilla_policy_gradient_model.PolicyGradient.calc_policy_loss.logits: raw output of the network given the batch_states

Return type

Tuple[List, Tensor]

Returns

policy loss

calc_qvals(rewards)[source]

Takes in the rewards for each batched episode and returns list of qvals for each batched episode

Parameters

rewards (List[Tensor]) – list of rewards for each episodes in the batch

Return type

List[Tensor]

Returns

List of qvals for each episodes

configure_optimizers()[source]

Initialize Adam optimizer

Return type

List[Optimizer]

static flatten_batch(batch_actions, batch_qvals, batch_rewards, batch_states)[source]

Takes in the outputs of the processed batch and flattens the several episodes into a single tensor for each batched output

Parameters
Return type

Tuple[Tensor, Tensor, Tensor, Tensor]

Returns

The input batched results flattend into a single tensor

forward(x)[source]

Passes in a state x through the network and gets the q_values of each action as an output

Parameters

x (Tensor) – environment state

Return type

Tensor

Returns

q values

get_device(batch)[source]

Retrieve device currently being used by minibatch

Return type

str

loss(batch_qvals, batch_states, batch_actions)[source]

Calculates the mse loss using a batch of states, actions and Q values from several episodes. These have all been flattend into a single tensor.

Parameters
  • batch_qvals (List[Tensor]) – current mini batch of q values

  • batch_actions (List[Tensor]) – current batch of actions

  • batch_states (List[Tensor]) – current batch of states

Return type

Tensor

Returns

loss

process_batch(batch)[source]

Takes in a batch of episodes and retrieves the q vals, the states and the actions for the batch

Parameters

batch (List[List[Experience]]) – list of episodes, each containing a list of Experiences

Return type

Tuple[List[Tensor], List[Tensor], List[Tensor]]

Returns

q_vals, states and actions used for calculating the loss

train_dataloader()[source]

Get train loader

Return type

DataLoader

training_step(batch, _)[source]

Carries out a single step through the environment to update the replay buffer. Then calculates loss based on the minibatch recieved

Parameters
Return type

OrderedDict

Returns

Training loss and log metrics

Read the Docs v: 0.1.0
Versions
latest
stable
0.1.1
0.1.0
Downloads
pdf
html
epub
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.