Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
PyTorch 1.x Reinforcement Learning Cookbook

You're reading from   PyTorch 1.x Reinforcement Learning Cookbook Over 60 recipes to design, develop, and deploy self-learning AI models using Python

Arrow left icon
Product type Paperback
Published in Oct 2019
Publisher Packt
ISBN-13 9781838551964
Length 340 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Yuxi (Hayden) Liu Yuxi (Hayden) Liu
Author Profile Icon Yuxi (Hayden) Liu
Yuxi (Hayden) Liu
Arrow right icon
View More author details
Toc

Table of Contents (11) Chapters Close

Preface 1. Getting Started with Reinforcement Learning and PyTorch 2. Markov Decision Processes and Dynamic Programming FREE CHAPTER 3. Monte Carlo Methods for Making Numerical Estimations 4. Temporal Difference and Q-Learning 5. Solving Multi-armed Bandit Problems 6. Scaling Up Learning with Function Approximation 7. Deep Q-Networks in Action 8. Implementing Policy Gradients and Policy Optimization 9. Capstone Project – Playing Flappy Bird with DQN 10. Other Books You May Enjoy

Developing the hill-climbing algorithm

As we can see in the random search policy, each episode is independent. In fact, all episodes in random search can be run in parallel, and the weight that achieves the best performance will eventually be selected. We've also verified this with the plot of reward versus episode, where there is no upward trend. In this recipe, we will develop a different algorithm, a hill-climbing algorithm, to transfer the knowledge acquired in one episode to the next episode.

In the hill-climbing algorithm, we also start with a randomly chosen weight. But here, for every episode, we add some noise to the weight. If the total reward improves, we update the weight with the new one; otherwise, we keep the old weight. In this approach, the weight is gradually improved as we progress through the episodes, instead of jumping around in each episode.

How to do it...

Let's go ahead and implement the hill-climbing algorithm with PyTorch:

  1. As before, import the necessary packages, create an environment instance, and obtain the dimensions of the observation and action space:
>>> import gym
>>> import torch
>>> env = gym.make('CartPole-v0')
>>> n_state = env.observation_space.shape[0]
>>> n_action = env.action_space.n
  1. We will reuse the run_episode function we defined in the previous recipe, so we will not repeat it here. Again, given the input weight, it simulates an episode and returns the total reward.
  2. Let's make it 1,000 episodes for now:

>>> n_episode = 1000
  1. We need to keep track of the best total reward on the fly, as well as the corresponding weight. So, let's specify their starting values:
>>> best_total_reward = 0
>>> best_weight = torch.rand(n_state, n_action)

We will also record the total reward for every episode:

>>> total_rewards = []
  1. As we mentioned, we will add some noise to the weight for each episode. In fact, we will apply a scale to the noise so that the noise won't overwhelm the weight. Here, we will choose 0.01 as the noise scale:
>>> noise_scale = 0.01
  1. Now, we can run the n_episode function. After we randomly pick an initial weight, for each episode, we do the following:
  • Add random noise to the weight
  • Let the agent take actions according to the linear mapping
  • An episode terminates and returns the total reward
  • If the current reward is greater than the best one obtained so far, update the best reward and the weight
  • Otherwise, the best reward and the weight remain unchanged
  • Also, keep a record of the total reward

Put this into code as follows:

 >>> for episode in range(n_episode):
... weight = best_weight +
noise_scale * torch.rand(n_state, n_action)
... total_reward = run_episode(env, weight)
... if total_reward >= best_total_reward:
... best_total_reward = total_reward
... best_weight = weight
... total_rewards.append(total_reward)
... print('Episode {}: {}'.format(episode + 1, total_reward))
...
Episode 1: 56.0
Episode 2: 52.0
Episode 3: 85.0
Episode 4: 106.0
Episode 5: 41.0
……
……
Episode 996: 39.0
Episode 997: 51.0
Episode 998: 49.0
Episode 999: 54.0
Episode 1000: 41.0

We also calculate the average total reward achieved by the hill-climbing version of linear mapping:

 >>> print('Average total reward over {} episode: {}'.format(
n_episode, sum(total_rewards) / n_episode))
Average total reward over 1000 episode: 50.024
  1. To assess the training using the hill-climbing algorithm, we repeat the training process multiple times (by running the code from Step 4 to Step 6 multiple times). We observe that the average total reward fluctuates a lot. The following are the results we got when running it 10 times:
Average total reward over 1000 episode: 9.261   
Average total reward over 1000 episode: 88.565
Average total reward over 1000 episode: 51.796
Average total reward over 1000 episode: 9.41
Average total reward over 1000 episode: 109.758
Average total reward over 1000 episode: 55.787
Average total reward over 1000 episode: 189.251
Average total reward over 1000 episode: 177.624
Average total reward over 1000 episode: 9.146
Average total reward over 1000 episode: 102.311

What could cause such variance? It turns out that if the initial weight is bad, adding noise at a small scale will have little effect on improving the performance. This will cause poor convergence. On the other hand, if the initial weight is good, adding noise at a big scale might move the weight away from the optimal weight and jeopardize the performance. How can we make the training of the hill-climbing model more stable and reliable? We can actually make the noise scale adaptive to the performance, just like the adaptive learning rate in gradient descent. Let's see Step 8 for more details.

  1. To make the noise adaptive, we do the following:
  • Specify a starting noise scale.
  • If the performance in an episode improves, decrease the noise scale. In our case, we take half of the scale, but set 0.0001 as the lower bound.
  • If the performance in an episode drops, increase the noise scale. In our case, we double the scale, but set 2 as the upper bound.

Put this into code:

 >>> noise_scale = 0.01
>>> best_total_reward = 0
>>> total_rewards = []
>>> for episode in range(n_episode):
... weight = best_weight +
noise_scale * torch.rand(n_state, n_action)
... total_reward = run_episode(env, weight)
... if total_reward >= best_total_reward:
... best_total_reward = total_reward
... best_weight = weight
... noise_scale = max(noise_scale / 2, 1e-4)
... else:
... noise_scale = min(noise_scale * 2, 2)
... print('Episode {}: {}'.format(episode + 1, total_reward))
... total_rewards.append(total_reward)
...
Episode 1: 9.0
Episode 2: 9.0
Episode 3: 9.0
Episode 4: 10.0
Episode 5: 10.0
……
……
Episode 996: 200.0
Episode 997: 200.0
Episode 998: 200.0
Episode 999: 200.0
Episode 1000: 200.0

The reward is increasing as the episodes progress. It reaches the maximum of 200 within the first 100 episodes and stays there. The average total reward also looks promising:

 >>> print('Average total reward over {} episode: {}'.format(
n_episode, sum(total_rewards) / n_episode))
Average total reward over 1000 episode: 186.11

We also plot the total reward for every episode as follows:

>>> import matplotlib.pyplot as plt
>>> plt.plot(total_rewards)
>>> plt.xlabel('Episode')
>>> plt.ylabel('Reward')
>>> plt.show()

In the resulting plot, we can see a clear upward trend before it plateaus at the maximum value:

Feel free to run the new training process a few times. The results are very stable compared to learning with a constant noise scale.

  1. Now, let's see how the learned policy performs on 100 new episodes:
 >>> n_episode_eval = 100
>>> total_rewards_eval = []
>>> for episode in range(n_episode_eval):
... total_reward = run_episode(env, best_weight)
... print('Episode {}: {}'.format(episode+1, total_reward))
... total_rewards_eval.append(total_reward)
...
Episode 1: 200.0
Episode 2: 200.0
Episode 3: 200.0
Episode 4: 200.0
Episode 5: 200.0
……
……
Episode 96: 200.0
Episode 97: 200.0
Episode 98: 200.0
Episode 99: 200.0
Episode 100: 200.0

Let's see the average performance:

>>> print('Average total reward over {} episode: {}'.format(n_episode, sum(total_rewards) / n_episode))
Average total reward over 1000 episode: 199.94

The average reward for the testing episodes is close to the maximum of 200 that we obtained with the learned policy. You can re-run the evaluation multiple times. The results are pretty consistent.

How it works...

We are able to achieve much better performance with the hill-climbing algorithm than with random search by simply adding adaptive noise to each episode. We can think of it as a special kind of gradient descent without a target variable. The additional noise is the gradient, albeit in a random way. The noise scale is the learning rate, and it is adaptive to the reward from the previous episode. The target variable in hill climbing becomes achieving the highest reward. In summary, rather than isolating each episode, the agent in the hill-climbing algorithm makes use of the knowledge learned from each episode and performs a more reliable action in the next episode. As the name hill climbing implies, the reward moves upwards through the episodes as the weight gradually moves towards the optimum value.

There's more...

We can observe that the reward can reach the maximum value within the first 100 episodes. Can we just stop training when the reward reaches 200, as we did with the random search policy? That might not be a good idea. Remember that the agent is making continuous improvements in hill climbing. Even if it finds a weight that generates the maximum reward, it can still search around this weight for the optimal point. Here, we define the optimal policy as the one that can solve the CartPole problem. According to the following wiki page, https://github.com/openai/gym/wiki/CartPole-v0, "solved" means the average reward over 100 consecutive episodes is no less than 195.

We refine the stopping criterion accordingly:

 >>> noise_scale = 0.01
>>> best_total_reward = 0
>>> total_rewards = []
>>> for episode in range(n_episode):
... weight = best_weight + noise_scale * torch.rand(n_state, n_action)
... total_reward = run_episode(env, weight)
... if total_reward >= best_total_reward:
... best_total_reward = total_reward
... best_weight = weight
... noise_scale = max(noise_scale / 2, 1e-4)
... else:
... noise_scale = min(noise_scale * 2, 2)
... print('Episode {}: {}'.format(episode + 1, total_reward))
... total_rewards.append(total_reward)
... if episode >= 99 and sum(total_rewards[-100:]) >= 19500:
... break
...
Episode 1: 9.0
Episode 2: 9.0
Episode 3: 10.0
Episode 4: 10.0
Episode 5: 9.0
……
……
Episode 133: 200.0
Episode 134: 200.0
Episode 135: 200.0
Episode 136: 200.0
Episode 137: 200.0

At episode 137, the problem is considered solved.

See also

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image