Q-learning is considered one of the most popular and often used foundational RL methods . The method itself was developed by Chris Watkins in 1989 as part of his thesis, Learning from Delayed Rewards. Q-learning or rather Deep Q-learning, which we will cover in Chapter 6, Going Deep with DQN, became so popular because of its use by DeepMind (Google) to play classic Atari games better than a human. What Watkins did was show how an update could be applied across state-action pairs using a learning rate and discount factor gamma.
This improved the update equation into a Q or quality of state-action update equation, as shown in the following formula:
![](https://static.packt-cdn.com/products/9781839214936/graphics/assets/4cc882f2-cb7a-4fb0-b25b-d2e7fd76b4d3.png)
In the previous equation, we have the following:
The current state-action quality being updated
The learning rate
The reward for the next state
Gamma, the discount factor
Take the max best or greedy action...