Reinforcement Learning with TensorFlow

Deep Learning – Architectures and Frameworks

Artificial neural networks are computational systems that provide us with important tools to solve challenging machine learning tasks, ranging from image recognition to speech translation. Recent breakthroughs, such as Google DeepMind's AlphaGo defeating the best Go players or Carnegie Mellon University's Libratus defeating the world's best professional poker players, have demonstrated the advancement in the algorithms; these algorithms learn a narrow intelligence like a human would and achieve superhuman-level performance. In plain speech, artificial neural networks are a loose representation of the human brain that we can program in a computer; to be precise, it's an approach inspired by our knowledge of the functions of the human brain. A key concept of neural networks is to create a representation space of the input data and then solve the problem in that space; that is, warping the data from its current state in such a way that it can be represented in a different state where it can solve the concerned problem statement (say, a classification or regression). Deep learning means multiple hidden representations, that is, a neural network with many layers to create more effective representations of the data. Each layer refines the information received from the previous one.

Reinforcement learning, on the other hand, is another wing of machine learning, which is a technique to learn any kind of activity that follows a sequence of actions. A reinforcement learning agent gathers the information from the environment and creates a representation of the states; it then performs an action that results in a new state and a reward (that is, quantifiable feedback from the environment telling us whether the action was good or bad). This phenomenon continues until the agent is able to improve the performance beyond a certain threshold, that is, maximizing the expected value of the rewards. At each step, these actions can be chosen randomly, can be fixed, or can be supervised using a neural network. The supervision of predicting action using a deep neural network opens a new domain, called deep reinforcement learning. This forms the base of AlphaGo, Libratus, and many other breakthrough research in the field of artificial intelligence.

We will cover the following topics in this chapter:

Deep learning
Reinforcement learning
Introduction to TensorFlow and OpenAI Gym
The influential researchers and projects in reinforcement learning

Deep learning

Deep learning refers to training large neural networks. Let's first discuss some basic use cases of neural networks and why deep learning is creating such a furore even though these neural networks have been here for decades.

Following are the examples of supervised learning in neural networks:

Inputs(x)	Output(y)	Application domain	Suggested neural network approach
House features	Price of the house	Real estate	Standard neural network with rectified linear unit in the output layer
Ad and user info Click on ad ?	Yes(1) or No(0)	Online advertising	Standard neural network with binary classification
Image object	Classifying from 100 different objects, that is (1,2,.....,100)	Photo tagging	Convolutional neural network (since image, that is, spatial data)
Audio	Text transcript	Speech recognition	Recurrent neural network (since both input-output are sequential data)
English	Chinese	Machine translation	Recurrent neural network (since the input is a sequential data)
Image, radar information	Position of other cars	Autonomous driving	Customized hybrid/complex neural network

We will go into the details of the previously-mentioned neural networks in the coming sections of this chapter, but first we must understand that different types of neural networks are used based on the objective of the problem statement.

Supervised learning is an approach in machine learning where an agent is trained using pairs of input features and their corresponding output/target values (also called labels).

Traditional machine learning algorithms worked very well for the structured data, where most of the input features were very well defined. This is not the case with the unstructured data, such as audio, image, and text, where the data is a signal, pixels, and letters, respectively. It's harder for the computers to make sense of the unstructured data than the structured data. The neural network's ability to make predictions based on this unstructured data is the key reason behind their popularity and generate economic value.

First, it's the scale at the present moment, that is the scale of data, computational power and new algorithms, which is driving the progress in deep learning. It's been over four decades of internet, resulting in an enormous amount of digital footprints accumulating and growing. During that period, research and technological development helped to expand the storage and processing ability of computational systems. Currently, owing to these heavy computational systems and massive amounts of data, we are able to verify discoveries in the field of artificial intelligence done over the past three decades.

Now, what do we need to implement deep learning?

First, we need a large amount of data.

Second, we need to train a reasonably large neural network.

So, why not train a large neural network on small amounts of data?

Think back to your data structure lessons, where the utility of the structure is to sufficiently handle a particular type of value. For example, you will not store a scalar value in a variable that has the tensor data type. Similarly, these large neural networks create distinct representations and develop comprehending patterns given the high volume of data, as shown in the following graph:

Please refer to the preceding graphical representation of data versus performance of different machine learning algorithms for the following inferences:

We see that the performance of traditional machine learning algorithms converges after a certain time as they are not able to absorb distinct representations with data volume beyond a threshold.
Check the bottom left part of the graph, near the origin. This is the region where the relative ordering of the algorithms is not well defined. Due to the small data size, the inner representations are not that distinct. As a result, the performance metrics of all the algorithms coincide. At this level, performance is directly proportional to better feature engineering. But these hand engineered features fail with the increase in data size. That's where deep neural networks come in as they are able to capture better representations from large amounts of data.

Therefore, we can conclude that one shouldn't fit a deep learning architecture in to any encountered data. The volume and variety of the data obtained indicate which algorithm to apply. Sometimes small data works better with traditional machine learning algorithms rather than deep neural networks.

Deep learning problem statements and algorithms can be further segregated into four different segments based on their area of research and application:

General deep learning: Densely-connected layers or fully-connected networks
Sequence models: Recurrent neural networks, Long Short Term Memory Networks, Gated Recurrent Units, and so on
Spatial data models (images, for example): Convolutional neural networks, Generative Adversarial Networks
Others: Unsupervised learning, reinforcement learning, sparse encoding, and so on

Presently, the industry is mostly driven by the first three segments, but the future of Artificial Intelligence rests on the advancements in the fourth segment. Walking down the journey of advancements in machine learning, we can see that until now, these learning models were giving real numbers as output, for example, movie reviews (sentiment score) and image classification (class object). But now, as well as, other type of outputs are being generated, for example, image captioning (input: image, output: text), machine translation (input: text, output: text), and speech recognition (input: audio, output: text).

Human-level performance is necessary and being commonly applied in deep learning. Human-level accuracy becomes constant after some time converging to the highest possible point. This point is called the Optimal Error Rate (also known as the Bayes Error Rate, that is, the lowest possible error rate for any classifier of a random outcome).

The reason behind this is that a lot of problems have a theoretical limit in performance owing to the noise in the data. Therefore, human-level accuracy is a good approach to improving your models by doing error analysis. This is done by incorporating human-level error, training set error, and validation set error to estimate bias variance effects, that is, the underfitting and overfitting conditions.

The scale of data, type of algorithm, and performance metrics are a set of approaches that help us to benchmark the level of improvements with respect to different machine learning algorithms. Thereby, governing the crucial decision of whether to invest in deep learning or go with the traditional machine learning approaches.

A basic perceptron with some input features (three, here in the following diagram) looks as follows:

The preceding diagram sets the basic approach of what a neural network looks like if we have input in the first layer and output in the next. Let's try to interpret it a bit. Here:

X1, X2, and X3 are input feature variables, that is, the dimension of input here is 3 (considering there's no bias variable).
W1, W2, and W3 are the corresponding weights associated with feature variables. When we talk about the training of neural networks, we mean to say the training of weights. Thus, these form the parameters of our small neural network.
The function in the output layer is an activation function applied over the aggregation of the information received from the previous layer. This function creates a representation state that corresponds to the actual output. The series of processes from the input layer to the output layer resulting into a predicted output is called forward propagation.
The error value between the output from the activation function and actual output is minimized through multiple iterations.
Minimization of the error only happens if we change the value of the weights (going from the output layer toward the input layer) in the direction that can minimize our error function. This process is termed backpropagation, as we are moving in the opposite direction.

Now, keeping these basics in mind, let's go into demystifying the neural networks further using logistic regression as a neural network and try to create a neural network with one hidden layer.

Activation functions for deep learning

Activation functions are the integral units of artificial neural networks. They decide whether a particular neuron is activated or not, that is, whether the information received by the neuron is relevant or not. The activation function performs nonlinear transformation on the receiving signal (data).

We will discuss some of the popular activation functions in the following sections.

The sigmoid function

Sigmoid is a smooth and continuously differentiable function. It results in nonlinear output. The sigmoid function is represented here:

Please, look at the observations in the following graph of the sigmoid function. The function ranges from 0 to 1. Observing the curve of the function, we see that the gradient is very high when x values between -3 and 3, but becomes flat beyond that. Thus, we can say that small changes in x near these points will bring large changes in the value of the sigmoid function. Therefore, the function goals in pushing the values of the sigmoid function towards the extremes.

Therefore, it's being used in classification problems:

Looking at the gradient of the following sigmoid function, we observe a smooth curve dependent on x. Since the gradient curve is continuous, it's easy to backpropagate the error and update the parameters, that is, and :

Sigmoids are widely used but its disadvantage is that the function goes flat beyond +3 and -3. Thus, whenever the function falls in that region, the gradients tends to approach zero and the learning of our neural network comes to a halt.

Since the sigmoid function outputs values from 0 to 1, that is, all positive, it's non symmetrical around the origin and all output signals are positive, that is, of the same sign. To tackle this, the sigmoid function has been scaled to the tanh function, which we will study next. Moreover, since the gradient results in a very small value, it's susceptible to the vanishing gradient problem (which we will discuss later in this chapter).

The tanh function

Tanh is a continuous function symmetric around the origin; it ranges from -1 to 1. The tanh function is represented as follows:

Thus the output signals will be both positive and negative thereby, adding to the segregation of the signals around the origin. As mentioned earlier, it is continuous and also non linear plus differentiable at all points. We can observe these properties in the graph of the tanh function in the following diagram. Though symmetrical, it becomes flat beyond -2 and 2:

Now looking at the gradient curve of the following tanh function, we observe it being steeper than the sigmoid function. The tanh function also has the vanishing gradient problem:

The softmax function

The softmax function is mainly used to handle classification problems and preferably used in the output layer, outputting the probabilities of the output classes. As seen earlier, while solving the binary logistic regression, we witnessed that the sigmoid function was able to handle only two classes. In order to handle multi-class we need a function that can generate values for all the classes and those values follow the rules of probability. This objective is fulfilled by the softmax function, which shrinks the outputs for each class between 0 and 1 and divides them by the sum of the outputs for all the classes:

For examples, , where x refers to four classes.

Then, the softmax function will gives results (rounded to three decimal places) as:

Thus, we see the probabilities of all the classes. Since the output of every classifier demands probabilistic values for all the classes, the softmax function becomes the best candidate for the outer layer activation function of the classifier.

The rectified linear unit function

The rectified linear unit, better known as ReLU, is the most widely used activation function:

The ReLU function has the advantage of being non linear. Thus, backpropagation is easy and can therefore stack multiple hidden layers activated by the ReLU function, where for x<=0, the function f(x) = 0 and for x>0, f(x)=x.

The main advantage of the ReLU function over other activation functions is that it does not activate all the neurons at the same time. This can be observed from the preceding graph of the ReLU function, where we see that if the input is negative it outputs zero and the neuron does not activate. This results in a sparse network, and fast and easy computation.

Derivative graph of ReLU, shows f'(x) = 0 for x<=0 and f'(x) = 1 for x>0

Looking at the preceding gradients graph of ReLU preceding, we can see the negative side of the graph shows a constant zero. Therefore, activations falling in that region will have zero gradients and therefore, weights will not get updated. This leads to inactivity of the nodes/neurons as they will not learn. To overcome this problem, we have Leaky ReLUs, which modify the function as:

This prevents the gradient from becoming zero in the negative side and the weight training continues, but slowly, owing to the low value of .

How to choose the right activation function

The activation function is decided depending upon the objective of the problem statement and the concerned properties. Some of the inferences are as follows:

Sigmoid functions work very well in the case of shallow networks and binary classifiers. Deeper networks may lead to vanishing gradients.
The ReLU function is the most widely used, and try using Leaky ReLU to avoid the case of dead neurons. Thus, start with ReLU, then move to another activation function if ReLU doesn't provide good results.
Use softmax in the outer layer for the multi-class classification.
Avoid using ReLU in the outer layer.

Logistic regression as a neural network

Logistic regression is a classifier algorithm. Here, we try to predict the probability of the output classes. The class with the highest probability becomes the predicted output. The error between the actual and predicted output is calculated using cross-entropy and minimized through backpropagation. Check the following diagram for binary logistic regression and multi-class logistic regression. The difference is based on the problem statement. If the unique number of output classes is two then it's called binary classification, if it's more than two then it's called multi-class classification. If there are no hidden layers, we use the sigmoid function for the binary classification and we get the architecture for binary logistic regression. Similarly, if there are no hidden layers and we use use the softmax function for the multi-class classification, we get the architecture for multi-class logistic regression.

Now a question arises, why not use the sigmoid function for multi-class logistic regression ?

The answer, which is true for all predicted output layers of any neural network, is that the predicted outputs should follow a probability distribution. In normal terms, say the output has N classes. This will result in N probabilities for an input data having, say, d dimensions. Thus, the sum of the N probabilities for this one input data should be 1 and each of those probabilities should be between 0 and 1 inclusive.

On the one hand, the summation of the sigmoid function for N different classes may not be 1 in the majority of cases. Therefore, in case of binary, the sigmoid function is applied to obtain the probability of one class, that is, p(y = 1|x), and for the other class the probability, that is, p(y = 0|x) = 1 − p(y = 1|x). On the other hand, the output of a softmax function is values satisfying the probability distribution properties. In the diagram, refers to the sigmoid function:

A follow-up question might also arise: what if we use softmax in binary logistic regression?

As mentioned previously, as long as your predicted output follows the rules of probability distribution, everything is fine. Later, we will discuss cross entropy and the importance of probability distribution as a building block for any machine learning problem especially dealing with classification tasks.

A probability distribution is valid if the probabilities of all the values in the distribution are between 0 and 1, inclusive, and the sum of those probabilities must be 1.

Logistic regression can be viewed in a very small neural network. Let's try to go through a step-by-step process to implement a binary logistic regression, as shown here:

Notation

Let the data be of the form , where:

, (number of classes = 2 because it's a binary classification)
is 'n' dimensional, that is, (refers to the preceding diagram)
The number of training examples is m. Thus the training set looks as follows:
- .
- m = size of training dataset.
- And, since , where, each .
- Therefore, is a matrix of size n * m, that is, number of features * number of training examples.
- , a vector of m outputs, where, each .
- Parameters : Weights , and bias , where and is a scalar value.

Objective

The objective of any supervised classification learning algorithm is to predict the correct class with higher probability. Therefore, for each given , we have to calculate the predicted output, that is, the probability . Therefore, .

Referring to binary logistic regression in the preceding diagram:

Predicted output, that is, . Here, the sigmoid function shrinks the value of between 0 and 1.
This means, when , the sigmoid function of this, that is .
When , the sigmoid function of this, that is, .

Once we have calculated , that is, the predicted output, we are done with our forward propagation task. Now, we will calculate the error value using the cost function and try to backpropagate to minimize our error value by changing the values of our parameters, W and b, through gradient descent.

The cost function

The cost function is a metric that determines how well or poorly a machine learning algorithm performed with regards to the actual training output and the predicted output. If you remember linear regression, where the sum of squares of errors was used as the loss function, that is, . This works better in a convex curve, but in the case of classification, the curve is non convex; as a result, the gradient descent doesn't work well and doesn't tend to global optimum. Therefore, we use cross-entropy loss which fits better in classification tasks as the cost function.

Cross entropy as loss function (for

input data), that is,

, where C refers to different output classes.
Thus, cost function = Average cross entropy loss (for the whole dataset), that is,

In case of binary logistic regression, output classes are only two, that is, 0 and 1, since the sum of class values will always be 1. Therefore (for input data), if one class is , the other will be . Similarly, since the probability of class is (prediction), then the probability of the other class, that is, , will be .

Therefore, the loss function modifies to , where:

If , that is, = - . Therefore, to minimize , should be large, that is, closer to 1.
If , that is, = - . Therefore, to minimize , should be small, that is, closer to 0.

Loss function applies to a single example whereas cost function applies on the whole training lot. Thus, the cost function for this case will be:

The gradient descent algorithm

The gradient descent algorithm is an optimization algorithm to find the minimum of the function using first order derivatives, that is, we differentiate functions with respect to their parameters to first order only. Here, the objective of the gradient descent algorithm would be to minimize the cost function with regards to and.

This approach includes following steps for numerous iterations to minimize :

used in the above equations refers to the learning rate. The learning rate is the speed at which the learning agent adapts to new knowledge. Thus, , that is, the learning rate is a hyperparameter that needs to be assigned as a scalar value or as a function of time. In this way, in every iteration, the values of and are updated as mentioned in the preceding formula until the value of the cost function reaches an acceptable minimum value.

The gradient descent algorithm means moving down the slope. The slope of the the curve is represented by the cost function with regards to the parameters. The gradient, that is, the slope, gives the direction of increasing slope if it's positive, and decreasing if it's negative. Thus, we use a negative sign to multiply with our slope since we have to go opposite to the direction of the increasing slope and toward the direction of the decreasing.

Using the optimum learning rate, , the descent is controlled and we don't overshoot the local minimum. If the learning rate, , is very small, then convergence will take more time, while if it's very high then it might overshoot and miss the minimum and diverge owing to the large number of iterations:

The computational graph

A basic neural network consists of forward propagation followed by a backward propagation. As a result, it consists of a series of steps that includes the values of different nodes, weights, and biases, as well as derivatives of cost function with regards to all the weights and biases. In order to keep track of these processes, the computational graph comes into the picture. The computational graph also keeps track of chain rule differentiation irrespective of the depth of the neural network.

Steps to solve logistic regression using gradient descent

Putting together all the building blocks we've just covered, let's try to solve a binary logistic regression with two input features.

The basic steps to compute are:

Calculate
Calculate , the predicted output
Calculate the cost function:

Say we have two input features, that is, two dimensions and m samples dataset. Therefore, the following would be the case:

Weights and bias
Therefore,, and,
Calculate (average loss over all the examples)
Calculating the derivative with regards to W1, W2 and that is , and, respectively
Modify and as mentioned in the preceding gradient descent section

The pseudo code of the preceding m samples dataset are:

Initialize the value of the learning rate, , and the number of epochs, e
Loop over many number of epochs e' (where each time a full dataset will pass in batches)
Initialize J (cost function) and b (bias) as 0, and for W1 and W2, you can go for random normal or xavier initialization (explained in the next section)

Here, a is , dw1 is , dw2 is and db is . Each iteration contains a loop iterating over m examples.

The pseudo code for the same is given here:

w1 = xavier initialization, w2 = xavier initialization, e = 100, α = 0.0001
for j → 1 to e :
     J = 0, dw1 = 0, dw2 = 0, db = 0
     for i → 1 to m :
         z = w1x1[i] + w2x2[i] + b
         a = σ(z)
         J = J - [ y[i] log a + (1-y) log (1-a) ]
         dw1 = dw1 + (a-y[i]) * x1[i] 
         dw2 = dw2 + (a-y[i]) * x2[i]
         db = db + (a-y[i])
     J = J / m
     dw1 = dw1 / m
     dw2 = dw2 / m
     db = db / m
     w1 = w1 - α * dw1
     w2 = w2 - α * dw2

What is xavier initialization?

Xavier Initialization is the initialization of weights in the neural networks, as a random variable following the Gaussian distribution where the variance being given by

Where, is the number of units in the current layer, that is, the incoming signal units, and is the number of units in the next layer, that is, the outgoing resulting signal units. In short, is the shape of .

Why do we use xavier initialization?

The following factors call for the application of xavier initialization:

If the weights in a network start very small, most of the signals will shrink and become dormant at the activation function in the later layers
If the weights start very large, most of the signals will massively grow and pass through the activation functions in the later layers

Thus, xavier initialization helps in generating optimal weights, such that the signals are within optimal range, thereby minimizing the chances of the signals getting neither too small nor too large.

The derivation of the preceding formula is beyond the scope of this book. Feel free to search here (http://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization) and go through the derivation for a better understanding.

The neural network model

A neural network model is similar to the preceding logistic regression model. The only difference is the addition of hidden layers between the input and output layers. Let's consider a single hidden layer neural network for classification to understand the process as shown in the following diagram:

Here, Layer 0 is the input layer, Layer 1 is the hidden layer, and Layer 2 is the output layer. This is also known as two layered neural networks, owing to the fact that when we count the number of layers in a neural network, we don't consider input layer as the first layer. Thus, input layer is considered as Layer 0 and then successive layers get the notation of Layer 1, Layer 2, and so on.

Now, a basic question which comes to mind: why the layers between the input and output layer termed as hidden layers ?

This is because the values of the nodes in the hidden layers are not present in the training set. As we have seen, at every node two calculations happen. These are:

Aggregation of the input signals from previous layers
Subjecting the aggregated signal to an activation to create deeper inner representations, which in turn are the values of the corresponding hidden nodes

Referring to the preceding diagram, we have three input features, , and . The node showing value 1 is regarded as the bias unit. Each layer, except the output, generally has a bias unit. Bias units can be regarded as an intercept term and play an important role in shifting the activation function left or right. Remember, the number of hidden layers and nodes in them are hyperparameters that we define at the start. Here, we have defined the number of hidden layers to be one and the number of hidden nodes to be three,, and . Thus, we can say we have three input units, three hidden units, and three output units (, and , since we have out of three classes to predict). This will give us the shape of weights and biases associated with the layers. For example, Layer 0 has 3 units and Layer 1 has 3. The shape of the weight matrix and bias vector associated with Layer i is given by:

Therefore, the shapes of :

will be and will be
will be and will be

Now, let's understand the following notation:

: Here, it refers to the value of weight connecting node a in Layer i to node d in Layer i+1
: Here, it refers to the value of the bias connecting the bias unit node in Layer i to node d in Layer i+1

Therefore, the nodes in the hidden layers can be calculated in the following way:

Where, the f function refers to the activation function. Remember the logistic regression where we used sigmoid and softmax a the activation function for binary and multi-class logistic regression respectively.

Similarly, we can calculate the output unit, as so:

This brings us to an end of the forward propagation process. Our next task is to train the neural network (that is, train the weights and biases parameters) through backpropagation.

Let the actual output classes be and .

Recalling the cost function section in linear regression, we used cross entropy to formulate our cost function. Since, the cost function is defined by,

where, C = 3, and m = number of examples

Since this is a classification problem, for each example the output will have only one output class as 1 and the rest would be zero. For example, for i, it would be:

Thus, cost function

Now, our goal is to minimize the cost function with regards to and . In order to train our given neural network, first randomly initialize and. Then we will try to optimize through gradient descent where we will update and accordingly at the learning rate, , in the following manner:

After setting up this structure, we have to perform these optimization steps (of updating and ) repeatedly for numerous iterations to train our neural network.

This brings us to the end of the basic of neural networks, which forms the basic building block of any neural network, shallow or deep. Our next frontier will be to understand some of the famous deep neural network architectures, such as recurrent neural networks (RNNs) and convolutional neural networks (CNNs). Apart from that, we will also have a look at the benchmarked deep neural network architectures such as AlexNet, VGG-net, and Inception.

Recurrent neural networks

Recurrent neural networks, abbreviated as RNNs, is used in cases of sequential data, whether as an input, output, or both. The reason RNNs became so effective is because of their architecture to aggregate the learning from the past datasets and use that along with the new data to enhance the learning. This way, it captures the sequence of events, which wasn't possible in a feed forward neural network nor in earlier approaches of statistical time series analysis.

Consider time series data such as stock market, audio, or video datasets, where the sequence of events matters a lot. Thus, in this case, apart from the collective learning from the whole data, the order of learning from the data encountered over time matters. This will help to capture the underlying trend.

The ability to perform sequence based learning is what makes RNNs highly effective. Let's take a step back and try to understand the problem. Consider the following data diagram:

Imagine you have a sequence of events similar to the ones in the diagram, and at each point in time you want to make decisions as per the sequence of events. Now, if your sequence is reasonably stationary, you can use a classifier with similar weights for any time step but here's the glitch. If you run the same classifier separately at different time step data, it will not train to similar weights for different time steps. If you run a single classifier on the whole dataset containing the data of all the time step then the weights will be same but the sequence based learning is hampered. For our solution, we want to share weights over different time steps and utilize what we have learned till the last time step, as shown in the following diagram:

As per the problem, we have understood that our neural network should be able to consider the learnings from the past. This notion can be seen in the preceding diagrammatic representation, where in the first part it shows that at each time step, the network training the weights should consider the data learning from the past, and the second part gives the solution to that. We use a state representation of the classifier output from the previous time step as an input, along with the new time step data to learn the current state representation. This state representation can be defined as the collective learning (or summary) of what happened till last time step, recursively. The state is not the predicted output from the classifier. Instead, when it is subjected to a softmax activation function, it will yield the predicted output.

In order to remember further back, a deeper neural network would be required. Instead, we will go for a single model summarizing the past and provide that information, along with the new information, to our classifier.

Thus, at any time step, t, in a recurrent neural network, the following calculations occur :

.
and are weights and biases shared over time.
is the activation function .
refers to the concatenation of these two information. Say, your input,, is of shape , that is, n samples/rows and d dimensions/columns and is . Then, your concatenation would result a matrix of shape .

Since, the shape of any hidden state, , is . Therefore, the shape of is and is .

Since,

These operations in a given time step, t, constitute an RNN cell unit. Let's visualize the RNN cell at time step t, as shown here:

Once we are done with the calculations till the final time step, our forward propagation task is done. The next task would be to minimize the overall loss by backpropagating through time to train our recurrent neural network. The total loss of one such sequence is the summation of loss across all time steps, that is, if the given sequence of X values and their corresponding output sequence of Y values, the loss is given by:

Thus, the cost function of the whole dataset containing 'm' examples would be (where k refers to the example):

Since the RNNs incorporate the sequential data, backpropagation is extended to backpropagation through time. Here, time is a series of ordered time steps connecting one to the other, which allows backpropagation through different time steps.

Long Short Term Memory Networks

RNNs practically fail to handle long term dependencies. As the gap between the output data point in the output sequence and the input data point in the input sequence increases, RNNs fail in connecting the information between the two. This usually happens in text-based tasks such as machine translation, audio to text, and many more where the length of sequences are long.

Long Short Term Memory Networks, also knows as LSTMs (introduced by Hochreiter and Schmidhuber), are capable of handling these long-term dependencies. Take a look at the image given here:

The key feature of LSTM is the cell state . This helps the information to flow unchanged. We will start with the forget gate layer, which takes the concatenation of of last hidden state, and as the input and trains a neural network that results a number between 0 and 1 for each number in the last cell state , where 1 means to keep the value and 0 means to forget the value. Thus, this layer is to identify what information to forget from the past and results what information to retain.

Next we come to the input gate layer and tanh layer whose task is to identify what new information to add in to one received from the past to update our information, that is, the cell state. The tanh layer creates vectors of new values, while the input gate layer identifies which of those values to use for the information update. Combining this new information with information retained by using the the forget gate layer, ,to update our information, that is, cell state :

Thus, the new cell state is:

Finally, a neural network is trained at the output gate layer, , returning which values of cell state to output as the hidden state,:

Thus, an LSTM Cell incorporates the last cell state , last hidden state and current time step input , and outputs the updated cell state and the current hidden state.

LSTMs were a breakthrough as people were able to benchmark remarkable outcomes with RNNs by incorporating them as the cell unit. This was a great step towards the solution for issues concerned with long term dependencies.

Convolutional neural networks

Convolutional neural networks or ConvNets, are deep neural networks that have provided successful results in computer vision. They were inspired by the organization and signal processing of neurons in the visual cortex of animals, that is, individual cortical neurons respond to the stimuli in their concerned small region (of the visual field), called the receptive field, and these receptive fields of different neurons overlap altogether covering the whole visual field.

When the input in an input space contains the same kind of information, then we share the weights and train those weights jointly for those input. For spatial data, such as images, this weight-sharing leads to CNNs. Similarly, for a sequential data, such as text, we witnessed this weight-sharing in RNNs.

CNNs have wide applications in the field of computer vision and natural language processing. As far as the industry is concerned, Facebook uses it in their automated image-tagging algorithms, Google in their image search, Amazon in their product recommendation systems, Pinterest to personalize the home feeds, and Instagram for image search and recommendations.

Just like a neuron (or node) in a neural network receives the weighted aggregation of the signals say input from the last layer which then subjected to an activation function leading to an output. Then we backpropagate to minimize our loss function. This is the basic operation that is applied to any kind of neural network, so it will work for CNNs.

Unlike neural networks, where an input is in the form of a vector, CNNs have images as input that are multi-channeled, that is, RGB (three channels: red, green, and blue). Say there's an image of pixel size a × b, then the actual tensor representation would be of an a × b × 3 shape.

Let's say you have an image similar to the one shown here:

It can be represented as a flat plate that has width, height, and because of the RGB channel, it has a depth of three. Now, take a small patch of this image, say 2 × 2, and run a tiny neural network on it with an output depth of, say, k. This will result in a representation patch of shape 1× 1 × k . Now, slide this neural network horizontally and vertically over the whole image without changing the weights results in another image of different width, height, and depth k (that is, now we have k channels).

This integration task is collectively termed as convolution. Generally, ReLUs are used as the activation function in these neural networks:

Here, we are mapping 3-feature maps (that is, RGB channels) to k feature maps

The sliding motion of the patch over the image is called striding, and the number of pixels you shift each time, whether horizontally or vertically, is called a stride. Striding if the patch doesn't go outside the image space it is regarded as a valid padding. On the other hand, if the patch goes outside the image space in order to map the patch size the pixels of the patch which are off the space are padded with zeros. This is called same padding.

CNN architecture consists of a series of these convolutional layers. The striding value in these convolutional layers if greater than 1 causes spatial reduction. Thus, stride, patch size, and the activation function become the hyperparameters. Along with convolutional layers, one important layer is sometimes added, it is called the pooling layer. This takes all the convolutions in a neighborhood and combines them. One form of pooling is called max pooling.

In max pooling, the feature map looks around all the values in the patch and returns the maximum among them. Thus, pooling size (that is, pooling patch/window size) and pooling stride are the hyperparameters. The following image depicts the concept of max pooling:

Max pooling often yields more accurate results. Similarly, we have average pooling, where instead of maximum value we take the average of the values in the pooling window providing a low resolution view of the feature map.

Manipulating the hyperparameters and ordering of the convolutional layers, by pooling and fully connected layers, many different variants of CNNs have been created which are being used in research and industrial domains. Some of the famous ones among them are the LeNet-5, Alexnet, VGG-Net, and Inception model.

The LeNet-5 convolutional neural network

Architecture of LeNet-5, from Gradient-based Learning Applied to Document Recognition by LeCunn et al.(http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf)

LeNet-5 is a seven-level convolutional neural network, published by the team comprising of Yann LeCunn, Yoshua Bengio, Leon Bottou and Patrick Haffner in 1998 to classify digits, which was used by banks to recognize handwritten numbers on checks. The layers are ordered as:

Input image | Convolutional Layer 1(ReLU) | Pooling 1 |Convolutional Layer 2(ReLU) |Pooling 2 |Fully Connected (ReLU) 1 | Fully Connected 2 | Output
LeNet-5 had remarkable results, but the ability to process higher-resolution images required more convolutional layers, such as in AlexNet, VGG-Net, and Inception models.

The AlexNet model

AlexNet, a modification of LeNet, was designed by the group named SuperVision, which was composed of Alex Krizhevsky, Geoffrey Hinton, and Ilya Sutskever. AlexNet made history by achieving the top-5 error percentage of 15.3%, which was 10 points more than the runner-up, in the ImageNet Large Scale Visual Recognition Challenge in 2012.

The architecture uses five convolutional layers, three max pool layers, and three fully connected layers at the end, as shown in the following diagram. There were a total of 60 million parameters in the model trained on 1.2 million images, which took about five to six days on two NVIDIA GTX 580 3GB GPUs. The following image shows the AlexNet model:

Architecture of AlexNet from ImageNet classification with deep convolutional neural networks by Hinton et al. (https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf)

The VGG-Net model

VGG-Net was introduced by Karen Simonyan and Andrew Zisserman from Visual Geometry Group (VGG) of the University of Oxford. They used small convolutional filters of size 3 x 3 to train a network of depth 16 and 19. Their team secured first and second place in the localization and classification tasks, respectively, of ImageNet Challenge 2014.

The idea to design a deeper neural network by adding more non-linearity to the model led to incorporate smaller filters to make sure the network didn't have too many parameters. While training, it was difficult to converge the model, so first a pre-trained simpler neural net model was used to initialize the weights of the deeper architecture. However, now we can directly use the xavier initialization method instead of training a neural network to initialize the weights. Due the depth of the model, it's very slow to train.

The Inception model

Inception was created by the team at Google in 2014. The main idea was to create deeper and wider networks while limiting the number of parameters and avoiding overfitting. The following image shows the full Inception module:

Architecture of Inception model (naive version), from going deeper with convolutions by Szegedy et al.(https://arxiv.org/pdf/1409.4842.pdf)

It applies multiple convolutional layers for a single input and outputs the stacked output of each convolution. The size of convolutions used are mainly 1x1, 3x3, and 5x5. This kind of architecture allows you to extract multi-level features from the same-sized input. An earlier version was also called GoogLeNet, which won the ImageNet challenge in 2014.

Limitations of deep learning

Deep neural networks are black boxes of weights and biases trained over a large amount of data to find hidden patterns through inner representations; it would be impossible for humans, and even if it were possible, then scalability would be an issue. Every neural probably has a different weight. Thus, they will have different gradients.

Training happens during backpropagation. Thus, the direction of training is always from the later layers (output/right side) to the early layers (input/left side). This results in later layers learning very well as compared to the early layers. The deeper the network gets, the more the condition deteriorates. This give rise to two possible problems associated with deep learning, which are:

The vanishing gradient problem
The exploding gradient problem

The vanishing gradient problem

The vanishing gradient problem is one of the problems associated with the training of artificial neural networks when the neurons present in the early layers are not able to learn because the gradients that train the weights shrink down to zero. This happens due to the greater depth of neural network, along with activation functions with derivatives resulting in low value.

Try the following steps:

Create one hidden layer neural network
Add more hidden layers, one by one

We observe the gradient with regards to all the nodes, and find that the gradient values get relatively smaller when we move from the later layers to the early layers. This condition worsens with the further addition of layers. This shows that the early layer neurons are learning slowly compared to the later layer neurons. This condition is called the vanishing gradient problem.

The exploding gradient problem

The exploding gradient problem is another problem associated with the training of artificial neural networks when the learning of the neurons present in the early layers diverge because the gradients become too large to cause severe changes in weights avoiding convergence. This generally happens if weights are not assigned properly.

While following the steps mentioned for the vanishing gradient problem, we observe that the gradients explode in the early layers, that is, they become larger. The phenomenon of the early layers diverging is called the exploding gradient problem.

Overcoming the limitations of deep learning

These two possible problems can be overcome by:

Minimizing the use of the sigmoid and tanh activation functions
Using a momentum-based stochastic gradient descent
Proper initialization of weights and biases, such as xavier initialization
Regularization (add regularization loss along with data loss and minimize that)

For more detail, along with mathematical representations of the vanishing and exploding gradient, you can read this article: Intelligent Signals : Unstable Deep Learning. Why and How to solve them ?

Reinforcement learning

Reinforcement learning is a branch of artificial intelligence that deals with an agent that perceives the information of the environment in the form of state spaces and action spaces, and acts on the environment thereby resulting in a new state and receiving a reward as feedback for that action. This received reward is assigned to the new state. Just like when we had to minimize the cost function in order to train our neural network, here the reinforcement learning agent has to maximize the overall reward to find the the optimal policy to solve a particular task.

How this is different from supervised and unsupervised learning?

In supervised learning, the training dataset has input features, X, and their corresponding output labels, Y. A model is trained on this training dataset, to which test cases having input features, X', are given as the input and the model predicts Y'.

In unsupervised learning, input features, X, of the training set are given for the training purpose. There are no associated Y values. The goal is to create a model that learns to segregate the data into different clusters by understanding the underlying pattern and thereby, classifying them to find some utility. This model is then further used for the input features X' to predict their similarity to one of the clusters.

Reinforcement learning is different from both supervised and unsupervised. Reinforcement learning can guide an agent on how to act in the real world. The interface is broader than the training vectors, like in supervised or unsupervised learning. Here is the entire environment, which can be real or a simulated world. Agents are trained in a different way, where the objective is to reach a goal state, unlike the case of supervised learning where the objective is to maximize the likelihood or minimize cost.

Reinforcement learning agents automatically receive the feedback, that is, rewards from the environment, unlike in supervised learning where labeling requires time-consuming human effort. One of the bigger advantage of reinforcement learning is that phrasing any task's objective in the form of a goal helps in solving a wide variety of problems. For example, the goal of a video game agent would be to win the game by achieving the highest score. This also helps in discovering new approaches to achieving the goal. For example, when AlphaGo became the world champion in Go, it found new, unique ways of winning.

A reinforcement learning agent is like a human. Humans evolved very slowly; an agent reinforces, but it can do that very fast. As far as sensing the environment is concerned, neither humans nor and artificial intelligence agents can sense the entire world at once. The perceived environment creates a state in which agents perform actions and land in a new state, that is, a newly-perceived environment different from the earlier one. This creates a state space that can be finite as well as infinite.

The largest sector interested in this technology is defense. Can reinforcement learning agents replace soldiers that not only walk, but fight, and make important decisions?

Basic terminologies and conventions

The following are the basic terminologies associated with reinforcement learning:

Agent: This we create by programming such that it is able to sense the environment, perform actions, receive feedback, and try to maximize rewards.
Environment: The world where the agent resides. It can be real or simulated.
State: The perception or configuration of the environment that the agent senses. State spaces can be finite or infinite.
Rewards: Feedback the agent receives after any action it has taken. The goal of the agent is to maximize the overall reward, that is, the immediate and the future reward. Rewards are defined in advance. Therefore, they must be created properly to achieve the goal efficiently.
Actions: Anything that the agent is capable of doing in the given environment. Action space can be finite or infinite.
SAR triple: (state, action, reward) is referred as the SAR triple, represented as (s, a, r).
Episode: Represents one complete run of the whole task.

Let's deduce the convention shown in the following diagram:

Every task is a sequence of SAR triples. We start from state S(t), perform action A(t) and thereby, receive a reward R(t+1), and land on a new state S(t+1). The current state and action pair gives rewards for the next step. Since, S(t) and A(t) results in S(t+1), we have a new triple of (current state, action, new state), that is, [S(t),A(t),S(t+1)] or (s,a,s').

Optimality criteria

The optimality criteria are a measure of goodness of fit of the model created over the data. For example, in supervised classification learning algorithms, we have maximum likelihood as the optimality criteria. Thus, on the basis of the problem statement and objective optimality criteria differs. In reinforcement learning, our major goal is to maximize the future rewards. Therefore, we have two different optimality criteria, which are:

Value function: To quantify a state on the basis of future probable rewards
Policy: To guide an agent on what action to take in a given state

We will discuss both of them in detail in the coming topics.

The value function for optimality

Agents should be able to think about both immediate and future rewards. Therefore, a value is assigned to each encountered state that reflects this future information too. This is called value function. Here comes the concept of delayed rewards, where being at present what actions taken now will lead to potential rewards in future.

V(s), that is, value of the state is defined as the expected value of rewards to be received in future for all the actions taken from this state to subsequent states until the agent reaches the goal state. Basically, value functions tell us how good it is to be in this state. The higher the value, the better the state.

Rewards assigned to each (s,a,s') triple is fixed. This is not the case with the value of the state; it is subjected to change with every action in the episode and with different episodes too.

One solution comes in mind, instead of the value function, why don't we store the knowledge of every possible state?

The answer is simple: it's time-consuming and expensive, and this cost grows exponentially. Therefore, it's better to store the knowledge of the current state, that is, V(s):

V(s) = E[all future rewards discounted | S(t)=s]

More details on the value function will be covered in Chapter 3, The Markov Decision Process and Partially Observable MDP.

The policy model for optimality

Policy is defined as the model that guides the agent with action selection in different states. Policy is denoted as . is basically the probability of a certain action given a particular state:

Thus, a policy map will provide the set of probabilities of different actions given a particular state. The policy along with the value function create a solution that helps in agent navigation as per the policy and the calculated value of the state.

The Q-learning approach to reinforcement learning

Q-learning is an attempt to learn the value Q(s,a) of a specific action given to the agent in a particular state. Consider a table where the number of rows represent the number of states, and the number of columns represent the number of actions. This is called a Q-table. Thus, we have to learn the value to find which action is the best for the agent in a given state.

Steps involved in Q-learning:

Initialize the table of Q(s,a) with uniform values (say, all zeros).
Observe the current state, s
Choose an action, a, by epsilon greedy or any other action selection policies, and take the action
As a result, a reward, r, is received and a new state, s', is perceived
Update the Q value of the (s,a) pair in the table by using the following Bellman equation:

, where is the discounting factor

Then, set the value of current state as a new state and repeat the process to complete one episode, that is, reaches the terminal state
Run multiple episodes to train the agent

To simplify, we can say that the Q-value for a given state, s, and action, a, is updated by the sum of current reward, r, and the discounted () maximum Q value for the new state among all its actions. The discount factor delays the reward from the future compared to the present rewards. For example, a reward of 100 today will be worth more than 100 in the future. Similarly, a reward of 100 in the future must be worth less than 100 today. Therefore, we will discount the future rewards. Repeating this update process continuously results in Q-table values converging to accurate measures of the expected future reward for a given action in a given state.

When the volume of the state and action spaces increase, maintaining a Q-table is difficult. In the real world, the state spaces are infinitely large. Thus, there's a requirement of another approach that can produce Q(s,a) without a Q-table. One solution is to replace the Q-table with a function. This function will take the state as the input in the form of a vector, and output the vector of Q-values for all the actions in the given state. This function approximator can be represented by a neural network to predict the Q-values. Thus, we can add more layers and fit in a deep neural network for better prediction of Q-values when the state and action space becomes large, which seemed impossible with a Q-table. This gives rise to the Q-network and if a deeper neural network, such as a convolutional neural network, is used then it results in a deep Q-network (DQN).

More details on Q-learning and deep Q-networks will be covered in Chapter 5, Q-Learning and Deep Q-Networks.

Asynchronous advantage actor-critic

The A3C algorithm was published in June 2016 by the combined team of Google DeepMind and MILA. It is simpler and has a lighter framework that used the asynchronous gradient descent to optimize the deep neural network. It was faster and was able to show good results on the multi-core CPU instead of GPU. One of A3C's big advantages is that it can work on continuous as well as discrete action spaces. As a result, it has opened the gateway for many new challenging problems that have complex state and action spaces.

We will discuss it at a high note here, but we will dig deeper in Chapter 6, Asynchronous Methods. Let's start with the name, that is, asynchronous advantage actor-critic (A3C) algorithm and unpack it to get the basic overview of the algorithm:

Asynchronous: In DQN, you remember we used a neural network with our agent to predict actions. This means there is one agent and it's interacting with one environment. What A3C does is create multiple copies of the agent-environment to make the agent learn more efficiently. A3C has a global network, and multiple worker agents, where each agent has its own set of network parameters and each of them interact with their copy of the environment simultaneously without interacting with another agent's environment. The reason this works better than a single agent is that the experience of each agent is independent of the experience of the other agents. Thus, the overall experience from all the worker agents results in diverse training.
Actor-critic: Actor-critic combines the benefits of both value iteration and policy iteration. Thus, the network will estimate both a value function, V(s), and a policy, π(s), for a given state, s. There will be two separate fully-connected layers at the top of the function approximator neural network that will output the value and policy of the state, respectively. The agent uses the value, which acts as a critic to update the policy, that is, the intelligent actor.
Advantage: Policy gradients used discounted returns telling the agent whether the action was good or bad. Replacing that with Advantage not only quantifies the the good or bad status of the action but helps in encouraging and discouraging actions better(we will discuss this in Chapter 4, Policy Gradients).

Introduction to TensorFlow and OpenAI Gym

TensorFlow is the mathematical library created by the team of Google Brain at Google. Thanks to its dataflow programming, it's being heaving used as a deep learning library both in research and development sectors. Since its inception in 2015, TensorFlow has grown a very big community.

OpenAI Gym is a reinforcement learning playground created by the team at OpenAI with an aim to provide a simple interface, since creating an environment is itself a tedious task in reinforcement learning. It provides a good list of environments to test your reinforcement learning algorithms in so that you can benchmark them.

Basic computations in TensorFlow

The base of TensorFlow is the computational graph, which we discussed earlier in this chapter, and tensors. A tensor is an n-dimensional vector. Thus, a scalar and a matrix variable is also a tensor. Here, we will try some of the basic computations to start with TensorFlow. Please try to implement this section in a python IDE such as Jupyter Notebook.

For the TensorFlow installation and dependencies please refer to the following link:

https://www.tensorflow.org/install/

Import tensorflow by the following command:

import tensorflow as tf

tf.zeros() and tf.ones() are some of the functions that instantiate basic tensors. The tf.zeros() takes a tensor shape (that is, a tuple) and returns a tensor of that shape with all the values being zero. Similarly, tf.ones() takes a tensor shape but returns a tensor of that shape containing only ones. Try the following commands in python shell to create a tensor:

>>> tf.zeros(3)

<tf.Tensor 'zeros:0' shape=(3,) dtype=float32>

>>>tf.ones(3)

<tf.Tensor 'ones:0' shape=(3,) dtype=float32>

As you can see, TensorFlow returns a reference to the tensor and not the value of the tensor. In order to get the value, we can use eval() or run(), a function of tensor objects by running a session as follows:

>>> a = tf.zeros(3)
>>> with tf.Session() as sess:
        sess.run(a)
        a.eval()

array([0., 0.,0.], dtype=float32)

array([0., 0.,0.], dtype=float32)

Next come the tf.fill() and tf.constant() methods to create a tensor of a certain shape and value:

>>> a = tf.fill((2,2),value=4.)
>>> b = tf.constant(4.,shape=(2,2))
>>> with tf.Session() as sess:
        sess.run(a)
        sess.run(b)

array([[ 4., 4.],
[ 4., 4.]], dtype=float32)

array([[ 4., 4.],
[ 4., 4.]], dtype=float32)

Next, we have functions that can randomly initialize a tensor. Among them, the most frequently used ones are:

tf.random_normal: Samples random values from the Normal distribution of specified mean and standard deviation
tf.random_uniform(): Samples random values from the Uniform distribution of a specified range

>>> a = tf.random_normal((2,2),mean=0,stddev=1)
>>> b = tf.random_uniform((2,2),minval=-3,maxval=3)
>>> with tf.Session() as sess:
        sess.run(a)
        sess.run(b)

array([[-0.31790468, 1.30740941],
[-0.52323157, -0.2980336 ]], dtype=float32)

array([[ 1.38419437, -2.91128755],
[-0.80171156, -0.84285879]], dtype=float32)

Variables in TensorFlow are holders for tensors and are defined by the function tf.Variable():

>>> a = tf.Variable(tf.ones((2,2)))
>>> a

<tf.Variable 'Variable:0' shape=(2, 2) dtype=float32_ref>

The evaluation fails in case of variables because they have to be explicitly initialized by using tf.global_variables_initializer within a session:

>>> a = tf.Variable(tf.ones((2,2)))
>>> with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        a.eval()

array([[ 1., 1.],
[ 1., 1.]], dtype=float32)

Next in the queue, we have matrices. Identity matrices are square matrices with ones in the diagonal and zeros elsewhere. This can be done with the function tf.eye():

>>> id = tf.eye(4) #size of the square matrix = 4
>>> with tf.Session() as sess:
         sess.run(id)

array([[ 1., 0., 0., 0.],
[ 0., 1., 0., 0.],
[ 0., 0., 1., 0.],
[ 0., 0., 0., 1.]], dtype=float32)

Similarly, there are diagonal matrices, which have values in the diagonal and zeros elsewhere, as shown here:

>>> a = tf.range(1,5,1)
>>> md = tf.diag(a)
>>> mdn = tf.diag([1,2,5,3,2])
>>> with tf.Session() as sess:
        sess.run(md)
        sess.run(mdn)

array([[1, 0, 0, 0],
[0, 2, 0, 0],
[0, 0, 3, 0],
[0, 0, 0, 4]], dtype=int32)

array([[1, 0, 0, 0, 0],
[0, 2, 0, 0, 0],
[0, 0, 5, 0, 0],
[0, 0, 0, 3, 0],
[0, 0, 0, 0, 2]], dtype=int32)

We use the tf.matrix_transpose() function to transpose the given matrix, as shown here:

>>> a = tf.ones((2,3))
>>> b = tf.transpose(a)
>>> with tf.Session() as sess:
        sess.run(a)
        sess.run(b)

array([[ 1., 1., 1.],
[ 1., 1., 1.]], dtype=float32)

array([[ 1., 1.],
[ 1., 1.],
[ 1., 1.]], dtype=float32)

The next matrix operation is the matrix multiplication function as shown here. This is done by the function tf.matmul():

>>> a = tf.ones((3,2))
>>> b = tf.ones((2,4))
>>> c = tf.matmul(a,b)
>>> with tf.Session() as sess:
        sess.run(a)
        sess.run(b)
        sess.run(c)

array([[ 1., 1.],
[ 1., 1.],
[ 1., 1.]], dtype=float32)

array([[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.]], dtype=float32)

array([[ 2., 2., 2., 2.],
[ 2., 2., 2., 2.],
[ 2., 2., 2., 2.]], dtype=float32)

Reshaping of tensors from one to another is done by using the tf.reshape() function, as shown here:

>>> a = tf.ones((2,4)) #initial shape is (2,4)
>>> b = tf.reshape(a,(8,)) # reshaping it to a vector of size 8. Thus shape is (8,)
>>> c = tf.reshape(a,(2,2,2)) #reshaping tensor a to shape (2,2,2)
>>> d = tf.reshape(b,(2,2,2)) #reshaping tensor b to shape (2,2,2) 
#####Thus, tensor 'c' and 'd' will be similar
>>> with tf.Session() as sess:
        sess.run(a)
        sess.run(b)
        sess.run(c)
        sess.run(d)

array([[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.]], dtype=float32)

array([ 1., 1., 1., 1., 1., 1., 1., 1.], dtype=float32)

array([[[ 1., 1.],
[ 1., 1.]],
[[ 1., 1.],
[ 1., 1.]]], dtype=float32)
&gt;
array([[[ 1., 1.],
[ 1., 1.]],
[[ 1., 1.],
[ 1., 1.]]], dtype=float32)

The flow of computation in TensorFlow is represented as a computational graph, which is as instance of tf.Graph. The graph contains tensors and operation objects, and keeps track of a series of operations and tensors involved. The default instance of the graph can be fetched by tf.get_default_graph():

>>> tf.get_default_graph()

<tensorflow.python.framework.ops.Graph object at 0x7fa3e139b550>

We will explore complex operations, the creation of neural networks, and much more in TensorFlow in the coming chapters.

An introduction to OpenAI Gym

The OpenAI Gym, created by the team at OpenAI is a playground of different environments where you can develop and compare your reinforcement learning algorithms. It is compatible with deep learning libraries such as TensorFlow and Theano.

OpenAI Gym consists of two parts:

The gym open-source library: This consists of many environments for different test problems where you can test your reinforcement learning algorithms. This suffices with the information of state and action spaces.
The OpenAI Gym service: This allows you to compare the performance of your agent with other trained agents.

For the installation and dependencies, please refer to the following link:

https://gym.openai.com/docs/

With the basics covered, now we can start with the implementation of reinforcement learning using the OpenAI Gym from next Chapter 2, Training Reinforcement Learning Agents using OpenAI Gym.

The pioneers and breakthroughs in reinforcement learning

Before going on floor with all the coding, let's shed some light on some of the pioneers, industrial leaders, and research breakthroughs in the field of deep reinforcement learning.

David Silver

Dr. David Silver, with an h-index of 30, heads the research team of reinforcement learning at Google DeepMind and is the lead researcher on AlphaGo. David co-founded Elixir Studios and then completed his PhD in reinforcement learning from the University of Alberta, where he co-introduced the algorithms used in the first master-level 9x9 Go programs. After this, he became a lecturer at University College London. He used to consult for DeepMind before joining full-time in 2013. David lead the AlphaGo project, which became the first program to defeat a top professional player in the game of Go.

Pieter Abbeel

Pieter Abbeel is a professor at UC Berkeley and was a Research Scientist at OpenAI. Pieter completed his PhD in Computer Science under Andrew Ng. His current research focuses on robotics and machine learning, with a particular focus on deep reinforcement learning, deep imitation learning, deep unsupervised learning, meta-learning, learning-to-learn, and AI safety. Pieter also won the NIPS 2016 Best Paper Award.

Google DeepMind

Google DeepMind is a British artificial intelligence company founded in September 2010 and acquired by Google in 2014. They are an industrial leader in the domains of deep reinforcement learning and a neural turing machine. They made news in 2016 when the AlphaGo program defeated Lee Sedol, 9th dan Go player. Google DeepMind has channelized its focus on two big sectors: energy and healthcare.

Here are some of its projects:

In July 2016, Google DeepMind and Moorfields Eye Hospital announced their collaboration to use eye scans to research early signs of diseases leading to blindness
In August 2016, Google DeepMind announced its collaboration with University College London Hospital to research and develop an algorithm to automatically differentiate between healthy and cancerous tissues in head and neck areas
Google DeepMind AI reduced the Google's data center cooling bill by 40%

The AlphaGo program

As mentioned previously in Google DeepMind, AlphaGo is a computer program that first defeated Lee Sedol and then Ke Jie, who at the time was the world No. 1 in Go. In 2017 an improved version, AlphaGo zero was launched that defeated AlphaGo 100 games to 0.

Libratus

Libratus is an artificial intelligence computer program designed by the team led by Professor Tuomas Sandholm at Carnegie Mellon University to play Poker. Libratus and its predecessor, Claudico, share the same meaning, balanced.

In January 2017, it made history by defeating four of the world's best professional poker players in a marathon 20-day poker competition.

Though Libratus focuses on playing poker, its designers mentioned its ability to learn any game that has incomplete information and where opponents are engaging in deception. As a result, they have proposed that the system can be applied to problems in cybersecurity, business negotiations, or medical planning domains.

siva Jul 17, 2018

Must have book for RL learners

Amazon Verified review

Gadginir Jun 17, 2018

Book is above average after going through first 4 chapters. I felt it takes lot of time to understand the concepts. You will take 30 min to go through 2-3 pages sometime. With some more good examples, author can make it bit easier.

RT Aug 26, 2018

Das Buch ist zu oberflächlich - die Konzepte werden nur unzureichend erklärt.Der Beispielcode ist nicht sinnvoll.Ich habe das Buch an Amazon zurückgeschickt

Giang Dao Aug 19, 2018

Not even re-produce quality paper result

Santhosh Jul 23, 2018

Not a good book to learn reinforcement learning or tensorflow. It does not discuss programming either. The book could be improved with an insight into the reinforcement learning concepts, at least to help the reader understand the concepts intuitively.