Tracking down what to measure and deciding how to measure it
We will now tackle the tough task of finding the factors that can make a system go wrong.
The model built in the previous chapters can be summed up as follows:
From lv, the availability vector (capacity in a warehouse, for example), to R, the process creates the reward matrix from the raw data (Chapter 2, Building a Reward Matrix – Designing Your Datasets) required for the MDP reinforcement learning program (Chapter 1, Getting Started with Next-Generation Artificial Intelligence through Reinforcement Learning). As described in the previous chapter, a softmax(lv) function is applied to lv. In turn, a one-hot(softmax(lv)) is applied, which is then converted into the reward value R, which will be used for the Q (Q-learning) algorithm.
The MDP-driven Bellman equation then runs from reading R (the reward matrix) to the results. Gamma is the learning parameter, Q is the Q-learning function, and...