In this section, we are going to analyze a strategy to find an optimal policy based on a complete knowledge of the environment (in terms of transition probability and expected returns). The first step is to define a method that can be employed to build a greedy policy. Let's suppose we're working with a finite MDP and a generic policy, π; we can define the intrinsic value of a state, st, as the expected discounted return obtained by the agent starting from st and following the stochastic policy, π:
In this case, we are assuming that, as the agent will follow π, state sa is more useful than sb if the expected return starting from sa is greater than the one obtained starting from sb. Unfortunately, trying to directly find the value of each state using the previous definition is almost impossible when γ >...