4.2 Policy Improvement

Finding the value function of a policy is important, but maybe our policy isn't the best one and need some improvement. For some state we would like to know whether or not we should change the policy. To answer this problem is to consider selecting in and thereafter following the existing policy .

If you have a better value, you have to change the policy to take the new action when you are in the state

If

This is true for deterministic policies but it can be easily extended to stochastic policies. In particular, the policy improvement theorem carries through as stated for the stochastic case, under the natural definition:

in addition, if there are ties in policy improvement steps then in the stochastic case we need not select a single action from among them. Instead, each maximizing action can be given a portion of the probability of being selected in the new greedy policy.