4.3 Policy Iteration

The goal is to improve using , once we get we can do it again.

Each policy is guaranteed to be a strict improvement over the previous one.

Figure4.3

If you are using a finite MDP, this process must converge to an optimal policy.

This algorithm is a way of finding a optimal policy is called policy iteration. It is worth noting that the policy iteration increases greatly the speed of policy evaluation. Usually, policy iteration converges in few iterations.