5.3 Monte Carlo Control
We will consider how Monte Carlo estimation can be used in control to approximate optimal policies. The overall idea is to proceed the same way we saw in DP methods.
- Policy evaluation is done exactly as described in the preceding section
- Many episodes are experienced, with the approximate action-value approaching the true function asymptotically
- We are using action-value function instead of value function
- no model is needed to construct the greedy policy
- we only choose the action with the maximal action-value
- no model is needed to construct the greedy policy
Policy improvement can be done by constructing each as the greedy policy with respect to . The policy improvement theorem then applies to and because, for all
Monte Carlo methods can be used to find optimal policies given only sample episodes and no other knowledge of the environment's dynamics.
There are 2 assumption that have to be remove for a practical use of MC methods
- episodes have exploring starts
- policy evaluation could be done with an infinite number of episode
- 2 solutions
- same methods used in DP, IE stop the algorithm when we obtain a too small variation
- not trying to complete policy evaluation before returning to policy improvement
- 2 solutions
- We introduce Monte Carlo ES for Monte Carlo with Exploring Starts
- the observed returns are used for policy evaluation after each episodes
- then the policy is improved at all the state visited in the episode