MC Control | Livres

5.3 Monte Carlo Control

We will consider how Monte Carlo estimation can be used in control to approximate optimal policies. The overall idea is to proceed the same way we saw in DP methods.

figure eval-improv

${\pi}_0 \xrightarrow{\mathbb{E}} q_{\pi_0} \xrightarrow{\mathbb{I}} \pi_1 \xrightarrow{\mathbb{E}} \dotso \xrightarrow{\mathbb{I}} \pi_* \xrightarrow{\mathbb{E}} q_*$

Policy evaluation is done exactly as described in the preceding section
Many episodes are experienced, with the approximate action-value approaching the true function asymptotically
We are using action-value function instead of value function
- no model is needed to construct the greedy policy
  - we only choose the action with the maximal action-value

$\pi(s) = arg\max_a q(s,a)$

Policy improvement can be done by constructing each $\pi_{k+1}$ as the greedy policy with respect to $q_{\pi_k}$ . The policy improvement theorem then applies to $\pi_k$ and $\pi_{k+1}$ because, for all $s \in \mathcal{S}$

$\begin{align} q_{\pi_k} (s, \pi_{k+1}(s)) &= q_{\pi_k}(s, arg\max_a q_{\pi_k}(s,a))\\ &= \max_a q_{\pi_k}(s,a)\\ &\geq q_{\pi_k}(s, \pi_k(s))\\ &= v_{\pi_k}(s) \end{align}$

Monte Carlo methods can be used to find optimal policies given only sample episodes and no other knowledge of the environment's dynamics.

There are 2 assumption that have to be remove for a practical use of MC methods

episodes have exploring starts
policy evaluation could be done with an infinite number of episode
- 2 solutions
  - same methods used in DP, IE stop the algorithm when we obtain a too small variation
  - not trying to complete policy evaluation before returning to policy improvement
We introduce Monte Carlo ES for Monte Carlo with Exploring Starts
- the observed returns are used for policy evaluation after each episodes
- then the policy is improved at all the state visited in the episode

figure 5.4