5.2 Monte Carlo Estimation of Action Values

If a model is not available

  • value functions is not enough to determine a policy
  • we have to estimate action values

Let's consider the policy evaluation problem for action values.

  • is the expected return when
    • starting in state
    • taking action
    • following policy
  • MC methods for this are almost the same as the one presented for state values
    • except we talk about visits to a state-action pair
  • a state-action pair is visited
    • if a state is visited and an action is taken in it
  • The every-vist MC methods and first-visit MC method have the same behavior
    • but instead of state, it is a state-action pair
    • they converge quadratically (as before) to the true expected values

The only complication is that many state-action pairs may never be visited. must be a non-deterministic because otherwise we will only observe return for one of the action from each state. To compare alternatives, we need to estimate the value of all the actions from each state.

This is the general problem of maintaining exploration.

  • For policy evaluation to work for action values, we must assure continual exploration

    • one solution is
      • start in a stat-action pair
      • every pair has a nonzero probability of being selected as the start
      • this guarantees that all stat-action pair will be visited an infinite number of times
      • we call this the assumption of exploring starts
  • The assumption of exploring starts can be useful

  • but cannot be relied upon in general
    • particularly when learning directly from actual interaction with an environment

Another solution is to consider only stochastic policy. It means all state-action pairs can be encountered with a nonzero probability.