5.4 Monte Carlo Control without Exploring Starts

To avoid the assumption of exploring starts there are two approches:

  • on-policy methods
    • attempt to evaluate or improve the policy that is used to make decision
  • off-policy methods
    • evaluate or improve an other policy that is used to generate the data

The Monte Carlo ES methods is a on-policy method.

Soft policy (in on-policy control methods) means

  • for all and all
  • but gradually shifted closer to a deterministic optimal policy
  • -greedy policies are an example of soft-policies

The overall idea of on-policy MC control is still that of GPI. Since we cannot use exploring starts, we need to find an other policy which will make us explore all states. Fortunately, GPI does not require the policy to be greedy, but to move toward a greedy policy. We will then fix to an -soft policy.

  • We adjust the policy with a random probability
  • just like is -greedy is for greedy policy

To improve -soft

  • Let's consider a new environment which behaves like the original one
    • except with the requirement that policies be -soft "moved inside" the environment.
  • In each state, you have the probability to behave in the new environment exactly like the old one.
  • You have a probability to take a random action and then act like the old environement.

  • and denote the optimal value functions for the new environment.

  • is optimal among -soft policies if and only if

When equality holds and the -soft policy is not longer improved, then we also know that

Those equations are the same except we change by . Since is the unique solution, it must be that }\Epsilon$$-soft policies.

figure 5.6