3.8 Optimal Value Functions
Solving a reinforcement learning task means finding a policy that achieves a lot of reward over the long run. For finite MDPs, we can define an :
optimal policy:
- for all
- is an optimal policy if for all :
optimal state-value function :
- for all
- optimal action-value function :
- for all and
- We can write in terms of as follows:
Because is the value function for a policy, it must satisfy the condition given by the Bellman equation for state values. Because it is the optimal value function 's consistency condition can be written in a special form without reference to any specific policy.
This is the Bellman equation for or the Bellman optimality equation.
The last two equation are two forms of the Bellman optimality equation for The Bellman optimality equation for is
For finite MDPs, the Bellman optimality equation has a unique solution independent of the policy. The Bellman optimality equation is actually a system of equation, one for each state, so if there are N states, then there are N equations in N unknowns.
Once one has , it is easy to determine an optimal policy. For each state s, there will be one or more actions at which the maximum is obtained in the Bellman optimality equation. Any policy that assigns non-zero probability only to these actions is an optimal policy.
Having make it even easier: for any state s, it can simple find any action that maximizes . The action-value function effectively caches the results of all one-step-ahead searches.
Many different decision-making methods can be viewed as ways of approximately solving the Bellman optimality equation. The methods of dynamic programming can be related even more closely to the Bellman optimality equation. Many reinforcement learning methods can be clearly understood as approximately solving the Bellman optimality equation, using actual experience transitions in place of knowledge of the expected transitions.