Optimality Optimal Value Functions

3.8 Optimal Value Functions

Solving a reinforcement learning task means finding a policy that achieves a lot of reward over the long run. For finite MDPs, we can define an :

optimal policy:
- for all $s \in \mathcal{S}$
- $\pi_*$ is an optimal policy if for all $\pi'$ : $v_{\pi_$}(s) \geq v_{\pi'}(s)$
optimal state-value function :
- for all $s \in \mathcal{S}$

$v_*(s) = max_{\pi} v_{\pi}(s)$

optimal action-value function :
- for all $s \in \mathcal{S}$ and $a \in \mathcal{A(s)}$

$q_*(s,a) = max_{\pi} q_{\pi}(s,a)$

We can write $q_*$ in terms of $v_*$ as follows:

$q_*(s,a) = \mathbb{E} [R_{t+1} + \gamma v_*(S_{t+1}) | S_t = s, A_t = a]$

Because $v_*$ is the value function for a policy, it must satisfy the condition given by the Bellman equation for state values. Because it is the optimal value function $v_*$ 's consistency condition can be written in a special form without reference to any specific policy.

This is the Bellman equation for $v_*$ or the Bellman optimality equation. $\begin{align} v_*(s) &= \max_{a \in \mathcal{A(s)}} q_{\pi_*}(s, a)\\ &= \max_{a} \mathbb{E}_{\pi_*} [G_t | S_t = s, A_t = a]\\ &= \max_{a} \mathbb{E}_{\pi_*} [\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} | S_t = s, A_t = a]\\ &= \dotsi\\ &= \max_{a} \mathbb{E}_{\pi_*} [R_{t+1} + \gamma v_*(S_{t+1}) | S_t = s, A_t = a]\\ &= \max_{a \in \mathcal{A(s)}} \sum_{s',r} p(s',r|s,a)[r + \gamma v_*(s')] \end{align}$

The last two equation are two forms of the Bellman optimality equation for $v_*$ The Bellman optimality equation for $q_*$ is

$\begin{align} q_*(s,a) &= \mathbb{E} [R_{t+1} + \gamma \max_{a'} q_*(S_{t+1}, a') | S_t = s, A_t = a]\\ &= \sum_{s',r} p(s',r |s, a) [ r+ \gamma \max_{a'} q_*(s', a') ] \end{align}$

For finite MDPs, the Bellman optimality equation has a unique solution independent of the policy. The Bellman optimality equation is actually a system of equation, one for each state, so if there are N states, then there are N equations in N unknowns.

Once one has $v_*$ , it is easy to determine an optimal policy. For each state s, there will be one or more actions at which the maximum is obtained in the Bellman optimality equation. Any policy that assigns non-zero probability only to these actions is an optimal policy.

Having $q_*$ make it even easier: for any state s, it can simple find any action that maximizes $q_*(s,a)$ . The action-value function effectively caches the results of all one-step-ahead searches.

Many different decision-making methods can be viewed as ways of approximately solving the Bellman optimality equation. The methods of dynamic programming can be related even more closely to the Bellman optimality equation. Many reinforcement learning methods can be clearly understood as approximately solving the Bellman optimality equation, using actual experience transitions in place of knowledge of the expected transitions.