Policy Evaluation

4.1 Policy Evaluation

Policy evaluation:

$\begin{align} v_{\pi}(s) &= \mathbb{E}_{\pi} [ R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dotso | S_t = s ]\\ &= \mathbb{E}_{\pi} [R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t = s]\\ &= \sum_a \pi(a|s) \sum_{s', r}p(s', r|s, a) [r + \gamma v_{\pi)(s')}] \end{align}$

$\pi(a|s)$ is the probability of taking action $a$ in state $s$ under policy $\pi$ o- The existence and uniqueness of $v_{\pi}$ are guaranteed as long as either $\gamma < 1$ or eventual termination is guaranteed from all states under the policy $\pi$

To find the solution to this equation we will do an iterative methods. We can considere a sequence of appromixate $v_0, v_1, v_2, \dotso$ The initial value is chosen arbitrarily. Each successive approximation is obtained by using the Bellman equation vor $v_{\pi}$ as an update rule.

$\begin{align} v_{k+1}(s) &= \max_a \mathbb{E} [R_{t+1} + \gamma v_k(S_{t+1}) | S_t = s, A_t = a]\\ &= \max_a \sum_{s', r} p(s', r | s, a) [r + \gamma v_{k}(s')] \end{align}$

$v_k = v_{\pi}$ is a fixed point for this update rule. This algorithm is called iterative policy evaluation.

$k \rightarrow \infty \Rightarrow v_k \rightarrow v_{\pi}$

full backup
- replace the value of the old state by a new value and this at each step
backup done in a sweep
- modifying the new vector directly without passing by a temporary vector

Figure 4.1: Iterative policy evaluation

To stop the algorithm (since it converges in the limit) we test the quantity $\max_{s \in \mathcal{S}} |v_{k+1}(s) - v_k(s)|$ . When it is sufficiently small we stop the loop.