Dynamic Programming

Chapter 4 Dynamique Programming

Dynamique Programming:

it refers to a collection of algorithm that can be used to compute optimal policies
it needs a perfect model of the environment (like a MDP)
it has a great computation cost
it has limited utility in reinforcement learning because of the cost and the model requirement
it is important theoretically
it is a base for many algorithm which aims at reproducing the effect of DP

They key idea of DP, and of reinforcement learning, is the use of value functions to organize and structure the search for good policies.

Let's rewrite the $v_*$ and $q_*$ which satisfy the Bellman optimality equations: $\begin{align} v_*(s) &= \max_a \mathbb{E} [R_{t+1} + \gamma v_*(S_{t+1}) | S_t = s, A_t = a]\\ &= \max_a \sum_{s', r} p(s', r | s, a) [r + \gamma v_*(s')] \end{align}$

$\begin{align} q_*(s,a) &= \mathbb{E} [R_{t+1} + \gamma \max_{a'} q_*(S_{t+1}) | S_t = s, A_t = a]\\ &= \sum_{s', r} p(s', r | s, a) [r + \gamma \max_{a'} q_*(s')] \end{align}$