Incremental Implementation

2.3 Incremental Implementation

Since each action as to keep a records of the rewards

$\begin{align} Q_t(a) = \frac{[R_1 + R_2 + ... + R_{N_t(a)}]}{N_t(a)} \end{align}$

You can't implement this equation like this because the more reward you'll get, the more memory you'll need. Hopefully there is a trick to calculate easily Qt(a).

$\begin{align} Q_{k+1} &= \frac{1}{k} \sum_{i=1}^{k+1} R_i\\ &= Q_k + \frac{1}{k} [R_k - Q_k]\\ \end{align}$

We can have $Q_t(a)$ with the previous estimation of $q(a)$ . We just need to store the iteration (k), and the previous reward. We also need an arbitrary $Q_1$

The general form is : $NewEstimate \leftarrow OldEstimage + SteSize [ Target - OldEstimate]$

The expression [Target - OldEstimate] is an error in the estimate. It is reduced by Taking a step toward the "Target". The target is presumed to indicatea desirable direction in which to move, though it may be noisy. In the case above, for example, the target is the k-th reward

The StepSize parameter is 1/k in our example but can be otherwise. It was a meaningful impact on the system.