2.3 Incremental Implementation
Since each action as to keep a records of the rewards
You can't implement this equation like this because the more reward you'll get, the more memory you'll need. Hopefully there is a trick to calculate easily Qt(a).
We can have with the previous estimation of . We just need to store the iteration (k), and the previous reward. We also need an arbitrary
The general form is :
The expression [Target - OldEstimate] is an error in the estimate. It is reduced by Taking a step toward the "Target". The target is presumed to indicatea desirable direction in which to move, though it may be noisy. In the case above, for example, the target is the k-th reward
The StepSize parameter is 1/k in our example but can be otherwise. It was a meaningful impact on the system.