2.5 Optimistic Initial Values
All the methods from above are biased by their initial estimates ( or for example). The bias disappears once all actions have been selected at least once but for methods with constant , the bias is permanent though decreasing over time. In practice this kind of bias can sometimes be very helpful because it is an easy way to supply some prior knowledge about what level of rewards can be expected.
We can force the system to explore by fixing high initial value (for example ) and if the reward is less than the initial value, then it will continue to explore (since is still within the actions not chosen). It is called optimistic initial values.
It is not well suited to non-stationary problems because its drive for exploration is inherently temporary. If the task changes, creating a renewed need for exploration, this method cannot help. Anyway, the beginning of time occurs only once so we should not focus on it too much.