UCB Action Selection

2.6 Upper-Confidence-Bound Action Selection

The greedy actions are those that look best at present, but some of the other actions may actually be better. We can select our action with an other equation than (E1). $\begin{align} A_t = argmax_a \left[ Q_t(a) + c \sqrt{\frac{ln(t)}{N_t(a)}}\right] \end{align}$ (see page 55, equation 2.8 from pdf for a more visible version of the equation)

This is called upper confidence bound action selection.

It is very effective on n-armed problems but it is hard to extend this methods to more general reinforcement learning settings.