2.7 Gradient Bandits

So far in this chapter we have considered methods that estimate action values and use those estimates to select actions. This is often a good approach, but it is not the only one possible. In this section we consider learning a numerical preference Ht(a) for each action a. The larger the preference, the more often that action is taken, but the preference has no interpretation in terms of reward.

Preference is a complicated subject and I invite you to look at the equation in page 57 to 60 in the book.

Once can gain a deeper insight into this algorithm by understanding it as a stochastic approximation to gradient ascent

The idea is to create an other parameter ( preference ) to chose which action to do.