Action-Value Methods

2.2 Action-Value Methods

Notation:

$a$ : action
$q(a)$ : The value of action
$Q_t(a)$ : the estimated value on the t-th time step of a
$R_t$ : the reward the t-th time
$N_t(a)$ : the time a has been chosen at time t

We can estimate Qt(a) as :

$\begin{align} Q_t(a) = \sum_{i=1}^{N_t(a)}\frac{R_i}{N_t(a)} \end{align}$

Or the mean of the Reward

If $N_t(a)$ = 0, then we define $Q_t(a)$ instead as some default value such as $Q_1(a)$ = 0

The simplest action selection rule is to select at t the highest estimated action value. It is the greedy action.

$A_t = argmax_a Qt(a)$

$argmax(a)$ denotes the value of a at which the expression that follows is maximized Greedy action selection always exploits current knowledge to maximize immediate reward.

A simple alternative is to behave greedily most of the time but every once in a while, say with small probability $\epsilon$ , instead to select randomly from among all the actions with equal probability. We call it the $\epsilon$ -greedy methods

Advantage:

On a long run, all $Q_t(a)$ will converge to $q(a)$

Choosing the $\epsilon$ (IE probability to not play the estimated maximum value action) is important For example if we take a very low $\epsilon$ then we will explore slowly, and we will converge to the real maximum later. But if we chose a high $\epsilon$ then we might find it faster but after we will only play the best action 1- $\epsilon$ (because $\epsilon$ time we will play a random move which is not the estimated max). A fix at this issue is to change the $\epsilon$ over the time. High in the beginning but decreasing over time so we explore a lot early and we play the optimal move at the end.

Of course this is only available if we have a static problem which is not the case is many reinforcement learning problems.