2.2 Action-Value Methods

Notation:

  • : action
  • : The value of action
  • : the estimated value on the t-th time step of a
  • : the reward the t-th time
  • : the time a has been chosen at time t

We can estimate Qt(a) as :

Or the mean of the Reward

If = 0, then we define instead as some default value such as = 0

The simplest action selection rule is to select at t the highest estimated action value. It is the greedy action.

denotes the value of a at which the expression that follows is maximized Greedy action selection always exploits current knowledge to maximize immediate reward.

A simple alternative is to behave greedily most of the time but every once in a while, say with small probability , instead to select randomly from among all the actions with equal probability. We call it the -greedy methods

Advantage:

  • On a long run, all will converge to

Choosing the (IE probability to not play the estimated maximum value action) is important For example if we take a very low then we will explore slowly, and we will converge to the real maximum later. But if we chose a high then we might find it faster but after we will only play the best action 1- (because time we will play a random move which is not the estimated max). A fix at this issue is to change the over the time. High in the beginning but decreasing over time so we explore a lot early and we play the optimal move at the end.

Of course this is only available if we have a static problem which is not the case is many reinforcement learning problems.