2.2 Action-Value Methods
Notation:
- : action
- : The value of action
- : the estimated value on the t-th time step of a
- : the reward the t-th time
- : the time a has been chosen at time t
We can estimate Qt(a) as :
Or the mean of the Reward
If = 0, then we define instead as some default value such as = 0
The simplest action selection rule is to select at t the highest estimated action value. It is the greedy action.
denotes the value of a at which the expression that follows is maximized Greedy action selection always exploits current knowledge to maximize immediate reward.
A simple alternative is to behave greedily most of the time but every once in a while, say with small probability , instead to select randomly from among all the actions with equal probability. We call it the -greedy methods
Advantage:
- On a long run, all will converge to
Choosing the (IE probability to not play the estimated maximum value action) is important For example if we take a very low then we will explore slowly, and we will converge to the real maximum later. But if we chose a high then we might find it faster but after we will only play the best action 1- (because time we will play a random move which is not the estimated max). A fix at this issue is to change the over the time. High in the beginning but decreasing over time so we explore a lot early and we play the optimal move at the end.
Of course this is only available if we have a static problem which is not the case is many reinforcement learning problems.