2.1 An n-Armed Bandit Problem

Quick problem presentation:

  • You have n different options or actions possible
    • each choice give you a numerical reward chosen from a stationary probability distribution
  • Your goal is to maximize the expected total reward over some time period
    • for example 1000 selections or a time steps

We assume we do not have access to the value (average reward) of each action, otherwise the solution is simple, we just do the action with the higher average reward.

We are confronted to the exploration-exploitation dilemma.

  • Exploitation: Taking the best reward known
  • Exploration: Try to explore the unknown space of action to find better reward

It is essential to know when to explore and when to exploit.