4.5 Asynchronous Dynamic Programming

DP methods that we discussed so far involve operations over the entire state set of the MDP. It is troublesome because if we possess a large state, the computation cost may be too big. For example the game of backgammon has over 1020 states.

Then we will introduce asynchronous DP algorithms.

  • they back up the values of states in any order
  • they are using whatever values of other states happen to be available
  • The values of some states may be backed up several times before the values of others are backed up once
  • to converge correctly, it must continue to backup the values of all states
    • it can't ignore any state after some point in the computation
  • they allow great flexibility in selecting states to which backup operations are applied

Avoiding sweeps does not necessarily mean that we can get away with less computation. It just means that an algorithm does not need to get locked into any hopelessly long sweep before it can make progress improving a policy. We can adapt our backup plan so we can focus on doing iteration on state will matter. Some state does not need to be backup as often as other and some state may even don't need to be backup at all depending of our optimal behavior.

Asynchronous algorithms:

  • make it is easier to mix computation with real-time interaction
    • We can run an iterative DP algorithm at the same time that a agent is experiencing the MDP
    • The agent experience can help to know on which state we apply a DP algorithm
    • the latest value and policy information (from the DP algorithm) can guide the agent's decision-making

For example, we can apply backups to states as the agent visits them. This makes it possible to focus the DP algorithm's backup onto parts of the state set that are most relevant to the agent.