3.6 Markov Decision Processes

Markov decision process or MDP: A reinforcement learning task that satisfies the Markov property. If the state and action spaces are finite, then it is called a finite MDP.

A particulare finite MDP is defined by its state and action sets and by the one-step dynamics of the environment. Given any state and action and , the probability of each possible pair of next state and reward, is denoted

These quantitites completely specify the dynamics of a finite MDP.

Given the dynamics as specified, on can compute anything else one might want to know about the environment, such as

  • the expected rewards for state-action pairs

  • the state-transition probabilities

  • the expected rewards for state-action-next-state triple