3.6 Markov Decision Processes
Markov decision process or MDP: A reinforcement learning task that satisfies the Markov property. If the state and action spaces are finite, then it is called a finite MDP.
A particulare finite MDP is defined by its state and action sets and by the one-step dynamics of the environment. Given any state and action and , the probability of each possible pair of next state and reward, is denoted
These quantitites completely specify the dynamics of a finite MDP.
Given the dynamics as specified, on can compute anything else one might want to know about the environment, such as
the expected rewards for state-action pairs
the state-transition probabilities
the expected rewards for state-action-next-state triple