Chapter 3: Finite Markov Decision Processes

Basic Definitions

MDP is the most basic formulation of sequential decision process under the assumption of Markov property.

State: The state must include information about all aspects of the past agent-environment interaction that make a difference for the future.
Action
Reward: The reward defines what we want to achieve instead of how we want to achieve it.
Dynamics: $p(s', r | s, a)$
Return: Return is defined as some function of the reward sequence
For episodic tasks, we have $G_t = R_{t+1} + \cdots + R_T$
For continuing tasks, we have $G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots$
They can be unified under the same framework as $G_t = \sum_{k=0}^\infty \gamma^{k} R_{t+k+1}$ by adding an absorbing state with zero reward to the terminal of episodic tasks
The recursive form of return is $G_t = R_{t+1} + \gamma G_{t+1}$ , which forms the basis of Bellman equations

Further notes:

In the RL book, the reward obtained from taking action $A_t$ in state $S_t$ at time step $t$ is denoted as $R_{t+1}$ instead of $R_t$ ;
RL beyond MDP assumption is an important research topic (also discussed in the RL book)
The representation of the states and actions has a great influence on the learning process, but is beyond the scope of the RL book (many recent works actually focus on this topic)
The RL book focuses on scalar reward signal, but there are also some recent works focusing on multi-objective reward signal in vector form

Policies and Value Functions

Value function is the expected return of a state or a state-action pair
Policy is a mapping from states to the probabilities of selecting each possible action
Value functions are defined w.r.t. particular policies, i.e., $v_\pi (s) = \mathbb{E}_\pi [G_t | S_t = s], \quad q_\pi (s, a) = \mathbb{E}_\pi [G_t | S_t = s, A_t = a]$
Based on the simple relationships of $v_\pi (s) = \sum_a \pi (a | s) q_\pi (s, a), \quad q_\pi (s, a) = \sum_{s', r} p(s', r | s, a) \big [ r + \gamma v_\pi (s') \big ],$ we can derive the Bellman equation which expresses the relationship between the value of a state (state-action pair) and the values of its successor states (state-action pairs), i.e., $v_\pi (s) = \sum_a \pi (a | s) \sum_{s', r} p(s', r | s, a) \big [ r + \gamma v_\pi (s') \big ] \\ q_\pi (s, a) = \sum_{s', r} p(s', r | s, a) \big [ r + \gamma \sum_{a'} \pi (a' | s') q_\pi (s', a') \big ].$ The value function $v_\pi$ is the unique solution to its Bellman equation by solving a set of $|\mathcal{S}|$ linear equations. Notice that the assumption here is that the system dynamics $p(s', r | s, a)$ is known.
Another useful tool to visualize the recursive relationships of value functions is backup diagram.

Optimal Policies and Optimal Value Functions

Definition of a "better" policy: $\pi \geq \pi'$ if and only if $v_\pi (s) \geq v_{\pi'} (s)$ for all $s \in \mathcal{S}$ .
There always exists an optimal value function $v_*(s)$ and $q_*(s, a)$ and its corresponding optimal policies (potentially more than one) for MDPs. Intuitively, if a policy is not optimal, we can always improve the value of a state $s$ by changing the policy for this specific state. The improvement in the value of $s$ will then backpropagate to all the values of the states which can reach $s$ in the state transition graph. In this way, we can always achieve a better policy and gradually reach the optimal policy.
Based on the following simple relations, i.e., $v_\pi (s) = \sum_a \pi (a | s) q_\pi (s, a) \rightarrow v_*(s) = \max_{a \in \mathcal{A}(s)} q_*(s, a) \\ \quad q_\pi (s, a) = \sum_{s', r} p(s', r | s, a) \big [ r + \gamma v_\pi (s') \big ] \rightarrow q_*(s, a) = \sum_{s', r} p(s', r | s, a) \big [ r + \gamma v_*(s') \big ],$ we have the Bellman optimality equation without reference to any specific policy as $v_*(s) = \max_{a \in \mathcal{A}(s)} \sum_{s', r} p(s', r | s, a) \big [ r + \gamma v_*(s') \big ] \\ q_*(s, a) = \sum_{s', r} p(s', r | s, a) \big [ r + \gamma \max_{a'} q_*(s', a') \big ].$
The optimal policy can be easily derived by greedy search over the state values.
Solving the Bellman optimality equation requires solving $|\mathcal{S}|$ nonlinear equations based on the assumption of fully known system dynamics and Markov property. Even these two assmuptions are satisfied, solving the equations is still computationally infeasible when the state space is very large. Consequently, different RL methods mainly focus on how to solve the Bellman optimality equation approximately.

Further notes:
The MDP formulation of RL makes it closely related to (stochastic) optimal control.

Reinforcement learning adds to MDPs a focus on approximation and incomplete information for realistically large problems.

The online nature of reinforcement learning makes it possible to approximate optimal policies in ways that put more effor into learning to make good decisions for frequently encountered states, at the expense of less effort for infrequently encountered states.