Chapter 5: Monte Carlo Methods
Monte Carlo (MC) methods are learning methods for estimating value functions and discovering optimal policies by averaging complete returns of sample experience. Some high-level characteristics include:
- Unlike DP, MC does not require the model dynamics. Instead, it learns from (simulated) experience. This is valuable because in many cases, it is easy to generate sample tracjectories but infeasible to obtain the model dynamics in explicit form.
- MC needs compete returns, thus is only suitable for episodic tasks. It can be incremental in an episode-by-episode sense, but not in a step-by-step (online) sense.
- MC does not bootstrap, thus is unbiased. It also provides another advantage that we can only estimate the value of the states we are interested in, regardless of the size of the state space.
Following the general framework of GPI, MC also consists of a value estimation (prediction) phase and a policy improvement (control) phase.
Monte Carlo Prediction
This phase learns the value functions for a given policy.
For estimating state value , it works by simply averaging the returns of in different sample episodes. It can be further categorized as first-visit MC and every-visit MC, both guaranteed to converge as the number of visits to goes to infinity.
As we can not derive the optimal policy based on state values without knowing the model dynamics, it is more important to estimation state-action values in MC. An important problem here is how to maintain exploration so that every state-action pair will be visited to for value estimation to find the optimal policy. This will be discussed in the next part.
The backup diagram of MC is a single line of the sampled trajectory.
Monte Carlo Control
MC is guaranteed to converge to the optimal policy under the following two assumptions:
- Each state-action pair has a non-zero probability of being visited;
- The policy evaluation can be done with a infinite number of episodes.
However, these two assumptions are usually unrealistic in real-world problems. A solution to tackle the second assumption is to approximate with a finite number of episodes at the cost of higher variance.
Theoritically, a solution to the first assumption is exploring start, which ensures that every state-action pair will be visited at the episode start. However, this is infeasible in most cases, as we can not generate arbitrary trajectories as we like. The solutions to this problem include on-policy methods and off-policy methods.
On-policy Control
On-policy methods directly improve the policy that is used to generate data. It ensures exploration by learning a -greedy policy instead of a deterministic policy. The cost here is that we can only obtain the optimal -soft policy instead of the truly optimal policy.
Off-policy Control
Off-policy methods learn an optimal target policy which is different from a behavior policy used to generate sample trajectories. We can use a soft behavior policy to ensure exploration while still learn a deterministic target policy.
The key challenge here is that we can not directly average the returns from the behavior policy, as it does not reflect the value functions of the target policy, i.e., .
To solve this problem, we introduce importance sampling which reweights the samples from the behavior policy to align with the sample distribution under the target policy. The importance sampling ratio is defined as: Then we have There are two approaches to get an estimation of . The first is ordinary importance sampling in the form of , while the second is weighted importance sampling in the form of , where we use to represent the importance sampling ratio for simplicity. Ordinary importance sampling is unbiased, but has much larger or even infinite variance. Weighted importance sampling has bounded variance, and its bias reduces as the number of samples increases, thus is strongly preferrred in practice. Moreover, the weighted estimator can be expressed in an incremental way as , where .
The off-policy MC control is shown in the figure below. A problem here is highlighted in the last two rows. When the target policy is deterministic, the importance sampling ratio will become zero when the bahavior policy takes an action which is not optimal in the current target policy. When nongreedy actions are common, it will only learns from the tails of episodes and greatly slow down the learning process.
Approximations in MC
- Truncated policy evaluation with only a finite number of sample episodes.
- Use returns of previous sampled trajectories under different policies for policy evaluation.
Under these approximations, is it still wise to fully trust the estimated values to determine a greedy policy improvement?