Chapter 7

Chapter 7: n-step Bootstrapping

n-step TD methods span a spectrum with MC methods at one end and one-step TD methods at the other.

n-step TD Prediction

The target estimation of value functions in n-step TD is a combination of the first n steps' sample rewards and bootstrapping the value estimation of the sampled state after n steps, i.e., G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n V_{t+n-1}(S_{t+n}). Correspondingly, the update rule is V_{t+n}(S_t) = V_{t+n-1}(S_t) + \alpha [G_{t:t+n} - V_{t+n-1}(S_t)]. Note that only the value of S_t changes at step t, while the values of all the other states remain unchanged, i.e., V_{t+n}(s) = V_{t+n-1}(s) for \forall s \neq S_t. An issue here is that the value estimation of V_{t+n-1}(s) may be updated long ago and does not reflect the true value of s under the current policy \pi_{t+n-1} in expectation. This is not covered in the RL book and I'm not sure if this will cause any problem in RL.
One-step TD (as introduced in the last chapter) can be seen as a special case of n-step TD when n=1, i.e., G_{t:t+1} = R_{t+1} + \gamma V_t(S_{t+1}). MC can be seen as the extreme of n-step TD in the opposite direction when n equals to the episode length, i.e., G_t = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{T-t-1} R_T.

n-step Sarsa

n-step Sarsa is a natural generalization of 1-step Sarsa with the target estimation as G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^{n} Q_{t+n-1}(S_{t+n}, A_{t+n}), and the corresponding update rule is Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha [G_{t:t+n} - Q_{t+n-1}(S_t, A_t)]. The RL book gives a gridworld example as shown below to illustrate the advantage of n-step Sarsa compared to one-step Sarsa. When the reward is sparse, n-step Sarsa can help speed up the reward propagation the earlier states.

n-step Sarsa example.

n-step Off-policy Control

For n-step off-policy control, we need to take importance sampling into consideration, which leads to the update rule as follows: Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha \rho_{t+1:t+n} [G_{t:t+n} - Q_{t+n-1}(S_t, A_t)], where \rho_{t:h} = \prod_{k=t}^{min(h,T-1)} \frac{\pi(A_k|S_k)}{b(A_k|S_k)}. Note that the ratio starts from step t+1 because we do not have to care how likely we were to select the action A_t; now that we have selected it we want to learn fully from what happens, with importance sampling only for subsequent actions. This also explains why one-step Q-learning do not have the ratio term, as \rho_{t+1:t+n} = 1 for n=1.

Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm

Importance sampling is required in Q-learning because it is a sample update method. So we need to multiply with the importance sampling ratio to make the update's expectation unbiased w.r.t. the target policy. Thus a natural way to avoid importance sampling is to perform expected update w.r.t. the target policy \pi, i.e., the n-step tree backup algorithm.
In its simplest case with n=1, tree backup is exactly expected Sarsa, i.e., G_{t:t+1} = R_{t+1} + \gamma \sum_a \pi(a|S_{t+1}) Q_t(S_{t+1}, a).
For a = A_{t+1}, we can further expand the corresponding Q_t(S_{t+1}, a) term in the equation above to get a two-step target. Recursively, we can get the tree backup target as follows: G_{t:t+n} = R_{t+1} + \gamma \sum_{a \neq A_{t+1}} \pi(a|S_{t+1}) Q_{t+n-1}(S_{t+1},a) + \gamma \pi(A_{t+1}|S_{t+1}) G_{t+1:t+n}. And the update rule without importance sampling is Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha [G_{t:t+n} - Q_{t+n-1}(S_t, A_t)].

?著作權歸作者所有,轉載或內容合作請聯(lián)系作者
  • 序言:七十年代末盅粪,一起剝皮案震驚了整個濱河市验庙,隨后出現(xiàn)的幾起案子铺纽,更是在濱河造成了極大的恐慌,老刑警劉巖胶滋,帶你破解...
    沈念sama閱讀 218,941評論 6 508
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異,居然都是意外死亡,警方通過查閱死者的電腦和手機魔种,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 93,397評論 3 395
  • 文/潘曉璐 我一進店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來粉洼,“玉大人节预,你說我怎么就攤上這事∈羧停” “怎么了安拟?”我有些...
    開封第一講書人閱讀 165,345評論 0 356
  • 文/不壞的土叔 我叫張陵,是天一觀的道長宵喂。 經常有香客問我糠赦,道長,這世上最難降的妖魔是什么樊破? 我笑而不...
    開封第一講書人閱讀 58,851評論 1 295
  • 正文 為了忘掉前任愉棱,我火速辦了婚禮,結果婚禮上哲戚,老公的妹妹穿的比我還像新娘奔滑。我一直安慰自己,他們只是感情好顺少,可當我...
    茶點故事閱讀 67,868評論 6 392
  • 文/花漫 我一把揭開白布朋其。 她就那樣靜靜地躺著王浴,像睡著了一般。 火紅的嫁衣襯著肌膚如雪梅猿。 梳的紋絲不亂的頭發(fā)上氓辣,一...
    開封第一講書人閱讀 51,688評論 1 305
  • 那天,我揣著相機與錄音袱蚓,去河邊找鬼钞啸。 笑死,一個胖子當著我的面吹牛喇潘,可吹牛的內容都是我干的体斩。 我是一名探鬼主播,決...
    沈念sama閱讀 40,414評論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼颖低,長吁一口氣:“原來是場噩夢啊……” “哼絮吵!你這毒婦竟也來了?” 一聲冷哼從身側響起忱屑,我...
    開封第一講書人閱讀 39,319評論 0 276
  • 序言:老撾萬榮一對情侶失蹤蹬敲,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后莺戒,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體伴嗡,經...
    沈念sama閱讀 45,775評論 1 315
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內容為張勛視角 年9月15日...
    茶點故事閱讀 37,945評論 3 336
  • 正文 我和宋清朗相戀三年从铲,在試婚紗的時候發(fā)現(xiàn)自己被綠了闹究。 大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 40,096評論 1 350
  • 序言:一個原本活蹦亂跳的男人離奇死亡食店,死狀恐怖,靈堂內的尸體忽然破棺而出赏寇,到底是詐尸還是另有隱情吉嫩,我是刑警寧澤,帶...
    沈念sama閱讀 35,789評論 5 346
  • 正文 年R本政府宣布嗅定,位于F島的核電站自娩,受9級特大地震影響,放射性物質發(fā)生泄漏渠退。R本人自食惡果不足惜忙迁,卻給世界環(huán)境...
    茶點故事閱讀 41,437評論 3 331
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望碎乃。 院中可真熱鬧姊扔,春花似錦、人聲如沸梅誓。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,993評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至嵌言,卻和暖如春嗅回,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背摧茴。 一陣腳步聲響...
    開封第一講書人閱讀 33,107評論 1 271
  • 我被黑心中介騙來泰國打工绵载, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人苛白。 一個月前我還...
    沈念sama閱讀 48,308評論 3 372
  • 正文 我出身青樓娃豹,卻偏偏與公主長得像,于是被迫代替她去往敵國和親丸氛。 傳聞我的和親對象是個殘疾皇子培愁,可洞房花燭夜當晚...
    茶點故事閱讀 45,037評論 2 355

推薦閱讀更多精彩內容