Chapter 7 Chapter 7: n-step Bootstrapping

n-step TD methods span a spectrum with MC methods at one end and one-step TD methods at the other.

n-step TD Prediction

The target estimation of value functions in n-step TD is a combination of the first n steps' sample rewards and bootstrapping the value estimation of the sampled state after n steps, i.e., $G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n V_{t+n-1}(S_{t+n}).$ Correspondingly, the update rule is $V_{t+n}(S_t) = V_{t+n-1}(S_t) + \alpha [G_{t:t+n} - V_{t+n-1}(S_t)].$ Note that only the value of $S_t$ changes at step $t$ , while the values of all the other states remain unchanged, i.e., $V_{t+n}(s) = V_{t+n-1}(s)$ for $\forall s \neq S_t$ . An issue here is that the value estimation of $V_{t+n-1}(s)$ may be updated long ago and does not reflect the true value of $s$ under the current policy $\pi_{t+n-1}$ in expectation. This is not covered in the RL book and I'm not sure if this will cause any problem in RL.
One-step TD (as introduced in the last chapter) can be seen as a special case of n-step TD when $n=1$ , i.e., $G_{t:t+1} = R_{t+1} + \gamma V_t(S_{t+1}).$ MC can be seen as the extreme of n-step TD in the opposite direction when $n$ equals to the episode length, i.e., $G_t = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{T-t-1} R_T$ .

n-step Sarsa

n-step Sarsa is a natural generalization of 1-step Sarsa with the target estimation as $G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^{n} Q_{t+n-1}(S_{t+n}, A_{t+n}),$ and the corresponding update rule is $Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha [G_{t:t+n} - Q_{t+n-1}(S_t, A_t)].$ The RL book gives a gridworld example as shown below to illustrate the advantage of n-step Sarsa compared to one-step Sarsa. When the reward is sparse, n-step Sarsa can help speed up the reward propagation the earlier states.

n-step Sarsa example.

n-step Off-policy Control

For n-step off-policy control, we need to take importance sampling into consideration, which leads to the update rule as follows: $Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha \rho_{t+1:t+n} [G_{t:t+n} - Q_{t+n-1}(S_t, A_t)],$ where $\rho_{t:h} = \prod_{k=t}^{min(h,T-1)} \frac{\pi(A_k|S_k)}{b(A_k|S_k)}$ . Note that the ratio starts from step $t+1$ because we do not have to care how likely we were to select the action $A_t$ ; now that we have selected it we want to learn fully from what happens, with importance sampling only for subsequent actions. This also explains why one-step Q-learning do not have the ratio term, as $\rho_{t+1:t+n} = 1$ for $n=1$ .

Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm

Importance sampling is required in Q-learning because it is a sample update method. So we need to multiply with the importance sampling ratio to make the update's expectation unbiased w.r.t. the target policy. Thus a natural way to avoid importance sampling is to perform expected update w.r.t. the target policy $\pi$ , i.e., the n-step tree backup algorithm.
In its simplest case with $n=1$ , tree backup is exactly expected Sarsa, i.e., $G_{t:t+1} = R_{t+1} + \gamma \sum_a \pi(a|S_{t+1}) Q_t(S_{t+1}, a).$
For $a = A_{t+1}$ , we can further expand the corresponding $Q_t(S_{t+1}, a)$ term in the equation above to get a two-step target. Recursively, we can get the tree backup target as follows: $G_{t:t+n} = R_{t+1} + \gamma \sum_{a \neq A_{t+1}} \pi(a|S_{t+1}) Q_{t+n-1}(S_{t+1},a) + \gamma \pi(A_{t+1}|S_{t+1}) G_{t+1:t+n}.$ And the update rule without importance sampling is $Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha [G_{t:t+n} - Q_{t+n-1}(S_t, A_t)].$

?著作權歸作者所有,轉載或內容合作請聯(lián)系作者

人面猴
序言：七十年代末盅粪，一起剝皮案震驚了整個濱河市验庙，隨后出現(xiàn)的幾起案子铺纽，更是在濱河造成了極大的恐慌，老刑警劉巖胶滋，帶你破解...
沈念sama閱讀 218,941評論 6贊 508
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場離奇詭異，居然都是意外死亡，警方通過查閱死者的電腦和手機魔种，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 93,397評論 3贊 395
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來粉洼，“玉大人节预，你說我怎么就攤上這事∈羧停” “怎么了安拟？”我有些...
開封第一講書人閱讀 165,345評論 0贊 356
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長宵喂。經常有香客問我糠赦，道長，這世上最難降的妖魔是什么樊破？我笑而不...
開封第一講書人閱讀 58,851評論 1贊 295
?港島之戀（遺憾婚禮）
正文為了忘掉前任愉棱，我火速辦了婚禮，結果婚禮上哲戚，老公的妹妹穿的比我還像新娘奔滑。我一直安慰自己，他們只是感情好顺少，可當我...
茶點故事閱讀 67,868評論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布朋其。她就那樣靜靜地躺著王浴，像睡著了一般。火紅的嫁衣襯著肌膚如雪梅猿。梳的紋絲不亂的頭發(fā)上氓辣，一...
開封第一講書人閱讀 51,688評論 1贊 305
城市分裂傳說
那天，我揣著相機與錄音袱蚓，去河邊找鬼钞啸。笑死，一個胖子當著我的面吹牛喇潘，可吹牛的內容都是我干的体斩。我是一名探鬼主播，決...
沈念sama閱讀 40,414評論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼颖低，長吁一口氣：“原來是場噩夢啊……” “哼絮吵！你這毒婦竟也來了？” 一聲冷哼從身側響起忱屑，我...
開封第一講書人閱讀 39,319評論 0贊 276
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤蹬敲，失蹤者是張志新（化名）和其女友劉穎，沒想到半個月后莺戒，有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體伴嗡，經...
沈念sama閱讀 45,775評論 1贊 315
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內容為張勛視角年9月15日...
茶點故事閱讀 37,945評論 3贊 336
?白月光啟示錄
正文我和宋清朗相戀三年从铲，在試婚紗的時候發(fā)現(xiàn)自己被綠了闹究。大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點故事閱讀 40,096評論 1贊 350
活死人
序言：一個原本活蹦亂跳的男人離奇死亡食店，死狀恐怖，靈堂內的尸體忽然破棺而出赏寇，到底是詐尸還是另有隱情吉嫩，我是刑警寧澤，帶...
沈念sama閱讀 35,789評論 5贊 346
?日本核電站爆炸內幕
正文年R本政府宣布嗅定，位于F島的核電站自娩，受9級特大地震影響，放射性物質發(fā)生泄漏渠退。R本人自食惡果不足惜忙迁，卻給世界環(huán)境...
茶點故事閱讀 41,437評論 3贊 331
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望碎乃。院中可真熱鬧姊扔，春花似錦、人聲如沸梅誓。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,993評論 0贊 22
一樁弒父案佛南，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至嵌言，卻和暖如春嗅回，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背摧茴。一陣腳步聲響...
開封第一講書人閱讀 33,107評論 1贊 271
情欲美人皮
我被黑心中介騙來泰國打工绵载，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留，地道東北人苛白。一個月前我還...
沈念sama閱讀 48,308評論 3贊 372
代替公主和親
正文我出身青樓娃豹，卻偏偏與公主長得像，于是被迫代替她去往敵國和親丸氛。傳聞我的和親對象是個殘疾皇子培愁，可洞房花燭夜當晚...
茶點故事閱讀 45,037評論 2贊 355

Chapter 7

Chapter 7: n-step Bootstrapping

n-step TD Prediction

n-step Sarsa

n-step Off-policy Control

Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm

推薦閱讀更多精彩內容