Revisiting DQN and TD Learning
let
通過TD算法訓(xùn)練DQN
TD算法
觀測得到,執(zhí)行
,
返回
TD target
TD erroe ,
make close to
online gradient descent
Observe and
Discard after using it.
transition
Experience Replay
store recent transitions in a replay buffer
remove old transition so that the buffer has at most transitions
TD with Experience Replay
SGD
randomly sample a transition from the buffer.
TD error
經(jīng)驗回放是訓(xùn)練DQN的標(biāo)準(zhǔn)技能
改進(jìn):優(yōu)先經(jīng)驗回放 Prioritized Experience Replay
稀缺的經(jīng)驗更重要
If a transition has high TD error , it will be given high priority
use importance sampling instead of uniform sampling.
TD error越大脸侥,transiton被抽樣概率yue