Neil Zhu,簡書ID Not_GOD挎袜,University AI 創(chuàng)始人 & Chief Scientist顽聂,致力于推進世界人工智能化進程肥惭。制定并實施 UAI 中長期增長戰(zhàn)略和目標(biāo),帶領(lǐng)團隊快速成長為人工智能領(lǐng)域最專業(yè)的力量紊搪。
作為行業(yè)領(lǐng)導(dǎo)者务豺,他和UAI一起在2014年創(chuàng)建了TASA(中國最早的人工智能社團), DL Center(深度學(xué)習(xí)知識中心全球價值網(wǎng)絡(luò)),AI growth(行業(yè)智庫培訓(xùn))等嗦明,為中國的人工智能人才建設(shè)輸送了大量的血液和養(yǎng)分笼沥。此外,他還參與或者舉辦過各類國際性的人工智能峰會和活動娶牌,產(chǎn)生了巨大的影響力奔浅,書寫了60萬字的人工智能精品技術(shù)內(nèi)容,生產(chǎn)翻譯了全球第一本深度學(xué)習(xí)入門書《神經(jīng)網(wǎng)絡(luò)與深度學(xué)習(xí)》诗良,生產(chǎn)的內(nèi)容被大量的專業(yè)垂直公眾號和媒體轉(zhuǎn)載與連載汹桦。曾經(jīng)受邀為國內(nèi)頂尖大學(xué)制定人工智能學(xué)習(xí)規(guī)劃和教授人工智能前沿課程,均受學(xué)生和老師好評鉴裹。
Brendan O’Donoghue, Rémi Munos, Koray Kavukcuoglu & Volodymyr Mnih
Deepmind
Policy gradient is an efficient technique for improving a policy in a reinforcement
learning setting. However, vanilla online variants are on-policy only and not able to take advantage of off-policy data. In this paper we describe a new technique that combines policy gradient with off-policy Q-learning, drawing experience from a replay buffer.
This is motivated by making a connection between the fixed points of the regularized policy gradient algorithm and the Q-values. This connection allows us to estimate the Q-values from the action preferences of the policy, to which we apply Q-learning updates. We refer to the new technique as ‘PGQ’, for policy gradient and Q-learning.
We also establish an equivalency between action-value fitting techniques and actor-critic algorithms, showing that regularized policy gradient techniques can be interpreted as advantage function learning algorithms. We conclude with some numerical examples that demonstrate improved data efficiency and stability of PGQ.
In particular, we tested PGQ on the full suite of Atari games and achieved performance exceeding that of both asynchronous advantage actor-critic (A3C) and Q-learning.