Continuous control with deep reinforcement learning
Timothy P. Lillicrap,Jonathan J. Hunt,Alexander Pritzel,Nicolas Heess,Tom Erez,Yuval Tassa,David Silver,Daan Wierstra
(Submitted on 9 Sep 2015)
We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving. Our algorithm is able to find policies whose performance is competitive with those found by a planning algorithm with full access to the dynamics of the domain and its derivatives. We further demonstrate that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs.
我們適應(yīng)深Q學(xué)習(xí)到連續(xù)動作領(lǐng)域的成功背后的想法框舔。我們提出了一種基于確定性的政策梯度娜亿,可以通過連續(xù)的動作空間運(yùn)行一個演員,評論家稻轨,無模型算法脐恩。使用相同的學(xué)習(xí)算法,網(wǎng)絡(luò)架構(gòu)和超參數(shù),我們的算法穩(wěn)健地解決了20多個模擬物理任務(wù)差油,包括經(jīng)典的問題,如cartpole擺起任洞,靈巧的操控蓄喇,腿運(yùn)動和汽車駕駛。我們的算法是能夠找到的政策交掏,其性能與那些由規(guī)劃算法具有完全訪問域及其衍生物的動態(tài)發(fā)現(xiàn)有競爭力的妆偏。我們進(jìn)一步證明,對許多任務(wù)的算法可以學(xué)習(xí)政策結(jié)束到終端:直接從原始像素的輸入盅弛。
Deepmind把之前游戲玩得不錯的DQN模式推廣到動作空間是高維和連續(xù)的情形钱骂。為避免在每一步對動作進(jìn)行優(yōu)化,本文采用基于確定策略梯度的Actor-Critic方法挪鹏,在20多個模擬物理任務(wù)中取得了不錯的效果见秽。
Comments:10 pages + supplementary
Subjects:Learning (cs.LG); Machine Learning (stat.ML)
Cite?as:arXiv:1509.02971[cs.LG]
(orarXiv:1509.02971v1[cs.LG]for this version)
Download: