Deep Reinforcement Learning for Vision-Based Robotic Grasping：A Simulated Comparative Evaluation ...

未看RL Algorithms部分！未完待續(xù)...

Snip20190226_32.png

demo code：https://github.com/bulletphysics/bullet3/blob/master/examples/pybullet/gym/pybullet_envs/baselines/enjoy_kuka_diverse_object_grasping.py

Abstract

Question--the proliferation of algorithms makes it difficult to discern which particular approach would be best suited for a rich, diverse task like grasping.
Goal--propose a simulated benchmark for robotic grasping that emphasizes off-policy learning and generalization to unseen objects.
Method--evaluate the benchmark tasks against a variety of Q-function estimation methods, a method previously proposed for robotic grasping with deep neural network models, and a novel approach based on a combination of Monte Carlo return estimation and an off-policy correction.
Results--several simple methods provide a surprisingly strong competitor to popular algorithms such as double Qlearning, and our analysis of stability sheds light on the relative tradeoffs between the algorithms.

I. INTRODUCTION

解決grasping problem的有很多方法犹菱，比如：

analytic grasp metrics [43], [36]
learning-based approaches [2]

雖然基于計(jì)算機(jī)視覺的的learning-based方法在今年取得了不錯(cuò)的表現(xiàn)[22]啥容，但是這些方法不涉及抓取任務(wù)的 sequential aspect。

要么選擇a single grasp pose [33]
要么重復(fù)選擇the next most promising grasp greedily[24].

后來引入了RL方法作為robotic grasping in a sequential decision making context下的框架诵闭，但是局限于：

single object [34]
simple geometric shapes such as cubes [40].

本實(shí)驗(yàn)中炼团，使用realistic simulated benchmark比較了多種RL方法。
由于成功的generalization通常需要訓(xùn)練大量的objects和scenes [33] [24]疏尿，需要多個(gè)視角和control瘟芝，因此on-policy不適用于多樣化的grasping scenarios，而Off-policy reinforcement learning methods不錯(cuò)～

Aim：to understand which off-policy RL algorithms are best suited for vision-based robotic grasping.
Contributions：

a simulated grasping benchmark for a robotic arm with a two-finger parallel jaw gripper, grasping random objects from a bin.
present an empirical evaluation of off-policy deep RL algorithms on vision-based robotic grasping tasks. 包括一下6種算法：

1. the grasp success prediction approach proposed by [24], 
2. Q-learning [28], 
3. path consistency learning (PCL) [29], 
4. deep deterministic policy gradient (DDPG) [25],
5. Monte Carlo policy evaluation [39], 
6. Corrected Monte-Carlo, a novel off-policy algorithm that extends Monte Carlo policy evaluation for unbiased off-policy learning.

Results show that deep RL can successfully learn grasping of diverse objects from raw pixels, and can grasp previously unseen objects in our simulator with an average success rate of 90%.

II. RELATED WORK

Model-free algorithms for deep RL的兩個(gè)主要領(lǐng)域：

policy gradient methods [44], [38], [27], [45]
value-based methods [35], [28], [25], [15], [16], with actor-critic algorithms combining the two classes [29], [31], [14].

但是Model-free algorithms的通病就是很難tune褥琐。

但是Model-free algorithms的相關(guān)工作锌俱，including popular benchmarks [7], [1], [3]主要集中在

applications in video games
relatively simple simulated robot locomotion tasks
而沒有我們所需要的，能進(jìn)行多樣化的任務(wù)敌呈，且能泛化到新環(huán)境中贸宏。

有很多RL方法應(yīng)用到了真實(shí)的機(jī)器人任務(wù)上，比如：

guided policy search methods用于解決一些操作任務(wù)：contact-rich, vision-based skills [23], non-prehensile manipulation [10], and tasks involving significant discontinuities [5], [4]
或者直接應(yīng)用model-free algorithms用于機(jī)器人學(xué)習(xí)技能：fitted Qiteration [21], Monte Carlo return estimates [37], deep deterministic policy gradient [13], trust-region policy optimization [11], and deep Q-networks [46]

這些成功的強(qiáng)化學(xué)習(xí)的應(yīng)用通常只能tackle individual skills,而不能泛化去完成機(jī)器沒有訓(xùn)練到的技能磕洪。
鋪墊了這么多吭练，就是為了強(qiáng)調(diào)The goal of this work is to provide a systematic comparison of deep RL approaches to robotic grasping，而且能泛化到新物體in a cluttered environment where objects may be obscured and the environment dynamics are complex（不像[40], [34], and [19]只考慮抓取形狀簡(jiǎn)單的物體析显，我們的實(shí)驗(yàn)是很高級(jí)的ｖ暄省）。

用于抓取diverse sets of objects的不是強(qiáng)化學(xué)習(xí)的其他學(xué)習(xí)策略谷异，也有很多分尸，這里作者推薦我們看看下面這篇survey。

[2].J. Bohg, A. Morales, T. Asfour, and D. Kragic. 
Data-driven grasp synthesisa survey. Transactions on Robotics, 2014.

以前的方法主要依賴這三種sources of supervision：

human labels [17], [22],
geometric criteria for grasp success computed offline [12],
robot self-supervision, measuring grasp success using sensors on the robot’s gripper [33]
后來也有DL的方法出現(xiàn)：[20], [22], [24], [26], [32].

III. PRELIMINARIES

這里說明了論文中的一些符號(hào)歹嘹，不贅述

Snip20190226_33.png

IV. PROBLEM SETUP

仿真環(huán)境：Bullet simulator
timesteps：T = 15
在最后一個(gè)step給出二進(jìn)制的reward
抓取成功的reward：

Snip20190226_34.png
抓取失敗的reward：0
當(dāng)前的狀態(tài)st包括當(dāng)前視角的RGB image和當(dāng)前的timestep t箩绍，用于讓policy知道在這個(gè)episode結(jié)束前還有多少steps，用于做一些決定尺上，如事都有時(shí)間做一個(gè)pre-grasp manipulation材蛛，或者是否要立即移動(dòng)到a good grasping position。
機(jī)械臂使用position control of the vertically-oriented gripper進(jìn)行控制
連續(xù)的action使用笛卡爾displacement來表示尖昏。其中fai是wrist繞著z軸的旋轉(zhuǎn)

Snip20190226_36.png
當(dāng)夾爪移動(dòng)到低于某個(gè)fixed height threshold時(shí)仰税，夾爪自動(dòng)合上
新episode開始時(shí)，物體的位置和方向在bin中隨機(jī)放置

Snip20190226_35.png

1) Regular grasping.

900個(gè)for訓(xùn)練集抽诉，100for測(cè)試集
每個(gè)episode有5個(gè)objects in the bin
每20個(gè)episodes換一次objects

Snip20190226_37.png

2) Targeted grasping in clutter

所有的episodes用一樣的objects
7個(gè)objects中選擇3個(gè)target objects
當(dāng)抓取到target objetc時(shí)機(jī)械臂獎(jiǎng)勵(lì)reward

Snip20190226_38.png

V. REINFORCEMENT LEARNING ALGORITHMS

A. Learning to Grasp with Supervised Learning

Levine et al. [24]. This method does not consider long-horizon returns, but instead uses a greedy controller to choose the actions 
with the highest predicted probability of producing a successful grasp.

B. Off-Policy Q-Learning

C. Regression with Monte Carlo Return Estimates

D. Corrected Monte Carlo Evaluation

E. Deep Deterministic Policy Gradient

F. Path Consistency Learning

G. Summary and Unified View

VI. EXPERIMENTS

評(píng)估RL算法的四個(gè)要點(diǎn)：

overall performance
data-efficiency
robustness to off-policy data
hyperparameter sensitivity

All algorithms use variants of the deep neural network architecture shown in Figure 3 to represent the Q-function.

Snip20190226_39.png

A. Data Efficiency and Performance

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末陨簇，一起剝皮案震驚了整個(gè)濱河市，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌河绽，老刑警劉巖己单，帶你破解...
沈念sama閱讀 211,194評(píng)論 6贊 490
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場(chǎng)離奇詭異耙饰，居然都是意外死亡纹笼，警方通過查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 90,058評(píng)論 2贊 385
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門苟跪，熙熙樓的掌柜王于貴愁眉苦臉地迎上來廷痘，“玉大人，你說我怎么就攤上這事件已∷穸睿” “怎么了？”我有些...
開封第一講書人閱讀 156,780評(píng)論 0贊 346
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵篷扩，是天一觀的道長(zhǎng)兄猩。經(jīng)常有香客問我，道長(zhǎng)鉴未，這世上最難降的妖魔是什么枢冤？我笑而不...
開封第一講書人閱讀 56,388評(píng)論 1贊 283
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮铜秆，結(jié)果婚禮上淹真，老公的妹妹穿的比我還像新娘。我一直安慰自己连茧，他們只是感情好趟咆，可當(dāng)我...
茶點(diǎn)故事閱讀 65,430評(píng)論 5贊 384
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著梅屉，像睡著了一般。火紅的嫁衣襯著肌膚如雪鳞贷。梳的紋絲不亂的頭發(fā)上坯汤，一...
開封第一講書人閱讀 49,764評(píng)論 1贊 290
城市分裂傳說
那天，我揣著相機(jī)與錄音搀愧，去河邊找鬼惰聂。笑死，一個(gè)胖子當(dāng)著我的面吹牛咱筛，可吹牛的內(nèi)容都是我干的搓幌。我是一名探鬼主播，決...
沈念sama閱讀 38,907評(píng)論 3贊 406
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼迅箩，長(zhǎng)吁一口氣：“原來是場(chǎng)噩夢(mèng)啊……” “哼溉愁！你這毒婦竟也來了？” 一聲冷哼從身側(cè)響起饲趋，我...
開封第一講書人閱讀 37,679評(píng)論 0贊 266
萬榮殺人案實(shí)錄
序言：老撾萬榮一對(duì)情侶失蹤拐揭，失蹤者是張志新（化名）和其女友劉穎撤蟆，沒想到半個(gè)月后，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體堂污，經(jīng)...
沈念sama閱讀 44,122評(píng)論 1贊 303
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡家肯，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 36,459評(píng)論 2贊 325
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了盟猖。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片讨衣。...
茶點(diǎn)故事閱讀 38,605評(píng)論 1贊 340
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡，死狀恐怖式镐，靈堂內(nèi)的尸體忽然破棺而出反镇，到底是詐尸還是另有隱情，我是刑警寧澤碟案，帶...
沈念sama閱讀 34,270評(píng)論 4贊 329
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布愿险，位于F島的核電站，受9級(jí)特大地震影響价说，放射性物質(zhì)發(fā)生泄漏辆亏。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 39,867評(píng)論 3贊 312
男人毒藥：我在死后第九天來索命
文/蒙蒙一鳖目、第九天我趴在偏房一處隱蔽的房頂上張望扮叨。院中可真熱鬧，春花似錦领迈、人聲如沸彻磁。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,734評(píng)論 0贊 21
一樁弒父案狸捅，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽衷蜓。三九已至，卻和暖如春尘喝，著一層夾襖步出監(jiān)牢的瞬間磁浇，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 31,961評(píng)論 1贊 265
情欲美人皮
我被黑心中介騙來泰國(guó)打工朽褪，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留置吓，地道東北人。一個(gè)月前我還...
沈念sama閱讀 46,297評(píng)論 2贊 360
代替公主和親
正文我出身青樓缔赠，卻偏偏與公主長(zhǎng)得像衍锚，于是被迫代替她去往敵國(guó)和親。傳聞我的和親對(duì)象是個(gè)殘疾皇子嗤堰，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 43,472評(píng)論 2贊 348