未看RL Algorithms部分!未完待續(xù)...
視頻:https://goo.gl/pyMd6p
環(huán)境code:https://goo.gl/jAESt9
Abstract
Question--the proliferation of algorithms makes it difficult to discern which particular approach would be best suited for a rich, diverse task like grasping.
Goal--propose a simulated benchmark for robotic grasping that emphasizes off-policy learning and generalization to unseen objects.
Method--evaluate the benchmark tasks against a variety of Q-function estimation methods, a method previously proposed for robotic grasping with deep neural network models, and a novel approach based on a combination of Monte Carlo return estimation and an off-policy correction.
Results--several simple methods provide a surprisingly strong competitor to popular algorithms such as double Qlearning, and our analysis of stability sheds light on the relative tradeoffs between the algorithms.
I. INTRODUCTION
解決grasping problem的有很多方法犹菱,比如:
- analytic grasp metrics [43], [36]
- learning-based approaches [2]
雖然基于計(jì)算機(jī)視覺的的learning-based方法在今年取得了不錯(cuò)的表現(xiàn)[22]啥容,但是這些方法不涉及抓取任務(wù)的 sequential aspect。
要么選擇a single grasp pose [33]
要么重復(fù)選擇the next most promising grasp greedily[24].
后來引入了RL方法作為robotic grasping in a sequential decision making context下的框架诵闭,但是局限于:
single object [34]
simple geometric shapes such as cubes [40].
本實(shí)驗(yàn)中炼团,使用realistic simulated benchmark比較了多種RL方法。
由于成功的generalization通常需要訓(xùn)練大量的objects和scenes [33] [24]疏尿,需要多個(gè)視角和control瘟芝,因此on-policy不適用于多樣化的grasping scenarios,而Off-policy reinforcement learning methods不錯(cuò)~
Aim:to understand which off-policy RL algorithms are best suited for vision-based robotic grasping.
Contributions:
- a simulated grasping benchmark for a robotic arm with a two-finger parallel jaw gripper, grasping random objects from a bin.
- present an empirical evaluation of off-policy deep RL algorithms on vision-based robotic grasping tasks. 包括一下6種算法:
1. the grasp success prediction approach proposed by [24],
2. Q-learning [28],
3. path consistency learning (PCL) [29],
4. deep deterministic policy gradient (DDPG) [25],
5. Monte Carlo policy evaluation [39],
6. Corrected Monte-Carlo, a novel off-policy algorithm that extends Monte Carlo policy evaluation for unbiased off-policy learning.
Results show that deep RL can successfully learn grasping of diverse objects from raw pixels, and can grasp previously unseen objects in our simulator with an average success rate of 90%.
II. RELATED WORK
Model-free algorithms for deep RL的兩個(gè)主要領(lǐng)域:
- policy gradient methods [44], [38], [27], [45]
- value-based methods [35], [28], [25], [15], [16], with actor-critic algorithms combining the two classes [29], [31], [14].
但是Model-free algorithms的通病就是很難tune褥琐。
但是Model-free algorithms的相關(guān)工作锌俱,including popular benchmarks [7], [1], [3]主要集中在
- applications in video games
- relatively simple simulated robot locomotion tasks
而沒有我們所需要的,能進(jìn)行多樣化的任務(wù)敌呈,且能泛化到新環(huán)境中贸宏。
有很多RL方法應(yīng)用到了真實(shí)的機(jī)器人任務(wù)上,比如:
- guided policy search methods用于解決一些操作任務(wù):contact-rich, vision-based skills [23], non-prehensile manipulation [10], and tasks involving significant discontinuities [5], [4]
- 或者直接應(yīng)用model-free algorithms用于機(jī)器人學(xué)習(xí)技能:fitted Qiteration [21], Monte Carlo return estimates [37], deep deterministic policy gradient [13], trust-region policy optimization [11], and deep Q-networks [46]
這些成功的強(qiáng)化學(xué)習(xí)的應(yīng)用通常只能tackle individual skills,而不能泛化去完成機(jī)器沒有訓(xùn)練到的技能磕洪。
鋪墊了這么多吭练,就是為了強(qiáng)調(diào)The goal of this work is to provide a systematic comparison of deep RL approaches to robotic grasping,而且能泛化到新物體in a cluttered environment where objects may be obscured and the environment dynamics are complex(不像[40], [34], and [19]只考慮抓取形狀簡(jiǎn)單的物體析显,我們的實(shí)驗(yàn)是很高級(jí)的v暄省)。
用于抓取diverse sets of objects的不是強(qiáng)化學(xué)習(xí)的其他學(xué)習(xí)策略谷异,也有很多分尸,這里作者推薦我們看看下面這篇survey。
[2].J. Bohg, A. Morales, T. Asfour, and D. Kragic.
Data-driven grasp synthesisa survey. Transactions on Robotics, 2014.
以前的方法主要依賴這三種sources of supervision:
- human labels [17], [22],
- geometric criteria for grasp success computed offline [12],
- robot self-supervision, measuring grasp success using sensors on the robot’s gripper [33]
后來也有DL的方法出現(xiàn):[20], [22], [24], [26], [32].
III. PRELIMINARIES
這里說明了論文中的一些符號(hào)歹嘹,不贅述
IV. PROBLEM SETUP
- 仿真環(huán)境:Bullet simulator
- timesteps:T = 15
- 在最后一個(gè)step給出二進(jìn)制的reward
-
抓取成功的reward:
- 抓取失敗的reward:0
- 當(dāng)前的狀態(tài)st包括當(dāng)前視角的RGB image和當(dāng)前的timestep t箩绍,用于讓policy知道在這個(gè)episode結(jié)束前還有多少steps,用于做一些決定尺上,如事都有時(shí)間做一個(gè)pre-grasp manipulation材蛛,或者是否要立即移動(dòng)到a good grasping position。
-
機(jī)械臂使用position control of the vertically-oriented gripper進(jìn)行控制
連續(xù)的action使用笛卡爾displacement來表示尖昏。其中fai是wrist繞著z軸的旋轉(zhuǎn)
- 當(dāng)夾爪移動(dòng)到低于某個(gè)fixed height threshold時(shí)仰税,夾爪自動(dòng)合上
- 新episode開始時(shí),物體的位置和方向在bin中隨機(jī)放置
1) Regular grasping.
900個(gè)for訓(xùn)練集抽诉,100for測(cè)試集
每個(gè)episode有5個(gè)objects in the bin
每20個(gè)episodes換一次objects
2) Targeted grasping in clutter
所有的episodes用一樣的objects
7個(gè)objects中選擇3個(gè)target objects
當(dāng)抓取到target objetc時(shí)機(jī)械臂獎(jiǎng)勵(lì)reward
V. REINFORCEMENT LEARNING ALGORITHMS
A. Learning to Grasp with Supervised Learning
Levine et al. [24]. This method does not consider long-horizon returns, but instead uses a greedy controller to choose the actions
with the highest predicted probability of producing a successful grasp.
B. Off-Policy Q-Learning
C. Regression with Monte Carlo Return Estimates
D. Corrected Monte Carlo Evaluation
E. Deep Deterministic Policy Gradient
F. Path Consistency Learning
G. Summary and Unified View
VI. EXPERIMENTS
評(píng)估RL算法的四個(gè)要點(diǎn):
- overall performance
- data-efficiency
- robustness to off-policy data
- hyperparameter sensitivity
All algorithms use variants of the deep neural network architecture shown in Figure 3 to represent the Q-function.