Socially Aware Motion Planning with Deep Reinforcement Learning

簡評：之前的方法使用特征匹配(feature-matching techniques)的做法來描述和模仿行人的軌跡逆趣，但是人和人之間的特征是有差異的(vary from person to person)，所以生成的行人軌跡并不理想。這篇文獻(xiàn)指出弱卡，盡管導(dǎo)航時指明機(jī)器人什么應(yīng)該和人類交互做是比較困難的(精確的行人導(dǎo)航機(jī)制)牡属，但是卻可以簡單地指明宅倒，什么是不應(yīng)該做的(違反社交規(guī)范)囤锉。尤其是以故，本文使用深度強(qiáng)化學(xué)習(xí)涡拘，提出一種時間高效(time-efficient)的遵守社會規(guī)范的導(dǎo)航策略。

This work notes that while it is challenging to directly specify the details of what to do (precise mechanisms of human navigation), it is straightforward to specify what not to do (violations of social norms). Specifically, using deep reinforcement learning, this work develops a time-efficient navigation policy that respects common social norms.

此外据德，本文是在作者基于深度強(qiáng)化學(xué)習(xí)的多智能體避障的研究基礎(chǔ)上鳄乏，在多智能體系統(tǒng)中引入具有社交意識的行為跷车，本文的主要貢獻(xiàn)是如何在CADRL中引入和融合社交行為。所以橱野，可以簡單認(rèn)為朽缴，SA-CADRL = SA (socially aware) + CADRL(collision avoidance with deep reinforcement learning)。

This work extends the collision avoidance with deep reinforcement learning framework (CADRL) to characterize and induce socially aware behaviors in multiagent systems.

多平臺維護(hù)不易水援，內(nèi)容實時更新于個人網(wǎng)站密强，請移步閱讀最新內(nèi)容。

INTRODUCTION

人機(jī)交互方法演進(jìn)

將行人視為具有簡單動力學(xué)的動態(tài)障礙物蜗元，執(zhí)行特定的反應(yīng)式避障行為或渤。

A common approach treats pedestrians as dynamic obstacles with simple kinematics, and employs specific reactive rules for avoiding collision.

缺陷：這種方法沒有觀察人類的行為，會產(chǎn)生不安全奕扣、不自然的行為薪鹦，尤其當(dāng)機(jī)器人的速度和人類行走速度比較靠近時。

Since these methods do not capture human behaviors, they sometimes generate unsafe/unnatural movements, particularly when the robot operates near human walking speed.

使用更精致的運(yùn)動模型來推理附近行人的運(yùn)動意圖(hidden intents)惯豆，產(chǎn)生一系列預(yù)測軌跡池磁。然后，使用傳統(tǒng)的路徑規(guī)劃算法為機(jī)器人生成無碰撞(collision-free)的路徑楷兽。

More sophisticated motion models have been proposed, which would reason about the nearby pedestrians’ hidden intents to generate a set of predicted paths. Subsequently, classical path planning algorithms would be employed to generate a collision-free path for the robot.

缺陷：導(dǎo)航問題分割為不關(guān)聯(lián)的預(yù)測和路徑規(guī)劃地熄，可能導(dǎo)致機(jī)器人凍結(jié)問題，機(jī)器人無法找到任何可行的行為芯杀，因為預(yù)測的軌跡讓大部分的空間不可通行端考。

Separating the navigation problem into disjoint prediction and planning steps can lead to the freezing robot problem, in which the robot fails to find any feasible action because the predicted paths could mark a large portion of the space untraversable.
My comment (MC):雖然作者批判這種做法，但是這是在目前工業(yè)界比較流行的做法揭厚。導(dǎo)航問題分割為多個層次模塊却特，上下游之間透明，容易遷移和調(diào)試棋弥。

合作(cooperation)

基于上述研究的問題，作者提出一種做法诚欠，合作(cooperation)顽染，建模(model)/預(yù)測(anticipate)機(jī)器人對周圍行人的影響。

> A key to resolving this problem is to account for cooperation, that is, to model/anticipate the impact of the robot’s motion on the nearby pedestrians.

現(xiàn)階段轰绵，基于合作的社交導(dǎo)航研究粉寞，主要分為基于模型(model-based)的做法和基于學(xué)習(xí)(learning-based)的做法。

Existing work on cooperative, socially compliant navigation can be broadly classified into two categories, namely model-based and learning-based.
- 基于模型的做法左腔，是一種典型的多智能體避障(multiagent collision avoidance algorithm)的擴(kuò)展唧垦，通過增加參數(shù)來引入對社交交互行為的考慮。
Model-based approaches are typically extensions of multiagent collision avoidance algorithms, with additional parameters introduced to account for social interactions.

模型方法的缺陷：不確定行人是否會遵循預(yù)設(shè)的幾何模型液样；勢力場需要針對不同的行人振亮，調(diào)節(jié)參數(shù)巧还；可能導(dǎo)致規(guī)劃軌跡震蕩(oscillatory)。
- 基于學(xué)習(xí)的做法坊秸，旨在通過匹配特征統(tǒng)計來開發(fā)一種策略麸祷。
Learning-based approaches aim to develop a policy that emulates human behaviors by matching feature statistics.
In particular, Inverse Reinforcement Learning (IRL) has been applied to learn a cost function from human demonstration (teleoperation), and a probability distribution over the set of joint trajectories with nearby pedestrians.

學(xué)習(xí)方法比模型方法更貼近人類的行為，但是同時需要更高的計算代價(computational cost)褒搔。同時阶牍，特征統(tǒng)計在人和人之間變化明顯(vary significantl)，也引起其在不同場景下的泛化能力的擔(dān)憂星瘾。

簡而言之走孽，存在的方法試圖建模或復(fù)制詳細(xì)的社交行為機(jī)制(mechanisms of social compliance)琳状，因為行人行為的隨機(jī)性(stochasticity)磕瓷，仍然很難去量化(quantify)。

In short, existing works are mostly focused on modeling
and replicating the detailed mechanisms of social compliance, which remains difficult to quantify due to the stochasticity in people’s behaviors.

作者認(rèn)為人類會遵循一系列簡單的社交規(guī)范算撮，比如從右側(cè)通過(passing on the right)生宛。所以在強(qiáng)化學(xué)習(xí)框架中描述(characterize)這些行為的特征，發(fā)現(xiàn)通過解決合作避障的問題可以生成(emerge)類人的導(dǎo)航慣例肮柜。

Building on a recent paper, we characterize these properties in a reinforcement learning framework, and show that human-like navigation conventions emerge from solving a cooperative collision avoidance problem.

Symmetries in multiagent collision avoidance

BACKGROUND

Collision Avoidance with Deep Reinforcement Learning

首先陷舅，多智能體的避障，可以表述為在強(qiáng)化學(xué)習(xí)下的一系列行為決策(a sequential decision making)問題审洞。

A multiagent collision avoidance problem can be formulated as a sequential decision making problem in a reinforcement learning framework.

強(qiáng)化學(xué)習(xí)問題建模

這部分理論分析非常精彩莱睁，建議多閱讀幾次，理解深意芒澜。

為了刻畫附近行人意圖的不確定性(uncertainty)仰剿，將狀態(tài)矢量分為可觀察部分(observable)和不可觀察部分(unobservable)。其中痴晦，可觀察部分包括行人的速度南吮，位置和大小誊酌；不可觀察部分部凑，包括行人的目標(biāo)位置，偏好速度和方向碧浊。
所以涂邀，模型的目標(biāo)是開發(fā)一種策略，在避開和附近行人碰撞的基礎(chǔ)上箱锐，最小化抵達(dá)目標(biāo)的時間比勉。
在此模型基礎(chǔ)上，這個問題可以在強(qiáng)化學(xué)習(xí)框架下表述為和鄰近行人的關(guān)聯(lián)配置(joint configuration)。另外浩聋，引入獎勵函數(shù)观蜗，獎勵那些抵達(dá)目標(biāo)的智能體，懲罰那些發(fā)生碰撞的智能體赡勘。

In particular, a reward function can be specified to reward the agent for reaching its goal and penalize the agent for colliding with others.

狀態(tài)轉(zhuǎn)移函數(shù)模型嫂便，因為考慮了其他智能體的隱藏意圖(hidden intents)，所以也簡介考慮了其他智能體的行為不確定性闸与。

The unknown state-transition model takes into account the uncertainty in the other agent’s motion due to its hidden intents.

隨后毙替，解決這個RL問題就是找到表達(dá)到達(dá)目標(biāo)點(diǎn)的預(yù)估時間的最優(yōu)值函數(shù)(the optimal value function)，然后可以由值函數(shù)回溯得到最優(yōu)策略(optimal policy)践樱。

Solving the RL problem amounts to finding the optimal value function that encodes an estimate of the expected time to goal.

然后厂画，找到最優(yōu)值函數(shù)的主要的挑戰(zhàn)是，關(guān)聯(lián)狀態(tài)是連續(xù)的拷邢、高維的矢量袱院，使得離散化和枚舉狀態(tài)空間不可行。

A major challenge in finding the optimal value function
is that the joint state sjn is a continuous, high-dimensional vector, making it impractical to discretize and enumerate the state space.

最近瞭稼，可以使用深度神經(jīng)網(wǎng)絡(luò)來解決這個強(qiáng)化學(xué)習(xí)的問題忽洛，去表示高維空間的值函數(shù)，并且具有人類水平的表現(xiàn)环肘。
Recent advances in reinforcement learning address this issue by using deep neural networks to represent value functions in high-dimensional spaces, and have demonstrated human-level performance on various complex tasks.

MC: 到目前為止欲虚，作者是在介紹自己已有的研究，the collision avoidance with deep reinforcement learning framework (CADRL)悔雹，接下來會在這個基礎(chǔ)上复哆，引入多智能體間具有社交意識的行為。

Characterization of Social Norms

與其直接去量化人類行為腌零，本文認(rèn)為復(fù)雜的規(guī)范行為模式梯找，是由一系列簡單的局部交互組成的。MC: 我贊同這個觀點(diǎn)益涧，可以把復(fù)雜的問題拆分為小問題锈锤，容易解決。

Rather than trying to quantify human behaviors directly, this work notes that the complex normative motion patterns can be a consequence of simple local interactions.

因此闲询，本文進(jìn)一步猜想久免，相比于一系列精確定義的規(guī)則(a set of precisely defined procedural rules)，社交規(guī)范是從相互避免碰撞的機(jī)制中新生的嘹裂。

Thus, we conjecture that rather than a set of precisely defined procedural rules, social norms are the emergent behaviors from a time-efficient, reciprocal collision avoidance mechanism.

Reciprocity implicitly encodes a model of the other agents’ behavior, which is the key for enabling cooperation without explicit communication.

有點(diǎn)哲學(xué)感: 局部避免碰撞中的互惠原則妄壶，衍生出來了所謂的社交行為規(guī)范摔握。作者進(jìn)一步實驗表明寄狼，無規(guī)則的CADRL也可以展示出一定的導(dǎo)航規(guī)范。(可以作為research hypothesis)

Reciprocity implicitly encodes a model of the other agents’ behavior, which is the key for enabling cooperation without explicit communication. While no behavioral rules were imposed in the problem formulation, CADRL policy exhibits certain navigation conventions.

所以，作者在這個基礎(chǔ)上認(rèn)為泊愧，通過多智能體的避碰學(xué)習(xí)伊磺，可以習(xí)得人類現(xiàn)在的行為規(guī)范。

已有的文獻(xiàn)報道删咱，人類導(dǎo)航趨向于合作和時間最優(yōu)屑埋。所以，作者在CADRL基礎(chǔ)上痰滋，通過引入最小時間獎勵函數(shù)和互惠假設(shè)(學(xué)習(xí)到的最優(yōu)行為摘能，智能體基本都會采用)。

Existing works have reported that human navigation (or teleoperation of a robot) tends to be cooperative and time- efficient. This work notes that these two properties are encoded in the CADRL formulation through using the min-time reward function and the reciprocity assumption.

同時敲街，作者指出团搞，從CADRL 衍生的合作行為，和人類現(xiàn)有的理解是不同的多艇。所以逻恐，作者會進(jìn)一步解決這個問題。

However, the cooperative behaviors emerging from a
CADRL solution are not consistent with human interpretation. The next section will address this issue and present a method to induce behaviors that respect human social norms.

APPROACH

本章首先描述兩個智能體如何在RL框架中塑造規(guī)范行為峻黍，然后將這一方法推廣到多智能體場景复隆。

We first describe a strategy for shaping normative behaviors for a two-agent system in the RL framework, and then generalize the method to multiagent scenarios.

Inducing Social Norms

和自己的解法基本是一致的，只不過沒有使用神經(jīng)網(wǎng)絡(luò)罷了

現(xiàn)有的社交行為是眾多解決對稱避障的方法之一姆涩。為了引入一個特定的行為挽拂，就需要向RL中引入一點(diǎn)偏愛(bias)，更偏向于一組行為阵面。

This work notes that social norms are one of the many
ways to resolve a symmetrical collision avoidance scenario. To induce a particular norm, a small bias can be introduced in the RL training process in favor of one set of behaviors over others.

如作者所說轻局，這一方法的優(yōu)點(diǎn)在于，違背特定性為的做法一般容易被識別样刷，并且這一規(guī)范不需要精確仑扑。這是因為新增的懲罰打破了避碰的平衡和對稱，所以會偏向于遵守社會規(guī)則的行為置鼻。

The advantage of this approach is that violations of a particular social norm are usually easy to specify; and this specification need not be precise. This is because the addition of a penalty breaks the symmetry in the collision avoidance problem, thereby favoring behaviors respecting the desired social norm.

最后镇饮，訓(xùn)練的結(jié)果表明學(xué)到了和人類行為類似的策略，比如 left-handed and right-handed norms箕母。

As long as training converges, the penalty sets’ size does not have a major effect on the learned policy. This is expected because the desired behaviors are not in the penalty set.

Training a Multiagent Value Network

因為上文的訓(xùn)練只是在兩個智能體之間储藐，所以很難引入到更高階的行為，比如多智能體環(huán)境嘶是。這部分主要講述如何訓(xùn)練多智能體钙勃。

Since training was solely performed on a two-agent system, it was difficult to encode/induce higher order behaviors, such as accounting for the relations between nearby agents. This work addresses this problem by developing a method that allows for training on multiagent scenarios directly.

為了刻畫多智能體對稱的特性，本文使用了權(quán)重共享(weight-sharing)和最大池(max-pooling layers)的神經(jīng)網(wǎng)絡(luò)聂喇。該網(wǎng)絡(luò)涉及4個智能體辖源，其中附近三個智能體的狀態(tài)可以互換而不影響訓(xùn)練結(jié)果蔚携。

網(wǎng)絡(luò)結(jié)構(gòu)的詳細(xì)設(shè)計，可以閱讀原文克饶。

To capture the multiagent system’s symmetrical structure,
a neural network with weight-sharing and max-pooling layers is employed,

Network structure for multiagent scenarios

在訓(xùn)練中酝蜒，會先生成軌跡，然后將軌跡轉(zhuǎn)化為經(jīng)驗集矾湃。

The trajectories are then turned into state-value pairs and assimilated into the experience sets.

CADRL和SA-CADRL的訓(xùn)練區(qū)別

Two experience sets are used to distinguish between trajectories that reached the goals and those that ended in a collision.
During the training process, trajectories generated by SA-CADRL are reflected in the x-axis with probability.
- This procedure exploits symmetry in the problem to explore different topologies more efficiently.

作者在網(wǎng)絡(luò)訓(xùn)練時已經(jīng)設(shè)置開關(guān)(a binary flag indicating whether the other agent is real or virtual (details)亡脑，所以n-智能體的網(wǎng)絡(luò)也可以用于p (p<=n) 個智能體的場景。

An n-agent network can be used to generate trajectories for scenarios with fewer agents.

RESULTS

Computational Details (online performance and offline training)

模型具有比較優(yōu)秀的實時和收斂(convergence and time-efficient)表現(xiàn)邀跃。

The size and connections in the multiagent network are tuned to obtain good performance (ensure
convergence and produce time-efficient paths) while achieving real-time performance.

Simulation Results

三組對比試驗：一組沒有社交行為獎勵函數(shù)霉咨，另外兩組是偏向左和右的行為獎勵函數(shù)。

Three copies of four-agent SA-CADRL policies were
trained, one without the norm inducing reward, one with the left-handed, and the other with the right-handed.

Hardware Experiment

硬件設(shè)備

The differential-drive vehicle is outfitted with a Lidar for localization, three Intel Realsenses for free space detection, and four webcams for pedestrian detection.

CONCLUSION

Contribution

In a reinforcement learning framework, a pair of simulated agents navigate around each other to learn a policy that respect human navigation norms, such as passing on the right and overtaking on the left in a right-handed system.
This approach is further generalized to multiagent (n > 2) scenarios through the use of a symmetrical neural network structure.
Moreover, SA-CADRL is implemented on robotic hardware, which enabled fully autonomous navigation at human walking speed in a dynamic environment with many pedestrians.

Future work

Consider the relationships between nearby pedestrians, such as a group of people who walk together.
- Group surfing
MC:可以遷移到其他具有規(guī)則學(xué)習(xí)的場景拍屑，比如水面無人艇中COLREGS規(guī)則和人類的行為類似躯护。
- A COLREGs-based obstacle avoidance approach for unmanned surface vehicles
網(wǎng)絡(luò)模型訓(xùn)練的效率，文章只是訓(xùn)練4個智能體丽涩，如果場景復(fù)雜棺滞，進(jìn)一步擴(kuò)展呢？

解讀 Socially Aware Motion Planning with Deep Reinforcement Learning