Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

自監(jiān)督聲源定位，與sound of pixel可能類似

2018.4

論文地址：https://arxiv.org/pdf/1804.03641.pdf

項目地址： http://andrewowens.com/multisensory

Abstract.

The thud of abouncing ball,the onset of speech as lips open—when visual and audio events occur together, it suggests that there might be a common, underlying event that produced both signals. In this paper, we argue that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation. We propose to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned. We use this learned representation for three applications: (a) sound source localization, i.e. visualizing the source of sound in a video; (b) audio-visual action recognition; and (c) on/offscreen audio source separation, e.g. removing the off-screen translator’s voice from a foreign of?cial’s speech. Code, models, and video results are available on our webpage: http://andrewowens.com/multisensory.

當(dāng)視覺和音頻事件一起發(fā)生時播玖，砰砰作響的球荠耽，嘴唇張開時的語音開始，這表明可能存在產(chǎn)生兩種信號的共同的潛在事件否淤。在本文中，我們認(rèn)為視頻信號的視覺和音頻組件應(yīng)使用融合多感覺表示聯(lián)合建模。我們建議以自我監(jiān)督的方式學(xué)習(xí)這種表示先紫，通過訓(xùn)練神經(jīng)網(wǎng)絡(luò)來預(yù)測視頻幀和音頻是否在時間上對齊。我們將這種學(xué)習(xí)的表示用于三種應(yīng)用：（a）聲源定位筹煮，即可視化視頻中的聲源; （b）視聽動作識別; （c）開/關(guān)音頻源分離遮精，例如，從外國官方的演講中刪除屏幕外翻譯者的聲音败潦。代碼本冲，模型和視頻結(jié)果可在我們的網(wǎng)頁上找到：http：//andrewowens.com/multisensory。

1 Introduction

As humans, we experience our world through a number of simultaneous sensory streams. When we bite into an apple, not only do we taste it, but — as Smith and Gasser [1] point out — we also hear it crunch, see its red skin, and feel the coolness of its core. The coincidence of sensations gives us strong evidence that they were generated by a common, underlying event [2], since it is unlikely that they co-occurred across multiple modalities merely by chance. These cross-modal, temporal co-occurrences therefore provide a useful learning signal: a model that is trained to detect them ought to discover multimodal structures that are useful for other tasks. In much of traditional computer vision research, however, we have been avoiding the use of other, non-visual modalities, arguably making the perception problem harder, not easier

作為人類劫扒，我們通過一系列同時感知流來體驗我們的世界檬洞。當(dāng)我們咬一口蘋果時，我們不僅嘗到了它沟饥，而且 - 正如Smith和Gasser [1]所指出的那樣 - 我們也聽到它緊縮添怔，看到它的紅色皮膚环戈，并感受到其核心的涼爽。感覺的巧合給了我們強有力的證據(jù)澎灸，證明它們是由一個共同的院塞，潛在的事件產(chǎn)生的[2]，因為它們不太可能僅僅偶然地在多種形式中共同發(fā)生性昭。因此拦止，這些跨模態(tài)，時間共現(xiàn)提供了有用的學(xué)習(xí)信號：經(jīng)過訓(xùn)練以檢測它們的模型應(yīng)該發(fā)現(xiàn)對其他任務(wù)有用的多模態(tài)結(jié)構(gòu)糜颠。然而汹族，在許多傳統(tǒng)的計算機視覺研究中，我們一直避免使用其他非視覺模式其兴，可以說使感知問題更難顶瞒，更容易

In this paper, we learn a temporal, multisensory representation that fuses the visual and audio components of a video signal. We propose to train this model without using any manually labeled data. That is, rather than explicitly telling the model that, e.g., it should associate moving lips with speech or a thud with a bouncing ball, we have it discover these audio-visual associations through self-supervised training [3]. Specifically, we train a neural network on a “pretext” task of detecting misalignment between audio and visual streams in synthetically-shifted videos. The network observes raw audio and video streams — some of which are aligned, and some that have been randomly shifted by a few seconds — and we task it with distinguishing between the two. This turns out to be a challenging training task that forces the network to fuse visual motion with audio information and, in the process, learn a useful audio-visual feature representation.?

在本文中，我們學(xué)習(xí)了一種時間的元旬，多感官的表示榴徐，融合了視頻信號的視覺和音頻組件。 我們建議在不使用任何手動標(biāo)記數(shù)據(jù)的情況下訓(xùn)練此模型匀归。也就是說坑资，不是明確地告訴模型，例如穆端，它應(yīng)該將動人的嘴唇與語音或砰砰聲與彈跳球相關(guān)聯(lián)袱贮，我們通過自我監(jiān)督訓(xùn)練發(fā)現(xiàn)這些視聽聯(lián)想[3]。具體而言体啰，我們在“借口”任務(wù)上訓(xùn)練神經(jīng)網(wǎng)絡(luò)暇检，以檢測合成移位視頻中的音頻和視覺流之間的未對準(zhǔn)薄声。網(wǎng)絡(luò)觀察原始音頻和視頻流 - 其中一些是對齊的俩檬，一些是隨機移動了幾秒鐘 - 我們的任務(wù)是區(qū)分兩者盼砍。這證明是一項具有挑戰(zhàn)性的訓(xùn)練任務(wù)，迫使網(wǎng)絡(luò)將視覺運動與音頻信息融合枕屉，并在此過程中學(xué)習(xí)有用的視聽特征表示常柄。

相比較而言，sound of pixel是非時序的搀擂，由模型結(jié)構(gòu)可以看出來

We demonstrate the usefulness of our multisensory representation in three audiovisual applications: (a) sound source localization, (b) audio-visual action recognition; and (c) on/off-screen sound source separation. Figure 1 shows examples of these applications. In Fig. 1(a), we visualize the sources of sound in a video using our network’s learned attention map, i.e. the impact of an axe, the opening of a mouth, and moving hands of a musician. In Fig. 1(b), we show an application of our learned features to audio-visual action recognition, i.e. classifying a video of a chef chopping an onion. In Fig. 1(c), we demonstrate our novel on/off-screen sound source separation model’s ability to separate the speakers’ voices by visually masking them from the video.?

我們展示了我們的多感官表示在三種視聽?wèi)?yīng)用中的有用性：（a）聲源定位西潘，（b）視聽動作識別; （c）屏幕外/聲源分離。圖1顯示了這些應(yīng)用程序的示例哨颂。在圖1（a）中喷市，我們使用我們網(wǎng)絡(luò)的學(xué)習(xí)注意力圖來顯示視頻中的聲源，即斧頭的影響威恼，嘴巴的開口和音樂家的移動手品姓。在圖1（b）中寝并，我們展示了我們學(xué)習(xí)的特征在視聽動作識別中的應(yīng)用，即對切碎洋蔥的廚師的視頻進行分類腹备。在圖1（c）中衬潦，我們展示了我們新穎的開/關(guān)屏幕聲源分離模型通過視覺掩蓋視頻來分離揚聲器聲音的能力。

The main contributions of this paper are: 1) learning a general video representation that fuses audio and visual information; 2) evaluating the usefulness of this representation qualitatively (by sound source visualization) and quantitatively (on an action recognition task); and 3) proposing a novel video-conditional source separation method that uses our representation to separate on- and off-screen sounds, and is the first method to work successfully on real-world video footage, e.g. television broadcasts. Our feature representation, as well as code and models for all applications are available online.

本文的主要貢獻是：1）學(xué)習(xí)融合音頻和視覺信息的一般視頻表示; 2）定性地（通過聲源可視化）和定量地（在動作識別任務(wù)上）評估該表示的有用性; 3）提出一種新穎的視頻條件源分離方法植酥，該方法使用我們的表示來分離屏幕上和屏幕外的聲音镀岛，并且是第一種成功地在真實世界的視頻鏡頭上工作的方法，例如友驮，電視廣播漂羊。我們的功能表示以及所有應(yīng)用程序的代碼和模型均可在線獲取。

2 Related work

Evidence from psychophysics While we often think of vision and hearing as being distinct systems, in humans they are closely intertwined [4] through a process known as multisensory integration. Perhaps the most compelling demonstration of this phenomenon is the McGurk effect [5], an illusion in which visual motion of a mouth changes one’s interpretation of a spoken sound1. Hearing can also influence vision: the timing of a sound, for instance, affects whether we perceive two moving objects to be colliding or overlapping [2]. Moreover, psychologists have suggested that humans fuse audio and visual signals at a fairly early stage of processing [7,8], and that the two modalities are used jointly in perceptual grouping. For example, the McGurk effect is less effective when the viewer first watches a video where audio and visuals in a video are unrelated, as this causes the signals to become “unbound” (i.e. not grouped together) [9,10]. This multi-modal perceptual grouping process is often referred to as audio-visual scene analysis [11,7,12,10]. In this paper, we take inspiration from psychology and propose a self-supervised multisensory feature representation as a computational model of audio-visual scene analysis.?

來自心理物理學(xué)的證據(jù) 雖然我們經(jīng)常認(rèn)為視覺和聽覺是不同的系統(tǒng)卸留，但在人類中走越，它們通過稱為多感覺整合的過程緊密地交織在一起[4]。也許這種現(xiàn)象最引人注目的證明就是McGurk效應(yīng)[5]耻瑟，這是一種幻覺旨指，在這種幻覺中，嘴巴的視覺運動會改變?nèi)藗儗谡Z1的解釋匆赃。聽覺也會影響視力：例如淤毛，聲音的時間會影響我們是否感知兩個移動物體碰撞或重疊[2]。此外算柳，心理學(xué)家已經(jīng)建議人類在處理的早期階段融合音頻和視覺信號[7,8]，并且這兩種方式在感知分組中聯(lián)合使用姓言。例如瞬项，當(dāng)觀眾首先觀看視頻中的音頻和視覺不相關(guān)的視頻時，McGurk效果效果較差何荚，因為這導(dǎo)致信號變?yōu)椤拔唇壎ā保次唇M合在一起）[9,10]囱淋。這種多模態(tài)感知分組過程通常被稱為視聽場景分析[11,7,12,10]。在本文中餐塘，我們從心理學(xué)中汲取靈感妥衣，并提出一種自我監(jiān)督的多感官特征表示作為視聽場景分析的計算模型。

Self-supervised learning Self-supervised methods learn features by training a model to solve a task derived from the input data itself, without human labeling. Starting with the early work of de Sa [3], there have been many self-supervised methods that learn to find correlations between sight and sound [13,14,15,16]. These methods, however, have either learned the correspondence between static images and ambient sound [15,16], or have analyzed motion in very limited domains [14,13] (e.g. [14] only modeled drumstick impacts). Our learning task resembles Arandjelovic and Zisserman [ ′ 16], which predicts whether an image and an audio track are sampled from the same (or different) videos. Their task, however, is solvable from a single frame by recognizing semantics (e.g. indoor vs. outdoor scenes). Our inputs, by contrast, always come from the same video, and we predict whether they are aligned; hence our task requires motion analysis to solve. Time has also been used as supervisory signal, e.g. predicting the temporal ordering in a video [17,18,19]. In contrast, our network learns to analyze audio-visual actions, which are likely to correspond to salient physical processes.?

自我監(jiān)督學(xué)習(xí) 自我監(jiān)督方法通過訓(xùn)練模型來學(xué)習(xí)特征戒傻，以解決從輸入數(shù)據(jù)本身派生的任務(wù)税手，而無需人工標(biāo)記。從de Sa [3]的早期工作開始需纳，已經(jīng)有許多自我監(jiān)督的方法學(xué)會找到視覺和聲音之間的相關(guān)性[13,14,15,16]芦倒。然而，這些方法要么學(xué)習(xí)靜態(tài)圖像和環(huán)境聲音之間的對應(yīng)[15,16]不翩，要么在非常有限的領(lǐng)域[14,13]分析運動（例如[14]僅模仿鼓棒的影響）兵扬。我們的學(xué)習(xí)任務(wù)類似于Arandjelovic和Zisserman ['16]麻裳，它預(yù)測圖像和音軌是否來自相同（或不同）的視頻。然而器钟，他們的任務(wù)可以通過識別語義（例如室內(nèi)場景和室外場景）從單幀中解決津坑。相比之下，我們的輸入總是來自同一個視頻傲霸，我們預(yù)測它們是否一致;因此我們的任務(wù)需要運動分析來解決疆瑰。時間也被用作監(jiān)督信號，例如預(yù)測視頻中的時間順序[17,18,19]狞谱。相比之下乃摹，我們的網(wǎng)絡(luò)學(xué)習(xí)分析視聽行為，這可能與顯著的物理過程相對應(yīng)跟衅。

Audio-visual alignment While we study alignment for self-supervised learning, it has also been studied as an end in itself [20,21,22] e.g. in lip-reading applications [23]. Chung and Zisserman [22], the most closely related approach, train a two-stream network with an embedding loss. Since aligning speech videos is their end goal, they use a face detector (trained with labels) and a tracking system to crop the speaker’s face. This allows them to address the problem with a 2D CNN that takes 5 channel-wise concatenated frames cropped around a mouth as input (they also propose using their image features for self-supervision; while promising, these results are very preliminary). Sound localization The goal of visually locating the source of sounds in a video has a long history. The seminal work of Hershey et al. [24] localized sound sources by measuring mutual information between visual motion and audio using a Gaussian process model. Subsequent work also considered subspace methods [25], canonical correlations [26], and keypoints [27]. Our model learns to associate motions with sounds via self-supervision, without us having to explicitly model them.?

視聽對齊 雖然我們研究了自我監(jiān)督學(xué)習(xí)的對齊孵睬，但它本身也被研究作為目的[20,21,22]，例如在唇讀應(yīng)用[23]伶跷。 Chung和Zisserman [22]是最密切相關(guān)的方法掰读，訓(xùn)練一個嵌入式損失的雙流網(wǎng)絡(luò)。由于對齊語音視頻是他們的最終目標(biāo)叭莫，因此他們使用面部檢測器（使用標(biāo)簽訓(xùn)練）和跟蹤系統(tǒng)來裁剪揚聲器的面部蹈集。這允許他們用2D CNN來解決問題，該CNN采用在嘴周圍裁剪的5個通道連接幀作為輸入（他們還建議使用他們的圖像特征進行自我監(jiān)督;雖然有希望雇初，但這些結(jié)果是非常初步的）拢肆。聲音定位視覺定位視頻中聲音源的目標(biāo)歷史悠久。 Hershey等人的開創(chuàng)性工作靖诗。 [24]通過使用高斯過程模型測量視覺運動和音頻之間的互信息來定位聲源郭怪。隨后的工作還考慮了子空間方法[25]，規(guī)范相關(guān)[26]和關(guān)鍵點[27]刊橘。我們的模型學(xué)會通過自我監(jiān)督將動作與聲音聯(lián)系起來鄙才，而無需我們明確地對它們進行建模。

Audio-Visual Source Separation Blind source separation (BSS), i.e. separating the individual sound sources in an audio stream — also known as the cocktail party problem [28] — is a classic audio-understanding task [29]. Researchers have proposed many successful probabilistic approaches to this problem [30,31,32,33]. More recent deep learning approaches involve predicting an embedding that encodes the audio clustering [34,35], or optimizing a permutation invariant loss [36]. It is natural to also want to include the visual signal to solve this problem, often referred to as Audio-Visual Source Separation. For example, [37,25] masked frequencies based on their correlation with optical flow; [12] used graphical models; [27] used priors on harmonics; [38] used a sparsity-based factorization method; and [39] used a clustering method. Other methods use face detection and multi-microphone beamforming [40]. These methods make strong assumptions about the relationship between sound and motion, and have mostly been applied to lab-recorded video. Researchers have proposed learning-based methods that address these limitations, e.g. [41] use mixture models to predict separation masks. Recently, [42] proposed a convolutional network that isolates on-screen speech, although this model is relatively small-scale (tested on videos from one speaker). We do on/off-screen source separation on more challenging internet and broadcast videos by combining our representation with a u-net [43] regression model.?

視聽源分離盲源分離（BSS）促绵，即分離音頻流中的各個聲源 - 也稱為雞尾酒會問題[28] - 是一種經(jīng)典的音頻理解任務(wù)[29]攒庵。研究人員已經(jīng)提出了許多成功的概率方法來解決這個問題[30,31,32,33]。最近的深度學(xué)習(xí)方法涉及預(yù)測編碼音頻聚類的嵌入[34,35]败晴，或優(yōu)化置換不變損失[36]浓冒。很自然地也希望包括視覺信號來解決這個問題，通常被稱為視聽源分離位衩。例如裆蒸，[37,25]掩蔽頻率基于它們與光流的相關(guān)性; [12]使用圖形模型; [27]使用先驗諧波; [38]使用基于稀疏性的分解方法; [39]使用了聚類方法。其他方法使用面部檢測和多麥克風(fēng)波束成形[40]糖驴。這些方法對聲音和運動之間的關(guān)系做出了強有力的假設(shè)僚祷，并且主要應(yīng)用于實驗室錄制的視頻佛致。研究人員提出了解決這些局限性的基于學(xué)習(xí)的方法，例如： [41]使用混合模型來預(yù)測分離面具辙谜。最近俺榆，[42]提出了一種隔離屏幕語音的卷積網(wǎng)絡(luò)，盡管這種模式規(guī)模相對較凶岸摺（在一個發(fā)言者的視頻上測試）罐脊。我們通過將我們的表示與u-net [43]回歸模型相結(jié)合，在更具挑戰(zhàn)性的互聯(lián)網(wǎng)和廣播視頻上進行屏幕/屏幕外源分離蜕琴。

Concurrent work Concurrently and independently from us, a number of groups have proposed closely related methods for source separation and sound localization. Gabbay et al. [44,45] use a vision-to-sound method to separate speech, and propose a convolutional separation model. Unlike our work, they assume speaker identities are known. Ephrat et al. [46] and Afouras et al. [47] separate the speech of a user-chosen speaker from videos containing multiple speakers, using face detection and tracking systems to group the different speakers. Work by Zhao et al. [48] and Gao et al. [49] separate sound for multiple visible objects (e.g. musical instruments). This task involves associating objects with the sounds they typically make based on their appearance, while ours involves the “fine-grained” motion-analysis task of separating multiple speakers. There has also been recent work on localizing sound sources using a network’s attention map [50,51,52]. These methods are similar to ours, but they largely localize objects and ambient sound in static images, while ours responds to actions in videos.

同時進行的工作 與我們同時獨立萍桌，許多團體提出了密切相關(guān)的源分離和聲音定位方法。 Gabbay等凌简。 [44,45]使用視覺 - 聲音方法來分離語音上炎，并提出卷積分離模型。與我們的工作不同雏搂，他們認(rèn)為說話人身份是已知的藕施。 Ephrat等。 [46]和Afouras等人凸郑。 [47]使用面部檢測和跟蹤系統(tǒng)將不同的揚聲器分組裳食，將用戶選擇的揚聲器的語音與包含多個揚聲器的視頻分開。趙等人的工作芙沥。 [48]和高等人诲祸。 [49]為多個可見物體（例如樂器）分離聲音。此任務(wù)涉及將對象與他們通常根據(jù)其外觀制作的聲音相關(guān)聯(lián)而昨，而我們的任務(wù)涉及分離多個揚聲器的“細(xì)粒度”運動分析任務(wù)烦绳。最近還有一項使用網(wǎng)絡(luò)注意力圖來定位聲源的工作[50,51,52]。這些方法與我們的類似配紫，但它們主要在靜態(tài)圖像中定位對象和環(huán)境聲音，而我們對視頻中的動作做出響應(yīng)午阵。

3 Learning a self-supervised multisensory representation

We propose to learn a representation using self-supervision, by training a model to predict whether a video’s audio and visual streams are temporally synchronized.

我們建議使用自我監(jiān)督來學(xué)習(xí)表示躺孝，通過訓(xùn)練模型來預(yù)測視頻的音頻和視頻流是否在時間上同步。

融合視聽網(wǎng)絡(luò)底桂。我們訓(xùn)練早期融合的多感官網(wǎng)絡(luò)植袍，以預(yù)測視頻幀和音頻是否在時間上對齊。我們在成對的卷積之間包括殘余連接[53]籽懦。我們將輸入表示為T×H×W體積于个，并用“/ 2”表示步幅。為了生成未對齊的樣本暮顺，我們將音頻合成移位幾秒鐘厅篓。

這個網(wǎng)絡(luò)可能是用于結(jié)果生成的秀存，或者是用于生成優(yōu)化目標(biāo)的，類似GAN中的D

Aligning sight with sound During training, we feed a neural network video clips. In half of them, the vision and sound streams are synchronized; in the others, we shift the audio by a few seconds. We train a network to distinguish between these examples. More specifically, we learn a model pθ(y | I; A) that predicts whether the image stream I and audio stream A are synchronized, by maximizing the log-likelihood:

使視覺與聲音對齊 在訓(xùn)練期間羽氮，我們提供神經(jīng)網(wǎng)絡(luò)視頻剪輯或链。其中一半，視覺和聲音流是同步的; 在其他人中档押，我們將音頻移動幾秒鐘澳盐。我們訓(xùn)練網(wǎng)絡(luò)來區(qū)分這些例子。更具體地說令宿，我們通過最大化對數(shù)似然來學(xué)習(xí)模型pθ（y j I; A）叼耙，它預(yù)測圖像流I和音頻流A是否同步：

where As is the audio track shifted by s secs., t is a random temporal shift, θ are the model parameters, and y is the event that the streams are synchronized. This learning problem is similar to noise-contrastive estimation [54], which trains a model to distinguish between real examples and noise; here, the noisy examples are misaligned videos.

其中As是音軌移動了s秒，t是隨機時移粒没，θ是模型參數(shù)筛婉，y是流同步的事件。 這種學(xué)習(xí)問題類似于噪聲對比估計[54]革娄，它訓(xùn)練模型以區(qū)分真實例子和噪聲; 在這里倾贰，嘈雜的例子是未對齊的視頻。

Fused audio-visual network design Solving this task requires the integration of lowlevel information across modalities. In order to detect misalignment in a video of human speech, for instance, the model must associate the subtle motion of lips with the timing of utterances in the sound. We hypothesize that early fusion of audio and visual streams is important for modeling actions that produce a signal in both modalities. We therefore propose to solve our task using a 3D multisensory convolutional network (CNN) with an early-fusion design (Figure 2).?

融合的視聽網(wǎng)絡(luò)設(shè)計解決此任務(wù)需要跨模態(tài)集成低級信息拦惋。例如匆浙，為了檢測人類語音的視頻中的未對準(zhǔn)，模型必須將嘴唇的微妙運動與聲音中的話語定時相關(guān)聯(lián)厕妖。我們假設(shè)音頻和視覺流的早期融合對于在兩種模態(tài)中產(chǎn)生信號的建模動作是重要的首尼。因此，我們建議使用具有早期融合設(shè)計的3D多感官卷積網(wǎng)絡(luò)（CNN）來解決我們的任務(wù)（圖2）言秸。

Before fusion, we apply a small number of 3D convolution and pooling operations to the video stream, reducing its temporal sampling rate by a factor of 4. We also apply a series of strided 1D convolutions to the input waveform, until its sampling rate matches that of the video network. We fuse the two subnetworks by concatenating their activations channel-wise, after spatially tiling the audio activations. The fused network then undergoes a series of 3D convolutions, followed by global average pooling [55]. We add residual connections between pairs of convolutions. We note that the network architecture resembles ResNet-18 [53] but with the extra audio subnetwork, and 3D convolutions instead of 2D ones (following work on inflated convolutions [56]).

在融合之前软能，我們對視頻流應(yīng)用少量3D卷積和合并操作，將其時間采樣率降低4倍举畸。我們還對輸入波形應(yīng)用一系列跨步1D卷積查排，直到其采樣率與視頻網(wǎng)絡(luò)。在空間平鋪音頻激活之后抄沮，我們通過在頻道方式連接它們的激活來融合兩個子網(wǎng)跋核。然后融合網(wǎng)絡(luò)經(jīng)歷一系列3D卷積，然后是全球平均匯集[55]叛买。我們在成對的卷積之間添加殘余連接砂代。我們注意到網(wǎng)絡(luò)架構(gòu)類似于ResNet-18 [53]，但有額外的音頻子網(wǎng)和3D卷積而不是2D卷積（跟隨膨脹卷積的工作[56]）率挣。

Training We train our model with 4.2-sec. videos, randomly shifting the audio by 2.0 to 5.8 seconds. We train our model on a dataset of approximately 750,000 videos randomly sampled from AudioSet [57]. We use full frame-rate videos (29.97 Hz), resulting in 125 frames per example. We select random 224 × 224 crops from resized 256×256 video frames, apply random left-right flipping, and use 21 kHz stereo sound. We sample these video clips from longer (10 sec.) videos. Optimization details can be found in Section A1.

訓(xùn)練我們訓(xùn)練我們的模型4.2sec刻伊。視頻，將音頻隨機移動2.0到5.8秒。我們在從AudioSet [57]隨機抽樣的大約750,000個視頻的數(shù)據(jù)集上訓(xùn)練我們的模型捶箱。我們使用全幀率視頻（29.97 Hz）智什，每個示例產(chǎn)生125幀。我們從調(diào)整大小的256×256視頻幀中選擇隨機224×224作物讼呢，應(yīng)用隨機左右翻轉(zhuǎn)撩鹿，并使用21 kHz立體聲聲音。我們從較長（10秒）的視頻中對這些視頻片段進行采樣悦屏。優(yōu)化細(xì)節(jié)可以在A1部分找到节沦。

Task performance We found that the model obtained 59.9% accuracy on held-out videos for its alignment task (chance = 50%). While at first glance this may seem low, we note that in many videos the sounds occur off-screen [15]. Moreover, we found that this task is also challenging for humans. To get a better understanding of human ability, we showed 30 participants from Amazon Mechanical Turk 60 aligned/shifted video pairs, and asked them to identify the one with out-of-sync sound. We gave them 15 secs. of video (so they have significant temporal context) and used large, 5-sec. shifts. They solved the task with 66:6% ± 2:4% accuracy.

任務(wù)表現(xiàn)我們發(fā)現(xiàn)該模型對于其對齊任務(wù)的持續(xù)視頻獲得了59.9％的準(zhǔn)確率（機會= 50％）。雖然乍一看這可能看起來很低础爬，但我們注意到在很多視頻中聲音都出現(xiàn)在屏幕外[15]甫贯。而且，我們發(fā)現(xiàn)這項任務(wù)對人類也具有挑戰(zhàn)性看蚜。為了更好地理解人類能力叫搁，我們展示了來自Amazon Mechanical Turk 60對齊/移位視頻對的30名參與者，并要求他們識別出具有不同步聲音的視頻對供炎。我們給了他們15秒渴逻。視頻（因此他們有重要的時間背景）并使用大，5秒音诫。轉(zhuǎn)移惨奕。他們以66：6％±2：4％的準(zhǔn)確度解決了這項任務(wù)。

To help understand what actions the model can predict synchronization for, we also evaluated its accuracy on categories from the Kinetics dataset [58] (Figure A1). It was most successful for classes involving human speech: e.g., news anchoring, answering questions, and testifying. Of course, the most important question is whether the learned audio-visual representation is useful for downstream tasks. We therefore turn out attention to applications.

為了幫助理解模型可以預(yù)測同步的動作竭钝，我們還從動力學(xué)數(shù)據(jù)集[58]（圖A1）評估了其對類別的準(zhǔn)確性梨撞。對于涉及人類言語的課程來說，這是最成功的：例如香罐，新聞錨定卧波，回答問題和作證。當(dāng)然庇茫，最重要的問題是學(xué)習(xí)的視聽表示是否對下游任務(wù)有用港粱。因此，我們關(guān)注應(yīng)用程序旦签。

4 Visualizing the locations of sound sources

One way of evaluating our representation is to visualize the audio-visual structures that it detects. A good audio-visual representation, we hypothesize, will pay special attention to visual sound sources — on-screen actions that make a sound, or whose motion is highly correlated with the onset of sound. We note that there is ambiguity in the notion of a sound source for in-the-wild videos. For example, a musician’s lips, their larynx, and their tuba could all potentially be called the source of a sound. Hence we use this term to refer to motions that are correlated with production of a sound, and study it through network visualizations.

評估我們的表示的一種方法是可視化它檢測到的視聽結(jié)構(gòu)啥容。我們假設(shè)，良好的視聽表現(xiàn)將特別關(guān)注視覺聲源 - 制作聲音的屏幕動作顷霹，或其動作與聲音的開始高度相關(guān)。 我們注意到击吱，野外視頻的聲源概念含糊不清淋淀。例如，音樂家的嘴唇，喉嚨和大號都可能被稱為聲音的來源朵纷。因此炭臭，我們使用這個術(shù)語來指代與聲音產(chǎn)生相關(guān)的運動，并通過網(wǎng)絡(luò)可視化來研究它袍辞。

To do this, we apply the class activation map (CAM) method of Zhou et al. [59], which has been used for localizing ambient sounds [52]. Given a space-time video patch Ix , its corresponding audio Ax, and the features assigned to them by the last convolutional layer of our model, f(Ix; Ax), we can estimate the probability of alignment with:

為此鞋仍，我們應(yīng)用Zhou等人的類激活圖（CAM）方法[59]，已被用于定位環(huán)境聲音[52]搅吁。給定一個空時視頻補丁Ix威创，它對應(yīng)的音頻Ax，以及我們模型的最后一個卷積層f（Ix; Ax）分配給它們的特征谎懦，我們可以估計對齊的概率：

where y is the binary alignment label, σ the sigmoid function, and w is the model’s final affine layer. We can therefore measure the information content of a patch — and, by our hypothesis, the likelihood that it is a sound source — by the magnitude of the prediction jw>f(Ix; Ax)j

其中y是二進制對齊標(biāo)簽肚豺，σ是sigmoid函數(shù)，w是模型的最終仿射層界拦。因此吸申，我們可以通過預(yù)測的大小|w> f（Ix; Ax）|來測量補丁的信息內(nèi)容 - 并且根據(jù)我們的假設(shè)，它是聲源的可能性享甸。

One might ask how this self-supervised approach to localization relates to generative approaches, such as classic mutual information methods [24,25]. To help understand this, we can view our audio-visual observations as having been produced by a generative process (using an analysis similar to [60]): we sample the label y, which de termines the alignment, and then conditionally sample Ix and Ax. Rather than computing mutual information between the two modalities (which requires a generative model that self-supervised approaches do not have), we find the patch/sound that provides the most information about the latent variable y, based on our learned model p(y j Ix; Ax).

有人可能會問這種自我監(jiān)督的本地化方法如何與生成方法相關(guān)截碴，例如經(jīng)典的互信息方法[24,25]。為了幫助理解這一點蛉威，我們可以將我們的視聽觀察視為由生成過程產(chǎn)生（使用類似于[60]的分析）：我們對標(biāo)簽y進行采樣日丹，確定對齊，然后有條件地采樣Ix和 Ax瓷翻。我們不是計算兩種模態(tài)之間的互信息（這需要自我監(jiān)督方法所不具備的生成模型）聚凹，而是基于我們的學(xué)習(xí)模型p（y|Ix，Ax）找到提供關(guān)于潛變量y的最多信息的補丁/聲音齐帚。?

Visualizations What actions does our network respond to? First, we asked which space-time patches in our test set were most informative, according to Equation 2. We show the top-ranked patches in Figure 3, with the class activation map displayed as a heatmap and overlaid on its corresponding video frame. From this visualization, we can see that the network is selective to faces and moving mouths. The strongest responses that are not faces tend to be unusual but salient audio-visual stimuli (e.g. two top-ranking videos contain strobe lights and music). For comparison, we show the videos with the weakest response in Figure 4; these contain relatively few faces. Next, we asked how the model responds to videos that do not contain speech, and applied our method to the Kinetics-Sounds dataset [16] — a subset of Kinetics [58] classes that tend to contain a distinctive sound. We show the examples with the highest response for a variety of categories, after removing examples in which the response was solely to a face (which appear in almost every category). We show results in Figure 5. Finally, we asked how the model’s attention varies with motion. To study this, we computed our CAM-based visualizations for videos, which we have included in the supplementary video (we also show some hand-chosen examples in Figure 1(a)). These results qualitatively suggest that the model’s attention varies with on-screen motion. This is in contrast to single-frame methods models [50,52,16], which largely attend to sound-making objects rather than actions.

可視化我們的網(wǎng)絡(luò)響應(yīng)什么行動妒牙？首先，根據(jù)公式2对妄，我們詢問測試集中哪些時空補丁最具信息性湘今。我們在圖3中顯示排名靠前的補丁，類激活映射顯示為熱圖并覆蓋在其相應(yīng)的視頻幀上剪菱。從這種可視化中摩瞎，我們可以看到網(wǎng)絡(luò)對面部和移動嘴部具有選擇性。非面部最強烈的反應(yīng)往往是不尋常但突出的視聽刺激（例如兩個排名靠前的視頻包含頻閃燈和音樂）孝常。為了比較旗们，我們在圖4中顯示響應(yīng)最弱的視頻;這些包含相對較少的面孔。接下來构灸，我們詢問模型如何響應(yīng)不包含語音的視頻上渴，并將我們的方法應(yīng)用于Kinetics-Sounds數(shù)據(jù)集[16] - Kinetics [58]類的一個子集，它們往往包含獨特的聲音。在刪除了響應(yīng)僅針對面部（幾乎出現(xiàn)在每個類別中）的示例之后稠氮，我們展示了對各種類別的響應(yīng)最高的示例曹阔。我們在圖5中顯示結(jié)果。最后隔披，我們詢問模型的注意力如何隨運動而變化赃份。為了研究這個，我們計算了基于CAM的視頻可視化奢米，我們已經(jīng)將其包含在補充視頻中（我們還在圖1（a）中展示了一些手工選擇的例子）抓韩。這些結(jié)果定性地表明模型的注意力隨著屏幕上的運動而變化。這與單幀方法模型[50,52,16]形成對比恃慧，后者主要關(guān)注聲音制作對象而不是動作园蝠。

5 Action recognition

We have seen through visualizations that our representation conveys information about sound sources. We now ask whether it is useful for recognition tasks. To study this, we fine-tuned our model for action recognition using the UCF-101 dataset [64], initializing the weights with those learned from our alignment task. We provide the results in Table 1, and compare our model to other unsupervised learning and 3D CNN methods.

我們通過可視化看到我們的表示傳達(dá)了有關(guān)聲源的信息。我們現(xiàn)在問它是否對識別任務(wù)有用痢士。為了研究這一點彪薛，我們使用UCF-101數(shù)據(jù)集[64]微調(diào)了我們的動作識別模型，用我們的對齊任務(wù)中學(xué)到的權(quán)重初始化權(quán)重怠蹂。我們在表1中提供結(jié)果善延，并將我們的模型與其他無監(jiān)督學(xué)習(xí)和3D CNN方法進行比較。

UCF101 動作識別數(shù)據(jù)集城侧，從youtube收集而得易遣，共包含101類動作。其中每類動作由25個人做動作嫌佑，每人做4-7組豆茫，共13320個視頻，分辨率為320*240屋摇，共6.5G揩魂。UCF101在動作的采集上具有非常大的多樣性，包括相機運行炮温、外觀變化火脉、姿態(tài)變化、物體比例變化柒啤、背景變化倦挂、光纖變化等。101類動作可以分為5類：人與物體互動担巩、人體動作方援、人與人互動、樂器演奏涛癌、體育運動肯骇。

We train with 2.56-second subsequences, following [56], which we augment with random flipping and cropping, and small (up to one frame) audio shifts. At test time, we follow [65] and average the model’s outputs over 25 clips from each video, and use a center 224 × 224 crop. Please see Section A1 for optimization details.

我們使用[56]跟隨2.56秒的后續(xù)序列進行訓(xùn)練窥浪，我們通過隨機翻轉(zhuǎn)和裁剪以及小（最多一幀）音頻移位進行擴充笛丙。在測試時，我們遵循[65]并將模型的輸出平均來自每個視頻的25個剪輯假颇，并使用中心224×224裁剪胚鸯。有關(guān)優(yōu)化詳情，請參閱第A1節(jié)笨鸡。

Analysis We see, first, that our model significantly outperforms self-supervised approaches that have previously been applied to this task, including Shuffle-and-Learn [17] (82.1% vs. 50.9% accuracy) and O3N [19] (60.3%). We suspect this is in part due to the fact that these methods either process a single frame or a short sequence, and they solve tasks that do not require extensive motion analysis. We then compared our model to methods that use supervised pretraining, focusing on the state-of-the-art I3D [56] model. While there is a large gap between our self-supervised model and a version of I3D that has been pretrained on the closely-related Kinetics dataset (94.5%), the performance of our model (with both sound and vision) is close to the (visual-only) I3D pretrained with ImageNet [66] (84.2%).

分析我們首先看到姜钳，我們的模型明顯優(yōu)于先前已應(yīng)用于此任務(wù)的自我監(jiān)督方法，包括Shuffle-and-Learn [17]（82.1％對50.9％準(zhǔn)確度）和O3N [19]（60.3％））形耗。我們懷疑這部分是由于這些方法處理單個幀或短序列哥桥，并且它們解決了不需要大量運動分析的任務(wù)琢锋。然后兴枯，我們將模型與使用監(jiān)督預(yù)訓(xùn)練的方法進行比較，重點關(guān)注最先進的I3D [56]模型尘喝。雖然我們的自我監(jiān)督模型和I3D版本之間存在很大差距倦踢，這種模型已經(jīng)在密切相關(guān)的Kinetics數(shù)據(jù)集（94.5％）上進行了預(yù)測送滞，但我們的模型（聲音和視覺）的性能接近于（僅視覺）I3D使用ImageNet預(yù)訓(xùn)練[66]（84.2％）。

Next, we trained our multisensory network with the self-supervision task of [16] rather than our own, i.e. creating negative examples by randomly pairing the audio and visual streams from different videos, rather than by introducing misalignment. We found that this model performed significantly worse than ours (78.7%), perhaps due to the fact that its task can largely be solved without analyzing motion.

接下來辱挥，我們使用[16]的自我監(jiān)督任務(wù)訓(xùn)練我們的多感官網(wǎng)絡(luò)犁嗅，而不是我們自己，即通過隨機配對來自不同視頻的音頻和視頻流而不是通過引入錯位來創(chuàng)建負(fù)面示例晤碘。我們發(fā)現(xiàn)這個模型比我們的表現(xiàn)差得多（78.7％）褂微，這可能是因為它的任務(wù)很大程度上可以在不分析運動的情況下得到解決。

Finally, we asked how components of our model contribute to its performance. To test whether the model is obtaining its predictive power from audio, we trained a variation of the model in which the audio subnetwork was ablated (activations set to zero), finding that this results in a 5% drop in performance. This suggests both that sound is important for our results, and that our visual features are useful in isolation. We also tried training a variation of the model that operated on spectrograms, rather than raw waveforms, finding that this yielded similar performance (Section A2). To measure the importance of our self-supervised pretraining, we compared our model to a randomly initialized network (i.e. trained from scratch), finding that there was a significant (14%) drop in performance — similar in magnitude to removing ImageNet pretraining from I3D. These results suggest that the model has learned a representation that is useful both for vision-only and audio-visual action recognition.

最后园爷，我們詢問了模型的組件如何對其性能做出貢獻宠蚂。為了測試模型是否從音頻獲得其預(yù)測能力，我們訓(xùn)練了模型的變體腮介，其中音頻子網(wǎng)被消融（激活設(shè)置為零）肥矢，發(fā)現(xiàn)這導(dǎo)致性能下降5％。這表明聲音對我們的結(jié)果很重要叠洗，并且我們的視覺特征在隔離中是有用的甘改。我們還嘗試訓(xùn)練模型的變體，該模型使用光譜圖而不是原始波形灭抑，發(fā)現(xiàn)這產(chǎn)生了類似的性能（第A2節(jié)）。為了衡量我們自我監(jiān)督的預(yù)訓(xùn)練的重要性腾节，我們將我們的模型與隨機初始化的網(wǎng)絡(luò)（即從頭開始訓(xùn)練）進行了比較忘嫉，發(fā)現(xiàn)性能顯著下降（14％） - 與從I3D中移除ImageNet預(yù)訓(xùn)練相似荤牍。這些結(jié)果表明該模型已經(jīng)學(xué)習(xí)了一種對于僅視覺和視聽動作識別都有用的表示。

6 On/off-screen audio-visual source separation

We now apply our representation to a classic audio-visual understanding task: separating on- and off-screen sound. To do this, we propose a source separation model that uses our learned features. Our formulation of the problem resembles recent audio-visual and audio-only separation work [34,36,67,42]. We create synthetic sound mixtures by summing an input video’s (“on-screen”) audio track with a randomly chosen (“off-screen”) track from a random video. Our model is then tasked with separating these sounds.

我們現(xiàn)在將我們的表示應(yīng)用于經(jīng)典的視聽理解任務(wù)：分離屏幕上和屏幕外的聲音庆冕。為此康吵，我們提出了一種使用我們學(xué)習(xí)的特征的源分離模型。我們對這個問題的表述類似于最近的視聽和純音頻分離工作[34,36,67,42]访递。我們通過將輸入視頻（“屏幕上”）音軌與隨機視頻中隨機選擇的（“屏幕外”）音軌相加來創(chuàng)建合成聲音混合晦嵌。然后我們的模型負(fù)責(zé)分離這些聲音。

Task We consider models that take a spectrogram for the mixed audio as input and recover spectrogram for the two mixture components. Our simplest on/off-screen separation model learns to minimize:

任務(wù)我們考慮采用混合音頻的頻譜圖作為輸入的模型拷姿，并恢復(fù)兩種混合成分的頻譜圖惭载。我們最簡單的開/關(guān)屏幕分離模型學(xué)習(xí)最小化：

where xM is the mixture sound, xF and xB are the spectrograms of the on- and offscreen sounds that comprise it (i.e. foreground and background), and fF and fB are our model’s predictions of them conditional on the (audio-visual) video I.

使我們的視聽網(wǎng)絡(luò)適應(yīng)源分離任務(wù)。我們的模型將輸入頻譜圖分為屏幕外和屏幕外音頻流响巢。在每個時間下采樣層之后描滔，我們的多感官特征與通過頻譜圖計算的u-net連接。我們反轉(zhuǎn)頻譜圖以獲得波形踪古。該模型在原始視頻上運行含长，沒有任何預(yù)處理（例如，沒有面部檢測）灾炭。

其中xM是混合聲音茎芋，xF和xB是包含它的幕外聲音（即前景和背景）的譜圖，fF和fB是我們模型對（視聽）視頻I的條件預(yù)測蜈出。

We also consider models that segment the two sounds without regard for their onor off-screen provenance, using the permutation invariant loss (PIT) of Yu et al. [36]. This loss is similar to Equation 3, but it allows for the on- and off-screen sounds to be swapped without penalty:

我們還考慮使用Yu等人的置換不變損失（PIT）來分割這兩種聲音的模型田弥，而不考慮它們在屏幕外的起源。[36]铡原。這種損失類似于公式3偷厦，但它允許交換屏幕上和屏幕外的聲音而不會受到懲罰：

where L(xi; xj) = ||xi - xF ||1 + ||xj - xB||1 and x^1 and x^2 are the predictions.

6.1 Source separation model

We augment our audio-visual network with a u-net encoder-decoder [43,69,70] that maps the mixture sound to its on- and off-screen components (Figure 6). To provide the u-net with video information, we include our multisensory network’s features at three temporal scales: we concatenate the last layer of each temporal scale with the layer of the encoder that has the closest temporal sampling rate. Prior to concatenation, we use linear interpolation to make the video features match the audio sampling rate; we then mean-pool them spatially, and tile them over the frequency domain, thereby reshaping our 3D CNN’s time/height/width shape to match the 2D encoder’s time/frequency shape. We use parameters for u-net similar to [69], adding one pair of convolution layers to compensate for the large number of frequency channels in our spectrograms. We predict both the magnitude of the log-spectrogram and its phase (we scale the phase loss by 0.01 since it is less perceptually important). To obtain waveforms, we invert the predicted spectrogram. We emphasize that our model uses raw video, with no preprocessing or labels (e.g. no face detection or pretrained supervised features).

我們使用u-net編碼器解碼器[43,49,70]擴充我們的視聽網(wǎng)絡(luò)，將混合聲音映射到其屏幕上和屏幕外組件（圖6）燕刻。為了向u-net提供視頻信息只泼，我們在三個時間尺度上包括我們的多感官網(wǎng)絡(luò)特征：我們將每個時間尺度的最后一層與具有最接近的時間采樣率的編碼器層連接起來。在連接之前卵洗，我們使用線性插值使視頻特征與音頻采樣率匹配;然后我們在空間上對它們進行平均匯總请唱，并將它們平鋪在頻域上，從而重塑我們的3D CNN的時間/高度/寬度形狀以匹配2D編碼器的時間/頻率形狀过蹂。我們使用類似于[69]的u-net參數(shù)十绑，添加一對卷積層來補償譜圖中的大量頻道。我們預(yù)測對數(shù)譜圖的大小及其相位（我們將相位損失縮放0.01酷勺，因為它在感知上不太重要）本橙。為了獲得波形，我們反轉(zhuǎn)預(yù)測的頻譜圖脆诉。我們強調(diào)我們的模型使用原始視頻甚亭，沒有預(yù)處理或標(biāo)簽（例如贷币，沒有面部檢測或預(yù)訓(xùn)練的監(jiān)督功能）。

Training We evaluated our model on the task of separating speech sounds using the VoxCeleb dataset [71]. We split the training/test to have disjoint speaker identities (72%, 8%, and 20% for training, validation, and test). During training, we sampled 2.1-sec. clips from longer 5-sec. clips, and normalized each waveform’s mean squared amplitude to a constant value. We used spectrograms with a 64 ms frame length and a 16 ms step size, producing 128 × 1025 spectrograms. In each mini-batch of the optimization, we randomly paired video clips, making one the off-screen sound for the other. We jointly optimized our multisensory network and the u-net model, initializing the weights using our self-supervised representation (see supplementary material for details).?

訓(xùn)練我們使用VoxCeleb數(shù)據(jù)集[71]評估我們的模型分離語音的任務(wù)亏狰。我們將訓(xùn)練/測試分成不同的演講者身份（72％役纹，8％和20％用于培訓(xùn)，驗證和測試）暇唾。在訓(xùn)練期間字管，我們采樣2.1秒。剪輯從5秒開始信不。剪輯，并將每個波形的均方幅度歸一化為一個恒定值亡呵。我們使用64 ms幀長和16 ms步長的頻譜圖抽活，產(chǎn)生128×1025個頻譜圖。在優(yōu)化的每個小批量中锰什，我們隨機配對視頻剪輯下硕，使一個屏幕外的聲音為另一個。我們聯(lián)合優(yōu)化了我們的多感官網(wǎng)絡(luò)和u-net模型汁胆，使用我們的自我監(jiān)督表示來初始化權(quán)重（詳見補充材料）梭姓。

6.2 Evaluation

We compared our model to a variety of separation methods: 1) we replaced our selfsupervised video representation with other features, 2) compared to audio-only methods using blind separation methods, 3) and compared to other audio-visual models.

我們將我們的模型與各種分離方法進行了比較：1）我們將自我監(jiān)督的視頻表示替換為其他特征，2）與使用盲分離方法的純音頻方法相比嫩码，3）并與其他視聽模型進行比較誉尖。

Ablations Since one of our main goals is to evaluate the quality of the learned features, we compared several variations of our model (Table 2). First, we replaced the multisensory features with the I3D network [56] pretrained on the Kinetics dataset — a 3D CNN-based representation that was very effective for action recognition (Section 5). This model performed significantly worse (11.4 vs. 12.3 spectrogram ‘1 loss for Equation 3). One possible explanation is that our pretraining task requires extensive motion analysis, whereas even single-frame action recognition can still perform well [65,72].

消除由于我們的主要目標(biāo)之一是評估學(xué)習(xí)特征的質(zhì)量，我們比較了模型的幾種變體（表2）铸题。首先铡恕，我們用動力學(xué)數(shù)據(jù)集上預(yù)先訓(xùn)練的I3D網(wǎng)絡(luò)[56]取代了多感官特征 - 基于3D CNN的表示，對動作識別非常有效（第5節(jié)）丢间。該模型的表現(xiàn)明顯更差（方程3中11.4對12.3頻譜圖'1損失）探熔。一種可能的解釋是，我們的預(yù)訓(xùn)練任務(wù)需要進行廣泛的運動分析烘挫，而即使單幀動作識別仍然可以表現(xiàn)良好[65,72]诀艰。

我們的開/關(guān)屏幕分離模型的定性結(jié)果。我們展示了來自我們測試裝置的兩種合成混合物的輸入框架和光譜圖饮六，以及兩個包含多個揚聲器的野外互聯(lián)網(wǎng)視頻其垄。第一種（雄性/雄性混合物）含有比第二種（雌性/雄性混合物）更多的偽影。第三個視頻是一個現(xiàn)實世界的混合體喜滨，其中女性發(fā)言者（同時）將男性西班牙語發(fā)音者翻譯成英語捉捅。最后，我們在電視新聞節(jié)目中將兩位（男性）演講者的演講分開虽风。雖然這些現(xiàn)實世界的例子沒有基本的事實棒口，但是源分離方法定性地分離了這兩個聲音寄月。有關(guān)視頻源分離結(jié)果，請參閱我們的網(wǎng)頁（http://andrewowens.com/multisensory）无牵。

We then asked how much of our representation’s performance comes from motion features, rather than from recognizing properties of the speaker (e.g. gender). To test this, we trained the model with only a single frame (replicated temporally to make a video). We found a significant drop in performance (11.4 vs. 14.8 loss). The drop was particularly large for mixtures in which two speakers had the same gender — a case where lip motion is an important cue.

然后我們詢問我們的表現(xiàn)有多少來自動作特征漾肮，而不是識別說話者的屬性（例如性別）。為了測試這一點茎毁，我們僅用一個幀訓(xùn)練模型（在時間上復(fù)制以制作視頻）克懊。我們發(fā)現(xiàn)性能顯著下降（11.4對14.8損失）。對于兩個揚聲器具有相同性別的混合物七蜘，下降特別大 - 唇部運動是一個重要線索谭溉。

One might also ask whether early audio-visual fusion is helpful — the network, after all, fuses the modalities in the spectrogram encoder-decoder as well. To test this, we ablated the audio stream of our multisensory network and retrained the separation model. This model obtained worse performance, suggesting the fused audio is helpful even when it is available elsewhere. Finally, while the encoder-decoder uses only monaural audio, our representation uses stereo. To test whether it uses binaural cues, we converted all the audio to mono and re-evaluated it. We found that this did not significantly affect performance, which is perhaps due to the difficulty of using stereo cues in in-the-wild internet videos (e.g. 39% of the audio tracks were mono). Finally, we also transferred (without retraining) our learned models to the GRID dataset [73], a labrecorded dataset in which people speak simple phrases in front of a plain background, finding a similar relative ordering of the methods.

人們可能還會問早期視聽融合是否有用 - 畢竟，網(wǎng)絡(luò)也融合了頻譜圖編碼器 - 解碼器中的模態(tài)橡卤。為了測試這一點扮念，我們消除了多感官網(wǎng)絡(luò)的音頻流并重新訓(xùn)練了分離模型。該模型獲得了更差的性能碧库，表明融合的音頻即使在其他地方可用時也是有用的柜与。最后，雖然編碼器 - 解碼器僅使用單聲道音頻嵌灰，但我們的表示使用立體聲弄匕。為了測試它是否使用雙聲道提示，我們將所有音頻轉(zhuǎn)換為單聲道并重新評估它沽瞭。我們發(fā)現(xiàn)這并沒有顯著影響性能迁匠，這可能是由于在野外互聯(lián)網(wǎng)視頻中使用立體聲提示的困難（例如39％的音軌是單聲道的）。最后秕脓，我們還將我們學(xué)到的模型轉(zhuǎn)移（不重新訓(xùn)練）到GRID數(shù)據(jù)集[73]柒瓣，這是一個實驗室記錄的數(shù)據(jù)集，人們在普通背景前講簡單的短語吠架，找到方法的相似的相對順序芙贫。

Audio-only separation To get a better understanding of our model’s effectiveness, we compared it to audio-only separation methods. While these methods are not applicable to on/off-screen separation, we modified our model to have it separate audio using an extra permutation invariant loss (Equation 4) and then compared the methods using blind separation metrics [68]: signal-to-distortion ratio (SDR), signal-to-interference ratio (SIR), and signal-to-artifacts ratio (SAR). For consistency across methods, we resampled predicted waveforms to 16 kHz (the minimum used by all methods), and used the mixture phase to invert our model’s spectrogram, rather than the predicted phase (which none of the others predict).

僅音頻分離為了更好地理解我們模型的有效性，我們將其與僅音頻分離方法進行了比較傍药。雖然這些方法不適用于屏幕上/屏幕外分離磺平，但我們修改了模型，使其使用額外的置換不變損失（公式4）將其分離拐辽，然后使用盲分離度量[68]比較方法：信號到 - 失真比（SDR）拣挪，信號干擾比（SIR）和信號與偽像比（SAR）。為了保證各種方法的一致性俱诸，我們將預(yù)測波形重新采樣到16 kHz（所有方法使用的最小值）菠劝，并使用混合相來反轉(zhuǎn)模型的頻譜圖，而不是預(yù)測的相位（其他沒有預(yù)測的相位）睁搭。

We compared our model to PIT-CNN [36]. This model uses a VGG-style [74] CNN to predict two soft separation masks via a fully connected layer. These maps are multiplied by the input mixture to obtain the segmented streams. While this method worked well on short clips, we found it failed on longer inputs (e.g. obtaining 1.8 SDR in the experiment shown in Table 2). To create a stronger PIT baseline, we therefore created an audio-only version of our u-net model, optimizing the PIT loss instead of our on/offscreen loss, i.e. replacing the VGG-style network and masks with u-net. We confirmed that this model obtains similar performance on short sequences (Table 3), and found it successfully trained on longer videos. Finally, we compared with a pretrained separation model [67], which is based on recurrent networks and trained on the TSP dataset [75].

我們將我們的模型與PIT-CNN進行了比較[36]赶诊。該模型使用VGG型[74] CNN通過完全連接的層預(yù)測兩個軟分離掩模笼平。這些映射乘以輸入混合以獲得分段流。雖然這種方法在短片段上運行良好舔痪，但我們發(fā)現(xiàn)它在較長輸入時失斣⒌鳌（例如，在表2中所示的實驗中獲得1.8 SDR）锄码。為了創(chuàng)建更強大的PIT基線夺英，我們因此創(chuàng)建了我們的u-net模型的純音頻版本，優(yōu)化了PIT損失而不是我們的開/關(guān)屏幕丟失滋捶，即用u-net替換VGG風(fēng)格的網(wǎng)絡(luò)和掩碼痛悯。我們確認(rèn)該模型在短序列上獲得了類似的性能（表3），并且發(fā)現(xiàn)它在較長的視頻上成功訓(xùn)練重窟。最后灸蟆，我們與預(yù)訓(xùn)練分離模型[67]進行了比較，該模型基于循環(huán)網(wǎng)絡(luò)并在TSP數(shù)據(jù)集上進行了訓(xùn)練[75]亲族。

We found that our audio-visual model, when trained with a PIT loss, outperformed all of these methods, except for on the SAR metric, where the u-net PIT model was slightly better (which largely measures the presence of artifacts in the generated waveform). In particular, our model did significantly better than the audio-only methods when the genders of the two speakers in the mixture were the same (Table 2). Interestingly, we found that the audio-only methods still performed better on blind separation metrics when transferring to the lab-recorded GRID dataset, which we hypothesize is due to the significant domain shift.

我們發(fā)現(xiàn)我們的視聽模型在受到PIT損失訓(xùn)練時表現(xiàn)優(yōu)于所有這些方法，除了SAR指標(biāo)可缚，其中u-net PIT模型略好一些（主要測量生成的工件的存在）波形）霎迫。特別是，當(dāng)混合物中兩個揚聲器的性別相同時帘靡，我們的模型明顯優(yōu)于僅音頻方法（表2）知给。有趣的是，我們發(fā)現(xiàn)當(dāng)轉(zhuǎn)移到實驗室記錄的GRID數(shù)據(jù)集時描姚，僅音頻方法在盲分離度量上仍然表現(xiàn)更好涩赢，我們假設(shè)這是由于顯著的域移位。

Audio-visual separation We compared to the audio-visual separation model of Hou et al. [42]. This model was designed for enhancing the speech of a previously known speaker, but we apply it to our task since it is the most closely related prior method. We also evaluated the network of Gabbay et al. [45] (a concurrent approach to ours). We trained these models using the same procedure as ours ([45] used speaker identities to create hard mixtures; we instead assumed speaker identities are unknown and mix randomly). Both models take very short (5-frame) video inputs. Therefore, following [45] we evaluated 200ms videos (Table 3). For these baselines, we cropped the video around the speaker’s mouth using the Viola-Jones [76] lip detector of [45] (we do not use face detection for our own model). These methods use a small number of frequency bands in their (Mel-) STFT representations, which limits their quantitative performance. To address these limitations, we evaluated only the on-screen audio, and downsampled the audio to a low, common rate (2 kHz) before computing SDR. Our model significantly outperforms these methods. Qualitatively, we observed that [45] often smooths the input spectrogram, and we suspect its performance on source separation metrics may be affected by the relatively small number of frequency bands in its audio representation.

視聽分離我們與Hou等人的視聽分離模型進行了比較轩勘。 [42]筒扒。該模型旨在增強先前已知說話者的語音，但我們將其應(yīng)用于我們的任務(wù)绊寻，因為它是最密切相關(guān)的先前方法花墩。我們還評估了Gabbay等人的網(wǎng)絡(luò)。 [45]（與我們同時采用的方法）澄步。我們使用與我們相同的程序訓(xùn)練這些模型（[45]使用說話人身份來創(chuàng)建硬混合物;我們改為假設(shè)說話者身份是未知的并隨機混合）冰蘑。兩種型號都采用非常短（5幀）的視頻輸入。因此村缸，在[45]之后祠肥，我們評估了200ms的視頻（表3）。對于這些基線梯皿，我們使用[45]的Viola-Jones [76]唇形檢測器在揚聲器的嘴周圍裁剪視頻（我們不對自己的模型使用面部檢測）仇箱。這些方法在其（Mel-）STFT表示中使用少量頻帶县恕，這限制了它們的定量性能。為了解決這些限制工碾，我們僅評估了屏幕上的音頻弱睦，并在計算SDR之前將音頻下采樣到較低的通用速率（2 kHz）。我們的模型明顯優(yōu)于這些方法渊额。定性地况木，我們觀察到[45]經(jīng)常使輸入譜圖平滑，并且我們懷疑其在源分離度量上的性能可能受其音頻表示中相對較少數(shù)量的頻帶的影響旬迹。

6.3 Qualitative results

Our quantitative results suggest that our model can successfully separate on- and offscreen sounds. However, these metrics are limited in their ability to convey the quality of the predicted sound (and are sensitive to factors that may not be perceptually important, such as the frequency representation). Therefore, we also provide qualitative examples.

我們的定量結(jié)果表明我們的模型可以成功地分離屏幕上和屏幕外的聲音火惊。然而，這些度量在它們傳達(dá)預(yù)測聲音質(zhì)量的能力方面受到限制（并且對于可能在感知上不重要的因素敏感奔垦，例如頻率表示）屹耐。因此，我們也提供了定性的例子椿猎。

Real mixtures In Figure 7, we show results for two synthetic mixtures from our test set, and two real-world mixtures: a simultaneous Spanish-to-English translation and a television interview with concurrent speech. We exploit the fact that our model is fully convolutional to apply it to these 8.3-sec. videos (4× longer than training videos). We include additional source separation examples in the videos on our webpage. This includes a random sample of (synthetically mixed) test videos, as well as results on in-the-wild videos that contain both on- and off-screen sound.

真正的混合物在圖7中惶岭，我們展示了來自我們的測試集的兩種合成混合物的結(jié)果，以及兩種真實世界的混合物：同時進行的西班牙語到英語的翻譯以及同時演講的電視采訪犯眠。我們利用了這樣一個事實按灶，即我們的模型是完全卷積的，可以將它應(yīng)用到這些8.3秒筐咧。視頻（比培訓(xùn)視頻長4倍）鸯旁。我們在網(wǎng)頁上的視頻中添加了其他源代碼分隔示例。這包括（合成混合的）測試視頻的隨機樣本量蕊，以及包含屏幕上和屏幕外聲音的野外視頻的結(jié)果铺罢。

Multiple on-screen sound sources To demonstrate our model’s ability to vary its prediction based on the speaker, we took a video in which two people are speaking on a TV debate show, visually masked one side of the screen (similar to [25]), and ran our source separation model. As shown in Figure 1, when the speaker on the left is hidden, we hear the speaker on the right, and vice versa. Please see our video for results.

多個屏幕聲源為了展示我們的模型基于揚聲器改變其預(yù)測的能力，我們拍攝了一個視頻残炮，其中兩個人在電視辯論節(jié)目中發(fā)言韭赘，在視覺上屏蔽了屏幕的一側(cè)（類似于[25]），并運行我們的源分離模型势就。如圖1所示辞居，當(dāng)左側(cè)的揚聲器隱藏時，我們會聽到右側(cè)的揚聲器蛋勺，反之亦然瓦灶。請查看我們的視頻了解結(jié)果.

Large-scale training We trained a larger variation of our model on significantly more data. For this, we combined the VoxCeleb and VoxCeleb2 [77] datasets (approx. 8× as manys videos), as in [47], and modeled ambient sounds by sampling background audio tracks from AudioSet approximately 8% of the time. To provide more temporal context, we trained with 4.1-sec. videos (approx. 256 STFT time samples). We also simplified the model by decreasing the spectrogram frame length to 40 ms (513 frequency samples) and increased the weight of the phase loss to 0.2. Please see our webpage for results.

大規(guī)模培訓(xùn)我們在更多數(shù)據(jù)上訓(xùn)練了更大的模型變體。為此抱完，我們將VoxCeleb和VoxCeleb2 [77]數(shù)據(jù)集（大約8倍作為manys視頻）組合在一起贼陶，如[47]所示，并通過從AudioSet中大約8％的時間采樣背景音軌來模擬環(huán)境聲音。為了提供更多的時間背景碉怔，我們訓(xùn)練了4.1秒烘贴。視頻（大約256個STFT時間樣本）。我們還通過將頻譜圖幀長度減小到40毫秒（513個頻率樣本）并將相位損失的權(quán)重增加到0.2來簡化模型撮胧。請查看我們的網(wǎng)頁了解結(jié)果桨踪。

7 Discussion

In this paper, we presented a method for learning a temporal multisensory representation, and we showed through experiments that it was useful for three downstream tasks: (a) pretraining action recognition systems, (b) visualizing the locations of sound sources, and (c) on/off-screen source separation. We see this work as opening two potential directions for future research. The first is developing new methods for learning fused multisensory representations. We presented one method — detecting temporal misalignment — but one could also incorporate other learning signals, such as the information provided by ambient sound [15]. The other direction is to use our representation for additional audio-visual tasks. We presented several applications here, but there are other audio-understanding tasks could potentially benefit from visual information and, likewise, visual applications that could benefit from fused audio information.

在本文中，我們提出了一種學(xué)習(xí)時間多感覺表示的方法芹啥，并且我們通過實驗證明它對三個下游任務(wù)有用：（a）預(yù)訓(xùn)練動作識別系統(tǒng)锻离，（b）可視化聲源的位置，和（c））開/關(guān)源分離墓怀。我們認(rèn)為這項工作為未來研究開辟了兩條潛在方向汽纠。第一個是開發(fā)學(xué)習(xí)融合多感覺表示的新方法。我們提出了一種方法 - 檢測時間錯位 - 但也可以包含其他學(xué)習(xí)信號傀履，例如環(huán)境聲音提供的信息[15]虱朵。另一個方向是使用我們的表示來進行額外的視聽任務(wù)。我們在這里介紹了幾個應(yīng)用程序钓账，但是其他音頻理解任務(wù)可能會從視覺信息中受益碴犬，同樣，視覺應(yīng)用程序可以從融合的音頻信息中受益梆暮。

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末翅敌，一起剝皮案震驚了整個濱河市睦授，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌炉菲，老刑警劉巖被啼，帶你破解...
沈念sama閱讀 222,681評論 6贊 517
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場離奇詭異唐责，居然都是意外死亡，警方通過查閱死者的電腦和手機，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 95,205評論 3贊 399
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門泪蔫，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人喘批，你說我怎么就攤上這事撩荣。” “怎么了饶深？”我有些...
開封第一講書人閱讀 169,421評論 0贊 362
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵餐曹，是天一觀的道長。經(jīng)常有香客問我敌厘，道長台猴，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 60,114評論 1贊 300
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮饱狂，結(jié)果婚禮上曹步，老公的妹妹穿的比我還像新娘。我一直安慰自己休讳，他們只是感情好讲婚，可當(dāng)我...
茶點故事閱讀 69,116評論 6贊 398
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著俊柔，像睡著了一般筹麸。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上婆咸，一...
開封第一講書人閱讀 52,713評論 1贊 312
城市分裂傳說
那天竹捉，我揣著相機與錄音，去河邊找鬼尚骄。笑死块差，一個胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的倔丈。我是一名探鬼主播憨闰，決...
沈念sama閱讀 41,170評論 3贊 422
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼需五！你這毒婦竟也來了鹉动？” 一聲冷哼從身側(cè)響起，我...
開封第一講書人閱讀 40,116評論 0贊 277
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤宏邮，失蹤者是張志新（化名）和其女友劉穎泽示，沒想到半個月后，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體蜜氨，經(jīng)...
沈念sama閱讀 46,651評論 1贊 320
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡械筛，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 38,714評論 3贊 342
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時候發(fā)現(xiàn)自己被綠了飒炎。大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片埋哟。...
茶點故事閱讀 40,865評論 1贊 353
活死人
序言：一個原本活蹦亂跳的男人離奇死亡，死狀恐怖郎汪，靈堂內(nèi)的尸體忽然破棺而出赤赊，到底是詐尸還是另有隱情，我是刑警寧澤煞赢，帶...
沈念sama閱讀 36,527評論 5贊 351
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布抛计，位于F島的核電站，受9級特大地震影響照筑，放射性物質(zhì)發(fā)生泄漏爷辱。R本人自食惡果不足惜录豺，卻給世界環(huán)境...
茶點故事閱讀 42,211評論 3贊 336
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望饭弓。院中可真熱鬧双饥，春花似錦、人聲如沸弟断。這莊子的主人今日做“春日...
開封第一講書人閱讀 32,699評論 0贊 25
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽阀趴。三九已至昏翰，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間刘急，已是汗流浹背棚菊。一陣腳步聲響...
開封第一講書人閱讀 33,814評論 1贊 274
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留叔汁，地道東北人统求。一個月前我還...
沈念sama閱讀 49,299評論 3贊 379
代替公主和親
正文我出身青樓，卻偏偏與公主長得像据块，于是被迫代替她去往敵國和親码邻。傳聞我的和親對象是個殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點故事閱讀 45,870評論 2贊 361