簡(jiǎn)要說明
摘要原文
2017年音念,作者提出了一個(gè)新的損失函數(shù)昂勉,稱為廣義損失端到端(GE2E)損失另伍,與之前(2016年)基于元組的端到端(TE2E)丟失函數(shù)相比敲街,這使說話人驗(yàn)證模型的訓(xùn)練更加有效团搞。
與TE2E不同,GE2E損失函數(shù)以新的方式更新網(wǎng)絡(luò)參數(shù)多艇,其通過關(guān)注(emphasizes)在訓(xùn)練過程的各步驟(step)中都難以驗(yàn)證的樣本來實(shí)現(xiàn)逻恐。另外,GE2E損失函數(shù)不需要樣本選擇的初始階段。 通過這些特性复隆,我們具有新的損失函數(shù)的模型將說話人驗(yàn)證的EER降低10%以上拨匆,同時(shí)將訓(xùn)練時(shí)間縮短了60%。
我們還介紹了MultiReader技術(shù)挽拂,使我們能夠進(jìn)行域自適應(yīng)惭每,訓(xùn)練出支持多個(gè)關(guān)鍵字(Multi keywords)的更準(zhǔn)確的模型(即,“ OK Google”和“ Hey Google”)以及多種方言亏栈。
容易混淆之處
論文涉及點(diǎn)有些多:
- 新提出的GE2E Loss應(yīng)用在Text-independent Speaker Verification(TI-SV)台腥、Text-dependent Speaker Verification(TD-SV)兩個(gè)領(lǐng)域。
- 同時(shí)Text-dependent領(lǐng)域又提出MultiReader(即多關(guān)鍵字 Multi keywords)
- 關(guān)于GE2E的具體實(shí)現(xiàn)仑扑,又提出兩種:一種是基于Softmax览爵、一種是基于Constract(對(duì)比型,關(guān)注在訓(xùn)練過程的各步驟中都難以驗(yàn)證的樣本)
- 從而對(duì)比實(shí)驗(yàn)有這么多镇饮。
- Text-independent領(lǐng)域1個(gè):GE2E vs TE2E vs Softmax蜓竹;
- Text-dependent領(lǐng)域4個(gè):(GE2E, TE2E) x (MultiReader, None MultiReader).
- GE2E中的Softmax(這里所說的Softmax是用在GE2E內(nèi)部的某個(gè)環(huán)節(jié))與Contrast對(duì)比1個(gè),但這個(gè)沒結(jié)出具體數(shù)據(jù)储藐。
論文回避之處:
- 論文摘要中首先強(qiáng)調(diào):“與TE2E不同俱济,GE2E損失函數(shù)以新的方式更新網(wǎng)絡(luò)參數(shù),其通過關(guān)注在訓(xùn)練過程的各步驟中都難以驗(yàn)證的樣本來實(shí)現(xiàn)钙勃≈肼担”,這個(gè)即論文中所說的“Contrasts(對(duì)比形式)”的方式辖源,找出負(fù)樣本中最難區(qū)分的蔚携。但論文中并未給出GE2E Loss中Softmax與Contrast兩種算法具體對(duì)比效果,只是簡(jiǎn)單說:Softmax在TD-SV表現(xiàn)更好克饶,而Contrast在TI-SV中稍微好點(diǎn)酝蜒。可能是GE2E中的Constrast公式與Triplet Loss太相似了矾湃,直接回避亡脑。
背景
這里包括幾個(gè)相關(guān)的Loss算子:
- Triplet Loss
- Softmax
- TE2E
Softmax
交叉熵?fù)p失函數(shù)。直接輸出分類的類別概率邀跃。
Softmax function, a wonderful activation function that turns numbers aka logits into probabilities that sum to one.Softmax function outputs a vector that represents the probability distributions of a list of potential outcomes.
Triplet Loss
2015年FaceNet論文提出Triplet Loss.
The Triplet Loss minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity.
Triplet Loss解決的問題:
- 類別數(shù)目固定霉咨,可以使用基于softmax的交叉熵?fù)p失函數(shù)。
- 類別數(shù)目是一個(gè)變量拍屑,可以使用triplet loss途戒,即triplet loss跟類別數(shù)量無關(guān)。不然僵驰,如果類別數(shù)很大喷斋,比如10k量級(jí)的說話人數(shù)據(jù)集裁蚁,softmax loss 部分的權(quán)重矩陣太大,難以訓(xùn)練继准。
triplet loss的優(yōu)勢(shì)在于細(xì)節(jié)區(qū)分枉证,triplet loss的缺點(diǎn)在于其收斂速度慢,有時(shí)不收斂移必。
Offline triplet
Generate triplets offline every n steps, using the most recent network checkpoint and computing the argmin and argmax on a subset of the data.
根據(jù)樣本之間的距離室谚,分為:semi-hard triplets,hard triplets與easy triplets三種崔泵,選擇semi-hard triplets秒赤,hard triplets進(jìn)行訓(xùn)練。
此方法不夠高效憎瘸,因?yàn)槊窟^幾個(gè)epoch入篮,要重新對(duì)negative examples進(jìn)行分類。
Online triplet
Generate triplets online. This can be done by selecting the hard positive/negative exemplars from within a mini-batch.
使用triplet進(jìn)行分類
FaceNet是特征向量提取器幌甘,輸出的是一個(gè)歐幾里得空間向量潮售,隨后就可以用各種機(jī)器學(xué)習(xí)算法進(jìn)行分類。
FaceNet. In this paper we present a system, called FaceNet, that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. Once this space has been produced, tasks such as face recognition, verification and clustering can be easily implemented using standard techniques with FaceNet embeddings as feature vectors.
前作 Tuple Based End-to-End Loss
2016 End-to-end text-dependent speaker verification. ICASSP
Tuple Based End-to-End Loss:
Pros
- Simulates runtime behavior in the loss function
- Each (N+1)-tuple contains all utterances involved in a verification decision (unlike triplet)
Cons
- Most tuples are easy - training is inefficient
本文 Generalized End-to-End Loss
2017 Generalized end-to-end loss for speaker verification.
GE2E Loss
上圖锅风,相同顏色的屬于同一類酥诽。 [圖片上傳失敗...(image-e1c855-1582508024839)] 是各類的中心。
GE2E loss pushes the embedding towards the centroid of the true speaker, and away from the centroid of the most similar different speaker. [圖片上傳失敗...(image-8a3965-1582508024839)]
Similarit Matrix
相似矩陣
Construct a similarity matrix for each batch:
- Embeddings are L2-normalized:
[圖片上傳失敗...(image-c91b8b-1582508024839)]
- Centroids:
[圖片上傳失敗...(image-f863af-1582508024839)]
- Similarity:
[圖片上傳失敗...(image-bd6fbb-1582508024839)]
Softmax vs. Contrast
Each row of [圖片上傳失敗...(image-ad80a9-1582508024839)] for [圖片上傳失敗...(image-70f086-1582508024839)] defines similarity between [圖片上傳失敗...(image-3598eb-1582508024839)] and every centroid [圖片上傳失敗...(image-552440-1582508024839)] . We want [圖片上傳失敗...(image-6fb2d8-1582508024839)] to be close to [圖片上傳失敗...(image-f5f189-1582508024839)] and far away from [圖片上傳失敗...(image-63f305-1582508024839)] for [圖片上傳失敗...(image-3ad46f-1582508024839)] .
Softmax與Contrast兩個(gè)Loss計(jì)算方式的對(duì)比皱埠,以下是作者的實(shí)驗(yàn)結(jié)果肮帐。
- Softmax: Good for text-independent applications [圖片上傳失敗...(image-ed5237-1582508024839)]
- Contrast: Good for keyword-based applications. The contrast loss is defined on positive pairs and most aggressive negative pairs [圖片上傳失敗...(image-dd805e-1582508024839)]
這里有個(gè)疑問,為何在text-independent任務(wù)中Contrast(對(duì)比度損失)會(huì)比Softmax差呢?
Contrast方法其實(shí)是類似Triplet Loss边器,其Loss計(jì)算為:正對(duì)和最積極的負(fù)對(duì)之和训枢。
Trick about centroid
For true speaker centroid, we should exclude the embedding itself. 即計(jì)算 [圖片上傳失敗...(image-5601ee-1582508024839)] 時(shí)要排除 [圖片上傳失敗...(image-621ff9-1582508024840)] 。
To avoid trivial solution: all utterances have same embedding. 即統(tǒng)一成一個(gè)公式忘巧。
[圖片上傳失敗...(image-698e1b-1582508024840)]
Efficiency estimate
TE2E vs. GE2E
主要思想是恒界,TE2E是一個(gè)tuple算一次,而GE2E是基本等于一個(gè)批量的tuple袋坑,同時(shí)放到GPU計(jì)算仗处,效率更高眯勾。
For TE2E, assume a training batch has: - N speakers - Each speaker has M utterances - P enrollment utterances per speaker
Number of all possible tuples: [圖片上傳失敗...(image-23526b-1582508024840)] Theoretically, one GE2E step is equivalent to [圖片上傳失敗...(image-4186b6-1582508024840)] TE2E steps.
TODO 對(duì)這里的具體計(jì)算不是很理解枣宫。為何能得出 [圖片上傳失敗...(image-3ee240-1582508024840)] 這個(gè)數(shù)值關(guān)系。
對(duì)比Triplet Loss
作者認(rèn)為Triplet Loss的優(yōu)劣:
- Pros: Simple, and correctly models the embedding space
- Cons: Does NOT simulate runtime behavior吃环。
這里的runtime behavior也颤,即語(yǔ)音/人臉驗(yàn)證(verification)場(chǎng)景中的使用流程:
- 注冊(cè): 注冊(cè)語(yǔ)音 -> 語(yǔ)音向量 -> 多條語(yǔ)音向量取平均
- 驗(yàn)證: 對(duì)驗(yàn)證時(shí)錄制的語(yǔ)音提取特征向量 -> 與注冊(cè)語(yǔ)音庫(kù)中的平均向量計(jì)算相似度(如Cosine距離) -> 根據(jù)設(shè)置的閾值在判斷是否同一個(gè)人。
Text-Independent
Text-Independent Speaker Verification
We are interested in identifying speaker based on arbitrary speech Challenge:
- Length of utterance can vary
- Unlike keyword-based, where we assume fixed-length (0.8s) for keyword segment
Naive solution: Full sequence training?
- No batching - very very slow
- Dynamic RNN unrolling - even slower...
解決方法:Sliding window inference
- In inference time, we extract sliding windows, and compute per-window d-vector
- For experiments, we use 1.6s window size, with 50% overlap
- L2正則化 per-window d-vectors郁轻,然后取這些向量的平均值作為這條不定長(zhǎng)語(yǔ)音的特征向量翅娶。
Training
Text-independent Training.
加速訓(xùn)練文留,同時(shí),充分利用數(shù)據(jù)竭沫。因每批次內(nèi)部utterances長(zhǎng)度相同燥翅,從而各utterances 計(jì)算時(shí)間相同。
- In training time, we need to group utterances by length
- Extract segments by minimal truncation
- Form batches of same-length segments
Experiment
Text-independent experiments.
Text-dependent
TODO
訓(xùn)練記錄
Text-independent
流程分為兩個(gè)步驟:數(shù)據(jù)預(yù)處理與訓(xùn)練.
音頻數(shù)據(jù)預(yù)處理:wav轉(zhuǎn)spectrogram等步驟很耗時(shí)蜕提,考慮使用GPU來計(jì)算森书,如使用PyTorch audio。
參考
訓(xùn)練耗時(shí)參考:
Dataset:
-
LibriSpeech: train-other-500 (extract as
LibriSpeech/train-other-500
). 500小時(shí)音頻谎势。 -
VoxCeleb1: Dev A - D as well as the metadata file (extract as
VoxCeleb1/wav
andVoxCeleb1/vox1_meta.csv
). 1200 人 -
VoxCeleb2: Dev A - H (extract as
VoxCeleb2/dev
). 6000人凛膏。
time: trained 1.56M steps (20 days with a single GPU) with a batch size of 64. GPU: GTX 1080 Ti.
訓(xùn)練過程:
最終達(dá)到的效果:特征向量使用UMAP降維再畫圖。