Kaldi中說話人識別嘗試|TIMIT

文章名為嘗試，因為從來沒用過linux系統(tǒng)彬伦，更別提shell腳本了，所以寫得也很雜亂伊诵，以后會再整理出清晰一點的帖子单绑；關于TIMIT數(shù)據(jù)的獲取，可以參看CSDN資源貼曹宴。這里是使用TIMIT數(shù)據(jù)集運行aishell腳本搂橙。

參考資料

1.kaldi基礎介紹（一）在說話人識別中的數(shù)據(jù)準備 - monsieurliaxiamen的博客 - CSDN博客
2.kaldi中改寫sre10/v1用timit dataset做說話人識別總結(jié) - zjm750617105的專欄 - CSDN博客
3.kaldi下清華語音數(shù)據(jù)集的說話人測試腳本編寫 - 破曉的專欄 - CSDN博客
4.Voiceprint recognition in kaldi - Programmer Sought
5.kaldi中的聲紋識別 - yutouwd的博客 - CSDN博客
6.【數(shù)據(jù)預處理】TIMIT語料庫WAV文件轉(zhuǎn)換 - JJJanepp - 博客園
7.利用kaldi提取mfcc特征 - 長虹劍的專欄 - CSDN博客
8.對TIMIT數(shù)據(jù)進行格式轉(zhuǎn)換（SPHERE2WAV（RIFF）） - MengWang - 博客園

kaldi的文件結(jié)構(gòu)

egs：保存各種例程，均使用腳本編寫笛坦，以使用的數(shù)據(jù)庫的名字命名区转。在下一級目錄中以s開頭的文件是語音識別，以v開頭的是聲紋識別甚垦，一般v1就是使用i-vector的方法來進行聲紋識別魏割。
src：保存了kaldi的C++代碼呻袭。
tools：包括了kaldi依賴的庫和一些實用的腳本。
windows：包括了在Windows下安裝需要的一些工具和配置文件

TIMIT數(shù)據(jù)格式轉(zhuǎn)換

timit的wav文件不是真正的wav文件蜻韭，需要用kaldi的工具 sph2pipe 進行轉(zhuǎn)換。

比如使用timit中的test/dr1/mdab0_si1039.wav 文件，首先用命令

fname=test_dr1_mdab0_si1039.wav  #改了文件的名字
sph2pipe -f wav $fname >file.wav  #要先轉(zhuǎn)換格式

就得到一個真正wav格式的文件湘捎，這樣你把它放到音樂播放器中就能夠聽到發(fā)音诀豁。(如果你的文件就是wav的就不用這樣做了)

如何批量處理呢？對TIMIT數(shù)據(jù)進行格式轉(zhuǎn)換（SPHERE2WAV（RIFF）） - MengWang - 博客園特別好用窥妇！

首先舷胜，轉(zhuǎn)換sph2pipe工具所在文件夾（此工具為LDC所提供的SPHERE音頻文件轉(zhuǎn)換工具）

cd '/home/dream/Research/kaldi-master/tools/sph2pipe_v2.5'

其次:在命令行進行音頻文件轉(zhuǎn)換測試:

./sph2pipe -f wav ./wav_test/SA1.WAV ./wav_test/SA1_tr.WAV

此處需要注意的是，sph2pipe可執(zhí)行文件不在PATH當中活翩，所以需要當前路徑下的完全路徑烹骨，即：./sph2pipe才可以運行，而非sph2pipe材泄。

認知數(shù)據(jù)集

訓練集train：用來訓練模型
驗證集dev：用于調(diào)整模型的超參數(shù)沮焕，驗證不同算法，檢驗哪種算法更有效
測試集test：根據(jù)最終的分類器拉宗，正確評估分類器的性能

很明顯峦树，TIMIT數(shù)據(jù)集并沒有劃分dev集，這是可以人為變更的旦事，需要有相關的表單操作知識魁巩，思路是相通的，將wav文件都搜集在一起后根據(jù)自己的需求重新劃分

數(shù)據(jù)集的劃分

如果數(shù)據(jù)只有100條姐浮，100條或者1萬條谷遂，通常將樣本集設置為70%驗證集，30%測試集卖鲤。也可按照60%訓練集肾扰，20%驗證和20%測試集來分類較為合理。
如果數(shù)據(jù)規(guī)模較大蛋逾，是百萬級別集晚，驗證集和測試集要小于數(shù)據(jù)總量的20%和10%。
做科研的話一般會使用標準數(shù)據(jù)集区匣，也就不用考慮數(shù)據(jù)劃分的問題了
如果是第一次嘗試偷拔，不用如何嚴苛地劃分數(shù)據(jù)

Kaldi中的數(shù)據(jù)準備

在kaldi說話人識別示例（egs/sre10,egs/sre16）中，數(shù)據(jù)總共有兩大類沉颂，一是訓練集（training）条摸，二是評估數(shù)據(jù)集（evaluation）。對于評估數(shù)據(jù)集又分為兩類铸屉，一是用來注冊（enrollment）的數(shù)據(jù)集钉蒲，二是測試（test）集。所以準備的wav文件夾下需要train,dev,test三個子文件夾

訓練集的準備：spk2utt, utt2spk以及wav.scp

這些都有相應的程序可以生成彻坛，不用自己寫腳本

spk2utt 是說話人id（記作spkid）和說話人語音名稱（uttid）的對應關系顷啼，通常來講踏枣，一個說話人會有很多條語音，文件中的格式為<spkid> <uttid1> <uttid2>...钙蒙，每一行有且只有一個說話人id茵瀑。每一行的uttid順序需要按照sort命令的排序模式來排，以及spkid也需要按照排序命令sort的模式來排躬厌。否則kaldi腳本在進行validate_data_dir.sh的時候報錯马昨。

utt2spk 是單個語音名稱uttid和說話人的對應，很明顯每行都是一一對應關系扛施。utt2spk也可以由kaldi自帶腳本和spk2utt生成鸿捧，也可以由自己寫腳本完成。

用kaldi自帶的命令utils/utt2spk_to_spk2utt.pl utt2spk >spk2utt 轉(zhuǎn)過來得到疙渣，好像需要把文件放在utils目錄下（匙奴？）

wav.scp 是語音名稱uttid和其完整路徑的對應，也是每行一個音頻妄荔。但是根據(jù)數(shù)據(jù)集中音頻文件格式的不同泼菌，需要添加一些轉(zhuǎn)換格式的命令。原始音頻文件格式為wav啦租，則只需要寫uttid path：

格式為sph或者flac哗伯，則需要加入格式轉(zhuǎn)換的命令行：

如果需要訓練性別有關的模型，還需要加入spk2gender的文本文件（這里沒用上）

注冊集

對于說話人識別的評估刷钢，我們首先需要注冊一批說話人笋颤，既然是注冊說話人的聲紋乳附，則每個說話人需要至少有一條語音用來注冊内地。對于測試集，則還需要一個已注冊說話人和某個語音的id以及標簽label（表明是否是同一個人）赋除。

注冊集和訓練集一樣阱缓，由spk2utt，utt2spk举农，wav.scp 組成荆针。文本文件內(nèi)容的模式也和訓練集保持一致，這里不再贅述颁糟。

測試集

測試集是由spk2utt航背，utt2spk，wav.scp棱貌，trials這四種文件組成玖媚。

trials文件格式： <spkid> <uttid label>，如：

FADG0_SI649.WAV FADG0 target
FADG0_SI649.WAV FAKS0 nontarget
FADG0_SI649.WAV FASW0 nontarget

以上就是kaldi說話人識別數(shù)據(jù)集的準備格式婚脱，自己手動寫腳本準備的時候今魔，會遇到的問題勺像，一是排序問題，主要會出現(xiàn)在spk2utt以及utt2spk文件中错森，因此在生成這些文件的時候就需要注意一定的順序性吟宦，也要根據(jù)實際情況改變uttid的名字，以便于通過validate-data-dir腳本的檢測（有自動排序修復程序）涩维；二是轉(zhuǎn)格式的問題殃姓，目前遇到的sph和flac就需要不同的工具去轉(zhuǎn)格式（sph2pipe和sox）。

只需準備三個數(shù)據(jù)文件就行了瓦阐，train辰狡、enroll、test垄分。SRE數(shù)據(jù)集用于訓練PLDA模型宛篇，因為大部分訓練數(shù)據(jù)超出了域。在您的情況下薄湿，您將直接在訓練數(shù)據(jù)上訓練PLDA模型叫倍，因為它在域中。train應包括一組與enroll和test數(shù)據(jù)不重疊的spk,其余spk的utt應分為enroll和test豺瘤。雖然enroll和test共享speaker吆倦，但它們不應包含來自相同錄音的utt。

官方論壇對TIMIT的解釋

順著github的Kaldi官方坐求，找到了相應的Google網(wǎng)上論壇：Kaldi-help做相關摘錄：

I'm not familiar with using TIMIT for speaker recognition, so I'm not sure how the evaluation is set up. It sounds like you might have only evaluation data and nothing to train your models with. Hopefully someone who has used TIMIT for this purpose can comment more. If you don't have any training data, you could try using the Librispeech corpus (look at the recipe in egs/ for more info).

You need at least the following datasets:

Training data. This is used to train the UBM, i-vector extractor and PLDA model. It should be non-overlapping with the other datasets. In the sre10 recipe, it corresponds to the "train" and "sre" data. The "sre" data is just a subset of "train" used to train the PLDA model, but it doesn't have to be that way in general.
Enrollment data. This is a subset of the evaluation data in which you know the identity of the speaker in the recording. Using the models created in the previous step, i-vectors are generated from this data. If you have multiple enrollment recordings per speaker, you might average their i-vectors to get speaker-level representations. In the sre10 recipe, this dataset is called "sre10_train."
Test data. This is also part of the evaluation data, and consists of recordings for which you don't know the identity of the speaker. These are compared (using the PLDA model or cosine distance) with the i-vectors created from the enrollment data. This dataset is called "sre10_test" in the recipe. **The set of comparisons is defined by the "trials" file. **

aishell流程

The run.sh in egs/aishell/v1 includes the entire voiceprint recognition process. It is best to copy the commands in run.sh to another script. In one sentence, one sentence at a time, so that errors can be found in time and then modified.

1）data preparation

2）start extracting mfcc features, perform endpoint detection (VAD), and check that the file does not meet the requirements to sort the files

3）train UBM and ivector extractor.It should be noted that the script that trains the ivector extractor will execute the program at the same time by default, which will take up a lot of memory and cause memory overflow. We need to modify it in train_ivector_extractor.sh. It defaults to executing njnum_threadnum_processes at the same time. In 16G memory, I changed these three parameters to 2 to run. There are also two hyperparameters that can be modified, namely the UBM dimension and the ivector dimension. The UBM dimension is modified directly in run.sh. The parameter behind the data/train in train_diag_ubm.sh is UBM. Dimension, the default is 1024. To modify the dimension of ivector, you also need to modify ivector_dim in train_ivector_extractor.sh. The default is 400.

4)extracting the ivector of the training set, and training the plda model for scoring with the ivector of the training set.

5)After that, the test set is divided into a registration set and a verification set. This step is mainly done by the script loacl/split_data_enroll_eval.py. This script first stores each spk and its corresponding utt in dictutt, then randomly smashes the utt order of the spk and redistributes it into enroll (registration set) and eval (evaluation set). You can see that in the penultimate line of the program, if(i<3): utt is written to enroll, otherwise it is written to eval. So we can change the value of the registration set and the evaluation set by changing this value.

6)After re-creating utt2spk, it is necessary to generate trials. Trials are generated by loacl/product_trials.py. Trials are a list of registered speakers and different voices that need to be scored. The format is (for example):

TIMIT+Aishell

We first look at the data partitioning in the AISHELL and TIMIT databases. There are a total of 400 people in AISHELL. The default is divided into train, dev and test sets. There are 340 people in the train; 40 in the dev; 20 in the test. In the routine, train is used as the training set, test is used as the test set, and dev is not used. Everyone in AISHELL has about 300 voices. Each voice is a sentence. Each voice is about 26s. There are 630 people in the TIMIT database, divided into train and test. There are 462 people in the training set and 168 people in the test set. Each person has 10 voices, and each voice is about 24s. Here, TIMIT's original distribution method is used directly, with 462 people as the training set and 168 people as the test set.

After understanding the differences between the two databases and the entire process of voiceprint recognition, we can begin to rewrite our program. In fact, there are not many places that need to be changed in the whole process. The main reason is that the process of preparing the data phase and generating trials needs to be modified. The first is the data preparation phase, we can rewrite our own tit_stat_prepare.sh according to the aishell_data_prepare.sh script. In the data preparation phase, three files, utt2spk spk2utt and wav.scp, are generated. The format of these three files is as follows:

image

Next, check if the found wav files add up to 141924, and then start wav.scp, utt2spk, and spk2utt and transcripts.txt for speech recognition. Here we will find the script related to transcripts.txt. Then delete it.

After completing the stage of preparing the data, we can start to perform voiceprint recognition according to the above process. One thing to note is that trials, if a person has only two or three segments of speech, you need to modify the proportion of the assigned enroll and eval sets. However, since everyone in the TIMIT database has 10 segments of speech, it is ok to not modify it. Here we use 3 segments of voice to register, and then the remaining 7 segments are used for verification.

The final error rate was about 4.5%. Although it is an acceptable result, it is still a lot worse than AISHELL's 0.18% error rate. Analyze the reasons: First, there are fewer voices for training. Although there are 462 people, each person has only 10 voices, and 340 people in AISHELL are used for training. Each person has a lot worse than 300 voices. Similarly, there are a total of 168 people in the TIMIT test set, which is much more than 40 people in the AISHELL test set. Moreover, AISHELL's default training UBM order and ivector dimensions are very high, so these two points may lead to higher error rates. If you want to further reduce the error rate, you can try to reduce the dimensions of the trained UBM and ivector. After I reduced the dimensions of both UBM and ivector, the error rate could eventually reach 1.53%.

其他補充

用sudo命令的話蚕泽，會給新生成的文件上鎖，解鎖辦法是sudo chmod 777 file
單步調(diào)試：sh -x script.sh桥嗤，修改腳本后须妻，不想重新debug，就用BLOCK注釋多行
線程設置在train_ivector_extractor.sh泛领，就算按前文所說設置為2荒吏，我也跑不動，我將ubm的維數(shù)降為了600渊鞋，ivector的維數(shù)降為了400绰更，程序才能跑快點，也得1h锡宋。我懷疑可以更小的儡湾，因為utt很短
不用text的話，可以在一個子程序的開頭設置执俩，而不用找具體的位置去注釋徐钠，我給忘了哪個程序了，大家可以運行出錯排查
所有需要預先準備的就是數(shù)據(jù)集了奠滑，TIMIT數(shù)據(jù)格式和他的文件夾都是比較頭疼的丹皱，最好寫個腳本實現(xiàn)
在aishell_data_prep.sh里注釋掉所有的dev和transcripts.txt

最后編輯于：2020.02.01 23:06:28

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末妒穴，一起剝皮案震驚了整個濱河市，隨后出現(xiàn)的幾起案子摊崭，更是在濱河造成了極大的恐慌讼油，老刑警劉巖，帶你破解...
沈念sama閱讀 218,386評論 6贊 506
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件呢簸，死亡現(xiàn)場離奇詭異矮台，居然都是意外死亡，警方通過查閱死者的電腦和手機根时，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 93,142評論 3贊 394
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門瘦赫，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人蛤迎，你說我怎么就攤上這事确虱。” “怎么了替裆？”我有些...
開封第一講書人閱讀 164,704評論 0贊 353
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵校辩，是天一觀的道長。經(jīng)常有香客問我辆童，道長宜咒，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 58,702評論 1贊 294
?港島之戀（遺憾婚禮）
正文為了忘掉前任把鉴，我火速辦了婚禮故黑，結(jié)果婚禮上，老公的妹妹穿的比我還像新娘庭砍。我一直安慰自己场晶，他們只是感情好，可當我...
茶點故事閱讀 67,716評論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布逗威。她就那樣靜靜地躺著峰搪，像睡著了一般岔冀。火紅的嫁衣襯著肌膚如雪凯旭。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 51,573評論 1贊 305
城市分裂傳說
那天使套，我揣著相機與錄音罐呼，去河邊找鬼。笑死侦高，一個胖子當著我的面吹牛嫉柴，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播奉呛，決...
沈念sama閱讀 40,314評論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼计螺，長吁一口氣：“原來是場噩夢啊……” “哼夯尽！你這毒婦竟也來了？” 一聲冷哼從身側(cè)響起登馒，我...
開封第一講書人閱讀 39,230評論 0贊 276
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤匙握，失蹤者是張志新（化名）和其女友劉穎，沒想到半個月后陈轿，有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體圈纺，經(jīng)...
沈念sama閱讀 45,680評論 1贊 314
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 37,873評論 3贊 336
?白月光啟示錄
正文我和宋清朗相戀三年麦射，在試婚紗的時候發(fā)現(xiàn)自己被綠了蛾娶。大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點故事閱讀 39,991評論 1贊 348
活死人
序言：一個原本活蹦亂跳的男人離奇死亡潜秋，死狀恐怖蛔琅，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情峻呛，我是刑警寧澤揍愁，帶...
沈念sama閱讀 35,706評論 5贊 346
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站杀饵，受9級特大地震影響莽囤，放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜切距，卻給世界環(huán)境...
茶點故事閱讀 41,329評論 3贊 330
男人毒藥：我在死后第九天來索命
文/蒙蒙一朽缎、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧谜悟，春花似錦话肖、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,910評論 0贊 22
一樁弒父案最筒，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至蔚叨，卻和暖如春床蜘，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背蔑水。一陣腳步聲響...
開封第一講書人閱讀 33,038評論 1贊 270
情欲美人皮
我被黑心中介騙來泰國打工邢锯，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留，地道東北人搀别。一個月前我還...
沈念sama閱讀 48,158評論 3贊 370
代替公主和親
正文我出身青樓丹擎，卻偏偏與公主長得像，于是被迫代替她去往敵國和親。傳聞我的和親對象是個殘疾皇子蒂培，可洞房花燭夜當晚...
茶點故事閱讀 44,941評論 2贊 355