文章名為嘗試,因為從來沒用過linux系統(tǒng)彬伦,更別提shell腳本了,所以寫得也很雜亂伊诵,以后會再整理出清晰一點的帖子单绑;關于TIMIT數(shù)據(jù)的獲取,可以參看CSDN資源貼曹宴。這里是使用TIMIT數(shù)據(jù)集運行aishell腳本搂橙。
參考資料
1.kaldi基礎介紹(一)在說話人識別中的數(shù)據(jù)準備 - monsieurliaxiamen的博客 - CSDN博客
2.kaldi中改寫sre10/v1用timit dataset做說話人識別總結(jié) - zjm750617105的專欄 - CSDN博客
3.kaldi下清華語音數(shù)據(jù)集的說話人測試腳本編寫 - 破曉的專欄 - CSDN博客
4.Voiceprint recognition in kaldi - Programmer Sought
5.kaldi中的聲紋識別 - yutouwd的博客 - CSDN博客
6.【數(shù)據(jù)預處理】TIMIT語料庫WAV文件轉(zhuǎn)換 - JJJanepp - 博客園
7.利用kaldi提取mfcc特征 - 長虹劍的專欄 - CSDN博客
8.對TIMIT數(shù)據(jù)進行格式轉(zhuǎn)換(SPHERE2WAV(RIFF)) - MengWang - 博客園
kaldi的文件結(jié)構(gòu)
egs:保存各種例程,均使用腳本編寫笛坦,以使用的數(shù)據(jù)庫的名字命名区转。在下一級目錄中以s開頭的文件是語音識別,以v開頭的是聲紋識別甚垦,一般v1就是使用i-vector的方法來進行聲紋識別魏割。
src:保存了kaldi的C++代碼呻袭。
tools:包括了kaldi依賴的庫和一些實用的腳本。
windows:包括了在Windows下安裝需要的一些工具和配置文件
TIMIT數(shù)據(jù)格式轉(zhuǎn)換
timit的wav文件不是真正的wav文件蜻韭,需要用kaldi的工具 sph2pipe 進行轉(zhuǎn)換。
比如使用timit中的test/dr1/mdab0_si1039.wav 文件,首先用命令
fname=test_dr1_mdab0_si1039.wav #改了文件的名字
sph2pipe -f wav $fname >file.wav #要先轉(zhuǎn)換格式
就得到一個真正wav格式的文件湘捎,這樣你把它放到音樂播放器中就能夠聽到發(fā)音诀豁。(如果你的文件就是wav的就不用這樣做了)
如何批量處理呢?對TIMIT數(shù)據(jù)進行格式轉(zhuǎn)換(SPHERE2WAV(RIFF)) - MengWang - 博客園特別好用窥妇!
首先舷胜,轉(zhuǎn)換sph2pipe工具所在文件夾(此工具為LDC所提供的SPHERE音頻文件轉(zhuǎn)換工具)
cd '/home/dream/Research/kaldi-master/tools/sph2pipe_v2.5'
其次:在命令行進行音頻文件轉(zhuǎn)換測試:
./sph2pipe -f wav ./wav_test/SA1.WAV ./wav_test/SA1_tr.WAV
此處需要注意的是,sph2pipe可執(zhí)行文件不在PATH當中活翩,所以需要當前路徑下的完全路徑烹骨,即:./sph2pipe才可以運行,而非sph2pipe材泄。
認知數(shù)據(jù)集
訓練集train:用來訓練模型
驗證集dev:用于調(diào)整模型的超參數(shù)沮焕,驗證不同算法,檢驗哪種算法更有效
測試集test:根據(jù)最終的分類器拉宗,正確評估分類器的性能
很明顯峦树,TIMIT數(shù)據(jù)集并沒有劃分dev集,這是可以人為變更的旦事,需要有相關的表單操作知識魁巩,思路是相通的,將wav文件都搜集在一起后根據(jù)自己的需求重新劃分
數(shù)據(jù)集的劃分
- 如果數(shù)據(jù)只有100條姐浮,100條或者1萬條谷遂,通常將樣本集設置為70%驗證集,30%測試集卖鲤。也可按照60%訓練集肾扰,20%驗證和20%測試集來分類較為合理。
- 如果數(shù)據(jù)規(guī)模較大蛋逾,是百萬級別集晚,驗證集和測試集要小于數(shù)據(jù)總量的20%和10%。
- 做科研的話一般會使用標準數(shù)據(jù)集区匣,也就不用考慮數(shù)據(jù)劃分的問題了
- 如果是第一次嘗試偷拔,不用如何嚴苛地劃分數(shù)據(jù)
Kaldi中的數(shù)據(jù)準備
在kaldi說話人識別示例(egs/sre10,egs/sre16)中,數(shù)據(jù)總共有兩大類沉颂,一是訓練集(training)条摸,二是評估數(shù)據(jù)集(evaluation)。對于評估數(shù)據(jù)集又分為兩類铸屉,一是用來注冊(enrollment)的數(shù)據(jù)集钉蒲,二是測試(test)集。所以準備的wav文件夾下需要train,dev,test三個子文件夾
訓練集的準備:spk2utt, utt2spk以及wav.scp
這些都有相應的程序可以生成彻坛,不用自己寫腳本
- spk2utt 是說話人id(記作spkid)和說話人語音名稱(uttid)的對應關系顷啼,通常來講踏枣,一個說話人會有很多條語音,文件中的格式為<spkid> <uttid1> <uttid2>...钙蒙,每一行有且只有一個說話人id茵瀑。每一行的uttid順序需要按照sort命令的排序模式來排,以及spkid也需要按照排序命令sort的模式來排躬厌。否則kaldi腳本在進行validate_data_dir.sh的時候報錯马昨。
- utt2spk 是單個語音名稱uttid和說話人的對應,很明顯每行都是一一對應關系扛施。utt2spk也可以由kaldi自帶腳本和spk2utt生成鸿捧,也可以由自己寫腳本完成。
用kaldi自帶的命令utils/utt2spk_to_spk2utt.pl utt2spk >spk2utt 轉(zhuǎn)過來得到疙渣,好像需要把文件放在utils目錄下(匙奴?)
- wav.scp 是語音名稱uttid和其完整路徑的對應,也是每行一個音頻妄荔。但是根據(jù)數(shù)據(jù)集中音頻文件格式的不同泼菌,需要添加一些轉(zhuǎn)換格式的命令。原始音頻文件格式為wav啦租,則只需要寫uttid path:
格式為sph或者flac哗伯,則需要加入格式轉(zhuǎn)換的命令行:
如果需要訓練性別有關的模型,還需要加入spk2gender的文本文件(這里沒用上)
注冊集
對于說話人識別的評估刷钢,我們首先需要注冊一批說話人笋颤,既然是注冊說話人的聲紋乳附,則每個說話人需要至少有一條語音用來注冊内地。對于測試集,則還需要一個已注冊說話人和某個語音的id以及標簽label(表明是否是同一個人)赋除。
注冊集和訓練集一樣阱缓,由spk2utt,utt2spk举农,wav.scp
組成荆针。文本文件內(nèi)容的模式也和訓練集保持一致,這里不再贅述颁糟。
測試集
測試集是由spk2utt航背,utt2spk,wav.scp棱貌,trials
這四種文件組成玖媚。
trials文件格式: <spkid> <uttid label>
,如:
FADG0_SI649.WAV FADG0 target
FADG0_SI649.WAV FAKS0 nontarget
FADG0_SI649.WAV FASW0 nontarget
以上就是kaldi說話人識別數(shù)據(jù)集的準備格式婚脱,自己手動寫腳本準備的時候今魔,會遇到的問題勺像,一是排序問題,主要會出現(xiàn)在spk2utt以及utt2spk文件中错森,因此在生成這些文件的時候就需要注意一定的順序性吟宦,也要根據(jù)實際情況改變uttid的名字,以便于通過validate-data-dir腳本的檢測(有自動排序修復程序)涩维;二是轉(zhuǎn)格式的問題殃姓,目前遇到的sph和flac就需要不同的工具去轉(zhuǎn)格式(sph2pipe和sox)。
只需準備三個數(shù)據(jù)文件就行了瓦阐,train辰狡、enroll、test垄分。SRE數(shù)據(jù)集用于訓練PLDA模型宛篇,因為大部分訓練數(shù)據(jù)超出了域。 在您的情況下薄湿,您將直接在訓練數(shù)據(jù)上訓練PLDA模型叫倍,因為它在域中。train應包括一組與enroll和test數(shù)據(jù)不重疊的spk,其余spk的utt應分為enroll和test豺瘤。 雖然enroll和test共享speaker吆倦,但它們不應包含來自相同錄音的utt。
官方論壇對TIMIT的解釋
順著github的Kaldi官方坐求,找到了相應的Google網(wǎng)上論壇:Kaldi-help做相關摘錄:
I'm not familiar with using TIMIT for speaker recognition, so I'm not sure how the evaluation is set up. It sounds like you might have only evaluation data and nothing to train your models with. Hopefully someone who has used TIMIT for this purpose can comment more. If you don't have any training data, you could try using the Librispeech corpus (look at the recipe in egs/ for more info).
You need at least the following datasets:
Training data. This is used to train the UBM, i-vector extractor and PLDA model. It should be non-overlapping with the other datasets. In the sre10 recipe, it corresponds to the "train" and "sre" data. The "sre" data is just a subset of "train" used to train the PLDA model, but it doesn't have to be that way in general.
Enrollment data. This is a subset of the evaluation data in which you know the identity of the speaker in the recording. Using the models created in the previous step, i-vectors are generated from this data. If you have multiple enrollment recordings per speaker, you might average their i-vectors to get speaker-level representations. In the sre10 recipe, this dataset is called "sre10_train."
Test data. This is also part of the evaluation data, and consists of recordings for which you don't know the identity of the speaker. These are compared (using the PLDA model or cosine distance) with the i-vectors created from the enrollment data. This dataset is called "sre10_test" in the recipe. **The set of comparisons is defined by the "trials" file. **
aishell流程
The run.sh in egs/aishell/v1 includes the entire voiceprint recognition process. It is best to copy the commands in run.sh to another script. In one sentence, one sentence at a time, so that errors can be found in time and then modified.
1)data preparation
2)start extracting mfcc features, perform endpoint detection (VAD), and check that the file does not meet the requirements to sort the files
3)train UBM and ivector extractor.It should be noted that the script that trains the ivector extractor will execute the program at the same time by default, which will take up a lot of memory and cause memory overflow. We need to modify it in train_ivector_extractor.sh. It defaults to executing njnum_threadnum_processes at the same time. In 16G memory, I changed these three parameters to 2 to run. There are also two hyperparameters that can be modified, namely the UBM dimension and the ivector dimension. The UBM dimension is modified directly in run.sh. The parameter behind the data/train in train_diag_ubm.sh is UBM. Dimension, the default is 1024. To modify the dimension of ivector, you also need to modify ivector_dim in train_ivector_extractor.sh. The default is 400.
4)extracting the ivector of the training set, and training the plda model for scoring with the ivector of the training set.
5)After that, the test set is divided into a registration set and a verification set. This step is mainly done by the script loacl/split_data_enroll_eval.py. This script first stores each spk and its corresponding utt in dictutt, then randomly smashes the utt order of the spk and redistributes it into enroll (registration set) and eval (evaluation set). You can see that in the penultimate line of the program, if(i<3): utt is written to enroll, otherwise it is written to eval. So we can change the value of the registration set and the evaluation set by changing this value.
6)After re-creating utt2spk, it is necessary to generate trials. Trials are generated by loacl/product_trials.py. Trials are a list of registered speakers and different voices that need to be scored. The format is (for example):
TIMIT+Aishell
We first look at the data partitioning in the AISHELL and TIMIT databases. There are a total of 400 people in AISHELL. The default is divided into train, dev and test sets. There are 340 people in the train; 40 in the dev; 20 in the test. In the routine, train is used as the training set, test is used as the test set, and dev is not used. Everyone in AISHELL has about 300 voices. Each voice is a sentence. Each voice is about 26s. There are 630 people in the TIMIT database, divided into train and test. There are 462 people in the training set and 168 people in the test set. Each person has 10 voices, and each voice is about 24s. Here, TIMIT's original distribution method is used directly, with 462 people as the training set and 168 people as the test set.
After understanding the differences between the two databases and the entire process of voiceprint recognition, we can begin to rewrite our program. In fact, there are not many places that need to be changed in the whole process. The main reason is that the process of preparing the data phase and generating trials needs to be modified. The first is the data preparation phase, we can rewrite our own tit_stat_prepare.sh according to the aishell_data_prepare.sh script. In the data preparation phase, three files, utt2spk spk2utt and wav.scp, are generated. The format of these three files is as follows:
Next, check if the found wav files add up to 141924, and then start wav.scp, utt2spk, and spk2utt and transcripts.txt for speech recognition. Here we will find the script related to transcripts.txt. Then delete it.
After completing the stage of preparing the data, we can start to perform voiceprint recognition according to the above process. One thing to note is that trials, if a person has only two or three segments of speech, you need to modify the proportion of the assigned enroll and eval sets. However, since everyone in the TIMIT database has 10 segments of speech, it is ok to not modify it. Here we use 3 segments of voice to register, and then the remaining 7 segments are used for verification.
The final error rate was about 4.5%. Although it is an acceptable result, it is still a lot worse than AISHELL's 0.18% error rate. Analyze the reasons: First, there are fewer voices for training. Although there are 462 people, each person has only 10 voices, and 340 people in AISHELL are used for training. Each person has a lot worse than 300 voices. Similarly, there are a total of 168 people in the TIMIT test set, which is much more than 40 people in the AISHELL test set. Moreover, AISHELL's default training UBM order and ivector dimensions are very high, so these two points may lead to higher error rates. If you want to further reduce the error rate, you can try to reduce the dimensions of the trained UBM and ivector. After I reduced the dimensions of both UBM and ivector, the error rate could eventually reach 1.53%.
其他補充
- 用sudo命令的話蚕泽,會給新生成的文件上鎖,解鎖辦法是
sudo chmod 777 file
- 單步調(diào)試:
sh -x script.sh
桥嗤,修改腳本后须妻,不想重新debug,就用BLOCK注釋多行 - 線程設置在train_ivector_extractor.sh泛领,就算按前文所說設置為2荒吏,我也跑不動,我將ubm的維數(shù)降為了600渊鞋,ivector的維數(shù)降為了400绰更,程序才能跑快點,也得1h锡宋。我懷疑可以更小的儡湾,因為utt很短
- 不用text的話,可以在一個子程序的開頭設置执俩,而不用找具體的位置去注釋徐钠,我給忘了哪個程序了,大家可以運行出錯排查
- 所有需要預先準備的就是數(shù)據(jù)集了奠滑,TIMIT數(shù)據(jù)格式和他的文件夾都是比較頭疼的丹皱,最好寫個腳本實現(xiàn)
- 在
aishell_data_prep.sh
里注釋掉所有的dev和transcripts.txt