GloVe的介紹
GloVe是斯坦福大學(xué)提出的一種新的詞矩陣生成的方法眷茁,綜合運(yùn)用詞的全局統(tǒng)計(jì)信息和局部統(tǒng)計(jì)信息來(lái)生成語(yǔ)言模型和詞的向量化表示。官方主頁(yè):
[http://nlp.stanford.edu/projects/glove/]
GloVe的安裝與使用
在這里介紹ubuntu linux環(huán)境下C版本的Glove的使用。
下載代碼
去官網(wǎng)[https://nlp.stanford.edu/projects/glove/]下載GloVe-1.2.zip
代碼文件介紹:
進(jìn)入glove目錄下逊拍,首先先參考README.txt惋嚎,里面主要介紹這個(gè)程序包含了四部分子程序冲甘,按步驟分別是vocab_count扛或、cooccur洲劣、shuffle泊脐、glove空幻。
1.vocab_count:用于計(jì)算原文本的單詞統(tǒng)計(jì)(生成vocab.txt,每一行為:?jiǎn)卧~ 詞頻)
2.cooccur:用于統(tǒng)計(jì)詞與詞的共現(xiàn),目測(cè)類似與word2vec的窗口內(nèi)的任意兩個(gè)詞(生成的是cooccurrence.bin二進(jìn)制文件)
3.shuffle:對(duì)于2中的共現(xiàn)結(jié)果重新整理(生成的也是二進(jìn)制文件cooccurrence.shuf.bin)
4.glove:glove算法的訓(xùn)練模型容客,會(huì)運(yùn)用到之前生成的相關(guān)文件(1&3)秕铛,最終會(huì)輸出vectors.txt和vectors.bin(前者直接可以打開(kāi),主要針對(duì)它做研究缩挑,后者還是二進(jìn)制文件)
訓(xùn)練詞向量
- 解壓文件但两,并上傳該代碼,,先將下載的文件解壓到E盤下,上傳
put -r E:\GloVe-1.2
- 打開(kāi)文件GloVe-1.2供置,并編譯程序谨湘,輸入:
make
文件編譯成功,會(huì)產(chǎn)生一個(gè)build文件。
3.執(zhí)行文件紧阔,輸入:
./sh demo.sh
成功的話就會(huì)產(chǎn)生coocurrence.bin坊罢,coocurrence.shuff.bin,vocab.txt擅耽,vectors.bin活孩,和vectors.bin文件,打開(kāi)vocab.txt就可以看到訓(xùn)練得到的詞向量結(jié)果乖仇。
(1) 若出現(xiàn)如下問(wèn)題憾儒,則是權(quán)限的問(wèn)題
修改權(quán)限如下:
chmod +x demo.sh
(2) 若出現(xiàn)如下問(wèn)題,
則需要修改demo.sh文件
$BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE
if [[ $? -eq 0 ]]
then
$BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE
if [[ $? -eq 0 ]]
then
$BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE
if [[ $? -eq 0 ]]
then
$BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE
if [[ $? -eq 0 ]]
then
if [ "$1" = 'matlab' ]; then
matlab -nodisplay -nodesktop -nojvm -nosplash < ./eval/matlab/read_and_evaluate.m 1>&2
elif [ "$1" = 'octave' ]; then
octave < ./eval/octave/read_and_evaluate_octave.m 1>&2
else
python eval/python/evaluate.py
fi
fi
fi
fi
fi
將以上代碼修改為以下代碼:
echo "$ $BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE"
$BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE
echo "$ $BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE"
$BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE
echo "$ $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE"
$BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE
echo "$ $BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE"
$BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE
if [ "$CORPUS" = 'text8' ]; then
if [ "$1" = 'matlab' ]; then
matlab -nodisplay -nodesktop -nojvm -nosplash < ./eval/matlab/read_and_evaluate.m 1>&2
elif [ "$1" = 'octave' ]; then
octave < ./eval/octave/read_and_evaluate_octave.m 1>&2
else
echo "$ python eval/python/evaluate.py"
python eval/python/evaluate.py
fi
fi
該代碼中用的語(yǔ)料庫(kù)是從[http://mattmahoney.net/dc/text8.zip]下載text8語(yǔ)料庫(kù)乃沙,最后訓(xùn)練得到的詞向量結(jié)果如下:
如果你要訓(xùn)練自己的語(yǔ)料庫(kù)起趾,那么你可以修改demo.sh文件的以下內(nèi)容:
make
if [ ! -e text8 ]; then
if hash wget 2>/dev/null; then
wget http://mattmahoney.net/dc/text8.zip
else
curl -O http://mattmahoney.net/dc/text8.zip
fi
unzip text8.zip
rm text8.zip
fi
//下面為Glove的相關(guān)參數(shù)
CORPUS=text8 // 這里是已經(jīng)分好詞的文件路徑
VOCAB_FILE=vocab.txt //#輸出的字典
COOCCURRENCE_FILE=cooccurrence.bin
COOCCURRENCE_SHUF_FILE=cooccurrence.shuf.bin
BUILDDIR=build
SAVE_FILE=vectors
VERBOSE=2
MEMORY=4.0
VOCAB_MIN_COUNT=5
VECTOR_SIZE=50 // 詞向量維度
MAX_ITER=15
WINDOW_SIZE=15 // 窗口大小
BINARY=2 //生成二進(jìn)制文件
NUM_THREADS=8
X_MAX=10