gensim模型(2)——Doc2Vec

Doc2Vec模型(Doc2Vec Model)

介紹Gensim的Doc2Vec模型且展示其在Lee Corpus上的用法终抽。

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Doc2Vec是一個(gè)將每個(gè)文檔表示為向量的模型儒喊。這個(gè)教程介紹了該模型并演示了如何訓(xùn)練和評(píng)估它昂验。

下面是我們將要完成的事情清單:

  1. 回顧相關(guān)模型:詞袋(bag-of-words),Word2Vec以及Doc2Vec
  2. 加載和預(yù)處理訓(xùn)練和測(cè)試語(yǔ)料庫(kù)(查看Corpus
  3. 使用訓(xùn)練語(yǔ)料庫(kù)訓(xùn)練一個(gè)Doc2Vec模型
  4. 演示訓(xùn)練后的模型如何被用于推理向量
  5. 評(píng)估模型
  6. 在測(cè)試語(yǔ)料庫(kù)上測(cè)試模型

回顧:詞袋(Review: Bag-of-words)

注意:如果你已經(jīng)很熟悉這些模型伊诵,可以隨意跳過(guò)這些回顧章節(jié)。

你可能已經(jīng)從向量章節(jié)中熟悉了詞袋模型呆奕。該模型把每個(gè)文檔變換為一個(gè)固定長(zhǎng)度的正型向量梢杭。例如,給定句子:

John likes to watch movies. Mary likes movies too.

John also likes to watch football games. Mary hates football.

該模型輸出向量:

[1, 2, 1, 1, 2, 1, 1, 0, 0, 0, 0]

[1, 1, 1, 1, 0, 1, 0, 1, 2, 1, 1]

每個(gè)向量含有10個(gè)元素栏账,其中每個(gè)元素是一個(gè)特定單詞出現(xiàn)在文檔中的計(jì)數(shù)帖族。元素的順序是任意的。在上述例子中挡爵,元素的順序?qū)?yīng)單詞:["John", "likes", "to", "watch", "movies", "Mary", "too", "also", "football", "games", "hates"]竖般。

詞袋模型的效果令人驚訝,但是它有幾個(gè)缺點(diǎn)茶鹃。

首先涣雕,它們丟失了有關(guān)單詞順序的所有信息:“John likes Mary”和“Mary likes John”都對(duì)應(yīng)同一個(gè)向量艰亮。這里有一個(gè)解決方案:bag of n-grams模型考慮長(zhǎng)度為n的單詞短語(yǔ)來(lái)將文檔表示為固定長(zhǎng)度的向量。其可以捕獲局部單詞順序挣郭,但需要忍受數(shù)據(jù)稀疏和高維度迄埃。

第二,該模型不會(huì)試圖去學(xué)習(xí)潛在單詞的意義兑障。因此侄非,向量之間的距離不會(huì)總反映它們?cè)谝饬x上的差異。Word2Vec解決了第二個(gè)問(wèn)題流译。

回顧:Word2Vec模型(Review: Word2Vec Model)

Word2Vec是一個(gè)較新的模型逞怨,它通過(guò)使用一個(gè)淺層的神經(jīng)網(wǎng)絡(luò)將單詞嵌入到一個(gè)低維的向量空間。該模型的結(jié)果是詞向量集福澡,其中在向量空間上相互靠近的向量在文檔上也有相似的意義叠赦,且彼此相距遙遠(yuǎn)的詞向量有不同的含義。例如革砸,strong和powerful彼此相互靠近除秀,但strong和Paris則相當(dāng)遠(yuǎn)。

Gensim的Word2Vec類(lèi)實(shí)現(xiàn)了這個(gè)模型业岁。

使用Word2Vec模型鳞仙,我們可以計(jì)算文檔中的每個(gè)單詞的向量。但是笔时,如果我們想要為整個(gè)文檔計(jì)算向量呢棍好?我們可以對(duì)文檔中的所有單詞求平均——雖然這個(gè)方法粗糙且便捷,但它往往是有用的。然而,有一個(gè)更好的方法……

介紹:段落向量(Introducing: Paragraph Vector)

重要:在Gensim中攀例,我們將段落向量記為Doc2Vec盗迟。

Le和Mikolov在2014年提出了Doc2Vec算法,這通常優(yōu)于Word2vec向量的簡(jiǎn)單平均。

基本思路是:就好像文檔有另一個(gè)浮點(diǎn)的單詞狀向量,它有助于所有訓(xùn)練預(yù)測(cè),并像其他詞向量一樣更新低散,但我們將稱(chēng)之為文檔矢量。Gensim的Doc2Vec類(lèi)實(shí)現(xiàn)了此算法骡楼。

有兩個(gè)實(shí)現(xiàn):

  1. 段落向量——分布式記憶(Paragraph Vector - Distributed Memory, PV-DM)
  2. 段落向量——分布詞袋(Paragraph Vector - Distributed Bag of Words, PV-DBOW)

重要:不要讓下面的實(shí)現(xiàn)細(xì)節(jié)嚇到你了熔号。這是高級(jí)教材:如果感覺(jué)太多,可以移到下一章節(jié)

PV-DM和Word2Vec CBOW類(lèi)似鸟整。文檔向量是通過(guò)在合成任務(wù)上訓(xùn)練神經(jīng)網(wǎng)絡(luò)來(lái)獲得的引镊。該合成任務(wù)基于上下文詞向量和完整文檔的文檔向量的平均值預(yù)測(cè)中心單詞。

PV-DBOW和Word2Vec SG類(lèi)似。文檔向量是通過(guò)在合成任務(wù)上訓(xùn)練神經(jīng)網(wǎng)絡(luò)來(lái)獲得的弟头。該合成任務(wù)僅利用完整文檔的文檔向量來(lái)預(yù)測(cè)目標(biāo)單詞吩抓。(該模型通常與skip-gram相結(jié)合,使用文檔向量和附近的詞向量來(lái)預(yù)測(cè)單個(gè)目標(biāo)單詞赴恨,且一次僅預(yù)測(cè)一個(gè)疹娶。)

準(zhǔn)備訓(xùn)練和測(cè)試數(shù)據(jù)(Prepare the Training and Test Data)

在此教程中,我們使用gensim中的Lee Background Corpus來(lái)訓(xùn)練模型伦连。該語(yǔ)料庫(kù)包含從澳大利亞廣播公司的新聞郵件服務(wù)中選擇的314個(gè)文檔蚓胸。該服務(wù)提供的文本電子郵件包含帶標(biāo)題的故事,且涵蓋了許多廣泛的主題除师。

同時(shí),使用較短的包含50個(gè)文本Lee Corpus扔枫,我們可以用肉眼測(cè)試我們的模型汛聚。

import os
import gensim
# Set file names for train and test data
test_data_dir = os.path.join(gensim.__path__[0], 'test', 'test_data')
lee_train_file = os.path.join(test_data_dir, 'lee_background.cor')
lee_test_file = os.path.join(test_data_dir, 'lee.cor')

定義讀取和處理文本的函數(shù)(Define a Function to Read and Preprocess Text)

下面,我們定義一個(gè)函數(shù)用于:

  1. 打開(kāi)訓(xùn)練/測(cè)試文件(latin編碼)
  2. 按行讀取文件
  3. 對(duì)每行進(jìn)行預(yù)處理(將文本標(biāo)記為單獨(dú)的單詞短荐,移除標(biāo)點(diǎn)符號(hào)倚舀,設(shè)置為小寫(xiě),等等)

我們讀取的文件死一個(gè)語(yǔ)料庫(kù)忍宋。文件的每一行是一個(gè)文檔痕貌。

重要:為了訓(xùn)練模型,我們需要給訓(xùn)練語(yǔ)料庫(kù)中的每個(gè)文檔關(guān)聯(lián)一個(gè)標(biāo)簽/數(shù)字糠排。在我們的例子中舵稠,標(biāo)簽只是從零開(kāi)始的行號(hào)。

import smart_open

def read_corpus(fname, tokens_only=False):
    with smart_open.open(fname, encoding="iso-8859-1") as f:
        for i, line in enumerate(f):
            tokens = gensim.utils.simple_preprocess(line)
            if tokens_only:
                yield tokens
            else:
                # For training data, add tags
                yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

train_corpus = list(read_corpus(lee_train_file))
test_corpus = list(read_corpus(lee_test_file, tokens_only=True))

讓我們看一看訓(xùn)練語(yǔ)料庫(kù)

print(train_corpus[:2])

結(jié)果為:

[TaggedDocument(words=['hundreds', 'of', 'people', 'have', 'been', 'forced', 'to', 'vacate', 'their', 'homes', 'in', 'the', 'southern', 'highlands', 'of', 'new', 'south', 'wales', 'as', 'strong', 'winds', 'today', 'pushed', 'huge', 'bushfire', 'towards', 'the', 'town', 'of', 'hill', 'top', 'new', 'blaze', 'near', 'goulburn', 'south', 'west', 'of', 'sydney', 'has', 'forced', 'the', 'closure', 'of', 'the', 'hume', 'highway', 'at', 'about', 'pm', 'aedt', 'marked', 'deterioration', 'in', 'the', 'weather', 'as', 'storm', 'cell', 'moved', 'east', 'across', 'the', 'blue', 'mountains', 'forced', 'authorities', 'to', 'make', 'decision', 'to', 'evacuate', 'people', 'from', 'homes', 'in', 'outlying', 'streets', 'at', 'hill', 'top', 'in', 'the', 'new', 'south', 'wales', 'southern', 'highlands', 'an', 'estimated', 'residents', 'have', 'left', 'their', 'homes', 'for', 'nearby', 'mittagong', 'the', 'new', 'south', 'wales', 'rural', 'fire', 'service', 'says', 'the', 'weather', 'conditions', 'which', 'caused', 'the', 'fire', 'to', 'burn', 'in', 'finger', 'formation', 'have', 'now', 'eased', 'and', 'about', 'fire', 'units', 'in', 'and', 'around', 'hill', 'top', 'are', 'optimistic', 'of', 'defending', 'all', 'properties', 'as', 'more', 'than', 'blazes', 'burn', 'on', 'new', 'year', 'eve', 'in', 'new', 'south', 'wales', 'fire', 'crews', 'have', 'been', 'called', 'to', 'new', 'fire', 'at', 'gunning', 'south', 'of', 'goulburn', 'while', 'few', 'details', 'are', 'available', 'at', 'this', 'stage', 'fire', 'authorities', 'says', 'it', 'has', 'closed', 'the', 'hume', 'highway', 'in', 'both', 'directions', 'meanwhile', 'new', 'fire', 'in', 'sydney', 'west', 'is', 'no', 'longer', 'threatening', 'properties', 'in', 'the', 'cranebrook', 'area', 'rain', 'has', 'fallen', 'in', 'some', 'parts', 'of', 'the', 'illawarra', 'sydney', 'the', 'hunter', 'valley', 'and', 'the', 'north', 'coast', 'but', 'the', 'bureau', 'of', 'meteorology', 'claire', 'richards', 'says', 'the', 'rain', 'has', 'done', 'little', 'to', 'ease', 'any', 'of', 'the', 'hundred', 'fires', 'still', 'burning', 'across', 'the', 'state', 'the', 'falls', 'have', 'been', 'quite', 'isolated', 'in', 'those', 'areas', 'and', 'generally', 'the', 'falls', 'have', 'been', 'less', 'than', 'about', 'five', 'millimetres', 'she', 'said', 'in', 'some', 'places', 'really', 'not', 'significant', 'at', 'all', 'less', 'than', 'millimetre', 'so', 'there', 'hasn', 'been', 'much', 'relief', 'as', 'far', 'as', 'rain', 'is', 'concerned', 'in', 'fact', 'they', 've', 'probably', 'hampered', 'the', 'efforts', 'of', 'the', 'firefighters', 'more', 'because', 'of', 'the', 'wind', 'gusts', 'that', 'are', 'associated', 'with', 'those', 'thunderstorms'], tags=[0]),
 TaggedDocument(words=['indian', 'security', 'forces', 'have', 'shot', 'dead', 'eight', 'suspected', 'militants', 'in', 'night', 'long', 'encounter', 'in', 'southern', 'kashmir', 'the', 'shootout', 'took', 'place', 'at', 'dora', 'village', 'some', 'kilometers', 'south', 'of', 'the', 'kashmiri', 'summer', 'capital', 'srinagar', 'the', 'deaths', 'came', 'as', 'pakistani', 'police', 'arrested', 'more', 'than', 'two', 'dozen', 'militants', 'from', 'extremist', 'groups', 'accused', 'of', 'staging', 'an', 'attack', 'on', 'india', 'parliament', 'india', 'has', 'accused', 'pakistan', 'based', 'lashkar', 'taiba', 'and', 'jaish', 'mohammad', 'of', 'carrying', 'out', 'the', 'attack', 'on', 'december', 'at', 'the', 'behest', 'of', 'pakistani', 'military', 'intelligence', 'military', 'tensions', 'have', 'soared', 'since', 'the', 'raid', 'with', 'both', 'sides', 'massing', 'troops', 'along', 'their', 'border', 'and', 'trading', 'tit', 'for', 'tat', 'diplomatic', 'sanctions', 'yesterday', 'pakistan', 'announced', 'it', 'had', 'arrested', 'lashkar', 'taiba', 'chief', 'hafiz', 'mohammed', 'saeed', 'police', 'in', 'karachi', 'say', 'it', 'is', 'likely', 'more', 'raids', 'will', 'be', 'launched', 'against', 'the', 'two', 'groups', 'as', 'well', 'as', 'other', 'militant', 'organisations', 'accused', 'of', 'targetting', 'india', 'military', 'tensions', 'between', 'india', 'and', 'pakistan', 'have', 'escalated', 'to', 'level', 'not', 'seen', 'since', 'their', 'war'], tags=[1])]

同樣入宦,測(cè)試語(yǔ)料庫(kù)看起來(lái)是:

print(test_corpus[:2])

結(jié)果為:

[['the', 'national', 'executive', 'of', 'the', 'strife', 'torn', 'democrats', 'last', 'night', 'appointed', 'little', 'known', 'west', 'australian', 'senator', 'brian', 'greig', 'as', 'interim', 'leader', 'shock', 'move', 'likely', 'to', 'provoke', 'further', 'conflict', 'between', 'the', 'party', 'senators', 'and', 'its', 'organisation', 'in', 'move', 'to', 'reassert', 'control', 'over', 'the', 'party', 'seven', 'senators', 'the', 'national', 'executive', 'last', 'night', 'rejected', 'aden', 'ridgeway', 'bid', 'to', 'become', 'interim', 'leader', 'in', 'favour', 'of', 'senator', 'greig', 'supporter', 'of', 'deposed', 'leader', 'natasha', 'stott', 'despoja', 'and', 'an', 'outspoken', 'gay', 'rights', 'activist'], ['cash', 'strapped', 'financial', 'services', 'group', 'amp', 'has', 'shelved', 'million', 'plan', 'to', 'buy', 'shares', 'back', 'from', 'investors', 'and', 'will', 'raise', 'million', 'in', 'fresh', 'capital', 'after', 'profits', 'crashed', 'in', 'the', 'six', 'months', 'to', 'june', 'chief', 'executive', 'paul', 'batchelor', 'said', 'the', 'result', 'was', 'solid', 'in', 'what', 'he', 'described', 'as', 'the', 'worst', 'conditions', 'for', 'stock', 'markets', 'in', 'years', 'amp', 'half', 'year', 'profit', 'sank', 'per', 'cent', 'to', 'million', 'or', 'share', 'as', 'australia', 'largest', 'investor', 'and', 'fund', 'manager', 'failed', 'to', 'hit', 'projected', 'per', 'cent', 'earnings', 'growth', 'targets', 'and', 'was', 'battered', 'by', 'falling', 'returns', 'on', 'share', 'markets']]

值得注意的是哺徊,測(cè)試語(yǔ)料庫(kù)是有列表組成的列表,且不包含任何的標(biāo)簽乾闰。

訓(xùn)練模型(Training the Model)

現(xiàn)在落追,我們將實(shí)例化一個(gè)Doc2Vec模型,其中向量尺寸是50維且在訓(xùn)練語(yǔ)料庫(kù)上的迭代次數(shù)為40涯肩。為了丟棄出現(xiàn)次數(shù)非常非常少的單詞轿钠,我們將最小文字計(jì)數(shù)設(shè)置為2。(如果沒(méi)有各種有代表性的例子病苗,保留這種不常用的單詞往往會(huì)使模型變得更糟A贫狻)發(fā)表的的段向量論文結(jié)果中的典型迭代計(jì)數(shù)(使用幾萬(wàn)到數(shù)百萬(wàn)文檔)為10-20。更多的迭代需要更多的時(shí)間铅乡,并最終達(dá)到遞減的點(diǎn)继谚。

然而,這個(gè)一個(gè)非常小的包含短文檔(幾百個(gè)單詞)的數(shù)據(jù)集(300個(gè)文檔)。增加訓(xùn)練次數(shù)有時(shí)可以幫助處理這樣小的數(shù)據(jù)集花履。

model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)

創(chuàng)建一個(gè)詞匯表

model.build_vocab(train_corpus)

結(jié)果為:

2020-09-30 21:08:55,026 : INFO : collecting all words and their counts
2020-09-30 21:08:55,027 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2020-09-30 21:08:55,043 : INFO : collected 6981 word types and 300 unique tags from a corpus of 300 examples and 58152 words
2020-09-30 21:08:55,043 : INFO : Loading a fresh vocabulary
2020-09-30 21:08:55,064 : INFO : effective_min_count=2 retains 3955 unique words (56% of original 6981, drops 3026)
2020-09-30 21:08:55,064 : INFO : effective_min_count=2 leaves 55126 word corpus (94% of original 58152, drops 3026)
2020-09-30 21:08:55,098 : INFO : deleting the raw counts dictionary of 6981 items
2020-09-30 21:08:55,100 : INFO : sample=0.001 downsamples 46 most-common words
2020-09-30 21:08:55,100 : INFO : downsampling leaves estimated 42390 word corpus (76.9% of prior 55126)
2020-09-30 21:08:55,149 : INFO : estimated required memory for 3955 words and 50 dimensions: 3679500 bytes
2020-09-30 21:08:55,149 : INFO : resetting layer weights

本質(zhì)上芽世,詞匯表是一個(gè)從訓(xùn)練語(yǔ)料庫(kù)中提取的包含所有唯一單詞的列表(可通過(guò)model.wv.index_to_key訪問(wèn))。每個(gè)單詞的額外屬性可通過(guò)model.wv.get_vecattr()方法訪問(wèn)诡壁。例如济瓢,查看“penalty”出現(xiàn)訓(xùn)練語(yǔ)料庫(kù)中的出現(xiàn)次數(shù):

print(f"Word 'penalty' appeared {model.wv.get_vecattr('penalty', 'count')} times in the training corpus.")

結(jié)果為:

Word 'penalty' appeared 4 times in the training corpus.

接下來(lái),需要在語(yǔ)料庫(kù)上訓(xùn)練模型妹卿。如果在使用優(yōu)化后的Gensim(帶有BLAS庫(kù))旺矾,訓(xùn)練時(shí)間不會(huì)超過(guò)3秒。如果沒(méi)有使用BLAS庫(kù)夺克,訓(xùn)練時(shí)間不會(huì)超過(guò)2分鐘箕宙。因此,如果你在意時(shí)間铺纽,請(qǐng)使用帶有BLAS的優(yōu)化Gensim柬帕。

model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

結(jié)果為:

2021-11-16 10:05:02,148 : INFO : Doc2Vec lifecycle event {'msg': 'training model with 3 workers on 3955 vocabulary and 50 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2021-11-16T10:05:02.148868', 'gensim': '4.1.2', 'python': '3.8.8 (default, Apr 13 2021, 12:59:45) \n[Clang 10.0.0 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'train'}
2021-11-16 10:05:02,205 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,209 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,209 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,210 : INFO : EPOCH - 1 : training on 58152 raw words (42680 effective words) took 0.1s, 773539 effective words/s
2021-11-16 10:05:02,257 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,259 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,260 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,260 : INFO : EPOCH - 2 : training on 58152 raw words (42645 effective words) took 0.0s, 888781 effective words/s
2021-11-16 10:05:02,305 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,307 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,308 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,309 : INFO : EPOCH - 3 : training on 58152 raw words (42665 effective words) took 0.0s, 937587 effective words/s
2021-11-16 10:05:02,354 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,355 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,356 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,357 : INFO : EPOCH - 4 : training on 58152 raw words (42653 effective words) took 0.0s, 955655 effective words/s
2021-11-16 10:05:02,401 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,403 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,404 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,404 : INFO : EPOCH - 5 : training on 58152 raw words (42717 effective words) took 0.0s, 945565 effective words/s
2021-11-16 10:05:02,445 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,446 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,447 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,448 : INFO : EPOCH - 6 : training on 58152 raw words (42625 effective words) took 0.0s, 1035609 effective words/s
2021-11-16 10:05:02,486 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,487 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,488 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,489 : INFO : EPOCH - 7 : training on 58152 raw words (42800 effective words) took 0.0s, 1099399 effective words/s
2021-11-16 10:05:02,525 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,527 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,527 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,528 : INFO : EPOCH - 8 : training on 58152 raw words (42803 effective words) took 0.0s, 1150303 effective words/s
2021-11-16 10:05:02,566 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,568 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,568 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,569 : INFO : EPOCH - 9 : training on 58152 raw words (42763 effective words) took 0.0s, 1105952 effective words/s
2021-11-16 10:05:02,604 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,608 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,609 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,609 : INFO : EPOCH - 10 : training on 58152 raw words (42715 effective words) took 0.0s, 1150022 effective words/s
2021-11-16 10:05:02,649 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,651 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,651 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,651 : INFO : EPOCH - 11 : training on 58152 raw words (42628 effective words) took 0.0s, 1100282 effective words/s
2021-11-16 10:05:02,689 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,690 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,691 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,692 : INFO : EPOCH - 12 : training on 58152 raw words (42673 effective words) took 0.0s, 1115292 effective words/s
2021-11-16 10:05:02,731 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,732 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,733 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,733 : INFO : EPOCH - 13 : training on 58152 raw words (42519 effective words) took 0.0s, 1093006 effective words/s
2021-11-16 10:05:02,770 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,771 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,772 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,773 : INFO : EPOCH - 14 : training on 58152 raw words (42698 effective words) took 0.0s, 1154425 effective words/s
2021-11-16 10:05:02,809 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,809 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,810 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,810 : INFO : EPOCH - 15 : training on 58152 raw words (42717 effective words) took 0.0s, 1198759 effective words/s
2021-11-16 10:05:02,848 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,850 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,852 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,852 : INFO : EPOCH - 16 : training on 58152 raw words (42670 effective words) took 0.0s, 1070404 effective words/s
2021-11-16 10:05:02,889 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,890 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,890 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,891 : INFO : EPOCH - 17 : training on 58152 raw words (42785 effective words) took 0.0s, 1181380 effective words/s
2021-11-16 10:05:02,928 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,929 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,929 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,930 : INFO : EPOCH - 18 : training on 58152 raw words (42781 effective words) took 0.0s, 1151716 effective words/s
2021-11-16 10:05:02,967 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:02,967 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:02,968 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:02,968 : INFO : EPOCH - 19 : training on 58152 raw words (42722 effective words) took 0.0s, 1191799 effective words/s
2021-11-16 10:05:03,005 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,006 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,006 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,007 : INFO : EPOCH - 20 : training on 58152 raw words (42545 effective words) took 0.0s, 1196913 effective words/s
2021-11-16 10:05:03,043 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,046 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,047 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,048 : INFO : EPOCH - 21 : training on 58152 raw words (42669 effective words) took 0.0s, 1088880 effective words/s
2021-11-16 10:05:03,087 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,088 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,089 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,089 : INFO : EPOCH - 22 : training on 58152 raw words (42641 effective words) took 0.0s, 1085220 effective words/s
2021-11-16 10:05:03,127 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,128 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,129 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,130 : INFO : EPOCH - 23 : training on 58152 raw words (42682 effective words) took 0.0s, 1118567 effective words/s
2021-11-16 10:05:03,170 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,171 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,172 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,172 : INFO : EPOCH - 24 : training on 58152 raw words (42579 effective words) took 0.0s, 1068513 effective words/s
2021-11-16 10:05:03,229 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,232 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,235 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,236 : INFO : EPOCH - 25 : training on 58152 raw words (42758 effective words) took 0.1s, 688556 effective words/s
2021-11-16 10:05:03,285 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,286 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,286 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,287 : INFO : EPOCH - 26 : training on 58152 raw words (42724 effective words) took 0.0s, 940922 effective words/s
2021-11-16 10:05:03,328 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,330 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,330 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,331 : INFO : EPOCH - 27 : training on 58152 raw words (42712 effective words) took 0.0s, 1043624 effective words/s
2021-11-16 10:05:03,368 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,370 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,371 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,371 : INFO : EPOCH - 28 : training on 58152 raw words (42606 effective words) took 0.0s, 1107016 effective words/s
2021-11-16 10:05:03,407 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,408 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,409 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,409 : INFO : EPOCH - 29 : training on 58152 raw words (42713 effective words) took 0.0s, 1192136 effective words/s
2021-11-16 10:05:03,446 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,448 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,449 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,449 : INFO : EPOCH - 30 : training on 58152 raw words (42619 effective words) took 0.0s, 1140147 effective words/s
2021-11-16 10:05:03,486 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,487 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,487 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,488 : INFO : EPOCH - 31 : training on 58152 raw words (42653 effective words) took 0.0s, 1161469 effective words/s
2021-11-16 10:05:03,524 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,527 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,527 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,528 : INFO : EPOCH - 32 : training on 58152 raw words (42698 effective words) took 0.0s, 1135948 effective words/s
2021-11-16 10:05:03,565 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,566 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,567 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,567 : INFO : EPOCH - 33 : training on 58152 raw words (42689 effective words) took 0.0s, 1161094 effective words/s
2021-11-16 10:05:03,603 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,605 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,606 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,606 : INFO : EPOCH - 34 : training on 58152 raw words (42571 effective words) took 0.0s, 1153284 effective words/s
2021-11-16 10:05:03,643 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,644 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,645 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,645 : INFO : EPOCH - 35 : training on 58152 raw words (42741 effective words) took 0.0s, 1136146 effective words/s
2021-11-16 10:05:03,683 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,684 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,684 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,685 : INFO : EPOCH - 36 : training on 58152 raw words (42825 effective words) took 0.0s, 1160456 effective words/s
2021-11-16 10:05:03,722 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,723 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,724 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,724 : INFO : EPOCH - 37 : training on 58152 raw words (42707 effective words) took 0.0s, 1152339 effective words/s
2021-11-16 10:05:03,763 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,764 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,764 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,764 : INFO : EPOCH - 38 : training on 58152 raw words (42561 effective words) took 0.0s, 1126831 effective words/s
2021-11-16 10:05:03,801 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,803 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,803 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,803 : INFO : EPOCH - 39 : training on 58152 raw words (42737 effective words) took 0.0s, 1154606 effective words/s
2021-11-16 10:05:03,842 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-11-16 10:05:03,842 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-11-16 10:05:03,843 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-11-16 10:05:03,843 : INFO : EPOCH - 40 : training on 58152 raw words (42875 effective words) took 0.0s, 1138368 effective words/s
2021-11-16 10:05:03,844 : INFO : Doc2Vec lifecycle event {'msg': 'training on 2326080 raw words (1707564 effective words) took 1.7s, 1008257 effective words/s', 'datetime': '2021-11-16T10:05:03.844291', 'gensim': '4.1.2', 'python': '3.8.8 (default, Apr 13 2021, 12:59:45) \n[Clang 10.0.0 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'train'}

現(xiàn)在,通過(guò)將單詞列表傳遞給model.infer_vector函數(shù)狡门,我們可以使用訓(xùn)練后的模型為任意文本推理其向量陷寝。這個(gè)向量可以通過(guò)cosine相似度與其它向量進(jìn)行比較。

vector = model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires'])
print(vector)

結(jié)果為:

[-0.08478509  0.05011684  0.0675064  -0.19926868 -0.1235586   0.01768214
 -0.12645927  0.01062329  0.06113973  0.35424358  0.01320948  0.07561274
 -0.01645093  0.0692549   0.08346193 -0.01599065  0.08287009 -0.0139379
 -0.17772709 -0.26271465  0.0442089  -0.04659882 -0.12873884  0.28799203
 -0.13040264  0.12478471 -0.14091878 -0.09698066 -0.07903259 -0.10124907
 -0.28239366  0.13270256  0.04445919 -0.24210942 -0.1907376  -0.07264525
 -0.14167067 -0.22816683 -0.00663796  0.23165748 -0.10436232 -0.01028251
 -0.04064698  0.08813146  0.01072008 -0.149789    0.05923386  0.16301566
  0.05815683  0.1258063]

請(qǐng)注意其馏,infer_vector()不把字符串作為輸入凤跑,而是將字符串的標(biāo)記列表作為輸入,該標(biāo)記應(yīng)當(dāng)以原始訓(xùn)練文檔對(duì)象的單詞屬性相同的方式進(jìn)行標(biāo)記化叛复。

還要注意仔引,由于基礎(chǔ)訓(xùn)練/推理算法是利用內(nèi)部隨機(jī)化的迭代近似問(wèn)題,因此同一文本的重復(fù)推論將返回略有不同的向量褐奥。

評(píng)估模型(Assessing the Model)

為了評(píng)估我們的新模型肤寝,我們將首先推理訓(xùn)練語(yǔ)料庫(kù)中每個(gè)文檔的新向量,將推理出的向量與訓(xùn)練語(yǔ)料庫(kù)進(jìn)行比較抖僵,然后根據(jù)自相似性返回文檔的等級(jí)鲤看。基本上耍群,我們假裝訓(xùn)練語(yǔ)料庫(kù)是一些新的看不見(jiàn)的數(shù)據(jù)义桂,然后看看它們?nèi)绾闻c訓(xùn)練模型進(jìn)行比較。期望是蹈垢,我們可能已經(jīng)過(guò)擬合我們的模型(即慷吊,所有的排名將小于2),所以我們應(yīng)該能夠很容易地找到類(lèi)似的文件曹抬。此外溉瓶,我們將跟蹤排名的第二,以便比較不太相似的文件。

ranks = []
second_ranks = []
for doc_id in range(len(train_corpus)):
    inferred_vector = model.infer_vector(train_corpus[doc_id].words)
    sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))
    rank = [docid for docid, sim in sims].index(doc_id)
    ranks.append(rank)

    second_ranks.append(sims[1])

讓我們計(jì)算一下每個(gè)文檔在訓(xùn)練語(yǔ)料庫(kù)方面的排名堰酿。由于使用了隨機(jī)數(shù)種子和小型語(yǔ)料庫(kù)疾宏,結(jié)果因運(yùn)行而異。

import collections

counter = collections.Counter(ranks)
print(counter)

結(jié)果為:

Counter({0: 292, 1: 8})

基本上触创,超過(guò)95%的推理文檔被發(fā)現(xiàn)與自身最相似坎藐,大約 5%的文檔與另一份文檔錯(cuò)誤地最相似。根據(jù)訓(xùn)練向量檢查推理向量是一種"理智檢查"哼绑,即模型的行為是否以有用的一致方式進(jìn)行岩馍,盡管不是真正的"準(zhǔn)確"值。

這是很好的抖韩,并不完全令人驚訝蛀恩。我們可以舉一個(gè)例子:

print('Document ({}): ?{}?\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: ?%s?\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))

結(jié)果為:

Document (299): ?australia will take on france in the doubles rubber of the davis cup tennis final today with the tie levelled at wayne arthurs and todd woodbridge are scheduled to lead australia in the doubles against cedric pioline and fabrice santoro however changes can be made to the line up up to an hour before the match and australian team captain john fitzgerald suggested he might do just that we ll make team appraisal of the whole situation go over the pros and cons and make decision french team captain guy forget says he will not make changes but does not know what to expect from australia todd is the best doubles player in the world right now so expect him to play he said would probably use wayne arthurs but don know what to expect really pat rafter salvaged australia davis cup campaign yesterday with win in the second singles match rafter overcame an arm injury to defeat french number one sebastien grosjean in three sets the australian says he is happy with his form it not very pretty tennis there isn too many consistent bounces you are playing like said bit of classic old grass court rafter said rafter levelled the score after lleyton hewitt shock five set loss to nicholas escude in the first singles rubber but rafter says he felt no added pressure after hewitt defeat knew had good team to back me up even if we were down he said knew could win on the last day know the boys can win doubles so even if we were down still feel we are good enough team to win and vice versa they are good enough team to beat us as well?

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/m,d50,n5,w5,mc2,s0.001,t3):

MOST (299, 0.9490002989768982): ?australia will take on france in the doubles rubber of the davis cup tennis final today with the tie levelled at wayne arthurs and todd woodbridge are scheduled to lead australia in the doubles against cedric pioline and fabrice santoro however changes can be made to the line up up to an hour before the match and australian team captain john fitzgerald suggested he might do just that we ll make team appraisal of the whole situation go over the pros and cons and make decision french team captain guy forget says he will not make changes but does not know what to expect from australia todd is the best doubles player in the world right now so expect him to play he said would probably use wayne arthurs but don know what to expect really pat rafter salvaged australia davis cup campaign yesterday with win in the second singles match rafter overcame an arm injury to defeat french number one sebastien grosjean in three sets the australian says he is happy with his form it not very pretty tennis there isn too many consistent bounces you are playing like said bit of classic old grass court rafter said rafter levelled the score after lleyton hewitt shock five set loss to nicholas escude in the first singles rubber but rafter says he felt no added pressure after hewitt defeat knew had good team to back me up even if we were down he said knew could win on the last day know the boys can win doubles so even if we were down still feel we are good enough team to win and vice versa they are good enough team to beat us as well?

SECOND-MOST (104, 0.7925528883934021): ?australian cricket captain steve waugh has supported fast bowler brett lee after criticism of his intimidatory bowling to the south african tailenders in the first test in adelaide earlier this month lee was fined for giving new zealand tailender shane bond an unsportsmanlike send off during the third test in perth waugh says tailenders should not be protected from short pitched bowling these days you re earning big money you ve got responsibility to learn how to bat he said mean there no times like years ago when it was not professional and sort of bowlers code these days you re professional our batsmen work very hard at their batting and expect other tailenders to do likewise meanwhile waugh says his side will need to guard against complacency after convincingly winning the first test by runs waugh says despite the dominance of his side in the first test south africa can never be taken lightly it only one test match out of three or six whichever way you want to look at it so there lot of work to go he said but it nice to win the first battle definitely it gives us lot of confidence going into melbourne you know the big crowd there we love playing in front of the boxing day crowd so that will be to our advantage as well south africa begins four day match against new south wales in sydney on thursday in the lead up to the boxing day test veteran fast bowler allan donald will play in the warm up match and is likely to take his place in the team for the second test south african captain shaun pollock expects much better performance from his side in the melbourne test we still believe that we didn play to our full potential so if we can improve on our aspects the output we put out on the field will be lot better and we still believe we have side that is good enough to beat australia on our day he said?

MEDIAN (57, 0.24077531695365906): ?afghanistan new interim government is to meet for the first time later today after an historic inauguration ceremony in the afghan capital kabul interim president hamid karzai and his fellow cabinet members are looking to start rebuilding afghanistan war ravaged economy mr karzai says he expects the reconstruction to cost many billions of dollars after years of war afghanistan must go from an economy of war to an economy of peace mr karzai said those people who ve earned living by taking the gun must be enabled with programs with plans with projects to put the gun aside and go to the various other forms of economic activity that can bring them livelihood he said?

LEAST (243, -0.0900598019361496): ?four afghan factions have reached agreement on an interim cabinet during talks in germany the united nations says the administration which will take over from december will be headed by the royalist anti taliban commander hamed karzai it concludes more than week of negotiations outside bonn and is aimed at restoring peace and stability to the war ravaged country the year old former deputy foreign minister who is currently battling the taliban around the southern city of kandahar is an ally of the exiled afghan king mohammed zahir shah he will serve as chairman of an interim authority that will govern afghanistan for six month period before loya jirga or grand traditional assembly of elders in turn appoints an month transitional government meanwhile united states marines are now reported to have been deployed in eastern afghanistan where opposition forces are closing in on al qaeda soldiers reports from the area say there has been gun battle between the opposition and al qaeda close to the tora bora cave complex where osama bin laden is thought to be hiding in the south of the country american marines are taking part in patrols around the air base they have secured near kandahar but are unlikely to take part in any assault on the city however the chairman of the joint chiefs of staff general richard myers says they are prepared for anything they are prepared for engagements they re robust fighting force and they re absolutely ready to engage if that required he said?

請(qǐng)注意,上面最相似的文檔(通常為同一個(gè)文本)具有接近1.0的相似性分?jǐn)?shù)茂浮。但是赦肋,排名第二的文檔的相似性分?jǐn)?shù)應(yīng)顯著降低(假設(shè)文檔實(shí)際上不同),當(dāng)我們檢查文本本身時(shí)励稳,推理結(jié)果就變得顯而易見(jiàn)。

我們可以反復(fù)運(yùn)行下一個(gè)單元格囱井,以查看其他目標(biāo)文檔比較的采樣驹尼。

# Pick a random document from the corpus and infer a vector from the model
import random
doc_id = random.randint(0, len(train_corpus) - 1)

# Compare and print the second-most-similar document
print('Train Document ({}): ?{}?\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
sim_id = second_ranks[doc_id]
print('Similar Document {}: ?{}?\n'.format(sim_id, ' '.join(train_corpus[sim_id[0]].words)))

結(jié)果為:

Train Document (158): ?the afl leading goal kicker tony lockett will nominate for the pre season draft after all lockett approached the sydney swans about return to the game last week but after much media speculation decided it was not in the best interests of his family to come out of retirement today the year old changed his mind he has informed the swans of his intention to nominate for next tuesday pre season draft in statement released short time ago lockett says last week he felt rushed and did not feel comfortable with his decision he says over the weekend he had time to think the matter through with his family who support his comeback sydney says it is delighted lockett has decided to make return and it intends to draft him?

Similar Document (246, 0.7817429900169373): ?the afl all time leading goalkicker tony lockett will decide within the next week if he will make comeback lockett has told the sydney swans he is interested in coming out of retirement and placing himself in this month pre season draft lockett retired at the end of the season and will turn in march swans chief executive kelvin templeton says the club would welcome lockett back we re not putting any undue pressure on him mr templeton said the approach really came from tony to us rather than the other way mr templeton says if lockett does make comeback the club would not expect him to play every game he certainly could play role albeit reduced role from the one the fans knew him to hold couple of years back he said?

測(cè)試模型(Testing the Model)

使用上述相同方法,我們將為隨機(jī)選擇的測(cè)試文檔推理向量庞呕,并用肉眼將文檔與我們的模型進(jìn)行比較新翎。

# Pick a random document from the test corpus and infer a vector from the model
doc_id = random.randint(0, len(test_corpus) - 1)
inferred_vector = model.infer_vector(test_corpus[doc_id])
sims = model.dv.most_similar([inferred_vector], topn=len(model.dv))

# Compare and print the most/median/least similar documents from the train corpus
print('Test Document ({}): ?{}?\n'.format(doc_id, ' '.join(test_corpus[doc_id])))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: ?%s?\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))

結(jié)果為:

Test Document (19): ?the united nations was determined that its showpiece environment summit the biggest conference the world has ever witnessed should be staged in africa the venue however could not be further removed from the grim realities of life in the rest of africa johannesburg exclusive and formerly whites only suburb of sandton is the wealthiest neighbourhood in the continent just few kilometres from sandton begins the sprawling alexandra township where nearly million people live in squalor organisers of the conference which begins today seem determined that the two worlds should be kept as far apart as possible?

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/m,d50,n5,w5,mc2,s0.001,t3):

MOST (298, 0.6037775278091431): ?university of canberra academic proposal for republic will be one of five discussed at an historic conference starting in corowa today the conference is part of centenary of federation celebrations and recognises the corowa conference of which began the process towards the federation of australia in university of canberra law lecturer bedeharris is proposing three referenda to determine the republic issue they would decide on whether the monarchy should be replaced the codification powers for head of state and the choice of republic model doctor harris says any constitutional change must involve all australians think it is very important that the people of australia be given the opporunity to choose or be consulted at every stage of the process?

MEDIAN (270, 0.27405408024787903): ?businessmen solomon lew and lindsay fox have called on the federal government to help break qantas dominance to ensure their bid for ansett is successful the pair met with the victorian premier steve bracks yesterday to update him on the progress of the bid over the weekend the federal government ruled out further assistance for the proposal mr lew says he has not requested financial assistance from the government but review of trade practices could be important he says he is also hopeful the government will help break qantas dominance of the aviation industry we are concerned of the fact that at this point in time the largest competitor has over per cent market share and the deputy prime minister john anderson did quote both to lindsay and myself and publicly that he would regulate it to per cent mr lew said he says the bid does not require any other government help at no time did we ever ask the government for any grant or any cash payment or any dollars from taxpayers what we asked for was for business from the government which will be forthcoming in our opinion and an assurance that there would be trade practices review of the current airline situation?

LEAST (153, -0.07414346933364868): ?at least two helicopters have landed near tora bora mountain in eastern afghanistan in what could be the start of raid against al qaeda fighters an afp journalist said the helicopters landed around pm local time am aedt few hours after al qaeda fighters rejected deadline set by afghan militia leaders for them to surrender or face death us warplanes have been bombing the network of caves and tunnels for eight days as part of the hunt for al qaeda leader osama bin laden several witnesses have spoken in recent days of seeing members of us or british special forces near the frontline between the local afghan militia and the followers of bin laden they could not be seen but could be clearly heard as they came into land and strong lights were seen in the same district us bombers and other warplanes staged series of attacks on the al qaeda positions in the white mountains after bin laden fighters failed to surrender all four crew members of us bomber that has crashed in the indian ocean near diego garcia have been rescued us military officials said pentagon spokesman navy captain timothy taylor said initial reports said that all four were aboard the destroyer uss russell which was rushed to the scene after the crash the bomber which usually carries crew of four and is armed with bombs and cruise missiles was engaged in the air war over afghanistan pentagon officials said they had heard about the crash just after am aedt and were unable to say whether the plane was headed to diego garcia or flying from the indian ocean island it is thought the australian arrested in afghanistan for fighting alongside the taliban is from adelaide northern suburbs but the salisbury park family of year old david hicks is remaining silent the president of adelaide islamic society walli hanifi says mr hicks approached him in having just returned from kosovo where he had developed an interest in islam he says mr hicks wanted to know more about the faith but left after few weeks late yesterday afternoon mr hicks salisbury park family told media the australian federal police had told them not to comment local residents confirmed member of the family called mr hicks had travelled to kosovo in recent years and has not been seen for around three years but most including karen white agree they cannot imagine mr hicks fighting for terrorist regime not unless he changed now but when he left here no he wasn he just normal teenage adult boy she said but man known as nick told channel ten he is sure the man detained in afghanistan is his friend david he says in david told him about training in the kosovo liberation army he gone through six weeks basic training how he been in the trenches you know killed few people you know confirmed kills and had few of his mates killed as well the man said?
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個(gè)濱河市住练,隨后出現(xiàn)的幾起案子地啰,更是在濱河造成了極大的恐慌,老刑警劉巖讲逛,帶你破解...
    沈念sama閱讀 216,402評(píng)論 6 499
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件亏吝,死亡現(xiàn)場(chǎng)離奇詭異,居然都是意外死亡盏混,警方通過(guò)查閱死者的電腦和手機(jī)蔚鸥,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,377評(píng)論 3 392
  • 文/潘曉璐 我一進(jìn)店門(mén),熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)许赃,“玉大人止喷,你說(shuō)我怎么就攤上這事』炝模” “怎么了弹谁?”我有些...
    開(kāi)封第一講書(shū)人閱讀 162,483評(píng)論 0 353
  • 文/不壞的土叔 我叫張陵,是天一觀的道長(zhǎng)。 經(jīng)常有香客問(wèn)我预愤,道長(zhǎng)沟于,這世上最難降的妖魔是什么? 我笑而不...
    開(kāi)封第一講書(shū)人閱讀 58,165評(píng)論 1 292
  • 正文 為了忘掉前任鳖粟,我火速辦了婚禮社裆,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘向图。我一直安慰自己泳秀,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,176評(píng)論 6 388
  • 文/花漫 我一把揭開(kāi)白布榄攀。 她就那樣靜靜地躺著嗜傅,像睡著了一般。 火紅的嫁衣襯著肌膚如雪檩赢。 梳的紋絲不亂的頭發(fā)上吕嘀,一...
    開(kāi)封第一講書(shū)人閱讀 51,146評(píng)論 1 297
  • 那天,我揣著相機(jī)與錄音贞瞒,去河邊找鬼偶房。 笑死,一個(gè)胖子當(dāng)著我的面吹牛军浆,可吹牛的內(nèi)容都是我干的棕洋。 我是一名探鬼主播,決...
    沈念sama閱讀 40,032評(píng)論 3 417
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼乒融,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼掰盘!你這毒婦竟也來(lái)了?” 一聲冷哼從身側(cè)響起赞季,我...
    開(kāi)封第一講書(shū)人閱讀 38,896評(píng)論 0 274
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤愧捕,失蹤者是張志新(化名)和其女友劉穎,沒(méi)想到半個(gè)月后申钩,有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體次绘,經(jīng)...
    沈念sama閱讀 45,311評(píng)論 1 310
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,536評(píng)論 2 332
  • 正文 我和宋清朗相戀三年撒遣,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了断盛。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 39,696評(píng)論 1 348
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡愉舔,死狀恐怖钢猛,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情轩缤,我是刑警寧澤命迈,帶...
    沈念sama閱讀 35,413評(píng)論 5 343
  • 正文 年R本政府宣布贩绕,位于F島的核電站,受9級(jí)特大地震影響壶愤,放射性物質(zhì)發(fā)生泄漏淑倾。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,008評(píng)論 3 325
  • 文/蒙蒙 一征椒、第九天 我趴在偏房一處隱蔽的房頂上張望娇哆。 院中可真熱鬧,春花似錦勃救、人聲如沸碍讨。這莊子的主人今日做“春日...
    開(kāi)封第一講書(shū)人閱讀 31,659評(píng)論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)勃黍。三九已至,卻和暖如春晕讲,著一層夾襖步出監(jiān)牢的瞬間覆获,已是汗流浹背。 一陣腳步聲響...
    開(kāi)封第一講書(shū)人閱讀 32,815評(píng)論 1 269
  • 我被黑心中介騙來(lái)泰國(guó)打工瓢省, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留弄息,地道東北人。 一個(gè)月前我還...
    沈念sama閱讀 47,698評(píng)論 2 368
  • 正文 我出身青樓勤婚,卻偏偏與公主長(zhǎng)得像摹量,于是被迫代替她去往敵國(guó)和親。 傳聞我的和親對(duì)象是個(gè)殘疾皇子蛔六,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 44,592評(píng)論 2 353

推薦閱讀更多精彩內(nèi)容