LDA建模

數(shù)據(jù):

首先我們來(lái)看一眼數(shù)據(jù):
語(yǔ)料庫(kù)中有9篇文檔坏平,每篇文檔為1行摊沉。數(shù)據(jù)保存在文件名為16.LDA_test.txt的文本文件中系任。

Human machine interface for lab abc computer applications
A survey of user opinion of computer system response time
The EPS user interface management system
System and human system engineering testing of EPS
Relation of user perceived response time to error measurement
The generation of random binary unordered trees
The intersection graph of paths in trees
Graph minors IV Widths of trees and well quasi ordering
Graph minors A survey

程序:

(1)首先唆香,將這個(gè)文件讀進(jìn)來(lái):

f = open('LDA_test.txt')

(2)然后對(duì)每行的文檔進(jìn)行分詞,并去掉停用詞:

stop_list = set('for a of the and to in'.split())
texts = [[word for word in line.strip().lower().split() if word not in stop_list] for line in f]
print 'Text = '
pprint(texts)

打印結(jié)果:

Text = 
[['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'],
 ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'management', 'system'],
 ['system', 'human', 'system', 'engineering', 'testing', 'eps'],
 ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'],
 ['generation', 'random', 'binary', 'unordered', 'trees'],
 ['intersection', 'graph', 'paths', 'trees'],
 ['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'],
 ['graph', 'minors', 'survey']]

(3)構(gòu)建字典:

dictionary = corpora.Dictionary(texts)
print dictionary

V = len(dictionary) # 字典的長(zhǎng)度

打印字典:總共有35個(gè)詞

Dictionary(35 unique tokens: [u'minors', u'generation', u'testing', u'iv', u'engineering']...)

(4)計(jì)算每個(gè)文檔中的TF-IDF值:

# 根據(jù)字典滑沧,將每行文檔都轉(zhuǎn)換為索引的形式
corpus = [dictionary.doc2bow(text) for text in texts]
# 逐行打印
for line in corpus:
       print line

轉(zhuǎn)換后還是每行一片文章并村,只是原來(lái)的文字變成了(索引,1)的形式滓技,這個(gè)索引根據(jù)的是字典中的(索引哩牍,詞)。打印結(jié)果如下:

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)]
[(4, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)]
[(6, 1), (7, 1), (9, 1), (13, 1), (14, 1)]
[(5, 1), (7, 2), (14, 1), (15, 1), (16, 1)]
[(9, 1), (10, 1), (12, 1), (17, 1), (18, 1), (19, 1), (20, 1)]
[(21, 1), (22, 1), (23, 1), (24, 1), (25, 1)]
[(25, 1), (26, 1), (27, 1), (28, 1)]
[(25, 1), (26, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1)]
[(8, 1), (26, 1), (29, 1)]

現(xiàn)在對(duì)每篇文檔中的每個(gè)詞都計(jì)算tf-idf值

corpus_tfidf = models.TfidfModel(corpus)[corpus]

#逐行打印
print 'TF-IDF:'
for c in corpus_tfidf:
    print c

仍然是每一行一篇文檔令漂,只是上面一步中的1的位置膝昆,變成了每個(gè)詞索引所對(duì)應(yīng)的tf-idf值了。

TF-IDF:
[(0, 0.4301019571350565), (1, 0.4301019571350565), (2, 0.4301019571350565), (3, 0.4301019571350565), (4, 0.2944198962221451), (5, 0.2944198962221451), (6, 0.2944198962221451)]
[(4, 0.3726494271826947), (7, 0.27219160459794917), (8, 0.3726494271826947), (9, 0.27219160459794917), (10, 0.3726494271826947), (11, 0.5443832091958983), (12, 0.3726494271826947)]
[(6, 0.438482464916089), (7, 0.32027755044706185), (9, 0.32027755044706185), (13, 0.6405551008941237), (14, 0.438482464916089)]
[(5, 0.3449874408519962), (7, 0.5039733231394895), (14, 0.3449874408519962), (15, 0.5039733231394895), (16, 0.5039733231394895)]
[(9, 0.21953536176370683), (10, 0.30055933182961736), (12, 0.30055933182961736), (17, 0.43907072352741366), (18, 0.43907072352741366), (19, 0.43907072352741366), (20, 0.43907072352741366)]
[(21, 0.48507125007266594), (22, 0.48507125007266594), (23, 0.48507125007266594), (24, 0.48507125007266594), (25, 0.24253562503633297)]
[(25, 0.31622776601683794), (26, 0.31622776601683794), (27, 0.6324555320336759), (28, 0.6324555320336759)]
[(25, 0.20466057569885868), (26, 0.20466057569885868), (29, 0.2801947048062438), (30, 0.40932115139771735), (31, 0.40932115139771735), (32, 0.40932115139771735), (33, 0.40932115139771735), (34, 0.40932115139771735)]
[(8, 0.6282580468670046), (26, 0.45889394536615247), (29, 0.6282580468670046)]

(5)應(yīng)用LDA模型
前面4步可以說(shuō)是特征數(shù)據(jù)的準(zhǔn)備叠必。因?yàn)檫@里我們使用每篇文章的tf-idf值來(lái)作為特征輸入進(jìn)LDA模型荚孵。

訓(xùn)練模型:

print '\nLDA Model:'
# 設(shè)置主題的數(shù)目
num_topics = 2
# 訓(xùn)練模型
lda = models.LdaModel(corpus_tfidf, num_topics=num_topics, id2word=dictionary,
                      alpha='auto', eta='auto', minimum_probability=0.001)

打印一下每篇文檔被分在各個(gè)主題的概率:

doc_topic = [a for a in lda[corpus_tfidf]]
    print 'Document-Topic:\n'
    pprint(doc_topic)
LDA Model:
Document-Topic:

[[(0, 0.25865201763870671), (1, 0.7413479823612934)],
 [(0, 0.6704214035190138), (1, 0.32957859648098625)],
 [(0, 0.34722886288787302), (1, 0.65277113711212698)],
 [(0, 0.64268836524831052), (1, 0.35731163475168948)],
 [(0, 0.67316053818546506), (1, 0.32683946181453505)],
 [(0, 0.37897103968594514), (1, 0.62102896031405486)],
 [(0, 0.6244681672561716), (1, 0.37553183274382845)],
 [(0, 0.74840501728867792), (1, 0.25159498271132213)],
 [(0, 0.65364678163446832), (1, 0.34635321836553179)]]

打印每個(gè)主題中,每個(gè)詞出現(xiàn)的概率:
因?yàn)榍懊嬗?xùn)練模型時(shí)傳入了參數(shù)minimum_probability=0.001挠唆,所以小于這個(gè)概率的詞將不被輸出了处窥。

for topic_id in range(num_topics):
    print 'Topic', topic_id
    pprint(lda.show_topic(topic_id))
Topic 0
[(u'system', 0.041635423550867606),
 (u'survey', 0.040429107770606001),
 (u'graph', 0.038913672197129358),
 (u'minors', 0.038613604352799001),
 (u'trees', 0.035093470419085344),
 (u'time', 0.034314182442026844),
 (u'user', 0.032712431543062859),
 (u'response', 0.032562733895067024),
 (u'eps', 0.032317332054789358),
 (u'intersection', 0.031074066863528784)]
Topic 1
[(u'interface', 0.038423961073724748),
 (u'system', 0.036616390857180062),
 (u'management', 0.03585869312482335),
 (u'graph', 0.034776623890248701),
 (u'user', 0.03448476247382859),
 (u'survey', 0.033892977987880241),
 (u'eps', 0.033683486487186061),
 (u'computer', 0.032741732328417393),
 (u'minors', 0.031949259380969104),
 (u'human', 0.03156868862825063)]

計(jì)算文檔與文檔之間的相似性:
相似性是通過(guò)tf-idf計(jì)算的嘱吗。

similarity = similarities.MatrixSimilarity(lda[corpus_tfidf])
print 'Similarity:'
pprint(list(similarity))
Similarity:
[array([ 0.99999994,  0.71217406,  0.98829806,  0.74671113,  0.70895636,
        0.97756702,  0.76893044,  0.61318189,  0.73319417], dtype=float32),
 array([ 0.71217406,  1.        ,  0.81092042,  0.99872446,  0.99998957,
        0.8440569 ,  0.99642557,  0.99123365,  0.99953747], dtype=float32),
 array([ 0.98829806,  0.81092042,  1.        ,  0.83943164,  0.808236  ,
        0.99825525,  0.85745317,  0.72650033,  0.82834125], dtype=float32),
 array([ 0.74671113,  0.99872446,  0.83943164,  0.99999994,  0.99848306,
        0.87005669,  0.99941987,  0.98329824,  0.99979806], dtype=float32),
 array([ 0.70895636,  0.99998957,  0.808236  ,  0.99848306,  1.        ,
        0.84159577,  0.99602884,  0.99182749,  0.99938792], dtype=float32),
 array([ 0.97756702,  0.8440569 ,  0.99825525,  0.87005669,  0.84159577,
        0.99999994,  0.88634008,  0.76580745,  0.85997516], dtype=float32),
 array([ 0.76893044,  0.99642557,  0.85745317,  0.99941987,  0.99602884,
        0.88634008,  1.        ,  0.9765296 ,  0.99853373], dtype=float32),
 array([ 0.61318189,  0.99123365,  0.72650033,  0.98329824,  0.99182749,
        0.76580745,  0.9765296 ,  0.99999994,  0.9867571 ], dtype=float32),
 array([ 0.73319417,  0.99953747,  0.82834125,  0.99979806,  0.99938792,
        0.85997516,  0.99853373,  0.9867571 ,  1.        ], dtype=float32)]
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末玄组,一起剝皮案震驚了整個(gè)濱河市,隨后出現(xiàn)的幾起案子谒麦,更是在濱河造成了極大的恐慌俄讹,老刑警劉巖,帶你破解...
    沈念sama閱讀 211,948評(píng)論 6 492
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件绕德,死亡現(xiàn)場(chǎng)離奇詭異患膛,居然都是意外死亡,警方通過(guò)查閱死者的電腦和手機(jī)耻蛇,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 90,371評(píng)論 3 385
  • 文/潘曉璐 我一進(jìn)店門踪蹬,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái),“玉大人臣咖,你說(shuō)我怎么就攤上這事跃捣。” “怎么了夺蛇?”我有些...
    開(kāi)封第一講書人閱讀 157,490評(píng)論 0 348
  • 文/不壞的土叔 我叫張陵疚漆,是天一觀的道長(zhǎng)。 經(jīng)常有香客問(wèn)我,道長(zhǎng)娶聘,這世上最難降的妖魔是什么闻镶? 我笑而不...
    開(kāi)封第一講書人閱讀 56,521評(píng)論 1 284
  • 正文 為了忘掉前任,我火速辦了婚禮丸升,結(jié)果婚禮上铆农,老公的妹妹穿的比我還像新娘。我一直安慰自己狡耻,他們只是感情好顿涣,可當(dāng)我...
    茶點(diǎn)故事閱讀 65,627評(píng)論 6 386
  • 文/花漫 我一把揭開(kāi)白布。 她就那樣靜靜地躺著酝豪,像睡著了一般涛碑。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上孵淘,一...
    開(kāi)封第一講書人閱讀 49,842評(píng)論 1 290
  • 那天蒲障,我揣著相機(jī)與錄音,去河邊找鬼瘫证。 笑死揉阎,一個(gè)胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的背捌。 我是一名探鬼主播毙籽,決...
    沈念sama閱讀 38,997評(píng)論 3 408
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼毡庆!你這毒婦竟也來(lái)了坑赡?” 一聲冷哼從身側(cè)響起,我...
    開(kāi)封第一講書人閱讀 37,741評(píng)論 0 268
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤么抗,失蹤者是張志新(化名)和其女友劉穎毅否,沒(méi)想到半個(gè)月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體蝇刀,經(jīng)...
    沈念sama閱讀 44,203評(píng)論 1 303
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡螟加,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 36,534評(píng)論 2 327
  • 正文 我和宋清朗相戀三年,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了吞琐。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片捆探。...
    茶點(diǎn)故事閱讀 38,673評(píng)論 1 341
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡,死狀恐怖站粟,靈堂內(nèi)的尸體忽然破棺而出黍图,到底是詐尸還是另有隱情,我是刑警寧澤卒蘸,帶...
    沈念sama閱讀 34,339評(píng)論 4 330
  • 正文 年R本政府宣布雌隅,位于F島的核電站翻默,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏恰起。R本人自食惡果不足惜修械,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,955評(píng)論 3 313
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望检盼。 院中可真熱鬧肯污,春花似錦、人聲如沸吨枉。這莊子的主人今日做“春日...
    開(kāi)封第一講書人閱讀 30,770評(píng)論 0 21
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)貌亭。三九已至柬唯,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間圃庭,已是汗流浹背锄奢。 一陣腳步聲響...
    開(kāi)封第一講書人閱讀 32,000評(píng)論 1 266
  • 我被黑心中介騙來(lái)泰國(guó)打工, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留剧腻,地道東北人拘央。 一個(gè)月前我還...
    沈念sama閱讀 46,394評(píng)論 2 360
  • 正文 我出身青樓,卻偏偏與公主長(zhǎng)得像书在,于是被迫代替她去往敵國(guó)和親灰伟。 傳聞我的和親對(duì)象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 43,562評(píng)論 2 349

推薦閱讀更多精彩內(nèi)容