數(shù)據(jù):
首先我們來(lái)看一眼數(shù)據(jù):
語(yǔ)料庫(kù)中有9篇文檔坏平,每篇文檔為1行摊沉。數(shù)據(jù)保存在文件名為16.LDA_test.txt的文本文件中系任。
Human machine interface for lab abc computer applications
A survey of user opinion of computer system response time
The EPS user interface management system
System and human system engineering testing of EPS
Relation of user perceived response time to error measurement
The generation of random binary unordered trees
The intersection graph of paths in trees
Graph minors IV Widths of trees and well quasi ordering
Graph minors A survey
程序:
(1)首先唆香,將這個(gè)文件讀進(jìn)來(lái):
f = open('LDA_test.txt')
(2)然后對(duì)每行的文檔進(jìn)行分詞,并去掉停用詞:
stop_list = set('for a of the and to in'.split())
texts = [[word for word in line.strip().lower().split() if word not in stop_list] for line in f]
print 'Text = '
pprint(texts)
打印結(jié)果:
Text =
[['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'],
['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
['eps', 'user', 'interface', 'management', 'system'],
['system', 'human', 'system', 'engineering', 'testing', 'eps'],
['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'],
['generation', 'random', 'binary', 'unordered', 'trees'],
['intersection', 'graph', 'paths', 'trees'],
['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'],
['graph', 'minors', 'survey']]
(3)構(gòu)建字典:
dictionary = corpora.Dictionary(texts)
print dictionary
V = len(dictionary) # 字典的長(zhǎng)度
打印字典:總共有35個(gè)詞
Dictionary(35 unique tokens: [u'minors', u'generation', u'testing', u'iv', u'engineering']...)
(4)計(jì)算每個(gè)文檔中的TF-IDF值:
# 根據(jù)字典滑沧,將每行文檔都轉(zhuǎn)換為索引的形式
corpus = [dictionary.doc2bow(text) for text in texts]
# 逐行打印
for line in corpus:
print line
轉(zhuǎn)換后還是每行一片文章并村,只是原來(lái)的文字變成了(索引,1)的形式滓技,這個(gè)索引根據(jù)的是字典中的(索引哩牍,詞)。打印結(jié)果如下:
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)]
[(4, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)]
[(6, 1), (7, 1), (9, 1), (13, 1), (14, 1)]
[(5, 1), (7, 2), (14, 1), (15, 1), (16, 1)]
[(9, 1), (10, 1), (12, 1), (17, 1), (18, 1), (19, 1), (20, 1)]
[(21, 1), (22, 1), (23, 1), (24, 1), (25, 1)]
[(25, 1), (26, 1), (27, 1), (28, 1)]
[(25, 1), (26, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1)]
[(8, 1), (26, 1), (29, 1)]
現(xiàn)在對(duì)每篇文檔中的每個(gè)詞都計(jì)算tf-idf值
corpus_tfidf = models.TfidfModel(corpus)[corpus]
#逐行打印
print 'TF-IDF:'
for c in corpus_tfidf:
print c
仍然是每一行一篇文檔令漂,只是上面一步中的1的位置膝昆,變成了每個(gè)詞索引所對(duì)應(yīng)的tf-idf值了。
TF-IDF:
[(0, 0.4301019571350565), (1, 0.4301019571350565), (2, 0.4301019571350565), (3, 0.4301019571350565), (4, 0.2944198962221451), (5, 0.2944198962221451), (6, 0.2944198962221451)]
[(4, 0.3726494271826947), (7, 0.27219160459794917), (8, 0.3726494271826947), (9, 0.27219160459794917), (10, 0.3726494271826947), (11, 0.5443832091958983), (12, 0.3726494271826947)]
[(6, 0.438482464916089), (7, 0.32027755044706185), (9, 0.32027755044706185), (13, 0.6405551008941237), (14, 0.438482464916089)]
[(5, 0.3449874408519962), (7, 0.5039733231394895), (14, 0.3449874408519962), (15, 0.5039733231394895), (16, 0.5039733231394895)]
[(9, 0.21953536176370683), (10, 0.30055933182961736), (12, 0.30055933182961736), (17, 0.43907072352741366), (18, 0.43907072352741366), (19, 0.43907072352741366), (20, 0.43907072352741366)]
[(21, 0.48507125007266594), (22, 0.48507125007266594), (23, 0.48507125007266594), (24, 0.48507125007266594), (25, 0.24253562503633297)]
[(25, 0.31622776601683794), (26, 0.31622776601683794), (27, 0.6324555320336759), (28, 0.6324555320336759)]
[(25, 0.20466057569885868), (26, 0.20466057569885868), (29, 0.2801947048062438), (30, 0.40932115139771735), (31, 0.40932115139771735), (32, 0.40932115139771735), (33, 0.40932115139771735), (34, 0.40932115139771735)]
[(8, 0.6282580468670046), (26, 0.45889394536615247), (29, 0.6282580468670046)]
(5)應(yīng)用LDA模型
前面4步可以說(shuō)是特征數(shù)據(jù)的準(zhǔn)備叠必。因?yàn)檫@里我們使用每篇文章的tf-idf值來(lái)作為特征輸入進(jìn)LDA模型荚孵。
訓(xùn)練模型:
print '\nLDA Model:'
# 設(shè)置主題的數(shù)目
num_topics = 2
# 訓(xùn)練模型
lda = models.LdaModel(corpus_tfidf, num_topics=num_topics, id2word=dictionary,
alpha='auto', eta='auto', minimum_probability=0.001)
打印一下每篇文檔被分在各個(gè)主題的概率:
doc_topic = [a for a in lda[corpus_tfidf]]
print 'Document-Topic:\n'
pprint(doc_topic)
LDA Model:
Document-Topic:
[[(0, 0.25865201763870671), (1, 0.7413479823612934)],
[(0, 0.6704214035190138), (1, 0.32957859648098625)],
[(0, 0.34722886288787302), (1, 0.65277113711212698)],
[(0, 0.64268836524831052), (1, 0.35731163475168948)],
[(0, 0.67316053818546506), (1, 0.32683946181453505)],
[(0, 0.37897103968594514), (1, 0.62102896031405486)],
[(0, 0.6244681672561716), (1, 0.37553183274382845)],
[(0, 0.74840501728867792), (1, 0.25159498271132213)],
[(0, 0.65364678163446832), (1, 0.34635321836553179)]]
打印每個(gè)主題中,每個(gè)詞出現(xiàn)的概率:
因?yàn)榍懊嬗?xùn)練模型時(shí)傳入了參數(shù)minimum_probability=0.001挠唆,所以小于這個(gè)概率的詞將不被輸出了处窥。
for topic_id in range(num_topics):
print 'Topic', topic_id
pprint(lda.show_topic(topic_id))
Topic 0
[(u'system', 0.041635423550867606),
(u'survey', 0.040429107770606001),
(u'graph', 0.038913672197129358),
(u'minors', 0.038613604352799001),
(u'trees', 0.035093470419085344),
(u'time', 0.034314182442026844),
(u'user', 0.032712431543062859),
(u'response', 0.032562733895067024),
(u'eps', 0.032317332054789358),
(u'intersection', 0.031074066863528784)]
Topic 1
[(u'interface', 0.038423961073724748),
(u'system', 0.036616390857180062),
(u'management', 0.03585869312482335),
(u'graph', 0.034776623890248701),
(u'user', 0.03448476247382859),
(u'survey', 0.033892977987880241),
(u'eps', 0.033683486487186061),
(u'computer', 0.032741732328417393),
(u'minors', 0.031949259380969104),
(u'human', 0.03156868862825063)]
計(jì)算文檔與文檔之間的相似性:
相似性是通過(guò)tf-idf計(jì)算的嘱吗。
similarity = similarities.MatrixSimilarity(lda[corpus_tfidf])
print 'Similarity:'
pprint(list(similarity))
Similarity:
[array([ 0.99999994, 0.71217406, 0.98829806, 0.74671113, 0.70895636,
0.97756702, 0.76893044, 0.61318189, 0.73319417], dtype=float32),
array([ 0.71217406, 1. , 0.81092042, 0.99872446, 0.99998957,
0.8440569 , 0.99642557, 0.99123365, 0.99953747], dtype=float32),
array([ 0.98829806, 0.81092042, 1. , 0.83943164, 0.808236 ,
0.99825525, 0.85745317, 0.72650033, 0.82834125], dtype=float32),
array([ 0.74671113, 0.99872446, 0.83943164, 0.99999994, 0.99848306,
0.87005669, 0.99941987, 0.98329824, 0.99979806], dtype=float32),
array([ 0.70895636, 0.99998957, 0.808236 , 0.99848306, 1. ,
0.84159577, 0.99602884, 0.99182749, 0.99938792], dtype=float32),
array([ 0.97756702, 0.8440569 , 0.99825525, 0.87005669, 0.84159577,
0.99999994, 0.88634008, 0.76580745, 0.85997516], dtype=float32),
array([ 0.76893044, 0.99642557, 0.85745317, 0.99941987, 0.99602884,
0.88634008, 1. , 0.9765296 , 0.99853373], dtype=float32),
array([ 0.61318189, 0.99123365, 0.72650033, 0.98329824, 0.99182749,
0.76580745, 0.9765296 , 0.99999994, 0.9867571 ], dtype=float32),
array([ 0.73319417, 0.99953747, 0.82834125, 0.99979806, 0.99938792,
0.85997516, 0.99853373, 0.9867571 , 1. ], dtype=float32)]