metaknowledge 之文本分析--LDA撵颊、NMF實(shí)踐

導(dǎo)入的包

from __future__ import print_function
import metaknowledge as mk
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
import gensim
from gensim import corpora, models
from stop_words import get_stop_words
from nltk.tokenize import RegexpTokenizer
import pyLDAvis
import pyLDAvis.gensim as gensimvis

mk核心代碼

  • 抽取出需要分析的文本并轉(zhuǎn)換為數(shù)組
folder_collec = mk.RecordCollection(r'F:\metaknow\example data')
topic = folder_collec.forNLP(r'F:\metaknow\example data\LDA_folder.csv',
dropList=stopwords,lower=True,removeNumbers=True)
document = topic['abstract']
docs = np.asarray(document)
  • 保存csv文件,結(jié)構(gòu)如下:


    用于分析的文本文件

genism包的LDA分析

文本數(shù)據(jù)清洗

  • 分詞
  • 去除停用詞
  • 詞向量化
*************************************分詞*****************************
#正則分詞器,將所有英文句子按單詞斷開(kāi)
tokenizer = RegexpTokenizer(r'\w+')
#用于存放分詞流
tokens = []
for l in document:
    #對(duì)一篇摘要就行分詞并存為列表
    token = tokenizer.tokenize(l)
    tokens.append(token)
# tokens.append([tokenizer.tokenize(l) for l in document])
*************************************去除停用詞*****************************

#運(yùn)用get_stop_word加載英文停用詞列表
stopwords = get_stop_words('en')
#存放去除停用詞后的詞庫(kù)
cleaned_tokens = []
for l in tokens:
    cleaned_tokens.append([i for i in l if not i in stopwords])

*************************************詞句子向量化*****************************

dictionary = corpora.Dictionary(cleaned_tokens)
#https://blog.csdn.net/xuxiuning/article/details/47720337
#為語(yǔ)料庫(kù)中的每個(gè)單詞分配一個(gè)獨(dú)一無(wú)二的ID,形成字典
array = np.asarray(cleaned_tokens)#轉(zhuǎn)換為數(shù)組
corpus = [dictionary.doc2bow(word) for word in array]
#創(chuàng)建詞袋模型绢涡,將每篇文檔的摘要用向量來(lái)表示,該向量與原來(lái)文本中單詞出現(xiàn)的順序沒(méi)有關(guān)系鳖轰,而是詞典中每個(gè)單詞在文本中出現(xiàn)的頻率清酥。
'''`
如存在一個(gè)語(yǔ)料庫(kù)如下:
{"John": 1, "likes": 2,"to": 3, "watch": 4, "movies": 5,"also": 6, "football": 7, "games": 8,"Mary": 9, "too": 10}
一個(gè)句子向量如下:
  [1, 2, 1, 1, 1, 0, 0, 0, 1, 1]
  表示John出現(xiàn)了一次,likes出現(xiàn)了兩次蕴侣,to表示0次焰轻,watch表示0次~~~~~~~~~~~~~
'''

投喂模型

ldamodel = gensim.models.ldamodel.LdaModel(corpus=corpus, num_topics=50,
 id2word = dictionary, passes=20)
#打印出前10個(gè)主題,以及每個(gè)主題中的前5個(gè)詞語(yǔ)
print(ldamodel.print_topics(num_topics=10, num_words=5))
#存盤(pán)
dictionary.save(r'F:\metaknow\example data\paper_abstracts.dict')
ldamodel.save(r'F:\metaknow\example data\paper_abstracts_lda.model')

前10個(gè)主題結(jié)果

[(38, '0.031*"collaboration" + 0.024*"scientific" + 0.021*"impact" + 0.017*"research" + 0.009*"researchers"'), 
(17, '0.021*"scientific" + 0.011*"collaboration" + 0.010*"research" + 0.010*"science" + 0.009*"researchers"'),
 (1, '0.034*"scientific" + 0.020*"collaboration" + 0.018*"research" + 0.013*"network" + 0.010*"coauthorship"'),
 (4, '0.017*"research" + 0.015*"china" + 0.014*"hivaids" + 0.010*"coauthorship" + 0.010*"collaboration"'), 
(33, '0.029*"collaboration" + 0.020*"data" + 0.018*"scientific" + 0.007*"events" + 0.007*"can"'), 
(14, '0.023*"collaboration" + 0.016*"scientific" + 0.014*"analysis" + 0.013*"international" + 0.011*"study"'), 
(44, '0.018*"scientific" + 0.016*"knowledge" + 0.015*"collaborations" + 0.012*"trust" + 0.008*"commercialization"'), 
(5, '0.015*"scientific" + 0.010*"research" + 0.009*"order" + 0.008*"teams" + 0.008*"leadership"'), 
(31, '0.024*"collaboration" + 0.024*"research" + 0.022*"scientific" + 0.009*"scientists" + 0.008*"network"'), 
(22, '0.028*"research" + 0.020*"collaboration" + 0.019*"scientific" + 0.010*"impact" + 0.009*"collaborative"')]

  • 可視化
vis_data = gensimvis.prepare(ldamodel, corpus, dictionary)

pyLDAvis.show(vis_data,open_browser=False)

LDA圖

http://127.0.0.1:8888/#topic=0&lambda=1&term=

sklearn

  • tf-idf矩陣
  • 非負(fù)矩陣分解
    TF-idf
#建立tf-idf矩陣,將文章摘要通過(guò) tf-idf值來(lái)進(jìn)行表示昆雀,也就是用一個(gè)tf-idf值的矩陣來(lái)表示文檔(句子也可)
************************************將原始的文檔轉(zhuǎn)換為tfidf矩陣****************
features = 500
topics = 20
top_words = 10
#創(chuàng)建對(duì)象
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, #如果詞頻在95%的文檔中都出現(xiàn)辱志,則去掉該詞頻
                                    min_df=2,#如果詞頻大于小于2則去掉
                                   max_features=features,#最多500個(gè)詞
                                    #常用英文停用詞表
                                   stop_words='english')
#實(shí)例化
tfidf = tfidf_vectorizer.fit_transform(docs)
#將文本中的詞語(yǔ)轉(zhuǎn)換為詞頻矩陣
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=features,
                                stop_words='english')

**********************************非負(fù)矩陣分解********************************
tf = tf_vectorizer.fit_transform(docs)
# 非負(fù)矩陣分解,降維10維
nmf = NMF(n_components=topics, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf)
#提供打印功能
def print_top_words(model, feature_names, top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-top_words - 1:-1]]))
    print()
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, top_words)

sklearn出的結(jié)果

Topic #0:
collaboration scientific collaborative paper information analysis science social results studies
Topic #1:
international science collaboration national domestic world index sci increased countries
Topic #2:
team scientific teaching construction members innovation university method paper performance
Topic #3:
authors articles number coauthorship journals published article author coauthors publications
Topic #4:
network centrality nodes analysis structure social coauthorship evolution collaboration degree
Topic #5:
data access sharing distributed use software resources including experiments provides
Topic #6:
scientists collaborators changes computer work colleagues connected group coauthored early
Topic #7:
researchers academic increasingly publications sample coauthors brazilian findings communities activities
Topic #8:
networks coauthorship social patterns properties structure clustering links ties high
Topic #9:
research university projects community project topics academic health scientific study
Topic #10:
collaborations scientific firms increasingly success physics domestic large performance benefits
Topic #11:
model empirical distribution degree proposed nodes graph evolution coauthorship node
Topic #12:
countries south collaboration production africa country institutions african latin world
Topic #13:
knowledge sharing scientific innovation production practices domain academic science communication
Topic #14:
china chinese chinas usa eu analysis science past collaborative european
Topic #15:
teams team members scientific new productivity work characteristics size related
Topic #16:
papers published collaboration coauthored citation patterns collaborative identified citations fields
Topic #17:
universities colleges management improve university innovation quality focus scientific engineering
Topic #18:
scholars academic collaboration coauthorship algorithm patterns collaborators based new age
Topic #19:
impact citation publications average citations output number publication greater scientific
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末狞膘,一起剝皮案震驚了整個(gè)濱河市揩懒,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌挽封,老刑警劉巖已球,帶你破解...
    沈念sama閱讀 219,589評(píng)論 6 508
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場(chǎng)離奇詭異辅愿,居然都是意外死亡智亮,警方通過(guò)查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 93,615評(píng)論 3 396
  • 文/潘曉璐 我一進(jìn)店門(mén)点待,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)阔蛉,“玉大人,你說(shuō)我怎么就攤上這事癞埠∽丛” “怎么了?”我有些...
    開(kāi)封第一講書(shū)人閱讀 165,933評(píng)論 0 356
  • 文/不壞的土叔 我叫張陵燕差,是天一觀的道長(zhǎng)遭笋。 經(jīng)常有香客問(wèn)我,道長(zhǎng)徒探,這世上最難降的妖魔是什么瓦呼? 我笑而不...
    開(kāi)封第一講書(shū)人閱讀 58,976評(píng)論 1 295
  • 正文 為了忘掉前任,我火速辦了婚禮测暗,結(jié)果婚禮上央串,老公的妹妹穿的比我還像新娘。我一直安慰自己碗啄,他們只是感情好质和,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,999評(píng)論 6 393
  • 文/花漫 我一把揭開(kāi)白布。 她就那樣靜靜地躺著稚字,像睡著了一般饲宿。 火紅的嫁衣襯著肌膚如雪厦酬。 梳的紋絲不亂的頭發(fā)上,一...
    開(kāi)封第一講書(shū)人閱讀 51,775評(píng)論 1 307
  • 那天瘫想,我揣著相機(jī)與錄音仗阅,去河邊找鬼。 笑死国夜,一個(gè)胖子當(dāng)著我的面吹牛减噪,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播车吹,決...
    沈念sama閱讀 40,474評(píng)論 3 420
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼筹裕,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼!你這毒婦竟也來(lái)了窄驹?” 一聲冷哼從身側(cè)響起朝卒,我...
    開(kāi)封第一講書(shū)人閱讀 39,359評(píng)論 0 276
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤,失蹤者是張志新(化名)和其女友劉穎馒吴,沒(méi)想到半個(gè)月后扎运,有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 45,854評(píng)論 1 317
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡饮戳,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 38,007評(píng)論 3 338
  • 正文 我和宋清朗相戀三年豪治,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片扯罐。...
    茶點(diǎn)故事閱讀 40,146評(píng)論 1 351
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡负拟,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出歹河,到底是詐尸還是另有隱情掩浙,我是刑警寧澤,帶...
    沈念sama閱讀 35,826評(píng)論 5 346
  • 正文 年R本政府宣布秸歧,位于F島的核電站厨姚,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏键菱。R本人自食惡果不足惜谬墙,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,484評(píng)論 3 331
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望经备。 院中可真熱鬧拭抬,春花似錦、人聲如沸侵蒙。這莊子的主人今日做“春日...
    開(kāi)封第一講書(shū)人閱讀 32,029評(píng)論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)纷闺。三九已至算凿,卻和暖如春份蝴,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背澎媒。 一陣腳步聲響...
    開(kāi)封第一講書(shū)人閱讀 33,153評(píng)論 1 272
  • 我被黑心中介騙來(lái)泰國(guó)打工搞乏, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留波桩,地道東北人戒努。 一個(gè)月前我還...
    沈念sama閱讀 48,420評(píng)論 3 373
  • 正文 我出身青樓,卻偏偏與公主長(zhǎng)得像镐躲,于是被迫代替她去往敵國(guó)和親储玫。 傳聞我的和親對(duì)象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 45,107評(píng)論 2 356