歡迎大家訪問我的博客以及簡書
本博客所有內(nèi)容以學(xué)習(xí)词渤、研究和分享為主,如需轉(zhuǎn)載喧务,請聯(lián)系本人,標(biāo)明作者和出處枉圃,并且是非商業(yè)用途功茴,謝謝!
一. 摘要
這篇文章主要介紹了計算TF-IDF的不同方法實現(xiàn)孽亲,主要有三種方法:
- 用gensim庫來計算tfidf值
- 用sklearn庫來計算tfidf值
- 用python手動實現(xiàn)tfidf的計算
關(guān)于TFIDF的算法原理我就不過多介紹了坎穿,看這篇博客即可——TF-IDF原理。阮一峰大佬寫的返劲,淺顯易懂玲昧,看了這么多篇就這篇最好懂。
二. 正文
1.使用gensim提取文本的tfidf特征
首先來看我們的語料庫
corpus = [
'this is the first document',
'this is the second second document',
'and the third one',
'is this the first document'
]
接下來看我們的處理過程
1)把語料庫做一個分詞的處理
[輸入]:
word_list = []
for i in range(len(corpus)):
word_list.append(corpus[i].split(' '))
print(word_list)
[輸出]:
[['this', 'is', 'the', 'first', 'document'],
['this', 'is', 'the', 'second', 'second', 'document'],
['and', 'the', 'third', 'one'],
['is', 'this', 'the', 'first', 'document']]
- 得到每個詞的id值及詞頻
[輸入]:
from gensim import corpora
# 賦給語料庫中每個詞(不重復(fù)的詞)一個整數(shù)id
dictionary = corpora.Dictionary(word_list)
new_corpus = [dictionary.doc2bow(text) for text in word_list]
print(new_corpus)
# 元組中第一個元素是詞語在詞典中對應(yīng)的id篮绿,第二個元素是詞語在文檔中出現(xiàn)的次數(shù)
[輸出]:
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)],
[(0, 1), (2, 1), (3, 1), (4, 1), (5, 2)],
[(3, 1), (6, 1), (7, 1), (8, 1)],
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)]]
[輸入]:
# 通過下面的方法可以看到語料庫中每個詞對應(yīng)的id
print(dictionary.token2id)
[輸出]:
{'document': 0, 'first': 1, 'is': 2, 'the': 3, 'this': 4, 'second': 5, 'and': 6,
'one': 7, 'third': 8}
3)訓(xùn)練gensim模型并且保存它以便后面的使用
[輸入]:
# 訓(xùn)練模型并保存
from gensim import models
tfidf = models.TfidfModel(new_corpus)
tfidf.save("my_model.tfidf")
# 載入模型
tfidf = models.TfidfModel.load("my_model.tfidf")
# 使用這個訓(xùn)練好的模型得到單詞的tfidf值
tfidf_vec = []
for i in range(len(corpus)):
string = corpus[i]
string_bow = dictionary.doc2bow(string.lower().split())
string_tfidf = tfidf[string_bow]
tfidf_vec.append(string_tfidf)
print(tfidf_vec)
[輸出]:
[[(0, 0.33699829595119235),
(1, 0.8119707171924228),
(2, 0.33699829595119235),
(4, 0.33699829595119235)],
[(0, 0.10212329019650272),
(2, 0.10212329019650272),
(4, 0.10212329019650272),
(5, 0.9842319344536239)],
[(6, 0.5773502691896258), (7, 0.5773502691896258), (8, 0.5773502691896258)],
[(0, 0.33699829595119235),
(1, 0.8119707171924228),
(2, 0.33699829595119235),
(4, 0.33699829595119235)]]
通過上面的計算我們發(fā)現(xiàn)這向量的維數(shù)和我們語料單詞的個數(shù)不一致呀孵延,我們要得到的是每個詞的tfidf值,為了一探究竟我們再做個小測試
- 小測試現(xiàn)出gensim計算的原形
[輸入]:
# 我們隨便拿幾個單詞來測試
string = 'the i first second name'
string_bow = dictionary.doc2bow(string.lower().split())
string_tfidf = tfidf[string_bow]
print(string_tfidf)
[輸出]:
[(1, 0.4472135954999579), (5, 0.8944271909999159)]
結(jié)論
- gensim訓(xùn)練出來的tf-idf值左邊是詞的id亲配,右邊是詞的tfidf值
- gensim有自動去除停用詞的功能尘应,比如the
- gensim會自動去除單個字母,比如i
- gensim會去除沒有被訓(xùn)練到的詞吼虎,比如name
- 所以通過gensim并不能計算每個單詞的tfidf值
2.使用sklearn提取文本tfidf特征
我們的語料庫不變犬钢,還是上面那個
corpus = [
'this is the first document',
'this is the second second document',
'and the third one',
'is this the first document'
]
然后來看我們的處理過程
[輸入]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vec = TfidfVectorizer()
tfidf_matrix = tfidf_vec.fit_transform(corpus)
# 得到語料庫所有不重復(fù)的詞
print(tfidf_vec.get_feature_names())
# 得到每個單詞對應(yīng)的id值
print(tfidf_vec.vocabulary_)
# 得到每個句子所對應(yīng)的向量
# 向量里數(shù)字的順序是按照詞語的id順序來的
print(tfidf_matrix.toarray())
[輸出]:
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}
[[0. 0.43877674 0.54197657 0.43877674 0. 0.
0.35872874 0. 0.43877674]
[0. 0.27230147 0. 0.27230147 0. 0.85322574
0.22262429 0. 0.27230147]
[0.55280532 0. 0. 0. 0.55280532 0.
0.28847675 0.55280532 0. ]
[0. 0.43877674 0.54197657 0.43877674 0. 0.
0.35872874 0. 0.43877674]]
3.python提取文本的tfidf特征
我們的語料庫依舊不變
corpus = [
'this is the first document',
'this is the second second document',
'and the third one',
'is this the first document'
]
- 對語料進(jìn)行分詞
[輸入]:
word_list = []
for i in range(len(corpus)):
word_list.append(corpus[i].split(' '))
print(word_list)
[輸出]:
[['this', 'is', 'the', 'first', 'document'],
['this', 'is', 'the', 'second', 'second', 'document'],
['and', 'the', 'third', 'one'],
['is', 'this', 'the', 'first', 'document']]
- 統(tǒng)計詞頻
[輸入]:
countlist = []
for i in range(len(word_list)):
count = Counter(word_list[i])
countlist.append(count)
countlist
[輸出]:
[Counter({'document': 1, 'first': 1, 'is': 1, 'the': 1, 'this': 1}),
Counter({'document': 1, 'is': 1, 'second': 2, 'the': 1, 'this': 1}),
Counter({'and': 1, 'one': 1, 'the': 1, 'third': 1}),
Counter({'document': 1, 'first': 1, 'is': 1, 'the': 1, 'this': 1})]
- 定義計算tfidf公式的函數(shù)
# word可以通過count得到,count可以通過countlist得到
# count[word]可以得到每個單詞的詞頻思灰, sum(count.values())得到整個句子的單詞總數(shù)
def tf(word, count):
return count[word] / sum(count.values())
# 統(tǒng)計的是含有該單詞的句子數(shù)
def n_containing(word, count_list):
return sum(1 for count in count_list if word in count)
# len(count_list)是指句子的總數(shù)玷犹,n_containing(word, count_list)是指含有該單詞的句子的總數(shù),加1是為了防止分母為0
def idf(word, count_list):
return math.log(len(count_list) / (1 + n_containing(word, count_list)))
# 將tf和idf相乘
def tfidf(word, count, count_list):
return tf(word, count) * idf(word, count_list)
- 計算每個單詞的tfidf值
[輸入]:
import math
for i, count in enumerate(countlist):
print("Top words in document {}".format(i + 1))
scores = {word: tfidf(word, count, countlist) for word in count}
sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
for word, score in sorted_words[:]:
print("\tWord: {}, TF-IDF: {}".format(word, round(score, 5)))
[輸出]:
Top words in document 1
Word: first, TF-IDF: 0.05754
Word: this, TF-IDF: 0.0
Word: is, TF-IDF: 0.0
Word: document, TF-IDF: 0.0
Word: the, TF-IDF: -0.04463
Top words in document 2
Word: second, TF-IDF: 0.23105
Word: this, TF-IDF: 0.0
Word: is, TF-IDF: 0.0
Word: document, TF-IDF: 0.0
Word: the, TF-IDF: -0.03719
Top words in document 3
Word: and, TF-IDF: 0.17329
Word: third, TF-IDF: 0.17329
Word: one, TF-IDF: 0.17329
Word: the, TF-IDF: -0.05579
Top words in document 4
Word: first, TF-IDF: 0.05754
Word: is, TF-IDF: 0.0
Word: this, TF-IDF: 0.0
Word: document, TF-IDF: 0.0
Word: the, TF-IDF: -0.04463
三. 總結(jié)
之所以做了這方面的總結(jié)是因為最近在研究word2vec洒疚,然后涉及到了基于word2vec的文本表示方法歹颓。你用word2vec訓(xùn)練好的模型可以得到詞的向量,然后我們可以利用這些詞向量表示句子向量油湖。
- 一般處理方法是把句子里涉及到的單詞用word2vec模型訓(xùn)練得到詞向量晴股,然后把這些向量加起來再除以單詞數(shù),就可以得到句子向量肺魁。這樣處理之后可以拿去給分類算法(比如LogisticRegression)訓(xùn)練电湘,從而對文本進(jìn)行分類。
- 還有一種是把句子里的每個單詞的向量拼接起來,比如每個單詞的維度是1X100
一句話有30個單詞寂呛,那么如何表示這句話的向量呢怎诫?
把單詞拼接來,最終得到這句話的向量的維度就是30X100維 - 我想做的是把句子里所有的單詞用word2vec模型訓(xùn)練得到詞向量贷痪,然后把這些向量乘以我們之前得到的tfidf值幻妓,再把它們加起來除以單詞數(shù),就可以得到句子向量劫拢。也就是結(jié)合tfidf給單詞加上一個權(quán)重肉津,評判一個單詞的重要程度。
- 最后發(fā)現(xiàn)gensim和sklearn都不能滿足我的需求舱沧,用python的方法做了一個妹沙。
以下是我所有文章的目錄,大家如果感興趣熟吏,也可以前往查看
??戳右邊:打開它距糖,也許會看到很多對你有幫助的文章