TF-IDF(term frequency–inverse document frequency)是一種用于信息檢索與數(shù)據(jù)挖掘的常用加權(quán)技術(shù)。TF意思是詞頻(Term Frequency)钾虐,IDF意思是逆文本頻率指數(shù)(Inverse Document Frequency)隔躲。
為什么要用TF-IDF匀奏?因為計算機只能識別數(shù)字芜果,對于一個一個的單詞塑陵,計算機是看不懂的哎迄,更別說是一句話,或是一篇文章斤蔓,而TF-IDF就是用來將文本轉(zhuǎn)換成計算機看得懂的語言植酥,或者說是機器學(xué)習(xí)或深度學(xué)習(xí)模型能夠進行學(xué)習(xí)訓(xùn)練的數(shù)據(jù)集。
首先看一下一個文本經(jīng)過TF-IDF轉(zhuǎn)換后得到的是什么弦牡?(后文附代碼)
arr=train_text_vector.toarray() # transform to array shape
了解了TF-IDF是干什么的之后魏保,接下來說說它的算法原理以及實現(xiàn)代碼。其實TF-IDF的算法原理很簡單摸屠。
TF是term frequency的縮寫谓罗,指的是某一個給定的詞語在該文件(注意這里的該文件與后面所有文本的區(qū)別)中出現(xiàn)的次數(shù),這個數(shù)字通常會被歸一化(一般是詞頻除以文章總詞數(shù)), 以防止它偏向長的文件季二。(同一個詞語在長文件里可能會比短文件有更高的詞頻檩咱,而不管該詞語重要與否)。
而IDF逆向文件頻率 (inverse document frequency, IDF)反應(yīng)了一個詞在所有文本(整個文檔)中出現(xiàn)的頻率戒傻,如果一個詞在很多的文本中出現(xiàn)税手,那么它的IDF值應(yīng)該低蜂筹。而反過來如果一個詞在比較少的文本中出現(xiàn)需纳,那么它的IDF值應(yīng)該高。比如一些專業(yè)的名詞如“Machine Learning”艺挪。這樣的詞IDF值應(yīng)該高不翩。一個極端的情況兵扬,如果一個詞在所有的文本中都出現(xiàn),那么它的IDF值應(yīng)該為0口蝠。
如果單單以TF或者IDF來計算一個詞的重要程度都是片面的器钟,因此TF-IDF綜合了TF和IDF兩者的優(yōu)點,用以評估一字詞對于一個文件集或一個語料庫中的其中一份文件的重要程度妙蔗。字詞的重要性隨著它在文件中出現(xiàn)的次數(shù)成正比增加傲霸,但同時會隨著它在語料庫中出現(xiàn)的頻率成反比下降。上述引用總結(jié)就是,一個詞語在一篇文章中出現(xiàn)次數(shù)越多, 同時在所有文檔中出現(xiàn)次數(shù)越少, 越能夠代表該文章眉反,越能與其它文章區(qū)分開來昙啄。
TF-IDF的計算方法十分簡單
TF的計算公式如下:其中是在某一文本中詞條w出現(xiàn)的次數(shù),是該文本總詞條數(shù)寸五。
IDF的計算公式:
其中是語料庫的文檔總數(shù)梳凛,是包含詞條的文檔數(shù),分母加一是為了避免未出現(xiàn)在任何文檔中從而導(dǎo)致分母為的情況梳杏。
TF-IDF的就是將TF和IDF相乘
從以上計算公式便可以看出韧拒,某一特定文件內(nèi)的高詞語頻率,以及該詞語在整個文件集合中的低文件頻率十性,可以產(chǎn)生出高權(quán)重的TF-IDF叛溢。因此,TF-IDF傾向于過濾掉常見的詞語劲适,保留重要的詞語雇初。
接下來看看代碼實現(xiàn),用一個簡單的語料庫(只有兩個文本)來模擬
TfidfTransformer是把TF矩陣轉(zhuǎn)成TF-IDF矩陣减响,所以需要先詞頻統(tǒng)計CountVectorizer靖诗,轉(zhuǎn)換成TF-IDF矩陣
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
# corpus 模擬語料庫
corpus=["second third document",
"second second document"]
tfvectorizer=CountVectorizer()
count_vector=tfvectorizer.fit_transform(corpus) # Tf 矩陣
transformer = TfidfTransformer() # 轉(zhuǎn)換Tf矩陣
tfidf = transformer.fit_transform(count_vector) # 將TF轉(zhuǎn)換成Tf-Idf
arr=tfidf.toarray()
結(jié)果如下,那么這兩個文件就可以分別用這兩個向量來表示了以上是先計算了TF然后再轉(zhuǎn)換成了TF-IDF支示,也有一步到位的方法
# TF-IDF一步到位
# 訓(xùn)練整個語料庫
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_df=0.5,min_df=0.0003) # 可以不加參數(shù)刊橘,這里加參數(shù)是為了降維
# =============================================================================
# all_text_vector = tfidf.fit_transform(all_text) #when fit transform to vector
# =============================================================================
tfidf.fit(corpus) # use vectorizer to fit the corpus
corpus_vector=tfidf.transform(corpus).toarray()
# print(corpus_vector)
整個項目源代碼
import pandas as pd
from string import punctuation
import re
def cleandata(data):
clean=[]
# 英文標點符號+中文標點符號
punc = punctuation + u'.,;《》?颂鸿!""''@#¥%…&×()——+【】{};促绵;●,嘴纺。&~败晴、|\s::'
for line in data:
line = re.sub(r"[{}]+".format(punc)," ",line)
clean.append(line)
clean=pd.DataFrame(clean)
return clean
# clean predict data
predata=pd.read_csv('test.csv') #
pre_clean=[]
pre_clean=cleandata(predata['review'])
# clean train data
traindata=pd.read_csv('train.csv',lineterminator='\n') #
train_clean=[]
train_clean=cleandata(traindata['review'])
# 所有清洗后的文本
all_clean=train_clean.append(pre_clean)
all_text=all_clean.iloc[:,0] # 取第一列 否則報錯 轉(zhuǎn)成了series
m=6100
n=6328
train_text=train_clean[0:m].iloc[:,0]
test_text=train_clean[m:n].iloc[:,0]
# Tf-idf
# 訓(xùn)練整個語料庫
# TF-IDF一步到位
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_df=0.5,min_df=0.0003)
# =============================================================================
# all_text_vector = tfidf.fit_transform(all_text) #fit的同時transform成vector
# =============================================================================
tfidf.fit(all_text) # 讓tfidf去fit這些數(shù)據(jù)
all_text_vector=tfidf.transform(all_text).toarray()
train_text_vector=tfidf.transform(train_text)
test_text_vector=tfidf.transform(test_text)
train_label=traindata[0:m]['label']
test_label=traindata[m:n]['label']
# =============================================================================
# 先算TF 再算IDF
# count_v1=CountVectorizer(vocabulary=count_v0.vocabulary_);
# counts_train = count_v1.fit_transform(train_text);
#
# count_v2=CountVectorizer(vocabulary=count_v0.vocabulary_);
# counts_test = count_v2.fit_transform(test_text);
#
# tfidftransformer = TfidfTransformer();
# train_text_vector = tfidftransformer.fit(counts_train).transform(counts_train)
# test_text_vector = tfidftransformer.fit(counts_test).transform(counts_test)
# print(train_text_vector)
# =============================================================================
# classifcation methods
# bayes alpha=0.2 0.7
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB(alpha = 0.2)
# svm 0.64
# =============================================================================
# from sklearn.svm import SVC
# clf = SVC(kernel = 'linear')
# =============================================================================
# decition tree
# =============================================================================
# from sklearn.tree import DecisionTreeClassifier
# clf = DecisionTreeClassifier()
# =============================================================================
# logistic
# =============================================================================
# from sklearn.linear_model import LogisticRegression
# clf = LogisticRegression()
# =============================================================================
# MLP
# =============================================================================
# from sklearn.neural_network import MLPClassifier
# clf = MLPClassifier()
# =============================================================================
# output auc
clf = clf.fit(train_text_vector,train_label)
# =============================================================================
# preds = clf.predict(test_text_vector); #輸出預(yù)測標簽
# preds = preds.tolist()
# =============================================================================
# proba
proba=clf.predict_proba(test_text_vector)
from sklearn import metrics
auc=metrics.roc_auc_score(test_label,proba[:,1]) # 看測試集的auc如何
pre_text=pre_clean.iloc[:,0]
pre_text_vector=tfidf.transform(pre_text)
pre_proba=clf.predict_proba(pre_text_vector)