1. hash結(jié)構(gòu)
from sklearn.feature_extraction import DictVectorizer
measurements = [
{'city':'dubai', 'temperature':33}, {'city':'London', 'temperature':12}, ,{'city':'San Fransisco', 'temperature':18}
]
vec = DictVectorizer()
vec.fit_transform(measurements).toarray()
# output
array([[ 0., 0., 1., 33.],
[ 1., 0., 0., 12.],
[ 0., 1., 0., 18.]])
vec.get_feature_names()
>>> ['city=London', 'city=San Fransisco', 'city=dubai', 'temperature']
2. 詞袋模型
from sklearn.feature_extraction.text import CountVectorizer
vectorizer=CountVectorizer(min_df=1) # 至少出現(xiàn)一次
vectorizer
>>> CountVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None)
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',
]
X=vectorizer.fit_transform(corpus)
X
>>> <4x9 sparse matrix of type '<class 'numpy.int64'>'
with 19 stored elements in Compressed Sparse Row format>
vectorizer.get_feature_names()
>>> ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
X.toarray()
>>> array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
[0, 1, 0, 1, 0, 2, 1, 0, 1],
[1, 0, 0, 0, 1, 0, 1, 1, 0],
[0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)
描述的文檔會完全忽略文檔中單詞的相對位置
analyze("This is a text document to analyze.")
>>> ['this', 'is', 'text', 'document', 'to', 'analyze']
為了保留一些有序的信息经宏,我們可以抽取2-grams的詞匯掌测,而非使用1-grams:
bigram_vectorizer=CountVectorizer(ngram_range=(1,2), token_pattern=r'\b\w+\b', min_df=1)
analyze=bigram_vectorizer.build_analyzer()
analyze('Bi-grams are cool!')
>>> ['bi', 'grams', 'are', 'cool', 'bi grams', 'grams are', 'are cool']
通過該vectorizer抽取的詞匯表潮改,比之前的方式更大止剖,可以以local positioning patterns進行模糊編碼:
X_2 = bigram_vectorizer.fit_transform(corpus).toarray()
X_2
>>> array([[0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0],
[0, 0, 1, 0, 0, 1, 1, 0, 0, 2, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0],
[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0],
[0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1]],
dtype=int64)
bigram_vectorizer.get_feature_names()
>>> ['and',
'and the',
'document',
'first',
'first document',
'is',
'is the',
'is this',
'one',
'second',
'second document',
'second second',
'the',
'the first',
'the second',
'the third',
'third',
'third one',
'this',
'this is',
'this the']
3. TF-IDF item weight
tf表示詞頻(term-frequency)媚值,idf表示inverse document-frequency 红省,tf–idf 表示tf * idf。它原先適用于信息檢索(搜索引擎的ranking)晶疼,同時也發(fā)現(xiàn)在文檔分類和聚類上也很好用酒贬。
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer()
transformer
TfidfTransformer(norm=...'l2', smooth_idf=True, sublinear_tf=False,
use_idf=True)
counts = [[3, 0, 1],
... [2, 0, 0],
... [3, 0, 0],
... [4, 0, 0],
... [3, 2, 0],
... [3, 0, 2]]
...
tfidf = transformer.fit_transform(counts)
tfidf
>>> <6x3 sparse matrix of type '<... 'numpy.float64'>'
with 9 stored elements in Compressed Sparse ... format>
tfidf.toarray()
>>> array([[ 0.85..., 0. ..., 0.52...],
[ 1. ..., 0. ..., 0. ...],
[ 1. ..., 0. ..., 0. ...],
[ 1. ..., 0. ..., 0. ...],
[ 0.55..., 0.83..., 0. ...],
[ 0.63..., 0. ..., 0.77...]])
4. BOW模型的限制
unigrams集(BOW)不能捕獲句字和多個詞的表示又憨,會丟失掉詞的順序依存。另外锭吨,BOW模型不能解釋可能的誤拼(misspellings)或者詞派生(word derivations)蠢莺。
ngram_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(2,2), min_df=1)
counts=ngram_vectorizer.fit_transform(['word', 'wprds'])
counts
>>> <2x9 sparse matrix of type '<class 'numpy.int64'>'
with 11 stored elements in Compressed Sparse Row format>
ngram_vectorizer.get_feature_names()
[' w', 'd ', 'ds', 'or', 'pr', 'rd', 's ', 'wo', 'wp']
使用’char_wb’分析器,它可以在字符邊界內(nèi)創(chuàng)建n-grams的字符(兩邊使用空格補齊)零如。而‘char’分析器則可以基于詞來創(chuàng)建n-grams躏将。