【TODO】【scikit-learn翻譯】4.2.3Text feature extraction

4.2.3. Text feature extraction

4.2.3.1. The Bag of Words representation

Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.
文本分析是機(jī)器學(xué)習(xí)算法的主要應(yīng)用領(lǐng)域琉苇。 然而已烤,原始數(shù)據(jù)裙盾,符號文字序列不能直接傳遞給算法掉伏,因?yàn)樗鼈兇蠖鄶?shù)要求具有固定長度的數(shù)字矩陣特征向量鬼雀,而不是具有可變長度的原始文本文檔仍秤。

In order to address this, scikit-learn provides utilities for the most common ways to extract numerical features from text content, namely:
為解決這個問題畴博,scikit-learn提供了從文本內(nèi)容中提取數(shù)字特征的最常見方法吮成,即:

  • tokenizing strings and giving an integer id for each possible token, for instance by using white-spaces and punctuation as token separators.
  • counting the occurrences of tokens in each document.
  • normalizing and weighting with diminishing importance tokens that occur in the majority of samples / documents.
    令牌化(tokenizing) 對每個可能的詞令牌分成字符串并賦予整數(shù)形的id逼裆,例如通過使用空格和標(biāo)點(diǎn)符號作為令牌分隔符郁稍。
    統(tǒng)計(jì)(counting) 每個詞令牌在文檔中的出現(xiàn)次數(shù)。
    標(biāo)準(zhǔn)化(normalizing) 對出現(xiàn)在在大多數(shù)文檔 / 樣本中的詞令牌胜宇,減少其重要程度耀怜。

In this scheme, features and samples are defined as follows:
在該方案中,特征和樣本定義如下:

  • each individual token occurrence frequency (normalized or not) is treated as a feature.
    每個單獨(dú)的令牌發(fā)生頻率(歸一化或不歸零)被視為一個特征桐愉。
  • the vector of all the token frequencies for a given document is considered a multivariate sample.
    給定文檔中所有的令牌頻率向量被看做一個多元樣本财破。

A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.
因此,文本的集合可被表示為矩陣形式仅财,每行對應(yīng)一條文本狈究,每列對應(yīng)每個文本中出現(xiàn)的詞令牌(如單個詞)。

We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or “Bag of n-grams” representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.
我們稱向量化是將文本文檔集合轉(zhuǎn)換為數(shù)字集合特征向量的普通方法。 這種特殊思想(令牌化抖锥,計(jì)數(shù)和歸一化)被稱為 Bag of Words 或 “Bag of n-grams” 模型亿眠。 文檔由單詞出現(xiàn)來描述,同時完全忽略文檔中單詞的相對位置信息磅废。

4.2.3.2. Sparsity

As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have many feature values that are zeros (typically more than 99% of them).
由于大多數(shù)文本文檔通常只使用文本詞向量全集中的一個小子集纳像,所以得到的矩陣將具有許多特征值為零(通常大于99%)。

For instance a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.
例如拯勉,10,000 個短文本文檔(如電子郵件)的集合將使用總共100,000個獨(dú)特詞的大小的詞匯竟趾,而每個文檔將單獨(dú)使用100到1000個獨(dú)特的單詞。

In order to be able to store such a matrix in memory but also to speed up algebraic operations matrix / vector, implementations will typically use a sparse representation such as the implementations available in the scipy.sparse package.
為了能夠?qū)⑦@樣的矩陣存儲在存儲器中宫峦,并且還可以加速代數(shù)的矩陣/向量運(yùn)算岔帽,實(shí)現(xiàn)通常將使用諸如 scipy.sparse 包中的稀疏實(shí)現(xiàn)。

4.2.3.3. Common Vectorizer usage

CountVectorizer implements both tokenization and occurrence counting in a single class:
CountVectorizer 在單個類中實(shí)現(xiàn)了 tokenization (詞語切分)和 occurrence counting (出現(xiàn)頻數(shù)統(tǒng)計(jì)):

from sklearn.feature_extraction.text import CountVectorizer

This model has many parameters, however the default values are quite reasonable (please see the reference documentation for the details):
這個模型有很多參數(shù)导绷,但參數(shù)的默認(rèn)初始值是相當(dāng)合理的(請參閱 參考文檔 了解詳細(xì)信息):

>>> vectorizer = CountVectorizer()
>>> vectorizer                     
CountVectorizer(analyzer=...'word', binary=False, decode_error=...'strict',
        dtype=<... 'numpy.int64'>, encoding=...'utf-8', input=...'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=...'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

Let’s use it to tokenize and count the word occurrences of a minimalistic corpus of text documents:
我們用它來對簡約的文本語料庫進(jìn)行 tokenize(分詞)和統(tǒng)計(jì)單詞出現(xiàn)頻數(shù):

>>> corpus = [
...     'This is the first document.',
...     'This is the second second document.',
...     'And the third one.',
...     'Is this the first document?',
... ]
>>> X = vectorizer.fit_transform(corpus)
>>> X                              
<4x9 sparse matrix of type '<... 'numpy.int64'>'
    with 19 stored elements in Compressed Sparse ... format>

The default configuration tokenizes the string by extracting words of at least 2 letters. The specific function that does this step can be requested explicitly:
默認(rèn)配置通過提取至少 2 個字母的單詞來對 string 進(jìn)行分詞犀勒。做這一步的函數(shù)可以顯式地被調(diào)用:

>>> analyze = vectorizer.build_analyzer()
>>> analyze("This is a text document to analyze.") == (
...     ['this', 'is', 'text', 'document', 'to', 'analyze'])
True

Each term found by the analyzer during the fit is assigned a unique integer index corresponding to a column in the resulting matrix. This interpretation of the columns can be retrieved as follows:
analyzer 在擬合過程中找到的每個 term(項(xiàng))都會被分配一個唯一的整數(shù)索引,對應(yīng)于 resulting matrix(結(jié)果矩陣)中的一列妥曲。此列的一些說明可以被檢索如下:

>>> vectorizer.get_feature_names() == (
...     ['and', 'document', 'first', 'is', 'one',
...      'second', 'the', 'third', 'this'])
True

>>> X.toarray()           
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 2, 1, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]]...)

The converse mapping from feature name to column index is stored in the vocabulary_ attribute of the vectorizer:
從 feature 名稱到 column index(列索引) 的逆映射存儲在 vocabulary_ 屬性中:

>>> vectorizer.vocabulary_.get('document')
1

Hence words that were not seen in the training corpus will be completely ignored in future calls to the transform method:
因此贾费,在未來對 transform 方法的調(diào)用中,在 training corpus (訓(xùn)練語料庫)中沒有看到的單詞將被完全忽略:

>>> vectorizer.transform(['Something completely new.']).toarray()
...                           
array([[0, 0, 0, 0, 0, 0, 0, 0, 0]]...)

Note that in the previous corpus, the first and the last documents have exactly the same words hence are encoded in equal vectors. In particular we lose the information that the last document is an interrogative form. To preserve some of the local ordering information we can extract 2-grams of words in addition to the 1-grams (individual words):
請注意檐盟,在前面的 corpus(語料庫)中褂萧,第一個和最后一個文檔具有完全相同的詞,因?yàn)楸痪幋a成相同的向量葵萎。 特別是我們丟失了最后一個文件是一個疑問的形式的信息导犹。為了防止詞組順序顛倒,除了提取一元模型 1-grams(個別詞)之外陌宿,我們還可以提取 2-grams 的單詞:

>>> bigram_vectorizer = CountVectorizer(ngram_range=(1, 2),
...                                     token_pattern=r'\b\w+\b', min_df=1)
>>> analyze = bigram_vectorizer.build_analyzer()
>>> analyze('Bi-grams are cool!') == (
...     ['bi', 'grams', 'are', 'cool', 'bi grams', 'grams are', 'are cool'])
True

The vocabulary extracted by this vectorizer is hence much bigger and can now resolve ambiguities encoded in local positioning patterns:
由 vectorizer(向量化器)提取的 vocabulary(詞匯)因此會變得更大锡足,同時可以在定位模式時消除歧義:

>>> X_2 = bigram_vectorizer.fit_transform(corpus).toarray()
>>> X_2
...                           
array([[0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0],
       [0, 0, 1, 0, 0, 1, 1, 0, 0, 2, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0],
       [1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0],
       [0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1]]...)

In particular the interrogative form “Is this” is only present in the last document:
特別是 “Is this” 的疑問形式只出現(xiàn)在最后一個文檔中:

>>> feature_index = bigram_vectorizer.vocabulary_.get('is this')
>>> X_2[:, feature_index]     
array([0, 0, 0, 1]...)

4.2.3.4. Tf–idf term weighting

In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.
在一個大的文本語料庫中,一些單詞將出現(xiàn)很多次(例如 “the”, “a”, “is” 是英文)壳坪,因此對文檔的實(shí)際內(nèi)容沒有什么有意義的信息舶得。 如果我們將直接計(jì)數(shù)數(shù)據(jù)直接提供給分類器,那么這些頻繁詞組會掩蓋住那些我們關(guān)注但很少出現(xiàn)的詞爽蝴。

In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform.
為了為了重新計(jì)算特征權(quán)重沐批,并將其轉(zhuǎn)化為適合分類器使用的浮點(diǎn)值,因此使用 tf-idf 變換是非常常見的蝎亚。

Tf means term-frequency while tf–idf means term-frequency times inverse document-frequency: \text{tf-idf(t,d)}=\text{tf(t,d)} \times \text{idf(t)}

Using the TfidfTransformer’s default settings,TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False) the term frequency, the number of times a term occurs in a given document, is multiplied with idf component, which is computed as
\text{idf}(t) = log{\frac{1 + n_d}{1+\text{df}(d,t)}} + 1,
where n_d is the total number of documents, and \text{df}(d,t)
is the number of documents that contain term t. The resulting tf-idf vectors are then normalized by the Euclidean norm:
v_{norm} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v{_1}^2 + v{_2}^2 + \dots + v{_n}^2}}.
Tf表示詞頻,而 tf-idf 表示術(shù)語頻率乘以逆文檔頻率: \text{tf-idf(t,d)}=\text{tf(t,d)} \times \text{idf(t)}
使用 TfidfTransformer 的默認(rèn)設(shè)置发框,TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False) 詞頻即一個詞在給定文檔中出現(xiàn)的次數(shù)亿扁,乘以 idf 即通過\text{idf}(t) = log{\frac{1 + n_d}{1+\text{df}(d,t)}} + 1計(jì)算,
其中n_d是文檔的總數(shù)退客,\text{df}(d,t)是包含詞t的文檔數(shù)。 然后,所得到的tf-idf向量通過歐幾里得范數(shù)歸一化:
v_{norm} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v{_1}^2 + v{_2}^2 + \dots + v{_n}^2}}.

This was originally a term weighting scheme developed for information retrieval (as a ranking function for search engines results) that has also found good use in document classification and clustering.

The following sections contain further explanations and examples that illustrate how the tf-idfs are computed exactly and how the tf-idfs computed in scikit-learn’s TfidfTransformer and TfidfVectorizer differ slightly from the standard textbook notation that defines the idf as

\text{idf}(t) = log{\frac{n_d}{1+\text{df}(d,t)}}.

In the TfidfTransformer and TfidfVectorizer with smooth_idf=False, the “1” count is added to the idf instead of the idf’s denominator:

\text{idf}(t) = log{\frac{n_d}{\text{df}(d,t)}} + 1

它源于一個詞權(quán)重的信息檢索方式(作為搜索引擎結(jié)果的評級函數(shù))携悯,同時也在文檔分類和聚類中表現(xiàn)良好祭芦。

以下部分包含進(jìn)一步說明和示例,說明如何精確計(jì)算 tf-idfs 以及如何在 scikit-learn 中計(jì)算 tf-idfs憔鬼, TfidfTransformerTfidfVectorizer 與定義 idf 的標(biāo)準(zhǔn)教科書符號略有不同

\text{idf}(t) = log{\frac{n_d}{1+\text{df}(d,t)}}.

TfidfTransformerTfidfVectorizersmooth_idf=False龟劲,將 “1” 計(jì)數(shù)添加到 idf 而不是 idf 的分母:

\text{idf}(t) = log{\frac{n_d}{\text{df}(d,t)}} + 1

This normalization is implemented by the TfidfTransformer class:
該歸一化由類 TfidfTransformer 實(shí)現(xiàn):

>>> from sklearn.feature_extraction.text import TfidfTransformer
>>> transformer = TfidfTransformer(smooth_idf=False)
>>> transformer   
TfidfTransformer(norm=...'l2', smooth_idf=False, sublinear_tf=False,
                 use_idf=True)

Again please see the reference documentation for the details on all the parameters.
有關(guān)所有參數(shù)的詳細(xì)信息,請參閱 參考文檔轴或。

Let’s take an example with the following counts. The first term is present 100% of the time hence not very interesting. The two other features only in less than 50% of the time hence probably more representative of the content of the documents:
讓我們以下方的詞頻為例昌跌。第一個次在任何時間都是100%出現(xiàn),因此不是很有重要照雁。另外兩個特征只占不到50%的比例蚕愤,因此可能更具有代表性:

>>> counts = [[3, 0, 1],
...           [2, 0, 0],
...           [3, 0, 0],
...           [4, 0, 0],
...           [3, 2, 0],
...           [3, 0, 2]]
...
>>> tfidf = transformer.fit_transform(counts)
>>> tfidf                         
<6x3 sparse matrix of type '<... 'numpy.float64'>'
    with 9 stored elements in Compressed Sparse ... format>

>>> tfidf.toarray()                        
array([[ 0.81940995,  0.        ,  0.57320793],
       [ 1.        ,  0.        ,  0.        ],
       [ 1.        ,  0.        ,  0.        ],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.47330339,  0.88089948,  0.        ],
       [ 0.58149261,  0.        ,  0.81355169]])

Each row is normalized to have unit Euclidean norm:

v_{norm} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v{_1}^2 + v{_2}^2 + \dots + v{_n}^2}}

For example, we can compute the tf-idf of the first term in the first document in the <cite style="font-style: normal;">counts</cite> array as follows:

n_{d, {\text{term1}}} = 6
\text{df}(d, t)_{\text{term1}} = 6
\text{idf}(d, t)_{\text{term1}} = log \frac{n_d}{\text{df}(d, t)} + 1 = log(1)+1 = 1
\text{tf-idf}_{\text{term1}} = \text{tf} \times \text{idf} = 3 \times 1 = 3

Now, if we repeat this computation for the remaining 2 terms in the document, we get

\text{tf-idf}_{\text{term2}} = 0 \times (log(6/1)+1) = 0
\text{tf-idf}_{\text{term3}} = 1 \times (log(6/2)+1) \approx 2.0986

and the vector of raw tf-idfs:

\text{tf-idf}_{\text{raw}} = [3, 0, 2.0986].

Then, applying the Euclidean (L2) norm, we obtain the following tf-idfs for document 1:

\frac{[3, 0, 2.0986]}{\sqrt{\big(3^2 + 0^2 + 2.0986^2\big)}} = [ 0.819, 0, 0.573].

Furthermore, the default parameter smooth_idf=True adds “1” to the numerator and denominator as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions:

\text{idf}(t) = log{\frac{1 + n_d}{1+\text{df}(d,t)}} + 1

Using this modification, the tf-idf of the third term in document 1 changes to 1.8473:

\text{tf-idf}_{\text{term3}} = 1 \times log(7/3)+1 \approx 1.8473

And the L2-normalized tf-idf changes to

\frac{[3, 0, 1.8473]}{\sqrt{\big(3^2 + 0^2 + 1.8473^2\big)}} = [0.8515, 0, 0.5243]

:
每行都被正則化,使其適應(yīng)歐幾里得標(biāo)準(zhǔn):

v_{norm} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v{_1}^2 + v{_2}^2 + \dots + v{_n}^2}}

例如饺蚊,我們可以計(jì)算計(jì)數(shù)數(shù)組中第一個文檔中第一個項(xiàng)的 tf-idf 萍诱,如下所示:

n_{d, {\text{term1}}} = 6
\text{df}(d, t)_{\text{term1}} = 6
\text{idf}(d, t)_{\text{term1}} = log \frac{n_d}{\text{df}(d, t)} + 1 = log(1)+1 = 1
\text{tf-idf}_{\text{term1}} = \text{tf} \times \text{idf} = 3 \times 1 = 3

現(xiàn)在,如果我們對文檔中剩下的2個術(shù)語重復(fù)這個計(jì)算污呼,我們得到:

\text{tf-idf}_{\text{term2}} = 0 \times log(6/1)+1 = 0
\text{tf-idf}_{\text{term3}} = 1 \times log(6/2)+1 \approx 2.0986

和原始 tf-idfs 的向量:

\text{tf-idf}_raw = [3, 0, 2.0986].

然后裕坊,應(yīng)用歐幾里德(L2)規(guī)范,我們獲得文檔1的以下 tf-idfs:

\frac{[3, 0, 2.0986]}{\sqrt{\big(3^2 + 0^2 + 2.0986^2\big)}} = [ 0.819, 0, 0.573].

此外燕酷,默認(rèn)參數(shù) smooth_idf=True 將 “1” 添加到分子和分母籍凝,就好像一個額外的文檔被看到一樣包含集合中的每個術(shù)語,這樣可以避免零分割:

\text{idf}(t) = log{\frac{1 + n_d}{1+\text{df}(d,t)}} + 1

使用此修改苗缩,文檔1中第三項(xiàng)的 tf-idf 更改為 1.8473:

\text{tf-idf}_{\text{term3}} = 1 \times log(7/3)+1 \approx 1.8473

而 L2 標(biāo)準(zhǔn)化的 tf-idf 變?yōu)?/p>

\frac{[3, 0, 1.8473]}{\sqrt{\big(3^2 + 0^2 + 1.8473^2\big)}} = [0.8515, 0, 0.5243]

:

>>> transformer = TfidfTransformer()
>>> transformer.fit_transform(counts).toarray()
array([[ 0.85151335,  0.        ,  0.52433293],
       [ 1.        ,  0.        ,  0.        ],
       [ 1.        ,  0.        ,  0.        ],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.55422893,  0.83236428,  0.        ],
       [ 0.63035731,  0.        ,  0.77630514]])

The weights of each feature computed by the fit method call are stored in a model attribute:
通過 fit 方法調(diào)用計(jì)算出的每個特征的權(quán)重存儲在模型屬性中:

>>> transformer.idf_                       
array([ 1. ...,  2.25...,  1.84...])

As tf–idf is very often used for text features, there is also another class called TfidfVectorizer that combines all the options of CountVectorizer and TfidfTransformer in a single model:
由于 tf-idf 經(jīng)常用于文本特征饵蒂,所以還有一個類 TfidfVectorizer ,它將 CountVectorizerTfidfTransformer 的所有選項(xiàng)組合在一個單例模型中:

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> vectorizer = TfidfVectorizer()
>>> vectorizer.fit_transform(corpus)
...                                
<4x9 sparse matrix of type '<... 'numpy.float64'>'
    with 19 stored elements in Compressed Sparse ... format>

While the tf–idf normalization is often very useful, there might be cases where the binary occurrence markers might offer better features. This can be achieved by using the binary parameter of CountVectorizer. In particular, some estimators such as Bernoulli Naive Bayes explicitly model discrete boolean random variables. Also, very short texts are likely to have noisy tf–idf values while the binary occurrence info is more stable.
雖然tf-idf標(biāo)準(zhǔn)化通常非常有用酱讶,但是可能有一種情況是二元變量顯示會提供更好的特征退盯。 這可以使用類 CountVectorizer二進(jìn)制 參數(shù)來實(shí)現(xiàn)。 特別地浴麻,一些估計(jì)器得问,諸如 伯努利樸素貝葉斯 顯式的使用離散的布爾隨機(jī)變量。 而且软免,非常短的文本很可能影響 tf-idf 值宫纬,而二進(jìn)制出現(xiàn)信息更穩(wěn)定。

As usual the best way to adjust the feature extraction parameters is to use a cross-validated grid search, for instance by pipelining the feature extractor with a classifier:

通常情況下膏萧,調(diào)整特征提取參數(shù)的最佳方法是使用基于網(wǎng)格搜索的交叉驗(yàn)證漓骚,例如通過將特征提取器與分類器進(jìn)行流水線化:

4.2.3.5. Decoding text files

Text is made of characters, but files are made of bytes. These bytes represent characters according to some encoding. To work with text files in Python, their bytes must be decoded to a character set called Unicode. Common encodings are ASCII, Latin-1 (Western Europe), KOI8-R (Russian) and the universal encodings UTF-8 and UTF-16. Many others exist.

Note

An encoding can also be called a ‘character set’, but this term is less accurate: several encodings can exist for a single character set.

The text feature extractors in scikit-learn know how to decode text files, but only if you tell them what encoding the files are in. The CountVectorizer takes an encoding parameter for this purpose. For modern text files, the correct encoding is probably UTF-8, which is therefore the default (encoding="utf-8").

If the text you are loading is not actually encoded with UTF-8, however, you will get a UnicodeDecodeError. The vectorizers can be told to be silent about decoding errors by setting the decode_error parameter to either "ignore" or "replace". See the documentation for the Python function bytes.decode for more details (type help(bytes.decode)at the Python prompt).

If you are having trouble decoding text, here are some things to try:

  • Find out what the actual encoding of the text is. The file might come with a header or README that tells you the encoding, or there might be some standard encoding you can assume based on where the text comes from.
  • You may be able to find out what kind of encoding it is in general using the UNIX command file. The Python chardet module comes with a script called chardetect.py that will guess the specific encoding, though you cannot rely on its guess being correct.
  • You could try UTF-8 and disregard the errors. You can decode byte strings with bytes.decode(errors='replace') to replace all decoding errors with a meaningless character, or set decode_error='replace' in the vectorizer. This may damage the usefulness of your features.
  • Real text may come from a variety of sources that may have used different encodings, or even be sloppily decoded in a different encoding than the one it was encoded with. This is common in text retrieved from the Web. The Python package ftfy can automatically sort out some classes of decoding errors, so you could try decoding the unknown text as latin-1 and then using ftfy to fix errors.
  • If the text is in a mish-mash of encodings that is simply too hard to sort out (which is the case for the 20 Newsgroups dataset), you can fall back on a simple single-byte encoding such as latin-1. Some text may display incorrectly, but at least the same sequence of bytes will always represent the same feature.

For example, the following snippet uses chardet (not shipped with scikit-learn, must be installed separately) to figure out the encoding of three texts. It then vectorizes the texts and prints the learned vocabulary. The output is not shown here.

<pre style="padding: 5px 10px; font-family: Monaco, Menlo, Consolas, "Courier New", monospace; font-size: 13px; color: rgb(34, 34, 34); border-radius: 4px; display: block; margin: 0.1em 0px 0.5em; line-height: 1.2em; word-break: break-all; word-wrap: break-word; white-space: pre-wrap; background-color: rgb(248, 248, 248); border: 1px solid rgb(221, 221, 221); overflow: auto hidden;">>>> import chardet

text1 = b"Sei mir gegr\xc3\xbc\xc3\x9ft mein Sauerkraut"
text2 = b"holdselig sind deine Ger\xfcche"
text3 = b"\xff\xfeA\x00u\x00f\x00 \x00F\x00l\x00\xfc\x00g\x00e\x00l\x00n\x00 \x00d\x00e\x00s\x00 \x00G\x00e\x00s\x00a\x00n\x00g\x00e\x00s\x00,\x00 \x00H\x00e\x00r\x00z\x00l\x00i\x00e\x00b\x00c\x00h\x00e\x00n\x00,\x00 \x00t\x00r\x00a\x00g\x00 \x00i\x00c\x00h\x00 \x00d\x00i\x00c\x00h\x00 \x00f\x00o\x00r\x00t\x00"
decoded = [x.decode(chardet.detect(x)['encoding'])
... for x in (text1, text2, text3)]
v = CountVectorizer().fit(decoded).vocabulary_
for term in v: print(v)
</pre>

(Depending on the version of chardet, it might get the first one wrong.)

For an introduction to Unicode and character encodings in general, see Joel Spolsky’s Absolute Minimum Every Software Developer Must Know About Unicode.

4.2.3.6. Applications and examples

The bag of words representation is quite simplistic but surprisingly useful in practice.

In particular in a supervised setting it can be successfully combined with fast and scalable linear models to train document classifiers, for instance:

In an unsupervised setting it can be used to group similar documents together by applying clustering algorithms such as K-means:

Finally it is possible to discover the main topics of a corpus by relaxing the hard assignment constraint of clustering, for instance by using Non-negative matrix factorization (NMF or NNMF):

4.2.3.7. Limitations of the Bag of Words representation

A collection of unigrams (what bag of words is) cannot capture phrases and multi-word expressions, effectively disregarding any word order dependence. Additionally, the bag of words model doesn’t account for potential misspellings or word derivations.

N-grams to the rescue! Instead of building a simple collection of unigrams (n=1), one might prefer a collection of bigrams (n=2), where occurrences of pairs of consecutive words are counted.

One might alternatively consider a collection of character n-grams, a representation resilient against misspellings and derivations.

For example, let’s say we’re dealing with a corpus of two documents: ['words', 'wprds']. The second document contains a misspelling of the word ‘words’. A simple bag of words representation would consider these two as very distinct documents, differing in both of the two possible features. A character 2-gram representation, however, would find the documents matching in 4 out of 8 features, which may help the preferred classifier decide better:

<pre style="padding: 5px 10px; font-family: Monaco, Menlo, Consolas, "Courier New", monospace; font-size: 13px; color: rgb(34, 34, 34); border-radius: 4px; display: block; margin: 0.1em 0px 0.5em; line-height: 1.2em; word-break: break-all; word-wrap: break-word; white-space: pre-wrap; background-color: rgb(248, 248, 248); border: 1px solid rgb(221, 221, 221); overflow: auto hidden;">>>> ngram_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(2, 2))

counts = ngram_vectorizer.fit_transform(['words', 'wprds'])
ngram_vectorizer.get_feature_names() == (
... [' w', 'ds', 'or', 'pr', 'rd', 's ', 'wo', 'wp'])
True
counts.toarray().astype(int)
array([[1, 1, 1, 0, 1, 1, 1, 0],
[1, 1, 0, 1, 1, 1, 0, 1]])
</pre>

In the above example, 'char_wb analyzer is used, which creates n-grams only from characters inside word boundaries (padded with space on each side). The 'char' analyzer, alternatively, creates n-grams that span across words:

<pre style="padding: 5px 10px; font-family: Monaco, Menlo, Consolas, "Courier New", monospace; font-size: 13px; color: rgb(34, 34, 34); border-radius: 4px; display: block; margin: 0.1em 0px 0.5em; line-height: 1.2em; word-break: break-all; word-wrap: break-word; white-space: pre-wrap; background-color: rgb(248, 248, 248); border: 1px solid rgb(221, 221, 221); overflow: auto hidden;">>>> ngram_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(5, 5))

ngram_vectorizer.fit_transform(['jumpy fox'])
...
<1x4 sparse matrix of type '<... 'numpy.int64'>'
with 4 stored elements in Compressed Sparse ... format>
ngram_vectorizer.get_feature_names() == (
... [' fox ', ' jump', 'jumpy', 'umpy '])
True

ngram_vectorizer = CountVectorizer(analyzer='char', ngram_range=(5, 5))
ngram_vectorizer.fit_transform(['jumpy fox'])

ngram_vectorizer.get_feature_names() == (
... [' w', 'ds', 'or', 'pr', 'rd', 's ', 'wo', 'wp'])
True
>>> counts.toarray().astype(int)
array([[1, 1, 1, 0, 1, 1, 1, 0],
[1, 1, 0, 1, 1, 1, 0, 1]])
</pre>

In the above example, 'char_wb analyzer is used, which creates n-grams only from characters inside word boundaries (padded with space on each side). The 'char' analyzer, alternatively, creates n-grams that span across words:

>>>

<pre style="padding: 5px 10px; font-family: Monaco, Menlo, Consolas, "Courier New", monospace; font-size: 13px; color: rgb(34, 34, 34); border-radius: 4px; display: block; margin: 0.1em 0px 0.5em; line-height: 1.2em; word-break: break-all; word-wrap: break-word; white-space: pre-wrap; background-color: rgb(248, 248, 248); border: 1px solid rgb(221, 221, 221); overflow: auto hidden;">>>> ngram_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(5, 5))
>>> ngram_vectorizer.fit_transform(['jumpy fox'])
...
<1x4 sparse matrix of type '<... 'numpy.int64'>'
with 4 stored elements in Compressed Sparse ... format>
>>> ngram_vectorizer.get_feature_names() == (
... [' fox ', ' jump', 'jumpy', 'umpy '])
True

>>> ngram_vectorizer = CountVectorizer(analyzer='char', ngram_range=(5, 5))
>>> ngram_vectorizer.fit_transform(['jumpy fox'])
...
<1x5 sparse matrix of type '<... 'numpy.int64'>'
with 5 stored elements in Compressed Sparse ... format>
ngram_vectorizer.get_feature_names() == (
... ['jumpy', 'mpy f', 'py fo', 'umpy ', 'y fox'])
True
</pre>

The word boundaries-aware variant char_wb is especially interesting for languages that use white-spaces for word separation as it generates significantly less noisy features than the raw char variant in that case. For such languages it can increase both the predictive accuracy and convergence speed of classifiers trained using such features while retaining the robustness with regards to misspellings and word derivations.

While some local positioning information can be preserved by extracting n-grams instead of individual words, bag of words and bag of n-grams destroy most of the inner structure of the document and hence most of the meaning carried by that internal structure.

In order to address the wider task of Natural Language Understanding, the local structure of sentences and paragraphs should thus be taken into account. Many such models will thus be casted as “Structured output” problems which are currently outside of the scope of scikit-learn.

4.2.3.8. Vectorizing a large text corpus with the hashing trick

The above vectorization scheme is simple but the fact that it holds an in- memory mapping from the string tokens to the integer feature indices (the vocabulary_ attribute) causes several problems when dealing with large datasets:

  • the larger the corpus, the larger the vocabulary will grow and hence the memory use too,
  • fitting requires the allocation of intermediate data structures of size proportional to that of the original dataset.
  • building the word-mapping requires a full pass over the dataset hence it is not possible to fit text classifiers in a strictly online manner.
  • pickling and un-pickling vectorizers with a large vocabulary_ can be very slow (typically much slower than pickling / un-pickling flat data structures such as a NumPy array of the same size),
  • it is not easily possible to split the vectorization work into concurrent sub tasks as the vocabulary_ attribute would have to be a shared state with a fine grained synchronization barrier: the mapping from token string to feature index is dependent on ordering of the first occurrence of each token hence would have to be shared, potentially harming the concurrent workers’ performance to the point of making them slower than the sequential variant.

It is possible to overcome those limitations by combining the “hashing trick” (Feature hashing) implemented by thesklearn.feature_extraction.FeatureHasher class and the text preprocessing and tokenization features of the CountVectorizer.

This combination is implementing in HashingVectorizer, a transformer class that is mostly API compatible with CountVectorizer. HashingVectorizer is stateless, meaning that you don’t have to call fit on it:

<pre style="padding: 5px 10px; font-family: Monaco, Menlo, Consolas, "Courier New", monospace; font-size: 13px; color: rgb(34, 34, 34); border-radius: 4px; display: block; margin: 0.1em 0px 0.5em; line-height: 1.2em; word-break: break-all; word-wrap: break-word; white-space: pre-wrap; background-color: rgb(248, 248, 248); border: 1px solid rgb(221, 221, 221); overflow: auto hidden;">>>> from sklearn.feature_extraction.text import HashingVectorizer

hv = HashingVectorizer(n_features=10)
hv.transform(corpus)
...
<4x10 sparse matrix of type '<... 'numpy.float64'>'
with 16 stored elements in Compressed Sparse ... format>
</pre>

You can see that 16 non-zero feature tokens were extracted in the vector output: this is less than the 19 non-zeros extracted previously by the CountVectorizer on the same toy corpus. The discrepancy comes from hash function collisions because of the low value of the n_features parameter.

In a real world setting, the n_features parameter can be left to its default value of 2 ** 20 (roughly one million possible features). If memory or downstream models size is an issue selecting a lower value such as 2 ** 18 might help without introducing too many additional collisions on typical text classification tasks.

Note that the dimensionality does not affect the CPU training time of algorithms which operate on CSR matrices (LinearSVC(dual=True), Perceptron, SGDClassifier, PassiveAggressive) but it does for algorithms that work with CSC matrices (LinearSVC(dual=False), Lasso(), etc).

Let’s try again with the default setting:

<pre style="padding: 5px 10px; font-family: Monaco, Menlo, Consolas, "Courier New", monospace; font-size: 13px; color: rgb(34, 34, 34); border-radius: 4px; display: block; margin: 0.1em 0px 0.5em; line-height: 1.2em; word-break: break-all; word-wrap: break-word; white-space: pre-wrap; background-color: rgb(248, 248, 248); border: 1px solid rgb(221, 221, 221); overflow: auto hidden;">>>> hv = HashingVectorizer()

hv.transform(corpus)
...
<4x1048576 sparse matrix of type '<... 'numpy.float64'>'
with 19 stored elements in Compressed Sparse ... format>
</pre>

We no longer get the collisions, but this comes at the expense of a much larger dimensionality of the output space. Of course, other terms than the 19 used here might still collide with each other.

The HashingVectorizer also comes with the following limitations:

  • it is not possible to invert the model (no inverse_transform method), nor to access the original string representation of the features, because of the one-way nature of the hash function that performs the mapping.
  • it does not provide IDF weighting as that would introduce statefulness in the model. A TfidfTransformer can be appended to it in a pipeline if required.

4.2.3.9. Performing out-of-core scaling with HashingVectorizer

An interesting development of using a HashingVectorizer is the ability to perform out-of-core scaling. This means that we can learn from data that does not fit into the computer’s main memory.

A strategy to implement out-of-core scaling is to stream data to the estimator in mini-batches. Each mini-batch is vectorized using HashingVectorizer so as to guarantee that the input space of the estimator has always the same dimensionality. The amount of memory used at any time is thus bounded by the size of a mini-batch. Although there is no limit to the amount of data that can be ingested using such an approach, from a practical point of view the learning time is often limited by the CPU time one wants to spend on the task.

For a full-fledged example of out-of-core scaling in a text classification task see Out-of-core classification of text documents.

4.2.3.10. Customizing the vectorizer classes

It is possible to customize the behavior by passing a callable to the vectorizer constructor:

<pre style="padding: 5px 10px; font-family: Monaco, Menlo, Consolas, "Courier New", monospace; font-size: 13px; color: rgb(34, 34, 34); border-radius: 4px; display: block; margin: 0.1em 0px 0.5em; line-height: 1.2em; word-break: break-all; word-wrap: break-word; white-space: pre-wrap; background-color: rgb(248, 248, 248); border: 1px solid rgb(221, 221, 221); overflow: auto hidden;">>>> def my_tokenizer(s):
... return s.split()
...

vectorizer = CountVectorizer(tokenizer=my_tokenizer)
vectorizer.build_analyzer()(u"Some... punctuation!") == (
... ['some...', 'punctuation!'])
True
</pre>

In particular we name:

  • preprocessor: a callable that takes an entire document as input (as a single string), and returns a possibly transformed version of the document, still as an entire string. This can be used to remove HTML tags, lowercase the entire document, etc.
  • tokenizer: a callable that takes the output from the preprocessor and splits it into tokens, then returns a list of these.
  • analyzer: a callable that replaces the preprocessor and tokenizer. The default analyzers all call the preprocessor and tokenizer, but custom analyzers will skip this. N-gram extraction and stop word filtering take place at the analyzer level, so a custom analyzer may have to reproduce these steps.

(Lucene users might recognize these names, but be aware that scikit-learn concepts may not map one-to-one onto Lucene concepts.)

To make the preprocessor, tokenizer and analyzers aware of the model parameters it is possible to derive from the class and override the build_preprocessor, build_tokenizer`` andbuild_analyzer` factory methods instead of passing custom functions.

Some tips and tricks:

  • If documents are pre-tokenized by an external package, then store them in files (or strings) with the tokens separated by whitespace and pass analyzer=str.split
  • Fancy token-level analysis such as stemming, lemmatizing, compound splitting, filtering based on part-of-speech, etc. are not included in the scikit-learn codebase, but can be added by customizing either the tokenizer or the analyzer. Here’s a CountVectorizer with a tokenizer and lemmatizer using NLTK:
>>>

<pre style="padding: 5px 10px; font-family: Monaco, Menlo, Consolas, &quot;Courier New&quot;, monospace; font-size: 13px; color: rgb(34, 34, 34); border-radius: 4px; display: block; margin: 0.1em 0px 0.5em; line-height: 1.2em; word-break: break-all; word-wrap: break-word; white-space: pre-wrap; background-color: rgb(248, 248, 248); border: 1px solid rgb(221, 221, 221); overflow: auto hidden;">>>> from nltk import word_tokenize          
>>> from nltk.stem import WordNetLemmatizer 
>>> class LemmaTokenizer(object):
...     def __init__(self):
...         self.wnl = WordNetLemmatizer()
...     def __call__(self, doc):
...         return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]
...
>>> vect = CountVectorizer(tokenizer=LemmaTokenizer())  
</pre>







(Note that this will not filter out punctuation.)



The following example will, for instance, transform some British spelling to American spelling:





>>>

<pre style="padding: 5px 10px; font-family: Monaco, Menlo, Consolas, &quot;Courier New&quot;, monospace; font-size: 13px; color: rgb(34, 34, 34); border-radius: 4px; display: block; margin: 0.1em 0px 0.5em; line-height: 1.2em; word-break: break-all; word-wrap: break-word; white-space: pre-wrap; background-color: rgb(248, 248, 248); border: 1px solid rgb(221, 221, 221); overflow: auto hidden;">>>> import re
>>> def to_british(tokens):
...     for t in tokens:
...         t = re.sub(r"(...)our$", r"\1or", t)
...         t = re.sub(r"([bt])re$", r"\1er", t)
...         t = re.sub(r"([iy])s(e$|ing|ation)", r"\1z\2", t)
...         t = re.sub(r"ogue$", "og", t)
...         yield t
...
>>> class CustomVectorizer(CountVectorizer):
...     def build_tokenizer(self):
...         tokenize = super(CustomVectorizer, self).build_tokenizer()
...         return lambda doc: list(to_british(tokenize(doc)))
...
>>> print(CustomVectorizer().build_analyzer()(u"color colour")) 
[...'color', ...'color']
</pre>







for other styles of preprocessing; examples include stemming, lemmatization, or normalizing numerical tokens, with the latter illustrated in:



> *   [Biclustering documents with the Spectral Co-clustering algorithm](http://scikit-learn.org/stable/auto_examples/bicluster/plot_bicluster_newsgroups.html#sphx-glr-auto-examples-bicluster-plot-bicluster-newsgroups-py)

Customizing the vectorizer can also be useful when handling Asian languages that do not use an explicit word separator such as whitespace.

參考資料:

  1. http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
  2. http://sklearn.apachecn.org/cn/0.19.0/modules/feature_extraction.html#text-feature-extraction
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末蝌衔,一起剝皮案震驚了整個濱河市,隨后出現(xiàn)的幾起案子蝌蹂,更是在濱河造成了極大的恐慌噩斟,老刑警劉巖,帶你破解...
    沈念sama閱讀 221,576評論 6 515
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件孤个,死亡現(xiàn)場離奇詭異剃允,居然都是意外死亡,警方通過查閱死者的電腦和手機(jī)齐鲤,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 94,515評論 3 399
  • 文/潘曉璐 我一進(jìn)店門斥废,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人给郊,你說我怎么就攤上這事牡肉。” “怎么了淆九?”我有些...
    開封第一講書人閱讀 168,017評論 0 360
  • 文/不壞的土叔 我叫張陵统锤,是天一觀的道長。 經(jīng)常有香客問我炭庙,道長饲窿,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 59,626評論 1 296
  • 正文 為了忘掉前任煤搜,我火速辦了婚禮免绿,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘擦盾。我一直安慰自己,他們只是感情好淌哟,可當(dāng)我...
    茶點(diǎn)故事閱讀 68,625評論 6 397
  • 文/花漫 我一把揭開白布迹卢。 她就那樣靜靜地躺著,像睡著了一般徒仓。 火紅的嫁衣襯著肌膚如雪腐碱。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 52,255評論 1 308
  • 那天掉弛,我揣著相機(jī)與錄音症见,去河邊找鬼。 笑死殃饿,一個胖子當(dāng)著我的面吹牛谋作,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播乎芳,決...
    沈念sama閱讀 40,825評論 3 421
  • 文/蒼蘭香墨 我猛地睜開眼遵蚜,長吁一口氣:“原來是場噩夢啊……” “哼帖池!你這毒婦竟也來了?” 一聲冷哼從身側(cè)響起吭净,我...
    開封第一講書人閱讀 39,729評論 0 276
  • 序言:老撾萬榮一對情侶失蹤睡汹,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后寂殉,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體囚巴,經(jīng)...
    沈念sama閱讀 46,271評論 1 320
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 38,363評論 3 340
  • 正文 我和宋清朗相戀三年友扰,在試婚紗的時候發(fā)現(xiàn)自己被綠了文兢。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 40,498評論 1 352
  • 序言:一個原本活蹦亂跳的男人離奇死亡焕檬,死狀恐怖姆坚,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情实愚,我是刑警寧澤兼呵,帶...
    沈念sama閱讀 36,183評論 5 350
  • 正文 年R本政府宣布,位于F島的核電站腊敲,受9級特大地震影響击喂,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜碰辅,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,867評論 3 333
  • 文/蒙蒙 一懂昂、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧没宾,春花似錦凌彬、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 32,338評論 0 24
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至会钝,卻和暖如春伐蒋,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背迁酸。 一陣腳步聲響...
    開封第一講書人閱讀 33,458評論 1 272
  • 我被黑心中介騙來泰國打工先鱼, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人奸鬓。 一個月前我還...
    沈念sama閱讀 48,906評論 3 376
  • 正文 我出身青樓焙畔,卻偏偏與公主長得像,于是被迫代替她去往敵國和親全蝶。 傳聞我的和親對象是個殘疾皇子闹蒜,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 45,507評論 2 359

推薦閱讀更多精彩內(nèi)容

  • 富力保是什么? 這不是重點(diǎn)砌烁。重點(diǎn)是富力保旗下的Die!ChiwawaDie!在這個周末有連續(xù)三場演出筐喳,且橫跨省城進(jìn)...
    DieChiwawaDie閱讀 467評論 0 1
  • 人際自由交往,溝通函喉,交流避归,分享,是為了學(xué)習(xí)管呵,提升與造就自身梳毙。這是關(guān)系存在的價值與意義。情感捐下,只是衍生物账锹。 基于...
    邁一閱讀 361評論 0 0
  • 成功的人總是有條理的,有計(jì)劃的坷襟,并且一定能做到的奸柬,從沒有抱僥幸心理,不告訴自己婴程,我就是偷一會懶廓奕,沒關(guān)系的,而很容易...
    一晌貪歡_木閱讀 195評論 1 1
  • 當(dāng)沒有頭緒時档叔,我也喜歡哪張A4紙涂涂寫寫桌粉,經(jīng)常想到什么就寫什么,并沒有如文中寫的步驟蹲蒲,還有標(biāo)題番甩、日期,然后才是內(nèi)容...
    Louise718閱讀 110評論 1 3
  • 2017-03-27凌晨 依賴 不要輕易去依賴一個人窍育,他會成為你的習(xí)慣卡睦,當(dāng)分別來臨,你失去的不是某個人漱抓,而是你精神...
    凌晨legend閱讀 257評論 1 1