sklearn.feature_extraction.text
.TfidfVectorizer
class sklearn.feature_extraction.text.TfidfVectorizer(input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer=’word’, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class ‘numpy.int64’>, norm=’l2’, use_idf=True, smooth_idf=True, sublinear_tf=False)
Convert a collection of raw documents to a matrix of TF-IDF features.
Equivalent to CountVectorizer followed by TfidfTransformer.
Read more in the User Guide.
將原始文檔的集合轉(zhuǎn)換為TF-IDF功能的矩陣胞皱。
相當(dāng)于CountVectorizer,后跟TfidfTransformer牵囤。
在“ 用戶指南”中閱讀更多內(nèi)容硬爆。
Parameters:
- input : string {‘filename’, ‘file’, ‘content’}
If ‘filename’, the sequence passed as an argument to fit is expected to be a list of filenames that need reading to fetch the raw content to analyze.
If ‘file’, the sequence items must have a ‘read’ method (file-like object) that is called to fetch the bytes in memory.
Otherwise the input is expected to be the sequence strings or bytes items are expected to be analyzed directly.
如果是'filename'欣舵,序列作為參數(shù)傳遞給擬合器,預(yù)計(jì)為文件名列表缀磕,這需要讀取原始內(nèi)容進(jìn)行分析
如果是'file'缘圈,序列中元素必須有一個(gè)”read“的方法(類似文件的對(duì)象),被調(diào)用作為獲取內(nèi)存中的字節(jié)數(shù)
否則袜蚕,輸入預(yù)計(jì)為序列串糟把,或字節(jié)數(shù)據(jù)項(xiàng)都預(yù)計(jì)可直接進(jìn)行分析。
- encoding : string, ‘utf-8’ by default.
If bytes or files are given to analyze, this encoding is used to decode.
如果給出要解析的字節(jié)或文件廷没,此編碼將用于解碼
- decode_error : {‘strict’, ‘ignore’, ‘replace’}
Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given encoding. By default, it is ‘strict’, meaning that a UnicodeDecodeError will be raised. Other values are ‘ignore’ and ‘replace’.
如果一個(gè)給出的字節(jié)序列包含的字符不是給定的編碼糊饱,指示應(yīng)該如何去做垂寥。默認(rèn)情況下颠黎,它是'strict',這意味著的UnicodeDecodeError將會(huì)報(bào)錯(cuò)滞项,其他值是'ignore'和'replace'
- strip_accents : {‘a(chǎn)scii’, ‘unicode’, None}
Remove accents during the preprocessing step. ‘a(chǎn)scii’ is a fast method that only works on characters that have an direct ASCII mapping. ‘unicode’ is a slightly slower method that works on any characters. None (default) does nothing.
在預(yù)處理步驟中去除編碼規(guī)則(accents)狭归,”ASCII碼“是一種快速的方法,僅適用于有一個(gè)直接的ASCII字符映射文判,"unicode"是一個(gè)稍慢一些的方法过椎,None(默認(rèn))什么都不做
- analyzer : string, {‘word’, ‘char’} or callable
Whether the feature should be made of word or character n-grams.
If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input.
定義特征為詞(word)或n-gram字符,如果傳遞給它的調(diào)用被用于抽取未處理輸入源文件的特征序列
- preprocessor : callable or None (default)
Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps.
當(dāng)保留令牌和”n-gram“生成步驟時(shí)戏仓,覆蓋預(yù)處理(字符串變換)的階段
- tokenizer : callable or None (default)
Override the string tokenization step while preserving the preprocessing and n-grams generation steps. Only applies if analyzer == 'word'.
當(dāng)保留預(yù)處理和n-gram生成步驟時(shí)疚宇,覆蓋字符串令牌步驟
- ngram_range : tuple (min_n, max_n)
The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.
要提取的n-gram的n-values的下限和上限范圍,在min_n <= n <= max_n區(qū)間的n的全部值
- stop_words : string {‘english’}, list, or None (default)
If a string, it is passed to _check_stop_list and the appropriate stop list is returned. ‘english’ is currently the only supported string value.
If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if analyzer == 'word'.
If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.
如果未english赏殃,用于英語內(nèi)建的停用詞列表
如果未list敷待,該列表被假定為包含停用詞,列表中的所有詞都將從令牌中刪除
如果None仁热,不使用停用詞榜揖。max_df可以被設(shè)置為范圍[0.7, 1.0)的值,基于內(nèi)部預(yù)料詞頻來自動(dòng)檢測和過濾停用詞
- lowercase : boolean, default True
Convert all characters to lowercase before tokenizing.
在令牌標(biāo)記前轉(zhuǎn)換所有的字符為小寫
- token_pattern : string
Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'. The default regexp selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).
正則表達(dá)式顯示了”token“的構(gòu)成,僅當(dāng)analyzer == ‘word’時(shí)才被使用举哟。兩個(gè)或多個(gè)字母數(shù)字字符的正則表達(dá)式(標(biāo)點(diǎn)符號(hào)完全被忽略思劳,始終被視為一個(gè)標(biāo)記分隔符)。
- max_df : float in range [0.0, 1.0] or int, default=1.0
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
當(dāng)構(gòu)建詞匯表時(shí)妨猩,嚴(yán)格忽略高于給出閾值的文檔頻率的詞條潜叛,語料指定的停用詞。如果是浮點(diǎn)值壶硅,該參數(shù)代表文檔的比例钠导,如果是整型,代表絕對(duì)計(jì)數(shù)值森瘪。如果詞匯表不為None牡属,此參數(shù)被忽略。
- min_df : float in range [0.0, 1.0] or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
當(dāng)構(gòu)建詞匯表時(shí)扼睬,嚴(yán)格忽略低于給出閾值的文檔頻率的詞條逮栅,語料指定的停用詞。如果是浮點(diǎn)值窗宇,該參數(shù)代表文檔的比例措伐,整型絕對(duì)計(jì)數(shù)值,如果詞匯表不為None军俊,此參數(shù)被忽略侥加。
- max_features : int or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
This parameter is ignored if vocabulary is not None.
如果不為None,構(gòu)建一個(gè)詞匯表粪躬,僅考慮max_features--按語料詞頻排序担败,如果詞匯表不為None,這個(gè)參數(shù)被忽略
- vocabulary : Mapping or iterable, optional
Either a Mapping (e.g., a dict) where keys are terms and values are indices in the feature matrix, or an iterable over terms. If not given, a vocabulary is determined from the input documents.
也是一個(gè)映射(Map)(例如镰官,字典)提前,其中鍵是詞條而值是在特征矩陣中索引,或詞條中的迭代器泳唠。如果沒有給出狈网,詞匯表被確定來自輸入文件。在映射中索引不能有重復(fù)笨腥,并且不能在0到最大索引值之間有間斷拓哺。
- binary : boolean, default=False
If True, all non-zero term counts are set to 1. This does not mean outputs will have only 0/1 values, only that the tf term in tf-idf is binary. (Set idf and normalization to False to get 0/1 outputs.)
如果未True,所有非零計(jì)數(shù)被設(shè)置為1脖母,這對(duì)于離散概率模型是有用的士鸥,建立二元事件模型窖杀,而不是整型計(jì)數(shù)
- dtype : type, optional
Type of the matrix returned by fit_transform() or transform().
通過fit_transform()或transform()返回矩陣的類型
- norm : ‘l1’, ‘l2’ or None, optional
Norm used to normalize term vectors. None for no normalization.
范數(shù)用于標(biāo)準(zhǔn)化詞條向量暂题。None為不歸一化
- use_idf : boolean, default=True
Enable inverse-document-frequency reweighting.
啟動(dòng)inverse-document-frequency重新計(jì)算權(quán)重
- smooth_idf : boolean, default=True
Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.
通過加1到文檔頻率平滑idf權(quán)重说铃,為防止除零,加入一個(gè)額外的文檔
- sublinear_tf : boolean, default=False
Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
應(yīng)用線性縮放TF锌介,例如均蜜,使用1+log(tf)覆蓋tf
Attributes:
- vocabulary_ : dict
A mapping of terms to feature indices.
- idf_ : array, shape = [n_features], or None
The learned idf vector (global term weights) when use_idf is set to True, None otherwise.
- stop_words_ : set
Terms that were ignored because they either:
occurred in too many documents (max_df)
occurred in too few documents (min_df)
were cut off by feature selection (max_features).
This is only available if no vocabulary was given.