【scikit-learn翻譯】TfidfVectorizer

sklearn.feature_extraction.text.TfidfVectorizer

class sklearn.feature_extraction.text.TfidfVectorizer(input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer=’word’, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class ‘numpy.int64’>, norm=’l2’, use_idf=True, smooth_idf=True, sublinear_tf=False)

Convert a collection of raw documents to a matrix of TF-IDF features.
Equivalent to CountVectorizer followed by TfidfTransformer.
Read more in the User Guide.
將原始文檔的集合轉(zhuǎn)換為TF-IDF功能的矩陣胞皱。
相當(dāng)于CountVectorizer,后跟TfidfTransformer牵囤。
在“ 用戶指南”中閱讀更多內(nèi)容硬爆。

Parameters:

  • input : string {‘filename’, ‘file’, ‘content’}

If ‘filename’, the sequence passed as an argument to fit is expected to be a list of filenames that need reading to fetch the raw content to analyze.

If ‘file’, the sequence items must have a ‘read’ method (file-like object) that is called to fetch the bytes in memory.

Otherwise the input is expected to be the sequence strings or bytes items are expected to be analyzed directly.
如果是'filename'欣舵,序列作為參數(shù)傳遞給擬合器,預(yù)計(jì)為文件名列表缀磕,這需要讀取原始內(nèi)容進(jìn)行分析

如果是'file'缘圈,序列中元素必須有一個(gè)”read“的方法(類似文件的對(duì)象),被調(diào)用作為獲取內(nèi)存中的字節(jié)數(shù)

否則袜蚕,輸入預(yù)計(jì)為序列串糟把,或字節(jié)數(shù)據(jù)項(xiàng)都預(yù)計(jì)可直接進(jìn)行分析。

  • encoding : string, ‘utf-8’ by default.

If bytes or files are given to analyze, this encoding is used to decode.
如果給出要解析的字節(jié)或文件廷没,此編碼將用于解碼

  • decode_error : {‘strict’, ‘ignore’, ‘replace’}

Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given encoding. By default, it is ‘strict’, meaning that a UnicodeDecodeError will be raised. Other values are ‘ignore’ and ‘replace’.
如果一個(gè)給出的字節(jié)序列包含的字符不是給定的編碼糊饱,指示應(yīng)該如何去做垂寥。默認(rèn)情況下颠黎,它是'strict',這意味著的UnicodeDecodeError將會(huì)報(bào)錯(cuò)滞项,其他值是'ignore'和'replace'

  • strip_accents : {‘a(chǎn)scii’, ‘unicode’, None}

Remove accents during the preprocessing step. ‘a(chǎn)scii’ is a fast method that only works on characters that have an direct ASCII mapping. ‘unicode’ is a slightly slower method that works on any characters. None (default) does nothing.
在預(yù)處理步驟中去除編碼規(guī)則(accents)狭归,”ASCII碼“是一種快速的方法,僅適用于有一個(gè)直接的ASCII字符映射文判,"unicode"是一個(gè)稍慢一些的方法过椎,None(默認(rèn))什么都不做

  • analyzer : string, {‘word’, ‘char’} or callable

Whether the feature should be made of word or character n-grams.

If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input.
定義特征為詞(word)或n-gram字符,如果傳遞給它的調(diào)用被用于抽取未處理輸入源文件的特征序列

  • preprocessor : callable or None (default)

Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps.
當(dāng)保留令牌和”n-gram“生成步驟時(shí)戏仓,覆蓋預(yù)處理(字符串變換)的階段

  • tokenizer : callable or None (default)

Override the string tokenization step while preserving the preprocessing and n-grams generation steps. Only applies if analyzer == 'word'.
當(dāng)保留預(yù)處理和n-gram生成步驟時(shí)疚宇,覆蓋字符串令牌步驟

  • ngram_range : tuple (min_n, max_n)

The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.
要提取的n-gram的n-values的下限和上限范圍,在min_n <= n <= max_n區(qū)間的n的全部值

  • stop_words : string {‘english’}, list, or None (default)

If a string, it is passed to _check_stop_list and the appropriate stop list is returned. ‘english’ is currently the only supported string value.

If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if analyzer == 'word'.

If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.
如果未english赏殃,用于英語內(nèi)建的停用詞列表

如果未list敷待,該列表被假定為包含停用詞,列表中的所有詞都將從令牌中刪除

如果None仁热,不使用停用詞榜揖。max_df可以被設(shè)置為范圍[0.7, 1.0)的值,基于內(nèi)部預(yù)料詞頻來自動(dòng)檢測和過濾停用詞

  • lowercase : boolean, default True

Convert all characters to lowercase before tokenizing.
在令牌標(biāo)記前轉(zhuǎn)換所有的字符為小寫

  • token_pattern : string

Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'. The default regexp selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).
正則表達(dá)式顯示了”token“的構(gòu)成,僅當(dāng)analyzer == ‘word’時(shí)才被使用举哟。兩個(gè)或多個(gè)字母數(shù)字字符的正則表達(dá)式(標(biāo)點(diǎn)符號(hào)完全被忽略思劳,始終被視為一個(gè)標(biāo)記分隔符)。

  • max_df : float in range [0.0, 1.0] or int, default=1.0

When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
當(dāng)構(gòu)建詞匯表時(shí)妨猩,嚴(yán)格忽略高于給出閾值的文檔頻率的詞條潜叛,語料指定的停用詞。如果是浮點(diǎn)值壶硅,該參數(shù)代表文檔的比例钠导,如果是整型,代表絕對(duì)計(jì)數(shù)值森瘪。如果詞匯表不為None牡属,此參數(shù)被忽略。

  • min_df : float in range [0.0, 1.0] or int, default=1

When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
當(dāng)構(gòu)建詞匯表時(shí)扼睬,嚴(yán)格忽略低于給出閾值的文檔頻率的詞條逮栅,語料指定的停用詞。如果是浮點(diǎn)值窗宇,該參數(shù)代表文檔的比例措伐,整型絕對(duì)計(jì)數(shù)值,如果詞匯表不為None军俊,此參數(shù)被忽略侥加。

  • max_features : int or None, default=None

If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

This parameter is ignored if vocabulary is not None.
如果不為None,構(gòu)建一個(gè)詞匯表粪躬,僅考慮max_features--按語料詞頻排序担败,如果詞匯表不為None,這個(gè)參數(shù)被忽略

  • vocabulary : Mapping or iterable, optional

Either a Mapping (e.g., a dict) where keys are terms and values are indices in the feature matrix, or an iterable over terms. If not given, a vocabulary is determined from the input documents.
也是一個(gè)映射(Map)(例如镰官,字典)提前,其中鍵是詞條而值是在特征矩陣中索引,或詞條中的迭代器泳唠。如果沒有給出狈网,詞匯表被確定來自輸入文件。在映射中索引不能有重復(fù)笨腥,并且不能在0到最大索引值之間有間斷拓哺。

  • binary : boolean, default=False

If True, all non-zero term counts are set to 1. This does not mean outputs will have only 0/1 values, only that the tf term in tf-idf is binary. (Set idf and normalization to False to get 0/1 outputs.)
如果未True,所有非零計(jì)數(shù)被設(shè)置為1脖母,這對(duì)于離散概率模型是有用的士鸥,建立二元事件模型窖杀,而不是整型計(jì)數(shù)

  • dtype : type, optional

Type of the matrix returned by fit_transform() or transform().
通過fit_transform()或transform()返回矩陣的類型

  • norm : ‘l1’, ‘l2’ or None, optional

Norm used to normalize term vectors. None for no normalization.
范數(shù)用于標(biāo)準(zhǔn)化詞條向量暂题。None為不歸一化

  • use_idf : boolean, default=True

Enable inverse-document-frequency reweighting.
啟動(dòng)inverse-document-frequency重新計(jì)算權(quán)重

  • smooth_idf : boolean, default=True

Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.
通過加1到文檔頻率平滑idf權(quán)重说铃,為防止除零,加入一個(gè)額外的文檔

  • sublinear_tf : boolean, default=False

Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
應(yīng)用線性縮放TF锌介,例如均蜜,使用1+log(tf)覆蓋tf

Attributes:

  • vocabulary_ : dict

A mapping of terms to feature indices.

  • idf_ : array, shape = [n_features], or None

The learned idf vector (global term weights) when use_idf is set to True, None otherwise.

  • stop_words_ : set

Terms that were ignored because they either:

occurred in too many documents (max_df)
occurred in too few documents (min_df)
were cut off by feature selection (max_features).
This is only available if no vocabulary was given.

參考資料

  1. http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
  2. https://blog.csdn.net/laobai1015/article/details/80451371
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末笤休,一起剝皮案震驚了整個(gè)濱河市齿椅,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌玻侥,老刑警劉巖决摧,帶你破解...
    沈念sama閱讀 219,427評(píng)論 6 508
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異凑兰,居然都是意外死亡掌桩,警方通過查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 93,551評(píng)論 3 395
  • 文/潘曉璐 我一進(jìn)店門姑食,熙熙樓的掌柜王于貴愁眉苦臉地迎上來波岛,“玉大人,你說我怎么就攤上這事音半≡蚩剑” “怎么了?”我有些...
    開封第一講書人閱讀 165,747評(píng)論 0 356
  • 文/不壞的土叔 我叫張陵曹鸠,是天一觀的道長煌茬。 經(jīng)常有香客問我,道長彻桃,這世上最難降的妖魔是什么坛善? 我笑而不...
    開封第一講書人閱讀 58,939評(píng)論 1 295
  • 正文 為了忘掉前任,我火速辦了婚禮邻眷,結(jié)果婚禮上眠屎,老公的妹妹穿的比我還像新娘。我一直安慰自己耗溜,他們只是感情好组力,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,955評(píng)論 6 392
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著抖拴,像睡著了一般。 火紅的嫁衣襯著肌膚如雪腥椒。 梳的紋絲不亂的頭發(fā)上阿宅,一...
    開封第一講書人閱讀 51,737評(píng)論 1 305
  • 那天,我揣著相機(jī)與錄音笼蛛,去河邊找鬼洒放。 笑死,一個(gè)胖子當(dāng)著我的面吹牛滨砍,可吹牛的內(nèi)容都是我干的往湿。 我是一名探鬼主播妖异,決...
    沈念sama閱讀 40,448評(píng)論 3 420
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼领追!你這毒婦竟也來了他膳?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 39,352評(píng)論 0 276
  • 序言:老撾萬榮一對(duì)情侶失蹤绒窑,失蹤者是張志新(化名)和其女友劉穎棕孙,沒想到半個(gè)月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體些膨,經(jīng)...
    沈念sama閱讀 45,834評(píng)論 1 317
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡蟀俊,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,992評(píng)論 3 338
  • 正文 我和宋清朗相戀三年,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了订雾。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片肢预。...
    茶點(diǎn)故事閱讀 40,133評(píng)論 1 351
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡,死狀恐怖洼哎,靈堂內(nèi)的尸體忽然破棺而出误甚,到底是詐尸還是另有隱情,我是刑警寧澤谱净,帶...
    沈念sama閱讀 35,815評(píng)論 5 346
  • 正文 年R本政府宣布窑邦,位于F島的核電站,受9級(jí)特大地震影響壕探,放射性物質(zhì)發(fā)生泄漏冈钦。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,477評(píng)論 3 331
  • 文/蒙蒙 一李请、第九天 我趴在偏房一處隱蔽的房頂上張望瞧筛。 院中可真熱鬧,春花似錦导盅、人聲如沸较幌。這莊子的主人今日做“春日...
    開封第一講書人閱讀 32,022評(píng)論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽乍炉。三九已至,卻和暖如春滤馍,著一層夾襖步出監(jiān)牢的瞬間岛琼,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 33,147評(píng)論 1 272
  • 我被黑心中介騙來泰國打工巢株, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留槐瑞,地道東北人。 一個(gè)月前我還...
    沈念sama閱讀 48,398評(píng)論 3 373
  • 正文 我出身青樓阁苞,卻偏偏與公主長得像困檩,于是被迫代替她去往敵國和親。 傳聞我的和親對(duì)象是個(gè)殘疾皇子等舔,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 45,077評(píng)論 2 355

推薦閱讀更多精彩內(nèi)容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi閱讀 7,336評(píng)論 0 10
  • 笑來老師在得到有個(gè)專欄叫做《通往財(cái)富自由之路》软瞎,截止目前訂閱用戶178210人,我就是這其中那一個(gè); Who 我今...
    劉冰杰閱讀 207評(píng)論 0 0
  • F303申某某(Aglarn)第五次作業(yè)非暴力溝通訓(xùn)練營第三期 這次課程2個(gè)重要的點(diǎn):區(qū)分觀察和評(píng)論,區(qū)分感受和想...
    申某某Aglarn閱讀 284評(píng)論 1 0
  • 關(guān)于愛情旬渠,有這么兩個(gè)觀點(diǎn): 1、一份感情如果對(duì)了损谦,一定是順?biāo)斓模?2照捡、真愛闯参,總是來之不易鹿寨。 這兩個(gè)看似矛盾的觀點(diǎn)脚草,...
    顏鹵煮閱讀 1,057評(píng)論 1 51
  • ——5304-劉羅鍋-清單主題營復(fù)盤 本來是不想?yún)⒓忧鍐沃黝}營的姑隅,為什么呢讲仰? 001每天要通過閱讀一本書來書寫...
    小曉真兒閱讀 247評(píng)論 2 3