文本分類調(diào)研

持續(xù)更新中

Introduction

1. Definition

什么是文本分類寿桨,即我們常說的text classification晚唇,簡單的說就是把一段文本劃分到我們提前定義好的一個(gè)或多個(gè)類別吆豹⊥镌觯可以說是屬于document classification的范疇古胆。
Input:
a document d
a fixed set of classes C = {c1, c2, ... , cn}
Output:
a predicted class ci from C

2. Some simple application

  1. spam detection
  2. authorship attribution
  3. age/gender identification
  4. sentiment analysis
  5. assigning subject categories, topics or genes
    ......

Traditional methods

1. Naive Bayes

two assumptions:

  1. Bag of words assumption:
    position doesn't matter
  2. Conditional independency:

to compute these probabilities:

add-one smoothing to prevent the situation in which we get zero:(you can add other number as well)

to deal with unknown/unshown words:

main features:

  1. very fast, low storage requirements
  2. robust to irrelevant features
  3. good in domains with many equally important features
  4. optimal if the indolence assumption hold
  5. lacks accuracy in general

2. SVM

cost function of SVM:

2. SVM decision boundary
when C is very large:

about kernel:

until now雹洗,it seems that the SVM are only applicable to two-class classification.

Comparing with Logistic regression:

while applying SVM and Logistic regression to text classification, all you need to do is to get the labeled data and find a proper way to represent the texts with vectors (you can use one-hot representation , word2vec, doc2vec ......)

Neural network methods

1. CNN

(1) the paper Convolutional Neural Networks for Sentence Classification which appeared in EMNLP 2014
(2) the paper A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification

The model uses multiple filters to obtain multiple features. These features form the penultimate layer and are passed to a fully connected softmax layer whose output is the probability distribution over labels.

For regularization we employ dropout on the penultimate layer with a constraint on l2-norms of the weight vectors. Dropout prevents co-adaptation of hidden units by randomly dropping out.

Pre-trained Word Vectors
We use the publicly available word2vec vectors that were trained on 100 billion words from Google News.

Results

There is simplified implementation using Tensorflow on Github:https://github.com/dennybritz/cnn-text-classification-tf

2. RNN

the paper Hierarchical Attention Networks for Document Classification which appeared in NAACL 2016

in this paper we test the hypothesis that better representations can be obtained by incorporating knowledge of document structure in the model architecture

  1. It is observed that different words and sentences in a documents are differentially informative.
  2. Moreover, the importance of words and sentences are highly context dependent.
    i.e. the same word or sentence may be dif- ferentially important in different context

Attention serves two benefits: not only does it often result in better performance, but it also provides in- sight into which words and sentences contribute to the classification decision which can be of value in applications and analysis

Hierarchical Attention Network

If you want to learn more about Attention Mechanisms:http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nlp/

In the model they used the GRU-based sequence encoder.
1. Word Encoder:

2. Word Attention:

3. Sentence Encoder:

4. Sentence Attention:

5. Document Classification:
Because the document vector v is a high level representation of document d

j is the label of document d

Results

There is simplified implementation written in Python on Github:https://github.com/richliao/textClassifier

References

https://www.cs.cmu.edu/%7Ediyiy/docs/naacl16.pdf
https://www.cs.cmu.edu/%7Ediyiy/docs/naacl16.pdf
https://www.coursera.org/learn/machine-learning/home/
https://www.youtube.com/playlist?list=PL6397E4B26D00A269

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末渐北,一起剝皮案震驚了整個(gè)濱河市阿逃,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌赃蛛,老刑警劉巖恃锉,帶你破解...
    沈念sama閱讀 207,248評(píng)論 6 481
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異呕臂,居然都是意外死亡破托,警方通過查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,681評(píng)論 2 381
  • 文/潘曉璐 我一進(jìn)店門歧蒋,熙熙樓的掌柜王于貴愁眉苦臉地迎上來土砂,“玉大人,你說我怎么就攤上這事谜洽÷苡常” “怎么了?”我有些...
    開封第一講書人閱讀 153,443評(píng)論 0 344
  • 文/不壞的土叔 我叫張陵褥琐,是天一觀的道長锌俱。 經(jīng)常有香客問我,道長敌呈,這世上最難降的妖魔是什么贸宏? 我笑而不...
    開封第一講書人閱讀 55,475評(píng)論 1 279
  • 正文 為了忘掉前任,我火速辦了婚禮磕洪,結(jié)果婚禮上吭练,老公的妹妹穿的比我還像新娘。我一直安慰自己析显,他們只是感情好鲫咽,可當(dāng)我...
    茶點(diǎn)故事閱讀 64,458評(píng)論 5 374
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著谷异,像睡著了一般分尸。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上歹嘹,一...
    開封第一講書人閱讀 49,185評(píng)論 1 284
  • 那天箩绍,我揣著相機(jī)與錄音,去河邊找鬼尺上。 笑死材蛛,一個(gè)胖子當(dāng)著我的面吹牛圆到,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播卑吭,決...
    沈念sama閱讀 38,451評(píng)論 3 401
  • 文/蒼蘭香墨 我猛地睜開眼芽淡,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了豆赏?” 一聲冷哼從身側(cè)響起挣菲,我...
    開封第一講書人閱讀 37,112評(píng)論 0 261
  • 序言:老撾萬榮一對(duì)情侶失蹤,失蹤者是張志新(化名)和其女友劉穎河绽,沒想到半個(gè)月后己单,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 43,609評(píng)論 1 300
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡耙饰,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 36,083評(píng)論 2 325
  • 正文 我和宋清朗相戀三年纹笼,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片苟跪。...
    茶點(diǎn)故事閱讀 38,163評(píng)論 1 334
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡廷痘,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出件已,到底是詐尸還是另有隱情笋额,我是刑警寧澤,帶...
    沈念sama閱讀 33,803評(píng)論 4 323
  • 正文 年R本政府宣布篷扩,位于F島的核電站兄猩,受9級(jí)特大地震影響,放射性物質(zhì)發(fā)生泄漏鉴未。R本人自食惡果不足惜枢冤,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,357評(píng)論 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望铜秆。 院中可真熱鬧淹真,春花似錦、人聲如沸连茧。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,357評(píng)論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽啸驯。三九已至客扎,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間罚斗,已是汗流浹背虐唠。 一陣腳步聲響...
    開封第一講書人閱讀 31,590評(píng)論 1 261
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留惰聂,地道東北人疆偿。 一個(gè)月前我還...
    沈念sama閱讀 45,636評(píng)論 2 355
  • 正文 我出身青樓,卻偏偏與公主長得像搓幌,于是被迫代替她去往敵國和親杆故。 傳聞我的和親對(duì)象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 42,925評(píng)論 2 344

推薦閱讀更多精彩內(nèi)容

  • 唐風(fēng)吹過宋室的夢 那一筆揮毫凌空 繡點(diǎn)了牡丹紅 蝶衣在月光下舞動(dòng) 憶往事小敘如風(fēng) 幾世情緣若斷流水 花飄香悲了誰 ...
    夜已空閱讀 170評(píng)論 0 4
  • 一個(gè)北方人真的被江浙的醉蟹醉倒了
    海岸線177閱讀 155評(píng)論 0 1
  • 枕上聽雨久未眠溉愁,心思輾轉(zhuǎn)幾時(shí)鼾处铛? 雨下叮零聲如脆,靜賞仙樂醉音梵拐揭。 落花春雨惱春愁撤蟆,新贊春暖又春寒。 何時(shí)心頭淋潔...
    me揮之即去閱讀 169評(píng)論 0 0
  • 周五堂污,是兒子滿月后從姥姥姥爺家肯、爺爺奶奶家游歷一圈后回樓上住的日子。 這小子已經(jīng)習(xí)慣了爺爺奶奶家的環(huán)境盟猖,反而到了自己...
    此木無言閱讀 160評(píng)論 0 0