BMI598: Natural Language Processing

Author: Zongwei Zhou | 周縱葦
Weibo: @MrGiovanni
Email: zongweiz@asu.edu

1. Token Features


1.1 token feature

  • case folding
  • punctuation (標(biāo)點)
  • prefix/stem patterns
  • word shape
  • character n-grams

1.2 context feature

  • token feature from n tokens before and n tokens after
  • word n-grams, n=2,3,4
  • skip-n-grams

1.3 sentence features

  • sentence length
  • case-folding patterns
  • presence of digits
  • enumeration tokens at the start
  • a colon at the end
  • whether verbs indicate past or future tense

1.4 section features

  • headings
  • subsections

1.5 document features

  • case pattern across the document
  • document length indicator

1.6 normalization

Stemming和Lemmatization的區(qū)別
Stemming:基于規(guī)則

from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()
porter_stemmer.stem('wolves')
# 結(jié)果里es被去掉了
u'wolv'

Lemmatization:基于字典

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('wolves')
# 結(jié)果準(zhǔn)確
u'wolf'

2. Word Embedding


2.1 tf-idf

特征的長度是整個字典單詞數(shù)
關(guān)鍵詞:計數(shù)
參考這個example:https://en.wikipedia.org/wiki/Tf%E2%80%93idf

2.2 word2vec

特征長度是固定的,一般比較虚啪啤(幾百)

Start with V random 300-dimensional vectors as initial embeddings
Use logistic regression, the second most basic classifier used in machine learning after na?ve bayes

  • Take a corpus and take pairs of words that co-occur as positive examples
  • Take pairs of words that don't co-occur as negative examples
  • Train the classifier to distinguish these by slowly adjusting all the embeddings to improve the classifier performance
  • Throw away the classifier code and keep the embeddings.

Pre-trained models are available for download
https://code.google.com/archive/p/word2vec/
You can use gensim (in python) to access the models
http://nlp.stanford.edu/projects/glove/

Brilliant insight: Use running text as implicitly supervised training data!

Setup
Let's represent words as vectors of some length (say 300), randomly initialized.
So we start with 300 * V random parameters. V是字典中單詞的數(shù)目胚膊。
Over the entire training set, we’d like to adjust those word vectors such that we

  • Maximize the similarity of the target word, context word pairs (t,c) drawn from the positive data
  • Minimize the similarity of the (t,c) pairs drawn from the negative data.

Learning the classifier
Iterative process.
We’ll start with 0 or random weights
Then adjust the word weights to

  • make the positive pairs more likely
  • and the negative pairs less likely over the entire training set:

3. Sentence vectors


Distributed Representations of Sentences and Documents

PV-DM [???]

  • Paragraph as a pseudo word
  • The algorithm learns a matrix of D vectors, corresponding to D paragraphs
  • in addition to W word vectors
  • Contexts are fixed length
  • Sampled from a sliding window over the paragraph
  • PV and WV are trained using Stochastic Gradient Descent

What about the unseen paragraphs? [???]

  • Add more columns to D (the paragraph vectors matrix)
  • Learn the new D, while holding U, b, and W fixed
  • We use D as features in a standard classifier

PV-DBOW [???]

  • Works by using a sliding window on a paragraph
  • then predict words randomly sampled from the paragraph
  • prediction: a classification task of the random word given the PV
When predicting sentiment of a sentence, use paragraph vector instead of single word embedding.

4. Neural Network


\sigma(z)=\frac{1}{1+e^{-z}}
softmax(z_i)=\frac{e^{z_i}}{\sum_{j=1}^wyqeiawd^{z_j}} 1\leq i\leq d

import numpy as np
z = [1.0, 2.0, 3.0, 4.0, 1.0, 2.0, 3.0]
softmax = lambda z:np.exp(z)/np.sum(np.exp(z))
softmax(z)
array([0.02364054, 0.06426166, 0.1746813 , 0.474833  , 0.02364054, 0.06426166, 0.1746813 ])

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

5. Highlight summary


  • I2b2 challenge – concepts, relations
  • Vector semantics – long vectors
  • Vector semantics – Word embeddings
  • Vector semantics – how to compute word embeddings
  • Vector semantics – Paragraph vectors
  • UMLS and Metamap lite (max match algorithm)
  • Neuron and math behind it
  • Feed forward neural network model - math behind it
  • Example FFN for predicting the next word
  • Keras – Intro and validation
  • Keras examples – simple solutions to concept extraction and relations
  • Data preparation for concept extraction and relation classification
  • IBM MADE 1.0 paper: concepts/relations using BiLSTM CRF/Attention
  • Recurrent neural networks and LSTM
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市想鹰,隨后出現(xiàn)的幾起案子紊婉,更是在濱河造成了極大的恐慌,老刑警劉巖辑舷,帶你破解...
    沈念sama閱讀 218,682評論 6 507
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件喻犁,死亡現(xiàn)場離奇詭異,居然都是意外死亡何缓,警方通過查閱死者的電腦和手機(jī)肢础,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 93,277評論 3 395
  • 文/潘曉璐 我一進(jìn)店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來碌廓,“玉大人乔妈,你說我怎么就攤上這事∶ブ澹” “怎么了路召?”我有些...
    開封第一講書人閱讀 165,083評論 0 355
  • 文/不壞的土叔 我叫張陵勃刨,是天一觀的道長。 經(jīng)常有香客問我股淡,道長身隐,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 58,763評論 1 295
  • 正文 為了忘掉前任唯灵,我火速辦了婚禮贾铝,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘埠帕。我一直安慰自己垢揩,他們只是感情好,可當(dāng)我...
    茶點故事閱讀 67,785評論 6 392
  • 文/花漫 我一把揭開白布敛瓷。 她就那樣靜靜地躺著叁巨,像睡著了一般。 火紅的嫁衣襯著肌膚如雪呐籽。 梳的紋絲不亂的頭發(fā)上锋勺,一...
    開封第一講書人閱讀 51,624評論 1 305
  • 那天,我揣著相機(jī)與錄音狡蝶,去河邊找鬼庶橱。 笑死,一個胖子當(dāng)著我的面吹牛贪惹,可吹牛的內(nèi)容都是我干的苏章。 我是一名探鬼主播,決...
    沈念sama閱讀 40,358評論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼奏瞬,長吁一口氣:“原來是場噩夢啊……” “哼布近!你這毒婦竟也來了?” 一聲冷哼從身側(cè)響起丝格,我...
    開封第一講書人閱讀 39,261評論 0 276
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎棵譬,沒想到半個月后显蝌,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 45,722評論 1 315
  • 正文 獨居荒郊野嶺守林人離奇死亡订咸,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 37,900評論 3 336
  • 正文 我和宋清朗相戀三年曼尊,在試婚紗的時候發(fā)現(xiàn)自己被綠了。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片脏嚷。...
    茶點故事閱讀 40,030評論 1 350
  • 序言:一個原本活蹦亂跳的男人離奇死亡骆撇,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出父叙,到底是詐尸還是另有隱情神郊,我是刑警寧澤肴裙,帶...
    沈念sama閱讀 35,737評論 5 346
  • 正文 年R本政府宣布,位于F島的核電站涌乳,受9級特大地震影響蜻懦,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜夕晓,卻給世界環(huán)境...
    茶點故事閱讀 41,360評論 3 330
  • 文/蒙蒙 一宛乃、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧蒸辆,春花似錦征炼、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,941評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至逗宜,卻和暖如春雄右,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背纺讲。 一陣腳步聲響...
    開封第一講書人閱讀 33,057評論 1 270
  • 我被黑心中介騙來泰國打工擂仍, 沒想到剛下飛機(jī)就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人熬甚。 一個月前我還...
    沈念sama閱讀 48,237評論 3 371
  • 正文 我出身青樓逢渔,卻偏偏與公主長得像,于是被迫代替她去往敵國和親乡括。 傳聞我的和親對象是個殘疾皇子肃廓,可洞房花燭夜當(dāng)晚...
    茶點故事閱讀 44,976評論 2 355

推薦閱讀更多精彩內(nèi)容