NLP 從入門到入土
What is NLP
Making wheels:造輪子
In the wheel-making movement of the big front-end era, each company has its own wheels, and a lot of repeated coding.
The ML field is better but not much better. There are a large number of databases and method libraries for us to use. Never write deep networks manually with tensoflow anymore. Even the birth of autokeras makes optimizing the network foolish.
The knowledge you need:
Simple python knowledge reserve
A preliminary understanding of the structure of the Neural network, such as gradient descent.
Recommend Courses :
CS224n: Natural Language Processing with Deep Learning Stanford University's very famous NLP course, because of the Covid-19, online cause also changed the professor's keynote style that never changes??. A very detailed and systematic course, from ML basics to math formulas, with very detailed notes. However, mathematical formulas and algorithms are too ‘mathematical’, and are obscure for students who have no foundation.
Deep Learning for Human Language Processing (2020, Spring) The famous stand-up comic lecturer at National Taiwan University HUNG-YI LEE (Li Hong Yi), because of its humorous and easy-to-understand lectures, the course has a high number of broadcasts on Youtube.
Machine Learning (2021, Spring) The same course is taught by Hung-Yi Lee. ML has an introduction. The two courses have overlapping knowledge blocks that can be skipped as appropriate.
Representing words
Representing Image
For images, know that the grayscale image is one of our matrices.
The RBG image is a three-channel matrix.
各種Image processing 就是在這個矩陣上疊Buff,卷積啊砚尽,濾鏡啊還有高斯傅立葉拆撼。容劳。喘沿。。竭贩。蚜印。那么人類語言的詞匯,該如何讓機器去理解呢留量。
Various Image processing is to stack Buff, convolution, filter and Gaussian Fourier on this matrix. . . . . . So how can the vocabulary of human language be understood by the machine?
How do we have usable meaning in a computer?
舉個簡單例子窄赋,判斷一個詞的詞性,是動詞還是名詞楼熄。
用機器學(xué)習(xí)的思路忆绰,我們有一系列樣本(x,y),這里 x 是詞語可岂,y 是它們的詞性错敢,我們要構(gòu)建 f(x)->y 的映射,但這里的數(shù)學(xué)模型 f(比如神經(jīng)網(wǎng)絡(luò)缕粹、SVM)只接受數(shù)值型輸入稚茅。
而 NLP 里的詞語,是人類的抽象總結(jié)平斩,是符號形式的(比如中文亚享、英文、拉丁文等等)绘面,所以需要把他們轉(zhuǎn)換成數(shù)值形式欺税,或者說——嵌入到一個數(shù)學(xué)空間里,這種嵌入方式揭璃,就叫詞嵌入word embedding魄衅,而 Word2vec,就是詞嵌入 word embedding 的一種
- WordNet A thesaurus containing lists of synonym sets and hypernyms (“is a” relationships).
<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n27" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-size: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-top-left-radius: 3px; border-top-right-radius: 3px; border-bottom-right-radius: 3px; border-bottom-left-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; caret-color: rgb(51, 51, 51); color: rgb(51, 51, 51); font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-indent: 0px; text-transform: none; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none; background-position: inherit; background-repeat: inherit;">包含同義詞集和上位詞列表的同義詞庫</pre>
WordNet的開發(fā)有兩個目的:
支持自動的文本分析以及人工智能應(yīng)用扣墩。
Representing words as discrete symbols One-hot vectors:
<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n36" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-size: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-top-left-radius: 3px; border-top-right-radius: 3px; border-bottom-right-radius: 3px; border-bottom-left-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; caret-color: rgb(51, 51, 51); color: rgb(51, 51, 51); font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-indent: 0px; text-transform: none; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none; background-position: inherit; background-repeat: inherit;">motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0] </pre>
Vector dimension = number of words in vocabulary (e.g., 500,000) 太龐大的數(shù)據(jù)了
- Word vector word vectors are also called word embeddings or (neural) word representations They are a distributed representation
如何創(chuàng)建呢哲银?
通過統(tǒng)計一個事先指定大小的窗口內(nèi)的word共現(xiàn)次數(shù),以word周邊的共現(xiàn)詞的次數(shù)做為當(dāng)前word的vector呻惕。具體來說荆责,我們通過從大量的語料文本中構(gòu)建一個共現(xiàn)矩陣來定義word representation。
For example
I like deep learning. I like NLP. I enjoy flying.
NLP Task
One Sequence | Multiple Sequences | |
---|---|---|
One Class | Sentiment Classification, Stance Detection, Veracity Prediction, Intent Classification, Dialogue Policy | NLI Search Engine Relation Extraction |
Class for each Token | POS tagging Word segmentation Extraction Summarization Slotting Filling NER | |
Copy from Input | Extractive QA | |
General Sequence | Abstractive Summarization, Translation, Grammar Correction ,NLG | General QA Task Oriented Dialogue Chatbot |
Other? | Parsing, Coreference Resolution |
Part-of-Speech (POS) Tagging
截屏2021-12-17 13.48.36
Word Segmentation
就是如何斷句亚脆。尤其在英文的從句做院,中文的多重定語。
It's how to break sentences. Especially in Chinese clauses, Chinese multiple attributives.
釀酒缸缸好造醋壇壇酸
養(yǎng)豬大如山老鼠只只死
釀酒缸缸好,造醋壇壇酸
養(yǎng)豬大如山键耕,老鼠只只死
釀酒缸缸好造醋寺滚,壇壇酸
養(yǎng)豬大如山老鼠,只只死
新聞:佟大為妻子產(chǎn)下一女
評論:這個佟大是誰屈雄?真了不起村视,太厲害了!酒奶!
Parsing
截屏2021-12-17 13.53.33
Summarization
-
Extractive summarization
截屏2021-12-17 13.54.29最簡單的resolution: It is binary classfication problem. To decide which sentence will be add in summary. 類似我們小時候?qū)懻峡祝蠋熥尶偨Y(jié)課文,我們只是摘抄兩句惋嚎。
但是這樣往往不會得到最好的結(jié)果杠氢,如果有兩個句子意思相近,我們只一句句input另伍,是不夠的鼻百。
如果能用上DL,我們要把全文考慮進去质况,全部一起輸入,用一個binary LSTM or Transformer然后輸出每一句是否放在summary里
-
Abstractive summarization
截屏2021-12-17 14.18.06The machine needs to write the summary in its own words, not directly in the original text. resolution:Seq2Seq problem. Long seq -> short Seq
會有一個問題玻靡,就是本來這個文章里的一些專業(yè)術(shù)語结榄。或者金句囤捻,人家寫的好好的臼朗。我們非要給人家parephas一下。然后濾除不對馬嘴蝎土。
我們總結(jié)的時候還是希望有一下原文的视哑,所以我能鼓勵這這個網(wǎng)絡(luò)是由Copy的能力的額。不要全部用機器自己的白話誊涯。
-
Machine Translation截屏2021-12-17 14.19.46
7000種語言挡毅,每種語言上萬詞。胡翻譯至少需要7000的平方暴构。
Unsupervised learning跪呈!
-
Grammar Error Correction Seq2seq. 我們可以直接給他數(shù)據(jù),硬train
[圖片上傳失敗...(image-18f7d6-1643274066425)]
進階Input: Token -> Token calculate the different. For example: 3 options C for Copy, R for replace, A for append
Sentiment Classification 情感判斷取逾。廣告推廣啊耗绿,去判斷電影的口碑啊。股票利空利多消息啊砾隅,幣圈源于周熱度啊之類的误阻。
-
Stance Detection 立場偵測。新型民調(diào),挖掘不愿意表態(tài)選民畫像究反,選舉廣告推送寻定。B站阿瓦隆系統(tǒng)截屏2021-12-17 13.56.29
Source:川普是個好總統(tǒng) Reply: 他只是個資本家 這位網(wǎng)民的立場?->Denied Many systems use the support, denying, querying and commenting (SDQC 4classes) for classifying replies. 支持奴紧、否認(rèn)特姐、質(zhì)疑和評論
-
Natural Language Inference (NLI)
自然語言推理
Contradiction:矛盾
Entailment:蘊含
-
Neutral:中性
Premise : 一個綠色的三角形 ->? hypothesis: 兩邊之和大于第三邊 Premise ->? Hypothesis input: premise+ hypothesis -> output:蘊含
-
Search engine Bert 可以簡化為
[圖片上傳失敗...(image-8871d6-1643274066425)]
2 inputs: 搜索詞+ 文章內(nèi)容 -> model -> relevant
-
Question AnswerQA system
截屏2021-12-17 14.27.16傳統(tǒng)方法是一個非常龐大的價格黍氮,包含一些svm的簡單模型等等唐含。 Input: Question & knowledge source -> QA model -> answer Reading comprehension Extractive QA 目前還太難實現(xiàn)了,目前的網(wǎng)絡(luò)只是閱讀理解和的程度沫浆,輸出原文的答案的field捷枯。eg. (1_7-11,第一段7-11詞) 如果實現(xiàn),就是先知的誕生专执。
Chatting 尬聊 就淮捆。。本股。尬聊
Task-oriented
[圖片上傳失敗...(image-de07f-1643274066425)]
Natural Language Generation (NLG)
Policy & State Tracker
-
Natural Language Understanding (NLU)
截屏2021-12-17 14.31.16
Network
BERT:
是芝麻街里的一個人物, 大家都在用網(wǎng)絡(luò)方法的首字母湊芝麻街里的人物攀痊。Bert和RNN 將會在下一次Note里更新
LSTM: Will be introduced in next sharing session
Final
Quote from "Statistical approach to speech" by Prof. Keiichi Tokuda in Interspeech 2019
Every time I fire a linguist, the performance of the speech recongnizer goes up