NLP from Zero to Picture

NLP 從入門到入土

What is NLP

Making wheels：造輪子

In the wheel-making movement of the big front-end era, each company has its own wheels, and a lot of repeated coding.

The ML field is better but not much better. There are a large number of databases and method libraries for us to use. Never write deep networks manually with tensoflow anymore. Even the birth of autokeras makes optimizing the network foolish.

The knowledge you need:

Simple python knowledge reserve
A preliminary understanding of the structure of the Neural network, such as gradient descent.

Recommend Courses ：

CS224n: Natural Language Processing with Deep Learning Stanford University's very famous NLP course, because of the Covid-19, online cause also changed the professor's keynote style that never changes??. A very detailed and systematic course, from ML basics to math formulas, with very detailed notes. However, mathematical formulas and algorithms are too ‘mathematical’, and are obscure for students who have no foundation.
Deep Learning for Human Language Processing (2020, Spring) The famous stand-up comic lecturer at National Taiwan University HUNG-YI LEE (Li Hong Yi), because of its humorous and easy-to-understand lectures, the course has a high number of broadcasts on Youtube.
Machine Learning (2021, Spring) The same course is taught by Hung-Yi Lee. ML has an introduction. The two courses have overlapping knowledge blocks that can be skipped as appropriate.

Representing words

Representing Image

For images, know that the grayscale image is one of our matrices.

The RBG image is a three-channel matrix.

details

img

各種Image processing 就是在這個矩陣上疊Buff，卷積啊砚尽，濾鏡啊還有高斯傅立葉拆撼。容劳。喘沿。。竭贩。蚜印。那么人類語言的詞匯，該如何讓機器去理解呢留量。
Various Image processing is to stack Buff, convolution, filter and Gaussian Fourier on this matrix. . . . . . So how can the vocabulary of human language be understood by the machine?

How do we have usable meaning in a computer?

舉個簡單例子窄赋，判斷一個詞的詞性，是動詞還是名詞楼熄。

用機器學(xué)習(xí)的思路忆绰，我們有一系列樣本(x,y)，這里 x 是詞語可岂，y 是它們的詞性错敢，我們要構(gòu)建 f(x)->y 的映射，但這里的數(shù)學(xué)模型 f（比如神經(jīng)網(wǎng)絡(luò)缕粹、SVM）只接受數(shù)值型輸入稚茅。

而 NLP 里的詞語，是人類的抽象總結(jié)平斩，是符號形式的（比如中文亚享、英文、拉丁文等等）绘面，所以需要把他們轉(zhuǎn)換成數(shù)值形式欺税，或者說——嵌入到一個數(shù)學(xué)空間里，這種嵌入方式揭璃，就叫詞嵌入word embedding魄衅，而 Word2vec，就是詞嵌入 word embedding 的一種

WordNet A thesaurus containing lists of synonym sets and hypernyms (“is a” relationships).

<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n27" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-size: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-top-left-radius: 3px; border-top-right-radius: 3px; border-bottom-right-radius: 3px; border-bottom-left-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; caret-color: rgb(51, 51, 51); color: rgb(51, 51, 51); font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-indent: 0px; text-transform: none; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none; background-position: inherit; background-repeat: inherit;">包含同義詞集和上位詞列表的同義詞庫</pre>

WordNet的開發(fā)有兩個目的：

它既是一個字典塘辅，又是一個辭典晃虫，它比單純的辭典或詞典都更加易于使用。
支持自動的文本分析以及人工智能應(yīng)用扣墩。
Representing words as discrete symbols One-hot vectors:

<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n36" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-size: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-top-left-radius: 3px; border-top-right-radius: 3px; border-bottom-right-radius: 3px; border-bottom-left-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; caret-color: rgb(51, 51, 51); color: rgb(51, 51, 51); font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-indent: 0px; text-transform: none; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none; background-position: inherit; background-repeat: inherit;">motel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0] </pre>

Vector dimension = number of words in vocabulary (e.g., 500,000) 太龐大的數(shù)據(jù)了

Word vector word vectors are also called word embeddings or (neural) word representations They are a distributed representation

如何創(chuàng)建呢哲银？

通過統(tǒng)計一個事先指定大小的窗口內(nèi)的word共現(xiàn)次數(shù)，以word周邊的共現(xiàn)詞的次數(shù)做為當(dāng)前word的vector呻惕。具體來說荆责，我們通過從大量的語料文本中構(gòu)建一個共現(xiàn)矩陣來定義word representation。

For example

I like deep learning. I like NLP. I enjoy flying.

截屏2021-12-17 14.13.53

NLP Task

截屏2021-12-17 13.31.29

	One Sequence	Multiple Sequences
One Class	Sentiment Classification, Stance Detection, Veracity Prediction, Intent Classification, Dialogue Policy	NLI Search Engine Relation Extraction
Class for each Token	POS tagging Word segmentation Extraction Summarization Slotting Filling NER
Copy from Input		Extractive QA
General Sequence	Abstractive Summarization, Translation, Grammar Correction ,NLG	General QA Task Oriented Dialogue Chatbot
Other?	Parsing, Coreference Resolution

Part-of-Speech (POS) Tagging

截屏2021-12-17 13.48.36

Word Segmentation

就是如何斷句亚脆。尤其在英文的從句做院，中文的多重定語。

It's how to break sentences. Especially in Chinese clauses, Chinese multiple attributives.

釀酒缸缸好造醋壇壇酸

養(yǎng)豬大如山老鼠只只死

釀酒缸缸好，造醋壇壇酸

養(yǎng)豬大如山键耕，老鼠只只死

釀酒缸缸好造醋寺滚，壇壇酸

養(yǎng)豬大如山老鼠，只只死

新聞：佟大為妻子產(chǎn)下一女

評論：這個佟大是誰屈雄？真了不起村视，太厲害了！酒奶！

Parsing

截屏2021-12-17 13.53.33

Summarization

Extractive summarization

截屏2021-12-17 13.54.29

最簡單的resolution： It is binary classfication problem. To decide which sentence will be add in summary. 類似我們小時候?qū)懻峡祝蠋熥尶偨Y(jié)課文，我們只是摘抄兩句惋嚎。

但是這樣往往不會得到最好的結(jié)果杠氢，如果有兩個句子意思相近，我們只一句句input另伍，是不夠的鼻百。

如果能用上DL，我們要把全文考慮進去质况，全部一起輸入，用一個binary LSTM or Transformer然后輸出每一句是否放在summary里
Abstractive summarization

截屏2021-12-17 14.18.06

The machine needs to write the summary in its own words, not directly in the original text. resolution：Seq2Seq problem. Long seq -> short Seq

會有一個問題玻靡，就是本來這個文章里的一些專業(yè)術(shù)語结榄。或者金句囤捻，人家寫的好好的臼朗。我們非要給人家parephas一下。然后濾除不對馬嘴蝎土。

我們總結(jié)的時候還是希望有一下原文的视哑，所以我能鼓勵這這個網(wǎng)絡(luò)是由Copy的能力的額。不要全部用機器自己的白話誊涯。
Machine Translation

截屏2021-12-17 14.19.46

7000種語言挡毅，每種語言上萬詞。胡翻譯至少需要7000的平方暴构。

Unsupervised learning跪呈！
Grammar Error Correction Seq2seq. 我們可以直接給他數(shù)據(jù)，硬train

[圖片上傳失敗...(image-18f7d6-1643274066425)]

進階Input: Token -> Token calculate the different. For example: 3 options C for Copy, R for replace, A for append
Sentiment Classification 情感判斷取逾。廣告推廣啊耗绿，去判斷電影的口碑啊。股票利空利多消息啊砾隅，幣圈源于周熱度啊之類的误阻。
Stance Detection 立場偵測。新型民調(diào)，挖掘不愿意表態(tài)選民畫像究反，選舉廣告推送寻定。B站阿瓦隆系統(tǒng)

截屏2021-12-17 13.56.29

Source：川普是個好總統(tǒng) Reply：他只是個資本家這位網(wǎng)民的立場？->Denied Many systems use the support, denying, querying and commenting （SDQC 4classes) for classifying replies. 支持奴紧、否認(rèn)特姐、質(zhì)疑和評論
Natural Language Inference (NLI)

自然語言推理
- Contradiction：矛盾
- Entailment：蘊含
- Neutral：中性
  
  Premise : 一個綠色的三角形 ->？ hypothesis：兩邊之和大于第三邊 Premise ->? Hypothesis input： premise+ hypothesis -> output：蘊含
Search engine Bert 可以簡化為

[圖片上傳失敗...(image-8871d6-1643274066425)]

2 inputs: 搜索詞+ 文章內(nèi)容 -> model -> relevant
Question AnswerQA system

截屏2021-12-17 14.27.16

傳統(tǒng)方法是一個非常龐大的價格黍氮，包含一些svm的簡單模型等等唐含。 Input: Question & knowledge source -> QA model -> answer Reading comprehension Extractive QA 目前還太難實現(xiàn)了，目前的網(wǎng)絡(luò)只是閱讀理解和的程度沫浆，輸出原文的答案的field捷枯。eg. (1_7-11,第一段7-11詞) 如果實現(xiàn)，就是先知的誕生专执。
Chatting 尬聊就淮捆。。本股。尬聊
Task-oriented

[圖片上傳失敗...(image-de07f-1643274066425)]

Natural Language Generation (NLG)
Policy & State Tracker
Natural Language Understanding (NLU)

截屏2021-12-17 14.31.16

Network

BERT：

是芝麻街里的一個人物, 大家都在用網(wǎng)絡(luò)方法的首字母湊芝麻街里的人物攀痊。Bert和RNN 將會在下一次Note里更新

截屏2021-12-17 13.42.20

LSTM: Will be introduced in next sharing session

Final

Quote from "Statistical approach to speech" by Prof. Keiichi Tokuda in Interspeech 2019

Every time I fire a linguist, the performance of the speech recongnizer goes up

截屏2021-12-17 14.38.36

最后編輯于：2024.10.16 13:32:20

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個濱河市拄显，隨后出現(xiàn)的幾起案子苟径，更是在濱河造成了極大的恐慌，老刑警劉巖躬审，帶你破解...
沈念sama閱讀 217,509評論 6贊 504
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件棘街，死亡現(xiàn)場離奇詭異，居然都是意外死亡承边，警方通過查閱死者的電腦和手機遭殉，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,806評論 3贊 394
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來博助，“玉大人险污，你說我怎么就攤上這事「辉溃” “怎么了罗心？”我有些...
開封第一講書人閱讀 163,875評論 0贊 354
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長城瞎。經(jīng)常有香客問我渤闷，道長，這世上最難降的妖魔是什么脖镀？我笑而不...
開封第一講書人閱讀 58,441評論 1贊 293
?港島之戀（遺憾婚禮）
正文為了忘掉前任飒箭，我火速辦了婚禮狼电，結(jié)果婚禮上，老公的妹妹穿的比我還像新娘弦蹂。我一直安慰自己肩碟，他們只是感情好，可當(dāng)我...
茶點故事閱讀 67,488評論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布凸椿。她就那樣靜靜地躺著削祈，像睡著了一般。火紅的嫁衣襯著肌膚如雪脑漫。梳的紋絲不亂的頭發(fā)上髓抑，一...
開封第一講書人閱讀 51,365評論 1贊 302
城市分裂傳說
那天优幸，我揣著相機與錄音吨拍，去河邊找鬼。笑死网杆，一個胖子當(dāng)著我的面吹牛羹饰，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播碳却，決...
沈念sama閱讀 40,190評論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼队秩，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了昼浦？” 一聲冷哼從身側(cè)響起馍资，我...
開封第一講書人閱讀 39,062評論 0贊 276
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎座柱，沒想到半個月后迷帜，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體物舒，經(jīng)...
沈念sama閱讀 45,500評論 1贊 314
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡色洞，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 37,706評論 3贊 335
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時候發(fā)現(xiàn)自己被綠了冠胯。大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片火诸。...
茶點故事閱讀 39,834評論 1贊 347
活死人
序言：一個原本活蹦亂跳的男人離奇死亡，死狀恐怖荠察，靈堂內(nèi)的尸體忽然破棺而出置蜀，到底是詐尸還是另有隱情，我是刑警寧澤悉盆，帶...
沈念sama閱讀 35,559評論 5贊 345
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布盯荤，位于F島的核電站，受9級特大地震影響焕盟，放射性物質(zhì)發(fā)生泄漏秋秤。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點故事閱讀 41,167評論 3贊 328
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望灼卢。院中可真熱鬧绍哎，春花似錦、人聲如沸鞋真。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,779評論 0贊 22
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽涩咖。三九已至海诲，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間抠藕，已是汗流浹背饿肺。一陣腳步聲響...
開封第一講書人閱讀 32,912評論 1贊 269
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留盾似，地道東北人敬辣。一個月前我還...
沈念sama閱讀 47,958評論 2贊 370
代替公主和親
正文我出身青樓，卻偏偏與公主長得像零院，于是被迫代替她去往敵國和親溉跃。傳聞我的和親對象是個殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點故事閱讀 44,779評論 2贊 354