NLP筆記Day1：環(huán)境搭建及數(shù)據(jù)預(yù)處理

引言

從deeplearning.ai的課程開始，嘗試撿回荒廢了3年的NLP。
Coursera課程鏈接

搭建jupyter + vscode學(xué)習(xí)環(huán)境

Start with Why

為什么要用vscode涩堤？

我很想用諸如：“誰用誰知道，不用就吃虧”這樣的話來偷懶，但為了能讓心存疑慮的小伙伴放心去用，好歹要用自己的話說一下這工具為什么好用耸黑。

vscode是寫程序必備的“萬用軍刀”，如果硬要說有什么它辦不到篮幢，那可能只是還沒找到合適的插件罷了大刊。

因此，我不打算寫vscode功能的詳細清單三椿。用過的同學(xué)們都知道缺菌，一旦用上，不光敲代碼搜锰，可能日常碼字你都離不開它伴郁。

現(xiàn)在我在vscode上面完成的工作有：前后端開發(fā)，代碼調(diào)試纽乱，記筆記蛾绎，遠程登錄服務(wù)器操作，命令行鸦列，上線代碼/博客，等等鹏倘。相信未來它能承載更多工作入口薯嗤。就正如我現(xiàn)在想要把Jupyter整合進去一樣。

Jupyter Notebook幾年前剛開始學(xué)機器學(xué)習(xí)就用過纤泵，但漸漸少用（少用的原因是Mac算力限制骆姐，跑機器學(xué)習(xí)太費勁）之后連怎么搭環(huán)境都給忘了。

這段時間想把自己的NLP技能撿回來捏题，于是就有了這想法玻褪。簡單一搜，果然有方案公荧，馬上動手不啰嗦带射。

步驟一：搭建本地Jupyter服務(wù)器

jupyter notebook 是非常好的用于學(xué)習(xí)人工智能編程的工具。

首先Jupyter配合anaconda讓你可以在不同的packages環(huán)境下進行相對應(yīng)的開發(fā)循狰，特別是在跑機器學(xué)習(xí)時往往需要加載大量的庫來配合工作窟社，省去折騰各種不同版本的包和運行環(huán)境的麻煩。

其次使用Jupyter還可以一邊做筆記绪钥，一邊看程序運行結(jié)果灿里，免去界面切換的繁瑣。當(dāng)你做完一次學(xué)習(xí)之后程腹，筆記可以立馬拿去發(fā)布分享匣吊，強化自身學(xué)習(xí)動力。

因此，強烈建議大家都在自家電腦上搭建一個Jupyter Notebook的運行環(huán)境色鸳，網(wǎng)上教程很多這里就不再累贅社痛。

首先你要自行安裝好Python3環(huán)境

下載并安裝Anaconda（這又是個什么玩意兒？）

image.png

我記得以前安裝anaconda都是跑命令行搞出來的缕碎，現(xiàn)在居然下載完了直接就可以裝了褥影。好吧，那就隨帶說一下為什么要用Anaconda咏雌，然后使用過程又要注意什么凡怎。

Anaconda解決了維護運行環(huán)境不一致的問題，你可以為每一個應(yīng)用配置單獨的赊抖，隔離的環(huán)境统倒。（一句話說完）

如果這句話你還是理解不了的話，建議隨便在github上找?guī)讉€Python項目拉下來玩氛雪，不用Anaconda房匆，然后就知道為啥要用這東西了。

安裝完了之后啟動Jupyter notebook报亩，運行

jupyter notebook

命令運行成功后系統(tǒng)會為你自動打開 localhost:8888浴鸿，因為我們是要在vscode里面去用的，所以直接關(guān)掉就行弦追。

步驟二：配置vscode

在插件市場安裝Jupyter插件岳链，成功后啟動命令窗口（Shift+Command+P）

執(zhí)行 Jupyter:Create New Blank Jupyter Notebook

image.png

然后就可以開始使用了，在新建的文檔中能看到的信息和你在網(wǎng)頁上使用無異劲件，可以看到已連接的local掸哑，Python3是否正在執(zhí)行等等。

image.png

參考資料

Working with Jupyter Notebooks in Visual Studio Code
Install and Use — Jupyter Documentation 4.1.1 alpha documentation

NLP基礎(chǔ)01 - 數(shù)據(jù)預(yù)處理

對數(shù)據(jù)進行預(yù)處理
使用NLTK處理數(shù)據(jù)集

引入包

NLTK(http://www.nltk.org/)是一個自然語言工具箱零远，提供超過50種語料庫和詞法資源(如WordNet)提供了易于使用的接口苗分，還提供了一套用于分類、標(biāo)記牵辣、詞干提取摔癣、標(biāo)記、解析和語義推理的文本處理庫服猪、工業(yè)強度NLP庫的包裝器供填。

適合于語言學(xué)家、工程師罢猪、學(xué)生近她、教育工作者、研究人員和行業(yè)用戶膳帕。NLTK可用于Windows粘捎、Mac OS X和Linux薇缅。最重要的是，NLTK是一個免費的攒磨、開源的泳桦、社區(qū)驅(qū)動的項目。

Python的自然語言處理為語言處理編程提供了一個實用的入門娩缰。它由NLTK的創(chuàng)建者編寫灸撰，指導(dǎo)讀者了解編寫Python程序的基礎(chǔ)知識、使用語料庫拼坎、對文本進行分類浮毯、分析語言結(jié)構(gòu)等等。該書的在線版本已經(jīng)針對Python 3和NLTK 3進行了更新泰鸡。(Python 2的原始版本仍然可以在http://nltk.org/book_1ed上找到债蓝。)

對tweets數(shù)據(jù)進行情感性分析，即判斷每一條tweet是正向盛龄，負向饰迹，還是中性描述。
在NLTK包中有預(yù)加載的一個Twitter實驗數(shù)據(jù)集余舶，可直接使用啊鸭。

import nltk
from nltk.corpus import twitter_samples
import matplotlib.pyplot as plt
import random

關(guān)于Twitter數(shù)據(jù)集

NLTK這個數(shù)據(jù)集已經(jīng)把tweets劃分成了正向或負向，各5000條匿值。雖然數(shù)據(jù)集來源于真實數(shù)據(jù)莉掂，但這樣的劃分是人為的。

由于本地使用 nltk.download('twitter_samples') 語句會報錯：Errno 61 Connection refused

因此需要在命令行中進行如下操作（同一窗口操作命令行是vscode優(yōu)勢之一）

在nltk/nltk_data: NLTK Data 下載zip
解壓后把文件夾改名為nltk_data
若運行下一步時報錯千扔，可查看提示搬運文件夾到程序會檢索的目錄下

from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

用 strings() 函數(shù)加載數(shù)據(jù)

all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

現(xiàn)在，我們可以先來看看數(shù)據(jù)長什么樣子库正。這在正式跑數(shù)之前是非常重要的操作

print('Number of postive tweets: ', len(all_positive_tweets))
print('Number of negative tweets: ', len(all_negative_tweets))

print('\nThe type of all_positive_tweets is: ', type(all_positive_tweets))
print('\nThe type of all_negative_tweets is: ', type(all_negative_tweets))
print('\nThe type of a tweet entry is: ', type(all_negative_tweets[0]))

Number of postive tweets:  5000
Number of negative tweets:  5000

The type of all_positive_tweets is:  <class 'list'>

The type of all_negative_tweets is:  <class 'list'>

The type of a tweet entry is:  <class 'str'>

從上面結(jié)果可以看出來曲楚，兩個json文件已被轉(zhuǎn)換成了列表，而一條tweet則是一個字符串褥符。

你還可以使用 pyplot 庫去畫一個餅圖龙誊，用來描述上述的數(shù)據(jù)（增加一點數(shù)據(jù)可視化總是有好處滴）

pyplot庫使用可參考 Basic pie chart — Matplotlib 3.3.3 documentation

# 自定義圖形大小
fig = plt.figure(figsize=(5, 5))

# 定義標(biāo)簽
labels = 'Positives', 'Negative'

# 每頁大小
sizes = [len(all_positive_tweets), len(all_negative_tweets)]

# 聲明餅圖，頁大小喷楣，保留小數(shù)位趟大，陰影，角度-90為垂直切分
plt.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90)

plt.axis('equal')

plt.show()

查看原始文本數(shù)據(jù)

查看真實的數(shù)據(jù)情況铣焊，下面的代碼會print出正向逊朽，負向的評論，以不同顏色為區(qū)分

# 正向評論 綠色
print('\033[92m' + all_positive_tweets[random.randint(0,5000)])

# 負向評論 紅色
print('\033[91m' + all_negative_tweets[random.randint(0,5000)])

?[92m@JayHorwell Hi Jay, if you haven't received it yet please email our events team at events@breastcancernow.org and they'll sort it :)
?[91m@pickledog47 @FoxyLustyGrover Its Kate, tho!!  :(  #sniff

由此發(fā)現(xiàn)數(shù)據(jù)中含有不少表情符號及url信息曲伊，在后續(xù)的處理中需要考慮在內(nèi)

對原始文本進行預(yù)處理

數(shù)據(jù)預(yù)處理是所有機器學(xué)習(xí)的關(guān)鍵步驟叽讳。包括數(shù)據(jù)清洗和格式化。對NLP而言，主要有以下任務(wù):

分詞
處理大小寫
刪除停止詞（Stop Words）和標(biāo)點符號
提取詞根(處理英語時特有的Stemming)

# 選擇一條較為復(fù)雜的數(shù)據(jù)
tweet = all_positive_tweets[2277]

print(tweet)

My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i

import re                                     # 正則表達式庫
import string                                 # 字符串操作庫

from nltk.corpus import stopwords             # NLTK的stopwords庫岛蚤，貌似不支持中文
from nltk.stem import PorterStemmer           # stemming 庫
from nltk.tokenize import TweetTokenizer      # 推特分詞器

去除超鏈接邑狸，推特標(biāo)簽和格式

刪除推特平臺常用字符串，就像微博一樣涤妒，有許多'@' '#' 和url
使用re庫執(zhí)行正則表達式操作单雾。使用sub()替換成空串

關(guān)于python正則表達式出來參考：Python 正則表達式 | 菜鳥教程
可以直接使用vscode的查找工具進行正則表達式的調(diào)試

print('\033[92m' + tweet)
print('\033[94m')

tweet2 = re.sub(r'^RT[\s]+', '', tweet) #處理 RT【空格】打頭的數(shù)據(jù)，即“轉(zhuǎn)發(fā)”類的tweet

tweet2 = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet2) #去除超鏈接

tweet2 = re.sub(r'#', '', tweet2)

print(tweet2)

?[92mMy beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i
?[94m
My beautiful sunflowers on a sunny Friday morning off :) sunflowers favourites happy Friday off…

先試試直接分詞她紫，看看結(jié)果如何

print()
print('\033[92m' + tweet2)
print('\033[94m')

tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)

tweet_tokens = tokenizer.tokenize(tweet2)

print()
print('Tokenized string:')
print(tweet_tokens)

?[92mMy beautiful sunflowers on a sunny Friday morning off :) sunflowers favourites happy Friday off… 
?[94m

Tokenized string:
['my', 'beautiful', 'sunflowers', 'on', 'a', 'sunny', 'friday', 'morning', 'off', ':)', 'sunflowers', 'favourites', 'happy', 'friday', 'off', '…']

去掉stop words和標(biāo)點符號

stop words是常用的沒有實際意義的那些詞語硅堆，之前試過生成詞云都會發(fā)現(xiàn)諸如“的”，“那么”這些詞會很多犁苏，所以在處理前最好先去掉硬萍。
在英文情況下會有所不同，具體看下一步執(zhí)行結(jié)果围详。

stopwords_english = stopwords.words('english')

print('Stop words\n')
print(stopwords_english)

print('\nPunctuation\n')
print(string.punctuation)

Stop words

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Punctuation

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

我們可以看到上面的停止詞包含了一些可能很重要的詞朴乖。例如“I”，"not", "between", "won", "against" 助赞。

不同分析目的买羞，可能要對停止詞表進一步加工，在我們前面下載nltk_data里面有一個stopwords的文件夾雹食，對應(yīng)的English那個文件就是停止詞的詞表畜普。
在這個練習(xí)里，則用整個列表群叶。

下面開始進行分詞操作

print()
print('033[92m')
print(tweet_tokens)
print('033[94m')

tweets_clean = []

for word in tweet_tokens:
    if (word not in stopwords_english and
        word not in string.punctuation):
        tweets_clean.append(word)

print('removed stop words and punctuation:')
print(tweets_clean)

033[92m
['my', 'beautiful', 'sunflowers', 'on', 'a', 'sunny', 'friday', 'morning', 'off', ':)', 'sunflowers', 'favourites', 'happy', 'friday', 'off', '…']
033[94m
removed stop words and punctuation:
['beautiful', 'sunflowers', 'sunny', 'friday', 'morning', ':)', 'sunflowers', 'favourites', 'happy', 'friday', '…']

詞干提取(Stemming)

這是處理英語時需要特別考慮的一個因素吃挑，比如

learn
learning
learned
learnt

這些詞的詞根都是learn，但處理時提取出來的可能不是learn街立。例如舶衬，happy

happy
happiness
happier

我們需要提取出happi，而不是happ赎离，因為它是happen的詞根逛犹。

NLTK有不同的模塊用于詞干提取，我們將使用使用PorterStemmer完成此操作

print()
print('\033[92m')
print(tweets_clean)
print('\033[94m')

stemmer = PorterStemmer()

tweets_stem = []

for word in tweets_clean:
    stem_word = stemmer.stem(word)
    tweets_stem.append(stem_word)

print('stemmed words:')
print(tweets_stem)

?[92m
['beautiful', 'sunflowers', 'sunny', 'friday', 'morning', ':)', 'sunflowers', 'favourites', 'happy', 'friday', '…']
?[94m
stemmed words:
['beauti', 'sunflow', 'sunni', 'friday', 'morn', ':)', 'sunflow', 'favourit', 'happi', 'friday', '…']

process_tweet()

可使用諸如utils.py這樣的文件梁剔，對上述過程進行封裝虽画，例如process_tweet函數(shù)的以下應(yīng)用
utils.py的代碼放在最后

from utils import process_tweet

tweet = all_positive_tweets[2277]

print()
print('\033[92m')
print(tweet)
print('\033[94m')

tweets_stem = process_tweet(tweet);

print('preprocessed tweet:')
print(tweets_stem)

?[92m
My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i
?[94m
preprocessed tweet:
['beauti', 'sunflow', 'sunni', 'friday', 'morn', ':)', 'sunflow', 'favourit', 'happi', 'friday', '…']

總結(jié)

通過這個練習(xí)，我們知道了一般NLP的預(yù)處理過程荣病，當(dāng)然實際過程（涉及中文時）會更復(fù)雜码撰，要結(jié)合數(shù)據(jù)具體情況不斷調(diào)整。

把以下內(nèi)容保存為文件utils.py众雷，放在ipynb文件同一個目錄下灸拍，最后一個步驟才能運行成功

import re
import string
import numpy as np


from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer


def process_tweet(tweet):
    """Process tweet function.
    Input:
        tweet: a string containing a tweet
    Output:
        tweets_clean: a list of words containing the processed tweet
?
    """
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english')
    # remove stock market tickers like $GE
    tweet = re.sub(r'\$\w*', '', tweet)
    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    # remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    # remove hashtags
    # only removing the hash # sign from the word
    tweet = re.sub(r'#', '', tweet)
    # tokenize tweets
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)

    tweets_clean = []
    for word in tweet_tokens:
        if (word not in stopwords_english and  # remove stopwords
                word not in string.punctuation):  # remove punctuation
            # tweets_clean.append(word)
            stem_word = stemmer.stem(word)  # stemming word
            tweets_clean.append(stem_word)

    return tweets_clean


def build_freqs(tweets, ys):
    """Build frequencies.
    Input:
        tweets: a list of tweets
        ys: an m x 1 array with the sentiment label of each tweet
            (either 0 or 1)
    Output:
        freqs: a dictionary mapping each (word, sentiment) pair to its
        frequency
    """
    # Convert np array to list since zip needs an iterable.
    # The squeeze is necessary or the list ends up with one element.
    # Also note that this is just a NOP if ys is already a list.
    yslist = np.squeeze(ys).tolist()

    # Start with an empty dictionary and populate it by looping over all tweets
    # and over all processed words in each tweet.
    freqs = {}
    for y, tweet in zip(yslist, tweets):
        for word in process_tweet(tweet):
            pair = (word, y)
            if pair in freqs:
                freqs[pair] += 1
            else:
                freqs[pair] = 1
    
    return freqs

ChangeLog

2021/1/28 17:12:10 折騰了兩小時做祝，先到這里。其實你會發(fā)現(xiàn)搞程序遇到的麻煩鸡岗，跟你在玩一個游戲（比較虐的那種）時被卡住的感覺很像混槐，這時應(yīng)該先設(shè)法讓自己先停下來去搞點別的……
2021/2/1 16:28:38 花了兩小時把后面內(nèi)容完成

最后編輯于：2021.02.03 15:48:39

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個濱河市轩性，隨后出現(xiàn)的幾起案子声登，更是在濱河造成了極大的恐慌，老刑警劉巖揣苏，帶你破解...
沈念sama閱讀 222,183評論 6贊 516
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件悯嗓，死亡現(xiàn)場離奇詭異，居然都是意外死亡卸察，警方通過查閱死者的電腦和手機脯厨，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 94,850評論 3贊 399
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來坑质，“玉大人合武，你說我怎么就攤上這事∥卸螅” “怎么了稼跳？”我有些...
開封第一講書人閱讀 168,766評論 0贊 361
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長吃沪。經(jīng)常有香客問我汤善，道長，這世上最難降的妖魔是什么票彪？我笑而不...
開封第一講書人閱讀 59,854評論 1贊 299
?港島之戀（遺憾婚禮）
正文為了忘掉前任红淡，我火速辦了婚禮，結(jié)果婚禮上降铸，老公的妹妹穿的比我還像新娘锉屈。我一直安慰自己，他們只是感情好垮耳，可當(dāng)我...
茶點故事閱讀 68,871評論 6贊 398
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著遂黍，像睡著了一般终佛。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上雾家，一...
開封第一講書人閱讀 52,457評論 1贊 311
城市分裂傳說
那天铃彰，我揣著相機與錄音，去河邊找鬼芯咧。笑死牙捉，一個胖子當(dāng)著我的面吹牛竹揍，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播邪铲，決...
沈念sama閱讀 40,999評論 3贊 422
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼芬位，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了带到？” 一聲冷哼從身側(cè)響起昧碉，我...
開封第一講書人閱讀 39,914評論 0贊 277
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎揽惹，沒想到半個月后被饿，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 46,465評論 1贊 319
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡搪搏，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 38,543評論 3贊 342
?白月光啟示錄
正文我和宋清朗相戀三年狭握，在試婚紗的時候發(fā)現(xiàn)自己被綠了。大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片疯溺。...
茶點故事閱讀 40,675評論 1贊 353
活死人
序言：一個原本活蹦亂跳的男人離奇死亡论颅，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出喝检，到底是詐尸還是另有隱情嗅辣，我是刑警寧澤，帶...
沈念sama閱讀 36,354評論 5贊 351
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布挠说，位于F島的核電站澡谭，受9級特大地震影響，放射性物質(zhì)發(fā)生泄漏损俭。R本人自食惡果不足惜蛙奖，卻給世界環(huán)境...
茶點故事閱讀 42,029評論 3贊 335
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望杆兵。院中可真熱鬧雁仲，春花似錦、人聲如沸琐脏。這莊子的主人今日做“春日...
開封第一講書人閱讀 32,514評論 0贊 25
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽日裙。三九已至吹艇，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間受神，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 33,616評論 1贊 274
情欲美人皮
我被黑心中介騙來泰國打工格侯，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留鼻听，地道東北人财著。一個月前我還...
沈念sama閱讀 49,091評論 3贊 378
代替公主和親
正文我出身青樓，卻偏偏與公主長得像撑碴，于是被迫代替她去往敵國和親撑教。傳聞我的和親對象是個殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點故事閱讀 45,685評論 2贊 360

NLP筆記Day1：環(huán)境搭建及數(shù)據(jù)預(yù)處理

引言

搭建jupyter + vscode學(xué)習(xí)環(huán)境

Start with Why

步驟一：搭建本地Jupyter服務(wù)器

步驟二：配置vscode

參考資料

NLP基礎(chǔ)01 - 數(shù)據(jù)預(yù)處理

引入包

關(guān)于Twitter數(shù)據(jù)集

查看原始文本數(shù)據(jù)

對原始文本進行預(yù)處理

去除超鏈接邑狸，推特標(biāo)簽和格式

去掉stop words和標(biāo)點符號

詞干提取(Stemming)

process_tweet()

總結(jié)

ChangeLog

推薦閱讀更多精彩內(nèi)容