《精通機(jī)器學(xué)習(xí):基于R 第二版》學(xué)習(xí)筆記
1袖迎、數(shù)據(jù)理解與數(shù)據(jù)準(zhǔn)備
在這個(gè)案例中南捂,我們研究美國前總統(tǒng)奧巴馬的國會(huì)演講。有兩個(gè)主要的分析目標(biāo)陨囊,首先使用7篇國會(huì)演講建立一個(gè)主題模型逗抑,然后對比2010年的第一篇演講和2016年1月的最后一篇演講剧辐。
> library(pacman)
> p_load(tm, wordcloud2, magrittr)
>
> # 設(shè)置文件所在目錄
> name <- file.path("./data_set/data-master")
>
> # 查看文件名
> dir(name, pattern = "*.txt")
## [1] "sou2010.txt" "sou2011.txt" "sou2012.txt" "sou2013.txt" "sou2014.txt"
## [6] "sou2015.txt" "sou2016.txt"
> # 查看有多少個(gè)txt文件
> length(dir(name, pattern = "*.txt"))
## [1] 7
> # 建立語料庫,不包括corpus和document level元數(shù)據(jù)
> docs <- Corpus(DirSource(name, pattern = "*.txt"))
> docs
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 7
文本整理
> # 字母全部轉(zhuǎn)換為小寫
> docs <- tm_map(docs, tolower) %>%
# 剔除數(shù)字
+ tm_map(removeNumbers) %>%
# 剔除標(biāo)點(diǎn)符號
+ tm_map(removePunctuation) %>%
# 剔除停用詞
+ tm_map(removeWords, stopwords("english")) %>%
# 剔除空白字符
+ tm_map(stripWhitespace) %>%
# 刪除那些不必要的詞
+ tm_map(removeWords, c("applause", "can", "cant", "will", "that", "weve", "dont",
+ "wont", "youll", "youre"))
>
> # 確認(rèn)文本還是純文本形式(會(huì)報(bào)錯(cuò))
> # docs <- tm_map(docs,PlainTextDocument)
>
> # 創(chuàng)建DTM矩陣
> dtm <- DocumentTermMatrix(docs)
> dtm
## <<DocumentTermMatrix (documents: 7, terms: 4753)>>
## Non-/sparse entries: 10913/22358
## Sparsity : 67%
## Maximal term length: 17
## Weighting : term frequency (tf)
> dim(dtm)
## [1] 7 4753
這7篇演講稿包含4753個(gè)詞邮府∮兀可以通過 removeSparseTerms() 函數(shù)刪除稀疏項(xiàng),但這一步不是必需的褂傀。需要指定一個(gè)0和1之間的數(shù)忍啤,這個(gè)數(shù)值越大,表示矩陣的稀疏度越高仙辟。稀疏度表示一個(gè)名詞在文檔中的相對頻率同波,所以如果你的稀疏度閾值是0.75,那么就只刪除那些稀疏度大于0.75的名詞欺嗤。在這個(gè)例子中参萄,(1 - 0.75) * 7 = 1.75,因此煎饼,對于任何一個(gè)名詞,如果包含它的文檔少于2個(gè)校赤,它就會(huì)被刪除:
> dtm <- removeSparseTerms(dtm, 0.75)
> dim(dtm)
## [1] 7 2256
因?yàn)闆]有文檔元數(shù)據(jù)吆玖,所以應(yīng)該命名矩陣中的行筒溃,這樣才能知道每行代表的文檔:
> rownames(dtm) <- dir(name, pattern = "*.txt") %>%
+ stringr::str_extract_all("\\d+") %>% unlist()
查看所有7行中的前5列數(shù)據(jù):
> inspect(dtm[1:7, 1:5])
## <<DocumentTermMatrix (documents: 7, terms: 5)>>
## Non-/sparse entries: 22/13
## Sparsity : 37%
## Maximal term length: 10
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs ability able abroad absolutely accept
## 2010 1 1 2 2 1
## 2011 0 4 3 0 0
## 2012 0 3 1 1 0
## 2013 3 3 2 1 1
## 2014 0 1 4 0 0
## 2015 0 1 1 0 1
## 2016 0 1 0 0 1
2、模型構(gòu)建與模型評價(jià)
2.1 詞頻分析與主題模型
> # 計(jì)算每列詞頻總和沾乘,并降序排列
> freq <- colSums(as.matrix(dtm))
> ord <- order(freq, decreasing = T)
> freq[head(ord)]
## new america people jobs now years
## 193 174 169 163 157 148
找出至少出現(xiàn)125次的詞:
> findFreqTerms(dtm, 125)
## [1] "america" "american" "americans" "jobs" "make" "new"
## [7] "now" "people" "thats" "work" "year" "years"
找出與“job”詞相關(guān)性大于0.85的詞:
> findAssocs(dtm, "jobs", corlimit = 0.85)
## $jobs
## colleges serve market shouldnt defense put tax came
## 0.97 0.91 0.89 0.88 0.87 0.87 0.87 0.86
生成詞云圖:
> tibble::tibble(word = names(freq), freq = freq) %>% wordcloud2(size = 1, color = "random-dark",
+ backgroundColor = "white", minRotation = -pi/4, maxRotation = pi/4)
生成柱狀圖:
> tibble::tibble(word = names(freq), freq = freq) %>% dplyr::arrange(freq) %>% dplyr::top_n(10) %>%
+ ggplot2::ggplot(aes(reorder(word, -freq), freq)) + geom_bar(stat = "identity") +
+ labs(title = "Word Frequency", y = "", x = "") + theme_bw()
建立主題模型怜奖,7篇文章建立3個(gè)主題:
> p_load(topicmodels)
> set.seed(124)
> lda <- LDA(dtm, k = 3, method = "Gibbs")
> topics(lda)
## 2010 2011 2012 2013 2014 2015 2016
## 1 3 3 3 2 2 1
第一篇和最后一篇演講具有同樣的主題分組,看來奧巴馬以同樣的方式開始和結(jié)束了自己的任期翅阵。
查看每個(gè)主題中的前25個(gè)詞:
> terms(lda, 25)
## Topic 1 Topic 2 Topic 3
## [1,] "americans" "new" "jobs"
## [2,] "year" "make" "now"
## [3,] "one" "like" "thats"
## [4,] "know" "world" "american"
## [5,] "right" "work" "people"
## [6,] "time" "help" "tonight"
## [7,] "economy" "america" "lets"
## [8,] "businesses" "congress" "america"
## [9,] "people" "every" "energy"
## [10,] "families" "want" "tax"
## [11,] "years" "need" "government"
## [12,] "even" "years" "well"
## [13,] "take" "job" "also"
## [14,] "security" "future" "nation"
## [15,] "give" "first" "last"
## [16,] "many" "country" "business"
## [17,] "still" "better" "education"
## [18,] "work" "home" "put"
## [19,] "get" "american" "get"
## [20,] "change" "today" "good"
## [21,] "care" "back" "reform"
## [22,] "just" "way" "companies"
## [23,] "next" "hard" "deficit"
## [24,] "health" "sure" "must"
## [25,] "support" "college" "small"
2.2 其他定量分析
比較2010年和2016年的演講:
> p_load(qdap)
> speech.16 <- readLines("./data_set/data-master/sou2016.txt") %>% paste(collapse = " ") %>%
+ # 將文本編碼設(shè)為ASCⅡ
+ iconv("latin1", "ASCII", "")
qdap包中的 qprep() 函數(shù)打包的函數(shù)如下所示:
? bracketX() :去掉括號
? replace_abbreviation() :替換縮略語
? replace_number() :將數(shù)字替換為單詞歪玲,例如,可以將100替換為one hundred
? replace_symbol() :將符號替換為單詞掷匠,例如滥崩,可以將@替換為at
> prep16 <- qprep(speech.16) %>%
# 替換縮寫,如can't替換為connot
+ replace_contraction() %>%
# 剔除前100個(gè)常用的停用詞
+ rm_stopwords(Top100Words, separate = F) %>%
# 剔除句號和問號之外的所有字符
+ strip(char.keep = c(".", "?"))
將文檔分割成句子:
> sent16 <- data.frame(speech = prep16) %>% sentSplit("speech") %>% dplyr::mutate(year = "2016")
> str(sent16)
## 'data.frame': 299 obs. of 3 variables:
## $ tot : chr "1.1" "2.2" "3.3" "4.4" ...
## $ speech: chr "mister speaker mister vice president members congress fellow americans tonight marks eighth year ive here report state union." "final im going try shorter." "know antsy back iowa." "also understand because election season expectations well achieve year low." ...
## $ year : chr "2016" "2016" "2016" "2016" ...
數(shù)據(jù)框包括三個(gè)變量讹语,tot的意思是談話的順序(Turn of Talk)钙皮,作為表示句子順序的指標(biāo);speech是將文本拆分為句子顽决;year作為分組變量的演講年份短条。
對2010年的演講重復(fù)上面的步驟:
> speech.10 <- readLines("./data_set/data-master/sou2010.txt") %>%
+ paste(collapse = " ") %>% iconv("latin1", "ASCII", "") %>%
+ gsub("Applause.", "", .) %>% qprep() %>%
+ replace_contraction() %>%
+ rm_stopwords(Top100Words, separate = F) %>%
+ strip(char.keep = c(".", "?"))
>
> sent10 <- data.frame(speech = speech.10) %>%
+ sentSplit("speech") %>% dplyr::mutate(year = "2010")
將兩個(gè)獨(dú)立年份的數(shù)據(jù)合成一個(gè)數(shù)據(jù)框:
> sentences <- rbind(sent10, sent16)
再看看詞頻:
> freq_terms(sentences$speech) %>% dplyr::top_n(15) %>% ggplot2::ggplot(aes(FREQ,
+ reorder(WORD, FREQ))) + geom_bar(stat = "identity") + labs(x = "", y = "") +
+ theme_bw()
建立一個(gè)詞頻矩陣,表示每篇演講中每個(gè)單詞的數(shù)量:
> word.mat <- wfm(sentences$speech, sentences$year)
> head(word.mat)
## 2010 2016
## abide 1 0
## ability 1 0
## able 1 1
## abroad 2 0
## absolutely 2 0
## abuse 0 1
> # 將矩陣按詞頻排序
> word.mat <- word.mat[order(word.mat[, 1], word.mat[, 2], decreasing = T), ]
> head(word.mat)
## 2010 2016
## our 120 85
## us 33 33
## year 29 17
## americans 28 15
## why 27 10
## jobs 23 8
為每年建立一個(gè)詞云:
> trans_cloud(sentences$speech, sentences$year, min.freq = 10)
對兩篇文檔進(jìn)行綜合統(tǒng)計(jì):
> ws <- word_stats(sentences$speech, sentences$year, rm.incomplete = T)
> plot(ws, label = T, lab.digits = 2)
2016年的演講要比2010年的短很多才菠,少100多個(gè)句子和差不多1000個(gè)單詞茸时。還有,2016年(10個(gè)問句)相對于2010年(4個(gè)問句)赋访,更多地使用了問句這種修辭手法屹蚊。
比較兩篇文檔的極性(情感評分):
> pol <- polarity(sentences$speech, sentences$year)
> pol
## year total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 2010 435 3900 0.052 0.432 0.121
## 2 2016 299 2982 0.105 0.395 0.267
stan.mean.polarity 表示標(biāo)準(zhǔn)化后的極性均值,就是用極性均值除以標(biāo)準(zhǔn)差进每⌒谠粒可以看到2016年的標(biāo)準(zhǔn)化極性均值(0.267)要比2010年(0.121)稍高一些。
> plot(pol)
2010年的演講以非常悲觀的情感開始田晚,而且整體上比2016年悲觀嘱兼。通過pol對象中的數(shù)據(jù)框,可以找出最悲觀的句子贤徒。
> pol.all <- pol$all
>
> pol.all$text.var[which.min(pol.all$polarity)]
## [1] "year ago took office amid wars economy rocked severe recession financial system verge collapse government deeply debt."
總結(jié):
文本定量分析包括:
1.極性分析通常稱為情感分析芹壕,它可以告訴你文本的情感有多么積極或者多么消極; polarity() 函數(shù);
2.自動(dòng)易讀性指數(shù)是衡量文本復(fù)雜度和讀者理解能力的一個(gè)指標(biāo)接奈;automated_readability_index()函數(shù)踢涌;
3.正式度可以表示文本和讀者之間或者演講者與聽眾之間的相關(guān)程度;formality()函數(shù)序宦;
4.多樣性表示文本中使用的不同詞數(shù)和全部詞數(shù)的比值睁壁;diversity()函數(shù);
5.分散度,或稱詞匯分散度潘明。它是一種可以幫助你理解詞在整篇文本中的分布的有用工具行剂,也是探索文本并識別模式的極好方法;dispersion_plot()函數(shù)钳降。