這次做的文本挖掘以tm包為基礎(chǔ)蹋砚,數(shù)據(jù)集內(nèi)容是奧巴馬的國會演講屈藐。
鏈接:https://github.com/datameister66/data
1科汗、加載數(shù)據(jù)
library(tm)
建立包含演講文稿的路徑
name <- file.path("/Users/mac/rstudio-workplace/txtData")
查看路徑下的文件
dir(name)
[1] "sou2010.txt" "sou2011.txt" "sou2012.txt" "sou2013.txt" "sou2014.txt" "sou2015.txt"
[7] "sou2016.txt"
查看路徑下文件數(shù)量
length(dir(name))
[1] 7使用Corpus建立語料庫捎迫,用命名為docs
docs <- Corpus(DirSource(name))
可以使用inspect()函數(shù)查看語料庫內(nèi)容
inspect(docs[1])
2、使用tm包的tm_map()函數(shù)進行文本轉(zhuǎn)換
字母轉(zhuǎn)換為小寫:tolower
docs <- tm_map(docs,tolower)
剔除數(shù)字:removeNumbers
docs <- tm_map(docs,removeNumbers)
剔除標(biāo)點符號:removePunctuation
docs <- tm_map(docs,removePunctuation)
剔除停用詞:removewords的stopwords
docs <- tm_map(docs,removeWords,stopwords("english"))
剔除空白字符:stripWhitespace
docs <- tm_map(docs,stripWhitespace)
刪除沒必要的詞:removewords故源,向量
docs <- tm_map(docs,removeWords,c("applause","can","cant","will","that","weve","dont","wont","youll","youre"))
3、將語料庫放入文檔-詞矩陣
dtm <- DocumentTermMatrix(docs)
7個文檔汞贸,4715個詞
dim(dtm)
[1] 7 4715
查看矩陣
inspect(dtm)
<<DocumentTermMatrix (documents: 7, terms: 4715)>>
Non-/sparse entries: 10899/22106
Sparsity : 67%
Maximal term length: 17
Weighting : term frequency (tf)
Sample :
Terms
Docs america american jobs make new now people thats work years
sou2010.txt 18 18 23 14 20 30 32 26 21 20
sou2011.txt 18 19 25 23 36 25 31 24 20 25
sou2012.txt 30 34 34 15 27 26 21 24 16 18
sou2013.txt 24 19 32 20 24 35 18 18 20 22
sou2014.txt 28 21 23 22 29 11 24 19 27 21
sou2015.txt 35 19 18 23 41 15 22 30 20 25
sou2016.txt 21 16 8 17 16 15 21 29 20 17
查看自己想看的矩陣部分
inspect(dtm[1:3,1:3])
4绳军、詞頻分析
計算每列總和
freq <- colSums(as.matrix(dtm))
head(freq)
abide ability able abroad absolutely abuses
1 4 14 13 4 1
對freq進行降序排序
ord <- order(-freq)
head(ord)
[1] 913 60 1386 991 755 922
查看頭六個詞
freq[head(ord)]
new america thats people jobs now
193 174 170 169 163 157查看最后六個詞
freq[tail(ord)]
withers wordvoices worldexcept worldin worry yearsnamely
1 1 1 1 1 1
查看詞頻的頻率
出現(xiàn)頻率最高前六
head(table(freq))
freq
1 2 3 4 5 6
2226 788 382 234 142 137
tail(table(freq))
freq
157 163 169 170 174 193
1 1 1 1 1 1
通過findFreqTerms()函數(shù)找出出現(xiàn)次數(shù)至少為N的詞
findFreqTerms(dtm,125)
[1] "america" "american" "americans" "jobs" "make" "new" "now"
[8] "people" "thats" "work" "year" "years"
通過findAssocs()函數(shù)計算相關(guān)性,找出詞與詞之間的關(guān)聯(lián)
比如與job相關(guān)性大于0.9
findAssocs(dtm,"job",corlimit = 0.9)
$job
wrong pollution forces together achieve training
0.97 0.96 0.93 0.93 0.93 0.91
生成詞云
library(wordcloud)
wordcloud(names(freq),freq,min.freq = 70,scale = c(3,.3),colors = brewer.pal(6,"Dark2"))