參考:《文本數(shù)據(jù)挖掘——基于R語言》
1、基本特征提取
基本特征包括:字符的數(shù)量甘凭、句子的數(shù)量稀拐、每個(gè)詞的長(zhǎng)度,標(biāo)點(diǎn)符號(hào)的數(shù)量等丹弱。
# 只能分析英文
p_load(textfeatures)
txt <- "1000pcs 8*32mm 0.5ml Plastic Centrifuge Tube Test Tubing Vial Clear Plastic Container Home Garden Storage Bottles德撬。1000pcs 6*22mm 0.2ml Plastic Bottles Gardening Storage Container Transparent Plastic Vials PCR Centrifuge Tube"
# sentiment參數(shù)能夠自動(dòng)進(jìn)行情感分析,word_dims則可以使用詞袋模型對(duì)文本進(jìn)行向量化躲胳,normalize參數(shù)可以對(duì)數(shù)據(jù)按列進(jìn)行歸一化蜓洪。全部關(guān)閉
textfeatures(txt, sentiment = F, word_dims = F, normalize = F,
verbose = F) %>%
# 顯示所有結(jié)果
print(width = Inf)
## # A tibble: 1 × 29
## n_urls n_uq_urls n_hashtags n_uq_hashtags n_mentions n_uq_mentions n_chars
## <int> <int> <int> <int> <int> <int> <int>
## 1 0 0 0 0 0 0 196
## n_uq_chars n_commas n_digits n_exclaims n_extraspaces n_lowers n_lowersp
## <int> <int> <int> <int> <int> <int> <dbl>
## 1 35 0 18 0 0 147 0.751
## n_periods n_words n_uq_words n_caps n_nonasciis n_puncts n_capsp
## <int> <int> <int> <int> <int> <int> <dbl>
## 1 3 29 20 26 0 2 0.137
## n_charsperword n_first_person n_first_personp n_second_person
## <dbl> <int> <int> <int>
## 1 6.57 0 0 0
## n_second_personp n_third_person n_tobe n_prepositions
## <int> <int> <int> <int>
## 1 0 0 0 0
● n_urls:文本中包含的URL的數(shù)量。
● n_uq_urls:文本中包含唯一URL的數(shù)量(本例中的計(jì)算結(jié)果并不準(zhǔn)確)坯苹。
● n_chars:總字符數(shù)量隆檀;
● n_commas:逗號(hào)的數(shù)量;
● n_lowers:小寫字符數(shù)量粹湃;
● n_lowersp:小寫字符比例恐仑;
● n_words:?jiǎn)卧~總數(shù)量;
● n_uq_words:唯一單詞的數(shù)量为鳄;
● n_first_person:第一人稱單數(shù)單詞的數(shù)量裳仆;
● n_second_personp:第二人稱復(fù)數(shù)單詞的數(shù)量孤钦;
● n_prepositions:介詞的數(shù)量记某。
2构捡、基于TF-IDF的特征提取
TF-IDF就是詞頻TF與逆文檔頻率IDF的乘積壳猜,它背后的思想是:詞語的重要性與它在文件中出現(xiàn)的次數(shù)成正比,但同時(shí)會(huì)隨著它在語料庫(kù)中出現(xiàn)的頻率成反比统扳。
library(pacman)
p_load(dplyr, stringr, purrr)
2.1 讀取數(shù)據(jù)
隨便文本代替即可,包括兩列吹由,一列為文檔名或編號(hào),一列為文本內(nèi)容倾鲫。
storagebottles <- read.csv("dataset/ali/storagebottles0905.csv",
header = F) %>%
set_names(c("sku_name", "sku_price", "sku_sale_volume", "sku_score",
"sku_ship", "sku_isNewin", "sku_isPromotion",
"sku_isTopselling", "shop_name", "sku_link", "category4")) %>%
distinct(.keep_all = T)
storagebottles <- storagebottles %>%
filter(!is.na(sku_name)) %>%
filter(str_detect(sku_price, "^US")) %>%
filter(str_detect(sku_link, "aliexpress")) %>%
filter(str_detect(sku_sale_volume, "sold")) %>%
mutate(category = "home",
category2 = "Home Storage",
category3 = "Storage Bottles & Jars") %>%
mutate(sku_id = str_extract(sku_link, "\\d{16}"),
sku_link = paste0("http:", sku_link)) %>%
mutate(sku_id = as.character(sku_id)) %>%
arrange(sku_sale_volume) %>%
group_by(sku_id, .drop = T) %>%
slice_tail(n=1) %>%
ungroup()
df <- select(storagebottles, sku_id, sku_name)
count(df, sku_id, stem) %>%
bind_tf_idf(term = stem,
document = sku_id,
n = n)
## # A tibble: 19,586 × 6
## sku_id stem n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 2251801564728378 0.5ml 1 0.0588 5.70 0.335
## 2 2251801564728378 1000pcs 1 0.0588 6.40 0.376
## 3 2251801564728378 32mm 1 0.0588 7.09 0.417
## 4 2251801564728378 8 1 0.0588 4.89 0.288
## 5 2251801564728378 bottl 1 0.0588 0.542 0.0319
## 6 2251801564728378 centrifug 1 0.0588 4.79 0.282
## 7 2251801564728378 clear 1 0.0588 2.29 0.135
## 8 2251801564728378 contain 1 0.0588 0.549 0.0323
## 9 2251801564728378 garden 1 0.0588 5.48 0.322
## 10 2251801564728378 home 1 0.0588 2.60 0.153
## # … with 19,576 more rows
2.2 詞嵌入
2.2.1 基于BOW(詞袋模型)
這種方法是信息的簡(jiǎn)化表示萍嬉,把文本表示為詞語的集合乌昔,不考慮文中的語法或者詞序,但是能夠?qū)?nèi)容的多樣性進(jìn)行記錄壤追。
可以使用詞頻或者使用TF-IDF進(jìn)行表征:
count(df, sku_id, stem) %>%
bind_tf_idf(term = stem,
document = sku_id,
n = n) %>%
cast_dfm(document = sku_id,
term = stem,
value = tf_idf)
## Document-feature matrix of: 1,198 documents, 1,641 features (99.00% sparse) and 0 docvars.
## features
## docs 0.5ml 1000pcs 32mm 8 bottl
## 2251801564728378 0.3354185 0.3761919 0.4169652 0.2877167 0.03186020
## 2251801564729229 0 0.4263508 0 0 0.03610822
## 2251832228713647 0 0 0 0 0.02850649
## 2251832295192632 0 0 0 0 0
## 2251832346856028 0 0 0 0 0
## 2251832357989488 0 0 0 0 0
## features
## docs centrifug clear contain garden home
## 2251801564728378 0.2815190 0.1348599 0.03228370 0.3222924 0.1529278
## 2251801564729229 0.3190549 0 0.03658819 0.3652647 0
## 2251832228713647 0 0 0 0 0
## 2251832295192632 0 0 0 0 0
## 2251832346856028 0 0 0.03430143 0 0
## 2251832357989488 0 0 0.05226884 0 0
## [ reached max_ndoc ... 1,192 more documents, reached max_nfeat ... 1,631 more features ]
text2vec包對(duì)詞袋模型的高性能運(yùn)算提供了強(qiáng)大的實(shí)現(xiàn)方法:
p_load(text2vec)
df <- select(storagebottles, sku_id, sku_name)
# 預(yù)處理及分詞
it <- itoken(df$sku_name,
# 預(yù)處理函數(shù)定義磕道,轉(zhuǎn)換為小寫
preprocessor = tolower,
# 分詞器定義,使用空格分割
tokenizer = word_tokenizer,
ids = df$sku_id,
# 是否顯示進(jìn)度條
progressbar = F)
# 構(gòu)建詞匯表
vocab <- create_vocabulary(it)
# 文本向量化
vec <- vocab_vectorizer(vocab)
# 創(chuàng)建DTM矩陣行冰,順便測(cè)試時(shí)間
system.time({
dtm_train <- create_dtm(it = it,
vectorizer = vec)
})
## 用戶 系統(tǒng) 流逝
## 0.00 0.03 0.03
2.2.2 基于word2vec
word2vec是一組用于生成詞向量的自然語言處理工具溺蕉,主要是基于雙層神經(jīng)網(wǎng)絡(luò),經(jīng)過訓(xùn)練后可以為單詞生成一個(gè)向量空間悼做,為每一個(gè)單詞都分配一個(gè)向量疯特。在生成的向量空間中,意思越相近的單詞向量之間的距離越小贿堰,反之則越大辙芍。word2vec有兩種模式,分別是CBOW和skip-gram羹与。
p_load(word2vec)
mod <- word2vec(x = df$sku_name,
# 輸出向量的維度
dim = 10,
# 迭代次數(shù)
iter = 20,
# 使用COBW模型還是skip-gram模型
type = "cbow")
# 轉(zhuǎn)換為矩陣
emb <- as.matrix(mod)
head(emb)
## [,1] [,2] [,3] [,4] [,5] [,6]
## Beads -1.9400728 -0.0888407 -0.08349484 -1.2562962 -0.2697616 -1.8421835
## Weekly -1.1608911 -0.4651599 -0.25436968 -1.4303011 -0.9386142 -1.7998306
## PVC 0.3128605 0.8036703 0.09885665 -2.0803137 0.7472407 -0.4748393
## Foaming -0.7814559 -0.2095719 0.88936573 0.2705340 -0.1661042 -1.4170694
## Flat 0.4067485 -0.7098307 -0.97712469 -0.8143641 0.9589156 -0.3076381
## Herbs -0.7869626 1.7579030 0.37374184 -1.6308879 -1.1516812 -0.4863058
## [,7] [,8] [,9] [,10]
## Beads -0.5022707 0.6688779 0.68726629 0.06747537
## Weekly -0.5683426 1.3080628 0.29630077 0.28856692
## PVC -1.0187222 1.5673116 0.50875384 0.61791813
## Foaming -2.1313248 0.8421388 -1.07562840 0.19196655
## Flat 0.2689168 1.7321179 0.43238384 1.85447288
## Herbs -0.1345760 -0.6410958 0.05552098 1.22308218
# 與container最接近的5個(gè)詞
predict(mod, c("plastic"), type = "nearest", top_n = 5)
## $plastic
## term1 term2 similarity rank
## 1 plastic transparent 0.9843451 1
## 2 plastic empty 0.9740745 2
## 3 plastic small 0.9626620 3
## 4 plastic perfume 0.9494769 4
## 5 plastic bottle 0.9455841 5
# 模型保存故硅,path為自定義路徑
write.word2vec(mod, file = path)
# 模型讀取
read.word2vec(path)
2.2.3 基于Glove
它是一種用于獲取詞向量表示的無監(jiān)督學(xué)習(xí)算法,與BOW相似纵搁,都是基于詞之間共現(xiàn)關(guān)系吃衅,但是這個(gè)算法能夠保留基于上下文關(guān)系保留更多的語境信息,能夠取得更好的向量化效果腾誉。
p_load(text2vec)
tokens <- df %>%
# 使用空格分詞
mutate(sku_name = tolower(sku_name)) %>%
mutate(token = space_tokenizer(sku_name)) %>%
pull(token)
# 設(shè)置迭代器
it <- itoken(df$sku_name, # 這個(gè)是語料
# 轉(zhuǎn)為小寫
preprocessor = tolower,
# 使用空格分詞
tokenizer = space_tokenizer,
ids = df$sku_id,
progressbar = F)
# 創(chuàng)建詞匯表徘层,加載停止詞
vocab <- create_vocabulary(it, stopwords = tm::stopwords())
# 保留出現(xiàn)5次以上的詞
vocab <- prune_vocabulary(vocabulary = vocab,
term_count_min = 5L)
# 形成語料文件
vectorizer <- vocab_vectorizer(vocabulary = vocab)
# 構(gòu)建DTM矩陣峻呕,窗口寬度設(shè)置為5
tcm <- create_tcm(it, vectorizer, skip_grams_window = 5L)
# 設(shè)置詞向量維度設(shè)為50
glove <- GlobalVectors$new(rank = 50,
# 權(quán)重為最大共現(xiàn)數(shù)
x_max = 10)
# SDG迭代次數(shù)為10
wv_main <- glove$fit_transform(x = tcm, n_iter = 10,
convergence_tol = 0.01,
# 不加參數(shù)默認(rèn)所有線程并行計(jì)算
n_threads = 4)
## INFO [15:27:30.119] epoch 1, loss 0.3156
## INFO [15:27:30.137] epoch 2, loss 0.1382
## INFO [15:27:30.155] epoch 3, loss 0.1033
## INFO [15:27:30.172] epoch 4, loss 0.0841
## INFO [15:27:30.190] epoch 5, loss 0.0708
## INFO [15:27:30.208] epoch 6, loss 0.0610
## INFO [15:27:30.226] epoch 7, loss 0.0534
## INFO [15:27:30.243] epoch 8, loss 0.0473
## INFO [15:27:30.260] epoch 9, loss 0.0423
## INFO [15:27:30.277] epoch 10, loss 0.0382
wv_content <- glove$components
dim(wv_content)
## [1] 50 538
# 最終結(jié)果
word_vectors <- wv_main + t(wv_content)
2.2.4 基于fastText
可以完成詞嵌入和文本分類等任務(wù)。與GloVe類似趣效,它也是word2vec的一種擴(kuò)展瘦癌,但它利用了神經(jīng)網(wǎng)絡(luò)對(duì)詞語進(jìn)行向量化,能夠?qū)ψ址?jí)的特征進(jìn)行學(xué)習(xí)跷敬。
fastText 模型架構(gòu)和 Word2Vec 中的 CBOW 模型很類似讯私。不同之處在于,fastText 預(yù)測(cè)標(biāo)簽西傀,而 CBOW 模型預(yù)測(cè)中間詞斤寇。
舉例來說:fastText能夠?qū)W會(huì)“男孩”娘锁、“女孩”饺鹃、“男人”尤慰、“女人”指代的是特定的性別伟端,并且能夠?qū)⑦@些數(shù)值存在相關(guān)文檔中。然后党巾,當(dāng)某個(gè)程序在提出一個(gè)用戶請(qǐng)求(假設(shè)是“我女友現(xiàn)在在兒齿拂?”)署海,它能夠馬上在fastText生成的文檔中進(jìn)行查找并且理解用戶想要問的是有關(guān)女性的問題砸狞。
p_load(text2vec)
# 一直報(bào)錯(cuò)刀森,原因是R4.2.1报账,需要安裝rtools4.2,否則安裝會(huì)報(bào)錯(cuò)
p_load_gh("pommedeterresautee/fastrtext")
# 生成文本
txt <- df %>%
pull(sku_name) %>%
tolower() %>%
str_remove_all("[:punct:]")
# 構(gòu)建文本文檔冠蒋,執(zhí)行向量化
tmp_file_txt <- tempfile()
tmp_file_model <- tempfile()
writeLines(txt, con = tmp_file_txt)
execute(commands = c("skipgram", "-input", tmp_file_txt,
"-output", tmp_file_model,
"-verbose", 1))
##
Read 0M words
## Number of words: 548
## Number of labels: 0
##
Progress: 100.0% words/sec/thread: 34165 lr: 0.000000 avg.loss: 3.049489 ETA: 0h 0m 0s
# 載入模型
model <- load_model(tmp_file_model)
# 獲取字典
dict <- get_dictionary(model)
# 獲得詞向量
word_vectors <- get_word_vectors(model)
# 釋放內(nèi)存
unlink(tmp_file_txt)
unlink(tmp_file_model)
rm(model)
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 3840040 205.1 6683084 357.0 6683084 357.0
## Vcells 7261552 55.5 14786712 112.9 10146316 77.5
execute()可以輸入的函數(shù):
以下參數(shù)是強(qiáng)制性的:
-input 訓(xùn)練文件路徑
-output 輸出文件路徑
以下參數(shù)是可選的:
-verbose 詳細(xì)級(jí)別浊服,默認(rèn)2
字典選項(xiàng):
-minCount 最少單詞出現(xiàn)次數(shù),默認(rèn)5
-minCountLabel 最少標(biāo)簽出現(xiàn)次數(shù)腕扶,默認(rèn)0
-wordNgrams ngram最大單詞數(shù)半抱,默認(rèn)1
-bucket 存儲(chǔ)體數(shù)量膜宋,默認(rèn)2000000
-minn ngram最小字符長(zhǎng)度秋茫,默認(rèn)3
-maxn ngram最大字符長(zhǎng)度肛著,默認(rèn)6
-t 采樣閾值,默認(rèn)0.0001
-label 標(biāo)簽前綴殉农,默認(rèn)_label_
以下訓(xùn)練參數(shù)是可選的:
-lr 學(xué)習(xí)速度超凳,默認(rèn)0.05
-lrUpdateRate 學(xué)習(xí)速度的更新率轮傍,默認(rèn)100
-dim 詞矩陣維度金麸,默認(rèn)100
-ws 上下文窗口大小挥下,默認(rèn)5
-epoch 時(shí)期數(shù),默認(rèn)5
-neg 負(fù)采樣數(shù)现斋,默認(rèn)5
-loss 損失函數(shù) {ns, hs, softmax}偎蘸,默認(rèn)ns
-thread 線程數(shù)迷雪,默認(rèn)12
-pretrainedVectors 用于監(jiān)督的預(yù)訓(xùn)練詞向量章咧,默認(rèn)為空
-saveOutput 是否保存輸出參數(shù)赁严,默認(rèn)0
以下量化參數(shù)可選:
-cutoff 要保留的單詞和 ngram 的數(shù)量,默認(rèn)0
-retrain 如果應(yīng)用了截止卤档,則微調(diào)嵌入劝枣,默認(rèn)0
-qnorm 分別量化范數(shù)哨免,默認(rèn)0
-qout 量化分類器琢唾,默認(rèn)0
-dsub 每個(gè)子向量的大小盾饮,默認(rèn)2
3丘损、文檔向量化
對(duì)文檔進(jìn)行分詞徘钥,然后利用獲得的詞向量,用文檔中所有詞匯向量進(jìn)行加和橱健,然后再除以一個(gè)標(biāo)量來獲得文檔表示向量拘荡。
p_load(textTinyR, text2vec)
tokens <- df %>%
mutate(sku_name = tolower(sku_name)) %>%
mutate(token = space_tokenizer(sku_name)) %>%
pull(token)
it <- itoken(tokens, progressbar = F)
vocab <- create_vocabulary(it)
vocab <- prune_vocabulary(vocab, term_count_min = 5L)
vectorizer <- vocab_vectorizer(vocab)
tcm <- create_tcm(it, vectorizer, skip_grams_window = 5L)
glove <- GlobalVectors$new(rank = 50, x_max = 10)
# SDG迭代次數(shù)為10
wv_main <- glove$fit_transform(x = tcm, n_iter = 10)
## INFO [15:27:41.400] epoch 1, loss 0.3118
## INFO [15:27:41.417] epoch 2, loss 0.1380
## INFO [15:27:41.433] epoch 3, loss 0.1033
## INFO [15:27:41.450] epoch 4, loss 0.0842
## INFO [15:27:41.470] epoch 5, loss 0.0710
## INFO [15:27:41.486] epoch 6, loss 0.0613
## INFO [15:27:41.504] epoch 7, loss 0.0537
## INFO [15:27:41.520] epoch 8, loss 0.0476
## INFO [15:27:41.537] epoch 9, loss 0.0427
## INFO [15:27:41.553] epoch 10, loss 0.0385
wv_content <- glove$components
dim(wv_content)
## [1] 50 548
# 最終結(jié)果
word_vectors <- wv_main + t(wv_content)
# 保存詞向量
write.table(word_vectors, file = "wv.txt", col.names = F)
# 轉(zhuǎn)化清洗
readLines("wv.txt") %>%
str_remove_all("\\\"") %>%
writeLines("wv.txt")
# 提取需要向量化的文檔
tok_text <- tokens
# 文檔向量化
init <- Doc2Vec$new(token_list = tok_text,
word_vector_FILE = "wv.txt")
# method參數(shù)使用了“sum_sqrt”珊皿,它表示文檔向量會(huì)對(duì)詞匯向量先進(jìn)行簡(jiǎn)單加和獲得一個(gè)新的向量
# INITIAL_WORD_VECTOR蟋定,然后對(duì)這個(gè)向量求平方和再開方得到一個(gè)標(biāo)量k溢吻,
# 最后INITIAL_WORD_VECTOR除以k就是最后的文檔向量。
out <- init$doc2vec_methods(method = "sum_sqrt")
# 刪除向量化詞匯表
unlink("wv.txt")