寫在前面
之前寫了publish or perish的使用方法。那么問題來了坠七,我們獲取了這些數(shù)據(jù)能拿來干什么水醋?于是萌生了一個(gè)想法旗笔,不如統(tǒng)計(jì)一下引用TBtools的文章,統(tǒng)計(jì)一下每年文章發(fā)表數(shù)量(截止12月初)离例,發(fā)表的層次,用戶都用了哪些功能等等悉稠。當(dāng)初的想法是我一個(gè)滑鏟就全部搞定了宫蛆,但實(shí)現(xiàn)起來嘛……一干就是兩個(gè)周,并且短期內(nèi)我是再也不想看基因挖掘相關(guān)的任何東西了……統(tǒng)計(jì)到最后實(shí)在是想吐了佃声,如果你發(fā)現(xiàn)你發(fā)的文章被錯(cuò)誤的統(tǒng)計(jì)了甥温,或者沒統(tǒng)計(jì)到……歡迎私信告訴我认然,我一定不改(
數(shù)據(jù)導(dǎo)入和準(zhǔn)備工作
這里的話數(shù)據(jù)我已經(jīng)整理好了,如果你也感興趣的話點(diǎn)這里進(jìn)行下載(https://cowtransfer.com/s/6f63b578a5e34f)叛拷。
首先載入數(shù)據(jù)
#Wipe the environment
rm(list=ls())
#Set the work directory
setwd("/Users/yangshao/OneDrive")
#Import the data
dta <- read.csv("clean_cite.csv", header = T)
#Take a peek at the data
skim(dta)
#Dimension of the data
dim(dta)
PS: 這里是我的地址,不要無腦復(fù)制……
之后載入需要用到的包岂却,接下來就可以進(jìn)行分析了忿薇!如果沒有安裝對(duì)應(yīng)的包,先讓電腦安裝就好了躏哩。實(shí)在不會(huì)可以讓你師兄先來學(xué)署浩,然后教你(
library(skimr)
library(tidyverse)
library(ggplot2)
library(ggpubr)
library(ggthemes)
library(tidytext)
library(stringr)
library(wordcloud)
library(networkD3)
越來越多的文章使用并引用TBtools
首先是每年引用文章數(shù)的可視化
#Citation per year
cite_trend <- sort(table(dta$Year), decreasing = TRUE)
cite_trend <- as.data.frame(cite_trend)
#Visualization
colnames(cite_trend) <- c('Year','Num.')
ggbarplot(cite_trend, x = "Year", fill = "Year", order = c('2018','2019','2020'), y = "Num.", palette = 'jco') +
scale_y_continuous(expand = c(0, 0))
從2018年至2020年,TBtools的每年引用數(shù)呈現(xiàn)上升趨勢(shì)(廢話)扫尺。截止2020年12月初筋栋,2020年的引用數(shù)已經(jīng)超過500次。需要說明的是正驻,這里的數(shù)據(jù)是相對(duì)比較保守的數(shù)據(jù)弊攘,實(shí)際引用數(shù)可能遠(yuǎn)不止這些。
TBtools的用戶都喜歡把文章發(fā)在誰家姑曙?
使用以下代碼完成可視化襟交,別問我為啥用ggpubr……
#Sort the top 15 publisher
cite_publisher <- head(sort(table(dta$Publisher), decreasing = TRUE),15)
#Transform the data to a dataframe
cite_publisher <- as.data.frame(cite_publisher)
#Rename the col names
colnames(cite_publisher) <- c('Publisher','Count')
#Visualization
ggbarplot(cite_publisher, x = "Publisher", fill = "Publisher", y = "Count") +
scale_y_continuous(expand = c(0, 0)) +
theme_pubr(
base_family = "",
border = FALSE,
margin = TRUE,
legend = 'none',
x.text.angle = 45
)
截止2020年12月初,mdpi發(fā)表了多數(shù)文章伤靠,施普林格(包含BMC)和愛斯維爾緊隨其后婿着。
三個(gè)出版商一共發(fā)表了538篇,占總數(shù)的65.1%醋界。
cite_publisher %>% head(3) %>% select(Count) %>% sum()
cite_publisher %>% head(3) %>% select(Count) %>% sum()/count(dta)
那么引用了TBtools的文章發(fā)表在了哪些雜志上呢竟宋?同樣的方式完成可視化,別問我為什么不用ggpubr……
#Sort the top 15 Journal
cite_source <- head(sort(table(dta$Source), decreasing = TRUE),15)
#Transform the data to a dataframe
cite_source <- as.data.frame(cite_source)
#Rename the col
colnames(cite_source) <- c('Journal Name','Count')
#Visualization
ggplot(cite_source,aes(x = cite_source$`Journal Name`, y = Count, fill = cite_source$`Journal Name`)) +
geom_bar(stat='identity') +
scale_y_continuous(expand = c(0, 0)) +
xlab("Journal Name") +
guides(fill=guide_legend(title="Journal Name")) +
theme_pubr(
base_family = "",
border = FALSE,
margin = TRUE
) + rremove("x.text")
從結(jié)果上看形纺,多數(shù)文章發(fā)在了OA雜志上丘侠,例如來自mdpi的International Journal of Molecular Sciences,例如來自Springer的BMC Plant Biology和BMC Genomics逐样,例如來自Frontier的Frontier in Genetics和Frontiers in Physiology蜗字,例如PeerJ, PloS等……按你胃打肝,有錢發(fā)OA真好……
用戶最愛用熱圖
每篇文章都會(huì)在文章里提到使用了TBtools的哪些功能。很遺憾這部分功能只能手動(dòng)登錄了挪捕,因?yàn)楹芏嗲闆r下粗梭,大家對(duì)同一個(gè)功能敘述存在差異,例如同樣是一個(gè)熱圖功能有的人說'the DEGs were visualized by TBtools'有的人會(huì)說'the heatmaps were used to show/draw 级零。断医。。奏纪。(TBtools鉴嗤, gihub)'或者是'the correlation coefficients were visualized by TBtools' 等等。還有Circos圖序调,很多時(shí)候難以分辨到底是Circos畫的還是用TBtools畫的醉锅,這種情況提到了TBtools索性都用Synteny代替了。還有一些懶得索性就說'the data were visualized by aaa, bbb, and TBtools' 的发绢,這種也只能自己手動(dòng)找硬耍。筆者已經(jīng)很認(rèn)真的過濾了,但人工做這方面工作難免會(huì)出錯(cuò)边酒,有些時(shí)候甚至看到某幾個(gè)圖就把幾個(gè)常用功能一列完事了……所以如果你發(fā)現(xiàn)我列舉的功能跟文章提到的有出入默垄,李姐萬歲……
在這里分析的時(shí)候需要先做成長(zhǎng)透視表(pivot_longer),然后選擇數(shù)據(jù)開始作圖甚纲,這里因?yàn)樾枰鋈瓮瑯拥膱D口锭,干脆就把表格的命令做成一個(gè)函數(shù),直接調(diào)用就好了介杆。這樣顯得不那么弱智(
#Create a pivot table
dta %>% select(Year,17:25) %>% pivot_longer(!Year,
names_to = 'Name',
values_to = 'Fun') -> Fun_count_pivot
#Function for counting
Fun_count <- function(year, Pivot){
table(Pivot %>% filter(Year == year) %>% .$Fun) %>%
as.data.frame() %>% .[-1,] -> Count_Year
colnames(Count_Year) <-c('Function', 'Count')
top_three <- tail(Count_Year[order(Count_Year$Count),],3)
ggbarplot(Count_Year, x = "Function", fill = "Function", y = "Count") +
scale_y_continuous(limits = c(0, NA), expand = expansion(mult = c(0,.1))) +
theme_pubr(
base_family = "",
border = FALSE,
margin = TRUE
) + rremove("x.text") + geom_text(top_three, mapping = aes(label = Function), nudge_y = top_three$Count*.2) + geom_text(top_three, mapping = aes(label = Count), nudge_y = top_three$Count*.1)
}
#Visualization the mentioned functions in papers
Fun_count("2018",Fun_count_pivot)
Fun_count("2019",Fun_count_pivot)
Fun_count("2020",Fun_count_pivot)
從2018年到2020年鹃操,用戶使用TBtools的姿勢(shì)越來越多:2018年,主要提及的功能只有15個(gè)春哨,而到了2020年文章中提及的功能有29個(gè)荆隘。從2019年開始,TBtools用戶群體最多使用的三個(gè)功能為:'Heatmap', 'motif'和'Gene structure'赴背。
2018 | Count | 2019 | Count | 2020 | Count | |
---|---|---|---|---|---|---|
1 | motif | 8 | Heatmap | 71 | Heatmap | 315 |
2 | Synteny | 8 | motif | 66 | motif | 183 |
3 | Gene structure | 7 | Gene structure | 45 | Gene structure | 171 |
4 | GO | 7 | Synteny | 41 | Synteny | 102 |
5 | Heatmap | 6 | Gene location | 30 | Phylogenetic tree | 96 |
6 | Gene location | 5 | Phylogenetic tree | 22 | Gene location | 88 |
7 | KEGG | 5 | GO | 15 | Venn diagram | 43 |
8 | Venn diagram | 4 | Venn diagram | 14 | GO | 37 |
9 | Fasta extraction | 3 | Fasta extraction | 9 | Seqlogo | 35 |
10 | Phylogenetic tree | 3 | MEME | 8 | Fasta extraction | 21 |
11 | Seqlogo | 3 | blastv | 7 | Hierarchical clustering | 21 |
12 | blast | 2 | Ks&Ka | 6 | Ks&Ka | 21 |
13 | MEME | 2 | Seqlogo | 6 | KEGG | 13 |
14 | eFP | 1 | KEGG | 5 | Upset plot | 13 |
15 | Ks&Ka | 1 | Hierarchical clustering | 3 | MEME | 12 |
最多提及的功能可以通過以下代碼查看
Fun_sum <- function(year, Pivot){
table(Pivot %>% filter(Year == year) %>% .$Fun) %>%
as.data.frame() %>% .[-1,] -> Fun_sum
colnames(Fun_sum) <- c('Function', 'Count')
Fun_sum <- head(Fun_sum[order(-Fun_sum$Count),],15)
return(Fun_sum)
}
Count_2018 <- Fun_sum(2018,Fun_count_pivot)
Count_2019 <- Fun_sum(2019,Fun_count_pivot)
Count_2020 <- Fun_sum(2020,Fun_count_pivot)
data.frame("2018" = Count_2018,"2019" = Count_2019, "2020" = Count_2020) %>% view()
與2019年相比椰拒,今年的TBtools用戶更喜歡用TBtools做進(jìn)化樹,韋恩圖和fasta文件的相關(guān)操作凰荚。在2020年燃观,TBtools的多數(shù)功能確實(shí)都有用戶在用。尤其是陳博士重點(diǎn)照顧的Heatmap便瑟,gene structure annotation缆毁,motif和phylogenetic tree。常用到你不一起做就覺得少了點(diǎn)什么的程度到涂。當(dāng)然脊框,這也側(cè)面反映了今年家族挖掘的研究還是很流行……
使用了TBtools的文章都發(fā)在了什么水平雜志上
我知道IF這東西不好做橫向?qū)Ρ劝涠健V耙灿邢敕ㄗ鰝€(gè)TBtools到底有沒有幫助做著把文章發(fā)在更好的雜志上。但最后還是放棄了浇雹。說到底我認(rèn)為一篇文章能否發(fā)在更好的雜志上還得看你的文章是否有真東西沉御。工具是幫助你更好的展現(xiàn)你的想法,講好你的故事而不是幫你做一些花里胡哨的圖糊弄reviewer的昭灵。
所以在這里做一下TBtools這三年的IF趨勢(shì)吧吠裆。
IF_by_year <- dta %>% select(Year, Source, ImpactFactor)
IF_by_year$Year <- as.character(IF_by_year$Year)
IF_by_year$ImpactFactor <- as.numeric(IF_by_year$ImpactFactor)
gghistogram(IF_by_year, x = 'ImpactFactor', bins = 100, add = "mean", color = "Year", palette = 'mpg', rug = TRUE,add_density = TRUE, alpha = .8,xticks.by = 1)
gghistogram(IF_by_year, x = 'ImpactFactor', bins = 100, add = "mean", color = "Year", palette = 'mpg', rug = TRUE,add_density = TRUE, alpha = .8, facet.by = "Year",xticks.by = 1)
有意思的是,這三年TBtools的文章主要落在了IF=3-4的區(qū)間虎锚。而在2020年硫痰,IF的兩端出現(xiàn)了凸起衩婚。:一方面窜护,拿家族基因挖掘當(dāng)財(cái)富密碼的人越來越多。這塊相互競(jìng)爭(zhēng)激烈非春,一些文章不得不選擇低IF的雜志發(fā)表文章柱徙。而另一方面也有不少文章發(fā)表在了老牌期刊上,而這些期刊的IF常常落在6-7的區(qū)間奇昙。當(dāng)然這里有個(gè)例外:'Genomics'护侮,根據(jù)學(xué)之策公眾號(hào)的實(shí)時(shí)IF統(tǒng)計(jì)顯示,'Genomics'的實(shí)時(shí)IF實(shí)際在3.2左右储耐。我能說什么呢羊初,套路深啊……
dta %>% select(Year,Source, ImpactFactor) %>%
filter(ImpactFactor < 7 & ImpactFactor > 6 & Source != 'Genomics') %>% group_by(Source) %>%
summarise(Count = n()) %>% ggplot(aes(x=Source, y=Count, fill = Source)) +
geom_bar(stat='identity',show.legend = FALSE) +
scale_y_continuous(limits = c(0, NA), expand = expansion(mult = c(0,.1))) +
coord_flip() + theme_few()
為了減少這部分的影響,這里剔除了這個(gè)'Genomics'期刊什湘。
IF_by_year %>%
filter(Source != 'Genomics') %>%
gghistogram(., x = 'ImpactFactor', bins = 100, add = "mean", color = "Year", palette = 'mpg', rug = TRUE,add_density = TRUE, alpha = .8,xticks.by = 1)
dta %>% select(Year,Source, ImpactFactor,17:25) %>%
filter(ImpactFactor > 6& Source != 'Genomics') %>% view()
這里再次看了下6分以上的文章都喜歡用TBtools的什么功能长赞。一看不得了,這可不是隨便用的啊闽撤,一看就是有備而來得哆。最多用的功能還是Heatmap。廣州這個(gè)地方肯定很熱……嗯哟旗,希望沒什么奇怪的聯(lián)想贩据。這里就不做圖了,感興趣的不妨自己試試闸餐。
TBtools用戶都喜歡研究什么
這個(gè)問題說真的我很難回答饱亮,畢竟做這個(gè)我不專業(yè)。大概琢磨了下分詞舍沙,弄了個(gè)詞云出來看看近尚。
dta %>% select(Year, Title) %>%
unnest_tokens(output = 'word', input = 'Title', token = "words", to_lower = TRUE) %>% #Text segmentation
select(word) %>%
anti_join(stop_words) %>% #delete the stop words
count(word, sort = TRUE) -> word_count
set.seed(2020) #for reproducibility
wordcloud(words = word_count$word,
freq = word_count$n,
min.freq = 1,
max.words = 500,
random.order = FALSE,
rot.per = 0.35,
colors=brewer.pal(8, "Dark2")
)
hmm,genome-wid and gene family expression analysis/identification...這還能說什么呢〕∏冢可漲點(diǎn)薪吧戈锻。
最后歼跟,因?yàn)橹参锟诔錾硭钥匆幌露加昧耸裁粗参镒鲅芯俊?/p>
get_source <- function(name){
dta %>% select(Title, ImpactFactor) %>%
mutate(lower_title = tolower(Title)) %>%
filter(str_detect(lower_title, name)) %>%
select(ImpactFactor)
}
list <- c('cotton','arabidopsis', 'tea','potato','rice','tomato','wheat','soybean','cucumber','barley','maize','tobacco','brachypodium','pepper','radish')
list_fruit <- c('pear','apple','citrus','cucumis','sugarcane','strawberry','dimocarpus','grape','litchi','banana','watermelon','jujube','peach','prunus')
index <- data.frame(list.i. = '', n = '')
for (i in 1:length(list)){
count <- get_source(list[i]) %>% count()
index_i <- data.frame(list[i],count)
index <- rbind(index, index_i)
}
index <- index[-1,]
colnames(index) <- c('Species name','Count')
index$Count <- as.numeric(index$Count)
index <- index[order(-index$Count),]
view(index)
Species name | Count |
---|---|
wheat | 56 |
tea | 36 |
cotton | 27 |
rice | 26 |
soybean | 18 |
cucumber | 18 |
potato | 15 |
arabidopsis | 12 |
tomato | 10 |
barley | 10 |
maize | 9 |
tobacco | 6 |
brachypodium | 5 |
pepper | 5 |
radish | 5 |
嗨,我覺得該改標(biāo)題了格遭,就叫什么哈街?研究小麥的居然還不知道TBtools?好了……本來還想一個(gè)滑鏟把家族基因統(tǒng)計(jì)了,我怕又鏟到溝里去拒迅,算了……另外還有個(gè)包含水果的list(我知道這么分類不專業(yè)骚秦,別罵了別罵了),感興趣的可以自己看看璧微。
寫在最后
沒啥想說的了作箍,最后搞個(gè)山雞圖和熱圖祝大家科研順利吧……
#Create a new dataset
dta %>% filter(Source != 'Genomics') %>%
select(Year, Publisher, Source, ImpactFactor, 17:25) %>%
pivot_longer(cols = starts_with('Fun_'),
names_to = 'Name',
values_to = 'Fun') %>%
select(Year, Publisher, Source, ImpactFactor, Fun) %>%
filter(Fun != '') -> sankey_dta
#Add an IF column
sankey_dta %>%
mutate(IF = ifelse(ImpactFactor > 6,'General Impact',
ifelse(ImpactFactor < 2, 'Have Impact',
ifelse(ImpactFactor > 2 & ImpactFactor < 6, 'Specific Impact',"no")))) %>%
select(Year,Publisher,Source,ImpactFactor, IF,Fun) -> sankey_dta
sankey_dta %>% group_by(Source,ImpactFactor) %>% filter(ImpactFactor>5|n()>10) -> sankey_dta
sankey_dta %>% select(Year,Publisher) %>%
group_by(Year,Publisher) %>%
summarise(value = n()) -> sankey_index
colnames(sankey_index) <- c('source','target','value')
sankey_index$source <- as.character(sankey_index$source)
sankey_index <- sankey_dta %>% select(Publisher,Source) %>%
group_by(Publisher,Source) %>%
summarise(value = n()) %>%
rename(.,source = Publisher, target = Source) %>%
rbind(sankey_index,.)
sankey_dta %>% select(Source,ImpactFactor) %>%
mutate(IF = ifelse(ImpactFactor > 6,'General_Impact',
ifelse(ImpactFactor < 2, 'Have_Impact',
ifelse(ImpactFactor > 2 & ImpactFactor < 6, 'Specific_Impact',"no")))) %>%
group_by(Source,IF) %>%
summarise(value = n()) %>% rename(.,source = Source, target = IF) %>%
rbind(sankey_index,.) -> sankey_index
sankey_dta %>% select(ImpactFactor,Fun) %>%
mutate(IF = ifelse(ImpactFactor > 6,'General_Impact',
ifelse(ImpactFactor < 2, 'Have_Impact',
ifelse(ImpactFactor > 2 & ImpactFactor < 6, 'Specific_Impact',"no")))) %>%
group_by(IF,Fun) %>%
summarise(value = n()) %>% rename(.,source = IF, target = Fun) %>%
rbind(sankey_index,.) -> sankey_index
sankey_index[order(sankey_index$source,sankey_index$target),]
nodes <- data.frame(
name=c(as.character(sankey_index$source), as.character(sankey_index$target)) %>%
unique()
)
sankey_index$IDsource <- match(sankey_index$source, nodes$name)-1
sankey_index$IDtarget <- match(sankey_index$target, nodes$name)-1
sankey_index[order(sankey_index$IDsource,sankey_index$IDtarget),]
p <- sankeyNetwork(Links = sankey_index, Nodes = nodes, Source = "IDsource", Target = "IDtarget",
Value = "value", NodeID = "name", fontSize = 12, nodeWidth = 20,LinkGroup="source")
p
熱圖當(dāng)然用TBtools做了。最近天氣好冷前硫,希望能熱起來胞得。嗯