數(shù)據(jù)分析:數(shù)據(jù)降維方法的實現(xiàn)

前言

隨著大數(shù)據(jù)時代的到來缎讼,我們常常需要對高緯度數(shù)據(jù)進行降維處理后再對數(shù)據(jù)整體結(jié)構(gòu)進行可視化收夸,進而發(fā)現(xiàn)數(shù)據(jù)內(nèi)在的區(qū)別,為后續(xù)數(shù)據(jù)分析提供初步的參考血崭。

這里主要介紹三類數(shù)據(jù)分析方法的R實現(xiàn)過程:

1.PCA主成分分析卧惜;

2.tSNE,基于概率密度分布計算能保留更多組間和個體差異的方法功氨;

3.UMAP序苏,非線性降維方法。

導入數(shù)據(jù)

options(warn = 0)
library(dplyr)
library(tibble)
library(ggplot2)
options(warn = 0)

ExprSet_LOG2Impute <- readRDS("Mass_Proteins_filtered_Normal_LOG2Impute.RDS")
subgrp <- c("NC", "PC", "PT")
grp.col <- c("#568875", "#73FAFC", "#EE853D")

Principal Component Analysis

PCAFun <- function(dataset = ExprSet_LOG2Impute ){
  
  # dataset = ExprSet_LOG2Impute 
  
  require(convert)
  metadata <- pData(dataset)
  profile <- exprs(dataset)
  
  # pca 
  pca <- prcomp(scale(t(profile), center = T, scale = T))
  require(factoextra)
  eig <- get_eig(pca)
  # explains variable 
  explains <- paste0(paste0("PC", seq(2)), "(", paste0(round(eig[1:2, 2], 2), "%"), ")")
  # principal component score of each sample
  score <- inner_join(pca$x %>% data.frame() %>% 
                        dplyr::select(c(1:2)) %>% 
                        rownames_to_column("SampleID"), 
                      metadata %>% rownames_to_column("SampleID"),
                      by = "SampleID") %>%
    mutate(Group=factor(Group, levels = grp))
  
  
  # PERMANOVA
  require(vegan)
  set.seed(123)
  if(any(profile < 0)){
    res_adonis <- adonis(vegdist(t(profile), method = "manhattan") ~ metadata$SubGroup, permutations = 999) 
  }else{
    res_adonis <- adonis(vegdist(t(profile), method = "bray") ~ metadata$SubGroup, permutations = 999)    
  }
  adn_pvalue <- res_adonis[[1]][["Pr(>F)"]][1]
  adn_rsquared <- round(res_adonis[[1]][["R2"]][1],3)
  #use the bquote function to format adonis results to be annotated on the ordination plot.
  signi_label <- paste(cut(adn_pvalue,breaks=c(-Inf, 0.001, 0.01, 0.05, Inf), label=c("***", "**", "*", ".")))
  adn_res_format <- bquote(atop(atop("PERMANOVA",R^2==~.(adn_rsquared)),
                                  atop("p-value="~.(adn_pvalue)~.(signi_label), phantom()))) 
  
  
  pl <- ggplot(score, aes(x=PC1, y=PC2))+
              geom_point(aes(fill=SubGroup), size=3.5, shape=21, stroke = .8, color = "black")+
              stat_ellipse(aes(color=SubGroup), level = 0.95, linetype = 1, size = 1.5)+
              labs(x=explains[1], y=explains[2])+
              scale_color_manual(values = grp.col)+
              scale_fill_manual(name = "Condition", 
                                values = grp.col)+
              annotate("text", x = max(score$PC1) - 8,
                       y = min(score$PC1),
                       label = adn_res_format,
                       size = 6)+ 
              guides(color=F)+
              theme_classic()+
              theme(axis.title = element_text(size = 10, color = "black", face = "bold"),
                    axis.text = element_text(size = 9, color = "black"),
                    text = element_text(size = 8, color = "black", family = "serif"),
                    strip.text = element_text(size = 9, color = "black", face = "bold"), 
                    panel.grid = element_blank(),
                    legend.title = element_text(size = 11, color = "black", family = "serif"),
                    legend.text = element_text(size = 10, color = "black", family = "serif"),
                    legend.position = c(0, 0),
                    legend.justification = c(0, 0),
                    legend.background = element_rect(color = "black", fill = "white", linetype = 2, size = 0.5))
  return(pl)
}

PCA_LOG2Impute <- PCAFun(dataset = ExprSet_LOG2Impute)
PCA_LOG2Impute 
ggsave("Mass_ProteinsLOG2Impute_PCA.pdf", PCA_LOG2Impute, width = 5, height = 5, dpi = 300)

Rtsne

RtsneFun <- function(dataset = ExprSet_LOG2Impute,
                     perpl = 10){
  
  # dataset = ExprSet_LOG2Impute
  # perpl = 10
  
  require(convert)
  metadata <- pData(dataset)
  profile <- t(exprs(dataset))
  
  # Rtsne
  require(Rtsne)
  #set.seed(123)
  Rtsne <- Rtsne(profile, 
                 dims=2, 
                 perplexity=perpl,
                 verbose=TRUE, 
                 max_iter=500, 
                 eta=200)
  point <- Rtsne$Y %>% data.frame() %>% 
    dplyr::select(c(1:2)) %>%
    setNames(c("tSNE1", "tSNE2"))
  rownames(point) <- rownames(profile)
  score <- inner_join(point %>% rownames_to_column("SampleID"), 
                      metadata %>% rownames_to_column("SampleID"),
                      by = "SampleID") %>%
    mutate(Group=factor(Group, levels = grp))
  
  # PERMANOVA
  require(vegan)
  set.seed(123)
  if(any(profile < 0)){
    res_adonis <- adonis(vegdist(profile, method = "manhattan") ~ metadata$SubGroup, permutations = 999) 
  }else{
    res_adonis <- adonis(vegdist(profile, method = "bray") ~ metadata$SubGroup, permutations = 999)    
  }
  adn_pvalue <- res_adonis[[1]][["Pr(>F)"]][1]
  adn_rsquared <- round(res_adonis[[1]][["R2"]][1],3)
  #use the bquote function to format adonis results to be annotated on the ordination plot.
  signi_label <- paste(cut(adn_pvalue,breaks=c(-Inf, 0.001, 0.01, 0.05, Inf), label=c("***", "**", "*", ".")))
  adn_res_format <- bquote(atop(atop("PERMANOVA",R^2==~.(adn_rsquared)),
                                  atop("p-value="~.(adn_pvalue)~.(signi_label), phantom()))) 
  
  pl <- ggplot(score, aes(x=tSNE1, y=tSNE2))+
              geom_point(aes(fill=SubGroup), size=3.5, shape=21, stroke = .8, color = "black")+
              stat_ellipse(aes(color=SubGroup), level = 0.95, linetype = 1, size = 1.5)+
              scale_color_manual(values = grp.col)+
              scale_fill_manual(name = "Condition", 
                                values = grp.col)+
              annotate("text", x = max(score$tSNE1) - 8,
                       y = max(score$tSNE2)-5,
                       label = adn_res_format,
                       size = 6)+ 
              guides(color=F)+
              theme_classic()+
              theme(axis.title = element_text(size = 10, color = "black", face = "bold"),
                    axis.text = element_text(size = 9, color = "black"),
                    text = element_text(size = 8, color = "black", family = "serif"),
                    strip.text = element_text(size = 9, color = "black", face = "bold"), 
                    panel.grid = element_blank(),
                    legend.title = element_text(size = 11, color = "black", family = "serif"),
                    legend.text = element_text(size = 10, color = "black", family = "serif"),
                    legend.position = c(0, 0),
                    legend.justification = c(0, 0),
                    legend.background = element_rect(color = "black", fill = "white", linetype = 2, size = 0.5))
  return(pl)
}

Rtsne_LOG2Impute <- RtsneFun(dataset = ExprSet_LOG2Impute, perpl = 10)
Rtsne_LOG2Impute
ggsave("Mass_ProteinsLOG2Impute_Rtsne.pdf", Rtsne_LOG2Impute, width = 5, height = 5, dpi = 300)

UMAP: a non-linear dimensionality reduction algorithm

UMAPFun <- function(dataset = ExprSet_LOG2Impute){
  
  # dataset = ExprSet_LOG2Impute
  
  require(convert)
  metadata <- pData(dataset)
  profile <- t(exprs(dataset))
  
  # umap 
  require(umap)
  umap <- umap::umap(profile)
  
  point <- umap$layout %>% data.frame() %>%
    setNames(c("UMAP1", "UMAP2"))
  rownames(point) <- rownames(profile)
  score <- inner_join(point %>% rownames_to_column("SampleID"), 
                      metadata %>% rownames_to_column("SampleID"),
                      by = "SampleID") %>%
    mutate(Group=factor(Group, levels = grp))
  
  # PERMANOVA
  require(vegan)
  set.seed(123)
  if(any(profile < 0)){
    res_adonis <- adonis(vegdist(profile, method = "manhattan") ~ metadata$SubGroup, permutations = 999) 
  }else{
    res_adonis <- adonis(vegdist(profile, method = "bray") ~ metadata$SubGroup, permutations = 999)    
  }
  adn_pvalue <- res_adonis[[1]][["Pr(>F)"]][1]
  adn_rsquared <- round(res_adonis[[1]][["R2"]][1],3)
  #use the bquote function to format adonis results to be annotated on the ordination plot.
  signi_label <- paste(cut(adn_pvalue,breaks=c(-Inf, 0.001, 0.01, 0.05, Inf), label=c("***", "**", "*", ".")))
  adn_res_format <- bquote(atop(atop("PERMANOVA",R^2==~.(adn_rsquared)),
                                  atop("p-value="~.(adn_pvalue)~.(signi_label), phantom())))   
  
  pl <- ggplot(score, aes(x=UMAP1, y=UMAP2))+
              geom_point(aes(fill=SubGroup), size=3.5, shape=21, stroke = .8, color = "black")+
              stat_ellipse(aes(color=SubGroup), level = 0.95, linetype = 1, size = 1.5)+
              scale_color_manual(values = grp.col)+
              scale_fill_manual(name = "Condition", 
                                values = grp.col)+
              annotate("text", x = max(score$UMAP1),
                       y = min(score$UMAP2),
                       label = adn_res_format,
                       size = 6)+ 
              guides(color=F)+
              theme_classic()+
              theme(axis.title = element_text(size = 10, color = "black", face = "bold"),
                    axis.text = element_text(size = 9, color = "black"),
                    text = element_text(size = 8, color = "black", family = "serif"),
                    strip.text = element_text(size = 9, color = "black", face = "bold"), 
                    panel.grid = element_blank(),
                    legend.title = element_text(size = 11, color = "black", family = "serif"),
                    legend.text = element_text(size = 10, color = "black", family = "serif"),
                    legend.position = c(0, 0),
                    legend.justification = c(0, 0),
                    legend.background = element_rect(color = "black", fill = "white", linetype = 2, size = 0.5))
  return(pl)
}

UMAP_LOG2Impute <- UMAPFun(dataset = ExprSet_LOG2Impute)
UMAP_LOG2Impute 
ggsave("Mass_ProteinsLOG2Impute_UMAP.pdf", UMAP_LOG2Impute, width = 5, height = 5, dpi = 300)

Notes: 三組數(shù)據(jù)的整體性差異如果較為明顯捷凄,一般可以在不同的降維方法上都有所體現(xiàn)忱详。從分組看,NC組和PC跺涤、PT組均存在明顯的差異匈睁,后續(xù)數(shù)據(jù)分析應(yīng)該著重關(guān)注 NC vs PC, NC vs PT,但也可以對PC vs PT做比較分析桶错。

Systemic information

sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: CentOS Linux 8 (Core)

Matrix products: default
BLAS/LAPACK: /disk/share/anaconda3/lib/libopenblasp-r0.3.10.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] umap_0.2.7.0                                        Rtsne_0.15                                         
 [3] vegan_2.5-6                                         lattice_0.20-41                                    
 [5] permute_0.9-5                                       factoextra_1.0.7                                   
 [7] IlluminaHumanMethylationEPICanno.ilm10b4.hg19_0.6.0 IlluminaHumanMethylationEPICmanifest_0.3.0         
 [9] minfi_1.36.0                                        bumphunter_1.32.0                                  
[11] locfit_1.5-9.4                                      iterators_1.0.13                                   
[13] foreach_1.5.1                                       Biostrings_2.58.0                                  
[15] XVector_0.30.0                                      SummarizedExperiment_1.20.0                        
[17] MatrixGenerics_1.2.1                                matrixStats_0.58.0                                 
[19] GenomicRanges_1.42.0                                GenomeInfoDb_1.26.4                                
[21] IRanges_2.24.1                                      S4Vectors_0.28.1                                   
[23] eulerr_6.1.0                                        UpSetR_1.4.0                                       
[25] ggplot2_3.3.3                                       DEP_1.12.0                                         
[27] convert_1.64.0                                      marray_1.68.0                                      
[29] limma_3.46.0                                        Biobase_2.50.0                                     
[31] BiocGenerics_0.36.0                                 data.table_1.14.0                                  
[33] tibble_3.1.0                                        dplyr_1.0.5                                        

loaded via a namespace (and not attached):
  [1] utf8_1.2.1                shinydashboard_0.7.1      reticulate_1.18           gmm_1.6-5                
  [5] tidyselect_1.1.0          RSQLite_2.2.5             AnnotationDbi_1.52.0      htmlwidgets_1.5.3        
  [9] grid_4.0.2                BiocParallel_1.24.1       norm_1.0-9.5              munsell_0.5.0            
 [13] codetools_0.2-18          preprocessCore_1.52.1     DT_0.18                   withr_2.4.1              
 [17] colorspace_2.0-0          knitr_1.31                rstudioapi_0.13           mzID_1.28.0              
 [21] labeling_0.4.2            GenomeInfoDbData_1.2.4    farver_2.1.0              bit64_4.0.5              
 [25] rhdf5_2.34.0              vctrs_0.3.7               generics_0.1.0            xfun_0.20                
 [29] BiocFileCache_1.14.0      R6_2.5.0                  doParallel_1.0.16         illuminaio_0.32.0        
 [33] clue_0.3-58               bitops_1.0-6              rhdf5filters_1.2.0        cachem_1.0.4             
 [37] reshape_0.8.8             DelayedArray_0.16.3       assertthat_0.2.1          promises_1.2.0.1         
 [41] scales_1.1.1              gtable_0.3.0              Cairo_1.5-12.2            affy_1.68.0              
 [45] sandwich_3.0-0            rlang_0.4.10              genefilter_1.72.0         mzR_2.24.1               
 [49] GlobalOptions_0.1.2       splines_4.0.2             rtracklayer_1.50.0        GEOquery_2.58.0          
 [53] impute_1.64.0             yaml_2.2.1                reshape2_1.4.4            BiocManager_1.30.12      
 [57] GenomicFeatures_1.42.2    httpuv_1.5.5              tools_4.0.2               nor1mix_1.3-0            
 [61] affyio_1.60.0             ellipsis_0.3.1            jquerylib_0.1.3           RColorBrewer_1.1-2       
 [65] siggenes_1.64.0           MSnbase_2.16.1            Rcpp_1.0.6                plyr_1.8.6               
 [69] sparseMatrixStats_1.2.1   progress_1.2.2            zlibbioc_1.36.0           purrr_0.3.4              
 [73] RCurl_1.98-1.3            prettyunits_1.1.1         openssl_1.4.3             GetoptLong_1.0.5         
 [77] cowplot_1.1.0             zoo_1.8-8                 ggrepel_0.9.1.9999        cluster_2.1.0            
 [81] tinytex_0.31              magrittr_2.0.1            RSpectra_0.16-0           circlize_0.4.10          
 [85] pcaMethods_1.80.0         mvtnorm_1.1-1             ProtGenerics_1.22.0       evaluate_0.14            
 [89] hms_1.0.0                 mime_0.10                 xtable_1.8-4              XML_3.99-0.6             
 [93] mclust_5.4.7              gridExtra_2.3             shape_1.4.5               compiler_4.0.2           
 [97] biomaRt_2.46.3            ncdf4_1.17                crayon_1.4.1              htmltools_0.5.1.1        
[101] mgcv_1.8-34               later_1.1.0.1             tidyr_1.1.3               DBI_1.1.1                
[105] dbplyr_2.1.1              ComplexHeatmap_2.6.2      MASS_7.3-53.1             tmvtnorm_1.4-10          
[109] rappdirs_0.3.3            Matrix_1.3-2              readr_1.4.0               cli_2.4.0                
[113] vsn_3.58.0                imputeLCMD_2.0            quadprog_1.5-8            pkgconfig_2.0.3          
[117] GenomicAlignments_1.26.0  MALDIquant_1.19.3         xml2_1.3.2                annotate_1.68.0          
[121] bslib_0.2.4               rngtools_1.5              multtest_2.46.0           beanplot_1.2             
[125] doRNG_1.8.2               scrime_1.3.5              stringr_1.4.0             digest_0.6.27            
[129] rmarkdown_2.7             base64_2.0                DelayedMatrixStats_1.12.3 curl_4.3                 
[133] shiny_1.6.0               Rsamtools_2.6.0           rjson_0.2.20              jsonlite_1.7.2           
[137] nlme_3.1-150              lifecycle_1.0.0           Rhdf5lib_1.12.1           askpass_1.1              
[141] fansi_0.4.2               pillar_1.6.0              fastmap_1.1.0             httr_1.4.2               
[145] survival_3.2-10           glue_1.4.2                png_0.1-7                 bit_4.0.4                
[149] sass_0.3.1                stringi_1.4.6             HDF5Array_1.18.1          blob_1.2.1               
[153] memoise_2.0.0   

Reference

  1. How to change Legend of ggplot2

  2. How to change ggplot facet labels

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
禁止轉(zhuǎn)載航唆,如需轉(zhuǎn)載請通過簡信或評論聯(lián)系作者。
  • 序言:七十年代末院刁,一起剝皮案震驚了整個濱河市糯钙,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌退腥,老刑警劉巖任岸,帶你破解...
    沈念sama閱讀 221,820評論 6 515
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異狡刘,居然都是意外死亡享潜,警方通過查閱死者的電腦和手機,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 94,648評論 3 399
  • 文/潘曉璐 我一進店門嗅蔬,熙熙樓的掌柜王于貴愁眉苦臉地迎上來剑按,“玉大人,你說我怎么就攤上這事澜术∫蘸” “怎么了?”我有些...
    開封第一講書人閱讀 168,324評論 0 360
  • 文/不壞的土叔 我叫張陵鸟废,是天一觀的道長猜敢。 經(jīng)常有香客問我,道長,這世上最難降的妖魔是什么锣枝? 我笑而不...
    開封第一講書人閱讀 59,714評論 1 297
  • 正文 為了忘掉前任,我火速辦了婚禮兰英,結(jié)果婚禮上撇叁,老公的妹妹穿的比我還像新娘。我一直安慰自己畦贸,他們只是感情好陨闹,可當我...
    茶點故事閱讀 68,724評論 6 397
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著薄坏,像睡著了一般趋厉。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上胶坠,一...
    開封第一講書人閱讀 52,328評論 1 310
  • 那天君账,我揣著相機與錄音,去河邊找鬼沈善。 笑死乡数,一個胖子當著我的面吹牛,可吹牛的內(nèi)容都是我干的闻牡。 我是一名探鬼主播净赴,決...
    沈念sama閱讀 40,897評論 3 421
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼罩润!你這毒婦竟也來了玖翅?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 39,804評論 0 276
  • 序言:老撾萬榮一對情侶失蹤割以,失蹤者是張志新(化名)和其女友劉穎金度,沒想到半個月后,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體拳球,經(jīng)...
    沈念sama閱讀 46,345評論 1 318
  • 正文 獨居荒郊野嶺守林人離奇死亡审姓,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 38,431評論 3 340
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發(fā)現(xiàn)自己被綠了祝峻。 大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片魔吐。...
    茶點故事閱讀 40,561評論 1 352
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖莱找,靈堂內(nèi)的尸體忽然破棺而出酬姆,到底是詐尸還是另有隱情,我是刑警寧澤奥溺,帶...
    沈念sama閱讀 36,238評論 5 350
  • 正文 年R本政府宣布辞色,位于F島的核電站,受9級特大地震影響浮定,放射性物質(zhì)發(fā)生泄漏相满。R本人自食惡果不足惜层亿,卻給世界環(huán)境...
    茶點故事閱讀 41,928評論 3 334
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望立美。 院中可真熱鬧匿又,春花似錦、人聲如沸建蹄。這莊子的主人今日做“春日...
    開封第一講書人閱讀 32,417評論 0 24
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽洞慎。三九已至痛单,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間劲腿,已是汗流浹背旭绒。 一陣腳步聲響...
    開封第一講書人閱讀 33,528評論 1 272
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留谆棱,地道東北人快压。 一個月前我還...
    沈念sama閱讀 48,983評論 3 376
  • 正文 我出身青樓,卻偏偏與公主長得像垃瞧,于是被迫代替她去往敵國和親蔫劣。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當晚...
    茶點故事閱讀 45,573評論 2 359

推薦閱讀更多精彩內(nèi)容