眾所周知刷晋,今年TCGA數(shù)據(jù)庫更新了一波,原來的HT-Counts現(xiàn)在變成了STAR-Counts。TCGABiolinks包的下載流程也發(fā)生了一些小小的變化眼虱。這里重新梳理一下TCGABiolinks的下載流程喻奥,供大家參考
一、加載R包
library(TCGAbiolinks)
library(SummarizedExperiment)
主要的R包主要是這么幾個(gè)捏悬,其中SummarizedExperiment是為了提取不同類型(Counts/TPM……)的數(shù)據(jù)的撞蚕。
二、下載數(shù)據(jù)
首先來查看一下TCGAbiolinks可以下載的數(shù)據(jù)類型
> getGDCprojects()$project_id
[1] "EXCEPTIONAL_RESPONDERS-ER" "GENIE-GRCC"
[3] "GENIE-DFCI" "GENIE-NKI"
[5] "GENIE-VICC" "GENIE-UHN"
[7] "GENIE-MDA" "GENIE-MSK"
[9] "GENIE-JHU" "FM-AD"
[11] "OHSU-CNL" "MMRF-COMMPASS"
[13] "ORGANOID-PANCREATIC" "NCICCR-DLBCL"
[15] "VAREPOP-APOLLO" "CGCI-BLGSP"
[17] "BEATAML1.0-CRENOLANIB" "TRIO-CRU"
[19] "REBC-THYR" "TARGET-ALL-P2"
[21] "TARGET-ALL-P1" "CPTAC-2"
[23] "WCDT-MCRPC" "CMI-ASC"
[25] "TCGA-READ" "TCGA-UCS"
[27] "CMI-MPC" "CMI-MBC"
[29] "BEATAML1.0-COHORT" "TCGA-COAD"
[31] "TCGA-CESC" "TCGA-PAAD"
[33] "TCGA-ESCA" "TCGA-KIRP"
[35] "TCGA-PCPG" "TCGA-HNSC"
[37] "TCGA-BLCA" "TCGA-STAD"
[39] "CTSP-DLBCL1" "TCGA-SARC"
[41] "TCGA-CHOL" "TCGA-LAML"
[43] "TCGA-THYM" "TCGA-ACC"
[45] "TCGA-SKCM" "TCGA-LUAD"
[47] "TCGA-LIHC" "TCGA-KIRC"
[49] "TCGA-KICH" "TCGA-DLBC"
[51] "TCGA-PRAD" "TCGA-OV"
[53] "TCGA-MESO" "TCGA-LUSC"
[55] "TCGA-GBM" "TCGA-UVM"
[57] "TCGA-LGG" "HCMI-CMDC"
[59] "TCGA-BRCA" "TARGET-RT"
[61] "TARGET-CCSK" "TCGA-TGCT"
[63] "TARGET-NBL" "CPTAC-3"
[65] "CGCI-HTMCP-CC" "TARGET-ALL-P3"
[67] "TARGET-OS" "TARGET-AML"
[69] "TARGET-WT" "MP2PRT-WT"
[71] "TCGA-THCA" "TCGA-UCEC"
這里以結(jié)腸癌為例進(jìn)行演示
COAD <- GDCquery(project = "TCGA-COAD",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "STAR - Counts")
GDCdownload(COAD,method="api")
workflow.type這個(gè)參數(shù)过牙,不管要下載的是TPM還是FPKM甥厦,都填STAR-Counts。不同類型的數(shù)據(jù)到后面再說寇钉。
經(jīng)過漫長的等待數(shù)據(jù)終于下載下來了刀疙。文件默認(rèn)存儲(chǔ)在當(dāng)前的工作目錄下的GDCdata文件夾,當(dāng)然也可以在GDCdownload函數(shù)里通過directory參數(shù)進(jìn)行更改扫倡。
三谦秧、合并數(shù)據(jù)和提取數(shù)據(jù)
expr <- GDCprepare(query=COAD)
通過這條命令可以把上面下載到的數(shù)據(jù)整合成1個(gè)summarizedExperiment對(duì)象。
如果需要counts數(shù)據(jù)撵溃,可以直接從這個(gè)對(duì)象里提取
count <- as.data.frame(assay(expr))
如果需要counts格式以外的其他數(shù)據(jù)疚鲤,則需要在這一步改一下參數(shù)
TPM <- as.data.frame(assay(expr,i = "tpm_unstrand"))
提取不同格式數(shù)據(jù)需要的參數(shù)在下面:
下載Counts i= "unstranded"
下載tpm i="tpm_unstrand"
下載fpkm i=" fpkm_unstrand"