TCGA 01 TCGAbiolinks下載轉(zhuǎn)錄組數(shù)據(jù)差異分析

1. 準(zhǔn)備

安裝和導(dǎo)入TCGAbiolinks包

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("TCGAbiolinks")

library(TCGAbiolinks)
library(dplyr)
library(DT)

查看各個癌種的項(xiàng)目id遣耍,總共有65個ID值

getGDCprojects()$project_id
# [1] "GENIE-MSK"             "TCGA-UCEC"             "TCGA-ACC"             
# [4] "TCGA-LGG"              "TCGA-SARC"             "TCGA-PAAD"            
# [7] "TCGA-ESCA"             "TCGA-LUAD"             "TCGA-PRAD"            
# [10] "GENIE-VICC"            "TCGA-LAML"             "TCGA-KIRC"            
# [13] "TCGA-COAD"             "TCGA-PCPG"             "TCGA-HNSC"            
# [16] "BEATAML1.0-COHORT"     "BEATAML1.0-CRENOLANIB" "GENIE-JHU"            
# [19] "MMRF-COMMPASS"         "TCGA-LUSC"             "TCGA-OV"              
# [22] "TCGA-GBM"              "TCGA-UCS"              "WCDT-MCRPC"           
# [25] "TCGA-MESO"             "TCGA-TGCT"             "TCGA-KICH"            
# [28] "TCGA-READ"             "TCGA-UVM"              "TCGA-STAD"            
# [31] "TCGA-THCA"             "OHSU-CNL"              "GENIE-DFCI"           
# [34] "GENIE-NKI"             "ORGANOID-PANCREATIC"   "VAREPOP-APOLLO"       
# [37] "GENIE-GRCC"            "FM-AD"                 "TARGET-ALL-P1"        
# [40] "TARGET-ALL-P2"         "TARGET-ALL-P3"         "TARGET-RT"            
# [43] "CTSP-DLBCL1"           "GENIE-UHN"             "GENIE-MDA"            
# [46] "TARGET-AML"            "TARGET-NBL"            "TARGET-WT"            
# [49] "NCICCR-DLBCL"          "TCGA-LIHC"             "TCGA-THYM"            
# [52] "TCGA-CHOL"             "TCGA-SKCM"             "TARGET-CCSK"          
# [55] "TARGET-OS"             "TCGA-DLBC"             "TCGA-CESC"            
# [58] "TCGA-BRCA"             "TCGA-KIRP"             "TCGA-BLCA"            
# [61] "CGCI-HTMCP-CC"         "HCMI-CMDC"             "CGCI-BLGSP"           
# [64] "CPTAC-2"               "CPTAC-3" 

查看project中有哪些數(shù)據(jù)類型舷嗡,如查詢"TCGA-LUAD",有7種數(shù)據(jù)類型袭灯,case_count為病人數(shù)应狱,file_count為對應(yīng)的文件數(shù)质欲,file_size為總文件大小树埠,若要下載表達(dá)譜,可以設(shè)置參數(shù)data.category="Transcriptome Profiling"

TCGAbiolinks:::getProjectSummary("TCGA-LUAD")
# $case_count
# [1] 585
# 
# $file_count
# [1] 18162
# 
# $file_size
# [1] 2.304064e+13
# 
# $data_categories
#   case_count file_count               data_category
# 1        519       2916     Transcriptome Profiling
# 2        569       5368 Simple Nucleotide Variation
# 3        585       2731                 Biospecimen
# 4        585        623                    Clinical
# 5        579        657             DNA Methylation
# 6        518       3383       Copy Number Variation
# 7        582       2484            Sequencing Reads

2. 下載LUAD肺腺癌的轉(zhuǎn)錄組數(shù)據(jù)

建立查詢

query <- GDCquery(project = "TCGA-LUAD",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification", 
                  workflow.type = "HTSeq - Counts")

查看所有樣本編號

# 594個barcode
samplesDown <- getResults(query,cols=c("cases"))

從結(jié)果中篩選出腫瘤樣本barcode:TP(primary solid tumor)

# 533個barcode
dataSmTP <- TCGAquery_SampleTypes(barcode = samplesDown,
                                  typesample = "TP")

從結(jié)果中篩選出NT(正常組織)樣本的barcode:NT()

# 59個barcode
dataSmNT <- TCGAquery_SampleTypes(barcode = samplesDown,
                                  typesample = "NT")

還有2個復(fù)發(fā)的樣本barcode:NR(Recurrent Solid Tumor)

# 2個barcode
dataSmTR <- TCGAquery_SampleTypes(barcode = samplesDown,
                                  typesample = "TR")

重新按照樣本分組建立查詢

queryDown <- GDCquery(project = "TCGA-LUAD", 
                      data.category = "Transcriptome Profiling",
                      data.type = "Gene Expression Quantification", 
                      workflow.type = "HTSeq - Counts", 
                      barcode = c(dataSmTP, dataSmNT)
                      )

下載查詢到的數(shù)據(jù)嘶伟,默認(rèn)存放位置為當(dāng)前工作目錄下的GDCdata文件夾中

GDCdownload(query = queryDown, files.per.chunk=6, method ="api", 
            directory ="lung_cancer")

3. 數(shù)據(jù)讀入與預(yù)處理

讀取下載的數(shù)據(jù)并將其準(zhǔn)備到R對象中怎憋,在工作目錄生成luad_case.rda文件,同時還產(chǎn)生Human_genes__GRCh38_p12_.rda文件(project文件)

dataPrep1 <- GDCprepare(query = queryDown, save = TRUE, 
                        save.filename = "luad_cases.rda", 
                        directory ="lung_cancer")
# Starting to add information to samples
# => Add clinical information to samples
# => Adding TCGA molecular information from marker papers
# => Information will have prefix 'paper_' 
# luad subtype information from:doi:10.1038/nature13385
# Accessing www.ensembl.org to get gene information
# Downloading genome information (try:0) Using: Human genes (GRCh38.p13)
# From the 60483 genes we couldn't map 3990(不能mapping則自動刪除該基因)
# 去除dataPrep1中的異常值九昧,dataPrep數(shù)據(jù)中含有腫瘤組織和正常組織的數(shù)據(jù)

# save(dataPrep1, file="dataPrep1_LUAD_TP_TN.RData")
rm(list=ls())
load("dataPrep1_LUAD_TP_TN.RData")

TCGAanalyze_Preprocessing執(zhí)行不同樣本表達(dá)矩陣之間相關(guān)性(Array Array Intensity correlation绊袋,AAIC)分析。根據(jù)樣本之間Spearman相關(guān)系數(shù)的平方對稱矩陣和箱線圖铸鹰,找到具有低相關(guān)性的樣本癌别,這些樣本可以被識別為可能的異常值。使用spearman相關(guān)系數(shù)去除數(shù)據(jù)中的異常值掉奄,一般設(shè)置閾值為0.6规个。在該次分析中無異常樣本

# 無樣本被刪除
dataPrep <- TCGAanalyze_Preprocessing(object = dataPrep1, 
                                      cor.cut = 0.6,
                                      datatype = "HTSeq - Counts",
                                      filename = "AAIC.png")

將數(shù)據(jù)進(jìn)行標(biāo)準(zhǔn)化使得不同樣本之間具有可比性,使用包含EDASeq包功能的TCGAanalyze_Normalization函數(shù)姓建,將mRNA標(biāo)準(zhǔn)化,此功能實(shí)現(xiàn)泳道內(nèi)歸一化程序缤苫,以針對GC含量效應(yīng)(或其他基因水平的效應(yīng)):全局縮放和分位數(shù)歸一化速兔。

dataNorm.luad <- TCGAanalyze_Normalization(tabDF = dataPrep,
                                           geneInfo = geneInfo,
                                           method = "gcContent")

使用TCGAanalyze_Filtering對低表達(dá)量的mRNA進(jìn)行過濾,method = "quantile"時為篩選出所有樣本中均值小于所有樣本中rowMeans qnt.cut = 0.25的分位數(shù)(默認(rèn)為25%的分位數(shù))基因活玲,即只保留表達(dá)量為前75%的基因涣狗。參考:The filtering of low-expression genes of RNA-seq data

dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm.luad,
                                  method = "quantile", 
                                  qnt.cut =  0.25)
save(dataFilt, file="dataFilt.RData")
load("dataFilt.RData")

4. 分組及差異分析

選擇NT組的前10和TP組的前10個樣本進(jìn)行差異分析

# selection of normal samples "NT"
samplesNT <- TCGAquery_SampleTypes(barcode = colnames(dataFilt),
                                   typesample = c("NT"))
# selection of tumor samples "TP"
samplesTP <- TCGAquery_SampleTypes(barcode = colnames(dataFilt), 
                                   typesample = c("TP"))
# Diff.expr.analysis (DEA)
dataDEGs <- TCGAanalyze_DEA(mat1 = dataFilt[,samplesNT[1:10]],
                            mat2 = dataFilt[,samplesTP[1:10]],
                            Cond1type = "Normal",
                            Cond2type = "Tumor",
#                            fdr.cut = 0.01 ,
#                            logFC.cut = 1,
                            method = "glmLRT")
# Batch correction skipped since no factors provided
# ----------------------- DEA -------------------------------
#   there are Cond1 type Normal in  10 samples
# there are Cond2 type Tumor in  10 samples
# there are  12980 features as miRNA or genes 
# I Need about  8.7 seconds for this DEA. [Processing 30k elements /s]  
# ----------------------- END DEA -------------------------------

DEG.LUAD.edgeR <- TCGAanalyze_DEA(mat1 = dataFilt[,samplesNT[1:10]],
                                  mat2 = dataFilt[,samplesTP[1:10]],
                                  pipeline="edgeR",
                                  batch.factors = c("TSS"),
                                  Cond1type = "Normal",
                                  Cond2type = "Tumor",
                                  voom =FALSE, 
                                  method = "glmLRT",
                                  # fdr.cut = 0.01,  #保留FDR<0.01的基因
                                  # logFC.cut = 1 #保留logFC>1的基因
)
# ----------------------- DEA -------------------------------
#   there are Cond1 type Normal in  10 samples
# there are Cond2 type Tumor in  10 samples
# there are  12980 features as miRNA or genes 
# I Need about  8.7 seconds for this DEA. [Processing 30k elements /s]  
# ----------------------- END DEA -------------------------------

5. 繪制火山圖和熱圖

火山圖繪制

valcano_data <- data.frame(genes=rownames(DEG.LUAD.edgeR), 
                           logFC=DEG.LUAD.edgeR$logFC, 
                           FDR=DEG.LUAD.edgeR$FDR,
                           group=rep("NotSignificant", 
                                     nrow(DEG.LUAD.edgeR)),
                           stringsAsFactors = F)

valcano_data[which(valcano_data['FDR'] < 0.05 & 
                     valcano_data['logFC'] > 1.5),"group"] <- "Increased"
valcano_data[which(valcano_data['FDR'] < 0.05 &
                     valcano_data['logFC'] < -1.5),"group"] <- "Decreased"

cols = c("darkgrey","#00B2FF","orange")
names(cols) = c("NotSignificant","Increased","Decreased")

library(ggplot2)
ggplot(valcano_data, aes(x = logFC, y = -log10(FDR), color = group))+
  scale_colour_manual(values = cols) +
  ggtitle(label = "Volcano Plot", subtitle = "LUAD 10 samples volcano plot") +
  geom_point(size = 2.5, alpha = 1, na.rm = T) +
  theme_bw(base_size = 14) + 
  theme(legend.position = "right") + 
  xlab(expression(log[2]("logFC"))) + 
  ylab(expression(-log[10]("FDR"))) +
  geom_hline(yintercept = 1.30102, colour="#990000", linetype="dashed") + 
  geom_vline(xintercept = 1.5849, colour="#990000", linetype="dashed") + 
  geom_vline(xintercept = -1.5849, colour="#990000", linetype="dashed")+ 
  scale_y_continuous(trans = "log1p")

火山圖

在火山圖中標(biāo)注特定基因如CALCA和MUC5B基因?yàn)榧t色

valcano_data_specific_genes = valcano_data[which(valcano_data$genes %in% 
                                          c("CALCA", "MUC5B")), ]

cols = c("darkgrey","#00B2FF","orange", "red")
names(cols) = c("NotSignificant","Increased","Decreased", "Specific")

library(ggplot2)
ggplot(valcano_data, aes(x = logFC, y = -log10(FDR), color = group))+
  scale_colour_manual(values = cols) +
  ggtitle(label = "Volcano Plot", subtitle = "LUAD 10 samples volcano plot") +
  geom_point(size = 2.5, alpha = 1, na.rm = T) +
  theme_bw(base_size = 14) + 
  theme(legend.position = "right") + 
  xlab(expression(log[2]("logFC"))) + 
  ylab(expression(-log[10]("FDR"))) +
  geom_hline(yintercept = 1.30102, colour="#990000", linetype="dashed") + 
  geom_vline(xintercept = 1.5849, colour="#990000", linetype="dashed") + 
  geom_vline(xintercept = -1.5849, colour="#990000", linetype="dashed")+ 
  scale_y_continuous(trans = "log1p") + 
  geom_point(data=valcano_data_specific_genes, col="red", size=2)
火山圖——帶標(biāo)注

合并平均表達(dá)量與差異基因表格(方便查詢)

#獲取腫瘤組織對應(yīng)的表達(dá)矩陣
dataTP <- dataFilt[,samplesTP[1:10]]

#獲取正常組織對應(yīng)的表達(dá)矩陣
dataTN <- dataFilt[,samplesNT[1:10]]

# 合并表格,增加2組每個基因的平均表達(dá)量
dataDEGsFiltLevel <- TCGAanalyze_LevelTab(FC_FDR_table_mRNA=dataDEGsFilt,
                                          typeCond1="Normal",
                                          typeCond2="Tumor", 
                                          TableCond1=dataTN,
                                          TableCond2=dataTP)
#查看結(jié)果
head(dataDEGsFiltLevel,2)
# mRNA     logFC         FDR   Normal    Tumor     Delta
# SFTPC   SFTPC -9.656391 0.001493513 850460.0  76090.8 8212374.1
# SFTPA2 SFTPA2 -1.755048 0.277785397 521998.4 154404.7  916132.2
# CALCA CALCA   5.803164    0.2678382   22.8    9511.5  132.3121
# MUC5B MUC5B   5.2703226   1.827029e-03    2722.7  49155.1 14349.507227

熱圖繪制

library(pheatmap)
t <- dataFilt[valcano_data$genes[which(valcano_data['FDR'] < 0.05 & 
                            abs(valcano_data['logFC']) >1.5)],
              c(samplesTP[1:10], samplesNT[1:10])]

t <- na.omit(t)
t <- log2(t+1)
col <- data.frame(Type=c(rep("Tumor", 10), rep("Normal", 10)))
col$Type <- factor(col$Type, levels = c("Normal", "Tumor"))
rownames(col) <- colnames(t)
pheatmap(t,
         annotation_col = col,
         annotation_colors = list(Type=c(Tumor="red", Normal="#00B2FF")),
         annotation_legend = T,
         border_color = "black",
         cluster_rows = T,
         cluster_cols = T,
         show_colnames = F,
         show_rownames = F,
         color = colorRampPalette(c("green", "black", "red"))(50))
熱圖
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末舒憾,一起剝皮案震驚了整個濱河市镀钓,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌镀迂,老刑警劉巖丁溅,帶你破解...
    沈念sama閱讀 206,013評論 6 481
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異探遵,居然都是意外死亡窟赏,警方通過查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,205評論 2 382
  • 文/潘曉璐 我一進(jìn)店門箱季,熙熙樓的掌柜王于貴愁眉苦臉地迎上來涯穷,“玉大人,你說我怎么就攤上這事藏雏】娇觯” “怎么了?”我有些...
    開封第一講書人閱讀 152,370評論 0 342
  • 文/不壞的土叔 我叫張陵,是天一觀的道長赚瘦。 經(jīng)常有香客問我粟誓,道長,這世上最難降的妖魔是什么蚤告? 我笑而不...
    開封第一講書人閱讀 55,168評論 1 278
  • 正文 為了忘掉前任努酸,我火速辦了婚禮,結(jié)果婚禮上杜恰,老公的妹妹穿的比我還像新娘获诈。我一直安慰自己,他們只是感情好心褐,可當(dāng)我...
    茶點(diǎn)故事閱讀 64,153評論 5 371
  • 文/花漫 我一把揭開白布舔涎。 她就那樣靜靜地躺著,像睡著了一般逗爹。 火紅的嫁衣襯著肌膚如雪亡嫌。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 48,954評論 1 283
  • 那天掘而,我揣著相機(jī)與錄音挟冠,去河邊找鬼。 笑死袍睡,一個胖子當(dāng)著我的面吹牛知染,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播斑胜,決...
    沈念sama閱讀 38,271評論 3 399
  • 文/蒼蘭香墨 我猛地睜開眼控淡,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了止潘?” 一聲冷哼從身側(cè)響起掺炭,我...
    開封第一講書人閱讀 36,916評論 0 259
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎凭戴,沒想到半個月后涧狮,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 43,382評論 1 300
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡簇宽,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 35,877評論 2 323
  • 正文 我和宋清朗相戀三年勋篓,在試婚紗的時候發(fā)現(xiàn)自己被綠了。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片魏割。...
    茶點(diǎn)故事閱讀 37,989評論 1 333
  • 序言:一個原本活蹦亂跳的男人離奇死亡譬嚣,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出钞它,到底是詐尸還是另有隱情拜银,我是刑警寧澤殊鞭,帶...
    沈念sama閱讀 33,624評論 4 322
  • 正文 年R本政府宣布,位于F島的核電站尼桶,受9級特大地震影響操灿,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜泵督,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,209評論 3 307
  • 文/蒙蒙 一趾盐、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧小腊,春花似錦救鲤、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,199評論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至入问,卻和暖如春丹锹,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背芬失。 一陣腳步聲響...
    開封第一講書人閱讀 31,418評論 1 260
  • 我被黑心中介騙來泰國打工楣黍, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人棱烂。 一個月前我還...
    沈念sama閱讀 45,401評論 2 352
  • 正文 我出身青樓锡凝,卻偏偏與公主長得像,于是被迫代替她去往敵國和親垢啼。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 42,700評論 2 345