12/2學(xué)習(xí)內(nèi)容(TCGA)
整理了之前的筆記崭篡,然后再操作一下關(guān)于TCGA的數(shù)據(jù)下載挪哄,剛好用上次老師完成的對24篇文獻(xiàn)的處理進(jìn)行處理,通過看了健明老師發(fā)的視頻琉闪。
TCGA數(shù)據(jù)下載
下載安裝包
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("TCGAbiolinks")#下載關(guān)鍵包"TCGAbiolinks"
install.packages("GDCquery")
數(shù)據(jù)下載之前特意去看了關(guān)于GDCquery函數(shù)的參數(shù)
發(fā)現(xiàn)GDCquery函數(shù)一共有11個參數(shù):
1.project迹炼;#可以使用TCGAbiolinks:::getGDCprojects()$project_id)得到各個癌種的項目id,總共有45個ID值颠毙。
2.data.category斯入;
3.data.type;
4.workflow.type蛀蜜;
5.legacy = FALSE刻两;
6.access;
7.platform滴某;
8.file.type磅摹;
9.barcode滋迈;
10.experimental.strategy;
11.sample.type
關(guān)于參數(shù)
1.project
可以使用TCGAbiolinks:::getGDCprojects()$project_id)得到各個癌種的項目id户誓,總共有45個ID值饼灿。
如:將要下載的肝癌項目編號為project="TCGA-LIHC"
TCGAbiolinks:::getGDCprojects()$project_id
[1] "TCGA-READ" "TARGET-CCSK" "TCGA-MESO" "TCGA-CHOL"
[5] "NCICCR-DLBCL" "TARGET-WT" "TCGA-TGCT" "TCGA-PRAD"
[9] "TCGA-LAML" "TCGA-ESCA" "TCGA-SARC" "TCGA-ACC"
[13] "TCGA-PAAD" "TCGA-BLCA" "TCGA-KICH" "FM-AD"
[17] "TCGA-LUSC" "TCGA-THYM" "TCGA-GBM" "TCGA-UCEC"
[21] "TCGA-COAD" "TCGA-LUAD" "TARGET-AML" "TARGET-NBL"
[25] "TCGA-DLBC" "TCGA-UVM" "TCGA-THCA" "TARGET-OS"
[29] "TCGA-LGG" "TCGA-STAD" "TCGA-LIHC" "TCGA-CESC"
[33] "TCGA-HNSC" "TCGA-KIRC" "VAREPOP-APOLLO" "TCGA-SKCM"
[37] "TCGA-BRCA" "TCGA-OV" "TCGA-PCPG" "CTSP-DLBCL1"
[41] "TCGA-UCS" "CPTAC-3" "TCGA-KIRP" "TARGET-RT"
[45] "TARGET-ALL-P3"
2.data.category
可以使用TCGAbiolinks:::getProjectSummary(project)查看project中有哪些數(shù)據(jù)類型,如查詢"TCGA-LIHC"帝美,有7種數(shù)據(jù)類型(就是前面群主視頻多次提到的數(shù)據(jù)類型)碍彭,case_count為病人數(shù),file_count為對應(yīng)的文件數(shù)悼潭。小編要下載表達(dá)譜庇忌,所以設(shè)置data.category="Transcriptome Profiling"
TCGAbiolinks:::getProjectSummary("TCGA-LIHC")
$data_categories
case_count file_count data_category
1 376 2122 Transcriptome Profiling
2 376 1537 Copy Number Variation
3 375 3032 Simple Nucleotide Variation
4 377 430 DNA Methylation
5 377 423 Clinical
6 377 1637 Sequencing Reads
7 377 1634 Biospecimen
3.data.type
篩選要下載的文件的數(shù)據(jù)類型。沒有命令可以查看data.type里都有哪些數(shù)據(jù)類型舰褪,但是根據(jù)官網(wǎng)連接漆枚,如下表圖,和所查資料抵知,我們可以總結(jié)出常用的data.type都有:
下載rna-seq的counts數(shù)據(jù)
data.type = "Gene Expression Quantification"
下載miRNA數(shù)據(jù)
data.type = "miRNA Expression Quantification"
下載Copy Number Variation數(shù)據(jù)
data.type = "Copy Number Segment"
這里下載表達(dá)譜為data.type = "Gene Expression Quantification"
1556293360665.png
1556293360665.png
4.workflow.type
不同的數(shù)據(jù)類型,有其對應(yīng)的參數(shù)可供選擇软族。
workflow.type 有三種類型分別為:
HTSeq - FPKM-UQ:FPKM上四分位數(shù)標(biāo)準(zhǔn)化值
HTSeq - FPKM:FPKM值/表達(dá)量值
HTSeq - Counts:原始count數(shù)
小編需要下載count數(shù)刷喜,所以workflow.type=“HTSeq - Counts”。
5.legacy = FALSE
這個參數(shù)主要是因為TCGA數(shù)據(jù)有兩個入口可以下載立砸,GDC Legacy Archive 和 GDC Data Portal掖疮,區(qū)別主要是注釋參考基因組版本不同分別是:GDC Legacy Archive(hg19和GDC Data Portal(hg38)。參數(shù)默認(rèn)為FALSE颗祝,下載GDC Data Portal(hg38)浊闪。這里小編的建議是,下載轉(zhuǎn)錄組層面的數(shù)據(jù)使用hg38螺戳,下載DNA層面的數(shù)據(jù)使用hg19搁宾,因為比如做SNP分析的時候很多數(shù)據(jù)庫沒有hg38版本的數(shù)據(jù),都是hg19的倔幼。
1556293412665.png
1556293412665.png
6.access
數(shù)據(jù)開放和不開放盖腿,有兩個參數(shù):controlled, open。
我們這里使用:access=“open”
7.platform
這里涉及到的平臺種類非常多损同,可以具體去官網(wǎng)看每種數(shù)據(jù)都有什么平臺的可以下載翩腐。這個參數(shù)可以省略不設(shè)置。
1556293428897.png
1556293428897.png
8.file.type
主要是在GDC Legacy Archive下載數(shù)據(jù)的時候使用膏燃,可以參考官網(wǎng)說明茂卦。這里小編在GDC Data Portal下載數(shù)據(jù),所以該參數(shù)省略不設(shè)置组哩。
9.barcode
A list of barcodes to filter the files to download等龙〈υ可以根據(jù)這個參數(shù)設(shè)置只下載某個樣本等。如:
barcode = c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01")
10.experimental.strategy
兩個下載入口參數(shù)選擇
GDC Data Portal:WXS, RNA-Seq, miRNA-Seq, Genotyping Array.
Legacy: WXS, RNA-Seq, miRNA-Seq, Genotyping Array, DNA-Seq, Methylation array, Protein expression array, WXS,CGH array, VALIDATION, Gene expression array,WGS, MSI-Mono-Dinucleotide Assay, miRNA expression array, Mixed strategies, AMPLICON, Exon array, Total RNA-Seq, Capillary sequencing, Bisulfite-Seq
11.sample.type
A sample type to filter the files to download而咆,可以對樣本類型進(jìn)行過濾下載霍比。這里我要下載所有樣本類型數(shù)據(jù),不設(shè)置暴备。部分值選擇如下(全部可以查看官網(wǎng)):如sample.type = "Recurrent Solid Tumor"
數(shù)據(jù)下載
#先從數(shù)據(jù)庫里找到符合各項參數(shù)要求的數(shù)據(jù)
query <- GDCquery(project = "TCGA-LIHC",
legacy = FALSE,
experimental.strategy = "RNA-Seq",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts")
#再使用命令GDCdownload()下載
GDCdownload(query)
獲取表達(dá)矩陣
dataAssay = GDCprepare(query.count, summarizedExperiment = F)
rownames(dataAssay) = as.character(dataAssay[,1])
# dataAssay就是矩陣了悠瞬,它此時在R的環(huán)境變量里、也就是在計算機(jī)內(nèi)存中涯捻。你可以在R中對它進(jìn)行進(jìn)一步的分析浅妆。
# 也可以用write.table或write.csv命令把它從R里保存出來到硬盤,并保存為csv的格式障癌,就可以用excel打開了凌外。
write.csv(dataAssay, "TCGA-matrix.csv") # 此時,保存的文件名為“TCGA-matrix.csv”
再看視頻當(dāng)中結(jié)合晚上資料去理解