title: GDAS002-Bioconductor備忘錄
date: 2019-09-02 12:0:00
type: "tags"
tags:
- Bioconductor
categories: - Genomics Data Analysis Series
前言
這篇筆記的主要內容是關于Bioconductor的一些備忘錄(cheat sheet)诵棵,作者總結的不錯,我就按原文翻譯了下來罕模,由于一些知識點并沒有學到,因此未翻譯些楣,后面學到后,再回頭過來翻譯途样。
安裝
安裝細節(jié)參考http://bioconductor.org/install/砌梆,具體代碼如下所示:
if (!requireNamespace("BiocManager"))
install.packages("BiocManager")
BiocManager::install()
BiocManager::install(c("package1","package2")
BiocManager::valid() # are packages up to date?
# what Bioc version is release right now?
http://bioconductor.org/bioc-version
# what Bioc versions are release/devel?
http://bioconductor.org/js/versions.js
R語言幫助
簡單的幫忙如下所示:
?函數名
?"eSet-class" # 如果要查看類的幫助,需要后綴'-class'
help(package="foo",help_type="html") # 使用web瀏覽器來查看幫助文檔
vignette("topic")
browseVignettes(package="package") # 某包的簡單案例
高級用戶的幫助功能
functionName # 直接輸入函數名(不帶括號)壶熏,則顯示函數代碼
getMethod(method,"class") # 顯示某類的方法代碼
selectMethod(method, "class") # 查看繼承來的方法
showMethods(classes="class") # 顯示某類的所有方法show all methods for class
methods(class="GRanges") # 此函數限于R >=3.2
?"functionName,class-method" # 用于查看S4對象的方法文檔
?"plotMA,data.frame-method" # 需要加載geneplotter包句柠,from library(geneplotter)
?"method.class" # 查看S3對象的方法
?"plot.lm"
sessionInfo() # 獲取R語言版本的相關幫助信息
packageVersion("foo") # 查看某個包的版本
Bioconductor相關的論壇: https://support.bioconductor.org
如果你使用的Rstudio,則可以使用?
或help
來查看幫忙文檔棒假,如果使用是命令行模式(例如Linux)溯职,則可以使用下面的代碼通過瀏覽器來查看幫助文檔(我使用的Rstudio,因此未嘗試以下代碼):
alias rhelp="Rscript -e 'args <- commandArgs(TRUE); help(args[2], package=args[3], help_type=\"html\"); Sys.sleep(5)' --args"
debugging R
traceback() # 查看哪一步導致了R錯誤
# debug a function
debug(myFunction) # 逐行遍歷函數中的代碼
undebug(myFunction) # 停止debugging
debugonce(myFunction) # 功能如上帽哑,但是不需要undebug()
# 在寫代碼的過程中谜酒,可以將函數browser()放到一個關系的節(jié)點上
# this plus devtools::load_all() can be useful for programming
# to jump in function on error:
options(error=recover)
# turn that behavior off:
options(error=NULL)
# debug, e.g. estimateSizeFactors from DESeq2...
# debugging an S4 method is more difficult; this gives you a peek inside:
trace(estimateSizeFactors, browser, exit=browser, signature="DESeqDataSet")
顯示某個類的包特異方法
以下兩個段代碼實現的功能大致相同:查看特定的某個包中,某個特定類對象的操作方法妻枕。給定類的對象的操作方法僻族。
intersect(sapply(strsplit(as.character(methods(class="DESeqDataSet")), ","), `[`, 1), ls("package:DESeq2"))
sub("Function: (.*) \\(package .*\\)","\\1",grep("Function",showMethods(classes="DESeqDataSet",
where=getNamespace("DESeq2"), printTo=FALSE), value=TRUE))
注釋Annotations
這里需要先更新一下AnnotationHub,現在我們查看一下AnnotationHub的某個案例:
https://master.bioconductor.org/packages/release/bioc/html/AnnotationHub.html
以下代碼是如何使用數據庫包屡谐,以及biomart
述么,先看AnnotationDbi:
# 某用包個注釋包
library(AnnotationDbi)
library(org.Hs.eg.db) # 這個是人類注釋包,即Homo.sapiens
columns(org.Hs.eg.db)
keytypes(org.Hs.eg.db) # columns與keytypes返回的結果一樣愕掏,即有什么樣的注釋內容
k <- head(keys(org.Hs.eg.db, keytype="ENTREZID"))
# 返回一個字符向量(1對1的關系)度秘,查看mapIds則可以查看多個選項
res <- mapIds(org.Hs.eg.db, keys=k, column="ENSEMBL", keytype="ENTREZID")
# 產生1對多的映射關系
res <- select(org.Hs.eg.db, keys=k,
columns=c("ENTREZID","ENSEMBL","SYMBOL"),
keytype="ENTREZID")
這個包主要用于各種基因ID的轉換。
# 使用biomart包饵撑,將一個注釋與另外一個注釋進行映射
library(biomaRt)
m <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
map <- getBM(mart = m,
attributes = c("ensembl_gene_id", "entrezgene"),
filters = "ensembl_gene_id",
values = some.ensembl.genes)
GenomicRanges包
GenomicRanges這個包用于分析高通量測序數據敷钾,它能有效地表殼與操作基因組注釋信息(genomic annotations)與比對結果(alignments)。GenomicRanges包定義了一個通用容器肄梨,此容器用于儲存與操作基因間隔(genomic intervals)與變量(variables)筐骇。在GenomicAlignments包和SummarizedExperiment包中定義了一個更加特殊的容器烈涮,GenomicAlignments包中的這個更加特殊的容器可以根據一個參考基因組來表示和操作一些短比對結果(short alignments)逢捺,SummarizedExperiment包中容器可以構建一個實驗的矩陣樣匯總信息去扣。
此包的使用如下所示:
library(GenomicRanges)
z <- GRanges("chr1",IRanges(1000001,1001000),strand="+")
start(z)
end(z)
width(z)
strand(z)
mcols(z) # the 'metadata columns', any information stored alongside each range
ranges(z) # gives the IRanges
seqnames(z) # the chromosomes for each ranges
seqlevels(z) # the possible chromosomes
seqlengths(z) # the lengths for each chromosome
Intra-range methods
不清楚功能呛牲,后面補充缭保。
Affects ranges independently
function | description |
---|---|
shift | moves left/right |
narrow | narrows by relative position within range |
resize | resizes to width, fixing start for +, end for - |
flank | returns flanking ranges to the left +, or right - |
promoters | similar to flank |
restrict | restricts ranges to a start and end position |
trim | trims out of bound ranges |
+/- | expands/contracts by adding/subtracting fixed amount |
* | zooms in (positive) or out (negative) by multiples |
Inter-range methods
Affects ranges as a group
function | description |
---|---|
range | one range, leftmost start to rightmost end |
reduce | cover all positions with only one range |
gaps | uncovered positions within range |
disjoin | breaks into discrete ranges based on original starts/ends |
Nearest methods
Given two sets of ranges, x
and subject
, for each range in x
, returns...
function | description |
---|---|
nearest | index of the nearest neighbor range in subject |
precede | index of the range in subject that is directly preceded by the range in x |
follow | index of the range in subject that is directly followed by the range in x |
distanceToNearest | distances to its nearest neighbor in subject (Hits object) |
distance | distances to nearest neighbor (integer vector) |
A Hits object can be accessed with queryHits
, subjectHits
and mcols
if a distance is associated.
set methods
If y
is a GRangesList, then use punion
, etc. All functions have default ignore.strand=FALSE
, so are strand specific.
union(x,y)
intersect(x,y)
setdiff(x,y)
Overlaps
x %over% y # logical vector of which x overlaps any in y
fo <- findOverlaps(x,y) # returns a Hits object
queryHits(fo) # which in x
subjectHits(fo) # which in y
Seqnames and seqlevels
GenomicRanges and GenomeInfoDb
gr.sub <- gr[seqlevels(gr) == "chr1"]
seqlevelsStyle(x) <- "UCSC" # convert to 'chr1' style from "NCBI" style '1'
Sequences
Biostrings是一個操作序列或序列集的包遥缕,簡要操作可以參考Biostrings Quick Overview PDF侥祭,關于一些物種的蓖宦,縮寫的信息可以參考cheat sheet for annotation
library(BSgenome.Hsapiens.UCSC.hg19)
dnastringset <- getSeq(Hsapiens, granges) # returns a DNAStringSet
# also Views() for Bioconductor >= 3.1
library(Biostrings)
dnastringset <- readDNAStringSet("transcripts.fa")
substr(dnastringset, 1, 10) # to character string
subseq(dnastringset, 1, 10) # returns DNAStringSet
Views(dnastringset, 1, 10) # lightweight views into object
complement(dnastringset)
reverseComplement(dnastringset)
matchPattern("ACGTT", dnastring) # also countPattern, also works on Hsapiens/genome
vmatchPattern("ACGTT", dnastringset) # also vcountPattern
letterFrequecy(dnastringset, "CG") # how many C's or G's
# also letterFrequencyInSlidingView
alphabetFrequency(dnastringset, as.prob=TRUE)
# also oligonucleotideFrequency, dinucleotideFrequency, trinucleotideFrequency
# transcribe/translate for imitating biological processes
測序數據
Rsamtools中的 scanBam
返回BAM文件中的原始數值:
library(Rsamtools)
which <- GRanges("chr1",IRanges(1000001,1001000))
what <- c("rname","strand","pos","qwidth","seq")
param <- ScanBamParam(which=which, what=what)
# for more BamFile functions/details see ?BamFile
# yieldSize for chunk-wise access
bamfile <- BamFile("/path/to/file.bam")
reads <- scanBam(bamfile, param=param)
res <- countBam(bamfile, param=param)
# for more sophisticated counting modes
# see summarizeOverlaps() below
# quickly check chromosome names
seqinfo(BamFile("/path/to/file.bam"))
# DNAStringSet is defined in the Biostrings package
# see the Biostrings Quick Overview PDF
dnastringset <- scanFa(fastaFile, param=granges)
GenomicAlignments 返回一個Bioconductor對象(GRanges-based)
library(GenomicAlignments)
ga <- readGAlignments(bamfile) # single-end
ga <- readGAlignmentPairs(bamfile) # paired-end
轉錄本數據庫
# get a transcript database, which stores exon, trancript, and gene information
library(GenomicFeatures)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
# or build a txdb from GTF file (e.g. downloadable from Ensembl FTP site)
txdb <- makeTranscriptDbFromGFF("file.GTF", format="gtf")
# or build a txdb from Biomart (however, not as easy to reproduce later)
txdb <- makeTranscriptDbFromBiomart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
# in Bioconductor >= 3.1, also makeTxDbFromGRanges
# saving and loading
saveDb(txdb, file="txdb.sqlite")
loadDb("txdb.sqlite")
# extracting information from txdb
g <- genes(txdb) # GRanges, just start to end, no exon/intron information
tx <- transcripts(txdb) # GRanges, similar to genes()
e <- exons(txdb) # GRanges for each exon
ebg <- exonsBy(txdb, by="gene") # exons grouped in a GRangesList by gene
ebt <- exonsBy(txdb, by="tx") # similar but by transcript
# then get the transcript sequence
txSeq <- extractTranscriptSeqs(Hsapiens, ebt)
根據范圍(ranges)和實驗匯總信息
SummarizedExperiment
是高維數據的儲存類齐婴, 它與GRanges
或GRangesList
配合使用(使用每個基因外顯子的read計數)。
library(GenomicAlignments)
fls <- list.files(pattern="*.bam$")
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
ebg <- exonsBy(txdb, by="gene")
# see yieldSize argument for restricting memory
bf <- BamFileList(fls)
library(BiocParallel)
register(MulticoreParam(4))
# lots of options in the man page
# singleEnd, ignore.strand, inter.features, fragments, etc.
se <- summarizeOverlaps(ebg, bf)
# operations on SummarizedExperiment
assay(se) # the counts from summarizeOverlaps
colData(se)
rowRanges(se)
Salmon是一個定量工具稠茂,如果數據占有GC依賴的數據柠偶,那么就添加上--gcBias
參數情妖,接著使用 tximport,以下是使用案例:
coldata <- read.table("samples.txt")
rownames(coldata) <- coldata$id
files <- coldata$files; names(files) <- coldata$id
txi <- tximport(files, type="salmon", tx2gene=tx2gene)
dds <- DESeqDataSetFromTximport(txi, coldata, ~condition)
Rsubread包中的featureCounts
函數也可以用于read計算诱担,速度可能更快毡证。
library(Rsubread)
res <- featureCounts(files, annot.ext="annotation.gtf",
isGTFAnnotationFile=TRUE,
GTF.featureType="exon",
GTF.attrType="gene_id")
res$counts
RNA-seq分析方法
DESeq2:My preferred pipeline for DESeq2 users is to start with a lightweight transcript
abundance quantifier such as Salmon
and to use tximport, followed
by DESeqDataSetFromTximport
.
Here, coldata
is a data.frame with group
as a column.
library(DESeq2)
# from tximport
dds <- DESeqDataSetFromTximport(txi, coldata, ~ group)
# from SummarizedExperiment
dds <- DESeqDataSet(se, ~ group)
# from count matrix
dds <- DESeqDataSetFromMatrix(counts, coldata, ~ group)
# minimal filtering helps keep things fast
# one can set 'n' to e.g. min(5, smallest group sample size)
keep <- rowSums(counts(dds) >= 10) >= n
dds <- dds[keep,]
dds <- DESeq(dds)
res <- results(dds) # no shrinkage of LFC, or:
res <- lfcShrink(dds, coef = 2, type="apeglm") # shrink LFCs
# this chunk from the Quick start in the edgeR User Guide
library(edgeR)
y <- DGEList(counts=counts,group=group)
keep <- filterByExpr(y)
y <- y[keep,]
y <- calcNormFactors(y)
design <- model.matrix(~group)
y <- estimateDisp(y,design)
fit <- glmFit(y,design)
lrt <- glmLRT(fit)
topTags(lrt)
# or use the QL methods:
qlfit <- glmQLFit(y,design)
qlft <- glmQLFTest(qlfit)
topTags(qlft)
library(limma)
design <- model.matrix(~ group)
y <- DGEList(counts)
keep <- filterByExpr(y)
y <- y[keep,]
y <- calcNormFactors(y)
v <- voom(y,design)
fit <- lmFit(v,design)
fit <- eBayes(fit)
topTable(fit)
更多有關RNA-seq操作的包可以參考:Many more RNA-seq packages
表達數據集Expression set
library(Biobase)
data(sample.ExpressionSet)
e <- sample.ExpressionSet
exprs(e)
pData(e)
fData(e)
下載GEO數據庫
library(GEOquery)
e <- getGEO("GSE9514")
芯片分析
library(affy)
library(limma)
phenoData <- read.AnnotatedDataFrame("sample-description.csv")
eset <- justRMA("/celfile-directory", phenoData=phenoData)
design <- model.matrix(~ Disease, pData(eset))
fit <- lmFit(eset, design)
efit <- eBayes(fit)
topTable(efit, coef=2)
iCOBRA包
iCOBRA包用于計算和可視化排序數據以及二分類匹配后的數據。例如蔫仙,iCOBRA包的一個典型用途就是用于評估基因表達實驗中不同組之間的比較方法料睛,我們可以視之為一個排序問題(評估正確的效應量(effect size)以及按照顯著性進行排列的基因),還是一個二分類比較問題(binary assignment problem)(例如將基因分為差異表達和非差異表達)摇邦。
library(iCOBRA)
cd <- COBRAData(pval=pval.df, padj=padj.df, score=score.df, truth=truth.df)
cp <- calculate_performance(cd, binary_truth = "status", cont_truth = "logFC")
cobraplot <- prepare_data_for_plot(cp)
plot_fdrtprcurve(cobraplot)
# interactive shiny app:
COBRAapp(cd)