GDAS002-Bioconductor備忘錄


title: GDAS002-Bioconductor備忘錄
date: 2019-09-02 12:0:00
type: "tags"
tags:

  • Bioconductor
    categories:
  • Genomics Data Analysis Series

前言

這篇筆記的主要內容是關于Bioconductor的一些備忘錄(cheat sheet)诵棵,作者總結的不錯,我就按原文翻譯了下來罕模,由于一些知識點并沒有學到,因此未翻譯些楣,后面學到后,再回頭過來翻譯途样。

安裝

安裝細節(jié)參考http://bioconductor.org/install/砌梆,具體代碼如下所示:

if (!requireNamespace("BiocManager"))
    install.packages("BiocManager")
BiocManager::install()
BiocManager::install(c("package1","package2")
BiocManager::valid() # are packages up to date?

# what Bioc version is release right now?
http://bioconductor.org/bioc-version
# what Bioc versions are release/devel?
http://bioconductor.org/js/versions.js

R語言幫助

簡單的幫忙如下所示:

?函數名
?"eSet-class" # 如果要查看類的幫助,需要后綴'-class'
help(package="foo",help_type="html") # 使用web瀏覽器來查看幫助文檔
vignette("topic")
browseVignettes(package="package") # 某包的簡單案例

高級用戶的幫助功能

functionName # 直接輸入函數名(不帶括號)壶熏,則顯示函數代碼
getMethod(method,"class")  # 顯示某類的方法代碼
selectMethod(method, "class") # 查看繼承來的方法
showMethods(classes="class") # 顯示某類的所有方法show all methods for class
methods(class="GRanges") # 此函數限于R >=3.2
?"functionName,class-method" # 用于查看S4對象的方法文檔
?"plotMA,data.frame-method" # 需要加載geneplotter包句柠,from library(geneplotter)
?"method.class" # 查看S3對象的方法 
?"plot.lm"
sessionInfo() # 獲取R語言版本的相關幫助信息 
packageVersion("foo") # 查看某個包的版本

Bioconductor相關的論壇: https://support.bioconductor.org

如果你使用的Rstudio,則可以使用?help來查看幫忙文檔棒假,如果使用是命令行模式(例如Linux)溯职,則可以使用下面的代碼通過瀏覽器來查看幫助文檔(我使用的Rstudio,因此未嘗試以下代碼):

alias rhelp="Rscript -e 'args <- commandArgs(TRUE); help(args[2], package=args[3], help_type=\"html\"); Sys.sleep(5)' --args"

debugging R

traceback() # 查看哪一步導致了R錯誤 
# debug a function
debug(myFunction) # 逐行遍歷函數中的代碼 
undebug(myFunction) # 停止debugging
debugonce(myFunction) # 功能如上帽哑,但是不需要undebug()
# 在寫代碼的過程中谜酒,可以將函數browser()放到一個關系的節(jié)點上
# this plus devtools::load_all() can be useful for programming
# to jump in function on error:
options(error=recover)
# turn that behavior off:
options(error=NULL)
# debug, e.g. estimateSizeFactors from DESeq2...
# debugging an S4 method is more difficult; this gives you a peek inside:
trace(estimateSizeFactors, browser, exit=browser, signature="DESeqDataSet")

顯示某個類的包特異方法

以下兩個段代碼實現的功能大致相同:查看特定的某個包中,某個特定類對象的操作方法妻枕。給定類的對象的操作方法僻族。

intersect(sapply(strsplit(as.character(methods(class="DESeqDataSet")), ","), `[`, 1), ls("package:DESeq2"))
sub("Function: (.*) \\(package .*\\)","\\1",grep("Function",showMethods(classes="DESeqDataSet", 
    where=getNamespace("DESeq2"), printTo=FALSE), value=TRUE))

注釋Annotations

這里需要先更新一下AnnotationHub,現在我們查看一下AnnotationHub的某個案例:

https://master.bioconductor.org/packages/release/bioc/html/AnnotationHub.html

以下代碼是如何使用數據庫包屡谐,以及biomart述么,先看AnnotationDbi

# 某用包個注釋包
library(AnnotationDbi)
library(org.Hs.eg.db) # 這個是人類注釋包,即Homo.sapiens
columns(org.Hs.eg.db)
keytypes(org.Hs.eg.db) # columns與keytypes返回的結果一樣愕掏,即有什么樣的注釋內容
k <- head(keys(org.Hs.eg.db, keytype="ENTREZID"))

# 返回一個字符向量(1對1的關系)度秘,查看mapIds則可以查看多個選項
res <- mapIds(org.Hs.eg.db, keys=k, column="ENSEMBL", keytype="ENTREZID")

# 產生1對多的映射關系 
res <- select(org.Hs.eg.db, keys=k,
  columns=c("ENTREZID","ENSEMBL","SYMBOL"),
  keytype="ENTREZID")

biomaRt

這個包主要用于各種基因ID的轉換。

# 使用biomart包饵撑,將一個注釋與另外一個注釋進行映射
library(biomaRt)
m <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
map <- getBM(mart = m,
  attributes = c("ensembl_gene_id", "entrezgene"),
  filters = "ensembl_gene_id", 
  values = some.ensembl.genes)

GenomicRanges包

GenomicRanges這個包用于分析高通量測序數據敷钾,它能有效地表殼與操作基因組注釋信息(genomic annotations)與比對結果(alignments)。GenomicRanges包定義了一個通用容器肄梨,此容器用于儲存與操作基因間隔(genomic intervals)與變量(variables)筐骇。在GenomicAlignments包和SummarizedExperiment包中定義了一個更加特殊的容器烈涮,GenomicAlignments包中的這個更加特殊的容器可以根據一個參考基因組來表示和操作一些短比對結果(short alignments)逢捺,SummarizedExperiment包中容器可以構建一個實驗的矩陣樣匯總信息去扣。

此包的使用如下所示:

library(GenomicRanges)
z <- GRanges("chr1",IRanges(1000001,1001000),strand="+")
start(z)
end(z)
width(z)
strand(z)
mcols(z) # the 'metadata columns', any information stored alongside each range
ranges(z) # gives the IRanges
seqnames(z) # the chromosomes for each ranges
seqlevels(z) # the possible chromosomes
seqlengths(z) # the lengths for each chromosome

Intra-range methods

不清楚功能呛牲,后面補充缭保。

Affects ranges independently

function description
shift moves left/right
narrow narrows by relative position within range
resize resizes to width, fixing start for +, end for -
flank returns flanking ranges to the left +, or right -
promoters similar to flank
restrict restricts ranges to a start and end position
trim trims out of bound ranges
+/- expands/contracts by adding/subtracting fixed amount
* zooms in (positive) or out (negative) by multiples

Inter-range methods

Affects ranges as a group

function description
range one range, leftmost start to rightmost end
reduce cover all positions with only one range
gaps uncovered positions within range
disjoin breaks into discrete ranges based on original starts/ends

Nearest methods

Given two sets of ranges, x and subject, for each range in x, returns...

function description
nearest index of the nearest neighbor range in subject
precede index of the range in subject that is directly preceded by the range in x
follow index of the range in subject that is directly followed by the range in x
distanceToNearest distances to its nearest neighbor in subject (Hits object)
distance distances to nearest neighbor (integer vector)

A Hits object can be accessed with queryHits, subjectHits and mcols if a distance is associated.

set methods

If y is a GRangesList, then use punion, etc. All functions have default ignore.strand=FALSE, so are strand specific.

union(x,y) 
intersect(x,y)
setdiff(x,y)

Overlaps

x %over% y  # logical vector of which x overlaps any in y
fo <- findOverlaps(x,y) # returns a Hits object
queryHits(fo)   # which in x
subjectHits(fo) # which in y 

Seqnames and seqlevels

GenomicRanges and GenomeInfoDb

gr.sub <- gr[seqlevels(gr) == "chr1"]
seqlevelsStyle(x) <- "UCSC" # convert to 'chr1' style from "NCBI" style '1'

Sequences

Biostrings是一個操作序列或序列集的包遥缕,簡要操作可以參考Biostrings Quick Overview PDF侥祭,關于一些物種的蓖宦,縮寫的信息可以參考cheat sheet for annotation

library(BSgenome.Hsapiens.UCSC.hg19)
dnastringset <- getSeq(Hsapiens, granges) # returns a DNAStringSet
# also Views() for Bioconductor >= 3.1
library(Biostrings)
dnastringset <- readDNAStringSet("transcripts.fa")
substr(dnastringset, 1, 10) # to character string
subseq(dnastringset, 1, 10) # returns DNAStringSet
Views(dnastringset, 1, 10) # lightweight views into object
complement(dnastringset)
reverseComplement(dnastringset)
matchPattern("ACGTT", dnastring) # also countPattern, also works on Hsapiens/genome
vmatchPattern("ACGTT", dnastringset) # also vcountPattern
letterFrequecy(dnastringset, "CG") # how many C's or G's
# also letterFrequencyInSlidingView
alphabetFrequency(dnastringset, as.prob=TRUE)
# also oligonucleotideFrequency, dinucleotideFrequency, trinucleotideFrequency
# transcribe/translate for imitating biological processes

測序數據

Rsamtools中的 scanBam 返回BAM文件中的原始數值:

library(Rsamtools)
which <- GRanges("chr1",IRanges(1000001,1001000))
what <- c("rname","strand","pos","qwidth","seq")
param <- ScanBamParam(which=which, what=what)
# for more BamFile functions/details see ?BamFile
# yieldSize for chunk-wise access
bamfile <- BamFile("/path/to/file.bam")
reads <- scanBam(bamfile, param=param)
res <- countBam(bamfile, param=param) 
# for more sophisticated counting modes
# see summarizeOverlaps() below

# quickly check chromosome names
seqinfo(BamFile("/path/to/file.bam"))

# DNAStringSet is defined in the Biostrings package
# see the Biostrings Quick Overview PDF
dnastringset <- scanFa(fastaFile, param=granges)

GenomicAlignments 返回一個Bioconductor對象(GRanges-based)

library(GenomicAlignments)
ga <- readGAlignments(bamfile) # single-end
ga <- readGAlignmentPairs(bamfile) # paired-end

轉錄本數據庫

GenomicFeatures

# get a transcript database, which stores exon, trancript, and gene information
library(GenomicFeatures)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

# or build a txdb from GTF file (e.g. downloadable from Ensembl FTP site)
txdb <- makeTranscriptDbFromGFF("file.GTF", format="gtf")

# or build a txdb from Biomart (however, not as easy to reproduce later)
txdb <- makeTranscriptDbFromBiomart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")

# in Bioconductor >= 3.1, also makeTxDbFromGRanges

# saving and loading
saveDb(txdb, file="txdb.sqlite")
loadDb("txdb.sqlite")

# extracting information from txdb
g <- genes(txdb) # GRanges, just start to end, no exon/intron information
tx <- transcripts(txdb) # GRanges, similar to genes()
e <- exons(txdb) # GRanges for each exon
ebg <- exonsBy(txdb, by="gene") # exons grouped in a GRangesList by gene
ebt <- exonsBy(txdb, by="tx") # similar but by transcript

# then get the transcript sequence
txSeq <- extractTranscriptSeqs(Hsapiens, ebt)

根據范圍(ranges)和實驗匯總信息

SummarizedExperiment是高維數據的儲存類齐婴, 它與GRangesGRangesList配合使用(使用每個基因外顯子的read計數)。

library(GenomicAlignments)
fls <- list.files(pattern="*.bam$")
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
ebg <- exonsBy(txdb, by="gene")
# see yieldSize argument for restricting memory
bf <- BamFileList(fls)
library(BiocParallel)
register(MulticoreParam(4))
# lots of options in the man page
# singleEnd, ignore.strand, inter.features, fragments, etc.
se <- summarizeOverlaps(ebg, bf)

# operations on SummarizedExperiment
assay(se) # the counts from summarizeOverlaps
colData(se)
rowRanges(se)

Salmon是一個定量工具稠茂,如果數據占有GC依賴的數據柠偶,那么就添加上--gcBias參數情妖,接著使用 tximport,以下是使用案例:

coldata <- read.table("samples.txt")
rownames(coldata) <- coldata$id
files <- coldata$files; names(files) <- coldata$id
txi <- tximport(files, type="salmon", tx2gene=tx2gene)
dds <- DESeqDataSetFromTximport(txi, coldata, ~condition)

Rsubread包中的featureCounts函數也可以用于read計算诱担,速度可能更快毡证。

library(Rsubread)
res <- featureCounts(files, annot.ext="annotation.gtf",
  isGTFAnnotationFile=TRUE,
  GTF.featureType="exon",
  GTF.attrType="gene_id")
res$counts

RNA-seq分析方法

DESeq2:My preferred pipeline for DESeq2 users is to start with a lightweight transcript
abundance quantifier such as Salmon
and to use tximport, followed
by DESeqDataSetFromTximport.

Here, coldata is a data.frame with group as a column.

library(DESeq2)
# from tximport
dds <- DESeqDataSetFromTximport(txi, coldata, ~ group)
# from SummarizedExperiment
dds <- DESeqDataSet(se, ~ group)
# from count matrix
dds <- DESeqDataSetFromMatrix(counts, coldata, ~ group)
# minimal filtering helps keep things fast 
# one can set 'n' to e.g. min(5, smallest group sample size)
keep <- rowSums(counts(dds) >= 10) >= n 
dds <- dds[keep,]
dds <- DESeq(dds)
res <- results(dds) # no shrinkage of LFC, or:
res <- lfcShrink(dds, coef = 2, type="apeglm") # shrink LFCs

edgeR

# this chunk from the Quick start in the edgeR User Guide
library(edgeR) 
y <- DGEList(counts=counts,group=group)
keep <- filterByExpr(y)
y <- y[keep,]
y <- calcNormFactors(y)
design <- model.matrix(~group)
y <- estimateDisp(y,design)
fit <- glmFit(y,design)
lrt <- glmLRT(fit)
topTags(lrt)
# or use the QL methods:
qlfit <- glmQLFit(y,design)
qlft <- glmQLFTest(qlfit)
topTags(qlft)

limma-voom

library(limma)
design <- model.matrix(~ group)
y <- DGEList(counts)
keep <- filterByExpr(y)
y <- y[keep,]
y <- calcNormFactors(y)
v <- voom(y,design)
fit <- lmFit(v,design)
fit <- eBayes(fit)
topTable(fit)

更多有關RNA-seq操作的包可以參考:Many more RNA-seq packages

表達數據集Expression set

library(Biobase)
data(sample.ExpressionSet)
e <- sample.ExpressionSet
exprs(e)
pData(e)
fData(e)

下載GEO數據庫

library(GEOquery)
e <- getGEO("GSE9514")

芯片分析

library(affy)
library(limma)
phenoData <- read.AnnotatedDataFrame("sample-description.csv")
eset <- justRMA("/celfile-directory", phenoData=phenoData)
design <- model.matrix(~ Disease, pData(eset))
fit <- lmFit(eset, design)
efit <- eBayes(fit)
topTable(efit, coef=2)

iCOBRA包

iCOBRA包用于計算和可視化排序數據以及二分類匹配后的數據。例如蔫仙,iCOBRA包的一個典型用途就是用于評估基因表達實驗中不同組之間的比較方法料睛,我們可以視之為一個排序問題(評估正確的效應量(effect size)以及按照顯著性進行排列的基因),還是一個二分類比較問題(binary assignment problem)(例如將基因分為差異表達和非差異表達)摇邦。

library(iCOBRA)
cd <- COBRAData(pval=pval.df, padj=padj.df, score=score.df, truth=truth.df)
cp <- calculate_performance(cd, binary_truth = "status", cont_truth = "logFC")
cobraplot <- prepare_data_for_plot(cp)
plot_fdrtprcurve(cobraplot)
# interactive shiny app:
COBRAapp(cd)

參考資料

  1. Genomic annotation in Bioconductor: Cheat sheet

  2. HarvardX Biomedical Data Science Open Online Training

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
  • 序言:七十年代末恤煞,一起剝皮案震驚了整個濱河市,隨后出現的幾起案子施籍,更是在濱河造成了極大的恐慌居扒,老刑警劉巖,帶你破解...
    沈念sama閱讀 206,013評論 6 481
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件法梯,死亡現場離奇詭異苔货,居然都是意外死亡,警方通過查閱死者的電腦和手機立哑,發(fā)現死者居然都...
    沈念sama閱讀 88,205評論 2 382
  • 文/潘曉璐 我一進店門夜惭,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人铛绰,你說我怎么就攤上這事诈茧。” “怎么了捂掰?”我有些...
    開封第一講書人閱讀 152,370評論 0 342
  • 文/不壞的土叔 我叫張陵敢会,是天一觀的道長。 經常有香客問我这嚣,道長鸥昏,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 55,168評論 1 278
  • 正文 為了忘掉前任姐帚,我火速辦了婚禮吏垮,結果婚禮上,老公的妹妹穿的比我還像新娘罐旗。我一直安慰自己膳汪,他們只是感情好,可當我...
    茶點故事閱讀 64,153評論 5 371
  • 文/花漫 我一把揭開白布九秀。 她就那樣靜靜地躺著遗嗽,像睡著了一般。 火紅的嫁衣襯著肌膚如雪鼓蜒。 梳的紋絲不亂的頭發(fā)上痹换,一...
    開封第一講書人閱讀 48,954評論 1 283
  • 那天征字,我揣著相機與錄音,去河邊找鬼晴音。 笑死柔纵,一個胖子當著我的面吹牛,可吹牛的內容都是我干的锤躁。 我是一名探鬼主播搁料,決...
    沈念sama閱讀 38,271評論 3 399
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼系羞!你這毒婦竟也來了郭计?” 一聲冷哼從身側響起,我...
    開封第一講書人閱讀 36,916評論 0 259
  • 序言:老撾萬榮一對情侶失蹤椒振,失蹤者是張志新(化名)和其女友劉穎昭伸,沒想到半個月后,有當地人在樹林里發(fā)現了一具尸體澎迎,經...
    沈念sama閱讀 43,382評論 1 300
  • 正文 獨居荒郊野嶺守林人離奇死亡庐杨,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內容為張勛視角 年9月15日...
    茶點故事閱讀 35,877評論 2 323
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發(fā)現自己被綠了夹供。 大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片灵份。...
    茶點故事閱讀 37,989評論 1 333
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖哮洽,靈堂內的尸體忽然破棺而出填渠,到底是詐尸還是另有隱情,我是刑警寧澤鸟辅,帶...
    沈念sama閱讀 33,624評論 4 322
  • 正文 年R本政府宣布氛什,位于F島的核電站,受9級特大地震影響匪凉,放射性物質發(fā)生泄漏枪眉。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點故事閱讀 39,209評論 3 307
  • 文/蒙蒙 一再层、第九天 我趴在偏房一處隱蔽的房頂上張望贸铜。 院中可真熱鬧,春花似錦树绩、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,199評論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至职车,卻和暖如春瘫俊,著一層夾襖步出監(jiān)牢的瞬間鹊杖,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 31,418評論 1 260
  • 我被黑心中介騙來泰國打工扛芽, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留骂蓖,地道東北人。 一個月前我還...
    沈念sama閱讀 45,401評論 2 352
  • 正文 我出身青樓川尖,卻偏偏與公主長得像登下,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子叮喳,可洞房花燭夜當晚...
    茶點故事閱讀 42,700評論 2 345

推薦閱讀更多精彩內容