Introduction to Bioconductor for Sequence Data

在院士組輪轉(zhuǎn)的時(shí)候，由于沒人安排我去做什么，我也不懂怎么和別人交流，于是那些日子就在翻譯Biocondutor的教程中度過由捎。本文為我學(xué)習(xí)Bioconductor所寫的第一篇學(xué)習(xí)筆記。說是學(xué)習(xí)筆記饿凛，差不多等于全文翻譯了bioconductor的 sequence Analysis一章狞玛。基本上高通量數(shù)據(jù)分析到了最后都會(huì)用到Bioconductor的包笤喳，甚至是前期都可能會(huì)用到为居，

什么是Bioconductor

官方的簡介是：

Bioconductor provides tools for the analysis and comprehension of high-throughput genomic data. Bioconductor uses the R statistical programming language, and is open source and open development. It has two releases each year, 1296 software packages, and an active user community. Bioconductor is also available as an AMI (Amazon Machine Image) and a series of Docker images.

簡單的說Bioconductor就是以R語言為平臺(tái)的一個(gè)高通量基因組數(shù)據(jù)的分析工具。那么下面就算正文環(huán)節(jié)

簡介

Biconductor為高通量基因組數(shù)據(jù)提供了分析和理解方法杀狡。我們擁有大量包可對(duì)大量數(shù)據(jù)進(jìn)行嚴(yán)格的統(tǒng)計(jì)分析，while keeping techological aritiacts in mind. Bioconductor能進(jìn)行多種類型分析：

測序： RNASeq, ChIPSeq, variants, copy number...
微矩陣：表達(dá)贰镣，SNP
Domain specific analysis: Flow cytometry, Proteomics...

對(duì)于這些分析呜象，可用簡單的導(dǎo)入并使用多種序列相關(guān)文件類型，包括,fasta, fastq,BAM,gtf,bed,和wig文件等等.Bionconductor支持導(dǎo)入碑隆，常見和高級(jí)的序列操作恭陡，triming,transformation, and alignment(質(zhì)量評(píng)估)

Sequencing Resources

不同階段，不同數(shù)據(jù)格式會(huì)用到的包

不同數(shù)據(jù)格式會(huì)用到的包

IRange, GenomicRanges：基于range的計(jì)算上煤，數(shù)據(jù)操作休玩，常見數(shù)據(jù)表示。Biostring: DNA和氨基酸序列表示劫狠，比對(duì)和模式匹配和大規(guī)模生物學(xué)序列的數(shù)據(jù)操作拴疤。ShortRead處理FASTQ文件
Rsamtools, GenomicAlignments用于比對(duì)read(BAM文件)的I/O和數(shù)據(jù)操作 rtracklayer用于導(dǎo)入和導(dǎo)出多種數(shù)據(jù)格式(BED,WIG,bigWig,GTF,GFF)和UCSC genome browser tracks操作
BSgenomes：用于獲取和操作規(guī)劃的全基因組。GenomicFeatures常見基因組之間序列特征注釋独泞，biomaRt 獲取Biomart數(shù)據(jù)庫
SRAdb用于從Sequnce Read Archive(SRA)查找和獲取數(shù)據(jù)呐矾。

Bioconductor包通過biocViews進(jìn)行管理羔飞，有些詞條位于Sequencing宵统，其他則在其他條目和代表性的包下，包括

RNASeq, e.g., edgeR, DESeq2, edgeR, derfinder, and QuasR.
ChIPSeq, e.g.,DiffBind, csaw, ChIPseeker, ChIPQC.
SNPs and other variants, e.g., VariantAnnotation, VariantFiltering, h5vc.
CopyNumberVariation e.g., DNAcopy, crlmm, fastseg.
Microbiome and metagenome sequencing, e.g., metagenomeSeq, phyloseq, DirichletMultinomial.

Ranges Infrastructure

許多其他的Bioconductor包高度依賴于IRanges/GenomicRanges所提供的低層數(shù)據(jù)結(jié)構(gòu)

library(GenomicRanges)
GRanges(seqnames=Rle(c('chr1', 'chr2', 'chr3'), c(3, 3, 4)),
      IRanges(1:10, width=5), strand='-',
      score=101:110, GC = runif(10))

GenomicRanges使得我們可以將染色體坐標(biāo)的一個(gè)范圍和一段序列名（如chromosome)以及正負(fù)鏈關(guān)聯(lián)起來硫戈。這對(duì)描述數(shù)據(jù)(aligned reads坐標(biāo),called ChIP peaks 或copy number variants)和注釋(gene models, Roadmap Epigenomics regularoty elements, dbSNP的已知臨床相關(guān)變異)

GRanes對(duì)象，用來儲(chǔ)存基因組位置和關(guān)聯(lián)注釋的向量罚随。
Genomic ranges基本都是通過導(dǎo)入數(shù)據(jù) (e.g., viaGenomicAlignments::readGAlignments())或注釋(e.g., via GenomicFeatures::select() or rtracklayer::import() of BED, WIG, GTF, and other common file formats).進(jìn)行創(chuàng)建

help(package="GenomicRanges")
vignette(package="GenomicRanges")

GRanges常用操作包括findOverlaps玉工， nearest等

FASTA文件中的DNA/氨基酸序列

Biostrings類用于表示DNA或氨基酸序列

library(Biostrings)
d <- DNAString("TTGAAAA-CTC-N")
length(d)

使用AnnotationHub在Ensembl上下載Homo sapiens cDNA序列，文件名為‘Homo_sapiens.GRCh38.cdna.all.fa’

library(AnnotationHub)
ah <- AnnotationHub()
ah2 <- query(ah, c("fasta","homo sapiens","Ensembl"))
fa <- ah2[["AH18522]]
fa

使用Rsamtools打開fasta文件讀取里面的序列和寬度記錄

library(Rsamtools)
idx <- scanFaIndex(fa)
idx

long <- idx[width(idx)>82000]
getSeq(fa,param=long)

BSgenome提供全基因組序列淘菩，

library(BSgenome.Hsapiens.UCSC.hg19)
chr14_range <- GRanges("chr14",IRanges(1,seqlengths(HSapiens)["chr14"]))
chr12_dna <- getSeq(Hsapiens,chr14_range)
letterFrequency(chr14_dna,‘GC",as.prob=TRUE)

FASTQ文件中的Reads

ShortRead用于fastq文件.BiocParallel用于加速任務(wù)進(jìn)程

##1. attach ShortRead and BiocParallel
library(ShortRead)
library(BiocParallel)
## 2. create a vectior of file paths
fls <- dir("~/fastq",pattern="*fastq",full=TRUE)
## 3. collect statistics
stats0 <- qa(fls)
## 4. generate and browser the report
if (interactive())
    browseURL(report(stats))

BAM中的Aligned Reads

GenomicAlignments用于輸入比對(duì)到參考基因組的reads
下一個(gè)案例中瓮栗，我們會(huì)讀入一個(gè)BAM文件,and specifically read in reads / supporting an apparent exon splice junction/ spanning position 19653773 of chromosome 14.

RNAseqData.HNRNPC.bam.chr14_BAMFILES內(nèi)有8個(gè)BAM文件。我們只是用第一個(gè)BAM文件瞄勾。我們會(huì)載入軟件包和數(shù)據(jù)包费奸，為目標(biāo)區(qū)域構(gòu)建一個(gè)GRanges對(duì)象，然后使用summarizeJunctions()尋找目標(biāo)區(qū)域的read

# 1. 載入軟件包
library(GenomicRanges)
library(GenomicAlignments)
# 2. 載入樣本數(shù)據(jù)
library('RNAseqDta.HNRNPC.bam.chr14')
bf <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[[1),asMates=TRUE)
# 3. 定義目標(biāo)區(qū)域
roi <- GRanges("chr14", IRanges(19653773,width=1)）
# 4. alignments, junctons, overlapping our roi
paln <- readAlignmentsList(bf)
j <- summarizeJunctions(paln,with.revmap=TRUE)
j_overlap <- j[j %over% roi]

# 5.supporting reads
paln[j_overlap$revmap[[1]]]

Called Variants from VCF files

VCF(Variant Call Files)描述了SNP和其他變異进陡。該文件包括元信息行愿阐，含有列名的header line和眾多的數(shù)據(jù)行，每一行都包括基因組上位置信息和每個(gè)位置上可選基因型信息趾疚。
數(shù)據(jù)通過VariantAnnotation的readVcf()讀入

library(VariantAnnotation)
fl <- system.file("extdata","chr22.vcf.gz",package=“VariantAnnotation")
vcf <- readVcf(fl,"hg19")

Genome Annotations from BED, WIG, GTF etc files

rtracklayer能夠?qū)牒蛯?dǎo)出許多常見的文件類型缨历，BED,WIG,GTF,還能查詢和導(dǎo)航UCSC genome Browser
rtracklayer包含測序所用BED文件

library(rtracklayer)
test_path <- system.file("test",package = "rtracklayer")
test_bed <- file.path(test_path,"test.bed")

test <- import(test_bed,format="bed")
test

AnnotationHub 同樣包括多種基因組注釋文件（BED,GTF,BigWig)，使用rtracklayer import()糙麦。更多有關(guān)AnnotationHub的介紹需要看 Annotation workflow and AnnotationHub HOW TO vignette辛孵。

順便放一下自己的知識(shí)星球，如果你覺得我對(duì)你有幫助的話赡磅。

知識(shí)星球

Bioconductor的數(shù)據(jù)類型