參考生信技能樹1以及生信技能樹2
只記錄從數(shù)據(jù)下載蹦漠,到最終結(jié)果展示笛园,具體生物學(xué)知識(shí)請自行查閱
稍后關(guān)于ChIP-seq的背景知識(shí)我會(huì)再發(fā)布一篇文章侍芝。
數(shù)據(jù)下載:數(shù)據(jù)存放地址
關(guān)于環(huán)境配置和軟件安裝請參考我之前的RNA-seq分析流程
-------------------------------------------
1 軟件安裝及環(huán)境配置
2 sra數(shù)據(jù)下載(采取aspera下載)
cd /mnt/f/Data/chip_seq/data
for ((i=204;i<=209;i++));do ascp -QT -v -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -k 1 -T -l200m anonftp@ftp-private.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR620/SRR620${i}/SRR620${i}.sra .;done
ls -lh *sra
-rwxrwxrwx 1 root root 522M Aug 14 10:03 SRR620204.sra
-rwxrwxrwx 1 root root 365M Aug 14 10:08 SRR620205.sra
-rwxrwxrwx 1 root root 572M Aug 14 10:19 SRR620206.sra
-rwxrwxrwx 1 root root 741M Aug 14 10:30 SRR620207.sra
-rwxrwxrwx 1 root root 346M Aug 14 10:37 SRR620208.sra
-rwxrwxrwx 1 root root 639M Aug 14 11:08 SRR620209.sra
3 sra到fastq格式轉(zhuǎn)換并進(jìn)行質(zhì)量控制
3.1 用samtools中的fastq-dump將sra格式轉(zhuǎn)為fastq格式
source ~/miniconda3/bin/activate
for ((i=204;i<=209;i++));do fastq-dump --gzip --split-3 -A SRR620$i -O .;done
ls -lh *fastq.gz
-rwxrwxrwx 1 root root 883M Aug 14 11:18 SRR620204.fastq.gz
-rwxrwxrwx 1 root root 703M Aug 14 11:24 SRR620205.fastq.gz
-rwxrwxrwx 1 root root 968M Aug 14 11:30 SRR620206.fastq.gz
-rwxrwxrwx 1 root root 1.4G Aug 14 11:41 SRR620207.fastq.gz
-rwxrwxrwx 1 root root 589M Aug 14 11:45 SRR620208.fastq.gz
-rwxrwxrwx 1 root root 1.1G Aug 14 11:54 SRR620209.fastq.gz
3.2 用fastqc進(jìn)行質(zhì)量控制
#將所有的數(shù)據(jù)進(jìn)行質(zhì)控棵红,得到zip的壓縮文件和html文件
fastqc -o . *.fastq.gz
4 Bowtie2進(jìn)行序列比對(duì)
下載小鼠index文件并解壓(迅雷下載)
unzip mm10.zip
unzip mm10.zip
Archive: mm10.zip
inflating: mm10.1.bt2
inflating: mm10.2.bt2
inflating: mm10.3.bt2
inflating: mm10.4.bt2
inflating: mm10.rev.1.bt2
inflating: mm10.rev.2.bt2
inflating: make_mm10.sh
cd /mnt/f/Data/chip_seq/data
bowtie2 -p 6 -3 5 --local -x /mnt/f/Data/reference/bowtie2_index/mm10 -U SRR620204.fastq.gz | samtools sort -O bam -o ../aligned/ring1B.bam &
19687027 reads; of these:
19687027 (100.00%) were unpaired; of these:
7778437 (39.51%) aligned 0 times
7766495 (39.45%) aligned exactly 1 time
4142095 (21.04%) aligned >1 times
60.49% overall alignment rate
[bam_sort_core] merging from 4 files and 1 in-memory blocks...
bowtie2 -p 6 -3 5 --local -x /mnt/f/Data/reference/bowtie2_index/mm10 -U SRR620205.fastq.gz | samtools sort -O bam -o ../aligned/cbx7.bam &
20710168 reads; of these:
20710168 (100.00%) were unpaired; of these:
2584870 (12.48%) aligned 0 times
10914615 (52.70%) aligned exactly 1 time
7210683 (34.82%) aligned >1 times
87.52% overall alignment rate
[bam_sort_core] merging from 4 files and 1 in-memory blocks...
bowtie2 -p 6 -3 5 --local -x /mnt/f/Data/reference/bowtie2_index/mm10 -U SRR620206.fastq.gz | samtools sort -O bam -o ../aligned/suz12.bam &
21551927 reads; of these:
21551927 (100.00%) were unpaired; of these:
7109852 (32.99%) aligned 0 times
9455003 (43.87%) aligned exactly 1 time
4987072 (23.14%) aligned >1 times
67.01% overall alignment rate
[bam_sort_core] merging from 4 files and 1 in-memory blocks...
bowtie2 -p 6 -3 5 --local -x /mnt/f/Data/reference/bowtie2_index/mm10 -U SRR620207.fastq.gz | samtools sort -O bam -o ../aligned/RYBP.bam &
39870646 reads; of these:
39870646 (100.00%) were unpaired; of these:
6312903 (15.83%) aligned 0 times
20702502 (51.92%) aligned exactly 1 time
12855241 (32.24%) aligned >1 times
84.17% overall alignment rate
[bam_sort_core] merging from 8 files and 1 in-memory blocks...
bowtie2 -p 6 -3 5 --local -x /mnt/f/Data/reference/bowtie2_index/mm10 -U SRR620208.fastq.gz | samtools sort -O bam -o ../aligned/IgGold.bam &
13291429 reads; of these:
13291429 (100.00%) were unpaired; of these:
5608796 (42.20%) aligned 0 times
4663584 (35.09%) aligned exactly 1 time
3019049 (22.71%) aligned >1 times
57.80% overall alignment rate
[bam_sort_core] merging from 2 files and 1 in-memory blocks...
bowtie2 -p 6 -3 5 --local -x /mnt/f/Data/reference/bowtie2_index/mm10 -U SRR620209.fastq.gz | samtools sort -O bam -o ../aligned/IgG.bam &
31218866 reads; of these:
31218866 (100.00%) were unpaired; of these:
5370101 (17.20%) aligned 0 times
15180537 (48.63%) aligned exactly 1 time
10668228 (34.17%) aligned >1 times
82.80% overall alignment rate
[bam_sort_core] merging from 6 files and 1 in-memory blocks...
cd ../aligned
ls -lh *bam
-rwxrwxrwx 1 root root 623M Aug 14 23:53 cbx7.bam
-rwxrwxrwx 1 root root 999M Aug 14 23:56 IgG.bam
-rwxrwxrwx 1 root root 503M Aug 14 23:43 IgGold.bam
-rwxrwxrwx 1 root root 752M Aug 14 19:37 ring1B.bam
-rwxrwxrwx 1 root root 1.2G Aug 14 23:59 RYBP.bam
-rwxrwxrwx 1 root root 845M Aug 14 23:49 suz12.bam
5 用MACS2獲取Chip-seq富集區(qū)
macs2 callpeak -c IgGold.bam -t suz12.bam -q 0.05 -f BAM -g mm -n suz12 &
macs2 callpeak -c IgGold.bam -t cbx7.bam -q 0.05 -f BAM -g mm -n cbx7 &
macs2 callpeak -c IgGold.bam -t ring1B.bam -q 0.05 -f BAM -g mm -n ring1B &
macs2 callpeak -c IgGold.bam -t RYBP.bam -q 0.05 -f BAM -g mm -n RYBP &
每個(gè)比較都會(huì)得到四個(gè)文件,如下
NAMEpeaks.xls: 以表格形式存放peak信息咧栗,雖然后綴是xls逆甜,但其實(shí)能用文本編輯器打開,和bed格式類似致板,但是以1為基交煞,而bed文件是以0為基.也就是說xls的坐標(biāo)都要減一才是bed文件的坐標(biāo)
NAMEpeaks.narrowPeak NAMEpeaks.broadPeak 類似。后面4列表示為斟或, integer score for display, fold-change萝挤,-log10pvalue御毅,-log10qvalue,relative summit position to peak start怜珍。內(nèi)容和NAMEpeaks.xls基本一致端蛆,適合用于導(dǎo)入R進(jìn)行分析。
NAMEsummits.bed:記錄每個(gè)peak的peak summits酥泛,也就是記錄極值點(diǎn)的位置今豆。MACS建議用該文件尋找結(jié)合位點(diǎn)的motif嫌拣。
NAME_model.r,能通過NAME_model.r作圖晚凿,得到是基于你提供數(shù)據(jù)的peak模型
- 注:RYBP無peak數(shù)據(jù)(RYBP_model.r可以用)亭罪,手動(dòng)下載peaks數(shù)據(jù)
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE42nnn/GSE42466/suppl/GSE42466_RYBP_peaks_5.txt.gz
gzip -d GSE42466_RYBP_peaks_5.txt.gz
mv GSE42466_RYBP_peaks_5.txt RYBP_summits.bed
6 對(duì)peaks注釋和可視化(R):Chipseeker包
6.0 安裝并加載相應(yīng)的R包,讀入peak文件
source ("https://bioconductor.org/biocLite.R")
biocLite("ChIPseeker")
biocLite("TxDb.Mmusculus.UCSC.mm10.knownGene")
library("ChIPseeker")
library("org.Mm.eg.db")
library("TxDb.Mmusculus.UCSC.mm10.knownGene")
library("clusterProfiler")
txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
讀入peak文件
setwd("F:/Data/chip_seq/aligned")
suz12<-readPeakFile("suz12_peaks.narrowPeak")
6.1 查看peak在全基因組的位置
covplot
函數(shù)可以計(jì)算peak 在染色體上的覆蓋區(qū)域歼秽,并可視化应役。
covplot(suz12,weightCol=5)
備注:
使用Y叔CHIPseeker包說明書中的covplot(peak, weightCol="V5")
提示一下錯(cuò)誤:
Error in .normarg_shift_or_weight(weight, "weight", x) :
'weight' must be a numeric vector, a single string, or a list-like object
我直接把v去掉了,顯示結(jié)果和說明書中一致燥筷。
還可以定義具體的染色體(圖就不展示了)
covplot(suz12, weightCol=5, chrs=c("chr4", "chr5"), xlim=c(4.5e7, 5e7))
6.2 ChIP peaks結(jié)合TSS 區(qū)域的情況
- TSS:transcription start site
- 首先箩祥,計(jì)算ChIP peaks結(jié)合在TSS區(qū)域的情況。這就需要準(zhǔn)備TSS區(qū)域肆氓,這一般定義在TSS位點(diǎn)的側(cè)翼序列(默認(rèn)-3000~+3000)袍祖。然后比對(duì) map到這些區(qū)域的peak,并生成tagMatrix谢揪。
promoter <- getPromoters(TxDb=txdb, upstream=3000, downstream=3000)
tagMatrix <- getTagMatrix(suz12, windows=promoter)
-
6.2.1 Heatmap of ChIP binding to TSS regions
tagHeatmap(tagMatrix, xlim=c(-3000, 3000), color="red")
備注蕉陋,下面這個(gè)代碼可以一步生成上面這個(gè)圖
peakHeatmap(suz12, TxDb=txdb, upstream=3000, downstream=3000, color="blue")
>> preparing promoter regions... 2018-08-17 16:54:26
>> preparing tag matrix... 2018-08-17 16:54:26
>> generating figure... 2018-08-17 16:54:41
>> done... 2018-08-17 16:54:4
6.2.2 Average Profile of ChIP peaks binding to TSS region
plotAvgProf(tagMatrix, xlim=c(-3000, 3000),
conf=0.95,resample = 1000,
xlab="Genomic Region (5'->3')", ylab = "Read Count Frequency")
6.3 peaks注釋
-
annotatePeak
函數(shù)進(jìn)行peaks注釋,可以定義TSS(轉(zhuǎn)錄起始位點(diǎn))區(qū)域拨扶,默認(rèn)情況下TSS定義為-3kb到+ 3kb凳鬓。 - annotatePeak的輸出是csAnno格式。 ChIPseeker中國的
as.GRanges
函數(shù)將csAnno轉(zhuǎn)換為GRanges格式患民,as.data.frame
將csAnno轉(zhuǎn)換為data.frame缩举,然后通過write.table
將其導(dǎo)出到文件。 - TxDb.Hsapiens.UCSC.hg38.knownGene和TxDb.Hsapiens.UCSC.hg19.knownGene分別對(duì)應(yīng)人類基因組hg38和hg19匹颤,TxDb.Mmusculus仅孩。
- UCSC.mm10.knownGene和TxDb.Mmusculus.UCSC.mm9.knownGene則對(duì)應(yīng)小鼠基因組mm10和mm9。
- 用戶還可以通過R
makeTxDbFromBiomart
和makeTxDbFromUCSC
從UCSC Genome Bioinformatics和BioMart數(shù)據(jù)庫檢索準(zhǔn)備自己的TxDb對(duì)象印蓖。然后進(jìn)行峰值注釋辽慕。 - 所有的峰值信息都會(huì)保存在輸出文件中。其中包含peak最近的gene的位置和鏈的信息另伍,從peak到最近的gene的TSS的距離等鼻百。鑒于某些信息可能的重疊,ChIPseeker采取以下優(yōu)先級(jí)別
Promoter
5’ UTR
3’ UTR
Exon
Intron
Downstream
Intergenic
更加具體信息參考作者用戶手冊摆尝。
peakAnno <- annotatePeak(suz12, tssRegion=c(-3000, 3000),
TxDb=txdb, annoDb="org.Mm.eg.db")
>> preparing features information... 2018-08-17 17:24:41
>> identifying nearest features... 2018-08-17 17:24:41
>> calculating distance from peak to TSS... 2018-08-17 17:24:42
>> assigning genomic annotation... 2018-08-17 17:24:42
>> adding gene annotation... 2018-08-17 17:24:51
'select()' returned 1:many mapping between keys and columns
>> assigning chromosome lengths 2018-08-17 17:24:52
>> done... 2018-08-17 17:24:52
6.3.1 可視化基因組注釋
為了注釋給定peak在基因組的位置特征信息,可使用annotatePeak
函數(shù)因悲,這些信息在輸出信息的“annotation”列堕汞,包括peak是否在TSS,外顯子,5'UTR,3'UTR,內(nèi)含子或間隔區(qū)晃琳。很多研究人員對(duì)這些注釋非常感興趣讯检。用戶可以自己定義TSS區(qū)域琐鲁。
有以下幾種可視化方式
-
A:pie plot
plotAnnoPie(peakAnno)
----------------------------------------------------------------
-
B: bar plot
plotAnnoBar(peakAnno)
--------------------------------------------------------------
-
C: venn plot(重疊)
因?yàn)橐恍┳⑨寱?huì)重疊,如果想查看重疊的全部注釋信息人灼,可以使用veenpie
函數(shù)
vennpie(peakAnno)
-----------------------------------------------------------
-
D: upsetplot(重疊)
upsetplot(peakAnno)
-------------------------------------------------------
6.3.2 Visualize distribution of TF-binding loci relative to TSS
peak(TF結(jié)合位點(diǎn))到最近的gene的TSS之間的距離可以有annotatePeak
函數(shù)進(jìn)行計(jì)算围段。作者提供了plotDistToTSS
函數(shù)計(jì)算最近基因的TSS上游和下游的結(jié)合位點(diǎn)的百分比,并可視化這種分布投放。
plotDistToTSS(peakAnno,
title="Distribution of transcription factor-binding loci\nrelative to TSS")
6.4 Functional enrichment analysis功能富集分析
因個(gè)人工作原因奈泪,簡書最近沒有更新,現(xiàn)在已經(jīng)度過轉(zhuǎn)變期灸芳,馬上開始更新涝桅,感謝關(guān)注的朋友及發(fā)來私信的朋友。
我的博客即將搬運(yùn)同步至騰訊云+社區(qū)烙样,邀請大家一同入駐:https://cloud.tencent.com/developer/support-plan?invite_code=2ys0k0l1st2ck