0.數(shù)據(jù)介紹
這是一篇發(fā)表在Cancer Cell 2019 Sep的文章——Single-Cell Transcriptomics in Medulloblastoma Reveals Tumor-Initiating Progenitors and Oncogenic Cascades during Tumorigenesis and Relapse(PMID: 31474569)癣蟋。本文揭示了OLIG2+細(xì)胞在髓母細(xì)胞瘤的發(fā)生發(fā)展起到重要作用呵晨,并且會以靜息狀態(tài)宗侦,存在于成熟的髓母細(xì)胞瘤中检吆,并在復(fù)發(fā)時被激活,可能是髓母細(xì)胞瘤對治療不敏感的原因之一庄吼。OLIG2能夠通過表觀調(diào)控丘喻,激活hippo-YAP和AUTOTA-A/MYCN的網(wǎng)絡(luò)促進(jìn)MB的生長屡立。
在曾老師推送的基礎(chǔ)上褂痰,想進(jìn)一步對文章的數(shù)據(jù)進(jìn)行復(fù)現(xiàn)亩进。常規(guī)轉(zhuǎn)錄差異建議都加上一個轉(zhuǎn)錄因子數(shù)據(jù) (qq.com)
本數(shù)據(jù)集包含了ChIP-seq,ATAC-seq缩歪,RNA-seq归薛。是非常好的練手機會。
背景知識:關(guān)于Medulloblastoma (MB):主要是有4個分子亞型:Wingless (WNT), Sonic hedgehog (SHH), Group 3, and Group 4匪蝙。其中主籍,SHH亞型的信號通路激活主要緣于PTCH1的功能缺失以及SMO的激活,其中GNAS是一種抑癌基因逛球,其功能缺陷與不良預(yù)后相關(guān)
1.數(shù)據(jù)預(yù)處理
現(xiàn)在SRA網(wǎng)站中獲取實驗設(shè)計的情況千元。
Run BioProject BioSample Assay Type AvgSpotLen Bases Bytes Experiment GEO_Accession LibraryLayout LibrarySelection LibrarySource Sample Name source_name SRA Study Age geneotype Antibody tissue
1 SRR7984671 PRJNA495715 SAMN10220650 ATAC-seq 75 1.09 G 535.44 Mb SRX4815869 GSM3423042 SINGLE other GENOMIC GSM3423042 medulloblastoma tumor cells SRP165176 Postnatal day 20 hGFAPcre;Pthc/c
2 SRR7984672 PRJNA495715 SAMN10220649 ATAC-seq 75 569.80 M 286.09 Mb SRX4815870 GSM3423043 SINGLE other GENOMIC GSM3423043 medulloblastoma tumor cells SRP165176 Postnatal day 20 hGFAPcre;Pthc/c;Olig2-GFP
3 SRR7984673 PRJNA495714 SAMN10220667 ChIP-Seq 75 1.10 G 665.72 Mb SRX4815871 GSM3423044 SINGLE ChIP GENOMIC GSM3423044 Cerebellum SRP165177 Olig2 (Millipore, ab9610)
4 SRR7984674 PRJNA495714 SAMN10220666 ChIP-Seq 75 2.97 G 1.32 Gb SRX4815872 GSM3423045 SINGLE ChIP GENOMIC GSM3423045 Medulloblastoma tissue SRP165177 Olig2 (Millipore, ab9610)
5 SRR7984675 PRJNA495714 SAMN10220665 ChIP-Seq 75 1.47 G 753.94 Mb SRX4815873 GSM3423046 SINGLE ChIP GENOMIC GSM3423046 Medulloblastoma tissue SRP165177 H3k27ac (Abcam, ab4729)
6 SRR7984676 PRJNA495713 SAMN10220664 RNA-Seq 81 9.18 G 4.98 Gb SRX4815874 GSM3423057 PAIRED cDNA TRANSCRIPTOMIC GSM3423057 medullablastoma tumor cells SRP165179 Postnatal day 24 to30 hGFAPcre;Ptchc/c
7 SRR7984677 PRJNA495712 SAMN10220663 RNA-Seq 150 5.46 G 1.96 Gb SRX4815875 GSM3423058 PAIRED cDNA TRANSCRIPTOMIC GSM3423058 cerebellum SRP165178 Postnatal day 18 Lats1/2f/f
8 SRR7984678 PRJNA495712 SAMN10220662 RNA-Seq 150 4.95 G 1.79 Gb SRX4815876 GSM3423059 PAIRED cDNA TRANSCRIPTOMIC GSM3423059 cerebellum SRP165178 Postnatal day 18 Lats1/2f/f
9 SRR7984679 PRJNA495712 SAMN10220661 RNA-Seq 150 4.82 G 1.77 Gb SRX4815877 GSM3423060 PAIRED cDNA TRANSCRIPTOMIC GSM3423060 cerebellum SRP165178 Postnatal day 18 Lats1/2f/f
10 SRR7984680 PRJNA495712 SAMN10220660 RNA-Seq 150 4.38 G 1.59 Gb SRX4815878 GSM3423061 PAIRED cDNA TRANSCRIPTOMIC GSM3423061 Medullobalstoma SRP165178 Postnatal day 18 Lats1/2f/f; Math1cre
11 SRR7984681 PRJNA495712 SAMN10220646 RNA-Seq 150 4.17 G 1.50 Gb SRX4815879 GSM3423062 PAIRED cDNA TRANSCRIPTOMIC GSM3423062 Medullobalstoma SRP165178 Postnatal day 18 Lats1/2f/f; Math1cre
12 SRR7984682 PRJNA495712 SAMN10220645 RNA-Seq 150 5.06 G 1.84 Gb SRX4815880 GSM3423063 PAIRED cDNA TRANSCRIPTOMIC GSM3423063 Medullobalstoma SRP165178 Postnatal day 18 Lats1/2f/f; Math1cre
13 SRR7984683 PRJNA495712 SAMN10220644 RNA-Seq 76 2.24 G 720.74 Mb SRX4815881 GSM3423064 SINGLE cDNA TRANSCRIPTOMIC GSM3423064 Medullobalstoma SRP165178 Postnatal day 20 Ptchf/f; hGFAPcre
14 SRR7984684 PRJNA495712 SAMN10220643 RNA-Seq 76 2.17 G 700.20 Mb SRX4815882 GSM3423065 SINGLE cDNA TRANSCRIPTOMIC GSM3423065 Medullobalstoma SRP165178 Postnatal day 20 Ptchf/f; hGFAPcre
15 SRR7984685 PRJNA495712 SAMN10220642 RNA-Seq 76 2.16 G 692.55 Mb SRX4815883 GSM3423066 SINGLE cDNA TRANSCRIPTOMIC GSM3423066 Medullobalstoma SRP165178 Postnatal day 20 Ptchf/f; Olig2c/c; hGFAPcre
16 SRR7984686 PRJNA495712 SAMN10220641 RNA-Seq 76 2.21 G 710.51 Mb SRX4815884 GSM3423067 SINGLE cDNA TRANSCRIPTOMIC GSM3423067 Medullobalstoma SRP165178 Postnatal day 20 Ptchf/f; Olig2c/c; hGFAPcre
17 SRR8037270 PRJNA495592 SAMN10228087 RNA-Seq 132 3.16 G 1.87 Gb SRX4867885 GSM3423041 PAIRED cDNA TRANSCRIPTOMIC GSM3423041 medullablastoma tumor cells SRP165234 Postnatal day 10 hGFAPcre;Ptchc/c
18 SRR8037271 PRJNA495592 SAMN10228087 RNA-Seq 132 2.85 G 1.69 Gb SRX4867885 GSM3423041 PAIRED cDNA TRANSCRIPTOMIC GSM3423041 medullablastoma tumor cells SRP165234 Postnatal day 10 hGFAPcre;Ptchc/c
19 SRR8037272 PRJNA495592 SAMN10228087 RNA-Seq 132 4.23 G 2.48 Gb SRX4867885 GSM3423041 PAIRED cDNA TRANSCRIPTOMIC GSM3423041 medullablastoma tumor cells SRP165234 Postnatal day 10 hGFAPcre;Ptchc/c
20 SRR8037273 PRJNA495592 SAMN10228087 RNA-Seq 132 2.80 G 1.67 Gb SRX4867885 GSM3423041 PAIRED cDNA TRANSCRIPTOMIC GSM3423041 medullablastoma tumor cells SRP165234 Postnatal day 10 hGFAPcre;Ptchc/c
21 SRR9667717 PRJNA495714 SAMN12249854 ChIP-Seq 75 2.34 G 1.08 Gb SRX6428312 GSM3936983 SINGLE ChIP GENOMIC GSM3936983 Cerebellum SRP165177 Olig2 (Millipore, ab9610)
22 SRR9667718 PRJNA495714 SAMN12249853 ChIP-Seq 75 1.98 G 896.85 Mb SRX6428313 GSM3936984 SINGLE ChIP GENOMIC GSM3936984 Medulloblastoma tissue SRP165177 Olig2 (Millipore, ab9610)
23 SRR9667719 PRJNA495714 SAMN12249852 ChIP-Seq 75 1.82 G 838.77 Mb SRX6428308 GSM3936985 SINGLE ChIP GENOMIC GSM3936985 Medulloblastoma tissue SRP165177 Olig2 (Millipore, ab9610)
24 SRR9667720 PRJNA495714 SAMN12249850 ChIP-Seq 75 2.02 G 924.48 Mb SRX6428309 GSM3936986 SINGLE ChIP GENOMIC GSM3936986 Medulloblastoma tissue SRP165177 H3k27ac (Abcam, ab4729)
25 SRR9667721 PRJNA495714 SAMN12249849 ChIP-Seq 75 1.10 G 665.72 Mb SRX6428310 GSM3936987 SINGLE ChIP GENOMIC GSM3936987 Cerebellum SRP165177 none
26 SRR9667722 PRJNA495714 SAMN12249847 ChIP-Seq 75 2.64 G 1.16 Gb SRX6428311 GSM3936988 SINGLE ChIP GENOMIC GSM3936988 Medulloblastoma tissue SRP165177 none
27 SRR9667723 PRJNA495713 SAMN12249846 RNA-Seq 83 9.24 G 3.78 Gb SRX6428314 GSM3936989 PAIRED cDNA TRANSCRIPTOMIC GSM3936989 medullablastoma tumor cells SRP165179 Postnatal day 12 hGFAPcre;Ptchc/c
28 SRR9667724 PRJNA495712 SAMN12249845 RNA-Seq 75 1.74 G 830.27 Mb SRX6428315 GSM3936990 SINGLE cDNA TRANSCRIPTOMIC GSM3936990 GFP+ cells SRP165178 Postnatal day 15 hGFAPcre;Ptchc/c;Olig2-GFP
29 SRR9667725 PRJNA495712 SAMN12249844 RNA-Seq 75 1.72 G 826.82 Mb SRX6428316 GSM3936991 SINGLE cDNA TRANSCRIPTOMIC GSM3936991 GFP+ cells SRP165178 Postnatal day 15 hGFAPcre;Ptchc/c;Olig2-GFP
30 SRR9667726 PRJNA495712 SAMN12249843 RNA-Seq 75 1.73 G 832.94 Mb SRX6428317 GSM3936992 SINGLE cDNA TRANSCRIPTOMIC GSM3936992 GFP+ cells SRP165178 Postnatal day 15 hGFAPcre;Ptchc/c;Olig2-GFP
31 SRR9667727 PRJNA495712 SAMN12249841 RNA-Seq 75 1.63 G 787.67 Mb SRX6428318 GSM3936993 SINGLE cDNA TRANSCRIPTOMIC GSM3936993 GFP- cells SRP165178 Postnatal day 15 hGFAPcre;Ptchc/c;Olig2-GFP
32 SRR9667728 PRJNA495712 SAMN12249840 RNA-Seq 75 1.63 G 782.61 Mb SRX6428319 GSM3936994 SINGLE cDNA TRANSCRIPTOMIC GSM3936994 GFP- cells SRP165178 Postnatal day 15 hGFAPcre;Ptchc/c;Olig2-GFP
33 SRR9667729 PRJNA495712 SAMN12249838 RNA-Seq 75 1.79 G 861.33 Mb SRX6428320 GSM3936995 SINGLE cDNA TRANSCRIPTOMIC GSM3936995 GFP- cells SRP165178 Postnatal day 15 hGFAPcre;Ptchc/c;Olig2-GFP
1.1 原始文件下載
mkdir SRRfile
prefetch --option-file SRR_Acc_List.txt
1.2 使用fasterq進(jìn)行解壓縮
ls .| while read id;do(fasterq-dump $id);done
mkdir ../fastq
mv *.fastq ../fastq
忘記設(shè)置output的地址了,這下得把所有的fastq文件重新整理一下
cd fastq
cat ../SRR_Acc_List.txt | while read id;do(mkdir $id);done
1.3 測序數(shù)據(jù)的質(zhì)控
cat SRR_Acc_List.txt | while read id;do(
fastqc -o ./fastqc -t 20 ./fastq/$id/*.fastq \
&& multiqc ./fastqc -o ./fastqc);
done
1.4 測序數(shù)據(jù)分類整理
由于原始數(shù)據(jù)涉及了不同的文件類型需忿,所以需要相應(yīng)進(jìn)行整理诅炉。在總文件夾中寫入不同數(shù)據(jù)類型所對應(yīng)的SRR序號蜡歹。shell腳本不是特別熟練屋厘,因此就在單機上,使用excel進(jìn)行篩選了月而。
# vi chip.txt
SRR7984673
SRR7984674
SRR7984675
SRR9667717
SRR9667718
SRR9667719
SRR9667720
SRR9667721
SRR9667722
# vi rna.txt
SRR7984676
SRR7984677
SRR7984678
SRR7984679
SRR7984680
SRR7984681
SRR7984682
SRR7984683
SRR7984684
SRR7984685
SRR7984686
SRR8037270
SRR8037271
SRR8037272
SRR8037273
SRR9667723
SRR9667724
SRR9667725
SRR9667726
SRR9667727
SRR9667728
SRR9667729
# vi atac.txt
SRR7984671
SRR7984672
RNA-seq共有22個測序文件汗洒;ATAC-seq共有2個測序文件;ChIP-seq共有9個測序文件父款。
2.ChIP-seq數(shù)據(jù)處理
2.1 使用trim_galore自動去接頭
這個數(shù)據(jù)貌似已經(jīng)經(jīng)過質(zhì)控了溢谤,因此這步貌似可做可不做瞻凤。G沒有去接頭時,比對上的比例還是能夠達(dá)到95%以上世杀,還是不錯的阀参。
mkdir clean
cat chip.txt | while read id;do\
trim_galore --quality 20 --phred33 \
--length 20 -j 20 \
-o ./clean/$id.fastq.gz \
./fastq/$id/$id.fastq;
done
2.2 使用bowtie2進(jìn)行比對
mkdir aligned
cat chip.txt | while read id;do \
bowtie2 -p 20 -3 5 --local -x ~/mm10/mm10 -U ./fastq/$id/$id.fastq | \
samtools sort -O bam -o ./aligned/$id.bam; \
done
比對過程需要花很多時間。
# SRR7984673
# 14675638 reads; of these:
# 14675638 (100.00%) were unpaired; of these:
# 339654 (2.31%) aligned 0 times
# 8731707 (59.50%) aligned exactly 1 time
# 5604277 (38.19%) aligned >1 times
# 97.69% overall alignment rate
# [bam_sort_core] merging from 4 files and 1 in-memory blocks...
......
2.3 使用sambamba去重
比對的效率總體還是不錯的瞻坝。比對之后進(jìn)行PCR去重蛛壳。一開始選擇了picard工具進(jìn)行去重。
mkdir deduplicate
cat atac.txt |while read id
do(picard MarkDuplicates --REMOVE_DUPLICATES true -I aligned/${id}.bam \
-O ./deduplicate/${id}_deduplicate.bam -M ${id}.log);
done
然而占用的線程過多所刀,為了避免被管理員拉黑衙荐,選擇其他的去重方法sambamba
mkdir deduplicate
cat chip.txt |while read id;do
sambamba markdup -r aligned/${id}.bam deduplicate/${id}_deduplicate.bam;
done
2.4 使用MACS2 call peak
在call peak之前,需要對明確對照組和實驗組浮创,根據(jù)文章的介紹忧吟,對GSM序號和SRR序號進(jìn)行配對。具體的對應(yīng)關(guān)系如下所示:
SRR7984673 GSM3423044 ChIP_CB_Olig2
SRR7984674 GSM3423045 ChIP_MB_Olig2
SRR7984675 GSM3423046 ChIP_MB_H3k27ac
SRR9667717 GSM3936983 ChIP_CB_H3k27ac
SRR9667718 GSM3936984 ChIP_MB_Olig2_repeat 2
SRR9667719 GSM3936985 ChIP_MB_Olig2_repeat 3
SRR9667720 GSM3936986 ChIP_MB_H3k27ac_repeat 2
SRR9667721 GSM3936987 CB_Input
SRR9667722 GSM3936988 MB_Input
因為涉及比較的組數(shù)不多斩披,所以可以手動進(jìn)行設(shè)置溜族,如需批量操作,可以自行構(gòu)建比較文檔垦沉。
MACS2的命令比較簡單斩祭,主要就是確定好對照組和實驗組。
mkdir macs2
# MB_H3K27ac_1
macs2 callpeak -c ./deduplicate/SRR9667722_deduplicate.bam \
-t ./deduplicate/SRR7984675_deduplicate.bam \
-q 0.05 -f BAM -g mm --outdir macs2 \
-n MB_H3K27ac_1_
# MB_H3K27ac_2
macs2 callpeak -c ./deduplicate/SRR9667722_deduplicate.bam \
-t ./deduplicate/SRR9667720_deduplicate.bam \
-q 0.05 -f BAM -g mm --outdir macs2 \
-n MB_H3K27ac_2_
# MB_OLIG2_1
macs2 callpeak -c ./deduplicate/SRR9667722_deduplicate.bam \
-t ./deduplicate/SRR7984674_deduplicate.bam \
-q 0.05 -f BAM -g mm --outdir macs2 \
-n MB_OLIG2_1_
# MB_OLIG2_2
macs2 callpeak -c ./deduplicate/SRR9667722_deduplicate.bam \
-t ./deduplicate/SRR9667718_deduplicate.bam \
-q 0.05 -f BAM -g mm --outdir macs2 \
-n MB_OLIG_2_
# MB_OLIG2_3
macs2 callpeak -c ./deduplicate/SRR9667722_deduplicate.bam \
-t ./deduplicate/SRR9667719_deduplicate.bam \
-q 0.05 -f BAM -g mm --outdir macs2 \
-n MB_OLIG_3_
# CB_OLIG2_1
macs2 callpeak -c ./deduplicate/SRR9667721_deduplicate.bam \
-t ./deduplicate/SRR7984673_deduplicate.bam \
-q 0.05 -f BAM -g mm --outdir macs2 \
-n CB_OLIG_1_
# CB_H3K27ac_1
macs2 callpeak -c ./deduplicate/SRR9667721_deduplicate.bam \
-t ./deduplicate/SRR9667717_deduplicate.bam \
-q 0.05 -f BAM -g mm --outdir macs2 \
-n CB_H3K27ac_1_
每次比較都會得到四個文件:NAMEpeaks.xls乡话,NAMEsummits.bed摧玫,NAME_model.r,NAMEpeaks.narrowPeak
NAMEpeaks.xls:peak信息绑青,和bed格式類似诬像,但是以1為基,而bed文件是以0為基.也就是說xls的坐標(biāo)都要減一才是bed文件的坐標(biāo)
NAMEpeaks.narrowPeak:后面4列表示為闸婴, integer score for display坏挠, fold-change,-log10pvalue邪乍,-log10qvalue降狠,relative summit position to peak start。內(nèi)容和NAMEpeaks.xls基本一致庇楞,適合用于導(dǎo)入R進(jìn)行分析榜配。
NAMEsummits.bed:記錄每個peak的peak summits,也就是記錄極值點的位置吕晌。MACS建議用該文件尋找結(jié)合位點的motif蛋褥。
NAME_model.r:能通過NAME_model.r作圖,得到是基于你提供數(shù)據(jù)的peak模型
2.5 deeptools可視化
peak的可視化有多種軟件可供選擇睛驳,包括deeptools烙心。也可以將數(shù)據(jù)導(dǎo)入R環(huán)境中膜廊,用Y叔的CHIPseeker包進(jìn)行可視化。
這是原文中Figure6F-H的數(shù)據(jù)淫茵。F圖是通過熱圖展示OLIG2 binding的位點上爪瓜,也往往有高強度的H3K27ac的信號,提示著一種OLIG2在增強相關(guān)基因表達(dá)的機制匙瘪。G圖是顯示在MB腫瘤中钥勋,OLIG2和H3K27ac的結(jié)合情況。H圖則是針對結(jié)合的位點進(jìn)行motif的分析辆苔,預(yù)測最有可能對目標(biāo)基因進(jìn)行調(diào)控的轉(zhuǎn)錄因子算灸。
原文中對方法學(xué)的描述是這樣的:
All sequencing data were mapped to mouse genome assembly mm10, and ChIP-seq peak calling was performed as previously described using model-based analysis of ChIP-seq (MACS, version 1.4.2; http://liulab.dfci.harvard.edu/MACS) with default parameters to get primary binding regions. To ensure that our data were of high quality and reproducibility, we called peaks with enrichment R10-fold over control (p % 10-9) and compared the peak sets using the ENCODE overlap rules. The identified primary regions were further filtered using the following criteria, to define a more stringent protein-DNA interactome: (1) the p value cutoff was set to %10-9; and (2) we required an enrichment of six-fold and peak height >5. The genome-wide distribution of protein-binding regions was determined by HOMER (http://homer.salk.edu/homer/index.html) in reference to UCSC mm10. For all ChIP-seq data sets, WIG files were generated with MACS, which were subsequently visualized using Mochiview v1.46. Occupancy was analyzed with Pearson’s correlation and ToppCluster (https://toppcluster.cchmc.org/). ChIP-seq heat maps were ordered by strength of binding. The heat maps were drawn using the Heatmap tools provided by Cistrome (http://cistrome.org/ap).
以下開始對這些圖進(jìn)行復(fù)現(xiàn)。
#首先將bam轉(zhuǎn)換為bw文件
effective_genome_size=1870000000
cat chip.txt |while read id;do
bamCoverage -p 25 --bam ./deduplicate/${id}_deduplicate.bam -o ./bw/${id}.bw \
--binSize 10 --normalizeUsing RPGC --effectiveGenomeSize ${effective_genome_size}
done
F圖
# conda install deeptools
mkdir deeptools
computeMatrix reference-point --referencePoint TSS -p 20\
-b 5000 -a 5000 -R ~/mm10.tss.bed \
-S ./bw/SRR7984673.bw ./bw/SRR7984674.bw ./bw/SRR7984675.bw \
--skipZeros -o ./deeptools/F.gz \
--outFileSortedRegions ./deeptools/F_Heatmap1sortedRegions.bed
cd deeptools
plotHeatmap -m F.gz -out F.TSS.Heatmap.pdf --plotFileFormat pdf
plotProfile -m F.gz -out F.TSS.Profile.pdf --plotFileFormat pdf --perGroup
cd ..
G圖
computeMatrix reference-point --referencePoint TSS -p 20\
-b 5000 -a 5000 -R ~/mm10.tss.bed \
-S ./bw/SRR7984674.bw ./bw/SRR7984675.bw \
--skipZeros -o ./deeptools/G.gz \
--outFileSortedRegions ./deeptools/G_Heatmap1sortedRegions.bed
cd deeptools
plotHeatmap -m G.gz -out G.TSS.Heatmap.pdf --plotFileFormat pdf
plotProfile -m G.gz -out G.TSS.Profile.pdf --plotFileFormat pdf --perGroup
仍然與原文中存在一些差異驻啤。這里TSS的范圍沒有按照原文-±400b的范圍進(jìn)行可視化菲驴,另外,需要對識別到的peak進(jìn)行相應(yīng)的篩選骑冗。
2.6 homer尋找motif
閱讀原文的方法學(xué)部分赊瞬,作者通過homer進(jìn)行motif的分析。下面對相關(guān)motif進(jìn)行復(fù)現(xiàn)
本文是針對MB中OLIG2結(jié)合位點進(jìn)行motif分析贼涩,因此巧涧,我們通過macs2獲得的peak,對特征性的motif進(jìn)行識別遥倦。
awk '{print $4"t"$1"t"$2"t"$3"t "}' ./macs2/MB_OLIG2_1__peaks.narrowPeak > ./macs2/MB_OLIG2_1.homer_peaks.tmp
# 尋找motif
mkdir homer
findMotifsGenome.pl ./macs2/MB_OLIG2_1__summits.bed ~/mm10.fa ./homer/MB_OLIG2_1.homer.motif -size 200 -len 8,10,12
針對結(jié)合的DNA序列谤绳,會返回很多識別到的motif序列。但貌似跟原文展示的最顯著的都不太一樣袒哥。
2.7 DiffBind找差異peak
由于本文中涉及多組進(jìn)行比較缩筛,因此還可以通過DiffBind的方法尋找差異peak。
3.ATAC-seq數(shù)據(jù)處理
3.1 使用bowtie2進(jìn)行序列比對
ATAC與ChIP類似堡称,還是需要先將測序的片段比對到基因組上去
cat atac.txt | while read id;do \
bowtie2 -p 25 -3 5 --local -x ~/mm10/mm10 -U ./fastq/$id/$id.fastq | \
samtools sort -O bam -o ./aligned/$id.bam; \
done
#14541510 reads; of these:
# 14541510 (100.00%) were unpaired; of these:
# 171629 (1.18%) aligned 0 times
# 9336347 (64.20%) aligned exactly 1 time
# 5033534 (34.61%) aligned >1 times
#98.82% overall alignment rate
#[bam_sort_core] merging from 4 files and 1 in-memory blocks...
#7597315 reads; of these:
# 7597315 (100.00%) were unpaired; of these:
# 95358 (1.26%) aligned 0 times
# 4675659 (61.54%) aligned exactly 1 time
# 2826298 (37.20%) aligned >1 times
#98.74% overall alignment rate
[bam_sort_core] merging from 2 files and 1 in-memory blocks...
注意瞎抛,這里得到的bam文件是已經(jīng)排過序的bam文件
3.2 去除PCR重復(fù)
同ChIP。
# mkdir deduplicate
cat atac.txt |while read id;do
sambamba markdup -r aligned/${id}.bam deduplicate/${id}_deduplicate.bam;
done
3.3 去重之后使用macs2進(jìn)行call peak
同ChIP却紧。
macs2 callpeak -c ./deduplicate/SRR7984671_deduplicate.bam \
-t ./deduplicate/SRR7984672_deduplicate.bam \
-q 0.05 -f BAM -g mm --outdir macs2 \
--nomodel --shift -100 --extsize 200 -n ATAC_
這步之后也可以獲得narrowPeak文件和bed文件桐臊。
3.4 獲取bw文件
前面的bam文件,narrowPeak文件晓殊,bed文件都有了断凶,現(xiàn)在還差一個bw文件
bw文件可以獲得測序的覆蓋度
# mkdir bw
effective_genome_size=1870000000
cat atac.txt |while read id;do
bamCoverage -p 20 --bam ./deduplicate/${id}_deduplicate.bam -o ./bw/${id}.bw \
--binSize 10 --normalizeUsing RPGC --effectiveGenomeSize ${effective_genome_size}
done
這是原文中根據(jù)ATAC-seq展現(xiàn)的不同基因位點染色質(zhì)開放程度的差異。
導(dǎo)出bw文件挺物,在igv中挑選了幾個基因進(jìn)行了可視化懒浮,的確可以看到OLIG2+組中飘弧,染色質(zhì)的開放程度是更高的
3.5 使用deeptools可視化
# mkdir deeptools
computeMatrix reference-point --referencePoint TSS -p 20\
-b 5000 -a 5000 -R ~/mm10.tss.bed \
-S ./bw/SRR7984671.bw ./bw/SRR7984672.bw \
--skipZeros -o ./deeptools/ATAC_TSS.gz \
--outFileSortedRegions ./deeptools/ATAC_Heatmap1sortedRegions.bed
cd deeptools
plotHeatmap -m ATAC_TSS.gz -out TSS.Heatmap.pdf --plotFileFormat pdf
plotProfile -m ATAC_TSS.gz -out TSS.Profile.pdf --plotFileFormat pdf --perGroup
不同的參數(shù)所映射的含義也不相同识藤,可以通過這張圖進(jìn)行了解(來源:plob.org)砚著。
從本數(shù)據(jù)中得到的Profile圖中,也可以明顯看到Olig2+的樣本中染色體的可及性更高痴昧,與上述結(jié)論一致稽穆。
4.RNA-seq數(shù)據(jù)處理
RNA-seq數(shù)據(jù)的處理和分析已經(jīng)比較普及了,在此先略過赶撰。
5.寫在后面
雖然能把整個pepline順過來舌镶,但是對于部分參數(shù)的理解還是不夠深入,仍然需要對相關(guān)生物學(xué)背景和軟件有更深的認(rèn)識豪娜。感謝生信技能樹搭建這個生信愛好者的平臺餐胀,以及Jimmy大神無私分享的這么多資料,讓生信這條學(xué)習(xí)路線不再陡峭瘤载。與大家一起進(jìn)步否灾。