參考基因組下載
有三大全文網(wǎng)站提供參考基因組下載,它們分別是:
1.NCBI (https://www.ncbi.nlm.nih.gov/grc)
2.UCSC (http://hgdownload.soe.ucsc.edu/downloads.html)
3.Ensemble (http://asia.ensembl.org/index.html?redirect=no)
目前最常用的人和小鼠的參考基因組版本如下(Jimmy總結(jié)):
|NCBI | UCSC| Ensemble|
|GRCh36 | hg18 | ENSEMBL release_52 |
|GRCh37 | hg19 | ENSEMBL release_59/61/64/68/69/75|
|GRCh38 | hg38 | ENSEMBL release_76/77/78/80/81/82|
這里我下載的是USCS版本的human基因組(hg19)
$ wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz #下載USCS版本的hg19
$ tar -zxvf chromFa.tar.gz # 解壓縮
chr1.fa
chr10.fa
chr11.fa
chr11_gl000202_random.fa
chr12.fa
chr13.fa
[...]
現(xiàn)在文件夾里應(yīng)該有所有染色體的文件驾讲,每一條染色體序列是單獨(dú)的一個文件蚊伞,后綴都是.fa。
$ cat *.fa > hg19.fa
#把所有染色體的信息都重定向到一個文件里吮铭,整合所有染色體信息,最后的文件大小大約3.2G
$ rm -rf chr* #刪除單獨(dú)的染色體文件时迫,節(jié)省空間
下載注釋文件GTT/GFT
簡單來講注釋文件就是基因組的說明書,告訴我們哪些序列是編碼蛋白的基因沐兵,哪些是非編碼基因别垮,外顯子、內(nèi)含子扎谎、UTR等的位置等等碳想。以上三個網(wǎng)站都有基因組的注釋文件。
現(xiàn)在最權(quán)威的人類和小鼠基因組的注釋還屬Gencode數(shù)據(jù)庫毁靶。(參考:https://zhuanlan.zhihu.com/p/28126314)注意注釋文件的格式一般是gtf或者gff3格式的胧奔。
$ wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/GRCh37_mapping/gencode.v31lift37.annotation.gtf.gz
$ wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/GRCh37_mapping/gencode.v31lift37.annotation.gff3.gz
$ gzip -d gencode.v31lift37.annotation.gff3.gz #解壓
$ gzip -d gencode.v31lift37.annotation.gtf.gz #解壓
因為這個文章的數(shù)據(jù)在fastqc質(zhì)控檢查后質(zhì)量比較好,就沒有做trim预吆。
記錄一下GTF 和GFF3文件的內(nèi)容(http://www.reibang.com/p/3e545b9a3c68)龙填,感覺沒什么區(qū)別:
GTF(General Transfer Format)其實就是GFF2,以Tab分割拐叉,分為如下幾列:
- seqname 染色體名稱- name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. Important note: the seqname must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.
- source來源 - name of the program that generated this feature, or the data source (database or project name)
- feature特性(是基因岩遗,外顯子,還是其他一些什么) - feature type name, e.g. Gene, Variation, Similarity
- start(在染色體上的開始位置) - Start position of the feature, with sequence numbering starting at 1.
- end(在染色體上結(jié)束的位置) - End position of the feature, with sequence numbering starting at 1.
- score - A floating point value.
- strand - defined as + (forward) or - (reverse).
- frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
- attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.
而GFF3(General Feature Format)的格式如下:
- seqid - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. Important note: the seq ID must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.
- source - name of the program that generated this feature, or the data source (database or project name)
- type - type of feature. Must be a term or accession from the SOFA sequence ontology
- start - Start position of the feature, with sequence numbering starting at 1.
- end - End position of the feature, with sequence numbering starting at 1.
- score - A floating point value.
- strand - defined as + (forward) or - (reverse).
- phase - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
- attributes - A semicolon-separated list of tag-value pairs, providing additional information about each feature. Some of these tags are predefined, e.g. ID, Name, Alias, Parent - see the GFF documentation for more details.
下載hg19索引
$ wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/hg19.tar.gz #下載索引凤瘦,會彈出下面的下載進(jìn)度
--2019-08-31 21:48:51-- ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/hg19.tar.gz
=> ‘hg19.tar.gz’
Resolving ftp.ccb.jhu.edu (ftp.ccb.jhu.edu)... 128.220.233.225
Connecting to ftp.ccb.jhu.edu (ftp.ccb.jhu.edu)|128.220.233.225|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /pub/infphilo/hisat2/data ... done.
==> SIZE hg19.tar.gz ... 4181115011
==> PASV ... done. ==> RETR hg19.tar.gz ... done.
Length: 4181115011 (3.9G) (unauthoritative)
hg19.tar.gz 100%[============================================>] 3.89G 1.27MB/s in 58m 27s
2019-08-31 22:47:18 (1.14 MB/s) - ‘hg19.tar.gz’ saved [4181115011]
也可以自己構(gòu)建索引宿礁,但是據(jù)說如果沒有服務(wù)器的話,筆記本電腦構(gòu)建人類基因組索引會非常費(fèi)時蔬芥,所以我就直接在官網(wǎng)下載了索引梆靖。下載索引也要一些時間控汉,這個花了我大概1個小時的時間,網(wǎng)速感人返吻。姑子。。
為什么以后的比對步驟需要索引测僵?(答案來自:http://www.reibang.com/p/479c7b576e6f)高通量測序遇到的第一個問題就是街佑,成千上萬甚至上幾億條read如果在合理的時間內(nèi)比對到參考基因組上,并且保證錯誤率在接受范圍內(nèi)恨课。為了提高比對速度舆乔,就需要根據(jù)參考基因組序列岳服,經(jīng)過BWT算法轉(zhuǎn)換成index剂公,而我們比對的序列其實是index的一個子集。當(dāng)然轉(zhuǎn)錄組比對還要考慮到可變剪切的情況,所以更加復(fù)雜。因此我門不是直接把read回貼到基因組上缭召,而是把read和index進(jìn)行比較郁油。
$ tar -zxvf *.tar.gz #解壓縮。解壓的文件中腕窥,包含genome.*.ht2的8個文件和一個shell腳本。
$ rm -rf *.tar.gz #刪除壓縮包節(jié)省空間
比對
之前我還跟著視頻練習(xí)了Bowtie2的比對方法,Bowtie2和hisat2都是比對軟件吊档,這兩個有啥不一樣的?我問了王院長唾糯,他說bowtie2一般用在CHIP怠硼,hisat2一般用在RNAseq。我又查了一下別人介紹的一些比對軟件移怯,有一篇文章是這樣講的:(http://www.reibang.com/p/681e02e7f9af)
RNA-Seq數(shù)據(jù)比對和DNA-Seq數(shù)據(jù)比對有什么差異香璃?
RNA-Seq數(shù)據(jù)分析分為很多種,比如說找差異表達(dá)基因或?qū)ふ倚碌目勺兗羟兄畚蟆H绻也町惐磉_(dá)基因單純只需要確定不同的read計數(shù)就行的話葡秒,我們可以用bowtie, bwa這類比對工具,或者是salmon這類align-free工具嵌溢,并且后者的速度更快眯牧。但是如果你需要找到新的isoform,或者RNA的可變剪切赖草,看看外顯子使用差異的話学少,你就需要TopHat, HISAT2或者是STAR這類工具用于找到剪切位點。因為RNA-Seq不同于DNA-Seq疚顷,DNA在轉(zhuǎn)錄成mRNA的時候會把內(nèi)含子部分去掉旱易。所以mRNA反轉(zhuǎn)的cDNA如果比對不到參考序列禁偎,會被分開,重新比對一次阀坏,判斷中間是否有內(nèi)含子如暖。
還有一篇文章是這樣講的:https://zhuanlan.zhihu.com/p/26506787:
轉(zhuǎn)錄組測序的比對通常分為基因組比對和轉(zhuǎn)錄組比對兩種,顧名思義忌堂,基因組比對就是把reads比對到完整的基因組序列上盒至,而轉(zhuǎn)錄組比對則是把reads比對到所有已知的轉(zhuǎn)錄本序列上。如果不是很急或者只想知道已知轉(zhuǎn)錄本表達(dá)量士修,個人建議使用基因組比對的方法進(jìn)行分析枷遂,理由如下:
① 轉(zhuǎn)錄組比對需要準(zhǔn)確的已知轉(zhuǎn)錄本的序列,對于來自未知轉(zhuǎn)錄本(比如一些未被數(shù)據(jù)庫收錄的lncRNA)或序列不準(zhǔn)確的reads無法正確比對棋嘲;
② 與上一條類似酒唉,轉(zhuǎn)錄組比對不能對轉(zhuǎn)錄本的可變剪接進(jìn)行分析,數(shù)據(jù)庫中未收錄的剪接位點會被直接丟棄沸移;
③ 由于同一個基因存在不同的轉(zhuǎn)錄本痪伦,因此很多reads可以同時完美比對到多個轉(zhuǎn)錄本,reads的比對評分會偏低雹锣,可能被后續(xù)計算表達(dá)量的軟件舍棄网沾,影響后續(xù)分析(有部分軟件解決了這個問題);
④ 由于與DNA測序使用的參考序列不同蕊爵,因此不利于RNA和DNA數(shù)據(jù)的整合分析辉哥。
而上面的問題使用基因組比對都可以解決。
此外攒射,值得注意的是醋旦,RNA測序并不能直接使用DNA測序常用的BWA、Bowtie等比對軟件匆篓,這是由于真核生物內(nèi)含子的存在浑度,導(dǎo)致測到的reads并不與基因組序列完全一致,因此需要使用Tophat/HISAT/STAR等專門為RNA測序設(shè)計的軟件進(jìn)行比對鸦概。
HISAT2箩张,取代Bowtie/TopHat程序,能夠?qū)NA-Seq的讀取與基因組進(jìn)行快速比對窗市。HISAT利用大量FM索引先慷,以覆蓋整個基因組。Index的目的主要使用與序列比對咨察。由于物種的基因組序列比較長论熙, 如果將測序序列與整個基因組進(jìn)行比對,則會非常耗時摄狱。因此采用將測序序列和參考基因組的Index文件進(jìn)行比對脓诡,會節(jié)省很多時間无午。以人類基因組為例,它需要48,000個索引祝谚,每個索引代表~64,000 bp的基因組區(qū)域宪迟。這些小的索引結(jié)合幾種比對策略,實現(xiàn)了RNA-Seq讀取的高效比對交惯,特別是那些跨越多個外顯子的讀取次泽。盡管它利用大量索引,但HISAT只需要4.3 GB的內(nèi)存席爽。這種應(yīng)用程序支持任何規(guī)模的基因組意荤,包括那些超過40億個堿基的。(http://www.biotrainee.com/thread-2073-1-1.html)
所以這次我試著用hisat2來做比對只锻,hisat2下載就不說了玖像,可以在網(wǎng)上搜一下怎么下載,然后把軟件放在環(huán)境變量里就行了炬藤。
$ hisat2 -h #先看一眼hisat2的使用方法御铃,然后會彈出一堆使用說明,各個參數(shù)的意思
HISAT2 version 2.1.0 by Daehwan Kim (infphilo@gmail.com, www.ccb.jhu.edu/people/infphilo)
Usage:
hisat2 [options]* -x <ht2-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <sam>]
#這里-x是必須有的沈矿,如果單端測序的話-U必須有;-S必須有
<ht2-idx> Index filename prefix (minus trailing .X.ht2).
<m1> Files with #1 mates, paired with files in <m2>.
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
<m2> Files with #2 mates, paired with files in <m1>.
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
<r> Files with unpaired reads.
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
<sam> File for SAM output (default: stdout)
<m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be
specified many times. E.g. '-U file1.fq,file2.fq -U file3.fq'.
[...]
主要參數(shù):
-x <hisat2-idx>:-x后面跟的是參考基因組索引文件的路徑咬腋。
-1 <m1>:雙端測序結(jié)果的第一個文件羹膳。若有多組數(shù)據(jù),使用逗號將文件分隔根竿。Reads的長度可以不一致陵像。
-2 <m2>: 雙端測序結(jié)果的第二個文件。若有多組數(shù)據(jù)寇壳,使用逗號將文件分隔醒颖,并且文件順序要和-1參數(shù)對應(yīng)。Reads的長度可以不一致壳炎。
-U <r>:單端數(shù)據(jù)文件泞歉。若有多組數(shù)據(jù),使用逗號將文件分隔匿辩⊙遥可以和-1、-2參數(shù)同時使用铲球。Reads的長度可以不一致挺庞。
-S <hit>:指定輸出的SAM文件的路徑。
因為要比對的有4個fastq文件稼病,所以寫一個小腳本选侨,讓它自己一個一個的比對:
#!/bin/bash
for ((i=77;i<=80;i++))
do hisat2 -t -x /media/yanfang/FYWD/RNA_seq/ref_genome/index/hg19/genome -U /media/yanfang/FYWD/RNA_seq/fastq_files/SRR9576${i}_1.fastq.gz -S /media/yanfang/FYWD/RNA_seq/sam_files/SRR9576${i}.sam
done
-t是讓hisat2顯示每一個比對運(yùn)行的時間掖鱼。-x后面跟的是我下載的索引存放的路徑。-U后面跟的是每一個fastq文件存放的路徑援制,這里的i是變量锨用,指定了不同fastq文件的名稱,因為四個文件從SRR957677-SRR957680隘谣,所以i從77開始增拥,每循環(huán)一次就加1。
接下來就是等待的時間了寻歧,取決于你的筆記本電腦的配置掌栅,我的配置是RAM 8G, i7 7500码泛。
$ ./hisat2_map.sh #運(yùn)行hisat2比對的腳本
Time loading forward index: 00:00:40
Time loading reference: 00:00:07
Multiseed full-index search: 00:12:04
20803937 reads; of these:
20803937 (100.00%) were unpaired; of these:
1198535 (5.76%) aligned 0 times
17146459 (82.42%) aligned exactly 1 time
2458943 (11.82%) aligned >1 times
94.24% overall alignment rate
Time searching: 00:12:11
Overall time: 00:12:51
Time loading forward index: 00:00:47
Time loading reference: 00:00:07
Multiseed full-index search: 00:05:04
8828013 reads; of these:
8828013 (100.00%) were unpaired; of these:
572582 (6.49%) aligned 0 times
7275873 (82.42%) aligned exactly 1 time
979558 (11.10%) aligned >1 times
93.51% overall alignment rate
Time searching: 00:05:12
Overall time: 00:05:59
Time loading forward index: 00:00:43
Time loading reference: 00:00:06
Multiseed full-index search: 00:11:37
19909740 reads; of these:
19909740 (100.00%) were unpaired; of these:
1256224 (6.31%) aligned 0 times
16065546 (80.69%) aligned exactly 1 time
2587970 (13.00%) aligned >1 times
93.69% overall alignment rate
Time searching: 00:11:43
Overall time: 00:12:26
Time loading forward index: 00:00:43
Time loading reference: 00:00:07
Multiseed full-index search: 00:13:38
24231941 reads; of these:
24231941 (100.00%) were unpaired; of these:
1348062 (5.56%) aligned 0 times
20030375 (82.66%) aligned exactly 1 time
2853504 (11.78%) aligned >1 times
94.44% overall alignment rate
Time searching: 00:13:45
Overall time: 00:14:28
最重要的是比對到基因組或是轉(zhuǎn)錄組上的比對率猾封。人類基因組的比對率期望值是70-90%,會出現(xiàn)多個序列比對在有限的序列區(qū)稱之為“多重比對序列”(multi-mapping reads)噪珊;轉(zhuǎn)錄組上的比對率較低晌缘,由于未注釋的轉(zhuǎn)錄本會被過濾且“多重比對序列”增加,由于同一個基因不同亞型共有外顯子區(qū)痢站。
SAM文件轉(zhuǎn)換為BAM文件
SAM(sequence Alignment/mapping)數(shù)據(jù)格式是目前高通量測序中存放比對數(shù)據(jù)的標(biāo)準(zhǔn)格式磷箕。bam是sam的二進(jìn)制格式,為了減少sam文件的儲存量阵难。為什么要轉(zhuǎn)換格式岳枷?為了讓計算機(jī)好處理。工具:SAMtools呜叫。
SAMTools的主要功能如下:
view: BAM-SAM/SAM-BAM 轉(zhuǎn)換和提取部分比對
sort: 比對排序空繁,-o是根據(jù)染色體排序,-n參數(shù)則是根據(jù)read名進(jìn)行排序朱庆,-t 根據(jù)TAG進(jìn)行排序盛泡。
merge: 聚合多個排序比對
index: 索引排序比對
faidx: 建立FASTA索引,提取部分序列
tview: 文本格式查看序列
pileup: 產(chǎn)生基于位置的結(jié)果和 consensus/indel calling
#!/bin/bash
#這里寫了一個小腳本娱颊,把三個步驟寫在一個for循環(huán)里傲诵,for循環(huán)會依次對每一個sam文件進(jìn)行處理
for i in `seq 77 80`
do
samtools view -S /media/yanfang/FYWD/RNA_seq/sam_files/SRR9576${i}.sam -b > /media/yanfang/FYWD/RNA_seq/bam_files/SRR9576${i}.bam
#第一步將比對后的sam文件轉(zhuǎn)換成bam文件。-S 后面跟的是sam文件的路徑维蒙;-b 指定輸出的文件為bam掰吕,后面跟輸出的路徑;最后重定向?qū)懭隻am文件
samtools sort /media/yanfang/FYWD/RNA_seq/bam_files/SRR9576${i}.bam -o /media/yanfang/FYWD/RNA_seq/bam_files/SRR9576${i}_sorted.bam
#第二步將所有的bam文件按默認(rèn)的染色體位置進(jìn)行排序颅痊。-o是指按染色體排序
samtools index /media/yanfang/FYWD/RNA_seq/bam_files/SRR9576${i}_sorted.bam
#第三步將所有的排序文件建立索引殖熟,索引文件,生成的索引文件是以bai為后綴的
done
生成的bam文件可以先用samtools的flagstat看一下read比對情況斑响,例如:
$ samtools flagstat SRR957677_sorted.bam
25783414 + 0 in total (QC-passed reads + QC-failed reads)
4979477 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
24584879 + 0 mapped (95.35% : N/A)
0 + 0 paired in sequencing #因為是單端測序菱属,所以這項是0
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
比對質(zhì)控
對BAM文件進(jìn)行QC的軟件包括:
RSeQC(依賴于Python2.7的一個軟件钳榨,利用conda創(chuàng)建新環(huán)境)
Qualimap:對二代數(shù)據(jù)進(jìn)行質(zhì)控的綜合軟件
Picard:綜合質(zhì)控學(xué)習(xí)軟件。
這里我用的是RSeQC纽门。
先安裝RSeQC:
$ sudo -H pip install RSeQC
$ bam_stat.py -i SRR957677_sorted.bam
Load BAM file ... Done
#==================================================
#All numbers are READ count
#==================================================
Total records: 25783414
QC failed: 0
Optical/PCR duplicate: 0
Non primary hits 4979477
Unmapped reads: 1198535
mapq < mapq_cut (non-unique): 2458943
mapq >= mapq_cut (unique): 17146459
Read-1: 0
Read-2: 0
Reads map to '+': 8670460
Reads map to '-': 8475999
Non-splice reads: 14136525
Splice reads: 3009934
Reads mapped in proper pairs: 0
Proper-paired reads map to different chrom:0
htseq-count計數(shù)
reads的計數(shù)定量主要可分為三個水平:基因水平薛耻、轉(zhuǎn)錄組水平、外顯子水平赏陵。
在基因水平上饼齿,常用的軟件為HTSeq-count,featureCounts蝙搔,BEDTools, Qualimap, Rsubread, GenomicRanges等缕溉。以常用的HTSeq-count為例,這些工具要解決的問題就是根據(jù)read和基因位置的overlap判斷這個read到底是誰家的孩子吃型。值得注意的是不同工具對multimapping reads處理方式也是不同的证鸥,例如HTSeq-count就直接當(dāng)它們不存在。而Qualimpa則是一人一份勤晚,平均分配枉层。
在轉(zhuǎn)錄本水平上,一般常用工具為Cufflinks和它的繼任者StringTie赐写, eXpress鸟蜡。這些軟件要處理的難題就時轉(zhuǎn)錄本亞型(isoforms)之間通常是有重疊的,當(dāng)二代測序讀長低于轉(zhuǎn)錄本長度時血淌,如何進(jìn)行區(qū)分矩欠?這些工具大多采用的都是expectation maximization(EM)。好在我們有三代測序悠夯。上述軟件都是alignment-based,目前許多alignment-free軟件躺坟,如kallisto, silfish, salmon沦补,能夠省去比對這一步,直接得到read count咪橙,在運(yùn)行效率上更高夕膀。不過最近一篇文獻(xiàn)[1]指出這類方法在估計豐度時存在樣本特異性和讀長偏差。
在外顯子使用水平上美侦,其實和基因水平的統(tǒng)計類似产舞。但是值得注意的是為了更好的計數(shù),我們需要提供無重疊的外顯子區(qū)域的gtf文件[2]菠剩。用于分析差異外顯子使用的DEXSeq提供了一個Python腳本(dexseq_prepare_annotation.py)執(zhí)行這個任務(wù)易猫。(以上內(nèi)容來自:http://www.reibang.com/p/6d4cba26bb60)
htseq-count 自定義模型:
1.數(shù)據(jù)準(zhǔn)備:
htseq的計數(shù)需要進(jìn)行按照reads名稱進(jìn)行排序,之前過程中reads是按照染色體排序的具壮,所以還要重新排序:
#!/bin/bash
for ((i=77;i<=80;i++))
do
samtools sort -n /media/yanfang/FYWD/RNA_seq/bam_files/SRR9576${i}.bam -o /media/yanfang/FYWD/RNA_seq/bam_files/SRR9576${i}_nsorted.bam
done
安裝htseq的步驟省略准颓,先看一下htseq-count的使用方法:
$ htseq-count --h
usage: htseq-count [options] alignment_file gff_file
This script takes one or more alignment files in SAM/BAM format and a feature
file in GFF format and calculates for each feature the number of reads mapping
to it. See http://htseq.readthedocs.io/en/master/count.html for details.
positional arguments:
samfilenames Path to the SAM/BAM files containing the mapped reads.
If '-' is selected, read from standard input
featuresfilename Path to the file containing the features
optional arguments:
-h, --help show this help message and exit
-f {sam,bam}, --format {sam,bam}
#指定輸入文件格式哈蝇,默認(rèn)SAM
type of <alignment_file> data, either 'sam' or 'bam'
(default: sam)
-r {pos,name}, --order {pos,name}
#你需要利用samtool sort對數(shù)據(jù)根據(jù)read name或者位置進(jìn)行排序,默認(rèn)是name
'pos' or 'name'. Sorting order of <alignment_file>
(default: name). Paired-end sequencing data must be
sorted either by position or by read name, and the
sorting order must be specified. Ignored for single-
end data.
--max-reads-in-buffer MAX_BUFFER_SIZE
When <alignment_file> is paired end sorted by
position, allow only so many reads to stay in memory
until the mates are found (raising this number will
use more memory). Has no effect for single end or
paired end sorted by name
-s {yes,no,reverse}, --stranded {yes,no,reverse}
#數(shù)據(jù)是否來自于strand-specific assay攘已。
#DNA是雙鏈的炮赦,所以需要判斷到底來自于哪條鏈。
#如果選擇了no样勃, 那么每一條read都會跟正義鏈和反義鏈進(jìn)行比較吠勘。
#默認(rèn)的yes對于雙端測序表示第一個read都在同一個鏈上,第二個read則在另一條鏈上峡眶。
whether the data is from a strand-specific assay.
Specify 'yes', 'no', or 'reverse' (default: yes).
'reverse' means 'yes' with reversed strand
interpretation
-a MINAQUAL, --minaqual MINAQUAL
#最低質(zhì)量剧防, 剔除低于閾值的read
skip all reads with alignment quality lower than the
given minimum value (default: 10)
-t FEATURETYPE, --type FEATURETYPE
feature type (3rd column in GFF file) to be used, all
features of other type are ignored (default, suitable
for Ensembl GTF files: exon)
-i IDATTR, --idattr IDATTR
GFF attribute to be used as feature ID (default,
suitable for Ensembl GTF files: gene_id)
--additional-attr ADDITIONAL_ATTR
Additional feature attributes (default: none, suitable
for Ensembl GTF files: gene_name). Use multiple times
for each different attribute
-m {union,intersection-strict,intersection-nonempty}, --mode {union,intersection-strict,intersection-nonempty}
mode to handle reads overlapping more than one feature
(choices: union, intersection-strict, intersection-
nonempty; default: union)
--nonunique {none,all}
Whether to score reads that are not uniquely aligned
or ambiguously assigned to features
--secondary-alignments {score,ignore}
Whether to score secondary alignments (0x100 flag)
--supplementary-alignments {score,ignore}
Whether to score supplementary alignments (0x800 flag)
-o SAMOUTS, --samout SAMOUTS
write out all SAM alignment records into SAM files
(one per input file needed), annotating each line with
its feature assignment (as an optional field with tag
'XF')
-q, --quiet suppress progress report
2.寫一個腳本:
#!/bin/bash
for i in `seq 77 80`
do
htseq-count -r name -f bam /media/yanfang/FYWD/RNA_seq/bam_files/SRR9576${i}_nsorted.bam /media/yanfang/FYWD/RNA_seq/ref_genome/gencode.v31lift37.annotation.gtf > /media/yanfang/FYWD/RNA_seq/matrix/SRR9576${i}.count
done
一共是4個樣品,大概花了兩三個小時的時間幌陕。
yanfang@YF-Lenovo:/media/yanfang/FYWD/RNA_seq/matrix$ wc -l *.count
#查看一下每一個matrix文件有多少行
62297 SRR957677.count
62297 SRR957678.count
62297 SRR957679.count
62297 SRR957680.count
249188 total
查看每個matrix文件的格式(前4行)诵姜,第一列ensembl_gene_id,第二列read_count計數(shù)
$ head -n 4 SRR9576*.count
==> SRR957677.count <==
ENSG00000000003.14_2 807
ENSG00000000005.6_3 0
ENSG00000000419.12_4 389
ENSG00000000457.14_4 288
==> SRR957678.count <==
ENSG00000000003.14_2 357
ENSG00000000005.6_3 0
ENSG00000000419.12_4 174
ENSG00000000457.14_4 108
==> SRR957679.count <==
ENSG00000000003.14_2 800
ENSG00000000005.6_3 0
ENSG00000000419.12_4 405
ENSG00000000457.14_4 218
==> SRR957680.count <==
ENSG00000000003.14_2 963
ENSG00000000005.6_3 1
ENSG00000000419.12_4 509
ENSG00000000457.14_4 283
再看一下后4行
$ tail -n 4 SRR9576*.count
==> SRR957677.count <==
__ambiguous 341518
__too_low_aQual 0
__not_aligned 1198535
__alignment_not_unique 2458943
==> SRR957678.count <==
__ambiguous 138861
__too_low_aQual 0
__not_aligned 572582
__alignment_not_unique 979558
==> SRR957679.count <==
__ambiguous 360081
__too_low_aQual 0
__not_aligned 1256224
__alignment_not_unique 2587970
==> SRR957680.count <==
__ambiguous 411012
__too_low_aQual 0
__not_aligned 1348062
__alignment_not_unique 2853504
合并表達(dá)矩陣并進(jìn)行注釋
上一步得到的4個單獨(dú)的矩陣文件,現(xiàn)在要把這4個文件合并為行為基因名搏熄,列為樣本名棚唆,中間為count的矩陣文件。
首先要啟動R-studio心例, 運(yùn)行R宵凌。載入數(shù)據(jù),把矩陣加上列名:
> options(stringsAsFactors = FALSE)
> control1<-read.table("SRR957677.count",sep= "\t",col.names = c("gene_id","control1"))
> head(control1)#查看前幾行
gene_id control1
1 ENSG00000000003.14_2 807
2 ENSG00000000005.6_3 0
3 ENSG00000000419.12_4 389
4 ENSG00000000457.14_4 288
5 ENSG00000000460.17_6 505
6 ENSG00000000938.13_4 0
> control2<-read.table("SRR957678.count",sep= "\t",col.names = c("gene_id","control2"))
> treat1<-read.table("SRR957679.count",sep= "\t",col.names = c("gene_id","treat1"))
> treat2<-read.table("SRR957680.count",sep= "\t",col.names = c("gene_id","treat2"))
> tail(control2)#查看后幾行
gene_id control2
62292 ENSG00000288111.1_1 0
62293 __no_feature 3856774
62294 __ambiguous 138861
62295 __too_low_aQual 0
62296 __not_aligned 572582
62297 __alignment_not_unique 979558
> tail(treat2)
gene_id treat2
62292 ENSG00000288111.1_1 0
62293 __no_feature 10430059
62294 __ambiguous 411012
62295 __too_low_aQual 0
62296 __not_aligned 1348062
62297 __alignment_not_unique 2853504
#合并矩陣
> raw_count <- merge(merge(control1, control2, by="gene_id"), merge(treat1, treat2, by="gene_id"))
> head(raw_count) #這里顯示的合并之后止后,行的順序改變了
gene_id control1 control2 treat1 treat2
1 __alignment_not_unique 2458943 979558 2587970 2853504
2 __ambiguous 341518 138861 360081 411012
3 __no_feature 9096888 3856774 8247195 10430059
4 __not_aligned 1198535 572582 1256224 1348062
5 __too_low_aQual 0 0 0 0
6 ENSG00000000003.14_2 807 357 800 963
> tail(raw_count)
gene_id control1 control2 treat1 treat2
62292 ENSG00000288106.1_1 0 3 1 2
62293 ENSG00000288107.1_1 2 0 0 0
62294 ENSG00000288108.1_1 0 0 0 0
62295 ENSG00000288109.1_1 0 0 1 0
62296 ENSG00000288110.1_1 0 0 0 0
62297 ENSG00000288111.1_1 0 0 0 0
>
以上代碼簡要說明:
(1)如果stringAsFactor=F瞎惫,就不會把字符轉(zhuǎn)換為factor。這樣以來译株,原來看起來是數(shù)字變成了character瓜喇,原來是character的還是character。
函數(shù)read.table是讀取矩形格子狀數(shù)據(jù)最為便利的方式歉糜。
(2)read.table的用法是:
read.table(file, header = FALSE, sep = "", quote = ""'",dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss")乘寒。
file是文件名,header指出文件的第一行是否為數(shù)據(jù)變量的名字匪补,缺省情況下伞辛,由文件的格式來確定此值,如果header設(shè)置為TRUE夯缺,則要求第一行要比數(shù)據(jù)列的數(shù)量少一列蚤氏。sep為數(shù)據(jù)分隔符,這里定義分隔符為制表符踊兜。用于指定包圍字符型數(shù)據(jù)的字符竿滨。如果不使用引用,則可以將該參數(shù)設(shè)置為quote=""。
raw_count_filt <- raw_count[-1:-5,] #刪掉前5行
> head(raw_count_filt) #查看刪完的數(shù)據(jù)矩陣的前幾行
gene_id control1 control2 treat1 treat2
6 ENSG00000000003.14_2 807 357 800 963
7 ENSG00000000005.6_3 0 0 0 1
8 ENSG00000000419.12_4 389 174 405 509
9 ENSG00000000457.14_4 288 108 218 283
10 ENSG00000000460.17_6 505 208 451 543
11 ENSG00000000938.13_4 0 0 0 0
上面的矩陣?yán)飃ene_id的名字有小數(shù)點姐呐〉盍可是我們無法在EBI數(shù)據(jù)庫上直接搜索找到ENSMUSG00000024045.5這樣的基因,只能是ENSMUSG00000024045的整數(shù)曙砂,沒有小數(shù)點头谜,所以需要進(jìn)一步替換為整數(shù)的形式。
> ENSEMBL <- gsub("\\.\\d*\\_\\d*", "", raw_count_filt$gene_id)#把gene_id列里的小數(shù)點后面的都去掉
#還有一篇文章是這樣的代碼:ENSEMBL <- gsub("(.*?)\\.\\d*?_\\d", "\\1", raw_count_filt$gene_id)
> row.names(raw_count_filt) <- ENSEMBL #將ENSEMBL重新添加到raw_count_filt矩陣
> raw_count_filt1 <- cbind(ENSEMBL,raw_count_filt)#合并矩陣ENSEMBL和filt2
> colnames(raw_count_filt1) <- c("ensembl_gene_id","gene_id","control1","control2","treat1","treat2")
#給矩陣1的列加名字
>head(raw_count_filt1)
ensembl_gene_id gene_id control1 control2 treat1 treat2
ENSG00000000003 ENSG00000000003 ENSG00000000003.14_2 807 357 800 963
ENSG00000000005 ENSG00000000005 ENSG00000000005.6_3 0 0 0 1
ENSG00000000419 ENSG00000000419 ENSG00000000419.12_4 389 174 405 509
ENSG00000000457 ENSG00000000457 ENSG00000000457.14_4 288 108 218 283
ENSG00000000460 ENSG00000000460 ENSG00000000460.17_6 505 208 451 543
ENSG00000000938 ENSG00000000938 ENSG00000000938.13_4 0 0 0 0
對基因進(jìn)行注釋-獲取gene_symbol
> mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
> my_ensembl_gene_id <- row.names(raw_count_filt1)
> options(timeout = 4000000) #提高連接時間
> hg_symbols<- getBM(attributes=c('ensembl_gene_id','hgnc_symbol',"chromosome_name", "start_position","end_position", "band"), filters= 'ensembl_gene_id', values = my_ensembl_gene_id, mart = mart)
> head(hg_symbols)
ensembl_gene_id hgnc_symbol chromosome_name start_position end_position band
1 ENSG00000000003 TSPAN6 X 100627109 100639991 q22.1
2 ENSG00000000005 TNMD X 100584936 100599885 q22.1
3 ENSG00000000419 DPM1 20 50934867 50958555 q13.13
4 ENSG00000000457 SCYL3 1 169849631 169894267 q24.2
5 ENSG00000000460 C1orf112 1 169662007 169854080 q24.2
6 ENSG00000000938 FGR 1 27612064 27635185 p35.3
>
將合并后的表達(dá)數(shù)據(jù)框raw_count_filt1和注釋得到的hg_symbols整合為一:
> readcount <- merge(raw_count_filt1, hg_symbols, by="ensembl_gene_id")
> head(readcount)
ensembl_gene_id gene_id control1 control2 treat1 treat2 hgnc_symbol chromosome_name start_position end_position band
1 ENSG00000000003 ENSG00000000003.14_2 807 357 800 963 TSPAN6 X 100627109 100639991 q22.1
2 ENSG00000000005 ENSG00000000005.6_3 0 0 0 1 TNMD X 100584936 100599885 q22.1
3 ENSG00000000419 ENSG00000000419.12_4 389 174 405 509 DPM1 20 50934867 50958555 q13.13
4 ENSG00000000457 ENSG00000000457.14_4 288 108 218 283 SCYL3 1 169849631 169894267 q24.2
5 ENSG00000000460 ENSG00000000460.17_6 505 208 451 543 C1orf112 1 169662007 169854080 q24.2
6 ENSG00000000938 ENSG00000000938.13_4 0 0 0 0 FGR 1 27612064 27635185 p35.3
輸出count矩陣文件:
> write.csv(readcount, file='readcount_all,csv')
> readcount<-raw_count_filt1[ ,-1:-2]
> write.csv(readcount, file='readcount.csv')
> head(readcount)
control1 control2 treat1 treat2
ENSG00000000003 807 357 800 963
ENSG00000000005 0 0 0 1
ENSG00000000419 389 174 405 509
ENSG00000000457 288 108 218 283
ENSG00000000460 505 208 451 543
ENSG00000000938 0 0 0 0