轉(zhuǎn)錄組入門(4):了解參考基因組及基因注釋
在UCSC下載hg19參考基因組堕油,我博客有詳細(xì)說明拘鞋,從gencode數(shù)據(jù)庫下載基因注釋文件烛亦,并且用IGV去查看你感興趣的基因的結(jié)構(gòu)铃剔,比如TP53,KRAS,EGFR等等。
作業(yè)析恢,截圖幾個(gè)基因的IGV可視化結(jié)構(gòu)墨坚!還可以下載ENSEMBL,NCBI的gtf映挂,也導(dǎo)入IGV看看泽篮,截圖基因結(jié)構(gòu)。了解IGV常識(shí)
準(zhǔn)備工作
參考基因組
測(cè)序得到的是幾百bp的短read柑船, 相當(dāng)于把拼圖打散了給你帽撑。如果沒有參考基因組,從頭(de novo)組裝等于是重走人類基因組計(jì)劃的老路鞍时,也就是打散了拼圖亏拉,卻不告訴你原來是什么樣子扣蜻,那么任務(wù)將會(huì)及其艱巨。
還好人類基因組已經(jīng)組裝好了及塘,我們只需要把我們測(cè)得序列回貼(mapping)回去莽使,畢竟人與人之間的差距只有不到1%差異, 允許mismatch就行。
因此第一步就是要去UCSC(http://genome.ucsc.edu/index.html)下載hg19參考基因組(文獻(xiàn)要求)
不同文件的所包含的數(shù)據(jù)在該頁面有介紹磷蛹,其中
chromFa.tar.gz - The assembly sequence in one file per chromosome.Repeats from RepeatMasker and Tandem Repeats Finder (with period of 12 or less) are shown in lower case; non-repeating sequence is shown in upper case.
我將數(shù)據(jù)存放在Windows的F盤的Data文件夾下吮旅,用于后續(xù)操作
cd /mnt/f/Data
mkdir reference && cd reference
mkdir -p genome/hg19 && cd genome/hg19
nohup wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz &
tar -zvf chromFa.tar.gz
cat *.fa > hg19.fa
rm chr*
下面的內(nèi)容是Jimmy在【直播】我的基因組(五):測(cè)試數(shù)據(jù)及參考基因組的準(zhǔn)備關(guān)于參考基因組的介紹
這個(gè)對(duì)新手來說,是一個(gè)很大的坑味咳,hg19庇勃、GRCH37、 ensembl 75這3種基因組版本應(yīng)該是大家見得比較多的了槽驶,國(guó)際通用的人類參考基因組责嚷,其實(shí)他們儲(chǔ)存的是同樣的fasta序列,只是分別對(duì)應(yīng)著三種國(guó)際生物信息學(xué)數(shù)據(jù)庫資源收集存儲(chǔ)單位掂铐,即NCBI罕拂,UCSC及ENSEMBL各自發(fā)布的基因組信息而已。有一些參考基因組比較小眾全陨,存儲(chǔ)的序列也不一樣爆班,比如BGI做的炎黃基因組,還有DNA雙螺旋結(jié)構(gòu)提出者沃森(Watson)的基因組辱姨,還有2016年發(fā)表在nature上面的號(hào)稱最完善的韓國(guó)人做的基因組柿菩。前期我們先不考慮這些小眾基因組,主要就下載hg19和hg38雨涛,都是UCSC提供的枢舶,雖然hg38相比hg19來說,做了很多改進(jìn)替久,優(yōu)點(diǎn)也不少凉泄,但因?yàn)槟壳盀橹购芏嘧⑨屝畔⒍际轻槍?duì)于hg19的坐標(biāo)系統(tǒng)來的,我們就都下載了蚯根,正好自己探究一下后众。也順便下載一個(gè)小鼠的最新版參考基因組吧,反正比對(duì)也就是睡個(gè)覺的功夫颅拦,順便分析一下結(jié)果吼具,看看比對(duì)率是不是很低。
吐槽: Jimmy大神的博客排版真的是非尘鼐啵考驗(yàn)我們對(duì)知識(shí)的渴望,每當(dāng)看到他的排版的時(shí)候怖竭,我必須得忍住不去點(diǎn)擊瀏覽器右上角锥债。為了求知,我忍了。
注釋信息
然而參考基因組是一部無字天書哮肚,要想解讀書中的內(nèi)容登夫,需要額外的注釋信息協(xié)助。
因此第二步允趟,就是去gencode數(shù)據(jù)庫(http://www.gencodegenes.org/)下載基因組注釋文件恼策。
看了下面這個(gè)圖,我才明白Jimmy為什么會(huì)吐槽基因組各種版本對(duì)應(yīng)關(guān)系了潮剪。
又到了GTF還是GFF3的抉擇時(shí)刻涣楷,簡(jiǎn)單介紹了一下他們的格式
GTF(General Transfer Format)其實(shí)就是GFF2,以Tab分割抗碰,分為如下幾列
- seqname - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. Important note: the seqname must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.
- source - name of the program that generated this feature, or the data source (database or project name)
- feature - feature type name, e.g. Gene, Variation, Similarity
- start - Start position of the feature, with sequence numbering starting at 1.
- end - End position of the feature, with sequence numbering starting at 1.
- score - A floating point value.
- strand - defined as + (forward) or - (reverse).
- frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
- attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.
而GFF3(General Feature Format)的格式如下
- seqid - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. Important note: the seq ID must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.
- source - name of the program that generated this feature, or the data source (database or project name)
- type - type of feature. Must be a term or accession from the SOFA sequence ontology
- start - Start position of the feature, with sequence numbering starting at 1.
- end - End position of the feature, with sequence numbering starting at 1.
- score - A floating point value.
- strand - defined as + (forward) or - (reverse).
- phase - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
- attributes - A semicolon-separated list of tag-value pairs, providing additional information about each feature. Some of these tags are predefined, e.g. ID, Name, Alias, Parent - see the GFF documentation for more details.
看不出來有啥區(qū)別狮斗,不想糾結(jié)就全下載好了。
nohup wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_26/GRCh37_mapping/gencode.v26lift37.annotation.gtf.gz &
nohuop wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_26/GRCh37_mapping/gencode.v26lift37.annotation.gff3.gz &
我們對(duì)文字的理解能力遠(yuǎn)遠(yuǎn)小于圖片弧蝇,所以下一步需要下載基因組瀏覽器
IGV碳褒, Integrative Genomics Viewer
下載地址為: http://software.broadinstitute.org/software/igv/download
Windows下載如下版本, 會(huì)自帶一個(gè)java運(yùn)行環(huán)境
雙擊igv.bat看疗, 就會(huì)出現(xiàn)運(yùn)行界面沙峻。
通過genome -> Load Genome From Files加載之前得到基因組文件。
進(jìn)一步两芳,還需要加載gff基因注釋文件摔寨,F(xiàn)ile -> Load From Files
顯示未排序出錯(cuò),可以使用Tool -> Run igvtools盗扇,進(jìn)行排序祷肯。
之后就可以重新加載排序后的gtf文件進(jìn)行操作。生信寶典寫過一篇文章介紹測(cè)序數(shù)據(jù)可視化(http://mp.weixin.qq.com/s/Q7pqycmQH58xU6hw_LECWA) 我也在看文檔摸索中疗隶,先放上基因截圖
下面這張圖是來自于幾個(gè)月前Jimmy對(duì)高通量測(cè)序的理解佑笋,提供數(shù)據(jù)的截圖