在Ensembl數(shù)據(jù)庫中下載擬南芥參考基因
TAIR是研究擬南芥的首選數(shù)據(jù)庫,其他數(shù)據(jù)庫中擬南芥的基因組數(shù)據(jù)都是直接來自TAIR
這里選擇Ensembl的原因是仆抵,感覺更加方便
植物參考基因組:http://plants.ensembl.org/index.html
- 一些常用的物種列在首頁 擬南芥,水稻种冬,玉米等
- 如果想要得物種不在首頁可以點擊:View full list of all Ensembl Plants species 镣丑,可以得到所有物種的列表;
動物參考基因組:http://asia.ensembl.org/index.html
植物參考基因組:http://plants.ensembl.org/index.html
其他真菌細菌等參考基因組:http://ensemblgenomes.org/
點擊進入擬南芥參考基因組介紹頁面
點擊Download DNA sequence (FASTA)
-
一般下載*toplevel.fa.gz文件娱两,為參考基因組完整文件
- 其他sm和rm的意義可看README文件莺匠,介紹如下,為repeat區(qū)不同mask方法:
'dna_rm'- masked genomic DNA. Interspersed repeatsandlow complexity regions are detectedwiththe RepeatMasker toolandmasked by replacing repeatswith'N's.
'dna_sm'- soft-masked genomic DNA. All repeatsandlow complexity regions have been replaced with lowercased versionsoftheir nucleic base
基因注釋gtf文件的下載
在上一步的基礎上繼續(xù)點擊三次轉(zhuǎn)到高層目錄:可以看到gff和gtf目錄趣竣,點擊進入到自己想要的物種下載對應的文件即可:
注意:fasta格式文件版本與gtf格式文件的版本必須一致。
用Xftp將文件上傳到服務器
查看gtf注釋文件
GTF文件如下所示:
$less -S Arabidopsis_thaliana.TAIR10.45.gtf.gz
當前所廣泛使用的GTF格式為第二版(GTF2)遥缕,它主要是用來描述基因的注釋卫袒。GTF格式有兩個硬性標準:
- 根據(jù)所使用的軟件的不同,
feature types
是必須注明的单匣。 - 第9列必須以
gene_id
以及transcript_id
開頭
GTF文件的第9列同GFF文件不同夕凝,雖然同樣是標簽與值配對的情況,但標簽與值之間以空格分開户秤,且每個特征之后都要有分號;(包括最后一個特征):
gene_id "geneA";transcript_id "geneA.1";database_id "0012";modified_by "Damian";duplicates 0;
解壓文件
gzip -d Arabidopsis_thaliana.TAIR10.45.gtf.gz
gzip -d Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
========================================================================
以下內(nèi)容是為了構(gòu)建10X單細胞轉(zhuǎn)錄組參考文件码秉,普通bulk測序可以不用管下面內(nèi)容
cellranger 檢查并生成指定用于10X pipiline的gtf文件
$cellranger mkgtf Arabidopsis_thaliana.TAIR10.45.gtf Arabidopsis_thaliana.TAIR10.45_new.gtf
/opt/biosoft/cellranger-expression/cellranger-cs/3.1.0/bin
cellranger mkgtf (3.1.0)
Copyright (c) 2019 10x Genomics, Inc. All rights reserved.
-------------------------------------------------------------------------------
Writing new genes GTF file (may take 10 minutes for a 1GB input GTF file)...
...done
為了后面分析流程的需要,在線粒體基因上加上"Mt"標記
可以自己寫一個Perl或者Python的小腳本
python add_mt_marker.py Arabidopsis_thaliana.TAIR10.45_new.gtf Arabidopsis_thaliana.TAIR10.45_new2.gtf
mv Arabidopsis_thaliana.TAIR10.45_new2.gtf Arabidopsis_thaliana.TAIR10.45.gtf
less -S Arabidopsis_thaliana.TAIR10.45.gtf
cellranger 檢查并生成指定用于10X pipiline的reference
$cellranger mkref --genome TAIR10 --fasta Arabidopsis_thaliana.TAIR10.dna.toplevel.fa --genes Arabidopsis_thaliana.TAIR10.45.gtf
/opt/biosoft/cellranger-expression/cellranger-cs/3.1.0/bin
cellranger mkref (3.1.0)
Copyright (c) 2019 10x Genomics, Inc. All rights reserved.
-------------------------------------------------------------------------------
Creating new reference folder at /share/nas1/Data/Users/luohb/Data/Reference/TAIR/TAIR10
...done
Writing genome FASTA file into reference folder...
...done
Computing hash of genome FASTA file...
...done
Indexing genome FASTA file...
...done
Writing genes GTF file into reference folder...
...done
Computing hash of genes GTF file...
...done
Writing genes index file into reference folder (may take over 10 minutes for a 3Gb genome)...
...done
Writing genome metadata JSON file into reference folder...
...done
Generating STAR genome index (may take over 8 core hours for a 3Gb genome)...
Jan 15 17:59:49 ..... Started STAR run
Jan 15 17:59:49 ... Starting to generate Genome files
Jan 15 17:59:55 ... starting to sort Suffix Array. This may take a long time...
Jan 15 17:59:55 ... sorting Suffix Array chunks and saving them to disk...
Jan 15 18:04:32 ... loading chunks from disk, packing SA...
Jan 15 18:04:50 ... Finished generating suffix array
Jan 15 18:04:50 ... Generating Suffix Array index
Jan 15 18:05:08 ... Completed Suffix Array index
Jan 15 18:05:08 ..... Processing annotations GTF
Jan 15 18:05:16 ..... Inserting junctions into the genome indices
Jan 15 18:09:14 ... writing Genome to disk ...
Jan 15 18:09:15 ... writing Suffix Array to disk ...
Jan 15 18:09:16 ... writing SAindex to disk
Jan 15 18:09:16 ..... Finished successfully
...done.
>>> Reference successfully created! <<<
You can now specify this reference on the command line:
cellranger --transcriptome=/share/nas1/Data/Users/luohb/Data/Reference/TAIR/TAIR10 ...
查看一下新生成的文件夾的內(nèi)容:
$cd TAIR10/
$ls
fasta genes pickle reference.json star
保存原始的壓縮文件鸡号,和說明文檔转砖。說明文件來源
$cd ..
$cd source/
$vi README.txt
$ls
Arabidopsis_thaliana.TAIR10.45.gtf.gz Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz README.txt
搞掂收工~