miRDeep2 tutorial

來源:https://drmirdeep.github.io/mirdeep2_tutorial.html

Extended miRDeep2 tutorial with step by step instructions. It will cover the mapper.pl for preprocessing and mapping, the miRDeep2.pl for de-novo prediction and the quantifier.pl for expression profiling.

Disclaimer

This tutorial comes with no warranty and demands common sense of the reader. I am not responsible for any damage that happens to your computer by using this tutorial. For comments or questions just create an 'issue' here

https://github.com/Drmirdeep/drmirdeep.github.io/issues.?

Apparently you will need a github account for that.

Preface

This is a step by step guide for a full small RNA sequencing data analysis using the miRDeep2 package and its patched files. The first part will describe the general workflow to do de-novo miRNA predictions based on a small RNAseq data seq and the second part will focus on expression analysis with the quantification module.

Installation instructions

If you haven't installed them yet you can obtain the main package frommiRDeep2and the patched files frompatchby clicking on 'Clone or download' and then on 'Download Zip'. Extract the zipped files and then open a command line window. If you have git installed you can obtain the packages also directly from the command line by typing

·git clone https://github.com/rajewsky-lab/mirdeep2.git·

and

git clone https://github.com/Drmirdeep/mirdeep2_patch.git

To install the miRDeep2 package enter the directory to which the package was extracted to. If you extracted the folder on the Desktop then typing

cd ~/Desktop/mirdeep2

will bring you to the mirdeep2 folder. Then typing

perl install.pl

will start the installer and download and install third party software. In particular. bowtie version 1, RNAfold, randfold and the perl packages PDF:API and TTF will be installed. Please follow the instructions on the screen. When mirdeep2 was installed successfully please open a new terminal window and just type

miRDeep2.pl

If you see the miRDeep2 usage instructions on the screen you can continue to install the patch. Otherwise something went wrong during the installation. In case everything worked fine you can now enter the directory of the mirdeep2_patch by typing

cd ~/Desktop/mirdeep2_patch

if you extracted the patched file to the Desktop. Typing

bash patchme.sh

will add the patched files to your mirdeep2 installation. After that we can start with some miRNA data analysis. Download the miRBase reference files for version 21 by typing

mirbase.pl 21

This will download the hairpin.fa.gz and mature.fa.gz for version 21 to directory ~/mirbase/21/

(The ~ sign will be expanded by your computer to your home directory).

If you want the gff files as well then you need to type

mirbase.pl 21 1

The second argument can be anything but 0 which will tell the script to also get the gff files from mirbase. For miRNA quantification we next extract the miRNAs for our species of interest. For that you need to know the 3-letter code of miRBase for your species. For humans this will be 'hsa' and mouse would be mmu. To extract the mature sequences from the mirbase file we downloaded before you just need to type

extract_miRNAs.pl ~/mirbase/21/mature.fa.gz hsa > mature_ref.fa

and to get the hairpin sequences by typing

extract_miRNAs.pl ~/mirbase/21/hairpin.fa.gz hsa > hairpin_ref.fa

For running miRDeep2 for de-novo miRNA prediction it is beneficial to supply also mature miRNAs from related species. You could for example use mouse 'mmu' and chimp 'ptr' as related species. For extracting those miRNAs you can type

extract_miRNAs.pl ~/mirbase/21/mature.fa.gz mmu,ptr > mature_other.fa

Tutorial files

In case you want to follow the tutorial with example data you can get the files from heretutorial_filesand type

unzip drmirdeep.github.io-master.zip

change to the unzipped directory with

cd drmirdeep.github.io-master

and continue with the tutorial.

The data and what else you will need

Usually you will have gotten a small RNA sequencing data file from a collaborator that wants you to analyze the data file. Before you can start with any kind of analysis you should either already know the small RNA sequencing adapter that was used for the sequencing of the sample or ask your collaborator to sent it to you. If you don't clip the adapter then the majority of the reads having an adapter are likely not to be aligned to anywhere.

Once you know the adapter sequence you should do a simple check to see how many of your sequences contain the adapter. This you can do by typing

grep -c TGGAATTC example_small_rna_file.fastq

where 'TGGAATTC' are the first 8 nucleotides of the adapter that has been used for this sample. Replace it with your own sequencing adapter. MicroRNAs have a mean length of 22 nucleotides in animals so if you have sequenced one of those it will likely have the sequencing adapter attached to it. If the resulting number of sequences with an adapter is around 70% of the number of your input sequences the data set can be considered as reasonably good. Note: In case that only adapters have been sequenced predominantly you will also get a high number which is obviously not good. If you only get 10% of sequences with an adpater then likely something went wrong during the sequnecing library preparation or your sample doesn't contain too many small RNAs. For novel miRNA prediction we need to map the reads against a reference database which has to be indexed by bowtie 1. For this we take a reference database file, lets call it refdb.fa (This can be a genome file or simply a file with scaffolds) and build a bowtie index by typing

bowtie-build refdb.fa refdb.fa

The first argument is here the actual file to index, the second argument is the prefix for the bowtie index files. You can name it differently but for ease of use I use the same name as my reference database file. Depending on the input file size it can take several hours (for the human genome for example) to be indexed. However, the bowtie website has already some prebuild index files for download. If you decide to download index files you will also need to download the fasta file with which the index was build. Otherwise the results in the miRDeep2 prediction will be not reliable.

Data preprocessing for novel miRNA prediction

Since the miRDeep2 package was designed as a complete solution for miRNA prediction and quantification it also contains data preprocessing routines that will also clip the sequencing adapter. The main function of the mapping module is the mapping of the preprocessed reads file to reference database. The reference database is typically an annotated genome sequence but can also be simply a scaffold assembly if no genome is available. The scaffolds itself however should be at least 200 nucleotides long so that a sane miRNA precursor plus some flanking region fits into it. Apart from clipping adapters the module does sanity checks on your sequencing reads and also collapse read sequences to reduce the file size which will save computing time.

mapper.pl example_small_rna_file.fastq -e -h -i -j -k TGGAATTC -l 18 -m -p refdb.fa -s reads_collapsed.fa -t reads_vs_refdb.arf -v -o 4

What does this command do? The first argument needs to be your sequencing file. Typically, this will be a fastq file. The format of the fastq file is designated by specifying option '-e'. If your file is in fasta format already you specify option '-c' instead. If your reads file is not in fasta format you need to specify option '-h' which advises the mapper module to parse your file to fasta format. Option -i will convert RNA to DNA and option '-j' will remove sequences that contain characters other than ACGTN.

Now comes the actual adapter clipping which is only done if a adapter sequence is given by option -k. Only the first 6 nucleotides of this sequence will be used to search for an exact match in the sequencing reads. Option '-m' will collapse the reads to remove redundancy and decrease the file size. A sequnecing read seen 10 times in your raw file will occur only once in the collapsed file and have a _x10 in its identifier.

After that the reads will be mapped to the given refence genome which index file was specified by option '-p'. Option -s indicates the preprocessed read file name which is output by the mapper module and option -t is the file name of the read mappings to the reference database ('refdb.fa') in miRDeep2's arf format. A mapping file in arf format can be easily obtained from a standard bowtie 1 output file (This is NOT in 'sam' format but a proprietary bowtie text file format) by typing

convert_bowtie_output.pl reads_vs_refdb.bwt > reads_vs_refdb.arf

However, if you used the mapping module then the mapped output file is already in arf format.

Identification of known and novel miRNAs

For predicting novel miRNAs the miRDeep2 module from the package is called with a collapsed reads file and a reference genome file in fasta format. For better prediction results reference files of miRNAs and related miRNAs should be given since miRDeep2 considers predicted miRNAs with conserved seeds in other species more reliable that miRNAs with non-conserved seeds.

miRDeep2.pl reads_collapsed.fa refdb.fa reads_vs_refdb.arf mature_ref.fa mature_other.fa hairpin_ref.fa -t hsa 2>report.log

Data preprocessing for miRNA expression profiling

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末陪毡,一起剝皮案震驚了整個濱河市悍赢,隨后出現(xiàn)的幾起案子洼畅,更是在濱河造成了極大的恐慌,老刑警劉巖,帶你破解...
    沈念sama閱讀 206,723評論 6 481
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件赁濒,死亡現(xiàn)場離奇詭異毛嫉,居然都是意外死亡,警方通過查閱死者的電腦和手機欲诺,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,485評論 2 382
  • 文/潘曉璐 我一進(jìn)店門抄谐,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人扰法,你說我怎么就攤上這事蛹含。” “怎么了塞颁?”我有些...
    開封第一講書人閱讀 152,998評論 0 344
  • 文/不壞的土叔 我叫張陵浦箱,是天一觀的道長。 經(jīng)常有香客問我祠锣,道長酷窥,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 55,323評論 1 279
  • 正文 為了忘掉前任伴网,我火速辦了婚禮蓬推,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘澡腾。我一直安慰自己沸伏,他們只是感情好,可當(dāng)我...
    茶點故事閱讀 64,355評論 5 374
  • 文/花漫 我一把揭開白布动分。 她就那樣靜靜地躺著毅糟,像睡著了一般。 火紅的嫁衣襯著肌膚如雪澜公。 梳的紋絲不亂的頭發(fā)上姆另,一...
    開封第一講書人閱讀 49,079評論 1 285
  • 那天,我揣著相機與錄音坟乾,去河邊找鬼迹辐。 笑死,一個胖子當(dāng)著我的面吹牛糊渊,可吹牛的內(nèi)容都是我干的右核。 我是一名探鬼主播,決...
    沈念sama閱讀 38,389評論 3 400
  • 文/蒼蘭香墨 我猛地睜開眼渺绒,長吁一口氣:“原來是場噩夢啊……” “哼贺喝!你這毒婦竟也來了?” 一聲冷哼從身側(cè)響起宗兼,我...
    開封第一講書人閱讀 37,019評論 0 259
  • 序言:老撾萬榮一對情侶失蹤躏鱼,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后殷绍,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體染苛,經(jīng)...
    沈念sama閱讀 43,519評論 1 300
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 35,971評論 2 325
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發(fā)現(xiàn)自己被綠了茶行。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片躯概。...
    茶點故事閱讀 38,100評論 1 333
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖畔师,靈堂內(nèi)的尸體忽然破棺而出娶靡,到底是詐尸還是另有隱情,我是刑警寧澤看锉,帶...
    沈念sama閱讀 33,738評論 4 324
  • 正文 年R本政府宣布姿锭,位于F島的核電站,受9級特大地震影響伯铣,放射性物質(zhì)發(fā)生泄漏呻此。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點故事閱讀 39,293評論 3 307
  • 文/蒙蒙 一腔寡、第九天 我趴在偏房一處隱蔽的房頂上張望焚鲜。 院中可真熱鬧,春花似錦放前、人聲如沸恃泪。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,289評論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至情连,卻和暖如春叽粹,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背却舀。 一陣腳步聲響...
    開封第一講書人閱讀 31,517評論 1 262
  • 我被黑心中介騙來泰國打工虫几, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人挽拔。 一個月前我還...
    沈念sama閱讀 45,547評論 2 354
  • 正文 我出身青樓辆脸,卻偏偏與公主長得像,于是被迫代替她去往敵國和親螃诅。 傳聞我的和親對象是個殘疾皇子啡氢,可洞房花燭夜當(dāng)晚...
    茶點故事閱讀 42,834評論 2 345

推薦閱讀更多精彩內(nèi)容

  • PLEASE READ THE FOLLOWING APPLE DEVELOPER PROGRAM LICENSE...
    念念不忘的閱讀 13,433評論 5 6
  • 今天早上,才匆匆忙忙完成了2016年總結(jié)的第一部分——堅持篇术裸√仁牵回首往事,要說話真的很多很多袭艺。除了堅持搀崭,更多的是感恩...
    巴山夜雨i閱讀 196評論 2 2
  • 又到了年關(guān)了,逼婚催婚的話題又開始火熱起來了猾编,單身的男女們似乎都開始有點著急了瘤睹。開始不斷的參加相親活動升敲,不再拒絕某...
    何子姐閱讀 155評論 0 1
  • 看著書 不知不覺的睡著
    Nicknamed閱讀 202評論 0 0
  • 靜靜地 月亮掛在天邊 我走在教室外的夜幕下 樹影一點一點隨著誰的腳步 編織起路燈暖暖 默默地 輕風(fēng)飄在街口 你來到...
    一只走心的90single汪閱讀 237評論 0 4