Bacterial genome assembly tutorial

Source: https://bioinformatics.uconn.edu/bacterial-genome-assembly-tutorial/#

This tutorial will serve as an example of how to use free and open-source genome assembly and secondary scaffolding tools to generate high quality assemblies of bacterial sequence data. The bacterial sample used in this tutorial will be referred to simply as “Species” since it is live data. This data is paired-end data, meaning that there are forward and reverse reads, which we will designate as Sample_R1.fastq and Sample_R2.fastq, respectively.

Software download links:

Sickle
ABySS
SOAPdenovo
SPAdes
QUAST
SSPACE
AlignGraph

Assembly tutorial directory

<pre>/common/Assembly_Tutorial</pre>

Sickle: Quality control on raw reads

The first step is to perform quality control on the reads using sickle. To run the program we will use the sickle command. Since our reads are paired-end reads, we indicate this with the pe option. The -f flag designates the input file containing the forward reads, -r the input file containing the reverse reads, -o the output file containing the trimmed forward reads, -p the output file containing the trimmed reverse reads, and -s the output file containing trimmed singles. The -q flag designates the minimum quality, -l the minimum read length, and -t designates the type of read.

<pre class="p1">sickle pe -f /common/Assembly_Tutorial/Sample_R1.fastq -r /common/Assembly_Tutorial/Sample_R2.fastq -t sanger -o Sample_1.fastq -p Sample_2.fastq -s Sample_s.fastq -q 30 -l 45</pre>

The trimmed quality control files are located in /common/Assembly_Tutorial/Quality_Control and the script to perform the quality control is located at /common/Assembly_Tutorial/Quality_Control/Sample_QC.sh.

ABySS: de novo sequence assembler

ABySS is the first assembly program we will use to assemble our trimmed reads. Since our reads are paired-end reads, to run the assembler we will use the abyss-pe command. We will use the parameters k for the size of the kmer, name for the output file prefix, in for the paths to the forward/reverse trimmed reads, and se for the path to the singles file, np for number of processors, which in this case should be as same as number of processors declared in the header of your shell script.

<pre class="p1">abyss-pe np=8 k=31 name=Sample_Kmer31 in='/common/Assembly_Tutorial/Quality_Control/Sample_1.fastq /common/Assembly_Tutorial/Quality_Control/Sample_2.fastq' se='/common/Assembly_Tutorial/Quality_Control/Sample_s.fastq' # repeat for k=35, k=41, etc</pre>

The kmers used in this example can be viewed as a starting point to get an idea of what kmer would best assemble the data. The assembly output files are located in /common/Assembly_Tutorial/Assembly/ABySS and the script to perform assembly is located at /common/Assembly_Tutorial/Assembly/Sample_assembly.sh. Note that this script also includes the assembly commands for SOAP and SPAdes.

SOAPdenovo: de novo sequence assembler

SOAPdenovo is another de novo sequence assembler. Unlike the other assemblers, SOAP uses a config file to pass information about the sequences into the program. The configuration file is shown below. Notable fields include average insert size and read length, which differ depending on the sequencing technology, and q1, q2, and q; the paths to the forward, reverse and singles trimmed reads.

<pre>#maximal read length
max_rd_len=250
[LIB]

average insert size

avg_ins=550

if sequence needs to be reversed

reverse_seq=0

in which part(s) the reads are used

asm_flags=3

use only first 250 bps of each read

rd_len_cutoff=250

in which order the reads are used while scaffolding

rank=1

cutoff of pair number for a reliable connection (at least 3 for short insert size)

pair_num_cutoff=3

minimum aligned length to contigs for a reliable read location (at least 32 for short insert size)

map_len=32

path to genes

q1=/common/Assembly_Tutorial/Quality_Control/Sample_1.fastq
q2=/common/Assembly_Tutorial/Quality_Control/Sample_2.fastq
q=/common/Assembly_Tutorial/Quality_Control/Sample_s.fastq</pre>

To run the assembler we will use the SOAPdenovo-63mer command with the all option (to perform kmer graph construction, contig error correction, mapping of reads to contigs, and scaffolding), -s for the path to the config file, -K for the size of the kmer, -o for the output prefix, 1 for assembly log, and 2 for assembly errors.

<pre>SOAPdenovo-63mer all -s /common/Assembly_Tutorial/Assembly/Sample.config -K 31 -R -o graph_Sample_31 1>ass31.log 2>ass31.err

repeat for k=35, k=41, etc</pre>

The assembly output files are located in /common/Assembly_Tutorial/Assembly/SOAP.

SPAdes: de Bruijn graph based assembler

The last assembler we will run is SPAdes. SPAdes is different from the other assemblers in that it generates a final assembly from multiple kmers. A list of kmers is automatically selected by SPAdes using the maximum read length of the input data, and each individual kmer contributes to the final assembly. To run SPAdes we will use the spades.py command with the --careful option to minimize the number of mismatches in the contigs, -o for the output folder, -1 for the path to the forward reads, -2 for the path to the reverse reads, and -s for the path to the singles reads. If desired, a list of kmers can be specified with the -k flag which will override automatic kmer selection.

<pre>spades.py --careful -o SPAdes_out -1 /common/Assembly_Tutorial/Quality_Control/Sample_1.fastq -2 /common/Assembly_Tutorial/Quality_Control/Sample_2.fastq -s /common/Assembly_Tutorial/Quality_Control/Sample_s.fastq</pre>

The assembly output files are located in /common/Assembly_Tutorial/Assembly/SPAdes.

QUAST: assembly statistics

Now that we have several assemblies, it’s time to analyze the quality of each assembly. ABySS and SOAPdenovo both have their own statistics output, but for consistency, we will be using the program QUAST. The statistics we are most interested in are number of contigs, total length, and N50. A good assembly would have a low number of contigs, a total length that makes sense for the species, and a high N50 value. To run quast on all of our final assembly files we will run the following commands, with the only parameters used being the name of the assembly file(s) and output directory.

<pre># ABySS statistics
python /opt/bioinformatics/quast-2.3/quast.py /common/Assembly_Tutorial/Assembly/ABySS/Sample_Kmer*-scaffolds.fa -o ABySS</pre>

<pre># SOAPdenovo statistics
python /opt/bioinformatics/quast-2.3/quast.py /common/Assembly_Tutorial/Assembly/SOAP/graph_Sample_*.scafSeq -o SOAP</pre>

<pre># SPAdes statistics
python /opt/bioinformatics/quast-2.3/quast.py /common/Assembly_Tutorial/Assembly/SPAdes/scaffolds.fasta -o SPAdes</pre>

Abyss results:

<colgroup><col width="100"> <col width="100"> <col width="100"> <col width="100"> <col width="100"> <col width="100"></colgroup>
| Assembly | # contigs | Largest contig | Total length | GC (%) | N50 |
| Sample_Kmer31-scaffolds | 363 | 86593 | 2779506 | 32.76 | 14714 |
| Sample_Kmer35-scaffolds | 342 | 86909 | 2787431 | 32.75 | 16801 |
| Sample_Kmer41-scaffolds | 330 | 84960 | 2794086 | 32.76 | 17579 |

SOAP results:

<colgroup><col width="100"> <col width="100"> <col width="100"> <col width="100"> <col width="100"> <col width="100"></colgroup>
| Assembly | # contigs | Largest contig | Total length | GC (%) | N50 |
| graph_Sample_31.scafSeq | 276 | 103125 | 3574101 | 32.44 | 26176 |
| graph_Sample_35.scafSeq | 246 | 86844 | 3543834 | 32.46 | 27766 |
| graph_Sample_41.scafSeq | 214 | 99593 | 3438095 | 32.46 | 36169 |

SPAdes results:

<colgroup><col width="100"> <col width="100"> <col width="100"> <col width="100"> <col width="100"> <col width="100"></colgroup>
| Assembly | # contigs | Largest contig | Total length | GC (%) | N50 |
| scaffolds | 59 | 255551 | 2880184 | 32.65 | 147660 |

From the data, it’s clear that SPAdes performed the best. SPAdes generated only 59 contigs as compared to ~200 from SOAP and ~300 from ABySS. Additionally, the largest contig size and N50 values were the highest. Finally, the total number of base pairs was closest to the number of base pairs in a different strain of this bacteria that has already been sequenced. We will proceed to secondary scaffolding with this assembly, located in /common/Assembly_Tutorial/Assembly/SPAdes/scaffolds.fasta.

QUAST’s output consists of a folder containing results in multiple formats within each of the three assembly directories. The script to run QUAST is located at /common/Assembly_Tutorial/QUAST/Sample_quast.sh.

SSPACE Standard

SSPACE is a script able to extend and scaffold pre-assembled contigs. SSPACE requires a library file containing the paths to the paired end reads, average insert size, and type of data. This file is located at /common/Assembly_Tutorial/Scaffolding/Species_library.txt.

We will run SSPACE using a perl command with the parameters -l for the species library, -s for the fasta file containing assembled scaffolds, -b for the output prefix, and -T for the number of threads.

<pre class="p1">perl /opt/bioinformatics/SSPACE-STANDARD/SSPACE_Standard_v3.0.pl -l /common/Assembly_Tutorial/Species_library.txt -s /common/Assembly_Tutorial/Assembly/SPAdes/scaffolds.fasta -b SSPACE -T 16</pre>

The output file is located at /common/Assembly_Tutorial/Scaffolding/SSPACE/Sample_SSPACE.final.scaffolds.fasta. The script to run SSPACE is located at /common/Assembly_Tutorial/Scaffolding/Sample_sspace.sh.

We then will run QUAST on this file to compare it with previous assemblies. This time, we will run QUAST over the command line without a submit script, since it is only one line.

<pre>cd /common/Assembly_Tutorial/QUAST
python /opt/bioinformatics/quast-2.3/quast.py /common/Assembly_Tutorial/Scaffolding/SSPACE/SSPACE.final.scaffolds.fasta -o SSPACE</pre>

Quast results:

<colgroup><col width="100"> <col width="100"> <col width="100"> <col width="100"> <col width="100"> <col width="100"></colgroup>
| Assembly | # contigs | Largest contig | Total length | GC (%) | N50 |
| Sample_SSPACE.final.scaffolds | 57 | 255551 | 2880249 | 32.65 | 147660 |

AlignGraph on close relation (different strain of species)

AlignGraph is the final step in this assembly pipeline. From the documentation, “AlignGraph is a software that extends and joins contigs or scaffolds by reassembling them with help provided by a reference genome of a closely related organism.” By using a reference genome of a closely related organism, it can improve the assembly.

To run AlignGraph we first need to convert the raw reads from fastq format to fasta format. There are many ways to do this, but one of the most efficient ways is to use a sed command to parse out the reads from the fastq file:

<pre class="p1">sed -n '14s/^@/>/p;24p' /common/Assembly_Tutorial/Sample_R1.fastq > Sample_R1.fasta
sed -n '14s/^@/>/p;24p' /common/Assembly_Tutorial/Sample_R2.fastq > Sample_R2.fasta</pre>

Then we will run AlignGraph using the AlignGraph command and the parameters --read1 for the forward read in fasta format, --read2 for the reverse read in fasta format, --contig for the path to the assembly we are rescaffolding, and --genome for the path to the reference genome we are using for rescaffolding. The genome we are using is named AlignGraph_genome.fasta, again to protect the live data.

Additionally, we have to define the --distanceLow and --distanceHigh parameters. From the documentation, distanceLow is the maximum of [insert size – 1000, insert size] and distanceHigh [insert size + 1000]. The insert size of this dataset is 550, giving us a distanceLow of 550 and distanceHigh of 1550. Finally, we define the output file names using --extendedContigs and --remainingContigs. –remainingContigs will contain the final assembly.

<pre class="p1">AlignGraph --read1 /common/Assembly_Tutorial/Scaffolding/Sample_R1.fasta --read2 /common/Assembly_Tutorial/Scaffolding/Sample_R1.fasta --contig /common/Assembly_Tutorial/Scaffolding/SSPACE/SSPACE.final.scaffolds.fasta --genome /common/Assembly_Tutorial/Scaffolding/AlignGraph_genome.fasta --distanceLow 550 --distanceHigh 1550 --extendedContig Species_extendedContigs.fa --remainingContig Species_remainingContigs.fa</pre>

The output file is located at /common/Assembly_Tutorial/Scaffolding/AlignGraph/Sample_remainingContigs.fa. The script to run AlignGraph is located at /common/Assembly_Tutorial/Scaffolding/Sample_aligngraph.sh.

Then QUAST:

<pre>cd /common/Assembly_Tutorial/QUAST
python /opt/bioinformatics/quast-2.3/quast.py /common/Assembly_Tutorial/Scaffolding/AlignGraph/Species_remainingContigs.fa -o AlignGraph</pre>

<colgroup><col width="100"> <col width="100"> <col width="100"> <col width="100"> <col width="100"> <col width="100"> <col width="100"></colgroup>
| Assembly | # contigs | Largest contig | Total length | GC (%) | N50 |
| Species_remainingContigs | 57 | 255551 | 2880249 | 32.65 | 147660 |

Unfortunately, this dataset was not improved by AlignGraph with this specific genome, but this tutorial still illustrates the general idea.

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末泡嘴,一起剝皮案震驚了整個(gè)濱河市在张,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌阶牍,老刑警劉巖恳啥,帶你破解...
    沈念sama閱讀 217,277評(píng)論 6 503
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件偏灿,死亡現(xiàn)場(chǎng)離奇詭異,居然都是意外死亡钝的,警方通過(guò)查閱死者的電腦和手機(jī)翁垂,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,689評(píng)論 3 393
  • 文/潘曉璐 我一進(jìn)店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)硝桩,“玉大人沿猜,你說(shuō)我怎么就攤上這事⊥爰梗” “怎么了啼肩?”我有些...
    開封第一講書人閱讀 163,624評(píng)論 0 353
  • 文/不壞的土叔 我叫張陵,是天一觀的道長(zhǎng)衙伶。 經(jīng)常有香客問(wèn)我祈坠,道長(zhǎng),這世上最難降的妖魔是什么矢劲? 我笑而不...
    開封第一講書人閱讀 58,356評(píng)論 1 293
  • 正文 為了忘掉前任赦拘,我火速辦了婚禮,結(jié)果婚禮上卧须,老公的妹妹穿的比我還像新娘另绩。我一直安慰自己儒陨,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,402評(píng)論 6 392
  • 文/花漫 我一把揭開白布笋籽。 她就那樣靜靜地躺著蹦漠,像睡著了一般。 火紅的嫁衣襯著肌膚如雪车海。 梳的紋絲不亂的頭發(fā)上笛园,一...
    開封第一講書人閱讀 51,292評(píng)論 1 301
  • 那天,我揣著相機(jī)與錄音侍芝,去河邊找鬼研铆。 笑死,一個(gè)胖子當(dāng)著我的面吹牛州叠,可吹牛的內(nèi)容都是我干的棵红。 我是一名探鬼主播,決...
    沈念sama閱讀 40,135評(píng)論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼咧栗,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼逆甜!你這毒婦竟也來(lái)了?” 一聲冷哼從身側(cè)響起致板,我...
    開封第一講書人閱讀 38,992評(píng)論 0 275
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤交煞,失蹤者是張志新(化名)和其女友劉穎,沒(méi)想到半個(gè)月后斟或,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體素征,經(jīng)...
    沈念sama閱讀 45,429評(píng)論 1 314
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,636評(píng)論 3 334
  • 正文 我和宋清朗相戀三年萝挤,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了御毅。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 39,785評(píng)論 1 348
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡平斩,死狀恐怖亚享,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情绘面,我是刑警寧澤欺税,帶...
    沈念sama閱讀 35,492評(píng)論 5 345
  • 正文 年R本政府宣布,位于F島的核電站揭璃,受9級(jí)特大地震影響晚凿,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜瘦馍,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,092評(píng)論 3 328
  • 文/蒙蒙 一歼秽、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧情组,春花似錦燥筷、人聲如沸箩祥。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,723評(píng)論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)袍祖。三九已至,卻和暖如春谢揪,著一層夾襖步出監(jiān)牢的瞬間蕉陋,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 32,858評(píng)論 1 269
  • 我被黑心中介騙來(lái)泰國(guó)打工拨扶, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留凳鬓,地道東北人。 一個(gè)月前我還...
    沈念sama閱讀 47,891評(píng)論 2 370
  • 正文 我出身青樓患民,卻偏偏與公主長(zhǎng)得像缩举,于是被迫代替她去往敵國(guó)和親。 傳聞我的和親對(duì)象是個(gè)殘疾皇子酒奶,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 44,713評(píng)論 2 354

推薦閱讀更多精彩內(nèi)容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi閱讀 7,325評(píng)論 0 10
  • 今早兒子因?yàn)樽蛲砜磿t起晚了蚁孔。他似乎太困了,我心情寧?kù)o惋嚎,平和地喊他起床,他皺皺眉頭站刑,眼睛緊閉另伍,伸個(gè)生硬的...
    光與愛的Lily閱讀 183評(píng)論 0 1
  • 很長(zhǎng)一段時(shí)間,都是打字绞旅,刪除摆尝,打字,刪除因悲。 秉持著最初的想法堕汞,寫文章是梳理自己的內(nèi)心世界,通過(guò)文字觀察自我晃琳,很早的...
    羅嫚閱讀 290評(píng)論 2 3
  • 老爸的突然離世讯检,讓我感慨萬(wàn)千,想起了好多的往事卫旱。 媽媽說(shuō)我出生的時(shí)候人灼,老爸他當(dāng)時(shí)一聽接生婆說(shuō)是個(gè)女孩,就礎(chǔ)在那說(shuō)顾翼;...
    修家三妹閱讀 190評(píng)論 0 1
  • 親生母親給我灌下湯藥投放,逼我和傻子在一起,就在我快要失守的時(shí)候适贸,傻子忽然暴斃被鬼附身灸芳,為保清白我和惡鬼做了交易涝桅,從此...
    愛笑的南瓜閱讀 354評(píng)論 0 1