關(guān)鍵詞:BSA分析在抛,測(cè)序數(shù)據(jù)質(zhì)控钟病,F(xiàn)astQC,Trimmomatic刚梭,去接頭序列
數(shù)據(jù)格式
一定要先搞清楚數(shù)據(jù)的格式肠阱,才能進(jìn)行進(jìn)一步的數(shù)據(jù)整理。
我們拿到測(cè)序數(shù)據(jù)基本是fastq格式的文件朴读,現(xiàn)在一般都是150 PE(150bp, 雙端)的reads暂幼,數(shù)據(jù)文件通常也會(huì)命名成_1.fq.gz和_2.fq.gz淮菠,進(jìn)行區(qū)分(注意雙端數(shù)據(jù)一個(gè)樣本對(duì)應(yīng)兩個(gè)數(shù)據(jù)文件)蘸鲸。也有可能是從公共數(shù)據(jù)庫(kù)下載的數(shù)據(jù)榨婆,比如NCBI的SRA格式的數(shù)據(jù),可以使用sratoolkit這類軟件將格式轉(zhuǎn)換成fastq氮唯。
fastq的格式如下:
第一行:描述行(Header Line)鉴吹。以“@”字符開頭,屬于數(shù)據(jù)的標(biāo)簽惩琉,包含序列的名稱豆励、標(biāo)識(shí)符等測(cè)序樣本的信息,雙端數(shù)據(jù)還會(huì)有類似于"/1"和"/2"(也有“:2“和”:1“這種瞒渠,具體看數(shù)據(jù))這種分類標(biāo)識(shí)良蒸。
第二行:序列行(Sequence Data Line),留意序列長(zhǎng)度伍玖,如果是150 PE的序列嫩痰,小于150bp的數(shù)據(jù)我們通常會(huì)過(guò)濾掉;如果序列長(zhǎng)度大于150bp窍箍,一般是因?yàn)樾蛄械慕宇^未去除串纺。
第三行:分隔行(Separator Line):以“+”開始丽旅,可以儲(chǔ)存一些附加信息,一般是空的
第四行:質(zhì)量分?jǐn)?shù)行(Quality Score Line):包含與序列數(shù)據(jù)行中的對(duì)應(yīng)堿基的質(zhì)量分?jǐn)?shù)造垛。質(zhì)量分?jǐn)?shù)通常以ASCII字符表示,表示測(cè)序儀器對(duì)每個(gè)堿基的測(cè)序質(zhì)量晰搀。較高的ASCII值表示更高的質(zhì)量五辽,我們一般會(huì)去統(tǒng)計(jì)Q30堿基的比例,以此作為測(cè)序數(shù)據(jù)的質(zhì)量好壞的衡量標(biāo)準(zhǔn)之一外恕。
質(zhì)量值對(duì)應(yīng)表杆逗,計(jì)算方式可自行百度:
數(shù)據(jù)示例:
#sample1_1.fq.gz
@E150015079L1C001R0020000323/1
CATACTGAGTATGGTGATTCAGGATTATTCCTGACGGATGTATCGAATGTTGTGGTGAAAGTGTACCACCTCTGCAGAGTGCCAAACTACTCGAATAACCGCCTCCGTAGTTATGGACAGTTGGAGTGGCCATACTATTCTATCGCCAGA
+
@CCCCDDCCCDCC@CCCCDCCCCBCCBBCCCCCBDCCCCCCCCDCB?CCCCCCDCCCCCCCBBC@CDCCDDCCCCCDA<CDCCCCACCCDDDCCCCCCCDCCDDDDCCCCCCCCC3CDC0CCA?CC?C9@DDCCDCCBCDDCDCCCDDCC
#sample1_2.fq.gz
@E150015079L1C001R0020000323/2
GCAAAAAGCTAGATTTATCTCTAACTGAGTTTAGAAGTAGAGAAAACTAGTTTAATGTCTTTCAACACCTAAGTGGGAACTTAGCTTTCACTTTCATAACCCTGTTGTGATCAAGTCAAATTCATGATTAATTTCACAAGTCATTTTGAA
+
CCCC5?CCCCCCCCCCCDCCCCCBCCC@CCCBCCCCCCCBCCCBCCCCCCACCCCCCCCCCCCCACCDCCC5CCCBC?CCCCC>CCCCCBDCCCCCCBACCCCBCCBCC/CA@@CCCC=CBCCBCCCBC;?CCCDCCACC=DCCCCCCCC
數(shù)據(jù)質(zhì)控
FastQC下載安裝
如果了解數(shù)據(jù)格式之后仍舊不確定數(shù)據(jù)是否去除接頭序列以及數(shù)據(jù)質(zhì)量好壞,我們可以借助軟件的力量鳞疲,這里就不得不提fq質(zhì)控常見軟件FastQC了罪郊。
FastQC軟件Windows和Linux兩個(gè)系統(tǒng)的適用版本都有
1、Windos系統(tǒng)
下載后解壓縮-->雙擊"run_fastqc.bat"運(yùn)行程序--> 點(diǎn)擊"File">點(diǎn)擊"Open">選擇要分析的序列文件-->FastQC的分析界面展示結(jié)果-->點(diǎn)擊"File">點(diǎn)擊"Save report"保存報(bào)告
幫助說(shuō)明文檔:點(diǎn)擊“Help”>“Contents”
2尚洽、Linux系統(tǒng)
還是推薦用Linux悔橄,畢竟不受電腦cpu和樣本數(shù)據(jù)量的限制
#確認(rèn)是否安裝好了Java
java -version
#沒(méi)有就安
sudo apt install default-jre
sudo apt install openjdk-11-jre-headless
sudo apt install openjdk-8-jre-headless
sudo apt install openjdk-9-jre-headless
#安裝好之后,解壓fastqc,更改權(quán)限,執(zhí)行看是否安裝成功
unzip ~/softwarepath/fastqc_v0.12.1.zip -d ~/softwarepath/
chmod 777 ~/softwarepath/FastQC/fastqc
~/softwarepath/FastQC/fastqc -h
#懶人方法(推薦)
conda create -n fastqc
conda activate fastqc
conda install -c bioconda fastqc -y
#安裝好后腺毫,寫入環(huán)境配置文件中
echo 'export PATH=~/softwarepath/FastQC:$PATH' >>~/.bashrc
source ~/.bashrc
FastQC使用方法
fastqc --help
FastQC - A high throughput sequence QC analysis tool
SYNOPSIS
fastqc seqfile1 seqfile2 .. seqfileN
fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam]
[-c contaminant file] seqfile1 .. seqfileN
DESCRIPTION
FastQC reads a set of sequence files and produces from each one a quality
control report consisting of a number of different modules, each one of
which will help to identify a different potential type of problem in your
data.
If no files to process are specified on the command line then the program
will start as an interactive graphical application. If files are provided
on the command line then the program will run with no user interaction
required. In this mode it is suitable for inclusion into a standardised
analysis pipeline.
The options for the program as as follows:
-h --help Print this help file and exit
-v --version Print the version of the program and exit
-o --outdir Create all output files in the specified output directory.
Please note that this directory must exist as the program
will not create it. If this option is not set then the
output file for each sequence file is created in the same
directory as the sequence file which was processed.
--casava Files come from raw casava output. Files in the same sample
group (differing only by the group number) will be analysed
as a set rather than individually. Sequences with the filter
flag set in the header will be excluded from the analysis.
Files must have the same names given to them by casava
(including being gzipped and ending with .gz) otherwise they
won't be grouped together correctly.
--nano Files come from nanopore sequences and are in fast5 format. In
this mode you can pass in directories to process and the program
will take in all fast5 files within those directories and produce
a single output file from the sequences found in all files.
--nofilter If running with --casava then don't remove read flagged by
casava as poor quality when performing the QC analysis.
--extract If set then the zipped output file will be uncompressed in
the same directory after it has been created. If --delete is
also specified then the zip file will be removed after the
contents are unzipped.
-j --java Provides the full path to the java binary you want to use to
launch fastqc. If not supplied then java is assumed to be in
your path.
--noextract Do not uncompress the output file after creating it. You
should set this option if you do not wish to uncompress
the output when running in non-interactive mode.
--nogroup Disable grouping of bases for reads >50bp. All reports will
show data for every base in the read. WARNING: Using this
option will cause fastqc to crash and burn if you use it on
really long reads, and your plots may end up a ridiculous size.
You have been warned!
--min_length Sets an artificial lower limit on the length of the sequence
to be shown in the report. As long as you set this to a value
greater or equal to your longest read length then this will be
the sequence length used to create your read groups. This can
be useful for making directly comaparable statistics from
datasets with somewhat variable read lengths.
--dup_length Sets a length to which the sequences will be truncated when
defining them to be duplicates, affecting the duplication and
overrepresented sequences plot. This can be useful if you have
long reads with higher levels of miscalls, or contamination with
adapter dimers containing UMI sequences.
-f --format Bypasses the normal sequence file format detection and
forces the program to use the specified format. Valid
formats are bam,sam,bam_mapped,sam_mapped and fastq
--memory Sets the base amount of memory, in Megabytes, used to process
each file. Defaults to 512MB. You may need to increase this if
you have a file with very long sequences in it.
--svg Save the graphs in the report in SVG format.
-t --threads Specifies the number of files which can be processed
simultaneously. Each thread will be allocated 250MB of
memory so you shouldn't run more threads than your
available memory will cope with, and not more than
6 threads on a 32 bit machine
-c Specifies a non-default file which contains the list of
--contaminants contaminants to screen overrepresented sequences against.
The file must contain sets of named contaminants in the
form name[tab]sequence. Lines prefixed with a hash will
be ignored.
-a Specifies a non-default file which contains the list of
--adapters adapter sequences which will be explicity searched against
the library. The file must contain sets of named adapters
in the form name[tab]sequence. Lines prefixed with a hash
will be ignored.
-l Specifies a non-default file which contains a set of criteria
--limits which will be used to determine the warn/error limits for the
various modules. This file can also be used to selectively
remove some modules from the output all together. The format
needs to mirror the default limits.txt file found in the
Configuration folder.
-k --kmers Specifies the length of Kmer to look for in the Kmer content
module. Specified Kmer length must be between 2 and 10. Default
length is 7 if not specified.
-q --quiet Suppress all progress messages on stdout and only report errors.
-d --dir Selects a directory to be used for temporary files written when
generating report images. Defaults to system temp directory if
not specified.
主要參數(shù)解讀:
-o 或 --outdir #FastQC生成的報(bào)告文件的輸出路徑
--extract #生成的報(bào)告不打包成壓縮文件癣疟,默認(rèn)打包
-t 或 --threads #選擇程序運(yùn)行的線程數(shù)
-q 或 --quiet #安靜運(yùn)行模式,一般不選
#生成結(jié)果文件名與輸入有關(guān)潮酒。
fastqc path/to/sample1_1.fq.gz path/to/sample1_2.fq.gz … -o 文件夾
#使用fastqc進(jìn)行質(zhì)量檢測(cè)睛挚,會(huì)當(dāng)前文件夾生成一個(gè).html網(wǎng)頁(yè)文件和一個(gè).zip文件
FastQC結(jié)果解讀
需要重點(diǎn)關(guān)注的結(jié)果
● Basic Statistics: 數(shù)據(jù)量概覽
● Per base sequence quality:reads單堿基測(cè)序質(zhì)量的直接展示。
● Per sequence quality scores:總體reads測(cè)序質(zhì)量趨勢(shì)急黎。
● Per base sequence content: ATGC含量估計(jì)測(cè)序是否存在偏差扎狱。
● Sequence Duplication Levels:數(shù)據(jù)重復(fù)情況。
影響測(cè)序的因素很多勃教,要排除數(shù)據(jù)污染淤击,冗余、數(shù)據(jù)量不足等問(wèn)題故源,因此前期數(shù)據(jù)處理時(shí)一定要進(jìn)行質(zhì)控遭贸。
使用瀏覽器打開后綴是html的文件查看網(wǎng)頁(yè)版結(jié)果,就是圖表化的fastqc報(bào)告心软。
整個(gè)報(bào)告分成11個(gè)部分壕吹。綠色的√表示合格;黃色的删铃!表示警告耳贬;紅色的×表示不合格。
Encoding:表示測(cè)序平臺(tái)版本及對(duì)應(yīng)編碼版本號(hào)猎唁。
Total Sequences:reads總數(shù)咒劲。
Sequence length:測(cè)序長(zhǎng)度,范圍值。
%GC:表示序列中的GC含量腐魂,具有物種特異性帐偎,相同物種的GC較為接近。GC含量遠(yuǎn)遠(yuǎn)偏離該物種常見值蛔屹,說(shuō)明測(cè)序數(shù)據(jù)存在一定偏好性削樊,如果直接使用數(shù)據(jù)會(huì)影響后續(xù)變異檢測(cè)的分析。
橫軸代表堿基位置漫贞;縱軸代表堿基質(zhì)量值,紅色線表示該位置質(zhì)量值中位數(shù)育叁,黃色是25%-75%區(qū)間迅脐,觸須是10%-90%區(qū)間,藍(lán)線是質(zhì)量值平均數(shù)豪嗽。(一般要求所有位置的1/10分位數(shù)大于20谴蔑,否則去除質(zhì)量值在20以下的堿基,即Q20過(guò)濾龟梦。Warning树碱,如果某位置堿基質(zhì)量低于10,或者是某位置堿基質(zhì)量值中位數(shù)低于25;Failure,如果某位置堿基質(zhì)量低于5,或者是某位置堿基質(zhì)量值中位數(shù)低于20变秦。)
tile:每一次測(cè)序熒光掃描的最小單位成榜。橫軸代表144個(gè)堿基的位置;縱軸是tile的Index編號(hào)蹦玫。(檢查reads中每一個(gè)堿基位置在不同的測(cè)序小孔之間的偏離度赎婚,藍(lán)色表示低于平均偏離度,偏離度小樱溉,質(zhì)量好挣输;紅色表示偏離平均質(zhì)量越多,質(zhì)量也越差福贞。)
reads的質(zhì)量值是指每個(gè)位置Q值的平均值撩嚼,該圖橫軸表示Q值,范圍0-40挖帘,完丽;縱軸是某一Q值對(duì)應(yīng)的reads數(shù)。如果測(cè)序結(jié)果主要集中在高Q值區(qū)拇舀,證明測(cè)序質(zhì)量較好B咦濉(當(dāng)測(cè)序質(zhì)量峰值小于27(錯(cuò)誤率0.2%)時(shí)報(bào)"WARN";當(dāng)峰值小于20(錯(cuò)誤率1%)時(shí)報(bào)"FAIL"骄崩。)
橫軸是1-150 bp聘鳞;縱軸是百分比薄辅。圖中四條線代表A T C G在每個(gè)位置平均含量。( 理論認(rèn)為抠璃,%A=%T站楚,%G=%C,由于測(cè)序儀初始狀態(tài)狀態(tài)不穩(wěn)定搏嗡,可能會(huì)出現(xiàn)上圖左端的情況窿春。如果存在這種情況,需要去除開始部分序列信息,一般10bp左右即可彻况,具體參考橫坐標(biāo)軸谁尸。)
橫軸是平均GC值舅踪,為1%-100%纽甘; 縱軸是每條序列GC含量對(duì)應(yīng)的reads數(shù)量(藍(lán)線是經(jīng)驗(yàn)分布理論值,紅線是真實(shí)值抽碌,二者越接近越好悍赢。如果出現(xiàn)多峰紅線,一般是序列存在污染)
reads每個(gè)位置N(未知堿基)的比例货徙。當(dāng)出現(xiàn)峰時(shí)左权,說(shuō)明測(cè)序存在問(wèn)題。(當(dāng)任意位置的N的比例超過(guò)5%痴颊,報(bào)"WARN"赏迟;當(dāng)任意位置的N的比例超過(guò)20%,報(bào)"FAIL"蠢棱。)
測(cè)序長(zhǎng)度一般應(yīng)該相等锌杀,允許些許序列存在幾bp的偏差存在,這種序列數(shù)量少便不影響后續(xù)分析泻仙。(當(dāng)測(cè)序長(zhǎng)度參差不齊糕再,則表明測(cè)序數(shù)據(jù)有誤)
橫坐標(biāo)是重復(fù)的次數(shù),縱坐標(biāo)是重復(fù)序列占特異序列總數(shù)百分比玉转。(和PCR過(guò)程類似突想,測(cè)序會(huì)產(chǎn)生重復(fù)reads,測(cè)序深度越高,reads重復(fù)數(shù)越大;如果重復(fù)出現(xiàn)峰值究抓,就提示可能存在偏差猾担。fastqc一般抽取前200,000條reads進(jìn)行重復(fù)數(shù)統(tǒng)計(jì)。當(dāng)重復(fù)數(shù)目≥10時(shí)刺下,reads會(huì)被合并統(tǒng)計(jì)垒探,這也是最右側(cè)起峰的原因;而>75bp的reads一般只取50bp進(jìn)行比較怠李。由于reads越長(zhǎng)錯(cuò)誤率越高圾叼,所以重復(fù)率統(tǒng)計(jì)可能偏低蛤克。當(dāng)非特異的reads占總數(shù)的比例大于20%時(shí),報(bào)"WARN"夷蚊;當(dāng)非特異的reads占總數(shù)的比例大于50%時(shí)构挤,報(bào)"FAIL“)
如果在當(dāng)時(shí)fastqc分析的時(shí)候-a選項(xiàng)沒(méi)有內(nèi)容,則默認(rèn)使用圖例中的四種通用adapter序列進(jìn)行統(tǒng)計(jì)惕鼓,本例中接頭(adapter)已經(jīng)去除筋现。如果有adapter序列沒(méi)有去除干凈,會(huì)在圖中左右端起峰箱歧。在后續(xù)分析可以使用Trimmomatic等軟件去接頭矾飞。
其他質(zhì)控軟件
還有MultiQC、fastp等常見質(zhì)控的軟件呀邢,這里不再一一詳細(xì)介紹洒沦,有需要可留言。
MultiQC:可以整合多個(gè)樣本的質(zhì)控結(jié)果信息价淌,但相對(duì)的信息沒(méi)有FastQC那么詳細(xì)申眼、
fastp:除了可以完成質(zhì)控,還可以進(jìn)行接頭序列去除等操作蝉衣。
Trimmomatic下載安裝
#下載并解壓
wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.40.zip && unzip Trimmomatic-0.40.zip
#解壓后的文件夾
Trimmomatic-0.39
├── adapters #該文件夾下是ILLUMINA常規(guī)的接頭文件括尸,有單端和雙端。
│ ├── NexteraPE-PE.fa
│ ├── TruSeq2-PE.fa
│ ├── TruSeq2-SE.fa
│ ├── TruSeq3-PE-2.fa
│ ├── TruSeq3-PE.fa
│ └── TruSeq3-SE.fa
├── LICENSE
└── trimmomatic-0.39.jar #編譯之后的文件病毡,使用java執(zhí)行
1 directory, 8 files
#懶人安裝
conda create -n trimmomatic
conda activate trimmomatic
conda install trimmomatic
#也可以使用git clone 從Github上獲得0.40的版本
Trimmomatic 使用方法
java -jar ~/softwarepath/trimmomatic-0.39.jar -h
Usage:
PE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] [-validatePairs] [-basein <inputBase> | <inputFile1> <inputFile2>] [-baseout <outputBase> | <outputFile1P> <outputFile1U> <outputFile2P> <outputFile2U>] <trimmer1>...
or:
SE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] <inputFile> <outputFile> <trimmer1>...
or:
-version
#參數(shù)注釋
PE/SE 設(shè)定對(duì)Paired-End或Single-End的reads進(jìn)行處理濒翻,其輸入和輸出參數(shù)稍有不一樣。
-threads 設(shè)置多線程運(yùn)行數(shù)
-phred33 設(shè)置堿基的質(zhì)量格式啦膜,可選pred64有送,不設(shè)置這個(gè)參數(shù),軟件會(huì)自動(dòng)判斷輸入文件是哪種格式
-trimlog 指定過(guò)濾日志文件名功戚,日志中包含以下四列內(nèi)容:read ID娶眷、過(guò)濾之后剩余序列長(zhǎng)度、過(guò)濾之后的序列起始?jí)A基位置(序列開頭處被切掉了多少個(gè)堿基)啸臀、過(guò)濾之后的序列末端堿基位置届宠、序列末端處被剪切掉的堿基數(shù)。
#由于生成的trimlog文件中包含了每一條 reads 的處理記錄乘粒,因此文件體積巨大(GB級(jí)別)豌注,如果后面不會(huì)用到 trim日志,建議不要使用這個(gè)參數(shù)
-basein 通常 PE 測(cè)序的兩個(gè)文件灯萍,指定其中 R1 文件名轧铁,軟件會(huì)推測(cè)出 R2 的文件名,但是這個(gè)功能實(shí)測(cè)并不好用旦棉,建議不用-basein參數(shù)齿风,直接指定兩個(gè)文件名(R1 和 R2)作為輸入
-baseout 輸出文件有四個(gè)药薯,使用 -baseout 參數(shù)指定輸出文件的 basename,軟件會(huì)自動(dòng)為四個(gè)輸出文件命名救斑,過(guò)濾之后雙端序列都保留的就是 paired童本,反之如果其中一端序列過(guò)濾之后被丟棄了另一端序列保留下來(lái)了就是 unpaired(即 成對(duì)的clean reads, 未成對(duì)的正向序列以及未成對(duì)的反向序列)
#一般情況下,若paired reads百分比占90%以上脸候,可只對(duì)paired reads進(jìn)行比對(duì)分析
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:8:true
切除adapter序列穷娱。參數(shù)后面分別接adapter序列的fasta文件:第一步 seed 搜索時(shí)允許的最大錯(cuò)配堿基個(gè)數(shù)2:palindrome模式下匹配堿基數(shù)閾值30:simple模式下的匹配堿基數(shù)閾值10(7-15之間):palindrome 模式允許切除的最短接頭序列為 8bp(默認(rèn)值):palindrome 模式去除與 R1 完全反向互補(bǔ)的 R2(默認(rèn)去除false),但在有些情況下运沦,例如需要用到 paired reads 的 bowtie2 流程泵额,就要將這個(gè)參數(shù)改為 true,否則會(huì)損失一部分 paired reads携添。
#按照規(guī)定順序嫁盲,ILLUMINACLIP 各個(gè)參數(shù)之間以冒號(hào)分開,PE測(cè)序需要注意最后一個(gè)參數(shù)薪寓。對(duì)于SE測(cè)序最后兩個(gè)參數(shù)可以不設(shè)置
LEADING:3 切除首端堿基質(zhì)量小于3的堿基
#Illumina平臺(tái)有些低質(zhì)量的堿基質(zhì)量值被標(biāo)記為 2 亡资,所以設(shè)置為3可以過(guò)濾掉這部分低質(zhì)量堿基澜共。
TRAILING:3 切除尾端堿基質(zhì)量小于3的堿基
SLIDINGWINDOW:15:20
滑窗質(zhì)量過(guò)濾向叉,一般一個(gè)read的低質(zhì)量序列都是集中在末端,也有很少部分在開頭嗦董。從5'端開始進(jìn)行滑動(dòng)母谎,當(dāng)滑動(dòng)位點(diǎn)周圍一段序列(window)的平均堿基低于閾值,則從該處進(jìn)行切除京革。Windows的size是15個(gè)堿基(一般設(shè)置在10-30之間)奇唤,其平均堿基質(zhì)量小于20,則切除
MINLEN:50 可被保留的最短reads長(zhǎng)度匹摇,應(yīng)根據(jù)原始序列的長(zhǎng)度而定
HEADCROP:<length> 在reads的首端切除指定的長(zhǎng)度
CROP:<length> 保留reads到指定的長(zhǎng)度
TOPHRED33 將堿基質(zhì)量轉(zhuǎn)換為pred33格式
TOPHRED64 將堿基質(zhì)量轉(zhuǎn)換為pred64格式
1咬扇、SE模式(單端數(shù)據(jù)模式)
指定輸入文件和輸出文件(均一個(gè))。
java -jar ~/softwarepath/trimmomatic-0.40.jar SE -threads 線程數(shù) -trimlog logFile(任務(wù)日志文件) <input> <output> <step 1> ...
(廊勃!不建議開)指定trimlog參數(shù)會(huì)創(chuàng)建一個(gè)記錄所有reads 的日志懈贺,內(nèi)容包括:
- read的名字,即QNAME
- 保留序列長(zhǎng)度
- 從起始位置坡垫,第一個(gè)保留堿基的位置梭灿。
- 原始read中最后一個(gè)堿基的位置
- 從末尾切除掉的堿基個(gè)數(shù)
可以根據(jù)需要指定多個(gè)步驟,所需處理(裁剪冰悠、接頭裁剪等步驟)作為附加參數(shù)放在指定在輸入/輸出文件之后均可堡妒。
2、PE模式(雙端數(shù)據(jù)模式)
對(duì)于Pair-End數(shù)據(jù)溉卓,需要兩個(gè)輸入文件皮迟,會(huì)輸出4個(gè)文件搬泥。其中兩個(gè)文件對(duì)應(yīng)于"paired"數(shù)據(jù),即read1和read2都保留伏尼。另外兩個(gè)對(duì)應(yīng)于"unpaired"數(shù)據(jù)佑钾,在處理的過(guò)程中會(huì)過(guò)濾掉其中一端的reads。
java -jar ~/softwarepath/trimmomatic-0.40.jar PE -threads 線程數(shù) -trimlog logFile(任務(wù)日志文件)\
[-basein <inputBase> | <input 1> <input 2>] [-baseout <outputBase> | <paired output 1> <unpaired output 1> <paired output 2> <unpaired output 2> <step 1> ...
## 示例
java -jar ~/softwarepath/trimmomatic-0.39.jar PE \
# 輸入文件
sample_1.fq.gz sample_2.fq.gz \
# 輸出文件
Trim_paired_1.fq.gz Trim_unpaired_1.fq.gz \
Trim_paired_2.fq.gz Trim_unpaired_2.fq.gz \
# 去接頭參數(shù)
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 \
# 切除TruSeq3-PE中提供的Illumina adapter烦粒。最初Trimmomatic將尋找seed匹配(16個(gè)堿基)休溶,最多允許2個(gè)不匹配。如果在配對(duì)端讀取的情況下達(dá)到30分(約50個(gè)堿基扰她,50*0.6)兽掰,或在單端讀取的情況下達(dá)到10分(約17個(gè)堿基, 17*0.6),這些"seed"序列將被修剪徒役。
LEADING:3 \
# 刪除前低質(zhì)量堿基(低于質(zhì)量3)
TRAILING:3 \
# 刪除后低質(zhì)量堿基(低于質(zhì)量3)
SLIDINGWINDOW:4:15 \
# 掃描4個(gè)堿基寬滑動(dòng)窗口孽尽,當(dāng)每個(gè)堿基的平均質(zhì)量下降到15以下時(shí)進(jìn)行剪切
MINLEN:80
# 上述步驟完成后,刪除小于80個(gè)堿基的reads
Trimmomatic 結(jié)果文件
Trim_paired_*.fq.gz :read1和read2都保留的fastq文件忧勿;
Trim_unpaired_*.fq.gz :過(guò)濾后的fastq文件
往期回顧
參考資料
https://zhuanlan.zhihu.com/p/47722164
https://www.cnblogs.com/Sunny-King/archive/2022/06/16/Bioinformatics-Trimmomatic.html