詳細(xì)的教程官方已經(jīng)給出贮缕。
這里記錄自己常用的方法:
安裝方法:用Python3安裝就可以使用多核參數(shù)姥份。
sudo python3 -m pip install --user --upgrade cutadapt
什么是 3’接頭凡伊,就是一段序列之后跟了adapter可帽。 XXXXXXXXXXXXXXadapter
什么是 5’接頭钾军,就是adapter在序列開始鳄袍。 adapterXXXXXXXXXXXXXX
假如說我的情況屬于第一種。就使用-a參數(shù),接頭和隨后的序列將都被trim掉吏恭。
屬于第二種拗小,就使用-g參數(shù),接頭和接頭之前的序列都被trim掉樱哼。
默認(rèn)adapter的錯誤率為10%哀九,通過-e參數(shù)修改。結(jié)果文件非壓縮唇礁。
舉例:
cutadapt -a adapter=ATATCCAGAACCCTGACCCTGCCGTGTACCAGCTGAC -O 10 -o G18E2L2_R1.p1.fq -r R1.p2.fq --info-file=R1.cutadapt.log /your/fastq/fastq_1.fq.gz > R1.cutadapt.stats
cutadapt -g adapter=CACAGCGACCTCGGGTGGGAACACCTTGTTCAGGTCT -O 10 -o G18E2L2_R2.p1.fq -r R2.p2.fq --info-file=R2.cutadapt.log /your/fastq/fastq_2.fq.gz > R2.cutadapt.stats
-O --overlap=MINLENGTH : Require MINLENGTH overlap between read and adapter for an adapter to be found. Default: 3
-o output.fastq
-r FILE, --rest-file=FILE When the adapter matches in the middle of a read, write the rest (after the adapter) to FILE.
--info-file=FILE Write information about each read and its adapter matches into FILE. See the documentation for the file format.
-j CORES, --cores=CORES Number of CPU cores to use. Use 0 to auto-detect. Default: 1 python2 下不能使用多核勾栗。
-a ADAPTER, --adapter=ADAPTER Sequence of an adapter ligated to the 3' end (paired data: of the first read). The adapter and subsequent bases are trimmed. If a '$' character is appended
('anchoring'), the adapter is only found if it is a suffix of the read.
-g ADAPTER, --front=ADAPTER Sequence of an adapter ligated to the 5' end (paired data: of the first read). The adapter and any preceding bases are trimmed. Partial matches at the 5'
end are allowed. If a '^' character is prepended ('anchoring'), the adapter is only found if it is a prefix of the read.
-b ADAPTER, --anywhere=ADAPTER Sequence of an adapter that may be ligated to the 5' or 3' end (paired data: of the first read). Both types of matches as described under -a and -g are allowed.
If the first base of the read is part of the match, the behavior is as with -g, otherwise as with -a. This option is mostly for rescuing failed library preparations
- do not use if you know which end your adapter was ligated to!
模糊匹配或容錯:
-e RATE, --error-rate=RATE Maximum allowed error rate as value between 0 and 1 (no. of errors divided by length of matching region). Default: 0.1 (=10%)
For paired-end reads:
cutadapt -a ADAPT1 -A ADAPT2 [options] -o out1.fastq -p out2.fastq in1.fastq in2.fastq
參數(shù):-O MINLENGTH, --overlap=MINLENGTH
Require MINLENGTH overlap between read and adapter for an adapter to be found.
Default: 3
-r:表示將截掉的序列保存在R2.p2.fq文件中。
--info-file:輸出log文件盏筐。
stat文件是記錄adapter的詳細(xì)過程围俘,最好像我一樣重定向到一個文件方便日后查看。默認(rèn)屏幕輸出琢融。
cutadapt結(jié)果默認(rèn)會trim掉adapter和adapter之后(3'的話是之前)的序列界牡,所以,如果你只想切掉adapter漾抬,想保留adapter之前和之后的序列宿亡,那么就需要從log文件中提取出序列來了。
cutadapt結(jié)果log文件處理:
log文件格式是以下這樣子的纳令。
這里面存儲著三種類型的格式挽荠。
實(shí)用腳本1:
將cutadapt 生成的log 中的adapter前后的reads分別輸出不同的文件中備用克胳。
就是可以將adapter兩端的reads分別輸出到p1,和p2文件中圈匆。
用法:腳本自己寫的漠另,很實(shí)用!
python deal_cutadapt_log.py -l xxx.cutadapt.log -d /result/dir/
就會得到
xxx.p1.fq 和 xxx.p2.fq
兩個文件跃赚,代表著adapter之前序列和adapter之后序列笆搓。
-f 參數(shù)還可以選擇保留或者刪除log文件中沒有adapter 的序列。
usage: deal_cutadapt_log.py [-h] -l LOG_FILE [-d RESULT_DIR] [-f] [-v]
This is description
optional arguments:
-h, --help show this help message and exit
-l LOG_FILE, --log LOG_FILE
input read1 file
-d RESULT_DIR, --dir RESULT_DIR
input read2 file
-f, --flag means to contains -l flag in output.
-v, --version show program's version number and exit
實(shí)用腳本2:
批量統(tǒng)計(jì)cutadapt.stats文件信息:輸入為路徑纬傲,就會統(tǒng)計(jì)該路徑下的所有stats文件中的相關(guān)信息满败。
python statistic_basic_info.py ./
sample Total reads processed Reads with adapters
G34E3L1 10,934,616 10,455,685 (95.6%)
非常好用。
點(diǎn)贊送腳本叹括!