從NCBI上下載轉(zhuǎn)錄組數(shù)據(jù),很多文章方法描述簡(jiǎn)單棍弄,無法判斷是否為鏈特異性數(shù)據(jù),導(dǎo)致在mapping和raw reads count時(shí)參數(shù)不知如何選擇。一直弄不懂鏈特異性和參數(shù)設(shè)置的同志們可以去看《鏈特異建庫那點(diǎn)事》披摄,講的非常清楚(雖然小白看完還是一知半解)。所以關(guān)于判斷轉(zhuǎn)錄組數(shù)據(jù)是否為鏈特異性勇凭,偷懶的小白找到了一個(gè)比較省事兒的方法疚膊,用RSeQC的infer_experiment.py工具。
官網(wǎng)鏈接:http://rseqc.sourceforge.net/#infer-experiment-py
官網(wǎng)用法如下:需要輸入自己數(shù)據(jù)的bam文件虾标,這個(gè)容易拿到寓盗。
但是另一個(gè)hg19.refseq.bed12文件,官網(wǎng)給出的參數(shù)解釋如下:但是Bedops(Bedopts)的gff2bed工具和RSeQC的舉例bed12格式卻大不一樣柳譬。
Bedops(Bedopts)的gff2bed轉(zhuǎn)化成bed文件的結(jié)果:
原鏈接:
https://bedops.readthedocs.io/en/latest/content/reference/file-management/conversion/gtf2bed.html
而RSeQC舉例的bed12格式的文件喳张,除了包含常用的chromosome, start, end, name, score, strand等信息外,最后一列包含了多個(gè)extron和intron等的位置美澳,用逗號(hào)隔開销部。
原鏈接:
http://dldcc-web.brc.bcm.edu/lilab/liguow/RSeQC/dat/sample.bed
小白通過搜索終于找到了獲取bed12文件Reference gene model in bed format的方法摸航,就是使用UCSC的gtfToGenePre工具,小白在上一篇筆記《Reference gene model in bed format》中已經(jīng)詳細(xì)講述舅桩,這里只列代碼:
#安裝gtfToGenePre
conda install -c bioconda ucsc-gtftogenepred
#準(zhǔn)備好基因組gtf文件酱虎,從gtf轉(zhuǎn)換為GenePred格式
gtfToGenePred -genePredExt -geneNameAsName2 genes.gtf gene.tmp
#從GenePred文件提取信息得到bed文件
awk '{print $2"\t"$4"\t"$5"\t"$1"\t0\t"$3"\t"$6"\t"$7"\t0\t"$8"\t"$9"\t"$10}' gene.tmp > genes_refseq.bed12
拿到bed12文件后,開始試試用infer_experiment.py判斷數(shù)據(jù)是否為鏈特異性擂涛。
infer_experiment.py -r genes_refseq.bed12 -i col.bam
結(jié)果:
Reading reference gene model genes_refseq.bed12 ... Done
Loading SAM/BAM file ... Total 200000 usable reads were sampled
This is SingleEnd Data
Fraction of reads failed to determine: 0.0050
Fraction of reads explained by "++,--": 0.4974
Fraction of reads explained by "+-,-+": 0.4976
這表明該數(shù)據(jù)為單端數(shù)據(jù)读串,以illumina standard建庫方式為代表的fr-unstranded的非鏈特異性轉(zhuǎn)錄組數(shù)據(jù)。
再來一個(gè)鏈特異性的雙端數(shù)據(jù)試試:
infer_experiment.py -r genes_refseq.bed12 -i m1_col_1_s.bam
結(jié)果:
Reading reference gene model genes_refseq.bed12 ... Done
Loading SAM/BAM file ... Total 200000 usable reads were sampled
This is PairEnd Data
Fraction of reads failed to determine: 0.0057
Fraction of reads explained by "1++,1--,2+-,2-+": 0.0049
Fraction of reads explained by "1+-,1-+,2++,2--": 0.9894
這表明該數(shù)據(jù)為雙端數(shù)據(jù)撒妈,以dUTP建庫方式為代表的RF (fr-firststrand)的鏈特異性轉(zhuǎn)錄組數(shù)據(jù)恢暖。
結(jié)果的詳細(xì)解讀可以去官網(wǎng)或參考鏈接:
http://rseqc.sourceforge.net/#infer-experiment-py
http://www.reibang.com/p/4987dce4d165
歡迎關(guān)注和討論哦,小白們一起學(xué)習(xí)生信~