可以先使用intron做參考基因組比對(duì)再取。看一下結(jié)果
如果用取sam的方式挫望,同一條序列只要看左右有沒有超出intron位置即可。
這樣可以使用索引狂窑,問題是媳板,比對(duì)算法中的索引是怎么起作用的?查看一下.fai文件泉哈。
改造igv
問題集中在取序列上:
1.取intron內(nèi)的
2.取配對(duì)的
是先寫出各部分的再組裝還是改原來取序列的代碼
conda安裝好軟件后需要先conda activate 進(jìn)環(huán)境后才能使用拷肌。即使計(jì)入了path 有的還是需要
gffread翻譯
gffread v0.12.1. Usage:
gffread <input_gff> [-g <genomic_seqs_fasta> | <dir>][-s <seq_info.fsize>]
[-o <outfile>] [-t <trackname>] [-r [[<strand>]<chr>:]<start>..<end> [-R]]
[-CTVNJMKQAFPGUBHZWTOLE] [-w <exons.fa>] [-x <cds.fa>] [-y <tr_cds.fa>]
[-i <maxintron>] [--stream] [--bed] [--table <attrlist>] [--sort-by <ref.lst>]
Filter, convert or cluster GFF/GTF/BED records, extract the sequence of
transcripts (exon or CDS) and more.
過濾轉(zhuǎn)化或者聚類GFF GTF BED記錄,提取轉(zhuǎn)錄本旨巷?序列等【主要功能】
By default (i.e. without -O) only transcripts are processed, discarding any
other non-transcript features. Default output is a simplified GFF3 with only
the basic attributes.
默認(rèn)(除非用-O)只處理轉(zhuǎn)錄本,丟棄非轉(zhuǎn)錄本信息添忘,默認(rèn)輸出GFF3.簡(jiǎn)化版
<input_gff> is a GFF file, use '-' for stdin
輸如的GFF文件采呐,用-作為
Options:
-i? discard transcripts having an intron larger than <maxintron>? ?丟棄有intron長(zhǎng)大于輸入?yún)?shù)的轉(zhuǎn)錄本
-l? discard transcripts shorter than <minlen> bases? 丟棄堿基數(shù)小于輸入?yún)?shù)的轉(zhuǎn)錄本
-r? only show transcripts overlapping coordinate range <start>..<end>
? ? ? (on chromosome/contig <chr>, strand <strand> if provided) 只展示重疊坐標(biāo)區(qū)的起止
-R? for -r option, discard all transcripts that are not fully
? ? ? contained within the given range? ?丟棄包括不完全在給定區(qū)域的轉(zhuǎn)錄本
-U? discard single-exon transcripts
-C? coding only: discard mRNAs that have no CDS features? 丟棄沒CDS特征的
--nc non-coding only: discard mRNAs that have CDS features??
--ignore-locus : discard locus features and attributes found in the input
-A? use the description field from <seq_info.fsize> and add it
? ? ? as the value for a 'descr' attribute to the GFF record? 使用來自??<seq_info.fsize>的描述字段添加到GFF文件
-s? <seq_info.fsize> is a tab-delimited file providing this info
? ? ? for each of the mapped sequences:
? ? ? <seq-name> <seq-length> <seq-description>
? ? ? (useful for -A option with mRNA/EST/protein mappings)
Sorting: (by default, chromosomes are kept in the order they were found)? 默認(rèn)不排序
--sort-alpha : chromosomes (reference sequences) are sorted alphabetically 按字母排序
--sort-by : sort the reference sequences by the order in which their
? ? ? names are given in the <refseq.lst> file
Misc options:
-F? preserve all GFF attributes (for non-exon features) 保留所有GFF屬性
--keep-exon-attrs : for -F option, do not attempt to reduce redundant
? ? ? exon/CDS attributes? 不要試圖減少冗余exon/CDS屬性
-G? do not keep exon attributes, move them to the transcript feature
? ? ? (for GFF3 output) 移除exon屬性
--keep-genes : in transcript-only mode (default), also preserve gene records
--keep-comments: for GFF3 input/output, try to preserve comments? 保留注釋
-O? process other non-transcript GFF records (by default non-transcript
? ? ? records are ignored)
-V? discard any mRNAs with CDS having in-frame stop codons (requires -g)
-H? for -V option, check and adjust the starting CDS phase
? ? ? if the original phase leads to a translation with an
? ? ? in-frame stop codon
-B? for -V option, single-exon transcripts are also checked on the
? ? ? opposite strand (requires -g)
-P? add transcript level GFF attributes about the coding status of each
? ? ? transcript, including partialness or in-frame stop codons (requires -g)
--add-hasCDS : add a "hasCDS" attribute with value "true" for transcripts
? ? ? that have CDS features
--adj-stop stop codon adjustment: enables -P and performs automatic
? ? ? adjustment of the CDS stop coordinate if premature or downstream
-N? discard multi-exon mRNAs that have any intron with a non-canonical
? ? ? splice site consensus (i.e. not GT-AG, GC-AG or AT-AC)
-J? discard any mRNAs that either lack initial START codon
? ? ? or the terminal STOP codon, or have an in-frame stop codon
? ? ? (i.e. only print mRNAs with a complete CDS)
--no-pseudo: filter out records matching the 'pseudo' keyword
--in-bed: input should be parsed as BED format (automatic if the input
? ? ? ? ? filename ends with .bed*)
--in-tlf: input GFF-like one-line-per-transcript format without exon/CDS
? ? ? ? ? features (see --tlf option below); automatic if the input
? ? ? ? ? filename ends with .tlf)
--stream: fast processing of input GFF/BED transcripts as they are received
? ? ? ? ? ((no sorting, exons must be grouped by transcript in the input data)
Clustering:
-M/--merge : cluster the input transcripts into loci, discarding
? ? ? "duplicated" transcripts (those with the same exact introns
? ? ? and fully contained or equal boundaries)
-d <dupinfo> : for -M option, write duplication info to file <dupinfo>
--cluster-only: same as -M/--merge but without discarding any of the
? ? ? "duplicate" transcripts, only create "locus" features
-K? for -M option: also discard as redundant the shorter, fully contained
? ? ? transcripts (intron chains matching a part of the container)
-Q? for -M option, no longer require boundary containment when assessing
? ? ? redundancy (can be combined with -K); only introns have to match for
? ? ? multi-exon transcripts, and >=80% overlap for single-exon transcripts
-Y? for -M option, enforce -Q but also discard overlapping single-exon
? ? ? transcripts, even on the opposite strand (can be combined with -K)
Output options:
--force-exons: make sure that the lowest level GFF features are considered
? ? ? "exon" features? 確保最低的GFF特征被考慮
--gene2exon: for single-line genes not parenting any transcripts, add an
? ? ? exon feature spanning the entire gene (treat it as a transcript)
--t-adopt:? try to find a parent gene overlapping/containing a transcript
? ? ? that does not have any explicit gene Parent
-D? ? decode url encoded characters within attributes
-Z? ? merge very close exons into a single exon (when intron size<4)將非常近的外顯子合并為一個(gè)
-g? full path to a multi-fasta file with the genomic sequences
? ? ? for all input mappings, OR a directory with single-fasta files
? ? ? (one per genomic sequence, with file names matching sequence names)
-w? ? write a fasta file with spliced exons for each transcript? 對(duì)每個(gè)轉(zhuǎn)錄本寫一個(gè)fa文件,有剪切外顯子
--w-add <N> for the -w option, extract additional <N> bases
? ? ? both upstream and downstream of the transcript boundaries? 提取附加的上下游N個(gè)堿基
-x? ? write a fasta file with spliced CDS for each GFF transcript? ?
-y? ? write a protein fasta file with the translation of CDS for each record
-W? ? for -w and -x options, write in the FASTA defline all the exon
? ? ? coordinates projected onto the spliced sequence;
-S? ? for -y option, use '*' instead of '.' as stop codon translation
-L? ? Ensembl GTF to GFF3 conversion (implies -F; should be used with -m)
-m? ? <chr_replace> is a name mapping table for converting reference
? ? ? sequence names, having this 2-column format:
? ? ? <original_ref_ID> <new_ref_ID>
-t? ? use <trackname> in the 2nd column of each GFF/GTF output line
-o? ? write the records into <outfile> instead of stdout
-T? ? main output will be GTF instead of GFF3
--bed output records in BED format instead of default GFF3
--tlf output "transcript line format" which is like GFF
? ? ? but exons, CDS features and related data are stored as GFF
? ? ? attributes in the transcript feature line, like this:
? ? ? ? exoncount=N;exons=<exons>;CDSphase=<N>;CDS=<CDScoords>
? ? ? <exons> is a comma-delimited list of exon_start-exon_end coordinates;
? ? ? <CDScoords> is CDS_start:CDS_end coordinates or a list like <exons>
--table output a simple tab delimited format instead of GFF, with columns
? ? ? having the values of GFF attributes given in <attrlist>; special
? ? ? pseudo-attributes (prefixed by @) are recognized:
? ? ? @id, @geneid, @chr, @start, @end, @strand, @numexons, @exons,
? ? ? @cds, @covlen, @cdslen
? ? ? If any of -w/-y/-x FASTA output files are enabled, the same fields
? ? ? (excluding @id) are appended to the definition line of corresponding
? ? ? FASTA records
-v,-E expose (warn about) duplicate transcript IDs and other potential
? ? ? problems with the given GFF/GTF records
GFFREAD看來達(dá)不到目的搁骑,從bedtools上看到有g(shù)etfast的subcommand可以達(dá)到目的
bedtools安裝:
之前不在conda的base環(huán)境中裝過斧吐,現(xiàn)在在base也能裝。經(jīng)驗(yàn)之一是:將conda的bin加入到path之后仲器,bin下的軟件都載入到path里了煤率,這個(gè)經(jīng)驗(yàn)非常重要。
參數(shù):
Usage: bedtools <subcommand> [options]
The bedtools sub-commands include:
[ Genome arithmetic ]
? ? intersect? ? Find overlapping intervals in various ways.
? ? window? ? ? ? Find overlapping intervals within a window around an interval.
? ? closest? ? ? Find the closest, potentially non-overlapping interval.
? ? coverage? ? ? Compute the coverage over defined intervals.
? ? map? ? ? ? ? Apply a function to a column for each overlapping interval.
? ? genomecov? ? Compute the coverage over an entire genome.
? ? merge? ? ? ? Combine overlapping/nearby intervals into a single interval.
? ? cluster? ? ? Cluster (but don't merge) overlapping/nearby intervals.
? ? complement? ? Extract intervals _not_ represented by an interval file.
? ? shift? ? ? ? Adjust the position of intervals.
? ? subtract? ? ? Remove intervals based on overlaps b/w two files.
? ? slop? ? ? ? ? Adjust the size of intervals.
? ? flank? ? ? ? Create new intervals from the flanks of existing intervals.
? ? sort? ? ? ? ? Order the intervals in a file.
? ? random? ? ? ? Generate random intervals in a genome.
? ? shuffle? ? ? Randomly redistribute intervals in a genome.
? ? sample? ? ? ? Sample random records from file using reservoir sampling.
? ? spacing? ? ? Report the gap lengths between intervals in a file.
? ? annotate? ? ? Annotate coverage of features from multiple files.
[ Multi-way file comparisons ]
? ? multiinter? ? Identifies common intervals among multiple interval files.
? ? unionbedg? ? Combines coverage intervals from multiple BEDGRAPH files.
[ Paired-end manipulation ]
? ? pairtobed? ? Find pairs that overlap intervals in various ways.
? ? pairtopair? ? Find pairs that overlap other pairs in various ways.
[ Format conversion ]
? ? bamtobed? ? ? Convert BAM alignments to BED (& other) formats.
? ? bedtobam? ? ? Convert intervals to BAM records.
? ? bamtofastq? ? Convert BAM records to FASTQ records.
? ? bedpetobam? ? Convert BEDPE intervals to BAM records.
? ? bed12tobed6? Breaks BED12 intervals into discrete BED6 intervals.
[ Fasta manipulation ]
? ? getfasta? ? ? Use intervals to extract sequences from a FASTA file.
? ? maskfasta? ? Use intervals to mask sequences from a FASTA file.
? ? nuc? ? ? ? ? Profile the nucleotide content of intervals in a FASTA file.
[ BAM focused tools ]
? ? multicov? ? ? Counts coverage from multiple BAMs at specific intervals.
? ? tag? ? ? ? ? Tag BAM alignments based on overlaps with interval files.
[ Statistical relationships ]
? ? jaccard? ? ? Calculate the Jaccard statistic b/w two sets of intervals.
? ? reldist? ? ? Calculate the distribution of relative distances b/w two files.
? ? fisher? ? ? ? Calculate Fisher statistic b/w two feature files.
[ Miscellaneous tools ]
? ? overlap? ? ? Computes the amount of overlap from two intervals.
? ? igv? ? ? ? ? Create an IGV snapshot batch script.
? ? links? ? ? ? Create a HTML page of links to UCSC locations.
? ? makewindows? Make interval "windows" across a genome.
? ? groupby? ? ? Group by common cols. & summarize oth. cols. (~ SQL "groupBy")
? ? expand? ? ? ? Replicate lines based on lists of values in columns.
? ? split? ? ? ? Split a file into multiple files with equal records or base pairs.
? ? summary? ? ? Statistical summary of intervals in a file.
[ General help ]
? ? --help? ? ? ? Print this help menu.
? ? --version? ? What version of bedtools are you using?.
? ? --contact? ? Feature requests, bugs, mailing lists, etc.
出錯(cuò)
malformed畸形的
entry 入口 條目
GFF3有問題乏冀,轉(zhuǎn)bed試試
方法是:https://www.bbsmax.com/A/LPdo6Oy253/
第一步出錯(cuò):
這個(gè)報(bào)錯(cuò)比較明顯蝶糯,回頭檢查GFF3文件
在反向序列這里,star確實(shí)比end大辆沦,說明其默認(rèn)的star end是左右昼捍,只按一個(gè)順序。
看一下原文件
即使-也是左小右大肢扯。
看來需要改一下GFF3,暫時(shí)先用以前的intron.fa
改一下順序以后妒茬,流程應(yīng)該是:GFF3 直接取intron,fa,用bedtools
先用以前的:
沒索引
但即使跑出來蔚晨,也只能做為驗(yàn)證乍钻,正式的步驟應(yīng)該是sam按intron取,sam取互補(bǔ)的另一條
跑出來之后效果并不好
不但特別大铭腕,而且文件內(nèi):
還是不用intron做參考基因組比較好银择。
查找算法:http://www.reibang.com/p/95d224c4d13e
常見查找算法:
[1. 順序查找]
[2. 二分查找]??[3. 插值查找]??[4. 斐波那契查找]
[5. 樹表查找]
[6. 分塊查找]
[7. 哈希查找]
說是七種,其實(shí)二分查找谨履、插值查找以及斐波那契查找都可以歸為一類——插值查找欢摄。插值查找和斐波那契查找是在二分查找的基礎(chǔ)上的優(yōu)化查找算法。
之前想用二分查找笋粟,實(shí)現(xiàn)雖然不難怀挠,但是想實(shí)現(xiàn)以下更復(fù)雜的算法析蝴,因此想使用分塊(索引)、哈希的查找方式绿淋。
分塊查找
分塊查找又稱索引順序查找闷畸,它是順序查找的一種改進(jìn)方法。
算法思想:將n個(gè)數(shù)據(jù)元素"按塊有序"劃分為m塊(m ≤ n)吞滞。每一塊中的結(jié)點(diǎn)不必有序佑菩,但塊與塊之間必須"按塊有序";即第1塊中任一元素的關(guān)鍵字都必須小于第2塊中任一元素的關(guān)鍵字裁赠;而第2塊中任一元素又都必須小于第3塊中的任一元素殿漠,……
算法流程:
- step1 先選取各塊中的最大關(guān)鍵字構(gòu)成一個(gè)索引表;
- step2 查找分兩個(gè)部分:先對(duì)索引表進(jìn)行二分查找或順序查找佩捞,以確定待查記錄在哪一塊中绞幌;然后,在已確定的塊中用順序法進(jìn)行查找一忱。
建索引表的思想有助于理解比對(duì)算法
哈希查找
哈希表是一個(gè)在時(shí)間和空間上做出權(quán)衡的經(jīng)典例子莲蜘。如果沒有內(nèi)存限制,那么可以直接將鍵作為數(shù)組的索引帘营。那么所有的查找時(shí)間復(fù)雜度為O(1)票渠;如果沒有時(shí)間限制,那么我們可以使用無序數(shù)組并進(jìn)行順序查找芬迄,這樣只需要很少的內(nèi)存问顷。哈希表使用了適度的時(shí)間和空間來在這兩個(gè)極端之間找到了平衡。只需要調(diào)整哈希函數(shù)算法即可在時(shí)間和空間上做出取舍薯鼠。
基因組數(shù)據(jù)可視化工具
格式:
chr1? ? ? ? ? ? ? ? ? ? ? ? ? ? start? ? ?end? ? ? ? ? ? ? ?strand
只需要chrom chromstart chromend就夠用择诈。
一個(gè)問題是哈希只能找數(shù)字不能找范圍,那么需要將范圍轉(zhuǎn)化為一個(gè)數(shù)字出皇。用中值的辦法也可行羞芍,但是有點(diǎn)麻煩。設(shè)計(jì)了一個(gè)樹結(jié)構(gòu)郊艘,想用改變后的樹查找方式查找荷科。
左右子樹是intron 的范圍,在第一個(gè)不大于處停下纱注。在比較star和左子樹大小即可畏浆。
但這樣有點(diǎn)復(fù)雜,只是示意圖狞贱,實(shí)際做需要對(duì)每個(gè)染色體單獨(dú)建二叉樹刻获,再先找大的(先找小的也一樣),然后將小的與同一節(jié)點(diǎn)下的左子節(jié)點(diǎn)比對(duì)瞎嬉。
根據(jù)需求蝎毡,可以簡(jiǎn)化建樹過程厚柳。根據(jù)有序數(shù)據(jù)創(chuàng)建二叉樹,將有序數(shù)組放入二叉樹沐兵。https://blog.csdn.net/weixin_42813521/article/details/107648974
先建立别垮,再遍歷。采用深度優(yōu)先的遍歷方式扎谎。
但是并不是順序的碳想,而是重疊的,這樣建樹有困難毁靶。感覺不應(yīng)該重疊
重新組裝胧奔。看了轉(zhuǎn)錄組的重新組裝感覺第二部分要做的事和重頭組裝有一定相似性预吆。
想統(tǒng)計(jì)重疊區(qū)域葡盗,檢查sort過的bam文件:
確實(shí)是按照起始位點(diǎn)的順序排序的,read也不再是一對(duì)一對(duì)的啡浊。也同igv上看到的----有起始位點(diǎn)相同的區(qū)域。
相同序列胶背,相同起點(diǎn)和比對(duì)情況巷嚣,不同測(cè)序。
選取最長(zhǎng)的钳吟?
最上面是igv自己拼接的廷粒?
有點(diǎn)奇怪 將中部沒有的也拼接上
?
5' 對(duì)齊
3‘有空隙
方向無關(guān)
intron問題:
1的同名位置不同? 12的條帶不同
合起來红且?
同一段位置上有很多坝茎,以哪個(gè)為標(biāo)準(zhǔn)?
重新組裝問題
取在intron內(nèi)的序列
1.對(duì)gff3暇番,有去重和以什么為主的問題嗤放。一個(gè)位置當(dāng)然要么只能是intron,要么不是壁酬。先grep出所有正鏈次酌,做一個(gè)正鏈gff3文件。
命令:
grep + Arabidopsis_thaliana.TAIR10.47.intron.gff3 > Arabidopsis_thaliana.TAIR10.47.only+intron.gff3
結(jié)果是沒sort過的舆乔,不是從小到大岳服。
2.去重+以長(zhǎng)的為intron獲取位置
因此先不uniq,而是先sort希俩,看一下重復(fù)和igv中是否一樣多吊宋。且有的重復(fù)同名,有的不同名
命令:
sort Arabidopsis_thaliana.TAIR10.47.only+intron.gff3 -t " " -k 4 -n > TAIR10.47.only+intron_sort.gff3
要點(diǎn):-t后面想輸入tab需要ctrl+v 再在“”內(nèi)按tab? ?-n告知以數(shù)字排序颜武,不然會(huì)出錯(cuò)璃搜。
結(jié)果:
parent不同拖吼,但intron位置相同,命名不同腺劣。名不同主要可能是因?yàn)閜arent不同绿贞。而確實(shí)在一個(gè)gene下的intron沒有重復(fù)位置的。
問題應(yīng)該是:不同gene下的intron有相同位置橘原。
情況比較普遍籍铁。
提取start end chr,按chr趾断,start拒名,end格式。
命令:
cut -d " " -f 1,4,5 Arabidopsis_thaliana.TAIR10.47.only+intron.gff3 | uniq > TAIR10.47.only+intron_uniq.gff3
chr1 chr2一變就重新起頭
有兩個(gè)問題:
1.sam的chr變是否重排位置
2.0-base和1-base問題
3.uniq出的結(jié)果里芋酌,有沒有start 一樣end不一樣的增显?
確實(shí)有。如果我只留最大的脐帝,那么反向排序再uniq
命令:
sort TAIR10.47.only+intron_uniq.gff3 -t " " -k 2 -n -r > TAIR10.47.only+intron_uniq_reverse.gff3
結(jié)果:
改一下同云,先cut和sort reverse
再按chr5 -> chr1 pt mt 分成七個(gè)文件,再對(duì)每個(gè)文件sort?
不對(duì) 先cut出七個(gè)文件堵腹。
先egrep出7個(gè)文件炸站,再對(duì)每個(gè)文件sort? -n -k 2
因?yàn)槭菑淖笸业模哉驎r(shí)疚顷,開頭小結(jié)尾一樣的應(yīng)該被開頭大的覆蓋旱易,而倒序可以將開頭一樣結(jié)尾小的去掉。
總之腿堤,需要uniq一下開頭和結(jié)尾阀坏。都應(yīng)該是倒序以去掉小的。
先對(duì)倒序的文件uniq笆檀,因?yàn)橄瘸霈F(xiàn)的是end大的忌堂,所以同start但是end小的被去除。同時(shí)保持倒序不變酗洒,再對(duì)end一致的同樣從操作诗茎,以去掉同endstart小的褥赊,因?yàn)槲募裺tart排序,start大的倒序后在前。
首先正序驗(yàn)證一下鸭限,再同樣逆序處理
對(duì)比證明其如實(shí)保留了大的star谦炬。
至此昔榴,取出了chr1的intron start 和end
步驟總結(jié):
1.先cut 145 列 并根據(jù)12列反向排序
2.grep出chr1的
3.對(duì)反向文件逞怨,同start保留最大end
4.保持反向,對(duì)同end取最大start
由此取出范圍內(nèi)最大的intron起止位置。下一步放入樹中阔籽,比對(duì)在intron內(nèi)的序列流妻。