日常瞎掰
??工欲善其事必先利其器盅弛,好用的工具對做事來說很重要,可以讓你做起事來收到事半功倍的效果真友。前段時間分析數(shù)據(jù)時黄痪,需要給intervals
做基因注釋,需要操作gtf文件盔然,最后發(fā)現(xiàn)一款非常好用的python包 — pyranges
桅打。該包可以很輕松地將bed
、gtf
愈案、gff3
等格式文件讀取為PyRanges
對象挺尾,該對象有點類似潘大師的DataFrame
,熟悉潘大師的同學(xué)應(yīng)該能體會到DataFrame
操作數(shù)據(jù)的快樂站绪。同樣潦嘶,皮軟杰斯內(nèi)置操作intervals
的方法也絕對能給你帶來無語倫比的暢快。廢話不多說崇众,下面來感受一下使用皮軟杰斯做注釋基因有多方便掂僵。
讀取文件
??read_bed
方法可以用來讀取bed
格式的文件,read_gtf
顷歌、read_gff3
分別用來讀取gtf
锰蓬、gff3
文件:
import pyranges as pr
bed = pr.read_bed('data.bed')
bed
+--------------+-----------+-----------+------------+-----------+--------------+
| Chromosome | Start | End | Name | Score | Strand |
| (category) | (int32) | (int32) | (object) | (int64) | (category) |
|--------------+-----------+-----------+------------+-----------+--------------|
| chr1 | 212609534 | 212609559 | U0 | 0 | + |
| chr1 | 169887529 | 169887554 | U0 | 0 | + |
| chr1 | 216711011 | 216711036 | U0 | 0 | + |
| chr1 | 144227079 | 144227104 | U0 | 0 | + |
| ... | ... | ... | ... | ... | ... |
| chrY | 15224235 | 15224260 | U0 | 0 | - |
| chrY | 13517892 | 13517917 | U0 | 0 | - |
| chrY | 8010951 | 8010976 | U0 | 0 | - |
| chrY | 7405376 | 7405401 | U0 | 0 | - |
+--------------+-----------+-----------+------------+-----------+--------------+
Stranded PyRanges object has 10,000 rows and 6 columns from 24 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
gtf = pr.read_gtf('gencode.vM24.annotation.gtf')
gtf
+--------------+------------+-------------+-----------+-----------+------------+--------------+------------+----------------------+----------------+-------+
| Chromosome | Source | Feature | Start | End | Score | Strand | Frame | gene_id | gene_type | +15 |
| (category) | (object) | (object) | (int32) | (int32) | (object) | (category) | (object) | (object) | (object) | ... |
|--------------+------------+-------------+-----------+-----------+------------+--------------+------------+----------------------+----------------+-------|
| chr1 | HAVANA | gene | 3073252 | 3074322 | . | + | . | ENSMUSG00000102693.1 | TEC | ... |
| chr1 | HAVANA | transcript | 3073252 | 3074322 | . | + | . | ENSMUSG00000102693.1 | TEC | ... |
| chr1 | HAVANA | exon | 3073252 | 3074322 | . | + | . | ENSMUSG00000102693.1 | TEC | ... |
| chr1 | ENSEMBL | gene | 3102015 | 3102125 | . | + | . | ENSMUSG00000064842.1 | snRNA | ... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| chrY | ENSEMBL | CDS | 90838871 | 90839177 | . | - | 0 | ENSMUSG00000096850.1 | protein_coding | ... |
| chrY | ENSEMBL | start_codon | 90839174 | 90839177 | . | - | 0 | ENSMUSG00000096850.1 | protein_coding | ... |
| chrY | ENSEMBL | stop_codon | 90838868 | 90838871 | . | - | 0 | ENSMUSG00000096850.1 | protein_coding | ... |
| chrY | ENSEMBL | UTR | 90838868 | 90838871 | . | - | . | ENSMUSG00000096850.1 | protein_coding | ... |
+--------------+------------+-------------+-----------+-----------+------------+--------------+------------+----------------------+----------------+-------+
Stranded PyRanges object has 1,870,769 rows and 25 columns from 22 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
15 hidden columns: gene_name, level, mgi_id, havana_gene, transcript_id, transcript_type, transcript_name, transcript_support_level, tag, havana_transcript, exon_number, ... (+ 4 more.)
??上面說過PyRanges
對象類似潘大師的DataFrame
,所以可以用類似的方法來過濾數(shù)據(jù)眯漩,如下面只保留gtf
里面Feature
屬性為gene的行:
gene = gtf[gtf.Feature == 'gene']
gene
+--------------+------------+------------+-----------+-----------+------------+--------------+------------+----------------------+----------------------+-------+
| Chromosome | Source | Feature | Start | End | Score | Strand | Frame | gene_id | gene_type | +15 |
| (category) | (object) | (object) | (int32) | (int32) | (object) | (category) | (object) | (object) | (object) | ... |
|--------------+------------+------------+-----------+-----------+------------+--------------+------------+----------------------+----------------------+-------|
| chr1 | HAVANA | gene | 3073252 | 3074322 | . | + | . | ENSMUSG00000102693.1 | TEC | ... |
| chr1 | ENSEMBL | gene | 3102015 | 3102125 | . | + | . | ENSMUSG00000064842.1 | snRNA | ... |
| chr1 | HAVANA | gene | 3252756 | 3253236 | . | + | . | ENSMUSG00000102851.1 | processed_pseudogene | ... |
| chr1 | HAVANA | gene | 3466586 | 3513553 | . | + | . | ENSMUSG00000089699.1 | lncRNA | ... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| chrY | HAVANA | gene | 90603500 | 90605864 | . | - | . | ENSMUSG00000099619.6 | lncRNA | ... |
| chrY | HAVANA | gene | 90665345 | 90667625 | . | - | . | ENSMUSG00000099399.6 | lncRNA | ... |
| chrY | HAVANA | gene | 90752426 | 90755467 | . | - | . | ENSMUSG00000095366.2 | lncRNA | ... |
| chrY | ENSEMBL | gene | 90838868 | 90839177 | . | - | . | ENSMUSG00000096850.1 | protein_coding | ... |
+--------------+------------+------------+-----------+-----------+------------+--------------+------------+----------------------+----------------------+-------+
Stranded PyRanges object has 55,385 rows and 25 columns from 22 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
15 hidden columns: gene_name, level, mgi_id, havana_gene, transcript_id, transcript_type, transcript_name, transcript_support_level, tag, havana_transcript, exon_number, ... (+ 4 more.)
基因注釋
??皮軟杰斯里面有nearest
芹扭、k_nearest
兩種方法可以用。通過函數(shù)名我們就可以猜到這兩個函數(shù)都可以用來查找interval1
集合里每一個區(qū)間在interval2
集合里面距離最近的區(qū)間赦抖,默認(rèn)是返回有交集的區(qū)間舱卡,而k_nearest
可以指定返回的區(qū)間個數(shù)。通過代碼看一下區(qū)別:
bed_test = pr.PyRanges(chromosomes="chr1", strands="+", starts=[3074300], ends=[3074322])
bed
+--------------+-----------+-----------+--------------+
| Chromosome | Start | End | Strand |
| (category) | (int32) | (int32) | (category) |
|--------------+-----------+-----------+--------------|
| chr1 | 3074300 | 3074322 | + |
+--------------+-----------+-----------+--------------+
bed_test.nearest(gtf,suffix="_gtf")
+--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------+
| Chromosome | Start | End | Strand | Source | Feature | Start_gtf | End_gtf | Score | Strand_gtf | Frame | +18 |
| (category) | (int32) | (int32) | (category) | (object) | (object) | (int32) | (int32) | (object) | (category) | (object) | ... |
|--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------|
| chr1 | 3074300 | 3074322 | + | HAVANA | gene | 3073252 | 3074322 | . | + | . | ... |
+--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------+
bed_test.nearest(gene,overlap=False,suffix="_gtf")
+--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------+
| Chromosome | Start | End | Strand | Source | Feature | Start_gtf | End_gtf | Score | Strand_gtf | Frame | +18 |
| (category) | (int32) | (int32) | (category) | (object) | (object) | (int32) | (int32) | (object) | (category) | (object) | ... |
|--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------|
| chr1 | 3074300 | 3074322 | + | ENSEMBL | gene | 3102015 | 3102125 | . | + | . | ... |
+--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------+
bed_test.k_nearest(gene,k=5,suffix="_gtf")
+--------------+-----------+-----------+--------------+------------+------------+-----------+-----------+------------+--------------+------------+-------+
| Chromosome | Start | End | Strand | Source | Feature | Start_b | End_b | Score | Strand_b | Frame | +18 |
| (category) | (int32) | (int32) | (category) | (object) | (object) | (int32) | (int32) | (object) | (category) | (object) | ... |
|--------------+-----------+-----------+--------------+------------+------------+-----------+-----------+------------+--------------+------------+-------|
| chr1 | 3074300 | 3074322 | + | HAVANA | gene | 3073252 | 3074322 | . | + | . | ... |
| chr1 | 3074300 | 3074322 | + | ENSEMBL | gene | 3102015 | 3102125 | . | + | . | ... |
| chr1 | 3074300 | 3074322 | + | HAVANA | gene | 3205900 | 3671498 | . | - | . | ... |
| chr1 | 3074300 | 3074322 | + | HAVANA | gene | 3252756 | 3253236 | . | + | . | ... |
| chr1 | 3074300 | 3074322 | + | HAVANA | gene | 3365730 | 3368549 | . | - | . | ... |
+--------------+-----------+-----------+--------------+------------+------------+-----------+-----------+------------+--------------+------------+-------+
??從上面的代碼可知有無overlap=False
參數(shù)的區(qū)別队萤,以及nearest
和k_nearest
函數(shù)的區(qū)別轮锥。當(dāng)然,還有一些比較實用的參數(shù)如strandedness
:"same"要尔,指定鏈?zhǔn)欠裣嗤?code>how:"upstream"舍杜、 "downstream"分別指定選取最近區(qū)間的方位,nb_cpu
:指定使用的cpu數(shù)赵辕。除此之外既绩,k_nearest
還有一個參數(shù)ties
可以指定有距離相同的區(qū)間時如何返回,可選值為"first"还惠、"last"饲握、"different"。
??suffix
參數(shù)指定當(dāng)兩個集合有相同列名時,后一個集合列名會添加的字段救欧,默認(rèn)是'_b'歪今。不過,好像k_nearest
函數(shù)有BUG這個參數(shù)無效颜矿,但并不影響結(jié)果可以忽略寄猩。
??從上面可以看出k_nearest
函數(shù)功能上略勝一籌,注釋基因時用此函數(shù)相對更自由一些:
anno = bed.k_nearest(gene, ties='different')
anno = anno[['Chromosome', 'Start', 'End, 'Strand', 'gene_id']]
anno
+--------------+-----------+-----------+--------------+----------------------+
| Chromosome | Start | End | Strand | gene_id |
| (category) | (int32) | (int32) | (category) | (object) |
|--------------+-----------+-----------+--------------+----------------------|
| chr1 | 212609534 | 212609559 | + | ENSMUSG00000102307.1 |
| chr1 | 169887529 | 169887554 | + | ENSMUSG00000103110.1 |
| chr1 | 216711011 | 216711036 | + | ENSMUSG00000102307.1 |
| chr1 | 144227079 | 144227104 | + | ENSMUSG00000100389.1 |
| ... | ... | ... | ... | ... |
| chrY | 15224235 | 15224260 | - | ENSMUSG00000102835.5 |
| chrY | 13517892 | 13517917 | - | ENSMUSG00000102174.1 |
| chrY | 8010951 | 8010976 | - | ENSMUSG00000099861.1 |
| chrY | 7405376 | 7405401 | - | ENSMUSG00000118447.1 |
+--------------+-----------+-----------+--------------+----------------------+
anno.to_csv("data_anno.txt", sep="\t", header=True)
結(jié)束語
??pyranges
包的功能還是挺多的骑疆,不僅包含區(qū)間的交集田篇、差集,還可以進行一些統(tǒng)計運算如Jaccard
箍铭、Fisher
等泊柬。雖然,目前有不少現(xiàn)成的軟件如homer
诈火、chipseeker
可以做基因注釋兽赁,很多時候我們可以直接使用這些軟件即可,但pyranges
還是值得學(xué)習(xí)收藏一下冷守,也許做個性化數(shù)據(jù)處理的時候使用它會來得更為方便些刀崖。
往期回顧
scanpy踩坑實錄
差異基因密度分布
R繪圖配色總結(jié)
saddleplot | A/B compartments
雙曲線火山圖一鍵拿捏