bed基因注釋

日常瞎掰

??工欲善其事必先利其器盅弛,好用的工具對做事來說很重要,可以讓你做起事來收到事半功倍的效果真友。前段時間分析數(shù)據(jù)時黄痪,需要給intervals做基因注釋,需要操作gtf文件盔然,最后發(fā)現(xiàn)一款非常好用的python包 — pyranges桅打。該包可以很輕松地將bedgtf愈案、gff3等格式文件讀取為PyRanges對象挺尾,該對象有點類似潘大師的DataFrame,熟悉潘大師的同學(xué)應(yīng)該能體會到DataFrame操作數(shù)據(jù)的快樂站绪。同樣潦嘶,皮軟杰斯內(nèi)置操作intervals的方法也絕對能給你帶來無語倫比的暢快。廢話不多說崇众,下面來感受一下使用皮軟杰斯做注釋基因有多方便掂僵。

讀取文件

??read_bed方法可以用來讀取bed格式的文件,read_gtf顷歌、read_gff3分別用來讀取gtf锰蓬、gff3文件:

import pyranges as pr

bed = pr.read_bed('data.bed')
bed
+--------------+-----------+-----------+------------+-----------+--------------+
| Chromosome   | Start     | End       | Name       | Score     | Strand       |
| (category)   | (int32)   | (int32)   | (object)   | (int64)   | (category)   |
|--------------+-----------+-----------+------------+-----------+--------------|
| chr1         | 212609534 | 212609559 | U0         | 0         | +            |
| chr1         | 169887529 | 169887554 | U0         | 0         | +            |
| chr1         | 216711011 | 216711036 | U0         | 0         | +            |
| chr1         | 144227079 | 144227104 | U0         | 0         | +            |
| ...          | ...       | ...       | ...        | ...       | ...          |
| chrY         | 15224235  | 15224260  | U0         | 0         | -            |
| chrY         | 13517892  | 13517917  | U0         | 0         | -            |
| chrY         | 8010951   | 8010976   | U0         | 0         | -            |
| chrY         | 7405376   | 7405401   | U0         | 0         | -            |
+--------------+-----------+-----------+------------+-----------+--------------+
Stranded PyRanges object has 10,000 rows and 6 columns from 24 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

gtf = pr.read_gtf('gencode.vM24.annotation.gtf')
gtf
+--------------+------------+-------------+-----------+-----------+------------+--------------+------------+----------------------+----------------+-------+
| Chromosome   | Source     | Feature     | Start     | End       | Score      | Strand       | Frame      | gene_id              | gene_type      | +15   |
| (category)   | (object)   | (object)    | (int32)   | (int32)   | (object)   | (category)   | (object)   | (object)             | (object)       | ...   |
|--------------+------------+-------------+-----------+-----------+------------+--------------+------------+----------------------+----------------+-------|
| chr1         | HAVANA     | gene        | 3073252   | 3074322   | .          | +            | .          | ENSMUSG00000102693.1 | TEC            | ...   |
| chr1         | HAVANA     | transcript  | 3073252   | 3074322   | .          | +            | .          | ENSMUSG00000102693.1 | TEC            | ...   |
| chr1         | HAVANA     | exon        | 3073252   | 3074322   | .          | +            | .          | ENSMUSG00000102693.1 | TEC            | ...   |
| chr1         | ENSEMBL    | gene        | 3102015   | 3102125   | .          | +            | .          | ENSMUSG00000064842.1 | snRNA          | ...   |
| ...          | ...        | ...         | ...       | ...       | ...        | ...          | ...        | ...                  | ...            | ...   |
| chrY         | ENSEMBL    | CDS         | 90838871  | 90839177  | .          | -            | 0          | ENSMUSG00000096850.1 | protein_coding | ...   |
| chrY         | ENSEMBL    | start_codon | 90839174  | 90839177  | .          | -            | 0          | ENSMUSG00000096850.1 | protein_coding | ...   |
| chrY         | ENSEMBL    | stop_codon  | 90838868  | 90838871  | .          | -            | 0          | ENSMUSG00000096850.1 | protein_coding | ...   |
| chrY         | ENSEMBL    | UTR         | 90838868  | 90838871  | .          | -            | .          | ENSMUSG00000096850.1 | protein_coding | ...   |
+--------------+------------+-------------+-----------+-----------+------------+--------------+------------+----------------------+----------------+-------+
Stranded PyRanges object has 1,870,769 rows and 25 columns from 22 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
15 hidden columns: gene_name, level, mgi_id, havana_gene, transcript_id, transcript_type, transcript_name, transcript_support_level, tag, havana_transcript, exon_number, ... (+ 4 more.)

??上面說過PyRanges對象類似潘大師的DataFrame,所以可以用類似的方法來過濾數(shù)據(jù)眯漩,如下面只保留gtf里面Feature屬性為gene的行:

gene = gtf[gtf.Feature == 'gene']
gene
+--------------+------------+------------+-----------+-----------+------------+--------------+------------+----------------------+----------------------+-------+
| Chromosome   | Source     | Feature    | Start     | End       | Score      | Strand       | Frame      | gene_id              | gene_type            | +15   |
| (category)   | (object)   | (object)   | (int32)   | (int32)   | (object)   | (category)   | (object)   | (object)             | (object)             | ...   |
|--------------+------------+------------+-----------+-----------+------------+--------------+------------+----------------------+----------------------+-------|
| chr1         | HAVANA     | gene       | 3073252   | 3074322   | .          | +            | .          | ENSMUSG00000102693.1 | TEC                  | ...   |
| chr1         | ENSEMBL    | gene       | 3102015   | 3102125   | .          | +            | .          | ENSMUSG00000064842.1 | snRNA                | ...   |
| chr1         | HAVANA     | gene       | 3252756   | 3253236   | .          | +            | .          | ENSMUSG00000102851.1 | processed_pseudogene | ...   |
| chr1         | HAVANA     | gene       | 3466586   | 3513553   | .          | +            | .          | ENSMUSG00000089699.1 | lncRNA               | ...   |
| ...          | ...        | ...        | ...       | ...       | ...        | ...          | ...        | ...                  | ...                  | ...   |
| chrY         | HAVANA     | gene       | 90603500  | 90605864  | .          | -            | .          | ENSMUSG00000099619.6 | lncRNA               | ...   |
| chrY         | HAVANA     | gene       | 90665345  | 90667625  | .          | -            | .          | ENSMUSG00000099399.6 | lncRNA               | ...   |
| chrY         | HAVANA     | gene       | 90752426  | 90755467  | .          | -            | .          | ENSMUSG00000095366.2 | lncRNA               | ...   |
| chrY         | ENSEMBL    | gene       | 90838868  | 90839177  | .          | -            | .          | ENSMUSG00000096850.1 | protein_coding       | ...   |
+--------------+------------+------------+-----------+-----------+------------+--------------+------------+----------------------+----------------------+-------+
Stranded PyRanges object has 55,385 rows and 25 columns from 22 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
15 hidden columns: gene_name, level, mgi_id, havana_gene, transcript_id, transcript_type, transcript_name, transcript_support_level, tag, havana_transcript, exon_number, ... (+ 4 more.)

基因注釋

??皮軟杰斯里面有nearest芹扭、k_nearest兩種方法可以用。通過函數(shù)名我們就可以猜到這兩個函數(shù)都可以用來查找interval1集合里每一個區(qū)間在interval2集合里面距離最近的區(qū)間赦抖,默認(rèn)是返回有交集的區(qū)間舱卡,而k_nearest可以指定返回的區(qū)間個數(shù)。通過代碼看一下區(qū)別:

bed_test = pr.PyRanges(chromosomes="chr1", strands="+", starts=[3074300], ends=[3074322])
bed
+--------------+-----------+-----------+--------------+
| Chromosome   |     Start |       End | Strand       |
| (category)   |   (int32) |   (int32) | (category)   |
|--------------+-----------+-----------+--------------|
| chr1         |   3074300 |   3074322 | +            |
+--------------+-----------+-----------+--------------+

bed_test.nearest(gtf,suffix="_gtf")
+--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------+
| Chromosome   |     Start |       End | Strand       | Source     | Feature    |   Start_gtf |   End_gtf | Score      | Strand_gtf   | Frame      | +18   |
| (category)   |   (int32) |   (int32) | (category)   | (object)   | (object)   |     (int32) |   (int32) | (object)   | (category)   | (object)   | ...   |
|--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------|
| chr1         |   3074300 |   3074322 | +            | HAVANA     | gene       |     3073252 |   3074322 | .          | +            | .          | ...   |
+--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------+

bed_test.nearest(gene,overlap=False,suffix="_gtf")
+--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------+
| Chromosome   |     Start |       End | Strand       | Source     | Feature    |   Start_gtf |   End_gtf | Score      | Strand_gtf   | Frame      | +18   |
| (category)   |   (int32) |   (int32) | (category)   | (object)   | (object)   |     (int32) |   (int32) | (object)   | (category)   | (object)   | ...   |
|--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------|
| chr1         |   3074300 |   3074322 | +            | ENSEMBL    | gene       |     3102015 |   3102125 | .          | +            | .          | ...   |
+--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------+

bed_test.k_nearest(gene,k=5,suffix="_gtf")
+--------------+-----------+-----------+--------------+------------+------------+-----------+-----------+------------+--------------+------------+-------+
| Chromosome   |     Start |       End | Strand       | Source     | Feature    |   Start_b |     End_b | Score      | Strand_b     | Frame      | +18   |
| (category)   |   (int32) |   (int32) | (category)   | (object)   | (object)   |   (int32) |   (int32) | (object)   | (category)   | (object)   | ...   |
|--------------+-----------+-----------+--------------+------------+------------+-----------+-----------+------------+--------------+------------+-------|
| chr1         |   3074300 |   3074322 | +            | HAVANA     | gene       |   3073252 |   3074322 | .          | +            | .          | ...   |
| chr1         |   3074300 |   3074322 | +            | ENSEMBL    | gene       |   3102015 |   3102125 | .          | +            | .          | ...   |
| chr1         |   3074300 |   3074322 | +            | HAVANA     | gene       |   3205900 |   3671498 | .          | -            | .          | ...   |
| chr1         |   3074300 |   3074322 | +            | HAVANA     | gene       |   3252756 |   3253236 | .          | +            | .          | ...   |
| chr1         |   3074300 |   3074322 | +            | HAVANA     | gene       |   3365730 |   3368549 | .          | -            | .          | ...   |
+--------------+-----------+-----------+--------------+------------+------------+-----------+-----------+------------+--------------+------------+-------+

??從上面的代碼可知有無overlap=False參數(shù)的區(qū)別队萤,以及nearestk_nearest函數(shù)的區(qū)別轮锥。當(dāng)然,還有一些比較實用的參數(shù)如strandedness:"same"要尔,指定鏈?zhǔn)欠裣嗤?code>how:"upstream"舍杜、 "downstream"分別指定選取最近區(qū)間的方位,nb_cpu:指定使用的cpu數(shù)赵辕。除此之外既绩,k_nearest還有一個參數(shù)ties可以指定有距離相同的區(qū)間時如何返回,可選值為"first"还惠、"last"饲握、"different"。
??suffix參數(shù)指定當(dāng)兩個集合有相同列名時,后一個集合列名會添加的字段救欧,默認(rèn)是'_b'歪今。不過,好像k_nearest函數(shù)有BUG這個參數(shù)無效颜矿,但并不影響結(jié)果可以忽略寄猩。
??從上面可以看出k_nearest函數(shù)功能上略勝一籌,注釋基因時用此函數(shù)相對更自由一些:

anno = bed.k_nearest(gene, ties='different')
anno = anno[['Chromosome', 'Start', 'End, 'Strand', 'gene_id']]
anno
+--------------+-----------+-----------+--------------+----------------------+
| Chromosome   | Start     | End       | Strand       | gene_id              |
| (category)   | (int32)   | (int32)   | (category)   | (object)             |
|--------------+-----------+-----------+--------------+----------------------|
| chr1         | 212609534 | 212609559 | +            | ENSMUSG00000102307.1 |
| chr1         | 169887529 | 169887554 | +            | ENSMUSG00000103110.1 |
| chr1         | 216711011 | 216711036 | +            | ENSMUSG00000102307.1 |
| chr1         | 144227079 | 144227104 | +            | ENSMUSG00000100389.1 |
| ...          | ...       | ...       | ...          | ...                  |
| chrY         | 15224235  | 15224260  | -            | ENSMUSG00000102835.5 |
| chrY         | 13517892  | 13517917  | -            | ENSMUSG00000102174.1 |
| chrY         | 8010951   | 8010976   | -            | ENSMUSG00000099861.1 |
| chrY         | 7405376   | 7405401   | -            | ENSMUSG00000118447.1 |
+--------------+-----------+-----------+--------------+----------------------+

anno.to_csv("data_anno.txt", sep="\t", header=True)

結(jié)束語

??pyranges包的功能還是挺多的骑疆,不僅包含區(qū)間的交集田篇、差集,還可以進行一些統(tǒng)計運算如Jaccard箍铭、Fisher等泊柬。雖然,目前有不少現(xiàn)成的軟件如homer诈火、chipseeker可以做基因注釋兽赁,很多時候我們可以直接使用這些軟件即可,但pyranges還是值得學(xué)習(xí)收藏一下冷守,也許做個性化數(shù)據(jù)處理的時候使用它會來得更為方便些刀崖。


往期回顧

scanpy踩坑實錄
差異基因密度分布
R繪圖配色總結(jié)
saddleplot | A/B compartments
雙曲線火山圖一鍵拿捏

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市拍摇,隨后出現(xiàn)的幾起案子亮钦,更是在濱河造成了極大的恐慌,老刑警劉巖充活,帶你破解...
    沈念sama閱讀 216,324評論 6 498
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件蜂莉,死亡現(xiàn)場離奇詭異,居然都是意外死亡混卵,警方通過查閱死者的電腦和手機映穗,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,356評論 3 392
  • 文/潘曉璐 我一進店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來幕随,“玉大人蚁滋,你說我怎么就攤上這事『狭辏” “怎么了枢赔?”我有些...
    開封第一講書人閱讀 162,328評論 0 353
  • 文/不壞的土叔 我叫張陵,是天一觀的道長拥知。 經(jīng)常有香客問我,道長碎赢,這世上最難降的妖魔是什么低剔? 我笑而不...
    開封第一講書人閱讀 58,147評論 1 292
  • 正文 為了忘掉前任,我火速辦了婚禮,結(jié)果婚禮上襟齿,老公的妹妹穿的比我還像新娘姻锁。我一直安慰自己,他們只是感情好猜欺,可當(dāng)我...
    茶點故事閱讀 67,160評論 6 388
  • 文/花漫 我一把揭開白布位隶。 她就那樣靜靜地躺著,像睡著了一般开皿。 火紅的嫁衣襯著肌膚如雪涧黄。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 51,115評論 1 296
  • 那天赋荆,我揣著相機與錄音笋妥,去河邊找鬼。 笑死窄潭,一個胖子當(dāng)著我的面吹牛春宣,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播嫉你,決...
    沈念sama閱讀 40,025評論 3 417
  • 文/蒼蘭香墨 我猛地睜開眼月帝,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了幽污?” 一聲冷哼從身側(cè)響起嫁赏,我...
    開封第一講書人閱讀 38,867評論 0 274
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎油挥,沒想到半個月后潦蝇,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 45,307評論 1 310
  • 正文 獨居荒郊野嶺守林人離奇死亡深寥,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 37,528評論 2 332
  • 正文 我和宋清朗相戀三年攘乒,在試婚紗的時候發(fā)現(xiàn)自己被綠了。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片惋鹅。...
    茶點故事閱讀 39,688評論 1 348
  • 序言:一個原本活蹦亂跳的男人離奇死亡则酝,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出闰集,到底是詐尸還是另有隱情沽讹,我是刑警寧澤,帶...
    沈念sama閱讀 35,409評論 5 343
  • 正文 年R本政府宣布武鲁,位于F島的核電站爽雄,受9級特大地震影響,放射性物質(zhì)發(fā)生泄漏沐鼠。R本人自食惡果不足惜挚瘟,卻給世界環(huán)境...
    茶點故事閱讀 41,001評論 3 325
  • 文/蒙蒙 一叹谁、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧乘盖,春花似錦焰檩、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,657評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至穿扳,卻和暖如春衩侥,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背纵揍。 一陣腳步聲響...
    開封第一講書人閱讀 32,811評論 1 268
  • 我被黑心中介騙來泰國打工顿乒, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人泽谨。 一個月前我還...
    沈念sama閱讀 47,685評論 2 368
  • 正文 我出身青樓璧榄,卻偏偏與公主長得像,于是被迫代替她去往敵國和親吧雹。 傳聞我的和親對象是個殘疾皇子骨杂,可洞房花燭夜當(dāng)晚...
    茶點故事閱讀 44,573評論 2 353

推薦閱讀更多精彩內(nèi)容