【技能】使用Python提取基因結(jié)構(gòu)bed文件

為了方便提取基因結(jié)構(gòu)搁宾，寫了一個(gè)簡(jiǎn)單的Python腳本磕诊，支持提戎啤：

exon
intron
UTR
CDS
start codon
end codon

其中關(guān)于起始終止密碼子的提取可能還存在bug乎串，期待大家一起來(lái)debug健霹。

此外凡傅，代碼的風(fēng)格也不是“熟練老手”辟狈，仍有很大進(jìn)步空間，沒(méi)有做到“優(yōu)雅”夏跷，期待在新版本能有所進(jìn)步哼转。

#!/usr/bin/python3

import argparse
import textwrap

parser = argparse.ArgumentParser(prog = "getFeature.py",
                 description = textwrap.dedent('''\
                         This is a python tool for you to extract gene features in bed file format from bed13 file.

                         About how to convert GTF annotation file into bed13 file:
                         - http://www.reibang.com/p/de2455a8f507
                         About the structure of transcripts:
                         - http://www.reibang.com/p/f97935ce7255
                         '''),
                 formatter_class = argparse.RawDescriptionHelpFormatter,
                 epilog = "Owner: skm@smail.nju.edu.cn, Nanjing University."
                 )

parser.add_argument('--bed13', '-b', required = True, help = "the bed13 file.")
parser.add_argument('--prefix', '-p', required = True, help = "the prefix of the file to store the result.")
parser.add_argument('--feature', '-f', required = True, choices = ['exon', 'intron', '5utr', '3utr', 'start_codon', 'stop_codon', 'cds'], help = "the feature type you want to extract.")
parser.add_argument('--keep', '-k', action = 'store_true', help = "whether to keep non-main chr, default is not.") 
parser.add_argument('--version', '-v', action = 'version', version='%(prog)s 1.0.0')

args = parser.parse_args()

#input
bed = args.bed13
prefix = args.prefix
feature = args.feature


#load bed13 file
with open(bed, 'r') as input_file:
    tx_dict = {}
    chrs_keep = ['chrM', 'chrX', 'chrY']
    for i in range(22):
        chrs_keep.append('chr' + str(i+1))
    for line in input_file.readlines():
        line = line.strip()
        fields = line.split("\t")
#       chr = fields[0]
#       start = fields[1]
#       end = fields[2]
#       name = fields[3]
#       strand = fields[5]
#       cds_start = fields[6]
#       cds_end = fields[7]
#       num = fields[9]
#       exon_size = fields[10]
#       exon_start = fields[11]

        name = fields.pop(3)
        del fields[3]
        del fields[6]
        if args.keep:
            tx_dict[name] = fields
        else:
            if fields[0] in chrs_keep:
                tx_dict[name] = fields
#       chr, start, end, strand, cds_start, cds_end, num, exon_size, exon_start
#function to extract feature
#exon
def get_exon(dict, dest):
    file = dest + '.exon.bed'
    with open(file, 'w') as exon:
        for tx,feat in dict.items():
            for i in range(int(feat[6])):
                curr_exon_start = int(feat[1]) + int(feat[8].split(",")[i])
                curr_exon_end = curr_exon_start + int(feat[7].split(",")[i])
                curr_line = [feat[0], str(curr_exon_start), str(curr_exon_end), tx + "_" + str(i+1), '0', feat[3]]
                exon.write("\t".join(curr_line) + "\n")
#intron
def get_intron(dict, dest):
    file = dest + '.intron.bed'
    with open(file, 'w') as intron:
        for tx,feat in dict.items():
            if int(feat[6]) != 1:
                for i in range(int(feat[6]) - 1):
                    curr_exon_start = int(feat[1]) + int(feat[8].split(",")[i])
                    curr_exon_end = curr_exon_start + int(feat[7].split(",")[i])
                    next_exon_start = int(feat[1]) + int(feat[8].split(",")[i+1])
                    curr_line = [feat[0], str(curr_exon_end), str(next_exon_start), tx + "_" + str(i+1), '0', feat[3]]
                    intron.write("\t".join(curr_line) + "\n")


#cds
def get_cds(dict, dest):
    file = dest + '.cds.bed'
    with open (file, 'w') as cds:
        for tx,feat in dict.items():
            if feat[4] != feat[5]:
                index = 0
                for i in range(int(feat[6])):
                    curr_exon_start = int(feat[1]) + int(feat[8].split(",")[i])
                    curr_exon_end = curr_exon_start + int(feat[7].split(",")[i])
                    bool_1 = curr_exon_start < int(feat[4]) and curr_exon_end > int(feat[4])
                    bool_2 = curr_exon_start >= int(feat[4]) and curr_exon_end <= int(feat[5])
                    bool_3 = curr_exon_start <= int(feat[5]) and curr_exon_end > int(feat[5])
                    bool_4 = curr_exon_start <= int(feat[4]) and curr_exon_end >= int(feat[5])
                    if bool_1:
                        index += 1
                        curr_line = [feat[0], str(feat[4]), str(curr_exon_end), tx + "_" + str(index), '0', feat[3]]
                        cds.write("\t".join(curr_line) + "\n")
                    elif bool_2:
                        index += 1
                        curr_line = [feat[0], str(curr_exon_start), str(curr_exon_end), tx + "_" + str(index), '0', feat[3]]
                        cds.write("\t".join(curr_line) + "\n")
                    elif bool_3:
                        index += 1
                        curr_line = [feat[0], str(curr_exon_start), str(feat[5]), tx + "_" + str(index), '0', feat[3]]
                        cds.write("\t".join(curr_line) + "\n")
                    elif bool_4:
                        index += 1
                        curr_line = [feat[0], str(feat[4]), str(feat[5]), tx + "_" + str(index), '0', feat[3]]
                        cds.write("\t".join(curr_line) + "\n")

#start_codon
def get_start_codon(dict, dest):
    file = dest + '.startcodon.bed'
    with open(file, 'w') as startcodon:
        for tx,feat in dict.items():
            if feat[4] != feat[5]:
                if feat[3] == "+":
                    curr_line = [feat[0], feat[4], str(int(feat[4]) + 3), tx, '0', feat[3]]
                else:
                    curr_line = [feat[0], str(int(feat[5]) - 3), feat[5], tx, '0', feat[3]]
                startcodon.write("\t".join(curr_line) + "\n")
#stop_codon
def get_stop_codon(dict, dest):
    file = dest + '.stopcodon.bed'
    with open(file, 'w') as stopcodon:
        for tx,feat in dict.items():
            if feat[4] != feat[5]:
                if feat[3] == "+":
                    curr_line = [feat[0], str(int(feat[5]) - 3), feat[5], tx, '0', feat[3]]
                else:
                    curr_line = [feat[0], feat[4], str(int(feat[4]) + 3), tx, '0', feat[3]]
                stopcodon.write("\t".join(curr_line) + "\n")


#5-utr:
def get_five_utr(dict, dest):
    file = dest + '.5utr.bed'
    with open(file, 'w') as fiveutr:
        for tx,feat in dict.items():
            if feat[4] != feat[5]:
                if feat[3] == "+":
                    for i in range(int(feat[6])):
                        curr_exon_start = int(feat[1]) + int(feat[8].split(",")[i])
                        curr_exon_end = curr_exon_start + int(feat[7].split(",")[i])
                        bool_1 = curr_exon_start < int(feat[4]) and curr_exon_end <= int(feat[4])
                        bool_2 = curr_exon_start < int(feat[4]) and curr_exon_end > int(feat[4])
                        if bool_1:
                            curr_line = [feat[0], str(curr_exon_start), str(curr_exon_end), tx + '_' + str(i+1), '0', feat[3]]
                            fiveutr.write("\t".join(curr_line) + "\n")
                        elif bool_2:
                            curr_line = [feat[0], str(curr_exon_start), feat[4], tx + '_' + str(i+1), '0', feat[3]]
                            fiveutr.write("\t".join(curr_line) + "\n")
                else:
                    index = 0
                    for i in range(int(feat[6])):
                        curr_exon_start = int(feat[1]) + int(feat[8].split(",")[i])
                        curr_exon_end = curr_exon_start + int(feat[7].split(",")[i])
                        bool_1 = curr_exon_start > int(feat[5]) and curr_exon_end > int(feat[5])
                        bool_2 = curr_exon_start <= int(feat[5]) and curr_exon_end > int(feat[5])
                        if bool_1:
                            index += 1
                            curr_line = [feat[0], str(curr_exon_start), str(curr_exon_end), tx + '_' + str(index), '0', feat[3]]
                            fiveutr.write("\t".join(curr_line) + "\n")
                        elif bool_2:
                            index += 1
                            curr_line = [feat[0], feat[5], str(curr_exon_end), tx + '_' + str(index), '0', feat[3]]
                            fiveutr.write("\t".join(curr_line) + "\n")
#3-utr
def get_three_utr(dict, dest):
    file = dest + '.3utr.bed'
    with open(file, 'w') as threeutr:
        for tx,feat in dict.items():
            if feat[4] != feat[5]:
                if feat[3] == "+":
                    index = 0
                    for i in range(int(feat[6])):
                        curr_exon_start = int(feat[1]) + int(feat[8].split(",")[i])
                        curr_exon_end = curr_exon_start + int(feat[7].split(",")[i])
                        bool_1 = curr_exon_start > int(feat[5]) and curr_exon_end > int(feat[5])
                        bool_2 = curr_exon_start <= int(feat[5]) and curr_exon_end > int(feat[5])
                        if bool_1:
                            index += 1
                            curr_line = [feat[0], str(curr_exon_start), str(curr_exon_end), tx + '_' + str(index), '0', feat[3]]
                            threeutr.write("\t".join(curr_line) + "\n")
                        elif bool_2:
                            index += 1
                            curr_line = [feat[0], feat[5], str(curr_exon_end), tx + '_' + str(index), '0', feat[3]]
                            threeutr.write("\t".join(curr_line) + "\n")
                else:
                    for i in range(int(feat[6])):
                        curr_exon_start = int(feat[1]) + int(feat[8].split(",")[i])
                        curr_exon_end = curr_exon_start + int(feat[7].split(",")[i])
                        bool_1 = curr_exon_start < int(feat[4]) and curr_exon_end <= int(feat[4])
                        bool_2 = curr_exon_start < int(feat[4]) and curr_exon_end > int(feat[4])
                        if bool_1:
                            curr_line = [feat[0], str(curr_exon_start), str(curr_exon_end), tx + '_' + str(i+1), '0', feat[3]]
                            threeutr.write("\t".join(curr_line) + "\n")
                        elif bool_2:
                            curr_line = [feat[0], str(curr_exon_start), feat[4], tx + '_' + str(i+1), '0', feat[3]]
                            threeutr.write("\t".join(curr_line) + "\n")


if __name__ == '__main__':
    print ("Processing...")
    if feature == "exon":
        get_exon(dict = tx_dict, dest = prefix)
    elif feature == "intron":
        get_intron(dict = tx_dict, dest = prefix)
    elif feature == "cds":
        get_cds(dict = tx_dict, dest = prefix)
    elif feature == "start_codon":
        get_start_codon(dict = tx_dict, dest = prefix)
    elif feature == "stop_codon":
        get_stop_codon(dict = tx_dict, dest = prefix)
    elif feature == "5utr":
        get_five_utr(dict = tx_dict, dest = prefix)
    elif feature == "3utr":
        get_three_utr(dict = tx_dict, dest = prefix)
    print ("Finished...")

使用方法：

getFeature.py --help

usage: getFeature.py [-h] --bed13 BED13 --prefix PREFIX --feature
                     {exon,intron,5utr,3utr,start_codon,stop_codon,cds}
                     [--keep] [--version]

This is a python tool for you to extract gene features in bed file format from bed13 file.

About how to convert GTF annotation file into bed13 file:
- http://www.reibang.com/p/de2455a8f507
About the structure of transcripts:
- http://www.reibang.com/p/f97935ce7255

optional arguments:
  -h, --help            show this help message and exit
  --bed13 BED13, -b BED13
                        the bed13 file.
  --prefix PREFIX, -p PREFIX
                        the prefix of the file to store the result.
  --feature {exon,intron,5utr,3utr,start_codon,stop_codon,cds}, -f {exon,intron,5utr,3utr,start_codon,stop_codon,cds}
                        the feature type you want to extract.
  --keep, -k            whether to keep non-main chr, default is not.
  --version, -v         show program's version number and exit

Owner: skm@smail.nju.edu.cn, Nanjing University.

感興趣的小伙伴可以在我的Github 上下載 getFeature.py或復(fù)制粘貼上面的code進(jìn)行使（試）用。

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末槽华，一起剝皮案震驚了整個(gè)濱河市壹蔓，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌猫态，老刑警劉巖佣蓉，帶你破解...
沈念sama閱讀 206,126評(píng)論 6贊 481
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場(chǎng)離奇詭異亲雪，居然都是意外死亡勇凭，警方通過(guò)查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 88,254評(píng)論 2贊 382
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門匆光，熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)套像，“玉大人，你說(shuō)我怎么就攤上這事终息《峁” “怎么了？”我有些...
開(kāi)封第一講書人閱讀 152,445評(píng)論 0贊 341
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵周崭，是天一觀的道長(zhǎng)柳譬。經(jīng)常有香客問(wèn)我，道長(zhǎng)续镇，這世上最難降的妖魔是什么美澳？我笑而不...
開(kāi)封第一講書人閱讀 55,185評(píng)論 1贊 278
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮，結(jié)果婚禮上制跟，老公的妹妹穿的比我還像新娘舅桩。我一直安慰自己，他們只是感情好雨膨，可當(dāng)我...
茶點(diǎn)故事閱讀 64,178評(píng)論 5贊 371
惡毒庶女頂嫁案：這布局不是一般人想出來(lái)的
文/花漫我一把揭開(kāi)白布擂涛。她就那樣靜靜地躺著，像睡著了一般聊记。火紅的嫁衣襯著肌膚如雪撒妈。梳的紋絲不亂的頭發(fā)上，一...
開(kāi)封第一講書人閱讀 48,970評(píng)論 1贊 284
城市分裂傳說(shuō)
那天排监，我揣著相機(jī)與錄音狰右，去河邊找鬼。笑死舆床，一個(gè)胖子當(dāng)著我的面吹牛棋蚌，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播峭弟，決...
沈念sama閱讀 38,276評(píng)論 3贊 399
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開(kāi)眼附鸽，長(zhǎng)吁一口氣：“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼！你這毒婦竟也來(lái)了瞒瘸？” 一聲冷哼從身側(cè)響起坷备，我...
開(kāi)封第一講書人閱讀 36,927評(píng)論 0贊 259
萬(wàn)榮殺人案實(shí)錄
序言：老撾萬(wàn)榮一對(duì)情侶失蹤，失蹤者是張志新（化名）和其女友劉穎情臭，沒(méi)想到半個(gè)月后省撑，有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 43,400評(píng)論 1贊 300
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡俯在，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 35,883評(píng)論 2贊 323
?白月光啟示錄
正文我和宋清朗相戀三年竟秫，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片跷乐。...
茶點(diǎn)故事閱讀 37,997評(píng)論 1贊 333
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡肥败，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出愕提，到底是詐尸還是另有隱情馒稍，我是刑警寧澤，帶...
沈念sama閱讀 33,646評(píng)論 4贊 322
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布浅侨，位于F島的核電站纽谒，受9級(jí)特大地震影響，放射性物質(zhì)發(fā)生泄漏如输。R本人自食惡果不足惜鼓黔，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 39,213評(píng)論 3贊 307
男人毒藥：我在死后第九天來(lái)索命
文/蒙蒙一央勒、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧澳化，春花似錦崔步、人聲如沸。這莊子的主人今日做“春日...
開(kāi)封第一講書人閱讀 30,204評(píng)論 0贊 19
一樁弒父案刷晋，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽(yáng)。三九已至慎陵，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間喻奥，已是汗流浹背席纽。一陣腳步聲響...
開(kāi)封第一講書人閱讀 31,423評(píng)論 1贊 260
情欲美人皮
我被黑心中介騙來(lái)泰國(guó)打工，沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留撞蚕，地道東北人润梯。一個(gè)月前我還...
沈念sama閱讀 45,423評(píng)論 2贊 352
代替公主和親
正文我出身青樓，卻偏偏與公主長(zhǎng)得像甥厦，于是被迫代替她去往敵國(guó)和親纺铭。傳聞我的和親對(duì)象是個(gè)殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 42,722評(píng)論 2贊 345

【技能】使用Python提取基因結(jié)構(gòu)bed文件

為了方便提取基因結(jié)構(gòu)搁宾，寫了一個(gè)簡(jiǎn)單的Python腳本磕诊，支持提戎啤：

其中關(guān)于起始終止密碼子的提取可能還存在bug乎串，期待大家一起來(lái)debug健霹。

使用方法：

推薦閱讀更多精彩內(nèi)容