- ID轉(zhuǎn)換的目的是尋找兩列ID的對應(yīng)關(guān)系冈涧,一列是當(dāng)前數(shù)據(jù)你所有的,另一列ID是所需要的ID即可正蛙,不多不少督弓,就兩列即可。那么linux命令也是可以處理的乒验,找一個這些信息的文本愚隧,用linux命令處理好。前提是linux基礎(chǔ)命令你掌握了锻全,掌握要求狂塘,培訓(xùn)課的第二天所講內(nèi)容。
- 下面我處理的是ENSG轉(zhuǎn)gene name的gff文件鳄厌,因為這個文件里的信息含義這兩列荞胡,我只需提取出來就好了。
下載gtf
你可以用gencode的gtf或者ncbi的gff了嚎,里面含有你想要的ID信息就好泪漂,只是處理的代碼不同,因此linux基礎(chǔ)和進(jìn)階命令都要學(xué)會歪泳,很有用萝勤,很高效
ENSG ID與gene nama對應(yīng)關(guān)系
$ zless -S gencode.v31.annotation.gtf.gz|grep -w 'gene'|cut -f 9|awk -v OFS="\t" '{print $2,$6,$4}'|sed 's/[";]//g'|sed '1i #gene_id\tgene_name\tgene_type'|less
# 或者用下面awk的gsub來做
zless -S gencode.v31.annotation.gtf.gz|grep -w 'gene'|cut -f 9|awk -v OFS="\t" '{gsub(/[";]/,"");print $2,$6,$4}'|sed '1i #gene_id\tgene_name\tgene_type'|less
ENST ID與ENSG ID對應(yīng)關(guān)系
$ zless -S gencode.v31.annotation.gtf.gz|grep -w 'transcript'|cut -f 9|awk -v OFS="\t" 'BEGIN{print "gene_id","transcript_id","gene_name","transcript_type"}{print $2,$4,$8,$10}'|sed 's/[";]//g'|less -S
# zless -S gencode.v31.annotation.gtf.gz|grep -w 'transcript'|cut -f 9|awk -v OFS="\t" 'BEGIN{print "gene_id","transcript_id","gene_name","transcript_type"}{print $2,$4,$8,$10}'|sed 's/[";]//g'|sort|uniq|less -S
# gene_id transcript_id gene_name transcript_type4列信息
如果你需要過濾類型查看的話
$ zless -S gencode.v31.annotation.gtf.gz|grep -w 'transcript'|awk '{print $18}'|sort|uniq
# 查看所有類型
"IG_C_gene";
"IG_C_pseudogene";
"IG_D_gene";
"IG_J_gene";
"IG_J_pseudogene";
"IG_pseudogene";
"IG_V_gene";
"IG_V_pseudogene";
"lncRNA";
"miRNA";
"misc_RNA";
"Mt_rRNA";
"Mt_tRNA";
"nonsense_mediated_decay";
"non_stop_decay";
"polymorphic_pseudogene";
"processed_pseudogene";
"protein_coding";
"pseudogene";
"retained_intron";
"ribozyme";
"rRNA";
"rRNA_pseudogene";
"scaRNA";
"scRNA";
"snoRNA";
"snRNA";
"sRNA";
"TEC";
"transcribed_processed_pseudogene";
"transcribed_unitary_pseudogene";
"transcribed_unprocessed_pseudogene";
"translated_processed_pseudogene";
"translated_unprocessed_pseudogene";
"TR_C_gene";
"TR_D_gene";
"TR_J_gene";
"TR_J_pseudogene";
"TR_V_gene";
"TR_V_pseudogene";
"unitary_pseudogene";
"unprocessed_pseudogene";
"vaultRNA";
# 然后用grep過濾上面的結(jié)果文件即可