適用背景
當(dāng)我們進(jìn)行跨物種比較分析時(shí)會(huì)發(fā)現(xiàn)基因名對(duì)不上陡蝇,或者找不到基因锁荔,這時(shí)候就需要進(jìn)行物種間的同源基因轉(zhuǎn)換棺蛛。最常用的工具就是Ensembl的biomaRt,另外還查到NCBI的一個(gè)工具h(yuǎn)omologene别厘,因此本文將詳細(xì)講解這兩個(gè)包。
方法一 biomaRt
biomaRt是基于Ensembl數(shù)據(jù)庫(kù)的R包拥诡,當(dāng)然也可以在網(wǎng)頁(yè)查詢同源基因触趴,但如果基因較多,物種較多渴肉,網(wǎng)頁(yè)版就比較麻煩冗懦。需要注意的是,biomaRt包需要聯(lián)網(wǎng)才能正常使用仇祭。
- 先安裝包
BiocManager::install("biomaRt")
- 加載包
library(biomaRt)
- 查看數(shù)據(jù)庫(kù)列表
> listMarts()
biomart version
1 ENSEMBL_MART_ENSEMBL Ensembl Genes 106
2 ENSEMBL_MART_MOUSE Mouse strains 106
3 ENSEMBL_MART_SNP Ensembl Variation 106
4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 106
- 查看特定數(shù)據(jù)庫(kù)的數(shù)據(jù)集披蕉,'ensembl'表示使用所有數(shù)據(jù)庫(kù),可以看到所有可用數(shù)據(jù)庫(kù)里共有215個(gè)數(shù)據(jù)集乌奇。
> umart = useMart('ensembl')
> datalist = listDatasets(umart)
> head(datalist)
dataset description
1 abrachyrhynchus_gene_ensembl Pink-footed goose genes (ASM259213v1)
2 acalliptera_gene_ensembl Eastern happy genes (fAstCal1.2)
3 acarolinensis_gene_ensembl Green anole genes (AnoCar2.0v2)
4 acchrysaetos_gene_ensembl Golden eagle genes (bAquChr1.2)
5 acitrinellus_gene_ensembl Midas cichlid genes (Midas_v5)
6 amelanoleuca_gene_ensembl Giant panda genes (ASM200744v2)
version
1 ASM259213v1
2 fAstCal1.2
3 AnoCar2.0v2
4 bAquChr1.2
5 Midas_v5
6 ASM200744v2
> dim(datalist)
[1] 215 3
- 模糊搜索數(shù)據(jù)庫(kù)里的數(shù)據(jù)集
> searchDatasets(mart = umart, pattern = "Mouse")
dataset description version
107 mmurinus_gene_ensembl Mouse Lemur genes (Mmur_3.0) Mmur_3.0
108 mmusculus_gene_ensembl Mouse genes (GRCm39) GRCm39
- 構(gòu)建數(shù)據(jù)集
mouse <- useMart('ensembl',dataset = "mmusculus_gene_ensembl")
macaque <- useMart('ensembl',dataset = "mfascicularis_gene_ensembl")
如果需要獲取單個(gè)物種的某些基因信息没讲,getBM可以實(shí)現(xiàn),但需要指定輸入的格式和輸出的格式礁苗。
getBM(attributes, filters = "", values = "", mart, curl = NULL,
checkFilters = TRUE, verbose = FALSE, uniqueRows = TRUE, bmHeader = FALSE,
quote = "\"", useCache = TRUE)
-
獲取某個(gè)物種全部基因
不指定 filters參數(shù)爬凑,只使用attributes參數(shù),mmusculus_gene_ensembl是小鼠的數(shù)據(jù)集寂屏。
g1 <- getBM(attributes = c("external_gene_name"),mart = useEnsembl("ensembl",dataset = 'mmusculus_gene_ensembl', host = "https://dec2021.archive.ensembl.org/"),uniqueRows = T)
-
輸入 filters
其filters參數(shù)可以指定基因的信息類型贰谣,用的最多的當(dāng)然是基因名,但也可以給定基因在染色體位置迁霎,基因ID等其它信息吱抚。簡(jiǎn)單來(lái)說(shuō)filters就指定了類型,例如"external_gene_name"考廉,values則是具體的值秘豹,例如基因Gad1,Sst昌粤。
listFilters可以列出所有輸入基因信息的類型既绕,而searchFilters還可以搜尋相關(guān)類型的名稱啄刹。可以看到剛剛構(gòu)建的小鼠數(shù)據(jù)集可以輸入400種基因信息類型凄贩。
> head(listFilters(mouse))
name description
1 chromosome_name Chromosome/scaffold name
2 start Start
3 end End
4 band_start Band Start
5 band_end Band End
6 strand Strand
> dim(listFilters(mouse))
[1] 400 2
> searchFilters(mart = mouse, 'gene_name')
name description
53 external_gene_name Gene Name(s) [e.g. mt-Tf]
98 wikigene_name WikiGene name(s) [e.g. 0610005C13Rik]
-
輸出 attributes
類似的誓军,指定輸出類型使用attributes參數(shù),用listAttributes函數(shù)查看可輸出的類型疲扎,searchAttributes模糊查詢可輸出的類型昵时。可以看到可以輸出2987種類型椒丧。
> dim(listAttributes(mouse))
[1] 2987 3
> head(listAttributes(mouse))
name description page
1 ensembl_gene_id Gene stable ID feature_page
2 ensembl_gene_id_version Gene stable ID version feature_page
3 ensembl_transcript_id Transcript stable ID feature_page
4 ensembl_transcript_id_version Transcript stable ID version feature_page
5 ensembl_peptide_id Protein stable ID feature_page
6 ensembl_peptide_id_version Protein stable ID version feature_page
> searchAttributes(mart = mouse, 'ensembl_gene_id')
name description page
1 ensembl_gene_id Gene stable ID feature_page
2 ensembl_gene_id_version Gene stable ID version feature_page
183 ensembl_gene_id Gene stable ID structure
184 ensembl_gene_id_version Gene stable ID version structure
223 ensembl_gene_id Gene stable ID homologs
224 ensembl_gene_id_version Gene stable ID version homologs
2884 ensembl_gene_id Gene stable ID snp
2885 ensembl_gene_id_version Gene stable ID version snp
2942 ensembl_gene_id Gene stable ID sequences
2943 ensembl_gene_id_version Gene stable ID version sequences
- 使用示例
> getBM(attributes = c("ensembl_gene_id","external_gene_name"),filters = "external_gene_name", values = c("Gad1","Sst"),mart = mouse,uniqueRows = T)
ensembl_gene_id external_gene_name
1 ENSMUSG00000070880 Gad1
2 ENSMUSG00000004366 Sst
以下來(lái)自另一篇博主簡(jiǎn)書(shū)的用法
### 根據(jù)entrez ID號(hào)來(lái)找
entrzID=c("672","1") ##定義entrez ID
getBM(attributes=c("entrezgene","external_gene_name","gene_biotype"), filters = "ensembl_gene_id_version", values =entrzID, mart=human,uniqueRows=TRUE)
### 通過(guò)染色體及起始終止坐標(biāo)來(lái)挑選基因
getBM(c('affy_hg_u133_plus_2','ensembl_gene_id'), filters = c('chromosome_name','start','end'), values=list(16,1100000,1250000), mart=human)
###根據(jù)染色體位置(6p21.1)查找所有位于其上的基因
getBM(c('entrezgene','hgnc_symbol', 'transcript_biotype', 'chromosome_name','start_position','end_position', 'band'), filters = c('chromosome_name','band_start','band_end'), values=list(6,'p21.1','p21.1'), mart=human)
獲取不同物種的同源基因
- 使用函數(shù)getLDS
getLDS(attributes, filters = "", values = "", mart, attributesL,
filtersL = "", valuesL = "", martL, verbose = FALSE, uniqueRows = TRUE,
bmHeader=TRUE)
與getBM類似壹甥,參考物種還是原來(lái)的參數(shù),而查詢物種則需要在原參數(shù)名加上L壶熏,例如參考物種是attributes句柠,則查詢物種是attributesL。
使用示例
gene.mo2ma <- getLDS(attributes = c("external_gene_name"),filters = "external_gene_name",values = c("Gad1","Sst"),mart = mouse,attributesL = c("external_gene_name","chromosome_name"),martL = macaque,uniqueRows = T)
> gene.mo2ma <- getLDS(attributes = c("external_gene_name"),filters = "external_gene_name",values = c("Gad1","Sst"),mart = mouse,attributesL = c("external_gene_name","chromosome_name"),martL = macaque,uniqueRows = T)
> head(gene.mo2ma)
Gene.name Gene.name.1 Chromosome.scaffold.name
1 Sst SST 2
2 Gad1 GAD1 12
-
小插曲
我運(yùn)行上面的腳本時(shí)出現(xiàn)下面的報(bào)錯(cuò):
> gene.mo2ma <- getLDS(attributes = c("external_gene_name"),filters = "external_gene_name",values = c("Gad1","Sst"),mart = mouse,attributesL = c("external_gene_name","chromosome_name"),martL = macaque,uniqueRows = T)
錯(cuò)誤: biomaRt has encountered an unexpected server error.
Consider trying one of the Ensembl mirrors (for more details look at ?useEnsembl)
查了谷歌后發(fā)現(xiàn)是網(wǎng)頁(yè)自身的問(wèn)題棒假,在構(gòu)建數(shù)據(jù)集的時(shí)候需更換2021年版本的一個(gè)網(wǎng)頁(yè)才能正常運(yùn)行溯职,估計(jì)是2022年版本的bug。
解決辦法是是用host參數(shù)指定2021年版本網(wǎng)頁(yè)淆衷。
macaque <- useMart("ensembl", dataset = "mfascicularis_gene_ensembl", host = "https://dec2021.archive.ensembl.org/")
方法二 homologene
與biomaRt不同缸榄,homologene是基于NCBI HomoloGene數(shù)據(jù)庫(kù)的一個(gè)包,感興趣可以進(jìn)官網(wǎng)了解一下此數(shù)據(jù)庫(kù)祝拯,因此甚带,兩者做出來(lái)的結(jié)果會(huì)有差異,建議可以互為補(bǔ)充佳头。
但HomoloGene最新更新時(shí)間是2014-04-09鹰贵,只有21個(gè)物種,可能還是biomaRt更完善一些康嘉。
- 安裝包
BiocManager::install("homologene")
- 加載包
library(homologene)
這個(gè)包比較簡(jiǎn)單碉输,但是可提取信息也比較少,就一個(gè)函數(shù)就可以查詢同源基因亭珍,只需傳入三個(gè)參數(shù)敷钾,genelist是基因列表,inTax 是參考物種的Taxonomy ID肄梨,outTax是查詢物種的Taxonomy ID阻荒。例如小鼠是10090,人是9606众羡。
> genelist=c('Gad','Sst')
> homologene(genelist, inTax = 10090, outTax = 9606)
10090 9606 10090_ID 9606_ID
1 Sst SST 20604 6750
- 常用物種的Taxonomy ID(來(lái)自這篇文章):
10090 Mus musculus
10116 Rattus norvegicus
28985 Kluyveromyces lactis
318829 Magnaporthe oryzae
33169 Eremothecium gossypii
3702 Arabidopsis thaliana
4530 Oryza sativa
4896 Schizosaccharomyces pombe
4932 Saccharomyces cerevisiae
5141 Neurospora crassa
6239 Caenorhabditis elegans
7165 Anopheles gambiae
7227 Drosophila melanogaster
7955 Danio rerio
8364 Xenopus (Silurana) tropicalis
9031 Gallus gallus
9544 Macaca mulatta
9598 Pan troglodytes
9606 Homo sapiens
9615 Canis lupus familiaris
9913 Bos taurus
小結(jié)與補(bǔ)充
總的來(lái)說(shuō)biomaRt更加完善侨赡,功能更強(qiáng)大,所以更推薦biomaRt。但是只能對(duì)兩個(gè)物種進(jìn)行轉(zhuǎn)換羊壹,如果是多個(gè)物種進(jìn)行轉(zhuǎn)換蓖宦,該怎么做呢?下一篇文章試試寫(xiě)一下2個(gè)及以上物種的轉(zhuǎn)換函數(shù)吧油猫。