rentrez 是 E-utilities 的 R 包封裝,用它可以在 R 語言代碼簡便地使用 E-utilities 省去創(chuàng)建 E-utilities 鏈接的麻煩源内。
NCBI 默認(rèn)有訪問限制老厌,未設(shè)置 api key 時(shí)限制 3 次每秒瘟则,設(shè)置后限制 10 次每秒。如果有枝秤,通過 set_entrez_key
函數(shù)設(shè)置醋拧。
set_entrez_key("xxxxxx")
本質(zhì)是設(shè)置環(huán)境變量 "ENTREZ_KEY" 直接設(shè)置環(huán)境變量也可以。
> set_entrez_key
function (key)
{
Sys.setenv(ENTREZ_KEY = key)
}
<bytecode: 0x5650672a7630>
<environment: namespace:rentrez>
列舉可用的數(shù)據(jù)庫 entrez_dbs
函數(shù)淀弹,常用的數(shù)據(jù)庫大多都支持丹壕。
> entrez_dbs()
[1] "pubmed" "protein" "nuccore" "ipg"
[5] "nucleotide" "structure" "genome" "annotinfo"
[9] "assembly" "bioproject" "biosample" "blastdbinfo"
[13] "books" "cdd" "clinvar" "gap"
[17] "gapplus" "grasp" "dbvar" "gene"
[21] "gds" "geoprofiles" "homologene" "medgen"
[25] "mesh" "nlmcatalog" "omim" "orgtrack"
[29] "pmc" "popset" "proteinclusters" "pcassay"
[33] "protfam" "pccompound" "pcsubstance" "seqannot"
[37] "snp" "sra" "taxonomy" "biocollections"
[41] "gtr"
查看某數(shù)據(jù)庫總結(jié) entrez_db_summary
函數(shù)。
> entrez_db_summary("nucleotide")
DbName: nuccore
MenuName: Nucleotide
Description: Core Nucleotide db
DbBuild: Build221030-1815m.1
Count: 508239779
LastUpdate: 2022/11/01 21:31
查看某數(shù)據(jù)庫支持的搜索字段(fields)薇溃,可根據(jù)支持的字段定義搜索條件進(jìn)行搜索菌赖。
> entrez_db_searchable("nucleotide")
Searchable fields for database 'nuccore'
ALL All terms from all searchable fields
UID Unique number assigned to each sequence
FILT Limits the records
WORD Free text associated with record
TITL Words in definition line
KYWD Nonstandardized terms provided by submitter
AUTH Author(s) of publication
JOUR Journal abbreviation of publication
VOL Volume number of publication
ISS Issue number of publication
... # 剩下省略
esearch 搜索用 entrez_search
函數(shù),字段用 []
包含沐序。
> es <- entrez_search(db = "nucleotide", term = "Human adenovirus 5[ORGN]", retmax = 5)
> es
Entrez search result with 681 hits (object contains 5 IDs and no web_history object)
Search term (as translated): "Human adenovirus 5"[Organism]
> es$ids
[1] "2295556242" "2295556240" "2295556238" "2287989107" "2260395970"
函數(shù) entrez_link
獲取不同數(shù)據(jù)庫的鏈接琉用。比如從物種鏈接到核酸序列,可以獲取一個(gè)物種對應(yīng)核酸序列 id 列表策幼,后續(xù)用于獲取序列信息或下載序列等邑时。
> el <- entrez_link(dbfrom = "taxonomy", id = 28285, db = "nucleotide")
> el
elink object with contents:
$links: IDs for linked records from NCBI
> el$links
elink result with information from 2 databases:
[1] taxonomy_nuccore taxonomy_nucleotide_exp
> el$links$taxonomy_nuccore[1:5]
[1] "9652377" "9652366" "296210" "2295556242" "2295556240"
函數(shù) entrez_summary
取得條目總結(jié)信息,設(shè)置 always_return_list = TRUE
讓函數(shù)就算一個(gè)請求也返回列表特姐,防止在 apply
或循環(huán)時(shí)出現(xiàn)非預(yù)期行為晶丘。
> es <- entrez_summary(db = "nucleotide", id = "2295556242")
> es
esummary result with 32 items:
[1] uid term caption title extra
[6] gi createdate updatedate flags taxid
[11] slen biomol moltype topology sourcedb
[16] segsetsize projectid genome subtype subname
[21] assemblygi assemblyacc tech completeness geneticcode
[26] strand organism strain biosample statistics
[31] properties oslt
> es$organism
[1] "Human adenovirus 5"
函數(shù) extract_from_esummary
從返回的總結(jié)對象提取需要信息。
> head(extract_from_esummary(es, "biomol"))
1884989923 1884989922 1884989921 1884989920 1884989919 1041463157
"genomic" "genomic" "genomic" "genomic" "genomic" "genomic"
函數(shù) entrez_fetch
獲取條目完整信息到逊,常用于下載數(shù)據(jù)铣口,如選擇 nucleotide 數(shù)據(jù)庫并返回類型為 fasta 時(shí)可以下載對應(yīng) fasta 信息,返回為字符串寫入到文件觉壶,即完成了核酸序列 fasta 文件下載脑题。
> el <- entrez_link(dbfrom = "taxonomy", id = "28285", db = "nucleotide")
> nu5 <- el$links$taxonomy_nuccore[1:5]
> nu5
[1] "9652377" "9652366" "296210" "2295556242" "2295556240"
> ef <- entrez_fetch(db = "nucleotide", id = nu5, rettype = "fasta")
> temp <- tempfile()
> write(ef, temp)
web_history
NCBI 允許將搜索結(jié)果保存在服務(wù)器,然后調(diào)用結(jié)果铜靶,避免了下載結(jié)果叔遂、解析、上傳解析結(jié)果的過程争剿,也能避免產(chǎn)生過多的請求數(shù)目導(dǎo)致受限已艰。如 entrez_link
設(shè)置參數(shù) cmd = "neighbor_history"
將返回 web_history 對象。
以下載某物種核酸序列為例蚕苇,默認(rèn)的 entrez_link
返回對象需要自己取得核酸序列 id 列表(如前面的例子)哩掺,然后用這些核酸序列輸入到 entrez_fetch
從 NCBI 一條條下載核酸序列;如果用 web_history
因?yàn)閷ο蟊4嬖?NCBI 服務(wù)器涩笤,可以直接根據(jù) web_history
下載物種的核酸序列嚼吞,如下面的代碼示例盒件。
> el <- entrez_link(dbfrom = "taxonomy", id = "208893", db = "nucleotide", cmd = "neighbor_history")
> el
elink object with contents:
$web_histories: Objects containing web history information
> el$web_histories
$taxonomy_nuccore
Web history object (QueryKey = 1, WebEnv = MCID_637d8b3...)
$taxonomy_nucleotide_exp
Web history object (QueryKey = 2, WebEnv = MCID_637d8b3...)
# 下載所有的該物種核酸序列
> ef <- entrez_fetch(db = "nucleotide", web_history = el$web_histories$taxonomy_nuccore, rettype = "fasta")
> str(ef)
chr ">LC732340.1 Human respiratory syncytial virus A Fukui_S_258_2018 gene for attachment glycoprotein, partial cds\"| __truncated__