在做用戶畫像中有關(guān)用戶的地域分布時褂微,我從數(shù)據(jù)庫里撈取了一堆活躍用戶的IP地址,將近30萬個左右;問了一圈也沒發(fā)現(xiàn)誰有IP地址信息庫美侦,百度后發(fā)現(xiàn)可供使用的第三方接口很多,比如Sina魂奥、搜狐菠剩、淘寶等等;這里我選擇Sina IP接口耻煤。
<p>Sina IP接口信息:</p>
<p><code>查詢接口:http://int.dpool.sina.com.cn/iplookup/iplookup.php?format=js&ip=IP地址.</code></p>
<p><code>返回信息:var remote_ip_info = {"ret":1,"start":"114.114.112.0","end":"114.114.119.255","country":"\u4e2d\u56fd","province":"\u6c5f\u82cf","city":"\u5357\u4eac","district":"","isp":"\u7535\u4fe1","type":"","desc":"\u5357\u4eac\u4fe1\u98ce114dns\u4e13\u5c5e"};</code></p>
<p>返回數(shù)據(jù)格式:(json格式的)國家 具壮、省(自治區(qū)或直轄市)哈蝇、市(縣)棺妓、運營商;比如:</p>
<p><code>{"code":0,"data":{"ip":"210.75.225.254","country":"\u4e2d\u56fd","area":"\u534e\u5317","region":"\u5317\u4eac\u5e02","city":"\u5317\u4eac\u5e02","county":"","isp":"\u7535\u4fe1",
"country_id":"86","area_id":"100000","region_id":"110000","city_id":"110000",
"county_id":"-1","isp_id":"100017"}}</code></p>
我的原始IP數(shù)據(jù)示例如下圖:

R代碼實現(xiàn)如下:
<pre><code>###ip批量查詢
<p>#設置文件目錄
setwd("A:\\數(shù)據(jù)分析師的成長\\zfancy.R")
library(RCurl) #調(diào)用getURL()函數(shù)<p>
<p>library(RJSONIO) #調(diào)用fromJSON()函數(shù)
Sinaurl <- function(ip){
paste("http://int.dpool.sina.com.cn/iplookup/iplookup.php?format=js&ip=",ip,sep="")} #sinaIP API
Ip_yb <- read.csv("A:\\數(shù)據(jù)分析\\用戶基本畫像\\ip數(shù)據(jù).csv",
<p>stringsAsFactors = F,header = T) #導入測試樣本ip
構(gòu)造函數(shù)fanxi
fanxi <- function(aaa){
AA <- NA;BB <- NA;url <- NA;cou <- NA;pro <- NA;cit <- NA ;
ip <- NA#定義初始值為0
for (i in 1:nrow(aaa)){
AA[i] <- Sinaurl(aaa[i,1]) #接口請求連接
url[i] <- getURL(AA[i]) #接口返回結(jié)果
BB[i] <- strsplit(url[i],"=")
BB[i] <- gsub("^ ","",BB[i][[1]][2]) #去掉首行空格
BB[i] <- gsub(";","",BB[i]) #去掉尾部分號
cou[i] <- fromJSON(BB[[i]])[4:6]$country #提取國家
pro[i] <- fromJSON(BB[[i]])[4:6]$province #提取省份
cit[i] <- fromJSON(BB[[i]])[4:6]$city #提取城市
ip[i] <- aaa[i,1]
Sys.sleep(1) #每次循環(huán)休眠1s
}
return(data.frame(ip=ip,country=cou,province=pro,city=cit))#匯總結(jié)果
}
<p>#定義結(jié)果輸出列表
MM <- list()
n <- ceiling(nrow(Ip_yb)/100)-1 #將原樣本等分炮赦,除最后一份外怜跑,每份均含100個觀測值
pb <- txtProgressBar(min = 0, max = n, style = 3) #設置循環(huán)進度條
for (i in 1:n){
MM[[i]] <- fanxi(data.frame(Ip_yb[(100i-99):(100i ),],
stringsAsFactors = F))
<p>##此處一定要注意添加stringsAsFactors=F,不然ip帶不出來
Sys.sleep(1.35) #每次循環(huán)休眠1.35s,防止連接中斷
setTxtProgressBar(pb,i)
}
<p>#MM[[n+1]] <- fanxi(data.frame(Ip_yb[(1000*n+1):nrow(Ip_yb),],
stringsAsFactors = F))#匹配最后一份數(shù)據(jù)
MM <- fanxi(Ip_yb)
result <- do.call(rbind,MM)
<p>#導出數(shù)據(jù)
setwd("A:\\數(shù)據(jù)分析\\匹配結(jié)果")
write.csv(result,"ip.CSV")
</pre></code>
"匹配結(jié)果.csv"的局部如下:

總結(jié):30萬的數(shù)據(jù)匹配下來總計耗時15h左右性芬;for循環(huán)的執(zhí)行效率實在是慢跋靠簟;希望能幫助到有需要的人植锉;也懇請路過高人指點一二//