python與R是當(dāng)前數(shù)據(jù)科學(xué)計(jì)算的兩大支柱狸涌,就我個(gè)人的使用經(jīng)驗(yàn)而言,R更直觀寥掐、簡(jiǎn)單和上手一些。很多專業(yè)的統(tǒng)計(jì)分析Python并沒有提供R中的對(duì)應(yīng)體磷蜀,而你想要使用Python做數(shù)據(jù)分析召耘,這時(shí)候就需要使用rpy2
包來構(gòu)建這個(gè)橋梁了。
比如我最近遇到一個(gè)分析問題褐隆,Python無法計(jì)算非參數(shù)統(tǒng)計(jì)檢驗(yàn)wilcoxon test
的置信區(qū)間污它,如果你仔細(xì)查看Python提供的非參數(shù)檢驗(yàn),你會(huì)發(fā)現(xiàn)它使用的是正態(tài)逼近庶弃,這在樣本量大的時(shí)候(根據(jù)中心極限定理服從正態(tài)分布)當(dāng)然可以使用Python計(jì)算衫贬,當(dāng)如果你是小樣本,比如大多數(shù)生物醫(yī)學(xué)數(shù)據(jù)處理與分析中普遍樣本少的可憐歇攻。在Stack overflow上有人討論過并檢查統(tǒng)計(jì)檢驗(yàn)的p值固惯,結(jié)論是算檢驗(yàn),R更靠譜些缴守,優(yōu)先采用葬毫。
言歸正傳,在銀行統(tǒng)計(jì)工作室rpy2使用示例一文中對(duì)rpy2
包各方面的使用都有介紹屡穗,加rpy2官方文檔基本上可以掌握rpy2
的使用贴捡,這里我提供這兩天實(shí)現(xiàn)的一個(gè)實(shí)例——從python中調(diào)用R的wilcox.test
函數(shù)進(jìn)行非參數(shù)檢驗(yàn),如果大家有這方面需求可以作為一個(gè)參考村砂。
代碼已經(jīng)封裝為一個(gè)函數(shù)烂斋,略寫了一下文檔。
# 推薦使用conda管理環(huán)境
# conda create --name test python=3.6
# source activate test
# conda install rpy2 # should add conda-forge channel
# reference link: <http://www.cnblogs.com/cloudtj/articles/6372200.html>, <https://rpy2.readthedocs.io/en/version_2.8.x/index.html>
def pyWilcox(x, y=None, alternative='two.sided', mu=0, paired=False, exact=None, correct=True, conf_interval=True, conf_level=0.95):
'''
Run wilcoxon test using wilcox.test in R stats package, default is 'two sided' test.
return p value, statistical value and confidence interval in a dictory.
Arguments:
x
numeric vector of data values. Non-finite (e.g., infinite or missing) values will be omitted.
y
an optional numeric vector of data values: as with x non-finite values will be omitted.
alternative
a character string specifying the alternative hypothesis, must be one of "two.sided" (default), "greater" or "less". You can specify just the initial letter.
mu
a number specifying an optional parameter used to form the null hypothesis. See ‘Details’.
paired
a logical indicating whether you want a paired test.
exact
a logical indicating whether an exact p-value should be computed.
correct
a logical indicating whether to apply continuity correction in the normal approximation for the p-value.
conf_interval
a logical indicating whether a confidence interval should be computed.
conf_level
confidence level of the interval.
Example:
x = [i for i in range(0, 10)]
y = [i for i in range(10, 20)]
# two sample test
pyWilcox(x, y)
# one sample test
pyWilcox(x)
# output:
#
# {'p_value': [1.082508822446903e-05], 'statistic': [0.0], 'conf_interval': [-13.0, -7.0]}
# {'p_value': [0.009151688852650072], 'statistic': [45.0], 'conf_interval': [2.500027475911944, 7.499972524088056]}
Note:
More information please run help('wilcox.test.default') in R console
'''
# 載入r對(duì)象
from rpy2 import robjects
# 載入導(dǎo)入包函數(shù)
from rpy2.robjects.packages import importr
# 將stats包導(dǎo)入為模塊
stats = importr('stats')
# When one wants to create a vector from Python, either the class Vector or the convenience classes IntVector, FloatVector, BoolVector, StrVector can be used.
# 將列表x轉(zhuǎn)換為r可識(shí)別數(shù)據(jù)對(duì)象
x = robjects.FloatVector(x)
# 將參數(shù)中的.替換為_础废,解決不兼容問題源祈, 來自rpy2文檔函數(shù)部分
def iamfeelinglucky(func):
def f(*args, **kwargs):
d = {}
for k, v in kwargs.items():
d[k.replace('_', '.')] = v
return func(**d)
return f
# 矯正參數(shù)名
wilcox = iamfeelinglucky(stats.wilcox_test_default)
# None類型似乎沒有相應(yīng)的函數(shù),只能用條件語句進(jìn)行判斷
if y != None:
y = robjects.FloatVector(y)
if exact != None:
pr = wilcox(x = x, y = y, alternative = alternative, mu = mu, paired = paired, exact = exact, correct = correct, conf_int = conf_interval, conf_level = conf_level)
else:
pr = wilcox(x = x, y = y, alternative = alternative, mu = mu, paired = paired, conf_int = conf_interval, conf_level = conf_level)
else:
if exact != None:
pr = wilcox(x = x, alternative = alternative, mu = mu, exact = exact, correct = correct, conf_int = conf_interval, conf_level = conf_level)
else:
pr = wilcox(x = x, alternative = alternative, mu = mu, correct = correct, conf_int = conf_interval, conf_level = conf_level)
print(pr)
res = list(pr)
# 返回結(jié)果中需要的值構(gòu)建字典
res = {"p_value":list(res[2]), "statistic":list(res[0]), "conf_interval":list(res[7])}
return(res)