內(nèi)容來自【顧先生聊數(shù)據(jù)】的 PSM傾向得分匹配法【上篇:理論篇】、PSM傾向得分匹配法【下篇:python實操篇】
定義:PSM傾向得分匹配法瞧挤,通過對數(shù)據(jù)建模业舍,為每個用戶擬合?個概率(多維特征擬合成一維的概率)冕房,在對照組樣本中尋找和實驗組最接近的樣本,從而進行比較。
前提:
- 條件獨立假設(shè)践瓷。在接受實驗之前换衬,對照組和實驗組之間沒有差異痰驱,實驗組產(chǎn)生的效應(yīng)完全來自實驗處理。
- 共同?撐假設(shè)瞳浦。在理想情況下担映,實驗組的個體,都能在對照組中找到對應(yīng)的個體叫潦。如下圖蝇完,兩組共同取值較少時,就不適用PSM了
代碼
1矗蕊、導(dǎo)入相關(guān)的python庫短蜕,path及model設(shè)置
import psmatching.match as psm
import pytest
import pandas as pd
import numpy as np
from psmatching.utilities import *
import statsmodels.api as sm
#地址
path = "E:/pythonFile/data/psm/psm_gxslsj_data.csv"
#model由干預(yù)項和其他類別標(biāo)簽組成,形式為"干預(yù)項~類別特征+列別特征傻咖。朋魔。。"
model = "PUSH ~ AGE + SEX + VIP_LEVEL + LASTDAY_BUY_DIFF + PREFER_TYPE + LOGTIME_PREFER + USE_COUPON_BEFORE + ACTIVE_LEVEL"
#想要幾個匹配項卿操,如k=3警检,那一個push=1的用戶就會匹配三個push=0的近似用戶
k = "3"
m = psm.PSMatch(path, model, k)
2孙援、獲得傾向性匹配得分
df = pd.read_csv(path)
df = df.set_index("use_id") #將use_id作為數(shù)據(jù)的新索引,這里可以替換成自己想要的索引字段
print("\n計算傾向性匹配得分 ...", end = " ")
#利用邏輯回歸框架計算傾向得分扇雕,即廣義線性估計 + 二項式Binomial
glm_binom = sm.formula.glm(formula = model, data = df, family = sm.families.Binomial())
#擬合給定family的廣義線性模型
#https://www.w3cschool.cn/doc_statsmodels/statsmodels-generated-statsmodels-genmod-generalized_linear_model-glm-fit.html?lang=en
result = glm_binom.fit()
# 輸出回歸分析的摘要
# print(result.summary)
propensity_scores = result.fittedvalues
print("\n計算完成!")
#將傾向性匹配得分寫入data
df["PROPENSITY"] = propensity_scores
df
3拓售、區(qū)分干預(yù)與非干預(yù)
groups是干預(yù)項,propensity是傾向性匹配得分镶奉,這里要分開干預(yù)與非干預(yù)础淤,且確保n1<n2
groups = df.PUSH #將PUSH替換成自己的干預(yù)項
propensity = df.PROPENSITY
#把干預(yù)項替換成True和False
groups = groups == groups.unique()[1]
n = len(groups)
#計算True和False的數(shù)量
n1 = groups[groups==1].sum()
n2 = n-n1
g1, g2 = propensity[groups==1], propensity[groups==0]
#確保n2>n1,,少的匹配多的腮鞍,否則交換下
if n1 > n2:
n1, n2, g1, g2 = n2, n1, g2, g1
m_order = list(np.random.permutation(groups[groups==1].index)) #隨機排序?qū)嶒灲M值骇,減少原始排序的影響
4、根據(jù)傾向評分差異將干預(yù)組與對照組進行匹配
注意:caliper = None可以替換成自己想要的精度
matches = {}
k = int(k)
print("\n給每個干預(yù)組匹配 [" + str(k) + "] 個對照組 ... ", end = " ")
for m in m_order:
# 計算所有傾向得分差異,這里用了最粗暴的絕對值
# 將propensity[groups==1]分別拿出來移国,每一個都與所有的propensity[groups==0]相減
dist = abs(g1[m]-g2)
array = np.array(dist)
#如果無放回地匹配吱瘩,最后會出現(xiàn)要選取3個匹配對象,但是只有一個候選對照組的錯誤迹缀,故進行判斷
if k < len(array):
# 在array里面選擇K個最小的數(shù)字使碾,并轉(zhuǎn)換成列表
k_smallest = np.partition(array, k)[:k].tolist()
# 用卡尺做判斷
caliper = None
if caliper:
caliper = float(caliper)
# 判斷k_smallest是否在定義的卡尺范圍
keep_diffs = [i for i in k_smallest if i <= caliper]
keep_ids = np.array(dist[dist.isin(keep_diffs)].index)
else:
# 如果不用標(biāo)尺判斷,那就直接上k_smallest了
keep_ids = np.array(dist[dist.isin(k_smallest)].index)
# 如果keep_ids比要匹配的數(shù)量多祝懂,那隨機選擇下票摇,如要少,通過補NA配平數(shù)量
if len(keep_ids) > k:
matches[m] = list(np.random.choice(keep_ids, k, replace=False))
elif len(keep_ids) < k:
while len(matches[m]) <= k:
matches[m].append("NA")
else:
matches[m] = keep_ids.tolist()
# 判斷 replace 是否放回
replace = False
if not replace:
g2 = g2.drop(matches[m])
print("\n匹配完成!")
5砚蓬、將匹配完成的結(jié)果合并起來
matches = pd.DataFrame.from_dict(matches, orient="index")
matches = matches.reset_index()
column_names = {}
column_names["index"] = "干預(yù)組"
for i in range(k):
column_names[i] = str("匹配對照組_" + str(i+1))
matches = matches.rename(columns = column_names)
matches