- batch effect(批次效應(yīng))
在高通量研究中邦蜜,常常被忽略的一個(gè)問(wèn)題就是批次效應(yīng)甲脏,簡(jiǎn)單的說(shuō)拓诸,批次效應(yīng)就是由實(shí)驗(yàn)條件碱鳞、試劑批次和實(shí)驗(yàn)人員等等因素的不同桑李,而造成試驗(yàn)出現(xiàn)誤差并混淆(confounding)試驗(yàn)結(jié)果。
- Simpson’s Paradox(辛普森悖論)
辛普森悖論是概率和統(tǒng)計(jì)學(xué)中的一種現(xiàn)象窿给,即幾組不同的數(shù)據(jù)中均存在一種趨勢(shì)贵白,但當(dāng)這些數(shù)據(jù)組合在一起后,這種趨勢(shì)消失或反轉(zhuǎn)崩泡。而其中的原因通常是跟因素混淆有關(guān)禁荒,舉例說(shuō)明:
library(dagdata)
data(admissions)
head(admissions)
##admissions數(shù)據(jù)是6個(gè)不同的專業(yè)的錄取記錄
# Major Number Percent Gender total
#1 A 825 62 1 511.50
#2 B 560 63 1 352.80
#3 C 325 37 1 120.25
#4 D 417 33 1 137.61
#5 E 191 28 1 53.48
#6 F 373 6 1 22.38
##通過(guò)chi-square檢測(cè)性別與錄取率之間的關(guān)系
index <- admissions$Gender==1
men <- admissions[index,]
women <- admissions[!index,]
menYes <- sum(men$Number*men$Percent/100)
menNo <- sum(men$Number*(1-men$Percent/100))
womenYes <- sum(women$Number*women$Percent/100)
womenNo <- sum(women$Number*(1-women$Percent/100))
tab <- matrix(c(menYes,womenYes,menNo,womenNo),2,2)
print(chisq.test(tab)$p.val)
## [1] 9.139492e-22
p值小于0.05,即原假設(shè)應(yīng)被拒絕(性別和錄取比例是互相獨(dú)立的)允华。但是如果我們依據(jù)專業(yè)對(duì)數(shù)據(jù)進(jìn)行分組圈浇,這種相關(guān)性就會(huì)消失,原因在于“男性“與”難度較低的那些專業(yè)“這兩個(gè)因素發(fā)生了混淆:
y=cbind(admissions[1:6,5],admissions[7:12,5])
y=sweep(y,2,colSums(y),"/")*100
x=rowMeans(cbind(admissions[1:6,3],admissions[7:12,3]))
library(rafalib)
mypar()
matplot(x,y,xlab="percent that gets in the major",ylab="percent that applies to major",col=c("blue","red"),cex=1.5)
legend("topleft",c("Male","Female"),col=c("blue","red"),pch=c("1","2"),box.lty=0)
從圖中可以看出靴寂,男性其實(shí)是更傾向于被簡(jiǎn)單一些的專業(yè)錄取的磷蜀。但是,當(dāng)我們使用專業(yè)這一因素對(duì)數(shù)據(jù)進(jìn)行分層分析的話百炬,批次效應(yīng)就會(huì)消失:
y=cbind(admissions[1:6,3],admissions[7:12,3])
matplot(1:6,y,xaxt="n",xlab="major",ylab="percent",col=c("blue","red"),cex=1.5)
axis(1,1:6,LETTERS[1:6])
legend("topright",c("Male","Female"),col=c("blue","red"),pch=c("1","2"),
box.lty=0)
閱讀原文請(qǐng)戳