問(wèn)題
你有分類數(shù)據(jù)然后想要檢驗(yàn)是否這些數(shù)據(jù)值的頻數(shù)分布是否與預(yù)期不符旱捧,或者是否組間的頻數(shù)分布有(顯著)差異葵陵。
方案
頻數(shù)檢驗(yàn)通常解決兩類問(wèn)題:
- 頻數(shù)分布與預(yù)期或者理論的分布(比如50%的yes甜孤,50%的no)符合嗎碉纳?(擬合優(yōu)度檢驗(yàn))
- 兩組或多組之間的頻率分布有差異嗎惶楼?(獨(dú)立檢驗(yàn))
通常用于解決這樣問(wèn)題的統(tǒng)計(jì)檢驗(yàn)方法轮蜕,分為精確檢驗(yàn)與近似檢驗(yàn)兩種。
期望分布 | 比較組別 | |
---|---|---|
精確 | 精確二項(xiàng)檢驗(yàn) | Fisher精確檢驗(yàn) |
近似 | 卡方擬合優(yōu)度 | 獨(dú)立卡方檢驗(yàn) |
注意:精確二項(xiàng)檢驗(yàn)僅能用于有兩個(gè)水平的單變量收壕。Fisher精確檢驗(yàn)僅能用于二維列聯(lián)表(比如讥脐,當(dāng)存在一個(gè)獨(dú)立變量和一個(gè)非獨(dú)立變量時(shí)它可以使用;但不能用于兩個(gè)獨(dú)立變量和一個(gè)非獨(dú)立變量的情況)啼器。
想要檢驗(yàn)配對(duì)或被試內(nèi)效應(yīng)旬渠,我們可以使用McNemar檢驗(yàn)。使用該檢驗(yàn)必須滿足存在兩個(gè)水平的獨(dú)立變量和兩個(gè)水平的非獨(dú)立變量端壳。
想要檢驗(yàn)有重復(fù)測(cè)量的兩個(gè)變量獨(dú)立性告丢,我們可以使用Cochran-Mantel-Haenszel 檢驗(yàn)。
假設(shè)你有下面的數(shù)據(jù)损谦,其中每一行代表一個(gè)記錄:
data <- read.table(header=TRUE, text='
condition result
control 0
control 0
control 0
control 0
treatment 1
control 0
control 0
treatment 0
treatment 1
control 1
treatment 1
treatment 1
treatment 1
treatment 1
treatment 0
control 0
control 1
control 0
control 1
treatment 0
treatment 1
treatment 0
treatment 0
control 0
treatment 1
control 0
control 0
treatment 1
treatment 0
treatment 1
')
相比于以記錄的數(shù)據(jù)框存儲(chǔ)岖免,你的數(shù)據(jù)可能是計(jì)數(shù)的數(shù)據(jù)框,或者是一個(gè)列聯(lián)表照捡。本文提到的分析必須使用列聯(lián)表颅湘,你可以參見this page獲取更多解決方案信息。
擬合優(yōu)度檢驗(yàn) (期望頻率)
卡方檢驗(yàn)
想要檢驗(yàn)假設(shè):結(jié)果列result(忽略條件condition)中的兩個(gè)值在總體中幾乎相等(50%-50%)栗精。
# 為result列創(chuàng)建列聯(lián)表闯参,包含0和1兩個(gè)值
# 注意“0”和“1”是列名而不是實(shí)際的值
ct <- table(data$result)
ct
#>
#> 0 1
#> 17 13
# 也可以手動(dòng)創(chuàng)建表格
#ct <- matrix(c(17,13), ncol=2)
#colnames(ct1) <- c("0", "1")
# 執(zhí)行卡方檢驗(yàn)
chisq.test(ct)
#>
#> Chi-squared test for given probabilities
#>
#> data: ct
#> X-squared = 0.53333, df = 1, p-value = 0.4652
想要檢驗(yàn)有不同期望頻率的樣本(比如下面一個(gè)0.75,一個(gè)0.25):
# 概率表 —— 和必須為1
pt <- c(.75, .25)
chisq.test(ct, p=pt)
#>
#> Chi-squared test for given probabilities
#>
#> data: ct
#> X-squared = 5.3778, df = 1, p-value = 0.02039
如果你想要從檢驗(yàn)結(jié)果中提取信息悲立,可以將結(jié)果保存進(jìn)一個(gè)變量鹿寨,然后用str()
函數(shù)查看變量信息,接著把你想要的部分取出來(lái)薪夕。例如:
chi_res <- chisq.test(ct, p=pt)
# View all the parts that can be extracted
str(chi_res)
#> List of 9
#> $ statistic: Named num 5.38
#> ..- attr(*, "names")= chr "X-squared"
#> $ parameter: Named num 1
#> ..- attr(*, "names")= chr "df"
#> $ p.value : num 0.0204
#> $ method : chr "Chi-squared test for given probabilities"
#> $ data.name: chr "ct"
#> $ observed : 'table' int [1:2(1d)] 17 13
#> ..- attr(*, "dimnames")=List of 1
#> .. ..$ : chr [1:2] "0" "1"
#> $ expected : Named num [1:2] 22.5 7.5
#> ..- attr(*, "names")= chr [1:2] "0" "1"
#> $ residuals: table [1:2(1d)] -1.16 2.01
#> ..- attr(*, "dimnames")=List of 1
#> .. ..$ : chr [1:2] "0" "1"
#> $ stdres : table [1:2(1d)] -2.32 2.32
#> ..- attr(*, "dimnames")=List of 1
#> .. ..$ : chr [1:2] "0" "1"
#> - attr(*, "class")= chr "htest"
# 獲取卡方值
chi_res$statistic
#> X-squared
#> 5.377778
# 獲取p值
chi_res$p.value
#> [1] 0.02039484
精確二項(xiàng)檢驗(yàn)
精確二項(xiàng)檢驗(yàn)僅能用于存在兩個(gè)值的單變量數(shù)據(jù)脚草。
ct <- table(data$result)
ct
#>
#> 0 1
#> 17 13
binom.test(ct, p=0.5)
#>
#> Exact binomial test
#>
#> data: ct
#> number of successes = 17, number of trials = 30, p-value = 0.5847
#> alternative hypothesis: true probability of success is not equal to 0.5
#> 95 percent confidence interval:
#> 0.3742735 0.7453925
#> sample estimates:
#> probability of success
#> 0.5666667
# 使用75%的期望概率——注意1在第二列,所以只需要令p=0.25
binom.test(ct, p=0.25)
#>
#> Exact binomial test
#>
#> data: ct
#> number of successes = 17, number of trials = 30, p-value = 0.0002157
#> alternative hypothesis: true probability of success is not equal to 0.25
#> 95 percent confidence interval:
#> 0.3742735 0.7453925
#> sample estimates:
#> probability of success
#> 0.5666667
如果你想要從檢驗(yàn)結(jié)果中提取信息原献,可以將結(jié)果保存進(jìn)一個(gè)變量馏慨,然后用str()
函數(shù)查看變量信息埂淮,接著把你想要的部分取出來(lái)。例如:
bin_res <- binom.test(ct, p=0.25)
# View all the parts that can be extracted
str(bin_res)
#> List of 9
#> $ statistic : Named num 17
#> ..- attr(*, "names")= chr "number of successes"
#> $ parameter : Named num 30
#> ..- attr(*, "names")= chr "number of trials"
#> $ p.value : Named num 0.000216
#> ..- attr(*, "names")= chr "0"
#> $ conf.int : atomic [1:2] 0.374 0.745
#> ..- attr(*, "conf.level")= num 0.95
#> $ estimate : Named num 0.567
#> ..- attr(*, "names")= chr "probability of success"
#> $ null.value : Named num 0.25
#> ..- attr(*, "names")= chr "probability of success"
#> $ alternative: chr "two.sided"
#> $ method : chr "Exact binomial test"
#> $ data.name : chr "ct"
#> - attr(*, "class")= chr "htest"
# 獲取p值
bin_res$p.value
#> 0
#> 0.0002156938
# 獲取95%置信區(qū)間
bin_res$conf.int
#> [1] 0.3742735 0.7453925
#> attr(,"conf.level")
#> [1] 0.95
獨(dú)立檢驗(yàn)(比較組間)
卡方檢驗(yàn)
想要檢驗(yàn)控制和處理組結(jié)果的頻數(shù)差異写隶,使用2維列聯(lián)表倔撞。
ct <- table(data$condition, data$result)
ct
#>
#> 0 1
#> control 11 3
#> treatment 6 10
chisq.test(ct)
#>
#> Pearson's Chi-squared test with Yates' continuity correction
#>
#> data: ct
#> X-squared = 3.593, df = 1, p-value = 0.05802
chisq.test(ct, correct=FALSE)
#>
#> Pearson's Chi-squared test
#>
#> data: ct
#> X-squared = 5.1293, df = 1, p-value = 0.02353
對(duì) 2x2 列表,默認(rèn)使用 Yates’s continuity correction 樟澜。這個(gè)檢驗(yàn)對(duì)小樣本進(jìn)行更加保守地估計(jì),設(shè)置選項(xiàng)correct=FALSE
使用無(wú)校正的Pearson卡方檢驗(yàn)叮盘。
Fisher精確檢驗(yàn)
對(duì)于小樣本而言Fisher精確檢驗(yàn)更為適合秩贰。小樣本的2x2列表非常典型,樣本更多柔吼、更復(fù)雜的列表計(jì)算強(qiáng)度非常大毒费。當(dāng)然,用R進(jìn)行比較復(fù)雜的計(jì)算也是沒(méi)有太大問(wèn)題的愈魏。
ct <- table(data$condition, data$result)
ct
#>
#> 0 1
#> control 11 3
#> treatment 6 10
fisher.test(ct)
#>
#> Fisher's Exact Test for Count Data
#>
#> data: ct
#> p-value = 0.03293
#> alternative hypothesis: true odds ratio is not equal to 1
#> 95 percent confidence interval:
#> 0.966861 45.555016
#> sample estimates:
#> odds ratio
#> 5.714369
Cochran-Mantel-Haenszel test
Cochran-Mantel-Haenszel 檢驗(yàn) (或稱為 Mantel-Haenszel 檢驗(yàn)))用于檢驗(yàn)重復(fù)測(cè)量?jī)呻x散變量的獨(dú)立性觅玻。通常使用 2x2xK列表表示,K是測(cè)量條件的次數(shù)培漏。比如你想要指導(dǎo)是否一個(gè)處理(C vs. D)是否影響了恢復(fù)的概率(yes or no)溪厘。假設(shè)該處理一天監(jiān)控測(cè)量三次——早上、中午和晚上牌柄,而你想要你的檢驗(yàn)?zāi)軌蚩刂扑D敲茨憧梢允褂肅MH檢驗(yàn)對(duì)2x2x3列聯(lián)表進(jìn)行操作,第三個(gè)變量是你想要控制的變量珊佣。
R中的CMH檢驗(yàn)可以處理比2x2xK維度更高的數(shù)據(jù)蹋宦,例如你處理3x3xK列聯(lián)表。
在接下來(lái)的例子里有三個(gè)變量:Location咒锻,Allele和Habitat冷冗。問(wèn)題是——當(dāng)控制location變量時(shí),Allel(94或非94)和Habitat(marine或estuarine)兩個(gè)變量是否獨(dú)立惑艇。
fish <- read.table(header=TRUE, text='
Location Allele Habitat Count
tillamook 94 marine 56
tillamook 94 estuarine 69
tillamook non-94 marine 40
tillamook non-94 estuarine 77
yaquina 94 marine 61
yaquina 94 estuarine 257
yaquina non-94 marine 57
yaquina non-94 estuarine 301
alsea 94 marine 73
alsea 94 estuarine 65
alsea non-94 marine 71
alsea non-94 estuarine 79
umpqua 94 marine 71
umpqua 94 estuarine 48
umpqua non-94 marine 55
umpqua non-94 estuarine 48
')
注意上面的數(shù)據(jù)是計(jì)數(shù)的數(shù)據(jù)框蒿辙,而不是像之前的例子是記錄的數(shù)據(jù)框。這里我們使用xtabs()
函數(shù)將它轉(zhuǎn)換為列聯(lián)表滨巴。
# 制造一個(gè)3維的列聯(lián)表须板,最后一個(gè)變量時(shí)要控制的Location變量
ct <- xtabs(Count ~ Allele + Habitat + Location, data=fish)
ct
#> , , Location = alsea
#>
#> Habitat
#> Allele estuarine marine
#> 94 65 73
#> non-94 79 71
#>
#> , , Location = tillamook
#>
#> Habitat
#> Allele estuarine marine
#> 94 69 56
#> non-94 77 40
#>
#> , , Location = umpqua
#>
#> Habitat
#> Allele estuarine marine
#> 94 48 71
#> non-94 48 55
#>
#> , , Location = yaquina
#>
#> Habitat
#> Allele estuarine marine
#> 94 257 61
#> non-94 301 57
# This prints ct in a "flat" format
ftable(ct)
#> Location alsea tillamook umpqua yaquina
#> Allele Habitat
#> 94 estuarine 65 69 48 257
#> marine 73 56 71 61
#> non-94 estuarine 79 77 48 301
#> marine 71 40 55 57
# 按指定方式進(jìn)行變量輸出
ftable(ct, row.vars=c("Location","Allele"), col.vars="Habitat")
#> Habitat estuarine marine
#> Location Allele
#> alsea 94 65 73
#> non-94 79 71
#> tillamook 94 69 56
#> non-94 77 40
#> umpqua 94 48 71
#> non-94 48 55
#> yaquina 94 257 61
#> non-94 301 57
mantelhaen.test(ct)
#>
#> Mantel-Haenszel chi-squared test with continuity correction
#>
#> data: ct
#> Mantel-Haenszel X-squared = 5.0497, df = 1, p-value = 0.02463
#> alternative hypothesis: true common odds ratio is not equal to 1
#> 95 percent confidence interval:
#> 0.6005522 0.9593077
#> sample estimates:
#> common odds ratio
#> 0.759022
根據(jù)檢驗(yàn)結(jié)果,當(dāng)控制Location變量時(shí)Allele與Habitat變量存在相關(guān)(p=.025)兢卵。
注意列聯(lián)表的前兩個(gè)維度處理是一致的习瑰,所以前后順序變化都不會(huì)影響結(jié)果。而最后一個(gè)變量變化會(huì)導(dǎo)致結(jié)果的不同秽荤,下面是一個(gè)實(shí)例甜奄。
# 下面兩個(gè)看似不同的列聯(lián)表柠横,實(shí)際檢驗(yàn)結(jié)果相同
ct.1 <- xtabs(Count ~ Habitat + Allele + Location, data=fish)
ct.2 <- xtabs(Count ~ Allele + Habitat + Location, data=fish)
mantelhaen.test(ct.1)
#>
#> Mantel-Haenszel chi-squared test with continuity correction
#>
#> data: ct.1
#> Mantel-Haenszel X-squared = 5.0497, df = 1, p-value = 0.02463
#> alternative hypothesis: true common odds ratio is not equal to 1
#> 95 percent confidence interval:
#> 0.6005522 0.9593077
#> sample estimates:
#> common odds ratio
#> 0.759022
mantelhaen.test(ct.2)
#>
#> Mantel-Haenszel chi-squared test with continuity correction
#>
#> data: ct.2
#> Mantel-Haenszel X-squared = 5.0497, df = 1, p-value = 0.02463
#> alternative hypothesis: true common odds ratio is not equal to 1
#> 95 percent confidence interval:
#> 0.6005522 0.9593077
#> sample estimates:
#> common odds ratio
#> 0.759022
# 把Allele放到最后,結(jié)果不同了
ct.3 <- xtabs(Count ~ Location + Habitat + Allele, data=fish)
ct.4 <- xtabs(Count ~ Habitat + Location + Allele, data=fish)
mantelhaen.test(ct.3)
#>
#> Cochran-Mantel-Haenszel test
#>
#> data: ct.3
#> Cochran-Mantel-Haenszel M^2 = 168.47, df = 3, p-value < 2.2e-16
mantelhaen.test(ct.4)
#>
#> Cochran-Mantel-Haenszel test
#>
#> data: ct.4
#> Cochran-Mantel-Haenszel M^2 = 168.47, df = 3, p-value < 2.2e-16
# 把Habitat放最后课兄,結(jié)果也不同
ct.5 <- xtabs(Count ~ Allele + Location + Habitat, data=fish)
ct.6 <- xtabs(Count ~ Location + Allele + Habitat, data=fish)
mantelhaen.test(ct.5)
#>
#> Cochran-Mantel-Haenszel test
#>
#> data: ct.5
#> Cochran-Mantel-Haenszel M^2 = 2.0168, df = 3, p-value = 0.5689
mantelhaen.test(ct.6)
#>
#> Cochran-Mantel-Haenszel test
#>
#> data: ct.6
#> Cochran-Mantel-Haenszel M^2 = 2.0168, df = 3, p-value = 0.5689
McNemar檢驗(yàn)
McNemar檢驗(yàn)概念上是頻數(shù)數(shù)據(jù)的一個(gè)被試內(nèi)檢驗(yàn)牍氛。例如,假設(shè)你想要檢驗(yàn)是否一個(gè)處理增加了一個(gè)人對(duì)某個(gè)問(wèn)題反應(yīng)“yes”的概率烟阐,而且你只有每個(gè)人處理前和處理后的數(shù)據(jù)搬俊。標(biāo)準(zhǔn)的卡方檢驗(yàn)將不合適,因?yàn)樗僭O(shè)了組別是獨(dú)立的蜒茄。取而代之唉擂,我們可以使用McNemar檢驗(yàn)。該檢驗(yàn)僅適用于當(dāng)存在一個(gè)獨(dú)立變量的兩次測(cè)量時(shí)檀葛。用于McNemar的列聯(lián)表與用于卡方檢驗(yàn)的非常相似玩祟,但結(jié)構(gòu)上是不同的。
假設(shè)你有下面的數(shù)據(jù)屿聋。每個(gè)對(duì)象有處理前和后的反應(yīng)空扎。
data <- read.table(header=TRUE, text='
subject time result
1 pre 0
1 post 1
2 pre 1
2 post 1
3 pre 0
3 post 1
4 pre 1
4 post 0
5 pre 1
5 post 1
6 pre 0
6 post 1
7 pre 0
7 post 1
8 pre 0
8 post 1
9 pre 0
9 post 1
10 pre 1
10 post 1
11 pre 0
11 post 0
12 pre 1
12 post 1
13 pre 0
13 post 1
14 pre 0
14 post 0
15 pre 0
15 post 1
')
如果你的數(shù)據(jù)不是寬格式,必須要進(jìn)行轉(zhuǎn)換(參見this page獲取更多信息)润讥。
library(tidyr)
data_wide <- spread(data, time, result)
data_wide
#> subject post pre
#> 1 1 1 0
#> 2 2 1 1
#> 3 3 1 0
#> 4 4 0 1
#> 5 5 1 1
#> 6 6 1 0
#> 7 7 1 0
#> 8 8 1 0
#> 9 9 1 0
#> 10 10 1 1
#> 11 11 0 0
#> 12 12 1 1
#> 13 13 1 0
#> 14 14 0 0
#> 15 15 1 0
接下來(lái)從數(shù)據(jù)框的pre和post列生成列聯(lián)表:
ct <- table( data_wide[,c("pre","post")] )
ct
#> post
#> pre 0 1
#> 0 2 8
#> 1 1 4
# 下面是用于標(biāo)準(zhǔn)卡方檢驗(yàn)的列聯(lián)表转锈,注意差別喔
# table(data[,c("time","result")])
# result
# time 0 1
# post 3 12
# pre 10 5
執(zhí)行檢驗(yàn):
mcnemar.test(ct)
#>
#> McNemar's Chi-squared test with continuity correction
#>
#> data: ct
#> McNemar's chi-squared = 4, df = 1, p-value = 0.0455
對(duì)于小樣本,它會(huì)使用連續(xù)校正楚殿。我們可以使用精確校正的McNemar檢驗(yàn)替換這種校正方式黑忱,前者更加的精確,可通過(guò)exact2x2
包獲取勒魔。
library(exact2x2)
#> Loading required package: exactci
#> Loading required package: ssanv
mcnemar.exact(ct)
#>
#> Exact McNemar test (with central confidence intervals)
#>
#> data: ct
#> b = 8, c = 1, p-value = 0.03906
#> alternative hypothesis: true odds ratio is not equal to 1
#> 95 percent confidence interval:
#> 1.072554 354.981246
#> sample estimates:
#> odds ratio
#> 8
原文鏈接:http://www.cookbook-r.com/Statistical_analysis/Frequency_tests/#cochran-mantel-haenszel-test