0.寫在前面
統(tǒng)計(jì)學(xué)知識刷起來讶隐。其實(shí)這又是一個系列起胰,晦澀難懂沒人看的那種。但這樣的知識確實(shí)非常有價值巫延,能夠帶給我以及認(rèn)真學(xué)習(xí)的讀者們實(shí)打?qū)嵉倪M(jìn)步效五,所以不管閱讀量高低,我都寫炉峰。 這一篇是講中心極限定理的畏妖。啥子意思呢?
中心極限定理指的是給定一個任意分布的總體疼阔。每次從這些總體中隨機(jī)抽取 n 個抽樣戒劫,一共抽 m 次。 然后把這 m 組抽樣分別求出平均值婆廊。 這些平均值的分布接近正態(tài)分布迅细。
用statquest小哥的話說:even if you’re not normal,the average is normal.
不管總體符合什么分布(除了無法計(jì)算均值的分布以外),均值都符合正態(tài)分布淘邻,所以不考慮原數(shù)據(jù)的分布茵典。神奇!課程中以均勻分布和指數(shù)分布為例宾舅,論證了這一定理统阿。
在B站搜索statquest即可找到視頻哦。
在用R語言實(shí)現(xiàn)了一下老師在視頻中畫的圖~非常好耍筹我。
1.準(zhǔn)備數(shù)據(jù)
rm(list = ls())
library(ggplot2)
library(patchwork)
df = data.frame(x = 1:100,
y1 = dnorm(1:100,50,20),
y2 = dunif(1:100,1,100),
y3 = dexp(1:100,0.06))
set.seed(1004)
rn1 = rnorm(100,50,20)
set.seed(1004)
rn2 = runif(100,1,100)
set.seed(1004)
rn3 = rexp(100,0.06)
rn = data.frame(x = 1:100,
rn1 = rn1,
rn2 = rn2,
rn3 = rn3)
head(df)
## x y1 y2 y3
## 1 1 0.0009918677 0.01010101 0.05650587
## 2 2 0.0011197265 0.01010101 0.05321523
## 3 3 0.0012609110 0.01010101 0.05011621
## 4 4 0.0014163519 0.01010101 0.04719767
## 5 5 0.0015869826 0.01010101 0.04444909
## 6 6 0.0017737296 0.01010101 0.04186058
head(rn)
## x rn1 rn2 rn3
## 1 1 37.84318 27.89301 12.995556
## 2 2 65.35258 25.35061 28.127434
## 3 3 46.71456 78.08598 4.341248
## 4 4 49.42446 97.92214 0.180288
## 5 5 50.27116 44.04110 14.291344
## 6 6 35.68566 91.49909 36.579238
兩個數(shù)據(jù)框砂吞,一個是符合某分布的某個數(shù)值大小的概率,一個是符合某分布的具體數(shù)值崎溃,兩個數(shù)據(jù)框的二三四列分別是正態(tài)分布、均勻分布和指數(shù)分布盯质。
2.三種分布的圖
#1.正態(tài)分布
p1 = ggplot(df,aes(x = x,y = y1))+
geom_line()+theme_classic()
#2.均勻分布
p2 = ggplot(df,aes(x = x,y = y2))+
geom_line()+theme_classic()
#3.指數(shù)分布
p3 = ggplot(df,aes(x = x,y = y3))+
geom_line()+theme_classic()
p1+p2+p3
3.正態(tài)分布數(shù)據(jù)的均值分布
#畫均值豎線
a1 = p1
n = c()
for(i in 1:100){
n[[i]] = mean(sample(rn$rn1,50))
a1 = a1 + geom_vline(xintercept = n[[i]],color = "red",size = 0.3,alpha = 0.3)
}
#畫直方圖
dat = data.frame(n = n)
b1 = ggplot(dat,aes(x = n,y = ..density..))+
geom_histogram(color = "#D0505D",
fill = "#D0505D",
alpha = 0.4,binwidth = 1)+
theme_classic()+
scale_y_continuous(expand = c(0,0))
#加正態(tài)曲線
y = data.frame(
x = seq(40,62,0.2),
y1 = dnorm(seq(40,62,0.2),50,2))
b1 = b1 + geom_line(aes(x = x,y = y1),data = y)
a1 + b1
結(jié)論:正態(tài)分布數(shù)據(jù)的均值符合正態(tài)分布
4.均勻分布數(shù)據(jù)的均值分布
#畫均值豎線
a2 = p2
n = c()
for(i in 1:100){
n[[i]] = mean(sample(rn$rn2,50))
a2 = a2 + geom_vline(xintercept = n[[i]],color = "red",size = 0.3,alpha = 0.3)
}
#畫直方圖
dat = data.frame(n = n)
b2 = ggplot(dat,aes(x = n,y = ..density..))+
geom_histogram(color = "#D0505D",
fill = "#D0505D",
alpha = 0.4,binwidth = 1)+
theme_classic()+
scale_y_continuous(expand = c(0,0))
#加正態(tài)曲線
y = data.frame(
x = 40:62,
y1 = dnorm(40:62,50,3))
b2 = b2 + geom_line(aes(x = x,y = y1),data = y)
a2 + b2
結(jié)論:均勻分布數(shù)據(jù)的均值符合正態(tài)分布
3.指數(shù)分布數(shù)據(jù)的均值分布
#畫均值豎線
a3 = p3
n = c()
for(i in 1:100){
n[[i]] = mean(sample(rn$rn3,50))
a3 = a3 + geom_vline(xintercept = n[[i]],color = "red",size = 0.3,alpha = 0.3)
}
#畫直方圖
dat = data.frame(n = n)
b3 = ggplot(dat,aes(x = n,y = ..density..))+
geom_histogram(color = "#D0505D",
fill = "#D0505D",
alpha = 0.4,binwidth = 1)+
theme_classic()+
scale_y_continuous(expand = c(0,0))
#加正態(tài)曲線
y = data.frame(
x = seq(11,22,0.1),
y1 = dnorm(seq(11,22,0.1),16.5,1.5))
b3 = b3 + geom_line(aes(x = x,y = y1),data = y)
a3 + b3
結(jié)論:指數(shù)分布數(shù)據(jù)的均值也也耶符合正態(tài)分布
6.最后來個全家福吧~
(p1+p2+p3)/(a1+a2+a3)/(b1+b2+b3)
R語言真是學(xué)統(tǒng)計(jì)的好玩具袁串!