計(jì)數(shù)模型的dependent variable為正整數(shù)矿卑,計(jì)數(shù)模型有很多回歸的方法糠赦,這里我們只討論泊松回歸以及負(fù)二項(xiàng)回歸
Poisson Model
Particularly useful for “rare” events凄杯,即數(shù)量少的事件错英,例如家庭擁有車輛數(shù)的預(yù)測(cè)骑冗。
若數(shù)據(jù)符合泊松分布赊瞬,那么事件發(fā)生的間隔獨(dú)立且符合指數(shù)分布。這里我就不放公式了贼涩。運(yùn)用泊松回歸需要注意的是均值與方差相等, 且均值大于等于0 (μ = σ^2)(μ = e^βo+β1X1)
-
回歸結(jié)果解釋
log(daysabsent) = 2.647 ? 0.014LangArts ? 0.409Male
如果在語言藝術(shù)這門考試中表現(xiàn)得好巧涧,那么就會(huì)有更少的缺席天數(shù),如果你是男性磁携,你將會(huì)有更少的缺席天數(shù)褒侧。
image.png(1)AgeGroup 45-54: e^1.90 = 6.86: The ratio of crash incidences is 6.86 times greater for 45-54 when compared to those aged 35 and under (baseline)
(2)RegionSouth: e^0.86 = 2.36: The ratio of crash incidences is 2.4 times greater for the South when compared to North Region (baseline) -
回歸模型的質(zhì)量 (Poisson Deviance)用來衡量模型有多接近完美image.png
Negetive Binomial Model
overdispersion 時(shí)我們會(huì)選擇負(fù)二項(xiàng)回歸, 即 variance is larger than mean (i.e, σ^2 > μ)
在這個(gè)模型中我們依舊加入了error term ε 獨(dú)立分布,E(ε)符合Gamma分布
實(shí)際上泊松模型是一個(gè)特殊的負(fù)二項(xiàng)模型形式谊迄,當(dāng)NB model的dispersion parameter θ = ∞時(shí)闷供,負(fù)二項(xiàng)模型變?yōu)椴此赡P汀N覀冊(cè)跀?shù)據(jù)分析時(shí)可以兩個(gè)模型都做一下统诺,然后通過對(duì)比他們的AIC值或者利用似然比檢驗(yàn)來比較哪個(gè)模型更好歪脏。
library(foreign)
library(spam) #needed for library (fields)
library(fields)
library(MASS)
library(car)
library(ggplot2)
setwd(...)
pdata <- read.csv("poissonreg.csv")
names(pdata)
hist(pdata$daysabs,breaks=40,xlim=c(0,40),col="gray",main="Days Absent in HS",ylab="Num of HS juniors", xlab="Days Absent")
PModel<-glm(daysabs~math+langarts+male,family=poisson, data=pdata)
summary(PModel)
#rerun without math since not significant
PModel2<-glm(daysabs~langarts+male,family=poisson, data=pdata)
summary(PModel2)
pdata$phat<-predict(PModel2, type = "response")
pdata <- pdata[with(pdata, order(langarts, male)), ]
ggplot(pdata, aes(x = langarts, y = phat)) +
geom_point(aes(y = daysabs), alpha = 0.5, position = position_jitter(h = 0.2)) +
geom_line(size = 1) +
labs(x = "Language Arts Scores", y = "Expected days absent")
##dispersion
#you would like this number to be close to 1; if it is larger than one, your data is overdispersed.
#PModel2$df.res <= degrees of freedom of your model
#residuals(PModel2) <= actual-predicted value from PModel2
dp<-sum(residuals(PModel2,type="pearson")^2)/PModel2$df.res
plot(residuals(PModel2))
mean(pdata$daysabs)
var(pdata$daysabs)
sd(pdata$daysabs)
#if greater than one, the data shows signs of overdispersion
dp<-round(dp,2)
library(faraway) #written by Julian Faraway
par(mfrow=c(1,2))
halfnorm(residuals(PModel2))
plot(log(fitted(PModel2)),
log((pdata$daysabs-fitted(PModel2))^2),
xlab=expression(hat(mu)),
ylab=expression((y-hat(mu))^2),
main=bquote("Dispersion"==.(dp)))
abline(0,1)
##alternative method
library(AER)
dispersiontest(PModel2)
##Negative binomial model
nbmod<-glm.nb(formula=daysabs~math+langarts+male,link=log,data=pdata)
summary(nbmod)
##rerun without math since not significant
nbmod2<-glm.nb(formula=daysabs~langarts+male,link=log,data=pdata)
summary(nbmod2)
##likelihood ratio test
library(zoo)
library(lmtest)
lrtest(nbmod2)
lrtest(PModel2)