5 Poisson data
Natural model for simple counts: \[y_{i}\sim \ Poisson(\mu_{i})\] where \(g(\mu_{i})=x_{i}'\beta\).
R allows the use of the three following links:
log: This is the most common link (hence also the default link). \[\log(\mu_{i})=x_{i}'\beta\] Note with this link: \[\mu_{i}=\exp(x_{i}'\beta)=\prod_{j}\exp(x_{ij}\beta_{j})\] that is a multiplicative structure for \(\mu_{i}\). With this link \(\mu_{i}\geq 0\) and \(-\infty < x_{i}'\beta<\infty\).
sqrt: \[\sqrt(\mu_{i})=x_{i}'\beta\] Thus: \[\mu_{i}=\left(x_{i}'\beta\right)^{2}.\] With this link \(\mu_{i}\geq 0\) and \(-\infty < x_{i}'\beta<\infty\).
identity: \[\mu_{i}=x_{i}'\beta\] Note: with this link \(\mu_{i}\) is not guaranteed to be greater than zero and thus we have to be careful when using it.
5.1 Poisson Examples
5.1.1 Example 1: Poisson response with one covariate
The following data gives Poisson observations y for three values of a covariate x (x=0,1,2). Compare the fits of the model \[\eta_{i}=\beta_{0}+\beta^{x}x_{i}\]
under the log and sqrt links. Can the model be simplified?
| x=0 | x=1 | x=2 |
|---|---|---|
| 1,0,1 | 2,3,2 | 3,4,2 |
| 3,0,2 | 1,0,2 | 0,5,1 |
| 0,1,2 | 1,4,1 | 4,2,3 |
| 1 | 3,1 |
Read in the data considering each level of the covariate in turn.
y<-c(1,0,1,3,0,2,0,1,2,1,2,3,2,1,0,2,1,4,1,3,4,2,0,5,1,4,2,3,3,1)
x<-c(rep(0,10),rep(1,9),rep(2,11))Some Exploratory Data Analysis shown in Figure 5.1 shows possibility of Poisson counts with mean increasing with x.
plot(x,y)
Figure 5.1: Exploratory Data: Analysis Poisson Counts
Tidy up the inputted values - by storing them in a data frame and then removing the individual variables.
ex1.df<-data.frame(y,x)
ex1.df y x
1 1 0
2 0 0
3 1 0
4 3 0
5 0 0
6 2 0
7 0 0
8 1 0
9 2 0
10 1 0
11 2 1
12 3 1
13 2 1
14 1 1
15 0 1
16 2 1
17 1 1
18 4 1
19 1 1
20 3 2
21 4 2
22 2 2
23 0 2
24 5 2
25 1 2
26 4 2
27 2 2
28 3 2
29 3 2
30 1 2
rm(x,y)Begin by fitting the model using the Poisson default log link.
ex1.glm1<-glm(y~x,poisson,data=ex1.df)
summary(ex1.glm1)
Call:
glm(formula = y ~ x, family = poisson, data = ex1.df)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2691 -0.5870 -0.1265 0.6162 1.4921
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.1239 0.2608 0.475 0.635
x 0.4109 0.1705 2.409 0.016 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 35.057 on 29 degrees of freedom
Residual deviance: 28.941 on 28 degrees of freedom
AIC: 97.474
Number of Fisher Scoring iterations: 5
Then fit the model using the sqrt link.
ex1.glm2<-glm(y~x,poisson(link=sqrt),data=ex1.df)
summary(ex1.glm2)
Call:
glm(formula = y ~ x, family = poisson(link = sqrt), data = ex1.df)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2609 -0.6219 -0.1038 0.6384 1.4813
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.0523 0.1451 7.251 4.15e-13 ***
x 0.2732 0.1092 2.502 0.0124 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 35.057 on 29 degrees of freedom
Residual deviance: 28.905 on 28 degrees of freedom
AIC: 97.438
Number of Fisher Scoring iterations: 4
As can be seen in the output above, the sqrt link has a slightly better deviance (NB no formal test between links – prefer one with lower deviance). For this link we shall try to see if the model can be simplified.
anova(ex1.glm2,test="Chi")Analysis of Deviance Table
Model: poisson, link: sqrt
Response: y
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 29 35.057
x 1 6.1517 28 28.905 0.01313 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow=c(2,2))
plot(ex1.glm2)
Figure 5.2: Diagnostic Plots: Poisson Counts
From this we see the covariate is significant and thus we can not simplify the model. The residual plot in Figure 5.2 gives us no cause for concern.
5.1.2 Example 2: Poisson response with two factors
The following data records Poisson observations over two factors A and B at 3 and 2 levels respectively. Find a suitable model for the data.
Read in the data a row at a time (e.g. level 1 then level 2 of B). Create the data frame. Then plot a histogram of the counts ignoring the factors; it shows possibly Poisson count data in Figure 5.3.
y<-c(14,6,6,8,8,16,10,9,10,16,14,20,24,8,10,8,10,12,12,9,10,6,14,18,18,26,24,24)
A<-factor(rep(c(1,2,3,1,2,3),c(4,5,4,2,8,5)))
B<-factor(rep(c(1,2),c(13,15)))
ex2.df<-data.frame(y,A,B)
ex2.df y A B
1 14 1 1
2 6 1 1
3 6 1 1
4 8 1 1
5 8 2 1
6 16 2 1
7 10 2 1
8 9 2 1
9 10 2 1
10 16 3 1
11 14 3 1
12 20 3 1
13 24 3 1
14 8 1 2
15 10 1 2
16 8 2 2
17 10 2 2
18 12 2 2
19 12 2 2
20 9 2 2
21 10 2 2
22 6 2 2
23 14 2 2
24 18 3 2
25 18 3 2
26 26 3 2
27 24 3 2
28 24 3 2
rm(y,A,B)
attach(ex2.df)hist(y)
Figure 5.3: Histogram of y variable
Plotting boxplots in Figure 5.4 for each factor allows the further exploration of the data; suggesting that both levels of B may be the same and that levels 1 and 2 of A might be the same.
par(mfrow=c(1,2))
plot(y~A,xlab="A",ylab="Counts")
plot(y~B,xlab="B",ylab="Counts")
Figure 5.4: Boxplots of counts against Levels
detach(ex2.df)Fit the full model first; note deviance is not zero as there are replicates at various factor combinations. Then use the step command to knock out terms as per the code and output below.
ex2.glm.full<-glm(y~A*B,family=poisson,data=ex2.df)
ex2.glm.full
Call: glm(formula = y ~ A * B, family = poisson, data = ex2.df)
Coefficients:
(Intercept) A2 A3 B2 A2:B2
2.14007 0.22079 0.77770 0.05716 -0.10300
A3:B2
0.11611
Degrees of Freedom: 27 Total (i.e. Null); 22 Residual
Null Deviance: 70.18
Residual Deviance: 18.58 AIC: 152.1
ex2.glm1<-step(ex2.glm.full)Start: AIC=152.07
y ~ A * B
Df Deviance AIC
- A:B 2 19.475 148.96
<none> 18.579 152.07
Step: AIC=148.96
y ~ A + B
Df Deviance AIC
- B 1 20.026 147.51
<none> 19.475 148.96
- A 2 68.916 194.40
Step: AIC=147.51
y ~ A
Df Deviance AIC
<none> 20.026 147.51
- A 2 70.184 193.67
The step algorithm leads to a model that has just factor A in it.
summary(ex2.glm1)
Call:
glm(formula = y ~ A, family = poisson, data = ex2.df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.51214 -0.60109 -0.09749 0.57646 1.66174
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.1595 0.1387 15.572 < 2e-16 ***
A2 0.1734 0.1634 1.061 0.289
A3 0.8582 0.1571 5.465 4.64e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 70.184 on 27 degrees of freedom
Residual deviance: 20.026 on 25 degrees of freedom
AIC: 147.51
Number of Fisher Scoring iterations: 4
However, examination of the parameter values and t statistics suggest that level 2 of A may not be different to level 1 of A. Therefore create a new factor variable, ATRY, which has levels 1 and 2 of A the same. Add this to the data frame and remove unwanted variable. Fit new model with ATRY instead of A.
ATRY<-factor(rep(c(1,2,1,2),c(9,4,10,5)))
ex2.df<-data.frame(ex2.df,ATRY)
rm(ATRY)
ex2.glm2<-glm(y~ATRY,family=poisson,data=ex2.df)We need to check that this model is better and if so whether it can be further simplified (unlikely from Wald statistics).
summary(ex2.glm2)
Call:
glm(formula = y ~ ATRY, family = poisson, data = ex2.df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5121 -0.5908 -0.0158 0.7029 1.8166
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.28131 0.07332 31.113 < 2e-16 ***
ATRY2 0.73640 0.10398 7.082 1.42e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 70.184 on 27 degrees of freedom
Residual deviance: 21.181 on 26 degrees of freedom
AIC: 146.67
Number of Fisher Scoring iterations: 4
anova(ex2.glm2,ex2.glm1,test="Chi")Analysis of Deviance Table
Model 1: y ~ ATRY
Model 2: y ~ A
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 26 21.181
2 25 20.026 1 1.1545 0.2826
drop1(ex2.glm2)Single term deletions
Model:
y ~ ATRY
Df Deviance AIC
<none> 21.181 146.67
ATRY 1 70.184 193.67
It appears that modelling with the reduced levels of A is better and that the model can not be further simplified.
Residuals in Figure 5.5 give us no cause for concern.
par(mfrow=c(2,2))
plot(ex2.glm2)
Figure 5.5: Diagnostic Plots
5.1.3 Example 3: Poisson response with two covariates
The following data shows a Poisson response y together with two associated covariate values x1 and x2. Set up data frame.
y<-c(3,10,4,24,5,4,43,3,2,26,3,2)
x1<-c(14.154,31.817,2.203,22.646,8.585,2.160,53.517,6.234,2.858,34.124,2.484,6.619)
x2<-c(0.1132,4.5437,5.1989,15.0614,2.6844,11.2151,22.5853,0.7164,0.8493,16,5.6245,0.1385)
ex3.df<-data.frame(y,x1,x2)
ex3.df y x1 x2
1 3 14.154 0.1132
2 10 31.817 4.5437
3 4 2.203 5.1989
4 24 22.646 15.0614
5 5 8.585 2.6844
6 4 2.160 11.2151
7 43 53.517 22.5853
8 3 6.234 0.7164
9 2 2.858 0.8493
10 26 34.124 16.0000
11 3 2.484 5.6245
12 2 6.619 0.1385
rm(y,x1,x2)Then plot the response against each of the covariates; in both cases there appears to be an increase in y with x as illustrated in Figure 5.6.
attach(ex3.df)
par(mfrow=c(1,2))
plot(x1,y)
plot(x2,y)
Figure 5.6: Plotting y against both x1 and x2
Try fitting an additive model with the two covariates.
ex3.glm1<-glm(y~x1+x2,family=poisson,data=ex3.df)
summary(ex3.glm1)
Call:
glm(formula = y ~ x1 + x2, family = poisson, data = ex3.df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.3794 -0.7879 -0.3445 0.4833 2.0939
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.048961 0.183864 5.705 1.16e-08 ***
x1 0.019498 0.009257 2.106 0.035182 *
x2 0.081492 0.022829 3.570 0.000357 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 142.352 on 11 degrees of freedom
Residual deviance: 11.962 on 9 degrees of freedom
AIC: 62.063
Number of Fisher Scoring iterations: 4
The deviance 11.962 on 9 degrees of freedom is a respectable figure. Need to check diagnostic plots in Figure 5.7 to check quality of fit okay.
par(mfrow=c(2,2))
plot(ex3.glm1)
Figure 5.7: Diagnostic Plots
The plot of the residuals against fitted values shows a clear parabolic trend. The normal score plot is slightly curved. The 7th observation appears to have a lot of leverage. By plotting deviances against individual covariates in Figure 5.8 we aim to find out more.
par(mfrow=c(1,2))
plot(x1,ex3.glm1$res)
plot(x2,ex3.glm1$res)
Figure 5.8: Residual Plots
The plot against x1 shows a divergent linear trend whilst the plot against x2 shows a fan. It would seem to be sensible to compress the x1 scale by taking logs which might sort out the plot against the deviance.
logx1 <- log(x1)
detach(ex3.df)
ex3.df<-data.frame(ex3.df,logx1)
rm(logx1)
ex3.df y x1 x2 logx1
1 3 14.154 0.1132 2.6499973
2 10 31.817 4.5437 3.4600007
3 4 2.203 5.1989 0.7898201
4 24 22.646 15.0614 3.1199832
5 5 8.585 2.6844 2.1500165
6 4 2.160 11.2151 0.7701082
7 43 53.517 22.5853 3.9799994
8 3 6.234 0.7164 1.8300182
9 2 2.858 0.8493 1.0501221
10 26 34.124 16.0000 3.5300009
11 3 2.484 5.6245 0.9098702
12 2 6.619 0.1385 1.8899443
Fit the new model with the new variable log(x1).
ex3.glm2<-glm(y~logx1+x2, family = poisson, data = ex3.df)
ex3.glm2
Call: glm(formula = y ~ logx1 + x2, family = poisson, data = ex3.df)
Coefficients:
(Intercept) logx1 x2
0.34705 0.44301 0.07829
Degrees of Freedom: 11 Total (i.e. Null); 9 Residual
Null Deviance: 142.4
Residual Deviance: 4.348 AIC: 54.45
summary(ex3.glm2)
Call:
glm(formula = y ~ logx1 + x2, family = poisson, data = ex3.df)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.8047 -0.4720 -0.1825 0.2832 1.2641
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.34705 0.30770 1.128 0.25937
logx1 0.44301 0.13495 3.283 0.00103 **
x2 0.07829 0.01681 4.657 3.21e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 142.3521 on 11 degrees of freedom
Residual deviance: 4.3478 on 9 degrees of freedom
AIC: 54.448
Number of Fisher Scoring iterations: 4
The new deviance is an improvement, but what about the residuals in Figure 5.9?
par(mfrow=c(3,2))
attach(ex3.df)
plot(ex3.glm2)
plot(x1,ex3.glm2$res)
plot(x2,ex3.glm2$res)
Figure 5.9: Diagnostic Plots and Residual Plots: including logarithm term
The normal score plot is reasonably linear apart from a strange tail. The plot of the residuals against fitted values is better, but possible a detectable arc? The plot of the deviance against x1 is now sorted. The plot of the deviance against x2 now seems like a divergent arc. This might indicate a quadratic or square root of x2 is needed; we shall create the new variables sqx2 and sqrtx2 and try models with them instead of x2.
sqx2<-x2*x2
sqrtx2<-sqrt(x2)
detach(ex3.df)
ex3.df<-data.frame(ex3.df,sqx2,sqrtx2)
rm(sqx2, sqrtx2)
ex3.df y x1 x2 logx1 sqx2 sqrtx2
1 3 14.154 0.1132 2.6499973 0.01281424 0.3364521
2 10 31.817 4.5437 3.4600007 20.64520969 2.1315956
3 4 2.203 5.1989 0.7898201 27.02856121 2.2801096
4 24 22.646 15.0614 3.1199832 226.84576996 3.8809020
5 5 8.585 2.6844 2.1500165 7.20600336 1.6384139
6 4 2.160 11.2151 0.7701082 125.77846801 3.3488953
7 43 53.517 22.5853 3.9799994 510.09577609 4.7523994
8 3 6.234 0.7164 1.8300182 0.51322896 0.8464042
9 2 2.858 0.8493 1.0501221 0.72131049 0.9215747
10 26 34.124 16.0000 3.5300009 256.00000000 4.0000000
11 3 2.484 5.6245 0.9098702 31.63500025 2.3716028
12 2 6.619 0.1385 1.8899443 0.01918225 0.3721559
Trying the new models:
ex3.glm3<-glm(y~logx1+sqx2, family = poisson, data = ex3.df)
ex3.glm3
Call: glm(formula = y ~ logx1 + sqx2, family = poisson, data = ex3.df)
Coefficients:
(Intercept) logx1 sqx2
0.501070 0.527566 0.002563
Degrees of Freedom: 11 Total (i.e. Null); 9 Residual
Null Deviance: 142.4
Residual Deviance: 12.32 AIC: 62.42
summary(ex3.glm3)
Call:
glm(formula = y ~ logx1 + sqx2, family = poisson, data = ex3.df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5991 -0.7583 -0.1730 0.4160 2.0488
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.5010699 0.3483684 1.438 0.150339
logx1 0.5275662 0.1427855 3.695 0.000220 ***
sqx2 0.0025633 0.0006604 3.882 0.000104 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 142.35 on 11 degrees of freedom
Residual deviance: 12.32 on 9 degrees of freedom
AIC: 62.421
Number of Fisher Scoring iterations: 4
ex3.glm4<-glm(y~logx1+sqrtx2, family = poisson, data = ex3.df)
ex3.glm4
Call: glm(formula = y ~ logx1 + sqrtx2, family = poisson, data = ex3.df)
Coefficients:
(Intercept) logx1 sqrtx2
-0.2341 0.4747 0.4535
Degrees of Freedom: 11 Total (i.e. Null); 9 Residual
Null Deviance: 142.4
Residual Deviance: 1.502 AIC: 51.6
summary(ex3.glm4)
Call:
glm(formula = y ~ logx1 + sqrtx2, family = poisson, data = ex3.df)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.55196 -0.25190 -0.06186 0.14680 0.81518
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.23408 0.30879 -0.758 0.448428
logx1 0.47471 0.12242 3.878 0.000105 ***
sqrtx2 0.45347 0.09362 4.844 1.27e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 142.352 on 11 degrees of freedom
Residual deviance: 1.502 on 9 degrees of freedom
AIC: 51.602
Number of Fisher Scoring iterations: 4
The deviance using sqx2 is 12.32 on 9 d.f. which is not worth following up after 4.348 with the previous model. The deviance for the model with sqrtx2 looks promising, but what about the residual plots in Figure 5.10?
par(mfrow=c(3,2))
attach(ex3.df)
plot(ex3.glm4)
plot(x1,ex3.glm4$res)
plot(x2,ex3.glm4$res)
Figure 5.10: Diagnostic Plots and Residual Plots: including square root of x2
The normal score plot is slightly concave. The plot of the residuals against fitted values is okay, but possibly a fan? The plot of residuals against x2 is better but still shows a divergent fan. The plot of residuals against x1 is okay. The plot of the square root of the absolute deviance residuals against predicted values shows a clear divergent arc. The leverage of the 7th data point is not as disturbing any more.
Since x1 is okay we shall concentrate on x2. Taking the square root of x2 has helped, so we shall try a new variable that is the fourth root.
fourthx2<-x2^0.25
detach(ex3.df)
ex3.df<-data.frame(ex3.df,fourthx2)
rm(fourthx2)
ex3.df y x1 x2 logx1 sqx2 sqrtx2 fourthx2
1 3 14.154 0.1132 2.6499973 0.01281424 0.3364521 0.5800449
2 10 31.817 4.5437 3.4600007 20.64520969 2.1315956 1.4599985
3 4 2.203 5.1989 0.7898201 27.02856121 2.2801096 1.5100032
4 24 22.646 15.0614 3.1199832 226.84576996 3.8809020 1.9700005
5 5 8.585 2.6844 2.1500165 7.20600336 1.6384139 1.2800054
6 4 2.160 11.2151 0.7701082 125.77846801 3.3488953 1.8299987
7 43 53.517 22.5853 3.9799994 510.09577609 4.7523994 2.1799999
8 3 6.234 0.7164 1.8300182 0.51322896 0.8464042 0.9200023
9 2 2.858 0.8493 1.0501221 0.72131049 0.9215747 0.9599868
10 26 34.124 16.0000 3.5300009 256.00000000 4.0000000 2.0000000
11 3 2.484 5.6245 0.9098702 31.63500025 2.3716028 1.5400009
12 2 6.619 0.1385 1.8899443 0.01918225 0.3721559 0.6100458
ex3.glm5<-glm(y~logx1+fourthx2, family = poisson, data = ex3.df)
ex3.glm5
Call: glm(formula = y ~ logx1 + fourthx2, family = poisson, data = ex3.df)
Coefficients:
(Intercept) logx1 fourthx2
-1.2133 0.5224 1.3280
Degrees of Freedom: 11 Total (i.e. Null); 9 Residual
Null Deviance: 142.4
Residual Deviance: 1.737 AIC: 51.84
summary(ex3.glm5)
Call:
glm(formula = y ~ logx1 + fourthx2, family = poisson, data = ex3.df)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.75819 -0.20401 0.05865 0.23673 0.69476
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.2133 0.4220 -2.875 0.00404 **
logx1 0.5224 0.1161 4.500 6.79e-06 ***
fourthx2 1.3280 0.2827 4.698 2.62e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 142.3521 on 11 degrees of freedom
Residual deviance: 1.7371 on 9 degrees of freedom
AIC: 51.837
Number of Fisher Scoring iterations: 4
The new deviance is still good, so then examine the diagnostic plots in Figure 5.11.
attach(ex3.df)
par(mfrow=c(3,2))
plot(ex3.glm5)
plot(x1, ex3.glm5$res)
plot(x2, ex3.glm5$res)
Figure 5.11: Diagnostic Plots and Residual Plots: \(\log(x1)\) and \(\sqrt[4]{x2}\)
These now look okay. Thus the final model will be Poisson response with means given by: \[\mu_{i}=\exp\left(-1.2133+0.5224\log(x1)+1.3280x2^{0.25} \right) \]
summary(ex3.glm5)
Call:
glm(formula = y ~ logx1 + fourthx2, family = poisson, data = ex3.df)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.75819 -0.20401 0.05865 0.23673 0.69476
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.2133 0.4220 -2.875 0.00404 **
logx1 0.5224 0.1161 4.500 6.79e-06 ***
fourthx2 1.3280 0.2827 4.698 2.62e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 142.3521 on 11 degrees of freedom
Residual deviance: 1.7371 on 9 degrees of freedom
AIC: 51.837
Number of Fisher Scoring iterations: 4
5.2 Offsets
An offset is a term to be added to a linear predictor, such as in a generalised linear model, with known coefficient 1 rather than an estimated coefficient. A common use of an offset is when dealing with poisson counts from populations of various sizes. It is necessary to acknowledge that the observed counts are partly related to the population sizes as well as any explanatory variables.
5.2.1 Example 4: Insurance data
For example, the data given in the example dataset within R ‘’Insurance’’ consist of the numbers of policyholders of an insurance company who were exposed to risk, and the numbers of car insurance claims made by those policyholders in the third quarter of 1973.
This data frame contains the following variables.
- District: district of policyholder, A/B/C/D, D is major cities.
- Group: group of car (1 to 4), $<$1 litre, 1-1.5 litre, 1.5-2 litre, $>$2 litre.
- Age: age of driver in 4 ordered groups, $<$25, 25-29, 30-35, $>$35.
- Holders: numbers of policyholders.
- Claims: numbers of claims.
library(MASS)
data("Insurance")
Insurance District Group Age Holders Claims
1 1 <1l <25 197 38
2 1 <1l 25-29 264 35
3 1 <1l 30-35 246 20
4 1 <1l >35 1680 156
5 1 1-1.5l <25 284 63
6 1 1-1.5l 25-29 536 84
7 1 1-1.5l 30-35 696 89
8 1 1-1.5l >35 3582 400
9 1 1.5-2l <25 133 19
10 1 1.5-2l 25-29 286 52
11 1 1.5-2l 30-35 355 74
12 1 1.5-2l >35 1640 233
13 1 >2l <25 24 4
14 1 >2l 25-29 71 18
15 1 >2l 30-35 99 19
16 1 >2l >35 452 77
17 2 <1l <25 85 22
18 2 <1l 25-29 139 19
19 2 <1l 30-35 151 22
20 2 <1l >35 931 87
21 2 1-1.5l <25 149 25
22 2 1-1.5l 25-29 313 51
23 2 1-1.5l 30-35 419 49
24 2 1-1.5l >35 2443 290
25 2 1.5-2l <25 66 14
26 2 1.5-2l 25-29 175 46
27 2 1.5-2l 30-35 221 39
28 2 1.5-2l >35 1110 143
29 2 >2l <25 9 4
30 2 >2l 25-29 48 15
31 2 >2l 30-35 72 12
32 2 >2l >35 322 53
33 3 <1l <25 35 5
34 3 <1l 25-29 73 11
35 3 <1l 30-35 89 10
36 3 <1l >35 648 67
37 3 1-1.5l <25 53 10
38 3 1-1.5l 25-29 155 24
39 3 1-1.5l 30-35 240 37
40 3 1-1.5l >35 1635 187
41 3 1.5-2l <25 24 8
42 3 1.5-2l 25-29 78 19
43 3 1.5-2l 30-35 121 24
44 3 1.5-2l >35 692 101
45 3 >2l <25 7 3
46 3 >2l 25-29 29 2
47 3 >2l 30-35 43 8
48 3 >2l >35 245 37
49 4 <1l <25 20 2
50 4 <1l 25-29 33 5
51 4 <1l 30-35 40 4
52 4 <1l >35 316 36
53 4 1-1.5l <25 31 7
54 4 1-1.5l 25-29 81 10
55 4 1-1.5l 30-35 122 22
56 4 1-1.5l >35 724 102
57 4 1.5-2l <25 18 5
58 4 1.5-2l 25-29 39 7
59 4 1.5-2l 30-35 68 16
60 4 1.5-2l >35 344 63
61 4 >2l <25 3 0
62 4 >2l 25-29 16 6
63 4 >2l 30-35 25 8
64 4 >2l >35 114 33
The three explanatory variables (District/Group/Car) form a three way table with \(4^3=64\) cells each containing a number of claims. If there was one policyholder in each cell was the same then we could (say) fit an additive model:
mod<-glm(Claims~District+Group+Age,family=poisson,data=Insurance)
summary(mod)
Call:
glm(formula = Claims ~ District + Group + Age, family = poisson,
data = Insurance)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5967 -0.9877 -0.1092 0.5180 4.3268
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.92306 0.03355 116.924 < 2e-16 ***
District2 -0.43822 0.04297 -10.198 < 2e-16 ***
District3 -0.91521 0.05032 -18.187 < 2e-16 ***
District4 -1.44367 0.06158 -23.445 < 2e-16 ***
Group.L -0.51133 0.04932 -10.368 < 2e-16 ***
Group.Q -1.02479 0.04198 -24.413 < 2e-16 ***
Group.C 0.21633 0.03304 6.547 5.87e-11 ***
Age.L 1.50084 0.04916 30.527 < 2e-16 ***
Age.Q 0.47465 0.04882 9.722 < 2e-16 ***
Age.C 0.41495 0.04847 8.560 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 4236.68 on 63 degrees of freedom
Residual deviance: 121.31 on 54 degrees of freedom
AIC: 458.63
Number of Fisher Scoring iterations: 4
This would give the expected claim in a cell to be: \[Expected\ claim = \exp(\beta_{0} + \beta^{District} + \beta^{Group} + \beta^{Age})\] However there are a number of policyholders in each cell so we would require: \[Expected\ claim = Holders \times \exp(\beta_{0} + \beta^{District} + \beta^{Group} + \beta^{Age})\] Which can be re-written as: \[Expected\ claim = \exp\left(\beta_{0} + \beta^{District} + \beta^{Group} + \beta^{Age}+\log(Holders)\right)\] Note that the linear predictor now includes log(Holders) which is not a variable with a parameter to estimate! The appropriate main-effects fit as Poisson GLM with offset is:
moda<-glm(Claims~District+Group+Age+ offset(log(Holders)),
family=poisson,data=Insurance)
summary(moda)
Call:
glm(formula = Claims ~ District + Group + Age + offset(log(Holders)),
family = poisson, data = Insurance)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.46558 -0.50802 -0.03198 0.55555 1.94026
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.810508 0.032972 -54.910 < 2e-16 ***
District2 0.025868 0.043016 0.601 0.547597
District3 0.038524 0.050512 0.763 0.445657
District4 0.234205 0.061673 3.798 0.000146 ***
Group.L 0.429708 0.049459 8.688 < 2e-16 ***
Group.Q 0.004632 0.041988 0.110 0.912150
Group.C -0.029294 0.033069 -0.886 0.375696
Age.L -0.394432 0.049404 -7.984 1.42e-15 ***
Age.Q -0.000355 0.048918 -0.007 0.994210
Age.C -0.016737 0.048478 -0.345 0.729910
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 236.26 on 63 degrees of freedom
Residual deviance: 51.42 on 54 degrees of freedom
AIC: 388.74
Number of Fisher Scoring iterations: 4
Now consider an analysis of this data, beginning with a saturated model and using model simplification approaches to reduce the complexity of the final model.
mod1<-glm(Claims~District*Group*Age+offset(log(Holders)),
family=poisson,data=Insurance)
summary(mod1)
Call:
glm(formula = Claims ~ District * Group * Age + offset(log(Holders)),
family = poisson, data = Insurance)
Deviance Residuals:
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[24] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[47] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.870e+00 4.896e-02 -38.188 < 2e-16 ***
District2 1.350e-01 7.387e-02 1.827 0.06763 .
District3 5.649e-02 9.699e-02 0.582 0.56024
District4 -1.182e+00 2.640e+03 0.000 0.99964
Group.L 3.545e-01 1.182e-01 2.998 0.00271 **
Group.Q -4.487e-02 9.792e-02 -0.458 0.64681
Group.C 3.656e-02 7.209e-02 0.507 0.61206
Age.L -2.816e-01 1.058e-01 -2.662 0.00778 **
Age.Q -5.541e-02 9.792e-02 -0.566 0.57146
Age.C 6.156e-02 8.935e-02 0.689 0.49088
District2:Group.L 5.975e-02 1.754e-01 0.341 0.73342
District3:Group.L -6.837e-02 2.332e-01 -0.293 0.76934
District4:Group.L -3.327e+00 7.085e+03 0.000 0.99963
District2:Group.Q 2.122e-01 1.477e-01 1.437 0.15085
District3:Group.Q -1.755e-01 1.940e-01 -0.905 0.36554
District4:Group.Q -2.709e+00 5.281e+03 -0.001 0.99959
District2:Group.C -1.232e-01 1.135e-01 -1.086 0.27761
District3:Group.C -2.240e-01 1.445e-01 -1.550 0.12107
District4:Group.C -1.234e+00 2.362e+03 -0.001 0.99958
District2:Age.L -2.699e-01 1.576e-01 -1.713 0.08677 .
District3:Age.L -1.427e-01 1.928e-01 -0.740 0.45945
District4:Age.L 3.875e+00 7.085e+03 0.001 0.99956
District2:Age.Q 6.156e-02 1.477e-01 0.417 0.67691
District3:Age.Q 2.289e-01 1.940e-01 1.180 0.23801
District4:Age.Q -2.727e+00 5.281e+03 -0.001 0.99959
District2:Age.C -4.086e-03 1.372e-01 -0.030 0.97624
District3:Age.C -2.941e-01 1.951e-01 -1.507 0.13169
District4:Age.C 1.122e+00 2.362e+03 0.000 0.99962
Group.L:Age.L 4.896e-01 2.567e-01 1.907 0.05655 .
Group.Q:Age.L -8.415e-02 2.116e-01 -0.398 0.69087
Group.C:Age.L -2.339e-01 1.537e-01 -1.522 0.12802
Group.L:Age.Q -4.445e-01 2.365e-01 -1.880 0.06012 .
Group.Q:Age.Q 9.545e-02 1.958e-01 0.487 0.62600
Group.C:Age.Q 1.636e-01 1.442e-01 1.134 0.25662
Group.L:Age.C 1.192e-03 2.143e-01 0.006 0.99556
Group.Q:Age.C 2.331e-01 1.787e-01 1.304 0.19215
Group.C:Age.C 5.697e-02 1.340e-01 0.425 0.67072
District2:Group.L:Age.L -6.082e-01 3.748e-01 -1.623 0.10459
District3:Group.L:Age.L -6.794e-01 4.545e-01 -1.495 0.13496
District4:Group.L:Age.L 9.466e+00 1.901e+04 0.000 0.99960
District2:Group.Q:Age.L -2.889e-01 3.152e-01 -0.917 0.35929
District3:Group.Q:Age.L 1.724e-01 3.857e-01 0.447 0.65495
District4:Group.Q:Age.L 7.722e+00 1.417e+04 0.001 0.99957
District2:Group.C:Age.L 2.818e-01 2.413e-01 1.168 0.24281
District3:Group.C:Age.L 3.668e-01 3.015e-01 1.216 0.22383
District4:Group.C:Age.L 3.543e+00 6.337e+03 0.001 0.99955
District2:Group.L:Age.Q 4.292e-01 3.509e-01 1.223 0.22121
District3:Group.L:Age.Q 1.045e+00 4.663e-01 2.241 0.02501 *
District4:Group.L:Age.Q -7.033e+00 1.417e+04 0.000 0.99960
District2:Group.Q:Age.Q 1.593e-01 2.955e-01 0.539 0.58984
District3:Group.Q:Age.Q 2.985e-01 3.880e-01 0.769 0.44165
District4:Group.Q:Age.Q -5.933e+00 1.056e+04 -0.001 0.99955
District2:Group.C:Age.Q 4.501e-02 2.269e-01 0.198 0.84279
District3:Group.C:Age.Q -3.545e-03 2.890e-01 -0.012 0.99022
District4:Group.C:Age.Q -2.592e+00 4.723e+03 -0.001 0.99956
District2:Group.L:Age.C 3.166e-01 3.252e-01 0.974 0.33030
District3:Group.L:Age.C -6.739e-01 4.779e-01 -1.410 0.15847
District4:Group.L:Age.C 3.207e+00 6.337e+03 0.001 0.99960
District2:Group.Q:Age.C -4.190e-01 2.744e-01 -1.527 0.12668
District3:Group.Q:Age.C -5.423e-01 3.902e-01 -1.390 0.16457
District4:Group.Q:Age.C 2.769e+00 4.723e+03 0.001 0.99953
District2:Group.C:Age.C 4.101e-02 2.116e-01 0.194 0.84635
District3:Group.C:Age.C -3.275e-01 2.760e-01 -1.187 0.23527
District4:Group.C:Age.C 9.491e-01 2.112e+03 0.000 0.99964
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 2.3626e+02 on 63 degrees of freedom
Residual deviance: 4.1222e-10 on 0 degrees of freedom
AIC: 445.32
Number of Fisher Scoring iterations: 20
mod2<-step(mod1,direction="both")Start: AIC=445.32
Claims ~ District * Group * Age + offset(log(Holders))
Df Deviance AIC
- District:Group:Age 27 27.29 418.61
<none> 0.00 445.32
Step: AIC=418.61
Claims ~ District + Group + Age + District:Group + District:Age +
Group:Age + offset(log(Holders))
Df Deviance AIC
- District:Age 9 33.527 406.85
- District:Group 9 34.457 407.78
- Group:Age 9 37.685 411.01
<none> 27.290 418.61
+ District:Group:Age 27 0.000 445.32
Step: AIC=406.85
Claims ~ District + Group + Age + District:Group + Group:Age +
offset(log(Holders))
Df Deviance AIC
- District:Group 9 40.907 396.23
- Group:Age 9 44.132 399.45
<none> 33.527 406.85
+ District:Age 9 27.290 418.61
Step: AIC=396.23
Claims ~ District + Group + Age + Group:Age + offset(log(Holders))
Df Deviance AIC
- Group:Age 9 51.420 388.74
<none> 40.907 396.23
- District 3 54.850 404.17
+ District:Group 9 33.527 406.85
+ District:Age 9 34.457 407.78
Step: AIC=388.74
Claims ~ District + Group + Age + offset(log(Holders))
Df Deviance AIC
<none> 51.420 388.74
+ Group:Age 9 40.907 396.23
- District 3 65.291 396.61
+ District:Group 9 44.132 399.45
+ District:Age 9 44.859 400.18
- Age 3 136.290 467.61
- Group 3 140.087 471.41
summary(mod2)
Call:
glm(formula = Claims ~ District + Group + Age + offset(log(Holders)),
family = poisson, data = Insurance)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.46558 -0.50802 -0.03198 0.55555 1.94026
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.810508 0.032972 -54.910 < 2e-16 ***
District2 0.025868 0.043016 0.601 0.547597
District3 0.038524 0.050512 0.763 0.445657
District4 0.234205 0.061673 3.798 0.000146 ***
Group.L 0.429708 0.049459 8.688 < 2e-16 ***
Group.Q 0.004632 0.041988 0.110 0.912150
Group.C -0.029294 0.033069 -0.886 0.375696
Age.L -0.394432 0.049404 -7.984 1.42e-15 ***
Age.Q -0.000355 0.048918 -0.007 0.994210
Age.C -0.016737 0.048478 -0.345 0.729910
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 236.26 on 63 degrees of freedom
Residual deviance: 51.42 on 54 degrees of freedom
AIC: 388.74
Number of Fisher Scoring iterations: 4
Making a prediction
mod2.pred<-predict(mod2,se.fit=T,type="response")
mod2.pred$fit
1 2 3 4 5 6
31.863585 35.275867 28.180802 158.878292 53.977716 84.160115
7 8 9 10 11 12
93.690431 398.060076 31.862179 56.602450 60.234044 229.717751
13 14 15 16 17 18
6.819092 16.665525 19.922338 75.089737 14.108529 19.060004
19 20 21 22 23 24
17.751277 90.352333 29.061421 50.433636 57.880788 278.599876
25 26 27 28 29 30
16.225653 35.541983 38.480469 159.554148 2.624171 11.562089
31 32 33 34 35 36
14.868666 54.894955 5.883384 10.137418 10.595927 63.688499
37 38 39 40 41 42
10.468941 25.293210 33.575924 188.830232 5.975384 16.043331
43 44 45 46 47 48
21.336824 100.736656 2.067017 7.074396 8.992994 42.299863
49 50 51 52 53 54
4.088580 5.573164 5.791517 37.770823 7.446839 16.074617
55 56 57 58 59 60
20.756777 101.689401 5.450175 9.755463 14.582657 60.900833
61 62 63 64
1.077335 4.746732 6.358566 23.936524
$se.fit
1 2 3 4 5 6
2.4467288 2.3205295 1.8336491 7.8310962 3.8413334 4.7928758
7 8 9 10 11 12
5.1246179 14.3024535 2.3711431 3.4120459 3.5100699 9.7020485
13 14 15 16 17 18
0.6102392 1.2790771 1.4898542 4.7613246 1.1421776 1.3282707
19 20 21 22 23 24
1.2209058 4.8532204 2.1783800 3.0548953 3.3675926 11.2937839
25 26 27 28 29 30
1.2663282 2.2641762 2.3682811 7.3719026 0.2422804 0.9161009
31 32 33 34 35 36
1.1466586 3.6141140 0.5077136 0.7620401 0.7818445 3.7658612
37 38 39 40 41 42
0.8464585 1.6946155 2.1564105 9.0165498 0.5017790 1.1253227
43 44 45 46 47 48
1.4423277 5.3432260 0.1998079 0.5918393 0.7299460 2.9526387
49 50 51 52 53 54
0.3792472 0.4624035 0.4724435 2.6112194 0.6542183 1.2184624
55 56 57 58 59 60
1.5194113 6.0900921 0.4925305 0.7626322 1.1045784 3.8778838
61 62 63 64
0.1103283 0.4303229 0.5610220 1.8759244
$residual.scale
[1] 1
Printing confidence intervals:
print(paste(round(mod2.pred$fit[1],2)," (",
round(mod2.pred$fit[1]-qt(.975,54)*mod2.pred$se.fit[1],2),",",
round(mod2.pred$fit[1]+qt(.975,54)*mod2.pred$se.fit[1],2),
")",sep=""), quote=F)[1] 31.86 (26.96,36.77)
Note that these can be embedded within your Rmarkdown code as 31.86 (26.96, 36.77). While the code may seem to be a lot more than just typing in the results directly, the advantage it has is that you have made a mistake in data entry at some point and need to rerun your code, the new results will automatically update here.
5.3 Practical exercises
5.3.1 Exercise 1: Missing persons data
The following data set appeared in The Independent, March 8, 1994, under the headline “Thousands of people who disappear without trace”. Here, using figures from the Metropolitan police, the numbers in the table are of the form r/n where n = the number reported missing during the year ending March 1993 and r = the number still missing at the end of that year. Questions of interest are whether a simple model fits these data, whether the age and/or sex effect are significant, and how to interpret the statistical conclusions to the layman.
| Age in years | ||
|---|---|---|
| 13 and under | 14-18 | |
| Males | 33/3271 | 63/7257 |
| Females | 38/2486 | 108/8877 |
- Read in the data n and r.
- Create factors sex and age.
- Create the data frame missing.df and remove unwanted files.
- Fit a Poisson regression using the appropriate offset; that is the linear predictor determined by \(r\sim sex+age+offset(\log(n))\).
- Describe and interpret these results.
- Simplify the Poisson if possible. Comment on the final model.
5.3.2 Exercise 2: AIDS data
The total number of reported new cases per month of AIDS in the UK up to November 1985 are listed below (data from A. Sykes, 1986):
0,0,3,0,1,1,1,2,2,4,2,8,0,3,4,5,2,2 ,2,5,4,3,15,12,7,14,6,10,14,8,19,10,7,20,10,19
(data for 36 consecutive months - reading across).
- Read in the data y.
- Create a covariate for month number using i=1:36.
- Plot number of new cases against month; note that y (more or less) increases as i increases.
- Create aids.df data frame.
- Fit a model that has the following simple log-linear relationship; \(\log(\mu_{i})=\alpha+\beta_{i}\).
- Can \(\beta\) be dropped?
5.3.3 Exercise 3: Composite material cracks data
The following data records the experimental results of an investigation into the strength properties of two types of composite materials. Specimens were subjected to various forces in Newtons, x, and the ensuing number of cracks appearing was recorded, y.
| Material 1 | y | 4 | 4 | 4 | 7 | 9 | 10 | 3 |
| x | 0.1 | 0.4 | 0.6 | 0.8 | 1.0 | 1.3 | 1.4 | |
| Material 2 | y | 6 | 9 | 9 | 12 | 10 | 13 | 15 |
| x | 0.0 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 |
Investigate two possible models:
\(\eta_{i}=\beta_{0}+\beta_{j\left(i\right)}^{A}+\beta^{X}x_{i}+\beta_{j\left(i\right)}^{AX}x_{i}\)
\(\eta_{i}=\beta_{0}+\beta_{j\left(i\right)}^{A}+\beta^{X}x_{i}\)
What do the two models represent? Perform any model simplification that is appropriate.
5.4 Over-dispersion
Over-dispersion can be a problem when working with the 1-parameter error distributions (e.g. Poisson or binomial errors) and occurs when the variance of the response exceeds the nominal variance. Over-dispersion tends to occur when:
- you have not measured one or more factors that turn out to be important;
- the underlying distribution being non-Poisson.
The net result of these will be that the residual deviance is inflated.
For the 1-parameter error distributions the scale parameter \(\phi\) is assumed to be one. The usual estimator of \(\phi\) is \(\hat{\phi}=D/(n-p)\) and thus the residual deviance divided by the residual degrees of freedom ought to be one. If this ratio is substantially larger than the assumed scale parameter of one (i.e. the residual deviance is much greater than the degrees of freedom), then this would suggest that the data are over-dispersed.
It is dealt with by adjusting the scale parameter to be a value other than one. This can be dealt with in two ways:
- using the quasipoisson family which differs from the poisson family because the dispersion parameter is not fixed at one, so it can model over-dispersion. The usual link functions are available.
- using the quasi family which allows us to estimate model parameters without fully specifying the error distribution of the response. However, we must specify the link and variance functions (so this requires a bit more consideration).
Note:
- Significance is then assessed by F-tests, using the estimated scale parameter as the denominator.
- The parameter estimates are not affected by this procedure, but the standard errors are inflated.
- Significance tests are therefore more stringent and can lead to fewer significant effects.
- family=quasi(link=‘’log’‘,var=’‘mu’’) is the same as quasipoisson
Under-dispersion is when the variance of the response is less than the nominal variance. It is dealt with in a similar fashion.
5.4.1 Example 5: Poisson (slugs)
Count data were obtained on the number of slugs under 40 tiles placed over two types of grassland; nursery and rookery. Does the mean slug density differ between the two?
slugs<-read.table("slugsurvey.txt",header=T)
names(slugs)[1] "count" "field"
attach(slugs)
plot(count~field,xlab="Field",ylab="count")
Figure 5.12: Slug count by Location
summary(count[field=="Nursery"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 0.000 1.275 1.250 10.000
summary(count[field=="Rookery"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 1.000 2.000 2.275 3.000 9.000
detach(slugs)From the above looks like the medians are different, but the range of counts in both fields is large and so significance may be in doubt. The data does not appear to be normal as for the nursery it is heavily positive skewed. Also, it looks like the variance between the fields may not be constant. So standard ANOVA does not look appropriate.
We start by fitting Poisson model with log link:
mod1<-glm(count~field,poisson,data=slugs)
summary(mod1)
Call:
glm(formula = count ~ field, family = poisson, data = slugs)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1331 -1.5969 -0.9519 0.4580 4.8727
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.2429 0.1400 1.735 0.082744 .
fieldRookery 0.5790 0.1749 3.310 0.000932 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 224.86 on 79 degrees of freedom
Residual deviance: 213.44 on 78 degrees of freedom
AIC: 346.26
Number of Fisher Scoring iterations: 6
mod2<-update(mod1,.~.-field)
anova(mod1,mod2,test="Chi")Analysis of Deviance Table
Model 1: count ~ field
Model 2: count ~ 1
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 78 213.44
2 79 224.86 -1 -11.422 0.000726 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow=c(2,2))
plot(mod1)
Figure 5.13: Diagnostic Plots for mod1
This suggests that field is significant. BUT it appears that over-dispersion is present; from the output we would estimate the scale parameter to be 213.44/78=2.74 which is much more than one which it should be. Q-Q plot is very concave.
We shall now try the quasipoisson family still with log link. This will allow for the presence of a non unity scale parameter.
mod3<-glm(count~field,quasipoisson,data=slugs)
summary(mod3)
Call:
glm(formula = count ~ field, family = quasipoisson, data = slugs)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1331 -1.5969 -0.9519 0.4580 4.8727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2429 0.2494 0.974 0.3331
fieldRookery 0.5790 0.3116 1.858 0.0669 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasipoisson family taken to be 3.17311)
Null deviance: 224.86 on 79 degrees of freedom
Residual deviance: 213.44 on 78 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 6
Deletion tests now need to be done using the F distribution.
mod4<-update(mod3,.~.-field)
anova(mod3,mod4,test="F")Analysis of Deviance Table
Model 1: count ~ field
Model 2: count ~ 1
Resid. Df Resid. Dev Df Deviance F Pr(>F)
1 78 213.44
2 79 224.86 -1 -11.422 3.5995 0.0615 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow=c(2,2))
plot(mod3)
Figure 5.14: Diagnostic Plots for mod3
We can now see that the difference in mean slug density is not significant. The Q-Q plot isn’t improved much. We could try the quasi family with log link which allows us to break away from the Poisson relationship between the mean and the variance.
mod5<-glm(count~field,quasi(link="log",var="mu^2"),data=slugs)
summary(mod5)
Call:
glm(formula = count ~ field, family = quasi(link = "log", var = "mu^2"),
data = slugs)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.7232 -0.1261 0.0000 0.2900 3.0931
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2429 0.2300 1.056 0.2941
fieldRookery 0.5790 0.3252 1.780 0.0789 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasi family taken to be 2.115738)
Null deviance: 40.144 on 79 degrees of freedom
Residual deviance: 45.606 on 78 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 11
mod6<-update(mod5,.~.-field)
anova(mod5,mod6,test="F")Analysis of Deviance Table
Model 1: count ~ field
Model 2: count ~ 1
Resid. Df Resid. Dev Df Deviance F Pr(>F)
1 78 45.606
2 79 40.144 -1 5.4615
par(mfrow=c(2,2))
plot(mod5)
Figure 5.15: Diagnostic Plots for mod5
Again a non-significant result but Q-Q plot has got worse in Figure 5.15.
Alternative approaches for this problem could be to use ANOVA after transformation to get normality/constant variance; try either log or sqrt transformations. Note we have to add 1 to count for log transformation as there are zero counts in this dataset.
attach(slugs)
plot(log(count+1)~field)
Figure 5.16: Log Transformation
mod7<-glm(log(count+1)~field,data=slugs)
summary(mod7)
Call:
glm(formula = log(count + 1) ~ field, data = slugs)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.9698 -0.4967 -0.2766 0.4165 1.9012
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.4967 0.1117 4.446 2.86e-05 ***
fieldRookery 0.4730 0.1580 2.994 0.00369 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 0.4993769)
Null deviance: 43.426 on 79 degrees of freedom
Residual deviance: 38.951 on 78 degrees of freedom
AIC: 175.45
Number of Fisher Scoring iterations: 2
mod8<-update(mod7,.~.-field)
anova(mod7,mod8,test="F")Analysis of Deviance Table
Model 1: log(count + 1) ~ field
Model 2: log(count + 1) ~ 1
Resid. Df Resid. Dev Df Deviance F Pr(>F)
1 78 38.951
2 79 43.426 -1 -4.475 8.9612 0.003693 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow=c(2,2))
plot(mod7)
Figure 5.17: Diagnostic Plots for mod7
Here we get a significant effect for field. But the associated Q-Q plots still not great in Figure 5.17.
Try a square root transformation (as in Figure 5.18 and associated output).
plot(sqrt(count)~field)
Figure 5.18: Square Root Transformation
mod9<-glm(sqrt(count)~field,data=slugs)
summary(mod9)
Call:
glm(formula = sqrt(count) ~ field, data = slugs)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.2549 -0.6446 -0.2549 0.4772 2.5176
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.6447 0.1414 4.559 1.88e-05 ***
fieldRookery 0.6103 0.2000 3.052 0.00311 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 0.799815)
Null deviance: 69.834 on 79 degrees of freedom
Residual deviance: 62.386 on 78 degrees of freedom
AIC: 213.13
Number of Fisher Scoring iterations: 2
mod10<-update(mod9,.~.-field)
anova(mod9,mod10,test="F")Analysis of Deviance Table
Model 1: sqrt(count) ~ field
Model 2: sqrt(count) ~ 1
Resid. Df Resid. Dev Df Deviance F Pr(>F)
1 78 62.386
2 79 69.834 -1 -7.4481 9.3123 0.003111 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow=c(2,2))
plot(mod9)
Figure 5.19: Diagnostic Plots for mod9
detach(slugs)Again a significant result for field, but Q-Q plots no better. We have contradictory results from the above which is common with data with low means and high variances. The residual plots in Figure 5.19 are not helping us in identifying which is the best model. There is no obvious right answer, but the results correcting for over-dispersion are warning us that conclusions from the ANOVA approach should be treated with caution especially as the transformed data does not appear normal.