install.packages("SMPracticals")
3 R practicals
3.1 Getting started
For running the code below, you will need the R packages MASS
and SMPracticals
. The MASS
package (Venables & Ripley, 2002) is one of the recommended R package and is included with the binary distributions of R, so you should have it. The SMPracticals
package (Davison, 2024) is the R package providing the datasets and a few functions for use with the practicals outlined in Davison (2003, Appendix A), and can be installed by running
The packages can be loaded and attached by doing
library("MASS")
library("SMPracticals")
3.2 trees
data
trees
contains data on the volume of timber, height and girth (diameter) of 31 felled black cherry trees; girth is measured four feet six inches above ground (Atkinson, 1985, p. 63). The problem is to find a simple linear model for predicting volume from height and girth. See ?trees
for more details.
data("trees", package = "datasets")
pairs(trees, panel = panel.smooth)
pairs(log(trees), panel = panel.smooth)
coplot()
generates conditioning plots, in which the relationship between two variables is displayed conditional on subsets of values of other variables. This is useful to see if the relationship is stable over the range of other variables. The plots should be read from left to right, starting from the bottom row, and each plot corresponds to the ranges of values (from left to right) shown on the top plot for the conditioning variable. For the relationship of log volume and log girth, conditional on height we get:
coplot(log(Volume) ~ log(Girth) | Height, data = trees, panel = panel.smooth)
Produce and interpret the conditioning plots on the orginal scale.
For an initial fit, we take a linear model and assess model fit using diagnostic plots:
<- glm(Volume ~ Girth + Height, data = trees)
m_trees summary(m_trees)
plot.glm.diag(m_trees)
What do you make of the m_trees
fit?
To assess the possibility of transformation:
boxcox(m_trees)
Both \(\lambda = 1\) and \(\lambda = 0\) lie outside the confidence interval, though the latter is better supported. One possibility is to take \(\lambda = 1/3\), corresponding to response \(\texttt{Volume}^{1/3}\).
What transformations for Girth
and Height
are then needed for dimensional compatibility? Fit this model, give interpretations of the parameter estimates, and discuss its suitability.
An alternative is to suppose that a tree is conical in shape, in which case
\[
\texttt{Volume} \propto \texttt{Height} \times \texttt{Girth}^2 \, .
\] Equivalently, we fit
<- glm(log(Volume) ~ log(Girth) + log(Height), data = trees)
m_trees_log summary(m_trees_log)
plot.glm.diag(m_trees_log)
Are the parameter estimates consistent with this model? Does it fit adequately? What advantage has it over the others for prediction of future volumes?
3.3 salinity
data
salinity
contains \(n = 28\) observations on the salinity of water in Pamlico Sound, North Carolina (Atkinson, 1985, p. 48; Ruppert & Carroll, 1980). The response sal
is the bi-weekly average of salinity. The other three columns contain values of the covariates, respectively a lagged value of salinity lag
, a trend indicator trend
, and the river discharge dis
.
Using the techniques from the analysis of the tree
data set as a guide, find a model suitable for prediction of salinity from the covariates. The data contain at least one outlier.
3.4 shuttle
data
shuttle
contains the data in Davison (2003, Table 1.3) on O-ring failures for the space shuttle (Dalal et al., 1989).
data("shuttle", package = "SMPracticals")
row.names(shuttle) <- NULL
To fit a binomial logistic regression model with covariate temperature, we do:
<- glm(cbind(r, m - r) ~ temperature, family = binomial(), data = shuttle)
m_shuttle anova(m_shuttle)
summary(m_shuttle)
Try fitting with and without both covariates. To assess model fit, try
plot.glm.diag(m_shuttle)
Do you find these diagnostics useful?
3.5 bliss
data
bliss
provides data on mortality of flour-beetles as a function of dose of a poison (Bliss, 1935). To plot the death rates and fit a logistic regression model, we do:
data("bliss", package = "SMPracticals")
<- glm(cbind(r, m - r) ~ log(dose), family = binomial(), data = bliss)
m_bliss_logit summary(m_bliss_logit)
with(bliss, {
plot(log(dose), r/m, ylim = c(0, 1), ylab = "Proportion dead")
points(log(dose), fitted(m_bliss_logit), pch = 3, col = 2)
})
Does the fit seem reasonble to you?
Try again with the probit and cloglog link functions.
For example, for the cloglog link function we have:
<- glm(cbind(r, m-r) ~ log(dose), family = binomial("cloglog"), data = bliss)
m_bliss_cloglog with(bliss, {
plot(log(dose), r/m, ylim = c(0, 1), ylab = "Proportion dead")
points(log(dose), fitted(m_bliss_logit), pch = 3, col = 2)
points(log(dose), fitted(m_bliss_cloglog), pch = 3, col = 3)
})
Which link function fits best? Give a careful interpretation of the resulting model.