Task
Select a statistical model from a set of candidate models on the basis of data.

Principles for selection
Background knowledge
Choose models based on substantive knowledge (e.g. from previous studies), theoretical arguments, dimensional or other general considerations.

Robustness to departures from assumptions
Prefer models that provide valid inference even if some of their assumptions are invalid.

Quality of fit
Prefer models that perform well in terms of informal devices such as residuals and graphical assessments, or more formal or goodness-of-fit tests.

Parsimony
Look for the simplest models that are adequate descriptions of the data.

Set of candidate models
Defined based on background knowledge, mathematical or computational convenience.

Defined accounting for the trade-off between allowed flexibility and cardinality of the set of candidate models.

Linear regression
Data
Response values \(y_1, \ldots, y_n\) , with \(y_i \in \Re\)

Covariate vectors \(x_1, \ldots, x_n\) , with \(x_i = (x_{i1}, \ldots, x_{ip})^\top \in \Re^p\)

Model
\[
Y_i = \sum_{t = 1}^p \beta_t x_{it} + \epsilon_i
\] where \(\mathop{\mathrm{E}}(\epsilon_i) = 0\) and \(\mathop{\mathrm{var}}(\epsilon_i) = \sigma^2 < \infty\)

Set of candidate models
For \(p = 30\) , assuming that every model has an intercept parameter (\(x_{i1} = 1\) ), the set of candidate models has about 537 million possible models, without even accounting for covariate transformations or interactions!

Logistic regression
Data
Response values \(y_1, \ldots, y_n\) , with \(y_i \in \{0, 1\}\)

Covariate vectors \(x_1, \ldots, x_n\) , with \(x_i = (x_{i1}, \ldots, x_{ip})^\top \in \Re^p\)

Model
\(Y_{1}, \ldots, Y_{n}\) conditionally independent with

\[
Y_i \mid x_i \sim \mathop{\mathrm{Bernoulli}}(\mu_i)\,, \quad
\log \frac{\mu_i}{1 - \mu_i} = \eta_i = \sum_{t = 1}^p \beta_t x_{it}
\]

Widely used for inference about covariate effects on probabilities, probability calibration, and prediction.

Set of candidate models
Same considerations as for linear regression.

Logistic regression likelihood
\[
\ell(\beta) = \sum_{i=1}^n y_i x_i^\top \beta - \sum_{i=1}^n \log\left\{1 + \exp( x_i^\top \beta)\right\}
\]

We expect a good fit to be one that gives large maximized log-likelihood \(\hat
\ell = \ell(\hat\beta)\) where \(\hat\beta\) is the maximum likelihood estimator

Nodal involvement: `?SMPracticals::nodal`

The treatment strategy for a patient diagnosed with cancer of the prostate depend highly on whether the cancer has spread to the surrounding lymph nodes.

It is common to operate on the patient to get samples from the nodes which are then analysed under a microscope.

For a sample of 53 prostate cancer patients, a number of possible predictor variables were measured before surgery. The patients then had surgery to determine nodal involvement.

data ("nodal" , package = "SMPracticals" )
str (nodal)

```
'data.frame': 53 obs. of 7 variables:
$ m : Named num 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, "names")= chr [1:53] "1" "2" "3" "4" ...
$ r : Named num 1 1 1 1 1 0 1 0 0 0 ...
..- attr(*, "names")= chr [1:53] "1" "2" "3" "4" ...
$ aged : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, "names")= chr [1:53] "1" "2" "3" "4" ...
$ stage: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 1 1 1 ...
..- attr(*, "names")= chr [1:53] "1" "2" "3" "4" ...
$ grade: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 1 1 1 ...
..- attr(*, "names")= chr [1:53] "1" "2" "3" "4" ...
$ xray : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 1 1 1 ...
..- attr(*, "names")= chr [1:53] "1" "2" "3" "4" ...
$ acid : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
..- attr(*, "names")= chr [1:53] "1" "2" "3" "4" ...
```

Nodal involvement
Considering only the models with an intercept and without any interaction between the 5 binary covariates, results in \(2^5 = 32\) possible logistic regression models for this data.

library ("SMPracticals" )
library ("MuMIn" )
nodal_full <- glm (r ~ aged + stage + grade + xray + acid,
data = nodal, family = "binomial" , na.action = "na.fail" )
nodal_table <- dredge (nodal_full, rank = logLik)
plot (logLik ~ df, data = nodal_table,
xlab = "Number of parameters" ,
ylab = "Maximized log-likelihood" ,
bg = "#ff7518" , pch = 21 )

Figure 1: Maximized log-likelihoods for 32 possible logistic regression models for the `nodal`

data.

Adding terms always increases the maximized log-likelihood.

Taking the model with highest \(\hat\ell\) would give the full model.

Linear regression
Let \(Y_i = \eta_i + \epsilon_i\) . Removing terms always increases the residual sums of squares \(\|y - \hat{\eta} \|_2^2\) .