# APTS Statistical Modelling Model Selection

Ioannis Kosmidis

Professor of Statistics
Department of Statistics, University of Warwick

ioannis.kosmidis@warwick.ac.uk
ikosmidis.com   ikosmidis ikosmidis_

19 April 2024

Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful.

in Box and Draper (1987). Empirical Model-Building and Response Surfaces, p. 74

# Model selection

Select a statistical model from a set of candidate models on the basis of data.

## Principles for selection

### Background knowledge

Choose models based on substantive knowledge (e.g. from previous studies), theoretical arguments, dimensional or other general considerations.

### Robustness to departures from assumptions

Prefer models that provide valid inference even if some of their assumptions are invalid.

### Quality of fit

Prefer models that perform well in terms of informal devices such as residuals and graphical assessments, or more formal or goodness-of-fit tests.

### Parsimony

Look for the simplest models that are adequate descriptions of the data.

## Set of candidate models

Defined based on background knowledge, mathematical or computational convenience.

Defined accounting for the trade-off between allowed flexibility and cardinality of the set of candidate models.

## Linear regression

### Data

Response values $$y_1, \ldots, y_n$$, with $$y_i \in \Re$$

Covariate vectors $$x_1, \ldots, x_n$$, with $$x_i = (x_{i1}, \ldots, x_{ip})^\top \in \Re^p$$

### Model

$Y_i = \sum_{t = 1}^p \beta_t x_{it} + \epsilon_i$ where $$\mathop{\mathrm{E}}(\epsilon_i) = 0$$ and $$\mathop{\mathrm{var}}(\epsilon_i) = \sigma^2 < \infty$$

### Set of candidate models

For $$p = 30$$, assuming that every model has an intercept parameter ($$x_{i1} = 1$$), the set of candidate models has about 537 million possible models, without even accounting for covariate transformations or interactions!

## Logistic regression

### Data

Response values $$y_1, \ldots, y_n$$, with $$y_i \in \{0, 1\}$$

Covariate vectors $$x_1, \ldots, x_n$$, with $$x_i = (x_{i1}, \ldots, x_{ip})^\top \in \Re^p$$

### Model

$$Y_{1}, \ldots, Y_{n}$$ conditionally independent with

$Y_i \mid x_i \sim \mathop{\mathrm{Bernoulli}}(\mu_i)\,, \quad \log \frac{\mu_i}{1 - \mu_i} = \eta_i = \sum_{t = 1}^p \beta_t x_{it}$

Widely used for inference about covariate effects on probabilities, probability calibration, and prediction.

### Set of candidate models

Same considerations as for linear regression.

## Logistic regression likelihood

$\ell(\beta) = \sum_{i=1}^n y_i x_i^\top \beta - \sum_{i=1}^n \log\left\{1 + \exp( x_i^\top \beta)\right\}$

We expect a good fit to be one that gives large maximized log-likelihood $$\hat \ell = \ell(\hat\beta)$$ where $$\hat\beta$$ is the maximum likelihood estimator

## Nodal involvement: ?SMPracticals::nodal

The treatment strategy for a patient diagnosed with cancer of the prostate depend highly on whether the cancer has spread to the surrounding lymph nodes.

It is common to operate on the patient to get samples from the nodes which are then analysed under a microscope.

For a sample of 53 prostate cancer patients, a number of possible predictor variables were measured before surgery. The patients then had surgery to determine nodal involvement.

data("nodal", package = "SMPracticals")
str(nodal)
'data.frame':   53 obs. of  7 variables:
$m : Named num 1 1 1 1 1 1 1 1 1 1 ... ..- attr(*, "names")= chr [1:53] "1" "2" "3" "4" ...$ r    : Named num  1 1 1 1 1 0 1 0 0 0 ...
..- attr(*, "names")= chr [1:53] "1" "2" "3" "4" ...
$aged : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... ..- attr(*, "names")= chr [1:53] "1" "2" "3" "4" ...$ stage: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 1 1 1 ...
..- attr(*, "names")= chr [1:53] "1" "2" "3" "4" ...
$grade: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 1 1 1 ... ..- attr(*, "names")= chr [1:53] "1" "2" "3" "4" ...$ xray : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 1 1 1 ...
..- attr(*, "names")= chr [1:53] "1" "2" "3" "4" ...
\$ acid : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
..- attr(*, "names")= chr [1:53] "1" "2" "3" "4" ...

## Nodal involvement

Considering only the models with an intercept and without any interaction between the 5 binary covariates, results in $$2^5 = 32$$ possible logistic regression models for this data.

library("SMPracticals")
library("MuMIn")
nodal_full <- glm(r ~ aged + stage + grade + xray + acid,
data = nodal, family = "binomial", na.action = "na.fail")
nodal_table <- dredge(nodal_full, rank = logLik)
plot(logLik ~ df, data = nodal_table,
xlab = "Number of parameters",
ylab = "Maximized log-likelihood",
bg = "#ff7518", pch = 21)

Figure 1: Maximized log-likelihoods for 32 possible logistic regression models for the nodal data.

Adding terms always increases the maximized log-likelihood.

Taking the model with highest $$\hat\ell$$ would give the full model.

## Linear regression

Let $$Y_i = \eta_i + \epsilon_i$$. Removing terms always increases the residual sums of squares $$\|y - \hat{\eta} \|_2^2$$.