5 Lab 1 (with solution)

5.1 Exercise

Suppose \[ Y_{im} \stackrel{\text{ind}}{\sim}\mathop{\mathrm{Poisson}}\left( \mu^*(x_{im}) \right) \quad (i = 1, \ldots, n; m = 1, \ldots, M)\,, \] where \[ \begin{aligned} \mu^*(x_{im}) & = 8 \exp \left(w(x_{im})\right) \,, \\ x_{im} & = x_i = -10 + 20 \frac{i-1}{n-1} \, , \\ w(x) & = 0.001 \left( 100 + x + x^2 + x^3 \right) \, . \end{aligned} \]

Consider the following simulation study. For \(b = 1, \ldots, B\):

Generate \[ Y_{im} \stackrel{\text{ind}}{\sim}\mathop{\mathrm{Poisson}}(\mu(x_{im})) \quad (i=1,\ldots,n; m = 1, \ldots, M) \]
Compute the AIC and BIC of the candidate models \[ Y_{im} \stackrel{\text{ind}}{\sim}\mathop{\mathrm{Poisson}}(\mu(x_{im})), \quad \mu(x_{im}) = \exp \left(\sum_{j=1}^p \beta_j x^{j-1}_{im} \right) \,, \] for \(p = 1, \ldots, p_{\max}\).

You can carry out the simulation study for \(n = 200\), \(M = 3\), \(p_{\max} = 20\), and \(B = 100\) with the following code:

B <- 100
n <- 200
M <- 3
pmax <- 20


w <- function(x) {
    0.001 * (100 + x + x^2 + x^3)
}

mu <- function(x) {
    8 * exp(w(x))
}

## Covariates
x <- rep(seq(from = -10, to = 10, length = n), each = M)

## Objects to hold AICs and BICs
aics <- bics <- matrix(NA, nrow = B, ncol = pmax)

set.seed(20240416)
for (b in 1:B) {
    ## Simulate responses
    y <- rpois(n = M * n, lambda = mu(x))
    ## Fit intercept-only model and compute AIC, BIC
    mod <- glm(y ~ 1, family = poisson)
    aics[b, 1] <- AIC(mod)
    bics[b, 1] <- BIC(mod)
    ## Fit remaining models and compute AIC, BIC
    for(p in 2:pmax) {
        modp <- glm(y ~ poly(x, p - 1), family = poisson)
        aics[b, p] <- AIC(modp)
        bics[b, p] <- BIC(modp)
    }
}

The number of times each \(p\) has been selected by AIC and BIC is

AICorder <- apply(aics, 1, which.min)
BICorder <- apply(bics, 1, which.min)
table(AICorder)

AICorder
 4  5  6  7  8  9 10 13 
78 10  3  5  1  1  1  1

table(BICorder)

BICorder
 4  5 
98  2

The AIC and BIC values for each sample as a function of \(p\) are

par(mfrow = c(1, 2))
matplot(x = 1:pmax, t(aics), xlab = "p", ylab = "AIC", type = "l", lty = 1, col = gray(0.9))
matplot(x = 1:pmax, t(bics), xlab = "p", ylab = "BIC", type = "l", lty = 1, col = gray(0.9))

Modify the code above to investigate the performance of AIC and BIC as model selection criteria for \(n \in \{25, 50, 100, 1000\}\).

[Suggestion: write a function that carries out the simulation for user-supplied \(n\), \(M\), \(p_{\max}\), \(B\), and the function \(w(\cdot)\).]
Use \[ w(x) = \frac{1.2}{1 + \exp(-x)} \, , \] in the model we simulate from. How do AIC and BIC perform when the candidate models do not include the simulation model?
What information criterion would you use to estimate \(\Delta\) in (1.2) and why? Use the simulation results to obtain a simulation-based estimate of \(\Delta\) as a function of \(p\), for \(n \in \{25, 50, 100, 1000\}\), and each choice of \(w(\cdot)\).

Then, estimate \(\Delta\) directly by its definition using out-of-sample log-likelihood.

[Hint: for the out-of-sample log-likelihood you can do sum(dpois(y_plus, mu_hat, log = TRUE)), where mu_hat are the fitted means from a training sample and y_plus is an independent sample of the same size and at exactly the same \(x\) values as the training sample.]

5.2 Solution

Since we are planning to explore various experimental conditions for the simulation study, it is a good idea to write a general function that allows us to vary \(n\), \(M\), \(p_{\max}\), \(B\), and the function \(w(.)\).

run_simulation <- function(n, M = 3, pmax = 20, B = 100,
                           w =  function(x) 0.001 * (100 + x + x^2 + x^3)) {
    mu <- function(x) {
        8 * exp(w(x))
    }
    x <- rep(seq(from = -10, to = 10, length = n), each = M)
    ## Objects to hold AICs and BICs
    aics <- bics <- matrix(NA, nrow = B, ncol = pmax)
    for (b in 1:B) {
        ## Simulate responses
        y <- rpois(n = M * n, lambda = mu(x))
        ## Fit intercept-only model and compute AIC, BIC
        mod <- glm(y ~ 1, family = poisson)
        aics[b, 1] <- AIC(mod)
        bics[b, 1] <- BIC(mod)
        ## Fit remaining models and compute AIC, BIC
        for(p in 2:pmax) {
            modp <- glm(y ~ poly(x, p - 1), family = poisson)
            aics[b, p] <- AIC(modp)
            bics[b, p] <- BIC(modp)
        }
    }
    list(AIC = aics, BIC = bics)
}

We can also write functions that take as input the output of run_simulation() and summarize the results, as is done in the code provided.

## Returns the number of times each $p$ has been selected by the
## criterion used to compute ics
get_selection_counts <- function(ics) {
    ic_order <- apply(ics, 1, which.min)
    table(ic_order)
}
## Plots the information criterion values in ics for each sample as a
## function of $p$ are
plot_ic <- function(ics, ylab = "IC", main = NULL) {
    matplot(x = 1:pmax, t(ics), xlab = "p", ylab = ylab, main = main, type = "l", lty = 1, col = gray(0.9))
}

So, for the simulation study with for \(n = 200\), \(M = 3\), \(p_{\max} = 20\), and \(B = 100\), we can now simply do (using the same seed)

set.seed(20240416)

res <- run_simulation(n = 200, M = 3, pmax = 20, B = 100, w = function(x) 0.001 * (100 + x + x^2 + x^3))

Then, for example, the results for AIC are

get_selection_counts(res$AIC)

ic_order
 4  5  6  7  8  9 10 13 
78 10  3  5  1  1  1  1

plot_ic(res$AIC, ylab = "AIC")

The code chunk below will run the required simulations and store the results in the list res_true.

set.seed(1)

ns <- c(25, 50, 100, 1000)
res_true <- as.list(numeric(length(ns)))
for (j in 1:length(ns)) {
    res_true[[j]] <- run_simulation(n = ns[j], M = 3, pmax = 20, B = 100,
                                    w = function(x) 0.001 * (100 + x + x^2 + x^3))
}
names(res_true) <- paste("n =", ns)

Let’s investigate the behaviour of AIC.

sapply(res_true, function(x) get_selection_counts(x$AIC))

$`n = 25`
ic_order
 4  5  6  7  8  9 10 18 
67 11 11  2  5  1  1  2 

$`n = 50`
ic_order
 4  5  6  7  8  9 11 13 
70  7 10  5  3  1  3  1 

$`n = 100`
ic_order
 4  5  6  7  8  9 18 
72 11  8  5  2  1  1 

$`n = 1000`
ic_order
 4  5  6  7  8  9 10 11 
69 11 10  4  2  2  1  1

AIC behaves similarly for all \(n\). In all cases, the correct (cubic) model is preferred most of the time, but the probability of it being selected does not tend to one as \(n\) grows. This is also apparent in the the AIC vs \(p\) plots

par(mfrow = c(2, 2))
for (j in 1:length(ns)) {
    plot_ic(res_true[[j]]$AIC, ylab = "AIC", main = paste("n =", ns[j]))
    abline(v = 4, lty = 3)
}

Let’s explore the behaviour of BIC.

sapply(res_true, function(x) get_selection_counts(x$BIC))

$`n = 25`
ic_order
 2  3  4  5  6 
 1  1 93  4  1 

$`n = 50`
ic_order
 4  5 
99  1 

$`n = 100`
ic_order
 4  5 
97  3 

$`n = 1000`
ic_order
  4 
100

In this example, the set of candidate models includes the model we simulate from, and BIC selects the correct model more and more often as \(n\) increases. This is also apparent in the BIC vs \(p\) plots.

par(mfrow = c(2, 2))
for (j in 1:length(ns)) {
    plot_ic(res_true[[j]]$BIC, ylab = "BIC", main = paste("n =", ns[j]))
    abline(v = 4, lty = 3)
}

The code chunk below will run the required simulations and store the results in the list res_other.

set.seed(1)

ns <- c(25, 50, 100, 1000)
res_other <- as.list(numeric(length(ns)))
for (j in 1:length(ns)) {
    res_other[[j]] <- run_simulation(n = ns[j], M = 3, pmax = 20, B = 100,
                                     w = function(x) 1.2 / (1 + exp( - x)))
}
names(res_other) <- paste("n =", ns)

Let’s investigate the behaviour of AIC.

sapply(res_other, function(x) get_selection_counts(x$AIC))

$`n = 25`
ic_order
 4  5  6  7  8  9 10 11 12 13 14 15 16 20 
 9  2 33  8 19  5  7  3  4  2  2  4  1  1 

$`n = 50`
ic_order
 4  6  7  8  9 10 11 12 13 14 15 16 17 18 20 
 1 31  5 22 10  8  5  4  1  3  3  3  1  2  1 

$`n = 100`
ic_order
 6  7  8  9 10 11 12 13 16 17 18 
15  3 33 10 28  2  3  1  1  2  2 

$`n = 1000`
ic_order
10 11 12 13 14 15 16 17 18 20 
13  5 33  6 21  5  8  1  7  1

As \(n\) increases, AIC tends to select increasingly complex models, which provide a better approximation to the true distribution which generated the data, which is not a polynomial model.

On the other hand, BIC prefers simpler models to AIC, although it still tends to prefer more complex models as \(n\) increases in this case.

sapply(res_other, function(x) get_selection_counts(x$BIC))

$`n = 25`
ic_order
 4  5  6  7  8 10 13 
41  4 43  5  5  1  1 

$`n = 50`
ic_order
 4  5  6  7  8  9 10 
18  2 59  9 10  1  1 

$`n = 100`
ic_order
 6  7  8 10 
55  6 35  4 

$`n = 1000`
ic_order
 8 10 11 12 13 
13 76  3  7  1

We have shown that \[ \bar\ell(\hat\theta) - p / (n M) = -\frac{1}{2nM} \{2 (p - \ell(\hat\theta))\} = -\frac{1}{2nM} AIC \] is a bias-corrected estimator of \(\Delta\).

So, \[ \tilde{\Delta} = - \frac{1}{2nM} \mathop{\mathrm{E}}(AIC) \, , \] should be approaching \(\Delta\) as \(n\) increases. We can estimate \(\tilde{\Delta}\) empirically by transforming the AIC values we obtained in the simulations, and taking averages for every \(p\).

A direct, simulation-based estimator of \(\Delta\) can be obtained by its definition, using independent samples from the ones that are used for estimation. The code chunk below provides a function that returns a simulation-base estimator of \(\Delta\) as a function of \(p\).

estimate_Delta <- function(n, M = 3, pmax = 20, B = 100,
                           w =  function(x) 0.001 * (100 + x + x^2 + x^3)) {
    mu <- function(x) {
        8 * exp(w(x))
    }
    x <- rep(seq(from = -10, to = 10, length = n), each = M)
    ## Object to hold out-of-sample log-likelihood
    oo_ll <- matrix(NA, nrow = B, ncol = pmax)
    for (b in 1:B) {
        ## Simulate responses
        y <- rpois(n = M * n, lambda = mu(x))
        y_plus <- rpois(n = M * n, lambda = mu(x))
        ## Fit intercept-only model and compute out-of-sample log-likelihood
        mod <- glm(y ~ 1, family = poisson)        
        oo_ll[b, 1] <- sum(dpois(y_plus, fitted(mod), log = TRUE))
        ## Fit remaining models and compute out-of-sample log-likelihood
        for(p in 2:pmax) {
            modp <- glm(y ~ poly(x, p - 1), family = poisson)
            oo_ll[b, p] <- sum(dpois(y_plus, fitted(modp), log = TRUE))
        }
    }
    colMeans(oo_ll / (n * M))
}

The estimates of \(\Delta\) for \(w(x) = 0.001 \left( 100 + x + x^2 + x^3 \right)\) and \(w(x) = 1.2/\{1 + \exp(-x)\}\) are

set.seed(1)

ns <- c(25, 50, 100, 1000)
Delta_true <- Delta_other <- as.list(numeric(length(ns)))
for (j in 1:length(ns)) {
    Delta_true[[j]] <- estimate_Delta(n = ns[j], M = 3, pmax = 20, B = 100,
                                      w = function(x) 0.001 * (100 + x + x^2 + x^3))
    Delta_other[[j]] <- estimate_Delta(n = ns[j], M = 3, pmax = 20, B = 100,
                                       w = function(x) 1.2 / (1 + exp( - x)))

}
names(Delta_true) <- names(Delta_other) <- paste("n =", ns)

For \(w(x) = 0.001 \left( 100 + x + x^2 + x^3 \right)\), we get

cols <- hcl.colors(2)
par(mfrow = c(2, 2))
for (j in 1:length(ns)) {
    Delta_t <- - res_true[[j]]$AIC / (2 * ns[j] * M)
    plot_ic(Delta_t, ylab = "-AIC / (2nM)", main = paste("n =", ns[j]))
    lines(1:pmax, colMeans(Delta_t), col = cols[1])
    lines(1:pmax, Delta_true[[j]], col = cols[2])
}

and for \(w(x) = 1.2/\{1 + \exp(-x)\}\), we get

par(mfrow = c(2, 2))
for (j in 1:length(ns)) {
    Delta_t <- - res_other[[j]]$AIC / (2 * ns[j] * M)
    plot_ic(Delta_t, ylab = "-AIC / (2nM)", main = paste("n =", ns[j]))
    lines(1:pmax, colMeans(Delta_t), col = cols[1])
    lines(1:pmax, Delta_other[[j]], col = cols[2])
}

The purple piecewise linear functions are the simulation-based estimates of \(\Delta\) based on AIC, and the yellow piecewise linear functions are the simulation-based estimates of \(\Delta\) based on its definition. In the current setting, we can see that the two simulation-based estimates are almost identical in value.

The fact that BIC does not estimate \(\Delta\) can now be made directly apparent. For example, for \(w(x) = 1.2/\{1 + \exp(-x)\}\),

par(mfrow = c(2, 2))
for (j in 1:length(ns)) {
    plot_ic(- res_other[[j]]$BIC / (2 * ns[j] * M), ylab = "-BIC / (2nM)", main = paste("n =", ns[j]))
    lines(1:pmax, Delta_other[[j]], col = cols[1])
    abline(v = 4, lty = 3)
}

Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. N. Petrov & F. Czáki (Eds.), Second international symposium on information theory (pp. 267–281). Akademiai Kiadó. https://doi.org/10.1007/978-1-4612-0919-5_38

Albert, J. H. (2007). Bayesian computation with r. Springer-Verlag. https://doi.org/10.1007/978-0-387-92298-0

Albert, J. H., & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88, 669–679. https://doi.org/10.2307/2290350

Best, N., & Thomas, A. (2000). Bayesian graphical models and software for GLMs. In D. K. Dey, S. K. Ghosh, & B. K. Mallick (Eds.), Generalized linear models: A Bayesian perspective (pp. 387–406). Marcel Dekker. https://doi.org/10.1201/9781482293456

Breslow, N. E., & Clayton, D. G. (1993). Appproximate inference in generalised linear mixed models. Journal of the American Statistical Association, 88, 9–25. https://doi.org/10.2307/2290687

Burnham, K. P., & Anderson, D. R. (2002). Model selection and multi-model inference: A practical information theoretic approach (Second). Springer. https://doi.org/10.1007/978-1-4757-2917-7

Candès, E., & Tao, T. (2007). The Dantzig selector: Statistical estimation when \(p\) is much larger than \(n\) (with discussion). Annals of Statistics, 35, 2313–2404. https://doi.org/10.1214/009053606000001523

Claeskens, G., & Hjort, N. L. (2008). Model selection and model averaging. Cambridge University Press. https://doi.org/10.1017/CBO9780511790485

Cowell, R. G., Dawid, A. P., Lauritzen, S. L., & Spiegelhelter, D. J. (1999). Probabilistic networks and expert systems. Springer-Verlag. https://doi.org/10.1007/b97670

Crowder, M. J., & Hand, D. J. (1990). Analysis of repeated measures. Chapman; Hall/CRC. https://doi.org/10.1201/9781315137421

Davison, A. C. (2003). Statistical models. Cambridge University Press. https://doi.org/10.1017/CBO9780511815850

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society Series B, 39, 1–38. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x

Diggle, P. J., Heagerty, P., Liang, K.-Y., & Zeger, S. (2002). Analysis of longitudinal data (2nd ed.). Oxford University Press. https://global.oup.com/academic/product/analysis-of-longitudinal-data-9780199676750

Draper, D. (1995). Assessment and propagation of model uncertainty (with discussion). Journal of the Royal Statistical Society Series B, 57, 45–97. https://doi.org/10.1111/j.2517-6161.1995.tb02015.x

Efron, B. (1975). Defining the curvature of a statistical problem (with applications to second order efficiency). The Annals of Statistics, 3(6), 1189–1242. https://doi.org/10.1214/aos/1176343282

Fahrmeir, L., Kneib, T., Lang, S., & Marx, B. (Eds.). (2013). Regression: Models, methods and applications. Springer. https://doi.org/10.1007/978-3-642-34333-9

Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348–1360.

Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with Discussion). Journal of the Royal Statistical Society Series B, 70, 849–911. https://doi.org/10.1198/016214501753382273

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2004). Bayesian data analysis (3rd ed.). Chapman; Hall/CRC. https://doi.org/10.1201/b16018

Gelman, A., Hill, J., & Vehtari, A. (2020). Regression and other stories. Cambridge University Press. https://doi.org/10.1017/9781139161879

Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (Eds.). (1996). Markov chain monte carlo in practice. Chapman & Hall.

Green, P. J., Hjørt, N. L., & Richardson, S. (Eds.). (2003). Highly structured stochastic systems. Chapman & Hall/CRC. 10.1201/b14835

Hastie, T., Tibshirani, R., & Wainwright, M. (2015). Statistical learning with sparsity: The Lasso and generalizations. Chapman; Hall/CRC. https://doi.org/10.1201/b18401

Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (1999). Bayesian model averaging: A tutorial (with discussion). Statistical Science, 14, 382–417. https://doi.org/10.1214/ss/1009212519

Jamshidian, M., & Jennrich, R. I. (1997). Acceleration of the EM algorithm by using quasi-Newton methods. Journal of the Royal Statistical Society Series B, 59, 569–587. https://doi.org/10.1111/1467-9868.00083

Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90, 773–795. https://doi.org/10.1080/01621459.1995.10476572

Liang, K., & Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73(1), 13–22. https://doi.org/10.1093/biomet/73.1.13

Linhart, H., & Zucchini, W. (1986). Model selection. Wiley. https://doi.org/10.1007/978-3-642-04898-2_373

Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data (2nd ed.). Wiley. https://doi.org/10.1002/9781119013563

Marin, J.-M., & Robert, C. P. (2007). Bayesian core: A practical approach to computational bayesian statistics. Springer-Verlag. https://doi.org/10.1007/978-0-387-38983-7

McCullagh, P. (2002). What is a statistical model? The Annals of Statistics, 30(5), 1225–1310. https://doi.org/10.1214/aos/1035844977

McCullagh, P., & Nelder, J. A. (1989). Generalized linear models (2nd ed.). Chapman & Hall. https://doi.org/10.1201/9780203753736

McCulloch, C. E., Searle, S. R., & Neuhaus, J. M. (2008). Generalized, linear, and mixed models (2nd ed.). Wiley. https://doi.org/10.1002/0471722073

McLachlan, G. J., & Krishnan, T. (2008). The EM algorithm and extensions (2nd ed.). Wiley. https://doi.org/10.1002/9780470191613

McQuarrie, A. D. R., & Tsai, C.-L. (1998). Regression and time series model selection. World Scientific. https://doi.org/10.1142/3573

Meng, X.-L., & van Dyk, D. (1997). The EM algorithm — an old folk-song sung to a fast new tune (with discussion). Journal of the Royal Statistical Society Series B, 59, 511–567. https://doi.org/10.1111/1467-9868.00082

Nelder, J. A., Lee, Y., & Pawitan, Y. (2017). Generalized linear models with random effects: A unified approach via \(h\)-likelihood (2nd ed.). Chapman; Hall/CRC. https://doi.org/10.1201/9781315119953

Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized linear models. Journal of the Royal Statistical Society: Series A (General), 135(3), 370–384. https://doi.org/10.2307/2344614

O’Hagan, A., & Forster, J. J. (2004). Kendall’s advanced theory of statistics. Volume 2B: Bayesian inference (Second). Hodder Arnold. https://www.wiley.com/en-us/Kendall%27s+Advanced+Theory+of+Statistic+2B-p-9780470685693

Oakes, D. (1999). Direct calculation of the information matrix via the EM algorithm. Journal of the Royal Statistical Society Series B, 61, 479–482. https://doi.org/10.1111/1467-9868.00188

Ogden, H. (2021). On the error in laplace approximations of high-dimensional integrals. Stat, 10(1), e380. https://doi.org/10.1002/sta4.380

Ogden, H. E. (2017). On asymptotic validity of naive inference with an approximate likelihood. Biometrika, 104(1), 153–164. https://doi.org/10.1093/biomet/asx002

Peters, J., Janzing, D., & Schlkopf, B. (2017). Elements of causal inference: Foundations and learning algorithms. The MIT Press. https://mitpress.mit.edu/books/elements-causal-inference

Pinheiro, J., & Bates, D. M. (2002). Mixed effects models in S and S-PLUS. New York:Springer-Verlag. https://doi.org/10.1007/b98882

Raftery, A. E., Madigan, D., & Hoeting, J. A. (1997). Bayesian model averaging for linear regression models. Journal of the American Statistical Association, 92, 179–191. https://doi.org/10.1080/01621459.1997.10473615

Richardson, S., & Green, P. J. (1997). On bayesian analysis of mixtures with an unknown number of components (with discussion). Journal of the Royal Statistical Society Series B, 59, 731–792. https://doi.org/10.1111/1467-9868.00095

Rissanen, J. (1987). Stochastic complexity (with discussion). Journal of the Royal Statistical Society, Series B, 49, 223–239. https://doi.org/10.1111/j.2517-6161.1987.tb01694.x

Schwartz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464. https://doi.org/10.1214/aos/1176344136

Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & Linde, A. van der. (2002). Bayesian measures of model complexity and fit (with discussion). Journal of the Royal Statistical Society, Series B, 64, 583–639. https://doi.org/10.1111/1467-9868.00353

Tanner, M. A. (1996). Tools for statistical inference: Methods for the exploration of posterior distributions and likelihood functions (Third). Springer. https://doi.org/10.1007/978-1-4612-4024-2

Venables, W., & Ripley, B. D. (2002). Modern applied statistics with S (4th ed.). Springer-Verlag. https://doi.org/10.1007/978-0-387-21706-2

Wedderburn, R. W. M. (1974). Quasi-Likelihood Functions, Generalized Linear Models, and the Gauss-Newton Method. Biometrika, 61(3), 439. https://doi.org/10.2307/2334725

Wood, S. N. (2017). Generalized additive models: An introduction with r (2nd ed.). Chapman; Hall/CRC. https://doi.org/10.1201/9781315370279