Penalized likelihood estimation and inference in high-dimensional logistic regression

Ioannis Kosmidis

Professor of Statistics
Department of Statistics, University of Warwick

ioannis.kosmidis@warwick.ac.uk
ikosmidis.com   ikosmidis ikosmidis_


CRiSM 2.0 Conference
University of Warwick

22 May 2025

CRISM 1.0

CRISM 1.0

Joint with

Philipp Sterzinger

Sterzinger P, Kosmidis I (2024). Diaconis-Ylvisaker prior penalized likelihood for \(p / n \to \kappa \in (0, 1)\) logistic regression. arXiv: 2311.11290

Slides

Logistic regression

Logistic regression

Data

\(y_1, \ldots, y_n\) with \(y_i \in \{0, 1\}\)

\(x_1, \ldots, x_n\) with \(x_i \in \Re^p\)

Model

\(Y_{1}, \ldots, Y_{n}\) conditionally independent with

\[ Y_i \mid x_i \sim \mathop{\mathrm{Bernoulli}}(\pi_i)\,, \quad \log \frac{\pi_i}{1 - \pi_i} = \eta_i = \sum_{t = 1}^p \beta_t x_{it} \]

Maximum likelihood estimation

Log-likelihood

\(\displaystyle l(\beta \,;\,y, X) = \sum_{i = 1}^n \left\{y_i \eta_i - \log\left(1 + e^{\eta_i}\right) \right\}\)

Maximum likelihood (ML) estimator

\(\hat{\beta} \in \arg \sup l(\beta \,;\,y, X)\)

Iterative re-weighted least squares

\(\hat\beta := \beta^{(\infty)}\) where \(\displaystyle \beta^{(j + 1)} = \left(X^T W^{(j)} X\right)^{-1} X^T W^{(j)} z^{(j)}\)

\(W\) is a diagonal matrix with \(i\)th diagonal element \(\pi_i ( 1- \pi_i)\)

\(z_i = \eta_i + (y_i - \pi_i) / \{ \pi_i (1 - \pi_i) \}\) is the working variate

UCI’s Multiple Deatures doi: 10.24432/C5HC70

Data

\(200\) binary images per digit (\(0\)-\(9\)) extracted maps from a Dutch public utility

Feature sets:

  • 76 Fourier coefficients of the character shapes (rotation invariant) per digit

  • 64 Karhunen-Loève coefficients per digit

1000 digits for training + 1000 digits for testing

Aim

Explain the contribution of the feature sets in describing the digit “7”

Considerations

Font, digitization noise, downscaling, complicate discrimination

If only rotation invariant features are used, “7” and “2” can look similar

fou + kar VS fou

full_ml <- glm(y ~ ., data = full_train, family = binomial)
nest_ml <- glm(y ~ ., data = nest_train, family = binomial)
Residual Df Log-likelihood Df LR pvalue
923 0.00
859 0.00 64 0.00 1
library("detectseparation")
full_sep <- update(full_ml, method = "detect_separation")
nest_sep <- update(nest_ml, method = "detect_separation")
Model Separation Infinite estimates
fou + kar true true
fou true true
  • Perfect fits on the training data

  • 1000 observations with 140 covariates

\(p / n \to \kappa \in (0, 1)\)

Recent developments

Candès & Sur (2020)

Sharp phase transition about when the ML estimate has infinite components, when \(\eta_i = \delta + x_i^\top \beta\), \(x_i \stackrel{iid}{\sim} \mathop{\mathrm{N}}(0, \Sigma)\), \(p/n \to \kappa \in (0, 1)\), \(\mathop{\mathrm{var}}(x_i^\top\beta_0) \to \gamma^2\)

Sur & Candès (2019), Zhao et al. (2022)

A method, based on approximate message passing (AMP), that for \(\delta = 0\) recovers estimation and inferential performance by appropriately rescaling \(\hat\beta\), whenever that exists.

Phase transition, \(\hat\beta\)

Phase transition, \(\hat\beta / \mu_\star\) (Sur & Candès, 2019)

?

Diaconis-Ylvisaker prior

Diaconis-Ylvisaker prior

Prior

\[ \log p(\beta; X) = \frac{1 - {\color[]{#E16A86}{\alpha}}}{{\color[]{#E16A86}{\alpha}}} \sum_{i=1}^n \left\{ \zeta'\left(x_i^\top {\color[]{#E16A86}{\beta}_P}\right) x_i^\top \beta - \zeta\left(x_i^\top \beta\right) \right\} + C \] with \(\zeta(\eta) = \log(1 + e^\eta)\)

Penalized log-likelihood

\[ l^*(\beta; y, X) = \frac{1}{\alpha} \sum_{i = 1}^n \left\{y_i^* x_i^\top\beta - \zeta\left(x_i^\top \beta\right) \right\} \] with \(y_i^* = \alpha y_i + (1 - \alpha) \zeta'\left(x_i^\top \beta_P\right)\)

Maximum DY prior penalized likelihood

Maximum DY prior penalized likelihood (MDYPL)

\(\hat{\beta}^{\textrm{\small DY}}= \arg \max l^*(\beta; y, X)\)

MDYPL implementation using ML procedures

MDYPL is ML with pseudo-responses: \(l(\beta; y^*, X) / \alpha = \ell^*(\beta; y, X)\)

Existence and uniqueness

Rigon & Aliverti (2023, Theorem 1): \(\hat{\beta}^{\textrm{\small DY}}\) is unique and exists for all \(\{y, X\}\) configurations

Shrinkage

\(\displaystyle \hat{\beta}^{\textrm{\small DY}}\longrightarrow \left\{ \begin{array}{ll} \hat\beta\,, & \alpha \to 1 \\ \beta_P\,, & \alpha \to 0 \end{array}\right.\)

Rigon & Aliverti (2023) suggest adaptive shrinkage with \(\alpha = 1 / (1 + \kappa)\)

Shrinkage direction

Fixed \(p\) (Cordeiro & McCullagh, 1991)

ML estimator of a logistic regression model is biased away from \(0\)

\(p / n \to \kappa \in (0, 1)\) (Sur & Candès, 2019; Zhao et al., 2022)

Empirical evidence that this continues to hold in high dimensions


\[\LARGE\Downarrow\]

\[\beta_{P} = 0\]

fou + kar VS fou using MDYPL

Maximum Diaconis-Ylvisaker prior penalized likelihood (MDYPL)

  1. Adjust responses to \(\displaystyle y_i^* = \alpha y_i + \frac{1 - \alpha}{2}\) with \(\alpha \in [0, 1)\)1

  2. Use ML estimation with the adjusted responses

full_mdypl <- glm(y ~ ., data = full_train_adj, family = binomial)
nest_mdypl <- glm(y ~ ., data = nest_train_adj, family = binomial)

Penalized likelihood ratio test

Df PLR
64 64.36

fou + kar VS fou using MDYPL

Simulation setup

\(1000 \times 140\) matrix \(X_{\rm fou + kar}\) is fixed

\(\beta_{\rm fou + kar} = (\beta_{\rm fou}^\top, 0_{64}^\top)^\top\)

\(\beta_{\rm fou}\) from i.i.d \(N(0, 1)\), rescaled so that \[ \widehat{\rm var}(X_{\rm fou + kar} \beta_{\rm fou + kar}) = \gamma^2 \]

Intercept: \(\delta \in \{-3, -2, -1, 0\}\)

Signal strength: \(\gamma^2 \in \{1, 2, 4, 8, 16\}\)

\(500\) response vectors per \((\delta, \gamma^2)\)

MDYPL: estimation and inference

Aggregate asymptotic behaviour of MDYPL estimator

Assume that \((\alpha,\kappa,\gamma)\) are such that \(\| \hat{\beta}^{\textrm{\small DY}} \|_2 = \mathcal{O}(n^{1/2})\) almost surely. Then for any function \(\psi: \Re^2 \to \Re\) that is pseudo-Lipschitz of order 2,1, \[ \frac{1}{p} \sum_{j=1}^{p} \psi(\hat{\beta}^{\textrm{\small DY}}_j - {\mu}_{\star}\beta_{0,j}, \beta_{0,j}) \overset{\textrm{a.s.}}{\longrightarrow} \mathop{\mathrm{E}}\left[\psi({\sigma}_{\star}G, \bar{\beta})\right], \quad \textrm{as } n \to \infty \] where \(G \sim \mathop{\mathrm{N}}(0,1)\) is independent of \(\bar{\beta} \sim \pi_{\bar{\beta}}\)

Behaviour of \(\hat{\beta}^{\textrm{\small DY}}\)

\(\psi(t,u)\) statistic a.s. limit
\(t\) \(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - {\mu}_{\star}\beta_{0,j}\right)\) Bias \(0\)
\(t^2\) \(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - {\mu}_{\star}\beta_{0,j}\right)^2\) Variance \({\sigma}_{\star}^2\)
\((t-(1-{\mu}_{\star})u)^2\) \(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - \beta_{0,j}\right)^2\) MSE \({\sigma}_{\star}^2 + (1-{\mu}_{\star})^2 \gamma^2 / \kappa\)

Behaviour of \(\hat{\beta}^{\textrm{\small DY}}/ \mu^*\)

\(\psi(t,u)\) statistic a.s. limit
\(t / {\mu}_{\star}\) \(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j / {\mu}_{\star}- \beta_{0,j}\right)\) Bias \(0\)
\(t^2/{\mu}_{\star}^2\) \(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j / {\mu}_{\star}- \beta_{0,j}\right)^2\) MSE / Variance \({\sigma}_{\star}^2 / {\mu}_{\star}^2\)

State evolution of AMP recursion

Solution \(({\mu}_{\star}, {b}_{\star}, {\sigma}_{\star})\) to the system of nonlinear equations in \((\mu, b, \sigma)\)

\[ \begin{aligned} 0 & = \mathop{\mathrm{E}}\left[2 \zeta'(Z) Z \left\{\frac{1+{\color[rgb]{0.70,0.1,0.12}{\alpha}}}{2} - \zeta' \left(\textrm{prox}_{b \zeta}\left(Z_{*} + \frac{1+{\color[rgb]{0.70,0.1,0.12}{\alpha}}}{2}b\right) \right) \right\} \right] \\ 0 & = 1 - {\color[rgb]{0.70,0.1,0.12}{\kappa}} - \mathop{\mathrm{E}}\left[\frac{2 \zeta'(Z)}{1 + b \zeta''(\textrm{prox}_{b\zeta}\left(Z_{*} + \frac{1+{\color[rgb]{0.70,0.1,0.12}{\alpha}}}{2}b\right))} \right] \\ 0 & = \sigma^2 - \frac{b^2}{{\color[rgb]{0.70,0.1,0.12}{\kappa}}^2} \mathop{\mathrm{E}}\left[2 \zeta'(Z) \left\{\frac{1+{\color[rgb]{0.70,0.1,0.12}{\alpha}}}{2} - \zeta'\left(\textrm{prox}_{b\zeta}\left( Z_{*} + \frac{1+{\color[rgb]{0.70,0.1,0.12}{\alpha}}}{2}b\right)\right) \right\}^2 \right] \end{aligned} \tag{1}\]

\(Z \sim \mathop{\mathrm{N}}(0, {\color[rgb]{0.70,0.1,0.12}{\gamma}}^2)\)

\(Z_{*} = \mu Z + {\color[rgb]{0.70,0.1,0.12}{\kappa}}^{1/2} \sigma G\) with \(G \sim \mathop{\mathrm{N}}(0, 1)\) independent of \(Z\)

\(\textrm{prox}_{b\zeta}\left(x\right)=\arg\min_{u} \left\{ b\zeta(u) + \frac{1}{2} (x-u)^2 \right\}\) is the proximal operator


Solvers for nonlinear systems, after cubature approximation of expectations

\(\hat{\beta}^{\textrm{\small DY}}\) for \(\alpha = 1 / (1 + \kappa)\)

\(\hat{\beta}^{\textrm{\small DY}}/ \mu_\star\) for \(\alpha = 1 / (1 + \kappa)\)

\(\hat{\beta}^{\textrm{\small DY}}\) for \(\alpha = 1 / (1 + \kappa)\)

\(\hat{\beta}^{\textrm{\small DY}}/ \mu_\star\) for \(\alpha = 1 / (1 + \kappa)\)

Optimal shrinkage: Asymptotic unbiasedness

Amount of shrinkage (value of \(\alpha\)) that results in \(\hat{\beta}^{\textrm{\small DY}}/ {\mu}_{\star}\) being asymptotically unbiased

Optimal shrinkage: Minimum MSE

Amount of shrinkage (value of \(\alpha\)) that results in \(\hat{\beta}^{\textrm{\small DY}}/ {\mu}_{\star}\) having minimal asymptotic MSE

DY prior penalized likelihood ratio test statistics

Assume that \((\alpha,\kappa,\gamma)\) are such that \(\| \hat{\beta}^{\textrm{\small DY}} \|_2 = \mathcal{O}(n^{1/2})\) almost surely. Define the DY prior penalized likelihood ratio test statistic for \(H_0: \beta_{0,i_1} = \ldots = \beta_{0,i_k} = 0\) as \[ \Lambda_I = \underset{\beta \in \Re^p}{\max} \, \ell(\beta \,;\,y^*, X) - \underset{\substack{\beta \in \Re^p: \\ \beta_j = 0, \, j \in I }}{\max} \, \ell(\beta \,;\,y^*, X)\,, \quad I = \{i_1,\ldots,i_k\} \]

Then, \[ 2 \Lambda_{I} \overset{d}{\longrightarrow} \frac{\kappa \sigma_{\star}^2}{b_{\star}} \chi^2_k \]

Adjusted Wald statistics

Assume that \((\alpha,\kappa,\gamma)\) are such that \(\| \hat{\beta}^{\textrm{\small DY}} \|_2 = \mathcal{O}(n^{1/2})\) almost surely.

Then, for any regression coefficient such that \(\sqrt{n} \tau_j \beta_{0,j} = \mathcal{O}(1)\), \[ Z^*_j = \sqrt{n} \tau_j \frac{\hat{\beta}^{\textrm{\small DY}}_{j} - {\mu}_{\star}\beta_{0,j}}{{\sigma}_{\star}} \overset{d}{\longrightarrow} \mathcal{N}(0,1) \] where \(\tau_j^2 = \mathop{\mathrm{var}}(x_{ij} \mid x_{i, -j})\)

Artificial example

\[ \begin{array}{l} n = 2000 \\ p = 400 \\ x_{ij} \stackrel{iid}{\sim} \mathop{\mathrm{N}}(0, 1/n) \\ \\ \beta_0 = (b_1, \ldots, b_{200}, 0, \ldots, 0)^\top \\ b_j \stackrel{iid}{\sim} \mathop{\mathrm{N}}(6.78, 1) \\ \end{array} \implies \begin{array}{l} \gamma^2 = \mathop{\mathrm{var}}(x_i^\top \beta_0) \simeq 4.7 \\ \kappa = 0.2 \\ \end{array} \]



\[ y_i \stackrel{ind}{\sim} \mathop{\mathrm{Bernoulli}}(1/(1 + {e^{-x_i^\top\beta_0}})) \]

Frequentist performance

\(W = 2 ( \hat{l} - \hat{l}_{0} )\), \(\hat{l}_{0}\) is maximized log-likelihood under \(H_0: \beta_{201} = \ldots = \beta_{210} = 0\)

Classical theory implies \(W \stackrel{d}{\rightarrow} \chi^2_{10}\) when \(\beta_{201} = \ldots = \beta_{210} = 0\)

\(Z = \hat\beta_{k} / [(X^\top \hat{W} X)^{-1}]_{kk}^{1/2}\)

Classical theory implies \(Z \stackrel{d}{\rightarrow} \mathop{\mathrm{N}}(0, 1)\) when \(\beta_{k} = 0\)

Estimating unknowns

Dimension

\(\hat\kappa = p / n\)

Signal strength

Adaptation of the Signal Strength Leave-One-Out Estimator (SLOE) of Yadlowsky et al. (2021), that consistently estimates \(\upsilon^2 = \lim\limits_{n \to \infty}\mathop{\mathrm{var}}(x_j^\top \hat{\beta}^{\textrm{\small DY}}) = {\mu}_{\star}^2 \gamma^2 + \kappa {\sigma}_{\star}^2\)

\[ \hat \upsilon^2 = \frac{\sum_{i = 1}^n (s_i - \bar{s})^2}{n} \quad \text{with} \quad s_i = \hat{\eta}^{\textrm{\small DY}}_i - \frac{\hat{h}^{\textrm{\small DY}}_i}{1 - \hat{h}^{\textrm{\small DY}}_i} (\hat{z}^{\textrm{\small DY}}_i - \hat{\eta}^{\textrm{\small DY}}_i) \] where \(\hat{h}^{\textrm{\small DY}}_i\) is the hat value for the \(i\)th observation, and \({\rm DY}\) implies evaluation at \(\hat{\beta}^{\textrm{\small DY}}\)

Estimating unknowns

Estimates of \({\mu}_{\star}\), \(b_{\star}\), \({\sigma}_{\star}\)

  1. Replace \(\gamma\) in the state evolution equations (1) with \(\hat\gamma = (\hat\upsilon^2 - \hat\kappa {\sigma}_{\star}^2) / {\mu}_{\star}^2\)

  2. Solve (1) to get estimates for \({\mu}_{\star}\), \(b_{\star}\), \({\sigma}_{\star}\)

Conditional variance of covariates

For adjusted Wald tests, \(\tau_j^2 = \mathop{\mathrm{var}}(x_{ij} \mid x_{i, -j})\) can be estimated using the residual sums of squares from the regressions of each covariate on all the others

Comparison to corrected least squares (Lewis & Battey, 2024)

\(p = n\kappa\), \(x_j \sim \mathop{\mathrm{N}}(0, \Sigma)\) with \(\Sigma_{jj} = 1\), \(\Sigma_{jk} = 0.5\) \((j \ne k)\)

\(\beta_0 = (c, c, c, c, c, 0, \ldots, 0)^\top\) with \(c\) such that \(\beta_0^\top \Sigma \beta_0 = \gamma^2 = 9\)

\(5000\) independent realizations of the response vector per setting

Comparison to corrected least squares (Lewis & Battey, 2024)

\(p = n\kappa\), \(x_j \sim \mathop{\mathrm{N}}(0, \Sigma)\) with \(\Sigma_{jj} = 1\), \(\Sigma_{jk} = 0.5\) \((j \ne k)\)

\(\beta_0 = (c, c, c, c, c, 0, \ldots, 0)^\top\) with \(c\) such that \(\beta_0^\top \Sigma \beta_0 = \gamma^2 = 9\)

\(5000\) independent realizations of the response vector per setting

Models with an intercept

An accurate, according to empirical evidence, conjecture1 with state evolution given from a system of 4 equations in 4 unknowns.

A SLOE-based procedure for estimating unknowns, replacing the asymptotic limit of the MDYPL estimator of the intercept by its MDYPL estimate.

fou + kar VS fou using MDYPL

Penalized likelihood ratio tests

Infinite estimates
Test Statistic pvalue fou.kar fou
LR 0.00 1.00 true true
PLR 64.36 0.46 false false
rescaled PLR 173.34 0.00 false false

Simulation setup

\(1000 \times 140\) matrix \(X_{\rm fou + kar}\) is fixed

\(\beta_{\rm fou + kar} = (\beta_{\rm fou}^\top, 0_{64}^\top)^\top\)

\(\beta_{\rm fou}\) from i.i.d \(N(0, 1)\), rescaled so that \[ \widehat{\rm var}(X_{\rm fou + kar} \beta_{\rm fou + kar}) = \gamma^2 \]

Intercept: \(\delta \in \{-3, -2, -1, 0\}\)

Signal strength: \(\gamma^2 \in \{1, 2, 4, 8, 16\}\)

\(500\) response vectors per \((\delta, \gamma^2)\)

Current work

More general model matrices

Distributions with light-tails

Asymptotic results seem to apply for a broad class of covariate distributions with sufficiently light tails (e.g. sub-Gaussian)1

Universality

Results in Han & Shen (2023) can be used to examine universality for more general model matrices with independent entries

DY prior

Arbitrary shrinkage directions (\(\beta_P \ne 0\))

Sterzinger P, Kosmidis I (2024). Diaconis-Ylvisaker prior penalized likelihood for \(p / n \to \kappa \in (0, 1)\) logistic regression. arXiv: 2311.11290

References

Candès, E. J., & Sur, P. (2020). The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression. Annals of Statistics, 48(1), 27–42. https://doi.org/10.1214/18-AOS1789
Cordeiro, G. M., & McCullagh, P. (1991). Bias correction in generalized linear models. Journal of the Royal Statistical Society, Series B: Methodological, 53(3), 629–643. https://doi.org/10.1111/j.2517-6161.1991.tb01852.x
Han, Q., & Shen, Y. (2023). Universality of regularized regression estimators in high dimensions. The Annals of Statistics, 51(4), 1799–1823. https://doi.org/10.1214/23-AOS2309
Kosmidis, I., & Zietkiewicz, P. (2025). Jeffreys-prior penalty for high-dimensional logistic regression: A conjecture about aggregate bias. Statistical Science (to Appear). https://arxiv.org/abs/2311.11290
Lewis, R. M., & Battey, H. S. (2024). On inference in high-dimensional logistic regression models with separated data. Biometrika, 111(3), 989–1011. https://doi.org/10.1093/biomet/asad065
Rigon, T., & Aliverti, E. (2023). Conjugate priors and bias reduction for logistic regression models. Statistics & Probability Letters, 202, 109901. https://doi.org/10.1016/j.spl.2023.109901
Sur, P., & Candès, E. J. (2019). A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences, 116(29), 14516–14525. https://doi.org/10.1073/pnas.1810420116
Yadlowsky, S., Yun, T., McLean, C. Y., & D’Amour, A. (2021). SLOE: A faster method for statistical inference in high-dimensional logistic regression. Advances in Neural Information Processing Systems, 34, 29517–29528. https://proceedings.neurips.cc/paper/2021/file/f6c2a0c4b566bc99d596e58638e342b0-Paper.pdf
Zhao, Q., Sur, P., & Candès, E. J. (2022). The asymptotic distribution of the MLE in high-dimensional logistic models: Arbitrary covariance. Bernoulli, 28(3). https://doi.org/10.3150/21-BEJ1401

Maximum Jeffreys prior penalized likelihood

\[ Y_i | x_i \sim \mathop{\mathrm{Bernoulli}}(\pi_i)\,, \quad \log \frac{\pi_i}{1 - \pi_i} = \eta_i = \delta + \sum_{t = 1}^p \beta_t x_{it} \]

Suppose that \(x_{i} \stackrel{iid}{\sim} \mathop{\mathrm{N}}(0, \Sigma)\), and that \(\mathop{\mathrm{var}}(x_i^\top \beta_0) \to \gamma_0^2\), as \(p / n \to \kappa \in (0, 1)\).

Then, the mJPL estimator \((\tilde\delta, \tilde\beta^\top)^\top\) satisfies \[ \frac{1}{p} \sum_{j = 1}^p (\tilde\beta_j - \alpha_\star \beta_{0,j}) \overset{\textrm{a.s.}}{\longrightarrow} 0 \] with \(\alpha_\star \le 1\). For small to moderate values of \(\delta_0^2 / \gamma_0^2\), the scaling \(\alpha_\star\) can be approximated by \[ q(\kappa, \gamma, \gamma_0; b) = \left\{ \begin{array}{ll} 1 \,, & \text{if } \kappa < h_{\rm MLE}(\delta, \gamma_0) \\ \kappa^{b_1} \gamma^{b_2} \gamma_0^{b_3}\,, & \text{if } \kappa > h_{\rm MLE}(\delta, \gamma_0) \end{array} \right. \] where \(b_1 < -1\), \(b_2 < -1\), \(b_3 > 0\).


Predictions for \(b_1\), \(b_2\), \(b_3\) are -1.172, -1.869, 0.817, respectively (Kosmidis & Zietkiewicz, 2025)