Penalized likelihood estimation and inference in high-dimensional logistic regression

Ioannis Kosmidis

Professor of Statistics
Department of Statistics, University of Warwick

ioannis.kosmidis@warwick.ac.uk
ikosmidis.com   ikosmidis ikosmidis_


CCDA 2024
Department of Statistics, London School of Economics, London, UK

01 November 2025

Joint with

Philipp Sterzinger

Sterzinger P, Kosmidis I (2023). Diaconis-Ylvisaker prior penalized likelihood for \(p / n \to \kappa \in (0, 1)\) logistic regression. arXiv: 2311.11290

Slides

Logistic regression

Logistic regression

Data

\(y_1, \ldots, y_n\) with \(y_i \in \{0, 1\}\)

\(x_1, \ldots, x_n\) with \(x_i \in \Re^p\)

Model

\(Y_{1}, \ldots, Y_{n}\) conditionally independent with

\[ Y_i | x_i \sim \mathop{\mathrm{Bernoulli}}(\pi_i)\,, \quad \log \frac{\pi_i}{1 - \pi_i} = \eta_i = \sum_{t = 1}^p \beta_t x_{it} \]

Maximum likelihood estimation

Log-likelihood

\(\displaystyle l(\beta \,;\,y, X) = \sum_{i = 1}^n \left\{y_i \eta_i - \log\left(1 + e^{\eta_i}\right) \right\}\)

Maximum likelihood (ML) estimator

\(\hat{\beta} = \arg \max l(\beta \,;\,y, X)\)

Iterative re-weighted least squares

\(\hat\beta := \beta^{(\infty)}\) where \(\displaystyle \beta^{(j + 1)} = \left(X^T W^{(j)} X\right)^{-1} X^T W^{(j)} z^{(j)}\)

\(W\) is a diagonal matrix with \(i\)th diagonal element \(\pi_i ( 1- \pi_i)\)

\(z_i = \eta_i + (y_i - \pi_i) / \{ \pi_i (1 - \pi_i) \}\) is the working variate

UCI’s Multiple Deatures data (doi: 10.24432/C5HC70)

Data

\(200\) binary images per digit (\(0\)-\(9\)) extracted maps from a Dutch public utility

Feature sets:

  • 76 Fourier coefficients of the character shapes (rotation invariant) per digit

  • 64 Karhunen-Lo`eve coefficients per digit

1000 digits for training + 1000 digits for testing

Aim

Explain the contribution of the feature sets in describing the digit ``7’’

Considerations

Font, digitization noise, downscaling, complicate discrimination

If only rotation invariant features are used, “7” and “2” can look similar

fou + kar >< fou

full_ml <- glm(y ~ ., data = full_train, family = binomial)
nest_ml <- glm(y ~ ., data = nest_train, family = binomial)
Residual Df Log-likelihood Df LR pvalue
923 0.00
859 0.00 64 0.00 1
library("detectseparation")
full_sep <- update(full_ml, method = "detect_separation")
nest_sep <- update(nest_ml, method = "detect_separation")
Model Separation Infinite estimates
fou + kar True True
fou True True
  • Perfect fits on the training data

  • 1000 observations with 140 covariates

fou + kar >< fou using MDYPL

Maximum Diaconis-Ylvisaker prior penalized likelihood (MDYPL)

  1. Adjust responses to \(\displaystyle y_i^* = \alpha y_i + \frac{1 - \alpha}{2}\) with \(\alpha \in [0, 1)\)1

  2. Use ML estimation with the adjusted responses

Existence and uniqueness, and equivariance

\(\hat{\beta}^{\textrm{\small DY}}\) is unique and exists for all \(\{y, X\}\) configurations2, and \(\widehat{g(\beta)} = g(\hat\beta)\)

full_mdypl <- glm(y ~ ., data = full_train_adj, family = binomial)
nest_mdypl <- glm(y ~ ., data = nest_train_adj, family = binomial)

Penalized likelihood ratio test

Df PLR
64 64.36

fou + kar >< fou using MDYPL

Simulation setup

\(1000 \times 140\) matrix \(X_{\rm fou + kar}\) is fixed

\(\beta_{\rm fou + kar} = (\beta_{\rm fou}^\top, 0_{64}^\top)^\top\)

\(\beta_{\rm fou}\) from i.i.d \(N(0, 1)\), rescaled so that \[ \widehat{\rm var}(X_{\rm fou + kar} \beta_{\rm fou + kar}) = \gamma^2 \]

Intercept: \(\delta \in \{-3, -2, -1, 0\}\)

Signal strength: \(\gamma^2 \in \{1, 2, 4, 8, 16\}\)

\(500\) response vectors per \((\delta, \gamma^2)\)

\(p / n \to \kappa \in (0, 1)\)

Recent developments

Candès & Sur (2020)

sharp phase transition about when the ML estimate has infinite components, when \(\eta_i = \delta + x_i^\top \beta\), \(x_i \stackrel{iid}{\sim} \mathop{\mathrm{N}}(0, \Sigma)\), \(p/n \to \kappa \in (0, 1)\), \(\mathop{\mathrm{var}}(x_i^\top\beta_0) \to \gamma^2\)

Sur & Candès (2019), Zhao et al. (2022)

a method, based on approximate message passing (AMP), that for \(\delta = 0\) recovers estimation and inferential performance by appropriately rescaling \(\hat\beta\), whenever that exists.

Phase transition, \(\hat\beta\)

Phase transition, \(\hat\beta / \mu_\star\) (Sur & Candès, 2019)

?

MDYPL: estimation and inference

Aggregate asymptotic behaviour of MDYPL estimator

For any \(\psi(\cdot, \cdot)\) is pseudo-Lipschitz of order 2, \[ \frac{1}{p} \sum_{j=1}^{p} \psi(\hat{\beta}^{\textrm{\small DY}}_j - {\mu}_{*}\beta_{0,j}, \beta_{0,j}) \overset{\textrm{a.s.}}{\longrightarrow} \mathop{\mathrm{E}}\left[\psi({\sigma}_{*}G, \bar{\beta})\right], \quad \textrm{as } n \to \infty \] where \(G \sim \mathop{\mathrm{N}}(0,1)\) is independent of \(\bar{\beta} \sim \pi_{\bar{\beta}}\)

Aggregate bias of \(\hat{\beta}^{\textrm{\small DY}}\) and \(\hat{\beta}^{\textrm{\small DY}}/ \mu^*\)1

\(\psi(t,u)\) statistic a.s. limit
\(t\) \(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - {\mu}_{*}\beta_{0,j}\right)\) \(0\)
\(t / {\mu}_{*}\) \(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j / {\mu}_{*}- \beta_{0,j}\right)\) \(0\)

State evolution of AMP recursion

Solution \(({\mu}_{*}, {b}_{*}, {\sigma}_{*})\) to the system of nonlinear equations in \((\mu, b, \sigma)\)

\[ \begin{aligned} 0 & = \mathop{\mathrm{E}}\left[2 \zeta'(Z) Z \left\{\frac{1+{\color[rgb]{0.70,0.1,0.12}{\alpha}}}{2} - \zeta' \left(\textrm{prox}_{b \zeta}\left(Z_{*} + \frac{1+{\color[rgb]{0.70,0.1,0.12}{\alpha}}}{2}b\right) \right) \right\} \right] \\ 0 & = 1 - {\color[rgb]{0.70,0.1,0.12}{\kappa}} - \mathop{\mathrm{E}}\left[\frac{2 \zeta'(Z)}{1 + b \zeta''(\textrm{prox}_{b\zeta}\left(Z_{*} + \frac{1+{\color[rgb]{0.70,0.1,0.12}{\alpha}}}{2}b\right))} \right] \\ 0 & = \sigma^2 - \frac{b^2}{{\color[rgb]{0.70,0.1,0.12}{\kappa}}^2} \mathop{\mathrm{E}}\left[2 \zeta'(Z) \left\{\frac{1+{\color[rgb]{0.70,0.1,0.12}{\alpha}}}{2} - \zeta'\left(\textrm{prox}_{b\zeta}\left( Z_{*} + \frac{1+{\color[rgb]{0.70,0.1,0.12}{\alpha}}}{2}b\right)\right) \right\}^2 \right] \end{aligned} \tag{1}\]

\(Z \sim \mathop{\mathrm{N}}(0, {\color[rgb]{0.70,0.1,0.12}{\gamma}}^2)\)

\(Z_{*} = \mu Z + {\color[rgb]{0.70,0.1,0.12}{\kappa}}^{1/2} \sigma G\) with \(G \sim \mathop{\mathrm{N}}(0, 1)\) independent of \(Z\)

\(\textrm{prox}_{b\zeta}\left(x\right)=\arg\min_{u} \left\{ b\zeta(u) + \frac{1}{2} (x-u)^2 \right\}\) is the proximal operator


Solvers for nonlinear systems, after cubature approximation of expectations

\(\hat{\beta}^{\textrm{\small DY}}\) for \(\alpha = 1 / (1 + \kappa)\)

\(\hat{\beta}^{\textrm{\small DY}}/ \mu_\star\) for \(\alpha = 1 / (1 + \kappa)\)

\(\hat{\beta}^{\textrm{\small DY}}\) for \(\alpha = 1 / (1 + \kappa)\)

\(\hat{\beta}^{\textrm{\small DY}}/ \mu_\star\) for \(\alpha = 1 / (1 + \kappa)\)

DY prior penalized likelihood ratio test statistics1

Define the DY prior penalized likelihood ratio test statistic for \(H_0: \beta_{0,i_1} = \ldots = \beta_{0,i_k} = 0\) as \[ \Lambda_I = \underset{\beta \in \Re^p}{\max} \, \ell(\beta \,;\,y^*, X) - \underset{\substack{\beta \in \Re^p: \\ \beta_j = 0, \, j \in I }}{\max} \, \ell(\beta \,;\,y^*, X)\,, \quad I = \{i_1,\ldots,i_k\} \]

Then, \[ 2 \Lambda_{I} \overset{d}{\longrightarrow} \frac{\kappa \sigma_{*}^2}{b_{*}} \chi^2_k \]

Estimating unknowns

Dimension

\(\hat\kappa = p / n\)

Signal strength

Adaptation of the Signal Strength Leave-One-Out Estimator (SLOE) of Yadlowsky et al. (2021)

\[ \hat\gamma^2 = \frac{\sum_{i = 1}^n (s_i - \bar{s})^2}{n} \quad \text{with} \quad s_i = \hat{\eta}^{\textrm{\small DY}}_i - \frac{\hat{h}^{\textrm{\small DY}}_i}{1 - \hat{h}^{\textrm{\small DY}}_i} (\hat{z}^{\textrm{\small DY}}_i - \hat{\eta}^{\textrm{\small DY}}_i) \] where \(\hat{h}^{\textrm{\small DY}}_i\) is the hat value for the \(i\)th observation

Intercepts

A similar conjecture to Zhao et al. (2022, Conjecture 7.1), with state evolution given from a system of 4 equations in 4 unknowns

fou + kar >< fou using MDYPL

Penalized likelihood ratio tests

Test Statistic pvalue Inf_fou_kar Inf_fou
LR 0.00 1.00 True True
PLR 64.36 0.46 False False
rescaled PLR 173.34 0.00 False False

Simulation setup

\(1000 \times 140\) matrix \(X_{\rm fou + kar}\) is fixed

\(\beta_{\rm fou + kar} = (\beta_{\rm fou}^\top, 0_{64}^\top)^\top\)

\(\beta_{\rm fou}\) from i.i.d \(N(0, 1)\), rescaled so that \[ \widehat{\rm var}(X_{\rm fou + kar} \beta_{\rm fou + kar}) = \gamma^2 \]

Intercept: \(\delta \in \{-3, -2, -1, 0\}\)

Signal strength: \(\gamma^2 \in \{1, 2, 4, 8, 16\}\)

\(500\) response vectors per \((\delta, \gamma^2)\)

Optimal shrinkage

Amount of shrinkage (value of \(\alpha\)) that results in \(\hat{\beta}^{\textrm{\small DY}}/ {\mu}_{*}\) having minimal asymptotic MSE

Better MSE than the rescaled MLE, whenever that exists

Sterzinger P, Kosmidis I (2023). Diaconis-Ylvisaker prior penalized likelihood for \(p / n \to \kappa \in (0, 1)\) logistic regression. arXiv: 2311.11290

References

Candès, E. J., & Sur, P. (2020). The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression. Annals of Statistics, 48(1), 27–42. https://doi.org/10.1214/18-AOS1789
Rigon, T., & Aliverti, E. (2023). Conjugate priors and bias reduction for logistic regression models. Statistics & Probability Letters, 202, 109901. https://doi.org/10.1016/j.spl.2023.109901
Sur, P., & Candès, E. J. (2019). A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences, 116(29), 14516–14525. https://doi.org/10.1073/pnas.1810420116
Yadlowsky, S., Yun, T., McLean, C. Y., & D’Amour, A. (2021). SLOE: A faster method for statistical inference in high-dimensional logistic regression. Advances in Neural Information Processing Systems, 34, 29517–29528. https://proceedings.neurips.cc/paper/2021/file/f6c2a0c4b566bc99d596e58638e342b0-Paper.pdf
Zhao, Q., Sur, P., & Candès, E. J. (2022). The asymptotic distribution of the MLE in high-dimensional logistic models: Arbitrary covariance. Bernoulli, 28(3). https://doi.org/10.3150/21-BEJ1401