# Penalized likelihood estimation and inference in high-dimensional logistic regression

Ioannis Kosmidis

Professor of Statistics
Department of Statistics, University of Warwick

ioannis.kosmidis@warwick.ac.uk
ikosmidis.com   ikosmidis ikosmidis_

Statistics Seminars
Department of Decision Sciences
Università Bocconi

22 February 2024

## Collaborators

Philipp Sterzinger
Patrick Zietkiewicz

Sterzinger P, Kosmidis I (2023). Diaconis-Ylvisaker prior penalized likelihood for $p / n \to \kappa \in (0, 1)$ logistic regression. ArXiV: 2311.11290

Kosmidis I, Zietkiewicz P (2023). Jeffreys-prior penalty for high-dimensional logistic regression: A conjecture about aggregate bias. ArXiV: 2311.07419

# Logistic regression

## Logistic regression model

### Data

Responses $y_1, \ldots, y_n$ with $y_i \in \{0, 1\}$

Covariate vectors $x_1, \ldots, x_n$ with $x_i \in \Re^p$

### Model

$Y_{1}, \ldots, Y_{n}$ conditionally independent with

$Y_i | x_i \sim \mathop{\mathrm{Bernoulli}}(\pi_i)\,, \quad \log \frac{\pi_i}{1 - \pi_i} = \eta_i = \sum_{t = 1}^p \beta_t x_{it}$

Widely used for inference about covariate effects on binomial probabilities, probability calibration, and prediction

## Maximum likelihood estimation

### Log-likelihood

$\displaystyle l(\beta; y, X) = \sum_{i = 1}^n \left\{y_i \eta_i - \log\left(1 + e^{\eta_i}\right) \right\}$

### Maximum likelihood (ML) estimator

$\hat{\beta} = \arg \max l(\beta; y, X)$

### Iterative re-weighted least squares

$\hat\beta := \beta^{(\infty)}$ where $\displaystyle \beta^{(j + 1)} = \left(X^T W^{(j)} X\right)^{-1} X^T W^{(j)} z^{(j)}$

$W$ is a diagonal matrix with $i$th diagonal element $\pi_i ( 1- \pi_i)$

$z_i = (y_i - \pi_i) / \{ \pi_i (1 - \pi_i) \}$ is the working variate

# $p / n \to \kappa \in (0, 1)$

## Artificial example

$\begin{array}{l} n = 2000 \\ p = 400 \\ x_{ij} \stackrel{iid}{\sim} \mathop{\mathrm{N}}(0, 1/n) \\ \\ \beta_0 = (b_1, \ldots, b_{200}, 0, \ldots, 0)^\top \\ b_j \stackrel{iid}{\sim} \mathop{\mathrm{N}}(6.78, 1) \\ \end{array} \implies \begin{array}{l} \gamma^2 = \mathop{\mathrm{var}}(x_i^\top \beta_0) \simeq 4.7 \\ \kappa = 0.2 \\ \end{array}$

$y_i \stackrel{ind}{\sim} \mathop{\mathrm{Bernoulli}}(1/(1 + {e^{-x_i^\top\beta_0}}))$

## Frequentist performance

$W = 2 ( \hat{l} - \hat{l}_{0} )$, $\hat{l}_{0}$ is maximized log-likelihood under $H_0: \beta_{201} = \ldots = \beta_{210} = 0$

Classical theory predicts $W_0 \stackrel{d}{\rightarrow} \chi^2_{10}$

$Z = \hat\beta_{k} / [(X^\top \hat{W} X)^{-1}]_{kk}^{1/2}$

Classical theory predicts $Z \stackrel{d}{\rightarrow} \mathop{\mathrm{N}}(0, 1)$ when $\beta_{k} = 0$

## Recent developments

### Candès & Sur (2020)

sharp phase transition about when the ML estimate has infinite components, when $\eta_i = \delta + x_i^\top \beta$, $x_i \sim \mathop{\mathrm{N}}(0, \Sigma)$, $p/n \to \kappa \in (0, 1)$, $\mathop{\mathrm{var}}(x_i^\top\beta_0) \to \gamma^2$

### Sur & Candès (2019), Zhao et al. (2022)

a method, based on approximate message passing (AMP), that recovers estimation and inferential performance by appropriately rescaling $\hat\beta$, whenever that exists

### Sur & Candès (2019), Kosmidis & Firth (2021)

empirical evidence that maximum Jeffreys-penalized likelihood (mJPL) achieves a substantial reduction in persistent bias when $p / n \to \kappa$, performing similarly to the rescaled ML estimator, whenever that exists

?

# Diaconis-Ylvisaker prior

## Diaconis-Ylvisaker prior

### Prior

$\log p(\beta; X) = \frac{1 - {\color[rgb]{0.70,0.1,0.12}{\alpha}}}{{\color[rgb]{0.70,0.1,0.12}{\alpha}}} \sum_{i=1}^n \left\{ \zeta'\left(x_i^\top {\color[rgb]{0.70,0.1,0.12}{\beta}_P}\right) x_i^\top \beta - \zeta\left(x_i^\top \beta\right) \right\} + C$ with $\zeta(\eta) = \log(1 + e^\eta)$

### Penalized log-likelihood

$l^*(\beta; y, X) = \frac{1}{\alpha} \sum_{i = 1}^n \left\{y_i^* x_i^\top\beta - \zeta\left(x_i^\top \beta\right) \right\}$ with $y_i^* = \alpha y_i + (1 - \alpha) \zeta'\left(x_i^\top \beta_P\right)$

## Maximum DY prior penalized likelihood

### Maximum DY prior penalized likelihood (MDYPL)

$\hat{\beta}^{\textrm{\small DY}}= \arg \max l^*(\beta; y, X)$

### MDYPL implementation using ML procedures

MDYPL is ML with pseudo-responses $\quad \Longleftarrow \quad$ $l(\beta; y^*, X) / \alpha = \ell^*(\beta; y, X)$

### Existence and uniqueness

Rigon & Aliverti (2023, Theorem 1): $\hat{\beta}^{\textrm{\small DY}}$ is unique and exists for all $\{y, X\}$ configurations

### Shrinkage

$\displaystyle \hat{\beta}^{\textrm{\small DY}}\longrightarrow \left\{ \begin{array}{ll} \hat\beta\,, & \alpha \to 1 \\ \beta_P\,, & \alpha \to 0 \end{array}\right.$

Rigon & Aliverti (2023) suggest adaptive shrinkage with $\alpha = 1 / (1 + \kappa)$

## Shrinkage direction

### $p / n = o(1)$(Cordeiro & McCullagh, 1991)

ML estimator of a logistic regression model is biased away from $0$

### $p / n \to \kappa \in (0, 1)$(Sur & Candès, 2019; Zhao et al., 2022)

Empirical evidence that this continues to hold in high dimensions

$\LARGE\Downarrow$

$\beta_{P} = 0$

# Estimation

## Setting

### Signal $\beta_0$

The empirical distribution of $\beta_0$ elements converges to $\pi_{\bar{\beta}}$ and $\sum_{j = 1}^p \beta_{0,j}^2 / p \to \mathop{\mathrm{E}}(\bar{\beta}^2) < \infty$

### Covariate distribution, dimension, and signal strength

$x_{ij} \stackrel{iid}{\sim} \mathop{\mathrm{N}}(0, 1 / n)$

As $p / n \to \kappa \in (0, 1)$, $\mathop{\mathrm{var}}(x_i^\top \beta_0) \to \gamma^2 > 0$

### Idea

Develop a generalized AMP recursion (see Feng et al., 2022 for an overview of AMP), whose iterates have a known asymptotic distribution, with stationary point $\hat{\beta}^{\textrm{\small DY}}$

## State evolution of AMP recursion

Solution $({\mu}_{*}, {b}_{*}, {\sigma}_{*})$ to the system of nonlinear equations in $(\mu, b, \sigma)$

\begin{aligned} 0 & = \mathop{\mathrm{E}}\left[2 \zeta'(Z) Z \left\{\frac{1+\alpha}{2} - \zeta' \left(\textrm{prox}_{b \zeta}\left(Z_{*} + \frac{1+\alpha}{2}b\right) \right) \right\} \right] \\ 0 & = 1 - \kappa - \mathop{\mathrm{E}}\left[\frac{2 \zeta'(Z)}{1 + b \zeta''(\textrm{prox}_{b\zeta}\left(Z_{*} + \frac{1+\alpha}{2}b\right))} \right] \\ 0 & = \sigma^2 - \frac{b^2}{\kappa^2} \mathop{\mathrm{E}}\left[2 \zeta'(Z) \left\{\frac{1+\alpha}{2} - \zeta'\left(\textrm{prox}_{b\zeta}\left( Z_{*} + \frac{1+\alpha}{2}b\right)\right) \right\}^2 \right] \end{aligned} \tag{1}

$Z \sim \mathop{\mathrm{N}}(0, \gamma^2)$

$Z_{*} = \mu Z + \kappa^{1/2} \sigma G$ with $G \sim \mathop{\mathrm{N}}(0, 1)$ independent of $Z$

$\textrm{prox}_{b\zeta}\left(x\right)=\arg\min_{u} \left\{ b\zeta(u) + \frac{1}{2} (x-u)^2 \right\}$ is the proximal operator

Solvers for nonlinear systems, after cubature approximation of expectations

## MDYPL asymptotics

Theorem 1 (Aggregate asymptotic behaviour of mDYPL estimator)

Assume that $(\alpha,\kappa,\gamma)$ are such that $\| \hat{\beta}^{\textrm{\small DY}} \|_2 = \mathcal{O}(n^{1/2})$ almost surely, and that (1) admits a solution $({\mu}_{*}, {b}_{*}, {\sigma}_{*})$ such that the Jacobian of the RHS of (1) is nonsingular.

Then, for any function $\psi: \Re^2 \to \Re$ that is pseudo-Lipschitz of order 2 $\frac{1}{p} \sum_{j=1}^{n} \psi(\hat{\beta}^{\textrm{\small DY}}_j - {\mu}_{*}\beta_{0,j}, \beta_{0,j}) \overset{\textrm{a.s.}}{\longrightarrow} \mathop{\mathrm{E}}\left[\psi({\sigma}_{*}G, \bar{\beta})\right], \quad \textrm{as } n \to \infty$ where $G \sim \mathop{\mathrm{N}}(0,1)$ is independent of $\bar{\beta} \sim \pi_{\bar{\beta}}$

## Behaviour of $\hat{\beta}^{\textrm{\small DY}}$

$\psi(t,u)$ statistic a.s. limit
$t$ $\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - {\mu}_{*}\beta_{0,j}\right)$ Bias $0$
$(t-(1-{\mu}_{*})u)^2$ $\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - \beta_{0,j}\right)^2$ MSE ${\sigma}_{*}^2 + (1-{\mu}_{*})^2 \gamma^2 / \kappa$
$t^2$ $\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - {\mu}_{*}\beta_{0,j}\right)^2$ Variance ${\sigma}_{*}^2$

## Behaviour of $\hat{\beta}^{\textrm{\small DY}}/ \mu^*$

$\psi(t,u)$ statistic a.s. limit
$t$ $\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j / {\mu}_{*}- \beta_{0,j}\right)$ Bias $0$
$t^2/{\mu}_{*}^2$ $\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j / {\mu}_{*}- \beta_{0,j}\right)^2$ MSE / Variance ${\sigma}_{*}^2 / {\mu}_{*}^2$

## $\hat{\beta}^{\textrm{\small DY}}$ for $\alpha = 1 / (1 + \kappa)$

Rigon & Aliverti (2023)

## $\hat{\beta}^{\textrm{\small DY}}$ for $\alpha = 1 / (1 + \kappa)$

Rigon & Aliverti (2023)

## Optimal shrinkage

For any given $\kappa$ and $\gamma$, find the amount of shrinkage (value of $\alpha$) that results in:

• $\hat{\beta}^{\textrm{\small DY}}$ having zero asymptotic bias

• $\hat{\beta}^{\textrm{\small DY}}/ {\mu}_{*}$ having minimal asymptotic mean squared error

### Zero Asymptotic bias

For $(\kappa, \gamma) \in (0, 1) \times (0, \infty)$: $\displaystyle \quad \alpha_{\rm \small UB} = \arg \mathop{\mathrm{solve}}_{\alpha \in (0, 1)} \{{\mu}_{*}= 1\}$

### Minimal asymptotic MSE

For $(\kappa, \gamma) \in (0, 1) \times (0, \infty)$: $\displaystyle \quad \alpha_{\rm \small MM} = \arg \min_{\alpha \in (0, 1)} \frac{{\sigma}_{*}^2}{{\mu}_{*}^2}$

## Extensions

### Abritrary covariate covariance

$\hat{\beta}^{\textrm{\small DY}}$ is the mDYPL estimator with covariate vectors $x_{i} \stackrel{iid}{\sim} \mathop{\mathrm{N}}(0, \Sigma)$ and signal $\beta_0$

### Dimension and signal strength

As $p / n \to \kappa \in (0, 1)$, $\mathop{\mathrm{var}}(x_i^\top \beta_0) = \beta_0^\top \Sigma \beta_0 \to \gamma^2 > 0$

### Aggregate asymptotics

Under the assumption $\underset{n \to \infty}{\limsup } \, \lambda_{\max}(\Sigma) / \lambda_{\min}(\Sigma) < \infty$, Theorem 1 still holds with $\frac{1}{p} \sum_{j=1}^p \psi \left( {\color[rgb]{0.70,0.1,0.12}{\sqrt{n} \tau_j}} \left( \hat{\beta}^{\textrm{\small DY}}_{j}-{\mu}_{*}\beta_{0,j}\right), {\color[rgb]{0.70,0.1,0.12}{\sqrt{n} \tau_j}} \beta_{0,j} \right) \overset{\textrm{a.s.}}{\longrightarrow} \mathop{\mathrm{E}}\left[\psi({\sigma}_{*}G, \bar{\beta})\right]$ where $\tau_j^2 = \mathop{\mathrm{var}}(x_{ij} \mid x_{i, -j})$

# Inference

## Adjusted $Z$-statistics

Theorem 2 Assume that $(\alpha,\kappa,\gamma)$ are such that the conditions of Theorem 1 are met.

Then for any regression coefficient such that $\sqrt{n} \tau_j \beta_{0,j} = \mathcal{O}(1)$, $Z^*_j = \sqrt{n} \tau_j \frac{\hat{\beta}^{\textrm{\small DY}}_{j} - {\mu}_{*}\beta_{0,j}}{{\sigma}_{*}} \overset{d}{\longrightarrow} \mathcal{N}(0,1)$

## DY prior penalized likelihood ratio test statistics

Theorem 3 Assume that $(\alpha,\kappa,\gamma)$ are such that the conditions of Theorem 1 are met.

Let $I = \{i_1,\ldots,i_k\}$, and define the DY prior penalized likelihood ratio test statistic $\Lambda_I = \underset{\beta \in \Re^p}{\max} \, \ell(\beta; y^*, X) - \underset{\substack{\beta \in \Re^p: \\ \beta_j = 0, \, j \in I }}{\max} \, \ell(\beta; y^*, X)$

Then, under $H_0: \beta_{0,i_1} = \ldots = \beta_{0,i_k} = 0$, it holds that $2 \Lambda_{I} \overset{d}{\longrightarrow} \frac{\kappa \sigma_{*}^2}{b_{*}} \chi^2_k$

# Discussion

## Current work

### Models with intercept

Empirical evidence with MDYPL supports Zhao et al. (2022, Conjecture 7.1), with state evolution given from a system of 4 equations in 4 unknowns

### Non-gaussian model matrices

#### Distributions with light-tails

Asymptotic results seem to apply for a broad class of covariate distributions with sufficiently light tails (e.g. sub-Gaussian), as also observed in Sur & Candès (2019)

#### Universality

Results in Han & Shen (2023) can be used to examine universality for general model matrices

## Estimating unknowns

### Dimension

$p / n$ is a good estimate of $\kappa$

### Signal strength

For multivariate normal covariates and potentially other distributions with light tails, $\gamma$ can be estimated using

• the ProbFrontier method

• the SLOE method

### Conditional variance of covariates

$\tau_j$ can be estimated using the residual sums of squares from the regressions of each covariate on all the others

## Maximum Jeffreys prior penalized likelihood

$Y_i | x_i \sim \mathop{\mathrm{Bernoulli}}(\pi_i)\,, \quad \log \frac{\pi_i}{1 - \pi_i} = \eta_i = \delta + \sum_{t = 1}^p \beta_t x_{it}$

Conjecture 1 Suppose that $x_{i} \stackrel{iid}{\sim} \mathop{\mathrm{N}}(0, \Sigma)$, and that $\mathop{\mathrm{var}}(x_i^\top \beta_0) \to \gamma_0^2$, as $p / n \to \kappa \in (0, 1)$.

Then, the mJPL estimator $(\tilde\delta, \tilde\beta^\top)^\top$ satisfies $\frac{1}{p} \sum_{j = 1}^p (\tilde\beta_j - \alpha_\star \beta_{0,j}) \overset{\textrm{a.s.}}{\longrightarrow} 0$ with $\alpha_\star \le 1$. For small to moderate values of $\delta_0^2 / \gamma_0^2$, the scaling $\alpha_\star$ can be approximated by $q(\kappa, \gamma, \gamma_0; b) = \left\{ \begin{array}{ll} 1 \,, & \text{if } \kappa < h_{\rm MLE}(\delta, \gamma_0) \\ \kappa^{b_1} \gamma^{b_2} \gamma_0^{b_3}\,, & \text{if } \kappa > h_{\rm MLE}(\delta, \gamma_0) \end{array} \right.$ where $b_1 < -1$, $b_2 < -1$, $b_3 > 0$.

Predictions for $b_1$, $b_2$, $b_3$ are -1.172, -1.869, 0.817, respectively

## References

Candès, E. J., & Sur, P. (2020). The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression. Annals of Statistics, 48(1), 27–42. https://doi.org/10.1214/18-AOS1789
Cordeiro, G. M., & McCullagh, P. (1991). Bias correction in generalized linear models. Journal of the Royal Statistical Society, Series B: Methodological, 53(3), 629–643. https://doi.org/10.1111/j.2517-6161.1991.tb01852.x
Feng, O. Y., Venkataramanan, R., Rush, C., & Samworth, R. J. (2022). A unifying tutorial on approximate message passing. Foundations and Trends in Machine Learning, 15(4), 335–536. http://dx.doi.org/10.1561/2200000092
Han, Q., & Shen, Y. (2023). Universality of regularized regression estimators in high dimensions. The Annals of Statistics, 51(4), 1799–1823. https://doi.org/10.1214/23-AOS2309
Kosmidis, I., & Firth, D. (2021). Jeffreys-prior penalty, finiteness and shrinkage in binomial-response generalized linear models. Biometrika, 108(1), 71–82. https://doi.org/10.1093/biomet/asaa052
Kosmidis, I., & Zietkiewicz, P. (2023). Jeffreys-prior penalty for high-dimensional logistic regression: A conjecture about aggregate bias. arXiv Preprint arXiv:2311.11290. https://arxiv.org/abs/2311.11290
Rigon, T., & Aliverti, E. (2023). Conjugate priors and bias reduction for logistic regression models. Statistics & Probability Letters, 202, 109901. https://doi.org/10.1016/j.spl.2023.109901
Sur, P., & Candès, E. J. (2019). A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences, 116(29), 14516–14525. https://doi.org/10.1073/pnas.1810420116
Yadlowsky, S., Yun, T., McLean, C. Y., & D’Amour, A. (2021). SLOE: A faster method for statistical inference in high-dimensional logistic regression. Advances in Neural Information Processing Systems, 34, 29517–29528. https://proceedings.neurips.cc/paper/2021/file/f6c2a0c4b566bc99d596e58638e342b0-Paper.pdf
Zhao, Q., Sur, P., & Candès, E. J. (2022). The asymptotic distribution of the MLE in high-dimensional logistic models: Arbitrary covariance. Bernoulli, 28(3). https://doi.org/10.3150/21-BEJ1401

Sterzinger P, Kosmidis I (2023). Diaconis-Ylvisaker prior penalized likelihood for $p / n \to \kappa \in (0, 1)$ logistic regression. ArXiV: 2311.11290

Kosmidis I, Zietkiewicz P (2023). Jeffreys-prior penalty for high-dimensional logistic regression: A conjecture about aggregate bias. ArXiV: 2311.07419