Penalized likelihood estimation and inference in high-dimensional logistic regression

Ioannis Kosmidis

Professor of Statistics
Department of Statistics, University of Warwick

ioannis.kosmidis@warwick.ac.uk
ikosmidis.com ikosmidis ikosmidis_

ISNPS 2024
Braga, Portugal

26 June 2024

Joint with

Philipp Sterzinger

Sterzinger P, Kosmidis I (2023). Diaconis-Ylvisaker prior penalized likelihood for \(p / n \to \kappa \in (0, 1)\) logistic regression. ArXiV: 2311.11290

Slides

ikosmidis.com/files/ikosmidis-ISNPS-2024

Logistic regression

Data

Responses \(y_1, \ldots, y_n\) with \(y_i \in \{0, 1\}\)

Covariate vectors \(x_1, \ldots, x_n\) with \(x_i \in \Re^p\)

Model

\(Y_{1}, \ldots, Y_{n}\) conditionally independent with

\[ Y_i | x_i \sim \mathop{\mathrm{Bernoulli}}(\pi_i)\,, \quad \log \frac{\pi_i}{1 - \pi_i} = \eta_i = \sum_{t = 1}^p \beta_t x_{it} \]

Widely used for inference about covariate effects on binomial probabilities, probability calibration, and prediction

Maximum likelihood estimation

Log-likelihood

\(\displaystyle l(\beta \,;\,y, X) = \sum_{i = 1}^n \left\{y_i \eta_i - \log\left(1 + e^{\eta_i}\right) \right\}\)

Maximum likelihood (ML) estimator

\(\hat{\beta} = \arg \max l(\beta \,;\,y, X)\)

Iterative re-weighted least squares

\(\hat\beta := \beta^{(\infty)}\) where \(\displaystyle \beta^{(j + 1)} = \left(X^T W^{(j)} X\right)^{-1} X^T W^{(j)} z^{(j)}\)

\(W\) is a diagonal matrix with \(i\)th diagonal element \(\pi_i ( 1- \pi_i)\)

\(z_i = \eta_i + (y_i - \pi_i) / \{ \pi_i (1 - \pi_i) \}\) is the working variate

\(p / n \to \kappa \in (0, 1)\)

Performance

\(n = 2000, p = 400, x_{ij} \stackrel{iid}{\sim} \mathop{\mathrm{N}}(0, 1/n)\)

Estimation

Performance

\(n = 2000, p = 400, x_{ij} \stackrel{iid}{\sim} \mathop{\mathrm{N}}(0, 1/n)\)

Likelihood ratio tests

\(W = 2 ( \hat{l} - \hat{l}_{0} )\), \(\hat{l}_{0}\) is maximized log-likelihood under \(H_0: \beta_{201} = \ldots = \beta_{210} = 0\)

Classical theory predicts \(W_0 \stackrel{d}{\rightarrow} \chi^2_{10}\)

Performance

\(n = 2000, p = 400, x_{ij} \stackrel{iid}{\sim} \mathop{\mathrm{N}}(0, 1/n)\)

Wald tests

\(Z = \hat\beta_{k} / [(X^\top \hat{W} X)^{-1}]_{kk}^{1/2}\)

Classical theory predicts \(Z \stackrel{d}{\rightarrow} \mathop{\mathrm{N}}(0, 1)\) when \(\beta_{k} = 0\)

Recent developments

Candès & Sur (2020)

sharp phase transition about when the ML estimate has infinite components, when \(\eta_i = \delta + x_i^\top \beta\), \(x_i \sim \mathop{\mathrm{N}}(0, \Sigma)\), \(p/n \to \kappa \in (0, 1)\), \(\mathop{\mathrm{var}}(x_i^\top\beta_0) \to \gamma^2\)

Sur & Candès (2019), Zhao et al. (2022)

a method, based on approximate message passing (AMP), that recovers estimation and inferential performance by appropriately rescaling \(\hat\beta\), whenever that exists

Phase transition, \(\hat\beta\)

Phase transition, \(\hat\beta / \mu_\star\) (Sur & Candès, 2019)

Diaconis-Ylvisaker prior

Prior

\[ \log p(\beta \,;\,X) = \frac{1 - {\color[rgb]{0.70,0.1,0.12}{\alpha}}}{{\color[rgb]{0.70,0.1,0.12}{\alpha}}} \sum_{i=1}^n \left\{ \frac{1}{2}x_i^\top \beta - \zeta\left(x_i^\top \beta\right) \right\} + C \] with \(\zeta(\eta) = \log(1 + e^\eta)\), \(\alpha \in (0, 1)\)

Penalized log-likelihood

\[ l^*(\beta \,;\,y, X) = \frac{1}{\alpha} \sum_{i = 1}^n \left\{y_i^* x_i^\top\beta - \zeta\left(x_i^\top \beta\right) \right\} \] with \(y_i^* = \alpha y_i + (1 - \alpha) / 2\)

Maximum DY prior penalized likelihood

Maximum DY prior penalized likelihood (MDYPL)

\(\hat{\beta}^{\textrm{\small DY}}= \arg \max l^*(\beta \,;\,y, X)\)

MDYPL implementation using ML procedures

\(l(\beta \,;\,y^*, X) / \alpha = \ell^*(\beta \,;\,y, X)\) \(\quad \Longrightarrow \quad\) MDYPL is ML with pseudo-responses

Existence and uniqueness, and equivariance

\(\hat{\beta}^{\textrm{\small DY}}\) is unique and exists for all \(\{y, X\}\) configurations¹, and \(\widehat{g(\beta)} = g(\hat\beta)\)

Shrinkage ²

\(\displaystyle \hat{\beta}^{\textrm{\small DY}}\longrightarrow \left\{ \begin{array}{ll} \hat\beta\,, & \alpha \to 1 \\ 0 \,, & \alpha \to 0 \end{array}\right.\)

Estimation

Setting

Covariate distribution, dimension, and signal strength

\(x_{ij} \stackrel{iid}{\sim} \mathop{\mathrm{N}}(0, \Sigma)\)

As \(p / n \to \kappa \in (0, 1)\), \(\mathop{\mathrm{var}}(x_i^\top \beta_0) = \beta_0^\top \Sigma \beta_0 \to \gamma^2 > 0\)

Signal \(\beta_0\)

The empirical distribution of approximately scaled components of \(\beta_0\) converges to \(\pi_{\bar\beta}\), with finite second moment

Goal

Develop a generalized AMP recursion¹, whose iterates have a known asymptotic distribution, with stationary point \(\hat{\beta}^{\textrm{\small DY}}\), and use that to correct MDYPL-based estimation and inference

State evolution of AMP recursion

Solution \(({\mu}_{*}, {b}_{*}, {\sigma}_{*})\) to the system of nonlinear equations in \((\mu, b, \sigma)\)

\[ \begin{aligned} 0 & = \mathop{\mathrm{E}}\left[2 \zeta'(Z) Z \left\{\frac{1+\alpha}{2} - \zeta' \left(\textrm{prox}_{b \zeta}\left(Z_{*} + \frac{1+\alpha}{2}b\right) \right) \right\} \right] \\ 0 & = 1 - \kappa - \mathop{\mathrm{E}}\left[\frac{2 \zeta'(Z)}{1 + b \zeta''(\textrm{prox}_{b\zeta}\left(Z_{*} + \frac{1+\alpha}{2}b\right))} \right] \\ 0 & = \sigma^2 - \frac{b^2}{\kappa^2} \mathop{\mathrm{E}}\left[2 \zeta'(Z) \left\{\frac{1+\alpha}{2} - \zeta'\left(\textrm{prox}_{b\zeta}\left( Z_{*} + \frac{1+\alpha}{2}b\right)\right) \right\}^2 \right] \end{aligned} \tag{1}\]

\(Z \sim \mathop{\mathrm{N}}(0, \gamma^2)\)

\(Z_{*} = \mu Z + \kappa^{1/2} \sigma G\) with \(G \sim \mathop{\mathrm{N}}(0, 1)\) independent of \(Z\)

\(\textrm{prox}_{b\zeta}\left(x\right)=\arg\min_{u} \left\{ b\zeta(u) + \frac{1}{2} (x-u)^2 \right\}\) is the proximal operator

Solvers for nonlinear systems, after cubature approximation of expectations

Aggregate asymptotic behaviour of mDYPL estimator

For any \(\psi(\cdot, \cdot)\) is pseudo-Lipschitz of order 2, \[ \frac{1}{p} \sum_{j=1}^{p} \psi(\hat{\beta}^{\textrm{\small DY}}_j - {\mu}_{*}\beta_{0,j}, \beta_{0,j}) \overset{\textrm{a.s.}}{\longrightarrow} \mathop{\mathrm{E}}\left[\psi({\sigma}_{*}G, \bar{\beta})\right], \quad \textrm{as } n \to \infty \] where \(G \sim \mathop{\mathrm{N}}(0,1)\) is independent of \(\bar{\beta} \sim \pi_{\bar{\beta}}\)

Behaviour of \(\hat{\beta}^{\textrm{\small DY}}\)

\(\psi(t,u)\)	statistic		a.s. limit
\(t\)	\(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - {\mu}_{*}\beta_{0,j}\right)\)	Bias	\(0\)
\(t^2\)	\(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - {\mu}_{*}\beta_{0,j}\right)^2\)	Variance	\({\sigma}_{*}^2\)
\((t-(1-{\mu}_{*})u)^2\)	\(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - \beta_{0,j}\right)^2\)	MSE	\({\sigma}_{}^2 + (1-{\mu}_{})^2 \gamma^2 / \kappa\)

Behaviour of \(\hat{\beta}^{\textrm{\small DY}}/ \mu^*\)

\(\psi(t,u)\)	statistic		a.s. limit
\(t / {\mu}_{*}\)	\(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j / {\mu}_{*}- \beta_{0,j}\right)\)	Bias	\(0\)
\(t^2/{\mu}_{*}^2\)	\(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j / {\mu}_{*}- \beta_{0,j}\right)^2\)	MSE / Variance	\({\sigma}_{}^2 / {\mu}_{}^2\)

\(\hat{\beta}^{\textrm{\small DY}}\) for \(\alpha = 1 / (1 + \kappa)\)

\(\hat{\beta}^{\textrm{\small DY}}/ \mu_\star\) for \(\alpha = 1 / (1 + \kappa)\)

\(\hat{\beta}^{\textrm{\small DY}}\) for \(\alpha = 1 / (1 + \kappa)\)

\(\hat{\beta}^{\textrm{\small DY}}/ \mu_\star\) for \(\alpha = 1 / (1 + \kappa)\)

Inference

Adjusted \(Z\)-statistics

If \(\sqrt{n} \tau_j \beta_{0,j} = \mathcal{O}(1)\) with \(\tau_j^2 = \mathop{\mathrm{var}}(x_{ij} \mid x_{i, -j})\), then \[ Z^*_j = \sqrt{n} \tau_j \frac{\hat{\beta}^{\textrm{\small DY}}_{j} - {\mu}_{*}\beta_{0,j}}{{\sigma}_{*}} \overset{d}{\longrightarrow} \mathcal{N}(0,1) \]

DY prior penalized likelihood ratio test statistics

Define the DY prior penalized likelihood ratio test statistic for \(H_0: \beta_{0,i_1} = \ldots = \beta_{0,i_k} = 0\) as \[ \Lambda_I = \underset{\beta \in \Re^p}{\max} \, \ell(\beta \,;\,y^*, X) - \underset{\substack{\beta \in \Re^p: \\ \beta_j = 0, \, j \in I }}{\max} \, \ell(\beta \,;\,y^*, X)\,, \quad I = \{i_1,\ldots,i_k\} \]

Then, \[ 2 \Lambda_{I} \overset{d}{\longrightarrow} \frac{\kappa \sigma_{*}^2}{b_{*}} \chi^2_k \]

Demonstration

\(\gamma^2 = 5, \Sigma_{ij} = 0.5^{|i - j|}\)

\(n = 1000, p = 800\)

\(\alpha = n / (n + p)\)

Penalized likelihood ratio tests

Demonstration

\(\gamma^2 = 5, \Sigma_{ij} = 0.5^{|i - j|}\)

\(n = 1000, p = 800\)

\(\alpha = n / (n + p)\)

Adjusted Wald tests

Remarks

Optimal shrinkage

Amount of shrinkage (value of \(\alpha\)) that results in \(\hat{\beta}^{\textrm{\small DY}}/ {\mu}_{*}\) having minimal asymptotic MSE

Better MSE than the rescaled MLE, whenever that exists

Estimating unknowns

Dimension

\(\hat\kappa = p / n\)

Signal strength

For multivariate normal covariates (potentially other distributions) we can adapt the Signal Strength Leave-One-Out Estimator (SLOE) of Yadlowsky et al. (2021)

\[ \hat\gamma^2 = \frac{\sum_{i = 1}^n (s_i - \bar{s})^2}{n} \quad \text{with} \quad s_i = \hat{\eta}^{\textrm{\small DY}}_i - \frac{\hat{h}^{\textrm{\small DY}}_i}{1 - \hat{h}^{\textrm{\small DY}}_i} (\hat{z}^{\textrm{\small DY}}_i - \hat{\eta}^{\textrm{\small DY}}_i) \] where \(\hat{h}^{\textrm{\small DY}}_i\) is the hat value for the \(i\)th observation

Conditional variance of covariates

\(\hat{\tau}_j\) from the residual sums of squares from regressing the \(j\)th covariate on all others

Current work

Models with intercept

Empirical evidence for a similar conjecture to Zhao et al. (2022, Conjecture 7.1), with state evolution given from a system of 4 equations in 4 unknowns

Non-gaussian model matrices

Distributions with light-tails

Asymptotic results seem to apply for a broad class of covariate distributions with sufficiently light tails (e.g. sub-Gaussian), as also observed in Sur & Candès (2019)

Universality

Results in Han & Shen (2023) can be used to examine universality for general model matrices

References

Candès, E. J., & Sur, P. (2020). The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression. Annals of Statistics, 48(1), 27–42. https://doi.org/10.1214/18-AOS1789

Cordeiro, G. M., & McCullagh, P. (1991). Bias correction in generalized linear models. Journal of the Royal Statistical Society, Series B: Methodological, 53(3), 629–643. https://doi.org/10.1111/j.2517-6161.1991.tb01852.x

Han, Q., & Shen, Y. (2023). Universality of regularized regression estimators in high dimensions. The Annals of Statistics, 51(4), 1799–1823. https://doi.org/10.1214/23-AOS2309

Rigon, T., & Aliverti, E. (2023). Conjugate priors and bias reduction for logistic regression models. Statistics & Probability Letters, 202, 109901. https://doi.org/10.1016/j.spl.2023.109901

Sur, P., & Candès, E. J. (2019). A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences, 116(29), 14516–14525. https://doi.org/10.1073/pnas.1810420116

Yadlowsky, S., Yun, T., McLean, C. Y., & D’Amour, A. (2021). SLOE: A faster method for statistical inference in high-dimensional logistic regression. Advances in Neural Information Processing Systems, 34, 29517–29528. https://proceedings.neurips.cc/paper/2021/file/f6c2a0c4b566bc99d596e58638e342b0-Paper.pdf

Zhao, Q., Sur, P., & Candès, E. J. (2022). The asymptotic distribution of the MLE in high-dimensional logistic models: Arbitrary covariance. Bernoulli, 28(3). https://doi.org/10.3150/21-BEJ1401

Sterzinger P, Kosmidis I (2023). Diaconis-Ylvisaker prior penalized likelihood for \(p / n \to \kappa \in (0, 1)\) logistic regression. ArXiV: 2311.11290

Penalized likelihood estimation and inference in high-dimensional logistic regression

Joint with

Related outputs

Slides

Logistic regression

Logistic regression

Data

Model

Maximum likelihood estimation

Log-likelihood

Maximum likelihood (ML) estimator

Iterative re-weighted least squares

\(p / n \to \kappa \in (0, 1)\)

Performance

Estimation

Performance

Likelihood ratio tests

Performance

Wald tests

Recent developments

Candès & Sur (2020)

Sur & Candès (2019), Zhao et al. (2022)

Phase transition, \(\hat\beta\)

Phase transition, \(\hat\beta / \mu_\star\) (Sur & Candès, 2019)

Diaconis-Ylvisaker prior

Diaconis-Ylvisaker prior

Prior

Penalized log-likelihood

Maximum DY prior penalized likelihood

Maximum DY prior penalized likelihood (MDYPL)

MDYPL implementation using ML procedures

Existence and uniqueness, and equivariance

Shrinkage 2

Estimation

Setting

Covariate distribution, dimension, and signal strength

Signal \(\beta_0\)

Goal

State evolution of AMP recursion

Aggregate asymptotic behaviour of mDYPL estimator

Behaviour of \(\hat{\beta}^{\textrm{\small DY}}\)

Behaviour of \(\hat{\beta}^{\textrm{\small DY}}/ \mu^*\)

\(\hat{\beta}^{\textrm{\small DY}}\) for \(\alpha = 1 / (1 + \kappa)\)

\(\hat{\beta}^{\textrm{\small DY}}/ \mu_\star\) for \(\alpha = 1 / (1 + \kappa)\)

\(\hat{\beta}^{\textrm{\small DY}}\) for \(\alpha = 1 / (1 + \kappa)\)

\(\hat{\beta}^{\textrm{\small DY}}/ \mu_\star\) for \(\alpha = 1 / (1 + \kappa)\)

Inference

Adjusted \(Z\)-statistics

DY prior penalized likelihood ratio test statistics

Demonstration

Penalized likelihood ratio tests

Demonstration

Adjusted Wald tests

Remarks

Optimal shrinkage

Estimating unknowns

Dimension

Signal strength

Conditional variance of covariates

Current work

Models with intercept

Non-gaussian model matrices

Distributions with light-tails

Universality

References

Shrinkage ²