Penalized likelihood estimation and inference in high-dimensional logistic regression

Ioannis Kosmidis

Professor of Statistics
Department of Statistics, University of Warwick

ioannis.kosmidis@warwick.ac.uk
ikosmidis.com ikosmidis ikosmidis_

Statistics Seminars
Department of Decision Sciences
Università Bocconi

22 February 2024

Collaborators

Philipp Sterzinger

Patrick Zietkiewicz

Sterzinger P, Kosmidis I (2023). Diaconis-Ylvisaker prior penalized likelihood for \(p / n \to \kappa \in (0, 1)\) logistic regression. ArXiV: 2311.11290

Kosmidis I, Zietkiewicz P (2023). Jeffreys-prior penalty for high-dimensional logistic regression: A conjecture about aggregate bias. ArXiV: 2311.07419

Logistic regression

Logistic regression model

Data

Responses \(y_1, \ldots, y_n\) with \(y_i \in \{0, 1\}\)

Covariate vectors \(x_1, \ldots, x_n\) with \(x_i \in \Re^p\)

Model

\(Y_{1}, \ldots, Y_{n}\) conditionally independent with

\[ Y_i | x_i \sim \mathop{\mathrm{Bernoulli}}(\pi_i)\,, \quad \log \frac{\pi_i}{1 - \pi_i} = \eta_i = \sum_{t = 1}^p \beta_t x_{it} \]

Widely used for inference about covariate effects on binomial probabilities, probability calibration, and prediction

Maximum likelihood estimation

Log-likelihood

\(\displaystyle l(\beta; y, X) = \sum_{i = 1}^n \left\{y_i \eta_i - \log\left(1 + e^{\eta_i}\right) \right\}\)

Maximum likelihood (ML) estimator

\(\hat{\beta} = \arg \max l(\beta; y, X)\)

Iterative re-weighted least squares

\(\hat\beta := \beta^{(\infty)}\) where \(\displaystyle \beta^{(j + 1)} = \left(X^T W^{(j)} X\right)^{-1} X^T W^{(j)} z^{(j)}\)

\(W\) is a diagonal matrix with \(i\)th diagonal element \(\pi_i ( 1- \pi_i)\)

\(z_i = (y_i - \pi_i) / \{ \pi_i (1 - \pi_i) \}\) is the working variate

\(p / n \to \kappa \in (0, 1)\)

Artificial example

\[ \begin{array}{l} n = 2000 \\ p = 400 \\ x_{ij} \stackrel{iid}{\sim} \mathop{\mathrm{N}}(0, 1/n) \\ \\ \beta_0 = (b_1, \ldots, b_{200}, 0, \ldots, 0)^\top \\ b_j \stackrel{iid}{\sim} \mathop{\mathrm{N}}(6.78, 1) \\ \end{array} \implies \begin{array}{l} \gamma^2 = \mathop{\mathrm{var}}(x_i^\top \beta_0) \simeq 4.7 \\ \kappa = 0.2 \\ \end{array} \]

\[ y_i \stackrel{ind}{\sim} \mathop{\mathrm{Bernoulli}}(1/(1 + {e^{-x_i^\top\beta_0}})) \]

\(W = 2 ( \hat{l} - \hat{l}_{0} )\), \(\hat{l}_{0}\) is maximized log-likelihood under \(H_0: \beta_{201} = \ldots = \beta_{210} = 0\)

Classical theory predicts \(W_0 \stackrel{d}{\rightarrow} \chi^2_{10}\)

\(Z = \hat\beta_{k} / [(X^\top \hat{W} X)^{-1}]_{kk}^{1/2}\)

Classical theory predicts \(Z \stackrel{d}{\rightarrow} \mathop{\mathrm{N}}(0, 1)\) when \(\beta_{k} = 0\)

Recent developments

Candès & Sur (2020)

sharp phase transition about when the ML estimate has infinite components, when \(\eta_i = \delta + x_i^\top \beta\), \(x_i \sim \mathop{\mathrm{N}}(0, \Sigma)\), \(p/n \to \kappa \in (0, 1)\), \(\mathop{\mathrm{var}}(x_i^\top\beta_0) \to \gamma^2\)

Sur & Candès (2019), Zhao et al. (2022)

a method, based on approximate message passing (AMP), that recovers estimation and inferential performance by appropriately rescaling \(\hat\beta\), whenever that exists

Sur & Candès (2019), Kosmidis & Firth (2021)

empirical evidence that maximum Jeffreys-penalized likelihood (mJPL) achieves a substantial reduction in persistent bias when \(p / n \to \kappa\), performing similarly to the rescaled ML estimator, whenever that exists

Phase transition, \(\hat\beta\)

Phase transition, \(\hat\beta / \mu_\star\) (Sur & Candès, 2019)

Diaconis-Ylvisaker prior

Prior

\[ \log p(\beta; X) = \frac{1 - {\color[rgb]{0.70,0.1,0.12}{\alpha}}}{{\color[rgb]{0.70,0.1,0.12}{\alpha}}} \sum_{i=1}^n \left\{ \zeta'\left(x_i^\top {\color[rgb]{0.70,0.1,0.12}{\beta}_P}\right) x_i^\top \beta - \zeta\left(x_i^\top \beta\right) \right\} + C \] with \(\zeta(\eta) = \log(1 + e^\eta)\)

Penalized log-likelihood

\[ l^*(\beta; y, X) = \frac{1}{\alpha} \sum_{i = 1}^n \left\{y_i^* x_i^\top\beta - \zeta\left(x_i^\top \beta\right) \right\} \] with \(y_i^* = \alpha y_i + (1 - \alpha) \zeta'\left(x_i^\top \beta_P\right)\)

Maximum DY prior penalized likelihood

Maximum DY prior penalized likelihood (MDYPL)

\(\hat{\beta}^{\textrm{\small DY}}= \arg \max l^*(\beta; y, X)\)

MDYPL implementation using ML procedures

MDYPL is ML with pseudo-responses \(\quad \Longleftarrow \quad\) \(l(\beta; y^*, X) / \alpha = \ell^*(\beta; y, X)\)

Existence and uniqueness

Rigon & Aliverti (2023, Theorem 1): \(\hat{\beta}^{\textrm{\small DY}}\) is unique and exists for all \(\{y, X\}\) configurations

Shrinkage

\(\displaystyle \hat{\beta}^{\textrm{\small DY}}\longrightarrow \left\{ \begin{array}{ll} \hat\beta\,, & \alpha \to 1 \\ \beta_P\,, & \alpha \to 0 \end{array}\right.\)

Rigon & Aliverti (2023) suggest adaptive shrinkage with \(\alpha = 1 / (1 + \kappa)\)

Shrinkage direction

\(p / n = o(1)\) (Cordeiro & McCullagh, 1991)

ML estimator of a logistic regression model is biased away from \(0\)

\(p / n \to \kappa \in (0, 1)\) (Sur & Candès, 2019; Zhao et al., 2022)

Empirical evidence that this continues to hold in high dimensions

\[\LARGE\Downarrow\]

\[\beta_{P} = 0\]

Estimation

Setting

Signal \(\beta_0\)

The empirical distribution of \(\beta_0\) elements converges to \(\pi_{\bar{\beta}}\) and \(\sum_{j = 1}^p \beta_{0,j}^2 / p \to \mathop{\mathrm{E}}(\bar{\beta}^2) < \infty\)

Covariate distribution, dimension, and signal strength

\(x_{ij} \stackrel{iid}{\sim} \mathop{\mathrm{N}}(0, 1 / n)\)

As \(p / n \to \kappa \in (0, 1)\), \(\mathop{\mathrm{var}}(x_i^\top \beta_0) \to \gamma^2 > 0\)

Idea

Develop a generalized AMP recursion (see Feng et al., 2022 for an overview of AMP), whose iterates have a known asymptotic distribution, with stationary point \(\hat{\beta}^{\textrm{\small DY}}\)

State evolution of AMP recursion

Solution \(({\mu}_{*}, {b}_{*}, {\sigma}_{*})\) to the system of nonlinear equations in \((\mu, b, \sigma)\)

\[ \begin{aligned} 0 & = \mathop{\mathrm{E}}\left[2 \zeta'(Z) Z \left\{\frac{1+\alpha}{2} - \zeta' \left(\textrm{prox}_{b \zeta}\left(Z_{*} + \frac{1+\alpha}{2}b\right) \right) \right\} \right] \\ 0 & = 1 - \kappa - \mathop{\mathrm{E}}\left[\frac{2 \zeta'(Z)}{1 + b \zeta''(\textrm{prox}_{b\zeta}\left(Z_{*} + \frac{1+\alpha}{2}b\right))} \right] \\ 0 & = \sigma^2 - \frac{b^2}{\kappa^2} \mathop{\mathrm{E}}\left[2 \zeta'(Z) \left\{\frac{1+\alpha}{2} - \zeta'\left(\textrm{prox}_{b\zeta}\left( Z_{*} + \frac{1+\alpha}{2}b\right)\right) \right\}^2 \right] \end{aligned} \tag{1}\]

\(Z \sim \mathop{\mathrm{N}}(0, \gamma^2)\)

\(Z_{*} = \mu Z + \kappa^{1/2} \sigma G\) with \(G \sim \mathop{\mathrm{N}}(0, 1)\) independent of \(Z\)

\(\textrm{prox}_{b\zeta}\left(x\right)=\arg\min_{u} \left\{ b\zeta(u) + \frac{1}{2} (x-u)^2 \right\}\) is the proximal operator

Solvers for nonlinear systems, after cubature approximation of expectations

MDYPL asymptotics

Theorem 1 (Aggregate asymptotic behaviour of mDYPL estimator)

Assume that \((\alpha,\kappa,\gamma)\) are such that \(\| \hat{\beta}^{\textrm{\small DY}} \|_2 = \mathcal{O}(n^{1/2})\) almost surely, and that (1) admits a solution \(({\mu}_{*}, {b}_{*}, {\sigma}_{*})\) such that the Jacobian of the RHS of (1) is nonsingular.

Then, for any function \(\psi: \Re^2 \to \Re\) that is pseudo-Lipschitz of order 2 \[ \frac{1}{p} \sum_{j=1}^{n} \psi(\hat{\beta}^{\textrm{\small DY}}_j - {\mu}_{*}\beta_{0,j}, \beta_{0,j}) \overset{\textrm{a.s.}}{\longrightarrow} \mathop{\mathrm{E}}\left[\psi({\sigma}_{*}G, \bar{\beta})\right], \quad \textrm{as } n \to \infty \] where \(G \sim \mathop{\mathrm{N}}(0,1)\) is independent of \(\bar{\beta} \sim \pi_{\bar{\beta}}\)

Behaviour of \(\hat{\beta}^{\textrm{\small DY}}\)

\(\psi(t,u)\)	statistic		a.s. limit
\(t\)	\(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - {\mu}_{*}\beta_{0,j}\right)\)	Bias	\(0\)
\((t-(1-{\mu}_{*})u)^2\)	\(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - \beta_{0,j}\right)^2\)	MSE	\({\sigma}_{}^2 + (1-{\mu}_{})^2 \gamma^2 / \kappa\)
\(t^2\)	\(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - {\mu}_{*}\beta_{0,j}\right)^2\)	Variance	\({\sigma}_{*}^2\)

Behaviour of \(\hat{\beta}^{\textrm{\small DY}}/ \mu^*\)

\(\psi(t,u)\)	statistic		a.s. limit
\(t\)	\(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j / {\mu}_{*}- \beta_{0,j}\right)\)	Bias	\(0\)
\(t^2/{\mu}_{*}^2\)	\(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j / {\mu}_{*}- \beta_{0,j}\right)^2\)	MSE / Variance	\({\sigma}_{}^2 / {\mu}_{}^2\)

\(\hat{\beta}^{\textrm{\small DY}}\) for \(\alpha = 1 / (1 + \kappa)\)

Rigon & Aliverti (2023)

\(\hat{\beta}^{\textrm{\small DY}}/ \mu_\star\) for \(\alpha = 1 / (1 + \kappa)\)

\(\hat{\beta}^{\textrm{\small DY}}\) for \(\alpha = 1 / (1 + \kappa)\)

Rigon & Aliverti (2023)

\(\hat{\beta}^{\textrm{\small DY}}/ \mu_\star\) for \(\alpha = 1 / (1 + \kappa)\)

Optimal shrinkage

For any given \(\kappa\) and \(\gamma\), find the amount of shrinkage (value of \(\alpha\)) that results in:

\(\hat{\beta}^{\textrm{\small DY}}\) having zero asymptotic bias
\(\hat{\beta}^{\textrm{\small DY}}/ {\mu}_{*}\) having minimal asymptotic mean squared error

Zero Asymptotic bias

For \((\kappa, \gamma) \in (0, 1) \times (0, \infty)\): \(\displaystyle \quad \alpha_{\rm \small UB} = \arg \mathop{\mathrm{solve}}_{\alpha \in (0, 1)} \{{\mu}_{*}= 1\}\)

Minimal asymptotic MSE

For \((\kappa, \gamma) \in (0, 1) \times (0, \infty)\): \(\displaystyle \quad \alpha_{\rm \small MM} = \arg \min_{\alpha \in (0, 1)} \frac{{\sigma}_{*}^2}{{\mu}_{*}^2}\)

Optimal shrinkage

Zero asymptotic bias
Minimal asymptotic MSE

Extensions

Abritrary covariate covariance

\(\hat{\beta}^{\textrm{\small DY}}\) is the mDYPL estimator with covariate vectors \(x_{i} \stackrel{iid}{\sim} \mathop{\mathrm{N}}(0, \Sigma)\) and signal \(\beta_0\)

Dimension and signal strength

As \(p / n \to \kappa \in (0, 1)\), \(\mathop{\mathrm{var}}(x_i^\top \beta_0) = \beta_0^\top \Sigma \beta_0 \to \gamma^2 > 0\)

Aggregate asymptotics

Under the assumption \(\underset{n \to \infty}{\limsup } \, \lambda_{\max}(\Sigma) / \lambda_{\min}(\Sigma) < \infty\), Theorem 1 still holds with \[ \frac{1}{p} \sum_{j=1}^p \psi \left( {\color[rgb]{0.70,0.1,0.12}{\sqrt{n} \tau_j}} \left( \hat{\beta}^{\textrm{\small DY}}_{j}-{\mu}_{*}\beta_{0,j}\right), {\color[rgb]{0.70,0.1,0.12}{\sqrt{n} \tau_j}} \beta_{0,j} \right) \overset{\textrm{a.s.}}{\longrightarrow} \mathop{\mathrm{E}}\left[\psi({\sigma}_{*}G, \bar{\beta})\right] \] where \(\tau_j^2 = \mathop{\mathrm{var}}(x_{ij} \mid x_{i, -j})\)

Inference

Adjusted \(Z\)-statistics

Theorem 2 Assume that \((\alpha,\kappa,\gamma)\) are such that the conditions of Theorem 1 are met.

Then for any regression coefficient such that \(\sqrt{n} \tau_j \beta_{0,j} = \mathcal{O}(1)\), \[ Z^*_j = \sqrt{n} \tau_j \frac{\hat{\beta}^{\textrm{\small DY}}_{j} - {\mu}_{*}\beta_{0,j}}{{\sigma}_{*}} \overset{d}{\longrightarrow} \mathcal{N}(0,1) \]

DY prior penalized likelihood ratio test statistics

Theorem 3 Assume that \((\alpha,\kappa,\gamma)\) are such that the conditions of Theorem 1 are met.

Let \(I = \{i_1,\ldots,i_k\}\), and define the DY prior penalized likelihood ratio test statistic \[ \Lambda_I = \underset{\beta \in \Re^p}{\max} \, \ell(\beta; y^*, X) - \underset{\substack{\beta \in \Re^p: \\ \beta_j = 0, \, j \in I }}{\max} \, \ell(\beta; y^*, X) \]

Then, under \(H_0: \beta_{0,i_1} = \ldots = \beta_{0,i_k} = 0\), it holds that \[ 2 \Lambda_{I} \overset{d}{\longrightarrow} \frac{\kappa \sigma_{*}^2}{b_{*}} \chi^2_k \]

Frequentist performance

Penalized likelihood ratio tests
Adjusted Wald tests

Discussion

Current work

Models with intercept

Empirical evidence with MDYPL supports Zhao et al. (2022, Conjecture 7.1), with state evolution given from a system of 4 equations in 4 unknowns

Non-gaussian model matrices

Distributions with light-tails

Asymptotic results seem to apply for a broad class of covariate distributions with sufficiently light tails (e.g. sub-Gaussian), as also observed in Sur & Candès (2019)

Universality

Results in Han & Shen (2023) can be used to examine universality for general model matrices

Estimating unknowns

Dimension

\(p / n\) is a good estimate of \(\kappa\)

Signal strength

For multivariate normal covariates and potentially other distributions with light tails, \(\gamma\) can be estimated using

the ProbFrontier method (Sur & Candès, 2019)
the SLOE method (Yadlowsky et al., 2021)

Conditional variance of covariates

\(\tau_j\) can be estimated using the residual sums of squares from the regressions of each covariate on all the others (Zhao et al., 2022, sec. 5.1)

Maximum Jeffreys prior penalized likelihood

\[ Y_i | x_i \sim \mathop{\mathrm{Bernoulli}}(\pi_i)\,, \quad \log \frac{\pi_i}{1 - \pi_i} = \eta_i = \delta + \sum_{t = 1}^p \beta_t x_{it} \]

Conjecture 1 Suppose that \(x_{i} \stackrel{iid}{\sim} \mathop{\mathrm{N}}(0, \Sigma)\), and that \(\mathop{\mathrm{var}}(x_i^\top \beta_0) \to \gamma_0^2\), as \(p / n \to \kappa \in (0, 1)\).

Then, the mJPL estimator \((\tilde\delta, \tilde\beta^\top)^\top\) satisfies \[ \frac{1}{p} \sum_{j = 1}^p (\tilde\beta_j - \alpha_\star \beta_{0,j}) \overset{\textrm{a.s.}}{\longrightarrow} 0 \] with \(\alpha_\star \le 1\). For small to moderate values of \(\delta_0^2 / \gamma_0^2\), the scaling \(\alpha_\star\) can be approximated by \[ q(\kappa, \gamma, \gamma_0; b) = \left\{ \begin{array}{ll} 1 \,, & \text{if } \kappa < h_{\rm MLE}(\delta, \gamma_0) \\ \kappa^{b_1} \gamma^{b_2} \gamma_0^{b_3}\,, & \text{if } \kappa > h_{\rm MLE}(\delta, \gamma_0) \end{array} \right. \] where \(b_1 < -1\), \(b_2 < -1\), \(b_3 > 0\).

Predictions for \(b_1\), \(b_2\), \(b_3\) are -1.172, -1.869, 0.817, respectively (Kosmidis & Zietkiewicz, 2023)

References

Candès, E. J., & Sur, P. (2020). The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression. Annals of Statistics, 48(1), 27–42. https://doi.org/10.1214/18-AOS1789

Cordeiro, G. M., & McCullagh, P. (1991). Bias correction in generalized linear models. Journal of the Royal Statistical Society, Series B: Methodological, 53(3), 629–643. https://doi.org/10.1111/j.2517-6161.1991.tb01852.x

Feng, O. Y., Venkataramanan, R., Rush, C., & Samworth, R. J. (2022). A unifying tutorial on approximate message passing. Foundations and Trends in Machine Learning, 15(4), 335–536. http://dx.doi.org/10.1561/2200000092

Han, Q., & Shen, Y. (2023). Universality of regularized regression estimators in high dimensions. The Annals of Statistics, 51(4), 1799–1823. https://doi.org/10.1214/23-AOS2309

Kosmidis, I., & Firth, D. (2021). Jeffreys-prior penalty, finiteness and shrinkage in binomial-response generalized linear models. Biometrika, 108(1), 71–82. https://doi.org/10.1093/biomet/asaa052

Kosmidis, I., & Zietkiewicz, P. (2023). Jeffreys-prior penalty for high-dimensional logistic regression: A conjecture about aggregate bias. arXiv Preprint arXiv:2311.11290. https://arxiv.org/abs/2311.11290

Rigon, T., & Aliverti, E. (2023). Conjugate priors and bias reduction for logistic regression models. Statistics & Probability Letters, 202, 109901. https://doi.org/10.1016/j.spl.2023.109901

Sur, P., & Candès, E. J. (2019). A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences, 116(29), 14516–14525. https://doi.org/10.1073/pnas.1810420116

Yadlowsky, S., Yun, T., McLean, C. Y., & D’Amour, A. (2021). SLOE: A faster method for statistical inference in high-dimensional logistic regression. Advances in Neural Information Processing Systems, 34, 29517–29528. https://proceedings.neurips.cc/paper/2021/file/f6c2a0c4b566bc99d596e58638e342b0-Paper.pdf

Zhao, Q., Sur, P., & Candès, E. J. (2022). The asymptotic distribution of the MLE in high-dimensional logistic models: Arbitrary covariance. Bernoulli, 28(3). https://doi.org/10.3150/21-BEJ1401

Sterzinger P, Kosmidis I (2023). Diaconis-Ylvisaker prior penalized likelihood for \(p / n \to \kappa \in (0, 1)\) logistic regression. ArXiV: 2311.11290

Kosmidis I, Zietkiewicz P (2023). Jeffreys-prior penalty for high-dimensional logistic regression: A conjecture about aggregate bias. ArXiV: 2311.07419

Penalized likelihood estimation and inference in high-dimensional logistic regression

Collaborators

Related outputs

Logistic regression

Logistic regression model

Data

Model

Maximum likelihood estimation

Log-likelihood

Maximum likelihood (ML) estimator

Iterative re-weighted least squares

\(p / n \to \kappa \in (0, 1)\)

Artificial example

Frequentist performance

Recent developments

Candès & Sur (2020)

Sur & Candès (2019), Zhao et al. (2022)

Sur & Candès (2019), Kosmidis & Firth (2021)

Phase transition, \(\hat\beta\)

Phase transition, \(\hat\beta / \mu_\star\) (Sur & Candès, 2019)

Diaconis-Ylvisaker prior

Diaconis-Ylvisaker prior

Prior

Penalized log-likelihood

Maximum DY prior penalized likelihood

Maximum DY prior penalized likelihood (MDYPL)

MDYPL implementation using ML procedures

Existence and uniqueness

Shrinkage

Shrinkage direction

\(p / n = o(1)\) (Cordeiro & McCullagh, 1991)

\(p / n \to \kappa \in (0, 1)\) (Sur & Candès, 2019; Zhao et al., 2022)

Estimation

Setting

Signal \(\beta_0\)

Covariate distribution, dimension, and signal strength

Idea

State evolution of AMP recursion

MDYPL asymptotics

Behaviour of \(\hat{\beta}^{\textrm{\small DY}}\)

Behaviour of \(\hat{\beta}^{\textrm{\small DY}}/ \mu^*\)

\(\hat{\beta}^{\textrm{\small DY}}\) for \(\alpha = 1 / (1 + \kappa)\)

\(\hat{\beta}^{\textrm{\small DY}}/ \mu_\star\) for \(\alpha = 1 / (1 + \kappa)\)

\(\hat{\beta}^{\textrm{\small DY}}\) for \(\alpha = 1 / (1 + \kappa)\)

\(\hat{\beta}^{\textrm{\small DY}}/ \mu_\star\) for \(\alpha = 1 / (1 + \kappa)\)

Optimal shrinkage

Zero Asymptotic bias

Minimal asymptotic MSE

Optimal shrinkage

Extensions

Abritrary covariate covariance

Dimension and signal strength

Aggregate asymptotics

Inference

Adjusted \(Z\)-statistics

DY prior penalized likelihood ratio test statistics

Frequentist performance

Discussion

Current work

Models with intercept

Non-gaussian model matrices

Distributions with light-tails

Universality

Estimating unknowns

Dimension

Signal strength

Conditional variance of covariates

Maximum Jeffreys prior penalized likelihood

References