Penalized likelihood estimation and inference in high-dimensional logistic regression

Ioannis Kosmidis

Professor of Statistics
Department of Statistics, University of Warwick

ioannis.kosmidis@warwick.ac.uk
ikosmidis.com ikosmidis ikosmidis_

CRiSM 2.0 Conference
University of Warwick

22 May 2025

CRISM 1.0

Joint with

Philipp Sterzinger

Sterzinger P, Kosmidis I (2024). Diaconis-Ylvisaker prior penalized likelihood for \(p / n \to \kappa \in (0, 1)\) logistic regression. arXiv: 2311.11290

Slides

ikosmidis.com/files/ikosmidis-CRiSM-2025

Logistic regression

Data

\(y_1, \ldots, y_n\) with \(y_i \in \{0, 1\}\)

\(x_1, \ldots, x_n\) with \(x_i \in \Re^p\)

Model

\(Y_{1}, \ldots, Y_{n}\) conditionally independent with

\[ Y_i \mid x_i \sim \mathop{\mathrm{Bernoulli}}(\pi_i)\,, \quad \log \frac{\pi_i}{1 - \pi_i} = \eta_i = \sum_{t = 1}^p \beta_t x_{it} \]

Maximum likelihood estimation

Log-likelihood

\(\displaystyle l(\beta \,;\,y, X) = \sum_{i = 1}^n \left\{y_i \eta_i - \log\left(1 + e^{\eta_i}\right) \right\}\)

Maximum likelihood (ML) estimator

\(\hat{\beta} \in \arg \sup l(\beta \,;\,y, X)\)

Iterative re-weighted least squares

\(\hat\beta := \beta^{(\infty)}\) where \(\displaystyle \beta^{(j + 1)} = \left(X^T W^{(j)} X\right)^{-1} X^T W^{(j)} z^{(j)}\)

\(W\) is a diagonal matrix with \(i\)th diagonal element \(\pi_i ( 1- \pi_i)\)

\(z_i = \eta_i + (y_i - \pi_i) / \{ \pi_i (1 - \pi_i) \}\) is the working variate

UCI’s Multiple Deatures doi: 10.24432/C5HC70

Data

\(200\) binary images per digit (\(0\)-\(9\)) extracted maps from a Dutch public utility

Feature sets:

76 Fourier coefficients of the character shapes (rotation invariant) per digit
64 Karhunen-Loève coefficients per digit

1000 digits for training + 1000 digits for testing

Aim

Explain the contribution of the feature sets in describing the digit “7”

Considerations

Font, digitization noise, downscaling, complicate discrimination

If only rotation invariant features are used, “7” and “2” can look similar

`fou + kar` VS `fou`

full_ml <- glm(y ~ ., data = full_train, family = binomial)
nest_ml <- glm(y ~ ., data = nest_train, family = binomial)

Residual Df	Log-likelihood	Df	LR	pvalue
923	0.00
859	0.00	64	0.00	1

library("detectseparation")
full_sep <- update(full_ml, method = "detect_separation")
nest_sep <- update(nest_ml, method = "detect_separation")

Model	Separation	Infinite estimates
fou + kar	true	true
fou	true	true

Perfect fits on the training data
1000 observations with 140 covariates

\(p / n \to \kappa \in (0, 1)\)

Recent developments

Candès & Sur (2020)

Sharp phase transition about when the ML estimate has infinite components, when \(\eta_i = \delta + x_i^\top \beta\), \(x_i \stackrel{iid}{\sim} \mathop{\mathrm{N}}(0, \Sigma)\), \(p/n \to \kappa \in (0, 1)\), \(\mathop{\mathrm{var}}(x_i^\top\beta_0) \to \gamma^2\)

Sur & Candès (2019), Zhao et al. (2022)

A method, based on approximate message passing (AMP), that for \(\delta = 0\) recovers estimation and inferential performance by appropriately rescaling \(\hat\beta\), whenever that exists.

Phase transition, \(\hat\beta\)

Phase transition, \(\hat\beta / \mu_\star\) (Sur & Candès, 2019)

Diaconis-Ylvisaker prior

Prior

\[ \log p(\beta; X) = \frac{1 - {\color[]{#E16A86}{\alpha}}}{{\color[]{#E16A86}{\alpha}}} \sum_{i=1}^n \left\{ \zeta'\left(x_i^\top {\color[]{#E16A86}{\beta}_P}\right) x_i^\top \beta - \zeta\left(x_i^\top \beta\right) \right\} + C \] with \(\zeta(\eta) = \log(1 + e^\eta)\)

Penalized log-likelihood

\[ l^*(\beta; y, X) = \frac{1}{\alpha} \sum_{i = 1}^n \left\{y_i^* x_i^\top\beta - \zeta\left(x_i^\top \beta\right) \right\} \] with \(y_i^* = \alpha y_i + (1 - \alpha) \zeta'\left(x_i^\top \beta_P\right)\)

Maximum DY prior penalized likelihood

Maximum DY prior penalized likelihood (MDYPL)

\(\hat{\beta}^{\textrm{\small DY}}= \arg \max l^*(\beta; y, X)\)

MDYPL implementation using ML procedures

MDYPL is ML with pseudo-responses: \(l(\beta; y^*, X) / \alpha = \ell^*(\beta; y, X)\)

Existence and uniqueness

Rigon & Aliverti (2023, Theorem 1): \(\hat{\beta}^{\textrm{\small DY}}\) is unique and exists for all \(\{y, X\}\) configurations

Shrinkage

\(\displaystyle \hat{\beta}^{\textrm{\small DY}}\longrightarrow \left\{ \begin{array}{ll} \hat\beta\,, & \alpha \to 1 \\ \beta_P\,, & \alpha \to 0 \end{array}\right.\)

Rigon & Aliverti (2023) suggest adaptive shrinkage with \(\alpha = 1 / (1 + \kappa)\)

Shrinkage direction

Fixed \(p\) (Cordeiro & McCullagh, 1991)

ML estimator of a logistic regression model is biased away from \(0\)

\(p / n \to \kappa \in (0, 1)\) (Sur & Candès, 2019; Zhao et al., 2022)

Empirical evidence that this continues to hold in high dimensions

\[\LARGE\Downarrow\]

\[\beta_{P} = 0\]

`fou + kar` VS `fou` using MDYPL

Maximum Diaconis-Ylvisaker prior penalized likelihood (MDYPL)

Adjust responses to \(\displaystyle y_i^* = \alpha y_i + \frac{1 - \alpha}{2}\) with \(\alpha \in [0, 1)\)¹
Use ML estimation with the adjusted responses

full_mdypl <- glm(y ~ ., data = full_train_adj, family = binomial)
nest_mdypl <- glm(y ~ ., data = nest_train_adj, family = binomial)

Penalized likelihood ratio test

Df	PLR
64	64.36

`fou + kar` VS `fou` using MDYPL

Simulation setup

\(1000 \times 140\) matrix \(X_{\rm fou + kar}\) is fixed

\(\beta_{\rm fou + kar} = (\beta_{\rm fou}^\top, 0_{64}^\top)^\top\)

\(\beta_{\rm fou}\) from i.i.d \(N(0, 1)\), rescaled so that \[ \widehat{\rm var}(X_{\rm fou + kar} \beta_{\rm fou + kar}) = \gamma^2 \]

Intercept: \(\delta \in \{-3, -2, -1, 0\}\)

Signal strength: \(\gamma^2 \in \{1, 2, 4, 8, 16\}\)

\(500\) response vectors per \((\delta, \gamma^2)\)

MDYPL: estimation and inference

Aggregate asymptotic behaviour of MDYPL estimator

Assume that \((\alpha,\kappa,\gamma)\) are such that \(\| \hat{\beta}^{\textrm{\small DY}} \|_2 = \mathcal{O}(n^{1/2})\) almost surely. Then for any function \(\psi: \Re^2 \to \Re\) that is pseudo-Lipschitz of order 2,¹, \[ \frac{1}{p} \sum_{j=1}^{p} \psi(\hat{\beta}^{\textrm{\small DY}}_j - {\mu}_{\star}\beta_{0,j}, \beta_{0,j}) \overset{\textrm{a.s.}}{\longrightarrow} \mathop{\mathrm{E}}\left[\psi({\sigma}_{\star}G, \bar{\beta})\right], \quad \textrm{as } n \to \infty \] where \(G \sim \mathop{\mathrm{N}}(0,1)\) is independent of \(\bar{\beta} \sim \pi_{\bar{\beta}}\)

Behaviour of \(\hat{\beta}^{\textrm{\small DY}}\)

\(\psi(t,u)\)	statistic		a.s. limit
\(t\)	\(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - {\mu}_{\star}\beta_{0,j}\right)\)	Bias	\(0\)
\(t^2\)	\(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - {\mu}_{\star}\beta_{0,j}\right)^2\)	Variance	\({\sigma}_{\star}^2\)
\((t-(1-{\mu}_{\star})u)^2\)	\(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - \beta_{0,j}\right)^2\)	MSE	\({\sigma}_{\star}^2 + (1-{\mu}_{\star})^2 \gamma^2 / \kappa\)

Behaviour of \(\hat{\beta}^{\textrm{\small DY}}/ \mu^*\)

\(\psi(t,u)\)	statistic		a.s. limit
\(t / {\mu}_{\star}\)	\(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j / {\mu}_{\star}- \beta_{0,j}\right)\)	Bias	\(0\)
\(t^2/{\mu}_{\star}^2\)	\(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j / {\mu}_{\star}- \beta_{0,j}\right)^2\)	MSE / Variance	\({\sigma}_{\star}^2 / {\mu}_{\star}^2\)

State evolution of AMP recursion

Solution \(({\mu}_{\star}, {b}_{\star}, {\sigma}_{\star})\) to the system of nonlinear equations in \((\mu, b, \sigma)\)

\[ \begin{aligned} 0 & = \mathop{\mathrm{E}}\left[2 \zeta'(Z) Z \left\{\frac{1+{\color[rgb]{0.70,0.1,0.12}{\alpha}}}{2} - \zeta' \left(\textrm{prox}_{b \zeta}\left(Z_{*} + \frac{1+{\color[rgb]{0.70,0.1,0.12}{\alpha}}}{2}b\right) \right) \right\} \right] \\ 0 & = 1 - {\color[rgb]{0.70,0.1,0.12}{\kappa}} - \mathop{\mathrm{E}}\left[\frac{2 \zeta'(Z)}{1 + b \zeta''(\textrm{prox}_{b\zeta}\left(Z_{*} + \frac{1+{\color[rgb]{0.70,0.1,0.12}{\alpha}}}{2}b\right))} \right] \\ 0 & = \sigma^2 - \frac{b^2}{{\color[rgb]{0.70,0.1,0.12}{\kappa}}^2} \mathop{\mathrm{E}}\left[2 \zeta'(Z) \left\{\frac{1+{\color[rgb]{0.70,0.1,0.12}{\alpha}}}{2} - \zeta'\left(\textrm{prox}_{b\zeta}\left( Z_{*} + \frac{1+{\color[rgb]{0.70,0.1,0.12}{\alpha}}}{2}b\right)\right) \right\}^2 \right] \end{aligned} \tag{1}\]

\(Z \sim \mathop{\mathrm{N}}(0, {\color[rgb]{0.70,0.1,0.12}{\gamma}}^2)\)

\(Z_{*} = \mu Z + {\color[rgb]{0.70,0.1,0.12}{\kappa}}^{1/2} \sigma G\) with \(G \sim \mathop{\mathrm{N}}(0, 1)\) independent of \(Z\)

\(\textrm{prox}_{b\zeta}\left(x\right)=\arg\min_{u} \left\{ b\zeta(u) + \frac{1}{2} (x-u)^2 \right\}\) is the proximal operator

Solvers for nonlinear systems, after cubature approximation of expectations

\(\hat{\beta}^{\textrm{\small DY}}\) for \(\alpha = 1 / (1 + \kappa)\)

\(\hat{\beta}^{\textrm{\small DY}}/ \mu_\star\) for \(\alpha = 1 / (1 + \kappa)\)

\(\hat{\beta}^{\textrm{\small DY}}\) for \(\alpha = 1 / (1 + \kappa)\)

\(\hat{\beta}^{\textrm{\small DY}}/ \mu_\star\) for \(\alpha = 1 / (1 + \kappa)\)

Optimal shrinkage: Asymptotic unbiasedness

Amount of shrinkage (value of \(\alpha\)) that results in \(\hat{\beta}^{\textrm{\small DY}}/ {\mu}_{\star}\) being asymptotically unbiased

Optimal shrinkage: Minimum MSE

Amount of shrinkage (value of \(\alpha\)) that results in \(\hat{\beta}^{\textrm{\small DY}}/ {\mu}_{\star}\) having minimal asymptotic MSE

DY prior penalized likelihood ratio test statistics

Assume that \((\alpha,\kappa,\gamma)\) are such that \(\| \hat{\beta}^{\textrm{\small DY}} \|_2 = \mathcal{O}(n^{1/2})\) almost surely. Define the DY prior penalized likelihood ratio test statistic for \(H_0: \beta_{0,i_1} = \ldots = \beta_{0,i_k} = 0\) as \[ \Lambda_I = \underset{\beta \in \Re^p}{\max} \, \ell(\beta \,;\,y^*, X) - \underset{\substack{\beta \in \Re^p: \\ \beta_j = 0, \, j \in I }}{\max} \, \ell(\beta \,;\,y^*, X)\,, \quad I = \{i_1,\ldots,i_k\} \]

Then, \[ 2 \Lambda_{I} \overset{d}{\longrightarrow} \frac{\kappa \sigma_{\star}^2}{b_{\star}} \chi^2_k \]

Adjusted Wald statistics

Assume that \((\alpha,\kappa,\gamma)\) are such that \(\| \hat{\beta}^{\textrm{\small DY}} \|_2 = \mathcal{O}(n^{1/2})\) almost surely.

Then, for any regression coefficient such that \(\sqrt{n} \tau_j \beta_{0,j} = \mathcal{O}(1)\), \[ Z^*_j = \sqrt{n} \tau_j \frac{\hat{\beta}^{\textrm{\small DY}}_{j} - {\mu}_{\star}\beta_{0,j}}{{\sigma}_{\star}} \overset{d}{\longrightarrow} \mathcal{N}(0,1) \] where \(\tau_j^2 = \mathop{\mathrm{var}}(x_{ij} \mid x_{i, -j})\)

Artificial example

\[ \begin{array}{l} n = 2000 \\ p = 400 \\ x_{ij} \stackrel{iid}{\sim} \mathop{\mathrm{N}}(0, 1/n) \\ \\ \beta_0 = (b_1, \ldots, b_{200}, 0, \ldots, 0)^\top \\ b_j \stackrel{iid}{\sim} \mathop{\mathrm{N}}(6.78, 1) \\ \end{array} \implies \begin{array}{l} \gamma^2 = \mathop{\mathrm{var}}(x_i^\top \beta_0) \simeq 4.7 \\ \kappa = 0.2 \\ \end{array} \]

\[ y_i \stackrel{ind}{\sim} \mathop{\mathrm{Bernoulli}}(1/(1 + {e^{-x_i^\top\beta_0}})) \]

\(W = 2 ( \hat{l} - \hat{l}_{0} )\), \(\hat{l}_{0}\) is maximized log-likelihood under \(H_0: \beta_{201} = \ldots = \beta_{210} = 0\)

Classical theory implies \(W \stackrel{d}{\rightarrow} \chi^2_{10}\) when \(\beta_{201} = \ldots = \beta_{210} = 0\)

\(Z = \hat\beta_{k} / [(X^\top \hat{W} X)^{-1}]_{kk}^{1/2}\)

Classical theory implies \(Z \stackrel{d}{\rightarrow} \mathop{\mathrm{N}}(0, 1)\) when \(\beta_{k} = 0\)

Estimating unknowns

Dimension

\(\hat\kappa = p / n\)

Signal strength

Adaptation of the Signal Strength Leave-One-Out Estimator (SLOE) of Yadlowsky et al. (2021), that consistently estimates \(\upsilon^2 = \lim\limits_{n \to \infty}\mathop{\mathrm{var}}(x_j^\top \hat{\beta}^{\textrm{\small DY}}) = {\mu}_{\star}^2 \gamma^2 + \kappa {\sigma}_{\star}^2\)

\[ \hat \upsilon^2 = \frac{\sum_{i = 1}^n (s_i - \bar{s})^2}{n} \quad \text{with} \quad s_i = \hat{\eta}^{\textrm{\small DY}}_i - \frac{\hat{h}^{\textrm{\small DY}}_i}{1 - \hat{h}^{\textrm{\small DY}}_i} (\hat{z}^{\textrm{\small DY}}_i - \hat{\eta}^{\textrm{\small DY}}_i) \] where \(\hat{h}^{\textrm{\small DY}}_i\) is the hat value for the \(i\)th observation, and \({\rm DY}\) implies evaluation at \(\hat{\beta}^{\textrm{\small DY}}\)

Estimating unknowns

Estimates of \({\mu}_{\star}\), \(b_{\star}\), \({\sigma}_{\star}\)

Replace \(\gamma\) in the state evolution equations (1) with \(\hat\gamma = (\hat\upsilon^2 - \hat\kappa {\sigma}_{\star}^2) / {\mu}_{\star}^2\)
Solve (1) to get estimates for \({\mu}_{\star}\), \(b_{\star}\), \({\sigma}_{\star}\)

Conditional variance of covariates

For adjusted Wald tests, \(\tau_j^2 = \mathop{\mathrm{var}}(x_{ij} \mid x_{i, -j})\) can be estimated using the residual sums of squares from the regressions of each covariate on all the others

Comparison to corrected least squares (Lewis & Battey, 2024)

\(p = n\kappa\), \(x_j \sim \mathop{\mathrm{N}}(0, \Sigma)\) with \(\Sigma_{jj} = 1\), \(\Sigma_{jk} = 0.5\) \((j \ne k)\)

\(\beta_0 = (c, c, c, c, c, 0, \ldots, 0)^\top\) with \(c\) such that \(\beta_0^\top \Sigma \beta_0 = \gamma^2 = 9\)

\(5000\) independent realizations of the response vector per setting

Comparison to corrected least squares (Lewis & Battey, 2024)

\(p = n\kappa\), \(x_j \sim \mathop{\mathrm{N}}(0, \Sigma)\) with \(\Sigma_{jj} = 1\), \(\Sigma_{jk} = 0.5\) \((j \ne k)\)

\(\beta_0 = (c, c, c, c, c, 0, \ldots, 0)^\top\) with \(c\) such that \(\beta_0^\top \Sigma \beta_0 = \gamma^2 = 9\)

\(5000\) independent realizations of the response vector per setting

Models with an intercept

An accurate, according to empirical evidence, conjecture¹ with state evolution given from a system of 4 equations in 4 unknowns.

A SLOE-based procedure for estimating unknowns, replacing the asymptotic limit of the MDYPL estimator of the intercept by its MDYPL estimate.

`fou + kar` VS `fou` using MDYPL

Penalized likelihood ratio tests

			Infinite estimates
Test	Statistic	pvalue	fou.kar	fou
LR	0.00	1.00	true	true
PLR	64.36	0.46	false	false
rescaled PLR	173.34	0.00	false	false

Simulation setup

\(1000 \times 140\) matrix \(X_{\rm fou + kar}\) is fixed

\(\beta_{\rm fou + kar} = (\beta_{\rm fou}^\top, 0_{64}^\top)^\top\)

\(\beta_{\rm fou}\) from i.i.d \(N(0, 1)\), rescaled so that \[ \widehat{\rm var}(X_{\rm fou + kar} \beta_{\rm fou + kar}) = \gamma^2 \]

Intercept: \(\delta \in \{-3, -2, -1, 0\}\)

Signal strength: \(\gamma^2 \in \{1, 2, 4, 8, 16\}\)

\(500\) response vectors per \((\delta, \gamma^2)\)

Current work

More general model matrices

Distributions with light-tails

Asymptotic results seem to apply for a broad class of covariate distributions with sufficiently light tails (e.g. sub-Gaussian)¹

Universality

Results in Han & Shen (2023) can be used to examine universality for more general model matrices with independent entries

DY prior

Arbitrary shrinkage directions (\(\beta_P \ne 0\))

Sterzinger P, Kosmidis I (2024). Diaconis-Ylvisaker prior penalized likelihood for \(p / n \to \kappa \in (0, 1)\) logistic regression. arXiv: 2311.11290

References

Candès, E. J., & Sur, P. (2020). The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression. Annals of Statistics, 48(1), 27–42. https://doi.org/10.1214/18-AOS1789

Cordeiro, G. M., & McCullagh, P. (1991). Bias correction in generalized linear models. Journal of the Royal Statistical Society, Series B: Methodological, 53(3), 629–643. https://doi.org/10.1111/j.2517-6161.1991.tb01852.x

Han, Q., & Shen, Y. (2023). Universality of regularized regression estimators in high dimensions. The Annals of Statistics, 51(4), 1799–1823. https://doi.org/10.1214/23-AOS2309

Kosmidis, I., & Zietkiewicz, P. (2025). Jeffreys-prior penalty for high-dimensional logistic regression: A conjecture about aggregate bias. Statistical Science (to Appear). https://arxiv.org/abs/2311.11290

Lewis, R. M., & Battey, H. S. (2024). On inference in high-dimensional logistic regression models with separated data. Biometrika, 111(3), 989–1011. https://doi.org/10.1093/biomet/asad065

Rigon, T., & Aliverti, E. (2023). Conjugate priors and bias reduction for logistic regression models. Statistics & Probability Letters, 202, 109901. https://doi.org/10.1016/j.spl.2023.109901

Sur, P., & Candès, E. J. (2019). A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences, 116(29), 14516–14525. https://doi.org/10.1073/pnas.1810420116

Yadlowsky, S., Yun, T., McLean, C. Y., & D’Amour, A. (2021). SLOE: A faster method for statistical inference in high-dimensional logistic regression. Advances in Neural Information Processing Systems, 34, 29517–29528. https://proceedings.neurips.cc/paper/2021/file/f6c2a0c4b566bc99d596e58638e342b0-Paper.pdf

Zhao, Q., Sur, P., & Candès, E. J. (2022). The asymptotic distribution of the MLE in high-dimensional logistic models: Arbitrary covariance. Bernoulli, 28(3). https://doi.org/10.3150/21-BEJ1401

Maximum Jeffreys prior penalized likelihood

\[ Y_i | x_i \sim \mathop{\mathrm{Bernoulli}}(\pi_i)\,, \quad \log \frac{\pi_i}{1 - \pi_i} = \eta_i = \delta + \sum_{t = 1}^p \beta_t x_{it} \]

Suppose that \(x_{i} \stackrel{iid}{\sim} \mathop{\mathrm{N}}(0, \Sigma)\), and that \(\mathop{\mathrm{var}}(x_i^\top \beta_0) \to \gamma_0^2\), as \(p / n \to \kappa \in (0, 1)\).

Then, the mJPL estimator \((\tilde\delta, \tilde\beta^\top)^\top\) satisfies \[ \frac{1}{p} \sum_{j = 1}^p (\tilde\beta_j - \alpha_\star \beta_{0,j}) \overset{\textrm{a.s.}}{\longrightarrow} 0 \] with \(\alpha_\star \le 1\). For small to moderate values of \(\delta_0^2 / \gamma_0^2\), the scaling \(\alpha_\star\) can be approximated by \[ q(\kappa, \gamma, \gamma_0; b) = \left\{ \begin{array}{ll} 1 \,, & \text{if } \kappa < h_{\rm MLE}(\delta, \gamma_0) \\ \kappa^{b_1} \gamma^{b_2} \gamma_0^{b_3}\,, & \text{if } \kappa > h_{\rm MLE}(\delta, \gamma_0) \end{array} \right. \] where \(b_1 < -1\), \(b_2 < -1\), \(b_3 > 0\).

Predictions for \(b_1\), \(b_2\), \(b_3\) are -1.172, -1.869, 0.817, respectively (Kosmidis & Zietkiewicz, 2025)

Penalized likelihood estimation and inference in high-dimensional logistic regression

CRISM 1.0

CRISM 1.0

Joint with

Related outputs

Slides

Logistic regression

Logistic regression

Data

Model

Maximum likelihood estimation

Log-likelihood

Maximum likelihood (ML) estimator

Iterative re-weighted least squares

UCI’s Multiple Deatures doi: 10.24432/C5HC70

Data

Aim

Considerations

fou + kar VS fou

\(p / n \to \kappa \in (0, 1)\)

Recent developments

Candès & Sur (2020)

Sur & Candès (2019), Zhao et al. (2022)

Phase transition, \(\hat\beta\)

Phase transition, \(\hat\beta / \mu_\star\) (Sur & Candès, 2019)

Diaconis-Ylvisaker prior

Diaconis-Ylvisaker prior

Prior

Penalized log-likelihood

Maximum DY prior penalized likelihood

Maximum DY prior penalized likelihood (MDYPL)

MDYPL implementation using ML procedures

Existence and uniqueness

Shrinkage

Shrinkage direction

Fixed \(p\) (Cordeiro & McCullagh, 1991)

\(p / n \to \kappa \in (0, 1)\) (Sur & Candès, 2019; Zhao et al., 2022)

fou + kar VS fou using MDYPL

Maximum Diaconis-Ylvisaker prior penalized likelihood (MDYPL)

Penalized likelihood ratio test

fou + kar VS fou using MDYPL

Simulation setup

MDYPL: estimation and inference

Aggregate asymptotic behaviour of MDYPL estimator

Behaviour of \(\hat{\beta}^{\textrm{\small DY}}\)

Behaviour of \(\hat{\beta}^{\textrm{\small DY}}/ \mu^*\)

State evolution of AMP recursion

\(\hat{\beta}^{\textrm{\small DY}}\) for \(\alpha = 1 / (1 + \kappa)\)

\(\hat{\beta}^{\textrm{\small DY}}/ \mu_\star\) for \(\alpha = 1 / (1 + \kappa)\)

\(\hat{\beta}^{\textrm{\small DY}}\) for \(\alpha = 1 / (1 + \kappa)\)

\(\hat{\beta}^{\textrm{\small DY}}/ \mu_\star\) for \(\alpha = 1 / (1 + \kappa)\)

Optimal shrinkage: Asymptotic unbiasedness

Optimal shrinkage: Minimum MSE

DY prior penalized likelihood ratio test statistics

Adjusted Wald statistics

Artificial example

Frequentist performance

Estimating unknowns

Dimension

Signal strength

Estimating unknowns

Estimates of \({\mu}_{\star}\), \(b_{\star}\), \({\sigma}_{\star}\)

Conditional variance of covariates

Comparison to corrected least squares (Lewis & Battey, 2024)

Comparison to corrected least squares (Lewis & Battey, 2024)

Models with an intercept

fou + kar VS fou using MDYPL

Penalized likelihood ratio tests

Simulation setup

Current work

More general model matrices

Distributions with light-tails

Universality

DY prior

References

Maximum Jeffreys prior penalized likelihood

`fou + kar` VS `fou`

`fou + kar` VS `fou` using MDYPL

`fou + kar` VS `fou` using MDYPL

`fou + kar` VS `fou` using MDYPL