Professor of Statistics
Department of Statistics, University of Warwick
ioannis.kosmidis@warwick.ac.uk
ikosmidis.com ikosmidis ikosmidis_
CRiSM 2.0 Conference
University of Warwick
22 May 2025
Sterzinger P, Kosmidis I (2024). Diaconis-Ylvisaker prior penalized likelihood for \(p / n \to \kappa \in (0, 1)\) logistic regression. arXiv: 2311.11290
\(y_1, \ldots, y_n\) with \(y_i \in \{0, 1\}\)
\(x_1, \ldots, x_n\) with \(x_i \in \Re^p\)
\(Y_{1}, \ldots, Y_{n}\) conditionally independent with
\[ Y_i \mid x_i \sim \mathop{\mathrm{Bernoulli}}(\pi_i)\,, \quad \log \frac{\pi_i}{1 - \pi_i} = \eta_i = \sum_{t = 1}^p \beta_t x_{it} \]
\(\displaystyle l(\beta \,;\,y, X) = \sum_{i = 1}^n \left\{y_i \eta_i - \log\left(1 + e^{\eta_i}\right) \right\}\)
\(\hat{\beta} \in \arg \sup l(\beta \,;\,y, X)\)
\(\hat\beta := \beta^{(\infty)}\) where \(\displaystyle \beta^{(j + 1)} = \left(X^T W^{(j)} X\right)^{-1} X^T W^{(j)} z^{(j)}\)
\(W\) is a diagonal matrix with \(i\)th diagonal element \(\pi_i ( 1- \pi_i)\)
\(z_i = \eta_i + (y_i - \pi_i) / \{ \pi_i (1 - \pi_i) \}\) is the working variate
\(200\) binary images per digit (\(0\)-\(9\)) extracted maps from a Dutch public utility
Feature sets:
76 Fourier coefficients of the character shapes (rotation invariant) per digit
64 Karhunen-Loève coefficients per digit
1000 digits for training + 1000 digits for testing
Explain the contribution of the feature sets in describing the digit “7”
Font, digitization noise, downscaling, complicate discrimination
If only rotation invariant features are used, “7” and “2” can look similar
fou + kar
VS fou
Residual Df | Log-likelihood | Df | LR | pvalue |
---|---|---|---|---|
923 | 0.00 | |||
859 | 0.00 | 64 | 0.00 | 1 |
Model | Separation | Infinite estimates |
---|---|---|
fou + kar | true | true |
fou | true | true |
Perfect fits on the training data
1000 observations with 140 covariates
Sharp phase transition about when the ML estimate has infinite components, when \(\eta_i = \delta + x_i^\top \beta\), \(x_i \stackrel{iid}{\sim} \mathop{\mathrm{N}}(0, \Sigma)\), \(p/n \to \kappa \in (0, 1)\), \(\mathop{\mathrm{var}}(x_i^\top\beta_0) \to \gamma^2\)
A method, based on approximate message passing (AMP), that for \(\delta = 0\) recovers estimation and inferential performance by appropriately rescaling \(\hat\beta\), whenever that exists.
?
\[ \log p(\beta; X) = \frac{1 - {\color[]{#E16A86}{\alpha}}}{{\color[]{#E16A86}{\alpha}}} \sum_{i=1}^n \left\{ \zeta'\left(x_i^\top {\color[]{#E16A86}{\beta}_P}\right) x_i^\top \beta - \zeta\left(x_i^\top \beta\right) \right\} + C \] with \(\zeta(\eta) = \log(1 + e^\eta)\)
\[ l^*(\beta; y, X) = \frac{1}{\alpha} \sum_{i = 1}^n \left\{y_i^* x_i^\top\beta - \zeta\left(x_i^\top \beta\right) \right\} \] with \(y_i^* = \alpha y_i + (1 - \alpha) \zeta'\left(x_i^\top \beta_P\right)\)
\(\hat{\beta}^{\textrm{\small DY}}= \arg \max l^*(\beta; y, X)\)
MDYPL is ML with pseudo-responses: \(l(\beta; y^*, X) / \alpha = \ell^*(\beta; y, X)\)
Rigon & Aliverti (2023, Theorem 1): \(\hat{\beta}^{\textrm{\small DY}}\) is unique and exists for all \(\{y, X\}\) configurations
\(\displaystyle \hat{\beta}^{\textrm{\small DY}}\longrightarrow \left\{ \begin{array}{ll} \hat\beta\,, & \alpha \to 1 \\ \beta_P\,, & \alpha \to 0 \end{array}\right.\)
Rigon & Aliverti (2023) suggest adaptive shrinkage with \(\alpha = 1 / (1 + \kappa)\)
ML estimator of a logistic regression model is biased away from \(0\)
Empirical evidence that this continues to hold in high dimensions
\[\LARGE\Downarrow\]
\[\beta_{P} = 0\]
fou + kar
VS fou
using MDYPLAdjust responses to \(\displaystyle y_i^* = \alpha y_i + \frac{1 - \alpha}{2}\) with \(\alpha \in [0, 1)\)1
Use ML estimation with the adjusted responses
Df | PLR |
---|---|
64 | 64.36 |
fou + kar
VS fou
using MDYPL\(1000 \times 140\) matrix \(X_{\rm fou + kar}\) is fixed
\(\beta_{\rm fou + kar} = (\beta_{\rm fou}^\top, 0_{64}^\top)^\top\)
\(\beta_{\rm fou}\) from i.i.d \(N(0, 1)\), rescaled so that \[ \widehat{\rm var}(X_{\rm fou + kar} \beta_{\rm fou + kar}) = \gamma^2 \]
Intercept: \(\delta \in \{-3, -2, -1, 0\}\)
Signal strength: \(\gamma^2 \in \{1, 2, 4, 8, 16\}\)
\(500\) response vectors per \((\delta, \gamma^2)\)
Assume that \((\alpha,\kappa,\gamma)\) are such that \(\| \hat{\beta}^{\textrm{\small DY}} \|_2 = \mathcal{O}(n^{1/2})\) almost surely. Then for any function \(\psi: \Re^2 \to \Re\) that is pseudo-Lipschitz of order 2,1, \[ \frac{1}{p} \sum_{j=1}^{p} \psi(\hat{\beta}^{\textrm{\small DY}}_j - {\mu}_{\star}\beta_{0,j}, \beta_{0,j}) \overset{\textrm{a.s.}}{\longrightarrow} \mathop{\mathrm{E}}\left[\psi({\sigma}_{\star}G, \bar{\beta})\right], \quad \textrm{as } n \to \infty \] where \(G \sim \mathop{\mathrm{N}}(0,1)\) is independent of \(\bar{\beta} \sim \pi_{\bar{\beta}}\)
\(\psi(t,u)\) | statistic | a.s. limit | |
---|---|---|---|
\(t\) | \(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - {\mu}_{\star}\beta_{0,j}\right)\) | Bias | \(0\) |
\(t^2\) | \(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - {\mu}_{\star}\beta_{0,j}\right)^2\) | Variance | \({\sigma}_{\star}^2\) |
\((t-(1-{\mu}_{\star})u)^2\) | \(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - \beta_{0,j}\right)^2\) | MSE | \({\sigma}_{\star}^2 + (1-{\mu}_{\star})^2 \gamma^2 / \kappa\) |
\(\psi(t,u)\) | statistic | a.s. limit | |
---|---|---|---|
\(t / {\mu}_{\star}\) | \(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j / {\mu}_{\star}- \beta_{0,j}\right)\) | Bias | \(0\) |
\(t^2/{\mu}_{\star}^2\) | \(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j / {\mu}_{\star}- \beta_{0,j}\right)^2\) | MSE / Variance | \({\sigma}_{\star}^2 / {\mu}_{\star}^2\) |
Solution \(({\mu}_{\star}, {b}_{\star}, {\sigma}_{\star})\) to the system of nonlinear equations in \((\mu, b, \sigma)\)
\[ \begin{aligned} 0 & = \mathop{\mathrm{E}}\left[2 \zeta'(Z) Z \left\{\frac{1+{\color[rgb]{0.70,0.1,0.12}{\alpha}}}{2} - \zeta' \left(\textrm{prox}_{b \zeta}\left(Z_{*} + \frac{1+{\color[rgb]{0.70,0.1,0.12}{\alpha}}}{2}b\right) \right) \right\} \right] \\ 0 & = 1 - {\color[rgb]{0.70,0.1,0.12}{\kappa}} - \mathop{\mathrm{E}}\left[\frac{2 \zeta'(Z)}{1 + b \zeta''(\textrm{prox}_{b\zeta}\left(Z_{*} + \frac{1+{\color[rgb]{0.70,0.1,0.12}{\alpha}}}{2}b\right))} \right] \\ 0 & = \sigma^2 - \frac{b^2}{{\color[rgb]{0.70,0.1,0.12}{\kappa}}^2} \mathop{\mathrm{E}}\left[2 \zeta'(Z) \left\{\frac{1+{\color[rgb]{0.70,0.1,0.12}{\alpha}}}{2} - \zeta'\left(\textrm{prox}_{b\zeta}\left( Z_{*} + \frac{1+{\color[rgb]{0.70,0.1,0.12}{\alpha}}}{2}b\right)\right) \right\}^2 \right] \end{aligned} \tag{1}\]
\(Z \sim \mathop{\mathrm{N}}(0, {\color[rgb]{0.70,0.1,0.12}{\gamma}}^2)\)
\(Z_{*} = \mu Z + {\color[rgb]{0.70,0.1,0.12}{\kappa}}^{1/2} \sigma G\) with \(G \sim \mathop{\mathrm{N}}(0, 1)\) independent of \(Z\)
\(\textrm{prox}_{b\zeta}\left(x\right)=\arg\min_{u} \left\{ b\zeta(u) + \frac{1}{2} (x-u)^2 \right\}\) is the proximal operator
Solvers for nonlinear systems, after cubature approximation of expectations
Amount of shrinkage (value of \(\alpha\)) that results in \(\hat{\beta}^{\textrm{\small DY}}/ {\mu}_{\star}\) being asymptotically unbiased
Amount of shrinkage (value of \(\alpha\)) that results in \(\hat{\beta}^{\textrm{\small DY}}/ {\mu}_{\star}\) having minimal asymptotic MSE
Assume that \((\alpha,\kappa,\gamma)\) are such that \(\| \hat{\beta}^{\textrm{\small DY}} \|_2 = \mathcal{O}(n^{1/2})\) almost surely. Define the DY prior penalized likelihood ratio test statistic for \(H_0: \beta_{0,i_1} = \ldots = \beta_{0,i_k} = 0\) as \[ \Lambda_I = \underset{\beta \in \Re^p}{\max} \, \ell(\beta \,;\,y^*, X) - \underset{\substack{\beta \in \Re^p: \\ \beta_j = 0, \, j \in I }}{\max} \, \ell(\beta \,;\,y^*, X)\,, \quad I = \{i_1,\ldots,i_k\} \]
Then, \[ 2 \Lambda_{I} \overset{d}{\longrightarrow} \frac{\kappa \sigma_{\star}^2}{b_{\star}} \chi^2_k \]
Assume that \((\alpha,\kappa,\gamma)\) are such that \(\| \hat{\beta}^{\textrm{\small DY}} \|_2 = \mathcal{O}(n^{1/2})\) almost surely.
Then, for any regression coefficient such that \(\sqrt{n} \tau_j \beta_{0,j} = \mathcal{O}(1)\), \[ Z^*_j = \sqrt{n} \tau_j \frac{\hat{\beta}^{\textrm{\small DY}}_{j} - {\mu}_{\star}\beta_{0,j}}{{\sigma}_{\star}} \overset{d}{\longrightarrow} \mathcal{N}(0,1) \] where \(\tau_j^2 = \mathop{\mathrm{var}}(x_{ij} \mid x_{i, -j})\)
\[ \begin{array}{l} n = 2000 \\ p = 400 \\ x_{ij} \stackrel{iid}{\sim} \mathop{\mathrm{N}}(0, 1/n) \\ \\ \beta_0 = (b_1, \ldots, b_{200}, 0, \ldots, 0)^\top \\ b_j \stackrel{iid}{\sim} \mathop{\mathrm{N}}(6.78, 1) \\ \end{array} \implies \begin{array}{l} \gamma^2 = \mathop{\mathrm{var}}(x_i^\top \beta_0) \simeq 4.7 \\ \kappa = 0.2 \\ \end{array} \]
\[ y_i \stackrel{ind}{\sim} \mathop{\mathrm{Bernoulli}}(1/(1 + {e^{-x_i^\top\beta_0}})) \]
\(W = 2 ( \hat{l} - \hat{l}_{0} )\), \(\hat{l}_{0}\) is maximized log-likelihood under \(H_0: \beta_{201} = \ldots = \beta_{210} = 0\)
Classical theory implies \(W \stackrel{d}{\rightarrow} \chi^2_{10}\) when \(\beta_{201} = \ldots = \beta_{210} = 0\)
\(Z = \hat\beta_{k} / [(X^\top \hat{W} X)^{-1}]_{kk}^{1/2}\)
Classical theory implies \(Z \stackrel{d}{\rightarrow} \mathop{\mathrm{N}}(0, 1)\) when \(\beta_{k} = 0\)
\(\hat\kappa = p / n\)
Adaptation of the Signal Strength Leave-One-Out Estimator (SLOE) of Yadlowsky et al. (2021), that consistently estimates \(\upsilon^2 = \lim\limits_{n \to \infty}\mathop{\mathrm{var}}(x_j^\top \hat{\beta}^{\textrm{\small DY}}) = {\mu}_{\star}^2 \gamma^2 + \kappa {\sigma}_{\star}^2\)
\[ \hat \upsilon^2 = \frac{\sum_{i = 1}^n (s_i - \bar{s})^2}{n} \quad \text{with} \quad s_i = \hat{\eta}^{\textrm{\small DY}}_i - \frac{\hat{h}^{\textrm{\small DY}}_i}{1 - \hat{h}^{\textrm{\small DY}}_i} (\hat{z}^{\textrm{\small DY}}_i - \hat{\eta}^{\textrm{\small DY}}_i) \] where \(\hat{h}^{\textrm{\small DY}}_i\) is the hat value for the \(i\)th observation, and \({\rm DY}\) implies evaluation at \(\hat{\beta}^{\textrm{\small DY}}\)
Replace \(\gamma\) in the state evolution equations (1) with \(\hat\gamma = (\hat\upsilon^2 - \hat\kappa {\sigma}_{\star}^2) / {\mu}_{\star}^2\)
Solve (1) to get estimates for \({\mu}_{\star}\), \(b_{\star}\), \({\sigma}_{\star}\)
For adjusted Wald tests, \(\tau_j^2 = \mathop{\mathrm{var}}(x_{ij} \mid x_{i, -j})\) can be estimated using the residual sums of squares from the regressions of each covariate on all the others
\(p = n\kappa\), \(x_j \sim \mathop{\mathrm{N}}(0, \Sigma)\) with \(\Sigma_{jj} = 1\), \(\Sigma_{jk} = 0.5\) \((j \ne k)\)
\(\beta_0 = (c, c, c, c, c, 0, \ldots, 0)^\top\) with \(c\) such that \(\beta_0^\top \Sigma \beta_0 = \gamma^2 = 9\)
\(5000\) independent realizations of the response vector per setting
\(p = n\kappa\), \(x_j \sim \mathop{\mathrm{N}}(0, \Sigma)\) with \(\Sigma_{jj} = 1\), \(\Sigma_{jk} = 0.5\) \((j \ne k)\)
\(\beta_0 = (c, c, c, c, c, 0, \ldots, 0)^\top\) with \(c\) such that \(\beta_0^\top \Sigma \beta_0 = \gamma^2 = 9\)
\(5000\) independent realizations of the response vector per setting
An accurate, according to empirical evidence, conjecture1 with state evolution given from a system of 4 equations in 4 unknowns.
A SLOE-based procedure for estimating unknowns, replacing the asymptotic limit of the MDYPL estimator of the intercept by its MDYPL estimate.
fou + kar
VS fou
using MDYPLInfinite estimates | ||||
---|---|---|---|---|
Test | Statistic | pvalue | fou.kar | fou |
LR | 0.00 | 1.00 | true | true |
PLR | 64.36 | 0.46 | false | false |
rescaled PLR | 173.34 | 0.00 | false | false |
\(1000 \times 140\) matrix \(X_{\rm fou + kar}\) is fixed
\(\beta_{\rm fou + kar} = (\beta_{\rm fou}^\top, 0_{64}^\top)^\top\)
\(\beta_{\rm fou}\) from i.i.d \(N(0, 1)\), rescaled so that \[ \widehat{\rm var}(X_{\rm fou + kar} \beta_{\rm fou + kar}) = \gamma^2 \]
Intercept: \(\delta \in \{-3, -2, -1, 0\}\)
Signal strength: \(\gamma^2 \in \{1, 2, 4, 8, 16\}\)
\(500\) response vectors per \((\delta, \gamma^2)\)
Asymptotic results seem to apply for a broad class of covariate distributions with sufficiently light tails (e.g. sub-Gaussian)1
Results in Han & Shen (2023) can be used to examine universality for more general model matrices with independent entries
Arbitrary shrinkage directions (\(\beta_P \ne 0\))
Sterzinger P, Kosmidis I (2024). Diaconis-Ylvisaker prior penalized likelihood for \(p / n \to \kappa \in (0, 1)\) logistic regression. arXiv: 2311.11290
\[ Y_i | x_i \sim \mathop{\mathrm{Bernoulli}}(\pi_i)\,, \quad \log \frac{\pi_i}{1 - \pi_i} = \eta_i = \delta + \sum_{t = 1}^p \beta_t x_{it} \]
Suppose that \(x_{i} \stackrel{iid}{\sim} \mathop{\mathrm{N}}(0, \Sigma)\), and that \(\mathop{\mathrm{var}}(x_i^\top \beta_0) \to \gamma_0^2\), as \(p / n \to \kappa \in (0, 1)\).
Then, the mJPL estimator \((\tilde\delta, \tilde\beta^\top)^\top\) satisfies \[ \frac{1}{p} \sum_{j = 1}^p (\tilde\beta_j - \alpha_\star \beta_{0,j}) \overset{\textrm{a.s.}}{\longrightarrow} 0 \] with \(\alpha_\star \le 1\). For small to moderate values of \(\delta_0^2 / \gamma_0^2\), the scaling \(\alpha_\star\) can be approximated by \[ q(\kappa, \gamma, \gamma_0; b) = \left\{ \begin{array}{ll} 1 \,, & \text{if } \kappa < h_{\rm MLE}(\delta, \gamma_0) \\ \kappa^{b_1} \gamma^{b_2} \gamma_0^{b_3}\,, & \text{if } \kappa > h_{\rm MLE}(\delta, \gamma_0) \end{array} \right. \] where \(b_1 < -1\), \(b_2 < -1\), \(b_3 > 0\).
Predictions for \(b_1\), \(b_2\), \(b_3\) are -1.172, -1.869, 0.817, respectively (Kosmidis & Zietkiewicz, 2025)
Ioannis Kosmidis - High-dimensional logistic regression