Professor of Statistics
Department of Statistics, University of Warwick
ioannis.kosmidis@warwick.ac.uk
ikosmidis.com ikosmidis ikosmidis_
Department of Statistical Sciences
University of Padova, Italy
29 April 2025
Sterzinger P, Kosmidis I (2024). Diaconis-Ylvisaker prior penalized likelihood for \(p / n \to \kappa \in (0, 1)\) logistic regression. arXiv: 2311.11290
\(y_1, \ldots, y_n\) with \(y_i \in \{0, 1\}\)
\(x_1, \ldots, x_n\) with \(x_i \in \Re^p\)
\(Y_{1}, \ldots, Y_{n}\) conditionally independent with
\[ Y_i | x_i \sim \mathop{\mathrm{Bernoulli}}(\pi_i)\,, \quad \log \frac{\pi_i}{1 - \pi_i} = \eta_i = \sum_{t = 1}^p \beta_t x_{it} \]
\(\displaystyle l(\beta \,;\,y, X) = \sum_{i = 1}^n \left\{y_i \eta_i - \log\left(1 + e^{\eta_i}\right) \right\}\)
\(\hat{\beta} = \arg \max l(\beta \,;\,y, X)\)
\(\hat\beta := \beta^{(\infty)}\) where \(\displaystyle \beta^{(j + 1)} = \left(X^T W^{(j)} X\right)^{-1} X^T W^{(j)} z^{(j)}\)
\(W\) is a diagonal matrix with \(i\)th diagonal element \(\pi_i ( 1- \pi_i)\)
\(z_i = \eta_i + (y_i - \pi_i) / \{ \pi_i (1 - \pi_i) \}\) is the working variate
\(200\) binary images per digit (\(0\)-\(9\)) extracted maps from a Dutch public utility
Feature sets:
76 Fourier coefficients of the character shapes (rotation invariant) per digit
64 Karhunen-Loève coefficients per digit
1000 digits for training + 1000 digits for testing
Explain the contribution of the feature sets in describing the digit ``7’’
Font, digitization noise, downscaling, complicate discrimination
If only rotation invariant features are used, “7” and “2” can look similar
fou + kar
VS fou
Residual Df | Log-likelihood | Df | LR | pvalue |
---|---|---|---|---|
923 | 0.00 | |||
859 | 0.00 | 64 | 0.00 | 1 |
Model | Separation | Infinite estimates |
---|---|---|
fou + kar | true | true |
fou | true | true |
Perfect fits on the training data
1000 observations with 140 covariates
Sharp phase transition about when the ML estimate has infinite components, when \(\eta_i = \delta + x_i^\top \beta\), \(x_i \stackrel{iid}{\sim} \mathop{\mathrm{N}}(0, \Sigma)\), \(p/n \to \kappa \in (0, 1)\), \(\mathop{\mathrm{var}}(x_i^\top\beta_0) \to \gamma^2\)
A method, based on approximate message passing (AMP), that for \(\delta = 0\) recovers estimation and inferential performance by appropriately rescaling \(\hat\beta\), whenever that exists.
?
\[ \log p(\beta; X) = \frac{1 - {\color[]{#E16A86}{\alpha}}}{{\color[]{#E16A86}{\alpha}}} \sum_{i=1}^n \left\{ \zeta'\left(x_i^\top {\color[]{#E16A86}{\beta}_P}\right) x_i^\top \beta - \zeta\left(x_i^\top \beta\right) \right\} + C \] with \(\zeta(\eta) = \log(1 + e^\eta)\)
\[ l^*(\beta; y, X) = \frac{1}{\alpha} \sum_{i = 1}^n \left\{y_i^* x_i^\top\beta - \zeta\left(x_i^\top \beta\right) \right\} \] with \(y_i^* = \alpha y_i + (1 - \alpha) \zeta'\left(x_i^\top \beta_P\right)\)
\(\hat{\beta}^{\textrm{\small DY}}= \arg \max l^*(\beta; y, X)\)
MDYPL is ML with pseudo-responses: \(l(\beta; y^*, X) / \alpha = \ell^*(\beta; y, X)\)
Rigon & Aliverti (2023, Theorem 1): \(\hat{\beta}^{\textrm{\small DY}}\) is unique and exists for all \(\{y, X\}\) configurations
\(\displaystyle \hat{\beta}^{\textrm{\small DY}}\longrightarrow \left\{ \begin{array}{ll} \hat\beta\,, & \alpha \to 1 \\ \beta_P\,, & \alpha \to 0 \end{array}\right.\)
Rigon & Aliverti (2023) suggest adaptive shrinkage with \(\alpha = 1 / (1 + \kappa)\)
ML estimator of a logistic regression model is biased away from \(0\)
Empirical evidence that this continues to hold in high dimensions
\[\LARGE\Downarrow\]
\[\beta_{P} = 0\]
fou + kar
VS fou
using MDYPLAdjust responses to \(\displaystyle y_i^* = \alpha y_i + \frac{1 - \alpha}{2}\) with \(\alpha \in [0, 1)\)1
Use ML estimation with the adjusted responses
Df | PLR |
---|---|
64 | 64.36 |
fou + kar
VS fou
using MDYPL\(1000 \times 140\) matrix \(X_{\rm fou + kar}\) is fixed
\(\beta_{\rm fou + kar} = (\beta_{\rm fou}^\top, 0_{64}^\top)^\top\)
\(\beta_{\rm fou}\) from i.i.d \(N(0, 1)\), rescaled so that \[ \widehat{\rm var}(X_{\rm fou + kar} \beta_{\rm fou + kar}) = \gamma^2 \]
Intercept: \(\delta \in \{-3, -2, -1, 0\}\)
Signal strength: \(\gamma^2 \in \{1, 2, 4, 8, 16\}\)
\(500\) response vectors per \((\delta, \gamma^2)\)
Assume that \((\alpha,\kappa,\gamma)\) are such that \(\| \hat{\beta}^{\textrm{\small DY}} \|_2 = \mathcal{O}(n^{1/2})\) almost surely. Then for any function \(\psi: \Re^2 \to \Re\) that is pseudo-Lipschitz of order 2, \[ \frac{1}{p} \sum_{j=1}^{p} \psi(\hat{\beta}^{\textrm{\small DY}}_j - {\mu}_{\star}\beta_{0,j}, \beta_{0,j}) \overset{\textrm{a.s.}}{\longrightarrow} \mathop{\mathrm{E}}\left[\psi({\sigma}_{\star}G, \bar{\beta})\right], \quad \textrm{as } n \to \infty \] where \(G \sim \mathop{\mathrm{N}}(0,1)\) is independent of \(\bar{\beta} \sim \pi_{\bar{\beta}}\)
\(\psi(t,u)\) | statistic | a.s. limit | |
---|---|---|---|
\(t\) | \(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - {\mu}_{\star}\beta_{0,j}\right)\) | Bias | \(0\) |
\(t^2\) | \(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - {\mu}_{\star}\beta_{0,j}\right)^2\) | Variance | \({\sigma}_{\star}^2\) |
\((t-(1-{\mu}_{\star})u)^2\) | \(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j - \beta_{0,j}\right)^2\) | MSE | \({\sigma}_{\star}^2 + (1-{\mu}_{\star})^2 \gamma^2 / \kappa\) |
\(\psi(t,u)\) | statistic | a.s. limit | |
---|---|---|---|
\(t / {\mu}_{\star}\) | \(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j / {\mu}_{\star}- \beta_{0,j}\right)\) | Bias | \(0\) |
\(t^2/{\mu}_{\star}^2\) | \(\frac{1}{p} \sum_{j=1}^p \left(\hat{\beta}^{\textrm{\small DY}}_j / {\mu}_{\star}- \beta_{0,j}\right)^2\) | MSE / Variance | \({\sigma}_{\star}^2 / {\mu}_{\star}^2\) |
Solution \(({\mu}_{\star}, {b}_{\star}, {\sigma}_{\star})\) to the system of nonlinear equations in \((\mu, b, \sigma)\)
\[ \begin{aligned} 0 & = \mathop{\mathrm{E}}\left[2 \zeta'(Z) Z \left\{\frac{1+{\color[rgb]{0.70,0.1,0.12}{\alpha}}}{2} - \zeta' \left(\textrm{prox}_{b \zeta}\left(Z_{*} + \frac{1+{\color[rgb]{0.70,0.1,0.12}{\alpha}}}{2}b\right) \right) \right\} \right] \\ 0 & = 1 - {\color[rgb]{0.70,0.1,0.12}{\kappa}} - \mathop{\mathrm{E}}\left[\frac{2 \zeta'(Z)}{1 + b \zeta''(\textrm{prox}_{b\zeta}\left(Z_{*} + \frac{1+{\color[rgb]{0.70,0.1,0.12}{\alpha}}}{2}b\right))} \right] \\ 0 & = \sigma^2 - \frac{b^2}{{\color[rgb]{0.70,0.1,0.12}{\kappa}}^2} \mathop{\mathrm{E}}\left[2 \zeta'(Z) \left\{\frac{1+{\color[rgb]{0.70,0.1,0.12}{\alpha}}}{2} - \zeta'\left(\textrm{prox}_{b\zeta}\left( Z_{*} + \frac{1+{\color[rgb]{0.70,0.1,0.12}{\alpha}}}{2}b\right)\right) \right\}^2 \right] \end{aligned} \tag{1}\]
\(Z \sim \mathop{\mathrm{N}}(0, {\color[rgb]{0.70,0.1,0.12}{\gamma}}^2)\)
\(Z_{*} = \mu Z + {\color[rgb]{0.70,0.1,0.12}{\kappa}}^{1/2} \sigma G\) with \(G \sim \mathop{\mathrm{N}}(0, 1)\) independent of \(Z\)
\(\textrm{prox}_{b\zeta}\left(x\right)=\arg\min_{u} \left\{ b\zeta(u) + \frac{1}{2} (x-u)^2 \right\}\) is the proximal operator
Solvers for nonlinear systems, after cubature approximation of expectations
Amount of shrinkage (value of \(\alpha\)) that results in \(\hat{\beta}^{\textrm{\small DY}}/ {\mu}_{\star}\) being asymptotically unbiased
Amount of shrinkage (value of \(\alpha\)) that results in \(\hat{\beta}^{\textrm{\small DY}}/ {\mu}_{\star}\) having minimal asymptotic MSE
Assume that \((\alpha,\kappa,\gamma)\) are such that \(\| \hat{\beta}^{\textrm{\small DY}} \|_2 = \mathcal{O}(n^{1/2})\) almost surely. Define the DY prior penalized likelihood ratio test statistic for \(H_0: \beta_{0,i_1} = \ldots = \beta_{0,i_k} = 0\) as \[ \Lambda_I = \underset{\beta \in \Re^p}{\max} \, \ell(\beta \,;\,y^*, X) - \underset{\substack{\beta \in \Re^p: \\ \beta_j = 0, \, j \in I }}{\max} \, \ell(\beta \,;\,y^*, X)\,, \quad I = \{i_1,\ldots,i_k\} \]
Then, \[ 2 \Lambda_{I} \overset{d}{\longrightarrow} \frac{\kappa \sigma_{\star}^2}{b_{\star}} \chi^2_k \]
Assume that \((\alpha,\kappa,\gamma)\) are such that \(\| \hat{\beta}^{\textrm{\small DY}} \|_2 = \mathcal{O}(n^{1/2})\) almost surely.
Then, for any regression coefficient such that \(\sqrt{n} \tau_j \beta_{0,j} = \mathcal{O}(1)\), \[ Z^*_j = \sqrt{n} \tau_j \frac{\hat{\beta}^{\textrm{\small DY}}_{j} - {\mu}_{\star}\beta_{0,j}}{{\sigma}_{\star}} \overset{d}{\longrightarrow} \mathcal{N}(0,1) \] where \(\tau_j^2 = \mathop{\mathrm{var}}(x_{ij} \mid x_{i, -j})\)
\[ \begin{array}{l} n = 2000 \\ p = 400 \\ x_{ij} \stackrel{iid}{\sim} \mathop{\mathrm{N}}(0, 1/n) \\ \\ \beta_0 = (b_1, \ldots, b_{200}, 0, \ldots, 0)^\top \\ b_j \stackrel{iid}{\sim} \mathop{\mathrm{N}}(6.78, 1) \\ \end{array} \implies \begin{array}{l} \gamma^2 = \mathop{\mathrm{var}}(x_i^\top \beta_0) \simeq 4.7 \\ \kappa = 0.2 \\ \end{array} \]
\[ y_i \stackrel{ind}{\sim} \mathop{\mathrm{Bernoulli}}(1/(1 + {e^{-x_i^\top\beta_0}})) \]
\(W = 2 ( \hat{l} - \hat{l}_{0} )\), \(\hat{l}_{0}\) is maximized log-likelihood under \(H_0: \beta_{201} = \ldots = \beta_{210} = 0\)
Classical theory implies \(W \stackrel{d}{\rightarrow} \chi^2_{10}\) when \(\beta_{201} = \ldots = \beta_{210} = 0\)
\(Z = \hat\beta_{k} / [(X^\top \hat{W} X)^{-1}]_{kk}^{1/2}\)
Classical theory implies \(Z \stackrel{d}{\rightarrow} \mathop{\mathrm{N}}(0, 1)\) when \(\beta_{k} = 0\)
\(\hat\kappa = p / n\)
Adaptation of the Signal Strength Leave-One-Out Estimator (SLOE) of Yadlowsky et al. (2021), that consistently estimates \(\upsilon^2 = \lim\limits_{n \to \infty}\mathop{\mathrm{var}}(x_j^\top \hat{\beta}^{\textrm{\small DY}}) = {\mu}_{\star}^2 \gamma^2 + \kappa {\sigma}_{\star}^2\)
\[ \hat \upsilon^2 = \frac{\sum_{i = 1}^n (s_i - \bar{s})^2}{n} \quad \text{with} \quad s_i = \hat{\eta}^{\textrm{\small DY}}_i - \frac{\hat{h}^{\textrm{\small DY}}_i}{1 - \hat{h}^{\textrm{\small DY}}_i} (\hat{z}^{\textrm{\small DY}}_i - \hat{\eta}^{\textrm{\small DY}}_i) \] where \(\hat{h}^{\textrm{\small DY}}_i\) is the hat value for the \(i\)th observation
Replace \(\gamma\) in the state evolution equations (1) with \(\hat\gamma = (\hat\upsilon^2 - \hat\kappa {\sigma}_{\star}^2) / {\mu}_{\star}^2\)
Solve (1) to get estimates for \({\mu}_{\star}\), \(b_{\star}\), \({\sigma}_{\star}\)
For adjusted Wald tests, \(\tau_j^2 = \mathop{\mathrm{var}}(x_{ij} \mid x_{i, -j})\) can be estimated using the residual sums of squares from the regressions of each covariate on all the others
\(p = n\kappa\), \(x_j \sim \mathop{\mathrm{N}}(0, \Sigma)\) with \(\Sigma_{jj} = 1\), \(\Sigma_{jk} = 0.5\) \((j \ne k)\)
\(\beta_0 = (c, c, c, c, c, 0, \ldots, 0)^\top\) with \(c\) such that \(\beta_0^\top \Sigma \beta_0 = \gamma^2 = 9\)
\(5000\) independent realizations of the response vector per setting
\(p = n\kappa\), \(x_j \sim \mathop{\mathrm{N}}(0, \Sigma)\) with \(\Sigma_{jj} = 1\), \(\Sigma_{jk} = 0.5\) \((j \ne k)\)
\(\beta_0 = (c, c, c, c, c, 0, \ldots, 0)^\top\) with \(c\) such that \(\beta_0^\top \Sigma \beta_0 = \gamma^2 = 9\)
\(5000\) independent realizations of the response vector per setting
An accurate, according to empirical evidence, conjecture1 with state evolution given from a system of 4 equations in 4 unknowns.
A SLOE-based procedure for estimating unknowns, replacing the asymptotic limit of the MDYPL estimator of the intercept by its MDYPL estimate.
fou + kar
VS fou
using MDYPLInfinite estimates | ||||
---|---|---|---|---|
Test | Statistic | pvalue | fou.kar | fou |
LR | 0.00 | 1.00 | true | true |
PLR | 64.36 | 0.46 | false | false |
rescaled PLR | 173.34 | 0.00 | false | false |
\(1000 \times 140\) matrix \(X_{\rm fou + kar}\) is fixed
\(\beta_{\rm fou + kar} = (\beta_{\rm fou}^\top, 0_{64}^\top)^\top\)
\(\beta_{\rm fou}\) from i.i.d \(N(0, 1)\), rescaled so that \[ \widehat{\rm var}(X_{\rm fou + kar} \beta_{\rm fou + kar}) = \gamma^2 \]
Intercept: \(\delta \in \{-3, -2, -1, 0\}\)
Signal strength: \(\gamma^2 \in \{1, 2, 4, 8, 16\}\)
\(500\) response vectors per \((\delta, \gamma^2)\)
\[ Y_i | x_i \sim \mathop{\mathrm{Bernoulli}}(\pi_i)\,, \quad \log \frac{\pi_i}{1 - \pi_i} = \eta_i = \delta + \sum_{t = 1}^p \beta_t x_{it} \]
Suppose that \(x_{i} \stackrel{iid}{\sim} \mathop{\mathrm{N}}(0, \Sigma)\), and that \(\mathop{\mathrm{var}}(x_i^\top \beta_0) \to \gamma_0^2\), as \(p / n \to \kappa \in (0, 1)\).
Then, the mJPL estimator \((\tilde\delta, \tilde\beta^\top)^\top\) satisfies \[ \frac{1}{p} \sum_{j = 1}^p (\tilde\beta_j - \alpha_\star \beta_{0,j}) \overset{\textrm{a.s.}}{\longrightarrow} 0 \] with \(\alpha_\star \le 1\). For small to moderate values of \(\delta_0^2 / \gamma_0^2\), the scaling \(\alpha_\star\) can be approximated by \[ q(\kappa, \gamma, \gamma_0; b) = \left\{ \begin{array}{ll} 1 \,, & \text{if } \kappa < h_{\rm MLE}(\delta, \gamma_0) \\ \kappa^{b_1} \gamma^{b_2} \gamma_0^{b_3}\,, & \text{if } \kappa > h_{\rm MLE}(\delta, \gamma_0) \end{array} \right. \] where \(b_1 < -1\), \(b_2 < -1\), \(b_3 > 0\).
Predictions for \(b_1\), \(b_2\), \(b_3\) are -1.172, -1.869, 0.817, respectively (Kosmidis & Zietkiewicz, 2025)
Sterzinger P, Kosmidis I (2024). Diaconis-Ylvisaker prior penalized likelihood for \(p / n \to \kappa \in (0, 1)\) logistic regression. arXiv: 2311.11290
Ioannis Kosmidis - High-dimensional logistic regression