Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful.
in Box and Draper (1987). Empirical Model-Building and Response Surfaces, p. 74
Latent variable models
Latent variables: setting
Many statistical models simplify when written in terms of unobserved latent variable\(U\) in addition to the observed data \(Y\). The latent variable
may really exist: for example, when \(Y = I(U > c)\) for some continuous \(U\) (“do you earn less than c per year?”);
may be human construct: for example, something called IQ is said to underlie scores on intelligence tests, but is IQ just a cultural construct? (“Mismeasure of man” debate, etc.);
may just be a mathematical / computational device: for example, latent variables are used in the implementation of MCMC or EM algorithms.
Velocity of galaxies: ?MASS::galaxies
The velocities, in km/sec, of 82 galaxies, moving away from our own galaxy, from 6 well-separated conic sections of an ‘unfilled’ survey of the Corona Borealis region. Multimodality in such surveys is evidence for voids and superclusters in the far universe.
If galaxies are indeed super-clustered the distribution of their velocities estimated from the red-shift in their light-spectra would be multimodal, and unimodal otherwise
data("galaxies", package ="MASS")## Fix typo see `?galaxies`galaxies[78] <-26960galaxies
## Rescale to 1000km/sgalaxies <- galaxies /1000plot(x =c(0, 40), y =c(0, 0.25), type ="n", bty ="l",xlab ="velocity of galaxy (1000km/s)", ylab ="density")rug(galaxies)lines(density(galaxies, bw ="nrd0"), col = cols[2], lwd =2)lines(density(galaxies, bw ="SJ"), col = cols[3], lwd =2)legend("topright", legend =c('bw = "nrd0"', 'bw = "SJ"'),col = cols[2:3], lty =1, lwd =2)
Figure 1: Density of galaxy velocities in 1000km/s, using two kernel density estimators with gaussian kernel but different bandwidth selection procedures.
Latent variable models
Let \([U], D\) denote discrete random variables, and \((U), X\) continuous ones.
\([U] \to X\) or \([U]\to D\)
\((U) \to D\)
\((U)\to X\) or \((U)\to D\)
finite mixture models
censoring
generalized nonlinear mixed models
hidden Markov models
truncation
exploratory / confirmatory factor analysis
changepoint models
item-response theory models
\(\vdots\)
\(\vdots\)
\(\vdots\)
Example: probit regression
A generalized linear model for binary data with a probit link function can be written as \[
\begin{aligned}
U_i & \stackrel{\text{ind}}{\sim}\mathop{\mathrm{N}}(x_i^\top \beta, 1) \\
Y_i & = I(U_i \geq 0)
\end{aligned}
\]
The log-likelihood of the probit regression model is then \[
\sum_{i = 1}^n \left\{ y_i \log \Phi(x_i^\top \beta) + (1 - y_i)\log\{1 - \Phi(x^\top\beta)\} \right\}
\]
Different assumptions for the distribution of the latent variable \(U\) give rise to different models (e.g. logistic distribution gives rise to logistic regression, extreme-value distribution to complementary log-log regression, etc.).
Finite mixture models
Setting
Observations are taken from a population composed of distinct sub-populations, but it is unknown from which of those sub-populations the observations come from (e.g. the galaxy velocities data).
\(p\)-component finite mixture model
The \(p\)-component finite mixture model has density or probability mass function \[
f(y \,;\,\pi, \theta) = \sum_{r=1}^p \pi_r f_r(y \,;\,\theta) \quad (0\leq \pi_r\leq 1 ; \sum_{r=1}^p \pi_r = 1)
\] where \(\pi_r\) is the probability that \(y\) is from the \(r\)th component, and \(f_r(y \,;\,\theta)\) is its density or probability mass function conditional on this event (component density).
A widely used class of models for density estimation and clustering. It is often assumed that the number of components \(p\) is unknown.
Such models are non-regular for likelihood inference:
The ordering of components is non-identifiable. In other works, charging the order of the components does not change model.
Setting \(\pi_r = 0\) eliminates the unknown parameters in \(f_r(y \,;\,\pi, \theta)\)
Depending on the specification of the component distribution, the maximum of the log-likelihood can be \(+\infty\), and can be achieved for various \(\theta\).
Typically, implementations use constraints on the parameters to overcome such issues.
Expectation-Maximization
Aim
The Expectation-Maximization (EM) algorithm is a powerful tool that allows to use the observed value \(y\) of \(Y\) for inference on \(\theta\) when we cannot easily compute \[
f(y \,;\,\theta) = \int f(y\mid u \,;\,\theta) f(u \,;\,\theta) du
\]
It proceeds by exploiting the latent variables.
EM Derivation
Complete data log-likelihood
Suppose that we observe \(U\). Then, the Bayes theorem gives that the complete data log-likelihood is \[
\log f(y, u \,;\,\theta) = \log f(y \,;\,\theta) + \log f(u \mid y \,;\,\theta)
\tag{1}\]
Incomplete data log-likelihood
The incomplete data log-likelihood (also known as observable data log-likelihood) is simply \[
\ell(\theta) = \log f(y \,;\,\theta)
\]
EM Derivation: \(Q\) function
\(Q(\theta\,;\,\theta')\)
Take expectations in both sides of (1) with respect to \(f(u \mid y \,;\,\theta')\). We have \[
\underbrace{\mathop{\mathrm{E}}\left( \log f(Y, U \,;\,\theta) \mid Y = y \,;\,\theta'\right)}_{Q(\theta\,;\,\theta')} =
\ell(\theta) + \overbrace{\mathop{\mathrm{E}}\left( \log f(U \mid Y \,;\,\theta)\mid Y = y \,;\,\theta'\right)}^{C(\theta \,;\,\theta')}
\tag{2}\]
Behaviour of \(Q(\theta \,;\,\theta')\) as a function of \(\theta\)
\[
\begin{aligned}
C(\theta \,;\,\theta') - C(\theta' \,;\,\theta') & = \mathop{\mathrm{E}}\left( \log \frac{f(U \mid Y \,;\,\theta)}{f(U \mid Y \,;\,\theta')} \mid Y = y \,;\,\theta'\right) \\
& \le \log \mathop{\mathrm{E}}\left( \frac{f(U \mid Y \,;\,\theta)}{f(U \mid Y \,;\,\theta')} \mid Y = y \,;\,\theta'\right) \\
& = \log \int \frac{f(u \mid Y = y \,;\,\theta)}{f(u \mid Y =y \,;\,\theta')} f(u \mid Y = y \,;\,\theta') du \\
& = \log \int f(u \mid Y = y \,;\,\theta) du = 0 \\
& \implies C(\theta' \,;\,\theta')\geq C(\theta \,;\,\theta')
\end{aligned}
\]
Under mild smoothness conditions, \(C(\theta \,;\,\theta')\) has a stationary point at \(\theta := \theta'\).
So, if \(Q(\theta \,;\,\theta')\) is stationary at \(\theta := \theta'\), (2) implies that so too is \(\ell(\theta)\)!
EM algorithm for maximizing \(\ell(\theta)\)
Starting from an initial value \(\theta'\) of \(\theta\)
Expectation (E) step
Compute \(Q(\theta \,;\,\theta') = \mathop{\mathrm{E}}\left( \log f(Y,U \,;\,\theta)
\mid Y = y \,;\,\theta'\right)\).
Maximization (M) step
With \(\theta'\) fixed, maximize \(Q(\theta \,;\,\theta')\) with respect to \(\theta\), and let \(\theta^\dagger\) be the maximizer.
Check convergence, based on the size of \(\ell(\theta^\dagger) -
\ell(\theta')\) if available, or \(\|\theta^\dagger - \theta'\|\), or both.
If the algorithm has converged stop and return \(\theta^\dagger\) as the value of the maximum likelihood estimator \(\hat\theta\). Otherwise, set \(\theta' := \theta^\dagger\) and go to 1.
The M-step ensures that \(Q(\theta^\dagger \,;\,\theta') \geq Q(\theta'
\,;\,\theta')\), so (3) implies that \(\ell(\theta^\dagger) \geq \ell(\theta')\).
\[
\LARGE\Downarrow
\]
The log likelihood never decreases between EM iterations!
EM algorithm: notes on convergence
If \(\ell(\theta)\) has only one stationary point, and \(Q(\theta \,;\,
\theta')\) eventually reaches a stationary value at \(\hat\theta\), then \(\hat\theta\) must maximize \(\ell(\theta)\).
Otherwise, the algorithm may converge to a local maximum of the log likelihood or to a saddlepoint.
EM never decreases the log likelihood so it is, generally, more stable than Newton–Raphson-type algorithms. The rate of convergence depends on closeness of \(Q(\theta \,;\,\theta')\) and \(\ell(\theta)\).
Missing information principle
Observed information equals the complete-data information minus the missing information.
EM’s rate of convergence is slow if the missing information is a high proportion of the total, or if the largest eigenvalue of \(I_c(\theta
\,;\,y)^{-1}I_m(\theta \,;\,y) \approx 1\).
Example: negative binomial
inc_loglik <-function(theta, nu, y) {sum(dnbinom(y, nu, nu / (theta + nu), log =TRUE))}Q <-function(theta, theta_d, nu, y) {sum(-(1+ nu / theta) * (y + nu) / (1+ nu / theta_d) - nu *log(theta))}theta_dagger <-function(theta_d, nu, y) { theta_d * (mean(y) + nu) / (theta_d + nu)}set.seed(123)nu <-20theta_true <-1.5y <-rnbinom(20, nu, nu / (theta_true + nu))theta_grid <-seq(0.01, 10, length =100)inc_ll <-sapply(theta_grid, inc_loglik, nu = nu, y = y)maxit <-400theta_start <- theta_d <-0.1## Scaling factor for nice plotsscl <-2plot(theta_grid, inc_ll, type ="l", ylim =c(-400, 10),ylab ="", xlab =expression(theta))legend("bottomright", legend =c("incomplete data log-likelihood", "Q"),col =c("black", "grey"), lty =1)for (r in1:maxit) { Q1 <-sapply(theta_grid, Q, theta_d = theta_d, nu = nu, y = y) / sclif (r %%4==0) {points(theta_grid, Q1, type ="l", col ="grey") theta_d <-theta_dagger(theta_d, nu, y)rug(theta_d) }}abline(v =mean(y), lty =2)
The maximum of the incoplete-data log-likelihood is at \(\hat\theta := 1.75\).
EM starting at \(\theta' = 0.1\).
Example: negative binomial
theta_start <- theta_d <-8## Scaling factor for nice plotsscl <-10plot(theta_grid, inc_ll, type ="l", ylim =c(-400, 10),ylab ="", xlab =expression(theta))legend("bottomright", legend =c("incomplete data log-likelihood", "Q"),col =c("black", "grey"), lty =1)for (r in1:300) { Q1 <-sapply(theta_grid, Q, theta_d = theta_d, nu = nu, y = y) / sclif (r %%4==0) {points(theta_grid, Q1, type ="l", col ="grey") theta_d <-theta_dagger(theta_d, nu, y)rug(theta_d) }}abline(v =mean(y), lty =2)
The maximum of the incoplete-data log-likelihood is at \(\hat\theta := 1.75\).
EM starting at \(\theta' = 8\).
EM for mixture models
E-step for mixture models
The contribution to the complete data log-likelihood (assuming that we know from what component \(y\) is from) of a mixture model is \[
\log f(y, u \,;\,\pi, \theta) = \sum_{r=1}^p I(u=r)\left\{ \log \pi_r + \log f_r(y \,;\,\theta)\right\}
\]
We need the expectation of \(\log f(y,u \,;\,\theta)\) over distribution of \(U \mid Y = y\) at \(\pi'\) and \(\theta'\), that is \[
w_r(y \,;\,\pi', \theta') = P(U = r \mid Y = y \,;\,\pi', \theta') = {\pi'_r f_r(y \,;\,\theta')\over \sum_{s=1}^p \pi'_s f_s(y \,;\,\theta')} \quad (r=1,\ldots,p)
\tag{4}\] which is the weight attributable to component \(r\) given observation \(y\).
The only quantities depending on \(U\) in the completed data log-likelihood are \(I(U = r)\). So, \[
\begin{aligned}
\mathop{\mathrm{E}}(I(U = r) \mid Y = y \,;\,\pi', \theta') & = P(U = r \mid Y = y \,;\,\pi', \theta') = w_r(y \,;\,\pi', \theta')\\
\end{aligned}
\]
E-step for mixture models
The expected value of the complete-data log likelihood based on a random sample \((y_1, u_1), \ldots, (y_n, u_n)\) is \[
\begin{aligned}
Q(\pi, \theta \,;\,\pi', \theta') & = \sum_{j=1}^n \sum_{r=1}^p w_r(y_j \,;\,\pi', \theta')\left\{ \log\pi_r + \log
f_r(y_j \,;\,\theta)\right\}\\
&= \sum_{r=1}^p \left\{ \sum_{j=1}^n w_r(y_j \,;\,\pi', \theta')\right\}\log\pi_r +
\sum_{r=1}^p \sum_{j=1}^n w_r(y_j \,;\,\pi', \theta')\log f_r(y_j \,;\,\theta)
\end{aligned}
\]
\(\pi_r\) do not usually appear in the component density \(f_r\). So, \[
\left\{\pi_r^\dagger\right\} = \arg\max A(\pi) = \left\{\frac{1}{n} \sum_{j = 1}^n w_r(y_j \,;\,\pi', \theta')\right\}
\] which are the average weights given the data for each component.
M-step for mixture models
Iterates for \(\theta\)
\(f_r\) do not depend on \(\pi_r\). So, \[
\theta^\dagger = \arg\max B(\theta)
\]
Example: Univariate normal components
If \(f_r\) is the normal density with mean \(\mu_r\) and variance \(\sigma^2_r\), simple calculations give that the M-step results in the weighted estimates \[
\mu_r^\dagger = {\sum_{j=1}^n w_r(y_j \,;\,\pi', \theta') y_j\over \sum_{j=1}^n
w_r(y_j \,;\,\pi', \theta')}\,, \quad
\sigma^{2\dagger}_r = {\sum_{j=1}^n w_r(y_j \,;\,\pi', \theta')(y_j-\mu_r^\dagger)^2\over \sum_{j=1}^n
w_r(y_j \,;\,\pi', \theta')} \quad (r=1,\ldots,p)
\]
Velocity of galaxies: Mixture of normal distributions
Let’s fit a mixture model with normal component densities, where each component has its own mean and variance. This can be done using the mclust R package, which provides the EM algorithm for (multivariate) gaussian mixture models.
library("mclust")gal_mix <-Mclust(galaxies, G =1:10, modelNames ="V")plot(gal_mix, what ="BIC")abline(v =4, lty =3)
[mclust reports the negative of the BIC as we defined it]
Velocity of galaxies: Density estimation
plot(x =c(0, 40), y =c(0, 0.25), type ="n", bty ="l",xlab ="velocity of galaxy (1000km/s)", ylab ="density")rug(galaxies)lines(density(galaxies, bw ="nrd0"), col = cols[2], lwd =2)lines(density(galaxies, bw ="SJ"), col = cols[3], lwd =2)legend("topright", legend =c('bw = "nrd0"', 'bw = "SJ"'),col = cols[2:3], lty =1, lwd =2)
Figure 2: Density of galaxy velocities in 1000km/s, using two kernel density estimators with gaussian kernel but different bandwidth selection procedures, and a mixture of 4 normal densities.
Figure 3: Density of galaxy velocities in 1000km/s, using two kernel density estimators with gaussian kernel but different bandwidth selection procedures, and a mixture of 4 normal densities.
Clustering and assignment uncertainty
Apart from computation convenient, EM for mixture models results in some useful byproducts.
The weights (4) from the E-step can be used to determine - the component at which the observation is assigned to - the uncertainty about that assignment
Exponential families
Suppose that the complete-data log likelihood has the form \[
\log f(y,u \,;\,\theta) = s(y,u)^\top\theta - \kappa(\theta) + c(y, u)
\]
Ignoring \(c(y, u)\), the M-step of the EM algorithm involves maximizing \[
Q(\theta \,;\,\theta') = \left\{\mathop{\mathrm{E}}\left( s(y,U)\mid Y = y \,;\,\theta'\right)\right\}^\top \theta -
\kappa(\theta)
\] or, equivalently, solving for \(\theta\) the equation \[
\mathop{\mathrm{E}}\left( s(y,U)\mid Y=y \,;\,\theta'\right) = {\partial\kappa(\theta)\over\partial\theta}
\]
The likelihood equation for \(\theta\) based on the complete data is \(s(y,u) = \partial\kappa(\theta)/\partial\theta\).
\[\Large\Downarrow\]
EM simply replaces \(s(y , u)\) by its conditional expectation \(\mathop{\mathrm{E}}\left(
s(y,U) \mid Y = y \,;\,\theta'\right)\) and solves the likelihood equation. A routine to fit the complete-data model can readily be adapted for missing data if the conditional expectations are available.