# Mixture of models and the E.M algorithm

## Table of Contents

## 1. Summary and Aknowledgment

- Notes on EM algorithm for mixture models estimate.
- Variational view of EM.
- Mainly based on Christopher Bishop's book (see chapter 9 of Pattern Recognition and Machine Learning).

## 2. Introduction and examples

One of the most widely used mixture of distributions (or mixture model, *i.e* MM) is the Gaussian mixture model (GMM).
When a random vector of real values \(\X\) is drawn from a GMM, its density of probability is:
\[
p(\X| \M,\C,\pis) = \sum_{k=1}^K \pi_k \normal(\X|\M_k,\C_k),
\]
with \(\pis = (\pi_k)_{1}^K\) and \(K\) the number of gaussians or clusters. The set of parameters for this model is \(\pa=(\pi_k, \M_k, \C_k)_{k=1}^K\).
The generative story of this model is as follows:

- First pick a cluster (or a gaussian) \(\sim Cat(\pis)\)
- Generate \(\X \sim \normal(\X|\M_k,\C_k)\)

MMs are mainly used for clustering or probability density estimation.

### 2.1. Clustering

In a clustering task, each gaussian is associated to a cluster, and the learning step aims at estimating the hidden assignments of a data point to a cluster. In the case of GMM, a gaussian is associated to a cluster. This can be interpreted as an extension of the K-means algorithm, with soft assignements and the possibility to better control the shape of each clusters with the covariance matrix.

### 2.2. Estimation for mixture of distributions

For probabilistic classifier, each class is modeled by a distribution (the likelihood term). However, following the gaussian example, assigning a single gaussian to model the data of one class can be a poor assumption. Inside a class, the data can exhibit different modes associated to the intra-class variability of the data. In this case a MM can be a better choice: for instance by increasing the number of Gaussians, we allows the model to have an improved expressivity along with more parameters to be estimated.

## 3. The EM Algorithm in general

A mixture model is a generative model that relies on latent variables \(\Z\) to explain the observed data \(\X\). These latent parameters \(\Z\) represent the affectation of each observation to a component of the mixture. To learn the parameters of the model we can maximize the log-likelihood of the parameters on the training data:

\begin{align*} \log(p(\X|\pa)) &= \log(\sum_{\Z} p(\X,\Z|\pa) ) \end{align*}For a training point \(\x\), a latent vector \(\z\) of dimension \(K\) is associated. \(\z\) corresponds to the assignement of \(\x\) to each cluster. \(\Z\) is therefore a set of binary random variables:

- \(\z = (z_k)_{k=1}^K\)
- \(z_k=1\) if \(\x\) belongs to the cluster \(k\) and \(0\) otherwise.

To say it differently, \[ p(\X , \Z = \z, \pa) = \pi_k \normal(\M_k,\C_k),\] if \(z_k = 1\) and \(z_{k'} = 0,\ \forall k'\neq k\).

If \(\Z\) could be observed, it becomes therefore a *simple* and easy to solve classification problem. \((\X, \Z)\) is often denoted as the **complete** data set, while \((\X)\) is the **incomplete** one. In other words, \(P(\X,\Z|\pa)\) is easy to optimize, while \(P(\X|\pa)\) requires to marginalize the latent variable, introducing a log-sum without a closed-form solution.

The **Expectation-Maximization (EM)** algorithm is a solution to solve the optimization problem of finding \(\pa\) to maximize \(\log(p(\X|\pa))\).
While in practice \(\Z\) is unknown, for a given set of parameters we can compute \(P(\Z|\X,\pa)\) and also the **expected** value of \(\Z|\X,\pa\). This is the **E(xpectation) step**. In a second step, learning the classification task can be carried out: **maximizing** \(\pa\) knowing \(\Z\) (or the expected value). This is the **M(aximization) step**.

## 4. Variational view of EM

The quantity of interest can be rewritten as follows by introducing a distribution \(q(\Z)\) defined over the latent variables: \[\log(P(\X|\pa))= \sum_{\Z}q(\Z) \log(\frac{P(\X,\Z|\pa)}{q(\Z)}) - \sum_{\Z}q(\Z) \log(\frac{P(\Z|\X,\pa)}{q(\Z)})\]

Two terms appear. The second one is the Kullback-Leibler divergence between \(P(\Z|\X,\pa)\) and \(q(\Z)\) while the first one is for the moment denoted by \(\lb\).

\begin{align*} \log(P(\X|\pa)) &= \lb + \dkl(q(\Z)||P(\Z|\X,\pa))\\ \lb &= \sum_{\Z}q(\Z) \log(\frac{P(\X,\Z|\pa)}{q(\Z)})\\ \dkl(q(\Z)||P(\Z|\X,\pa)) &= \sum_{\Z}q(\Z) \log(\frac{P(\Z|\X,\pa)}{q(\Z)}) \end{align*}
Recall that the Kullback-Leibler divergence satisfies \(\dkl(q|p)\ge 0\), with equality if, and only if \(q(Z) = p(Z|X,\pa)\). Therefore \(P(\X|\pa) \ge \lb\) which means that \(\lb\) is a **lower bound** of \(P(\X|\pa)\) we want to maximize. **The goal of the EM algorithm is therefore to maximize \(\lb\) to indirectly maximize \(P(X|\pa)\).**

### 4.1. Remark

This formulation does not solve the tractability issue since in the \(\dkl\) term we still have a dependance between \(P(\Z|X,\pa)\) and \(P(\X|\pa)\), *i.e* :

\(P(\Z|X,\pa) = \frac{P(\X,\Z|\pa)}{P(\X|\pa)}\).

So both \(P(\Z|X,\pa)\) and \(P(\X|\pa)\) are untractable. The probabilistic model specifies the joint distribution \(P(\X, \Z)\), and the goal is to find an approximation for the posterior distribution \(P(\Z|\X,\pa)\) as well as for the model evidence \(P(X|\pa)\).

### 4.2. E step

Suppose that the current value of the parameter vector is \(\paold\). In the E step, the lower bound \(\lb\) **is maximized with respect to \(q(\Z)\)** while holding \(\pa\) fixed to \(\paold\).

- the solution is when \(q(\Z) = P(\Z|\X,\paold)\), and
- the Kullback-Leibler divergence vanishes.

Since the KL divergence is zero, we have \(\lb=P(\X|\paold)\). In fact, the E-step consists in computing the posterior distribution over \(\Z\) with the parameters fixed at \(\paold\). Then you just set *theoritically* \(q(\Z) = P(\Z|\X,\paold)\).

### 4.3. M step

The distribution \(q(Z)\) is now fixed and the **lower bound \(\lb\) is now maximized with respect to \(\pa\)**. The maximization process yields a new value for the parameters \(\panew\) and increases the lower bound (except if we are already at the maximum). Note that \(q(Z)=P(\Z|\X,\paold)\) acts as a constant in the maximization process:

The second term is a positive constant, since it depends only on \(\paold\). This is the entropy of the posterior distribution \(H(P(\Z|\X,\paold))\). The first we want to maximize is the expectation under the posterior \(P(\Z|\X,\paold)\) of the log-likelihood of complete data. This means in practice, that we optimize a classifier of \(\X\) into \(\Z\), with the supervision of \(P(\Z|\X,\paold)\) that provides the pseudo-affectation.

Since the distribution \(q\) is fixed to \(P(\Z|\X,\paold)\), \(q(Z) \neq P(\Z|\X,\panew)\) and now the KL divergence term is nonzero. The increase in the log likelihood function is therefore greater than the increase in the lower bound.

### 4.4. A summary of EM

You can see the EM algorithm as pushing two functions.

- With fixed parameters (\(\paold\)), push \(\lb\) to stick \(P(\X|\paold)\) by setting \(q(Z)=P(\Z|\X,\paold)\)
- Then recompute the parameters to get \(\panew\) with a fixed \(q(\Z)=P(\Z|\X,\paold)\). The criterion is to maximize \(\lb\) w.r.t \(\pa\). In fact you are moving in the parameter space in order to push \(\lb\), but by doing this you also push the likelihood even further: \(P(\X|\panew) \ge \lb\) because \(P(\Z|\X,\pa)\) has changed to \(P(\Z|\X,\panew)\) while \(q(\Z)\) has not.

For graphical illustration, you can look at the book of Christopher Bishop called Pattern Recognition and Machine Learning. You can also read the great paper of Radford Neal and Geoffrey Hinton on the link between Variational Bayes and EM.

Note that for implementation, \(q(\Z)\) does not really exist or is not a quantity of interest. It acts more like a temporary variable to perform this two steps game.

## 5. GMM

Application of the EM algorithm to GMM.

### 5.1. E step

Given the current set of parameters \(\pa=\paold\), we set \(q(\Z)=P(\Z|\X,\paold)\). This implies that for each training example \(\X\), we compute the posterior distribution of its associated latent variable \(\Z\)

\begin{align*} q(\Z)&=P(\Z|\X,\paold)\\ &=\frac{P(\X,\Z|\paold)}{\sum_z P(\X,\Z|\paold)}\\ &= \frac{\pi_k \normal(\X|\M_k,\C_k)}{\sum_{j=1}^K\ \pi_j \normal(\X|\M_j,\C_j)} \end{align*}\(q(\Z)\) can ba be considered as the a soft assignement of \(\X\) to the different clusters, or its responsability.

### 5.2. M step

The distribution \(q(Z)\) is now fixed and the lower bound \(\lb\) is maximized with respect to \(\pa\). We optimize a classifier of \(\X\) into \(\Z\), with the supervision of \(P(\Z|\X,\paold)\) that provides the pseudo-affectation.

- Each data points is involved in each cluster, but weighted by its responsability.
- \(\x_n\) is associated to \(\z_n\)
- \(\x_n\) belongs to cluster \(k\) with a weight given by \(q(z_{nk})\)

## 6. TODO Online (or incremental version)

## 7. Bernoulli Mixture model

With the naive assumption, a random vector \(\X\) is assumed to be generated by a set of independant Bernoulli distributions, one for each component: \[ P(\X=\x|\M) = \prod_{i=1}^d \m_i^{x_i} (1-\m_i)^{1-x_i}, \] where \(\M\) is the vector of parameters, and \(\m_i\) is the parameter of the Bernoulli distribution associated to the component \(i\) of \(\X\). We can then compute the (statistical) mean and covariance of \(\X\) under this naive assumption:

\begin{align*} E[\X] &= \M\\ cov[\X] &= diag\{\M(1-\M)\}. \end{align*}Here \(diag\) denotes a diagonal matrix and \(\M(1-\M)\) is the vector gathering the diagonal values. Therefore, there is no correlation between the component of \(\X\) (of course, it was designed like this). The variance for one component is thus \(\m_i(1-\m_i)\).

If we want to capture correlations between the variables, unlike a single Bernoulli distribution, a solution is to assume a mixture of Bernoulli distributions. \[ P(\X=\x|\M,\pis) = \sum_{k=1}^K \pi_k P(\X=\x|\m_k), \] Under this distribution, the statistics are now:

\begin{align*} E[\X] &= \sum_k \pi_k \M_k\\ cov[\X] &= \sum_k \pi_k (diag\{\M_k(1-\M_k)\} + \M_k\M_k^t) + E[\X]E[\X]^t \end{align*}The covariance is no longer diagonal and the model can capture correlation between variables at the expense of increasing the number of parameters (\(K\) times).