Tutorial of EM algorithm
The following contents have much to do with the following materials:
- Frank Dellaert, The Expectation Maximization Algorithm link
- C.B. Do & S. Batzoglou, What is the expectation maximization algorithm link
Both of the papers have presented a simple and elegant explaination, while this tutorial aims to further combines them together. More specificly, in the following sections we will try to solve the unknown coin problem in material 2 from the mathematical derivation in material 1.
Intutive Explanation of EM
EM algorithm is an iterative method to estimate some unknown parameters \(\Theta\), given observation data \(U\). For an usual maximum likelihood estimation, an optimal estimation of \(\Theta\) is obtained through:
\[ \Theta^{*} = argmax~P(\Theta|U) \]
A simple example provided by material 1 is, given coin A and B of unknown biases \(\Theta = \{\theta_A , \theta_B\}\), and observations of the coin tosses experiment, to estimate \(\theta_A\) and \(\theta_B\). Suppose we know which coin, A or B, has been chosen in the tossing experiment, then the MLE of \(\theta_A\) and \(\theta_B\) will be:
\[ \begin{align} \theta_J & = \frac{\#~of~heads~using~coin~J}{total~\#~of~flips~using~coin~J} , ~ J = \{A,B\} \end{align} \]
However, now some if whether the coin is A or B is unknown, the above equation can no longer be applied directedly. Denote the related variable \(J = \{A,B\}\) which is invisible and called the latent variable. Now the problem has become an optimal estimation of \(\Theta\) through:
\[ \Theta^{*} = argmax~P(\Theta,J|U) \]
EM is set to solve the above problem. The intuition is quite straight forward: alternate between estimating the unknowns \(\Theta\) and a distribution of \(J\). One most insightful analysis of EM algorithm is in terms of lower bound maximiztion. Equivalently to the maximization equation above, we seek to maximize the logarithm of the joint distribution (which is proportional to the posterior):
\[ \Theta^* = argmax~logP(U,\Theta) = argmax~log\sum_{J}P(U,J,\Theta) \]
Now start with a guess \(\Theta^t\), computer an lower bound \(B(\Theta;\Theta^t)\) to the function \(logP(U,\Theta)\) and maximize that bound instead. This lower bound is derived based on a guess distribution \(f^t(J)\), which itself is derived from the guessed parameter \(\Theta^t\):
\[ logP(U,\Theta) = log\sum_{J}P(U,J,\Theta) = log\sum_{J}f^t(J)\frac{P(U,J,\Theta)}{f^t(J)} \geq \sum_{J}f^t(J)log\frac{P(U,J,\Theta)}{f^t(J)} \]
EM goes one step further and tries to find the best bound. Through Lagrange method, the best lower bound could be arrived by:
\[ f^t(J) = P(J|U,\Theta^t). \]
Considering that \(\sum_{J}f^t(J)log(f^t(J))\) does not depend on \(\Theta\), we can equivalently maximie the following terms: \(B'(\Theta;\Theta^t) = \sum_{J}P(J|U,\Theta^t)logP(U,J,\Theta)\). That is to say:
\[ \Theta^{t+1} = argmax~B'(\Theta;\Theta^t) = argmax~\sum_{J}P(J|U,\Theta^t)logP(U,J,\Theta) \]
At each iteration, teh EM algorithm first finds and lower bound \(B'(\Theta,\Theta^t)\) at the current guess \(\Theta^t\), and then maximizes this bound to obtain an improved estimate \(\Theta^{t+1}\).
Because the bound is expressed as an expectation, the first step is called the "Expectation step" or "E step", and the second step is called "Maximization step" or "M step". The complete algorithm can thus be summarized as:
- E step: \(f^t(J) = P(J|U,\Theta^t)\)
- M step: \(\Theta^{t+1} = argmax~\sum_{J}P(J|U,\Theta^t)logP(U,J,\Theta)\)
It can be proved that the EM algorithm converges a LOCAL maximum of \(logP(U,\Theta)\).
The coin problem
Now let's go the coin problem provided by material 2. With coin A and B, we have 5 sets of each, with 10 tosses per set.
In the figure above, when the data is complete, i.e., \(J = \{A,B\}\) is known, the elegent MLE of \(\theta_A\) or \(\theta_B\) is simply the ration of \(\#\)(heads) (denoted as \(n_H\)) and the sum with \(\#\)(tails) (denoted as \(n_T\)): \(n_H / (n_H + n_T)\).
Then it comes to the case where the data is incomplete, that is, the variable \(J\) is not known. In Figure (b), the data \(U = \{U_1,U_2,U_3,U_4,U_5\}\), and \(U_i = \{n_{H},n_{T}\}\), Let's focus on \(U_1\) only. The EM algorithm will start with an initial guess of \(\Theta^t = \{\theta_A^t, \theta_B^t\} = \{0.60, 0.50\}\). In the E step, it calculates \(f^t(J)\):
\[ P(J = A | U_1, \Theta^t) = P(J=A|\{n_H=5,n_T=5\},\theta_A^t=0.60) = 0.45 \\ P(J = B | U_1, \Theta^t) = P(J=B|\{n_H=5,n_T=5\},\theta_B^t=0.50) = 0.55 \]
Note that the above calculation is based on the conditions that \(P(J = A | U_1, \Theta^t) + P(J = B | U_1, \Theta^t) = 1\) and \(P(J=A,\Theta^t) = P(J=B,\Theta^t)\).
The second step will be loking at \(P(J,U,\Theta)\):
\[ P(J=A,U_1,\Theta) = \left(\begin{array}{ccc} 10\\k\end{array}\right) \theta_A^{n_H}(1-\theta_A)^{n_T} \\ P(J=B,U_1,\Theta) = \left(\begin{array}{ccc} 10\\k\end{array}\right) \theta_B^{n_H}(1-\theta_B)^{n_T} \]
For data set 1, \(k = 5\). The constant coeefficient is the same for both coins, so it is dropped out in the normalization, and only the ratio of the remaining factors remains in the result. The objective function to be maximized \(\sum_{J}P(J|U,\Theta^t)logP(U,J,\Theta)\) can then be written as:
\[ P(J = A | U_1, \Theta^t) [n_H~log\theta_A + n_T~log(1-\theta_A)] + P(J = B | U_1, \Theta^t) [n_H~log\theta_B + n_T~log(1-\theta_B)] \]
Through partial derivative, we can get \(\theta_A^{t+1} = n_H / (n_H + n_T) = 0.5\) and \(\theta_B^{t+1} = n_H / (n_H + n_T) = 0.5\). This is the M step.
In the figure, there are 5 data sets. The final solution of the M step will be modified accordingly. In the above example, the 5 data sets are independent, so the log-likelihood function can be written as:
\[ logP(U,\Theta) = log\prod_i P(U_i,\Theta) = \sum_i logP(U_i,\Theta) = \sum_i log\sum_{J_i}P(U_i,J_i,\Theta) \]
Accordingly, the E step and M step will be modified as:
- E step: for \(i = {1,2,...,5}, ~ f^t(J_i^t) = P(J_i|U_i,\Theta^t)\)
- M step: \(\Theta^{t+1} = argmax~\sum_i\sum_{J_i}P(J_i|U_i,\Theta^t)logP(U_i,J_i,\Theta)\)
Taking partial derivative over \(\theta_A\) and \(\theta_B\), the solution to the M step is:
\[ \theta_A^{t+1} = \frac{\sum_iP(J_i=A|U_i,\Theta^t)n_{Hi}}{\sum_iP(J_i=A|U_i,\Theta^t)n_{Hi}+\sum_iP(J_i=A|U_i,\Theta^t)n_{Ti}} \\ \theta_B^{t+1} = \frac{\sum_iP(J_i=B|U_i,\Theta^t)n_{Hi}}{\sum_iP(J_i=B|U_i,\Theta^t)n_{Hi}+\sum_iP(J_i=B|U_i,\Theta^t)n_{Ti}} \]
That's exactly what is done in figure (b).