Expectation-Maximization (EM) data mining algorithm in plain English

The EM data mining algorithm is part of a longer article about many more data mining algorithms.

What does it do?

In data mining, expectation-maximization (EM) is generally used as a clustering algorithm (like k-means) for knowledge discovery.

In statistics, the EM algorithm iterates and optimizes the likelihood of seeing observed data while estimating the parameters of a statistical model with unobserved variables.

OK, hang on while I explain...

I'm not a statistician, so hopefully my simplification is both correct and helps with understanding.

Here are a few concepts that will make this way easier...

What's a statistical model?

I see a model as something that describes how observed data is generated. For example, the grades for an exam could fit a bell curve, so the assumption that the grades are generated via a bell curve (a.k.a. normal distribution) is the model.

Wait, what's a distribution?

A distribution represents the probabilities for all measurable outcomes. For example, the grades for an exam could fit a normal distribution. This normal distribution represents all the probabilities of a grade.

In other words, given a grade, you can use the distribution to determine how many exam takers are expected to get that grade.

Cool, what are the parameters of a model?

A parameter describes a distribution which is part of a model. For example, a bell curve can be described by its mean and variance.

Using the exam scenario, the distribution of grades on an exam (the measurable outcomes) followed a bell curve (this is the distribution). The mean was 85 and the variance was 100.

So, all you need to describe a normal distribution are 2 parameters:

The mean
The variance

And likelihood?

Going back to our previous bell curve example... suppose we have a bunch of grades and are told the grades follow a bell curve. However, we're not given all the grades... only a sample.

Here's the deal:

We don't know the mean or variance of all the grades, but we can estimate them using the sample. The likelihood is the probability that the bell curve with estimated mean and variance results in those bunch of grades.

In other words, given a set of measurable outcomes, let's estimate the parameters. Using these estimated parameters, the hypothetical probability of the outcomes is called likelihood.

Remember, it's the hypothetical probability of the existing grades, not the probability of a future grade.

You're probably wondering, what's probability then?

Using the bell curve example, suppose we know the mean and variance. Then we're told the grades follow a bell curve. The chance that we observe certain grades and how often they are observed is the probability.

In more general terms, given the parameters, let's estimate what outcomes should be observed. That's what probability does for us.

Great! Now, what's the difference between observed and unobserved data?

Observed data is the data that you saw or recorded. Unobserved data is data that is missing. There a number of reasons that the data could be missing (not recorded, ignored, etc.).

Here's the kicker:

For data mining and clustering, what's important to us is looking at the class of a data point as missing data. We don't know the class, so interpreting missing data this way is crucial for applying EM to the task of clustering.

Once again:

The EM algorithm iterates and optimizes the likelihood of seeing observed data while estimating the parameters of a statistical model with unobserved variables. Hopefully, this is way more understandable now.

The best part is...

By optimizing the likelihood, EM generates an awesome model that assigns class labels to data points -- sounds like clustering to me!

How does EM help with clustering?

EM begins by making a guess at the model parameters.

Then it follows an iterative 3-step process:

E-step: Based on the model parameters, it calculates the probabilities for assignments of each data point to a cluster.
M-step: Update the model parameters based on the cluster assignments from the E-step.
Repeat until the model parameters and cluster assignments stabilize (a.k.a. convergence).

Is this supervised or unsupervised?

Since we do not provide labeled class information, this is unsupervised learning.

Why use EM?

A key selling point of EM is it's simple and straight-forward to implement. In addition, not only can it optimize for model parameters, it can also iteratively make guesses about missing data.

This makes it great for clustering and generating a model with parameters. Knowing the clusters and model parameters, it's possible to reason about what the clusters have in common and which cluster new data belongs to.

EM is not without weaknesses though...

First, EM is fast in the early iterations, but slow in the later iterations.
Second, EM doesn't always find the optimal parameters and gets stuck in local optima rather than global optima.

Where is it used?

The EM algorithm is available in Weka. R has an implementation in the mclust package. scikit-learn also has an implementation in its gmm module.

Checkout how I used EM

What data mining does Google do? Take a look on the main list...

About the Author

Ray Li

Ray is a software engineer and data enthusiast who has been blogging for over a decade. He loves to learn, teach and grow. You’ll usually find him wrangling data, programming and lifehacking.