AdaBoost data mining algorithm in plain English

The AdaBoost data mining algorithm is part of a longer article about many more data mining algorithms.

What does it do?

AdaBoost is a boosting algorithm which constructs a classifier.

As you probably remember, a classifier takes a bunch of data and attempts to predict or classify which class a new data element belongs to.

But what's boosting?

Boosting is an ensemble learning algorithm which takes multiple learning algorithms (e.g. decision trees) and combines them. The goal is to take an ensemble or group of weak learners and combine them to create a single strong learner.

What's the difference between a strong and weak learner?

A weak learner classifies with accuracy barely above chance. A popular example of a weak learner is the decision stump which is a one-level decision tree.

Alternatively...

A strong learner has much higher accuracy, and an often used example of a strong learner is SVM.

What's an example of AdaBoost?

Let's start with 3 weak learners. We're going to train them in 10 rounds on a training dataset containing patient data. The dataset contains details about the patient's medical records.

The question is...

How can we predict whether the patient will get cancer?

Here's how AdaBoost answers the question...

In round 1:

AdaBoost takes a sample of the training dataset and tests to see how accurate each learner is. The end result is we find the best learner.

In addition, samples that are misclassified are given a heavier weight, so that they have a higher chance of being picked in the next round.

One more thing, the best learner is also given a weight depending on its accuracy and incorporated into the ensemble of learners (right now there's just 1 learner).

In round 2:

AdaBoost again attempts to look for the best learner.

And here's the kicker:

The sample of patient training data is now influenced by the more heavily misclassified weights. In other words, previously misclassified patients have a higher chance of showing up in the sample.

Why?

It's like getting to the second level of a video game and not having to start all over again when your character is killed. Instead, you start at level 2 and focus all your efforts on getting to level 3.

Likewise, the first learner likely classified some patients correctly. Instead of trying to classify them again, let's focus all the efforts on getting the misclassified patients.

The best learner is again weighted and incorporated into the ensemble, misclassified patients are weighted so they have a higher chance of being picked and we rinse and repeat.

At the end of the 10 rounds:

We're left with an ensemble of weighted learners trained and then repeatedly retrained on misclassified data from the previous rounds.

Is this supervised or unsupervised?

This is supervised learning, since each iteration trains the weaker learners with the labelled dataset.

Why use AdaBoost?

AdaBoost is simple. The algorithm is relatively straight-forward to program.

In addition, it's fast! Weak learners are generally simpler than strong learners. Being simpler means they'll likely execute faster.

Another thing...

It's a super elegant way to auto-tune a classifier, since each successive AdaBoost round refines the weights for each of the best learners. All you need to specify is the number of rounds.

Finally, it's flexible and versatile. AdaBoost can incorporate any learning algorithm, and it can work with a large variety of data.

Where is it used?

AdaBoost has a ton of implementations and variants. Here are a few:

Checkout how I used AdaBoost

If you get along with your neighbors, you'll love the next algorithm in the main algorithm list...

About the Author

Ray Li

Ray is a software engineer and data enthusiast who has been blogging for over a decade. He loves to learn, teach and grow. You’ll usually find him wrangling data, programming and lifehacking.