Naive Bayes data mining algorithm in plain English

The Naive Bayes data mining algorithm is part of a longer article about many more data mining algorithms.

What does it do?

Naive Bayes is not a single algorithm, but a family of classification algorithms that share one common assumption:

Every feature of the data being classified is independent of all other features given the class.

What does independent mean?

2 features are independent when the value of one feature has no effect on the value of another feature.

For example:

Let's say you have a patient dataset containing features like pulse, cholesterol level, weight, height and zip code. All features would be independent if the value of all features have no effect on each other. For this dataset, it's reasonable to assume that the patient's height and zip code are independent, since a patient's height has little to do with their zip code.

But let's not stop there, are the other features independent?

Sadly, the answer is no. Here are 3 feature relationships which are not independent:

If height increases, weight likely increases.
If cholesterol level increases, weight likely increases.
If cholesterol level increases, pulse likely increases as well.

In my experience, the features of a dataset are generally not all independent.

And that ties in with the next question...

Why is it called naive?

The assumption that all features of a dataset are independent is precisely why it's called naive -- it's generally not the case that all features are independent.

What's Bayes?

Thomas Bayes was an English statistician for which Bayes' Theorem is named after. You can click on the link to find about more about Bayes' Theorem.

In a nutshell, the theorem allows us to predict the class given a set of features using probability.

The simplified equation for classification looks something like this:

$latex P(\textit{Class A}|\textit{Feature 1}, \textit{Feature 2}) = \dfrac{P(\textit{Feature 1}|\textit{Class A}) \cdot P(\textit{Feature 2}|\textit{Class A}) \cdot P(\textit{Class A})}{P(\textit{Feature 1}) \cdot P(\textit{Feature 2})}$

Let's dig deeper into this...

What does the equation mean?

The equation finds the probability of Class A given Features 1 and 2. In other words, if you see Features 1 and 2, this is the probability the data is Class A.

The equation reads: The probability of Class A given Features 1 and 2 is a fraction.

The fraction's numerator is the probability of Feature 1 given Class A multiplied by the probability of Feature 2 given Class A multiplied by the probability of Class A.
The fraction's denominator is the probability of Feature 1 multiplied by the probability of Feature 2.

What is an example of Naive Bayes?

Below is a great example taken from a Stack Overflow thread (Ram's answer).

Here's the deal:

We have a training dataset of 1,000 fruits.
The fruit can be a Banana, Orange or Other (these are the classes).
The fruit can be Long, Sweet or Yellow (these are the features).

What do you see in this training dataset?

Out of 500 bananas, 400 are long, 350 are sweet and 450 are yellow.
Out of 300 oranges, none are long, 150 are sweet and 300 are yellow.
Out of the remaining 200 fruit, 100 are long, 150 are sweet and 50 are yellow.

If we are given the length, sweetness and color of a fruit (without knowing its class), we can now calculate the probability of it being a banana, orange or other fruit.

Suppose we are told the unknown fruit is long, sweet and yellow.

Here's how we calculate all the probabilities in 4 steps:

Step 1:

To calculate the probability the fruit is a banana, let's first recognize that this looks familiar. It's the probability of the class Banana given the features Long, Sweet and Yellow or more succinctly:

$latex P(Banana|Long, Sweet, Yellow)$

This is exactly like the equation discussed earlier.

Step 2:

Starting with the numerator, let's plug everything in.

$latex P(Long|Banana) = 400/500 = 0.8$
$latex P(Sweet|Banana) = 350/500 = 0.7$
$latex P(Yellow|Banana) = 450/500 = 0.9$
$latex P(Banana) = 500 / 1000 = 0.5$

Multiplying everything together (as in the equation), we get:

$latex 0.8 \times 0.7 \times 0.9 \times 0.5 = 0.252$

Step 3:

Ignore the denominator, since it'll be the same for all the other calculations.

Step 4:

Do a similar calculation for the other classes:

$latex P(Orange|Long, Sweet, Yellow) = 0$
$latex P(Other|Long, Sweet, Yellow) = 0.01875$

Since the $latex 0.252$ is greater than $latex 0.01875$, Naive Bayes would classify this long, sweet and yellow fruit as a banana.

Is this supervised or unsupervised?

This is supervised learning, since Naive Bayes is provided a labeled training dataset in order to construct the tables.

Why use Naive Bayes?

As you could see in the example above, Naive Bayes involves simple arithmetic. It's just tallying up counts, multiplying and dividing.

Once the frequency tables are calculated, classifying an unknown fruit just involves calculating the probabilities for all the classes, and then choosing the highest probability.

Despite its simplicity, Naive Bayes can be surprisingly accurate. For example, it's been found to be effective for spam filtering.

Where is it used?

Implementations of Naive Bayes can be found in Orange, scikit-learn, Weka and R.

Check out how I used Naive Bayes

Check out the next algorithm on the main list...

About the Author

Ray Li

Ray is a software engineer and data enthusiast who has been blogging for over a decade. He loves to learn, teach and grow. You’ll usually find him wrangling data, programming and lifehacking.