C4.5 data mining algorithm in plain english

The C4.5 data mining algorithm is part of a longer article about many more data mining algorithms.

What does it do?

C4.5 constructs a classifier in the form of a decision tree. In order to do this, C4.5 is given a set of data representing things that are already classified.

Wait, what's a classifier?

A classifier is a tool in data mining that takes a bunch of data representing things we want to classify and attempts to predict which class the new data belongs to.

What's an example of this?

Sure, suppose a dataset contains a bunch of patients. We know various things about each patient like age, pulse, blood pressure, VO₂max, family history, etc. These are called attributes.

Now:

Given these attributes, we want to predict whether the patient will get cancer. The patient can fall into 1 of 2 classes: will get cancer or won't get cancer. C4.5 is told the class for each patient.

And here's the deal:

Using a set of patient attributes and the patient's corresponding class, C4.5 constructs a decision tree that can predict the class for new patients based on their attributes.

Cool, so what's a decision tree?

Decision tree learning creates something similar to a flowchart to classify new data. Using the same patient example, one particular path in the flowchart could be:

Patient has a history of cancer
Patient is expressing a gene highly correlated with cancer patients
Patient has tumors
Patient's tumor size is greater than 5cm

The bottomline is:

At each point in the flowchart is a question about the value of some attribute, and depending on those values, he or she gets classified. You can find lots of examples of decision trees.

Is this supervised or unsupervised?

This is supervised learning, since the training dataset is labeled with classes. Using the patient example, C4.5 doesn't learn on its own that a patient will get cancer or won't get cancer. We told it first, it generated a decision tree, and now it uses the decision tree to classify.

You might be wondering how C4.5 is different than other decision tree systems?

First, C4.5 uses information gain when generating the decision tree.
Second, although other systems also incorporate pruning, C4.5 uses a single-pass pruning process to mitigate over-fitting. Pruning results in many improvements.
Third, C4.5 can work with both continuous and discrete data. My understanding is it does this by specifying ranges or thresholds for continuous data thus turning continuous data into discrete data.
Finally, incomplete data is dealt with in its own ways.

Why use C4.5?

Arguably, the best selling point of decision trees is their ease of interpretation and explanation. They are also quite fast, quite popular and the output is human readable.

Where is it used?

A popular open-source Java implementation can be found over at OpenTox. Orange, an open-source data visualization and analysis tool for data mining, implements C4.5 in their decision tree classifier.

Checkout how I used C5.0 (latest version of C4.5)

Classifiers are great, but make sure to checkout the other data mining algorithms...

About the Author

Ray Li

Ray is a software engineer and data enthusiast who has been blogging for over a decade. He loves to learn, teach and grow. You’ll usually find him wrangling data, programming and lifehacking.