The kNN data mining algorithm is part of a longer article about many more data mining algorithms.
What does it do?
kNN, or k-Nearest Neighbors, is a classification algorithm. However, it differs from the classifiers previously described because it's a lazy learner.
What's a lazy learner?
A lazy learner doesn't do much during the training process other than store the training data. Only when new unlabeled data is input does this type of learner look to classify.
On the other hand, an eager learner builds a classification model during training. When new unlabeled data is input, this type of learner feeds the data into the classification model.
How does C4.5, SVM and AdaBoost fit into this?
Unlike kNN, they are all eager learners.
Here's why:
- C4.5 builds a decision tree classification model during training.
- SVM builds a hyperplane classification model during training.
- AdaBoost builds an ensemble classification model during training.
So what does kNN do?
kNN builds no such classification model. Instead, it just stores the labeled training data.
When new unlabeled data comes in, kNN operates in 2 basic steps:
- First, it looks at the $latex k$ closest labeled training data points -- in other words, the k-nearest neighbors.
- Second, using the neighbors' classes, kNN gets a better idea of how the new data should be classified.
You might be wondering...
How does kNN figure out what's closer?
For continuous data, kNN uses a distance metric like Euclidean distance. The choice of distance metric largely depends on the data. Some even suggest learning a distance metric based on the training data. There's tons more details and papers on kNN distance metrics.
For discrete data, the idea is transform discrete data into continuous data. 2 examples of this are:
- Using Hamming distance as a metric for the "closeness" of 2 text strings.
- Transforming discrete data into binary features.
These 2 Stack Overflow threads have some more suggestions on dealing with discrete data:
How does kNN classify new data when neighbors disagree?
kNN has an easy time when all neighbors are the same class. The intuition is if all the neighbors agree, then the new data point likely falls in the same class.
I'll bet you can guess where things get hairy...
How does kNN decide the class when neighbors don't have the same class?
2 common techniques for dealing with this are:
- Take a simple majority vote from the neighbors. Whichever class has the greatest number of votes becomes the class for the new data point.
- Take a similar vote except give a heavier weight to those neighbors that are closer. A simple way to do this is to use reciprocal distance e.g. if the neighbor is 5 units away, then weight its vote 1/5. As the neighbor gets further away, the reciprocal distance gets smaller and smaller... exactly what we want!
Is this supervised or unsupervised?
This is supervised learning, since kNN is provided a labeled training dataset.
Why use kNN?
Ease of understanding and implementing are 2 of the key reasons to use kNN. Depending on the distance metric, kNN can be quite accurate.
But that's just part of the story...
Here are 5 things to watch out for:
- kNN can get very computationally expensive when trying to determine the nearest neighbors on a large dataset.
- Noisy data can throw off kNN classifications.
- Features with a larger range of values can dominate the distance metric relative to features that have a smaller range, so feature scaling is important.
- Since data processing is deferred, kNN generally requires greater storage requirements than eager classifiers.
- Selecting a good distance metric is crucial to kNN's accuracy.
Where is it used?
A number of kNN implementations exist:
- MATLAB k-nearest neighbor classification
- scikit-learn KNeighborsClassifier
- k-Nearest Neighbour Classification in R
Spam? Fuhgeddaboudit! Check out the entire algorithm list to learn about the next algorithm...