What does it do?
Support vector machine (SVM) learns a hyperplane to classify data into 2 classes. At a high-level, SVM performs a similar task like C4.5 except SVM doesn't use decision trees at all.
Whoa, a hyper-what?
A hyperplane is a function like the equation for a line, $latex y = mx + b$. In fact, for a simple classification task with just 2 features, the hyperplane can be a line.
As it turns out...
SVM can perform a trick to project your data into higher dimensions. Once projected into higher dimensions...
...SVM figures out the best hyperplane which separates your data into the 2 classes.
Do you have an example?
Absolutely, the simplest example I found starts with a bunch of red and blue balls on a table. If the balls aren't too mixed together, you could take a stick and without moving the balls, separate them with the stick.
When a new ball is added on the table, by knowing which side of the stick the ball is on, you can predict its color.
What do the balls, table and stick represent?
The balls represent data points, and the red and blue color represent 2 classes. The stick represents the hyperplane which in this case is a line.
And the coolest part?
SVM figures out the function for the hyperplane.
What if things get more complicated?
Right, they frequently do. If the balls are mixed together, a straight stick won't work.
Here's the work-around:
Quickly lift up the table throwing the balls in the air. While the balls are in the air and thrown up in just the right way, you use a large sheet of paper to divide the balls in the air.
You might be wondering if this is cheating:
Nope, lifting up the table is the equivalent of mapping your data into higher dimensions. In this case, we go from the 2 dimensional table surface to the 3 dimensional balls in the air.
How does SVM do this?
By using a kernel we have a nice way to operate in higher dimensions. The large sheet of paper is still called a hyperplane, but it is now a function for a plane rather than a line. Note from Yuval that once we're in 3 dimensions, the hyperplane must be a plane rather than a line.
I found this visualization super helpful:
Reddit also has 2 great threads on this in the ELI5 and ML subreddits.
How do balls on a table or in the air map to real-life data?
A ball on a table has a location that we can specify using coordinates. For example, a ball could be 20cm from the left edge and 50cm from the bottom edge. Another way to describe the ball is as (x, y) coordinates or (20, 50). x and y are 2 dimensions of the ball.
Here's the deal:
If we had a patient dataset, each patient could be described by various measurements like pulse, cholesterol level, blood pressure, etc. Each of these measurements is a dimension.
The bottomline is:
SVM does its thing, maps them into a higher dimension and then finds the hyperplane to separate the classes.
Margins are often associated with SVM? What are they?
The margin is the distance between the hyperplane and the 2 closest data points from each respective class. In the ball and table example, the distance between the stick and the closest red and blue ball is the margin.
The key is:
SVM attempts to maximize the margin, so that the hyperplane is just as far away from red ball as the blue ball. In this way, it decreases the chance of misclassification.
Where does SVM get its name from?
Using the ball and table example, the hyperplane is equidistant from a red ball and a blue ball. These balls or data points are called support vectors, because they support the hyperplane.
Is this supervised or unsupervised?
This is a supervised learning, since a dataset is used to first teach the SVM about the classes. Only then is the SVM capable of classifying new data.
Why use SVM?
SVM along with C4.5 are generally the 2 classifiers to try first. No classifier will be the best in all cases due to the No Free Lunch Theorem. In addition, kernel selection and interpretability are some weaknesses.
Where is it used?
There are many implementations of SVM. A few of the popular ones are scikit-learn, MATLAB and of course libsvm.
The next algorithm on the main list of data mining algorithms is one of my favorites...