Top 10 data mining algorithms in plain English

Today, I'm going to explain in plain English the top 10 most influential data mining algorithms as voted on by 3 separate panels in this survey paper.

Once you know what they are, how they work, what they do and where you can find them, my hope is you'll have this blog post as a springboard to learn even more about data mining.

What are we waiting for? Let's get started!

C4.5 data mining algorithm

C4.5 constructs a classifier in the form of a decision tree. In order to do this, C4.5 is given a set of data representing things that are already classified.

k-means data mining algorthim (thumbnail)

k-means data mining algorithm

k-means creates $k$ groups from a set of objects so that the members of a group are more similar. It’s a popular cluster analysis technique for exploring a dataset.

SVM data mining algorithm

Support vector machine (SVM) learns a hyperplane to classify data into 2 classes. At a high-level, SVM performs a similar task like C4.5 except SVM doesn’t use decision trees at all.

Apriori data mining algorithm

The Apriori algorithm learns association rules and is applied to a database containing a large number of transactions.

EM data mining algorithm

In data mining, expectation-maximization (EM) is generally used as a clustering algorithm (like k-means) for knowledge discovery.

PageRank data mining algorithm

PageRank is a link analysis algorithm designed to determine the relative importance of some object linked within a network of objects.

AdaBoost data mining algorithm

AdaBoost is a boosting algorithm which constructs a classifier. As you probably remember, a classifier takes a bunch of data and attempts to predict or classify which class a new data element belongs to.

kNN data mining algorithm

kNN, or k-Nearest Neighbors, is a classification algorithm. However, it differs from the classifiers previously described because it’s a lazy learner.

Naive Bayes data mining algorithm

Naive Bayes is not a single algorithm, but a family of classification algorithms that share one common assumption: Every feature of the data being classified is independent of all other features given the class.

CART data mining algorithm

CART stands for classification and regression trees. It is a decision tree learning technique that outputs either classification or regression trees. Like C4.5, CART is a classifier.

Interesting Resources

Now it's your turn...

Now that I've shared my thoughts and research around these data mining algorithms, I want to turn it over to you.

Are you going to give data mining a try?
Which data mining algorithms have you heard of but weren't on the list?
Or maybe you have a question about an algorithm?

Let me know what you think by leaving a comment below right now.

Thanks to Yuval Merhav and Oliver Keyes for their suggestions which I've incorporated into the post.

Thanks to Dan Steinberg (yes, the CART expert!) for the suggested updates to the CART section which have now been added.

About the Author

Ray Li

Ray is a software engineer and data enthusiast who has been blogging for over a decade. He loves to learn, teach and grow. You’ll usually find him wrangling data, programming and lifehacking.

Comments 150

Pingback: 1 â€“ Top data mining algorithms in plain English | blog.offeryour.com
Pingback: Bookmarks for May 17th | Chris's Digital Detritus
Joe Guy
May 17, 2015 at 6:19 pm

Your explanation of SVM is the best I have ever seen. Thanks!

Reply
1. Raymond Li
  May 17, 2015 at 7:14 pm
  
  Thanks, Joe. Definitely appreciate it! 🙂
  
  I owe a lot of it to a few threads from Reddit and Yuval (both are linked in the post above).
  
  Reply
2. Heather Stark
  May 29, 2015 at 9:25 am
  
  Agree!!!
  
  Reply
Roger Huang
May 17, 2015 at 7:10 pm

Really snappy and informative view into data mining algorithms. I clicked on a whole ton of links: always a mark of a resource done right! Kudos.

Reply
1. Raymond Li
  May 17, 2015 at 8:27 pm
  
  Thanks, Roger. I’m happy you found it snappy and click-worthy. 🙂 Sometimes data mining resources can be a bit on the dry side.
  
  Reply
Lakshminarayanan
May 17, 2015 at 11:41 pm

Thanks for the excellent compile
This is what I was looking for as a starter.

Reply
1. Ray Li
  May 18, 2015 at 12:19 pm
  
  Thanks, Lakshminarayanan!
  
  Reply
Pingback: Els 10 primers algoritmes del Big data explicats en paraules | Blog d'estadÃstica oficial
Pingback: LessThunk.com « Top 10 data mining algorithms in plain English — recommended for big data users
Meghana
May 18, 2015 at 2:57 am

Out of all the numerous websites about data mining algorithms I have gone through, this one is by far the best! Explaining everything in such casual terms really helps beginners like me. The examples were definitely apt and helpful.

Thank you so much! You made my work a lot easier. 🙂

Reply
1. Ray Li
  May 18, 2015 at 12:26 pm
  
  I’m excited to hear this helped with your work, Meghana! And really appreciate the kind words. 🙂
  
  Reply
Vrashabh Irde
May 18, 2015 at 3:36 am

This is an awesome list! Thanks. Trying to dabble into ML myself and having a simple know how of everything is very useful

Reply
1. Ray Li
  May 18, 2015 at 12:29 pm
  
  Very glad to hear you find it useful, Vrashabh. Thank you!
  
  Reply
Pingback: Data Mining Algorithms Explained : Stephen E. Arnold @ Beyond Search
Pingback: Distilled News | Data Analytics & R
Kyle
May 18, 2015 at 9:23 am

Excellent man, this is so well explained. Thanks!!

Reply
1. Ray Li
  May 18, 2015 at 12:32 pm
  
  My pleasure, Kyle. 🙂
  
  Reply
suanfazu
May 18, 2015 at 9:47 am

Thanks for the excellent share

Reply
1. Ray Li
  May 18, 2015 at 12:35 pm
  
  My pleasure, Suanfazu! Thanks for exploring the blog and leaving your kind words. 🙂
  
  Reply
Anonymous
May 18, 2015 at 9:53 am

Hey, great introduction! I would love to see more posts like this in our community; great way to grasp the concept of algorithms before diving into the hard math.

Just one thing, though: On Step 2 in Naive Bayes you repeated P(Long | Banana) twice. The third one should be P(Yellow | Banana).

Thanks again!

Reply
1. Ray Li
  May 18, 2015 at 12:45 pm
  
  Hi Anonymous,
  
  Nice catch! I fixed it now, but have no one to attribute the fix to. 🙁
  
  I totally agree about understanding the concepts of the algorithm before the hard math. I’ve always felt using concepts and examples as a platform for understanding makes the math part way easier.
  
  Thanks again,
  Ray
  
  Reply
Robert Klein
May 18, 2015 at 9:59 am

This is a great resource. I’ve bookmarked it. Thanks for your work. I love using height-zip code to illustrate independence. That will be a go-to for me now. The only thing I can offer in return is a heads-up about the API we just released for ML preprocessing. It’s all about correlating themes in unstructured information streams. Hope it’s useful. Let us know what you think. Thanks again.

Reply
1. Ray Li
  May 18, 2015 at 12:57 pm
  
  Thanks for bookmarking and the heads-up, Robert! 🙂
  
  Reply
Raghav
May 18, 2015 at 12:46 pm

Hello Ray,
Thanks for a great article.
It looks like there is a typo in step 2 of Naive Bayes. One of the probabilities should be P(Yellow|Banana).
Thanks again!

Reply
1. Ray Li
  May 18, 2015 at 1:00 pm
  
  My pleasure, Raghav. Thanks also for letting me know about the typo. It should be corrected now.
  
  Reply
Jens
May 18, 2015 at 1:00 pm

Hello Raymond,

first of all kudos for your sum up of data mining algos!

I’ve been exploring this for a few weeks now (mainly using scikit learn and nltk in python).

In the past few days I came up with the idea to create a classifier that is able to group products by their title to a corresponding product taxonomy.

For that I crawled a German product marketplace for their category landingpages and created a corpus consisting of a taxonomy tree node in column “a” and a set of snowball stemmed relevant uni and bigram keywords ( appx. 50 per node) that have been extracted from all products on each category page (this is comma separated in column “b”).

Now I would like to build a classifier from that with the idea in mind, that I could throw stemmed product titles at the classifier and let it return the most probable taxonomy node.

Could you advise which would be the most appropriate one for the given task. I can email you the corpus…

Hope to get some direction… to omit any detours / too much trial and error.

Looking forward to your reply.

Thanks again for your great article.

Cheers from Cologne Germany

Jens

Reply
1. Ray Li
  May 18, 2015 at 9:05 pm
  
  Hi Jens,
  
  Thanks for the kudos and taking the time to leave a comment.
  
  Short answer to your question…
  I don’t know. 🙂 It sounds like there’s a bunch I could learn from you!
  
  For example:
  You just taught me about stemming and the Snowball framework. Honestly, I’m amazed there are tools like Snowball that can create stemming algorithms. Very cool!
  
  Longer answer…
  I found the StackOverflow.com, stats.stackexchange.com and reddit.com forums invaluable when I was learning, researching and simplifying the algorithms to make them easier to describe.
  
  Sorry I couldn’t be more help, but I’m working to catch up… 🙂
  
  Ray
  
  Reply
  1. Jens
    May 20, 2015 at 6:48 am
    
    Hi Ray,
    
    thanks for your feedback 🙂
    I found a good solution in the meantime using a naive bayes approach.
    
    By the way your regular contact form does not work. There is an htaccess authentication popping up upon form submit.
    
    Cheers
    Jens
    
    Reply
    1. Ray Li
      May 20, 2015 at 7:30 am
      
      Awesome!
      
      Also, thanks for the heads up about the contact form. It should be fixed now. There’s a small issue with the confirmation message (some fields are not displayed), but no more auth pop-up and the message successfully sends.
      
      Reply
Malhar
May 18, 2015 at 1:21 pm

This goes in my bookmarks. Excellent simple explanation. Loved you have taken SVM. It would be great if you can put Neural network with various kernels.

Reply
1. Ray Li
  May 18, 2015 at 9:21 pm
  
  Definitely appreciate the bookmark, Malhar! Thanks for your suggestion about the neural nets. I’ll definitely be diving into that one very soon.
  
  Reply
2. Meghana
  May 18, 2015 at 11:37 pm
  
  Exactly the same concern, Malhar. I was looking for information on Neural Networks as well.
  
  Reply
Serge
May 18, 2015 at 6:04 pm

Man, I really wish I had this guide a few years ago! I was trying my hand at unsupervised categorization of email messages. I didn’t know what terms to google, so the only thing I used was LSM (latent semantic mapping). The problem is, when you have thousands of words and tens of thousands of emails, the N^2 matrix gets a little hard to handle, computationally. I ended up giving up on it.

What I had never considered was using a different algorithm to pre-create groups, which would have helped a lot. This was a useful read.

Reply
1. Ray Li
  May 18, 2015 at 9:31 pm
  
  Thanks for reading and your kind words, Serge!
  
  Reply
Pingback: The Data Scientist - Professional Data Science in Singapore » 10 Data Science Algorithms Explained – In English
David
May 19, 2015 at 5:27 pm

Great article! Now, as a public service, how about a decision tree or categorization matrix for selecting the right algorithm?

Reply
1. Ray Li
  May 20, 2015 at 12:28 am
  
  Thanks, David.
  
  It’s a good call about selecting the right algorithm. From all the readings so far, I feel picking the right one is the hardest part.
  
  It’s one of the main reasons I was attracted to the original survey paper despite it being a bit outdated. Might as well dive into the ones the panelists thought were important, and then figure out why they use them.
  
  I certainly have a lot more to learn, and I’m already having some ideas on future posts.
  
  Ray
  
  Reply
D Lego
May 19, 2015 at 5:48 pm

Good post. It is curious, I’m write one version in spanish about this same theme.

Reply
1. Ray Li
  May 20, 2015 at 12:34 am
  
  Thank you, D Lego. I’m curious — can you email me the link?
  
  Reply
Pingback: Data mining algorithms | has many :code_blocks
michael davies
May 20, 2015 at 8:24 am

Great work Raymond

Reply
1. Ray Li
  May 20, 2015 at 12:57 pm
  
  Appreciate it, Michael!
  
  Reply
Sthitaprajna Sahoo
May 20, 2015 at 9:04 am

Couldn’t ask for more simpler explanation. A very good collection and hoping more posts from you .

Reply
1. Ray Li
  May 20, 2015 at 1:00 pm
  
  My pleasure, Sthitaprajna.
  
  Reply
Pingback: Data Mining Algorithms Explained In Plain Language | Artificial Intelligence Matters
Stephen Oman
May 20, 2015 at 10:36 am

This is a really excellent article with some nice explanations. Looking forward to your piece on Artificial Neural Networks too!

Reply
1. Ray Li
  May 20, 2015 at 1:31 pm
  
  Thanks, Stephen!
  
  Reply
Richard Grigonis
May 21, 2015 at 9:13 am

Including Decision Forests would have been nice.

Reply
1. Ray Li
  May 21, 2015 at 10:13 pm
  
  Although I haven’t used that one myself, that’s a good one, Richard!
  
  Reply
Pingback: Top 10 data mining algorithms in plain English « Another Word For It
Daniel Zilber
May 21, 2015 at 12:03 pm

Thanks for the write up!

Reply
1. Ray Li
  May 21, 2015 at 10:13 pm
  
  Appreciate it, Daniel. 🙂
  
  Reply
Pingback: Els 10 primers algoritmes del Big data explicats en paraules | Econometria aplicada
Sylvio Allore
May 21, 2015 at 10:26 pm

Hello,

It is a good review of things undergraduates learn but what about starting with just a single example of application in predicting stock returns, for example. Do you have an example of applying, for example, naive Bayes to predicting stock returns? That would be more useful that listing a set of methods one can find in most ML books.

Reply
1. Ray Li
  May 21, 2015 at 11:00 pm
  
  Thanks, Sylvio. I appreciate the constructive comments.
  
  Depth and real-life applications are certainly something to improve on in this article series (Yep… I think it deserves to be a series!). Stay tuned… 🙂
  
  Reply
Ray Li
May 21, 2015 at 10:31 pm

Super excited about this…

Due to all your comments and sharing, this article has been reposted to KDnuggets, a leading resource on data mining: http://bit.ly/1AoicbW!

There’s no way this could’ve happened without you reading, commenting and sharing. My sincerest thank you! 🙂

Reply
Matt Cairnduff
May 22, 2015 at 5:07 am

Echoing all the sentiments above Ray. This is a tremendously useful resource that’s gone straight into my bookmarks. Really appreciate the informal writing style as well, which makes it nice and accessible, and easy to share with colleagues!

Reply
1. Ray Li
  May 22, 2015 at 5:41 pm
  
  Thank you, Matt. I’m glad you found the writing style accessible and shareable. Please do share… 🙂
  
  Reply
Adriana Wilde
May 22, 2015 at 5:36 am

Excellent blogpost! Very accessible and rather complete (apart from multilayer perceptrons, which I hope you’ll touch in a follow up post).
I found useful that you refer to the NFL theorem and list characteristics of each algorithm which make them more suited to one type of problem than another (e.g. lazy learners are faster in training but slower classifiers, and why). I also liked you explained which algorithms are for supervised and unsupervised learning. These are all things to take into account when choosing a classifier. Wish I read this 5 years ago!
Thanks!

Reply
1. Ray Li
  May 22, 2015 at 5:52 pm
  
  Hi Adriana,
  
  Thank you for your kind words.
  
  I think I came across the standard perceptron while researching SVM. Definitely thinking about tackling MLPs and more recently all the buzz about deep learning at some point.
  
  Thanks for your insightful comment.
  
  Ray
  
  Reply
brian piercy
May 22, 2015 at 8:00 am

What an awesome article! I learned more from this than 20 hours of plowing through SciKit. Well done!

Reply
1. Ray Li
  May 22, 2015 at 5:53 pm
  
  Appreciate it, Brian! 🙂
  
  Reply
david berneda
May 25, 2015 at 3:04 am

Thanks a lot Ray for your article !
I did a clustering library sometime ago, your article encourages me to try expanding it with more algorithms.
regards
david

Reply
1. Ray Li
  May 25, 2015 at 1:11 pm
  
  My pleasure, David.
  
  Reply
Pingback: Les liens de la semaine â€“ Ã‰dition #133 | French Coding
Pingback: #1 Time Management is Key | Kenechi Learns Code
Martin Campbell
May 25, 2015 at 11:25 pm

This is a fantastic article and just what I needed as I start attempting to learn all this stuff. I’ll be shooting up the Kaggle rankings in now time (well, from 100,000 to 90,000 perhaps!).

Reply
1. Ray Li
  May 26, 2015 at 12:45 pm
  
  Appreciate it, Martin. I’m really happy to hear that it helps to get the ball rolling for you. Your increased Kaggle ranking would be nice icing on the cake! 🙂
  
  Reply
Yolande Tra
May 26, 2015 at 6:27 am

Excellent overview. You have a gift in teaching complex topics into down-to earth terms. Here is my comment: when using data mining algorithm, in this list (classifiers) I am more concerned about accuracy. We can try and use each one of these but in the end we are interested in validation after training. Accuracy was only addressed with SVM and Adaboost.

Reply
1. Ray Li
  May 26, 2015 at 12:50 pm
  
  Thank you for your kind words, Yolande.
  
  It’s a good point about the accuracy. I’ll definitely keep this in mind to explore accuracy in an upcoming post.
  
  Reply
Maksim Gayduk
May 26, 2015 at 8:43 am

I didn’t quite understand the part about C4.5 pruning.
In the link provided, it says that in order to decide whether to prune a tree or not, it calculates error rate of both pruned and unpruned tree and decides which one leads to the lower limit of confidence interval.
It should work okey for already pruned trees, but how does it start? Usually decision tree algorhythms build the tree until it reaches entrophy = 0, which means zero error rate, and zero upper limit for confidence interval. In this case, such tree can never be pruned, using that logic …

Reply
1. Ray Li
  May 26, 2015 at 5:30 pm
  
  This is a great question, Maksim. It got me thinking a bunch, but unfortunately I don’t have an answer that I’m satisfied with.
  
  My investigation so far indicates that the error rate for the training data is distinct from the estimated error rate for the unseen data. As you pointed out, this is what the confidence interval is meant to bound. Based on the formula in the link, given f=0, I’m also at a loss on how a pruned tree could beat the unpruned tree.
  
  If you’re up for it, CrossValidated or StackOverflow might be an awesome place to get your question answered. You or I could even post a link here for reference.
  
  Reply
Pingback: No solutions for a simple predictive analytics challenge? | Decision Management Community
Ilan Sharfer
May 26, 2015 at 12:42 pm

Ray, thanks a lot for this really useful review. Some of the algorithms are
already familiar to me, others are new. So it surely helps to have them all in
one place.
As a practical application I’m interested in a data mining algorithm that can
be used in investment portfolio selection based on historical data, that is,
decide which stocks to invest in and make timely buy/sell orders. Can you
recommend a suitable algorithm?

Reply
1. Ray Li
  May 26, 2015 at 6:33 pm
  
  My pleasure, Ilan. Same here, I’ve come across a few of these algorithms before writing this article, and I had to teach myself the unfamiliar ones.
  
  I’m planning to go into more practical applications in an upcoming post. Stay tuned for that one… 🙂
  
  On a side note, you might already be aware of them, and the “random walk hypothesis” and “efficient-market hypothesis” might be of interest to you. It doesn’t answer your question, but it is an alternate perspective on predicting future returns based on historical data.
  
  Reply
Zeeshan
May 26, 2015 at 7:59 pm

Awesome explanation!

Reply
1. Ray Li
  May 26, 2015 at 8:29 pm
  
  Much appreciated, Zeeshan.
  
  Reply
Lalit A Patel
May 26, 2015 at 11:09 pm

This is an excellent blog. It is helping me digest what I have studied elsewhere. Thanks a lot.

Reply
1. Ray Li
  May 28, 2015 at 8:07 am
  
  Thank you, Lalit. I’m happy to hear the blog is helping you with your studies.
  
  Reply
Phaneendra
May 28, 2015 at 1:40 am

Fantastic post ray. Nicely explained. Helped me enhancing my understanding. Please keep sharing the knowledge 🙂 It helps.

Regards,
Phaneendra

Reply
1. Ray Li
  May 28, 2015 at 8:10 am
  
  Thanks, Phaneendra. More is definitely on the way… 🙂
  
  Reply
Adrian Cuyugan
May 28, 2015 at 7:01 am

These are very good and simple explanation. Thank you for sharing!

Reply
1. Ray Li
  May 28, 2015 at 8:11 am
  
  Appreciate it, Adrian.
  
  Reply
Pingback: BirdView (2) – Ranking Everything: an Overview of Link Analysis Using PageRank Algorithm | datawarrior
Peter Nour
May 28, 2015 at 4:50 pm

Thanks Ray! This is a fantastic post with great details and yet so simple to understand.

Cheers,

Peter

Reply
1. Ray Li
  May 28, 2015 at 8:39 pm
  
  Much appreciated, Peter. Glad you liked the post.
  
  Reply
Sanjoy
May 29, 2015 at 8:08 am

Awesome explanation of some of the oft-used data-mining algorithms.

Are you thinking of doing something similar for some of the other algorithms (Discriminant Analysis, Neural Networks, etc.) as well?

Would love to read your posts on them.

Thanks,
Sanjoy

Reply
1. Ray Li
  May 31, 2015 at 12:55 am
  
  Thanks, Sanjoy. Those are good ones. NNs are definitely at the top of the list.
  
  Reply
Suresh
May 29, 2015 at 11:12 am

Thanks Ray!! Awesome compilation and explanation. This truly helps me get started with learning and applying data science.

Reply
1. Ray Li
  May 31, 2015 at 12:56 am
  
  My pleasure, Suresh. I’m really happy to hear the post helped you start learning and applying.
  
  Reply
Pingback: June 2015 Items of Interest | Tidewater Analytics
Ulf
May 30, 2015 at 10:24 am

I’m afraid to be rather boring by having nothing to contribute than more of the well deserved praise to the quality of your article: thanks, really a great wrap-up and very good primer for the subject.
I shared the link to your post on the intranet of my company and rarely an article has received so many “likes” in no time.
The only thing I was missing was a bit more visual support. You have an excellent video embedded for SVM. But for many of the other concepts, there are also rather straight forward visual representations possible (e.g. clustering, k-nearest-neighbour).
I found the book “Data Science for Business” (http://www.data-science-for-biz.com/) a VERY good start into the subject (….though I would have prefered to have read your article beore, as it really wraps it up so well….). This book offers real real inspiration as to how the underlying concepts of the algorithms you explain can be visualized and thus be made more intuitively understandable.
Enhancing your article with a bit more visual support would be the cherry on the icing on the cake 😉

Reply
1. Ray Li
  May 31, 2015 at 1:06 am
  
  Hi Ulf,
  
  Really appreciate your kind words and you sharing it with your colleagues. 🙂
  
  That’s a good point about visualizations… especially for visual learners. Like in the case of the SVM video, I found seeing it in action made it so much clearer.
  
  I definitely appreciate the book recommendation. From the sound of it, that book might be a fantastic reference not just for this article but for future articles covering this area.
  
  Thanks again,
  Ray
  
  Reply
Praveen G S
May 31, 2015 at 11:57 pm

Thanks for your wonderful post. I like the way you describe the SVM, kNN, Bayes. Since you language is so user friendly and easy to understand. Can you also write a blog on the some of the ensembles like random forest which is one of the most popular machine learning algorithm and has a good predictive power compared to other algorithms

Reply
1. Ray Li
  June 1, 2015 at 5:56 pm
  
  Thanks, Praveen. Those are good ones, and I’ll add them to my growing list of potential algorithms to dive into.
  
  Reply
Tom F
June 2, 2015 at 5:17 am

Fantastic article. Thanks.

One point:
>> What do the balls, table and stick represent? The balls represent data points, and the red and blue color represent 2 classes. The stick represents the simplest hyperplane which is a line.

The simplest (i.e. 1 dimensional) hyperplane is a point, not a line.

Reply
1. Ray Li
  June 2, 2015 at 1:32 pm
  
  Thanks, Tom. Good “point” about the simplest hyperplane. I’ve modified the sentence to read “The stick represents the hyperplane which in this case is a line.”
  
  Reply
Pingback: Guide to Data Science Competitions | Happy Endpoints
vdep
June 15, 2015 at 2:31 am

Hi Ray,
All Algorithms are explained in a simple and neat manner. It will be extremely useful for beginners as well as pros if u could come up with a “cheat sheet”, explaining best and worst scenario, for each algorithms. ( I mean how to choose the best algorithm for a given data).

Thank you

Reply
1. Ray Li
  June 15, 2015 at 12:14 pm
  
  Appreciate your kind words, vdep! Thanks also for your suggestion about the “cheat sheet.” 🙂
  
  Reply
Houssem
June 16, 2015 at 12:30 am

Hi Ray,
Thank you for your effort to explain such algorithms with such simplicity.
Good to start on data science !

Reply
1. Ray Li
  June 16, 2015 at 12:37 am
  
  My pleasure, Houssem!
  
  Reply
Pingback: Linkblog #6 | Ivan Yurchenko
Pingback: Web Picks (week of 1 June 2015) | DataMiningApps
Pingback: DB Weekly No.59 | ENUE Blog
Paris
September 11, 2015 at 2:45 am

Excellent simplified approach!

Reply
1. Ray Li
  September 11, 2015 at 9:37 am
  
  Thanks, Paris! Much appreciated… 🙂
  
  Reply
Pingback: Klicks #33: Vielmehr Ãœberbleibsel - Ole ReiÃŸmann
Pingback: Very interesting explainer: Top 10 data mining algorithms in plain English rayli.net/blog/data/top-10-dat… (via @TheBrowser) | Stromabnehmer
Pingback: æœºå™¨å¦ä¹ (Machine Learning)&æ·±åº¦å¦ä¹ (Deep Learning)èµ„æ–™(Chapter 1) | ~ Code flavor ~
Pingback: Data Lab Link Roundup: python pivot tables, Hypothesis for testing, data mining algorithms in plain english and more… | Open Data Aha!
Pingback: Top 10 Data mining algorithm – C4.5 | Ken's Study Note
Pingback: Top 10 Data mining algorithm â€“ k-means | Ken's Study Note
Pingback: Top 10 Data mining algorithm – kNN | Ken's Study Note
Pingback: How To Learn Everything About Machine Learning | Meanchey Center
Kurac
November 23, 2015 at 2:54 pm

The latest downloadable Orange data mining suite and its Associate add-on doesn’t seem to be using Apriori for enumerating frequent itemsets but FP-growth algorithm instead.

I must say it’s MUCH faster now. 😀

Reply
1. Ray Li
  November 29, 2015 at 2:34 pm
  
  Thanks, Kurac.
  
  Reply
Pingback: Simulando, visualizando ML, algoritmos, cheatsheet y conjuntos de datos: Lecturas para el fin de semana | To the mean!
Pingback: February 2016 Items of Interest | Tidewater Analytics
mounika
February 20, 2016 at 12:30 am

is there any searching technique algorithm in data mining ..please help me..

Reply
1. Ray Li
  February 20, 2016 at 2:06 pm
  
  Yes, even within the context of the 10 data mining algorithms, we are searching.
  
  The first 3 that come to mind are K-means, Apriori and PageRank.
  
  K-means groups similar data together. It’s essentially a way to search through the data and group together data that have similar attributes.
  
  Apriori attempts to search for relationships and patterns among a set of transactions.
  
  Finally, PageRank searches through a network in order to unearth the relative importance of an object in the network.
  
  Hope this helps!
  
  Reply
2. Ray Li
  February 20, 2016 at 2:12 pm
  
  However, if you’re looking for a search algorithm that finds specific item(s) that match certain attributes, these 10 data mining algorithms may not be a good fit.
  
  Reply
Jenny
March 1, 2016 at 9:37 pm

This article is so helpful!

I’ve always have trouble understanding the Naive Bayes and SVM algorithms.

Your article has done a really great job in explaining these two algorithms that now I have a much better understanding on these algorithms.

Thanks alot! 🙂

Reply
1. Ray Li
  March 2, 2016 at 9:34 am
  
  Glad you found the article helpful, Jenny. Thanks for the kind words!
  
  Reply
Pingback: Spectroscopy and Chemometrics News Weekly #9, 2016 | NIR Calibration Model
Mikail
March 14, 2016 at 3:34 pm

Thank you!

Reply
David Millie
April 2, 2016 at 1:35 pm

very nice summary article … question – is the current implementation of Orange (still) using C4.5 as the classification tree algorithm … I cannot find any reference to it in the current documentation

Reply
1. Ray Li
  April 3, 2016 at 2:54 pm
  
  Thanks, David. This might help: http://orange.biolab.si/docs/latest/reference/rst/Orange.classification.tree.html.
  
  Orange includes multiple implementations of classification tree learners: a very flexible TreeLearner, a fast SimpleTreeLearner, and a C45Learner, which uses the C4.5 tree induction algorithm.
  
  Hope this helps!
  
  Reply
Mak
April 3, 2016 at 4:04 pm

Good job! 🙂 This is a great resource for a beginner like me.

Reply
1. Ray Li
  April 3, 2016 at 8:18 pm
  
  Thank you, Mak!
  
  Reply
Jermaine Allgood
April 12, 2016 at 10:31 am

THANK YOU!!!!!!! As a budding data scientist, this is really helpful. I appreciate it immensely!!!!!

Reply
1. Ray Li
  April 13, 2016 at 7:02 am
  
  Thanks, Jermaine! Good luck in your data scientist journey. 🙂
  
  Reply
Bruno Ferreira
April 23, 2016 at 1:57 pm

Thank very much for this article.

This is from a far the best page about the most used data-mining algorithms.
As a data-mining student, this was very helpful.

Reply
1. Ray Li
  April 24, 2016 at 3:28 pm
  
  My pleasure, Bruno. Thanks for the kind words!
  
  Reply
Paolo
May 2, 2016 at 7:27 am

Great article, Ray, top level, thank you so much!

This question could be a bit OT: which technique do you feel to suggest for the analysis of biological networks? Classical graph theory measures, functional cartography (by Guimera & Amaral), entropy and clustering are already used with good results. PageRank on undirected networks provides similar results to betweenness centrality, I am looking for innovative approaches to be compared with the mentioned ones.

Thanks again!

Reply
1. Ray Li
  May 8, 2016 at 7:19 pm
  
  Thank you, Paolo. Really appreciate it!
  
  From the techniques you’ve already mentioned, it sounds like you’re already deep into the area of biological network analysis. Although I don’t have any new approaches to add (and probably not as familiar with this area as you are), perhaps someone reading this thread could point us in the right direction.
  
  Reply
abdul
May 7, 2016 at 7:17 am

Wonderful list and even more wonderful explanations. Question though, you don’t think Random Forests merit a place on that list?

Cheers

Reply
1. Ray Li
  May 8, 2016 at 7:26 pm
  
  Thanks, Abdul! Random forests is a great one. However, the authors of the original 2007 paper describe how their analysis arrived at these top 10. If a similar analysis were done today, I’m sure random forest would be a strong contender.
  
  Reply
  1. Abdul
    May 11, 2016 at 6:05 am
    
    Ok. Fair enough
    Again, nice work
    
    Reply
Phil
May 8, 2016 at 9:41 pm

I did not read the whole article, but the description of the Apriori algorithm is incorrect.

It is said that there are three steps and that the second step is “Those itemsets that satisfy the support and confidence move onto the next round for 2-itemsets.”

This is incorrect and it is not how the Apriori algorithm works.. The Apriori algorithms does NOT consider the confidence when generating itemsets. It only considers the confidence after finding the itemsets, when it is generating the rules.

In other words, the Apriori algorithms first find the frequent itemsets by applying the three steps. Then it applies another algorithm for generating the rules from these itemsets. The confidence is only considered by the second algorithm. It is not considered during itemset generation.

Reply
Pingback: æœºå™¨å¦ä¹ (Machine Learning)&æ·±åº¦å¦ä¹ (Deep Learning)èµ„æ–™ | Dotteåšå®¢
Pingback: d204: Top 10 data mining algorithms explained in plain English [nd009 study materials] – AI
Pingback: Top 10 data mining algorithms in plain English | rayli.net – Unstable Contextuality Research
Aftab khan
January 28, 2017 at 4:29 am

Sir,
This information is very helpful for the students like me. I was searching for an algorithm for my final year project in data mining. Now i can easily select an algorithm to start my work on my final year project. Thanks

Reply
Pingback: How to Become a Data Scientist | Springboard Blog
Kirk Paul Lafler
January 26, 2021 at 11:37 pm

Fantastic explanation of the top data mining algorithms. Thank you for sharing!

Reply
Sokolyk Petro
April 13, 2021 at 10:24 am

Thank you, Mr. Ray Li. Your explanation is much easier to understand for beginners.

Reply

Interesting Resources

Now it's your turn...

About the Author

Ray Li

Comments 150

Leave a Reply Cancel reply