The Apriori data mining algorithm is part of a longer article about many more data mining algorithms.
What does it do?
The Apriori algorithm learns association rules and is applied to a database containing a large number of transactions.
What are association rules?
Association rule learning is a data mining technique for learning correlations and relations among variables in a database.
What's an example of Apriori?
Let's say we have a database full of supermarket transactions. You can think of a database as a giant spreadsheet where each row is a customer transaction and every column represents a different grocery item.
Here's the best part:
By applying the Apriori algorithm, we can learn the grocery items that are purchased together a.k.a association rules.
The power of this is:
You can find those items that tend to be purchased together more frequently than other items -- the ultimate goal being to get shoppers to buy more. Together, these items are called itemsets.
You can probably quickly see that chips + dip and chips + soda seem to frequently occur together. These are called 2-itemsets. With a large enough dataset, it will be much harder to "see" the relationships especially when you're dealing with 3-itemsets or more. That's precisely what Apriori helps with!
You might be wondering how Apriori works?
Before getting into the nitty gritty of algorithm, you'll need to define 3 things:
- The first is the size of your itemset. Do you want to see patterns for a 2-itemset, 3-itemset, etc.?
- The second is your support or the number of transactions containing the itemset divided by the total number of transactions. An itemset that meets the support is called a frequent itemset.
- The third is your confidence or the conditional probability of some item given you have certain other items in your itemset. A good example is given chips in your itemset, there is a 67% confidence of having soda also in the itemset.
The basic Apriori algorithm is a 3 step approach:
- Join. Scan the whole database for how frequent 1-itemsets are.
- Prune. Those itemsets that satisfy the support and confidence move onto the next round for 2-itemsets.
- Repeat. This is repeated for each itemset level until we reach our previously defined size.
Is this supervised or unsupervised?
Apriori is generally considered an unsupervised learning approach, since it's often used to discover or mine for interesting patterns and relationships.
But wait, there's more...
Apriori can also be modified to do classification based on labelled data.
Why use Apriori?
Apriori is well understood, easy to implement and has many derivatives.
On the other hand...
The algorithm can be quite memory, space and time intensive when generating itemsets.
Where is it used?
The next algorithm was the most difficult for me to understand, look at the next algorithm on the entire list...