PageRank data mining algorithm

PageRank data mining algorithm in plain English

The PageRank data mining algorithm is part of a longer article about many more data mining algorithms.

What does it do? 

PageRank is a link analysis algorithm designed to determine the relative importance of some object linked within a network of objects.

Yikes.. what's link analysis? 

It's a type of network analysis looking to explore the associations (a.k.a. links) among objects.

Here's an example: 

The most prevalent example of PageRank is Google's search engine. Although their search engine doesn't solely rely on PageRank, it's one of the measures Google uses to determine a web page's importance.

Let me explain:

Web pages on the World Wide Web link to each other. If rayli.net links to a web page on CNN, a vote is added for the CNN page indicating rayli.net finds the CNN web page relevant.

And it doesn't stop there...

rayli.net's votes are in turn weighted by rayli.net's importance and relevance. In other words, any web page that's voted for rayli.net increases rayli.net's relevance.

The bottom line?

This concept of voting and relevance is PageRank. rayli.net's vote for CNN increases CNN's PageRank, and the strength of rayli.net's PageRank influences how much its vote affects CNN's PageRank.

What does a PageRank of 0, 1, 2, 3, etc. mean? 

Although the precise meaning of a PageRank number isn't disclosed by Google, we can get a sense of its relative meaning.

And here's how:

Pank Rank Table

You see?

It's a bit like a popularity contest. We all have a sense of which websites are relevant and popular in our minds. PageRank is just an uber elegant way to define it.

What other applications are there of PageRank? 

PageRank was specifically designed for the World Wide Web.

Think about it:

At its core, PageRank is really just a super effective way to do link analysis.The objects being linked don't have to be web pages.

Here are 3 innovative applications of PageRank:

  1. Dr Stefano Allesina, from the University of Chicago, applied PageRank to ecology to determine which species are critical for sustaining ecosystems.
  2. Twitter developed WTF (Who-to-Follow) which is a personalized PageRank recommendation engine about who to follow.
  3. Bin Jiang, from The Hong Kong Polytechnic University, used a variant of PageRank to predict human movement rates based on topographical metrics in London.

Is this supervised or unsupervised? 

PageRank is generally considered an unsupervised learning approach, since it's often used to discover the importance or relevance of a web page.

Why use PageRank? 

Arguably, the main selling point of PageRank  is its robustness due to the difficulty of getting a relevant incoming link.

Simply stated:

If you have a graph or network and want to understand relative importance, priority, ranking or relevance, give PageRank a try.

Where is it used? 

The PageRank trademark is owned by Google. However, the PageRank algorithm is actually patented by Stanford University.

You might be wondering if you can use PageRank:

I'm not a lawyer, so best to check with an actual lawyer, but you can probably use the algorithm as long as it doesn't commercially compete against Google/Stanford.

Here are 3 implementations of PageRank:

  1. C++ OpenSource PageRank Implementation
  2. Python PageRank Implementation
  3. igraph — The network analysis package (R)

Checkout how I used PageRank

Get a boost on data mining and see the next algorithm in the complete list...

About the Author

Ray Li

Ray is a software engineer and data enthusiast who has been blogging for over a decade. He loves to learn, teach and grow. You’ll usually find him wrangling data, programming and lifehacking.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.