IS4200/CS6200: Information Retrieval

Project 1

Return to basic course information.

Assigned: Thursday, 27 September 2012
Due: Friday, 12 October 2012, 6pm

Students have to submit a hard-copy / paper copy of their report including the details wherever 'To hand in: ' is present in the project problem statement. You can hand it in at Thursday's lecture, or at Madan's office hours on Friday 1-3 and 5:30-6.

The only part that should be emailed is the source code in form of zip file. The zip formats are already mentioned in the first post of 'Submission Guidelines' on Piazza. TAs will schedule a 5 minute demo wherein you may be asked specific questions about your implementation.


PageRank

In this project, you will compute PageRank on a collection of 183,811 web documents. Consider the version of PageRank described in class. PageRank can be computed iteratively as show in the following pseudocode:
// P is the set of all pages; |P| = N
// S is the set of sink nodes, i.e., pages that have no out links
// M(p) is the set of pages that link to page p
// L(q) is the number of out-links from page q
// d is the PageRank damping/teleportation factor; use d = 0.85 as is typical

foreach page p in P
  PR(p) = 1/N                          /* initial value */

while PageRank has not converged do
  sinkPR = 0
  foreach page p in S                  /* calculate total sink PR */
    sinkPR += PR(p)
  foreach page p in P
    newPR(p) = (1-d)/N                 /* teleportation */
    newPR(p) += d*sinkPR/N             /* spread remaining sink PR evenly */
    foreach page q in M(p)             /* pages pointing to p */
      newPR(p) += d*PR(q)/L(q)         /* add share of PageRank from in-links */
  foreach page p
    PR(p) = newPR(p)

return PR

In order to facilitate the computation of PageRank using the above pseudocode, one would ideally have access to an in-link respresentation of the web graph, i.e., for each page p, a list of the pages q that link to p.

Consider the following directed graph:

graph

We can represent this graph as follows:

A D E F
B A F
C A B D
D B C
E B C D F
F A B D
where the first line indicates that page A is linked from pages D, E, and F, and so on. Note that, unlike this example, in a real web graph, not every page will have in-links, nor will every page have out-links.

The Project

What to Submit

In addition to the items mentioned above, you should hand in a copy of your code, which should hopefully be relatively short. (The pseudocode for the iterative PageRank algorithm is 10 lines long, as shown above.)