CS6200: Information Retrieval

Homework 1

Return to basic course information.

Assigned: Thursday, September 10
Due: Wednesday, September 23, 11:59 p.m.


Instructions

  1. If you collaborated with others, you must write down with whom you worked on the assignment. If this changes from problem to problem, then you should write down this information separately with each problem.
  2. Submit the requested written answers, code, and instructions to the TAs on how to (compile and) run the code.

Focused Crawling

Implement your own web crawler, with the following properties:

Hand in your code and instructions on how to (compile and) run it. In addition, hand in two lists of URLs, each with at most 1000 entries:

  1. the pages crawled when the crawler is run with no keyphrase, in other words all Wikipedia pages meeting the requirements above to a depth of 5 from the starting seed; and
  2. the pages crawled when the keyphrase is ‘concordance’. (If you already did the crawl with ‘index’, that's OK, too.)

Also, tell us what proportion of the total pages were retrieved by the focused crawler for ‘concordance’. Keep in mind that this will be a significant overestimate of the prevalence of Wikipedia articles on indexing and information retrieval.