CS6200: Information Retrieval
Homework 1
Return to basic course information.
Assigned: Thursday, September 10
Due: Wednesday, September 23, 11:59 p.m.
Instructions
- If you collaborated with others, you must write down with whom
you worked on the assignment. If this changes from problem to
problem, then you should write down this information separately
with each problem.
- Submit the requested written answers, code, and instructions to
the TAs on how to (compile and) run the code.
Focused Crawling
Implement your own web crawler, with the following properties:
- Be polite and use a delay of at least one second between
requests to the web server.
- You should start from the seed
document
http://en.wikipedia.org/wiki/Hugh_of_Saint-Cher
,
the Wikipedia article on Hugh of Saint-Cher, one of the
originators of a key information retrieval technology.
- You should only follow links with the
prefix
http://en.wikipedia.org/wiki/
. In other
words, do not follow links to non-English articles or to
non-Wikipedia pages.
- Do not follow links with a colon (:) in the rest of the URL.
This will help filter out Wikipedia help and administration
pages.
- Do not follow links to the main
page
http://en.wikipedia.org/wiki/Main_Page
.
- You may use existing libraries to request documents over HTTP,
including following redirects that handle alternate article
titles in Wikipedia.
- Otherwise, you should implement your own code to extract
links, keep track of what you've crawled, and decide what to
crawl next.
- Crawl to at most depth 5 from the seed page. In other words,
you should retrieve the seed page, pages it links to, the pages
those pages link to, the pages they link to, and the pages they
link to. The seed page is thus not depth 0, but depth 1.
- Wikipedia has a lot of links, so you should also stop when you
reach 1000 unique URLs.
- Your crawler should take two arguments: the seed page and an
optional "keyphrase" that must be present, in any combination of
upper and lower case, on any page you crawl (after the seed).
Don't worry about tokenization: just match the characters
ignoring case. If the keyphrase is not present, stop crawling.
This is a very simple version of focused crawling, where the
presence or absence of a single feature is used to determine
whether a document is relevant.
Hand in your code and instructions on how to (compile and) run it.
In addition, hand in two lists of URLs, each with at most 1000
entries:
- the pages crawled when the crawler is run with no
keyphrase, in other words all Wikipedia pages meeting the
requirements above to a depth of 5 from the starting seed;
and
- the pages crawled when the keyphrase is
‘concordance’. (If you already did the crawl with
‘index’, that's OK, too.)
Also, tell us what proportion of the total pages were retrieved by
the focused crawler for ‘concordance’. Keep in mind
that this will be a significant overestimate of the
prevalence of Wikipedia articles on indexing and information
retrieval.