CS6200: Information Retrieval
Homework 1
Return to basic course information.
Assigned: Thursday, 12 September 2013
Due: Email TAs with subject "CS6200 HW1" by Thursday, 19 September 2013, 6 p.m.
Instructions
- This assignment is due at the beginning of class on the due date
assigned above.
- If you collaborated with others, you must write down with whom
you worked on the assignment. If this changes from problem to
problem, then you should write down this information separately
with each problem.
- Submit the requested written answers, code, and instructions to
the TAs on how to (compile and) run the code.
Problems
- [10 points] Document filtering is an application that stores a large number
of queries or user profiles and compares these profiles to every
incoming document on a feed. Documents that are sufficiently similar
to the profile are forwarded to that person via email or some other
mechanism.
- Describe the components of a filtering engine using a block
diagram of the architecture, a flowchart of the filtering
process, and text explaining the function of the components.
Use the same level of detail that we gave in the second
lectures. For instance, don't just say that the filter needs
"text acquisition" but that it needs "format conversion" and
"stemming", to name only one example.
- Explain the major differences compared to a search
engine. Consider issues such as specific efficiency problems and
the usefulness of ranking in a filtering application.
- [5 points] Use the
GNU
wget
utility to crawl the CIS college site, starting with the seed
www.ccs.neu.edu
- Installed on departmental machines, or with most Linux
distros
- Generate a file of the first 100 unique links you find, restricting the links to web pages and pdfs that are on this site
- Respect
robots.txt
files and use a delay of 5 seconds between accesses.
- Provide the list of links and the command-line invocation of
wget
command you used.
- [15 points] Implement your own crawler