CS6200: Information Retrieval

Project 3

Assigned: Thursday, 14 November 2013
Due: Email TAs with subject "CS6200 Project 3" by Tuesday, 10 December 2013, 6pm

Indexing

In this project, you will replicate the functionality of the Lemur index used in Project 2, and in conjuction with the code you created implementing various retrieval functions for Project 2, you will have created a fully functioning search engine.

The Project

Download the CACM collection from the Search Engines: Information Retrieval in Practice test collection site.
Create an index of the CACM collection, together with code replicating the functionality of the Lemur index used in Project 2.

Notes:
- From the "Search Engines" book site, you should use the .tar.gz version. The .corpus version is in a special format meant for use by the (book's) Galago search engine.
  
  You will note that the CACM "documents" are actually just abstracts of full articles, and, in many cases, not much more than titles. Before disk was cheap, many retrieval systems used no more than this.
  
  Finally, you can ignore the columns of numbers. This is an encoding of bibliographic references. Just index the text.
- When creating your index, you should first apply stop-wording using the stop-word list from Project 2.
- When creating your index, you should then apply stemming. You may do so using any reasonable stemmer, such as the Porter stemmer or the KStem stemmer, each of which are freely available on the web: Porter stemmer and KStem stemmer.
- Next, you should create an inverted index of the CACM collection documents, as described in class. The index will typically consist of multiple files: (1) a file that maps term names to term IDs and associated term information, such as inverted index offset and length values (see below) and corpus frequency statistics (2) the inverted index file that maps term IDs to document IDs and associated term frequencies, and (3) a file that maps document IDs to document names and associated document information, such as document lengths.
  
  The inverted index file constitutes the bulk of the index. For simplicity, you can build up to this file in stages:
  1. As you process documents, maintain a separate file per unique term, adding document information to these files as you go.
  2. Concatenate these files into one inverted index file and add the appropriate inverted index offset and length values to the term information file.
- Finally, create code that, given a specified term and the index files above, replicates the functionality of the Lemur index used in Project 2.
Now, using the index you just built and your code from Project 2, perform retrieval experiments on all CACM queries using all five retrieval algorithms from Project 2. Record and report mean average precision and mean precision at cutoff 10 and 30 results, as you did for Project 2. You should, of course, use the queries and qrel file that come with the CACM collection. Note that you should use the "raw" queries, not the "processed" ones. This will allow you to stopword and stem your queries in exactly the same way you did the documents when indexing.

As with Project 2, experiment with various retrieval model parameters, such as the smoothing parameter, and compare and contrast your results here with those from Project 2: Do the same retrieval models work best? Do the optimal parameters change? And so on.

New: Extra Credit

For a maximum of 50 extra points, consider the following:

Many modern search engines end up indexing even stop words. Disk is cheap! But what are the tradeoffs? For extra credit, analyze the empirical time and space complexity of including stopword information in the index:

First, build an index without removing stopwords. How much larger is the index? How long did it take to create, compared to the smaller stopword-free index? If you chose to use compression in the index, what did you do?
Second, run the same queries against the same models as you did above but still remove stopwords. In other words, what effect does the bigger index have on runtime for the same set of queries? Since the collection is relatively small, I suggest running the queries 100 times to get accurate measurements.
Third, run the same queries but without removing stopwords. What happens to the query processing time and effectiveness?

Note that you should still stem the document and query terms. Points will be assigned for a clear description of the approach and presentation of the results.

What to Submit

The main assignment is worth 150 points. The extra credit portion is worth at most 50 extra points (for a maximum total of 200).

Submit a report describing your system and the results and analyses requested above.
Submit a copy of your code and instructions on how to run it.