IS4200/CS6200
							  Information Retrieval  
					
 
					 
						
						Homework1: Retrieval Models
						
					
					Objective
					
					- Introduction to elasticsearch: one of the many available
					  commercial-grade indexes. These instructions will
					  generally not spell out how to accomplish various
					  tasks in elasticsearch; instead, you are encouraged to read the online
					  documentation to familiarize yourself with this tool. Feel free to
					  ask for help on Piazza or during office hours if you encounter any difficulties.
- Implement and compare various retrieval systems using
					  vector space models and language models. Explain how
					  and why their performance differs.
					This assignment involves writing two main programs:
					
					  - A program to parse the corpus and index it with
						elasticsearch
- A query processor, which runs queries from an
						input file using a selected retrieval model
Task1: Elasticsearch
					  
					
					Task2: Document
	
	
	
	
	
	
	
	
	
	
	
	
					  Indexing
					Create an index of the downloaded corpus. The
					  documents are found within the ap89_collection folder
					  in the data .zip file. You will need to write a
					  program to parse the documents and send them to your
					  elasticsearch instance.
					The corpus files are in a standard format used by
					  TREC. Each file contains multiple documents. The
					  format is similar to XML, but standard XML and HTML
					  parsers will not work correctly. Instead, read the
					  file one line at a time with the following rules:
					
					  - Each document begins with a line containing <DOC>and ends with a line containing</DOC>.
- The first several lines of a document’s record
						contain various metadata. You should read the <DOCNO>field and use it as the ID of the document.
- The document contents are between lines containing
						<TEXT>and</TEXT>.
- All other file contents can be ignored.
Use
					  elasticsearch API to retrieve values like term frequency and term positions within a document. You will need such values to score documents using the retrieval models listed below.
					  
					Task3: Query
					  execution
					Write a program to run the queries in the file
					  query_desc.51-100.short.txt, included in the data .zip
					  file. You should run all queries (omitting the leading
					  number) using each of the retrieval models listed
					  below, and output the top 1000 results for each query
					  to an output file. If a particular query has fewer
					  than 1000 documents with a nonzero matching score, then
					  just list whichever documents have nonzero scores.
					You should write precisely one output file per
					  retrieval model. Each line of an output file should
					  specify one retrieved document, in the following
					  format:
					<query-number> Q0 <docno> <rank> <score> Exp
	
					Where:
					
					  -  <query-number> is the number preceding the query
						  in the query list
-  <docno> is the document number, from the <DOCNO>field (which we asked you to index)
-  <rank> is the document rank: an integer from
						  1-1000
-  <score> is the retrieval model’s matching score
						  for the document
- Q0 and Exp are entered literally
Your program will run queries against elasticsearch.
					  Instead of using their built in query engine, we will
					  be retrieving information such as TF and DF scores
					  from elasticsearch and implementing our own retrieval models to rank documents. It will be helpful if you write a method
					  which takes a term as a parameter and retrieves the
					  postings for that term from elasticsearch. You can
					  then easily reuse this method to implement the
					  retrieval models.
	
					Implement the following retrieval models, using TF
					  and DF scores from your elasticsearch index, as
					  needed:
				
					ES built-in 
				 Use ES query with the API "match"{"body_text":"query keywords"}. 
					This should be somewhat similar to BM25 scoring.
								
	
					
					
					Okapi
					  TF
 					This is a vector space model using a slightly
                                          modified version of TF to score documents. The Okapi
					  TF score for term   is as follows:
					
					                $$ okapi\_tf(w, d) = \frac{tf_{w,d}}{tf_{w,d} + 0.5
                  + 1.5 \cdot (len(d) / avg(len(d)))} $$
			
					  
					Where:
					
					 - 
						  
							   
-  
- 
				
The matching score for document 
 and query 
  as follows:
	
 $$tf(d, q) = \sum_{w \in q} okapi\_tf(w, d)$$
TF-IDF
	This is the second vector space model. The scoring function is as follows.
	 $$tfidf(d, q) = \sum_{w \in q} okapi\_tf(w, d) \cdot \log \frac{D}{df_w} $$
	Where:
		
			- 
				-  
Okapi BM25
	BM25 is a language model based on a binary independence model. Its matching score is as follows:
	 $$bm25(d, q) = \sum_{w \in q} \left[ \log\left(
                                          \frac{D + 0.5}{df_w + 0.5} \right) \cdot
                                          \frac{tf_{w,d} + k_1 \cdot tf_{w,d}}{tf_{w,d} + k_1
                                          \left((1-b) + b \cdot
                                          \frac{len(d)}{avg(len(d))}\right)} \cdot
                                          \frac{tf_{w,q} + k_2 \cdot tf_{w,q}}{tf_{w,q} + k_2}
                                          \right] $$
	
	Where:
	
		- 
			- 
			
Unigram LM with Laplace smoothing
	This is a language model with Laplace (“add-one”)  smoothing. We will use maximum likelihood estimates of
					  the query based on a multinomial model “trained” on
					  the document. The matching score is as follows:
	 $$lm\_laplace(d, q) = \sum_{w \in q} \log
                                          p\_laplace(w|d)$$
	
	$$ p\_laplace(w|d) = \frac{tf_{w,d} +
                                          1}{len(d) + V} $$
	
	Where:
	
		- 
			 
Unigram LM with Jelinek-Mercer smoothing
	This is a similar language model, except that here we
					  smooth a foreground document language model with a
					  background model from the entire corpus.
	 $$ lm\_jm(d, q) = \sum_{w \in q} \log p\_jm(w|d)$$
	
	 $$ p\_jm(w|d) = \lambda \frac{tf_{w,d}}{len(d)} + (1 -
		\lambda) \frac{\sum_{d'} tf_{w,d'}}{\sum_{d'} len(d')}$$
	
                Where:
                
			
Think carefully about how to efficiently obtain the
					  background model here. If you wish, you can instead
					  estimate the corpus probability using 
				
			
Task4: Evaluation
				 A) Compare manually the top 10 docs returned by ESBuilt-In, TFIDF,  BM25, LMLaplace, for 5 queries specified by TAs. Explain or speculate on the reasons for differences in  the rankings
				
					B) Download trec_eval and use
					  it to evalute your results for each retrieval model:
					To perform an evaluation, run:
					trec_eval [-q] qrel_file results_file
	
					The -q option shows a summary average
					  evaluation across all queries, followed by individual
					  evaluation results for each query; without the -q
					  option, you will see only the summary average. The
					  trec_eval program provides a wealth of statistics
					  about how well the uploaded file did for those
					  queries, including average precision, precision at
					  various recall cut-offs, and so on.
					You should evaluate using the QREL file named
					  qrels.adhoc.51-100.AP89.txt, included in the data .zip
					  file.
					Create a document showing the following metrics:
					
					  - The uninterpolated mean average precision numbers
						for all six retrieval models.
- Precision at 10 and precision at 30 documents for
						all six retrieval models.
Be prepared to answer questions during your demo as
					  to how model performance compares, why the best model
					  outperformed the others, and so on.
				Task5: Pseudo-relevance Feedback (MS students only)
	
				Pseudo-relevance Feedback
				
		  
	
				Implement pseudo-relevance feedback. The general
				  algorithm is:
				
				  - Retrieve the top - Identify terms in these documents which are
					distinctive to the documents.
- Add the terms to the query, and re-run the
					retrieval process. Return the final results.
It is up to you to devise a reasonable way to choose
				  terms to add to the query. It doesn’t have to be
				  complicated, but you should be prepared to explain and
				  justify your approach.
				Evaluate your results using trec_eval and include
				  similar metrics with your submission.
				
				
					  
						Pseudo-relevance Feedback using ElasticSearch aggs
						"significant terms"
					  
					  Use ES API "significat terms" separately on each query
					  term (stem root) to get a list of related words. The
					  words you want to add to the query are:
					      - related to more than one query term
					      - not stopwords
					      - high IDF
					      - other tricks you might need in
					  order to only get interesting words
					  Add few of these words to the query and rerun your
					  models. 
					  
					  Below is an example of this API in Sense for query term
					  "atom":
					  GET /ap_dataset/document/_search
					  {
					      "query" : {
					          "terms" :
					  {"TEXT" : [ "atom" ]}
					      },
					      "aggregations" : {
					         
					  "significantCrimeTypes" : {
					             
					  "significant_terms" : {
					               
					  "field" : "TEXT"
					              
					  
					             
					  }
					          }
					      },
					      "size": 0
					  }
				
				
					
					
These extra problems are provided for students who
					  wish to dig deeper into this project. Extra credit is
					  meant to be significantly harder and more open-ended
					  than the standard problems. We strongly recommend
					  completing all of the above before attempting any of
					  these problems.
					EC are added to satisfy students motivated to do more than the regular classwork. EC are not worth 10pts max depending on which and how many of the ECs you complete complete 
 
				   
				  
					EC1:
					  Bigram Language Models
					Perform retrieval using a bigram language model. You
					  should match bigrams from the query against bigrams in
					  the document. You may smooth with a unigram language
					  model, if you wish.
					You should devise a matching score function for this
					  task. Your matching score should assign smaller scores
					  to documents when the two bigram terms appear at a
					  greater distance from each other. Use term position
					  information from elasticsearch to accomplish this. Be
					  prepared to explain and justify your matching score
					  function to the graders.
					Evaluate your results using trec_eval and include
					  similar metrics with your submission.
					
	
	
	
	
	
	
	
	
	
	
					Submission
					  checklist
					
					  - Your source code for the indexer
- Your source code for the query processor using the listed retrieval models
- A short report describing your implementation and comparing your scoring results that should be presented in a tabular form. The evaluation results should also be included in the report. A report template will be provided. 
Rubric 
					
						Check Canvas for a detailed rubric!