On this page:
Searching by relevance
Before you go...
6.10

Lab 7 Working with local

home work!

Purpose: To practice working with nested data definitions, nested function definitions, and local

Textbook References: Chapter 14: Similarities Everywhere, Chapter 15: Designing Abstractions, Chapter 16: Using abstractions

Searching by relevance

Goals: Practice using abstraction.

We use search engines all the time, and fundamentally a search engine queries by relevance: just because a search term appears in a given document doesn’t mean that it’s actually a useful or informative document. The science of how to return high-quality answers to user queries is known as Information Retrieval, and there is plenty more to learn beyond the particular techniques we develop in this lab.

To begin, we need some data definitions to represent a corpus of documents and searches over them. For today’s lab we’ll use the following simplified definitions:

; A Search is a String
; INTERPRETATION: a user query, consisting of an individual word
 
; A Doc is a [List-of String]
; INTERPRETATION: a document is represented by the list of words in it.
 
; A Corpus is a [List-of Doc]
; INTERPRETATION: a collection of documents

Exercise 1 Make some examples of Docs and at least one Corpus.

Note: Right-click on the link and choose "Save as..." — do not just copy&paste the text into a new buffer, since it isn’t an ISL file.

We have provided for you some text files that you can use as documents for this lab. You can load them by downloading this file into the same directory as your lab work and calling (require "corpus.rkt") at the top of your lab file. Notice that we have already removed any punctuation marks for you, so that all that remains are individual words. (This is a necessary cleanup step, so that our later functions don’t get confused by commas or periods, for example.)

The primary measurement of relevance we will compute is called TF-IDF, which is short for "Term Frequency - Inverse Document Frequency":
  • Term Frequency simply means "how many times does this term appear in this document?" Presumably, if a word appears many times in a given document, then it’s likely that the document is really about that term, rather than merely mentioning it in passing...

  • Inverse Document Frequency is defined as one divided by (one plus the number of documents that the word appears in). Intuitively, it means that if a term appears in a lot of documents, it probably isn’t a very interesting query word. For instance, just about every document will mention the word "the". So even though it appears a lot, it also appears almost everywhere, so it’s not a very specific search term. (The +1 in the denominator is to ensure that if a term doesn’t appear at all, you don’t end up with a division-by-zero error.)

The TF-IDF score is simply the product of the TF score and the IDF score.

Exercise 2 Design the function term-freq/doc, that takes in a Doc and a Search, and returns a Number indicating the count of how often the given search term appears in the given document. (Since a document is a list of its words, and a search is simply a word, you can use string=? to determine whether a term appears in a document.)

Design two versions of your function: once following the list template, and again using some appropriate list-processing function as was discussed in lecture.

Exercise 3 Design the function doc-freq, that takes in a Corpus and a Search, and returns the count of how many documents contain the search term. (Note that this counts documents, not the number of times the search term appears in each document.) You will likely need to design a helper function here...

Again, design two versions of your function: once following the list template and again using some appropriate list-processing function.

Exercise 4 Design the function tf-idf, that takes in a Corpus of documents, a specific Doc, and a Search term, and returns the TF-IDF score for that particular document and search term, relative to the overall corpus. Be careful, and make examples first!

Do you need to design two different versions of this function?

Exercise 5 Design a function score-corpus, that takes in a Corpus and a Search, and returns a list of documents together with their TF-IDF scores for the given search. You might find the following data definition helpful:
; A Ranking is a (make-ranking Doc Number)
(define-struct ranking [doc score])

Design this function twice, using the list template and using a list-processing function.

Exercise 6 Design a function all-relevant-documents, that takes in a Corpus and a Search, and returns all the documents that contain the search term. (Hint: you’ve likely written a helper function that’s useful here.)

Design this function twice, using the list template and again using a list-processing function.

Exercise 7 Design a function best-document, that takes in a Corpus and a Search, and returns the Doc that has the highest TF-IDF score. Use some or all of the helper functions above.

Design this function twice, using the list template and using a list-processing function.

Before you go...

If you had trouble finishing any of the exercises in the lab or homework, or just feel like you’re struggling with any of the class material, please feel free to come to office hours and talk to a TA or tutor for additional assistance.