homework 4

autocomplete with topical information


modeling short sequences of words


Language models provide the capability to predict the most likely set of texts to follow any preceding text, of which the simplest model is the n-gram model. We will be building our first language model in this homework trained on Twitter data. Review the homework in this pdf file. Remember that reading resources can be found in the syllabus.



data and starter kit


You will need the data: the en_US.twitter.txt file and the code. You can read more about this data here. If you are more comfortable with notebooks, you can test your code out with some following options:

You will be filling out the portions in the code that say <YOUR-CODE-HERE>. There is also helpful unit test code with the suffix _test(). For example,

  def estimate_probabilities():
    """
    The graded function that you will need to fill out
    """
    # <YOUR-CODE-HERE>
    return None

  def estimate_probabilities_test():
    """
    Ungraded: You can use this function to test out estimate_probabilities. 
    """
    # Run this function to test 
    return

You might find prototyping with Notebooks useful, but it is important that you submit a Python file and not a Notebook.


submission instructions

  • Submit your homework on Gradescope, Assignment 4. You will need to upload your well-commented Python code (either as a notebook or as a Python file.)