homework 4 | Natural Language Processing CS 6120

modeling short sequences of words

Language models provide the capability to predict the most likely set of texts to follow any preceding text, of which the simplest model is the n-gram model. We will be building our first language model in this homework trained on Twitter data. Review the homework in this pdf file. Remember that reading resources can be found in the syllabus.

data and starter kit

You will need the data: the en_US.twitter.txt file and the code. You can read more about this data here. If you are more comfortable with notebooks, you can test your code out with some following options:

Locally on Your Laptop
Google Cloud Vertex Work with your Google Cloud credits.
Google Colabs with your Google Account

You will be filling out the portions in the code that say <YOUR-CODE-HERE>. There is also helpful unit test code with the suffix _test(). For example,

  def estimate_probabilities():
    """
    The graded function that you will need to fill out
    """
    # <YOUR-CODE-HERE>
    return None

  def estimate_probabilities_test():
    """
    Ungraded: You can use this function to test out estimate_probabilities. 
    """
    # Run this function to test 
    return

You might find prototyping with Notebooks useful, but it is important that you submit a Python file and not a Notebook.

submission instructions

Submit your homework on Gradescope, Assignment 4. You will need to upload your well-commented Python code (either as a notebook or as a Python file.)