homework 4
autocomplete with topical information
modeling short sequences of words
Language models provide the capability to predict the most likely set of texts to follow any preceding text, of which the simplest model is the n-gram model. We will be building our first language model in this homework trained on Twitter data. Review the homework in this pdf file. Remember that reading resources can be found in the syllabus.

data and starter kit
You will need the data: the en_US.twitter.txt
file and the code. You can read more about this data here. If you are more comfortable with notebooks, you can test your code out with some following options:
- Locally on Your Laptop
- Google Cloud Vertex Work with your Google Cloud credits.
- Google Colabs with your Google Account
You will be filling out the portions in the code that say <YOUR-CODE-HERE>
. There is also helpful unit test code with the suffix _test()
. For example,
def estimate_probabilities():
"""
The graded function that you will need to fill out
"""
# <YOUR-CODE-HERE>
return None
def estimate_probabilities_test():
"""
Ungraded: You can use this function to test out estimate_probabilities.
"""
# Run this function to test
return
You might find prototyping with Notebooks useful, but it is important that you submit a Python file and not a Notebook.
submission instructions
- Submit your homework on Gradescope, Assignment 4. You will need to upload your well-commented Python code (either as a notebook or as a Python file.)