syllabus | Natural Language Processing CS 6120

schedule

Week 1
Introduction and Applications

January 9 (video - part 1), (video - part 2)
- Language - the most efficient and compact way to transfer knowledge is through words, where the window to AGI is through NLP. This lecture is an introduction that takes us through history of how we got to LLMs. We'll also review some applications of NLP, current industry standards, and some of the most impactful approaches and where they are being implemented. Finally, we'll preview what we'll be learning, the logistics of how we'll be doing so, and the expectations for your participation in this class.
- Applications Overview
  - Machine Translation (Baidu's Word-Word)
  - Summarization (Dialogues, Newspaper Articles, etc.)
  - Text Classification and Clustering (News Article Groupings, etc.)
  - Question and Answering (LLMs and Chatbots)
- Submissions
  - Laboratory - Getting Started on Google Cloud with Your Credits
  - Assignment 1 is assigned - A First Look at Processing Language
Week 2
ML Foundations and Software Engineering

January 16 (video)
- As NLP is a specific branch of machine learning, we will review some foundational knowledge that we'll utilize through the course of this class. We'll look at both machine learning and software engineering best practices that will help you build and scale NLP systems later in the course. Because most NLP algorithms today rely heavily on computing resources, we'll dive into distributed compution approaches and cloud-based operations.
- Lecturing Topics
  - Foundations of Machine Learning
  - Software Engineering Practices
- Required Keynote Reading
- Submissions
  - Laboratory - Containerization in the Cloud
  - Assignment 1 is due
  - Assignment 2 is assigned - Text Classification
Week 3
Language Classification

January 23 (video)
- Building upon our review of machine learning, we discuss strategies in feature extraction and generation. Particularly as creating a vocabulary can explode required memory space, our featurization includes NLP-specific techniques (e.g., tokenization, lemmatization, etc.).
- Lecturing Topics
  - Building Vocabulary with Stopwords and Stemming
  - Preprocessing - Tokenization and Lemmatization
  - Logistic Regression Classifier
  - Naïve Bayes Classifiers
- Application - Sentiment Analysis
- Submissions
  - Laboratory - Naïve Bayes
Week 4
Text Processing Algorithms

January 30 (video)
- One of the most widely used algorithms in practice today are autocorrecting algorithms that typically have on-device requirements. In this lecture, we'll review elements of dynamic programming, particularly with respect to the minimum edit distance algorithm, and how we can apply these concepts to the autocorrect and subsequently the autocomplete problem.
- Lecturing Topics
  - Representations of Language
  - Comparisons / Differences in Language
  - Minimum Edit Distance Algorithms
- Application - Autocorrect in Practice
- Submissions
  - Laboratory - Autocorrect Vocabulary Candidates
  - Assignment 2 is due
  - Assignment 3 is assigned - Autocorrect and Minimum Edit Distances
Week 5
Introduction to Language Modeling

February 6
- Lecturing Topics
  - What is a language model? (Abstractive vs extractive approaches)
  - Overview of Basic Modeling Approaches
  - The N-Gram Model
  - Out of Vocabulary Words and Smoothing
  - Language Model Evaluation
- Application - Autocompleting words and sentences
- Submissions
  - Laboratory 5.1 - N-Grams Processing
  - Laboratory 5.2 - Out of Vocabulary Words
  - Laboratory 5.3 - Building the Language Model
  - Assignment 3 is due
  - Assignment 4 is assigned - Autocomplete with Topical Information
Week 6
Unsupervised NLP - Topic Modeling

February 13
- This week, we will explore David Blei's contributions to the field, a set of concepts that indirectly attack the age-old question of "what is k in the k-means clustering algorithm. We will review the hierarchical nature of how to model natural language using Bayesian concepts, where our corpora is processed without preserving the order of words. This week also marks the first week of required keynote paper reading, where we will begin the tour of seminal papers that have revolutionized not only language processing but also machine learning and artificial intelligence writ large. This reading is perhaps the most difficult one that you'll read in this class, since it involves a heavy component of probability and statistics.
- Lecturing Topics
  - Parameter Estimation of a Distribution
  - The Dirichlet Distribution and its Attributes
  - Infinite Bayesian in Topic Models
  - Latent Semantic Indexing and Latent Dirichlet Allocation
  - (Collapsed) Gibbs Sampling, and Optimization
- Application - Grouping Documents
- Required Keynote Paper - Latent Dirichlet Allocation
  - David Blei's Lecture
  - Introductory Blog to Topic Modeling
- Submissions
  - Laboratory - Jupyter Notebooks with GPUs
Week 7
Word Modeling with Self-Supervision

February 20 (video)
- Perhaps the most influential paper to have come out of the natural language community is the word2vec paper that most general machine learning practioners recognize. You'll find elements of its practice in communities from the information retrieval sciences to modern cyber applications to general ML problems. As it pertains to language models, modeling words is often the first stage in any system pipeline that you may design. This week's lecture reviews word models (including word2vec as well as continuous bags of words) and the embeddings / representations that they create.
- Lecturing Topics
  - Embeddings with Continuous Bag of Words
  - Intrinsic and Extrinsice Evaluation of Word Models
  - Word Modeling in Practice
  - The Skip-gram and Negative Sampling
  - From Words to Sentences
- Required Keynote Paper - Distributed Representations of Words
  - Blog - Gentle Introduction to Negative Sampling
- Submissions
  - Laboratory - Word Embeddings with CBOW
  - Laboratory - The Original Word2Vec Code in C (Optional)
  - Assignment 4 is due
  - Assignment 5 is assigned - Word2Vec - Skipgram Implementation (Optional)
Week 8
Introduction to Sequential Modeling

February 27 (video)
- Topics
  - Modeling with Hidden Markov Models
  - The Viterbi Algorithm - Initialization, Forward, and Backward Passes
- Application - Parts of Speech Tagging
- Required Keynote Reading - A Survey of LLMs Including ChatGPT and GPT-4
- Required Keynote Reading - Learning Text Similarity with Siamese Recurrent Networks
- Submissions
  - Laboratory - HMMs Text Processing
  - Laboratory - HMMs Numpy PoS Processing
Week 9
No Instruction - Spring Break

March 6
- Have a nice holiday!
Week 10
Recurrence and Neural Networks

March 13 (video)
- While newer architectures like transformers now dominate the field of NLP, in its short tenure, Recurrent Neural Networks became workhorses that first demonstrated the power of deep learning for sequential data like text. This lecture builds an appreciation of how modeling language works, how attention and transformers originated, and subsequently the transition to truly deep architectures. Beyond studying the history; it's we'll review the fundamental principles in RNNs that underpin modern NLP.
- Lecturing Topics
  - Traditional Language Models vs Recurrent Models
  - The Recurrent Neural Network
  - Vanishing and Exploding Gradients
  - Memory Gating - GRUs and LSTMs
  - Accuracy and Evaluation - Perplexity
- Applications - Named Entity Recognition and Machine Translation
- Required Keynote Paper Long Short Term Memory Networks
- Required Keynote Paper - On the Difficulty of Training RNNs
  - Karpathy's Blog on Recurrent Networks
- Submissions
  - Laboratory - Building Your First RNN
  - Assignment 5 is due
  - Assignment 6 is assigned - Implement Your Own Recurrent Network
Week 11
Attention and the Transformer Model

March 20 (video)
- Attention models have been the leap forward that are the fundamental building blocks to modern machine learning today, including the essential ingredients for Large Language Models. We'll go deep into attention layers in neural networks, building our own from scratch.
- Lecturing Topics
  - Introduction to the Attention Modeling
  - The Self-Attention Mechanism
  - The Transformer Modeling Layer
  - Large Scale Attention Modeling
- Required Keynote Reading - Attention is All You Need
- Required Keynote Reading - BERT - Pre-training Bidirectional Transformers
  - BERT Explained - State of the art in NLP, Blog
  - Attention Paper Explained
- Submissions
  - Laboratory - Dot Product Attention
  - Laboratory - Masking in Attention
  - Laboratory - Positional Encoding
  - Assignment 6 is due
  - Assignment 7 is assigned - Attention and Transformer Networks
Week 12 (video)
Introduction to Large Language Modeling (LLMs)

March 27
- The next three weeks are devoted to the state of the art in industry, and LLMs in practice, which may have changed in the time that you have started this course! This week, we introduce large language models using the fundamentals that you have learned, from perplexity in system design to transformer neural network layers for pre-training. We'll focus on techniques that large companies (or well-funded ones, at least) use to create foundation LLM models, taking training methods from OpenAI, Anthropic, Amazon, and Google.
- Lecturing Topics
  - Large Language Modeling (LLM) in Code
  - Tuning with Low Resources - LoRA and Quantization
- Required Keynote Reading - Training to Instruct with Human Feedback
- Required Keynote Reading - GPT-4 Technical Report from OpenAI
- Submissions
Week 13
Practically Leveraging Large Language Models

April 3
- Last week, we discussed how large companies might train LLMs. In contrast, this week's lecture is most useful for those interested in entering the industry at the mid- to startup levels, where we explore common approaches to optimally leverage large language models for your particular applications once the LLM has been created. These techniques additionally attack limitations in LLMs, such as knowledge gaps, hallucinations, and logical reasoning problems.
- Lecturing Topics
  - Prompt Engineering - In Context Learning
  - Aligning LLMs in the Instruction Following Framework
  - Deep Reinforcement Learning from Human Feedback
  - Retrieval Augmented Generation (RAG)
- Required Keynote Reading - Retrieval Augmented Generation
- Required Keynote Reading - Parameter Efficient Fine-Tuning
- Submissions
  - Laboratory - Instruction Following Tuning
  - Assignment 7 is due
Week 14

Lecturer on Travel (No Lecture)

April 10
Week 15
Language Modeling Systems Lifecycle

April 17
- You've learned about the inner workings of the LLM, the mechanisms that power it, how and when to tune it, the data collection processes that govern it, how it can be used practically, and the agents that can run on it. In this lecture, we explore the practical aspects of GenAI engineers when product managers ask them to design a system for them. More than the theory, we'll learn about the system itself, devoting time for *when* to focus on certain components of your LLM, and the life cycle of your system design.
- Lecturing Topics
  - Guidelines and NLP Systems Engineering Diagrams
  - Intelligent Agents with Program-Aided LLMs
  - Multimodal Large Language Models
- Applications - Creating Your Own GenAI Smart Agents
Week 16
Demonstrations and Poster Sessions

April 24
- Deploy and show off your domain-specific LLM and pitch your startup idea! Review the guidelines at the Final Project Website.

grading criterion

Participation	5%
Reading Group	15%
Labs	25%
LLM Deployment Project	25%
Assignments	30%

course meeting times

Lectures
- Thurs, 4pm-7:20pm
- Room 1045
Office Hours
- Karl, Tues 8:30-9:30pm
- Raman, Mon 1-3pm, 9th Floor
- Joy, Tues 12-2pm, 9th Floor
- Bella, Wed 1-3pm, 9th Floor

suggested textbooks

Speech and Language Processing, 3rd Ed. Dan Jurafsky and James Martin, 2024
A Comprehensive Overview of Large Language Models, Naveed et. al., 2024

schedule

January 9 (video - part 1), (video - part 2)

January 16 (video)

January 23 (video)

January 30 (video)

February 6

February 13

February 20 (video)

February 27 (video)

No Instruction - Spring Break

March 6

March 13 (video)

March 20 (video)

March 27

April 3

Lecturer on Travel (No Lecture)

April 10

April 17

April 24

grading criterion

course meeting times

Lectures

Office Hours

suggested textbooks