DS 5230 Unsupervised Machine Learning and Data Mining / DS 4420 Machine Learning and Data Mining 2 - Fall 2018

Lectures

Time: MW 2.50pm - 4.30pm
Room: Kariotis Hall 011


Instructor

Jan-Willem van de Meent [personal page]
E-mail:
Phone: +1 617 373 7696
Office Hours: Monday 4.30pm - 6.00pm (or by appointment)


Teaching Assistant

Hao Wu [bio]
E-mail:
Office Hours: Wednesday 4.30pm - 6.00pm (or by appointment)


Resources

Blackboard: DS 5230 / DS 4420 (Homework problems and Grades)
Piazza: DS 5230 & DS 4420 (Discussion)
Exam Prep: Midterm Topic List


Course Overview

This course introduces a range of techniques in unsupervised machine learning and data mining:

  • Frequent itemset & association rule mining
  • Clustering Methods
  • Gaussian Mixtures and Expectation maximization
  • Dimensionality reduction
  • Topic Models
  • Social network analysis
  • Link analysis
  • Recommender systems

This course is designed for MS students in computer science. Lectures will focus on developing a mathematical and algorithmic understanding of the methods commonly employed to solve unsupervised machine learning and data mining problems. Homework problem sets will ask students to implement algorithms and/or work out examples.

Students will also collaborate on a project in which they must complete a data analysis taks from start to finish, including pre-processing of data, analysis, and visualization of results.


Requirements

CS 5800 or CS 7800, or consent of instructor. Students without this prerequisite should e-mail a CV and transcripts to the instructor. If these materials are acceptable, then the student will be asked to complete the self-test prior to admission to the course.

In addition to the formal requirements, students are expected to have a good working knowledge of calculus, linear algebra, probability, statistics, and algorithms.


Reading

This class is not structured to directly follow the outline of a text book. The schedule will list chapters from a number of text books as background reading for each lecture, as well as additional additional materials. Students are expected to read the materials in preparation of each lecture.

Textbooks

  1. [HTF] Trevor Hastie, Robert Tibshirani, and Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction., Springer 2013. [pdf]

  2. [LRU] Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman, Mining of Massive Datasets, Cambridge University Press, 2014 [pdf]

  3. [TSK] Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Introduction to Data Mining, 2005. [ch6, ch8]

  4. [Aggarwal] Charu C. Aggarwal, Data Mining, The Textbook, Springer 2015. [pdf]

The HTF and LRU books are freely available from the authors’ websites. The Aggarwal book is available online to Northeastern students.

Additional Materials


Homework

The homework in this class will consist of 4 problem sets. Submissions must be made via blackboard by 11.59pm on the due date. Please upload a single ZIP file containing both source code for programming problems as well as PDF files for math problems. Name this zip file:

Please follow the following guidelines:

Math Problems

Please submit math exercises as PDF files (preferably in LaTeX).

Programming Problems

The preferred language for this course is Python. However, you may use any programming language you like, as long as your submission contains clear instructions on how to compile and run the code.

  • Data File Path: Don’t use absolute path for data files in code. Please add a data folder to your project and refer to it using relative path.

  • 3rd Party Jars: If you are using any 3rd party jar, make sure you attach that to submission.

  • Clarity: When coding up multiple variants of an algorithm, ensure that your code is properly factored into small, readable and clearly commented functions.

The TAs can deduct points for submissions that do not meet these guidelines at their discretion.


Project

The goal of the project is to gain hands-on experience with analysis of a dataset of your choice. You should select a problem and a dataset that can be analyzed using methods covered in class. The project should be conducted in groups of 2-4 people. Each group should work independently, but you are welcome to discuss technical issues on Piazza. Completion of the project will include a project proposal, two milestone project updates, a report, and a review of the project by another team.


Participation and Collaboration

Students are expected to attend lectures and actively participate by asking questions. While students are required to complete homework programming exercises individually, helping fellow students by explaining course material is encouraged. At the end of the semester, students will be able to indicate which of their peers most contributed to their understanding of the material, and bonus points will be awarded based on this feedback.


Grading

The final grade for this course will be weighted as follows:

  • Homework: 40%
  • Midterm Exam: 15%
  • Final Exam: 15%
  • Course Project: 30%

Class participation will be used to adjust the final grade upwards at the discretion of the instructor.


Self-evaluation

Students will be asked to indicate the amount of time spent on each homework, as well as the project. The will also be able to indicate what they think went well, and what they think did not go well. There will also be an opportunity to provide feedback on the class after the midterm exam.


Schedule

Note: This schedule is subject to change and will be adjusted as needed throughout the semester.

Date

Lectures

Homework / Project

Reading

Wed Sep 05

1

Overview: Unsupervised Learning and Data Mining [slides]

Mon Sep 10

2

Math Review [slides]

Homework 1 Out

[CS 229 Linear Algebra Notes],
[Iain Murray Math Crib Sheet]

Wed Sep 12

3

Frequent Itemsets & Association Rules [slides]

Self-test Due (Fri)

[TSK Chapter 6]

Mon Sep 17

4

Maximum Likelihood, Maximum A Posteriori, Conjugacy [slides]

[CS 229 Probability Notes],
[Blei Exponential Family Notes]

Wed Sep 19

5

Predictive Distribution, Graphical Models, Exponential Families [slides]

[CS 229 Probability Notes],
[Blei Exponential Family Notes]

Mon Sep 24

6

Bayesian Regression [slides]

Homework 2 Out

Bishop [Ch 1.1-1.2, Ch 3.1-3.3] Optional: Rasmussen and Williams [Ch 1, Ch 2, Ch 4]

Wed Sep 26

7

Dimensionality Reduction 1 [slides]

[Survey Paper by Cunningham et al.]

Mon Oct 01

8

Dimensionality Reduction 2 [slides]

Homework 1 Due (Fri)

[t-SNE paper by van der Maaten et al.]

Wed Oct 03

9

Clustering 1 [slides]

Project Teams Due

[TSK Chapter 8]

Mon Oct 08

Columbus Day (No Class)

Wed Oct 10

10

Clustering 2 [slides]

Homework 2 Due (Fri)

[TSK Chapter 8]

Mon Oct 15

11

Clustering 3 [slides]

Homework 3 Out

[TSK Chapter 8]

Wed Oct 17

12

Clustering 4

Mon Oct 22

13

Topic Modeling 1

Project Abstracts Due

Wed Oct 24

Midterm Exam

Homework 3 Due (Fri)

Mon Oct 29

14

Topic Modeling 2

Homework 4 Out

Wed Oct 31

15

Topic Modeling 3

Mon Nov 05

16

Community Detection 1

Wed Nov 07

17

Community Detection 2

Mon Nov 12

Veteran’s Day (No Class)

Project Milestone 1 Due

Wed Nov 14

18

Link Analysis

Homework 4 Due (Fri)

Mon Nov 19

19

Recommender Systems

Project Milestone 2 Due

Wed Nov 21

Thanksgiving (No Class)

Mon Nov 26

20

Review

Wed Nov 28

Project Presentations

Mon Dec 03

(No Class)

Wed Dec 05

(No Class)

Project Reports Due (Fri)

Mon Dec 10

Final Exam