Time: Wednesdays and Fridays 11:45am - 1:30pm
Room: Ryder Hall 161
Jan-Willem van de Meent
Phone: +1 617 373 7696
Office Hours: WVH 478, Wednesdays 1.45pm - 2.45pm (or by appointment)
Office Hours: WVH 462, Wednesdays 3pm - 5pm
Office Hours: WVH 462, Fridays 3pm - 5pm
This course introduces a range of topics in data mining and unsupervised machine learning:
This course is designed for MS students in computer science. Students are expected to have a good working knowledge of basic linear algebra, probability, statistics, and algorithms.
Lectures will focus on developing a mathematical and algorithmic understanding of the methods commonly employed to solve data mining problems. Homework problem sets will ask students to implement algorithms, apply them to datasets, and evaluate the relative merit of different methods.
In addition, students will complete a project in which they must complete a data mining task from start to finish, including pre-processing of data, analysis, and visualization of results.
CS 5800 or CS 7800, or consent of instructor.
This class is not structured to directly follow the outline of a text book. The class schedule will list chapters from the following 4 books below as background reading for each lecture:
[Bishop] Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer 2007. [amazon]
[Aggarwal] Charu C. Aggarwal, Data Mining, The Textbook, Springer 2015. [pdf]
[HTF] Trevor Hastie, Robert Tibshirani, and Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction., Springer 2013. [pdf]
[LRU] Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman, Mining of Massive Datasets, Cambridge University Press, 2014 [pdf]
The HKP and Aggarwal books are available online for Northeastern students. The HTF and LRU books are freely available from the authors’ websites. The Bishop book is on reserve at the Snell library with 3 hr loan periods.
The homework in this class will consist of 5 problem sets.Submissions must be made via blackboard by 11.59pm on the due date. Please upload a single ZIP file containing both source code for programming problems as well as PDF files for math problems. Name this zip file:
Please follow the following guidelines:
Please submit math exercises as PDF files (preferably in LaTeX).
You may use any programming language you like, as long as your submission contains clear instructions on how to compile and run the code.
Data File Path: Don’t use absolute path for data files in code. Please add a data folder to your project and refer to it using relative path.
3rd Party Jars: If you are using any 3rd party jar, make sure you attach that to submission.
Clarity: When coding up multiple variants of an algorithm, ensure that your code is properly factored into small, readable and clearly commented functions.
The TAs can deduct points for submissions that do not meet these guidelines at their discretion.
In the second week of the semester, the class will be allowed to vote to choose a project type:
Freeform: Students form teams and develop their own project propososals. Projects may analyse an existing published dataset, or analyse self-acquired data.
Predefined (guidelines): The instructor will provide a single dataset with an accompanying prediction task. Students will form teams and evaluate the efficacy of a number of different algorithms on this dataset.
For both project types, students are required to submit a report outlining the chosen analysis methodology and results. Team members will be asked to rank each other in terms of who contributed the most to the project.
This vote has now conluded in favor for option 1. Project guidelines can be found here.
Students are expected to attend lectures and actively participate by asking questions. While students are required to complete homework programming exercises individually, helping fellow students by explaining course material is encouraged. At the end of the semester, students will be able to indicate which of their peers most contributed to their understanding of the material, and bonus points will be awarded based on this feedback.
The final grade for this course will be weighted as follows:
Bonus points earned through class participation will be used to adjust the final grade upwards at the discretion of the instructor.
Students will be asked to indicate the amount of time spent on each homework, as well as the project. The will also be able to indicate what they think went well, and what they think did not go well. There will also be an opportunity to provide feedback on the class after the midterm exam.
Note: This schedule is subject to change and will be adjusted as needed throughout the semester.
|1||07 Sep||Introduction 1: Course Overview||HKP: 1,2|
|09 Sep||Introduction 2: Linear regression, Overfitting, Cross validation||Bishop: 1,2; HTF: 2; Ng: 1|
|2||14 Sep||Introduction 3: Probability, Bayes Rule, Conjugacy||#1 out||Vote on type||Bishop: 2,3; HTF: 7, 13, 16; Ng: 4,5|
|16 Sep||Classification 1: k-NN, Logistic Regression, Linear Discriminant Analysis||Bishop: 4; HKP: 8; HTF: 4; Aggarwal 10; Ng: 1|
|3||21 Sep||Classification 2: Naive Bayes, Support Vector Machines||Bishop: 7; HKP: 9; HTF: 12; Aggarwal: 10; Ng: 1,3,6|
|23 Sep||Classification 3: Non-linear SVMs, Kernels||Bishop: 6,7; HTF: 5; HKP: 9|
|4||28 Sep||Classification 4: Decision Trees, Random Forests, Gradient Boosting||#2 out||Teams due||Bishop: 14; Aggarwal: 11; HTF: 9, 10|
|30 Sep||Classification Wrap-up, Clustering 1: Hierarchical Clustering||#1 due||Bishop: 9; HTF 13, 14; Aggarwal: 6; Ng: 7a|
|5||05 Oct||Clustering 2: K-means, K-medioids, DBSCAN||Bishop: 9; Aggarwal 6; Ng: 7b|
|07 Oct||Clustering 3: Mixture Models, Expectation Maximimization||Bishop: 9; Ng: 8|
|6||12 Oct||Topic Models: pLSA/pLSI, Latent Dirichlet Allocation||Aggarwal: 13; Tutorials: Hong, Blei|
|14 Oct||Dimensionality Reduction 1: PCA, CCA||#2 due||Bishop: 12; HKP: 8; HTF: 14; HKP: 3; Ng: 10,11; FG: 5|
|7||19 Oct||Dimensionality Reduction 2: SVD, Random Projections, t-SNE||#3 out||FG: 5|
|21 Oct||Recommender Systems||Bishop: 2; HKP: 9, 13; HTF: 13; Aggarwal: 18, FG: 5,8; Aggarwal 18;|
|8||26 Oct||Midterm exam|
|28 Oct||Project Proposal presentations||Proposals due|
|9||04 Nov||Frequent Pattern Mining 1: Apriori||HKP: 6; HTF: 14; Aggarwal: 4,5; TSK: 6|
|07 Nov||Frequent Pattern Mining 2: PCY, FP-Growth||HKP: 6; HTF: 14; Aggarwal: 4,5; TSK: 6|
|10||09 Nov||Link Analysis: PageRank, TrustRank||#4 out||LRU: 5; Aggarwal: 18.4|
|11 Nov||(Veteran’s Day)||#3 due|
|11||16 Nov||Time Series: Autoregressive Models, Hidden Markov Models||Aggarwal: 14.3; Bishop: 13.1-2; HKP: 13.1.1|
|18 Nov||Community Detection: Betweenness, Spectral Clustering||LRU: 10|
|12||23 Nov||(Thanksgiving Holiday)|
|25 Nov||(Thanksgiving Holiday)|
|13||30 Nov||Bonus Topic: Deep Learning||#4 due|
|14||07 Dec||(No Class)|
|09 Dec||Final Exam|
|15||14 Dec||Project Presentations||Reports due|
|16||19 Dec||(Final grades posted)|