CS6220 - Fall 2016 - Section 3 - Data Mining Techniques

Lectures

Time: Wednesdays and Fridays 11:45am - 1:30pm
Room: Ryder Hall 161


Instructor

Jan-Willem van de Meent
E-mail: contact
Phone: +1 617 373 7696
Office Hours: WVH 478, Wednesdays 1.45pm - 2.45pm (or by appointment)


Teaching Assistants

Yuan Zhong
E-mail: yzhong@ccs.neu.edu
Office Hours: WVH 462, Wednesdays 3pm - 5pm

Kamlendru Kumar
E-mail: kumark@zimbra.ccs.neu.edu
Office Hours: WVH 462, Fridays 3pm - 5pm


Resources


Course Overview

This course introduces a range of topics in data mining and unsupervised machine learning:

  • Regression (Bias-variance tradeoff, overfitting, cross-validation)
  • Data pre-processing and visualization
  • Dimensionality Reduction (PCA, ICA, Random Projections)
  • Classification (Naive Bayes, Logistic Regression, SVMs, Random Forests)
  • Clustering (K-means, K-medioids, DBSCAN, EM for Mixture Models)
  • Recommender systems
  • Frequent Pattern Mining (Apriori, FP-Growth)
  • Time Series (ARIMA, HMMs)
  • Networks (Page-rank, Spectral Clusterng)

This course is designed for MS students in computer science. Students are expected to have a good working knowledge of basic linear algebra, probability, statistics, and algorithms.

Lectures will focus on developing a mathematical and algorithmic understanding of the methods commonly employed to solve data mining problems. Homework problem sets will ask students to implement algorithms, apply them to datasets, and evaluate the relative merit of different methods.

In addition, students will complete a project in which they must complete a data mining task from start to finish, including pre-processing of data, analysis, and visualization of results.


Requirements

CS 5800 or CS 7800, or consent of instructor.


Textbooks

This class is not structured to directly follow the outline of a text book. The class schedule will list chapters from the following 4 books below as background reading for each lecture:

  1. [Bishop] Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer 2007. [amazon]

  2. [HKP] Jiawei Han, Micheline Kamber, and Jian Pei, Data Mining: Concepts and Techniques, 3rd edition, Morgan Kaufmann, 2011. [amazon] [ebrary]

  3. [Aggarwal] Charu C. Aggarwal, Data Mining, The Textbook, Springer 2015. [pdf]

  4. [HTF] Trevor Hastie, Robert Tibshirani, and Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction., Springer 2013. [pdf]

  5. [LRU] Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman, Mining of Massive Datasets, Cambridge University Press, 2014 [pdf]

The HKP and Aggarwal books are available online for Northeastern students. The HTF and LRU books are freely available from the authors’ websites. The Bishop book is on reserve at the Snell library with 3 hr loan periods.


Additional Reading

Lecture Notes by Andrew Ng (course webpage)

  • Part 1: Regression, Classification, Generalized Linear Models
  • Part 2: Gaussian Discriminant Analysis, Naive Bayes
  • Part 3: Support Vector Machines
  • Part 4: Learning Theory
  • Part 5: Regularization and Model Selection
  • Part 6: The Perceptron and Large Margin Classifiers
  • Part 7a: K-means clustering
  • Part 7b: Expectation Maxmimization for Gaussian Mixtures
  • Part 8: The EM Algorithm
  • Part 10: Principal component analysis
  • Part 11: Independent component analysis

Lecture Notes by Carlos Fernandez-Granda (course webpage)

  • Part 5: PCA and random projections
  • Part 8: PCA, collaborative filtering, Non-negative matrix factorization.

Math Background Notes

Materials on Specific Topics


Homework

The homework in this class will consist of 5 problem sets.Submissions must be made via blackboard by 11.59pm on the due date. Please upload a single ZIP file containing both source code for programming problems as well as PDF files for math problems. Name this zip file:

Please follow the following guidelines:

Math Problems

Please submit math exercises as PDF files (preferably in LaTeX).

Programming Problems

You may use any programming language you like, as long as your submission contains clear instructions on how to compile and run the code.

  • Data File Path: Don’t use absolute path for data files in code. Please add a data folder to your project and refer to it using relative path.

  • 3rd Party Jars: If you are using any 3rd party jar, make sure you attach that to submission.

  • Clarity: When coding up multiple variants of an algorithm, ensure that your code is properly factored into small, readable and clearly commented functions.

The TAs can deduct points for submissions that do not meet these guidelines at their discretion.


Project

In the second week of the semester, the class will be allowed to vote to choose a project type:

  1. Freeform: Students form teams and develop their own project propososals. Projects may analyse an existing published dataset, or analyse self-acquired data.

  2. Predefined (guidelines): The instructor will provide a single dataset with an accompanying prediction task. Students will form teams and evaluate the efficacy of a number of different algorithms on this dataset.

For both project types, students are required to submit a report outlining the chosen analysis methodology and results. Team members will be asked to rank each other in terms of who contributed the most to the project.

This vote has now conluded in favor for option 1. Project guidelines can be found here.


Participation and Collaboration

Students are expected to attend lectures and actively participate by asking questions. While students are required to complete homework programming exercises individually, helping fellow students by explaining course material is encouraged. At the end of the semester, students will be able to indicate which of their peers most contributed to their understanding of the material, and bonus points will be awarded based on this feedback.


Grading

The final grade for this course will be weighted as follows:

  • Homework: 30%
  • Midterm Exam: 20%
  • Final Exam: 20%
  • Course Project: 30%
  • Participation (Bonus): 10%

Bonus points earned through class participation will be used to adjust the final grade upwards at the discretion of the instructor.


Self-evaluation

Students will be asked to indicate the amount of time spent on each homework, as well as the project. The will also be able to indicate what they think went well, and what they think did not go well. There will also be an opportunity to provide feedback on the class after the midterm exam.


Schedule

Note: This schedule is subject to change and will be adjusted as needed throughout the semester.

Wk Day Lectures Homework Project Reading
1 07 Sep Introduction 1: Course Overview     HKP: 1,2
  09 Sep Introduction 2: Linear regression, Overfitting, Cross validation     Bishop: 1,2; HTF: 2; Ng: 1
2 14 Sep Introduction 3: Probability, Bayes Rule, Conjugacy #1 out Vote on type Bishop: 2,3; HTF: 7, 13, 16; Ng: 4,5
  16 Sep Classification 1: k-NN, Logistic Regression, Linear Discriminant Analysis     Bishop: 4; HKP: 8; HTF: 4; Aggarwal 10; Ng: 1
3 21 Sep Classification 2: Naive Bayes, Support Vector Machines     Bishop: 7; HKP: 9; HTF: 12; Aggarwal: 10; Ng: 1,3,6
  23 Sep Classification 3: Non-linear SVMs, Kernels     Bishop: 6,7; HTF: 5; HKP: 9
4 28 Sep Classification 4: Decision Trees, Random Forests, Gradient Boosting #2 out Teams due Bishop: 14; Aggarwal: 11; HTF: 9, 10
  30 Sep Classification Wrap-up, Clustering 1: Hierarchical Clustering #1 due   Bishop: 9; HTF 13, 14; Aggarwal: 6; Ng: 7a
5 05 Oct Clustering 2: K-means, K-medioids, DBSCAN     Bishop: 9; Aggarwal 6; Ng: 7b
  07 Oct Clustering 3: Mixture Models, Expectation Maximimization     Bishop: 9; Ng: 8
6 12 Oct Topic Models: pLSA/pLSI, Latent Dirichlet Allocation     Aggarwal: 13; Tutorials: Hong, Blei
  14 Oct Dimensionality Reduction 1: PCA, CCA #2 due   Bishop: 12; HKP: 8; HTF: 14; HKP: 3; Ng: 10,11; FG: 5
7 19 Oct Dimensionality Reduction 2: SVD, Random Projections, t-SNE #3 out   FG: 5
  21 Oct Recommender Systems     Bishop: 2; HKP: 9, 13; HTF: 13; Aggarwal: 18, FG: 5,8; Aggarwal 18;
8 26 Oct Midterm exam      
  28 Oct Project Proposal presentations   Proposals due  
9 04 Nov Frequent Pattern Mining 1: Apriori     HKP: 6; HTF: 14; Aggarwal: 4,5; TSK: 6
  07 Nov Frequent Pattern Mining 2: PCY, FP-Growth     HKP: 6; HTF: 14; Aggarwal: 4,5; TSK: 6
10 09 Nov Link Analysis: PageRank, TrustRank #4 out   LRU: 5; Aggarwal: 18.4
  11 Nov (Veteran’s Day) #3 due    
11 16 Nov Time Series: Autoregressive Models, Hidden Markov Models     Aggarwal: 14.3; Bishop: 13.1-2; HKP: 13.1.1
  18 Nov Community Detection: Betweenness, Spectral Clustering     LRU: 10
12 23 Nov (Thanksgiving Holiday)      
  25 Nov (Thanksgiving Holiday)      
13 30 Nov Bonus Topic: Deep Learning #4 due    
  02 Dec Review      
14 07 Dec (No Class)      
  09 Dec Final Exam      
15 14 Dec Project Presentations   Reports due  
16 19 Dec (Final grades posted)