CS6220 - Fall 2016 - Section 3 - Data Mining Techniques
Final Exam Topic List
Note
:
Midterm topics
are also included in the final exam, but not repeated here.
Topic Models
Bag of words representations of documents
Multinomial mixture models
Latent Dirichlet Allocation
Generative model
Expectation Maximization (PLSA/PLSI)
Variational inference (high level)
Perplexity
Extensions (high level)
Dynamic Topic Models
Supervised LDA
Ideal Point Topic Models
Dimensionality Reduction
Principal Component Analysis
Interpretation as minimization of reconstruction error
Interpretation as maximization of captured variance
Interpretation as EM in generative model
Computation using eigenvalue decomposition
Computation using SVD
Applications (high-level)
Eigenfaces
Latent Semantic Analysis
Relationship to LDA
Multi-task learning
Kernel PCA
Direct method vs modular method
Canonical Correlation Analysis
Objective
Relationship to PCA
Regularized CCA
Motivation
Objective
Singular Value Decomposition
Definition
Complexity
Relationship to PCA
Random Projections
Johnson-Lindenstrauss Lemma
Stochastic Neighbor Embeddings
Similarity definition in original space
Similarity definition in lower dimensional space
Definition of objective in terms of KL divergence
Gradient of objective
Recommender Systems
Motivation: The long tail of product popularity
Content-based filtering
Formulation as a regression problem
User and item bias
Temporal effects
Collaborative filtering
(user, user) vs (item, item) similarity
pro’s and cons of each approach
Parzen-window CF
Similarity measures
Pearson correlation coefficient
Regularization for small support
Regularization for small neighborhood
Jaccard similarity
Regularization
Observed/expected ratio
Regularization
Matrix Factorization
Formulation of recommender systems as matrix factorization
Solution through alternating least squares
Solution through stochastic gradient descent
Frequent Itemsets & Association Rules
Problem formulation and examples
Customer purchasing
Plagiarism detection
Frequent Itemset
Definition of (fractional) support
Association Rules
Confidence
Measures of interest
Added value
Mutual information
A-priori
Base principle
Algorithm
Self-joining and pruning of candidate sets
Maximal vs closed itemsets
Hash tree implementation for subset matching
I/O and memory limited steps
PCY method for reducing candidate sets
FP-Growth
FP-tree construction
Pattern mining using conditional FP-trees
Performance of A-priori vs FP-growth
Link Analysis
PageRank
Recursive formulation
Interpretation of links as weighted votes
Interpretation as equilibrium condition in flow model for population of surfers (inflow equal to outflow)
Interpretation as visit frequency of random surfer
Probabilistic model
Stochastic matrices
Power iteration
Dead ends (and fix)
Spider traps (and fix)
PageRank Equation
Extension to topic-specific page-rank
Extension to TrustRank
Time Series
Time series smoothing
Moving average
Exponential
Definition of a stationary time series
Autocorrelation
AR(p), MA(q), ARMA(p,q) and ARIMA(p,d,q) models
Hidden Markov Models
Relationship of dynamics to random surfer in page rank
Relatinoship to mixture models
Forward-backward algorithm (see notes)
Social Networks
Centrality measures
Betweenness
Closeness
Degree
Girvan-Newman algorithm for clustering
Calculating betweenness
Selecting number of clusters using the modularity
Spectral clustering
Graph cuts
Normalized cuts
Laplacian Matrix
Definition in terms of Adjacency and Degree matrix
Properties of eigenvectors
Eigenvalues are >= 0
First eigenvector
Eigenvalue is 0
Eigenvector is [1 … 1]^T
Second eigenvector (Fiedler vector)
Elements sum to 0
Eigenvalue is normalized sum of squared edge distances
Use of first eigenvector to find normalized cut