CS6220 - Fall 2016 - Section 3 - Data Mining Techniques

Final Exam Topic List

Note: Midterm topics are also included in the final exam, but not repeated here.

Topic Models

  • Bag of words representations of documents
  • Multinomial mixture models
  • Latent Dirichlet Allocation
    • Generative model
    • Expectation Maximization (PLSA/PLSI)
    • Variational inference (high level)
  • Perplexity
  • Extensions (high level)
    • Dynamic Topic Models
    • Supervised LDA
    • Ideal Point Topic Models

Dimensionality Reduction

  • Principal Component Analysis
    • Interpretation as minimization of reconstruction error
    • Interpretation as maximization of captured variance
    • Interpretation as EM in generative model
    • Computation using eigenvalue decomposition
    • Computation using SVD
    • Applications (high-level)
      • Eigenfaces
      • Latent Semantic Analysis
        • Relationship to LDA
      • Multi-task learning
    • Kernel PCA
      • Direct method vs modular method
  • Canonical Correlation Analysis
    • Objective
    • Relationship to PCA
    • Regularized CCA
      • Motivation
      • Objective
  • Singular Value Decomposition
    • Definition
    • Complexity
    • Relationship to PCA
  • Random Projections
    • Johnson-Lindenstrauss Lemma
  • Stochastic Neighbor Embeddings
    • Similarity definition in original space
    • Similarity definition in lower dimensional space
    • Definition of objective in terms of KL divergence
    • Gradient of objective

Recommender Systems

  • Motivation: The long tail of product popularity
  • Content-based filtering
    • Formulation as a regression problem
    • User and item bias
    • Temporal effects
  • Collaborative filtering
    • (user, user) vs (item, item) similarity
      • pro’s and cons of each approach
    • Parzen-window CF
    • Similarity measures
      • Pearson correlation coefficient
        • Regularization for small support
        • Regularization for small neighborhood
      • Jaccard similarity
        • Regularization
      • Observed/expected ratio
        • Regularization
  • Matrix Factorization
    • Formulation of recommender systems as matrix factorization
    • Solution through alternating least squares
    • Solution through stochastic gradient descent

Frequent Itemsets & Association Rules

  • Problem formulation and examples
    • Customer purchasing
    • Plagiarism detection
  • Frequent Itemset
    • Definition of (fractional) support
  • Association Rules
    • Confidence
    • Measures of interest
      • Added value
      • Mutual information
  • A-priori
    • Base principle
    • Algorithm
    • Self-joining and pruning of candidate sets
    • Maximal vs closed itemsets
    • Hash tree implementation for subset matching
    • I/O and memory limited steps
    • PCY method for reducing candidate sets
  • FP-Growth
    • FP-tree construction
    • Pattern mining using conditional FP-trees
  • Performance of A-priori vs FP-growth
  • PageRank
    • Recursive formulation
      • Interpretation of links as weighted votes
      • Interpretation as equilibrium condition in flow model for population of surfers (inflow equal to outflow)
      • Interpretation as visit frequency of random surfer
    • Probabilistic model
    • Stochastic matrices
    • Power iteration
    • Dead ends (and fix)
    • Spider traps (and fix)
    • PageRank Equation
      • Extension to topic-specific page-rank
      • Extension to TrustRank

Time Series

  • Time series smoothing
    • Moving average
    • Exponential
  • Definition of a stationary time series
  • Autocorrelation
  • AR(p), MA(q), ARMA(p,q) and ARIMA(p,d,q) models
  • Hidden Markov Models
    • Relationship of dynamics to random surfer in page rank
    • Relatinoship to mixture models
    • Forward-backward algorithm (see notes)

Social Networks

  • Centrality measures
    • Betweenness
    • Closeness
    • Degree
  • Girvan-Newman algorithm for clustering
    • Calculating betweenness
    • Selecting number of clusters using the modularity
  • Spectral clustering
    • Graph cuts
    • Normalized cuts
    • Laplacian Matrix
      • Definition in terms of Adjacency and Degree matrix
      • Properties of eigenvectors
        • Eigenvalues are >= 0
        • First eigenvector
          • Eigenvalue is 0
          • Eigenvector is [1 … 1]^T
        • Second eigenvector (Fiedler vector)
          • Elements sum to 0
          • Eigenvalue is normalized sum of squared edge distances
    • Use of first eigenvector to find normalized cut