CS6220 - Fall 2016 - Section 3 - Data Mining Techniques

Midterm Exam Topic List

Note: These topics are also included in the final exam. Additional topics for the final exam can be found here.

Linear Regression

  • Problem definition
  • Ordinary Least Squares, Pseudo-inverse
    • Implementation
    • Computational complexity
  • Least Mean Squares, Gradient descent
    • Implementation
    • Effect of step size
    • Computational Complexity
  • Regularization
    • Ridge regression (L2-regularization)
      • Closed form solution
    • LASSO (L1-regularization)
  • Probabilistic interpretation
    • MSE minimization as maximum likelihood estimation
    • Regularized regression as maximum a posteriori estimation
  • Polynomial / basis function regression

Model Selection

  • Curse of Dimensionality
    • Number of points in center vs periphery as a function of number of dimensions
  • Bias variance trade off
  • Overfitting vs Underfitting
  • Learning Curve
  • Cross validation
  • Model evidence / marginal likelihood

Probability

  • Measure theory (high level)
    • Sample space, event space, probability measures
    • Axioms of probability
  • Probability mass functions vs probability density functions
  • Expected values, Mean, Variance, Covariance
  • Joint, conditional and marginal probabilities
  • Sum and Product rules
  • Bayes Rule
  • Conditional independence
  • Independent and identically distributed variables
  • Central limit theorem

Distributions / Densities

  • Bernouilli
  • Binomial
  • Discrete
  • Multinomial
  • Normal
  • Multivariate Normal
    • Spherical
    • Diagonal
    • General relationship between Σ and shape (in 2D)
  • Beta
  • Dirichlet
  • Conjugacy
    • Beta and Binomial
    • Dirichlet and Multinomial
    • Gaussian with unknown mean and Gaussian

Information Theory

  • Entropy
  • Mutual Information
  • KL Divergence

Data pre-processing

  • Normalization via Min-max, Z-score, and Scaling

Distance Metrics

  • Euclidean, Manhattan and Minkowski distance
  • Humming distance
  • Edit distance
  • Properties of a distance measure
    • Relationship to inner products

Optimization (high level)

  • Gradient descent
  • Stochastic Gradient descent
    • Batch size
    • Step size schedule
    • Robbins-monro conditions
  • Convex Optimization
    • Convex sets and functions
    • Lagrange duality
      • Lagrangian
      • Generalized Lagrangian
      • Primal problem, dual problem, duality gap
      • KKT conditions at optimum, dual complementarity

Kernels (high level)

  • Definition of hilbert space and inner product
  • Sum rule, product rule, mapping rule, taylor series kernels
  • Polynomial, Radial Basis Function, Squared Exponential, Automatic Relevance Determination

Classification

  • Problem definition
  • k-Nearest Neighbors
    • Definition
    • Time and Storage complexity
    • Performance vs dimensionality of feature space
  • Linear classifiers
    • Perceptron
      • Definition
      • Failure modes
    • Logistic regression
      • Definition
      • Gradient of objective
    • Loss functions
      • Square loss, logistic loss, hinge loss
      • Relationship to perceptron, LDA, logistic regression and SVM
  • Generative Learning Algorithms
    • Naive Bayes
      • Maximum likelihood parameter estimation
        • Failure modes
      • Maximum posterior parameter estimation (Smoothing)
    • Linear Discriminant Analysis
      • Problem definition
      • Probabilistic interpretation
        • Maximum likelihood estimates for parameters
        • Maximum posterior interpretation for class prediciton
      • Implicit assumptions about feature distributions and failure modes
      • Extension to quadratic discriminant analysis
      • Relationship to logistic regression and SVMs
        • Similarities and differences
        • Advantages and disadvantages
    • Support Vector Machines
      • Margin of a classification boundary
        • Relationship to hinge loss
      • Solution as convex optimization problem
        • Dual problem, support vectors
      • Extension to non-separable data
      • Relationship to logistic regression, LDA
      • Extension to nonlinear case
        • Feature map
        • Kernel trick
          • Effect on computational complexity
        • Dual problem with kernels
  • Decision trees
    • Algorithm for construction
    • Split criteria
      • Entropy, information gain
      • Gini index
    • Bias variance trade-off in relation to depth
    • Relationship to k-Nearest Neighbors
  • Ensemble methods
    • Bagging vs Boosting
    • Random Forests
      • Definition
      • Parameters
    • Gradient Boosting
      • Definition
      • Parameters
  • Performance measures
    • Confusion matrix
    • True/False Positives/Negatives
  • Decision theoretical interpretation of bias in odds in terms of loss matrix
  • Precision, Recall, F1 Score, Specificity
  • Precision recall and ROC curve

Clustering

  • Hierarchical (linkage-based methods)
    • Dendrogram
    • Agglomerative vs Divisive Clustering
    • Simple, Average and Complete Linkage
    • Strengths and weaknesses
  • K-means and K-medoids (centroid-based methods)
    • Algorithm definition
    • Sum of Squared Errors (SSE)
    • Time complexity
    • Initialization and local minima
    • Choosing K, elbow finding
    • Limitations: Different sizes, different densities, non-globular shapes
  • DBSCAN (density-based methods)
    • Method definition
      • EPS and MinPts parameters
      • Core points, border points, noise points
      • Density-reachable points, density-connected points
    • Time and space complexity
    • Methodology for determining EPS and MinPts
    • Strengths and weaknesses
  • Gaussian Mixture Models (distribution-based methods)
    • Generative model
      • Relationship to LDA and QDA
    • Expectation Maximization
      • Calculation of posterior weights
      • Calculation of maximum likelihood parameters
      • Lower bound on log likelihood
    • Advantages and disadvantages
  • Quality Criteria
    • External vs Internal
    • Mutual information (External)
      • Definition
      • Minimum, maximum, and properties under label permutation
    • Scatter criteria (Internal)
      • Within cluster and between cluster covariances
      • Trace criteria