MGMT 69000 Fall 2016

PhD Seminar in Analytics: Topics in High-dimensional Data Analysis

Friday 3:15pm-6:15pm, Jerry S Rawls Hall 2079


Overview and Objectives

Today we see a surge of online social networks and e-commerce such as Facebook, Amazon, Twitter and Bitcoin, which generate enormous data.
Also, with the advent of high-throughput measurement methods in biology, a large amount of biological data has accumulated. There are many other
sources of data such as transportation networks, power networks, and sensor networks. This data contains a wealth of information and it is often
desirable to extract useful information from it for various reasons, for instance to predict user preferences, discover disease causes or predict traffic patterns.
However, this data is often noisy and voluminous; thus extracting useful information from it requires highly efficient algorithms that can process large amount
of data and detect tenuous statistical signatures.

This will be a research-oriented course designed for graduate students with an interest in doing research in theoretical aspects of high-dimensional data analysis.
Two central questions will be addressed in this course:

  • How shall we characterize the limit above which the task of extracting information is fundamentally possible and below which it is fundamentally impossible?
  • How shall we develop computationally efficient algorithms that attain the fundamental limit, or understand the lack thereof.
  • This course aims to familiarize students with advanced analytical tools such as concentration of measures, probabilistic methods, information-theoretic arguments,
    convex duality theory, and random matrix theory, by going over a number of emerging research topics in the area of high-dimensional data analysis such as data clustering,
    community detection, submatrix localization, sparse PCA, learning graphical models, and fast algorithms for linear algebra.



    Prerequisites: Maturity of linear algebra and probability is required.
    Some knowledge of the basic optimization and algorithms is also recommended.

    Instructor: Prof. Jiaming Xu, Krannert Building 431, xu972@purdue.edu , Office Hours: TBD


    Credit: 3 hours


    Office Hours: TBD


    Syllabus


    Lecture Notes


    References:


    Related Courses:


    Course Outline

    Part I: Clustering

     

    • Introduction: Examples of high-dimensional data analysis and applications

     

    • k-means clustering: Optimization formulation of k-means, convergence of k-means, failure cases of k-means,
      model-based formulation of k-means, maximum likelihood estimation and EM algorithm for Gaussian mixtures, soft k-means

     

    • Review on linear algebra: eigenvalue decomposition of symmetric matrices, singular value decomposition and best-fit subspace,
      Frobenius norm, Spectral norm, best low rank matrix approximation, spectral relaxations of k-means, principal component analysis

     

    • Concentration inequalities: Markov inequality, Chebyshev's inequality, Chernoff's bound, Sub-Gaussian random variables,
      Sub-Exponential random variables, Bernstein inequality, Symmetrization technique, Gaussian isoperimetric inequality

     

    • Matrix concentration inequalities: Wigner semi-circle law, Gaussian comparison inequality, epsilon-net method, Matrix Bernstein inequality

     

    • Spectral clustering: Spectral clustering under Gaussian mixture model, spectral clustering based on Laplacian matrix,
      perturbation theory for linear operators, Davis-Khan sin theta theorem


    Part II: Community Detection

     

    • Random graphs G(n,p): giant component, Branching process, connectivity threshold

     

    • Panted models: stochastic block model, planted partition, planted clique

     

    • Spectral graph clustering: spectrum of random graph, failure of naive spectral methods in sparse graph, spectral barrier in planted clique

     

    • Information-theoretic tools: Mutual information, Kullback-Leibler (KL) divergence and operational characterizations,
      data processing inequality, Fano's inequality, derivation of recovery limits of planted models

     

    • First and second moment methods: Binary hypothesis testing, likelihood ratio test, generalized likelihood ratio test,
      total variation distance, Hellinger distance, chi-square divergence, first moment method, second moment method,
      derivation of detection limits, Information-computation gap

     

    • Semidefinite relaxations for community detection: convex duality theory, performance analysis of SDP relaxations,
      Grothendieck's inequality, SDP in real-world networks

     

    • Belief propagation for community detection: locally tree-like argument, density evolution, fixed-point analysis

     

    • Submatrix localization: Information limits and algorithmic limits

     

    • Covariance matrix estimation: Spiked covariance model, BBP phase transition, sparse PCA


    Part III: Graphical models and message passing

     

    • Graphical model representation: Bayesian networks, pairwise graphical models, factor graphs, Markov random fields

     

    • Inference via message passing algorithms: Tree networks, sum-product algorithm, max-product algorithm, Ising model, Hidden Markov model

     

    • Variational methods: Free energy and Gibbs free energy, naive mean field, Bethe free energy

     

    • Bayesian inference: Gaussian mixture and community detection as spin glass

     

    • Learning graphical models: Chow-Liu's algorithm on trees


    Part IV: Other selected topics

     

    • Randomized linear algebra: matrix-vector product, matrix multiplication, low-rank approximation

     

    • Ranking: BTL model, Plackett-Luce model, maximum likelihood estimation, EM algorithm

     

    • DNA sequencing: DNA scaffolding, hidden Hamiltonian path problem


    Grading

    • 30% Homework (late homework will not be accepted)
    • 30% Attendance
    • 40% Final project
    • Final project: either presenting a paper or a standalone research project.