MGMT 69000 Fall 2016

PhD Seminar in Analytics: Topics in High-dimensional Data Analysis

Friday 3:15pm-6:15pm, Jerry S Rawls Hall 2079

Overview and Objectives

Today we see a surge of online social networks and e-commerce such as Facebook, Amazon, Twitter and Bitcoin, which generate enormous data.
Also, with the advent of high-throughput measurement methods in biology, a large amount of biological data has accumulated. There are many other
sources of data such as transportation networks, power networks, and sensor networks. This data contains a wealth of information and it is often
desirable to extract useful information from it for various reasons, for instance to predict user preferences, discover disease causes or predict traffic patterns.
However, this data is often noisy and voluminous; thus extracting useful information from it requires highly efficient algorithms that can process large amount
of data and detect tenuous statistical signatures.

This will be a research-oriented course designed for graduate students with an interest in doing research in theoretical aspects of high-dimensional data analysis.
Two central questions will be addressed in this course:

How shall we characterize the limit above which the task of extracting information is fundamentally possible and below which it is fundamentally impossible?

How shall we develop computationally efficient algorithms that attain the fundamental limit, or understand the lack thereof.

This course aims to familiarize students with advanced analytical tools such as concentration of measures, probabilistic methods, information-theoretic arguments,
convex duality theory, and random matrix theory, by going over a number of emerging research topics in the area of high-dimensional data analysis such as data clustering,
community detection, submatrix localization, sparse PCA, learning graphical models, and fast algorithms for linear algebra.

Prerequisites: Maturity of linear algebra and probability is required.
Some knowledge of the basic optimization and algorithms is also recommended.

Instructor: Prof. Jiaming Xu, Krannert Building 431, xu972@purdue.edu , Office Hours: TBD

Credit: 3 hours

Office Hours: TBD

Syllabus

Lecture Notes

References:

R. Kannan and S. Vempala, Spectral Algorithms.
Ulrike von Luxburg, A Tutorial on Spectral Clustering.
J. Tropp, User-friendly Tail bounds for Sums of Random Matrices.
John Hopcroft and R. Kannan, Foundations of Data Science.
David J. Mackay, Information theory, inference, and learning algorithms

Related Courses:

Andrea Montanari, Inference, Estimation, and Information Processing.
Yihong Wu, Information-theoretic methods in high-dimension.
Afonso S. Bandeira, Topics in Mathematics of Data Science.

Course Outline

Part I: Clustering

Introduction: Examples of high-dimensional data analysis and applications

k-means clustering: Optimization formulation of k-means, convergence of k-means, failure cases of k-means,
model-based formulation of k-means, maximum likelihood estimation and EM algorithm for Gaussian mixtures, soft k-means

Review on linear algebra: eigenvalue decomposition of symmetric matrices, singular value decomposition and best-fit subspace,
Frobenius norm, Spectral norm, best low rank matrix approximation, spectral relaxations of k-means, principal component analysis

Concentration inequalities: Markov inequality, Chebyshev's inequality, Chernoff's bound, Sub-Gaussian random variables,
Sub-Exponential random variables, Bernstein inequality, Symmetrization technique, Gaussian isoperimetric inequality

Matrix concentration inequalities: Wigner semi-circle law, Gaussian comparison inequality, epsilon-net method, Matrix Bernstein inequality

Spectral clustering: Spectral clustering under Gaussian mixture model, spectral clustering based on Laplacian matrix,
perturbation theory for linear operators, Davis-Khan sin theta theorem

Part II: Community Detection

Random graphs G(n,p): giant component, Branching process, connectivity threshold

Panted models: stochastic block model, planted partition, planted clique

Spectral graph clustering: spectrum of random graph, failure of naive spectral methods in sparse graph, spectral barrier in planted clique

Information-theoretic tools: Mutual information, Kullback-Leibler (KL) divergence and operational characterizations,
data processing inequality, Fano's inequality, derivation of recovery limits of planted models

First and second moment methods: Binary hypothesis testing, likelihood ratio test, generalized likelihood ratio test,
total variation distance, Hellinger distance, chi-square divergence, first moment method, second moment method,
derivation of detection limits, Information-computation gap

Semidefinite relaxations for community detection: convex duality theory, performance analysis of SDP relaxations,
Grothendieck's inequality, SDP in real-world networks

Belief propagation for community detection: locally tree-like argument, density evolution, fixed-point analysis

Submatrix localization: Information limits and algorithmic limits

Covariance matrix estimation: Spiked covariance model, BBP phase transition, sparse PCA

Part III: Graphical models and message passing

Graphical model representation: Bayesian networks, pairwise graphical models, factor graphs, Markov random fields

Inference via message passing algorithms: Tree networks, sum-product algorithm, max-product algorithm, Ising model, Hidden Markov model

Variational methods: Free energy and Gibbs free energy, naive mean field, Bethe free energy

Bayesian inference: Gaussian mixture and community detection as spin glass

Learning graphical models: Chow-Liu's algorithm on trees

Part IV: Other selected topics

Randomized linear algebra: matrix-vector product, matrix multiplication, low-rank approximation

Ranking: BTL model, Plackett-Luce model, maximum likelihood estimation, EM algorithm

DNA sequencing: DNA scaffolding, hidden Hamiltonian path problem

Grading

30% Homework (late homework will not be accepted)
30% Attendance
40% Final project
Final project: either presenting a paper or a standalone research project.