MGMT 69000 Fall
2016
PhD Seminar in Analytics: Topics in Highdimensional Data Analysis
Friday
3:15pm6:15pm, Jerry S Rawls Hall 2079
Overview and Objectives
Today we see a surge of online social networks and ecommerce such as Facebook, Amazon,
Twitter and Bitcoin, which generate enormous data.
Also, with the advent
of highthroughput measurement methods in biology, a large amount of biological
data has accumulated. There are many other
sources
of data such as transportation networks, power networks,
and sensor networks. This data contains a wealth of information and it
is often
desirable to extract useful information from it for various reasons, for
instance to predict user preferences, discover disease causes or predict traffic patterns.
However, this data is often noisy and voluminous; thus extracting useful information from it requires highly efficient algorithms
that can process large amount
of data and detect tenuous statistical signatures.
This will be a researchoriented course designed for graduate students
with an interest in doing research in theoretical aspects of highdimensional data analysis.
Two central questions will be addressed in this course:
How shall we characterize the limit above which the task of extracting information
is fundamentally possible and below which it is fundamentally impossible?
How shall we develop computationally efficient algorithms that attain the fundamental
limit, or understand the lack thereof.
This course aims to familiarize students with
advanced analytical tools such as
concentration of measures, probabilistic methods, informationtheoretic arguments,
convex duality theory,
and random matrix theory,
by going over a number of emerging research topics
in the area of highdimensional data analysis
such as data clustering,
community detection, submatrix localization,
sparse PCA, learning graphical models, and fast algorithms for linear algebra.
Prerequisites: Maturity of linear algebra and probability is required.
Some knowledge
of the basic optimization and algorithms is also recommended.
Instructor: Prof. Jiaming Xu, Krannert Building 431, xu972@purdue.edu , Office Hours: TBD
Credit: 3
hours
Office Hours:
TBD
Syllabus
Lecture Notes
References:
Related Courses:
Course Outline
Part I: Clustering

Introduction: Examples of highdimensional data analysis and applications
 kmeans clustering: Optimization formulation of kmeans, convergence of kmeans, failure cases of kmeans,
modelbased formulation of kmeans, maximum likelihood estimation and EM algorithm for Gaussian mixtures, soft kmeans
 Review on linear algebra: eigenvalue decomposition of symmetric matrices,
singular value decomposition and bestfit subspace,
Frobenius norm, Spectral norm, best low rank matrix approximation, spectral
relaxations of kmeans, principal component analysis
 Concentration inequalities: Markov inequality, Chebyshev's inequality, Chernoff's bound, SubGaussian random variables,
SubExponential random variables, Bernstein inequality, Symmetrization technique, Gaussian isoperimetric inequality
 Matrix concentration inequalities: Wigner semicircle law, Gaussian comparison inequality, epsilonnet method, Matrix Bernstein inequality
 Spectral clustering:
Spectral clustering under Gaussian mixture model, spectral clustering based on
Laplacian matrix,
perturbation theory for linear operators, DavisKhan sin theta theorem
Part II: Community Detection
 Random graphs G(n,p):
giant component, Branching process, connectivity threshold
 Panted models:
stochastic block model, planted partition, planted clique
 Spectral graph clustering:
spectrum of random graph, failure of naive spectral methods in sparse graph,
spectral barrier in planted clique
 Informationtheoretic tools:
Mutual information, KullbackLeibler
(KL) divergence and operational characterizations,
data processing inequality,
Fano's inequality, derivation of recovery
limits of planted models
 First and second moment methods:
Binary hypothesis testing, likelihood ratio test, generalized likelihood ratio test,
total variation distance, Hellinger distance, chisquare divergence,
first moment method, second moment
method,
derivation of detection limits, Informationcomputation gap
 Semidefinite relaxations for community detection:
convex duality theory, performance
analysis of SDP relaxations,
Grothendieck's inequality, SDP in realworld networks
 Belief propagation for community detection:
locally treelike argument, density evolution, fixedpoint analysis
 Submatrix localization:
Information limits and algorithmic limits
 Covariance matrix estimation:
Spiked covariance model, BBP phase transition, sparse PCA
Part III: Graphical models and message passing
 Graphical model representation:
Bayesian networks, pairwise graphical models, factor graphs, Markov random fields
 Inference via message passing algorithms:
Tree networks, sumproduct algorithm, maxproduct algorithm, Ising model, Hidden Markov model
 Variational methods:
Free energy and Gibbs free energy, naive mean field, Bethe free energy
 Bayesian inference:
Gaussian mixture and community detection as spin glass
 Learning graphical models:
ChowLiu's algorithm on trees
Part IV: Other selected topics
 Randomized linear algebra:
matrixvector product, matrix multiplication, lowrank approximation
 Ranking:
BTL model, PlackettLuce model, maximum likelihood estimation, EM algorithm
 DNA sequencing:
DNA scaffolding, hidden Hamiltonian path problem
Grading
 30%
Homework (late homework will not be accepted)
 30%
Attendance
 40%
Final project

Final project: either presenting a paper or a standalone research project.