MGMT 69000 Fall
2016
PhD Seminar in Analytics: Topics in High-dimensional Data Analysis
Friday
3:15pm-6:15pm, Jerry S Rawls Hall 2079
Overview and Objectives
Today we see a surge of online social networks and e-commerce such as Facebook, Amazon,
Twitter and Bitcoin, which generate enormous data.
Also, with the advent
of high-throughput measurement methods in biology, a large amount of biological
data has accumulated. There are many other
sources
of data such as transportation networks, power networks,
and sensor networks. This data contains a wealth of information and it
is often
desirable to extract useful information from it for various reasons, for
instance to predict user preferences, discover disease causes or predict traffic patterns.
However, this data is often noisy and voluminous; thus extracting useful information from it requires highly efficient algorithms
that can process large amount
of data and detect tenuous statistical signatures.
This will be a research-oriented course designed for graduate students
with an interest in doing research in theoretical aspects of high-dimensional data analysis.
Two central questions will be addressed in this course:
How shall we characterize the limit above which the task of extracting information
is fundamentally possible and below which it is fundamentally impossible?
How shall we develop computationally efficient algorithms that attain the fundamental
limit, or understand the lack thereof.
This course aims to familiarize students with
advanced analytical tools such as
concentration of measures, probabilistic methods, information-theoretic arguments,
convex duality theory,
and random matrix theory,
by going over a number of emerging research topics
in the area of high-dimensional data analysis
such as data clustering,
community detection, submatrix localization,
sparse PCA, learning graphical models, and fast algorithms for linear algebra.
Prerequisites: Maturity of linear algebra and probability is required.
Some knowledge
of the basic optimization and algorithms is also recommended.
Instructor: Prof. Jiaming Xu, Krannert Building 431, xu972@purdue.edu , Office Hours: TBD
Credit: 3
hours
Office Hours:
TBD
Syllabus
Lecture Notes
References:
Related Courses:
Course Outline
Part I: Clustering
-
Introduction: Examples of high-dimensional data analysis and applications
- k-means clustering: Optimization formulation of k-means, convergence of k-means, failure cases of k-means,
model-based formulation of k-means, maximum likelihood estimation and EM algorithm for Gaussian mixtures, soft k-means
- Review on linear algebra: eigenvalue decomposition of symmetric matrices,
singular value decomposition and best-fit subspace,
Frobenius norm, Spectral norm, best low rank matrix approximation, spectral
relaxations of k-means, principal component analysis
- Concentration inequalities: Markov inequality, Chebyshev's inequality, Chernoff's bound, Sub-Gaussian random variables,
Sub-Exponential random variables, Bernstein inequality, Symmetrization technique, Gaussian isoperimetric inequality
- Matrix concentration inequalities: Wigner semi-circle law, Gaussian comparison inequality, epsilon-net method, Matrix Bernstein inequality
- Spectral clustering:
Spectral clustering under Gaussian mixture model, spectral clustering based on
Laplacian matrix,
perturbation theory for linear operators, Davis-Khan sin theta theorem
Part II: Community Detection
- Random graphs G(n,p):
giant component, Branching process, connectivity threshold
- Panted models:
stochastic block model, planted partition, planted clique
- Spectral graph clustering:
spectrum of random graph, failure of naive spectral methods in sparse graph,
spectral barrier in planted clique
- Information-theoretic tools:
Mutual information, Kullback-Leibler
(KL) divergence and operational characterizations,
data processing inequality,
Fano's inequality, derivation of recovery
limits of planted models
- First and second moment methods:
Binary hypothesis testing, likelihood ratio test, generalized likelihood ratio test,
total variation distance, Hellinger distance, chi-square divergence,
first moment method, second moment
method,
derivation of detection limits, Information-computation gap
- Semidefinite relaxations for community detection:
convex duality theory, performance
analysis of SDP relaxations,
Grothendieck's inequality, SDP in real-world networks
- Belief propagation for community detection:
locally tree-like argument, density evolution, fixed-point analysis
- Submatrix localization:
Information limits and algorithmic limits
- Covariance matrix estimation:
Spiked covariance model, BBP phase transition, sparse PCA
Part III: Graphical models and message passing
- Graphical model representation:
Bayesian networks, pairwise graphical models, factor graphs, Markov random fields
- Inference via message passing algorithms:
Tree networks, sum-product algorithm, max-product algorithm, Ising model, Hidden Markov model
- Variational methods:
Free energy and Gibbs free energy, naive mean field, Bethe free energy
- Bayesian inference:
Gaussian mixture and community detection as spin glass
- Learning graphical models:
Chow-Liu's algorithm on trees
Part IV: Other selected topics
- Randomized linear algebra:
matrix-vector product, matrix multiplication, low-rank approximation
- Ranking:
BTL model, Plackett-Luce model, maximum likelihood estimation, EM algorithm
- DNA sequencing:
DNA scaffolding, hidden Hamiltonian path problem
Grading
- 30%
Homework (late homework will not be accepted)
- 30%
Attendance
- 40%
Final project
-
Final project: either presenting a paper or a standalone research project.