High-dimensional Statistical Models

Similar documents
High-dimensional Statistics

General principles for high-dimensional estimation: Statistics and computation

High-dimensional statistics: Some progress and challenges ahead

A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers

Applied Statistics and Machine Learning

Reconstruction from Anisotropic Random Measurements

High dimensional Ising model selection

Estimation of (near) low-rank matrices with noise and high-dimensional scaling

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Least squares under convex constraint

Compressed Sensing and Neural Networks

An iterative hard thresholding estimator for low rank matrix recovery

Recovery of Simultaneously Structured Models using Convex Optimization

ESTIMATION OF (NEAR) LOW-RANK MATRICES WITH NOISE AND HIGH-DIMENSIONAL SCALING

Collaborative Filtering Matrix Completion Alternating Least Squares

Convex relaxation for Combinatorial Penalties

An Introduction to Sparse Approximation

Structured matrix factorizations. Example: Eigenfaces

Lecture Notes 10: Matrix Factorization

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Tractable Upper Bounds on the Restricted Isometry Constant

CSC 576: Variants of Sparse Learning

Recovering any low-rank matrix, provably

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

arxiv: v2 [math.st] 12 Feb 2008

Adaptive one-bit matrix completion


Supplementary material for a unified framework for high-dimensional analysis of M-estimators with decomposable regularizers

Guaranteed Sparse Recovery under Linear Transformation

Composite Loss Functions and Multivariate Regression; Sparse PCA

Compressed Sensing and Sparse Recovery

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

From Compressed Sensing to Matrix Completion and Beyond. Benjamin Recht Department of Computer Sciences University of Wisconsin-Madison

Collaborative Filtering

Nearest Neighbor based Coordinate Descent

Computational and Statistical Aspects of Statistical Machine Learning. John Lafferty Department of Statistics Retreat Gleacher Center

High-dimensional covariance estimation based on Gaussian graphical models

Multi-stage convex relaxation approach for low-rank structured PSD matrix recovery

Statistical Machine Learning for Structured and High Dimensional Data

Optimisation Combinatoire et Convexe.

Spectral k-support Norm Regularization

Binary matrix completion

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

(Part 1) High-dimensional statistics May / 41

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Signal Recovery from Permuted Observations

Sparse Additive Functional and kernel CCA

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

Near Ideal Behavior of a Modified Elastic Net Algorithm in Compressed Sensing

Thresholds for the Recovery of Sparse Solutions via L1 Minimization

Combining geometry and combinatorics

Provable Alternating Minimization Methods for Non-convex Optimization

Gaussian Graphical Models and Graphical Lasso

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

Low-rank Matrix Completion with Noisy Observations: a Quantitative Comparison

1-bit Matrix Completion. PAC-Bayes and Variational Approximation

Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise

Mathematical Methods for Data Analysis

Sparse analysis Lecture VII: Combining geometry and combinatorics, sparse matrices for sparse signal recovery

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Regularization Algorithms for Learning

High-dimensional graphical model selection: Practical and information-theoretic limits

Generalized Power Method for Sparse Principal Component Analysis

Compressive Sensing and Beyond

Sparse analysis Lecture V: From Sparse Approximation to Sparse Signal Recovery

1-bit Matrix Completion. PAC-Bayes and Variational Approximation

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

25 : Graphical induced structured input/output models

arxiv: v1 [stat.ml] 1 Mar 2015

Variations on Nonparametric Additive Models: Computational and Statistical Aspects

Optimization methods

1-Bit Matrix Completion

1 Regression with High Dimensional Data

Exercise Sheet 1. 1 Probability revision 1: Student-t as an infinite mixture of Gaussians

1-Bit Matrix Completion

High-dimensional graphical model selection: Practical and information-theoretic limits

Minimizing the Difference of L 1 and L 2 Norms with Applications

On Optimal Frame Conditioners

LASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS FOR HIGH-DIMENSIONAL DATA

Universal low-rank matrix recovery from Pauli measurements

Recent Developments in Compressed Sensing

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Solution Recovery via L1 minimization: What are possible and Why?

On Iterative Hard Thresholding Methods for High-dimensional M-Estimation

Overview. Optimization-Based Data Analysis. Carlos Fernandez-Granda

7 Principal Component Analysis

Within Group Variable Selection through the Exclusive Lasso

Block-sparse Solutions using Kernel Block RIP and its Application to Group Lasso

11 : Gaussian Graphic Models and Ising Models

arxiv: v3 [stat.me] 8 Jun 2018

Inverse problems and sparse models (1/2) Rémi Gribonval INRIA Rennes - Bretagne Atlantique, France

Sparse PCA with applications in finance

From approximation theory to machine learning

An algebraic perspective on integer sparse recovery

Matrix completion: Fundamental limits and efficient algorithms. Sewoong Oh Stanford University

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Transcription:

High-dimensional Statistical Models Pradeep Ravikumar UT Austin MLSS 2014 1

Curse of Dimensionality Statistical Learning: Given n observations from p(x; θ ), where θ R p, recover signal/parameter θ. For reliable statistical learning, no. of observations n should scale exponentially with the dimension of data p. What if we do not have these many observations? What if the dimension of data p scales exponentially with the number of observations n instead? 2

High-dim. Data: Imaging Tens of thousands of voxels in each 3D image. Don t want to spend hundreds of thousands of minutes inside machine in the name of curse of dim.! 3

High-dim. Data: Gene (Microarray) Experiments Tens of thousands of genes Each experiment costs money (so no access to exponentially more observations) 4

High-dim. Data: Social Networks Millions/Billions of nodes/parameters Fewer obervations 5

Linear Regression Source: Wikipedia Y : real-valued response Y i = X T i θ + ɛ i, i = 1,..., n X : covariates/features in R p Examples: Finance: Modeling Investment risk, Spending, Demand, etc. (responses) given market conditions (features) Biology: Linking Tobacco Smoking (feature) to Mortality (response) 6

Linear Regression Y i = X T i θ + ɛ i, i = 1,..., n What if p n? Hope for consistent estimation even for such a highdimensional model, if there is some low-dimensional structure! Sparsity: Only a few entries are non-zero 7

Sparse Linear Regression Example: Sparse regression y X θ w n n p = + S S c Set-up: noisy observations y = Xθ + w with sparse θ Estimator: Lasso program θ 0 = {j {1,..., p} : θ j n p (y i x T i θ)2 + λ n θ j 1 θ arg min θ n i=1 j=1 Estimate a sparse linear model: Some past work: Tibshirani, 1996; Chen et al., 1998; Donoho/Xuo, 2001; Tropp, 2004; Fuchs, 2004; Meinshausen/Buhlmann, 2005; Candes/Tao, 2005; Donoho, 2005; Haupt & Nowak, 2006; Zhao/Yu, 2006; Wainwright, 2006; Zou, 2006; Koltchinskii, 2007; Meinshausen/Yu, 2007; Tsybakov et al., 2008 0} is small min y Xθ 2 2 θ s.t. θ 0 k. l 0 constrained linear regression! NP-Hard : Davis (1994), Natarajan (1995) Note: The estimation problem is non-convex 8

l 1 Regularization Why an 1 penalty?! 0 quasi-norm 1 norm 2 norm Source: Tropp 06 Just Relax 6 l 1 norm is the closest convex norm to the l 0 penalty. 9

Example: Sparse regression y X θ w n n p = + S l 1 Regularization Set-up: noisy observations y = Xθ + w with sparse θ Estimator: Lasso program 1 θ arg min θ n S c n p (y i x T i θ)2 + λ n θ j i=1 j=1 Some past work: Tibshirani, 1996; Chen et al., 1998; Donoho/Xuo, 2001; Tropp, 2004; Fuchs, 2004; Meinshausen/Buhlmann, 2005; Candes/Tao, 2005; Donoho, 2005; Haupt & Nowak, 2006; Zhao/Yu, 2006; Wainwright, 2006; Zou, 2006; Koltchinskii, 2007; Meinshausen/Yu, 2007; Tsybakov et al., 2008 Equivalent: 1 min θ n n (y i x T i θ)2 i=1 s.t. θ 1 C. 10

Group-Sparsity Parameters in groups: θ = ( ) θ 1,..., θ G1,..., θ }{{} p Gm +1,..., θ p }{{} θ G1 θ Gm A group analog of sparsity: θ = ( ),...,, 0,..., 0,... }{{} θ G1 Only a few groups are active; rest are zero. 11

Handwriting Recognition Data : Digit Two from multiple writers ; Task: Recognize Digit given a new image Could model digit recognition for each writer separately, or mix all digits for training. Alternative: Use group sparsity. Model digit recognition for each writer, but make the models share relevant features. (Each image is represented as a vector of features) 12

Group-sparse Multiple Linear Regression m Response Variables: Y (l) i = X T i Θ(l) + w (l) i, i = 1,..., n. Collate into matrices Y R n m, X R n p and Θ R m p : Multiple Linear Regression: Y = XΘ + W. 13

Group-sparse Multiple Linear Regression Multivariate regression with block regularizers Y X Θ W n m = n p + S S c p m n m l 1 /l q -regularized group Lasso: with λ n 2 XT W n, q where 1/q + 1/ q = 1 Estimate a group-sparse 1 model where rows (groups) of Θ are sparse: bθ arg min Θ R p p 2n Y XΘ 2 F + λ n Θ 1,q. ollary {j {1,..., p} : Θ j Θ is supported on S = s rows, X satisfies RSC and W ij N(0, σ 2 ). n we have Θ Θ F 2 γ(l) Ψ q(s) λ n where Ψ q (S) = 0} is small. { m 1/q 1/2 s if q [1, 2). s if q 2. 14

Group Lasso min Θ { m l=1 Group analog of Lasso. Lasso: θ 0 p j=1 θ j. n (l) i=1 (Y i Xi T Θ l) 2 + λ p j=1 Θ j q }. Group Lasso: ( Θ j q ) 0 p j=1 Θ j q (Obozinski et al; Negahban et al; Huang et al;...) 15

Low Rank Matrix-structured observations: X R k m, Y R. Parameters are matrices: Θ R k m Linear Model: Y i = tr(x i Θ) + W i, i = 1,..., n. Applications: Analysis of fmri image data, EEG data decoding, neural response modeling, financial data. Also arise in collaborative filtering: predicting user preferences for items (such as movies) based on their and other users ratings of related items. 16

Low Rank Example: Low-rank matrix approximation Θ U D V T k m = k r r r r m Set-up: Matrix Θ R k m with rank r min{k, m}. Estimator: Θ arg min Θ 1 n min{k,m} n (y i X i, Θ ) 2 + λ n σ j (Θ) i=1 j=1 Some past work: Frieze et al., 1998; Achilioptas & McSherry, 2001; Srebro et al., 2004; Drineas et al., 2005; Rudelson & Vershynin, 2006; Recht et al., 2007; Bach, 2008; Meka et al., 2009; Candes & Tao, 2009; Keshavan et al., 2009 17

Nuclear Norm Singular Values of A R k m : Square-roots of non-zero eigenvalues of A T A. Matrix Decomposition: A = r i=1 σ iu i (v i ) T. Rank of Matrix A = {i {1,..., min{k, m}} : σ i 0}. Nuclear Norm is the low-rank analog of Lasso: A = min{k,m} i=1 σ i. 18

High-dimensional Statistical Analysis Typical Statistical Consistency Analysis: Holding model size (p) fixed, as number of samples goes to infinity, estimated parameter ˆθ approaches the true parameter θ. Meaningless in finite sample cases where p n! Need a new breed of modern statistical analysis: both model size p and sample size n go to infinity! Typical Statistical Guarantees of Interest for an estimate ˆθ: Structure Recovery e.g. same as of θ? is sparsity pattern of ˆθ Parameter Bounds: ˆθ θ (e.g. l 2 error bounds) Risk (Loss) Bounds: difference in expected loss 19

Example: Sparse regression y X θ w n n p = + S Recall: Lasso Set-up: noisy observations y = Xθ + w with sparse θ Estimator: Lasso program 1 θ arg min θ n S c n p (y i x T i θ)2 + λ n θ j i=1 j=1 Some past work: Tibshirani, 1996; Chen et al., 1998; Donoho/Xuo, 2001; Tropp, 2004; Fuchs, 2004; Meinshausen/Buhlmann, 2005; Candes/Tao, 2005; Donoho, 2005; Haupt & Nowak, 2006; Zhao/Yu, 2006; Wainwright, 2006; Zou, 2006; Koltchinskii, 2007; Meinshausen/Yu, 2007; Tsybakov et al., 2008 Statistical Assumption: (x i, y i ) from Linear Model: y i = x T i θ + w i, with w i N(0, σ 2 ). 20

Sparsistency Theorem. Suppose the design matrix X satisfies some conditions (to be specified later), and suppose we solve the Lasso problem with regularization penalty λ n > 2 2σ 2 log p. γ n Then for some c 1 > 0, the following properties hold with probability at least 1 4 exp( c 1 nλ 2 n ) 1: The Lasso problem has unique solution θ with support contained with the true support: S( θ) S(θ ). If θ min = min j S(θ ) θ j > c 2λ n for some c 2 > 0, then S( θ) = S(θ ). (Wainwright 2008; Zhao and Yu, 2006;...) 21