Composite Loss Functions and Multivariate Regression; Sparse PCA
|
|
- Caroline Henry
- 5 years ago
- Views:
Transcription
1 Composite Loss Functions and Multivariate Regression; Sparse PCA G. Obozinski, B. Taskar, and M. I. Jordan (2009). Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, to appear. G. Obozinski, M. J. Wainwright, and M. I. Jordan (2009). Union support recovery in multivariate regression. Annals of Statistics, under review. A. Amini and M. J. Wainwright (2009). High-dimensional analysis of semidefinite relaxations for sparse PCA. Annals of Statistics, to appear. 1
2 Introduction classical asymptotic theory of statistical inference: number of observations n + model dimension p stays fixed not suitable for many modern applications: { images, signals, systems, networks } frequently large (p )... interesting consequences: might have p = Θ(n) or even p n curse of dimensionality: procedures unless p/n 0 frequently impossible to obtain consistent can be saved by a lower effective dimensionality, due to some form of complexity constraint 2
3 Example: Sparse linear regression y X β w n = n p + S S c vector β R p with at most k p non-zero entries observation model: noisy linear observations y = Xβ + w X R n p : design matrix w R n 1 : noise vector various applications (database sketching, imaging, genetic testing...) 3
4 Example: Graphical model selection consider m-dimensional random vector Z = (Z 1,...,Z m ): P(Z 1,...,Z m ; β) exp (i,j) E β ij Z i Z j. given n independent and identically distributed (i.i.d.) samples of Z, identify underlying graph G = (V, E) lower effective dimensionality: graphs with k p := ( m 2) edges 4
5 Example: Sparse principal components analysis = Σ ZZ T D Set-up: Covariance matrix Σ = ZZ T + D, where leading eigenspace Z has sparse columns. Σ. Goal: Produce an estimate Ẑ based on samples X(i) with covariance matrix 5
6 Some issues in high-dimensional inference Consider some fixed loss function, and a fixed level δ of error. Given particular (polynomial-time) algorithms for what sample sizes n do they succeed/fail to achieve error δ? when does more computation reduce minimum # samples needed? 6
7 Outline 1. Multivariate regression in high dimensions (a) Practical limitations: scaling laws for second-order cone programs (b) SOCP vs. Lasso: when does more computation reduce statistical error? 2. Sparse principal component analysis in high dimensions (a) Thresholding methods (b) Semidefinite programming 7
8 Optimization-based estimators in (sparse) regression y X β w n = n p + S S c Regularized QP: β b 1 arg min β R p 2n y Xβ 2 2 {z } Data term + ρ n R(β). {z } Regularizer R(β) = β 2 Ridge regression (Tik43, HoeKen70) R(β) = β 1 convex l 1 -constrained QP (CheDonSau96; Tibs96) R(β) = β 0 Subset selection: combinatorial, NP-hard (Nat95) R(β) = β a, a (0,1) Non-convex l a regularization 8
9 Different loss functions Given an estimate β, how to assess its performance? 1. Predictive loss: compute expected error E[ ỹ X β 2 2] goal is to construct model with good predictive power β itself of secondary interest (need not be uniquely determined) 2. l 2 -loss E[ β β 2 2] appropriate when B is of primary interest (signal recovery, compressed sensing, denoising etc.) 3. Support recovery criterion: define estimated support S( β) = { i = 1,...,p β i 0 }, and measure probability P[S( β) S(β )]. useful for feature selection, dimensionality reduction, model selection can be used as a pre-processing step for estimation in l 2 -norm 9
10 1. Multivariate regression in high dimensions Y X B W n r = n p + S S c p r n r signal B is a p r matrix: partitioned into non-zero rows S and zero rows S c observe n noisy projections, defined via design matrix X R n p and noise matrix W R n r matrix Y R n r of observations high-dimensional scaling: allow parameters (n, p, r, S ) to scale 10
11 Block regularization and second-order cone programs (Obozinski, Taskar & Jordan, 2009) for fixed parameter q [1, ], estimate B via: bb j 1 arg min B R p r 2n Y XB 2 F {z } P Data term n rp ˆYjl (XB) jl 2 j=1 l=1 ff + ρ n B {z 1,q. } pp (B i1,..., B ir ) q i=1 regularization constant ρ n > 0 to be chosen by user q = 1: elementwise l 1 norm (constrained QP) different cases: q = 2: second-order cone program (SOCP) q = : block l 1 /l max-norm (constrained QP) in all cases, efficiently solvable (e.g., by interior point methods) generalization of the Lasso (Tibshirani, 1996; Chen et al., 1998), special case of the CAP family (Zhao, Rocha, & Yu, 2006); see also (Turlach et al., 2005; Yuan & Lin, 2006, Nardi & Rinaldo, 2008) 11
12 Two strategies Goal: Model selection consistency: recover union of supports S(B ) := {i {1, 2,..., p} B i1,..., B ir 2 0}. Different methods: Lasso-based recovery: 1. Solve a separate Lasso (l 1 -constrained QP) for each column l = 1,..., r, yielding column vector b β l R p. 2. Estimate row support b S Lasso = i {1, 2,..., p} b β il 0 for some l. SOCP-based recovery: 1. Solve a single SOCP, obtaining matrix estimate bb p r. 2. Estimate support b S SOCP = i {1,..., p} ( b B i1,..., b B ir 2 0. Trade-offs: Lasso (QP) cheap to solve, but method ignores coupling among columns SOCP more expensive, but block-regularizer better tailored to matrix structure 12
13 Scaling law for high-dimensional SOCP recovery (Obozinski, Wainwright & Jordan, 2009) SOCP method: bb arg min B R p r 1 2n Y XB 2 F + ρ n B 1,2. Parameters: Problem dimension p; number of non-zero rows k Design matrix X: i.i.d. rows from sub-gaussian distribution, with suitable covariance Σ Theorem: If the rescaled sample size θ SOCP (n, p, k, B ) := n Ψ(B S ;Σ SS) log(p k) is greater than a critical threshold θ l (Σ; σ 2 ), then for suitable ρ n we have with probability greater than 1 2 exp(c 2 log k): (a) the SOCP has a unique solution B b s.t. S( b B) b S(B ), and q (b) It includes all rows i with B i max{k,log(p k)} 2 c 3 n. 13
14 Assumptions on design covariance Σ SS Σ S c S c Σ S c S Σ SS c support set S = {i β i 0} complement S c := {1,... p}\s. random design matrix X R n p rows drawn i.i.d., cov. Σ, sub-gaussian 1. Bounded eigenspectrum: λ(σ SS ) [C min, C max ]. 2. Mutual incoherence/irrepresentability: There exists an ν (0, 1] such that Σ S c S (Σ SS ) 1, 1 ν. P Example: if Σ SS = I, then require max Σ j S c ji 1 ν. i S 14
15 Order parameter captures threshold (Angle 0 ) Prob. success versus rescaled sample size θ SOCP (n, p, k, B ) = n Ψ(B S ; Σ SS)log(p k). 15
16 Order parameter captures threshold (Angle 60 ) Prob. success versus rescaled sample size θ SOCP (n, p, k, B ) = n Ψ(B S ; Σ SS)log(p k). 16
17 Sparsity overlap function Ψ form gradient matrix Z(B S ) := B S 1,2 BS =B S R k r equivalent to renormalizing B S to have unit l 2-norm rows form r r Gram matrix: G = Z T (Σ SS ) 1 Z with G a,b = Z a, Z b (ΣSS ) 1 sparsity overlap function is max. eigenvalue of G: Ψ(B S ; Σ SS) = G 2. measures relative alignments of the renormalized columns of B Special case: Univariate regression (r = 1): Z(β S ) = k for any vector β S 17
18 Concrete examples (k = 4,r = 2) Aligned columns B S 2 Z(B S ) Orthogonal columns B S 2 Z(B S ) G =» G 2 = 4 G =» G 2 = 2 18
19 Empirical illustration of sparsity-overlap Ψ Orthogonal regression: Columns Z 1 Z 2 Intermediate angle: Columns at 60 Aligned regression: Columns parallel Ordinary Lasso: solve problems separately. 19
20 SOCP versus ordinary QP Corollary: If Σ SS =I k k, SOCP always dominates ordinary QP, with relative statistical efficiency: 1 max k l log(p k l ) l=1,...r Ψ(BS ; I)log(p k) {z } (QP sample size)/(socp sample size) r increased statistical efficiency of SOCP: dependent on orthogonality properties of rescaled columns B S up to a factor 1/r reduction in number of samples required most pessimistic case: no gain for disjoint supports, SOCP can be worse in some cases (if Σ SS I) 20
21 Proof sketch of sufficient conditions Direct analysis : Given n observations of β R p with S(β ) = k, oracle decoder performs following two 1. For each subset S of size k, solve the quadratic program: steps: f(s) = min β S R k Y X Sβ S Output the subset b S = arg min S =k f(s). by symmetry of ensemble, may assume that fixed subset S is chosen for sets U different from true set S, consider range of non-overlaps t := U\S {1,..., k} number of subsets with non-overlap t given by N(t) = ` k `p k t k t 21
22 Error exponents for random projections union bound yields upper bound on error probability P[error S true]: kx k p k t=1 k t t P[error on subset with non-overlap t] orthogonal projection Π U := I n n X U ˆXT U X 1 U X T U optimal decoder chooses U incorrectly over S if and only if (U) = Π U X S\U β S\U + W 2 {z } Π S W 2 {z } < 0 effective noise in U effective noise in S use large deviations to establish that P[ (U) < 0] exp n F( β S\U 2 ; t). 22
23 Proof sketch of necessary conditions Fano s inequality applied to a restricted ensemble, assuming fixed choice of β : β i [U] = ( βmin if i U 0 otherwise. by Fano s inequality, probability of success upper bounded as 1 P[error] I(Y ; β ) log(m 1) o(1), where I(Y ; β ): mutual information between β and observation vector Y M = `p k : number of competing models some work to establish the upper bound holds w.h.p. for X: I(Y, β X) n 2 log»1 + (1 k p )kβ2 min 23
24 2. High-dimensional analysis of sparse PCA principal components analysis (PCA): classical method for dimensionality reduction high-dimensional version: eigenvectors from sample covariance bσ based on n samples in p dimensions in general, high-dimensional PCA inconsistent unless p/n 0 (Joh01, JohLu04) natural to investigate more structured ensembles for which consistency still possible even with p/n + : sparse eigenvector recovery (JolEtal03, JohLu04, ZouEtAl06) sparse covariance matrices (LevBic06,ElKar07) 24
25 Spiked covariance ensembles sequences {Σ p } of spiked population covariance matrices: Σ p = MX α i β i β T i + Γ p, i=1 with leading eigenvectors (β i, i = 1,... M). past work on identity spiked ensembles (Γ p = I p ) (Joh01; JohLu04) different sparsity models: hard sparsity model: β has exactly k non-zero coefficients weak l q -sparsity: β belongs to the l q - ball : B q (R q ) = z R p px z i q R q. i=1 given n i.i.d. samples {X i } n i=1 with E[X i] = 0 and cov(x i ) = Σ p 25
26 SDP relaxation of sparse PCA (D Asprémont, El Ghaoui, Jordan & Lanckriet, 2006) Courant-Fischer variational principle for maximum eigenvalue/vector (PCA): λ max (Q) = max z 2 =1 zt Qz. equivalent/exact semidefinite program (SDP) of max. eigenvector: λ max (Q) = max Z 0,trace(Z)=1 trace `Z Q. SDP relaxation of sparse PCA: 8 < bz = arg max Z 0,trace(Z)=1 : trace `Z X Q ρ n` with regularization parameter ρ n > 0 chosen by user. i,j 9 = Z ij ;, 26
27 Rates in spectral norm given n samples from spiked identity model Σ p = αzz T + σ 2 I p eigenvector z in weak l q -ball B q (R q ) SDP relaxation: bz arg min Z 0,trace(Z)=1 trace(z bσ) + ρ n P i,j Z ij. Theorem: (AmiWai08b) Suppose that we apply q the SDP to the sample covariance Σ b with regularization parameter ρ n = f(α, σ 2 log p ) n. Then with probability greater than 1 c 1 exp( c 2 log p) 0, we have: bz zz T `log p 1 2 C R 2(1+q) q. n Example (Hard sparsity): q = 0, and radius R q = k (# non-zeros) s Z b zz T k 2 log p 2 C. n 27
28 Comparison to some known results Estimating sparse covariance matrices Thresholding estimator T λ n(bσ) achieves rate: (BicLev07) `log p 1 q T λ n(bσ) Σ 2 CR 2 q. n by matrix perturbation results, for well-separated eigenvalues, same rate applies to leading eigenvector agrees with SDP result for q = 0, but slower rate for q > 0 Minimax rates for q (0, 2): with sign bz, z = 1: (PauJoh08) min bz max z Bq(Rq) E[ bz z 2 2 ] C R q `log p n 1 q 2. same rate as normal sequence model (DonJoh94) SDP rate is slower, but approaches minimax rate as q 0 28
29 Model selection consistency for hard sparsity (q = 0) Goal: Given spiked model with k-sparse eigenvector (z i = ± 1 k ), recover support set S(z) = i {1, 2,..., p} z i 0} exactly. Methods: 1. Diagonal thresholding: Complexity O(np + p log p) (JohLu04) (a) Form sample covariance bσ = 1 n P n i=1 X ix T i. (b) Extract top k order statistics b Σ (11),..., b Σ (kk), and estimate support b S(D) by rank indices. 2. SDP-based recovery: Complexity O(np + p 4 log p) (AspLanGhaJor08) (a) Solve SDP with ρ n = α/(2σ 2 k). (b) Given solution b Z, estimate support bs := i {1,..., p} b Z ij 0 for some j. 29
30 Sharp threshold for diagonal thresholding Model: Σ p = αzz T + σ 2 I p Parameters: p model dimension k number of non-zeroes in spiked eigenvector Proposition: (AmiWai08a) If k = O(p 1 δ ) for any δ (0, 1), diagonal thresholding for support recovery controlled by rescaled sample size θ thr (n, p, k) := n k 2 log(p k). I.e., there are constants 0 < τ l (α, σ2 ) τ u (α, σ2 ) < such that (a) Success: If n > τ u k2 log(p k), then P[ bs(d) = S(β)] 1 c 1 exp ` c 2 k 2 log(p k)) 1. (b) Failure: If n τ l k2 log(p k), then P[ bs(d) = S(β)] c 1 exp ( c 2 (log(p k))) 0. 30
31 Performance of diagonal thresholding 1 Diagonal thresholding: k = O(log(p)) 1 Diagonal thresholding (k = O(sqrt(p)) Prob. success p = 100 p = p = 300 p = 600 p = Control parameter Prob. success p = 100 p = p = 300 p = 600 p = Control parameter (a) Log. sparsity (b) Square-root sparsity Probability of success P[S(D) = S(β )] versus rescaled sample size n θ thr (n, p, k) = k 2 log(p k) 31
32 Eigenvector support recovery via SDP relaxation spiked identity model Σ p = αzz T + σ 2 I p with k-sparse eigenvector z SDP relaxation: Z b arg trace(zσ) b P + ρn i,j Z ij. min Z 0,trace(Z)=1 Theorem: (AmiWai08a) Suppose that we solve the SDP with ρ n = α/(2σ 2 k). Then there are constants θ wr and θ crit such that (a) For sample sizes such that θ thr (n, p, k) = solution w.h.p., and (b) For problem sequences such that k = O(log p), and n k 2 log(p k) > θ wr, the SDP has a rank one θ sdp (n, p, k) := n k log(p k) > θ crit, a rank one solution (when it exists) specifies correct support w.h.p. Remarks: technical condition k = O(log p): likely an artifact 32
33 Performance of SDP relaxation 1 SDP relaxation (k = O(log p)) 1 SDP relaxation (k = 0.1 p) Prob. success p = 100 p = 200 p = 300 Prob. success p = 100 p = 200 p = Control parameter (a) Log. sparsity Control parameter (b) Linear sparsity Probability of success P[S( b β) = S(β )] versus rescaled sample size θ sdp (n, p, k) = n k log(p k). 33
34 Summary and open directions 1. When does more computation yield greater statistical accuracy? Multivariate regression: second-order cone programming versus quadratic programming (Lasso) Sparse PCA: diagonal thresholding versus SDP relaxation 2. When are polynomial-time algorithms as good as optimal algorithms? Multivariate regression: Lasso/SOCP order-optimal for k = o(p) Sparse PCA: SDP relaxation order-optimal for k = O(log p) 34
High-dimensional graphical model selection: Practical and information-theoretic limits
1 High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John
More informationHigh-dimensional graphical model selection: Practical and information-theoretic limits
1 High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John
More informationHigh dimensional Ising model selection
High dimensional Ising model selection Pradeep Ravikumar UT Austin (based on work with John Lafferty, Martin Wainwright) Sparse Ising model US Senate 109th Congress Banerjee et al, 2008 Estimate a sparse
More informationSTAT 200C: High-dimensional Statistics
STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57
More informationReconstruction from Anisotropic Random Measurements
Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013
More informationEstimation of large dimensional sparse covariance matrices
Estimation of large dimensional sparse covariance matrices Department of Statistics UC, Berkeley May 5, 2009 Sample covariance matrix and its eigenvalues Data: n p matrix X n (independent identically distributed)
More informationConfidence Intervals for Low-dimensional Parameters with High-dimensional Data
Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Cun-Hui Zhang and Stephanie S. Zhang Rutgers University and Columbia University September 14, 2012 Outline Introduction Methodology
More informationTractable Upper Bounds on the Restricted Isometry Constant
Tractable Upper Bounds on the Restricted Isometry Constant Alex d Aspremont, Francis Bach, Laurent El Ghaoui Princeton University, École Normale Supérieure, U.C. Berkeley. Support from NSF, DHS and Google.
More informationMethods for sparse analysis of high-dimensional data, II
Methods for sparse analysis of high-dimensional data, II Rachel Ward May 23, 2011 High dimensional data with low-dimensional structure 300 by 300 pixel images = 90, 000 dimensions 2 / 47 High dimensional
More informationSimultaneous Support Recovery in High Dimensions: Benefits and Perils of Block `1=` -Regularization
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 6, JUNE 2011 3841 Simultaneous Support Recovery in High Dimensions: Benefits and Perils of Block `1=` -Regularization Sahand N. Negahban and Martin
More informationConvex relaxation for Combinatorial Penalties
Convex relaxation for Combinatorial Penalties Guillaume Obozinski Equipe Imagine Laboratoire d Informatique Gaspard Monge Ecole des Ponts - ParisTech Joint work with Francis Bach Fête Parisienne in Computation,
More informationapproximation algorithms I
SUM-OF-SQUARES method and approximation algorithms I David Steurer Cornell Cargese Workshop, 201 meta-task encoded as low-degree polynomial in R x example: f(x) = i,j n w ij x i x j 2 given: functions
More informationSparse Covariance Selection using Semidefinite Programming
Sparse Covariance Selection using Semidefinite Programming A. d Aspremont ORFE, Princeton University Joint work with O. Banerjee, L. El Ghaoui & G. Natsoulis, U.C. Berkeley & Iconix Pharmaceuticals Support
More informationMethods for sparse analysis of high-dimensional data, II
Methods for sparse analysis of high-dimensional data, II Rachel Ward May 26, 2011 High dimensional data with low-dimensional structure 300 by 300 pixel images = 90, 000 dimensions 2 / 55 High dimensional
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationRegularized Estimation of High Dimensional Covariance Matrices. Peter Bickel. January, 2008
Regularized Estimation of High Dimensional Covariance Matrices Peter Bickel Cambridge January, 2008 With Thanks to E. Levina (Joint collaboration, slides) I. M. Johnstone (Slides) Choongsoon Bae (Slides)
More informationSparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28
Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:
More informationHigh-dimensional covariance estimation based on Gaussian graphical models
High-dimensional covariance estimation based on Gaussian graphical models Shuheng Zhou Department of Statistics, The University of Michigan, Ann Arbor IMA workshop on High Dimensional Phenomena Sept. 26,
More informationHigh-dimensional statistics: Some progress and challenges ahead
High-dimensional statistics: Some progress and challenges ahead Martin Wainwright UC Berkeley Departments of Statistics, and EECS University College, London Master Class: Lecture Joint work with: Alekh
More information19.1 Problem setup: Sparse linear regression
ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 19: Minimax rates for sparse linear regression Lecturer: Yihong Wu Scribe: Subhadeep Paul, April 13/14, 2016 In
More informationELE539A: Optimization of Communication Systems Lecture 15: Semidefinite Programming, Detection and Estimation Applications
ELE539A: Optimization of Communication Systems Lecture 15: Semidefinite Programming, Detection and Estimation Applications Professor M. Chiang Electrical Engineering Department, Princeton University March
More informationIntroduction to graphical models: Lecture III
Introduction to graphical models: Lecture III Martin Wainwright UC Berkeley Departments of Statistics, and EECS Martin Wainwright (UC Berkeley) Some introductory lectures January 2013 1 / 25 Introduction
More informationSparse PCA in High Dimensions
Sparse PCA in High Dimensions Jing Lei, Department of Statistics, Carnegie Mellon Workshop on Big Data and Differential Privacy Simons Institute, Dec, 2013 (Based on joint work with V. Q. Vu, J. Cho, and
More information1 Regression with High Dimensional Data
6.883 Learning with Combinatorial Structure ote for Lecture 11 Instructor: Prof. Stefanie Jegelka Scribe: Xuhong Zhang 1 Regression with High Dimensional Data Consider the following regression problem:
More informationVector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.
Vector spaces DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Vector space Consists of: A set V A scalar
More informationFinding normalized and modularity cuts by spectral clustering. Ljubjana 2010, October
Finding normalized and modularity cuts by spectral clustering Marianna Bolla Institute of Mathematics Budapest University of Technology and Economics marib@math.bme.hu Ljubjana 2010, October Outline Find
More informationLecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora
princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora Scribe: Today we continue the
More informationarxiv: v1 [math.st] 10 Sep 2015
Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees Department of Statistics Yudong Chen Martin J. Wainwright, Department of Electrical Engineering and
More informationLecture 5 : Projections
Lecture 5 : Projections EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Up until now, we have seen convergence rates of unconstrained gradient descent. Now, we consider a constrained minimization
More informationSemidefinite Programming Basics and Applications
Semidefinite Programming Basics and Applications Ray Pörn, principal lecturer Åbo Akademi University Novia University of Applied Sciences Content What is semidefinite programming (SDP)? How to represent
More informationHigh-dimensional Statistical Models
High-dimensional Statistical Models Pradeep Ravikumar UT Austin MLSS 2014 1 Curse of Dimensionality Statistical Learning: Given n observations from p(x; θ ), where θ R p, recover signal/parameter θ. For
More informationLecture Notes 1: Vector spaces
Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector
More informationSparse inverse covariance estimation with the lasso
Sparse inverse covariance estimation with the lasso Jerome Friedman Trevor Hastie and Robert Tibshirani November 8, 2007 Abstract We consider the problem of estimating sparse graphs by a lasso penalty
More informationConvex sets, conic matrix factorizations and conic rank lower bounds
Convex sets, conic matrix factorizations and conic rank lower bounds Pablo A. Parrilo Laboratory for Information and Decision Systems Electrical Engineering and Computer Science Massachusetts Institute
More informationSVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)
Chapter 14 SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Today we continue the topic of low-dimensional approximation to datasets and matrices. Last time we saw the singular
More informationPermutation-invariant regularization of large covariance matrices. Liza Levina
Liza Levina Permutation-invariant covariance regularization 1/42 Permutation-invariant regularization of large covariance matrices Liza Levina Department of Statistics University of Michigan Joint work
More informationA New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables
A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of
More information(k, q)-trace norm for sparse matrix factorization
(k, q)-trace norm for sparse matrix factorization Emile Richard Department of Electrical Engineering Stanford University emileric@stanford.edu Guillaume Obozinski Imagine Ecole des Ponts ParisTech guillaume.obozinski@imagine.enpc.fr
More informationHigh-dimensional Statistics
High-dimensional Statistics Pradeep Ravikumar UT Austin Outline 1. High Dimensional Data : Large p, small n 2. Sparsity 3. Group Sparsity 4. Low Rank 1 Curse of Dimensionality Statistical Learning: Given
More information(Part 1) High-dimensional statistics May / 41
Theory for the Lasso Recall the linear model Y i = p j=1 β j X (j) i + ɛ i, i = 1,..., n, or, in matrix notation, Y = Xβ + ɛ, To simplify, we assume that the design X is fixed, and that ɛ is N (0, σ 2
More informationConvex Optimization in Classification Problems
New Trends in Optimization and Computational Algorithms December 9 13, 2001 Convex Optimization in Classification Problems Laurent El Ghaoui Department of EECS, UC Berkeley elghaoui@eecs.berkeley.edu 1
More informationx. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).
.8.6 µ =, σ = 1 µ = 1, σ = 1 / µ =, σ =.. 3 1 1 3 x Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ ). The Gaussian distribution Probably the most-important distribution in all of statistics
More informationConvex Optimization M2
Convex Optimization M2 Lecture 8 A. d Aspremont. Convex Optimization M2. 1/57 Applications A. d Aspremont. Convex Optimization M2. 2/57 Outline Geometrical problems Approximation problems Combinatorial
More informationBAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage
BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage Lingrui Gan, Naveen N. Narisetty, Feng Liang Department of Statistics University of Illinois at Urbana-Champaign Problem Statement
More informationSolving Corrupted Quadratic Equations, Provably
Solving Corrupted Quadratic Equations, Provably Yuejie Chi London Workshop on Sparse Signal Processing September 206 Acknowledgement Joint work with Yuanxin Li (OSU), Huishuai Zhuang (Syracuse) and Yingbin
More informationInfo-Greedy Sequential Adaptive Compressed Sensing
Info-Greedy Sequential Adaptive Compressed Sensing Yao Xie Joint work with Gabor Braun and Sebastian Pokutta Georgia Institute of Technology Presented at Allerton Conference 2014 Information sensing for
More informationThe Multivariate Gaussian Distribution
The Multivariate Gaussian Distribution Chuong B. Do October, 8 A vector-valued random variable X = T X X n is said to have a multivariate normal or Gaussian) distribution with mean µ R n and covariance
More informationDS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.
DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1
More informationHigh-dimensional Principal Component Analysis. Arash Ali Amini
High-dimensional Principal Component Analysis by Arash Ali Amini A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Engineering-Electrical Engineering
More informationGeneral principles for high-dimensional estimation: Statistics and computation
General principles for high-dimensional estimation: Statistics and computation Martin Wainwright Statistics, and EECS UC Berkeley Joint work with: Garvesh Raskutti, Sahand Negahban Pradeep Ravikumar, Bin
More informationAn efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss
An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss arxiv:1811.04545v1 [stat.co] 12 Nov 2018 Cheng Wang School of Mathematical Sciences, Shanghai Jiao
More informationROP: MATRIX RECOVERY VIA RANK-ONE PROJECTIONS 1. BY T. TONY CAI AND ANRU ZHANG University of Pennsylvania
The Annals of Statistics 2015, Vol. 43, No. 1, 102 138 DOI: 10.1214/14-AOS1267 Institute of Mathematical Statistics, 2015 ROP: MATRIX RECOVERY VIA RANK-ONE PROJECTIONS 1 BY T. TONY CAI AND ANRU ZHANG University
More informationSparse signal recovery using sparse random projections
Sparse signal recovery using sparse random projections Wei Wang Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2009-69 http://www.eecs.berkeley.edu/pubs/techrpts/2009/eecs-2009-69.html
More informationLearning discrete graphical models via generalized inverse covariance matrices
Learning discrete graphical models via generalized inverse covariance matrices Duzhe Wang, Yiming Lv, Yongjoon Kim, Young Lee Department of Statistics University of Wisconsin-Madison {dwang282, lv23, ykim676,
More informationEstimators based on non-convex programs: Statistical and computational guarantees
Estimators based on non-convex programs: Statistical and computational guarantees Martin Wainwright UC Berkeley Statistics and EECS Based on joint work with: Po-Ling Loh (UC Berkeley) Martin Wainwright
More informationHigh dimensional ising model selection using l 1 -regularized logistic regression
High dimensional ising model selection using l 1 -regularized logistic regression 1 Department of Statistics Pennsylvania State University 597 Presentation 2016 1/29 Outline Introduction 1 Introduction
More informationInference For High Dimensional M-estimates. Fixed Design Results
: Fixed Design Results Lihua Lei Advisors: Peter J. Bickel, Michael I. Jordan joint work with Peter J. Bickel and Noureddine El Karoui Dec. 8, 2016 1/57 Table of Contents 1 Background 2 Main Results and
More informationRobust Principal Component Analysis
ELE 538B: Mathematics of High-Dimensional Data Robust Principal Component Analysis Yuxin Chen Princeton University, Fall 2018 Disentangling sparse and low-rank matrices Suppose we are given a matrix M
More informationConvex optimization. Javier Peña Carnegie Mellon University. Universidad de los Andes Bogotá, Colombia September 2014
Convex optimization Javier Peña Carnegie Mellon University Universidad de los Andes Bogotá, Colombia September 2014 1 / 41 Convex optimization Problem of the form where Q R n convex set: min x f(x) x Q,
More informationHigh Dimensional Inverse Covariate Matrix Estimation via Linear Programming
High Dimensional Inverse Covariate Matrix Estimation via Linear Programming Ming Yuan October 24, 2011 Gaussian Graphical Model X = (X 1,..., X p ) indep. N(µ, Σ) Inverse covariance matrix Σ 1 = Ω = (ω
More informationsparse and low-rank tensor recovery Cubic-Sketching
Sparse and Low-Ran Tensor Recovery via Cubic-Setching Guang Cheng Department of Statistics Purdue University www.science.purdue.edu/bigdata CCAM@Purdue Math Oct. 27, 2017 Joint wor with Botao Hao and Anru
More informationA Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression
A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent
More informationInformation Recovery from Pairwise Measurements
Information Recovery from Pairwise Measurements A Shannon-Theoretic Approach Yuxin Chen, Changho Suh, Andrea Goldsmith Stanford University KAIST Page 1 Recovering data from correlation measurements A large
More information2.3. Clustering or vector quantization 57
Multivariate Statistics non-negative matrix factorisation and sparse dictionary learning The PCA decomposition is by construction optimal solution to argmin A R n q,h R q p X AH 2 2 under constraint :
More informationSparse Optimization Lecture: Basic Sparse Optimization Models
Sparse Optimization Lecture: Basic Sparse Optimization Models Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this lecture will know basic l 1, l 2,1, and nuclear-norm
More informationLecture 7: Positive Semidefinite Matrices
Lecture 7: Positive Semidefinite Matrices Rajat Mittal IIT Kanpur The main aim of this lecture note is to prepare your background for semidefinite programming. We have already seen some linear algebra.
More informationDISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania
Submitted to the Annals of Statistics DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING By T. Tony Cai and Linjun Zhang University of Pennsylvania We would like to congratulate the
More informationESL Chap3. Some extensions of lasso
ESL Chap3 Some extensions of lasso 1 Outline Consistency of lasso for model selection Adaptive lasso Elastic net Group lasso 2 Consistency of lasso for model selection A number of authors have studied
More informationWithin Group Variable Selection through the Exclusive Lasso
Within Group Variable Selection through the Exclusive Lasso arxiv:1505.07517v1 [stat.me] 28 May 2015 Frederick Campbell Department of Statistics, Rice University and Genevera Allen Department of Statistics,
More informationAn Introduction to Sparse Approximation
An Introduction to Sparse Approximation Anna C. Gilbert Department of Mathematics University of Michigan Basic image/signal/data compression: transform coding Approximate signals sparsely Compress images,
More informationLearning gradients: prescriptive models
Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan
More informationProperties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation
Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Adam J. Rothman School of Statistics University of Minnesota October 8, 2014, joint work with Liliana
More informationA direct formulation for sparse PCA using semidefinite programming
A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon
More informationChapter 3 Transformations
Chapter 3 Transformations An Introduction to Optimization Spring, 2014 Wei-Ta Chu 1 Linear Transformations A function is called a linear transformation if 1. for every and 2. for every If we fix the bases
More informationA direct formulation for sparse PCA using semidefinite programming
A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley A. d Aspremont, INFORMS, Denver,
More informationSTATS 306B: Unsupervised Learning Spring Lecture 13 May 12
STATS 306B: Unsupervised Learning Spring 2014 Lecture 13 May 12 Lecturer: Lester Mackey Scribe: Jessy Hwang, Minzhe Wang 13.1 Canonical correlation analysis 13.1.1 Recap CCA is a linear dimensionality
More informationCompressed Sensing and Neural Networks
and Jan Vybíral (Charles University & Czech Technical University Prague, Czech Republic) NOMAD Summer Berlin, September 25-29, 2017 1 / 31 Outline Lasso & Introduction Notation Training the network Applications
More informationESTIMATION OF (NEAR) LOW-RANK MATRICES WITH NOISE AND HIGH-DIMENSIONAL SCALING
Submitted to the Annals of Statistics ESTIMATION OF (NEAR) LOW-RANK MATRICES WITH NOISE AND HIGH-DIMENSIONAL SCALING By Sahand Negahban and Martin J. Wainwright University of California, Berkeley We study
More informationA lower bound for the Laplacian eigenvalues of a graph proof of a conjecture by Guo
A lower bound for the Laplacian eigenvalues of a graph proof of a conjecture by Guo A. E. Brouwer & W. H. Haemers 2008-02-28 Abstract We show that if µ j is the j-th largest Laplacian eigenvalue, and d
More informationInformation-theoretically Optimal Sparse PCA
Information-theoretically Optimal Sparse PCA Yash Deshpande Department of Electrical Engineering Stanford, CA. Andrea Montanari Departments of Electrical Engineering and Statistics Stanford, CA. Abstract
More informationSketching for Large-Scale Learning of Mixture Models
Sketching for Large-Scale Learning of Mixture Models Nicolas Keriven Université Rennes 1, Inria Rennes Bretagne-atlantique Adv. Rémi Gribonval Outline Introduction Practical Approach Results Theoretical
More informationIntroduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationRobust Sparse Principal Component Regression under the High Dimensional Elliptical Model
Robust Sparse Principal Component Regression under the High Dimensional Elliptical Model Fang Han Department of Biostatistics Johns Hopkins University Baltimore, MD 21210 fhan@jhsph.edu Han Liu Department
More informationHigh Dimensional Covariance and Precision Matrix Estimation
High Dimensional Covariance and Precision Matrix Estimation Wei Wang Washington University in St. Louis Thursday 23 rd February, 2017 Wei Wang (Washington University in St. Louis) High Dimensional Covariance
More informationSpectral Graph Theory Lecture 3. Fundamental Graphs. Daniel A. Spielman September 5, 2018
Spectral Graph Theory Lecture 3 Fundamental Graphs Daniel A. Spielman September 5, 2018 3.1 Overview We will bound and derive the eigenvalues of the Laplacian matrices of some fundamental graphs, including
More informationIntroduction to Compressed Sensing
Introduction to Compressed Sensing Alejandro Parada, Gonzalo Arce University of Delaware August 25, 2016 Motivation: Classical Sampling 1 Motivation: Classical Sampling Issues Some applications Radar Spectral
More informationECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis
ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis Lecture 7: Matrix completion Yuejie Chi The Ohio State University Page 1 Reference Guaranteed Minimum-Rank Solutions of Linear
More informationL 0 methods. H.J. Kappen Donders Institute for Neuroscience Radboud University, Nijmegen, the Netherlands. December 5, 2011.
L methods H.J. Kappen Donders Institute for Neuroscience Radboud University, Nijmegen, the Netherlands December 5, 2 Bert Kappen Outline George McCullochs model The Variational Garrote Bert Kappen L methods
More informationInformation-Theoretic Methods in Data Science
Information-Theoretic Methods in Data Science Information-theoretic bounds on sketching Mert Pilanci Department of Electrical Engineering Stanford University Contents Information-theoretic bounds on sketching
More informationhttps://goo.gl/kfxweg KYOTO UNIVERSITY Statistical Machine Learning Theory Sparsity Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT OF INTELLIGENCE SCIENCE AND TECHNOLOGY 1 KYOTO UNIVERSITY Topics:
More informationGeneralized Power Method for Sparse Principal Component Analysis
Generalized Power Method for Sparse Principal Component Analysis Peter Richtárik CORE/INMA Catholic University of Louvain Belgium VOCAL 2008, Veszprém, Hungary CORE Discussion Paper #2008/70 joint work
More informationDimension reduction for semidefinite programming
1 / 22 Dimension reduction for semidefinite programming Pablo A. Parrilo Laboratory for Information and Decision Systems Electrical Engineering and Computer Science Massachusetts Institute of Technology
More informationhigh-dimensional inference robust to the lack of model sparsity
high-dimensional inference robust to the lack of model sparsity Jelena Bradic (joint with a PhD student Yinchu Zhu) www.jelenabradic.net Assistant Professor Department of Mathematics University of California,
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationApproximate Principal Components Analysis of Large Data Sets
Approximate Principal Components Analysis of Large Data Sets Daniel J. McDonald Department of Statistics Indiana University mypage.iu.edu/ dajmcdon April 27, 2016 Approximation-Regularization for Analysis
More informationBasic Concepts in Matrix Algebra
Basic Concepts in Matrix Algebra An column array of p elements is called a vector of dimension p and is written as x p 1 = x 1 x 2. x p. The transpose of the column vector x p 1 is row vector x = [x 1
More informationInformation-Theoretic Limits of Group Testing: Phase Transitions, Noisy Tests, and Partial Recovery
Information-Theoretic Limits of Group Testing: Phase Transitions, Noisy Tests, and Partial Recovery Jonathan Scarlett jonathan.scarlett@epfl.ch Laboratory for Information and Inference Systems (LIONS)
More informationCS281 Section 4: Factor Analysis and PCA
CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we
More informationGraph Partitioning Using Random Walks
Graph Partitioning Using Random Walks A Convex Optimization Perspective Lorenzo Orecchia Computer Science Why Spectral Algorithms for Graph Problems in practice? Simple to implement Can exploit very efficient
More information