Applied Statistics and Machine Learning
|
|
- Oscar Norton
- 6 years ago
- Views:
Transcription
1 Applied Statistics and Machine Learning Theory: Stability, CLT, and Sparse Modeling Bin Yu, IMA, June 26, 2013
2 Today s plan 1. Reproducibility and statistical stability 2. In the classical world: CLT stability result Asymptotic normality and score function for parametric inference Valid bootstrap 3. In the modern world: L2 estimation rate of Lasso and extensions Model selection consistency of Lasso Sample to sample variability meets heavy tail noise 2
3 2013: Information technology and data Although scientists have always comforted themselves with the thought that science is self-correcting, the immediacy and rapidity with which knowledge disseminates today means that incorrect information can have a profound impact before any corrective process can take place. Reforming Science: Methodological and Cultural Reforms, Infection and Immunity Editorial (2012) by Arturo Casadevall (Editor in Chief, mbio) and Ferric C. Fang, (Editor in Chief, Infection and Immunity) 3
4 2013: IT and data A recent study* analyzed the cause of retraction for 788 retracted papers and found that error and fraud were responsible for 545 (69%) and 197 (25%) cases, respectively, while the cause was unknown in 46 (5.8%) cases (31). -- Casadevall and Fang (2012) * R Grant Steen (2011), J. Med. Ethics Casadevall and Fang called for Enhanced training in probability and statistics 4
5 Scientific Reproducibility Recent papers on reproducibility: Ioannidis, 2005; Kraft et al, 2009; Nosek et al, 2012; Naik, 2011; Booth, 2012; Donoho et al, 2009, Fanio et al, 2012 Statistical stability is a minimum requirement for scientific reproducibility. Statistical stability: statistical conclusions should be stable to appropriate perturbations to data and/or models. Stable statistical results are more reliable. 5
6 Scientific reproducibility is our responsibility } Stability is a minimum requirement for reproducibility } Statistical stability: statistical conclusions should be robust to appropriate perturbations to data 6
7 } Data perturbation has a long history Data perturbation schemes are now routinely employed to estimate bias, variance, and sampling distribution of an estimator. Jacknife 7 Quenouille (1949, 1956), Tukey (1958), Mosteller and Tukey (1977),or Efron and Stein (1981), Wu (1986), Carlstein (1986), Wu (1990), Sub-sampling: Mahalanobis (1949), Hartigan (1969, 1970), Politis and Romano (1992, 94), Bickel, Götze, and van Zwet (1997), Cross-validation: Bootstrap: Allen (1974), Stone (1974, 1977), Hall (1983), Li (1985), fron (1979), Bickel and Freedman (1981), Beran (1984),
8 Stability argument is central to limiting law results One proof of CLT (bedrock of classical stat theory): 1. Universality of a limiting law for a normalized sum of iid through a stability argument or Lindeberg s swapping trick, i.e., a perturbation in a (normalized) sum by a random variable with matching first and second moments does not change the (normalized) sum distribution in the limit. 2. Finding the limit law via ODE. cf. Lecture notes of Terence Tao at 8
9 CLT A proof of CLT via Lindeberg s swapping trick: Let 9
10 CLT 10
11 CLT Then we get For a complete proof, see 11
12 Asymptotic normality of MLE Via Taylor expansion, for a nice parametric family f(x, θ), it can be seen that MLE ˆθ n, from n iid samples X 1,...,X n has the following approximation: n(ˆθn θ) ((Y Y n )/ n where Y i = U θ (X i ) and is the score function with U θ first moment or expected value zero: U θ (x) = d log f(x, θ) dθ 12
13 Asymptotic normality of MLE comes from stability Lindeberg s swapping trick proof of CLT means that MLE is asymptotically normal because we can swap in an independent normal random variable (of mean zero and variance Fisher information) in the sum of the score functions, without changing the sum much there is stability. 13
14 Why bootstrap works? 14
15 A sketch of proof: A similar proof can be carried out using the Lindeberg s swapping trick with the minor modification that the second moments are matched to the 1/ n order. That is, let S n,i = (X 1 µ n )+...(X i µ n )+(Xi+1 ˆµ n)... +(Xn ˆµ n ) n S n,i = (X 1 µ n )+...(Xi ˆµ n)+(xi+1 ˆµ n)... +(Xn ˆµ n ) n Then we can see that the summands in the two summations above have the same first moment 0, their second moments differ by O(1/ n), and their third moments are finite. Going through Tao s proof shows that we get another term of order O(n 3/2 ) from the second moment term which is ok. 15
16 Stability argument is central Recent generalizations to obtain other universal limiting distributions e.g. Wigner law under non-gaussian assumptions and last passage percolation (Chatterjee, 2006, Suidan, 2006, ) Concentration results also assume stability-type conditions In learning theory, stability is closely related to good generalization performance (Devroye and Wagner, 1979, Kearns and Ron, 1999, Bousquet and Elisseeff, 2002, Kutin and Niyogi, 2002, Mukherjee et al, 2006.) 16
17 Lasso and/or Compressed Sensing: much theory lately Compressed Sensing emphasize the choice of design matrix (often iid Gaussian entries) Theoretical studies: much work recently on Lasso in terms of L2 error of parameter model selection consistency L2 prediction error (not covered in this lecture) June 26, 2013 page 17
18 Regularized M-estimation including Lasso Estimation: Minimize loss function plus regularization term θ λn }{{} Estimate arg min θ R p { Ln (θ; X n 1 ) }{{} Loss function } + λ n r(θ). }{{} Regularizer Goal: for some error metric (e.g. L2 error), bound in this metric the difference ˆθ λn θ ˆθ λn θ in high-dim scaling as (n, p) tends to infinity. June 26, 2013 page 18
19 Example 1: Lasso (sparse linear model) y X θ w n n p = + S S c Set-up: noisy observations y = Xθ + w with sparse θ Estimator: Lasso program θ λn 1 arg min θ n n p (y i x T i θ)2 + λ n θ j i=1 Some past work: Tropp, 2004; Fuchs, 2000; Meinshausen/Buhlmann, 2005; Some Candes/Tao, past work: Tropp, 2005; 2004; Donoho, Fuchs, 2005; 2000; Zhao Meinshausen/Buhlmann, & Yu, 2006; Zou, 2006; 2006; Wainwright, Candes/Tao, 2006; 2005; Koltchinskii, Donoho, 2005; 2007; Zhao Tsybakov & Yu, 2006; et al., Zou, 2007; 2006; van de Wainwright, Geer, 2007; 2009; Koltchinskii, Zhang and Huang, 2007; Tsybakov 2008; Meinshausen et al., 2007; and van Yu, de 2009, Geer, Bickel 2007; et Zhang al., and 2008 Huang, ; Meinshausen June 26, 2013 and Yu, 2009, Bickel et al., 2008, page 19 Neghban et al, 12. j=1
20 Example 2: Low-rank matrix approximation Θ U D V T k m k r r r r m Set-up: Matrix Θ R k m with rank r min{k, m}. Estimator: Θ arg min Θ 1 n min{k,m} n (y i X i, Θ ) 2 + λ n σ j (Θ) i=1 j=1 Some past work : Frieze et al, 1998; Achilioptas & McSherry, 2001; Fayzel & Boyd, 2001; Srebro et al, 2004; Drineas et al, 2005; Rudelson & Vershynin, 2006; Recht et al, 2007, Bach, 2008; Candes and Tao, 2009; Halko et al, 2009, Keshaven et al, 2009; Negahban & Wainwright, 2009, Tsybakov??? June 26, 2013 page 20
21 Example 3: Structured (inverse) Cov. Estimation Zero pattern of inverse covariance Set-up: Samples from random vector with sparse covariance Σ or sparse inverse covariance Θ R p p. Estimator: Θ arg min Θ 1 n n x i x T i, Θ log det(θ) + λ n Θ ij 1 i=1 i j Some past work :Yuan and Lin, 2006; d Aspremont et al, 2007; Bickel & Levina, 2007; El Karoui, 2007; Rothman et al, 2007; Zhou et al, 2007; Friedman et al 2008; Ravikumar et al, 2008; Lam and Fan, 2009; Cai and Zhou, 2009 June 26, 2013 page 21
22 Unified Analysis S. Negahban, P. Ravikumar, M. J. Wainwright and B. Yu. (2012) A unified framework for the analysis of regularized M-estimators. Statistical Science, 27(4): , Many high-dimensional models and associated results case by case Is there a core set of ideas that underlie these analyses? We discovered that two key conditions are needed for a unified error bound: Decomposability of regularizer r (e.g. L1-norm) Restricted strong convexity of loss function (e.g Least Squares loss) June 26, 2013 page 22
23 Why can we estimate parameters? Strong convexity of cost (curvature captured in Fisher Info when p fixed): δl θ θ (a) High curvature: easy to estimate δl θ θ (b) Low curvature: harder Asymptotic analysis of maximum likelihood estimator or OLS: 1. Fisher information (Hessian matrix) non-degenerate or strong convexity in all directions. 2. Sampling noise disappears with sample size gets large June 26, 2013 page 23
24 In high-dim and when r corresponsds to structure (e.g. sparsity) 1 Restricted strong convexity (RSC) (courtesy of high-dim): loss functions are often flat in many directions in high dim curvature needed only for directions C in high dim loss function Ln (θ) := L n (θ; X n 1 ) satisfies L n (θ + ) L n (θ ) L n (θ ), γ(l) d 2 ( ) {z } {z } {z } Excess loss score squared function error for all C. 2 Decomposability of regularizers makes C small in high-dim: for subspace pairs (A, B ) where A represents model constraints: C r(u+v) =r(u)+r(v) for all u A and v B forces error = b θλn θ to C June 26, 2013 page 24
25 In high-dim and when r corresponds to structure (e.g. sparsity) When p >n as in the fmri problem, LS is flat in many directions or it is impossible to have strong convexity in all directions. Regularization is needed. When λ n is large enough to overcome the sampling noise, we have a deterministic situation: The decomposable regularization norm forces the estimation difference ˆθ λn θ into a constraint or cone set. This constraint set is small relative to R p when p large. Strong convexity is needed only over this constraint set.. When predictors are random and Gaussian (dependence ok), strong convexity does hold for LS over the l1-norm induced constraint set. June 26, 2013 page 25
26 Main Result (Negahban et al, 2012) Theorem (Negahban, Ravikumar, Wainwright & Y. 2009) Say regularizer decomposes across pair (A, B ) with A B, and restricted strong convexity holds for (A, B ) and over C. With regularization constant chosen such that λ n 2r ( L(θ ; X1 n )), then any solution θ λn satisfies d( θ λn θ ) 1 [ ] Ψ(B) λn + 2 λ n γ(l) γ(l) r(π A (θ )) Estimation error Approximation error Quantities that control rates: restricted strong convexity parameter: γ(l) dual norm of regularizer: r (v) := optimal subspace const.: Ψ(B) = sup v, u. r(u)=1 sup r(θ)/d(θ) θ B\{0} More work is required for each case: verify restrictive strong convexity and June 26, 2013 page 26 find to overcome sampling noise (concentration ineq.) λ n
27 Recovering existing result in Bickel et al 08 Example: Linear regression (exact sparsity) Lasso program: min θ R p y Xθ λ n θ 1 } RSC reduces to lower bound on restricted singular values of X R n p for a k-sparse vector, we have θ 1 k θ 2. Corollary Suppose that true parameter θ is exactly k-sparse. Under RSC and with λ n 2 XT ε n, then any Lasso solution satisfies θ λn θ 2 γ(l) 1 k λn. Some stochastic instances: recover known results Compressed sensing: X ij N(0, 1) and bounded noise ε 2 σ n Deterministic design: X with bounded columns and ε i N(0, σ 2 ) r XT ε n 2σ2 log p n w.h.p. = b θ λn θ 2 8σ r k log p. γ(l) n (e.g., Candes & Tao, 2007; Meinshausen/Yu, 2008; Zhang and Huang, 2008; Bickel et al., 2009) June 26, 2013 page 27
28 Obtaining new result Example: Linear regression (weak sparsity) for some q [0, 1], say θ belongs to l q - ball B q (R q ) := { θ R p p θ j q } R q. j=1 Corollary For θ B q (R q ), any Lasso solution satisfies (w.h.p.) [ θ λn θ 2 2 O σ 2 R q ( log p n ) 1 q/2 ]. rate known to be minimax optimal (Raskutti et al., 2009) June 26, 2013 page 28
29 Effective sample size n/log(p) We lose at least a factor of log(p) for having to search over a set of p predictors for a small number of relevant predictors. For the static image data, we lose a factor of log(10921)=9.28 so the effective sample size is not n=1750, but 1750/9.28=188 For the movie data, we lose at least a factor of log(26000)=10.16 so the effective sample size is not n=7200, but 7200/10.16=708 The log p factor is very much a consequence of the light-tail Gaussian (or sub-gaussian) noise assumption. We will lose more if the noise term has a heavier tail. June 26, 2013 page 29
30 Summary of unified analysis } Recovered existing results: sparse linear regression with Lasso multivariate group Lasso sparse inverse cov. estimation } Derived new results: weak sparse linear model with Lasso low rank matrix estimation generalized linear models June 26, 2013 page 30
31 An MSE result for L2Boosting Peter Buhlmann and Bin Yu (2003). Boosting with the L2 Loss: Regression and Classification. J. Amer. Statist. Assoc. 98, Not Note that the variance term is a measure of complexity and it increases by an exponentially diminishing amount when the iteration number m increases and it is always bounded. June 26, 2013 page 31
32 Model Selection Consistency of Lasso Zhao and Yu (2006), On model selection consistency of Lasso, JMLR, 7, Set-up: Linear regression model n observations and p predictors Assume (A): Knight and Fu (2000) showed L2 estimation consistency under (A) for fixed p. June 26, 2013 page 32
33 Model Selection Consistency of Lasso } p small, n large (Zhao and Y, 2006), assume (A) and sign((β 1,...,β s ))(X SX S ) 1 X SX S c < 1 Then roughly* Irrepresentable condition (1 by (p-s) matrix) } model selection consistency * Some ambiguity when equality holds. Related work: Tropp(06), Meinshausen and Buhlmann (06), Zou (06), Wainwright (09) Population version June 26, 2013 page 33
34 Irrepresentable condition (s=2, p=3): geomery } r=0.4 } r=0.6 June 26, 2013 page 34
35 Model Selection Consistency of Lasso } Consistency holds also for s and p growing with n, assume irrepresentable condition bounds on max and min eigenvalues of design matrix smallest nonzero coefficient bounded away from zero. Gaussian noise (Wainwright, 09): Finite 2k-th moment noise (Zhao&Y,06): June 26, 2013 page 35
36 } Consistency of Lasso for Model Selection Interpretation of Condition Regressing the irrelevant predictors on the relevant predictors. If the L 1 norm of regression coefficients (*) } Larger than 1, Lasso can not distinguish the irrelevant predictor from the relevant predictors for some parameter values. } Smaller than 1, Lasso can distinguish the irrelevant predictor from the relevant predictors. } Sufficient Conditions (Verifiable) } } } Constant correlation Power decay correlation Bounded correlation* June 26, 2013 page 36
37 Bootstrap and Lasso+mLS Or Lasso+Ridge Liu and Yu (2013) Because Lasso is biased, it is common to use it to select variables and then correct the bias with a modified OLS or a Ridge with a very small smoothing parameter. Then a residual bootstrap can be carried out to get confidence intervals for the parameters. Under the irrepresentable condition (IC), the Lasso+mLS and Lasso+Ridge are shown to have the parametric MSE of k/n (no log p term) and The residual bootstrap is shown to work as well. Even when the IC doesn t hold, simulations shown that this bootstrap seems to work better than its comparison methods in the paper. See Liu and Yu (2013) for more details. June 26, 2013 page 37
38 L1 penalized log Gaussian Likelihood Given n iid observations of X with Banerjee, El Ghaoui, d Aspremont (08): by a block descent algorithm. \ June 26, 2013 page 38
39 Model selection consistency Ravikumar, Wainwright, Raskutti, Yu (2011) gives sufficient conditions for model selection consistency. Hessian: Define model complexity : June 26, 2013 page 39
40 Model selection consistency (Ravikumar et al, 11) Assume the irrepresentable condition below holds 1. X sub-gaussian with parameter and effective sample size Or 2. X has 4m-th moment, Then with high probability as n tends to infinity, the correct model is chosen. June 26, 2013 page 40
41 Success prob s dependence on n and p (Gaussian) Edge covariances as * = 0.1.! ij Each point is an average over 100 trials. Curves stack up in second plot, so that (n/log p) controls model selection. June 26, pa
42 Success prob s dependence on model complexity K and n Chain graph with p = 120 nodes. Curves from left to right have increasing values of K. Models with larger K thus require more samples n for same probability of success. June 26, 2013 page 42
43 For model selection consistency results for gradient and/or backward-forward algorithms, see works by T. Zhang: } Tong Zhang. Sparse Recovery with Orthogonal Matching Pursuit under RIP, IEEE Trans. Info. Th, 57: , } Tong Zhang. Adaptive Forward-Backward Greedy Algorithm for Learning Sparse Representations, IEEE Trans. Info. Th, 57: ,
44 Back to stability: robust statistics also aims at stability Mean functions are fitted with L 2 loss. L 1 What if the errors have heavier tails? loss is commonly used in robust statistics to deal with heavy tail errors in regression. Will L 1 Loss add more stability? 44
45 Model perturbation is used in Robust Statistics Tukey (1958), Huber (1964, ), Hampel (1968, ), Bickel (1976, ), Rousseeuw (1979, ), Portnoy (1979,.),. "Overall, and in analogy with, for example, the stability aspects of differential equations or of numerical computations, robustness theories can be viewed as stability theories of statistical inference. - p. 8, Hampel, Rousseeuw, Ronchetti and Stahel (1986) 45
46 Seeking insights through analytical work For high-dim data such as ours, removing some data units could change the outcomes of our model because of feature dependence. This phenomenon is also seen in simulated data from Gaussian linear models in high-dim. How does sample to sample variability interact with heavy tail errors? 46
47 Sample stability meets robust statistics in high-dim (El Karoui, Bean, Bickel, Lim and Yu, 2012) Set-up: Linear regression model Y n 1 = X n p β p 1 + n 1 For i=1,,n: X i N(0, Σ p ), i iid, E i =0 We consider the random-matrix regime: p/n κ (0, 1) ˆβ = argmin β R p ρ(y i Xiβ) Due to invariance, WLOG, assume Σ p = I p, β =0 i 47
48 Sample stability meets robust statistics in high-dim (El Karoui, Bean, Bickel, Lim and Yu, 2012) RESULT (in an important special case): I. Let r ρ (p, n) = ˆβ, then ˆβ is distributed as rρ (p, n)u where U uniform(s p 1 )(1) 2. r ρ (p, n) r ρ (κ), let ẑ := + r ρ (κ)z, Z indep of (x y)2 prox c (ρ)(x) =argmin y R [ρ(y)+ ] 2c Then r ρ (κ) satisfies E{[prox c (ρ)] } =1 κ E{[ẑ prox c (ẑ )] 2 } = κr 2 ρ(κ) 48
49 Sample stability meets robust statistics in high-dim (continue) limiting results --- normalization constant stablizes. Sketch of proof: Invariance "leave-one-out" trick both ways (reminiscent of swapping trick for CLT) analytical derivations (prox functions) (reminiscent of proving normality in the limit for CLT) 49
50 Sample stability meets robust statistics in high-dim (continue) Corollary of the result: L 2 When κ = p/n > 0.3, loss fitting (OLS) is better than L 1 loss fitting (LAD) even when the error is double-exponential. Ratio of Var (LAD) and Var (OLS) Blue simulated Magenta analytic 50
51 Sample stability meets robust statistics in high-dim (continue) Remarks: MLE doesn't work here for a different reason than in cases where penalized MLE works better than MLE. We have unbiased estimators, a non-sparse situation and the question is about variance. Optimal loss function can be calculated (Bean et al, 2012) Simulated model with design matrix from fmri data and double- exp. error shows the same phenomenon: p/n > 0.3, OLS is better than LAD. Some insurance for using L2 loss function in fmri project. Results hold for more general settings. 51
52 John W. Tukey June 16, 1915 July 26, 2000 What of the future? The future of data analysis can involve great progress, the overcoming of real difficulties, and the provision of a great service to all fields of science and technology. Will it? That remains to us, to our willingness to take up the rocky road of real problems in preferences to the smooth road of unreal assumptions, arbitrary criteria, and abstract results without real attachments. Who is for the challenge? John W. Tukey (1962) Future of Data Analysis
53 Thank you for your sustained enthusiasm and questions! 53
High-dimensional Statistical Models
High-dimensional Statistical Models Pradeep Ravikumar UT Austin MLSS 2014 1 Curse of Dimensionality Statistical Learning: Given n observations from p(x; θ ), where θ R p, recover signal/parameter θ. For
More informationHigh-dimensional Statistics
High-dimensional Statistics Pradeep Ravikumar UT Austin Outline 1. High Dimensional Data : Large p, small n 2. Sparsity 3. Group Sparsity 4. Low Rank 1 Curse of Dimensionality Statistical Learning: Given
More informationGeneral principles for high-dimensional estimation: Statistics and computation
General principles for high-dimensional estimation: Statistics and computation Martin Wainwright Statistics, and EECS UC Berkeley Joint work with: Garvesh Raskutti, Sahand Negahban Pradeep Ravikumar, Bin
More informationHigh-dimensional statistics: Some progress and challenges ahead
High-dimensional statistics: Some progress and challenges ahead Martin Wainwright UC Berkeley Departments of Statistics, and EECS University College, London Master Class: Lecture Joint work with: Alekh
More informationProperties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation
Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Adam J. Rothman School of Statistics University of Minnesota October 8, 2014, joint work with Liliana
More informationSparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28
Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:
More informationA Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models
A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los
More informationHigh-dimensional covariance estimation based on Gaussian graphical models
High-dimensional covariance estimation based on Gaussian graphical models Shuheng Zhou Department of Statistics, The University of Michigan, Ann Arbor IMA workshop on High Dimensional Phenomena Sept. 26,
More informationIntroduction to graphical models: Lecture III
Introduction to graphical models: Lecture III Martin Wainwright UC Berkeley Departments of Statistics, and EECS Martin Wainwright (UC Berkeley) Some introductory lectures January 2013 1 / 25 Introduction
More informationBAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage
BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage Lingrui Gan, Naveen N. Narisetty, Feng Liang Department of Statistics University of Illinois at Urbana-Champaign Problem Statement
More informationInference For High Dimensional M-estimates. Fixed Design Results
: Fixed Design Results Lihua Lei Advisors: Peter J. Bickel, Michael I. Jordan joint work with Peter J. Bickel and Noureddine El Karoui Dec. 8, 2016 1/57 Table of Contents 1 Background 2 Main Results and
More informationA unified framework for high-dimensional analysis of M-estimators with decomposable regularizers
A uified framework for high-dimesioal aalysis of M-estimators with decomposable regularizers Sahad Negahba, UC Berkeley Pradeep Ravikumar, UT Austi Marti Waiwright, UC Berkeley Bi Yu, UC Berkeley NIPS
More informationA New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables
A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of
More informationHigh-dimensional regression:
High-dimensional regression: How to pick the objective function in high-dimension UC Berkeley March 11, 2013 Joint work with Noureddine El Karoui, Peter Bickel, Chingwhay Lim, and Bin Yu 1 / 12 Notation.
More informationInference For High Dimensional M-estimates: Fixed Design Results
Inference For High Dimensional M-estimates: Fixed Design Results Lihua Lei, Peter Bickel and Noureddine El Karoui Department of Statistics, UC Berkeley Berkeley-Stanford Econometrics Jamboree, 2017 1/49
More informationReconstruction from Anisotropic Random Measurements
Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013
More informationLeast squares under convex constraint
Stanford University Questions Let Z be an n-dimensional standard Gaussian random vector. Let µ be a point in R n and let Y = Z + µ. We are interested in estimating µ from the data vector Y, under the assumption
More informationInference for High Dimensional Robust Regression
Department of Statistics UC Berkeley Stanford-Berkeley Joint Colloquium, 2015 Table of Contents 1 Background 2 Main Results 3 OLS: A Motivating Example Table of Contents 1 Background 2 Main Results 3 OLS:
More informationarxiv: v2 [math.st] 12 Feb 2008
arxiv:080.460v2 [math.st] 2 Feb 2008 Electronic Journal of Statistics Vol. 2 2008 90 02 ISSN: 935-7524 DOI: 0.24/08-EJS77 Sup-norm convergence rate and sign concentration property of Lasso and Dantzig
More informationLearning discrete graphical models via generalized inverse covariance matrices
Learning discrete graphical models via generalized inverse covariance matrices Duzhe Wang, Yiming Lv, Yongjoon Kim, Young Lee Department of Statistics University of Wisconsin-Madison {dwang282, lv23, ykim676,
More informationSTAT 200C: High-dimensional Statistics
STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57
More informationAn iterative hard thresholding estimator for low rank matrix recovery
An iterative hard thresholding estimator for low rank matrix recovery Alexandra Carpentier - based on a joint work with Arlene K.Y. Kim Statistical Laboratory, Department of Pure Mathematics and Mathematical
More informationThe Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA
The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA Presented by Dongjun Chung March 12, 2010 Introduction Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions:
More informationPermutation-invariant regularization of large covariance matrices. Liza Levina
Liza Levina Permutation-invariant covariance regularization 1/42 Permutation-invariant regularization of large covariance matrices Liza Levina Department of Statistics University of Michigan Joint work
More informationRegularized Estimation of High Dimensional Covariance Matrices. Peter Bickel. January, 2008
Regularized Estimation of High Dimensional Covariance Matrices Peter Bickel Cambridge January, 2008 With Thanks to E. Levina (Joint collaboration, slides) I. M. Johnstone (Slides) Choongsoon Bae (Slides)
More informationhigh-dimensional inference robust to the lack of model sparsity
high-dimensional inference robust to the lack of model sparsity Jelena Bradic (joint with a PhD student Yinchu Zhu) www.jelenabradic.net Assistant Professor Department of Mathematics University of California,
More informationISyE 691 Data mining and analytics
ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationAnalysis of Greedy Algorithms
Analysis of Greedy Algorithms Jiahui Shen Florida State University Oct.26th Outline Introduction Regularity condition Analysis on orthogonal matching pursuit Analysis on forward-backward greedy algorithm
More informationHigh-dimensional Covariance Estimation Based On Gaussian Graphical Models
High-dimensional Covariance Estimation Based On Gaussian Graphical Models Shuheng Zhou, Philipp Rutimann, Min Xu and Peter Buhlmann February 3, 2012 Problem definition Want to estimate the covariance matrix
More informationConvex relaxation for Combinatorial Penalties
Convex relaxation for Combinatorial Penalties Guillaume Obozinski Equipe Imagine Laboratoire d Informatique Gaspard Monge Ecole des Ponts - ParisTech Joint work with Francis Bach Fête Parisienne in Computation,
More informationHigh-dimensional graphical model selection: Practical and information-theoretic limits
1 High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John
More informationDe-biasing the Lasso: Optimal Sample Size for Gaussian Designs
De-biasing the Lasso: Optimal Sample Size for Gaussian Designs Adel Javanmard USC Marshall School of Business Data Science and Operations department Based on joint work with Andrea Montanari Oct 2015 Adel
More information(Part 1) High-dimensional statistics May / 41
Theory for the Lasso Recall the linear model Y i = p j=1 β j X (j) i + ɛ i, i = 1,..., n, or, in matrix notation, Y = Xβ + ɛ, To simplify, we assume that the design X is fixed, and that ɛ is N (0, σ 2
More informationLinear Methods for Regression. Lijun Zhang
Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived
More informationCausal Inference: Discussion
Causal Inference: Discussion Mladen Kolar The University of Chicago Booth School of Business Sept 23, 2016 Types of machine learning problems Based on the information available: Supervised learning Reinforcement
More informationA Consistent Model Selection Criterion for L 2 -Boosting in High-Dimensional Sparse Linear Models
A Consistent Model Selection Criterion for L 2 -Boosting in High-Dimensional Sparse Linear Models Tze Leung Lai, Stanford University Ching-Kang Ing, Academia Sinica, Taipei Zehao Chen, Lehman Brothers
More informationThe lasso, persistence, and cross-validation
The lasso, persistence, and cross-validation Daniel J. McDonald Department of Statistics Indiana University http://www.stat.cmu.edu/ danielmc Joint work with: Darren Homrighausen Colorado State University
More informationHigh-dimensional graphical model selection: Practical and information-theoretic limits
1 High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John
More informationRobust Inverse Covariance Estimation under Noisy Measurements
.. Robust Inverse Covariance Estimation under Noisy Measurements Jun-Kun Wang, Shou-De Lin Intel-NTU, National Taiwan University ICML 2014 1 / 30 . Table of contents Introduction.1 Introduction.2 Related
More informationTractable Upper Bounds on the Restricted Isometry Constant
Tractable Upper Bounds on the Restricted Isometry Constant Alex d Aspremont, Francis Bach, Laurent El Ghaoui Princeton University, École Normale Supérieure, U.C. Berkeley. Support from NSF, DHS and Google.
More informationProximity-Based Anomaly Detection using Sparse Structure Learning
Proximity-Based Anomaly Detection using Sparse Structure Learning Tsuyoshi Idé (IBM Tokyo Research Lab) Aurelie C. Lozano, Naoki Abe, and Yan Liu (IBM T. J. Watson Research Center) 2009/04/ SDM 2009 /
More informationarxiv: v2 [stat.me] 16 May 2009
Stability Selection Nicolai Meinshausen and Peter Bühlmann University of Oxford and ETH Zürich May 16, 9 ariv:0809.2932v2 [stat.me] 16 May 9 Abstract Estimation of structure, such as in variable selection,
More informationHierarchical kernel learning
Hierarchical kernel learning Francis Bach Willow project, INRIA - Ecole Normale Supérieure May 2010 Outline Supervised learning and regularization Kernel methods vs. sparse methods MKL: Multiple kernel
More informationRisk and Noise Estimation in High Dimensional Statistics via State Evolution
Risk and Noise Estimation in High Dimensional Statistics via State Evolution Mohsen Bayati Stanford University Joint work with Jose Bento, Murat Erdogdu, Marc Lelarge, and Andrea Montanari Statistical
More informationLASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS FOR HIGH-DIMENSIONAL DATA
The Annals of Statistics 2009, Vol. 37, No. 1, 246 270 DOI: 10.1214/07-AOS582 Institute of Mathematical Statistics, 2009 LASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS FOR HIGH-DIMENSIONAL DATA BY NICOLAI
More informationBIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation
BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation Yujin Chung November 29th, 2016 Fall 2016 Yujin Chung Lec13: MLE Fall 2016 1/24 Previous Parametric tests Mean comparisons (normality assumption)
More informationSTAT 200C: High-dimensional Statistics
STAT 200C: High-dimensional Statistics Arash A. Amini April 27, 2018 1 / 80 Classical case: n d. Asymptotic assumption: d is fixed and n. Basic tools: LLN and CLT. High-dimensional setting: n d, e.g. n/d
More informationRobust estimation, efficiency, and Lasso debiasing
Robust estimation, efficiency, and Lasso debiasing Po-Ling Loh University of Wisconsin - Madison Departments of ECE & Statistics WHOA-PSI workshop Washington University in St. Louis Aug 12, 2017 Po-Ling
More informationNonconcave Penalized Likelihood with A Diverging Number of Parameters
Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized
More informationLatent Variable Graphical Model Selection Via Convex Optimization
Latent Variable Graphical Model Selection Via Convex Optimization The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationLecture 7 Introduction to Statistical Decision Theory
Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7
More informationM-Estimation under High-Dimensional Asymptotics
M-Estimation under High-Dimensional Asymptotics 2014-05-01 Classical M-estimation Big Data M-estimation An out-of-the-park grand-slam home run Annals of Mathematical Statistics 1964 Richard Olshen Classical
More informationExtended Bayesian Information Criteria for Gaussian Graphical Models
Extended Bayesian Information Criteria for Gaussian Graphical Models Rina Foygel University of Chicago rina@uchicago.edu Mathias Drton University of Chicago drton@uchicago.edu Abstract Gaussian graphical
More informationSample Size Requirement For Some Low-Dimensional Estimation Problems
Sample Size Requirement For Some Low-Dimensional Estimation Problems Cun-Hui Zhang, Rutgers University September 10, 2013 SAMSI Thanks for the invitation! Acknowledgements/References Sun, T. and Zhang,
More informationEstimation of (near) low-rank matrices with noise and high-dimensional scaling
Estimation of (near) low-rank matrices with noise and high-dimensional scaling Sahand Negahban Department of EECS, University of California, Berkeley, CA 94720, USA sahand n@eecs.berkeley.edu Martin J.
More informationarxiv: v1 [math.st] 13 Feb 2012
Sparse Matrix Inversion with Scaled Lasso Tingni Sun and Cun-Hui Zhang Rutgers University arxiv:1202.2723v1 [math.st] 13 Feb 2012 Address: Department of Statistics and Biostatistics, Hill Center, Busch
More informationRobust high-dimensional linear regression: A statistical perspective
Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison Departments of ECE & Statistics STOC workshop on robustness and nonconvexity Montreal,
More informationSupplementary material for a unified framework for high-dimensional analysis of M-estimators with decomposable regularizers
Submitted to the Statistical Science Supplementary material for a unified framework for high-dimensional analysis of M-estimators with decomposable regularizers Sahand N. Negahban 1, Pradeep Ravikumar,
More informationSparse Linear Models (10/7/13)
STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine
More informationsparse and low-rank tensor recovery Cubic-Sketching
Sparse and Low-Ran Tensor Recovery via Cubic-Setching Guang Cheng Department of Statistics Purdue University www.science.purdue.edu/bigdata CCAM@Purdue Math Oct. 27, 2017 Joint wor with Botao Hao and Anru
More informationLasso-type recovery of sparse representations for high-dimensional data
Lasso-type recovery of sparse representations for high-dimensional data Nicolai Meinshausen and Bin Yu Department of Statistics, UC Berkeley December 5, 2006 Abstract The Lasso (Tibshirani, 1996) is an
More informationInvertibility of random matrices
University of Michigan February 2011, Princeton University Origins of Random Matrix Theory Statistics (Wishart matrices) PCA of a multivariate Gaussian distribution. [Gaël Varoquaux s blog gael-varoquaux.info]
More informationAn efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss
An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss arxiv:1811.04545v1 [stat.co] 12 Nov 2018 Cheng Wang School of Mathematical Sciences, Shanghai Jiao
More informationNew ways of dimension reduction? Cutting data sets into small pieces
New ways of dimension reduction? Cutting data sets into small pieces Roman Vershynin University of Michigan, Department of Mathematics Statistical Machine Learning Ann Arbor, June 5, 2012 Joint work with
More informationRestricted Strong Convexity Implies Weak Submodularity
Restricted Strong Convexity Implies Weak Submodularity Ethan R. Elenberg Rajiv Khanna Alexandros G. Dimakis Department of Electrical and Computer Engineering The University of Texas at Austin {elenberg,rajivak}@utexas.edu
More informationMaximum Likelihood, Logistic Regression, and Stochastic Gradient Training
Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions
More informationregression Lie Wang Abstract In this paper, the high-dimensional sparse linear regression model is considered,
L penalized LAD estimator for high dimensional linear regression Lie Wang Abstract In this paper, the high-dimensional sparse linear regression model is considered, where the overall number of variables
More informationSelection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty
Journal of Data Science 9(2011), 549-564 Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Masaru Kanba and Kanta Naito Shimane University Abstract: This paper discusses the
More informationOrthogonal Matching Pursuit for Sparse Signal Recovery With Noise
Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published
More information1 Regression with High Dimensional Data
6.883 Learning with Combinatorial Structure ote for Lecture 11 Instructor: Prof. Stefanie Jegelka Scribe: Xuhong Zhang 1 Regression with High Dimensional Data Consider the following regression problem:
More informationA Study of Relative Efficiency and Robustness of Classification Methods
A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics
More informationHigh Dimensional Inverse Covariate Matrix Estimation via Linear Programming
High Dimensional Inverse Covariate Matrix Estimation via Linear Programming Ming Yuan October 24, 2011 Gaussian Graphical Model X = (X 1,..., X p ) indep. N(µ, Σ) Inverse covariance matrix Σ 1 = Ω = (ω
More informationTheoretical results for lasso, MCP, and SCAD
Theoretical results for lasso, MCP, and SCAD Patrick Breheny March 2 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/23 Introduction There is an enormous body of literature concerning theoretical
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationA General Framework for High-Dimensional Inference and Multiple Testing
A General Framework for High-Dimensional Inference and Multiple Testing Yang Ning Department of Statistical Science Joint work with Han Liu 1 Overview Goal: Control false scientific discoveries in high-dimensional
More informationEstimators based on non-convex programs: Statistical and computational guarantees
Estimators based on non-convex programs: Statistical and computational guarantees Martin Wainwright UC Berkeley Statistics and EECS Based on joint work with: Po-Ling Loh (UC Berkeley) Martin Wainwright
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationOn Iterative Hard Thresholding Methods for High-dimensional M-Estimation
On Iterative Hard Thresholding Methods for High-dimensional M-Estimation Prateek Jain Ambuj Tewari Purushottam Kar Microsoft Research, INDIA University of Michigan, Ann Arbor, USA {prajain,t-purkar}@microsoft.com,
More informationHigh-dimensional regression with unknown variance
High-dimensional regression with unknown variance Christophe Giraud Ecole Polytechnique march 2012 Setting Gaussian regression with unknown variance: Y i = f i + ε i with ε i i.i.d. N (0, σ 2 ) f = (f
More informationLinear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1
Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1 ( OWL ) Regularization Mário A. T. Figueiredo Instituto de Telecomunicações and Instituto Superior Técnico, Universidade de
More information9. Robust regression
9. Robust regression Least squares regression........................................................ 2 Problems with LS regression..................................................... 3 Robust regression............................................................
More informationTuning Parameter Selection in Regularized Estimations of Large Covariance Matrices
Tuning Parameter Selection in Regularized Estimations of Large Covariance Matrices arxiv:1308.3416v1 [stat.me] 15 Aug 2013 Yixin Fang 1, Binhuan Wang 1, and Yang Feng 2 1 New York University and 2 Columbia
More informationOWL to the rescue of LASSO
OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science,
More informationComposite Loss Functions and Multivariate Regression; Sparse PCA
Composite Loss Functions and Multivariate Regression; Sparse PCA G. Obozinski, B. Taskar, and M. I. Jordan (2009). Joint covariate selection and joint subspace selection for multiple classification problems.
More informationHigh dimensional Ising model selection
High dimensional Ising model selection Pradeep Ravikumar UT Austin (based on work with John Lafferty, Martin Wainwright) Sparse Ising model US Senate 109th Congress Banerjee et al, 2008 Estimate a sparse
More informationSVRG++ with Non-uniform Sampling
SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract
More informationAn Introduction to Sparse Approximation
An Introduction to Sparse Approximation Anna C. Gilbert Department of Mathematics University of Michigan Basic image/signal/data compression: transform coding Approximate signals sparsely Compress images,
More informationECE 275A Homework 7 Solutions
ECE 275A Homework 7 Solutions Solutions 1. For the same specification as in Homework Problem 6.11 we want to determine an estimator for θ using the Method of Moments (MOM). In general, the MOM estimator
More informationCovariance function estimation in Gaussian process regression
Covariance function estimation in Gaussian process regression François Bachoc Department of Statistics and Operations Research, University of Vienna WU Research Seminar - May 2015 François Bachoc Gaussian
More informationBoosting Methods: Why They Can Be Useful for High-Dimensional Data
New URL: http://www.r-project.org/conferences/dsc-2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609-395X Kurt Hornik,
More informationNear Ideal Behavior of a Modified Elastic Net Algorithm in Compressed Sensing
Near Ideal Behavior of a Modified Elastic Net Algorithm in Compressed Sensing M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas M.Vidyasagar@utdallas.edu www.utdallas.edu/ m.vidyasagar
More informationGuaranteed Sparse Recovery under Linear Transformation
Ji Liu JI-LIU@CS.WISC.EDU Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI 53706, USA Lei Yuan LEI.YUAN@ASU.EDU Jieping Ye JIEPING.YE@ASU.EDU Department of Computer Science
More informationSparse inverse covariance estimation with the lasso
Sparse inverse covariance estimation with the lasso Jerome Friedman Trevor Hastie and Robert Tibshirani November 8, 2007 Abstract We consider the problem of estimating sparse graphs by a lasso penalty
More informationDoes Modeling Lead to More Accurate Classification?
Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang
More informationOn Model Selection Consistency of Lasso
On Model Selection Consistency of Lasso Peng Zhao Department of Statistics University of Berkeley 367 Evans Hall Berkeley, CA 94720-3860, USA Bin Yu Department of Statistics University of Berkeley 367
More informationConfidence Intervals for Low-dimensional Parameters with High-dimensional Data
Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Cun-Hui Zhang and Stephanie S. Zhang Rutgers University and Columbia University September 14, 2012 Outline Introduction Methodology
More informationGaussian Graphical Models and Graphical Lasso
ELE 538B: Sparsity, Structure and Inference Gaussian Graphical Models and Graphical Lasso Yuxin Chen Princeton University, Spring 2017 Multivariate Gaussians Consider a random vector x N (0, Σ) with pdf
More informationSTATS 200: Introduction to Statistical Inference. Lecture 29: Course review
STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout
More information