Penalized Squared Error and Likelihood: Risk Bounds and Fast Algorithms

Similar documents
A BLEND OF INFORMATION THEORY AND STATISTICS. Andrew R. Barron. Collaborators: Cong Huang, Jonathan Li, Gerald Cheang, Xi Luo

Information and Statistics

Model Selection and Geometry

Analysis of Greedy Algorithms

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Least Squares Regression

Generalization theory

Regression Shrinkage and Selection via the Lasso

Least Squares Regression

LASSO Review, Fused LASSO, Parallel LASSO Solvers

The deterministic Lasso

Linear Models for Regression CS534

Linear Models for Regression CS534

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Generalization Bounds

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates

Statistical Data Mining and Machine Learning Hilary Term 2016

ECS289: Scalable Machine Learning

Motivation Sparse Signal Recovery is an interesting area with many potential applications. Methods developed for solving sparse signal recovery proble

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)

(Part 1) High-dimensional statistics May / 41

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning

Adaptive Piecewise Polynomial Estimation via Trend Filtering

Class 2 & 3 Overfitting & Regularization

Randomized Algorithms

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Recap from previous lecture

High-dimensional covariance estimation based on Gaussian graphical models

ESL Chap3. Some extensions of lasso

Linear Models for Regression CS534

Regularization Algorithms for Learning

Stochastic Analogues to Deterministic Optimizers

Linear classifiers: Overfitting and regularization

Sparsity in Underdetermined Systems

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

Reconstruction from Anisotropic Random Measurements

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Permutation-invariant regularization of large covariance matrices. Liza Levina

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Relaxed Lasso. Nicolai Meinshausen December 14, 2006

Lecture 3: Statistical Decision Theory (Part II)

Statistical Ranking Problem

Chapter 3. Linear Models for Regression

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 25: November 27

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Understanding Generalization Error: Bounds and Decompositions

Solving Corrupted Quadratic Equations, Provably

Empirical Risk Minimization

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

Linear Models in Machine Learning

The Learning Problem and Regularization

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

Gaussian Graphical Models and Graphical Lasso

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Lecture 7 Introduction to Statistical Decision Theory

Computational and Statistical Learning theory

Case study: stochastic simulation via Rademacher bootstrap

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

Scalable robust hypothesis tests using graphical models

Discriminative Models

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

FORMULATION OF THE LEARNING PROBLEM

Midterm exam CS 189/289, Fall 2015

An Introduction to Sparse Approximation

sparse and low-rank tensor recovery Cubic-Sketching

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

High Dimensional Inverse Covariate Matrix Estimation via Linear Programming

Introduction to Compressed Sensing

Lecture 2 Part 1 Optimization

Bayesian Adaptation. Aad van der Vaart. Vrije Universiteit Amsterdam. aad. Bayesian Adaptation p. 1/4

ISyE 691 Data mining and analytics

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Functional Analysis Exercise Class

Convex relaxation for Combinatorial Penalties

Lasso Regression: Regularization for feature selection

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

Discriminative Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

learning bounds for importance weighting Tamas Madarasz & Michael Rabadi April 15, 2015

regression Lie Wang Abstract In this paper, the high-dimensional sparse linear regression model is considered,

Sparse Linear Models (10/7/13)

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Sparse Approximation and Variable Selection

Dimension Reduction Methods

Probabilistic Graphical Models & Applications

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Graphlet Screening (GS)

STAT 200C: High-dimensional Statistics

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R

Transcription:

university-logo Penalized Squared Error and Likelihood: Risk Bounds and Fast Algorithms Andrew Barron Cong Huang Xi Luo Department of Statistics Yale University 2008 Workshop on Sparsity in High Dimensional Statistics and Learning Theory Barron, Huang, Luo Penalized Squared Error and Likelihood 1/54

Outline 1 Settings and Penalized Estimator Acceptability of Penalty General View 2 Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability 3 l 1 Penalized Least Squares l 1 Penalized Loglikelihood 4 university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 2/54

Outline Settings and Penalized Estimator Acceptability of Penalty General View 1 Settings and Penalized Estimator Acceptability of Penalty General View 2 Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability 3 l 1 Penalized Least Squares l 1 Penalized Loglikelihood 4 university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 3/54

Settings Settings and Penalized Estimator Acceptability of Penalty General View Regression Y = f (X) + ɛ Training data (X, Y ) = (X i, Y i ) n i=1 Evaluation sample X = (X i )n i=1 Target function f (x) = E[Y X =x] Assume f B Noise ɛ = Y f (X) satisfies Bernstein s moment conditions Candidate functions f from a class F Average squared error Y f 2 X = 1 n n i=1 (Y i f (X i )) 2 Barron, Huang, Luo Penalized Squared Error and Likelihood 4/54

Penalized Least Squares Settings and Penalized Estimator Acceptability of Penalty General View ˆf chosen to satisfy Y ˆf { } 2 X + pen n (ˆf ) inf Y f 2 X + pen n (f ) + A f f F pen n (f ) and A f may depend on the data X, Y A f is index of computational accuracy Truncated estimator Tˆf at a level B B We want risk bounded by } E Tˆf f 2 (1 + δ) inf { f f 2 + Epen n (f ) + EA f f F }{{} index of resolvability university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 5/54

Acceptable Penalties Settings and Penalized Estimator Acceptability of Penalty General View What kinds of penalties produce the required risk bound? Acceptable or proper penalties Barron, Huang, Luo Penalized Squared Error and Likelihood 6/54

Countable Case Settings and Penalized Estimator Acceptability of Penalty General View Consider countable F Penalty γl(f )/n proportional to complexities Kraft inequality f F e L(f ) 1 P n, P n empirical distribution of X and X From Hoeffding and Bernstein inequality 1 E sup f F c P n[(f f ) 2 2 ] P }{{} n [(Y f ) }{{} f f 2 X Y f 2 X ɛ 2 ] γl(f ) n 0 c > 1 γ depends on B, B, c, σ 2 and h Bern university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 7/54

Risk Bound in Countable Case Settings and Penalized Estimator Acceptability of Penalty General View Risk is bounded by E Tˆf f 2 X {min c f f 2 + γl(f ) } f F n Barron, Huang, Luo Penalized Squared Error and Likelihood 8/54

Uncountable Case Settings and Penalized Estimator Acceptability of Penalty General View Valid pen n (f ) for uncountable F If there exists F and complexity L( f ) with { } 1 sup f F c P n(g f ) P n (ρ f ) pen n (f ) { } 1 c P n(g f ) P n (ρ f ) γl( f ), n sup f F where c c > 1. Inequality holds point-wise or in expectation g f (X) = (f (X) f (X)) 2 ρ f (X, Y ) = (Y f (X)) 2 (Y f (X)) 2 university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 9/54

Acceptable Penalty Settings and Penalized Estimator Acceptability of Penalty General View Variable-complexity, variable-distortion cover For f in F, penalty pen n (f ) valid if there is a representor f, s.t. pen n (f ) is at least γl( f ) n + n (f, f ), n (f, f ) = Y f 2 X Y f 2 X + 1 c f Tf 2 X 1 c f f 2 X with c c > 1 F consists of f bounded by B Risk is bounded by E Tˆf f 2 X c inf f F { } f f 2 + E [pen n (f ) + A f ] university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 10/54

Settings and Penalized Estimator Acceptability of Penalty General View Penalty via Complexity Distortion Trade-off Allowing unbounded f Acceptable penalty at least where inf f F { γl( f ) + D n (f, f ) n } D n (f, f ) = Y f 2 X Y f 2 X + f f 2 X γ = 1.6(B + B ) 2 + 2h }{{} Bern (B + B ) + 2.7σ 2 }{{} main term arising from noise university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 11/54

Risk Bound Settings and Penalized Estimator Acceptability of Penalty General View Risk of Tˆf E Tˆf f 2 X 3 inf f F { f f 2 + E [pen n (f ) + A f ] + tail } n Noise bounded, tail = 0, B = B + C Noise sub-gaussian, tail = const, B = B + C log n Noise Bernstein, tail = const, B = B + C log n Barron, Huang, Luo Penalized Squared Error and Likelihood 12/54

Our Work Settings and Penalized Estimator Acceptability of Penalty General View General penalty condition Subset selection pen n (f m ) = γ n log ( ) M m + m log n n l 1 penalization pen n (f β ) = λ n β 1 What size λ n? Combinations thereof (see paper) Greedy algorithm for each Barron, Huang, Luo Penalized Squared Error and Likelihood 13/54

Sampling Idea Settings and Penalized Estimator Acceptability of Penalty General View f = h β hh a linear combination of h in H Randomly draw h 1, h 2,..., h m independently with probability proportional to β h for h i = h This idea is useful in Approximation bound Proof of the acceptability of a penalty via countable covers Greedy algorithm computational inaccuracy Squared error of order 1 m or better for each Barron, Huang, Luo Penalized Squared Error and Likelihood 14/54

Outline Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability 1 Settings and Penalized Estimator Acceptability of Penalty General View 2 Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability 3 l 1 Penalized Least Squares l 1 Penalized Loglikelihood 4 university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 15/54

Regression Problem Settings Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability Training data (X, Y ) = (X i, Y i ) n i=1 Evaluation at X = (X i )n i=1, independent copy of X Target function f (x) = E[Y X =x], f B Noise ɛ = Y f (X) satisfies Bernstein s conditions Function class F = F H is the linear span of a library H f in F H of form f (x) = f β (x) = h β hh(x) with β = (β h : h H) Barron, Huang, Luo Penalized Squared Error and Likelihood 16/54

l 1 Penalized Least Squares Estimator Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability Find ˆβ, ˆf = f ˆβ to satisfy { } Y f ˆβ 2 X + λ ˆβ 1,a = min Y f β 2 X + λ β 1,a. β where f β (x) = h β hh(x) and β 1,a = h β h a h. Lasso (Tibshirani 1996) Basis Pursuit (Chen and Donoho 1996) Barron, Huang, Luo Penalized Squared Error and Likelihood 17/54

Areas to be explored Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability We show β 1 = β 1,a is a proper penalty and hence the corresponding resolvability risk bound follows. What kinds of weights a h What is the condition for λ What is the convergence rate of the risk Results in Huang, Cheang and Barron (2008) Section 4. Barron, Huang, Luo Penalized Squared Error and Likelihood 18/54

Areas to be explored Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability We show β 1 = β 1,a is a proper penalty and hence the corresponding resolvability risk bound follows. What kinds of weights a h What is the condition for λ What is the convergence rate of the risk Results in Huang, Cheang and Barron (2008) Section 4. Barron, Huang, Luo Penalized Squared Error and Likelihood 18/54

Areas to be explored Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability We show β 1 = β 1,a is a proper penalty and hence the corresponding resolvability risk bound follows. What kinds of weights a h What is the condition for λ What is the convergence rate of the risk Results in Huang, Cheang and Barron (2008) Section 4. Barron, Huang, Luo Penalized Squared Error and Likelihood 18/54

Risk bound in the case that H is finite Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability Consider case H is finite with size M (also called p) Weights a h = h in traditional setting Weights a h = 2 h X,X in the transductive setting where h 2 X,X = 1 n n i=1 (h2 (X i ) + h 2 (X i )) λ is chosen at least 2 2γ(log 2M)/n with γ = 1.6(B + B ) 2 + 2(B + B )h Bern + 2.7σ 2 Tˆf truncates to the level B B Risk satisfies [ E Tˆf f 2 3 inf β } ] { f β f 2 + λ β 1,a + adjust/n Adjustment terms negligible compared to main terms university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 19/54

Glance of the Proof Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability F consists of f of the form f (x) = v m m h k (x)/a hk, h k H, m = 1, 2,... k=1 Complexities L( f ) = m log M + m log 2 Using sampling idea, there is a representor f m for each f Y f m 2 X Y f 2 X + f f m X + γl( f m )/n, }{{}}{{}}{{} v β 1,a /m v β 1,a /m γm log(2m)/n m = β 1,a /η and v = mη; optimal η = n/(log M) pen n (f β ) at least 2 2γ(log 2M)/n β 1 university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 20/54

Improvement Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability Improvement based on empirical L 2 covering of library H H 2 finite cover with precision ε 2 and cardinality m 2 Use a h = 1. 2γ log(2m) λ is chosen at least λ n = 2ε 2 n. The risk satisfies [ { } E Tˆf f 2 3 min f β f 2 + λ β 1 + adjust ], β n Barron, Huang, Luo Penalized Squared Error and Likelihood 21/54

Stratified Sampling Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability H 2 is an L 2 cover of H with precision ε 2 and cardinality m 2 For f = h β hh in F H, there is an f m = (v/m) m k=1 h k such that f f m 2 ε2 2 β 1v m m 2 v is between h β h and h β h (1 + m 2 /(m m 2 )) Based on Makovoz (1996) university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 22/54

Proof of Improvement Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability Same F and L( f ) Using stratified sampling idea, there is a representor f m for each f Y f m 2 X Y f 2 X + }{{} f f m X }{{} + γl( f m )/n }{{}, ε 2 2 β 1v/(m m 2 ) ε 2 2 β 1v/(m m 2 ) γm log(2m)/n Set v = mη/ɛ 2 and ε 2 h β h/η m ε 2 h β h/η + m 2 Optimizing η Penalty at least λ n β 1 + γ m 2 log(2m) n Barron, Huang, Luo Penalized Squared Error and Likelihood 23/54

Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability Risk bound in the case that H is infinite Improvement with two levels of cover. A fine precision ε 1 typically of order ε 2 / n, we consider empirical covers H 1 providing effective library size M 1. This size M 1 serves as surrogate for M H 2 is the same as before Use a h = 1 2γ log(2m λ at least 2ε 1 ) 2 n + 16B ε 1 The risk satisfies [ { } E Tˆf f 2 3 min f β f 2 + λ β 1 + adjust ] β n There is a quantity 2γm 2 log(2m 1 ) in the adjust university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 24/54

Further exploration of the infinite case Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability Take advantage of the covering properties of H to relate M 1 and ε 1 ; likewise m 2 and ε 2 Library H have metric dimension d 1 w.r.t. the empirical L 1 norm, if the cardinality M 1 is of order (1/ε) d 1 Likewise, the metric dimension d 2 w.r.t. the empirical L 2 norm d 1 d 2 2d 1 Barron, Huang, Luo Penalized Squared Error and Likelihood 25/54

Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability l 1 Penalty for Libraries of Finite Metric Dimension The library H has dimensions d 1 and d 2 w.r.t. empirical L 1 and L 2 norms λ at least ( d1 λ n,d = C 1 (d 1, d 2 ) n log n ) (d2 +2)/(2d 2 +2) d 1 The risk tends to zero at rate ( inf β f β f 2 d1 + n log n ) d 2 +2 2(d 2 +1) β 1 d 1. Barron, Huang, Luo Penalized Squared Error and Likelihood 26/54

A Refined Penalty Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability Library H has dimensions d 1 and d 2 w.r.t. empirical norms Using penalty with λ at least λ n,d pen n (f β ) = λ β d 2/(d 2 +1) 1 The penalized least squares estimator ˆf satisfies the resolvability risk bound [ { } E Tˆf f 2 3 min β f β f 2 + λ n,d β d 2 d 2 +1 1 ] + adjust n Smaller index of resolvability university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 27/54

Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability university-logo Variation and L 1,H Variation V (f ) of f, w.r.t. H and weights a = (a h ), is { } V (f ) = lim inf ε 0 f ε F H β 1,a : f ε = β h h and f ε f ε h A natural extension of β 1,a L 1,H consists of functions with finite variation Barron, Huang, Luo Penalized Squared Error and Likelihood 28/54

Approximation and Penalty Trade-off Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability We discuss the trade-off between approximation error and penalty as expressed in the resolvability and its relationship to interpolation spaces between two classes of functions. Squared approximation error Resolvability App(f, v) = inf { f β f 2 } f β : β 1 =v R 1 (f, λ n ) = inf v {App(f, v) + λ n v} If f L 1,H, R 1 (f, λ n ) λ n V (f ) goes to 0 linearly If f L 2 (P), the convergence rate can be arbitrarily slow university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 29/54

Interpolation Space B res 1,p Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability Consider B res 1,p = {f : R 1 (f, λ) c f λ 2 p for all λ>0}, indexed by 1 p 2 Coincide with traditional interpolation spaces B p = [L 2 (P), L 1,H ] θ When p =1, we see B res 1,1 includes L 1,H. If f B1,p res, the resolvability of order λ2 p n, provides rate ε 2 p 2 ( γ log M n ) 1 p/2 university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 30/54

Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability Trade-off for finite-dimensional libraries f B res 1,p and H has dimensions d 1 and d 2 The resolvability R 1 (f, λ n ), with R 1 (f, λ) = inf v {App(f, v) + λv}, is of order ( d1 ) (1 p/2)(d 2 +2) (d 2 +1) n log n d 1 The resolvability R 1 r (f, λ n ), with r = d 2 /(d 2 + 1), R 1 r (f, λ) = inf v {App(f, v) + λv r }, is of order ( d1 ) (1 p/2)(d 2 +2) (d 2 +2 p) n log n d 1 university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 31/54

Variable Complexity Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability Finite Library H Variable complexities L(h), satisfying h e L(h) 1 a L,h = h L(h)+log 2 in traditional setting a L,h = h 2n L(h)+log 2 in transductive setting Similar risk bound holds Using L(h)+log 2 inside the sum defining β 1,aL in place of the constant log M +log 2 outside the sum Barron, Huang, Luo Penalized Squared Error and Likelihood 32/54

al Accuracy Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability ˆβ and f ˆβ satisfy Y f ˆβ 2 X + λ ˆβ } 1,a inf { Y f β 2 β X + λ β 1,a + A β,m. Same risk bound still holds with EA β,m inside the index of resolvability Barron, Huang, Luo Penalized Squared Error and Likelihood 33/54

Outline l 1 Penalized Least Squares l 1 Penalized Loglikelihood 1 Settings and Penalized Estimator Acceptability of Penalty General View 2 Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability 3 l 1 Penalized Least Squares l 1 Penalized Loglikelihood 4 university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 34/54

l 1 Penalized Least Squares l 1 Penalized Least Squares l 1 Penalized Loglikelihood Dictionary H = {h(x)} of size M (also called p). Data (X i, Y i ) n i=1. Fit a function in the linear span F H to minimize 1 n ( n Y i h i=1 ) 2 β h h(x i ) + λ β h. h Barron, Huang, Luo Penalized Squared Error and Likelihood 35/54

l 1 Penalized Least Squares l 1 Penalized Loglikelihood Algorithms for l 1 Penalized Least Squares Examples of existing algorithms: interior point method (Boyed et al, 2004) LARS (EHJT, 2004) coordinate descent (Friedman et al, 2007) Built on Jones (1992) and others. l 1 penalized greedy pursuit (LPGP), HCB (2008), sec 3. Barron, Huang, Luo Penalized Squared Error and Likelihood 36/54

LPGP for Least Squares l 1 Penalized Least Squares l 1 Penalized Loglikelihood Initialize f 0 (x) = 0. Iteratively seek f m (x) = (1 α m )f m 1 (x) + β m h m (x) to minimize over h H, α (0, 1) and β R: Y i (1 α) f m 1 (X i ) βh(x i ) 2 n + λ [ β + (1 α) v m 1] where v m = m j=1 β j,m for f m = m j=1 β j,mh j. Barron, Huang, Luo Penalized Squared Error and Likelihood 37/54

l 1 Penalized Least Squares l 1 Penalized Loglikelihood Theorem for Accuracy Empirical l 2 norm h n. V f = inf { h β h h n : f (x) = h β hh(x)}. The m step estimator of LPGP within order Vf 2 /m of the minimal objective: Y i f m (X i ) n + λv m inf f F H { Y i f (X i ) n + λv f + 4V } f 2 m + 1 Barron, Huang, Luo Penalized Squared Error and Likelihood 38/54

Advantages and Disadvantages l 1 Penalized Least Squares l 1 Penalized Loglikelihood Advantages: explicit guarantee of accuracy; cost Mnm v.s. Mn 2 for LARS; inexpensive optimization at each iteration. Disadvantages: approximate solution; fixed λ n. Barron, Huang, Luo Penalized Squared Error and Likelihood 39/54

Idea of Proof l 1 Penalized Least Squares l 1 Penalized Loglikelihood WLOG β h 0 (assume H closed under sign change). For an arbitrary f = h β hh, e 2 m = Y i f m (X i ) 2 n Y i f (X i ) 2 n + λv m. f m = (1 α m )f m 1 + β m h m, then em 2 is at least as good as choosing α = 2 m+1, β = αv f, an h chosen at random. Barron, Huang, Luo Penalized Squared Error and Likelihood 40/54

Idea of Proof l 1 Penalized Least Squares l 1 Penalized Loglikelihood Rearrange to have em 2 (1 α)em 1 2 + α2 b(v f h) + αλv f 2α(1 α) 1 n (Y i f m 1 (X i )) (V n f h(x i ) f (X i )) }{{} i=1 0 averaging over h where b(v f h) = Y V f h 2 n Y f 2 n. Consider drawing h with probability β h /V f, cross term vanishes and averaging b(v f h) over h bounded by V 2 f. Barron, Huang, Luo Penalized Squared Error and Likelihood 41/54

Idea of Proof l 1 Penalized Least Squares l 1 Penalized Loglikelihood We show e 2 m (1 α)e 2 m 1 + α2 V 2 f + αλv f. Induction reveals the accuracy of O(1/m). Barron, Huang, Luo Penalized Squared Error and Likelihood 42/54

l 1 Penalized Loglikelihood l 1 Penalized Least Squares l 1 Penalized Loglikelihood X 1,..., X n be i.i.d. in R p distributed as Let L n (f ) = 1 n log(1/p f (X n )). p f (x) = ef (x) p 0 (x) C f. l 1 penalized loglikelihood estimator f = f β minimizes L n (f ) + λv f where V f = inf{ h β h : f (x) = h β hh(x), h H}. Barron, Huang, Luo Penalized Squared Error and Likelihood 43/54

Motivation l 1 Penalized Least Squares l 1 Penalized Loglikelihood Minimization is computationally demanding when p large. Term by term selection is favored in sparse settings. Approximate optimization good enough for risk analysis. BHLL (2008) extends LPGP for penalized loglikelihood. Barron, Huang, Luo Penalized Squared Error and Likelihood 44/54

LPGP for Penalized Loglikelihood l 1 Penalized Least Squares l 1 Penalized Loglikelihood Initialize with f 0 (x) = 0. f m (x) = (1 α m )f m 1 (x) + β m h m (x) with α m, β m and h m chosen by argmin α, β, h {L n (f m ) + λ[(1 α)v m 1 + β ]} where v m 1 = m 1 j=1 βj,m 1 for fm 1 = m 1 j=1 β j,m 1h j. Barron, Huang, Luo Penalized Squared Error and Likelihood 45/54

Theorem l 1 Penalized Least Squares l 1 Penalized Loglikelihood Theorem Suppose h(x) C for all h(x) H. The m step LPGP estimator f m (x) has L n (f m ) + λv m inf {L n(f ) + λv f + 2V f 2 f F m + 1 }. Barron, Huang, Luo Penalized Squared Error and Likelihood 46/54

Idea of Proof l 1 Penalized Least Squares l 1 Penalized Loglikelihood m step error has linear and nonlinear components. Linear parts are handled similarly to the least squares. Nonlinear(normalizing constants) of O(α 2 ) by a mgf bound. Induction completes the proof. Barron, Huang, Luo Penalized Squared Error and Likelihood 47/54

Accuracy l 1 Penalized Least Squares l 1 Penalized Loglikelihood f m (x) = (1 α m )f m 1 (x) + β m h m (x) e m = L n (f m ) L n (f ) + λ [(1 α m )v m 1 + β m ]. From definition L n (f m ) L n (f ) equals 1 n e f m(t) p 0 (t) [(1 α m )f m 1 (X i ) + β m h m (X i ) f (X i )] log n e f (t). p i=1 0 (t) }{{}}{{} linear nonlinear Barron, Huang, Luo Penalized Squared Error and Likelihood 48/54

Sampling h l 1 Penalized Least Squares l 1 Penalized Loglikelihood Consider α = 2/(m + 1), β = αv f, a random h(x). Rearrange and write p α = e (1 α)[f m 1(x) f (x)] p f (x)/c e m (1 α)e m 1 + αλv f + α 1 n + log n [f (X i ) V f h(x i )] i=1 } {{ } 0 averaging over h p α (t)exp{α(v f h(t) f (t))}. Sample h with probability β h /V f, third term vanishes. Bring the average over h inside log and the expectation over random h of exp{α(v f h(t) f (t))} not more than e α2 Vf 2/2. university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 49/54

Induction l 1 Penalized Least Squares l 1 Penalized Loglikelihood We show e m (1 α)e m 1 + α2 Vf 2 2 + αλv f Induction completes the proof. Barron, Huang, Luo Penalized Squared Error and Likelihood 50/54

Current Work l 1 Penalized Least Squares l 1 Penalized Loglikelihood Generalize to permit l 2 norm in penalized loglikelihood. High dimensional graphical models: logistic, gaussian. R package will be publicly available. Barron, Huang, Luo Penalized Squared Error and Likelihood 51/54

Outline 1 Settings and Penalized Estimator Acceptability of Penalty General View 2 Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability 3 l 1 Penalized Least Squares l 1 Penalized Loglikelihood 4 university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 52/54

Our Work university-logo General penalty condition Subset selection pen n (f m ) = γ n log ( ) M m + m log n n l 1 penalization pen n (f β ) = λ n β 1 valid What size λ n? Combinations thereof (see paper) Greedy algorithm for each valid Barron, Huang, Luo Penalized Squared Error and Likelihood 53/54

Sampling Idea university-logo f = h β hh a linear combination of h in H Randomly draw h 1, h 2,..., h m independently with probability proportional to β h for h i = h This idea is useful in Approximation bound Proof of the acceptability of a penalty via countable covers Greedy algorithm computational inaccuracy Squared error of order 1 m or better for each Barron, Huang, Luo Penalized Squared Error and Likelihood 54/54