Penalized Squared Error and Likelihood: Risk Bounds and Fast Algorithms

university-logo Penalized Squared Error and Likelihood: Risk Bounds and Fast Algorithms Andrew Barron Cong Huang Xi Luo Department of Statistics Yale University 2008 Workshop on Sparsity in High Dimensional Statistics and Learning Theory Barron, Huang, Luo Penalized Squared Error and Likelihood 1/54

Outline 1 Settings and Penalized Estimator Acceptability of Penalty General View 2 Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability 3 l 1 Penalized Least Squares l 1 Penalized Loglikelihood 4 university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 2/54

Outline Settings and Penalized Estimator Acceptability of Penalty General View 1 Settings and Penalized Estimator Acceptability of Penalty General View 2 Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability 3 l 1 Penalized Least Squares l 1 Penalized Loglikelihood 4 university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 3/54

Settings Settings and Penalized Estimator Acceptability of Penalty General View Regression Y = f (X) + ɛ Training data (X, Y ) = (X i, Y i ) n i=1 Evaluation sample X = (X i )n i=1 Target function f (x) = E[Y X =x] Assume f B Noise ɛ = Y f (X) satisfies Bernstein s moment conditions Candidate functions f from a class F Average squared error Y f 2 X = 1 n n i=1 (Y i f (X i )) 2 Barron, Huang, Luo Penalized Squared Error and Likelihood 4/54

Penalized Least Squares Settings and Penalized Estimator Acceptability of Penalty General View ˆf chosen to satisfy Y ˆf { } 2 X + pen n (ˆf ) inf Y f 2 X + pen n (f ) + A f f F pen n (f ) and A f may depend on the data X, Y A f is index of computational accuracy Truncated estimator Tˆf at a level B B We want risk bounded by } E Tˆf f 2 (1 + δ) inf { f f 2 + Epen n (f ) + EA f f F }{{} index of resolvability university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 5/54

Acceptable Penalties Settings and Penalized Estimator Acceptability of Penalty General View What kinds of penalties produce the required risk bound? Acceptable or proper penalties Barron, Huang, Luo Penalized Squared Error and Likelihood 6/54

Countable Case Settings and Penalized Estimator Acceptability of Penalty General View Consider countable F Penalty γl(f )/n proportional to complexities Kraft inequality f F e L(f ) 1 P n, P n empirical distribution of X and X From Hoeffding and Bernstein inequality 1 E sup f F c P n[(f f ) 2 2 ] P }{{} n [(Y f ) }{{} f f 2 X Y f 2 X ɛ 2 ] γl(f ) n 0 c > 1 γ depends on B, B, c, σ 2 and h Bern university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 7/54

Risk Bound in Countable Case Settings and Penalized Estimator Acceptability of Penalty General View Risk is bounded by E Tˆf f 2 X {min c f f 2 + γl(f ) } f F n Barron, Huang, Luo Penalized Squared Error and Likelihood 8/54

Uncountable Case Settings and Penalized Estimator Acceptability of Penalty General View Valid pen n (f ) for uncountable F If there exists F and complexity L( f ) with { } 1 sup f F c P n(g f ) P n (ρ f ) pen n (f ) { } 1 c P n(g f ) P n (ρ f ) γl( f ), n sup f F where c c > 1. Inequality holds point-wise or in expectation g f (X) = (f (X) f (X)) 2 ρ f (X, Y ) = (Y f (X)) 2 (Y f (X)) 2 university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 9/54

Acceptable Penalty Settings and Penalized Estimator Acceptability of Penalty General View Variable-complexity, variable-distortion cover For f in F, penalty pen n (f ) valid if there is a representor f, s.t. pen n (f ) is at least γl( f ) n + n (f, f ), n (f, f ) = Y f 2 X Y f 2 X + 1 c f Tf 2 X 1 c f f 2 X with c c > 1 F consists of f bounded by B Risk is bounded by E Tˆf f 2 X c inf f F { } f f 2 + E [pen n (f ) + A f ] university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 10/54

Settings and Penalized Estimator Acceptability of Penalty General View Penalty via Complexity Distortion Trade-off Allowing unbounded f Acceptable penalty at least where inf f F { γl( f ) + D n (f, f ) n } D n (f, f ) = Y f 2 X Y f 2 X + f f 2 X γ = 1.6(B + B ) 2 + 2h }{{} Bern (B + B ) + 2.7σ 2 }{{} main term arising from noise university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 11/54

Risk Bound Settings and Penalized Estimator Acceptability of Penalty General View Risk of Tˆf E Tˆf f 2 X 3 inf f F { f f 2 + E [pen n (f ) + A f ] + tail } n Noise bounded, tail = 0, B = B + C Noise sub-gaussian, tail = const, B = B + C log n Noise Bernstein, tail = const, B = B + C log n Barron, Huang, Luo Penalized Squared Error and Likelihood 12/54

Our Work Settings and Penalized Estimator Acceptability of Penalty General View General penalty condition Subset selection pen n (f m ) = γ n log ( ) M m + m log n n l 1 penalization pen n (f β ) = λ n β 1 What size λ n? Combinations thereof (see paper) Greedy algorithm for each Barron, Huang, Luo Penalized Squared Error and Likelihood 13/54

Sampling Idea Settings and Penalized Estimator Acceptability of Penalty General View f = h β hh a linear combination of h in H Randomly draw h 1, h 2,..., h m independently with probability proportional to β h for h i = h This idea is useful in Approximation bound Proof of the acceptability of a penalty via countable covers Greedy algorithm computational inaccuracy Squared error of order 1 m or better for each Barron, Huang, Luo Penalized Squared Error and Likelihood 14/54

Outline Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability 1 Settings and Penalized Estimator Acceptability of Penalty General View 2 Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability 3 l 1 Penalized Least Squares l 1 Penalized Loglikelihood 4 university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 15/54

Regression Problem Settings Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability Training data (X, Y ) = (X i, Y i ) n i=1 Evaluation at X = (X i )n i=1, independent copy of X Target function f (x) = E[Y X =x], f B Noise ɛ = Y f (X) satisfies Bernstein s conditions Function class F = F H is the linear span of a library H f in F H of form f (x) = f β (x) = h β hh(x) with β = (β h : h H) Barron, Huang, Luo Penalized Squared Error and Likelihood 16/54

l 1 Penalized Least Squares Estimator Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability Find ˆβ, ˆf = f ˆβ to satisfy { } Y f ˆβ 2 X + λ ˆβ 1,a = min Y f β 2 X + λ β 1,a. β where f β (x) = h β hh(x) and β 1,a = h β h a h. Lasso (Tibshirani 1996) Basis Pursuit (Chen and Donoho 1996) Barron, Huang, Luo Penalized Squared Error and Likelihood 17/54

Areas to be explored Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability We show β 1 = β 1,a is a proper penalty and hence the corresponding resolvability risk bound follows. What kinds of weights a h What is the condition for λ What is the convergence rate of the risk Results in Huang, Cheang and Barron (2008) Section 4. Barron, Huang, Luo Penalized Squared Error and Likelihood 18/54

Risk bound in the case that H is finite Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability Consider case H is finite with size M (also called p) Weights a h = h in traditional setting Weights a h = 2 h X,X in the transductive setting where h 2 X,X = 1 n n i=1 (h2 (X i ) + h 2 (X i )) λ is chosen at least 2 2γ(log 2M)/n with γ = 1.6(B + B ) 2 + 2(B + B )h Bern + 2.7σ 2 Tˆf truncates to the level B B Risk satisfies [ E Tˆf f 2 3 inf β } ] { f β f 2 + λ β 1,a + adjust/n Adjustment terms negligible compared to main terms university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 19/54

Glance of the Proof Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability F consists of f of the form f (x) = v m m h k (x)/a hk, h k H, m = 1, 2,... k=1 Complexities L( f ) = m log M + m log 2 Using sampling idea, there is a representor f m for each f Y f m 2 X Y f 2 X + f f m X + γl( f m )/n, }{{}}{{}}{{} v β 1,a /m v β 1,a /m γm log(2m)/n m = β 1,a /η and v = mη; optimal η = n/(log M) pen n (f β ) at least 2 2γ(log 2M)/n β 1 university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 20/54

Improvement Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability Improvement based on empirical L 2 covering of library H H 2 finite cover with precision ε 2 and cardinality m 2 Use a h = 1. 2γ log(2m) λ is chosen at least λ n = 2ε 2 n. The risk satisfies [ { } E Tˆf f 2 3 min f β f 2 + λ β 1 + adjust ], β n Barron, Huang, Luo Penalized Squared Error and Likelihood 21/54

Stratified Sampling Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability H 2 is an L 2 cover of H with precision ε 2 and cardinality m 2 For f = h β hh in F H, there is an f m = (v/m) m k=1 h k such that f f m 2 ε2 2 β 1v m m 2 v is between h β h and h β h (1 + m 2 /(m m 2 )) Based on Makovoz (1996) university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 22/54

Proof of Improvement Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability Same F and L( f ) Using stratified sampling idea, there is a representor f m for each f Y f m 2 X Y f 2 X + }{{} f f m X }{{} + γl( f m )/n }{{}, ε 2 2 β 1v/(m m 2 ) ε 2 2 β 1v/(m m 2 ) γm log(2m)/n Set v = mη/ɛ 2 and ε 2 h β h/η m ε 2 h β h/η + m 2 Optimizing η Penalty at least λ n β 1 + γ m 2 log(2m) n Barron, Huang, Luo Penalized Squared Error and Likelihood 23/54

Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability Risk bound in the case that H is infinite Improvement with two levels of cover. A fine precision ε 1 typically of order ε 2 / n, we consider empirical covers H 1 providing effective library size M 1. This size M 1 serves as surrogate for M H 2 is the same as before Use a h = 1 2γ log(2m λ at least 2ε 1 ) 2 n + 16B ε 1 The risk satisfies [ { } E Tˆf f 2 3 min f β f 2 + λ β 1 + adjust ] β n There is a quantity 2γm 2 log(2m 1 ) in the adjust university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 24/54

Further exploration of the infinite case Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability Take advantage of the covering properties of H to relate M 1 and ε 1 ; likewise m 2 and ε 2 Library H have metric dimension d 1 w.r.t. the empirical L 1 norm, if the cardinality M 1 is of order (1/ε) d 1 Likewise, the metric dimension d 2 w.r.t. the empirical L 2 norm d 1 d 2 2d 1 Barron, Huang, Luo Penalized Squared Error and Likelihood 25/54

Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability l 1 Penalty for Libraries of Finite Metric Dimension The library H has dimensions d 1 and d 2 w.r.t. empirical L 1 and L 2 norms λ at least ( d1 λ n,d = C 1 (d 1, d 2 ) n log n ) (d2 +2)/(2d 2 +2) d 1 The risk tends to zero at rate ( inf β f β f 2 d1 + n log n ) d 2 +2 2(d 2 +1) β 1 d 1. Barron, Huang, Luo Penalized Squared Error and Likelihood 26/54

A Refined Penalty Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability Library H has dimensions d 1 and d 2 w.r.t. empirical norms Using penalty with λ at least λ n,d pen n (f β ) = λ β d 2/(d 2 +1) 1 The penalized least squares estimator ˆf satisfies the resolvability risk bound [ { } E Tˆf f 2 3 min β f β f 2 + λ n,d β d 2 d 2 +1 1 ] + adjust n Smaller index of resolvability university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 27/54

Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability university-logo Variation and L 1,H Variation V (f ) of f, w.r.t. H and weights a = (a h ), is { } V (f ) = lim inf ε 0 f ε F H β 1,a : f ε = β h h and f ε f ε h A natural extension of β 1,a L 1,H consists of functions with finite variation Barron, Huang, Luo Penalized Squared Error and Likelihood 28/54

Approximation and Penalty Trade-off Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability We discuss the trade-off between approximation error and penalty as expressed in the resolvability and its relationship to interpolation spaces between two classes of functions. Squared approximation error Resolvability App(f, v) = inf { f β f 2 } f β : β 1 =v R 1 (f, λ n ) = inf v {App(f, v) + λ n v} If f L 1,H, R 1 (f, λ n ) λ n V (f ) goes to 0 linearly If f L 2 (P), the convergence rate can be arbitrarily slow university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 29/54

Interpolation Space B res 1,p Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability Consider B res 1,p = {f : R 1 (f, λ) c f λ 2 p for all λ>0}, indexed by 1 p 2 Coincide with traditional interpolation spaces B p = [L 2 (P), L 1,H ] θ When p =1, we see B res 1,1 includes L 1,H. If f B1,p res, the resolvability of order λ2 p n, provides rate ε 2 p 2 ( γ log M n ) 1 p/2 university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 30/54

Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability Trade-off for finite-dimensional libraries f B res 1,p and H has dimensions d 1 and d 2 The resolvability R 1 (f, λ n ), with R 1 (f, λ) = inf v {App(f, v) + λv}, is of order ( d1 ) (1 p/2)(d 2 +2) (d 2 +1) n log n d 1 The resolvability R 1 r (f, λ n ), with r = d 2 /(d 2 + 1), R 1 r (f, λ) = inf v {App(f, v) + λv r }, is of order ( d1 ) (1 p/2)(d 2 +2) (d 2 +2 p) n log n d 1 university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 31/54

Variable Complexity Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability Finite Library H Variable complexities L(h), satisfying h e L(h) 1 a L,h = h L(h)+log 2 in traditional setting a L,h = h 2n L(h)+log 2 in transductive setting Similar risk bound holds Using L(h)+log 2 inside the sum defining β 1,aL in place of the constant log M +log 2 outside the sum Barron, Huang, Luo Penalized Squared Error and Likelihood 32/54

al Accuracy Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability ˆβ and f ˆβ satisfy Y f ˆβ 2 X + λ ˆβ } 1,a inf { Y f β 2 β X + λ β 1,a + A β,m. Same risk bound still holds with EA β,m inside the index of resolvability Barron, Huang, Luo Penalized Squared Error and Likelihood 33/54

Outline l 1 Penalized Least Squares l 1 Penalized Loglikelihood 1 Settings and Penalized Estimator Acceptability of Penalty General View 2 Settings and l 1 Penalization Risk bound for l 1 Penalized Least Squares Risk Properties for the Finite-Dimension Libraries Trade-off in the Resolvability 3 l 1 Penalized Least Squares l 1 Penalized Loglikelihood 4 university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 34/54

l 1 Penalized Least Squares l 1 Penalized Least Squares l 1 Penalized Loglikelihood Dictionary H = {h(x)} of size M (also called p). Data (X i, Y i ) n i=1. Fit a function in the linear span F H to minimize 1 n ( n Y i h i=1 ) 2 β h h(x i ) + λ β h. h Barron, Huang, Luo Penalized Squared Error and Likelihood 35/54

l 1 Penalized Least Squares l 1 Penalized Loglikelihood Algorithms for l 1 Penalized Least Squares Examples of existing algorithms: interior point method (Boyed et al, 2004) LARS (EHJT, 2004) coordinate descent (Friedman et al, 2007) Built on Jones (1992) and others. l 1 penalized greedy pursuit (LPGP), HCB (2008), sec 3. Barron, Huang, Luo Penalized Squared Error and Likelihood 36/54

LPGP for Least Squares l 1 Penalized Least Squares l 1 Penalized Loglikelihood Initialize f 0 (x) = 0. Iteratively seek f m (x) = (1 α m )f m 1 (x) + β m h m (x) to minimize over h H, α (0, 1) and β R: Y i (1 α) f m 1 (X i ) βh(x i ) 2 n + λ [ β + (1 α) v m 1] where v m = m j=1 β j,m for f m = m j=1 β j,mh j. Barron, Huang, Luo Penalized Squared Error and Likelihood 37/54

l 1 Penalized Least Squares l 1 Penalized Loglikelihood Theorem for Accuracy Empirical l 2 norm h n. V f = inf { h β h h n : f (x) = h β hh(x)}. The m step estimator of LPGP within order Vf 2 /m of the minimal objective: Y i f m (X i ) n + λv m inf f F H { Y i f (X i ) n + λv f + 4V } f 2 m + 1 Barron, Huang, Luo Penalized Squared Error and Likelihood 38/54

Advantages and Disadvantages l 1 Penalized Least Squares l 1 Penalized Loglikelihood Advantages: explicit guarantee of accuracy; cost Mnm v.s. Mn 2 for LARS; inexpensive optimization at each iteration. Disadvantages: approximate solution; fixed λ n. Barron, Huang, Luo Penalized Squared Error and Likelihood 39/54

Idea of Proof l 1 Penalized Least Squares l 1 Penalized Loglikelihood WLOG β h 0 (assume H closed under sign change). For an arbitrary f = h β hh, e 2 m = Y i f m (X i ) 2 n Y i f (X i ) 2 n + λv m. f m = (1 α m )f m 1 + β m h m, then em 2 is at least as good as choosing α = 2 m+1, β = αv f, an h chosen at random. Barron, Huang, Luo Penalized Squared Error and Likelihood 40/54

Idea of Proof l 1 Penalized Least Squares l 1 Penalized Loglikelihood Rearrange to have em 2 (1 α)em 1 2 + α2 b(v f h) + αλv f 2α(1 α) 1 n (Y i f m 1 (X i )) (V n f h(x i ) f (X i )) }{{} i=1 0 averaging over h where b(v f h) = Y V f h 2 n Y f 2 n. Consider drawing h with probability β h /V f, cross term vanishes and averaging b(v f h) over h bounded by V 2 f. Barron, Huang, Luo Penalized Squared Error and Likelihood 41/54

Idea of Proof l 1 Penalized Least Squares l 1 Penalized Loglikelihood We show e 2 m (1 α)e 2 m 1 + α2 V 2 f + αλv f. Induction reveals the accuracy of O(1/m). Barron, Huang, Luo Penalized Squared Error and Likelihood 42/54

l 1 Penalized Loglikelihood l 1 Penalized Least Squares l 1 Penalized Loglikelihood X 1,..., X n be i.i.d. in R p distributed as Let L n (f ) = 1 n log(1/p f (X n )). p f (x) = ef (x) p 0 (x) C f. l 1 penalized loglikelihood estimator f = f β minimizes L n (f ) + λv f where V f = inf{ h β h : f (x) = h β hh(x), h H}. Barron, Huang, Luo Penalized Squared Error and Likelihood 43/54

Motivation l 1 Penalized Least Squares l 1 Penalized Loglikelihood Minimization is computationally demanding when p large. Term by term selection is favored in sparse settings. Approximate optimization good enough for risk analysis. BHLL (2008) extends LPGP for penalized loglikelihood. Barron, Huang, Luo Penalized Squared Error and Likelihood 44/54

LPGP for Penalized Loglikelihood l 1 Penalized Least Squares l 1 Penalized Loglikelihood Initialize with f 0 (x) = 0. f m (x) = (1 α m )f m 1 (x) + β m h m (x) with α m, β m and h m chosen by argmin α, β, h {L n (f m ) + λ[(1 α)v m 1 + β ]} where v m 1 = m 1 j=1 βj,m 1 for fm 1 = m 1 j=1 β j,m 1h j. Barron, Huang, Luo Penalized Squared Error and Likelihood 45/54

Theorem l 1 Penalized Least Squares l 1 Penalized Loglikelihood Theorem Suppose h(x) C for all h(x) H. The m step LPGP estimator f m (x) has L n (f m ) + λv m inf {L n(f ) + λv f + 2V f 2 f F m + 1 }. Barron, Huang, Luo Penalized Squared Error and Likelihood 46/54

Idea of Proof l 1 Penalized Least Squares l 1 Penalized Loglikelihood m step error has linear and nonlinear components. Linear parts are handled similarly to the least squares. Nonlinear(normalizing constants) of O(α 2 ) by a mgf bound. Induction completes the proof. Barron, Huang, Luo Penalized Squared Error and Likelihood 47/54

Accuracy l 1 Penalized Least Squares l 1 Penalized Loglikelihood f m (x) = (1 α m )f m 1 (x) + β m h m (x) e m = L n (f m ) L n (f ) + λ [(1 α m )v m 1 + β m ]. From definition L n (f m ) L n (f ) equals 1 n e f m(t) p 0 (t) [(1 α m )f m 1 (X i ) + β m h m (X i ) f (X i )] log n e f (t). p i=1 0 (t) }{{}}{{} linear nonlinear Barron, Huang, Luo Penalized Squared Error and Likelihood 48/54

Sampling h l 1 Penalized Least Squares l 1 Penalized Loglikelihood Consider α = 2/(m + 1), β = αv f, a random h(x). Rearrange and write p α = e (1 α)[f m 1(x) f (x)] p f (x)/c e m (1 α)e m 1 + αλv f + α 1 n + log n [f (X i ) V f h(x i )] i=1 } {{ } 0 averaging over h p α (t)exp{α(v f h(t) f (t))}. Sample h with probability β h /V f, third term vanishes. Bring the average over h inside log and the expectation over random h of exp{α(v f h(t) f (t))} not more than e α2 Vf 2/2. university-logo Barron, Huang, Luo Penalized Squared Error and Likelihood 49/54

Induction l 1 Penalized Least Squares l 1 Penalized Loglikelihood We show e m (1 α)e m 1 + α2 Vf 2 2 + αλv f Induction completes the proof. Barron, Huang, Luo Penalized Squared Error and Likelihood 50/54

Current Work l 1 Penalized Least Squares l 1 Penalized Loglikelihood Generalize to permit l 2 norm in penalized loglikelihood. High dimensional graphical models: logistic, gaussian. R package will be publicly available. Barron, Huang, Luo Penalized Squared Error and Likelihood 51/54

Our Work university-logo General penalty condition Subset selection pen n (f m ) = γ n log ( ) M m + m log n n l 1 penalization pen n (f β ) = λ n β 1 valid What size λ n? Combinations thereof (see paper) Greedy algorithm for each valid Barron, Huang, Luo Penalized Squared Error and Likelihood 53/54

Sampling Idea university-logo f = h β hh a linear combination of h in H Randomly draw h 1, h 2,..., h m independently with probability proportional to β h for h i = h This idea is useful in Approximation bound Proof of the acceptability of a penalty via countable covers Greedy algorithm computational inaccuracy Squared error of order 1 m or better for each Barron, Huang, Luo Penalized Squared Error and Likelihood 54/54