l 1 -Regularized Linear Regression: Persistence and Oracle Inequalities

l -Regularized Linear Regression: Persistence and Oracle Inequalities Peter Bartlett EECS and Statistics UC Berkeley slides at http://www.stat.berkeley.edu/ bartlett Joint work with Shahar Mendelson and Joe Neeman.

l -regularized linear regression Random pair: (X, Y ) P, in R d R. n independent samples drawn from P: (X, Y ),..., (X n, Y n ). Find β so linear function X, β has small risk, Pl β = P ( X, β Y ) 2. Here, l β (X, Y ) = ( X, β Y ) 2 is the quadratic loss of the linear prediction.

l -regularized linear regression Random pair: (X, Y ) P, in R d R. n independent samples drawn from P: (X, Y ),..., (X n, Y n ). Find β so linear function X, β has small risk, Pl β = P ( X, β Y ) 2. Example. l -regularized least squares: ( ) ˆβ = arg min P n l β + ρ n β l β R d d, where P n l β = n n ( X i, β Y i ) 2, and β l d = i= d β j. j=

l -regularized linear regression Example. l -regularized least squares: ( ) ˆβ = arg min P n l β + ρ n β l β R d d, where P n l β = n n ( X i, β Y i ) 2, and β l d = i= d β j. j= Tends to select sparse solutions (few non-zero components β j ). Useful, for example, if d n.

l -regularized linear regression Example. l -regularized least squares: ( ) ˆβ = arg min P n l β + ρ n β l β R d d, Example. l -constrained least squares: ˆβ = arg min P n l β. β l d b n [Recall: l β (X, Y ) = ( X, β Y ) 2.]

l -regularized linear regression Example. l -regularized least squares: ( ) ˆβ = arg min P n l β + ρ n β l β R d d, Example. l -constrained least squares: Some questions: ˆβ = arg min P n l β. β l d b n Prediction: Does ˆβ give accurate forecasts? e.g., How does Pl ˆβ compare with Pl β? } Here, β = arg min {Pl β : β l b n. d

l -regularized linear regression Example. l -regularized least squares: ( ) ˆβ = arg min P n l β + ρ n β l β R d d, Example. l -constrained least squares: Some questions: ˆβ = arg min P n l β. β l d b n Does ˆβ give accurate forecasts? } e.g., Pl ˆβ versus Pl β = min {Pl β : β l b n? d Estimation: Under assumptions on P, is ˆβ correct? Sparsity Pattern Estimation: Under assumptions on P, are the non-zeros of ˆβ correct?

Outline of Talk. For l -constrained least squares, bounds on Pl ˆβ Pl β. Persistence: (Greenshtein and Ritov, 2004) For what d n, b n does Pl ˆβ Pl β 0? Convex Aggregation: (Tsybakov, 2003) For b = (convex combinations of dictionary functions), what is rate of Pl ˆβ Pl β? 2. For l -regularized least squares, oracle inequalities. 3. Proof ideas.

l -regularized linear regression Key Issue: l β is unbounded, so some key tools (e.g., concentration inequalities) cannot immediately be applied. For (X, Y ) bounded, l β can be bounded using β l d, but this gives loose prediction bounds. We use chaining to show that metric structures of l -constrained linear functions under P n and P are similar.

Main Results: Excess Risk For l -constrained least squares, ˆβ = arg min P nl β, β l d b if X and Y have suitable tail behaviour then, with probability δ, Pl ˆβ Pl β c ( logα (nd) b 2 δ 2 min n + d n, b n ( + n b )). Small d regime: d/n. Large d regime: b/ n.

Main Results: Excess Risk For l -constrained least squares, with probability δ, ( Pl ˆβ Pl β c logα (nd) b 2 δ 2 min n + d n, b n ( + n b )). Conditions:. PY 2 is bounded by a constant. 2. X bounded a.s., X log concave and maxj X, e j L2 c, or X log concave and isotropic.

Application: Persistence Consider l -constrained least squares, ˆβ = arg min P nl β. β l d b Suppose that PY 2 is bounded by a constant and tails of X decay nicely (e.g., X bounded a.s. or X log concave and isotropic). Then for increasing d n and ( ) n b n = o log 3/2 n log 3/2, (nd n ) l -constrained least squares is persistent (i.e., Pl ˆβ Pl β 0).

Application: Persistence If PY 2 is bounded and tails of X decay nicely, then l -constrained least squares is persistent provided that d n is increasing and ( ) n b n = o log 3/2 n log 3/2. (nd n ) Previous Results (Greenshtein and Ritov, 2004):. b n = ω(n /2 / log /2 n) implies empirical minimization is not persistent for Gaussian (X, Y ). 2. b n = o(n /2 / log /2 n) implies empirical minimization is persistent for Gaussian (X, Y ). 3. b n = o(n /4 / log /4 n) implies empirical minimization is persistent under tail conditions on (X, Y ).

Application: Convex Aggregation Consider b =, so that the l -ball of radius b is the convex hull of a dictionary of d functions (the components of X). Tsybakov (2003) showed that, for any aggregation scheme ˆβ, the rate of convex aggregation satisfies ( ( )) d log d Pl ˆβ Pl β = Ω min n,. n For bounded, isotropic distributions, our result implies that this rate can be achieved, up to log factors, by least squares over the convex hull of the dictionary. Previous positive results (Tsybakov, 2003; Bunea, Tsybakov and Wegkamp, 2006) involved complicated estimators.

Main Results: Oracle Inequality For l -regularized least squares, ( ) ˆβ = arg min P n l β + ρ n β β l dn, if d n increases but log d n = o(n), and Y and X bounded a.s., then with probability at least o(), Pl ˆβ inf β ( )) (Pl β + cρ n + β l dn, where ρ n = c log3/2 n log /2 (d n n) n.

Outline of Talk. For l -constrained least squares, bounds on Pl ˆβ Pl β. Persistence: For what d n, b n does Pl ˆβ Pl β 0? Convex Aggregation: For b = (convex combinations of dictionary functions), what is rate of Pl ˆβ Pl β? 2. For l -regularized least squares, oracle inequalities. 3. Proof ideas.

Proof Ideas:. ɛ-equivalence of P and P n structures Define { } λ G λ = P(l β l β ) (l β l β ) : P(l β l β ) λ. Then: E sup g Gλ P n g Pg is small with high probability, for all β with P(l β l β ) λ, ( ɛ)p(l β l β ) P n (l β l β ) ( + ɛ)p(l β l β ) P(l ˆβ l β ) λ, where ˆβ = arg min β P n l β.

Proof Ideas: 2. Symmetrization, subgaussian tails For a class of functions F, E sup f F P n f Pf 2 n E sup f F ɛ i f (X i ), where ɛ i are independent Rademacher (±) random variables. (Giné and Zinn, 984) The Rademacher process i Z β = i ɛ i (l β (X i, Y i ) l β (X i, Y i )) indexed by β in T λ = { β R d : β l d } b, Pl β Pl β λ is subgaussian wrt a certain metric d. That is, for every β, β 2 T λ and t, Pr ( Z β Z β2 td(β, β 2 )) 2 exp( t 2 /2).

Proof Ideas: 2. Symmetrization, subgaussian tails The Rademacher process Z β = i ɛ i (l β (X i, Y i ) l β (X i, Y i )) indexed by β in T λ = { β R d : β l d is subgaussian wrt the metric } b, Pl β Pl β λ d(β, β 2 ) = 4 max i X i, β β 2 ( X i, v 2 + i i sup v λd 2T l β (X i, Y i ) ). This is a random l distance, scaled by the random l 2 diameter of λd 2T. Here, D = {x R d : P X, x 2 }.

Proof Ideas: 3. Chaining For a subgaussian process {Z t } indexed by a metric space (T, d), and for t 0 T, E sup Z t Z t0 cd(t, d) = c t T diam(t,d) where N(ɛ, T, d) is the ɛ covering number of T. 0 log N(ɛ, T, d) dɛ,

Proof Ideas: 4. Bounding the Entropy Integral It suffices to calculate the entropy integral D( λd 2bB d, d). We can approximate this by D( ( λd 2bB d, d) min D( ) λd, d), D(2bB d, d). This leads to: Pl ˆβ Pl β c logα (nd) δ 2 ( b 2 min n + d n, b n ( + n b )).

Proof Ideas: 5. Oracle Inequalities We get an isomorphic condition on {l β l β }, 2 P n(l β l β ) ɛ n P(l β l β ) 2P n (l β l β ) + ɛ n, and this implies that ˆβ = arg min β (P n l β + cɛ n ) has ( Pl β inf Plβ + c ) ɛ n. β This leads to oracle inequality: For l -regularized least squares, ( ) ˆβ = arg min P n l β + ρ n β β l dn, with probability at least o(), Pl ˆβ inf β ( )) (Pl β + cρ n + β l dn.

Outline of Talk. For l -constrained least squares, Pl ˆβ Pl β c ( logα (nd) b 2 δ 2 min n + d n, b n ( + n b )). Persistence: If b n = õ( n), then Pl ˆβ Pl β 0. Convex Aggregation: Empirical risk ( minimization gives optimal rate (up to log factors): Õ min(d/n, ) log d/n). 2. For l -regularized least squares, oracle inequalities. 3. Proof ideas: subgaussian Rademacher process.