A talk on Oracle inequalities and regularization. by Sara van de Geer

Size: px

Start display at page:

Download "A talk on Oracle inequalities and regularization. by Sara van de Geer"

Megan Simon
5 years ago
Views:

1 A talk on Oracle inequalities and regularization by Sara van de Geer Workshop Regularization in Statistics Banff International Regularization Station September 6-11, 2003

2 Aim: to compare l 1 and other penalties Explain bias-variance paradigm Consider penalized least squares in general Study l 1 penalty in robust regression Study classification problems, l 1 penalties and an l 1/2 penalty

3 1. Regression model ɛ 1,..., ɛ n independent N (0, σ 2 ). x i X, i = 1,..., n. f 0 : [0, 1] R unknown function. Y i = f 0 (x i ) + ɛ i, i = 1,..., n.

4 An example. x i = i/n, i = 1,..., n. Estimator: ˆf n = arg min f { 1 n n } 1 Y i f(x i ) 2 + λ 2 f (x) 2 dx. i=1 0

5 True f Noise added, noise level = 0.01

6 Denoised, lambda=0.2 Fit=9.0531e Denoised, lambda=0.1 Fit=3.4322e-04

7 Denoised, lambda=0.1 Error=2.8119e Denoised, lambda=0.05 Error=7.8683e-05

8 Intermezzo: Continuous version: { 1 1 ˆf = arg min y(x) f(x) 2 dx + λ 2 f 0 0 } f (x) 2 dx. Lemma 1.1. Solution: where with ˆf(x) = C λ cosh(x λ ) + 1 λ C = Y (1) { 1 1 λ 0 Y (x) = x 0 y(u) sinh( u x λ )du, Y (u) sinh( 1 u } λ )du / sinh( 1 λ ), x 0 y(u)du.

9 Choice of regularization parameter λ? First Prev Next Last Go Back Full Screen Close Quit

10 Mimic oracle: choose λ as if the unknown f 0 were known? Our choice will be { 1 n } 1 ˆf n = arg min min Y i f(x i ) 2 + λ 2 f (x) 2 dx + c. f λ n nλ i=1 0 See later.

11 2. The sequence space model. Y j = θ j + ɛ j, j = 1,..., n, with ɛ 1,..., ɛ n independent N (0, σ2), n Define n ϑ 2 n = ϑ j 2, ϑ R n. j=1

13 Let J {1,..., n}. Define ˆθ j (J) = { Yj if j J 0 if j / J. Then E ˆθ(J) θ 2 n = j / J θ 2 j + J σ2 n = bias 2 + variance = approximation error + estimation error. Oracle θ oracle uses optimal tradeoff between bias 2 and variance: θ oracle θ 2 n = min θ 2 j + J σ2 J {1,...,n} n. j / J

14 3. Hard and soft thresholding. Let λ = 2σ 2 log n/n be the threshold (= regularization parameter). Hard thresholding: ˆθ j (hard) = { Yj if Y j > λ 0 if Y j λ, j = 1,..., n. Thus ˆθ(hard) minimizes where Here n (Y j ϑ j ) 2 + λ 2 #{ϑ j 0}, j=1 = n (Y j ϑ j ) 2 + pen(ϑ), j=1 pen(ϑ) = λ 2 #{ϑ j 0} = log n J(ϑ) σ2. n J(ϑ) = {ϑ j 0}. So the penalty is up to log-terms equal to the variance. First Prev Next Last Go Back Full Screen Close Quit

15 Soft thresholding: ˆθ j (soft) = { Yj λ if Y j > λ Y j + λ if Y j < λ, j = 1,..., n. 0 if Y j λ Thus ˆθ(soft) minimizes n (Y j ϑ j ) 2 + 2λ j=1 n ϑ j. j=1 = where n (Y j ϑ j ) 2 + pen(ϑ), j=1 pen(ϑ) = λ n log n ϑ j = n j=1 n ϑ j. This penalty is generally much much larger than the variance! j=1

16 The estimators ˆθ(hard) and ˆθ(soft) have similar oracle properties. Lemma. Let ˆθ {ˆθ(hard), ˆθ(soft)}. We have where and E ˆθ θ 2 n c min ϑ {blue(ϑ) + red(ϑ)}, blue(ϑ) = ϑ θ 2 n = approximation error = bias 2, red(ϑ) = log n J(ϑ) σ2 n = estimation error( variance).

17 4. Discretization. Y i = f 0 (x i ) + ɛ i, i = 1,..., n. with ɛ 1,..., ɛ n independent N (0, 1), and x i X, Y i R, i = 1,..., n. Let F be a finite collection of functions, and 1 n ˆf = arg min Y i f(x i ) 2. f F n Define Lemma 4.1. We have i=1 f f 0 n = min f F f f 0 n. E ˆf f 0 2 n 2 f f 0 2 log F n + c n = approximation error + estimation error.

18 Let {F m } m M be a collection of increasing nested finite models, and let F = m M F m. Define pen(f) = min c log F m. m: f F m n Let and ˆf = arg min f F Lemma 4.2. We have { 1 n f = arg min f F } n Y i f(x i ) 2 + pen(f), i=1 { f f0 2 n + pen(f) } = arg min {blue(f) + red(f)}. E[ ˆf f 0 2 n + pen( ˆf)] 2[blue(f ) + red(f )] + c n.

19 5. General penalties. Y i = f 0 (x i ) + ɛ i, i = 1,..., n. with ɛ 1,..., ɛ n independent N (0, 1), and x i X, Y i R, i = 1,..., n. Let { } 1 n ˆf = arg min Y i f(x i ) 2 + pen(f), f F n and let the oracle be f = arg min f F i=1 { f f0 2 n + pen(f) }. Definition 5.1. The δ-entropy H(δ, F, Q n ) is the logarithm of the minimum number of balls with radius δ necessary to cover F.

20 Lemma 5.2. For nδ 2 n c we have ( δn 0 ) H 1/2 (u, { f f 2 n + pen(f) δn}, 2 Q n )du δ n, E[ ˆf f 0 2 n + pen( ˆf)] 2[ f f 0 2 n + pen(f ) + δ 2 n] + c n.

21 Example: Sobolev penalties I 2 s (f) = 1 0 f (s) (x) 2 dx a) Penalty on I s with s fixed: pen(f) = λ 2 I 2 s (f). We find E[ ˆf f 0 2 n + λ 2 I 2 s ( ˆf)] 2[ f f λ 2 I 2 s (f )] + c c + nλ1/s n b) Penalty of I s with s fixed, and on λ Then pen(f) = inf 0<λ< {λ2 I 2 s (f) + c nλ1/s} = red(f). E ˆf f 0 2 n 2[ f f c 0 n 2s 2 2s+1 I 2s+1 = [blue(f ) + red(f )] + c log n n s (f )] + c log n n

22 c) Penalty on I s and on s pen(f) = d) Penalty on I s and on s, and on λ pen(f) = { min λ 2 I 2 s (f) + c } 0s 3 max 1 s s max nλ 1/s inf 0<λ< min 1 s s max {λ 2 I 2 s (f) + c 0s 3 max nλ 1/s }

23 6. Robust regression. Let Y i depend on some covariable x i, i = 1,..., n. Assume Y 1,..., Y n are independent. Let γ : R R be a convex loss function satisfying the Lipschitz condition γ(ỹ) γ(y) ỹ y, ỹ, y R. Consider { 1 n ˆf n = arg min f } n γ(y i f(x i )) + pen(f) i=1 Least absolute deviations: γ(y) = y. γ(y) = τ y l{y < 0} + (1 τ) y l{y > 0}, y R. Here 0 < τ < 1 is fixed. Huber loss function γ

24 true regression function: f 0 = arg min Γ(f), where Γ(f) = 1 n n Eγ(Y i f(x i )). i=1

25 6.1. Standard identifiablity condition Suppose that for some (unknown) σ > 0 Γ(f) Γ(f 0 ) f f 0 2 /σ. Then one can prove similar results as for the penalized least squares estimator More general margin condition Suppose that for some unknown κ 1 and σ > 0, Γ(f) Γ(f 0 ) f f 0 2κ /σ. The estimation error then depends on the unknown κ!

26 Let and where and 6.4. Oracle f = n ϑ j ψ j, j=1 J f = { ϑ j > 0}. f = arg min f {blue(f) + red(f)}, blue(f) = Γ(f) Γ(f 0 ), red(f) = [ ] κ log n n J 2κ 1 f.

27 6.5. l 1 penalty with { 1 n ˆf n = arg min f } n γ(y i f(x i )) + pen(f), i=1 pen(f) = c log n n n ϑ j. j=1

28 Oracle inequality. One has E(Γ( ˆf n ) Γ(f 0 )) C {blue(f ) + red(f )} { = C Γ(f ) Γ(f 0 ) + [ log n } n J κ 2κ 1 ]. In other words, ˆf n adapts to the smoothness of f 0 as well as to κ.

29 Typical example Then f f 0 J s. Γ( ˆf n ) Γ(f 0 ) ( log n n ) 2κs 4κs 2s+1.

30 10 FY3PQ : SMSE = MAD = time = 3.56 s 10 IRLS : SMSE = MAD = time = s

31 Conclusion: The advantage of the l 1 penalty over (nonrandom) penalties based on bias-variance considerations, is that it is not only adaptive to the smoothness, but also adaptive to the margin. First Prev Next Last Go Back Full Screen Close Quit

32 The classification problem 1. Introduction Y {0, 1} binary response variable, X X covariable. Aim: predict Y given X. Examples. - Recognition of speech or handwriting - Classifying an object in an image - Classification of gene expression levels - Etc. Training set: n i.i.d. copies (X i, Y i ) n i=1, of (X, Y ).

33 1 learning data = learning error= input data training errors 1 testing data = testing error= classification testing errors

34 2. Bayes classifier When using G as classifier, the PREDICTION ERROR is R(G) = P (Y l G (X)). where BAYES RULE = G 0 G 0 = arg min G R(G), where the minimum is over all sets G X. Thus G 0 = {x : η(x) 1/2} where η(x) the regression of Y on X = x: η(x) = P (Y = 1 X = x) : So Bayes rule predicts the most likely label.

35 Bayes rule 1_ 2 Bayes rule G 0 is here: [ ][ ].

36 3. Empirical risk minimization Let G be a collection of sets. The EMPIRICAL RISK MINIMIZER is Ĝ n = arg min G G R n(g), where is the. R n (G) = 1 n n Y i l G (X i ), i=1 EMPIRICAL RISK

38 Let 5. Mammen & Tsybakov margin condition G G = (G\G 0 ) (G 0 \G) is the symmetric difference between the two sets. Symmetric difference G G 0 G G 0

39 The EXCESS RISK = approximation error at G, is R(G) R(G 0 ). Margin condition. (Mammen and Tsybakov (1999), Tsybakov (2003)) For some (unknown) constants σ > 0 and κ 1, for all sets G X. R(G) R(G 0 ) σq κ (G G 0 ),

40 6. Boundary fragments We assume X [0, 1] d+1 and write X = (S, T ), with S [0, 1] d, T [0, 1]. For a function f : [0, 1] d [0, 1], we define the boundary fragment G f = {x = (s, t) : f(s) t}.

41 The symmetric difference G f G f for the boundary fragments G f and G f formed by subgraphs Subgraphs f f ~

42 Define the oracle 7. Oracle G = arg min G G [red(g) + blue(g)], where Moreover red(g) = estimation error. blue(g) = R(G) R(G 0 ) = approximation error. Thus G gives the best trade-off between estimation error and approximation error.

43 Let f ϑ = ϑ j ψ j, and 8. Oracle inequality G = {boundary fragments G fϑ : ϑ R n }. We take a square root penalty (or l 1/2 penalty) pen(g fϑ ) = λ n 2 dl/2 ϑ j,l. Theorem. (Tsybakov and van de Geer (2003)) Let where l Ĝ n = arg min G G {R n(g) + pen(g)}, λ n = c log 4 n n, and where c is a (large enough) universal constant. Then ( ) P R(Ĝn) R(G 0 ) c σ,κ [red(g ) + blue(g )] j exp[ c q log 4 n]. First Prev Next Last Go Back Full Screen Close Quit

44 Conclusion. The estimator with square root penalty adapts up to log-factors to the smoothness as well as to the margin.... but we cannot compute it!

45 9. Surrogate loss functions We now code the label as Y {±1}. Let f ϑ (x) = N ϑ j ψ j (x). Introduce a margin 0 λ < 1. We call (X, Y ) well classified by f ϑ if j=1 Y f ϑ (X) ϑ 1 λ. Here ϑ 1 = N ϑ j. j=1

46 Define R n (f ϑ, λ) = #{Y if ϑ (X i ) < λ ϑ 1 }. n Then by Chebyshev s inequality, for any non-negative, increasing function φ, where R n (f ϑ, λ) L n (f) = 1 n L n (f ϑ ) φ(1 λ ϑ 1 ), n φ(1 Y i f(x i )). i=1 This leads to minimizing the penalized loss function L n (f ϑ ) pen(ϑ), where pen(ϑ) = 1/φ(1 λ ϑ 1 ).

47 Example: support vector machine loss Adaptive estimation: minimize 1 n n (1 Y i f ϑ (X i )) + + λ ϑ 1, i=1 with λ = c log n/n.

48 Conclusion. Adaptation using empirical risk minimization leads to l 1/2 penalties, and is computationally very hard. In combination with surrogate loss functions, l 1 penalities are natural, and also computationally simple. But the resulting oracle inequalities generally do not yield fast optimal rates for the excess risk.

49 Some references D.L. Donoho and I.M. Johnstone (1996). New minimax theorems, thresholding and adaptation. Bernoulli L. Birgé and P. Massart (1997). From model selection to adaptive estimation. In; Festschrift for Lucien Le Cam: Research Papers in Probab. and Statist., (Eds. D. Pollard, E. Torgersen and G. Yang) 55-87, Springer, New York G. Lugosi and A. Nobel (1999). Adaptive model selection using empirical complexities. Ann. Statist E. Mammen and A. B. Tsybakov (1999). Smooth discriminant analysis. Ann. Statist S. van de Geer (2001). Least squares estimation with complexity penalties. Mathematical Methods of Statistics S. van de Geer (2002). M-estimation using penalties or sieves. Journal of Statistical Planning and Inference

50 J.-M. Loubes and S. van de Geer (2002). Adaptive estimation in regression, using soft thresholding type penalties. Statistica Neerlandica A.B. Tsybakov (2003). Optimal aggregation of classifiers in statistical learning. To appear in Ann. Statist. A.B. Tsybakov and S.A. van de Geer (2003). Square root penalty: adaptation to the margin in classification and in edge estimation. Prépublication PMA -820, Lab. de Probab. et Modèles Aléatoires, Université Paris VII (submitted).

Empirical Processes in M-Estimation by Sara van de Geer

Empirical Processes in M-Estimation by Sara van de Geer Handout at New Directions in General Equilibrium Analysis Cowles Workshop, Yale University June 15-20, 2003 Version: June 13, 2003 1 Most of the