Less is More: Computational Regularization by Subsampling

Size: px

Start display at page:

Download "Less is More: Computational Regularization by Subsampling"

Hollie Sharyl Bryan
5 years ago
Views:

1 Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro Rudi, Raffaello Camoriano Paris

2 A Starting Point Classically: Statistics and optimization distinct steps in algorithm design Empirical process theory + Optimization

3 A Starting Point Classically: Statistics and optimization distinct steps in algorithm design Empirical process theory + Optimization Large Scale: Consider interplay between statistics and optimization! (Bottou, Bousquet 08)

4 A Starting Point Classically: Statistics and optimization distinct steps in algorithm design Empirical process theory + Optimization Large Scale: Consider interplay between statistics and optimization! (Bottou, Bousquet 08) Computational Regularization: Computation tricks =regularization

5 Problem: Estimate f Supervised Learning f

6 Supervised Learning Problem: Estimate f given S n = {(x 1, y 1 ),..., (x n, y n )} f (x 1,y 1 ) (x 2,y 2) (x 3,y 3) (x 4,y 4) (x 5,y 5)

7 Supervised Learning Problem: Estimate f given S n = {(x 1, y 1 ),..., (x n, y n )} f (x 1,y 1 ) (x 2,y 2) (x 3,y 3) (x 4,y 4) (x 5,y 5) The Setting y i = f (x i ) + ε i i {1,..., n} ε i R, x i R d random (bounded but with unknown distribution) f unknown

8 Outline Nonparametric Learning Data Dependent Subsampling Data Independent Subsampling

9 Non-linear/non-parametric learning f(x) = M c i q(x, w i ) i=1

10 Non-linear/non-parametric learning f(x) = M c i q(x, w i ) i=1 q non linear function

11 Non-linear/non-parametric learning f(x) = M c i q(x, w i ) i=1 q non linear function w i R d centers

12 Non-linear/non-parametric learning f(x) = M c i q(x, w i ) i=1 q non linear function w i R d centers c i R coefficients

13 Non-linear/non-parametric learning f(x) = M c i q(x, w i ) i=1 q non linear function w i R d centers c i R coefficients M = M n could/should grow with n

14 Non-linear/non-parametric learning f(x) = M c i q(x, w i ) i=1 q non linear function w i R d centers c i R coefficients M = M n could/should grow with n Question: How to choose w i, c i and M given S n?

15 Learning with Positive Definite Kernels There is an elegant answer if: q is symmetric all the matrices Q ij = q(x i, x j ) are positive semi-definite 1 1 They have non-negative eigenvalues

16 Learning with Positive Definite Kernels There is an elegant answer if: q is symmetric all the matrices Q ij = q(x i, x j ) are positive semi-definite 1 Representer Theorem (Kimeldorf, Wahba 70; Schölkopf et al. 01) M = n, w i = x i, c i by convex optimization! 1 They have non-negative eigenvalues

17 Kernel Ridge Regression (KRR) a.k.a. Tikhonov Regularization where 2 f λ = argmin f H 1 n n (y i f(x i )) 2 + λ f 2 i=1 H = {f f(x) = M c i q(x, w i ), c i R, w i R }{{ d, }} M {{ N } } any center! any length! i=1 2 The norm is induced by the inner product f, f = i,j c ic j q(x i, x j )

18 Kernel Ridge Regression (KRR) a.k.a. Tikhonov Regularization where 2 f λ = argmin f H 1 n n (y i f(x i )) 2 + λ f 2 i=1 Solution H = {f f(x) = f λ = M c i q(x, w i ), c i R, w i R }{{ d, }} M {{ N } } any center! any length! i=1 n c i q(x, x i ) with c = ( Q + λni) 1 ŷ i=1 2 The norm is induced by the inner product f, f = i,j c ic j q(x i, x j )

19 KRR: Statistics

20 KRR: Statistics Well understood statistical properties: Classical Theorem If f H, then λ = 1 n E ( f λ (x) f (x)) 2 1 n

21 KRR: Statistics Well understood statistical properties: Classical Theorem If f H, then λ = 1 n E ( f λ (x) f (x)) 2 1 n Remarks

22 KRR: Statistics Well understood statistical properties: Classical Theorem If f H, then λ = 1 n E ( f λ (x) f (x)) 2 1 n Remarks 1. Optimal nonparametric bound

23 KRR: Statistics Well understood statistical properties: Classical Theorem If f H, then λ = 1 n E ( f λ (x) f (x)) 2 1 n Remarks 1. Optimal nonparametric bound 2. More refined results for smooth kernels λ = n 1 2s+1, E ( f λ (x) f (x)) 2 n 2s 2s+1

24 KRR: Statistics Well understood statistical properties: Classical Theorem If f H, then λ = 1 n E ( f λ (x) f (x)) 2 1 n Remarks 1. Optimal nonparametric bound 2. More refined results for smooth kernels λ = n 1 2s+1, E ( f λ (x) f (x)) 2 n 2s 2s+1 3. Adaptive tuning, e.g. via cross validation 4. Proofs: inverse problems results + random matrices (Smale and Zhou + Caponnetto, De Vito, R.)

25 KRR: Optimization n f λ = c i q(x, x i ) with c = ( Q + λni) 1 ŷ i=1 Linear System Complexity Space O(n 2 ) bq c = Time O(n 3 ) by

26 KRR: Optimization n f λ = c i q(x, x i ) with c = ( Q + λni) 1 ŷ i=1 Linear System Complexity Space O(n 2 ) bq c = Time O(n 3 ) by BIG DATA? Running out of time and space... Can this be fixed?

27 Beyond Tikhonov: Spectral Filtering ( ˆQ + λi) 1 approximation of ˆQ controlled by λ

28 Beyond Tikhonov: Spectral Filtering ( ˆQ + λi) 1 approximation of ˆQ controlled by λ Can we approximate ˆQ by saving computations?

29 Beyond Tikhonov: Spectral Filtering ( ˆQ + λi) 1 approximation of ˆQ controlled by λ Can we approximate ˆQ by saving computations? Yes!

30 Beyond Tikhonov: Spectral Filtering ( ˆQ + λi) 1 approximation of ˆQ controlled by λ Can we approximate ˆQ by saving computations? Yes! Spectral filtering (Engl 96- inverse problems, Rosasco et al. 05- ML ) g λ ( ˆQ) ˆQ The filter function g λ defines the form of the approximation

31 Spectral filtering Examples Tikhonov- ridge regression Truncated SVD principal component regression Landweber iteration GD/ L 2 -boosting nu-method accelerated GD/Chebyshev method...

32 Spectral filtering Examples Tikhonov- ridge regression Truncated SVD principal component regression Landweber iteration GD/ L 2 -boosting nu-method accelerated GD/Chebyshev method... Landweber iteration (truncated power series)... c t = g t ( ˆQ) t 1 = γ (I γ ˆQ) r ŷ r=0

33 Spectral filtering Examples Tikhonov- ridge regression Truncated SVD principal component regression Landweber iteration GD/ L 2 -boosting nu-method accelerated GD/Chebyshev method... Landweber iteration (truncated power series)... c t = g t ( ˆQ) t 1 = γ (I γ ˆQ) r ŷ r=0... it s GD for ERM!! r = 1... t c r = c r 1 γ( ˆQc r 1 ŷ), c 0 = 0

34 Statistics and computations with spectral filtering The different filters achieve essentially the same optimal statistical error!

35 Statistics and computations with spectral filtering The different filters achieve essentially the same optimal statistical error! Difference is in computations Notet: λ 1 Filter Time Space Tikhonov n 3 n 2 GD n 2 λ 1 n 2 Accelerated GD n 2 λ 1/2 n 2 Truncated SVD n 2 λ γ n 2 = t, for iterative methods

36 Semiconvergence 0.24 Empirical Error Expected Error Error Iteration Iterations control statistics and time complexity

37 Computational Regularization

38 Computational Regularization BIG DATA? Running out of time and space...

39 Computational Regularization BIG DATA? Running out of time and space... Is there a principle to control statistics, time and space complexity?

40 Outline Nonparametric Learning Data Dependent Subsampling Data Independent Subsampling

41 1. pick w i at random... Subsampling

42 Subsampling 1. pick w i at random... from training set (Smola, Scholköpf 00) w 1,..., w M x 1,... x n M n

43 Subsampling 1. pick w i at random... from training set (Smola, Scholköpf 00) w 1,..., w M x 1,... x n M n 2. perform KRR on H M = {f f(x) = M i=1 c i q(x, w i ), c i R, w i R d, M N}.

44 Subsampling 1. pick w i at random... from training set (Smola, Scholköpf 00) w 1,..., w M x 1,... x n M n 2. perform KRR on H M = {f f(x) = M i=1 c i q(x, w i ), c i R, w i R d, M N}. Linear System bq M c = by Complexity Space O(n 2 ) O(nM) Time O(n 3 ) O(nM 2 )

Space O(n 2 ) O(nM) Time O(n 3 ) O(nM 2 )

45 Subsampling 1. pick w i at random... from training set (Smola, Scholköpf 00) w 1,..., w M x 1,... x n M n 2. perform KRR on H M = {f f(x) = M i=1 c i q(x, w i ), c i R, w i R d, M N}. Linear System bq M c = by Complexity Space O(n 2 ) O(nM) Time O(n 3 ) O(nM 2 ) What about statistics? What s the price for efficient computations?

46 Putting our Result in Context *Many* different subsampling schemes (Smola, Scholkopf 00; Williams, Seeger 01; )

47 Putting our Result in Context *Many* different subsampling schemes (Smola, Scholkopf 00; Williams, Seeger 01; ) Theoretical guarantees mainly on matrix approximation (Mahoney and Drineas 09; Cortes et al 10, Kumar et al ) Q Q M 1 M

48 Putting our Result in Context *Many* different subsampling schemes (Smola, Scholkopf 00; Williams, Seeger 01; ) Theoretical guarantees mainly on matrix approximation (Mahoney and Drineas 09; Cortes et al 10, Kumar et al ) Q Q M 1 M Statistical guarantees suboptimal or in restricted setting (Cortes et al. 10; Jin et al. 11, Bach 13, Alaoui, Mahoney 14)

49 Main Result (Rudi, Camoriano, Rosasco, 15) Theorem If f H, then λ = 1 n, M = 1 λ, E ( f λ,m (x) f (x)) 2 1 n

50 Main Result (Rudi, Camoriano, Rosasco, 15) Theorem If f H, then λ = 1 n, M = 1 λ, E ( f λ,m (x) f (x)) 2 1 n Remarks

51 Main Result (Rudi, Camoriano, Rosasco, 15) Theorem If f H, then λ = 1 n, M = 1 λ, E ( f λ,m (x) f (x)) 2 1 n Remarks 1. Subsampling achives optimal bound...

52 Main Result (Rudi, Camoriano, Rosasco, 15) Theorem If f H, then λ = 1 n, M = 1 λ, E ( f λ,m (x) f (x)) 2 1 n Remarks 1. Subsampling achives optimal bound with M n!!

53 Main Result (Rudi, Camoriano, Rosasco, 15) Theorem If f H, then λ = 1 n, M = 1 λ, E ( f λ,m (x) f (x)) 2 1 n Remarks 1. Subsampling achives optimal bound with M n!! 3. More generally, λ = n 1 2s+1, M = 1 λ, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1

54 Main Result (Rudi, Camoriano, Rosasco, 15) Theorem If f H, then λ = 1 n, M = 1 λ, E ( f λ,m (x) f (x)) 2 1 n Remarks 1. Subsampling achives optimal bound with M n!! 3. More generally, λ = n 1 2s+1, M = 1 λ, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1 Note: An interesting insight is obtained rewriting the result...

55 Computational Regularization by Subsampling (Rudi, Camoriano, Rosasco, 15) A simple idea: swap the role of λ and M...

56 Computational Regularization by Subsampling (Rudi, Camoriano, Rosasco, 15) A simple idea: swap the role of λ and M... Theorem If f H with a smooth kernel, then M = n 1 2s+1, λ = 1 M, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1

57 Computational Regularization by Subsampling (Rudi, Camoriano, Rosasco, 15) A simple idea: swap the role of λ and M... Theorem If f H with a smooth kernel, then M = n 1 2s+1, λ = 1 M, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1 λ and M play the same role new interpretation: subsampling regularizes!

58 Computational Regularization by Subsampling (Rudi, Camoriano, Rosasco, 15) A simple idea: swap the role of λ and M... Theorem If f H with a smooth kernel, then M = n 1 2s+1, λ = 1 M, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1 λ and M play the same role new interpretation: subsampling regularizes! New natural incremental algorithm... Algorithm

59 Computational Regularization by Subsampling (Rudi, Camoriano, Rosasco, 15) A simple idea: swap the role of λ and M... Theorem If f H with a smooth kernel, then M = n 1 2s+1, λ = 1 M, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1 λ and M play the same role new interpretation: subsampling regularizes! New natural incremental algorithm... Algorithm 1. Pick a center + compute solution

60 Computational Regularization by Subsampling (Rudi, Camoriano, Rosasco, 15) A simple idea: swap the role of λ and M... Theorem If f H with a smooth kernel, then M = n 1 2s+1, λ = 1 M, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1 λ and M play the same role new interpretation: subsampling regularizes! New natural incremental algorithm... Algorithm 1. Pick a center + compute solution 2. Pick another center + rank one update

61 Computational Regularization by Subsampling (Rudi, Camoriano, Rosasco, 15) A simple idea: swap the role of λ and M... Theorem If f H with a smooth kernel, then M = n 1 2s+1, λ = 1 M, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1 λ and M play the same role new interpretation: subsampling regularizes! New natural incremental algorithm... Algorithm 1. Pick a center + compute solution 2. Pick another center + rank one update 3. Pick another center...

62 Nÿstrom CoRe Illustrated n, λ are fixed Validation Error Computation controls stability! Time/space requirement tailored to generalization

63 Experiments comparable/better w.r.t. the state of the art Dataset n tr d Incremental Standard Standard Random Fastfood CoRe KRLS Nyström Features RF Ins. Co ± CPU ± CT slices ± NA Year Pred ± NA Forest ± NA Random Features (Rahimi, Recht 07) Fastfood (Le et al. 13)

64 Summary so far Optimal learning with data dependent subsampling Computational regularization: subsampling regularizes!

65 Summary so far Optimal learning with data dependent subsampling Computational regularization: subsampling regularizes! Few more questions: Can one do better than uniform sampling?

66 Summary so far Optimal learning with data dependent subsampling Computational regularization: subsampling regularizes! Few more questions: Can one do better than uniform sampling? Yes: leverage score sampling... What about data independent sampling?

67 Outline Nonparametric Learning Data Dependent Subsampling Data Independent Subsampling

68 Random Features f(x) = M c i q(x, w i ) i=1

69 Random Features f(x) = M c i q(x, w i ) i=1 q general non linear function

70 Random Features f(x) = M c i q(x, w i ) i=1 q general non linear function pick w i at random according to a distribution µ w 1,..., w M µ

71 Random Features f(x) = M c i q(x, w i ) i=1 q general non linear function pick w i at random according to a distribution µ w 1,..., w M µ perform KRR on M H M = {f f(x) = c i q(x, w i ), c i R}. i=1

72 Random Fourier Features (Rahimi, Recht 07) Consider q(x, w) = e iwt x,

73 Random Fourier Features (Rahimi, Recht 07) Consider q(x, w) = e iwt x, w µ(w) = N (0, I)

74 Random Fourier Features (Rahimi, Recht 07) Consider q(x, w) = e iwt x, w µ(w) = N (0, I) Then ] E w [q(x, w)q(x, w) = e x x 2γ = K(x, x )

75 Random Fourier Features (Rahimi, Recht 07) Consider q(x, w) = e iwt x, w µ(w) = N (0, I) Then ] E w [q(x, w)q(x, w) = e x x 2γ = K(x, x ) By sampling w 1,..., w M we are considering the approximating kernel 1 M M i=1 [ ] q(x, w i )q(x, w i ) = K M (x, x )

76 More Random Features translation invariant kernels K(x, x ) = H(x x ), q(x, w) = e iwt x, infinite neural nets kernels w µ = F(H) q(x, w) = w T x + b +, (w, b) µ = U[S d ] infinite dot product kernels homogeneous additive kernels group invariant kernels... Note: Connections with hashing and sketching techniques.

77 Properties of Random Features

78 Properties of Random Features Optimization Time: O(n 3 ) O(nM 2 ) Space: O(n 2 ) O(nM)

79 Properties of Random Features Optimization Time: O(n 3 ) O(nM 2 ) Space: O(n 2 ) O(nM) Statistics As before: do we pay a price for efficient computations?

80 Previous works

81 Previous works *Many* different random features for different kernels (Rahimi, Recht 07, Vedaldi, Zisserman, )

82 Previous works *Many* different random features for different kernels (Rahimi, Recht 07, Vedaldi, Zisserman, ) Theoretical guarantees: mainly kernel approximation (Rahimi, Recht 07,..., Sriperumbudur and Szabo 15) K(x, x ) K M (x, x ) 1 M

83 Previous works *Many* different random features for different kernels (Rahimi, Recht 07, Vedaldi, Zisserman, ) Theoretical guarantees: mainly kernel approximation (Rahimi, Recht 07,..., Sriperumbudur and Szabo 15) K(x, x ) K M (x, x ) 1 M Statistical guarantees suboptimal or in restricted setting (Rahimi, Recht 09, Yang et al ,Bach 15 )

84 Main Result Let q(x, w) = e iwt x,

85 Main Result Let ( q(x, w) = e iwt x 1, w µ(w) = c d 1 + w 2 ) d+1 2

86 Main Result Let ( q(x, w) = e iwt x 1, w µ(w) = c d 1 + w 2 ) d+1 2 Theorem If f H s Sobolev space, then λ = n 1 2s+1, M = 1 λ 2s, E ( f λ,m (x) f (x)) 2 n 2s 2s+1

87 Main Result Let ( q(x, w) = e iwt x 1, w µ(w) = c d 1 + w 2 ) d+1 2 Theorem If f H s Sobolev space, then λ = n 1 2s+1, M = 1 λ 2s, E ( f λ,m (x) f (x)) 2 n 2s 2s+1 Random features achieve optimal bound!

88 Main Result Let ( q(x, w) = e iwt x 1, w µ(w) = c d 1 + w 2 ) d+1 2 Theorem If f H s Sobolev space, then λ = n 1 2s+1, M = 1 λ 2s, E ( f λ,m (x) f (x)) 2 n 2s 2s+1 Random features achieve optimal bound! Efficient worst case subsampling M n but cannot exploit smoothness.

89 Remarks & Extensions Nÿstrom vs Random features Both achieve optimal rates Nÿstrom seems to need fewer samples (random centers)

90 Remarks & Extensions Nÿstrom vs Random features Both achieve optimal rates Nÿstrom seems to need fewer samples (random centers) How tight are the results?

91 Remarks & Extensions Nÿstrom vs Random features Both achieve optimal rates Nÿstrom seems to need fewer samples (random centers) How tight are the results? log λ Test Error log M test error

92 Contributions Optimal bounds for data dependent/independent subsampling Subsampling: Nÿstrom vs Random features Beyond ridge regression: early stopping and multiple passes SGD (see arxiv)

93 Contributions Optimal bounds for data dependent/independent subsampling Subsampling: Nÿstrom vs Random features Beyond ridge regression: early stopping and multiple passes SGD (see arxiv) Some questions: Quest for the best sampling Regularization by projection: inverse problems and preconditioning Beyond randomization: non convex neural nets optimization?

94 Contributions Optimal bounds for data dependent/independent subsampling Subsampling: Nÿstrom vs Random features Beyond ridge regression: early stopping and multiple passes SGD (see arxiv) Some questions: Quest for the best sampling Regularization by projection: inverse problems and preconditioning Beyond randomization: non convex neural nets optimization? Some perspectives: Computational regularization: subsampling regularizes Algorithm design: control stability for good statistics/computations

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro