Less is More: Computational Regularization by Subsampling

Size: px

Start display at page:

Download "Less is More: Computational Regularization by Subsampling"

Alban McLaughlin
5 years ago
Views:

mit.edu joint work with Alessandro Rudi,

1 Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro Rudi, Raffaello Camoriano Jan 13th, 2016 UCL, London

2 A Starting Point Classically: Statistics and optimization distinct steps in algorithm design

3 A Starting Point Classically: Statistics and optimization distinct steps in algorithm design Large Scale: Consider interplay between statistics and optimization! (Bottou, Bousquet 08)

4 A Starting Point Classically: Statistics and optimization distinct steps in algorithm design Large Scale: Consider interplay between statistics and optimization! (Bottou, Bousquet 08) Computational Regularization: Computation tricks =regularization

5 Problem: Estimate f Supervised Learning f

6 Supervised Learning Problem: Estimate f given S n = {(x 1, y 1 ),..., (x n, y n )} f (x 1,y 1 ) (x 2,y 2) (x 3,y 3) (x 4,y 4) (x 5,y 5)

7 Supervised Learning Problem: Estimate f given S n = {(x 1, y 1 ),..., (x n, y n )} f (x 1,y 1 ) (x 2,y 2) (x 3,y 3) (x 4,y 4) (x 5,y 5) The Setting y i = f (x i ) + ε i i {1,..., n} ε i R, x i R d random (with unknown distribution) f unknown

8 Outline Learning with dictionaries and kernels Data Dependent Subsampling Data Independent Subsampling

9 Non-linear/non-parametric learning f(x) = M c i q(x, w i ) i=1

10 Non-linear/non-parametric learning f(x) = M c i q(x, w i ) i=1 q non linear function

11 Non-linear/non-parametric learning f(x) = M c i q(x, w i ) i=1 q non linear function w i R d centers

12 Non-linear/non-parametric learning f(x) = M c i q(x, w i ) i=1 q non linear function w i R d centers c i R coefficients

13 Non-linear/non-parametric learning f(x) = M c i q(x, w i ) i=1 q non linear function w i R d centers c i R coefficients M = M n could/should grow with n

14 Non-linear/non-parametric learning f(x) = M c i q(x, w i ) i=1 q non linear function w i R d centers c i R coefficients M = M n could/should grow with n Question: How to choose w i, c i and M given S n?

15 Learning with Positive Definite Kernels There is an elegant answer if: q is symmetric all the matrices Q ij = q(x i, x j ) are positive semi-definite 1 1 They have non-negative eigenvalues

16 Learning with Positive Definite Kernels There is an elegant answer if: q is symmetric all the matrices Q ij = q(x i, x j ) are positive semi-definite 1 Representer Theorem (Kimeldorf, Wahba 70; Schölkopf et al. 01) M = n, w i = x i, c i by convex optimization! 1 They have non-negative eigenvalues

17 Kernel Ridge Regression (KRR) a.k.a. Penalized Least Squares f λ = argmin f H 1 n n (y i f(x i )) 2 + λ f 2 i=1

18 Kernel Ridge Regression (KRR) a.k.a. Penalized Least Squares where f λ = argmin f H 1 n n (y i f(x i )) 2 + λ f 2 i=1 H = {f f(x) = M c i q(x, w i ), c i R, w i R }{{ d, }} M {{ N } } any center! any length! i=1

19 Kernel Ridge Regression (KRR) a.k.a. Penalized Least Squares where f λ = argmin f H 1 n n (y i f(x i )) 2 + λ f 2 i=1 H = {f f(x) = M c i q(x, w i ), c i R, w i R }{{ d, }} M {{ N } } any center! any length! i=1 Solution n f λ = c i q(x, x i ) with c = ( Q + λni) 1 ŷ i=1

20 KRR: Statistics

21 KRR: Statistics Well understood statistical properties: Classical Theorem If f H, then λ = 1 n E ( f λ (x) f (x)) 2 1 n

22 KRR: Statistics Well understood statistical properties: Classical Theorem If f H, then λ = 1 n E ( f λ (x) f (x)) 2 1 n Remarks

23 KRR: Statistics Well understood statistical properties: Classical Theorem If f H, then λ = 1 n E ( f λ (x) f (x)) 2 1 n Remarks 1. Optimal nonparametric bound

24 KRR: Statistics Well understood statistical properties: Classical Theorem If f H, then λ = 1 n E ( f λ (x) f (x)) 2 1 n Remarks 1. Optimal nonparametric bound 2. Results for general kernels (e.g. splines/sobolev etc.) λ = n 1 2s+1, E ( f λ (x) f (x)) 2 n 2s 2s+1

25 KRR: Statistics Well understood statistical properties: Classical Theorem If f H, then λ = 1 n E ( f λ (x) f (x)) 2 1 n Remarks 1. Optimal nonparametric bound 2. Results for general kernels (e.g. splines/sobolev etc.) λ = n 1 2s+1, E ( f λ (x) f (x)) 2 n 2s 2s+1 3. Adaptive tuning via cross validation 4. Proofs: analysis/linear algebra+ random matrices (Smale and Zhou + Caponnetto, De Vito, R.+ Steinwart)

26 KRR: Optimization n f λ = c i q(x, x i ) with c = ( Q + λni) 1 ŷ i=1 Linear System Complexity Space O(n 2 ) bq c = Time O(n 3 ) by

27 KRR: Optimization n f λ = c i q(x, x i ) with c = ( Q + λni) 1 ŷ i=1 Linear System Complexity Space O(n 2 ) bq c = Time O(n 3 ) by BIG DATA? Running out of time and space... Can this be fixed?

28 Beyond Tikhonov: Spectral Filtering ( ˆQ + λi) 1 approximation of ˆQ controlled by λ

29 Beyond Tikhonov: Spectral Filtering ( ˆQ + λi) 1 approximation of ˆQ controlled by λ Can we approximate ˆQ by saving computations?

30 Beyond Tikhonov: Spectral Filtering ( ˆQ + λi) 1 approximation of ˆQ controlled by λ Can we approximate ˆQ by saving computations? Yes!

31 Beyond Tikhonov: Spectral Filtering ( ˆQ + λi) 1 approximation of ˆQ controlled by λ Can we approximate ˆQ by saving computations? Yes! Spectral filtering (Engl 96- inverse problems, Rosasco et al. 05- ML ) g λ ( ˆQ) ˆQ The filter function g λ defines the form of the approximation

32 Spectral filtering Examples Tikhonov- ridge regression Truncated SVD principal component regression Landweber iteration GD/ L 2 -boosting nu-method accelerated GD/Chebyshev method...

33 Spectral filtering Examples Tikhonov- ridge regression Truncated SVD principal component regression Landweber iteration GD/ L 2 -boosting nu-method accelerated GD/Chebyshev method... Landweber iteration (truncated power series)... c t = g t ( ˆQ) t 1 = γ (I γ ˆQ) r ŷ r=0

34 Spectral filtering Examples Tikhonov- ridge regression Truncated SVD principal component regression Landweber iteration GD/ L 2 -boosting nu-method accelerated GD/Chebyshev method... Landweber iteration (truncated power series)... c t = g t ( ˆQ) t 1 = γ (I γ ˆQ) r ŷ r=0... it s GD for ERM!! r = 1... t c r = c r 1 γ( ˆQc r 1 ŷ), c 0 = 0

35 Semiconvergence 0.24 Empirical Error Expected Error Error Iteration

36 Early Stopping at Work 1.5 Fitting on the training set Iteration #

37 Early Stopping at Work 1.5 Fitting on the training set Iteration #

38 Early Stopping at Work 1.5 Fitting on the training set Iteration #

39 Early Stopping at Work 1.5 Fitting on the training set Iteration #

40 Statistics and computations with spectral filtering The different filters achieve essentially the same optimal statistical error!

41 Statistics and computations with spectral filtering The different filters achieve essentially the same optimal statistical error! Difference is in computations Filter Time Space Tikhonov n 3 n 2 GD n 2 λ 1 n 2 Accelerated GD n 2 λ 1/2 n 2 Truncated SVD n 2 λ γ n 2

42 Computational Regularization Arises Computational regularization: iterations control statistics and time complexity

43 Computational Regularization Arises Computational regularization: iterations control statistics and time complexity Built-in regularization path

44 Computational Regularization Arises Computational regularization: iterations control statistics and time complexity Built-in regularization path Is there an advantage going for on-line learning?

45 Computational Regularization Arises Computational regularization: iterations control statistics and time complexity Built-in regularization path Is there an advantage going for on-line learning? Not much, maybe n 2 log n? (Bach Dieluevet 15 R. Villa 15)

46 Computational Regularization Arises Computational regularization: iterations control statistics and time complexity Built-in regularization path Is there an advantage going for on-line learning? Not much, maybe n 2 log n? (Bach Dieluevet 15 R. Villa 15) Computational regularization: principles to control statistics, time and space complexity

47 Outline Learning with dictionaries and kernels Data Dependent Subsampling Data Independent Subsampling

48 1. pick w i at random... Subsampling

49 Subsampling 1. pick w i at random... from training set (Smola, Scholköpf 00) w 1,..., w M x 1,... x n M n

50 Subsampling 1. pick w i at random... from training set (Smola, Scholköpf 00) w 1,..., w M x 1,... x n M n 2. perform KRR on H M = {f f(x) = M i=1 c i q(x, w i ), c i R, w i R d, M N}.

51 Subsampling 1. pick w i at random... from training set (Smola, Scholköpf 00) w 1,..., w M x 1,... x n M n 2. perform KRR on H M = {f f(x) = M i=1 c i q(x, w i ), c i R, w i R d, M N}. Linear System bq M c = by Complexity Space O(n 2 ) O(nM) Time O(n 3 ) O(nM 2 )

Space O(n 2 ) O(nM) Time O(n 3 ) O(nM 2 )

52 Subsampling 1. pick w i at random... from training set (Smola, Scholköpf 00) w 1,..., w M x 1,... x n M n 2. perform KRR on H M = {f f(x) = M i=1 c i q(x, w i ), c i R, w i R d, M N}. Linear System bq M c = by Complexity Space O(n 2 ) O(nM) Time O(n 3 ) O(nM 2 ) What about statistics? What s the price for efficient computations?

53 Putting our Result in Context *Many* different subsampling schemes (Smola, Scholkopf 00; Williams, Seeger 01; )

54 Putting our Result in Context *Many* different subsampling schemes (Smola, Scholkopf 00; Williams, Seeger 01; ) Theoretical guarantees mainly on matrix approximation (Mahoney and Drineas 09; Cortes et al 10, Kumar et al ) Q Q M 1 M

55 Putting our Result in Context *Many* different subsampling schemes (Smola, Scholkopf 00; Williams, Seeger 01; ) Theoretical guarantees mainly on matrix approximation (Mahoney and Drineas 09; Cortes et al 10, Kumar et al ) Q Q M 1 M Few prediction guarantees either suboptimal or in restricted setting (Cortes et al. 10; Jin et al. 11, Bach 13, Alaoui, Mahoney 14)

56 Main Result (Rudi, Camoriano, Rosasco, 15) Theorem If f H, then λ = 1 n, M = 1 λ, E ( f λ,m (x) f (x)) 2 1 n

57 Main Result (Rudi, Camoriano, Rosasco, 15) Theorem If f H, then λ = 1 n, M = 1 λ, E ( f λ,m (x) f (x)) 2 1 n Remarks

58 Main Result (Rudi, Camoriano, Rosasco, 15) Theorem If f H, then λ = 1 n, M = 1 λ, E ( f λ,m (x) f (x)) 2 1 n Remarks 1. Subsampling achives optimal bound...

59 Main Result (Rudi, Camoriano, Rosasco, 15) Theorem If f H, then λ = 1 n, M = 1 λ, E ( f λ,m (x) f (x)) 2 1 n Remarks 1. Subsampling achives optimal bound with M n!!

60 Main Result (Rudi, Camoriano, Rosasco, 15) Theorem If f H, then λ = 1 n, M = 1 λ, E ( f λ,m (x) f (x)) 2 1 n Remarks 1. Subsampling achives optimal bound with M n!! 3. More generally, λ = n 1 2s+1, M = 1 λ, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1

61 Main Result (Rudi, Camoriano, Rosasco, 15) Theorem If f H, then λ = 1 n, M = 1 λ, E ( f λ,m (x) f (x)) 2 1 n Remarks 1. Subsampling achives optimal bound with M n!! 3. More generally, λ = n 1 2s+1, M = 1 λ, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1 Note: An interesting insight is obtained rewriting the result...

62 Computational Regularization by Subsampling (Rudi, Camoriano, Rosasco, 15) A simple idea: swap the role of λ and M...

63 Computational Regularization by Subsampling (Rudi, Camoriano, Rosasco, 15) A simple idea: swap the role of λ and M... Theorem If f H, then M = n 1 2s+1, λ = 1 M, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1

64 Computational Regularization by Subsampling (Rudi, Camoriano, Rosasco, 15) A simple idea: swap the role of λ and M... Theorem If f H, then M = n 1 2s+1, λ = 1 M, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1 λ and M play the same role new interpretation: subsampling regularizes!

65 Computational Regularization by Subsampling (Rudi, Camoriano, Rosasco, 15) A simple idea: swap the role of λ and M... Theorem If f H, then M = n 1 2s+1, λ = 1 M, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1 λ and M play the same role new interpretation: subsampling regularizes! New natural incremental algorithm... Algorithm

66 Computational Regularization by Subsampling (Rudi, Camoriano, Rosasco, 15) A simple idea: swap the role of λ and M... Theorem If f H, then M = n 1 2s+1, λ = 1 M, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1 λ and M play the same role new interpretation: subsampling regularizes! New natural incremental algorithm... Algorithm 1. Pick a center + compute solution

67 Computational Regularization by Subsampling (Rudi, Camoriano, Rosasco, 15) A simple idea: swap the role of λ and M... Theorem If f H, then M = n 1 2s+1, λ = 1 M, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1 λ and M play the same role new interpretation: subsampling regularizes! New natural incremental algorithm... Algorithm 1. Pick a center + compute solution 2. Pick another center + rank one update

68 Computational Regularization by Subsampling (Rudi, Camoriano, Rosasco, 15) A simple idea: swap the role of λ and M... Theorem If f H, then M = n 1 2s+1, λ = 1 M, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1 λ and M play the same role new interpretation: subsampling regularizes! New natural incremental algorithm... Algorithm 1. Pick a center + compute solution 2. Pick another center + rank one update 3. Pick another center...

69 CoRe Illustrated n, λ are fixed Validation Error Computation controls stability! Time/space requirement tailored to generalization

70 Experiments comparable/better w.r.t. the state of the art Dataset n tr d Incremental Standard Standard Random Fastfood CoRe KRLS Nyström Features RF Ins. Co ± CPU ± CT slices ± NA Year Pred ± NA Forest ± NA Random Features (Rahimi, Recht 07) Fastfood (Le et al. 13)

71 Summary so far Optimal learning with data dependent subsampling Computational regularization: subsampling regularizes!

72 Summary so far Optimal learning with data dependent subsampling Computational regularization: subsampling regularizes! Few more questions: Can one do better than uniform sampling? What about data independent sampling?

73 (Approximate) Leverage scores Leverage scores l i (t) = ( Q( Q + tni) 1 ) ii

74 (Approximate) Leverage scores Leverage scores l i (t) = ( Q( Q + tni) 1 ) ii ALS With probability at least 1 δ, 1 T l i(t) l i (t) T l i (t)

75 (Approximate) Leverage scores Leverage scores l i (t) = ( Q( Q + tni) 1 ) ii ALS With probability at least 1 δ, 1 T l i(t) l i (t) T l i (t) ALS subsampling Pick w 1,..., w M x 1,... x n with replacement, and probability P t (i) = l i (t)/ j lj (t).

76 Learning with ALS Subsampling (Rudi, Camoriano, Rosasco, 15) Theorem If f H and the integral operator associated to q has eigenvalue decay i 1 γ, γ (0, 1) then λ = n 1 1+γ, M = 1 λ γ, E ( f λ,m (x) f (x)) 2 n 1 1+γ

77 Learning with ALS Subsampling (Rudi, Camoriano, Rosasco, 15) Theorem If f H and the integral operator associated to q has eigenvalue decay i 1 γ, γ (0, 1) then λ = n 1 1+γ, M = 1 λ γ, E ( f λ,m (x) f (x)) 2 n 1 1+γ Non uniform subsampling achieves optimal bound

78 Learning with ALS Subsampling (Rudi, Camoriano, Rosasco, 15) Theorem If f H and the integral operator associated to q has eigenvalue decay i 1 γ, γ (0, 1) then λ = n 1 1+γ, M = 1 λ γ, E ( f λ,m (x) f (x)) 2 n 1 1+γ Non uniform subsampling achieves optimal bound Most importantly: potentially much fewer samples needed

79 Learning with ALS Subsampling (Rudi, Camoriano, Rosasco, 15) Theorem If f H and the integral operator associated to q has eigenvalue decay i 1 γ, γ (0, 1) then λ = n 1 1+γ, M = 1 λ γ, E ( f λ,m (x) f (x)) 2 n 1 1+γ Non uniform subsampling achieves optimal bound Most importantly: potentially much fewer samples needed Need efficient ALS computation (... )

80 Learning with ALS Subsampling (Rudi, Camoriano, Rosasco, 15) Theorem If f H and the integral operator associated to q has eigenvalue decay i 1 γ, γ (0, 1) then λ = n 1 1+γ, M = 1 λ γ, E ( f λ,m (x) f (x)) 2 n 1 1+γ Non uniform subsampling achieves optimal bound Most importantly: potentially much fewer samples needed Need efficient ALS computation (... ) Extensions to more general sampling schemes are possible

81 Outline Learning with dictionaries and kernels Data Dependent Subsampling Data Independent Subsampling

82 Random Features

83 Random Features f(x) = M c i q(x, w i ) i=1 q general non linear function

84 Random Features f(x) = M c i q(x, w i ) i=1 q general non linear function pick w i at random according to a distribution µ w 1,..., w M µ

85 Random Features f(x) = M c i q(x, w i ) i=1 q general non linear function pick w i at random according to a distribution µ w 1,..., w M µ perform KRR on M H M = {f f(x) = c i q(x, w i ), c i R}. i=1

86 Random Fourier Features (Rahimi, Recht 07) Consider q(x, w) = e iwt x,

87 Random Fourier Features (Rahimi, Recht 07) Consider q(x, w) = e iwt x, w µ(w) = N (0, I)

88 Random Fourier Features (Rahimi, Recht 07) Consider q(x, w) = e iwt x, w µ(w) = N (0, I) Then E w [q(x, w)q(x, w)] = e x x 2γ = K(x, x )

89 Random Fourier Features (Rahimi, Recht 07) Consider q(x, w) = e iwt x, w µ(w) = N (0, I) Then E w [q(x, w)q(x, w)] = e x x 2γ = K(x, x ) By sampling w 1,..., w M we are considering the approximating kernel 1 M M [q(x, w i )q(x, w i )] = K(x, x ) i=1

90 More Random Features translation invariant kernels K(x, x ) = H(x x ),

91 More Random Features translation invariant kernels K(x, x ) = H(x x ), q(x, w) = e iwt x,

92 More Random Features translation invariant kernels K(x, x ) = H(x x ), q(x, w) = e iwt x, w µ = F(H)

93 More Random Features translation invariant kernels K(x, x ) = H(x x ), q(x, w) = e iwt x, w µ = F(H) infinite neural nets kernels

94 More Random Features translation invariant kernels K(x, x ) = H(x x ), q(x, w) = e iwt x, w µ = F(H) infinite neural nets kernels q(x, w) = w T x + b +,

95 More Random Features translation invariant kernels K(x, x ) = H(x x ), q(x, w) = e iwt x, w µ = F(H) infinite neural nets kernels q(x, w) = w T x + b +, (w, b) µ = U[S d ]

96 More Random Features translation invariant kernels K(x, x ) = H(x x ), q(x, w) = e iwt x, w µ = F(H) infinite neural nets kernels q(x, w) = w T x + b +, (w, b) µ = U[S d ] infinite dot product kernels homogeneous additive kernels group invariant kernels... Note: Connections with hashing and sketching techniques.

97 Properties of Random Features

98 Properties of Random Features Optimization Time: O(n 3 ) O(nM 2 ) Space: O(n 2 ) O(nM)

99 Properties of Random Features Optimization Time: O(n 3 ) O(nM 2 ) Space: O(n 2 ) O(nM) Statistics As before: do we pay a price for efficient computations?

100 Previous works

101 Previous works *Many* different random features for different kernels (Rahimi, Recht 07, Vedaldi, Zisserman, )

102 Previous works *Many* different random features for different kernels (Rahimi, Recht 07, Vedaldi, Zisserman, ) Theoretical guarantees: mainly kernel approximation (Rahimi, Recht 07,..., Sriperumbudur and Szabo 15) K(x, x ) K(x, x ) 1 M

103 Previous works *Many* different random features for different kernels (Rahimi, Recht 07, Vedaldi, Zisserman, ) Theoretical guarantees: mainly kernel approximation (Rahimi, Recht 07,..., Sriperumbudur and Szabo 15) K(x, x ) K(x, x ) 1 M Statistical guarantees suboptimal or in restricted setting (Rahimi, Recht 09, Yang et al ,Bach 15 )

104 Main Result Let q(x, w) = e iwt x,

105 Main Result Let ( q(x, w) = e iwt x 1, w µ(w) = c d 1 + w 2 ) d+1 2

106 Main Result Let ( q(x, w) = e iwt x 1, w µ(w) = c d 1 + w 2 ) d+1 2 Theorem If f H s Sobolev space, then λ = n 1 2s+1, M = 1 λ 2s, E ( f λ,m (x) f (x)) 2 n 2s 2s+1

107 Main Result Let ( q(x, w) = e iwt x 1, w µ(w) = c d 1 + w 2 ) d+1 2 Theorem If f H s Sobolev space, then λ = n 1 2s+1, M = 1 λ 2s, E ( f λ,m (x) f (x)) 2 n 2s 2s+1 Random feature achieves optimal bound!

108 Main Result Let ( q(x, w) = e iwt x 1, w µ(w) = c d 1 + w 2 ) d+1 2 Theorem If f H s Sobolev space, then λ = n 1 2s+1, M = 1 λ 2s, E ( f λ,m (x) f (x)) 2 n 2s 2s+1 Random feature achieves optimal bound! Efficient worst case subsampling M n, but cannot exploit smoothness.

109 Remarks & Extensions Nÿstrom vs Random features Both achieve optimal rates Nÿstrom seems to need fewer samples (random centers)

110 Remarks & Extensions Nÿstrom vs Random features Both achieve optimal rates Nÿstrom seems to need fewer samples (random centers) How tight are the results?

111 Remarks & Extensions Nÿstrom vs Random features Both achieve optimal rates Nÿstrom seems to need fewer samples (random centers) How tight are the results? log λ Test Error log M test error

112 Contributions Optimal bounds for data dependent/independent subsampling Subsampling: Nÿstrom vs Random features Beyond ridge regression: early stopping and multiple passes SGD (coming up in AISTATS!)

113 Contributions Optimal bounds for data dependent/independent subsampling Subsampling: Nÿstrom vs Random features Beyond ridge regression: early stopping and multiple passes SGD (coming up in AISTATS!) Some questions: Quest for the best sampling Regularization by projection: inverse problems and preconditioning Beyond randomization: non convex optimization?

114 Contributions Optimal bounds for data dependent/independent subsampling Subsampling: Nÿstrom vs Random features Beyond ridge regression: early stopping and multiple passes SGD (coming up in AISTATS!) Some questions: Quest for the best sampling Regularization by projection: inverse problems and preconditioning Beyond randomization: non convex optimization? Some perspectives: Computational regularization: subsampling regularizes Algorithm design: Control stability with computations

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro