Less is More: Computational Regularization by Subsampling

Size: px
Start display at page:

Download "Less is More: Computational Regularization by Subsampling"

Transcription

1 Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro Rudi, Raffaello Camoriano Jan 13th, 2016 UCL, London

2 A Starting Point Classically: Statistics and optimization distinct steps in algorithm design

3 A Starting Point Classically: Statistics and optimization distinct steps in algorithm design Large Scale: Consider interplay between statistics and optimization! (Bottou, Bousquet 08)

4 A Starting Point Classically: Statistics and optimization distinct steps in algorithm design Large Scale: Consider interplay between statistics and optimization! (Bottou, Bousquet 08) Computational Regularization: Computation tricks =regularization

5 Problem: Estimate f Supervised Learning f

6 Supervised Learning Problem: Estimate f given S n = {(x 1, y 1 ),..., (x n, y n )} f (x 1,y 1 ) (x 2,y 2) (x 3,y 3) (x 4,y 4) (x 5,y 5)

7 Supervised Learning Problem: Estimate f given S n = {(x 1, y 1 ),..., (x n, y n )} f (x 1,y 1 ) (x 2,y 2) (x 3,y 3) (x 4,y 4) (x 5,y 5) The Setting y i = f (x i ) + ε i i {1,..., n} ε i R, x i R d random (with unknown distribution) f unknown

8 Outline Learning with dictionaries and kernels Data Dependent Subsampling Data Independent Subsampling

9 Non-linear/non-parametric learning f(x) = M c i q(x, w i ) i=1

10 Non-linear/non-parametric learning f(x) = M c i q(x, w i ) i=1 q non linear function

11 Non-linear/non-parametric learning f(x) = M c i q(x, w i ) i=1 q non linear function w i R d centers

12 Non-linear/non-parametric learning f(x) = M c i q(x, w i ) i=1 q non linear function w i R d centers c i R coefficients

13 Non-linear/non-parametric learning f(x) = M c i q(x, w i ) i=1 q non linear function w i R d centers c i R coefficients M = M n could/should grow with n

14 Non-linear/non-parametric learning f(x) = M c i q(x, w i ) i=1 q non linear function w i R d centers c i R coefficients M = M n could/should grow with n Question: How to choose w i, c i and M given S n?

15 Learning with Positive Definite Kernels There is an elegant answer if: q is symmetric all the matrices Q ij = q(x i, x j ) are positive semi-definite 1 1 They have non-negative eigenvalues

16 Learning with Positive Definite Kernels There is an elegant answer if: q is symmetric all the matrices Q ij = q(x i, x j ) are positive semi-definite 1 Representer Theorem (Kimeldorf, Wahba 70; Schölkopf et al. 01) M = n, w i = x i, c i by convex optimization! 1 They have non-negative eigenvalues

17 Kernel Ridge Regression (KRR) a.k.a. Penalized Least Squares f λ = argmin f H 1 n n (y i f(x i )) 2 + λ f 2 i=1

18 Kernel Ridge Regression (KRR) a.k.a. Penalized Least Squares where f λ = argmin f H 1 n n (y i f(x i )) 2 + λ f 2 i=1 H = {f f(x) = M c i q(x, w i ), c i R, w i R }{{ d, }} M {{ N } } any center! any length! i=1

19 Kernel Ridge Regression (KRR) a.k.a. Penalized Least Squares where f λ = argmin f H 1 n n (y i f(x i )) 2 + λ f 2 i=1 H = {f f(x) = M c i q(x, w i ), c i R, w i R }{{ d, }} M {{ N } } any center! any length! i=1 Solution n f λ = c i q(x, x i ) with c = ( Q + λni) 1 ŷ i=1

20 KRR: Statistics

21 KRR: Statistics Well understood statistical properties: Classical Theorem If f H, then λ = 1 n E ( f λ (x) f (x)) 2 1 n

22 KRR: Statistics Well understood statistical properties: Classical Theorem If f H, then λ = 1 n E ( f λ (x) f (x)) 2 1 n Remarks

23 KRR: Statistics Well understood statistical properties: Classical Theorem If f H, then λ = 1 n E ( f λ (x) f (x)) 2 1 n Remarks 1. Optimal nonparametric bound

24 KRR: Statistics Well understood statistical properties: Classical Theorem If f H, then λ = 1 n E ( f λ (x) f (x)) 2 1 n Remarks 1. Optimal nonparametric bound 2. Results for general kernels (e.g. splines/sobolev etc.) λ = n 1 2s+1, E ( f λ (x) f (x)) 2 n 2s 2s+1

25 KRR: Statistics Well understood statistical properties: Classical Theorem If f H, then λ = 1 n E ( f λ (x) f (x)) 2 1 n Remarks 1. Optimal nonparametric bound 2. Results for general kernels (e.g. splines/sobolev etc.) λ = n 1 2s+1, E ( f λ (x) f (x)) 2 n 2s 2s+1 3. Adaptive tuning via cross validation 4. Proofs: analysis/linear algebra+ random matrices (Smale and Zhou + Caponnetto, De Vito, R.+ Steinwart)

26 KRR: Optimization n f λ = c i q(x, x i ) with c = ( Q + λni) 1 ŷ i=1 Linear System Complexity Space O(n 2 ) bq c = Time O(n 3 ) by

27 KRR: Optimization n f λ = c i q(x, x i ) with c = ( Q + λni) 1 ŷ i=1 Linear System Complexity Space O(n 2 ) bq c = Time O(n 3 ) by BIG DATA? Running out of time and space... Can this be fixed?

28 Beyond Tikhonov: Spectral Filtering ( ˆQ + λi) 1 approximation of ˆQ controlled by λ

29 Beyond Tikhonov: Spectral Filtering ( ˆQ + λi) 1 approximation of ˆQ controlled by λ Can we approximate ˆQ by saving computations?

30 Beyond Tikhonov: Spectral Filtering ( ˆQ + λi) 1 approximation of ˆQ controlled by λ Can we approximate ˆQ by saving computations? Yes!

31 Beyond Tikhonov: Spectral Filtering ( ˆQ + λi) 1 approximation of ˆQ controlled by λ Can we approximate ˆQ by saving computations? Yes! Spectral filtering (Engl 96- inverse problems, Rosasco et al. 05- ML ) g λ ( ˆQ) ˆQ The filter function g λ defines the form of the approximation

32 Spectral filtering Examples Tikhonov- ridge regression Truncated SVD principal component regression Landweber iteration GD/ L 2 -boosting nu-method accelerated GD/Chebyshev method...

33 Spectral filtering Examples Tikhonov- ridge regression Truncated SVD principal component regression Landweber iteration GD/ L 2 -boosting nu-method accelerated GD/Chebyshev method... Landweber iteration (truncated power series)... c t = g t ( ˆQ) t 1 = γ (I γ ˆQ) r ŷ r=0

34 Spectral filtering Examples Tikhonov- ridge regression Truncated SVD principal component regression Landweber iteration GD/ L 2 -boosting nu-method accelerated GD/Chebyshev method... Landweber iteration (truncated power series)... c t = g t ( ˆQ) t 1 = γ (I γ ˆQ) r ŷ r=0... it s GD for ERM!! r = 1... t c r = c r 1 γ( ˆQc r 1 ŷ), c 0 = 0

35 Semiconvergence 0.24 Empirical Error Expected Error Error Iteration

36 Early Stopping at Work 1.5 Fitting on the training set Iteration #

37 Early Stopping at Work 1.5 Fitting on the training set Iteration #

38 Early Stopping at Work 1.5 Fitting on the training set Iteration #

39 Early Stopping at Work 1.5 Fitting on the training set Iteration #

40 Statistics and computations with spectral filtering The different filters achieve essentially the same optimal statistical error!

41 Statistics and computations with spectral filtering The different filters achieve essentially the same optimal statistical error! Difference is in computations Filter Time Space Tikhonov n 3 n 2 GD n 2 λ 1 n 2 Accelerated GD n 2 λ 1/2 n 2 Truncated SVD n 2 λ γ n 2

42 Computational Regularization Arises Computational regularization: iterations control statistics and time complexity

43 Computational Regularization Arises Computational regularization: iterations control statistics and time complexity Built-in regularization path

44 Computational Regularization Arises Computational regularization: iterations control statistics and time complexity Built-in regularization path Is there an advantage going for on-line learning?

45 Computational Regularization Arises Computational regularization: iterations control statistics and time complexity Built-in regularization path Is there an advantage going for on-line learning? Not much, maybe n 2 log n? (Bach Dieluevet 15 R. Villa 15)

46 Computational Regularization Arises Computational regularization: iterations control statistics and time complexity Built-in regularization path Is there an advantage going for on-line learning? Not much, maybe n 2 log n? (Bach Dieluevet 15 R. Villa 15) Computational regularization: principles to control statistics, time and space complexity

47 Outline Learning with dictionaries and kernels Data Dependent Subsampling Data Independent Subsampling

48 1. pick w i at random... Subsampling

49 Subsampling 1. pick w i at random... from training set (Smola, Scholköpf 00) w 1,..., w M x 1,... x n M n

50 Subsampling 1. pick w i at random... from training set (Smola, Scholköpf 00) w 1,..., w M x 1,... x n M n 2. perform KRR on H M = {f f(x) = M i=1 c i q(x, w i ), c i R, w i R d, M N}.

51 Subsampling 1. pick w i at random... from training set (Smola, Scholköpf 00) w 1,..., w M x 1,... x n M n 2. perform KRR on H M = {f f(x) = M i=1 c i q(x, w i ), c i R, w i R d, M N}. Linear System bq M c = by Complexity Space O(n 2 ) O(nM) Time O(n 3 ) O(nM 2 )

52 Subsampling 1. pick w i at random... from training set (Smola, Scholköpf 00) w 1,..., w M x 1,... x n M n 2. perform KRR on H M = {f f(x) = M i=1 c i q(x, w i ), c i R, w i R d, M N}. Linear System bq M c = by Complexity Space O(n 2 ) O(nM) Time O(n 3 ) O(nM 2 ) What about statistics? What s the price for efficient computations?

53 Putting our Result in Context *Many* different subsampling schemes (Smola, Scholkopf 00; Williams, Seeger 01; )

54 Putting our Result in Context *Many* different subsampling schemes (Smola, Scholkopf 00; Williams, Seeger 01; ) Theoretical guarantees mainly on matrix approximation (Mahoney and Drineas 09; Cortes et al 10, Kumar et al ) Q Q M 1 M

55 Putting our Result in Context *Many* different subsampling schemes (Smola, Scholkopf 00; Williams, Seeger 01; ) Theoretical guarantees mainly on matrix approximation (Mahoney and Drineas 09; Cortes et al 10, Kumar et al ) Q Q M 1 M Few prediction guarantees either suboptimal or in restricted setting (Cortes et al. 10; Jin et al. 11, Bach 13, Alaoui, Mahoney 14)

56 Main Result (Rudi, Camoriano, Rosasco, 15) Theorem If f H, then λ = 1 n, M = 1 λ, E ( f λ,m (x) f (x)) 2 1 n

57 Main Result (Rudi, Camoriano, Rosasco, 15) Theorem If f H, then λ = 1 n, M = 1 λ, E ( f λ,m (x) f (x)) 2 1 n Remarks

58 Main Result (Rudi, Camoriano, Rosasco, 15) Theorem If f H, then λ = 1 n, M = 1 λ, E ( f λ,m (x) f (x)) 2 1 n Remarks 1. Subsampling achives optimal bound...

59 Main Result (Rudi, Camoriano, Rosasco, 15) Theorem If f H, then λ = 1 n, M = 1 λ, E ( f λ,m (x) f (x)) 2 1 n Remarks 1. Subsampling achives optimal bound with M n!!

60 Main Result (Rudi, Camoriano, Rosasco, 15) Theorem If f H, then λ = 1 n, M = 1 λ, E ( f λ,m (x) f (x)) 2 1 n Remarks 1. Subsampling achives optimal bound with M n!! 3. More generally, λ = n 1 2s+1, M = 1 λ, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1

61 Main Result (Rudi, Camoriano, Rosasco, 15) Theorem If f H, then λ = 1 n, M = 1 λ, E ( f λ,m (x) f (x)) 2 1 n Remarks 1. Subsampling achives optimal bound with M n!! 3. More generally, λ = n 1 2s+1, M = 1 λ, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1 Note: An interesting insight is obtained rewriting the result...

62 Computational Regularization by Subsampling (Rudi, Camoriano, Rosasco, 15) A simple idea: swap the role of λ and M...

63 Computational Regularization by Subsampling (Rudi, Camoriano, Rosasco, 15) A simple idea: swap the role of λ and M... Theorem If f H, then M = n 1 2s+1, λ = 1 M, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1

64 Computational Regularization by Subsampling (Rudi, Camoriano, Rosasco, 15) A simple idea: swap the role of λ and M... Theorem If f H, then M = n 1 2s+1, λ = 1 M, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1 λ and M play the same role new interpretation: subsampling regularizes!

65 Computational Regularization by Subsampling (Rudi, Camoriano, Rosasco, 15) A simple idea: swap the role of λ and M... Theorem If f H, then M = n 1 2s+1, λ = 1 M, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1 λ and M play the same role new interpretation: subsampling regularizes! New natural incremental algorithm... Algorithm

66 Computational Regularization by Subsampling (Rudi, Camoriano, Rosasco, 15) A simple idea: swap the role of λ and M... Theorem If f H, then M = n 1 2s+1, λ = 1 M, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1 λ and M play the same role new interpretation: subsampling regularizes! New natural incremental algorithm... Algorithm 1. Pick a center + compute solution

67 Computational Regularization by Subsampling (Rudi, Camoriano, Rosasco, 15) A simple idea: swap the role of λ and M... Theorem If f H, then M = n 1 2s+1, λ = 1 M, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1 λ and M play the same role new interpretation: subsampling regularizes! New natural incremental algorithm... Algorithm 1. Pick a center + compute solution 2. Pick another center + rank one update

68 Computational Regularization by Subsampling (Rudi, Camoriano, Rosasco, 15) A simple idea: swap the role of λ and M... Theorem If f H, then M = n 1 2s+1, λ = 1 M, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1 λ and M play the same role new interpretation: subsampling regularizes! New natural incremental algorithm... Algorithm 1. Pick a center + compute solution 2. Pick another center + rank one update 3. Pick another center...

69 CoRe Illustrated n, λ are fixed Validation Error Computation controls stability! Time/space requirement tailored to generalization

70 Experiments comparable/better w.r.t. the state of the art Dataset n tr d Incremental Standard Standard Random Fastfood CoRe KRLS Nyström Features RF Ins. Co ± CPU ± CT slices ± NA Year Pred ± NA Forest ± NA Random Features (Rahimi, Recht 07) Fastfood (Le et al. 13)

71 Summary so far Optimal learning with data dependent subsampling Computational regularization: subsampling regularizes!

72 Summary so far Optimal learning with data dependent subsampling Computational regularization: subsampling regularizes! Few more questions: Can one do better than uniform sampling? What about data independent sampling?

73 (Approximate) Leverage scores Leverage scores l i (t) = ( Q( Q + tni) 1 ) ii

74 (Approximate) Leverage scores Leverage scores l i (t) = ( Q( Q + tni) 1 ) ii ALS With probability at least 1 δ, 1 T l i(t) l i (t) T l i (t)

75 (Approximate) Leverage scores Leverage scores l i (t) = ( Q( Q + tni) 1 ) ii ALS With probability at least 1 δ, 1 T l i(t) l i (t) T l i (t) ALS subsampling Pick w 1,..., w M x 1,... x n with replacement, and probability P t (i) = l i (t)/ j lj (t).

76 Learning with ALS Subsampling (Rudi, Camoriano, Rosasco, 15) Theorem If f H and the integral operator associated to q has eigenvalue decay i 1 γ, γ (0, 1) then λ = n 1 1+γ, M = 1 λ γ, E ( f λ,m (x) f (x)) 2 n 1 1+γ

77 Learning with ALS Subsampling (Rudi, Camoriano, Rosasco, 15) Theorem If f H and the integral operator associated to q has eigenvalue decay i 1 γ, γ (0, 1) then λ = n 1 1+γ, M = 1 λ γ, E ( f λ,m (x) f (x)) 2 n 1 1+γ Non uniform subsampling achieves optimal bound

78 Learning with ALS Subsampling (Rudi, Camoriano, Rosasco, 15) Theorem If f H and the integral operator associated to q has eigenvalue decay i 1 γ, γ (0, 1) then λ = n 1 1+γ, M = 1 λ γ, E ( f λ,m (x) f (x)) 2 n 1 1+γ Non uniform subsampling achieves optimal bound Most importantly: potentially much fewer samples needed

79 Learning with ALS Subsampling (Rudi, Camoriano, Rosasco, 15) Theorem If f H and the integral operator associated to q has eigenvalue decay i 1 γ, γ (0, 1) then λ = n 1 1+γ, M = 1 λ γ, E ( f λ,m (x) f (x)) 2 n 1 1+γ Non uniform subsampling achieves optimal bound Most importantly: potentially much fewer samples needed Need efficient ALS computation (... )

80 Learning with ALS Subsampling (Rudi, Camoriano, Rosasco, 15) Theorem If f H and the integral operator associated to q has eigenvalue decay i 1 γ, γ (0, 1) then λ = n 1 1+γ, M = 1 λ γ, E ( f λ,m (x) f (x)) 2 n 1 1+γ Non uniform subsampling achieves optimal bound Most importantly: potentially much fewer samples needed Need efficient ALS computation (... ) Extensions to more general sampling schemes are possible

81 Outline Learning with dictionaries and kernels Data Dependent Subsampling Data Independent Subsampling

82 Random Features

83 Random Features f(x) = M c i q(x, w i ) i=1 q general non linear function

84 Random Features f(x) = M c i q(x, w i ) i=1 q general non linear function pick w i at random according to a distribution µ w 1,..., w M µ

85 Random Features f(x) = M c i q(x, w i ) i=1 q general non linear function pick w i at random according to a distribution µ w 1,..., w M µ perform KRR on M H M = {f f(x) = c i q(x, w i ), c i R}. i=1

86 Random Fourier Features (Rahimi, Recht 07) Consider q(x, w) = e iwt x,

87 Random Fourier Features (Rahimi, Recht 07) Consider q(x, w) = e iwt x, w µ(w) = N (0, I)

88 Random Fourier Features (Rahimi, Recht 07) Consider q(x, w) = e iwt x, w µ(w) = N (0, I) Then E w [q(x, w)q(x, w)] = e x x 2γ = K(x, x )

89 Random Fourier Features (Rahimi, Recht 07) Consider q(x, w) = e iwt x, w µ(w) = N (0, I) Then E w [q(x, w)q(x, w)] = e x x 2γ = K(x, x ) By sampling w 1,..., w M we are considering the approximating kernel 1 M M [q(x, w i )q(x, w i )] = K(x, x ) i=1

90 More Random Features translation invariant kernels K(x, x ) = H(x x ),

91 More Random Features translation invariant kernels K(x, x ) = H(x x ), q(x, w) = e iwt x,

92 More Random Features translation invariant kernels K(x, x ) = H(x x ), q(x, w) = e iwt x, w µ = F(H)

93 More Random Features translation invariant kernels K(x, x ) = H(x x ), q(x, w) = e iwt x, w µ = F(H) infinite neural nets kernels

94 More Random Features translation invariant kernels K(x, x ) = H(x x ), q(x, w) = e iwt x, w µ = F(H) infinite neural nets kernels q(x, w) = w T x + b +,

95 More Random Features translation invariant kernels K(x, x ) = H(x x ), q(x, w) = e iwt x, w µ = F(H) infinite neural nets kernels q(x, w) = w T x + b +, (w, b) µ = U[S d ]

96 More Random Features translation invariant kernels K(x, x ) = H(x x ), q(x, w) = e iwt x, w µ = F(H) infinite neural nets kernels q(x, w) = w T x + b +, (w, b) µ = U[S d ] infinite dot product kernels homogeneous additive kernels group invariant kernels... Note: Connections with hashing and sketching techniques.

97 Properties of Random Features

98 Properties of Random Features Optimization Time: O(n 3 ) O(nM 2 ) Space: O(n 2 ) O(nM)

99 Properties of Random Features Optimization Time: O(n 3 ) O(nM 2 ) Space: O(n 2 ) O(nM) Statistics As before: do we pay a price for efficient computations?

100 Previous works

101 Previous works *Many* different random features for different kernels (Rahimi, Recht 07, Vedaldi, Zisserman, )

102 Previous works *Many* different random features for different kernels (Rahimi, Recht 07, Vedaldi, Zisserman, ) Theoretical guarantees: mainly kernel approximation (Rahimi, Recht 07,..., Sriperumbudur and Szabo 15) K(x, x ) K(x, x ) 1 M

103 Previous works *Many* different random features for different kernels (Rahimi, Recht 07, Vedaldi, Zisserman, ) Theoretical guarantees: mainly kernel approximation (Rahimi, Recht 07,..., Sriperumbudur and Szabo 15) K(x, x ) K(x, x ) 1 M Statistical guarantees suboptimal or in restricted setting (Rahimi, Recht 09, Yang et al ,Bach 15 )

104 Main Result Let q(x, w) = e iwt x,

105 Main Result Let ( q(x, w) = e iwt x 1, w µ(w) = c d 1 + w 2 ) d+1 2

106 Main Result Let ( q(x, w) = e iwt x 1, w µ(w) = c d 1 + w 2 ) d+1 2 Theorem If f H s Sobolev space, then λ = n 1 2s+1, M = 1 λ 2s, E ( f λ,m (x) f (x)) 2 n 2s 2s+1

107 Main Result Let ( q(x, w) = e iwt x 1, w µ(w) = c d 1 + w 2 ) d+1 2 Theorem If f H s Sobolev space, then λ = n 1 2s+1, M = 1 λ 2s, E ( f λ,m (x) f (x)) 2 n 2s 2s+1 Random feature achieves optimal bound!

108 Main Result Let ( q(x, w) = e iwt x 1, w µ(w) = c d 1 + w 2 ) d+1 2 Theorem If f H s Sobolev space, then λ = n 1 2s+1, M = 1 λ 2s, E ( f λ,m (x) f (x)) 2 n 2s 2s+1 Random feature achieves optimal bound! Efficient worst case subsampling M n, but cannot exploit smoothness.

109 Remarks & Extensions Nÿstrom vs Random features Both achieve optimal rates Nÿstrom seems to need fewer samples (random centers)

110 Remarks & Extensions Nÿstrom vs Random features Both achieve optimal rates Nÿstrom seems to need fewer samples (random centers) How tight are the results?

111 Remarks & Extensions Nÿstrom vs Random features Both achieve optimal rates Nÿstrom seems to need fewer samples (random centers) How tight are the results? log λ Test Error log M test error

112 Contributions Optimal bounds for data dependent/independent subsampling Subsampling: Nÿstrom vs Random features Beyond ridge regression: early stopping and multiple passes SGD (coming up in AISTATS!)

113 Contributions Optimal bounds for data dependent/independent subsampling Subsampling: Nÿstrom vs Random features Beyond ridge regression: early stopping and multiple passes SGD (coming up in AISTATS!) Some questions: Quest for the best sampling Regularization by projection: inverse problems and preconditioning Beyond randomization: non convex optimization?

114 Contributions Optimal bounds for data dependent/independent subsampling Subsampling: Nÿstrom vs Random features Beyond ridge regression: early stopping and multiple passes SGD (coming up in AISTATS!) Some questions: Quest for the best sampling Regularization by projection: inverse problems and preconditioning Beyond randomization: non convex optimization? Some perspectives: Computational regularization: subsampling regularizes Algorithm design: Control stability with computations

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

TUM 2016 Class 3 Large scale learning by regularization

TUM 2016 Class 3 Large scale learning by regularization TUM 2016 Class 3 Large scale learning by regularization Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x n, y n ) Beyond

More information

Optimal kernel methods for large scale learning

Optimal kernel methods for large scale learning Optimal kernel methods for large scale learning Alessandro Rudi INRIA - École Normale Supérieure, Paris joint work with Luigi Carratino, Lorenzo Rosasco 6 Mar 2018 École Polytechnique Learning problem

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Oslo Class 4 Early Stopping and Spectral Regularization

Oslo Class 4 Early Stopping and Spectral Regularization RegML2017@SIMULA Oslo Class 4 Early Stopping and Spectral Regularization Lorenzo Rosasco UNIGE-MIT-IIT June 28, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x

More information

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien

More information

Approximate Kernel PCA with Random Features

Approximate Kernel PCA with Random Features Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,

More information

An inverse problem perspective on machine learning

An inverse problem perspective on machine learning An inverse problem perspective on machine learning Lorenzo Rosasco University of Genova Massachusetts Institute of Technology Istituto Italiano di Tecnologia lcsl.mit.edu Feb 9th, 2018 Inverse Problems

More information

Approximate Kernel Methods

Approximate Kernel Methods Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression

More information

Spectral Regularization

Spectral Regularization Spectral Regularization Lorenzo Rosasco 9.520 Class 07 February 27, 2008 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

Regularization via Spectral Filtering

Regularization via Spectral Filtering Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

Convergence Rates of Kernel Quadrature Rules

Convergence Rates of Kernel Quadrature Rules Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction

More information

RegML 2018 Class 2 Tikhonov regularization and kernels

RegML 2018 Class 2 Tikhonov regularization and kernels RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,

More information

Early Stopping for Computational Learning

Early Stopping for Computational Learning Early Stopping for Computational Learning Lorenzo Rosasco Universita di Genova, Massachusetts Institute of Technology Istituto Italiano di Tecnologia CBMM Sestri Levante, September, 2014 joint work with

More information

Basis Expansion and Nonlinear SVM. Kai Yu

Basis Expansion and Nonlinear SVM. Kai Yu Basis Expansion and Nonlinear SVM Kai Yu Linear Classifiers f(x) =w > x + b z(x) = sign(f(x)) Help to learn more general cases, e.g., nonlinear models 8/7/12 2 Nonlinear Classifiers via Basis Expansion

More information

Iterative Convex Regularization

Iterative Convex Regularization Iterative Convex Regularization Lorenzo Rosasco Universita di Genova Universita di Genova Istituto Italiano di Tecnologia Massachusetts Institute of Technology Optimization and Statistical Learning Workshop,

More information

Kernel Learning via Random Fourier Representations

Kernel Learning via Random Fourier Representations Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random

More information

Memory Efficient Kernel Approximation

Memory Efficient Kernel Approximation Si Si Department of Computer Science University of Texas at Austin ICML Beijing, China June 23, 2014 Joint work with Cho-Jui Hsieh and Inderjit S. Dhillon Outline Background Motivation Low-Rank vs. Block

More information

Mathematical Methods for Data Analysis

Mathematical Methods for Data Analysis Mathematical Methods for Data Analysis Massimiliano Pontil Istituto Italiano di Tecnologia and Department of Computer Science University College London Massimiliano Pontil Mathematical Methods for Data

More information

2 Tikhonov Regularization and ERM

2 Tikhonov Regularization and ERM Introduction Here we discusses how a class of regularization methods originally designed to solve ill-posed inverse problems give rise to regularized learning algorithms. These algorithms are kernel methods

More information

Learning sets and subspaces: a spectral approach

Learning sets and subspaces: a spectral approach Learning sets and subspaces: a spectral approach Alessandro Rudi DIBRIS, Università di Genova Optimization and dynamical processes in Statistical learning and inverse problems Sept 8-12, 2014 A world of

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Pack only the essentials: distributed sequential sampling for adaptive kernel DL

Pack only the essentials: distributed sequential sampling for adaptive kernel DL Pack only the essentials: distributed sequential sampling for adaptive kernel DL with Daniele Calandriello and Alessandro Lazaric SequeL team, Inria Lille - Nord Europe, France appeared in AISTATS 2017

More information

Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms

Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms Junhong Lin Volkan Cevher Laboratory for Information and Inference Systems École Polytechnique Fédérale

More information

Online Gradient Descent Learning Algorithms

Online Gradient Descent Learning Algorithms DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline

More information

Towards Deep Kernel Machines

Towards Deep Kernel Machines Towards Deep Kernel Machines Julien Mairal Inria, Grenoble Prague, April, 2017 Julien Mairal Towards deep kernel machines 1/51 Part I: Scientific Context Julien Mairal Towards deep kernel machines 2/51

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Analysis of Nyström Method with Sequential Ridge Leverage Score Sampling

Analysis of Nyström Method with Sequential Ridge Leverage Score Sampling Analysis of Nyström Method with Sequential Ridge Leverage Score Sampling Daniele Calandriello SequeL team INRIA Lille - Nord Europe Alessandro Lazaric SequeL team INRIA Lille - Nord Europe Michal Valko

More information

Oslo Class 2 Tikhonov regularization and kernels

Oslo Class 2 Tikhonov regularization and kernels RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Distributed Adaptive Sampling for Kernel Matrix Approximation

Distributed Adaptive Sampling for Kernel Matrix Approximation Daniele Calandriello Alessandro Lazaric Michal Valko SequeL team, INRIA Lille - Nord Europe Abstract Most kernel-based methods, such as kernel regression, kernel PCA, ICA, or k-means clustering, do not

More information

Regularization Algorithms for Learning

Regularization Algorithms for Learning DISI, UNIGE Texas, 10/19/07 plan motivation setting elastic net regularization - iterative thresholding algorithms - error estimates and parameter choice applications motivations starting point of many

More information

Optimal Rates for Multi-pass Stochastic Gradient Methods

Optimal Rates for Multi-pass Stochastic Gradient Methods Journal of Machine Learning Research 8 (07) -47 Submitted 3/7; Revised 8/7; Published 0/7 Optimal Rates for Multi-pass Stochastic Gradient Methods Junhong Lin Laboratory for Computational and Statistical

More information

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) Diffeomorphic Warping Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) What Manifold Learning Isn t Common features of Manifold Learning Algorithms: 1-1 charting Dense sampling Geometric Assumptions

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Spectral Filtering for MultiOutput Learning

Spectral Filtering for MultiOutput Learning Spectral Filtering for MultiOutput Learning Lorenzo Rosasco Center for Biological and Computational Learning, MIT Universita di Genova, Italy Plan Learning with kernels Multioutput kernel and regularization

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Statistically and Computationally Efficient Variance Estimator for Kernel Ridge Regression

Statistically and Computationally Efficient Variance Estimator for Kernel Ridge Regression Statistically and Computationally Efficient Variance Estimator for Kernel Ridge Regression Meimei Liu Department of Statistical Science Duke University Durham, IN - 27708 Email: meimei.liu@duke.edu Jean

More information

MLCC 2017 Regularization Networks I: Linear Models

MLCC 2017 Regularization Networks I: Linear Models MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational

More information

FALKON: An Optimal Large Scale Kernel Method

FALKON: An Optimal Large Scale Kernel Method FALKON: An Optimal Large Scale Kernel Method Alessandro Rudi INRIA Sierra Project-team, École Normale Supérieure, Paris Luigi Carratino University of Genoa Genova, Italy Lorenzo Rosasco University of Genoa,

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms

More information

Approximate Principal Components Analysis of Large Data Sets

Approximate Principal Components Analysis of Large Data Sets Approximate Principal Components Analysis of Large Data Sets Daniel J. McDonald Department of Statistics Indiana University mypage.iu.edu/ dajmcdon April 27, 2016 Approximation-Regularization for Analysis

More information

sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU)

sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU) sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU) 0 overview Our Contributions: 1 overview Our Contributions: A near optimal low-rank

More information

Distribution Regression

Distribution Regression Zoltán Szabó (École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) Dagstuhl Seminar 16481

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Semi-Nonparametric Inferences for Massive Data

Semi-Nonparametric Inferences for Massive Data Semi-Nonparametric Inferences for Massive Data Guang Cheng 1 Department of Statistics Purdue University Statistics Seminar at NCSU October, 2015 1 Acknowledge NSF, Simons Foundation and ONR. A Joint Work

More information

arxiv: v1 [cs.lg] 11 Jan 2012

arxiv: v1 [cs.lg] 11 Jan 2012 arxiv:20.246v [cs.lg] Jan 202 Pierre Machart LIF, LSIS, CNRS, Aix-Marseille Université Thomas Peel LIF, LATP, CNRS, Aix-Marseille Université Sandrine Anthoine LATP, CNRS, Aix-Marseille Université Liva

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee Distribution Regression: A Simple Technique with Minimax-optimal Guarantee (CMAP, École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department,

More information

Approximate Spectral Clustering via Randomized Sketching

Approximate Spectral Clustering via Randomized Sketching Approximate Spectral Clustering via Randomized Sketching Christos Boutsidis Yahoo! Labs, New York Joint work with Alex Gittens (Ebay), Anju Kambadur (IBM) The big picture: sketch and solve Tradeoff: Speed

More information

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Simple Optimization, Bigger Models, and Faster Learning. Niao He Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Spectral Algorithms for Supervised Learning

Spectral Algorithms for Supervised Learning LETTER Communicated by David Hardoon Spectral Algorithms for Supervised Learning L. Lo Gerfo logerfo@disi.unige.it L. Rosasco rosasco@disi.unige.it F. Odone odone@disi.unige.it Dipartimento di Informatica

More information

Multi-Task Learning and Matrix Regularization

Multi-Task Learning and Matrix Regularization Multi-Task Learning and Matrix Regularization Andreas Argyriou Department of Computer Science University College London Collaborators T. Evgeniou (INSEAD) R. Hauser (University of Oxford) M. Herbster (University

More information

arxiv: v6 [stat.ml] 17 Mar 2016

arxiv: v6 [stat.ml] 17 Mar 2016 Less is More: Nyström Computational Regularization Alessandro Rudi 1 Raffaello Camoriano 1,2 Lorenzo Rosasco 1,3 ale rudi@mit.edu raffaello.camoriano@iit.it lrosasco@mit.edu arxiv:1507.04717v6 [stat.ml]

More information

INTRODUCTION TO DATA SCIENCE

INTRODUCTION TO DATA SCIENCE INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #13 3/9/2017 CMSC320 Tuesdays & Thursdays 3:30pm 4:45pm ANNOUNCEMENTS Mini-Project #1 is due Saturday night (3/11): Seems like people are able to do

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline

More information

Optimal Distributed Learning with Multi-pass Stochastic Gradient Methods

Optimal Distributed Learning with Multi-pass Stochastic Gradient Methods Optimal Distributed Learning with Multi-pass Stochastic Gradient Methods Junhong Lin Volkan Cevher Abstract We study generalization properties of distributed algorithms in the setting of nonparametric

More information

Advances in kernel exponential families

Advances in kernel exponential families Advances in kernel exponential families Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS, 2017 1/39 Outline Motivating application: Fast estimation of complex multivariate

More information

Computational tractability of machine learning algorithms for tall fat data

Computational tractability of machine learning algorithms for tall fat data Computational tractability of machine learning algorithms for tall fat data Getting good enough solutions as fast as possible Vikas Chandrakant Raykar vikas@cs.umd.edu University of Maryland, CollegePark

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

An Adaptive Test of Independence with Analytic Kernel Embeddings

An Adaptive Test of Independence with Analytic Kernel Embeddings An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum 1 Zoltán Szabó 2 Arthur Gretton 1 1 Gatsby Unit, University College London 2 CMAP, École Polytechnique ICML 2017, Sydney

More information

Algorithmic Stability and Generalization Christoph Lampert

Algorithmic Stability and Generalization Christoph Lampert Algorithmic Stability and Generalization Christoph Lampert November 28, 2018 1 / 32 IST Austria (Institute of Science and Technology Austria) institute for basic research opened in 2009 located in outskirts

More information

Sparse Approximation and Variable Selection

Sparse Approximation and Variable Selection Sparse Approximation and Variable Selection Lorenzo Rosasco 9.520 Class 07 February 26, 2007 About this class Goal To introduce the problem of variable selection, discuss its connection to sparse approximation

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new

More information

Kernel-based Approximation. Methods using MATLAB. Gregory Fasshauer. Interdisciplinary Mathematical Sciences. Michael McCourt.

Kernel-based Approximation. Methods using MATLAB. Gregory Fasshauer. Interdisciplinary Mathematical Sciences. Michael McCourt. SINGAPORE SHANGHAI Vol TAIPEI - Interdisciplinary Mathematical Sciences 19 Kernel-based Approximation Methods using MATLAB Gregory Fasshauer Illinois Institute of Technology, USA Michael McCourt University

More information

Convergence rates of spectral methods for statistical inverse learning problems

Convergence rates of spectral methods for statistical inverse learning problems Convergence rates of spectral methods for statistical inverse learning problems G. Blanchard Universtität Potsdam UCL/Gatsby unit, 04/11/2015 Joint work with N. Mücke (U. Potsdam); N. Krämer (U. München)

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Nystrom Method for Approximating the GMM Kernel

Nystrom Method for Approximating the GMM Kernel Nystrom Method for Approximating the GMM Kernel Ping Li Department of Statistics and Biostatistics Department of omputer Science Rutgers University Piscataway, NJ 0, USA pingli@stat.rutgers.edu Abstract

More information

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM Wei Chu Chong Jin Ong chuwei@gatsby.ucl.ac.uk mpeongcj@nus.edu.sg S. Sathiya Keerthi mpessk@nus.edu.sg Control Division, Department

More information

Learning gradients: prescriptive models

Learning gradients: prescriptive models Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan

More information

Regularized Least Squares

Regularized Least Squares Regularized Least Squares Ryan M. Rifkin Google, Inc. 2008 Basics: Data Data points S = {(X 1, Y 1 ),...,(X n, Y n )}. We let X simultaneously refer to the set {X 1,...,X n } and to the n by d matrix whose

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

Sparse Random Features Algorithm as Coordinate Descent in Hilbert Space

Sparse Random Features Algorithm as Coordinate Descent in Hilbert Space Sparse Random Features Algorithm as Coordinate Descent in Hilbert Space Ian E.H. Yen 1 Ting-Wei Lin Shou-De Lin Pradeep Ravikumar 1 Inderjit S. Dhillon 1 Department of Computer Science 1: University of

More information

Distribution Regression with Minimax-Optimal Guarantee

Distribution Regression with Minimax-Optimal Guarantee Distribution Regression with Minimax-Optimal Guarantee (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton

More information

Introductory Machine Learning Notes 1

Introductory Machine Learning Notes 1 Introductory Machine Learning Notes 1 Lorenzo Rosasco DIBRIS, Universita degli Studi di Genova LCSL, Massachusetts Institute of Technology and Istituto Italiano di Tecnologia lrosasco@mit.edu December

More information

CS540 Machine learning Lecture 5

CS540 Machine learning Lecture 5 CS540 Machine learning Lecture 5 1 Last time Basis functions for linear regression Normal equations QR SVD - briefly 2 This time Geometry of least squares (again) SVD more slowly LMS Ridge regression 3

More information

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley Collaborators Joint work with Samy Bengio, Moritz Hardt, Michael Jordan, Jason Lee, Max Simchowitz,

More information

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017 The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the

More information

Functional SVD for Big Data

Functional SVD for Big Data Functional SVD for Big Data Pan Chao April 23, 2014 Pan Chao Functional SVD for Big Data April 23, 2014 1 / 24 Outline 1 One-Way Functional SVD a) Interpretation b) Robustness c) CV/GCV 2 Two-Way Problem

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement

More information

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University. SCMA292 Mathematical Modeling : Machine Learning Krikamol Muandet Department of Mathematics Faculty of Science, Mahidol University February 9, 2016 Outline Quick Recap of Least Square Ridge Regression

More information

Large-scale Image Annotation by Efficient and Robust Kernel Metric Learning

Large-scale Image Annotation by Efficient and Robust Kernel Metric Learning Large-scale Image Annotation by Efficient and Robust Kernel Metric Learning Supplementary Material Zheyun Feng Rong Jin Anil Jain Department of Computer Science and Engineering, Michigan State University,

More information

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee The Learning Problem and Regularization 9.520 Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing

More information

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 19: Data Representation by Design What is data representation? Let X be a data-space X M (M) F (M) X A data representation

More information

https://goo.gl/kfxweg KYOTO UNIVERSITY Statistical Machine Learning Theory Sparsity Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT OF INTELLIGENCE SCIENCE AND TECHNOLOGY 1 KYOTO UNIVERSITY Topics:

More information

Online Kernel PCA with Entropic Matrix Updates

Online Kernel PCA with Entropic Matrix Updates Dima Kuzmin Manfred K. Warmuth Computer Science Department, University of California - Santa Cruz dima@cse.ucsc.edu manfred@cse.ucsc.edu Abstract A number of updates for density matrices have been developed

More information

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1). Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes

More information

arxiv: v1 [stat.ml] 31 Oct 2018

arxiv: v1 [stat.ml] 31 Oct 2018 On Fast Leverage Score Sampling and Optimal Learning Alessandro Rudi, alessandro.rudi@inria.fr Luigi Carratino 3 luigi.carratino@dibris.unige.it Daniele Calandriello,2 daniele.calandriello@iit.it Lorenzo

More information

Semi-Supervised Learning in Gigantic Image Collections. Rob Fergus (New York University) Yair Weiss (Hebrew University) Antonio Torralba (MIT)

Semi-Supervised Learning in Gigantic Image Collections. Rob Fergus (New York University) Yair Weiss (Hebrew University) Antonio Torralba (MIT) Semi-Supervised Learning in Gigantic Image Collections Rob Fergus (New York University) Yair Weiss (Hebrew University) Antonio Torralba (MIT) Gigantic Image Collections What does the world look like? High

More information

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates : A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley;http://arxiv.org/pdf/1305.509; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014 Outline Motivation.

More information