Less is More: Computational Regularization by Subsampling

Size: px
Start display at page:

Download "Less is More: Computational Regularization by Subsampling"

Transcription

1 Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro Rudi, Raffaello Camoriano Paris

2 A Starting Point Classically: Statistics and optimization distinct steps in algorithm design Empirical process theory + Optimization

3 A Starting Point Classically: Statistics and optimization distinct steps in algorithm design Empirical process theory + Optimization Large Scale: Consider interplay between statistics and optimization! (Bottou, Bousquet 08)

4 A Starting Point Classically: Statistics and optimization distinct steps in algorithm design Empirical process theory + Optimization Large Scale: Consider interplay between statistics and optimization! (Bottou, Bousquet 08) Computational Regularization: Computation tricks =regularization

5 Problem: Estimate f Supervised Learning f

6 Supervised Learning Problem: Estimate f given S n = {(x 1, y 1 ),..., (x n, y n )} f (x 1,y 1 ) (x 2,y 2) (x 3,y 3) (x 4,y 4) (x 5,y 5)

7 Supervised Learning Problem: Estimate f given S n = {(x 1, y 1 ),..., (x n, y n )} f (x 1,y 1 ) (x 2,y 2) (x 3,y 3) (x 4,y 4) (x 5,y 5) The Setting y i = f (x i ) + ε i i {1,..., n} ε i R, x i R d random (bounded but with unknown distribution) f unknown

8 Outline Nonparametric Learning Data Dependent Subsampling Data Independent Subsampling

9 Non-linear/non-parametric learning f(x) = M c i q(x, w i ) i=1

10 Non-linear/non-parametric learning f(x) = M c i q(x, w i ) i=1 q non linear function

11 Non-linear/non-parametric learning f(x) = M c i q(x, w i ) i=1 q non linear function w i R d centers

12 Non-linear/non-parametric learning f(x) = M c i q(x, w i ) i=1 q non linear function w i R d centers c i R coefficients

13 Non-linear/non-parametric learning f(x) = M c i q(x, w i ) i=1 q non linear function w i R d centers c i R coefficients M = M n could/should grow with n

14 Non-linear/non-parametric learning f(x) = M c i q(x, w i ) i=1 q non linear function w i R d centers c i R coefficients M = M n could/should grow with n Question: How to choose w i, c i and M given S n?

15 Learning with Positive Definite Kernels There is an elegant answer if: q is symmetric all the matrices Q ij = q(x i, x j ) are positive semi-definite 1 1 They have non-negative eigenvalues

16 Learning with Positive Definite Kernels There is an elegant answer if: q is symmetric all the matrices Q ij = q(x i, x j ) are positive semi-definite 1 Representer Theorem (Kimeldorf, Wahba 70; Schölkopf et al. 01) M = n, w i = x i, c i by convex optimization! 1 They have non-negative eigenvalues

17 Kernel Ridge Regression (KRR) a.k.a. Tikhonov Regularization where 2 f λ = argmin f H 1 n n (y i f(x i )) 2 + λ f 2 i=1 H = {f f(x) = M c i q(x, w i ), c i R, w i R }{{ d, }} M {{ N } } any center! any length! i=1 2 The norm is induced by the inner product f, f = i,j c ic j q(x i, x j )

18 Kernel Ridge Regression (KRR) a.k.a. Tikhonov Regularization where 2 f λ = argmin f H 1 n n (y i f(x i )) 2 + λ f 2 i=1 Solution H = {f f(x) = f λ = M c i q(x, w i ), c i R, w i R }{{ d, }} M {{ N } } any center! any length! i=1 n c i q(x, x i ) with c = ( Q + λni) 1 ŷ i=1 2 The norm is induced by the inner product f, f = i,j c ic j q(x i, x j )

19 KRR: Statistics

20 KRR: Statistics Well understood statistical properties: Classical Theorem If f H, then λ = 1 n E ( f λ (x) f (x)) 2 1 n

21 KRR: Statistics Well understood statistical properties: Classical Theorem If f H, then λ = 1 n E ( f λ (x) f (x)) 2 1 n Remarks

22 KRR: Statistics Well understood statistical properties: Classical Theorem If f H, then λ = 1 n E ( f λ (x) f (x)) 2 1 n Remarks 1. Optimal nonparametric bound

23 KRR: Statistics Well understood statistical properties: Classical Theorem If f H, then λ = 1 n E ( f λ (x) f (x)) 2 1 n Remarks 1. Optimal nonparametric bound 2. More refined results for smooth kernels λ = n 1 2s+1, E ( f λ (x) f (x)) 2 n 2s 2s+1

24 KRR: Statistics Well understood statistical properties: Classical Theorem If f H, then λ = 1 n E ( f λ (x) f (x)) 2 1 n Remarks 1. Optimal nonparametric bound 2. More refined results for smooth kernels λ = n 1 2s+1, E ( f λ (x) f (x)) 2 n 2s 2s+1 3. Adaptive tuning, e.g. via cross validation 4. Proofs: inverse problems results + random matrices (Smale and Zhou + Caponnetto, De Vito, R.)

25 KRR: Optimization n f λ = c i q(x, x i ) with c = ( Q + λni) 1 ŷ i=1 Linear System Complexity Space O(n 2 ) bq c = Time O(n 3 ) by

26 KRR: Optimization n f λ = c i q(x, x i ) with c = ( Q + λni) 1 ŷ i=1 Linear System Complexity Space O(n 2 ) bq c = Time O(n 3 ) by BIG DATA? Running out of time and space... Can this be fixed?

27 Beyond Tikhonov: Spectral Filtering ( ˆQ + λi) 1 approximation of ˆQ controlled by λ

28 Beyond Tikhonov: Spectral Filtering ( ˆQ + λi) 1 approximation of ˆQ controlled by λ Can we approximate ˆQ by saving computations?

29 Beyond Tikhonov: Spectral Filtering ( ˆQ + λi) 1 approximation of ˆQ controlled by λ Can we approximate ˆQ by saving computations? Yes!

30 Beyond Tikhonov: Spectral Filtering ( ˆQ + λi) 1 approximation of ˆQ controlled by λ Can we approximate ˆQ by saving computations? Yes! Spectral filtering (Engl 96- inverse problems, Rosasco et al. 05- ML ) g λ ( ˆQ) ˆQ The filter function g λ defines the form of the approximation

31 Spectral filtering Examples Tikhonov- ridge regression Truncated SVD principal component regression Landweber iteration GD/ L 2 -boosting nu-method accelerated GD/Chebyshev method...

32 Spectral filtering Examples Tikhonov- ridge regression Truncated SVD principal component regression Landweber iteration GD/ L 2 -boosting nu-method accelerated GD/Chebyshev method... Landweber iteration (truncated power series)... c t = g t ( ˆQ) t 1 = γ (I γ ˆQ) r ŷ r=0

33 Spectral filtering Examples Tikhonov- ridge regression Truncated SVD principal component regression Landweber iteration GD/ L 2 -boosting nu-method accelerated GD/Chebyshev method... Landweber iteration (truncated power series)... c t = g t ( ˆQ) t 1 = γ (I γ ˆQ) r ŷ r=0... it s GD for ERM!! r = 1... t c r = c r 1 γ( ˆQc r 1 ŷ), c 0 = 0

34 Statistics and computations with spectral filtering The different filters achieve essentially the same optimal statistical error!

35 Statistics and computations with spectral filtering The different filters achieve essentially the same optimal statistical error! Difference is in computations Notet: λ 1 Filter Time Space Tikhonov n 3 n 2 GD n 2 λ 1 n 2 Accelerated GD n 2 λ 1/2 n 2 Truncated SVD n 2 λ γ n 2 = t, for iterative methods

36 Semiconvergence 0.24 Empirical Error Expected Error Error Iteration Iterations control statistics and time complexity

37 Computational Regularization

38 Computational Regularization BIG DATA? Running out of time and space...

39 Computational Regularization BIG DATA? Running out of time and space... Is there a principle to control statistics, time and space complexity?

40 Outline Nonparametric Learning Data Dependent Subsampling Data Independent Subsampling

41 1. pick w i at random... Subsampling

42 Subsampling 1. pick w i at random... from training set (Smola, Scholköpf 00) w 1,..., w M x 1,... x n M n

43 Subsampling 1. pick w i at random... from training set (Smola, Scholköpf 00) w 1,..., w M x 1,... x n M n 2. perform KRR on H M = {f f(x) = M i=1 c i q(x, w i ), c i R, w i R d, M N}.

44 Subsampling 1. pick w i at random... from training set (Smola, Scholköpf 00) w 1,..., w M x 1,... x n M n 2. perform KRR on H M = {f f(x) = M i=1 c i q(x, w i ), c i R, w i R d, M N}. Linear System bq M c = by Complexity Space O(n 2 ) O(nM) Time O(n 3 ) O(nM 2 )

45 Subsampling 1. pick w i at random... from training set (Smola, Scholköpf 00) w 1,..., w M x 1,... x n M n 2. perform KRR on H M = {f f(x) = M i=1 c i q(x, w i ), c i R, w i R d, M N}. Linear System bq M c = by Complexity Space O(n 2 ) O(nM) Time O(n 3 ) O(nM 2 ) What about statistics? What s the price for efficient computations?

46 Putting our Result in Context *Many* different subsampling schemes (Smola, Scholkopf 00; Williams, Seeger 01; )

47 Putting our Result in Context *Many* different subsampling schemes (Smola, Scholkopf 00; Williams, Seeger 01; ) Theoretical guarantees mainly on matrix approximation (Mahoney and Drineas 09; Cortes et al 10, Kumar et al ) Q Q M 1 M

48 Putting our Result in Context *Many* different subsampling schemes (Smola, Scholkopf 00; Williams, Seeger 01; ) Theoretical guarantees mainly on matrix approximation (Mahoney and Drineas 09; Cortes et al 10, Kumar et al ) Q Q M 1 M Statistical guarantees suboptimal or in restricted setting (Cortes et al. 10; Jin et al. 11, Bach 13, Alaoui, Mahoney 14)

49 Main Result (Rudi, Camoriano, Rosasco, 15) Theorem If f H, then λ = 1 n, M = 1 λ, E ( f λ,m (x) f (x)) 2 1 n

50 Main Result (Rudi, Camoriano, Rosasco, 15) Theorem If f H, then λ = 1 n, M = 1 λ, E ( f λ,m (x) f (x)) 2 1 n Remarks

51 Main Result (Rudi, Camoriano, Rosasco, 15) Theorem If f H, then λ = 1 n, M = 1 λ, E ( f λ,m (x) f (x)) 2 1 n Remarks 1. Subsampling achives optimal bound...

52 Main Result (Rudi, Camoriano, Rosasco, 15) Theorem If f H, then λ = 1 n, M = 1 λ, E ( f λ,m (x) f (x)) 2 1 n Remarks 1. Subsampling achives optimal bound with M n!!

53 Main Result (Rudi, Camoriano, Rosasco, 15) Theorem If f H, then λ = 1 n, M = 1 λ, E ( f λ,m (x) f (x)) 2 1 n Remarks 1. Subsampling achives optimal bound with M n!! 3. More generally, λ = n 1 2s+1, M = 1 λ, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1

54 Main Result (Rudi, Camoriano, Rosasco, 15) Theorem If f H, then λ = 1 n, M = 1 λ, E ( f λ,m (x) f (x)) 2 1 n Remarks 1. Subsampling achives optimal bound with M n!! 3. More generally, λ = n 1 2s+1, M = 1 λ, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1 Note: An interesting insight is obtained rewriting the result...

55 Computational Regularization by Subsampling (Rudi, Camoriano, Rosasco, 15) A simple idea: swap the role of λ and M...

56 Computational Regularization by Subsampling (Rudi, Camoriano, Rosasco, 15) A simple idea: swap the role of λ and M... Theorem If f H with a smooth kernel, then M = n 1 2s+1, λ = 1 M, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1

57 Computational Regularization by Subsampling (Rudi, Camoriano, Rosasco, 15) A simple idea: swap the role of λ and M... Theorem If f H with a smooth kernel, then M = n 1 2s+1, λ = 1 M, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1 λ and M play the same role new interpretation: subsampling regularizes!

58 Computational Regularization by Subsampling (Rudi, Camoriano, Rosasco, 15) A simple idea: swap the role of λ and M... Theorem If f H with a smooth kernel, then M = n 1 2s+1, λ = 1 M, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1 λ and M play the same role new interpretation: subsampling regularizes! New natural incremental algorithm... Algorithm

59 Computational Regularization by Subsampling (Rudi, Camoriano, Rosasco, 15) A simple idea: swap the role of λ and M... Theorem If f H with a smooth kernel, then M = n 1 2s+1, λ = 1 M, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1 λ and M play the same role new interpretation: subsampling regularizes! New natural incremental algorithm... Algorithm 1. Pick a center + compute solution

60 Computational Regularization by Subsampling (Rudi, Camoriano, Rosasco, 15) A simple idea: swap the role of λ and M... Theorem If f H with a smooth kernel, then M = n 1 2s+1, λ = 1 M, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1 λ and M play the same role new interpretation: subsampling regularizes! New natural incremental algorithm... Algorithm 1. Pick a center + compute solution 2. Pick another center + rank one update

61 Computational Regularization by Subsampling (Rudi, Camoriano, Rosasco, 15) A simple idea: swap the role of λ and M... Theorem If f H with a smooth kernel, then M = n 1 2s+1, λ = 1 M, E x ( f λ,m (x) f (x)) 2 n 2s 2s+1 λ and M play the same role new interpretation: subsampling regularizes! New natural incremental algorithm... Algorithm 1. Pick a center + compute solution 2. Pick another center + rank one update 3. Pick another center...

62 Nÿstrom CoRe Illustrated n, λ are fixed Validation Error Computation controls stability! Time/space requirement tailored to generalization

63 Experiments comparable/better w.r.t. the state of the art Dataset n tr d Incremental Standard Standard Random Fastfood CoRe KRLS Nyström Features RF Ins. Co ± CPU ± CT slices ± NA Year Pred ± NA Forest ± NA Random Features (Rahimi, Recht 07) Fastfood (Le et al. 13)

64 Summary so far Optimal learning with data dependent subsampling Computational regularization: subsampling regularizes!

65 Summary so far Optimal learning with data dependent subsampling Computational regularization: subsampling regularizes! Few more questions: Can one do better than uniform sampling?

66 Summary so far Optimal learning with data dependent subsampling Computational regularization: subsampling regularizes! Few more questions: Can one do better than uniform sampling? Yes: leverage score sampling... What about data independent sampling?

67 Outline Nonparametric Learning Data Dependent Subsampling Data Independent Subsampling

68 Random Features f(x) = M c i q(x, w i ) i=1

69 Random Features f(x) = M c i q(x, w i ) i=1 q general non linear function

70 Random Features f(x) = M c i q(x, w i ) i=1 q general non linear function pick w i at random according to a distribution µ w 1,..., w M µ

71 Random Features f(x) = M c i q(x, w i ) i=1 q general non linear function pick w i at random according to a distribution µ w 1,..., w M µ perform KRR on M H M = {f f(x) = c i q(x, w i ), c i R}. i=1

72 Random Fourier Features (Rahimi, Recht 07) Consider q(x, w) = e iwt x,

73 Random Fourier Features (Rahimi, Recht 07) Consider q(x, w) = e iwt x, w µ(w) = N (0, I)

74 Random Fourier Features (Rahimi, Recht 07) Consider q(x, w) = e iwt x, w µ(w) = N (0, I) Then ] E w [q(x, w)q(x, w) = e x x 2γ = K(x, x )

75 Random Fourier Features (Rahimi, Recht 07) Consider q(x, w) = e iwt x, w µ(w) = N (0, I) Then ] E w [q(x, w)q(x, w) = e x x 2γ = K(x, x ) By sampling w 1,..., w M we are considering the approximating kernel 1 M M i=1 [ ] q(x, w i )q(x, w i ) = K M (x, x )

76 More Random Features translation invariant kernels K(x, x ) = H(x x ), q(x, w) = e iwt x, infinite neural nets kernels w µ = F(H) q(x, w) = w T x + b +, (w, b) µ = U[S d ] infinite dot product kernels homogeneous additive kernels group invariant kernels... Note: Connections with hashing and sketching techniques.

77 Properties of Random Features

78 Properties of Random Features Optimization Time: O(n 3 ) O(nM 2 ) Space: O(n 2 ) O(nM)

79 Properties of Random Features Optimization Time: O(n 3 ) O(nM 2 ) Space: O(n 2 ) O(nM) Statistics As before: do we pay a price for efficient computations?

80 Previous works

81 Previous works *Many* different random features for different kernels (Rahimi, Recht 07, Vedaldi, Zisserman, )

82 Previous works *Many* different random features for different kernels (Rahimi, Recht 07, Vedaldi, Zisserman, ) Theoretical guarantees: mainly kernel approximation (Rahimi, Recht 07,..., Sriperumbudur and Szabo 15) K(x, x ) K M (x, x ) 1 M

83 Previous works *Many* different random features for different kernels (Rahimi, Recht 07, Vedaldi, Zisserman, ) Theoretical guarantees: mainly kernel approximation (Rahimi, Recht 07,..., Sriperumbudur and Szabo 15) K(x, x ) K M (x, x ) 1 M Statistical guarantees suboptimal or in restricted setting (Rahimi, Recht 09, Yang et al ,Bach 15 )

84 Main Result Let q(x, w) = e iwt x,

85 Main Result Let ( q(x, w) = e iwt x 1, w µ(w) = c d 1 + w 2 ) d+1 2

86 Main Result Let ( q(x, w) = e iwt x 1, w µ(w) = c d 1 + w 2 ) d+1 2 Theorem If f H s Sobolev space, then λ = n 1 2s+1, M = 1 λ 2s, E ( f λ,m (x) f (x)) 2 n 2s 2s+1

87 Main Result Let ( q(x, w) = e iwt x 1, w µ(w) = c d 1 + w 2 ) d+1 2 Theorem If f H s Sobolev space, then λ = n 1 2s+1, M = 1 λ 2s, E ( f λ,m (x) f (x)) 2 n 2s 2s+1 Random features achieve optimal bound!

88 Main Result Let ( q(x, w) = e iwt x 1, w µ(w) = c d 1 + w 2 ) d+1 2 Theorem If f H s Sobolev space, then λ = n 1 2s+1, M = 1 λ 2s, E ( f λ,m (x) f (x)) 2 n 2s 2s+1 Random features achieve optimal bound! Efficient worst case subsampling M n but cannot exploit smoothness.

89 Remarks & Extensions Nÿstrom vs Random features Both achieve optimal rates Nÿstrom seems to need fewer samples (random centers)

90 Remarks & Extensions Nÿstrom vs Random features Both achieve optimal rates Nÿstrom seems to need fewer samples (random centers) How tight are the results?

91 Remarks & Extensions Nÿstrom vs Random features Both achieve optimal rates Nÿstrom seems to need fewer samples (random centers) How tight are the results? log λ Test Error log M test error

92 Contributions Optimal bounds for data dependent/independent subsampling Subsampling: Nÿstrom vs Random features Beyond ridge regression: early stopping and multiple passes SGD (see arxiv)

93 Contributions Optimal bounds for data dependent/independent subsampling Subsampling: Nÿstrom vs Random features Beyond ridge regression: early stopping and multiple passes SGD (see arxiv) Some questions: Quest for the best sampling Regularization by projection: inverse problems and preconditioning Beyond randomization: non convex neural nets optimization?

94 Contributions Optimal bounds for data dependent/independent subsampling Subsampling: Nÿstrom vs Random features Beyond ridge regression: early stopping and multiple passes SGD (see arxiv) Some questions: Quest for the best sampling Regularization by projection: inverse problems and preconditioning Beyond randomization: non convex neural nets optimization? Some perspectives: Computational regularization: subsampling regularizes Algorithm design: control stability for good statistics/computations

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

TUM 2016 Class 3 Large scale learning by regularization

TUM 2016 Class 3 Large scale learning by regularization TUM 2016 Class 3 Large scale learning by regularization Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x n, y n ) Beyond

More information

Optimal kernel methods for large scale learning

Optimal kernel methods for large scale learning Optimal kernel methods for large scale learning Alessandro Rudi INRIA - École Normale Supérieure, Paris joint work with Luigi Carratino, Lorenzo Rosasco 6 Mar 2018 École Polytechnique Learning problem

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Oslo Class 4 Early Stopping and Spectral Regularization

Oslo Class 4 Early Stopping and Spectral Regularization RegML2017@SIMULA Oslo Class 4 Early Stopping and Spectral Regularization Lorenzo Rosasco UNIGE-MIT-IIT June 28, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x

More information

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien

More information

Approximate Kernel Methods

Approximate Kernel Methods Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression

More information

Approximate Kernel PCA with Random Features

Approximate Kernel PCA with Random Features Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,

More information

An inverse problem perspective on machine learning

An inverse problem perspective on machine learning An inverse problem perspective on machine learning Lorenzo Rosasco University of Genova Massachusetts Institute of Technology Istituto Italiano di Tecnologia lcsl.mit.edu Feb 9th, 2018 Inverse Problems

More information

RegML 2018 Class 2 Tikhonov regularization and kernels

RegML 2018 Class 2 Tikhonov regularization and kernels RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,

More information

Convergence Rates of Kernel Quadrature Rules

Convergence Rates of Kernel Quadrature Rules Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction

More information

Spectral Regularization

Spectral Regularization Spectral Regularization Lorenzo Rosasco 9.520 Class 07 February 27, 2008 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

Regularization via Spectral Filtering

Regularization via Spectral Filtering Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Basis Expansion and Nonlinear SVM. Kai Yu

Basis Expansion and Nonlinear SVM. Kai Yu Basis Expansion and Nonlinear SVM Kai Yu Linear Classifiers f(x) =w > x + b z(x) = sign(f(x)) Help to learn more general cases, e.g., nonlinear models 8/7/12 2 Nonlinear Classifiers via Basis Expansion

More information

Memory Efficient Kernel Approximation

Memory Efficient Kernel Approximation Si Si Department of Computer Science University of Texas at Austin ICML Beijing, China June 23, 2014 Joint work with Cho-Jui Hsieh and Inderjit S. Dhillon Outline Background Motivation Low-Rank vs. Block

More information

Early Stopping for Computational Learning

Early Stopping for Computational Learning Early Stopping for Computational Learning Lorenzo Rosasco Universita di Genova, Massachusetts Institute of Technology Istituto Italiano di Tecnologia CBMM Sestri Levante, September, 2014 joint work with

More information

Iterative Convex Regularization

Iterative Convex Regularization Iterative Convex Regularization Lorenzo Rosasco Universita di Genova Universita di Genova Istituto Italiano di Tecnologia Massachusetts Institute of Technology Optimization and Statistical Learning Workshop,

More information

Kernel Learning via Random Fourier Representations

Kernel Learning via Random Fourier Representations Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Oslo Class 2 Tikhonov regularization and kernels

Oslo Class 2 Tikhonov regularization and kernels RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n

More information

Mathematical Methods for Data Analysis

Mathematical Methods for Data Analysis Mathematical Methods for Data Analysis Massimiliano Pontil Istituto Italiano di Tecnologia and Department of Computer Science University College London Massimiliano Pontil Mathematical Methods for Data

More information

Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms

Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms Junhong Lin Volkan Cevher Laboratory for Information and Inference Systems École Polytechnique Fédérale

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Towards Deep Kernel Machines

Towards Deep Kernel Machines Towards Deep Kernel Machines Julien Mairal Inria, Grenoble Prague, April, 2017 Julien Mairal Towards deep kernel machines 1/51 Part I: Scientific Context Julien Mairal Towards deep kernel machines 2/51

More information

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) Diffeomorphic Warping Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) What Manifold Learning Isn t Common features of Manifold Learning Algorithms: 1-1 charting Dense sampling Geometric Assumptions

More information

Online Gradient Descent Learning Algorithms

Online Gradient Descent Learning Algorithms DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline

More information

2 Tikhonov Regularization and ERM

2 Tikhonov Regularization and ERM Introduction Here we discusses how a class of regularization methods originally designed to solve ill-posed inverse problems give rise to regularized learning algorithms. These algorithms are kernel methods

More information

Learning sets and subspaces: a spectral approach

Learning sets and subspaces: a spectral approach Learning sets and subspaces: a spectral approach Alessandro Rudi DIBRIS, Università di Genova Optimization and dynamical processes in Statistical learning and inverse problems Sept 8-12, 2014 A world of

More information

Analysis of Nyström Method with Sequential Ridge Leverage Score Sampling

Analysis of Nyström Method with Sequential Ridge Leverage Score Sampling Analysis of Nyström Method with Sequential Ridge Leverage Score Sampling Daniele Calandriello SequeL team INRIA Lille - Nord Europe Alessandro Lazaric SequeL team INRIA Lille - Nord Europe Michal Valko

More information

Pack only the essentials: distributed sequential sampling for adaptive kernel DL

Pack only the essentials: distributed sequential sampling for adaptive kernel DL Pack only the essentials: distributed sequential sampling for adaptive kernel DL with Daniele Calandriello and Alessandro Lazaric SequeL team, Inria Lille - Nord Europe, France appeared in AISTATS 2017

More information

Spectral Filtering for MultiOutput Learning

Spectral Filtering for MultiOutput Learning Spectral Filtering for MultiOutput Learning Lorenzo Rosasco Center for Biological and Computational Learning, MIT Universita di Genova, Italy Plan Learning with kernels Multioutput kernel and regularization

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new

More information

Statistically and Computationally Efficient Variance Estimator for Kernel Ridge Regression

Statistically and Computationally Efficient Variance Estimator for Kernel Ridge Regression Statistically and Computationally Efficient Variance Estimator for Kernel Ridge Regression Meimei Liu Department of Statistical Science Duke University Durham, IN - 27708 Email: meimei.liu@duke.edu Jean

More information

arxiv: v1 [cs.lg] 11 Jan 2012

arxiv: v1 [cs.lg] 11 Jan 2012 arxiv:20.246v [cs.lg] Jan 202 Pierre Machart LIF, LSIS, CNRS, Aix-Marseille Université Thomas Peel LIF, LATP, CNRS, Aix-Marseille Université Sandrine Anthoine LATP, CNRS, Aix-Marseille Université Liva

More information

FALKON: An Optimal Large Scale Kernel Method

FALKON: An Optimal Large Scale Kernel Method FALKON: An Optimal Large Scale Kernel Method Alessandro Rudi INRIA Sierra Project-team, École Normale Supérieure, Paris Luigi Carratino University of Genoa Genova, Italy Lorenzo Rosasco University of Genoa,

More information

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee The Learning Problem and Regularization 9.520 Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing

More information

Optimal Rates for Multi-pass Stochastic Gradient Methods

Optimal Rates for Multi-pass Stochastic Gradient Methods Journal of Machine Learning Research 8 (07) -47 Submitted 3/7; Revised 8/7; Published 0/7 Optimal Rates for Multi-pass Stochastic Gradient Methods Junhong Lin Laboratory for Computational and Statistical

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms

More information

sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU)

sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU) sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU) 0 overview Our Contributions: 1 overview Our Contributions: A near optimal low-rank

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley Collaborators Joint work with Samy Bengio, Moritz Hardt, Michael Jordan, Jason Lee, Max Simchowitz,

More information

Distributed Adaptive Sampling for Kernel Matrix Approximation

Distributed Adaptive Sampling for Kernel Matrix Approximation Daniele Calandriello Alessandro Lazaric Michal Valko SequeL team, INRIA Lille - Nord Europe Abstract Most kernel-based methods, such as kernel regression, kernel PCA, ICA, or k-means clustering, do not

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

MLCC 2017 Regularization Networks I: Linear Models

MLCC 2017 Regularization Networks I: Linear Models MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational

More information

Regularization Algorithms for Learning

Regularization Algorithms for Learning DISI, UNIGE Texas, 10/19/07 plan motivation setting elastic net regularization - iterative thresholding algorithms - error estimates and parameter choice applications motivations starting point of many

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline

More information

Semi-Nonparametric Inferences for Massive Data

Semi-Nonparametric Inferences for Massive Data Semi-Nonparametric Inferences for Massive Data Guang Cheng 1 Department of Statistics Purdue University Statistics Seminar at NCSU October, 2015 1 Acknowledge NSF, Simons Foundation and ONR. A Joint Work

More information

Spectral Algorithms for Supervised Learning

Spectral Algorithms for Supervised Learning LETTER Communicated by David Hardoon Spectral Algorithms for Supervised Learning L. Lo Gerfo logerfo@disi.unige.it L. Rosasco rosasco@disi.unige.it F. Odone odone@disi.unige.it Dipartimento di Informatica

More information

Multi-Task Learning and Matrix Regularization

Multi-Task Learning and Matrix Regularization Multi-Task Learning and Matrix Regularization Andreas Argyriou Department of Computer Science University College London Collaborators T. Evgeniou (INSEAD) R. Hauser (University of Oxford) M. Herbster (University

More information

Nystrom Method for Approximating the GMM Kernel

Nystrom Method for Approximating the GMM Kernel Nystrom Method for Approximating the GMM Kernel Ping Li Department of Statistics and Biostatistics Department of omputer Science Rutgers University Piscataway, NJ 0, USA pingli@stat.rutgers.edu Abstract

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

https://goo.gl/kfxweg KYOTO UNIVERSITY Statistical Machine Learning Theory Sparsity Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT OF INTELLIGENCE SCIENCE AND TECHNOLOGY 1 KYOTO UNIVERSITY Topics:

More information

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017 The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the

More information

Algorithmic Stability and Generalization Christoph Lampert

Algorithmic Stability and Generalization Christoph Lampert Algorithmic Stability and Generalization Christoph Lampert November 28, 2018 1 / 32 IST Austria (Institute of Science and Technology Austria) institute for basic research opened in 2009 located in outskirts

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

An Adaptive Test of Independence with Analytic Kernel Embeddings

An Adaptive Test of Independence with Analytic Kernel Embeddings An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum 1 Zoltán Szabó 2 Arthur Gretton 1 1 Gatsby Unit, University College London 2 CMAP, École Polytechnique ICML 2017, Sydney

More information

Convergence rates of spectral methods for statistical inverse learning problems

Convergence rates of spectral methods for statistical inverse learning problems Convergence rates of spectral methods for statistical inverse learning problems G. Blanchard Universtität Potsdam UCL/Gatsby unit, 04/11/2015 Joint work with N. Mücke (U. Potsdam); N. Krämer (U. München)

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Kernel-based Approximation. Methods using MATLAB. Gregory Fasshauer. Interdisciplinary Mathematical Sciences. Michael McCourt.

Kernel-based Approximation. Methods using MATLAB. Gregory Fasshauer. Interdisciplinary Mathematical Sciences. Michael McCourt. SINGAPORE SHANGHAI Vol TAIPEI - Interdisciplinary Mathematical Sciences 19 Kernel-based Approximation Methods using MATLAB Gregory Fasshauer Illinois Institute of Technology, USA Michael McCourt University

More information

Distribution Regression

Distribution Regression Zoltán Szabó (École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) Dagstuhl Seminar 16481

More information

Approximate Spectral Clustering via Randomized Sketching

Approximate Spectral Clustering via Randomized Sketching Approximate Spectral Clustering via Randomized Sketching Christos Boutsidis Yahoo! Labs, New York Joint work with Alex Gittens (Ebay), Anju Kambadur (IBM) The big picture: sketch and solve Tradeoff: Speed

More information

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Simple Optimization, Bigger Models, and Faster Learning. Niao He Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture

More information

Online Kernel PCA with Entropic Matrix Updates

Online Kernel PCA with Entropic Matrix Updates Dima Kuzmin Manfred K. Warmuth Computer Science Department, University of California - Santa Cruz dima@cse.ucsc.edu manfred@cse.ucsc.edu Abstract A number of updates for density matrices have been developed

More information

Large-scale Image Annotation by Efficient and Robust Kernel Metric Learning

Large-scale Image Annotation by Efficient and Robust Kernel Metric Learning Large-scale Image Annotation by Efficient and Robust Kernel Metric Learning Supplementary Material Zheyun Feng Rong Jin Anil Jain Department of Computer Science and Engineering, Michigan State University,

More information

Lecture 2: Linear Algebra Review

Lecture 2: Linear Algebra Review EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1

More information

10-701/ Recitation : Kernels

10-701/ Recitation : Kernels 10-701/15-781 Recitation : Kernels Manojit Nandi February 27, 2014 Outline Mathematical Theory Banach Space and Hilbert Spaces Kernels Commonly Used Kernels Kernel Theory One Weird Kernel Trick Representer

More information

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology 2010-06-22

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Approximate Principal Components Analysis of Large Data Sets

Approximate Principal Components Analysis of Large Data Sets Approximate Principal Components Analysis of Large Data Sets Daniel J. McDonald Department of Statistics Indiana University mypage.iu.edu/ dajmcdon April 27, 2016 Approximation-Regularization for Analysis

More information

Learning gradients: prescriptive models

Learning gradients: prescriptive models Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan

More information

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 04: Features and Kernels Lorenzo Rosasco Linear functions Let H lin be the space of linear functions f(x) = w x. f w is one

More information

Regularized Least Squares

Regularized Least Squares Regularized Least Squares Ryan M. Rifkin Google, Inc. 2008 Basics: Data Data points S = {(X 1, Y 1 ),...,(X n, Y n )}. We let X simultaneously refer to the set {X 1,...,X n } and to the n by d matrix whose

More information

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets 9.520 Class 22, 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce an alternate perspective of RKHS via integral operators

More information

Sketched Ridge Regression:

Sketched Ridge Regression: Sketched Ridge Regression: Optimization and Statistical Perspectives Shusen Wang UC Berkeley Alex Gittens RPI Michael Mahoney UC Berkeley Overview Ridge Regression min w f w = 1 n Xw y + γ w Over-determined:

More information

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable. Linear SVM (separable case) First consider the scenario where the two classes of points are separable. It s desirable to have the width (called margin) between the two dashed lines to be large, i.e., have

More information

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Optimal Distributed Learning with Multi-pass Stochastic Gradient Methods

Optimal Distributed Learning with Multi-pass Stochastic Gradient Methods Optimal Distributed Learning with Multi-pass Stochastic Gradient Methods Junhong Lin Volkan Cevher Abstract We study generalization properties of distributed algorithms in the setting of nonparametric

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

Advanced Introduction to Machine Learning

Advanced Introduction to Machine Learning 10-715 Advanced Introduction to Machine Learning Homework Due Oct 15, 10.30 am Rules Please follow these guidelines. Failure to do so, will result in loss of credit. 1. Homework is due on the due date

More information

Advances in kernel exponential families

Advances in kernel exponential families Advances in kernel exponential families Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS, 2017 1/39 Outline Motivating application: Fast estimation of complex multivariate

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement

More information

Introductory Machine Learning Notes 1

Introductory Machine Learning Notes 1 Introductory Machine Learning Notes 1 Lorenzo Rosasco DIBRIS, Universita degli Studi di Genova LCSL, Massachusetts Institute of Technology and Istituto Italiano di Tecnologia lrosasco@mit.edu December

More information

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55

More information

Sparse Random Features Algorithm as Coordinate Descent in Hilbert Space

Sparse Random Features Algorithm as Coordinate Descent in Hilbert Space Sparse Random Features Algorithm as Coordinate Descent in Hilbert Space Ian E.H. Yen 1 Ting-Wei Lin Shou-De Lin Pradeep Ravikumar 1 Inderjit S. Dhillon 1 Department of Computer Science 1: University of

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee Distribution Regression: A Simple Technique with Minimax-optimal Guarantee (CMAP, École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department,

More information

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM Wei Chu Chong Jin Ong chuwei@gatsby.ucl.ac.uk mpeongcj@nus.edu.sg S. Sathiya Keerthi mpessk@nus.edu.sg Control Division, Department

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models

More information

arxiv: v6 [stat.ml] 17 Mar 2016

arxiv: v6 [stat.ml] 17 Mar 2016 Less is More: Nyström Computational Regularization Alessandro Rudi 1 Raffaello Camoriano 1,2 Lorenzo Rosasco 1,3 ale rudi@mit.edu raffaello.camoriano@iit.it lrosasco@mit.edu arxiv:1507.04717v6 [stat.ml]

More information