Stochastic optimization in Hilbert spaces

Size: px
Start display at page:

Download "Stochastic optimization in Hilbert spaces"

Transcription

1 Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48

2 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert spaces 2 / 48

3 Outline Tradeoffs of large scale learning Algorithm ERM? complexity. Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert spaces 3 / 48

4 Outline Tradeoffs of large scale learning Algorithm ERM? complexity. Learning vs Statistics Stochastic optimization Why is SGD so useful in learning? Aymeric Dieuleveut Stochastic optimization Hilbert spaces 4 / 48

5 Outline Tradeoffs of large scale learning Algorithm ERM? complexity. Learning vs Statistics Stochastic optimization Why is SGD so useful in learning? A simple case Least mean squares, finite dimension Aymeric Dieuleveut Stochastic optimization Hilbert spaces 5 / 48

6 Outline Tradeoffs of large scale learning Algorithm ERM? complexity. Learning vs Statistics Stochastic optimization Why is SGD so useful in learning? Higher dimension? RKHS, non parametric learning A simple case Least mean squares, finite dimension Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48

7 Outline Lower complexity? Column sampling, feature selection Tradeoffs of large scale learning Algorithm ERM? complexity. Learning vs Statistics Stochastic optimization Why is SGD so useful in learning? Higher dimension? RKHS, non parametric learning A simple case Least mean squares, finite dimension Aymeric Dieuleveut Stochastic optimization Hilbert spaces 7 / 48

8 Tradeoffs of Large scale learning - Learning Statistics vs Machine Learning 1. taken from Aymeric Dieuleveut Stochastic optimization Hilbert spaces 8 / 48

9 Tradeoffs of Large scale learning - Learning Statistics vs Machine Learning Statistics Machine Learning Estimation Learning Classifier Hypothesis Data point Example/Instance Regression Supervised Learning Classification Supervised Learning Covariate Feature Response Label 1 1. taken from Aymeric Dieuleveut Stochastic optimization Hilbert spaces 8 / 48

10 Tradeoffs of Large scale learning - Learning Statistics vs Machine Learning Statistics Machine Learning Estimation Learning Classifier Hypothesis Data point Example/Instance Regression Supervised Learning Classification Supervised Learning Covariate Feature Response Label 1 Essentially AI vs math guys doing same kind of stuff. However main differences : 1. taken from Aymeric Dieuleveut Stochastic optimization Hilbert spaces 8 / 48

11 Tradeoffs of Large scale learning - Learning Statistics vs Machine Learning Statistics Machine Learning Estimation Learning Classifier Hypothesis Data point Example/Instance Regression Supervised Learning Classification Supervised Learning Covariate Feature Response Label 1 Essentially AI vs math guys doing same kind of stuff. However main differences : Statisticians are more interested in the model and drawing conclusions about it. 1. taken from Aymeric Dieuleveut Stochastic optimization Hilbert spaces 8 / 48

12 Tradeoffs of Large scale learning - Learning Statistics vs Machine Learning Statistics Machine Learning Estimation Learning Classifier Hypothesis Data point Example/Instance Regression Supervised Learning Classification Supervised Learning Covariate Feature Response Label 1 Essentially AI vs math guys doing same kind of stuff. However main differences : Statisticians are more interested in the model and drawing conclusions about it. ML are more interested about prediction with a concern on algorithms for high dim. data. 1. taken from Aymeric Dieuleveut Stochastic optimization Hilbert spaces 8 / 48

13 Tradeoffs of Large scale learning - Learning Framework We consider the classical risk minimization problem. Given : a space of input output pairs (x, y) X Y, with probability distribution P(x, y). a loss function l : Y Y R, a class of function F. the risk of a function f : X Y is R(f ) := E P [l (f (x), y)]. Our aim is min R(f ) f F Aymeric Dieuleveut Stochastic optimization Hilbert spaces 9 / 48

14 Tradeoffs of Large scale learning - Learning Framework We consider the classical risk minimization problem. Given : a space of input output pairs (x, y) X Y, with probability distribution P(x, y). a loss function l : Y Y R, a class of function F. the risk of a function f : X Y is R(f ) := E P [l (f (x), y)]. Our aim is R is unknown. min R(f ) f F Aymeric Dieuleveut Stochastic optimization Hilbert spaces 9 / 48

15 Tradeoffs of Large scale learning - Learning Framework We consider the classical risk minimization problem. Given : a space of input output pairs (x, y) X Y, with probability distribution P(x, y). a loss function l : Y Y R, a class of function F. the risk of a function f : X Y is R(f ) := E P [l (f (x), y)]. Our aim is R is unknown. min R(f ) f F given a sequence of i.i.d. data points distributed (x i, y i ) i=1..n P n, we can define the empirical risk R n (f ) = 1 n n l(f (x i ), y i ). i=1 Aymeric Dieuleveut Stochastic optimization Hilbert spaces 9 / 48

16 Tradeoffs of Large scale learning - Learning The bias-variance tradeoffs a.k.a. estimation approximation error. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 10 / 48

17 Tradeoffs of Large scale learning - Learning The bias-variance tradeoffs a.k.a. estimation approximation error. There are many ways of seeing it : constraint case penalized case other regularization Aymeric Dieuleveut Stochastic optimization Hilbert spaces 10 / 48

18 Tradeoffs of Large scale learning - Learning The bias-variance tradeoffs a.k.a. estimation approximation error. There are many ways of seeing it : constraint case penalized case other regularization Thus compromise : ε app + ε est. F ε app ε est Aymeric Dieuleveut Stochastic optimization Hilbert spaces 10 / 48

19 Tradeoffs of Large scale learning - Learning The bias-variance tradeoffs a.k.a. estimation approximation error. There are many ways of seeing it : constraint case penalized case other regularization Thus compromise : ε app + ε est. F ε app ε est Aymeric Dieuleveut Stochastic optimization Hilbert spaces 10 / 48

20 Tradeoffs of Large scale learning - Learning The bias-variance tradeoffs a.k.a. estimation approximation error. There are many ways of seeing it : constraint case penalized case other regularization Thus compromise : ε app + ε est. This is the classical setting. ε app ε est F Aymeric Dieuleveut Stochastic optimization Hilbert spaces 10 / 48

21 Tradeoffs of Large scale learning - Learning Adding an optimization term When we face large datasets, it may be uneasy and useless to optimize with high accuracy the estimator. We then question the choice of an algorithm from a fixed budget time point of view Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008] Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48

22 Tradeoffs of Large scale learning - Learning Adding an optimization term When we face large datasets, it may be uneasy and useless to optimize with high accuracy the estimator. We then question the choice of an algorithm from a fixed budget time point of view. 2 It questions the following points : up to which precision is it necessary to optimize? 2. Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008] Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48

23 Tradeoffs of Large scale learning - Learning Adding an optimization term When we face large datasets, it may be uneasy and useless to optimize with high accuracy the estimator. We then question the choice of an algorithm from a fixed budget time point of view. 2 It questions the following points : up to which precision is it necessary to optimize? which is the limiting factor? (time, data points) 2. Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008] Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48

24 Tradeoffs of Large scale learning - Learning Adding an optimization term When we face large datasets, it may be uneasy and useless to optimize with high accuracy the estimator. We then question the choice of an algorithm from a fixed budget time point of view. 2 It questions the following points : up to which precision is it necessary to optimize? which is the limiting factor? (time, data points) A problem is said to be large scale when time is limiting. For large scale problem : which algo? 2. Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008] Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48

25 Tradeoffs of Large scale learning - Learning Adding an optimization term When we face large datasets, it may be uneasy and useless to optimize with high accuracy the estimator. We then question the choice of an algorithm from a fixed budget time point of view. 2 It questions the following points : up to which precision is it necessary to optimize? which is the limiting factor? (time, data points) A problem is said to be large scale when time is limiting. For large scale problem : which algo? more data less work? (if time is limiting) 2. Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008] Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48

26 Tradeoffs of Large scale learning - Learning Adding an optimization term When we face large datasets, it may be uneasy and useless to optimize with high accuracy the estimator. We then question the choice of an algorithm from a fixed budget time point of view. 2 It questions the following points : up to which precision is it necessary to optimize? which is the limiting factor? (time, data points) A problem is said to be large scale when time is limiting. For large scale problem : which algo? more data less work? (if time is limiting) 2. Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008] Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48

27 Tradeoffs of Large scale learning - Learning Tradeoffs - Large scale learning F n ε Aymeric Dieuleveut Stochastic optimization Hilbert spaces 12 / 48

28 Tradeoffs of Large scale learning - Learning Tradeoffs - Large scale learning ε app F n ε Aymeric Dieuleveut Stochastic optimization Hilbert spaces 12 / 48

29 Tradeoffs of Large scale learning - Learning Tradeoffs - Large scale learning F n ε ε app ε est Aymeric Dieuleveut Stochastic optimization Hilbert spaces 12 / 48

30 Tradeoffs of Large scale learning - Learning Tradeoffs - Large scale learning F n ε ε app ε est ε opt Aymeric Dieuleveut Stochastic optimization Hilbert spaces 12 / 48

31 Tradeoffs of Large scale learning - Learning Tradeoffs - Large scale learning F n ε ε app ε est ε opt T Aymeric Dieuleveut Stochastic optimization Hilbert spaces 12 / 48

32 Tradeoffs of Large scale learning - Learning Different algorithms To minimize ERM, a bunch of algorithms may be considered : Gradient descent Second order gradient descent Stochastic gradient descent Fast stochastic algorithm (requiring high memory storage) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 13 / 48

33 Tradeoffs of Large scale learning - Learning Different algorithms To minimize ERM, a bunch of algorithms may be considered : Gradient descent Second order gradient descent Stochastic gradient descent Fast stochastic algorithm (requiring high memory storage) Let s compare first order methods : SGD and GD. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 13 / 48

34 Tradeoffs of Large scale learning - Learning Stochastic gradient algorithms : Aim : min f R(f ) we only access to unbiased estimates of R(f ) and R(f ). Aymeric Dieuleveut Stochastic optimization Hilbert spaces 14 / 48

35 Tradeoffs of Large scale learning - Learning Stochastic gradient algorithms : Aim : min f R(f ) we only access to unbiased estimates of R(f ) and R(f ). 1 Start at some f 0. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 14 / 48

36 Tradeoffs of Large scale learning - Learning Stochastic gradient algorithms : Aim : min f R(f ) we only access to unbiased estimates of R(f ) and R(f ). 1 Start at some f 0. 2 Iterate : Get unbiased gradient estimate g k, s.t. E[g k ] = R(f k ). f k+1 f k γ k g k. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 14 / 48

37 Tradeoffs of Large scale learning - Learning Stochastic gradient algorithms : Aim : min f R(f ) we only access to unbiased estimates of R(f ) and R(f ). 1 Start at some f 0. 2 Iterate : Get unbiased gradient estimate g k, s.t. E[g k ] = R(f k ). f k+1 f k γ k g k. m 3 Output f m or f m := 1 m f k (averaged SGD). k=1 Gradient descent : same but with true gradient. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 14 / 48

38 Tradeoffs of Large scale learning - Learning ERM SGD in ERM min f F R n (f ) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 15 / 48

39 Tradeoffs of Large scale learning - Learning ERM SGD in ERM min f F R n (f ) Pick any (x i, y i ) from empirical sample g k = f l(f k, (x i, y i )). f k+1 (f k γ k g k ) Output f m R n( f m) R n(fn ) O ( 1/ m ) sup f F R R n (f ) O(1/ n) Cost of one iteration O(d). GD in ERM min f F R n (f ) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 15 / 48

40 Tradeoffs of Large scale learning - Learning ERM SGD in ERM min f F R n (f ) Pick any (x i, y i ) from empirical sample g k = f l(f k, (x i, y i )). f k+1 (f k γ k g k ) Output f m R n( f m) R n(fn ) O ( 1/ m ) sup f F R R n (f ) O(1/ n) Cost of one iteration O(d). With step-size γ k proportional to 1 k. GD in ERM min f F R n (f ) g k = n f i=1 l(f k, (x i, y i )) = f R(f k ) f k+1 (f k γ k g k ) Output f m R n(f m) R n(fn ) O ((1 κ) m ) sup f F R R n (f ) O(1/ n) Cost of one iteration O(nd). R( f m ) R(f ) O ( 1/ m ) + O(1/ n) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 15 / 48

41 Tradeoffs of Large scale learning - Learning Conclusion In the large scale setting, it is beneficial to use SGD! Aymeric Dieuleveut Stochastic optimization Hilbert spaces 16 / 48

42 Tradeoffs of Large scale learning - Learning Conclusion In the large scale setting, it is beneficial to use SGD! Does more data help? 1 With global estimation error fixed, it seems T R(f m) R(f ) 1 n decreasing with n. is Aymeric Dieuleveut Stochastic optimization Hilbert spaces 16 / 48

43 Tradeoffs of Large scale learning - Learning Conclusion In the large scale setting, it is beneficial to use SGD! Does more data help? 1 With global estimation error fixed, it seems T R(f m) R(f ) 1 n decreasing with n. Upper bounding R n R uniformly is dangerous. Indeed, we have to also compare to one pass SGD, which minimizes the true risk R. is Aymeric Dieuleveut Stochastic optimization Hilbert spaces 16 / 48

44 Tradeoffs of Large scale learning - Learning Expectation minimization Stochastic gradient descent may be used to minimize R(f ) : SGD in ERM min f F R n (f ) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 17 / 48

45 Tradeoffs of Large scale learning - Learning Expectation minimization Stochastic gradient descent may be used to minimize R(f ) : SGD in ERM min f F R n (f ) Pick any (x i, y i ) from empirical sample g k = f l(f k, (x i, y i )). f k+1 (f k γ k g k ) Output f m R n( f m) R n(fn ) O ( 1/ m ) sup f F R R n (f ) O(1/ n) Cost of one iteration O(d). SGD one pass min f F R(f ) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 17 / 48

46 Tradeoffs of Large scale learning - Learning Expectation minimization Stochastic gradient descent may be used to minimize R(f ) : SGD in ERM min f F R n (f ) Pick any (x i, y i ) from empirical sample g k = f l(f k, (x i, y i )). f k+1 (f k γ k g k ) Output f m R n( f m) R n(fn ) O ( 1/ m ) sup f F R R n (f ) O(1/ n) Cost of one iteration O(d). SGD one pass min f F R(f ) Pick an independent (x, y) g k = f l(f k, (x, y)). f k+1 (f k γ k g k ) Output f k, k ( n R( f k ) R(f ) O 1/ ) k Cost of one iteration O(d). Aymeric Dieuleveut Stochastic optimization Hilbert spaces 17 / 48

47 Tradeoffs of Large scale learning - Learning Expectation minimization Stochastic gradient descent may be used to minimize R(f ) : SGD in ERM min f F R n (f ) Pick any (x i, y i ) from empirical sample g k = f l(f k, (x i, y i )). f k+1 (f k γ k g k ) Output f m R n( f m) R n(fn ) O ( 1/ m ) sup f F R R n (f ) O(1/ n) SGD one pass min f F R(f ) Pick an independent (x, y) g k = f l(f k, (x, y)). f k+1 (f k γ k g k ) Output f k, k ( n R( f k ) R(f ) O 1/ ) k Cost of one iteration O(d). Cost of one iteration O(d). SGD with one pass (early stopping as a regularization) achieves a nearly optimal bias variance tradeoff with low complexity. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 17 / 48

48 Tradeoffs of Large scale learning - Learning Rate of convergence We are interested in prediction. 1 Strongly convex objective : µn. 1 Non strongly : n. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 18 / 48

49 A case study -Finite dimension linear least mean squares LMS [Bach and Moulines, 2013] We now consider the simple case where X = R d, and the loss l is quadratic. We are interested in linear predictors : min E P[(θ T x y) 2 ]. θ R d If we assume that the data points are generated according to y i = θ T x i + ε i. We consider stochastic gradient algorithm : θ 0 = 0 This system may be rewritten : θ n+1 = θ n γ n ( x n, θ n x n y n x n ) θ n+1 θ = (I γx n x T n )(θ n θ ) γ n ξ n. (1) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 19 / 48

50 A case study -Finite dimension linear least mean squares Rate of convergence, back again! We are interested in prediction. Strongly convex objective : 1 µn. Non strongly : 1 n. We define H = E[xx T ]. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 20 / 48

51 A case study -Finite dimension linear least mean squares Rate of convergence, back again! We are interested in prediction. Strongly convex objective : 1 µn. Non strongly : 1 n. We define H = E[xx T ]. We have µ = min Sp(H). Aymeric Dieuleveut Stochastic optimization Hilbert spaces 20 / 48

52 A case study -Finite dimension linear least mean squares Rate of convergence, back again! We are interested in prediction. Strongly convex objective : 1 µn. Non strongly : 1 n. We define H = E[xx T ]. We have µ = min Sp(H). For least min squares, statistical rate with ordinary LMS estimator is σ 2 d n Aymeric Dieuleveut Stochastic optimization Hilbert spaces 20 / 48

53 A case study -Finite dimension linear least mean squares Rate of convergence, back again! We are interested in prediction. Strongly convex objective : 1 µn. Non strongly : 1 n. We define H = E[xx T ]. We have µ = min Sp(H). For least min squares, statistical rate with ordinary LMS estimator is σ 2 d n there is still a gap to be bridged! Aymeric Dieuleveut Stochastic optimization Hilbert spaces 20 / 48

54 A case study -Finite dimension linear least mean squares A few assumptions We define H = E[xx T ], and C = E[ξξ T ]. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 21 / 48

55 A case study -Finite dimension linear least mean squares A few assumptions We define H = E[xx T ], and C = E[ξξ T ]. Bounded noise variance : we assume C σ 2 H. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 21 / 48

56 A case study -Finite dimension linear least mean squares A few assumptions We define H = E[xx T ], and C = E[ξξ T ]. Bounded noise variance : we assume C σ 2 H. Covariance operator : no assumption on minimal eigenvalue, E[ x 2 ] R 2. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 21 / 48

57 A case study -Finite dimension linear least mean squares Result Theorem E[R( θ n ) R(θ )] 4 n (σ2 d + R 2 θ 0 θ 2 ) optimal statistical rate 1/n without strong convexity. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 22 / 48

58 Non parametric learning Outline What if d >> n? Aymeric Dieuleveut Stochastic optimization Hilbert spaces 23 / 48

59 Non parametric learning Outline What if d >> n? Carry analyse in a Hilbert space using reproducing kernel Hilbert spaces Aymeric Dieuleveut Stochastic optimization Hilbert spaces 24 / 48

60 Non parametric learning Outline What if d >> n? Non parametric regression in RKHS An interesting problem itself Carry analyse in a Hilbert space using reproducing kernel Hilbert spaces Aymeric Dieuleveut Stochastic optimization Hilbert spaces 25 / 48

61 Non parametric learning Outline What if d >> n? Non parametric regression in RKHS An interesting problem itself Carry analyse in a Hilbert space using reproducing kernel Hilbert spaces Behaviour in FD Adaptativity, tradeoffs. Optimal statistical rates in RKHS Choice of γ Aymeric Dieuleveut Stochastic optimization Hilbert spaces 26 / 48

62 Non parametric learning Reproducing kernel Hilbert space [Dieuleveut and Bach, 2014] We denote H K a Hilbert space of function. H K R X. Which is characterized by the kernel function K : X X R : for any x, K x : X R defined by K x (x ) = K(x, x ) is in H K. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 27 / 48

63 Non parametric learning Reproducing kernel Hilbert space [Dieuleveut and Bach, 2014] We denote H K a Hilbert space of function. H K R X. Which is characterized by the kernel function K : X X R : for any x, K x : X R defined by K x (x ) = K(x, x ) is in H K. reproducing property : for all g H K and x X, g(x) = g, K x K. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 27 / 48

64 Non parametric learning Reproducing kernel Hilbert space [Dieuleveut and Bach, 2014] We denote H K a Hilbert space of function. H K R X. Which is characterized by the kernel function K : X X R : for any x, K x : X R defined by K x (x ) = K(x, x ) is in H K. reproducing property : for all g H K and x X, g(x) = g, K x K. Two usages : α) A hypothesis space for regression. β) Mapping data points in a linear space. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 27 / 48

65 Non parametric learning α) A hypothesis space for regression. Classical regression setting : Goal : Minimizing prediction error (X i, Y i ) ρ i.i.d. (X i, Y i ) (X R) min E[(g(X ) Y )2 ]. g L 2 Aymeric Dieuleveut Stochastic optimization Hilbert spaces 28 / 48

66 Non parametric learning α) A hypothesis space for regression. Classical regression setting : Goal : Minimizing prediction error (X i, Y i ) ρ i.i.d. (X i, Y i ) (X R) min E[(g(X ) Y )2 ]. g L 2 Looking for an estimator ĝ n of g ρ (X ) = E[Y X ], g ρ L 2 ρ X. with L 2 ρ X = { f : X R/ } f 2 (t)dρ X (t) <. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 28 / 48

67 Non parametric learning β) Mapping data points in a linear space. Linear regression on data maped into some RKHS. arg min θ H Y X θ 2. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 29 / 48

68 Non parametric learning 2 approaches of regression problem : Link : In general H K L 2 ρ X Aymeric Dieuleveut Stochastic optimization Hilbert spaces 30 / 48

69 Non parametric learning 2 approaches of regression problem : Link : In general And H K L 2 ρ X compl. L 2 ρx (RKHS) = L 2 ρ X in some cases. We then look for an estimator of the regression function in the RKHS. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 30 / 48

70 Non parametric learning 2 approaches of regression problem : Link : In general And H K L 2 ρ X compl. L 2 ρx (RKHS) = L 2 ρ X in some cases. We then look for an estimator of the regression function in the RKHS. General regression problem g ρ L 2 Aymeric Dieuleveut Stochastic optimization Hilbert spaces 30 / 48

71 Non parametric learning 2 approaches of regression problem : Link : In general And H K L 2 ρ X compl. L 2 ρx (RKHS) = L 2 ρ X in some cases. We then look for an estimator of the regression function in the RKHS. General regression problem g ρ L 2 Linear regression problem in RKHS Aymeric Dieuleveut Stochastic optimization Hilbert spaces 30 / 48

72 Non parametric learning 2 approaches of regression problem : Link : In general And H K L 2 ρ X compl. L 2 ρx (RKHS) = L 2 ρ X in some cases. We then look for an estimator of the regression function in the RKHS. General regression problem g ρ L 2 Linear regression problem in RKHS looking for an estimator for the first problem using natural algorithms for the second one Aymeric Dieuleveut Stochastic optimization Hilbert spaces 30 / 48

73 Non parametric learning Outline What if d >> n? Non parametric regression in RKHS An interesting problem itself Carry analyse in a Hilbert space using reproducing kernel Hilbert spaces Aymeric Dieuleveut Stochastic optimization Hilbert spaces 31 / 48

74 Non parametric learning SGD algorithm in the RKHS g 0 H K (we often consider g 0 = 0), n g n = a i K xi, (2) i=1 (a n ) n such that a n := γ n (g n 1 (x n ) y n ) = γ n ( n 1 i=1 a ik(x n, x i ) y i ). g n = g n 1 γ n (g n 1 (x n ) y n )K xn n = a i K xi with a n defined as above. i=1 (g n 1 (x n ) y n )K xn unbiased estimate of grade[( K x, g n 1 y) 2 ]. SGD algorithm in the RKHS takes very simple form Aymeric Dieuleveut Stochastic optimization Hilbert spaces 32 / 48

75 Non parametric learning Assumptions Two important points characterize the difficulty of the problem : The regularity of the objective function The spectrum of the covariance operator Aymeric Dieuleveut Stochastic optimization Hilbert spaces 33 / 48

76 Non parametric learning Covariance operator We have Σ = E [K x K x ]. Where K x K x : g K x, g K x = g(x)k x Covariance operator is a self adjoint operator which contains information on the distribution of K x Aymeric Dieuleveut Stochastic optimization Hilbert spaces 34 / 48

77 Non parametric learning Covariance operator We have Σ = E [K x K x ]. Where K x K x : g K x, g K x = g(x)k x Covariance operator is a self adjoint operator which contains information on the distribution of K x Assumption : tr(σ α ) <, for α [0; 1]. on g ρ : g ρ Σ r (L 2 ρ(x )) with r 0. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 34 / 48

78 Non parametric learning Interpretation Eigenvalues decrease Ellipsoid class of function. (we do not assume g ρ H K ) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 35 / 48

79 Non parametric learning Result : Theorem Under a few hidden assumptions : ( σ 2 tr(σ α )γ α E [R (ḡ n ) R(g ρ )] O n 1 α ) + O ( Σ r ) g ρ 2 (nγ) 2(r 1) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 36 / 48

80 Non parametric learning Result : Theorem Under a few hidden assumptions : ( σ 2 tr(σ α )γ α E [R (ḡ n ) R(g ρ )] O Bias Variance decomposition n 1 α ) + O ( Σ r ) g ρ 2 (nγ) 2(r 1) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 36 / 48

81 Non parametric learning Result : Theorem Under a few hidden assumptions : ( σ 2 tr(σ α )γ α E [R (ḡ n ) R(g ρ )] O Bias Variance decomposition O is a known constant (4 or 8) n 1 α ) + O ( Σ r ) g ρ 2 (nγ) 2(r 1) Finite horizon result here but extends to online setting. Saturation Aymeric Dieuleveut Stochastic optimization Hilbert spaces 36 / 48

82 Non parametric learning Corollary Corollary Assume A1-8 : If 1 α 2 < r < 2 α 2, with γ = n 2r+α 1 2r+α we get the optimal rate : E [R (ḡ n ) R(g ρ )] = O ) (n 2r 2r+α (3) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 37 / 48

83 Non parametric learning Conclusion 1 Optimal statistical rates in RKHS Choice of γ Aymeric Dieuleveut Stochastic optimization Hilbert spaces 38 / 48

84 Non parametric learning Conclusion 1 Optimal statistical rates in RKHS Choice of γ We get statistical optimal rate of convergence for learning in RKHS with SGD with one pass. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 38 / 48

85 Non parametric learning Conclusion 1 Optimal statistical rates in RKHS Choice of γ We get statistical optimal rate of convergence for learning in RKHS with SGD with one pass. We get insights on how to choose the kernel and the step size. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 38 / 48

86 Non parametric learning Conclusion 1 Optimal statistical rates in RKHS Choice of γ We get statistical optimal rate of convergence for learning in RKHS with SGD with one pass. We get insights on how to choose the kernel and the step size. We compare favorably to [Ying and Pontil, 2008, Caponnetto and De Vito, 2007, Tarrès and Yao, 2011]. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 38 / 48

87 Non parametric learning Conclusion 2 Behaviour in FD Adaptativity, tradeoffs. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 39 / 48

88 Non parametric learning Conclusion 2 Behaviour in FD Adaptativity, tradeoffs. Theorem can be rewritten : E [ R ( θ ) n R(θ ) ] ( σ 2 tr(σ α )γ α ) ( θ T O n 1 α + O Σ 2r 1 θ T ) (nγ) 2(r 1) (4) where the ellipsoid condition appears more clearly. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 39 / 48

89 Non parametric learning Conclusion 2 Behaviour in FD Adaptativity, tradeoffs. Theorem can be rewritten : E [ R ( θ ) n R(θ ) ] ( σ 2 tr(σ α )γ α ) ( θ T O n 1 α + O Σ 2r 1 θ T ) (nγ) 2(r 1) (4) where the ellipsoid condition appears more clearly. Thus : SGD is adaptative to the regularity of the problem bridges the gap between the different regimes and explains behaviour when d >> n. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 39 / 48

90 The complexity challenge, approximation of the kernel 1 Tradeoffs of Large scale learning - Learning 2 A case study -Finite dimension linear least mean squares 3 Non parametric learning 4 The complexity challenge, approximation of the kernel Aymeric Dieuleveut Stochastic optimization Hilbert spaces 40 / 48

91 The complexity challenge, approximation of the kernel Reducing complexity : sampling methods However the complexity of such a method remains quadratic with respect of the number of examples : iteration number n costs n kernel calculations. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 41 / 48

92 The complexity challenge, approximation of the kernel Reducing complexity : sampling methods However the complexity of such a method remains quadratic with respect of the number of examples : iteration number n costs n kernel calculations. Finite Dimension Rate d n Complexity O(dn) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 41 / 48

93 The complexity challenge, approximation of the kernel Reducing complexity : sampling methods However the complexity of such a method remains quadratic with respect of the number of examples : iteration number n costs n kernel calculations. Finite Dimension Infinite dimension Rate Complexity d n O(dn) d n n O(n 2 ) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 41 / 48

94 The complexity challenge, approximation of the kernel 2 related methods Approximate the kernel matrix Approximate the kernel Results from [Bach, 2012]. Such results have been extended by [Alaoui and Mahoney, 2014, Rudi et al., There also exist results in the second situation [Rahimi and Recht, 2008, Dai et al., 2014] Aymeric Dieuleveut Stochastic optimization Hilbert spaces 42 / 48

95 The complexity challenge, approximation of the kernel Sharp analysis We only consider a fixed design setting. Then we have to approximate the kernel matrix : instead of computing the whole matrix, we randomly pick a number d n of columns. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 43 / 48

96 The complexity challenge, approximation of the kernel Sharp analysis We only consider a fixed design setting. Then we have to approximate the kernel matrix : instead of computing the whole matrix, we randomly pick a number d n of columns. Then we still get the same estimation errors. Leading to : Finite Dimension Infinite dimension Rate Complexity d n O(dn) d n n O(ndn 2 ) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 43 / 48

97 The complexity challenge, approximation of the kernel Random feature selection Many kernels may be represented, due to Bochner s theorem as K(x, y) = φ(w, x)φ(w, y)dµ(w). W (think of translation invariant kernels and Fourier transform). Aymeric Dieuleveut Stochastic optimization Hilbert spaces 44 / 48

98 The complexity challenge, approximation of the kernel Random feature selection Many kernels may be represented, due to Bochner s theorem as K(x, y) = φ(w, x)φ(w, y)dµ(w). W (think of translation invariant kernels and Fourier transform). We thus consider the low rank approximation : K(x, y) = 1 d n φ(x, w i )φ(y, w i ). i=1 where w i µ. We use this approximation of the kernel in SGD. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 44 / 48

99 The complexity challenge, approximation of the kernel Directions What I am working on for the moment : Random feature selection Tuning the sampling to improve accuracy of the approximation Acceleration + stochasticity (with Nicolas Flammarion). Aymeric Dieuleveut Stochastic optimization Hilbert spaces 45 / 48

100 The complexity challenge, approximation of the kernel Some references I Alaoui, A. E. and Mahoney, M. W. (2014). Fast randomized kernel methods with statistical guarantees. CoRR, abs/ Bach, F. (2012). Sharp analysis of low-rank kernel matrix approximations. ArXiv e-prints. Bach, F. and Moulines, E. (2013). Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). ArXiv e-prints. Bottou, L. and Bousquet, O. (2008). The tradeoffs of large scale learning. In IN : ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 20. Caponnetto, A. and De Vito, E. (2007). Optimal Rates for the Regularized Least-Squares Algorithm. Foundations of Computational Mathematics, 7(3) : Dai, B., Xie, B., He, N., Liang, Y., Raj, A., Balcan, M., and Song, L. (2014). Scalable kernel methods via doubly stochastic gradients. In Advances in Neural Information Processing Systems 27 : Annual Conference on Neural Information Processing Systems 2014, December , Montreal, Quebec, Canada, pages Dieuleveut, A. and Bach, F. (2014). Non-parametric Stochastic Approximation with Large Step sizes. ArXiv e-prints. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 46 / 48

101 The complexity challenge, approximation of the kernel Some references II Rahimi, A. and Recht, B. (2008). Weighted sums of random kitchen sinks : Replacing minimization with randomization in learning. In Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008, pages Rudi, A., Camoriano, R., and Rosasco, L. (2015). Less is more : Nyström computational regularization. CoRR, abs/ Shalev-Schwartz, S. and K., S. (2011). Theorical basis for more data less work. Shalev-Schwartz, S. and Srebro, N. (2008). SVM optimisation : Inverse dependance on training set size. Proceedings of the International Conference on Machine Learning (ICML). Tarrès, P. and Yao, Y. (2011). Online learning as stochastic approximation of regularization paths. ArXiv e-prints Ying, Y. and Pontil, M. (2008). Online gradient descent learning algorithms. Foundations of Computational Mathematics, 5. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 47 / 48

102 The complexity challenge, approximation of the kernel Thank you for your attention! Aymeric Dieuleveut Stochastic optimization Hilbert spaces 48 / 48

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien

More information

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro

More information

Approximate Kernel PCA with Random Features

Approximate Kernel PCA with Random Features Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Simple Optimization, Bigger Models, and Faster Learning. Niao He Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

Oslo Class 2 Tikhonov regularization and kernels

Oslo Class 2 Tikhonov regularization and kernels RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new

More information

RegML 2018 Class 2 Tikhonov regularization and kernels

RegML 2018 Class 2 Tikhonov regularization and kernels RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Kernel Learning via Random Fourier Representations

Kernel Learning via Random Fourier Representations Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Online Gradient Descent Learning Algorithms

Online Gradient Descent Learning Algorithms DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms

More information

An inverse problem perspective on machine learning

An inverse problem perspective on machine learning An inverse problem perspective on machine learning Lorenzo Rosasco University of Genova Massachusetts Institute of Technology Istituto Italiano di Tecnologia lcsl.mit.edu Feb 9th, 2018 Inverse Problems

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

TUM 2016 Class 3 Large scale learning by regularization

TUM 2016 Class 3 Large scale learning by regularization TUM 2016 Class 3 Large scale learning by regularization Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x n, y n ) Beyond

More information

Convergence Rates of Kernel Quadrature Rules

Convergence Rates of Kernel Quadrature Rules Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July

More information

Day 3: Classification, logistic regression

Day 3: Classification, logistic regression Day 3: Classification, logistic regression Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 20 June 2018 Topics so far Supervised

More information

Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms

Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms Junhong Lin Volkan Cevher Laboratory for Information and Inference Systems École Polytechnique Fédérale

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Mathematical Methods for Data Analysis

Mathematical Methods for Data Analysis Mathematical Methods for Data Analysis Massimiliano Pontil Istituto Italiano di Tecnologia and Department of Computer Science University College London Massimiliano Pontil Mathematical Methods for Data

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

Learning gradients: prescriptive models

Learning gradients: prescriptive models Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan

More information

Optimal kernel methods for large scale learning

Optimal kernel methods for large scale learning Optimal kernel methods for large scale learning Alessandro Rudi INRIA - École Normale Supérieure, Paris joint work with Luigi Carratino, Lorenzo Rosasco 6 Mar 2018 École Polytechnique Learning problem

More information

Large Scale Machine Learning with Stochastic Gradient Descent

Large Scale Machine Learning with Stochastic Gradient Descent Large Scale Machine Learning with Stochastic Gradient Descent Léon Bottou leon@bottou.org Microsoft (since June) Summary i. Learning with Stochastic Gradient Descent. ii. The Tradeoffs of Large Scale Learning.

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Optimistic Rates Nati Srebro

Optimistic Rates Nati Srebro Optimistic Rates Nati Srebro Based on work with Karthik Sridharan and Ambuj Tewari Examples based on work with Andy Cotter, Elad Hazan, Tomer Koren, Percy Liang, Shai Shalev-Shwartz, Ohad Shamir, Karthik

More information

Optimal Rates for Multi-pass Stochastic Gradient Methods

Optimal Rates for Multi-pass Stochastic Gradient Methods Journal of Machine Learning Research 8 (07) -47 Submitted 3/7; Revised 8/7; Published 0/7 Optimal Rates for Multi-pass Stochastic Gradient Methods Junhong Lin Laboratory for Computational and Statistical

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Warm up. Regrade requests submitted directly in Gradescope, do not  instructors. Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

Early Stopping for Computational Learning

Early Stopping for Computational Learning Early Stopping for Computational Learning Lorenzo Rosasco Universita di Genova, Massachusetts Institute of Technology Istituto Italiano di Tecnologia CBMM Sestri Levante, September, 2014 joint work with

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine

More information

Approximate Kernel Methods

Approximate Kernel Methods Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Support Vector Machine

Support Vector Machine Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary

More information

MLCC 2017 Regularization Networks I: Linear Models

MLCC 2017 Regularization Networks I: Linear Models MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational

More information

Sample questions for Fundamentals of Machine Learning 2018

Sample questions for Fundamentals of Machine Learning 2018 Sample questions for Fundamentals of Machine Learning 2018 Teacher: Mohammad Emtiyaz Khan A few important informations: In the final exam, no electronic devices are allowed except a calculator. Make sure

More information

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate 58th Annual IEEE Symposium on Foundations of Computer Science First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate Zeyuan Allen-Zhu Microsoft Research zeyuan@csail.mit.edu

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html

More information

Spectral Regularization

Spectral Regularization Spectral Regularization Lorenzo Rosasco 9.520 Class 07 February 27, 2008 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Regularization via Spectral Filtering

Regularization via Spectral Filtering Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting

More information

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Tutorial on Machine Learning for Advanced Electronics

Tutorial on Machine Learning for Advanced Electronics Tutorial on Machine Learning for Advanced Electronics Maxim Raginsky March 2017 Part I (Some) Theory and Principles Machine Learning: estimation of dependencies from empirical data (V. Vapnik) enabling

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017 Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w

More information

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University Chapter 9. Support Vector Machine Yongdai Kim Seoul National University 1. Introduction Support Vector Machine (SVM) is a classification method developed by Vapnik (1996). It is thought that SVM improved

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

Machine Learning in the Data Revolution Era

Machine Learning in the Data Revolution Era Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Stochastic Analogues to Deterministic Optimizers

Stochastic Analogues to Deterministic Optimizers Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

A summary of Deep Learning without Poor Local Minima

A summary of Deep Learning without Poor Local Minima A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given

More information

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

Beating SGD: Learning SVMs in Sublinear Time

Beating SGD: Learning SVMs in Sublinear Time Beating SGD: Learning SVMs in Sublinear Time Elad Hazan Tomer Koren Technion, Israel Institute of Technology Haifa, Israel 32000 {ehazan@ie,tomerk@cs}.technion.ac.il Nathan Srebro Toyota Technological

More information

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates : A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley;http://arxiv.org/pdf/1305.509; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014 Outline Motivation.

More information

ONLINE LEARNING WITH KERNELS: OVERCOMING THE GROWING SUM PROBLEM. Abhishek Singh, Narendra Ahuja and Pierre Moulin

ONLINE LEARNING WITH KERNELS: OVERCOMING THE GROWING SUM PROBLEM. Abhishek Singh, Narendra Ahuja and Pierre Moulin 22 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 23 26, 22, SANTANDER, SPAIN ONLINE LEARNING WITH KERNELS: OVERCOMING THE GROWING SUM PROBLEM Abhishek Singh, Narendra Ahuja

More information

Modern Optimization Techniques

Modern Optimization Techniques Modern Optimization Techniques 2. Unconstrained Optimization / 2.2. Stochastic Gradient Descent Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1 Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Hilbert Space Methods in Learning

Hilbert Space Methods in Learning Hilbert Space Methods in Learning guest lecturer: Risi Kondor 6772 Advanced Machine Learning and Perception (Jebara), Columbia University, October 15, 2003. 1 1. A general formulation of the learning problem

More information