Stochastic optimization in Hilbert spaces

Size: px

Start display at page:

Download "Stochastic optimization in Hilbert spaces"

Rosa McCoy
5 years ago
Views:

1 Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48

2 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert spaces 2 / 48

3 Outline Tradeoffs of large scale learning Algorithm ERM? complexity. Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert spaces 3 / 48

4 Outline Tradeoffs of large scale learning Algorithm ERM? complexity. Learning vs Statistics Stochastic optimization Why is SGD so useful in learning? Aymeric Dieuleveut Stochastic optimization Hilbert spaces 4 / 48

5 Outline Tradeoffs of large scale learning Algorithm ERM? complexity. Learning vs Statistics Stochastic optimization Why is SGD so useful in learning? A simple case Least mean squares, finite dimension Aymeric Dieuleveut Stochastic optimization Hilbert spaces 5 / 48

6 Outline Tradeoffs of large scale learning Algorithm ERM? complexity. Learning vs Statistics Stochastic optimization Why is SGD so useful in learning? Higher dimension? RKHS, non parametric learning A simple case Least mean squares, finite dimension Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48

7 Outline Lower complexity? Column sampling, feature selection Tradeoffs of large scale learning Algorithm ERM? complexity. Learning vs Statistics Stochastic optimization Why is SGD so useful in learning? Higher dimension? RKHS, non parametric learning A simple case Least mean squares, finite dimension Aymeric Dieuleveut Stochastic optimization Hilbert spaces 7 / 48

8 Tradeoffs of Large scale learning - Learning Statistics vs Machine Learning 1. taken from Aymeric Dieuleveut Stochastic optimization Hilbert spaces 8 / 48

9 Tradeoffs of Large scale learning - Learning Statistics vs Machine Learning Statistics Machine Learning Estimation Learning Classifier Hypothesis Data point Example/Instance Regression Supervised Learning Classification Supervised Learning Covariate Feature Response Label 1 1. taken from Aymeric Dieuleveut Stochastic optimization Hilbert spaces 8 / 48

10 Tradeoffs of Large scale learning - Learning Statistics vs Machine Learning Statistics Machine Learning Estimation Learning Classifier Hypothesis Data point Example/Instance Regression Supervised Learning Classification Supervised Learning Covariate Feature Response Label 1 Essentially AI vs math guys doing same kind of stuff. However main differences : 1. taken from Aymeric Dieuleveut Stochastic optimization Hilbert spaces 8 / 48

11 Tradeoffs of Large scale learning - Learning Statistics vs Machine Learning Statistics Machine Learning Estimation Learning Classifier Hypothesis Data point Example/Instance Regression Supervised Learning Classification Supervised Learning Covariate Feature Response Label 1 Essentially AI vs math guys doing same kind of stuff. However main differences : Statisticians are more interested in the model and drawing conclusions about it. 1. taken from Aymeric Dieuleveut Stochastic optimization Hilbert spaces 8 / 48

12 Tradeoffs of Large scale learning - Learning Statistics vs Machine Learning Statistics Machine Learning Estimation Learning Classifier Hypothesis Data point Example/Instance Regression Supervised Learning Classification Supervised Learning Covariate Feature Response Label 1 Essentially AI vs math guys doing same kind of stuff. However main differences : Statisticians are more interested in the model and drawing conclusions about it. ML are more interested about prediction with a concern on algorithms for high dim. data. 1. taken from Aymeric Dieuleveut Stochastic optimization Hilbert spaces 8 / 48

13 Tradeoffs of Large scale learning - Learning Framework We consider the classical risk minimization problem. Given : a space of input output pairs (x, y) X Y, with probability distribution P(x, y). a loss function l : Y Y R, a class of function F. the risk of a function f : X Y is R(f ) := E P [l (f (x), y)]. Our aim is min R(f ) f F Aymeric Dieuleveut Stochastic optimization Hilbert spaces 9 / 48

14 Tradeoffs of Large scale learning - Learning Framework We consider the classical risk minimization problem. Given : a space of input output pairs (x, y) X Y, with probability distribution P(x, y). a loss function l : Y Y R, a class of function F. the risk of a function f : X Y is R(f ) := E P [l (f (x), y)]. Our aim is R is unknown. min R(f ) f F Aymeric Dieuleveut Stochastic optimization Hilbert spaces 9 / 48

15 Tradeoffs of Large scale learning - Learning Framework We consider the classical risk minimization problem. Given : a space of input output pairs (x, y) X Y, with probability distribution P(x, y). a loss function l : Y Y R, a class of function F. the risk of a function f : X Y is R(f ) := E P [l (f (x), y)]. Our aim is R is unknown. min R(f ) f F given a sequence of i.i.d. data points distributed (x i, y i ) i=1..n P n, we can define the empirical risk R n (f ) = 1 n n l(f (x i ), y i ). i=1 Aymeric Dieuleveut Stochastic optimization Hilbert spaces 9 / 48

16 Tradeoffs of Large scale learning - Learning The bias-variance tradeoffs a.k.a. estimation approximation error. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 10 / 48

17 Tradeoffs of Large scale learning - Learning The bias-variance tradeoffs a.k.a. estimation approximation error. There are many ways of seeing it : constraint case penalized case other regularization Aymeric Dieuleveut Stochastic optimization Hilbert spaces 10 / 48

18 Tradeoffs of Large scale learning - Learning The bias-variance tradeoffs a.k.a. estimation approximation error. There are many ways of seeing it : constraint case penalized case other regularization Thus compromise : ε app + ε est. F ε app ε est Aymeric Dieuleveut Stochastic optimization Hilbert spaces 10 / 48

19 Tradeoffs of Large scale learning - Learning The bias-variance tradeoffs a.k.a. estimation approximation error. There are many ways of seeing it : constraint case penalized case other regularization Thus compromise : ε app + ε est. F ε app ε est Aymeric Dieuleveut Stochastic optimization Hilbert spaces 10 / 48

20 Tradeoffs of Large scale learning - Learning The bias-variance tradeoffs a.k.a. estimation approximation error. There are many ways of seeing it : constraint case penalized case other regularization Thus compromise : ε app + ε est. This is the classical setting. ε app ε est F Aymeric Dieuleveut Stochastic optimization Hilbert spaces 10 / 48

21 Tradeoffs of Large scale learning - Learning Adding an optimization term When we face large datasets, it may be uneasy and useless to optimize with high accuracy the estimator. We then question the choice of an algorithm from a fixed budget time point of view Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008] Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48

22 Tradeoffs of Large scale learning - Learning Adding an optimization term When we face large datasets, it may be uneasy and useless to optimize with high accuracy the estimator. We then question the choice of an algorithm from a fixed budget time point of view. 2 It questions the following points : up to which precision is it necessary to optimize? 2. Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008] Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48

23 Tradeoffs of Large scale learning - Learning Adding an optimization term When we face large datasets, it may be uneasy and useless to optimize with high accuracy the estimator. We then question the choice of an algorithm from a fixed budget time point of view. 2 It questions the following points : up to which precision is it necessary to optimize? which is the limiting factor? (time, data points) 2. Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008] Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48

24 Tradeoffs of Large scale learning - Learning Adding an optimization term When we face large datasets, it may be uneasy and useless to optimize with high accuracy the estimator. We then question the choice of an algorithm from a fixed budget time point of view. 2 It questions the following points : up to which precision is it necessary to optimize? which is the limiting factor? (time, data points) A problem is said to be large scale when time is limiting. For large scale problem : which algo? 2. Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008] Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48

25 Tradeoffs of Large scale learning - Learning Adding an optimization term When we face large datasets, it may be uneasy and useless to optimize with high accuracy the estimator. We then question the choice of an algorithm from a fixed budget time point of view. 2 It questions the following points : up to which precision is it necessary to optimize? which is the limiting factor? (time, data points) A problem is said to be large scale when time is limiting. For large scale problem : which algo? more data less work? (if time is limiting) 2. Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008] Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48

26 Tradeoffs of Large scale learning - Learning Adding an optimization term When we face large datasets, it may be uneasy and useless to optimize with high accuracy the estimator. We then question the choice of an algorithm from a fixed budget time point of view. 2 It questions the following points : up to which precision is it necessary to optimize? which is the limiting factor? (time, data points) A problem is said to be large scale when time is limiting. For large scale problem : which algo? more data less work? (if time is limiting) 2. Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008] Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48

27 Tradeoffs of Large scale learning - Learning Tradeoffs - Large scale learning F n ε Aymeric Dieuleveut Stochastic optimization Hilbert spaces 12 / 48

28 Tradeoffs of Large scale learning - Learning Tradeoffs - Large scale learning ε app F n ε Aymeric Dieuleveut Stochastic optimization Hilbert spaces 12 / 48

29 Tradeoffs of Large scale learning - Learning Tradeoffs - Large scale learning F n ε ε app ε est Aymeric Dieuleveut Stochastic optimization Hilbert spaces 12 / 48

30 Tradeoffs of Large scale learning - Learning Tradeoffs - Large scale learning F n ε ε app ε est ε opt Aymeric Dieuleveut Stochastic optimization Hilbert spaces 12 / 48

31 Tradeoffs of Large scale learning - Learning Tradeoffs - Large scale learning F n ε ε app ε est ε opt T Aymeric Dieuleveut Stochastic optimization Hilbert spaces 12 / 48

32 Tradeoffs of Large scale learning - Learning Different algorithms To minimize ERM, a bunch of algorithms may be considered : Gradient descent Second order gradient descent Stochastic gradient descent Fast stochastic algorithm (requiring high memory storage) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 13 / 48

33 Tradeoffs of Large scale learning - Learning Different algorithms To minimize ERM, a bunch of algorithms may be considered : Gradient descent Second order gradient descent Stochastic gradient descent Fast stochastic algorithm (requiring high memory storage) Let s compare first order methods : SGD and GD. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 13 / 48

34 Tradeoffs of Large scale learning - Learning Stochastic gradient algorithms : Aim : min f R(f ) we only access to unbiased estimates of R(f ) and R(f ). Aymeric Dieuleveut Stochastic optimization Hilbert spaces 14 / 48

35 Tradeoffs of Large scale learning - Learning Stochastic gradient algorithms : Aim : min f R(f ) we only access to unbiased estimates of R(f ) and R(f ). 1 Start at some f 0. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 14 / 48

36 Tradeoffs of Large scale learning - Learning Stochastic gradient algorithms : Aim : min f R(f ) we only access to unbiased estimates of R(f ) and R(f ). 1 Start at some f 0. 2 Iterate : Get unbiased gradient estimate g k, s.t. E[g k ] = R(f k ). f k+1 f k γ k g k. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 14 / 48

2 Iterate : Get unbiased gradient estimate g k, s.t. E[g k ] = R(f k ). f k+1 f k γ k g k.

37 Tradeoffs of Large scale learning - Learning Stochastic gradient algorithms : Aim : min f R(f ) we only access to unbiased estimates of R(f ) and R(f ). 1 Start at some f 0. 2 Iterate : Get unbiased gradient estimate g k, s.t. E[g k ] = R(f k ). f k+1 f k γ k g k. m 3 Output f m or f m := 1 m f k (averaged SGD). k=1 Gradient descent : same but with true gradient. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 14 / 48

38 Tradeoffs of Large scale learning - Learning ERM SGD in ERM min f F R n (f ) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 15 / 48

39 Tradeoffs of Large scale learning - Learning ERM SGD in ERM min f F R n (f ) Pick any (x i, y i ) from empirical sample g k = f l(f k, (x i, y i )). f k+1 (f k γ k g k ) Output f m R n( f m) R n(fn ) O ( 1/ m ) sup f F R R n (f ) O(1/ n) Cost of one iteration O(d). GD in ERM min f F R n (f ) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 15 / 48

40 Tradeoffs of Large scale learning - Learning ERM SGD in ERM min f F R n (f ) Pick any (x i, y i ) from empirical sample g k = f l(f k, (x i, y i )). f k+1 (f k γ k g k ) Output f m R n( f m) R n(fn ) O ( 1/ m ) sup f F R R n (f ) O(1/ n) Cost of one iteration O(d). With step-size γ k proportional to 1 k. GD in ERM min f F R n (f ) g k = n f i=1 l(f k, (x i, y i )) = f R(f k ) f k+1 (f k γ k g k ) Output f m R n(f m) R n(fn ) O ((1 κ) m ) sup f F R R n (f ) O(1/ n) Cost of one iteration O(nd). R( f m ) R(f ) O ( 1/ m ) + O(1/ n) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 15 / 48

41 Tradeoffs of Large scale learning - Learning Conclusion In the large scale setting, it is beneficial to use SGD! Aymeric Dieuleveut Stochastic optimization Hilbert spaces 16 / 48

42 Tradeoffs of Large scale learning - Learning Conclusion In the large scale setting, it is beneficial to use SGD! Does more data help? 1 With global estimation error fixed, it seems T R(f m) R(f ) 1 n decreasing with n. is Aymeric Dieuleveut Stochastic optimization Hilbert spaces 16 / 48

43 Tradeoffs of Large scale learning - Learning Conclusion In the large scale setting, it is beneficial to use SGD! Does more data help? 1 With global estimation error fixed, it seems T R(f m) R(f ) 1 n decreasing with n. Upper bounding R n R uniformly is dangerous. Indeed, we have to also compare to one pass SGD, which minimizes the true risk R. is Aymeric Dieuleveut Stochastic optimization Hilbert spaces 16 / 48

44 Tradeoffs of Large scale learning - Learning Expectation minimization Stochastic gradient descent may be used to minimize R(f ) : SGD in ERM min f F R n (f ) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 17 / 48

45 Tradeoffs of Large scale learning - Learning Expectation minimization Stochastic gradient descent may be used to minimize R(f ) : SGD in ERM min f F R n (f ) Pick any (x i, y i ) from empirical sample g k = f l(f k, (x i, y i )). f k+1 (f k γ k g k ) Output f m R n( f m) R n(fn ) O ( 1/ m ) sup f F R R n (f ) O(1/ n) Cost of one iteration O(d). SGD one pass min f F R(f ) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 17 / 48

46 Tradeoffs of Large scale learning - Learning Expectation minimization Stochastic gradient descent may be used to minimize R(f ) : SGD in ERM min f F R n (f ) Pick any (x i, y i ) from empirical sample g k = f l(f k, (x i, y i )). f k+1 (f k γ k g k ) Output f m R n( f m) R n(fn ) O ( 1/ m ) sup f F R R n (f ) O(1/ n) Cost of one iteration O(d). SGD one pass min f F R(f ) Pick an independent (x, y) g k = f l(f k, (x, y)). f k+1 (f k γ k g k ) Output f k, k ( n R( f k ) R(f ) O 1/ ) k Cost of one iteration O(d). Aymeric Dieuleveut Stochastic optimization Hilbert spaces 17 / 48

47 Tradeoffs of Large scale learning - Learning Expectation minimization Stochastic gradient descent may be used to minimize R(f ) : SGD in ERM min f F R n (f ) Pick any (x i, y i ) from empirical sample g k = f l(f k, (x i, y i )). f k+1 (f k γ k g k ) Output f m R n( f m) R n(fn ) O ( 1/ m ) sup f F R R n (f ) O(1/ n) SGD one pass min f F R(f ) Pick an independent (x, y) g k = f l(f k, (x, y)). f k+1 (f k γ k g k ) Output f k, k ( n R( f k ) R(f ) O 1/ ) k Cost of one iteration O(d). Cost of one iteration O(d). SGD with one pass (early stopping as a regularization) achieves a nearly optimal bias variance tradeoff with low complexity. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 17 / 48

48 Tradeoffs of Large scale learning - Learning Rate of convergence We are interested in prediction. 1 Strongly convex objective : µn. 1 Non strongly : n. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 18 / 48

49 A case study -Finite dimension linear least mean squares LMS [Bach and Moulines, 2013] We now consider the simple case where X = R d, and the loss l is quadratic. We are interested in linear predictors : min E P[(θ T x y) 2 ]. θ R d If we assume that the data points are generated according to y i = θ T x i + ε i. We consider stochastic gradient algorithm : θ 0 = 0 This system may be rewritten : θ n+1 = θ n γ n ( x n, θ n x n y n x n ) θ n+1 θ = (I γx n x T n )(θ n θ ) γ n ξ n. (1) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 19 / 48

50 A case study -Finite dimension linear least mean squares Rate of convergence, back again! We are interested in prediction. Strongly convex objective : 1 µn. Non strongly : 1 n. We define H = E[xx T ]. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 20 / 48

51 A case study -Finite dimension linear least mean squares Rate of convergence, back again! We are interested in prediction. Strongly convex objective : 1 µn. Non strongly : 1 n. We define H = E[xx T ]. We have µ = min Sp(H). Aymeric Dieuleveut Stochastic optimization Hilbert spaces 20 / 48

52 A case study -Finite dimension linear least mean squares Rate of convergence, back again! We are interested in prediction. Strongly convex objective : 1 µn. Non strongly : 1 n. We define H = E[xx T ]. We have µ = min Sp(H). For least min squares, statistical rate with ordinary LMS estimator is σ 2 d n Aymeric Dieuleveut Stochastic optimization Hilbert spaces 20 / 48

53 A case study -Finite dimension linear least mean squares Rate of convergence, back again! We are interested in prediction. Strongly convex objective : 1 µn. Non strongly : 1 n. We define H = E[xx T ]. We have µ = min Sp(H). For least min squares, statistical rate with ordinary LMS estimator is σ 2 d n there is still a gap to be bridged! Aymeric Dieuleveut Stochastic optimization Hilbert spaces 20 / 48

54 A case study -Finite dimension linear least mean squares A few assumptions We define H = E[xx T ], and C = E[ξξ T ]. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 21 / 48

55 A case study -Finite dimension linear least mean squares A few assumptions We define H = E[xx T ], and C = E[ξξ T ]. Bounded noise variance : we assume C σ 2 H. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 21 / 48

56 A case study -Finite dimension linear least mean squares A few assumptions We define H = E[xx T ], and C = E[ξξ T ]. Bounded noise variance : we assume C σ 2 H. Covariance operator : no assumption on minimal eigenvalue, E[ x 2 ] R 2. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 21 / 48

57 A case study -Finite dimension linear least mean squares Result Theorem E[R( θ n ) R(θ )] 4 n (σ2 d + R 2 θ 0 θ 2 ) optimal statistical rate 1/n without strong convexity. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 22 / 48

58 Non parametric learning Outline What if d >> n? Aymeric Dieuleveut Stochastic optimization Hilbert spaces 23 / 48

59 Non parametric learning Outline What if d >> n? Carry analyse in a Hilbert space using reproducing kernel Hilbert spaces Aymeric Dieuleveut Stochastic optimization Hilbert spaces 24 / 48

60 Non parametric learning Outline What if d >> n? Non parametric regression in RKHS An interesting problem itself Carry analyse in a Hilbert space using reproducing kernel Hilbert spaces Aymeric Dieuleveut Stochastic optimization Hilbert spaces 25 / 48

61 Non parametric learning Outline What if d >> n? Non parametric regression in RKHS An interesting problem itself Carry analyse in a Hilbert space using reproducing kernel Hilbert spaces Behaviour in FD Adaptativity, tradeoffs. Optimal statistical rates in RKHS Choice of γ Aymeric Dieuleveut Stochastic optimization Hilbert spaces 26 / 48

62 Non parametric learning Reproducing kernel Hilbert space [Dieuleveut and Bach, 2014] We denote H K a Hilbert space of function. H K R X. Which is characterized by the kernel function K : X X R : for any x, K x : X R defined by K x (x ) = K(x, x ) is in H K. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 27 / 48

63 Non parametric learning Reproducing kernel Hilbert space [Dieuleveut and Bach, 2014] We denote H K a Hilbert space of function. H K R X. Which is characterized by the kernel function K : X X R : for any x, K x : X R defined by K x (x ) = K(x, x ) is in H K. reproducing property : for all g H K and x X, g(x) = g, K x K. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 27 / 48

64 Non parametric learning Reproducing kernel Hilbert space [Dieuleveut and Bach, 2014] We denote H K a Hilbert space of function. H K R X. Which is characterized by the kernel function K : X X R : for any x, K x : X R defined by K x (x ) = K(x, x ) is in H K. reproducing property : for all g H K and x X, g(x) = g, K x K. Two usages : α) A hypothesis space for regression. β) Mapping data points in a linear space. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 27 / 48

65 Non parametric learning α) A hypothesis space for regression. Classical regression setting : Goal : Minimizing prediction error (X i, Y i ) ρ i.i.d. (X i, Y i ) (X R) min E[(g(X ) Y )2 ]. g L 2 Aymeric Dieuleveut Stochastic optimization Hilbert spaces 28 / 48

66 Non parametric learning α) A hypothesis space for regression. Classical regression setting : Goal : Minimizing prediction error (X i, Y i ) ρ i.i.d. (X i, Y i ) (X R) min E[(g(X ) Y )2 ]. g L 2 Looking for an estimator ĝ n of g ρ (X ) = E[Y X ], g ρ L 2 ρ X. with L 2 ρ X = { f : X R/ } f 2 (t)dρ X (t) <. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 28 / 48

67 Non parametric learning β) Mapping data points in a linear space. Linear regression on data maped into some RKHS. arg min θ H Y X θ 2. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 29 / 48

68 Non parametric learning 2 approaches of regression problem : Link : In general H K L 2 ρ X Aymeric Dieuleveut Stochastic optimization Hilbert spaces 30 / 48

69 Non parametric learning 2 approaches of regression problem : Link : In general And H K L 2 ρ X compl. L 2 ρx (RKHS) = L 2 ρ X in some cases. We then look for an estimator of the regression function in the RKHS. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 30 / 48

70 Non parametric learning 2 approaches of regression problem : Link : In general And H K L 2 ρ X compl. L 2 ρx (RKHS) = L 2 ρ X in some cases. We then look for an estimator of the regression function in the RKHS. General regression problem g ρ L 2 Aymeric Dieuleveut Stochastic optimization Hilbert spaces 30 / 48

71 Non parametric learning 2 approaches of regression problem : Link : In general And H K L 2 ρ X compl. L 2 ρx (RKHS) = L 2 ρ X in some cases. We then look for an estimator of the regression function in the RKHS. General regression problem g ρ L 2 Linear regression problem in RKHS Aymeric Dieuleveut Stochastic optimization Hilbert spaces 30 / 48

72 Non parametric learning 2 approaches of regression problem : Link : In general And H K L 2 ρ X compl. L 2 ρx (RKHS) = L 2 ρ X in some cases. We then look for an estimator of the regression function in the RKHS. General regression problem g ρ L 2 Linear regression problem in RKHS looking for an estimator for the first problem using natural algorithms for the second one Aymeric Dieuleveut Stochastic optimization Hilbert spaces 30 / 48

73 Non parametric learning Outline What if d >> n? Non parametric regression in RKHS An interesting problem itself Carry analyse in a Hilbert space using reproducing kernel Hilbert spaces Aymeric Dieuleveut Stochastic optimization Hilbert spaces 31 / 48

74 Non parametric learning SGD algorithm in the RKHS g 0 H K (we often consider g 0 = 0), n g n = a i K xi, (2) i=1 (a n ) n such that a n := γ n (g n 1 (x n ) y n ) = γ n ( n 1 i=1 a ik(x n, x i ) y i ). g n = g n 1 γ n (g n 1 (x n ) y n )K xn n = a i K xi with a n defined as above. i=1 (g n 1 (x n ) y n )K xn unbiased estimate of grade[( K x, g n 1 y) 2 ]. SGD algorithm in the RKHS takes very simple form Aymeric Dieuleveut Stochastic optimization Hilbert spaces 32 / 48

75 Non parametric learning Assumptions Two important points characterize the difficulty of the problem : The regularity of the objective function The spectrum of the covariance operator Aymeric Dieuleveut Stochastic optimization Hilbert spaces 33 / 48

76 Non parametric learning Covariance operator We have Σ = E [K x K x ]. Where K x K x : g K x, g K x = g(x)k x Covariance operator is a self adjoint operator which contains information on the distribution of K x Aymeric Dieuleveut Stochastic optimization Hilbert spaces 34 / 48

77 Non parametric learning Covariance operator We have Σ = E [K x K x ]. Where K x K x : g K x, g K x = g(x)k x Covariance operator is a self adjoint operator which contains information on the distribution of K x Assumption : tr(σ α ) <, for α [0; 1]. on g ρ : g ρ Σ r (L 2 ρ(x )) with r 0. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 34 / 48

78 Non parametric learning Interpretation Eigenvalues decrease Ellipsoid class of function. (we do not assume g ρ H K ) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 35 / 48

79 Non parametric learning Result : Theorem Under a few hidden assumptions : ( σ 2 tr(σ α )γ α E [R (ḡ n ) R(g ρ )] O n 1 α ) + O ( Σ r ) g ρ 2 (nγ) 2(r 1) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 36 / 48

80 Non parametric learning Result : Theorem Under a few hidden assumptions : ( σ 2 tr(σ α )γ α E [R (ḡ n ) R(g ρ )] O Bias Variance decomposition n 1 α ) + O ( Σ r ) g ρ 2 (nγ) 2(r 1) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 36 / 48

81 Non parametric learning Result : Theorem Under a few hidden assumptions : ( σ 2 tr(σ α )γ α E [R (ḡ n ) R(g ρ )] O Bias Variance decomposition O is a known constant (4 or 8) n 1 α ) + O ( Σ r ) g ρ 2 (nγ) 2(r 1) Finite horizon result here but extends to online setting. Saturation Aymeric Dieuleveut Stochastic optimization Hilbert spaces 36 / 48

82 Non parametric learning Corollary Corollary Assume A1-8 : If 1 α 2 < r < 2 α 2, with γ = n 2r+α 1 2r+α we get the optimal rate : E [R (ḡ n ) R(g ρ )] = O ) (n 2r 2r+α (3) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 37 / 48

83 Non parametric learning Conclusion 1 Optimal statistical rates in RKHS Choice of γ Aymeric Dieuleveut Stochastic optimization Hilbert spaces 38 / 48

84 Non parametric learning Conclusion 1 Optimal statistical rates in RKHS Choice of γ We get statistical optimal rate of convergence for learning in RKHS with SGD with one pass. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 38 / 48

85 Non parametric learning Conclusion 1 Optimal statistical rates in RKHS Choice of γ We get statistical optimal rate of convergence for learning in RKHS with SGD with one pass. We get insights on how to choose the kernel and the step size. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 38 / 48

86 Non parametric learning Conclusion 1 Optimal statistical rates in RKHS Choice of γ We get statistical optimal rate of convergence for learning in RKHS with SGD with one pass. We get insights on how to choose the kernel and the step size. We compare favorably to [Ying and Pontil, 2008, Caponnetto and De Vito, 2007, Tarrès and Yao, 2011]. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 38 / 48

87 Non parametric learning Conclusion 2 Behaviour in FD Adaptativity, tradeoffs. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 39 / 48

88 Non parametric learning Conclusion 2 Behaviour in FD Adaptativity, tradeoffs. Theorem can be rewritten : E [ R ( θ ) n R(θ ) ] ( σ 2 tr(σ α )γ α ) ( θ T O n 1 α + O Σ 2r 1 θ T ) (nγ) 2(r 1) (4) where the ellipsoid condition appears more clearly. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 39 / 48

89 Non parametric learning Conclusion 2 Behaviour in FD Adaptativity, tradeoffs. Theorem can be rewritten : E [ R ( θ ) n R(θ ) ] ( σ 2 tr(σ α )γ α ) ( θ T O n 1 α + O Σ 2r 1 θ T ) (nγ) 2(r 1) (4) where the ellipsoid condition appears more clearly. Thus : SGD is adaptative to the regularity of the problem bridges the gap between the different regimes and explains behaviour when d >> n. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 39 / 48

90 The complexity challenge, approximation of the kernel 1 Tradeoffs of Large scale learning - Learning 2 A case study -Finite dimension linear least mean squares 3 Non parametric learning 4 The complexity challenge, approximation of the kernel Aymeric Dieuleveut Stochastic optimization Hilbert spaces 40 / 48

91 The complexity challenge, approximation of the kernel Reducing complexity : sampling methods However the complexity of such a method remains quadratic with respect of the number of examples : iteration number n costs n kernel calculations. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 41 / 48

92 The complexity challenge, approximation of the kernel Reducing complexity : sampling methods However the complexity of such a method remains quadratic with respect of the number of examples : iteration number n costs n kernel calculations. Finite Dimension Rate d n Complexity O(dn) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 41 / 48

93 The complexity challenge, approximation of the kernel Reducing complexity : sampling methods However the complexity of such a method remains quadratic with respect of the number of examples : iteration number n costs n kernel calculations. Finite Dimension Infinite dimension Rate Complexity d n O(dn) d n n O(n 2 ) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 41 / 48

94 The complexity challenge, approximation of the kernel 2 related methods Approximate the kernel matrix Approximate the kernel Results from [Bach, 2012]. Such results have been extended by [Alaoui and Mahoney, 2014, Rudi et al., There also exist results in the second situation [Rahimi and Recht, 2008, Dai et al., 2014] Aymeric Dieuleveut Stochastic optimization Hilbert spaces 42 / 48

95 The complexity challenge, approximation of the kernel Sharp analysis We only consider a fixed design setting. Then we have to approximate the kernel matrix : instead of computing the whole matrix, we randomly pick a number d n of columns. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 43 / 48

96 The complexity challenge, approximation of the kernel Sharp analysis We only consider a fixed design setting. Then we have to approximate the kernel matrix : instead of computing the whole matrix, we randomly pick a number d n of columns. Then we still get the same estimation errors. Leading to : Finite Dimension Infinite dimension Rate Complexity d n O(dn) d n n O(ndn 2 ) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 43 / 48

97 The complexity challenge, approximation of the kernel Random feature selection Many kernels may be represented, due to Bochner s theorem as K(x, y) = φ(w, x)φ(w, y)dµ(w). W (think of translation invariant kernels and Fourier transform). Aymeric Dieuleveut Stochastic optimization Hilbert spaces 44 / 48

98 The complexity challenge, approximation of the kernel Random feature selection Many kernels may be represented, due to Bochner s theorem as K(x, y) = φ(w, x)φ(w, y)dµ(w). W (think of translation invariant kernels and Fourier transform). We thus consider the low rank approximation : K(x, y) = 1 d n φ(x, w i )φ(y, w i ). i=1 where w i µ. We use this approximation of the kernel in SGD. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 44 / 48

99 The complexity challenge, approximation of the kernel Directions What I am working on for the moment : Random feature selection Tuning the sampling to improve accuracy of the approximation Acceleration + stochasticity (with Nicolas Flammarion). Aymeric Dieuleveut Stochastic optimization Hilbert spaces 45 / 48

100 The complexity challenge, approximation of the kernel Some references I Alaoui, A. E. and Mahoney, M. W. (2014). Fast randomized kernel methods with statistical guarantees. CoRR, abs/ Bach, F. (2012). Sharp analysis of low-rank kernel matrix approximations. ArXiv e-prints. Bach, F. and Moulines, E. (2013). Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). ArXiv e-prints. Bottou, L. and Bousquet, O. (2008). The tradeoffs of large scale learning. In IN : ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 20. Caponnetto, A. and De Vito, E. (2007). Optimal Rates for the Regularized Least-Squares Algorithm. Foundations of Computational Mathematics, 7(3) : Dai, B., Xie, B., He, N., Liang, Y., Raj, A., Balcan, M., and Song, L. (2014). Scalable kernel methods via doubly stochastic gradients. In Advances in Neural Information Processing Systems 27 : Annual Conference on Neural Information Processing Systems 2014, December , Montreal, Quebec, Canada, pages Dieuleveut, A. and Bach, F. (2014). Non-parametric Stochastic Approximation with Large Step sizes. ArXiv e-prints. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 46 / 48

101 The complexity challenge, approximation of the kernel Some references II Rahimi, A. and Recht, B. (2008). Weighted sums of random kitchen sinks : Replacing minimization with randomization in learning. In Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008, pages Rudi, A., Camoriano, R., and Rosasco, L. (2015). Less is more : Nyström computational regularization. CoRR, abs/ Shalev-Schwartz, S. and K., S. (2011). Theorical basis for more data less work. Shalev-Schwartz, S. and Srebro, N. (2008). SVM optimisation : Inverse dependance on training set size. Proceedings of the International Conference on Machine Learning (ICML). Tarrès, P. and Yao, Y. (2011). Online learning as stochastic approximation of regularization paths. ArXiv e-prints Ying, Y. and Pontil, M. (2008). Online gradient descent learning algorithms. Foundations of Computational Mathematics, 5. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 47 / 48

102 The complexity challenge, approximation of the kernel Thank you for your attention! Aymeric Dieuleveut Stochastic optimization Hilbert spaces 48 / 48

Less is More: Computational Regularization by Subsampling

Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro