Stochastic optimization in Hilbert spaces
|
|
- Rosa McCoy
- 5 years ago
- Views:
Transcription
1 Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48
2 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert spaces 2 / 48
3 Outline Tradeoffs of large scale learning Algorithm ERM? complexity. Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert spaces 3 / 48
4 Outline Tradeoffs of large scale learning Algorithm ERM? complexity. Learning vs Statistics Stochastic optimization Why is SGD so useful in learning? Aymeric Dieuleveut Stochastic optimization Hilbert spaces 4 / 48
5 Outline Tradeoffs of large scale learning Algorithm ERM? complexity. Learning vs Statistics Stochastic optimization Why is SGD so useful in learning? A simple case Least mean squares, finite dimension Aymeric Dieuleveut Stochastic optimization Hilbert spaces 5 / 48
6 Outline Tradeoffs of large scale learning Algorithm ERM? complexity. Learning vs Statistics Stochastic optimization Why is SGD so useful in learning? Higher dimension? RKHS, non parametric learning A simple case Least mean squares, finite dimension Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48
7 Outline Lower complexity? Column sampling, feature selection Tradeoffs of large scale learning Algorithm ERM? complexity. Learning vs Statistics Stochastic optimization Why is SGD so useful in learning? Higher dimension? RKHS, non parametric learning A simple case Least mean squares, finite dimension Aymeric Dieuleveut Stochastic optimization Hilbert spaces 7 / 48
8 Tradeoffs of Large scale learning - Learning Statistics vs Machine Learning 1. taken from Aymeric Dieuleveut Stochastic optimization Hilbert spaces 8 / 48
9 Tradeoffs of Large scale learning - Learning Statistics vs Machine Learning Statistics Machine Learning Estimation Learning Classifier Hypothesis Data point Example/Instance Regression Supervised Learning Classification Supervised Learning Covariate Feature Response Label 1 1. taken from Aymeric Dieuleveut Stochastic optimization Hilbert spaces 8 / 48
10 Tradeoffs of Large scale learning - Learning Statistics vs Machine Learning Statistics Machine Learning Estimation Learning Classifier Hypothesis Data point Example/Instance Regression Supervised Learning Classification Supervised Learning Covariate Feature Response Label 1 Essentially AI vs math guys doing same kind of stuff. However main differences : 1. taken from Aymeric Dieuleveut Stochastic optimization Hilbert spaces 8 / 48
11 Tradeoffs of Large scale learning - Learning Statistics vs Machine Learning Statistics Machine Learning Estimation Learning Classifier Hypothesis Data point Example/Instance Regression Supervised Learning Classification Supervised Learning Covariate Feature Response Label 1 Essentially AI vs math guys doing same kind of stuff. However main differences : Statisticians are more interested in the model and drawing conclusions about it. 1. taken from Aymeric Dieuleveut Stochastic optimization Hilbert spaces 8 / 48
12 Tradeoffs of Large scale learning - Learning Statistics vs Machine Learning Statistics Machine Learning Estimation Learning Classifier Hypothesis Data point Example/Instance Regression Supervised Learning Classification Supervised Learning Covariate Feature Response Label 1 Essentially AI vs math guys doing same kind of stuff. However main differences : Statisticians are more interested in the model and drawing conclusions about it. ML are more interested about prediction with a concern on algorithms for high dim. data. 1. taken from Aymeric Dieuleveut Stochastic optimization Hilbert spaces 8 / 48
13 Tradeoffs of Large scale learning - Learning Framework We consider the classical risk minimization problem. Given : a space of input output pairs (x, y) X Y, with probability distribution P(x, y). a loss function l : Y Y R, a class of function F. the risk of a function f : X Y is R(f ) := E P [l (f (x), y)]. Our aim is min R(f ) f F Aymeric Dieuleveut Stochastic optimization Hilbert spaces 9 / 48
14 Tradeoffs of Large scale learning - Learning Framework We consider the classical risk minimization problem. Given : a space of input output pairs (x, y) X Y, with probability distribution P(x, y). a loss function l : Y Y R, a class of function F. the risk of a function f : X Y is R(f ) := E P [l (f (x), y)]. Our aim is R is unknown. min R(f ) f F Aymeric Dieuleveut Stochastic optimization Hilbert spaces 9 / 48
15 Tradeoffs of Large scale learning - Learning Framework We consider the classical risk minimization problem. Given : a space of input output pairs (x, y) X Y, with probability distribution P(x, y). a loss function l : Y Y R, a class of function F. the risk of a function f : X Y is R(f ) := E P [l (f (x), y)]. Our aim is R is unknown. min R(f ) f F given a sequence of i.i.d. data points distributed (x i, y i ) i=1..n P n, we can define the empirical risk R n (f ) = 1 n n l(f (x i ), y i ). i=1 Aymeric Dieuleveut Stochastic optimization Hilbert spaces 9 / 48
16 Tradeoffs of Large scale learning - Learning The bias-variance tradeoffs a.k.a. estimation approximation error. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 10 / 48
17 Tradeoffs of Large scale learning - Learning The bias-variance tradeoffs a.k.a. estimation approximation error. There are many ways of seeing it : constraint case penalized case other regularization Aymeric Dieuleveut Stochastic optimization Hilbert spaces 10 / 48
18 Tradeoffs of Large scale learning - Learning The bias-variance tradeoffs a.k.a. estimation approximation error. There are many ways of seeing it : constraint case penalized case other regularization Thus compromise : ε app + ε est. F ε app ε est Aymeric Dieuleveut Stochastic optimization Hilbert spaces 10 / 48
19 Tradeoffs of Large scale learning - Learning The bias-variance tradeoffs a.k.a. estimation approximation error. There are many ways of seeing it : constraint case penalized case other regularization Thus compromise : ε app + ε est. F ε app ε est Aymeric Dieuleveut Stochastic optimization Hilbert spaces 10 / 48
20 Tradeoffs of Large scale learning - Learning The bias-variance tradeoffs a.k.a. estimation approximation error. There are many ways of seeing it : constraint case penalized case other regularization Thus compromise : ε app + ε est. This is the classical setting. ε app ε est F Aymeric Dieuleveut Stochastic optimization Hilbert spaces 10 / 48
21 Tradeoffs of Large scale learning - Learning Adding an optimization term When we face large datasets, it may be uneasy and useless to optimize with high accuracy the estimator. We then question the choice of an algorithm from a fixed budget time point of view Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008] Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48
22 Tradeoffs of Large scale learning - Learning Adding an optimization term When we face large datasets, it may be uneasy and useless to optimize with high accuracy the estimator. We then question the choice of an algorithm from a fixed budget time point of view. 2 It questions the following points : up to which precision is it necessary to optimize? 2. Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008] Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48
23 Tradeoffs of Large scale learning - Learning Adding an optimization term When we face large datasets, it may be uneasy and useless to optimize with high accuracy the estimator. We then question the choice of an algorithm from a fixed budget time point of view. 2 It questions the following points : up to which precision is it necessary to optimize? which is the limiting factor? (time, data points) 2. Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008] Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48
24 Tradeoffs of Large scale learning - Learning Adding an optimization term When we face large datasets, it may be uneasy and useless to optimize with high accuracy the estimator. We then question the choice of an algorithm from a fixed budget time point of view. 2 It questions the following points : up to which precision is it necessary to optimize? which is the limiting factor? (time, data points) A problem is said to be large scale when time is limiting. For large scale problem : which algo? 2. Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008] Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48
25 Tradeoffs of Large scale learning - Learning Adding an optimization term When we face large datasets, it may be uneasy and useless to optimize with high accuracy the estimator. We then question the choice of an algorithm from a fixed budget time point of view. 2 It questions the following points : up to which precision is it necessary to optimize? which is the limiting factor? (time, data points) A problem is said to be large scale when time is limiting. For large scale problem : which algo? more data less work? (if time is limiting) 2. Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008] Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48
26 Tradeoffs of Large scale learning - Learning Adding an optimization term When we face large datasets, it may be uneasy and useless to optimize with high accuracy the estimator. We then question the choice of an algorithm from a fixed budget time point of view. 2 It questions the following points : up to which precision is it necessary to optimize? which is the limiting factor? (time, data points) A problem is said to be large scale when time is limiting. For large scale problem : which algo? more data less work? (if time is limiting) 2. Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008] Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48
27 Tradeoffs of Large scale learning - Learning Tradeoffs - Large scale learning F n ε Aymeric Dieuleveut Stochastic optimization Hilbert spaces 12 / 48
28 Tradeoffs of Large scale learning - Learning Tradeoffs - Large scale learning ε app F n ε Aymeric Dieuleveut Stochastic optimization Hilbert spaces 12 / 48
29 Tradeoffs of Large scale learning - Learning Tradeoffs - Large scale learning F n ε ε app ε est Aymeric Dieuleveut Stochastic optimization Hilbert spaces 12 / 48
30 Tradeoffs of Large scale learning - Learning Tradeoffs - Large scale learning F n ε ε app ε est ε opt Aymeric Dieuleveut Stochastic optimization Hilbert spaces 12 / 48
31 Tradeoffs of Large scale learning - Learning Tradeoffs - Large scale learning F n ε ε app ε est ε opt T Aymeric Dieuleveut Stochastic optimization Hilbert spaces 12 / 48
32 Tradeoffs of Large scale learning - Learning Different algorithms To minimize ERM, a bunch of algorithms may be considered : Gradient descent Second order gradient descent Stochastic gradient descent Fast stochastic algorithm (requiring high memory storage) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 13 / 48
33 Tradeoffs of Large scale learning - Learning Different algorithms To minimize ERM, a bunch of algorithms may be considered : Gradient descent Second order gradient descent Stochastic gradient descent Fast stochastic algorithm (requiring high memory storage) Let s compare first order methods : SGD and GD. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 13 / 48
34 Tradeoffs of Large scale learning - Learning Stochastic gradient algorithms : Aim : min f R(f ) we only access to unbiased estimates of R(f ) and R(f ). Aymeric Dieuleveut Stochastic optimization Hilbert spaces 14 / 48
35 Tradeoffs of Large scale learning - Learning Stochastic gradient algorithms : Aim : min f R(f ) we only access to unbiased estimates of R(f ) and R(f ). 1 Start at some f 0. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 14 / 48
36 Tradeoffs of Large scale learning - Learning Stochastic gradient algorithms : Aim : min f R(f ) we only access to unbiased estimates of R(f ) and R(f ). 1 Start at some f 0. 2 Iterate : Get unbiased gradient estimate g k, s.t. E[g k ] = R(f k ). f k+1 f k γ k g k. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 14 / 48
37 Tradeoffs of Large scale learning - Learning Stochastic gradient algorithms : Aim : min f R(f ) we only access to unbiased estimates of R(f ) and R(f ). 1 Start at some f 0. 2 Iterate : Get unbiased gradient estimate g k, s.t. E[g k ] = R(f k ). f k+1 f k γ k g k. m 3 Output f m or f m := 1 m f k (averaged SGD). k=1 Gradient descent : same but with true gradient. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 14 / 48
38 Tradeoffs of Large scale learning - Learning ERM SGD in ERM min f F R n (f ) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 15 / 48
39 Tradeoffs of Large scale learning - Learning ERM SGD in ERM min f F R n (f ) Pick any (x i, y i ) from empirical sample g k = f l(f k, (x i, y i )). f k+1 (f k γ k g k ) Output f m R n( f m) R n(fn ) O ( 1/ m ) sup f F R R n (f ) O(1/ n) Cost of one iteration O(d). GD in ERM min f F R n (f ) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 15 / 48
40 Tradeoffs of Large scale learning - Learning ERM SGD in ERM min f F R n (f ) Pick any (x i, y i ) from empirical sample g k = f l(f k, (x i, y i )). f k+1 (f k γ k g k ) Output f m R n( f m) R n(fn ) O ( 1/ m ) sup f F R R n (f ) O(1/ n) Cost of one iteration O(d). With step-size γ k proportional to 1 k. GD in ERM min f F R n (f ) g k = n f i=1 l(f k, (x i, y i )) = f R(f k ) f k+1 (f k γ k g k ) Output f m R n(f m) R n(fn ) O ((1 κ) m ) sup f F R R n (f ) O(1/ n) Cost of one iteration O(nd). R( f m ) R(f ) O ( 1/ m ) + O(1/ n) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 15 / 48
41 Tradeoffs of Large scale learning - Learning Conclusion In the large scale setting, it is beneficial to use SGD! Aymeric Dieuleveut Stochastic optimization Hilbert spaces 16 / 48
42 Tradeoffs of Large scale learning - Learning Conclusion In the large scale setting, it is beneficial to use SGD! Does more data help? 1 With global estimation error fixed, it seems T R(f m) R(f ) 1 n decreasing with n. is Aymeric Dieuleveut Stochastic optimization Hilbert spaces 16 / 48
43 Tradeoffs of Large scale learning - Learning Conclusion In the large scale setting, it is beneficial to use SGD! Does more data help? 1 With global estimation error fixed, it seems T R(f m) R(f ) 1 n decreasing with n. Upper bounding R n R uniformly is dangerous. Indeed, we have to also compare to one pass SGD, which minimizes the true risk R. is Aymeric Dieuleveut Stochastic optimization Hilbert spaces 16 / 48
44 Tradeoffs of Large scale learning - Learning Expectation minimization Stochastic gradient descent may be used to minimize R(f ) : SGD in ERM min f F R n (f ) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 17 / 48
45 Tradeoffs of Large scale learning - Learning Expectation minimization Stochastic gradient descent may be used to minimize R(f ) : SGD in ERM min f F R n (f ) Pick any (x i, y i ) from empirical sample g k = f l(f k, (x i, y i )). f k+1 (f k γ k g k ) Output f m R n( f m) R n(fn ) O ( 1/ m ) sup f F R R n (f ) O(1/ n) Cost of one iteration O(d). SGD one pass min f F R(f ) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 17 / 48
46 Tradeoffs of Large scale learning - Learning Expectation minimization Stochastic gradient descent may be used to minimize R(f ) : SGD in ERM min f F R n (f ) Pick any (x i, y i ) from empirical sample g k = f l(f k, (x i, y i )). f k+1 (f k γ k g k ) Output f m R n( f m) R n(fn ) O ( 1/ m ) sup f F R R n (f ) O(1/ n) Cost of one iteration O(d). SGD one pass min f F R(f ) Pick an independent (x, y) g k = f l(f k, (x, y)). f k+1 (f k γ k g k ) Output f k, k ( n R( f k ) R(f ) O 1/ ) k Cost of one iteration O(d). Aymeric Dieuleveut Stochastic optimization Hilbert spaces 17 / 48
47 Tradeoffs of Large scale learning - Learning Expectation minimization Stochastic gradient descent may be used to minimize R(f ) : SGD in ERM min f F R n (f ) Pick any (x i, y i ) from empirical sample g k = f l(f k, (x i, y i )). f k+1 (f k γ k g k ) Output f m R n( f m) R n(fn ) O ( 1/ m ) sup f F R R n (f ) O(1/ n) SGD one pass min f F R(f ) Pick an independent (x, y) g k = f l(f k, (x, y)). f k+1 (f k γ k g k ) Output f k, k ( n R( f k ) R(f ) O 1/ ) k Cost of one iteration O(d). Cost of one iteration O(d). SGD with one pass (early stopping as a regularization) achieves a nearly optimal bias variance tradeoff with low complexity. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 17 / 48
48 Tradeoffs of Large scale learning - Learning Rate of convergence We are interested in prediction. 1 Strongly convex objective : µn. 1 Non strongly : n. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 18 / 48
49 A case study -Finite dimension linear least mean squares LMS [Bach and Moulines, 2013] We now consider the simple case where X = R d, and the loss l is quadratic. We are interested in linear predictors : min E P[(θ T x y) 2 ]. θ R d If we assume that the data points are generated according to y i = θ T x i + ε i. We consider stochastic gradient algorithm : θ 0 = 0 This system may be rewritten : θ n+1 = θ n γ n ( x n, θ n x n y n x n ) θ n+1 θ = (I γx n x T n )(θ n θ ) γ n ξ n. (1) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 19 / 48
50 A case study -Finite dimension linear least mean squares Rate of convergence, back again! We are interested in prediction. Strongly convex objective : 1 µn. Non strongly : 1 n. We define H = E[xx T ]. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 20 / 48
51 A case study -Finite dimension linear least mean squares Rate of convergence, back again! We are interested in prediction. Strongly convex objective : 1 µn. Non strongly : 1 n. We define H = E[xx T ]. We have µ = min Sp(H). Aymeric Dieuleveut Stochastic optimization Hilbert spaces 20 / 48
52 A case study -Finite dimension linear least mean squares Rate of convergence, back again! We are interested in prediction. Strongly convex objective : 1 µn. Non strongly : 1 n. We define H = E[xx T ]. We have µ = min Sp(H). For least min squares, statistical rate with ordinary LMS estimator is σ 2 d n Aymeric Dieuleveut Stochastic optimization Hilbert spaces 20 / 48
53 A case study -Finite dimension linear least mean squares Rate of convergence, back again! We are interested in prediction. Strongly convex objective : 1 µn. Non strongly : 1 n. We define H = E[xx T ]. We have µ = min Sp(H). For least min squares, statistical rate with ordinary LMS estimator is σ 2 d n there is still a gap to be bridged! Aymeric Dieuleveut Stochastic optimization Hilbert spaces 20 / 48
54 A case study -Finite dimension linear least mean squares A few assumptions We define H = E[xx T ], and C = E[ξξ T ]. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 21 / 48
55 A case study -Finite dimension linear least mean squares A few assumptions We define H = E[xx T ], and C = E[ξξ T ]. Bounded noise variance : we assume C σ 2 H. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 21 / 48
56 A case study -Finite dimension linear least mean squares A few assumptions We define H = E[xx T ], and C = E[ξξ T ]. Bounded noise variance : we assume C σ 2 H. Covariance operator : no assumption on minimal eigenvalue, E[ x 2 ] R 2. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 21 / 48
57 A case study -Finite dimension linear least mean squares Result Theorem E[R( θ n ) R(θ )] 4 n (σ2 d + R 2 θ 0 θ 2 ) optimal statistical rate 1/n without strong convexity. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 22 / 48
58 Non parametric learning Outline What if d >> n? Aymeric Dieuleveut Stochastic optimization Hilbert spaces 23 / 48
59 Non parametric learning Outline What if d >> n? Carry analyse in a Hilbert space using reproducing kernel Hilbert spaces Aymeric Dieuleveut Stochastic optimization Hilbert spaces 24 / 48
60 Non parametric learning Outline What if d >> n? Non parametric regression in RKHS An interesting problem itself Carry analyse in a Hilbert space using reproducing kernel Hilbert spaces Aymeric Dieuleveut Stochastic optimization Hilbert spaces 25 / 48
61 Non parametric learning Outline What if d >> n? Non parametric regression in RKHS An interesting problem itself Carry analyse in a Hilbert space using reproducing kernel Hilbert spaces Behaviour in FD Adaptativity, tradeoffs. Optimal statistical rates in RKHS Choice of γ Aymeric Dieuleveut Stochastic optimization Hilbert spaces 26 / 48
62 Non parametric learning Reproducing kernel Hilbert space [Dieuleveut and Bach, 2014] We denote H K a Hilbert space of function. H K R X. Which is characterized by the kernel function K : X X R : for any x, K x : X R defined by K x (x ) = K(x, x ) is in H K. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 27 / 48
63 Non parametric learning Reproducing kernel Hilbert space [Dieuleveut and Bach, 2014] We denote H K a Hilbert space of function. H K R X. Which is characterized by the kernel function K : X X R : for any x, K x : X R defined by K x (x ) = K(x, x ) is in H K. reproducing property : for all g H K and x X, g(x) = g, K x K. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 27 / 48
64 Non parametric learning Reproducing kernel Hilbert space [Dieuleveut and Bach, 2014] We denote H K a Hilbert space of function. H K R X. Which is characterized by the kernel function K : X X R : for any x, K x : X R defined by K x (x ) = K(x, x ) is in H K. reproducing property : for all g H K and x X, g(x) = g, K x K. Two usages : α) A hypothesis space for regression. β) Mapping data points in a linear space. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 27 / 48
65 Non parametric learning α) A hypothesis space for regression. Classical regression setting : Goal : Minimizing prediction error (X i, Y i ) ρ i.i.d. (X i, Y i ) (X R) min E[(g(X ) Y )2 ]. g L 2 Aymeric Dieuleveut Stochastic optimization Hilbert spaces 28 / 48
66 Non parametric learning α) A hypothesis space for regression. Classical regression setting : Goal : Minimizing prediction error (X i, Y i ) ρ i.i.d. (X i, Y i ) (X R) min E[(g(X ) Y )2 ]. g L 2 Looking for an estimator ĝ n of g ρ (X ) = E[Y X ], g ρ L 2 ρ X. with L 2 ρ X = { f : X R/ } f 2 (t)dρ X (t) <. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 28 / 48
67 Non parametric learning β) Mapping data points in a linear space. Linear regression on data maped into some RKHS. arg min θ H Y X θ 2. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 29 / 48
68 Non parametric learning 2 approaches of regression problem : Link : In general H K L 2 ρ X Aymeric Dieuleveut Stochastic optimization Hilbert spaces 30 / 48
69 Non parametric learning 2 approaches of regression problem : Link : In general And H K L 2 ρ X compl. L 2 ρx (RKHS) = L 2 ρ X in some cases. We then look for an estimator of the regression function in the RKHS. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 30 / 48
70 Non parametric learning 2 approaches of regression problem : Link : In general And H K L 2 ρ X compl. L 2 ρx (RKHS) = L 2 ρ X in some cases. We then look for an estimator of the regression function in the RKHS. General regression problem g ρ L 2 Aymeric Dieuleveut Stochastic optimization Hilbert spaces 30 / 48
71 Non parametric learning 2 approaches of regression problem : Link : In general And H K L 2 ρ X compl. L 2 ρx (RKHS) = L 2 ρ X in some cases. We then look for an estimator of the regression function in the RKHS. General regression problem g ρ L 2 Linear regression problem in RKHS Aymeric Dieuleveut Stochastic optimization Hilbert spaces 30 / 48
72 Non parametric learning 2 approaches of regression problem : Link : In general And H K L 2 ρ X compl. L 2 ρx (RKHS) = L 2 ρ X in some cases. We then look for an estimator of the regression function in the RKHS. General regression problem g ρ L 2 Linear regression problem in RKHS looking for an estimator for the first problem using natural algorithms for the second one Aymeric Dieuleveut Stochastic optimization Hilbert spaces 30 / 48
73 Non parametric learning Outline What if d >> n? Non parametric regression in RKHS An interesting problem itself Carry analyse in a Hilbert space using reproducing kernel Hilbert spaces Aymeric Dieuleveut Stochastic optimization Hilbert spaces 31 / 48
74 Non parametric learning SGD algorithm in the RKHS g 0 H K (we often consider g 0 = 0), n g n = a i K xi, (2) i=1 (a n ) n such that a n := γ n (g n 1 (x n ) y n ) = γ n ( n 1 i=1 a ik(x n, x i ) y i ). g n = g n 1 γ n (g n 1 (x n ) y n )K xn n = a i K xi with a n defined as above. i=1 (g n 1 (x n ) y n )K xn unbiased estimate of grade[( K x, g n 1 y) 2 ]. SGD algorithm in the RKHS takes very simple form Aymeric Dieuleveut Stochastic optimization Hilbert spaces 32 / 48
75 Non parametric learning Assumptions Two important points characterize the difficulty of the problem : The regularity of the objective function The spectrum of the covariance operator Aymeric Dieuleveut Stochastic optimization Hilbert spaces 33 / 48
76 Non parametric learning Covariance operator We have Σ = E [K x K x ]. Where K x K x : g K x, g K x = g(x)k x Covariance operator is a self adjoint operator which contains information on the distribution of K x Aymeric Dieuleveut Stochastic optimization Hilbert spaces 34 / 48
77 Non parametric learning Covariance operator We have Σ = E [K x K x ]. Where K x K x : g K x, g K x = g(x)k x Covariance operator is a self adjoint operator which contains information on the distribution of K x Assumption : tr(σ α ) <, for α [0; 1]. on g ρ : g ρ Σ r (L 2 ρ(x )) with r 0. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 34 / 48
78 Non parametric learning Interpretation Eigenvalues decrease Ellipsoid class of function. (we do not assume g ρ H K ) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 35 / 48
79 Non parametric learning Result : Theorem Under a few hidden assumptions : ( σ 2 tr(σ α )γ α E [R (ḡ n ) R(g ρ )] O n 1 α ) + O ( Σ r ) g ρ 2 (nγ) 2(r 1) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 36 / 48
80 Non parametric learning Result : Theorem Under a few hidden assumptions : ( σ 2 tr(σ α )γ α E [R (ḡ n ) R(g ρ )] O Bias Variance decomposition n 1 α ) + O ( Σ r ) g ρ 2 (nγ) 2(r 1) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 36 / 48
81 Non parametric learning Result : Theorem Under a few hidden assumptions : ( σ 2 tr(σ α )γ α E [R (ḡ n ) R(g ρ )] O Bias Variance decomposition O is a known constant (4 or 8) n 1 α ) + O ( Σ r ) g ρ 2 (nγ) 2(r 1) Finite horizon result here but extends to online setting. Saturation Aymeric Dieuleveut Stochastic optimization Hilbert spaces 36 / 48
82 Non parametric learning Corollary Corollary Assume A1-8 : If 1 α 2 < r < 2 α 2, with γ = n 2r+α 1 2r+α we get the optimal rate : E [R (ḡ n ) R(g ρ )] = O ) (n 2r 2r+α (3) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 37 / 48
83 Non parametric learning Conclusion 1 Optimal statistical rates in RKHS Choice of γ Aymeric Dieuleveut Stochastic optimization Hilbert spaces 38 / 48
84 Non parametric learning Conclusion 1 Optimal statistical rates in RKHS Choice of γ We get statistical optimal rate of convergence for learning in RKHS with SGD with one pass. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 38 / 48
85 Non parametric learning Conclusion 1 Optimal statistical rates in RKHS Choice of γ We get statistical optimal rate of convergence for learning in RKHS with SGD with one pass. We get insights on how to choose the kernel and the step size. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 38 / 48
86 Non parametric learning Conclusion 1 Optimal statistical rates in RKHS Choice of γ We get statistical optimal rate of convergence for learning in RKHS with SGD with one pass. We get insights on how to choose the kernel and the step size. We compare favorably to [Ying and Pontil, 2008, Caponnetto and De Vito, 2007, Tarrès and Yao, 2011]. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 38 / 48
87 Non parametric learning Conclusion 2 Behaviour in FD Adaptativity, tradeoffs. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 39 / 48
88 Non parametric learning Conclusion 2 Behaviour in FD Adaptativity, tradeoffs. Theorem can be rewritten : E [ R ( θ ) n R(θ ) ] ( σ 2 tr(σ α )γ α ) ( θ T O n 1 α + O Σ 2r 1 θ T ) (nγ) 2(r 1) (4) where the ellipsoid condition appears more clearly. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 39 / 48
89 Non parametric learning Conclusion 2 Behaviour in FD Adaptativity, tradeoffs. Theorem can be rewritten : E [ R ( θ ) n R(θ ) ] ( σ 2 tr(σ α )γ α ) ( θ T O n 1 α + O Σ 2r 1 θ T ) (nγ) 2(r 1) (4) where the ellipsoid condition appears more clearly. Thus : SGD is adaptative to the regularity of the problem bridges the gap between the different regimes and explains behaviour when d >> n. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 39 / 48
90 The complexity challenge, approximation of the kernel 1 Tradeoffs of Large scale learning - Learning 2 A case study -Finite dimension linear least mean squares 3 Non parametric learning 4 The complexity challenge, approximation of the kernel Aymeric Dieuleveut Stochastic optimization Hilbert spaces 40 / 48
91 The complexity challenge, approximation of the kernel Reducing complexity : sampling methods However the complexity of such a method remains quadratic with respect of the number of examples : iteration number n costs n kernel calculations. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 41 / 48
92 The complexity challenge, approximation of the kernel Reducing complexity : sampling methods However the complexity of such a method remains quadratic with respect of the number of examples : iteration number n costs n kernel calculations. Finite Dimension Rate d n Complexity O(dn) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 41 / 48
93 The complexity challenge, approximation of the kernel Reducing complexity : sampling methods However the complexity of such a method remains quadratic with respect of the number of examples : iteration number n costs n kernel calculations. Finite Dimension Infinite dimension Rate Complexity d n O(dn) d n n O(n 2 ) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 41 / 48
94 The complexity challenge, approximation of the kernel 2 related methods Approximate the kernel matrix Approximate the kernel Results from [Bach, 2012]. Such results have been extended by [Alaoui and Mahoney, 2014, Rudi et al., There also exist results in the second situation [Rahimi and Recht, 2008, Dai et al., 2014] Aymeric Dieuleveut Stochastic optimization Hilbert spaces 42 / 48
95 The complexity challenge, approximation of the kernel Sharp analysis We only consider a fixed design setting. Then we have to approximate the kernel matrix : instead of computing the whole matrix, we randomly pick a number d n of columns. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 43 / 48
96 The complexity challenge, approximation of the kernel Sharp analysis We only consider a fixed design setting. Then we have to approximate the kernel matrix : instead of computing the whole matrix, we randomly pick a number d n of columns. Then we still get the same estimation errors. Leading to : Finite Dimension Infinite dimension Rate Complexity d n O(dn) d n n O(ndn 2 ) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 43 / 48
97 The complexity challenge, approximation of the kernel Random feature selection Many kernels may be represented, due to Bochner s theorem as K(x, y) = φ(w, x)φ(w, y)dµ(w). W (think of translation invariant kernels and Fourier transform). Aymeric Dieuleveut Stochastic optimization Hilbert spaces 44 / 48
98 The complexity challenge, approximation of the kernel Random feature selection Many kernels may be represented, due to Bochner s theorem as K(x, y) = φ(w, x)φ(w, y)dµ(w). W (think of translation invariant kernels and Fourier transform). We thus consider the low rank approximation : K(x, y) = 1 d n φ(x, w i )φ(y, w i ). i=1 where w i µ. We use this approximation of the kernel in SGD. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 44 / 48
99 The complexity challenge, approximation of the kernel Directions What I am working on for the moment : Random feature selection Tuning the sampling to improve accuracy of the approximation Acceleration + stochasticity (with Nicolas Flammarion). Aymeric Dieuleveut Stochastic optimization Hilbert spaces 45 / 48
100 The complexity challenge, approximation of the kernel Some references I Alaoui, A. E. and Mahoney, M. W. (2014). Fast randomized kernel methods with statistical guarantees. CoRR, abs/ Bach, F. (2012). Sharp analysis of low-rank kernel matrix approximations. ArXiv e-prints. Bach, F. and Moulines, E. (2013). Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). ArXiv e-prints. Bottou, L. and Bousquet, O. (2008). The tradeoffs of large scale learning. In IN : ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 20. Caponnetto, A. and De Vito, E. (2007). Optimal Rates for the Regularized Least-Squares Algorithm. Foundations of Computational Mathematics, 7(3) : Dai, B., Xie, B., He, N., Liang, Y., Raj, A., Balcan, M., and Song, L. (2014). Scalable kernel methods via doubly stochastic gradients. In Advances in Neural Information Processing Systems 27 : Annual Conference on Neural Information Processing Systems 2014, December , Montreal, Quebec, Canada, pages Dieuleveut, A. and Bach, F. (2014). Non-parametric Stochastic Approximation with Large Step sizes. ArXiv e-prints. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 46 / 48
101 The complexity challenge, approximation of the kernel Some references II Rahimi, A. and Recht, B. (2008). Weighted sums of random kitchen sinks : Replacing minimization with randomization in learning. In Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008, pages Rudi, A., Camoriano, R., and Rosasco, L. (2015). Less is more : Nyström computational regularization. CoRR, abs/ Shalev-Schwartz, S. and K., S. (2011). Theorical basis for more data less work. Shalev-Schwartz, S. and Srebro, N. (2008). SVM optimisation : Inverse dependance on training set size. Proceedings of the International Conference on Machine Learning (ICML). Tarrès, P. and Yao, Y. (2011). Online learning as stochastic approximation of regularization paths. ArXiv e-prints Ying, Y. and Pontil, M. (2008). Online gradient descent learning algorithms. Foundations of Computational Mathematics, 5. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 47 / 48
102 The complexity challenge, approximation of the kernel Thank you for your attention! Aymeric Dieuleveut Stochastic optimization Hilbert spaces 48 / 48
Less is More: Computational Regularization by Subsampling
Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro
More informationStatistical Optimality of Stochastic Gradient Descent through Multiple Passes
Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien
More informationLess is More: Computational Regularization by Subsampling
Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro
More informationApproximate Kernel PCA with Random Features
Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,
More informationStochastic gradient descent and robustness to ill-conditioning
Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,
More informationSimple Optimization, Bigger Models, and Faster Learning. Niao He
Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic
More informationOslo Class 2 Tikhonov regularization and kernels
RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new
More informationRegML 2018 Class 2 Tikhonov regularization and kernels
RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationSVRG++ with Non-uniform Sampling
SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract
More informationKernel Learning via Random Fourier Representations
Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationOnline Gradient Descent Learning Algorithms
DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline
More informationStochastic Gradient Descent. CS 584: Big Data Analytics
Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration
More informationSimultaneous Model Selection and Optimization through Parameter-free Stochastic Learning
Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms
More informationAn inverse problem perspective on machine learning
An inverse problem perspective on machine learning Lorenzo Rosasco University of Genova Massachusetts Institute of Technology Istituto Italiano di Tecnologia lcsl.mit.edu Feb 9th, 2018 Inverse Problems
More informationBits of Machine Learning Part 1: Supervised Learning
Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationTUM 2016 Class 3 Large scale learning by regularization
TUM 2016 Class 3 Large scale learning by regularization Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x n, y n ) Beyond
More informationConvergence Rates of Kernel Quadrature Rules
Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July
More informationDay 3: Classification, logistic regression
Day 3: Classification, logistic regression Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 20 June 2018 Topics so far Supervised
More informationOptimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms
Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms Junhong Lin Volkan Cevher Laboratory for Information and Inference Systems École Polytechnique Fédérale
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationMathematical Methods for Data Analysis
Mathematical Methods for Data Analysis Massimiliano Pontil Istituto Italiano di Tecnologia and Department of Computer Science University College London Massimiliano Pontil Mathematical Methods for Data
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationLecture 1: Supervised Learning
Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)
More informationOptimization for Machine Learning
Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS
More informationIFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent
IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V
More informationLearning gradients: prescriptive models
Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan
More informationOptimal kernel methods for large scale learning
Optimal kernel methods for large scale learning Alessandro Rudi INRIA - École Normale Supérieure, Paris joint work with Luigi Carratino, Lorenzo Rosasco 6 Mar 2018 École Polytechnique Learning problem
More informationLarge Scale Machine Learning with Stochastic Gradient Descent
Large Scale Machine Learning with Stochastic Gradient Descent Léon Bottou leon@bottou.org Microsoft (since June) Summary i. Learning with Stochastic Gradient Descent. ii. The Tradeoffs of Large Scale Learning.
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationStochastic Optimization
Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationOptimistic Rates Nati Srebro
Optimistic Rates Nati Srebro Based on work with Karthik Sridharan and Ambuj Tewari Examples based on work with Andy Cotter, Elad Hazan, Tomer Koren, Percy Liang, Shai Shalev-Shwartz, Ohad Shamir, Karthik
More informationOptimal Rates for Multi-pass Stochastic Gradient Methods
Journal of Machine Learning Research 8 (07) -47 Submitted 3/7; Revised 8/7; Published 0/7 Optimal Rates for Multi-pass Stochastic Gradient Methods Junhong Lin Laboratory for Computational and Statistical
More informationStochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure
Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,
More informationWarm up. Regrade requests submitted directly in Gradescope, do not instructors.
Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required
More informationLecture 3: Statistical Decision Theory (Part II)
Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical
More informationEarly Stopping for Computational Learning
Early Stopping for Computational Learning Lorenzo Rosasco Universita di Genova, Massachusetts Institute of Technology Istituto Italiano di Tecnologia CBMM Sestri Levante, September, 2014 joint work with
More informationCSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent
CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic
More informationStochastic gradient methods for machine learning
Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine
More informationApproximate Kernel Methods
Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationSupport Vector Machine
Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary
More informationMLCC 2017 Regularization Networks I: Linear Models
MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational
More informationSample questions for Fundamentals of Machine Learning 2018
Sample questions for Fundamentals of Machine Learning 2018 Teacher: Mohammad Emtiyaz Khan A few important informations: In the final exam, no electronic devices are allowed except a calculator. Make sure
More informationFirst Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate
58th Annual IEEE Symposium on Foundations of Computer Science First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate Zeyuan Allen-Zhu Microsoft Research zeyuan@csail.mit.edu
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationCSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression
CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html
More informationSpectral Regularization
Spectral Regularization Lorenzo Rosasco 9.520 Class 07 February 27, 2008 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationRandomized Algorithms
Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationRegularization via Spectral Filtering
Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationCMU-Q Lecture 24:
CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting
More informationTopics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families
Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationTutorial on Machine Learning for Advanced Electronics
Tutorial on Machine Learning for Advanced Electronics Maxim Raginsky March 2017 Part I (Some) Theory and Principles Machine Learning: estimation of dependencies from empirical data (V. Vapnik) enabling
More informationLogistic Regression: Online, Lazy, Kernelized, Sequential, etc.
Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationFast Stochastic Optimization Algorithms for ML
Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2
More informationMachine Learning. Linear Models. Fabio Vandin October 10, 2017
Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w
More informationChapter 9. Support Vector Machine. Yongdai Kim Seoul National University
Chapter 9. Support Vector Machine Yongdai Kim Seoul National University 1. Introduction Support Vector Machine (SVM) is a classification method developed by Vapnik (1996). It is thought that SVM improved
More informationStochastic gradient methods for machine learning
Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine
More informationCSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18
CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H
More informationMachine Learning in the Data Revolution Era
Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationStochastic Analogues to Deterministic Optimizers
Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationA summary of Deep Learning without Poor Local Minima
A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given
More informationApproximation Theoretical Questions for SVMs
Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually
More informationBeating SGD: Learning SVMs in Sublinear Time
Beating SGD: Learning SVMs in Sublinear Time Elad Hazan Tomer Koren Technion, Israel Institute of Technology Haifa, Israel 32000 {ehazan@ie,tomerk@cs}.technion.ac.il Nathan Srebro Toyota Technological
More informationDivide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates
: A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley;http://arxiv.org/pdf/1305.509; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014 Outline Motivation.
More informationONLINE LEARNING WITH KERNELS: OVERCOMING THE GROWING SUM PROBLEM. Abhishek Singh, Narendra Ahuja and Pierre Moulin
22 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 23 26, 22, SANTANDER, SPAIN ONLINE LEARNING WITH KERNELS: OVERCOMING THE GROWING SUM PROBLEM Abhishek Singh, Narendra Ahuja
More informationModern Optimization Techniques
Modern Optimization Techniques 2. Unconstrained Optimization / 2.2. Stochastic Gradient Descent Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More informationMachine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari
Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1 Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets
More informationCS534 Machine Learning - Spring Final Exam
CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the
More informationLinear Regression. CSL603 - Fall 2017 Narayanan C Krishnan
Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationLinear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan
Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationStochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More informationHilbert Space Methods in Learning
Hilbert Space Methods in Learning guest lecturer: Risi Kondor 6772 Advanced Machine Learning and Perception (Jebara), Columbia University, October 15, 2003. 1 1. A general formulation of the learning problem
More information