Introduction to Machine Learning (67577) Lecture 7

Size: px
Start display at page:

Download "Introduction to Machine Learning (67577) Lecture 7"

Transcription

1 Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 1 / 31

2 Outline 1 Reminder: Convex learning problems 2 Learning Using Stochastic Gradient Descent 3 Learning Using Regularized Loss Minimization 4 Dimension vs. Norm bounds Example application: Text categorization Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 2 / 31

3 Convex-Lipschitz-bounded learning problem Definition (Convex-Lipschitz-Bounded Learning Problem) A learning problem, (H, Z, l), is called Convex-Lipschitz-Bounded, with parameters ρ, B if the following holds: The hypothesis class H is a convex set and for all w H we have w B. For all z Z, the loss function, l(, z), is a convex and ρ-lipschitz function. Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 3 / 31

4 Convex-Lipschitz-bounded learning problem Definition (Convex-Lipschitz-Bounded Learning Problem) A learning problem, (H, Z, l), is called Convex-Lipschitz-Bounded, with parameters ρ, B if the following holds: The hypothesis class H is a convex set and for all w H we have w B. For all z Z, the loss function, l(, z), is a convex and ρ-lipschitz function. Example: H = {w R d : w B} X = {x R d : x ρ}, Y = R, l(w, (x, y)) = w, x y Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 3 / 31

5 Convex-Smooth-bounded learning problem Definition (Convex-Smooth-Bounded Learning Problem) A learning problem, (H, Z, l), is called Convex-Smooth-Bounded, with parameters β, B if the following holds: The hypothesis class H is a convex set and for all w H we have w B. For all z Z, the loss function, l(, z), is a convex, non-negative, and β-smooth function. Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 4 / 31

6 Convex-Smooth-bounded learning problem Definition (Convex-Smooth-Bounded Learning Problem) A learning problem, (H, Z, l), is called Convex-Smooth-Bounded, with parameters β, B if the following holds: The hypothesis class H is a convex set and for all w H we have w B. For all z Z, the loss function, l(, z), is a convex, non-negative, and β-smooth function. Example: H = {w R d : w B} X = {x R d : x β/2}, Y = R, l(w, (x, y)) = ( w, x y) 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 4 / 31

7 Outline 1 Reminder: Convex learning problems 2 Learning Using Stochastic Gradient Descent 3 Learning Using Regularized Loss Minimization 4 Dimension vs. Norm bounds Example application: Text categorization Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 5 / 31

8 Learning Using Stochastic Gradient Descent Consider a learning problem. Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 6 / 31

9 Learning Using Stochastic Gradient Descent Consider a learning problem. Recall: our goal is to (probably approximately) solve: min L D(w) where L D (w) = E [l(w, z)] w H z D Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 6 / 31

10 Learning Using Stochastic Gradient Descent Consider a learning problem. Recall: our goal is to (probably approximately) solve: min L D(w) where L D (w) = E [l(w, z)] w H z D So far, learning was based on the empirical risk, L S (w) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 6 / 31

11 Learning Using Stochastic Gradient Descent Consider a learning problem. Recall: our goal is to (probably approximately) solve: min L D(w) where L D (w) = E [l(w, z)] w H z D So far, learning was based on the empirical risk, L S (w) We now consider directly minimizing L D (w) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 6 / 31

12 Stochastic Gradient Descent min L D(w) where L D (w) = E [l(w, z)] w H z D Recall the gradient descent method in which we initialize w (1) = 0 and update w (t+1) = w (t) η L D (w) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 7 / 31

13 Stochastic Gradient Descent min L D(w) where L D (w) = E [l(w, z)] w H z D Recall the gradient descent method in which we initialize w (1) = 0 and update w (t+1) = w (t) η L D (w) Observe: L D (w) = E z D [ l(w, z)] Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 7 / 31

14 Stochastic Gradient Descent min L D(w) where L D (w) = E [l(w, z)] w H z D Recall the gradient descent method in which we initialize w (1) = 0 and update w (t+1) = w (t) η L D (w) Observe: L D (w) = E z D [ l(w, z)] We can t calculate L D (w) because we don t know D Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 7 / 31

15 Stochastic Gradient Descent min L D(w) where L D (w) = E [l(w, z)] w H z D Recall the gradient descent method in which we initialize w (1) = 0 and update w (t+1) = w (t) η L D (w) Observe: L D (w) = E z D [ l(w, z)] We can t calculate L D (w) because we don t know D But we can estimate it by l(w, z) for z D Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 7 / 31

16 Stochastic Gradient Descent min L D(w) where L D (w) = E [l(w, z)] w H z D Recall the gradient descent method in which we initialize w (1) = 0 and update w (t+1) = w (t) η L D (w) Observe: L D (w) = E z D [ l(w, z)] We can t calculate L D (w) because we don t know D But we can estimate it by l(w, z) for z D If we take a step based on the direction v = l(w, z) then in expectation we re moving in the right direction Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 7 / 31

17 Stochastic Gradient Descent min L D(w) where L D (w) = E [l(w, z)] w H z D Recall the gradient descent method in which we initialize w (1) = 0 and update w (t+1) = w (t) η L D (w) Observe: L D (w) = E z D [ l(w, z)] We can t calculate L D (w) because we don t know D But we can estimate it by l(w, z) for z D If we take a step based on the direction v = l(w, z) then in expectation we re moving in the right direction In other words, v is an unbiased estimate of the gradient Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 7 / 31

18 Stochastic Gradient Descent min L D(w) where L D (w) = E [l(w, z)] w H z D Recall the gradient descent method in which we initialize w (1) = 0 and update w (t+1) = w (t) η L D (w) Observe: L D (w) = E z D [ l(w, z)] We can t calculate L D (w) because we don t know D But we can estimate it by l(w, z) for z D If we take a step based on the direction v = l(w, z) then in expectation we re moving in the right direction In other words, v is an unbiased estimate of the gradient We ll show that this is good enough Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 7 / 31

19 Stochastic Gradient Descent initialize: w (1) = 0 for t = 1, 2,..., T choose z t D let v t l(w (t), z t ) update w (t+1) = w (t) ηv t output w = 1 T T t=1 w(t) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 8 / 31

20 Stochastic Gradient Descent initialize: w (1) = 0 for t = 1, 2,..., T choose z t D let v t l(w (t), z t ) update w (t+1) = w (t) ηv t output w = 1 T T t=1 w(t) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 8 / 31

21 Analyzing SGD for convex-lipschitz-bounded By algebraic manipulations, for any sequence of v 1,..., v T, and any w, T t=1 w (t) w, v t = w(1) w 2 w (T +1) w 2 2η + η 2 T v t 2 t=1 Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 9 / 31

22 Analyzing SGD for convex-lipschitz-bounded By algebraic manipulations, for any sequence of v 1,..., v T, and any w, T t=1 w (t) w, v t = w(1) w 2 w (T +1) w 2 2η + η 2 T v t 2 t=1 Assume that v t ρ for all t and that w B we obtain T t=1 w (t) w, v t B2 2η + η ρ2 T 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 9 / 31

23 Analyzing SGD for convex-lipschitz-bounded By algebraic manipulations, for any sequence of v 1,..., v T, and any w, T t=1 w (t) w, v t = w(1) w 2 w (T +1) w 2 2η + η 2 T v t 2 t=1 Assume that v t ρ for all t and that w B we obtain T t=1 In particular, for η = w (t) w, v t B2 2η + η ρ2 T 2 B 2 ρ 2 T we get T w (t) w, v t B ρ T. t=1 Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 9 / 31

24 Analyzing SGD for convex-lipschitz-bounded Taking expectation of both sides w.r.t. the randomness of choosing z 1,..., z T we obtain: [ T ] w (t) w, v t B ρ T. E z 1,...,z T t=1 Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 10 / 31

25 Analyzing SGD for convex-lipschitz-bounded Taking expectation of both sides w.r.t. the randomness of choosing z 1,..., z T we obtain: [ T ] w (t) w, v t B ρ T. E z 1,...,z T t=1 The law of total expectation: for every two random variables α, β, and a function g, E α [g(α)] = E β E α [g(α) β]. Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 10 / 31

26 Analyzing SGD for convex-lipschitz-bounded Taking expectation of both sides w.r.t. the randomness of choosing z 1,..., z T we obtain: [ T ] w (t) w, v t B ρ T. E z 1,...,z T t=1 The law of total expectation: for every two random variables α, β, and a function g, E α [g(α)] = E β E α [g(α) β]. Therefore E [ w (t) w, v t ] = z 1,...,z T E z 1,...,z t 1 E [ w (t) w, v t z 1,..., z t 1 ]. z 1,...,z T Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 10 / 31

27 Analyzing SGD for convex-lipschitz-bounded Taking expectation of both sides w.r.t. the randomness of choosing z 1,..., z T we obtain: [ T ] w (t) w, v t B ρ T. E z 1,...,z T t=1 The law of total expectation: for every two random variables α, β, and a function g, E α [g(α)] = E β E α [g(α) β]. Therefore E [ w (t) w, v t ] = z 1,...,z T E z 1,...,z t 1 E [ w (t) w, v t z 1,..., z t 1 ]. z 1,...,z T Once we know z 1,..., z t 1 the value of w (t) is not random, hence, E [ w (t) w, v t z 1,..., z t 1 ] = w (t) w, E[ l(w (t), z t )] z 1,...,z T zt = w (t) w, L D (w (t) ) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 10 / 31

28 Analyzing SGD for convex-lipschitz-bounded We got: E z 1,...,z T [ T ] w (t) w, L D (w (t) ) t=1 B ρ T Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 11 / 31

29 Analyzing SGD for convex-lipschitz-bounded We got: [ T ] E z 1,...,z T t=1 w (t) w, L D (w (t) ) B ρ T By convexity, this means [ T ] E z 1,...,z T t=1 (L D (w (t) ) L D (w )) B ρ T Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 11 / 31

30 Analyzing SGD for convex-lipschitz-bounded We got: [ T ] E z 1,...,z T t=1 w (t) w, L D (w (t) ) B ρ T By convexity, this means [ T ] E z 1,...,z T t=1 (L D (w (t) ) L D (w )) B ρ T Dividing by T and using convexity again, )] T w (t) E z 1,...,z T [L D ( 1 T t=1 L D (w ) + B ρ T Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 11 / 31

31 Learning convex-lipschitz-bounded problems using SGD Corollary Consider a convex-lipschitz-bounded learning problem with parameters ρ, B. Then, for every ɛ > 0, if we run the SGD method for minimizing L D (w) with a number of iterations (i.e., number of examples) and with η = B 2 ρ 2 T T B2 ρ 2 ɛ 2, then the output of SGD satisfies: E [L D ( w)] min w H L D(w) + ɛ. Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 12 / 31

32 Learning convex-lipschitz-bounded problems using SGD Corollary Consider a convex-lipschitz-bounded learning problem with parameters ρ, B. Then, for every ɛ > 0, if we run the SGD method for minimizing L D (w) with a number of iterations (i.e., number of examples) and with η = B 2 ρ 2 T T B2 ρ 2 ɛ 2, then the output of SGD satisfies: E [L D ( w)] min w H L D(w) + ɛ. Remark: Can obtain high probability bound using boosting the confidence (Lecture 4) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 12 / 31

33 Convex-smooth-bounded problems Similar result holds for smooth problems: Corollary Consider a convex-smooth-bounded learning problem with parameters β, B. Assume in addition that l(0, z) 1 for all z Z. For every ɛ > 0, set η = 1 β(1+3/ɛ). Then, running SGD with T 12B2 β/ɛ 2 yields E[L D ( w)] min w H L D(w) + ɛ. Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 13 / 31

34 Outline 1 Reminder: Convex learning problems 2 Learning Using Stochastic Gradient Descent 3 Learning Using Regularized Loss Minimization 4 Dimension vs. Norm bounds Example application: Text categorization Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 14 / 31

35 Regularized Loss Minimization (RLM) Given a regularization function R : R d R, the RLM rule is: A(S) = argmin (L S (w) + R(w)). w Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 15 / 31

36 Regularized Loss Minimization (RLM) Given a regularization function R : R d R, the RLM rule is: A(S) = argmin (L S (w) + R(w)). w We will focus on Tikhonov regularization ( A(S) = argmin LS (w) + λ w 2). w Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 15 / 31

37 Why to regularize? Similar to MDL: specify prior belief in hypotheses. We bias ourselves toward short vectors. Stabilizer: we ll show that Tikhonov regularization makes the learner stable w.r.t. small perturbation of the training set, which in turn leads to generalization Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 16 / 31

38 Stability Informally: an algorithm A is stable if a small change of its input S will lead to a small change of its output hypothesis Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 17 / 31

39 Stability Informally: an algorithm A is stable if a small change of its input S will lead to a small change of its output hypothesis Need to specify what is small change of input and what is small change of output Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 17 / 31

40 Stability Replace one sample: given S = (z 1,..., z m ) and an additional example z, let S (i) = (z 1,..., z i 1, z, z i+1,..., z m ) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 18 / 31

41 Stability Replace one sample: given S = (z 1,..., z m ) and an additional example z, let S (i) = (z 1,..., z i 1, z, z i+1,..., z m ) Definition (on-average-replace-one-stable) Let ɛ : N R be a monotonically decreasing function. We say that a learning algorithm A is on-average-replace-one-stable with rate ɛ(m) if for every distribution D E (S,z ) D m+1,i U(m) [l(a(s(i), z i )) l(a(s), z i )] ɛ(m). Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 18 / 31

42 Stable rules do not ovefit Theorem if A is on-average-replace-one-stable with rate ɛ(m) then E S D m[l D(A(S)) L S (A(S))] ɛ(m). Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 19 / 31

43 Stable rules do not ovefit Theorem if A is on-average-replace-one-stable with rate ɛ(m) then Proof. E S D m[l D(A(S)) L S (A(S))] ɛ(m). Since S and z are both drawn i.i.d. from D, we have that for every i, E S [L D (A(S))] = E S,z [l(a(s), z )] = E S,z [l(a(s(i) ), z i )]. On the other hand, we can write E S [L S (A(S))] = E S,i [l(a(s), z i )]. The proof follows from the definition of stability. Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 19 / 31

44 Tikhonov Regularization as Stabilizer Theorem Assume that the loss function is convex and ρ-lipschitz. Then, the RLM rule with the regularizer λ w 2 is on-average-replace-one-stable with rate 2 ρ 2 λ m. It follows that E S D m[l D(A(S)) L S (A(S))] 2 ρ2 λ m. Similarly, for convex, β-smooth, and non-negative, loss the rate is 48βC λm, where C is an upper bound on max z l(0, z). Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 20 / 31

45 Tikhonov Regularization as Stabilizer Theorem Assume that the loss function is convex and ρ-lipschitz. Then, the RLM rule with the regularizer λ w 2 is on-average-replace-one-stable with rate 2 ρ 2 λ m. It follows that E S D m[l D(A(S)) L S (A(S))] 2 ρ2 λ m. Similarly, for convex, β-smooth, and non-negative, loss the rate is 48βC λm, where C is an upper bound on max z l(0, z). The proof relies on the notion of strong convexity and can be found in the book. Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 20 / 31

46 The Fitting-Stability Tradeoff Observe: E S [L D (A(S))] = E S [L S (A(S))] + E S [L D (A(S)) L S (A(S))]. Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 21 / 31

47 The Fitting-Stability Tradeoff Observe: E S [L D (A(S))] = E S [L S (A(S))] + E S [L D (A(S)) L S (A(S))]. The first term is how good A fits the training set The 2nd term is the overfitting, and is bounded by the stability of A Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 21 / 31

48 The Fitting-Stability Tradeoff Observe: E S [L D (A(S))] = E S [L S (A(S))] + E S [L D (A(S)) L S (A(S))]. The first term is how good A fits the training set The 2nd term is the overfitting, and is bounded by the stability of A λ controls the tradeoff between the two terms Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 21 / 31

49 The Fitting-Stability Tradeoff Let A be the RLM rule Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 22 / 31

50 The Fitting-Stability Tradeoff Let A be the RLM rule We saw (for convex-lipschitz losses) E S [L D (A(S)) L S (A(S))] 2 ρ2 λ m Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 22 / 31

51 The Fitting-Stability Tradeoff Let A be the RLM rule We saw (for convex-lipschitz losses) E S [L D (A(S)) L S (A(S))] 2 ρ2 λ m Fix some arbitrary vector w, then: L S (A(S)) L S (A(S)) + λ A(S) 2 L S (w ) + λ w 2. Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 22 / 31

52 The Fitting-Stability Tradeoff Let A be the RLM rule We saw (for convex-lipschitz losses) E S [L D (A(S)) L S (A(S))] 2 ρ2 λ m Fix some arbitrary vector w, then: L S (A(S)) L S (A(S)) + λ A(S) 2 L S (w ) + λ w 2. Taking expectation of both sides with respect to S and noting that E S [L S (w )] = L D (w ), we obtain that E S [L S (A(S))] L D (w ) + λ w 2. Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 22 / 31

53 The Fitting-Stability Tradeoff Let A be the RLM rule We saw (for convex-lipschitz losses) E S [L D (A(S)) L S (A(S))] 2 ρ2 λ m Fix some arbitrary vector w, then: L S (A(S)) L S (A(S)) + λ A(S) 2 L S (w ) + λ w 2. Taking expectation of both sides with respect to S and noting that E S [L S (w )] = L D (w ), we obtain that Therefore: E S [L S (A(S))] L D (w ) + λ w 2. E S [L D (A(S))] L D (w ) + λ w ρ2 λ m Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 22 / 31

54 The Regularization Path The RLM rule as a function of λ is w(λ) = argmin w L S (w) + λ w 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 23 / 31

55 The Regularization Path The RLM rule as a function of λ is w(λ) = argmin w L S (w) + λ w 2 Can be seen as a pareto objective: minimize both L S (w) and w 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 23 / 31

56 The Regularization Path The RLM rule as a function of λ is w(λ) = argmin w L S (w) + λ w 2 Can be seen as a pareto objective: minimize both L S (w) and w 2 L S (w) w( ) feasible w pareto optimal w(0) w 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 23 / 31

57 How to choose λ? Bound minimization: choose λ according to the bound on L D (w) usually far from optimal as the bound is worst case Validation: calculate several pareto optimal points on the regularization path (by varying λ) and use validation set to choose the best one Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 24 / 31

58 Outline 1 Reminder: Convex learning problems 2 Learning Using Stochastic Gradient Descent 3 Learning Using Regularized Loss Minimization 4 Dimension vs. Norm bounds Example application: Text categorization Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 25 / 31

59 Dimension vs. Norm Bounds Previously in the course, when we learnt d parameters the sample complexity grew with d Here, we learn d parameters but the sample complexity depends on the norm of w and on the Lipschitzness/smoothness, rather than on d Which approach is better depends on the properties of the distribution Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 26 / 31

60 Example: document categorization Signs all encouraging for Phelps in comeback. He did not win any gold medals or set any world records but Michael Phelps ticked all the boxes he needed in his comeback to competitive swimming.? About sport? Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 27 / 31

61 Bag-of-words representation Signs all encouraging for Phelps in comeback. He did not win any gold medals or set any world records but Michael Phelps ticked all the boxes he needed in his comeback to competitive swimming swimming world elephant Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 28 / 31

62 Document categorization Let X = {x {0, 1} d : x 2 R 2, x d = 1} Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31

63 Document categorization Let X = {x {0, 1} d : x 2 R 2, x d = 1} Think on x X as a text document represented as a bag of words: Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31

64 Document categorization Let X = {x {0, 1} d : x 2 R 2, x d = 1} Think on x X as a text document represented as a bag of words: At most R 2 1 words in each document Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31

65 Document categorization Let X = {x {0, 1} d : x 2 R 2, x d = 1} Think on x X as a text document represented as a bag of words: At most R 2 1 words in each document d 1 is the size of the dictionary Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31

66 Document categorization Let X = {x {0, 1} d : x 2 R 2, x d = 1} Think on x X as a text document represented as a bag of words: At most R 2 1 words in each document d 1 is the size of the dictionary Last coordinate is the bias Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31

67 Document categorization Let X = {x {0, 1} d : x 2 R 2, x d = 1} Think on x X as a text document represented as a bag of words: At most R 2 1 words in each document d 1 is the size of the dictionary Last coordinate is the bias Let Y = {±1} (e.g., the document is about sport or not) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31

68 Document categorization Let X = {x {0, 1} d : x 2 R 2, x d = 1} Think on x X as a text document represented as a bag of words: At most R 2 1 words in each document d 1 is the size of the dictionary Last coordinate is the bias Let Y = {±1} (e.g., the document is about sport or not) Linear classifiers x sign( w, x ) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31

69 Document categorization Let X = {x {0, 1} d : x 2 R 2, x d = 1} Think on x X as a text document represented as a bag of words: At most R 2 1 words in each document d 1 is the size of the dictionary Last coordinate is the bias Let Y = {±1} (e.g., the document is about sport or not) Linear classifiers x sign( w, x ) Intuitively: w i is large (positive) for words indicative to sport while w i is small (negative) for words indicative to non-sport Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31

70 Document categorization Let X = {x {0, 1} d : x 2 R 2, x d = 1} Think on x X as a text document represented as a bag of words: At most R 2 1 words in each document d 1 is the size of the dictionary Last coordinate is the bias Let Y = {±1} (e.g., the document is about sport or not) Linear classifiers x sign( w, x ) Intuitively: w i is large (positive) for words indicative to sport while w i is small (negative) for words indicative to non-sport Hinge-loss: l(w, (x, y)) = [1 y w, x ] + Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31

71 Dimension vs. Norm VC dimension is d, but d can be extremely large (number of words in English) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 30 / 31

72 Dimension vs. Norm VC dimension is d, but d can be extremely large (number of words in English) Loss function is convex and R Lipschitz Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 30 / 31

73 Dimension vs. Norm VC dimension is d, but d can be extremely large (number of words in English) Loss function is convex and R Lipschitz Assume that the number of relevant words is small, and their weights is not too large, then there is a w with small norm and small L D (w ) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 30 / 31

74 Dimension vs. Norm VC dimension is d, but d can be extremely large (number of words in English) Loss function is convex and R Lipschitz Assume that the number of relevant words is small, and their weights is not too large, then there is a w with small norm and small L D (w ) Then, can learn it with sample complexity that depends on R 2 w 2, and does not depend on d at all! Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 30 / 31

75 Dimension vs. Norm VC dimension is d, but d can be extremely large (number of words in English) Loss function is convex and R Lipschitz Assume that the number of relevant words is small, and their weights is not too large, then there is a w with small norm and small L D (w ) Then, can learn it with sample complexity that depends on R 2 w 2, and does not depend on d at all! But, there are of course opposite cases, in which d is much smaller than R 2 w 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 30 / 31

76 Summary Learning convex learning problems using SGD Learning convex learning problems using RLM The regularization path Dimension vs. Norm Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 31 / 31

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017 Machine Learning Regularization and Feature Selection Fabio Vandin November 13, 2017 1 Learning Model A: learning algorithm for a machine learning task S: m i.i.d. pairs z i = (x i, y i ), i = 1,..., m,

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Introduction to Machine Learning (67577) Lecture 3

Introduction to Machine Learning (67577) Lecture 3 Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz

More information

Introduction to Machine Learning (67577) Lecture 5

Introduction to Machine Learning (67577) Lecture 5 Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Nonuniform learning, MDL, SRM, Decision Trees, Nearest Neighbor Shai

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Machine Learning in the Data Revolution Era

Machine Learning in the Data Revolution Era Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Optimization and Gradient Descent

Optimization and Gradient Descent Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

Neural Networks: Backpropagation

Neural Networks: Backpropagation Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others

More information

Introduction to Machine Learning (67577)

Introduction to Machine Learning (67577) Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Deep Learning Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection

Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Greedy Algorithms, Frank-Wolfe and Friends

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Algorithmic Stability and Generalization Christoph Lampert

Algorithmic Stability and Generalization Christoph Lampert Algorithmic Stability and Generalization Christoph Lampert November 28, 2018 1 / 32 IST Austria (Institute of Science and Technology Austria) institute for basic research opened in 2009 located in outskirts

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Online Convex Optimization

Online Convex Optimization Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

On the tradeoff between computational complexity and sample complexity in learning

On the tradeoff between computational complexity and sample complexity in learning On the tradeoff between computational complexity and sample complexity in learning Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Joint work with Sham

More information

CSC 411: Lecture 02: Linear Regression

CSC 411: Lecture 02: Linear Regression CSC 411: Lecture 02: Linear Regression Richard Zemel, Raquel Urtasun and Sanja Fidler University of Toronto (Most plots in this lecture are from Bishop s book) Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution

More information

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 26 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions

More information

Linear and Logistic Regression. Dr. Xiaowei Huang

Linear and Logistic Regression. Dr. Xiaowei Huang Linear and Logistic Regression Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Two Classical Machine Learning Algorithms Decision tree learning K-nearest neighbor Model Evaluation Metrics

More information

l 1 and l 2 Regularization

l 1 and l 2 Regularization David Rosenberg New York University February 5, 2015 David Rosenberg (New York University) DS-GA 1003 February 5, 2015 1 / 32 Tikhonov and Ivanov Regularization Hypothesis Spaces We ve spoken vaguely about

More information

Information theoretic perspectives on learning algorithms

Information theoretic perspectives on learning algorithms Information theoretic perspectives on learning algorithms Varun Jog University of Wisconsin - Madison Departments of ECE and Mathematics Shannon Channel Hangout! May 8, 2018 Jointly with Adrian Tovar-Lopez

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

CSC 411 Lecture 6: Linear Regression

CSC 411 Lecture 6: Linear Regression CSC 411 Lecture 6: Linear Regression Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 06-Linear Regression 1 / 37 A Timely XKCD UofT CSC 411: 06-Linear Regression

More information

Inverse Time Dependency in Convex Regularized Learning

Inverse Time Dependency in Convex Regularized Learning Inverse Time Dependency in Convex Regularized Learning Zeyuan A. Zhu (Tsinghua University) Weizhu Chen (MSRA) Chenguang Zhu (Tsinghua University) Gang Wang (MSRA) Haixun Wang (MSRA) Zheng Chen (MSRA) December

More information

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016 AM 1: Advanced Optimization Spring 016 Prof. Yaron Singer Lecture 11 March 3rd 1 Overview In this lecture we will introduce the notion of online convex optimization. This is an extremely useful framework

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Least Mean Squares Regression

Least Mean Squares Regression Least Mean Squares Regression Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Lecture Overview Linear classifiers What functions do linear classifiers express? Least Squares Method

More information

Optimistic Rates Nati Srebro

Optimistic Rates Nati Srebro Optimistic Rates Nati Srebro Based on work with Karthik Sridharan and Ambuj Tewari Examples based on work with Andy Cotter, Elad Hazan, Tomer Koren, Percy Liang, Shai Shalev-Shwartz, Ohad Shamir, Karthik

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Support Vector Machine

Support Vector Machine Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

More information

Online Learning Summer School Copenhagen 2015 Lecture 1

Online Learning Summer School Copenhagen 2015 Lecture 1 Online Learning Summer School Copenhagen 2015 Lecture 1 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Online Learning Shai Shalev-Shwartz (Hebrew U) OLSS Lecture

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Learnability, Stability, Regularization and Strong Convexity

Learnability, Stability, Regularization and Strong Convexity Learnability, Stability, Regularization and Strong Convexity Nati Srebro Shai Shalev-Shwartz HUJI Ohad Shamir Weizmann Karthik Sridharan Cornell Ambuj Tewari Michigan Toyota Technological Institute Chicago

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October

More information

SGD and Deep Learning

SGD and Deep Learning SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients

More information

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines

More information

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and

More information

Active Learning: Disagreement Coefficient

Active Learning: Disagreement Coefficient Advanced Course in Machine Learning Spring 2010 Active Learning: Disagreement Coefficient Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz In previous lectures we saw examples in which

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning PAC Learning and VC Dimension Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Efficient Bandit Algorithms for Online Multiclass Prediction

Efficient Bandit Algorithms for Online Multiclass Prediction Efficient Bandit Algorithms for Online Multiclass Prediction Sham Kakade, Shai Shalev-Shwartz and Ambuj Tewari Presented By: Nakul Verma Motivation In many learning applications, true class labels are

More information

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 27 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions

More information

Computational Learning Theory - Hilary Term : Learning Real-valued Functions

Computational Learning Theory - Hilary Term : Learning Real-valued Functions Computational Learning Theory - Hilary Term 08 8 : Learning Real-valued Functions Lecturer: Varun Kanade So far our focus has been on learning boolean functions. Boolean functions are suitable for modelling

More information

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015 Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and

More information

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar Computational Learning Theory: Shattering and VC Dimensions Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 This lecture: Computational Learning Theory The Theory of Generalization

More information

Machine Learning and Data Mining. Linear classification. Kalev Kask

Machine Learning and Data Mining. Linear classification. Kalev Kask Machine Learning and Data Mining Linear classification Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ = f(x ; q) Parameters q Program ( Learner ) Learning algorithm Change q

More information

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017 Machine Learning Regularization and Feature Selection Fabio Vandin November 14, 2017 1 Regularized Loss Minimization Assume h is defined by a vector w = (w 1,..., w d ) T R d (e.g., linear models) Regularization

More information

CS7267 MACHINE LEARNING

CS7267 MACHINE LEARNING CS7267 MACHINE LEARNING ENSEMBLE LEARNING Ref: Dr. Ricardo Gutierrez-Osuna at TAMU, and Aarti Singh at CMU Mingon Kang, Ph.D. Computer Science, Kennesaw State University Definition of Ensemble Learning

More information

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Logarithmic Regret Algorithms for Strongly Convex Repeated Games Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 8: Boosting (and Compression Schemes) Boosting the Error If we have an efficient learning algorithm that for any distribution

More information

The FTRL Algorithm with Strongly Convex Regularizers

The FTRL Algorithm with Strongly Convex Regularizers CSE599s, Spring 202, Online Learning Lecture 8-04/9/202 The FTRL Algorithm with Strongly Convex Regularizers Lecturer: Brandan McMahan Scribe: Tamara Bonaci Introduction In the last lecture, we talked

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017 Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

CSC321 Lecture 8: Optimization

CSC321 Lecture 8: Optimization CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships

More information

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725 Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like

More information

Littlestone s Dimension and Online Learnability

Littlestone s Dimension and Online Learnability Littlestone s Dimension and Online Learnability Shai Shalev-Shwartz Toyota Technological Institute at Chicago The Hebrew University Talk at UCSD workshop, February, 2009 Joint work with Shai Ben-David

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Machine Learning. Ensemble Methods. Manfred Huber

Machine Learning. Ensemble Methods. Manfred Huber Machine Learning Ensemble Methods Manfred Huber 2015 1 Bias, Variance, Noise Classification errors have different sources Choice of hypothesis space and algorithm Training set Noise in the data The expected

More information

Dan Roth 461C, 3401 Walnut

Dan Roth   461C, 3401 Walnut CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Composite Objective Mirror Descent

Composite Objective Mirror Descent Composite Objective Mirror Descent John C. Duchi 1,3 Shai Shalev-Shwartz 2 Yoram Singer 3 Ambuj Tewari 4 1 University of California, Berkeley 2 Hebrew University of Jerusalem, Israel 3 Google Research

More information

Kernelized Perceptron Support Vector Machines

Kernelized Perceptron Support Vector Machines Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting:

More information

Deep Learning for Computer Vision

Deep Learning for Computer Vision Deep Learning for Computer Vision Lecture 4: Curse of Dimensionality, High Dimensional Feature Spaces, Linear Classifiers, Linear Regression, Python, and Jupyter Notebooks Peter Belhumeur Computer Science

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter 3+5+6+7 of The Elements of Statistical Learning

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Part of the slides are adapted from Ziko Kolter

Part of the slides are adapted from Ziko Kolter Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,

More information