Introduction to Machine Learning (67577) Lecture 7
|
|
- Cuthbert Day
- 6 years ago
- Views:
Transcription
1 Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 1 / 31
2 Outline 1 Reminder: Convex learning problems 2 Learning Using Stochastic Gradient Descent 3 Learning Using Regularized Loss Minimization 4 Dimension vs. Norm bounds Example application: Text categorization Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 2 / 31
3 Convex-Lipschitz-bounded learning problem Definition (Convex-Lipschitz-Bounded Learning Problem) A learning problem, (H, Z, l), is called Convex-Lipschitz-Bounded, with parameters ρ, B if the following holds: The hypothesis class H is a convex set and for all w H we have w B. For all z Z, the loss function, l(, z), is a convex and ρ-lipschitz function. Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 3 / 31
4 Convex-Lipschitz-bounded learning problem Definition (Convex-Lipschitz-Bounded Learning Problem) A learning problem, (H, Z, l), is called Convex-Lipschitz-Bounded, with parameters ρ, B if the following holds: The hypothesis class H is a convex set and for all w H we have w B. For all z Z, the loss function, l(, z), is a convex and ρ-lipschitz function. Example: H = {w R d : w B} X = {x R d : x ρ}, Y = R, l(w, (x, y)) = w, x y Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 3 / 31
5 Convex-Smooth-bounded learning problem Definition (Convex-Smooth-Bounded Learning Problem) A learning problem, (H, Z, l), is called Convex-Smooth-Bounded, with parameters β, B if the following holds: The hypothesis class H is a convex set and for all w H we have w B. For all z Z, the loss function, l(, z), is a convex, non-negative, and β-smooth function. Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 4 / 31
6 Convex-Smooth-bounded learning problem Definition (Convex-Smooth-Bounded Learning Problem) A learning problem, (H, Z, l), is called Convex-Smooth-Bounded, with parameters β, B if the following holds: The hypothesis class H is a convex set and for all w H we have w B. For all z Z, the loss function, l(, z), is a convex, non-negative, and β-smooth function. Example: H = {w R d : w B} X = {x R d : x β/2}, Y = R, l(w, (x, y)) = ( w, x y) 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 4 / 31
7 Outline 1 Reminder: Convex learning problems 2 Learning Using Stochastic Gradient Descent 3 Learning Using Regularized Loss Minimization 4 Dimension vs. Norm bounds Example application: Text categorization Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 5 / 31
8 Learning Using Stochastic Gradient Descent Consider a learning problem. Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 6 / 31
9 Learning Using Stochastic Gradient Descent Consider a learning problem. Recall: our goal is to (probably approximately) solve: min L D(w) where L D (w) = E [l(w, z)] w H z D Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 6 / 31
10 Learning Using Stochastic Gradient Descent Consider a learning problem. Recall: our goal is to (probably approximately) solve: min L D(w) where L D (w) = E [l(w, z)] w H z D So far, learning was based on the empirical risk, L S (w) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 6 / 31
11 Learning Using Stochastic Gradient Descent Consider a learning problem. Recall: our goal is to (probably approximately) solve: min L D(w) where L D (w) = E [l(w, z)] w H z D So far, learning was based on the empirical risk, L S (w) We now consider directly minimizing L D (w) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 6 / 31
12 Stochastic Gradient Descent min L D(w) where L D (w) = E [l(w, z)] w H z D Recall the gradient descent method in which we initialize w (1) = 0 and update w (t+1) = w (t) η L D (w) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 7 / 31
13 Stochastic Gradient Descent min L D(w) where L D (w) = E [l(w, z)] w H z D Recall the gradient descent method in which we initialize w (1) = 0 and update w (t+1) = w (t) η L D (w) Observe: L D (w) = E z D [ l(w, z)] Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 7 / 31
14 Stochastic Gradient Descent min L D(w) where L D (w) = E [l(w, z)] w H z D Recall the gradient descent method in which we initialize w (1) = 0 and update w (t+1) = w (t) η L D (w) Observe: L D (w) = E z D [ l(w, z)] We can t calculate L D (w) because we don t know D Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 7 / 31
15 Stochastic Gradient Descent min L D(w) where L D (w) = E [l(w, z)] w H z D Recall the gradient descent method in which we initialize w (1) = 0 and update w (t+1) = w (t) η L D (w) Observe: L D (w) = E z D [ l(w, z)] We can t calculate L D (w) because we don t know D But we can estimate it by l(w, z) for z D Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 7 / 31
16 Stochastic Gradient Descent min L D(w) where L D (w) = E [l(w, z)] w H z D Recall the gradient descent method in which we initialize w (1) = 0 and update w (t+1) = w (t) η L D (w) Observe: L D (w) = E z D [ l(w, z)] We can t calculate L D (w) because we don t know D But we can estimate it by l(w, z) for z D If we take a step based on the direction v = l(w, z) then in expectation we re moving in the right direction Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 7 / 31
17 Stochastic Gradient Descent min L D(w) where L D (w) = E [l(w, z)] w H z D Recall the gradient descent method in which we initialize w (1) = 0 and update w (t+1) = w (t) η L D (w) Observe: L D (w) = E z D [ l(w, z)] We can t calculate L D (w) because we don t know D But we can estimate it by l(w, z) for z D If we take a step based on the direction v = l(w, z) then in expectation we re moving in the right direction In other words, v is an unbiased estimate of the gradient Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 7 / 31
18 Stochastic Gradient Descent min L D(w) where L D (w) = E [l(w, z)] w H z D Recall the gradient descent method in which we initialize w (1) = 0 and update w (t+1) = w (t) η L D (w) Observe: L D (w) = E z D [ l(w, z)] We can t calculate L D (w) because we don t know D But we can estimate it by l(w, z) for z D If we take a step based on the direction v = l(w, z) then in expectation we re moving in the right direction In other words, v is an unbiased estimate of the gradient We ll show that this is good enough Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 7 / 31
19 Stochastic Gradient Descent initialize: w (1) = 0 for t = 1, 2,..., T choose z t D let v t l(w (t), z t ) update w (t+1) = w (t) ηv t output w = 1 T T t=1 w(t) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 8 / 31
20 Stochastic Gradient Descent initialize: w (1) = 0 for t = 1, 2,..., T choose z t D let v t l(w (t), z t ) update w (t+1) = w (t) ηv t output w = 1 T T t=1 w(t) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 8 / 31
21 Analyzing SGD for convex-lipschitz-bounded By algebraic manipulations, for any sequence of v 1,..., v T, and any w, T t=1 w (t) w, v t = w(1) w 2 w (T +1) w 2 2η + η 2 T v t 2 t=1 Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 9 / 31
22 Analyzing SGD for convex-lipschitz-bounded By algebraic manipulations, for any sequence of v 1,..., v T, and any w, T t=1 w (t) w, v t = w(1) w 2 w (T +1) w 2 2η + η 2 T v t 2 t=1 Assume that v t ρ for all t and that w B we obtain T t=1 w (t) w, v t B2 2η + η ρ2 T 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 9 / 31
23 Analyzing SGD for convex-lipschitz-bounded By algebraic manipulations, for any sequence of v 1,..., v T, and any w, T t=1 w (t) w, v t = w(1) w 2 w (T +1) w 2 2η + η 2 T v t 2 t=1 Assume that v t ρ for all t and that w B we obtain T t=1 In particular, for η = w (t) w, v t B2 2η + η ρ2 T 2 B 2 ρ 2 T we get T w (t) w, v t B ρ T. t=1 Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 9 / 31
24 Analyzing SGD for convex-lipschitz-bounded Taking expectation of both sides w.r.t. the randomness of choosing z 1,..., z T we obtain: [ T ] w (t) w, v t B ρ T. E z 1,...,z T t=1 Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 10 / 31
25 Analyzing SGD for convex-lipschitz-bounded Taking expectation of both sides w.r.t. the randomness of choosing z 1,..., z T we obtain: [ T ] w (t) w, v t B ρ T. E z 1,...,z T t=1 The law of total expectation: for every two random variables α, β, and a function g, E α [g(α)] = E β E α [g(α) β]. Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 10 / 31
26 Analyzing SGD for convex-lipschitz-bounded Taking expectation of both sides w.r.t. the randomness of choosing z 1,..., z T we obtain: [ T ] w (t) w, v t B ρ T. E z 1,...,z T t=1 The law of total expectation: for every two random variables α, β, and a function g, E α [g(α)] = E β E α [g(α) β]. Therefore E [ w (t) w, v t ] = z 1,...,z T E z 1,...,z t 1 E [ w (t) w, v t z 1,..., z t 1 ]. z 1,...,z T Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 10 / 31
27 Analyzing SGD for convex-lipschitz-bounded Taking expectation of both sides w.r.t. the randomness of choosing z 1,..., z T we obtain: [ T ] w (t) w, v t B ρ T. E z 1,...,z T t=1 The law of total expectation: for every two random variables α, β, and a function g, E α [g(α)] = E β E α [g(α) β]. Therefore E [ w (t) w, v t ] = z 1,...,z T E z 1,...,z t 1 E [ w (t) w, v t z 1,..., z t 1 ]. z 1,...,z T Once we know z 1,..., z t 1 the value of w (t) is not random, hence, E [ w (t) w, v t z 1,..., z t 1 ] = w (t) w, E[ l(w (t), z t )] z 1,...,z T zt = w (t) w, L D (w (t) ) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 10 / 31
28 Analyzing SGD for convex-lipschitz-bounded We got: E z 1,...,z T [ T ] w (t) w, L D (w (t) ) t=1 B ρ T Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 11 / 31
29 Analyzing SGD for convex-lipschitz-bounded We got: [ T ] E z 1,...,z T t=1 w (t) w, L D (w (t) ) B ρ T By convexity, this means [ T ] E z 1,...,z T t=1 (L D (w (t) ) L D (w )) B ρ T Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 11 / 31
30 Analyzing SGD for convex-lipschitz-bounded We got: [ T ] E z 1,...,z T t=1 w (t) w, L D (w (t) ) B ρ T By convexity, this means [ T ] E z 1,...,z T t=1 (L D (w (t) ) L D (w )) B ρ T Dividing by T and using convexity again, )] T w (t) E z 1,...,z T [L D ( 1 T t=1 L D (w ) + B ρ T Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 11 / 31
31 Learning convex-lipschitz-bounded problems using SGD Corollary Consider a convex-lipschitz-bounded learning problem with parameters ρ, B. Then, for every ɛ > 0, if we run the SGD method for minimizing L D (w) with a number of iterations (i.e., number of examples) and with η = B 2 ρ 2 T T B2 ρ 2 ɛ 2, then the output of SGD satisfies: E [L D ( w)] min w H L D(w) + ɛ. Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 12 / 31
32 Learning convex-lipschitz-bounded problems using SGD Corollary Consider a convex-lipschitz-bounded learning problem with parameters ρ, B. Then, for every ɛ > 0, if we run the SGD method for minimizing L D (w) with a number of iterations (i.e., number of examples) and with η = B 2 ρ 2 T T B2 ρ 2 ɛ 2, then the output of SGD satisfies: E [L D ( w)] min w H L D(w) + ɛ. Remark: Can obtain high probability bound using boosting the confidence (Lecture 4) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 12 / 31
33 Convex-smooth-bounded problems Similar result holds for smooth problems: Corollary Consider a convex-smooth-bounded learning problem with parameters β, B. Assume in addition that l(0, z) 1 for all z Z. For every ɛ > 0, set η = 1 β(1+3/ɛ). Then, running SGD with T 12B2 β/ɛ 2 yields E[L D ( w)] min w H L D(w) + ɛ. Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 13 / 31
34 Outline 1 Reminder: Convex learning problems 2 Learning Using Stochastic Gradient Descent 3 Learning Using Regularized Loss Minimization 4 Dimension vs. Norm bounds Example application: Text categorization Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 14 / 31
35 Regularized Loss Minimization (RLM) Given a regularization function R : R d R, the RLM rule is: A(S) = argmin (L S (w) + R(w)). w Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 15 / 31
36 Regularized Loss Minimization (RLM) Given a regularization function R : R d R, the RLM rule is: A(S) = argmin (L S (w) + R(w)). w We will focus on Tikhonov regularization ( A(S) = argmin LS (w) + λ w 2). w Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 15 / 31
37 Why to regularize? Similar to MDL: specify prior belief in hypotheses. We bias ourselves toward short vectors. Stabilizer: we ll show that Tikhonov regularization makes the learner stable w.r.t. small perturbation of the training set, which in turn leads to generalization Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 16 / 31
38 Stability Informally: an algorithm A is stable if a small change of its input S will lead to a small change of its output hypothesis Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 17 / 31
39 Stability Informally: an algorithm A is stable if a small change of its input S will lead to a small change of its output hypothesis Need to specify what is small change of input and what is small change of output Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 17 / 31
40 Stability Replace one sample: given S = (z 1,..., z m ) and an additional example z, let S (i) = (z 1,..., z i 1, z, z i+1,..., z m ) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 18 / 31
41 Stability Replace one sample: given S = (z 1,..., z m ) and an additional example z, let S (i) = (z 1,..., z i 1, z, z i+1,..., z m ) Definition (on-average-replace-one-stable) Let ɛ : N R be a monotonically decreasing function. We say that a learning algorithm A is on-average-replace-one-stable with rate ɛ(m) if for every distribution D E (S,z ) D m+1,i U(m) [l(a(s(i), z i )) l(a(s), z i )] ɛ(m). Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 18 / 31
42 Stable rules do not ovefit Theorem if A is on-average-replace-one-stable with rate ɛ(m) then E S D m[l D(A(S)) L S (A(S))] ɛ(m). Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 19 / 31
43 Stable rules do not ovefit Theorem if A is on-average-replace-one-stable with rate ɛ(m) then Proof. E S D m[l D(A(S)) L S (A(S))] ɛ(m). Since S and z are both drawn i.i.d. from D, we have that for every i, E S [L D (A(S))] = E S,z [l(a(s), z )] = E S,z [l(a(s(i) ), z i )]. On the other hand, we can write E S [L S (A(S))] = E S,i [l(a(s), z i )]. The proof follows from the definition of stability. Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 19 / 31
44 Tikhonov Regularization as Stabilizer Theorem Assume that the loss function is convex and ρ-lipschitz. Then, the RLM rule with the regularizer λ w 2 is on-average-replace-one-stable with rate 2 ρ 2 λ m. It follows that E S D m[l D(A(S)) L S (A(S))] 2 ρ2 λ m. Similarly, for convex, β-smooth, and non-negative, loss the rate is 48βC λm, where C is an upper bound on max z l(0, z). Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 20 / 31
45 Tikhonov Regularization as Stabilizer Theorem Assume that the loss function is convex and ρ-lipschitz. Then, the RLM rule with the regularizer λ w 2 is on-average-replace-one-stable with rate 2 ρ 2 λ m. It follows that E S D m[l D(A(S)) L S (A(S))] 2 ρ2 λ m. Similarly, for convex, β-smooth, and non-negative, loss the rate is 48βC λm, where C is an upper bound on max z l(0, z). The proof relies on the notion of strong convexity and can be found in the book. Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 20 / 31
46 The Fitting-Stability Tradeoff Observe: E S [L D (A(S))] = E S [L S (A(S))] + E S [L D (A(S)) L S (A(S))]. Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 21 / 31
47 The Fitting-Stability Tradeoff Observe: E S [L D (A(S))] = E S [L S (A(S))] + E S [L D (A(S)) L S (A(S))]. The first term is how good A fits the training set The 2nd term is the overfitting, and is bounded by the stability of A Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 21 / 31
48 The Fitting-Stability Tradeoff Observe: E S [L D (A(S))] = E S [L S (A(S))] + E S [L D (A(S)) L S (A(S))]. The first term is how good A fits the training set The 2nd term is the overfitting, and is bounded by the stability of A λ controls the tradeoff between the two terms Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 21 / 31
49 The Fitting-Stability Tradeoff Let A be the RLM rule Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 22 / 31
50 The Fitting-Stability Tradeoff Let A be the RLM rule We saw (for convex-lipschitz losses) E S [L D (A(S)) L S (A(S))] 2 ρ2 λ m Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 22 / 31
51 The Fitting-Stability Tradeoff Let A be the RLM rule We saw (for convex-lipschitz losses) E S [L D (A(S)) L S (A(S))] 2 ρ2 λ m Fix some arbitrary vector w, then: L S (A(S)) L S (A(S)) + λ A(S) 2 L S (w ) + λ w 2. Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 22 / 31
52 The Fitting-Stability Tradeoff Let A be the RLM rule We saw (for convex-lipschitz losses) E S [L D (A(S)) L S (A(S))] 2 ρ2 λ m Fix some arbitrary vector w, then: L S (A(S)) L S (A(S)) + λ A(S) 2 L S (w ) + λ w 2. Taking expectation of both sides with respect to S and noting that E S [L S (w )] = L D (w ), we obtain that E S [L S (A(S))] L D (w ) + λ w 2. Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 22 / 31
53 The Fitting-Stability Tradeoff Let A be the RLM rule We saw (for convex-lipschitz losses) E S [L D (A(S)) L S (A(S))] 2 ρ2 λ m Fix some arbitrary vector w, then: L S (A(S)) L S (A(S)) + λ A(S) 2 L S (w ) + λ w 2. Taking expectation of both sides with respect to S and noting that E S [L S (w )] = L D (w ), we obtain that Therefore: E S [L S (A(S))] L D (w ) + λ w 2. E S [L D (A(S))] L D (w ) + λ w ρ2 λ m Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 22 / 31
54 The Regularization Path The RLM rule as a function of λ is w(λ) = argmin w L S (w) + λ w 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 23 / 31
55 The Regularization Path The RLM rule as a function of λ is w(λ) = argmin w L S (w) + λ w 2 Can be seen as a pareto objective: minimize both L S (w) and w 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 23 / 31
56 The Regularization Path The RLM rule as a function of λ is w(λ) = argmin w L S (w) + λ w 2 Can be seen as a pareto objective: minimize both L S (w) and w 2 L S (w) w( ) feasible w pareto optimal w(0) w 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 23 / 31
57 How to choose λ? Bound minimization: choose λ according to the bound on L D (w) usually far from optimal as the bound is worst case Validation: calculate several pareto optimal points on the regularization path (by varying λ) and use validation set to choose the best one Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 24 / 31
58 Outline 1 Reminder: Convex learning problems 2 Learning Using Stochastic Gradient Descent 3 Learning Using Regularized Loss Minimization 4 Dimension vs. Norm bounds Example application: Text categorization Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 25 / 31
59 Dimension vs. Norm Bounds Previously in the course, when we learnt d parameters the sample complexity grew with d Here, we learn d parameters but the sample complexity depends on the norm of w and on the Lipschitzness/smoothness, rather than on d Which approach is better depends on the properties of the distribution Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 26 / 31
60 Example: document categorization Signs all encouraging for Phelps in comeback. He did not win any gold medals or set any world records but Michael Phelps ticked all the boxes he needed in his comeback to competitive swimming.? About sport? Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 27 / 31
61 Bag-of-words representation Signs all encouraging for Phelps in comeback. He did not win any gold medals or set any world records but Michael Phelps ticked all the boxes he needed in his comeback to competitive swimming swimming world elephant Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 28 / 31
62 Document categorization Let X = {x {0, 1} d : x 2 R 2, x d = 1} Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31
63 Document categorization Let X = {x {0, 1} d : x 2 R 2, x d = 1} Think on x X as a text document represented as a bag of words: Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31
64 Document categorization Let X = {x {0, 1} d : x 2 R 2, x d = 1} Think on x X as a text document represented as a bag of words: At most R 2 1 words in each document Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31
65 Document categorization Let X = {x {0, 1} d : x 2 R 2, x d = 1} Think on x X as a text document represented as a bag of words: At most R 2 1 words in each document d 1 is the size of the dictionary Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31
66 Document categorization Let X = {x {0, 1} d : x 2 R 2, x d = 1} Think on x X as a text document represented as a bag of words: At most R 2 1 words in each document d 1 is the size of the dictionary Last coordinate is the bias Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31
67 Document categorization Let X = {x {0, 1} d : x 2 R 2, x d = 1} Think on x X as a text document represented as a bag of words: At most R 2 1 words in each document d 1 is the size of the dictionary Last coordinate is the bias Let Y = {±1} (e.g., the document is about sport or not) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31
68 Document categorization Let X = {x {0, 1} d : x 2 R 2, x d = 1} Think on x X as a text document represented as a bag of words: At most R 2 1 words in each document d 1 is the size of the dictionary Last coordinate is the bias Let Y = {±1} (e.g., the document is about sport or not) Linear classifiers x sign( w, x ) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31
69 Document categorization Let X = {x {0, 1} d : x 2 R 2, x d = 1} Think on x X as a text document represented as a bag of words: At most R 2 1 words in each document d 1 is the size of the dictionary Last coordinate is the bias Let Y = {±1} (e.g., the document is about sport or not) Linear classifiers x sign( w, x ) Intuitively: w i is large (positive) for words indicative to sport while w i is small (negative) for words indicative to non-sport Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31
70 Document categorization Let X = {x {0, 1} d : x 2 R 2, x d = 1} Think on x X as a text document represented as a bag of words: At most R 2 1 words in each document d 1 is the size of the dictionary Last coordinate is the bias Let Y = {±1} (e.g., the document is about sport or not) Linear classifiers x sign( w, x ) Intuitively: w i is large (positive) for words indicative to sport while w i is small (negative) for words indicative to non-sport Hinge-loss: l(w, (x, y)) = [1 y w, x ] + Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31
71 Dimension vs. Norm VC dimension is d, but d can be extremely large (number of words in English) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 30 / 31
72 Dimension vs. Norm VC dimension is d, but d can be extremely large (number of words in English) Loss function is convex and R Lipschitz Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 30 / 31
73 Dimension vs. Norm VC dimension is d, but d can be extremely large (number of words in English) Loss function is convex and R Lipschitz Assume that the number of relevant words is small, and their weights is not too large, then there is a w with small norm and small L D (w ) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 30 / 31
74 Dimension vs. Norm VC dimension is d, but d can be extremely large (number of words in English) Loss function is convex and R Lipschitz Assume that the number of relevant words is small, and their weights is not too large, then there is a w with small norm and small L D (w ) Then, can learn it with sample complexity that depends on R 2 w 2, and does not depend on d at all! Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 30 / 31
75 Dimension vs. Norm VC dimension is d, but d can be extremely large (number of words in English) Loss function is convex and R Lipschitz Assume that the number of relevant words is small, and their weights is not too large, then there is a w with small norm and small L D (w ) Then, can learn it with sample complexity that depends on R 2 w 2, and does not depend on d at all! But, there are of course opposite cases, in which d is much smaller than R 2 w 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 30 / 31
76 Summary Learning convex learning problems using SGD Learning convex learning problems using RLM The regularization path Dimension vs. Norm Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 31 / 31
Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017
Machine Learning Regularization and Feature Selection Fabio Vandin November 13, 2017 1 Learning Model A: learning algorithm for a machine learning task S: m i.i.d. pairs z i = (x i, y i ), i = 1,..., m,
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationIntroduction to Machine Learning (67577) Lecture 3
Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz
More informationIntroduction to Machine Learning (67577) Lecture 5
Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Nonuniform learning, MDL, SRM, Decision Trees, Nearest Neighbor Shai
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationMachine Learning in the Data Revolution Era
Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic
More informationCSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent
CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic
More informationStochastic Gradient Descent with Variance Reduction
Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction
More informationOptimization and Gradient Descent
Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function
More informationSupport vector machines Lecture 4
Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The
More informationNeural Networks: Backpropagation
Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others
More informationIntroduction to Machine Learning (67577)
Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Deep Learning Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationEfficiently Training Sum-Product Neural Networks using Forward Greedy Selection
Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Greedy Algorithms, Frank-Wolfe and Friends
More informationStochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More informationClassification Logistic Regression
Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationCS260: Machine Learning Algorithms
CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationAlgorithmic Stability and Generalization Christoph Lampert
Algorithmic Stability and Generalization Christoph Lampert November 28, 2018 1 / 32 IST Austria (Institute of Science and Technology Austria) institute for basic research opened in 2009 located in outskirts
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationOnline Convex Optimization
Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed
More informationCase Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!
Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:
More informationLecture 4: Types of errors. Bayesian regression models. Logistic regression
Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture
More informationCSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18
CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationTrade-Offs in Distributed Learning and Optimization
Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed
More informationLecture Support Vector Machine (SVM) Classifiers
Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in
More informationOn the tradeoff between computational complexity and sample complexity in learning
On the tradeoff between computational complexity and sample complexity in learning Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Joint work with Sham
More informationCSC 411: Lecture 02: Linear Regression
CSC 411: Lecture 02: Linear Regression Richard Zemel, Raquel Urtasun and Sanja Fidler University of Toronto (Most plots in this lecture are from Bishop s book) Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V
More informationIFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent
IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):
More informationECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple
More informationMachine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution
More informationCSC321 Lecture 9: Generalization
CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 26 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions
More informationLinear and Logistic Regression. Dr. Xiaowei Huang
Linear and Logistic Regression Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Two Classical Machine Learning Algorithms Decision tree learning K-nearest neighbor Model Evaluation Metrics
More informationl 1 and l 2 Regularization
David Rosenberg New York University February 5, 2015 David Rosenberg (New York University) DS-GA 1003 February 5, 2015 1 / 32 Tikhonov and Ivanov Regularization Hypothesis Spaces We ve spoken vaguely about
More informationInformation theoretic perspectives on learning algorithms
Information theoretic perspectives on learning algorithms Varun Jog University of Wisconsin - Madison Departments of ECE and Mathematics Shannon Channel Hangout! May 8, 2018 Jointly with Adrian Tovar-Lopez
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationCSC 411 Lecture 6: Linear Regression
CSC 411 Lecture 6: Linear Regression Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 06-Linear Regression 1 / 37 A Timely XKCD UofT CSC 411: 06-Linear Regression
More informationInverse Time Dependency in Convex Regularized Learning
Inverse Time Dependency in Convex Regularized Learning Zeyuan A. Zhu (Tsinghua University) Weizhu Chen (MSRA) Chenguang Zhu (Tsinghua University) Gang Wang (MSRA) Haixun Wang (MSRA) Zheng Chen (MSRA) December
More information1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016
AM 1: Advanced Optimization Spring 016 Prof. Yaron Singer Lecture 11 March 3rd 1 Overview In this lecture we will introduce the notion of online convex optimization. This is an extremely useful framework
More informationLinear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training
More informationStochastic Optimization
Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic
More informationLeast Mean Squares Regression
Least Mean Squares Regression Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Lecture Overview Linear classifiers What functions do linear classifiers express? Least Squares Method
More informationOptimistic Rates Nati Srebro
Optimistic Rates Nati Srebro Based on work with Karthik Sridharan and Ambuj Tewari Examples based on work with Andy Cotter, Elad Hazan, Tomer Koren, Percy Liang, Shai Shalev-Shwartz, Ohad Shamir, Karthik
More informationStochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure
Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationSupport Vector Machine
Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
More informationOnline Learning Summer School Copenhagen 2015 Lecture 1
Online Learning Summer School Copenhagen 2015 Lecture 1 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Online Learning Shai Shalev-Shwartz (Hebrew U) OLSS Lecture
More informationComputational Learning Theory
Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms
More informationLecture 3: More on regularization. Bayesian vs maximum likelihood learning
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting
More informationLearnability, Stability, Regularization and Strong Convexity
Learnability, Stability, Regularization and Strong Convexity Nati Srebro Shai Shalev-Shwartz HUJI Ohad Shamir Weizmann Karthik Sridharan Cornell Ambuj Tewari Michigan Toyota Technological Institute Chicago
More informationFast Stochastic Optimization Algorithms for ML
Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October
More informationSGD and Deep Learning
SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients
More informationMachine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression
Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationAd Placement Strategies
Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad
More informationTopics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families
Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines
More informationOLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research
OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and
More informationActive Learning: Disagreement Coefficient
Advanced Course in Machine Learning Spring 2010 Active Learning: Disagreement Coefficient Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz In previous lectures we saw examples in which
More informationECE 5984: Introduction to Machine Learning
ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement
More informationIntroduction to Machine Learning
Introduction to Machine Learning PAC Learning and VC Dimension Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationEfficient Bandit Algorithms for Online Multiclass Prediction
Efficient Bandit Algorithms for Online Multiclass Prediction Sham Kakade, Shai Shalev-Shwartz and Ambuj Tewari Presented By: Nakul Verma Motivation In many learning applications, true class labels are
More informationCSC321 Lecture 9: Generalization
CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 27 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions
More informationComputational Learning Theory - Hilary Term : Learning Real-valued Functions
Computational Learning Theory - Hilary Term 08 8 : Learning Real-valued Functions Lecturer: Varun Kanade So far our focus has been on learning boolean functions. Boolean functions are suitable for modelling
More informationMachine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015
Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and
More informationComputational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar
Computational Learning Theory: Shattering and VC Dimensions Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 This lecture: Computational Learning Theory The Theory of Generalization
More informationMachine Learning and Data Mining. Linear classification. Kalev Kask
Machine Learning and Data Mining Linear classification Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ = f(x ; q) Parameters q Program ( Learner ) Learning algorithm Change q
More informationMachine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017
Machine Learning Regularization and Feature Selection Fabio Vandin November 14, 2017 1 Regularized Loss Minimization Assume h is defined by a vector w = (w 1,..., w d ) T R d (e.g., linear models) Regularization
More informationCS7267 MACHINE LEARNING
CS7267 MACHINE LEARNING ENSEMBLE LEARNING Ref: Dr. Ricardo Gutierrez-Osuna at TAMU, and Aarti Singh at CMU Mingon Kang, Ph.D. Computer Science, Kennesaw State University Definition of Ensemble Learning
More informationLogarithmic Regret Algorithms for Strongly Convex Repeated Games
Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 8: Boosting (and Compression Schemes) Boosting the Error If we have an efficient learning algorithm that for any distribution
More informationThe FTRL Algorithm with Strongly Convex Regularizers
CSE599s, Spring 202, Online Learning Lecture 8-04/9/202 The FTRL Algorithm with Strongly Convex Regularizers Lecturer: Brandan McMahan Scribe: Tamara Bonaci Introduction In the last lecture, we talked
More informationOnline Passive-Aggressive Algorithms
Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il
More informationOn the Generalization Ability of Online Strongly Convex Programming Algorithms
On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract
More informationSimple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017
Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationCSC321 Lecture 8: Optimization
CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:
More informationMachine Learning - MT & 5. Basis Expansion, Regularization, Validation
Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships
More informationGradient Descent. Ryan Tibshirani Convex Optimization /36-725
Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like
More informationLittlestone s Dimension and Online Learnability
Littlestone s Dimension and Online Learnability Shai Shalev-Shwartz Toyota Technological Institute at Chicago The Hebrew University Talk at UCSD workshop, February, 2009 Joint work with Shai Ben-David
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationMachine Learning. Ensemble Methods. Manfred Huber
Machine Learning Ensemble Methods Manfred Huber 2015 1 Bias, Variance, Noise Classification errors have different sources Choice of hypothesis space and algorithm Training set Noise in the data The expected
More informationDan Roth 461C, 3401 Walnut
CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationComposite Objective Mirror Descent
Composite Objective Mirror Descent John C. Duchi 1,3 Shai Shalev-Shwartz 2 Yoram Singer 3 Ambuj Tewari 4 1 University of California, Berkeley 2 Hebrew University of Jerusalem, Israel 3 Google Research
More informationKernelized Perceptron Support Vector Machines
Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting:
More informationDeep Learning for Computer Vision
Deep Learning for Computer Vision Lecture 4: Curse of Dimensionality, High Dimensional Feature Spaces, Linear Classifiers, Linear Regression, Python, and Jupyter Notebooks Peter Belhumeur Computer Science
More informationLinear Models for Regression
Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter 3+5+6+7 of The Elements of Statistical Learning
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More informationPart of the slides are adapted from Ziko Kolter
Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,
More information