Introduction to Machine Learning (67577) Lecture 3
|
|
- Laurence Leonard
- 5 years ago
- Views:
Transcription
1 Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 1 / 39
2 Outline 1 The general PAC model Releasing the realizability assumption beyond binary classification The general PAC learning model 2 Learning via Uniform Convergence 3 Linear Regression and Least Squares Polynomial Fitting 4 The Bias-Complexity Tradeoff Error Decomposition 5 Validation and Model Selection Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 2 / 39
3 Relaxing the realizability assumption Agnostic PAC learning So far we assumed that labels are generated by some f H This assumption may be too strong Relax the realizability assumption by replacing the target labeling function with a more flexible notion, a data-labels generating distribution Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 3 / 39
4 Relaxing the realizability assumption Agnostic PAC learning Recall: in PAC model, D is a distribution over X Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 4 / 39
5 Relaxing the realizability assumption Agnostic PAC learning Recall: in PAC model, D is a distribution over X From now on, let D be a distribution over X Y Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 4 / 39
6 Relaxing the realizability assumption Agnostic PAC learning Recall: in PAC model, D is a distribution over X From now on, let D be a distribution over X Y We redefine the risk as: L D (h) def = P [h(x) y] def = D({(x, y) : h(x) y}) (x,y) D Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 4 / 39
7 Relaxing the realizability assumption Agnostic PAC learning Recall: in PAC model, D is a distribution over X From now on, let D be a distribution over X Y We redefine the risk as: L D (h) def = P [h(x) y] def = D({(x, y) : h(x) y}) (x,y) D We redefine the approximately correct notion to L D (A(S)) min h H L D(h) + ɛ Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 4 / 39
8 PAC vs. Agnostic PAC learning PAC Agnostic PAC Distribution D over X D over X Y Truth f H not in class or doesn t exist Risk L D,f (h) = L D (h) = D({x : h(x) f(x)}) D({(x, y) : h(x) y}) Training set (x 1,..., x m ) D m ((x 1, y 1 ),..., (x m, y m )) D m i, y i = f(x i ) Goal L D,f (A(S)) ɛ L D (A(S)) min h H L D (h) + ɛ Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 5 / 39
9 Beyond Binary Classification Scope of learning problems: Multiclass categorization: Y is a finite set representing Y different classes. E.g. X is documents and Y = {News, Sports, Biology, Medicine} Regression: Y = R. E.g. one wishes to predict a baby s birth weight based on ultrasound measures of his head circumference, abdominal circumference, and femur length. Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 6 / 39
10 Loss Functions Let Z = X Y Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 7 / 39
11 Loss Functions Let Z = X Y Given hypothesis h H, and an example, (x, y) Z, how good is h on (x, y)? Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 7 / 39
12 Loss Functions Let Z = X Y Given hypothesis h H, and an example, (x, y) Z, how good is h on (x, y)? Loss function: l : H Z R + Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 7 / 39
13 Loss Functions Let Z = X Y Given hypothesis h H, and an example, (x, y) Z, how good is h on (x, y)? Loss function: l : H Z R + Examples: Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 7 / 39
14 Loss Functions Let Z = X Y Given hypothesis h H, and an example, (x, y) Z, how good is h on (x, y)? Loss function: l : H Z R + Examples: { 1 if h(x) y 0-1 loss: l(h, (x, y)) = 0 if h(x) = y Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 7 / 39
15 Loss Functions Let Z = X Y Given hypothesis h H, and an example, (x, y) Z, how good is h on (x, y)? Loss function: l : H Z R + Examples: { 1 if h(x) y 0-1 loss: l(h, (x, y)) = 0 if h(x) = y Squared loss: l(h, (x, y)) = (h(x) y) 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 7 / 39
16 Loss Functions Let Z = X Y Given hypothesis h H, and an example, (x, y) Z, how good is h on (x, y)? Loss function: l : H Z R + Examples: { 1 if h(x) y 0-1 loss: l(h, (x, y)) = 0 if h(x) = y Squared loss: l(h, (x, y)) = (h(x) y) 2 Absolute-value loss: l(h, (x, y)) = h(x) y Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 7 / 39
17 Loss Functions Let Z = X Y Given hypothesis h H, and an example, (x, y) Z, how good is h on (x, y)? Loss function: l : H Z R + Examples: { 1 if h(x) y 0-1 loss: l(h, (x, y)) = 0 if h(x) = y Squared loss: l(h, (x, y)) = (h(x) y) 2 Absolute-value loss: l(h, (x, y)) = h(x) y Cost-sensitive loss: l(h, (x, y)) = C h(x),y where C is some Y Y matrix Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 7 / 39
18 The General PAC Learning Problem We wish to Probably Approximately Solve: min L D(h) where L D (h) def = E [l(h, z)]. h H z D Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 8 / 39
19 The General PAC Learning Problem We wish to Probably Approximately Solve: min L D(h) where L D (h) def = E [l(h, z)]. h H z D Learner knows H, Z, and l Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 8 / 39
20 The General PAC Learning Problem We wish to Probably Approximately Solve: min L D(h) where L D (h) def = E [l(h, z)]. h H z D Learner knows H, Z, and l Learner receives accuracy parameter ɛ and confidence parameter δ Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 8 / 39
21 The General PAC Learning Problem We wish to Probably Approximately Solve: min L D(h) where L D (h) def = E [l(h, z)]. h H z D Learner knows H, Z, and l Learner receives accuracy parameter ɛ and confidence parameter δ Learner can decide on training set size m based on ɛ, δ Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 8 / 39
22 The General PAC Learning Problem We wish to Probably Approximately Solve: min L D(h) where L D (h) def = E [l(h, z)]. h H z D Learner knows H, Z, and l Learner receives accuracy parameter ɛ and confidence parameter δ Learner can decide on training set size m based on ɛ, δ Learner doesn t know D but can sample S D m Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 8 / 39
23 The General PAC Learning Problem We wish to Probably Approximately Solve: min L D(h) where L D (h) def = E [l(h, z)]. h H z D Learner knows H, Z, and l Learner receives accuracy parameter ɛ and confidence parameter δ Learner can decide on training set size m based on ɛ, δ Learner doesn t know D but can sample S D m Using S the learner outputs some hypothesis A(S) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 8 / 39
24 The General PAC Learning Problem We wish to Probably Approximately Solve: min L D(h) where L D (h) def = E [l(h, z)]. h H z D Learner knows H, Z, and l Learner receives accuracy parameter ɛ and confidence parameter δ Learner can decide on training set size m based on ɛ, δ Learner doesn t know D but can sample S D m Using S the learner outputs some hypothesis A(S) We want that with probability of at least 1 δ over the choice of S, the following would hold: L D (A(S)) min h H L D (h) + ɛ Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 8 / 39
25 Formal definition A hypothesis class H is agnostic PAC learnable with respect to a set Z and a loss function l : H Z R +, if there exists a function m H : (0, 1) 2 N and a learning algorithm, A, with the following property: for every ɛ, δ (0, 1), m m H (ɛ, δ), and distribution D over Z, }) D ({S m Z m : L D (A(S)) min L D(h) + ɛ 1 δ h H Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 9 / 39
26 Outline 1 The general PAC model Releasing the realizability assumption beyond binary classification The general PAC learning model 2 Learning via Uniform Convergence 3 Linear Regression and Least Squares Polynomial Fitting 4 The Bias-Complexity Tradeoff Error Decomposition 5 Validation and Model Selection Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 10 / 39
27 Representative Sample Definition (ɛ-representative sample) A training set S is called ɛ-representative if h H, L S (h) L D (h) ɛ. Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 11 / 39
28 Representative Sample Lemma Assume that a training set S is ɛ 2-representative. Then, any output of ERM H (S), namely any h S argmin h H L S (h), satisfies L D (h S ) min h H L D(h) + ɛ. Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 12 / 39
29 Representative Sample Lemma Assume that a training set S is ɛ 2-representative. Then, any output of ERM H (S), namely any h S argmin h H L S (h), satisfies Proof: For every h H, L D (h S ) min h H L D(h) + ɛ. L D (h S ) L S (h S ) + ɛ 2 L S(h) + ɛ 2 L D(h) + ɛ 2 + ɛ 2 = L D(h) + ɛ Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 12 / 39
30 Uniform Convergence is Sufficient for Learnability Definition (uniform convergence) H has the uniform convergence property if there exists a function m UC H : (0, 1)2 N such that for every ɛ, δ (0, 1), and every distribution D, D m ({S Z m : S is ɛ -representative}) 1 δ Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 13 / 39
31 Uniform Convergence is Sufficient for Learnability Definition (uniform convergence) H has the uniform convergence property if there exists a function m UC H : (0, 1)2 N such that for every ɛ, δ (0, 1), and every distribution D, D m ({S Z m : S is ɛ -representative}) 1 δ Corollary If H has the uniform convergence property with a function m UC H then H is agnostically PAC learnable with the sample complexity m H (ɛ, δ) m UC H (ɛ/2, δ). Furthermore, in that case, the ERM H paradigm is a successful agnostic PAC learner for H. Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 13 / 39
32 Finite Classes are Agnostic PAC Learnable We will prove the following: Theorem Assume H is finite and the range of the loss function is [0, 1]. Then, H is agnostically PAC learnable using the ERM H algorithm with sample complexity 2 log(2 H /δ) m H (ɛ, δ). ɛ 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 14 / 39
33 Finite Classes are Agnostic PAC Learnable We will prove the following: Theorem Assume H is finite and the range of the loss function is [0, 1]. Then, H is agnostically PAC learnable using the ERM H algorithm with sample complexity 2 log(2 H /δ) m H (ɛ, δ). Proof: It suffices to show that H has the uniform convergence property with log(2 H /δ) m UC H (ɛ, δ) 2ɛ 2. ɛ 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 14 / 39
34 Proof (cont.) To show uniform convergence, we need: D m ({S : h H, L S (h) L D (h) > ɛ}) < δ. Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 15 / 39
35 Proof (cont.) To show uniform convergence, we need: D m ({S : h H, L S (h) L D (h) > ɛ}) < δ. Using the union bound: D m ({S : h H, L S (h) L D (h) > ɛ}) = D m ( h H {S : L S (h) L D (h) > ɛ}) D m ({S : L S (h) L D (h) > ɛ}). h H Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 15 / 39
36 Proof (cont.) Recall: L D (h) = E z D [l(h, z)] and L S (h) = 1 m m i=1 l(h, z i). Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 16 / 39
37 Proof (cont.) Recall: L D (h) = E z D [l(h, z)] and L S (h) = 1 m m i=1 l(h, z i). Denote θ i = l(h, z i ). Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 16 / 39
38 Proof (cont.) Recall: L D (h) = E z D [l(h, z)] and L S (h) = 1 m m i=1 l(h, z i). Denote θ i = l(h, z i ). Then, for all i, E[θ i ] = L D (h) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 16 / 39
39 Proof (cont.) Recall: L D (h) = E z D [l(h, z)] and L S (h) = 1 m m i=1 l(h, z i). Denote θ i = l(h, z i ). Then, for all i, E[θ i ] = L D (h) Lemma (Hoeffding s inequality) Let θ 1,..., θ m be a sequence of i.i.d. random variables and assume that for all i, E[θ i ] = µ and P[a θ i b] = 1. Then, for any ɛ > 0 [ ] m 1 P m θ i µ > ɛ 2 exp ( 2 m ɛ 2 /(b a) 2). i=1 Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 16 / 39
40 Proof (cont.) Recall: L D (h) = E z D [l(h, z)] and L S (h) = 1 m m i=1 l(h, z i). Denote θ i = l(h, z i ). Then, for all i, E[θ i ] = L D (h) Lemma (Hoeffding s inequality) Let θ 1,..., θ m be a sequence of i.i.d. random variables and assume that for all i, E[θ i ] = µ and P[a θ i b] = 1. Then, for any ɛ > 0 [ ] m 1 P m θ i µ > ɛ 2 exp ( 2 m ɛ 2 /(b a) 2). i=1 This implies: D m ({S : L S (h) L D (h) > ɛ}) 2 exp ( 2 m ɛ 2). Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 16 / 39
41 Proof (cont.) We have shown: D m ({S : h H, L S (h) L D (h) > ɛ}) 2 H exp ( 2 m ɛ 2) So, if m log(2 H /δ) 2ɛ 2 then the right-hand side is at most δ as required. Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 17 / 39
42 The Discretization Trick Suppose H is parameterized by d numbers Suppose we are happy with a representation of each number using b bits (say, b = 32) Then H 2 db, and so 2db + 2 log(2/δ) m H (ɛ, δ). While not very elegant, it s a great tool for upper bounding sample complexity ɛ 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 18 / 39
43 Outline 1 The general PAC model Releasing the realizability assumption beyond binary classification The general PAC learning model 2 Learning via Uniform Convergence 3 Linear Regression and Least Squares Polynomial Fitting 4 The Bias-Complexity Tradeoff Error Decomposition 5 Validation and Model Selection Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 19 / 39
44 Linear Regression X R d, Y R, H = {x w, x : w R d } Example: d = 1, predict weight of a child based on his age. 18 Weight (Kg.) Age (years) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 20 / 39
45 The Squared Loss Zero-one loss doesn t make sense in regression Squared loss: l(h, (x, y)) = (h(x) y) 2 The ERM problem: 1 min w R d m m ( w, x i y i ) 2 i=1 Equivalently, suppose X is a matrix whose ith column is x i, and y is a vector with y i on its ith entry, then min X w y 2 w R d Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 21 / 39
46 Background: Gradient and Optimization Given a function f : R R, its derivative is f (x) = lim 0 f(x + ) f(x) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 22 / 39
47 Background: Gradient and Optimization Given a function f : R R, its derivative is f (x) = lim 0 f(x + ) f(x) If x minimizes f(x) then f (x) = 0 Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 22 / 39
48 Background: Gradient and Optimization Given a function f : R R, its derivative is f (x) = lim 0 f(x + ) f(x) If x minimizes f(x) then f (x) = 0 Now take f : R d R Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 22 / 39
49 Background: Gradient and Optimization Given a function f : R R, its derivative is f (x) = lim 0 f(x + ) f(x) If x minimizes f(x) then f (x) = 0 Now take f : R d R Its gradient is a d-dimensional vector, f(x), where the ith coordinate of f(x) is the derivative of the scalar function g(a) = f((x 1,..., x i 1, x i + a, x i+1,..., x d )). Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 22 / 39
50 Background: Gradient and Optimization Given a function f : R R, its derivative is f (x) = lim 0 f(x + ) f(x) If x minimizes f(x) then f (x) = 0 Now take f : R d R Its gradient is a d-dimensional vector, f(x), where the ith coordinate of f(x) is the derivative of the scalar function g(a) = f((x 1,..., x i 1, x i + a, x i+1,..., x d )). The derivative of g is called the partial derivative of f Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 22 / 39
51 Background: Gradient and Optimization Given a function f : R R, its derivative is f (x) = lim 0 f(x + ) f(x) If x minimizes f(x) then f (x) = 0 Now take f : R d R Its gradient is a d-dimensional vector, f(x), where the ith coordinate of f(x) is the derivative of the scalar function g(a) = f((x 1,..., x i 1, x i + a, x i+1,..., x d )). The derivative of g is called the partial derivative of f If x minimizes f(x) then f(x) = (0,..., 0) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 22 / 39
52 Background: Jacobian and the chain rule The Jacobian of f : R n R m at x R n, denoted J x (f), is the m n matrix whose i, j element is the partial derivative of f i : R n R w.r.t. its j th variable at x Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 23 / 39
53 Background: Jacobian and the chain rule The Jacobian of f : R n R m at x R n, denoted J x (f), is the m n matrix whose i, j element is the partial derivative of f i : R n R w.r.t. its j th variable at x Note: if m = 1 then J x (f) = f(x) (as a row vector) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 23 / 39
54 Background: Jacobian and the chain rule The Jacobian of f : R n R m at x R n, denoted J x (f), is the m n matrix whose i, j element is the partial derivative of f i : R n R w.r.t. its j th variable at x Note: if m = 1 then J x (f) = f(x) (as a row vector) Example: If f(w) = Aw for A R m,n then J w (f) = A Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 23 / 39
55 Background: Jacobian and the chain rule The Jacobian of f : R n R m at x R n, denoted J x (f), is the m n matrix whose i, j element is the partial derivative of f i : R n R w.r.t. its j th variable at x Note: if m = 1 then J x (f) = f(x) (as a row vector) Example: If f(w) = Aw for A R m,n then J w (f) = A Chain rule: Given f : R n R m and g : R k R n, the Jacobian of the composition function, (f g) : R k R m, at x, is J x (f g) = J g(x) (f)j x (g). Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 23 / 39
56 Least Squares Recall that we d like to solve the ERM problem: 1 min w R d 2 X w y 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 24 / 39
57 Least Squares Recall that we d like to solve the ERM problem: 1 min w R d 2 X w y 2 Let g(w) = X w y and f(v) = 1 2 v 2 = m i=1 v2 i Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 24 / 39
58 Least Squares Recall that we d like to solve the ERM problem: 1 min w R d 2 X w y 2 Let g(w) = X w y and f(v) = 1 2 v 2 = m i=1 v2 i Then, we need to solve min w f(g(w)) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 24 / 39
59 Least Squares Recall that we d like to solve the ERM problem: 1 min w R d 2 X w y 2 Let g(w) = X w y and f(v) = 1 2 v 2 = m i=1 v2 i Then, we need to solve min w f(g(w)) Note that J w (g) = X and J v (f) = (v 1,..., v m ) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 24 / 39
60 Least Squares Recall that we d like to solve the ERM problem: 1 min w R d 2 X w y 2 Let g(w) = X w y and f(v) = 1 2 v 2 = m i=1 v2 i Then, we need to solve min w f(g(w)) Note that J w (g) = X and J v (f) = (v 1,..., v m ) Using the chain rule: J w (f g) = J g(w) (f)j w (g) = g(w) X = (X w y) X Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 24 / 39
61 Least Squares Recall that we d like to solve the ERM problem: 1 min w R d 2 X w y 2 Let g(w) = X w y and f(v) = 1 2 v 2 = m i=1 v2 i Then, we need to solve min w f(g(w)) Note that J w (g) = X and J v (f) = (v 1,..., v m ) Using the chain rule: J w (f g) = J g(w) (f)j w (g) = g(w) X = (X w y) X Requiring that J w (f g) = (0,..., 0) yields (X w y) X = 0 XX w = Xy. Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 24 / 39
62 Least Squares Recall that we d like to solve the ERM problem: 1 min w R d 2 X w y 2 Let g(w) = X w y and f(v) = 1 2 v 2 = m i=1 v2 i Then, we need to solve min w f(g(w)) Note that J w (g) = X and J v (f) = (v 1,..., v m ) Using the chain rule: J w (f g) = J g(w) (f)j w (g) = g(w) X = (X w y) X Requiring that J w (f g) = (0,..., 0) yields (X w y) X = 0 XX w = Xy. This is a linear set of equations. If XX is invertible, the solution is w = (XX ) 1 Xy. Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 24 / 39
63 Least Squares What if XX is not invertible? In the exercise you ll see that there s always a solution to the set of linear equations using pseudo-inverse Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 25 / 39
64 Least Squares What if XX is not invertible? In the exercise you ll see that there s always a solution to the set of linear equations using pseudo-inverse Non-rigorous trick to help remembering the formula: We want X w y Multiply both sides by X to obtain XX w Xy Multiply both sides by (XX ) 1 to obtain the formula: w = (XX ) 1 Xy Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 25 / 39
65 Least Squares Interpretation as projection Recall, we try to minimize X w y The set C = {X w : w R d } R m is a linear subspace, forming the range of X Therefore, if w is the least squares solution, then the vector ŷ = X w is the vector in C which is closest to y. This is called the projection of y onto C We can find ŷ by taking V to be an m d matrix whose columns are orthonormal basis of the range of X, and then setting ŷ = V V y Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 26 / 39
66 Polynomial Fitting Sometimes, linear predictors are not expressive enough for our data We will show how to fit a polynomial to the data using linear regression Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 27 / 39
67 Polynomial Fitting A one-dimensional polynomial function of degree n: p(x) = a 0 + a 1 x + a 2 x a n x n Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 28 / 39
68 Polynomial Fitting A one-dimensional polynomial function of degree n: p(x) = a 0 + a 1 x + a 2 x a n x n Goal: given data S = ((x 1, y 1 ),..., (x m, y m )) find ERM with respect to the class of polynomials of degree n Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 28 / 39
69 Polynomial Fitting A one-dimensional polynomial function of degree n: p(x) = a 0 + a 1 x + a 2 x a n x n Goal: given data S = ((x 1, y 1 ),..., (x m, y m )) find ERM with respect to the class of polynomials of degree n Reduction to linear regression: Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 28 / 39
70 Polynomial Fitting A one-dimensional polynomial function of degree n: p(x) = a 0 + a 1 x + a 2 x a n x n Goal: given data S = ((x 1, y 1 ),..., (x m, y m )) find ERM with respect to the class of polynomials of degree n Reduction to linear regression: Define ψ : R R n+1 by ψ(x) = (1, x, x 2,..., x n ) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 28 / 39
71 Polynomial Fitting A one-dimensional polynomial function of degree n: p(x) = a 0 + a 1 x + a 2 x a n x n Goal: given data S = ((x 1, y 1 ),..., (x m, y m )) find ERM with respect to the class of polynomials of degree n Reduction to linear regression: Define ψ : R R n+1 by ψ(x) = (1, x, x 2,..., x n ) Define a = (a 0, a 1,..., a n ) and observe: p(x) = n a i x i = a, ψ(x) i=0 Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 28 / 39
72 Polynomial Fitting A one-dimensional polynomial function of degree n: p(x) = a 0 + a 1 x + a 2 x a n x n Goal: given data S = ((x 1, y 1 ),..., (x m, y m )) find ERM with respect to the class of polynomials of degree n Reduction to linear regression: Define ψ : R R n+1 by ψ(x) = (1, x, x 2,..., x n ) Define a = (a 0, a 1,..., a n ) and observe: p(x) = n a i x i = a, ψ(x) i=0 To find a, we can solve Least Squares w.r.t. ((ψ(x 1 ), y 1 ),..., (ψ(x m ), y m )) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 28 / 39
73 Outline 1 The general PAC model Releasing the realizability assumption beyond binary classification The general PAC learning model 2 Learning via Uniform Convergence 3 Linear Regression and Least Squares Polynomial Fitting 4 The Bias-Complexity Tradeoff Error Decomposition 5 Validation and Model Selection Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 29 / 39
74 Error Decomposition Let h S = ERM H (S). We can decompose the risk of h S as: L D (h S ) = ɛ app + ɛ est ɛ app ɛ est The approximation error, ɛ app = min h H L D (h): How much risk do we have due to restricting to H Doesn t depend on S Decreases with the complexity (size, or VC dimension) of H The estimation error, ɛ est = L D (h S ) ɛ app : Result of L S being only an estimate of L D Decreases with the size of S Increases with the complexity of H Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 30 / 39
75 Bias-Complexity Tradeoff How to choose H? degree 2 degree 3 degree 10 Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 31 / 39
76 Outline 1 The general PAC model Releasing the realizability assumption beyond binary classification The general PAC learning model 2 Learning via Uniform Convergence 3 Linear Regression and Least Squares Polynomial Fitting 4 The Bias-Complexity Tradeoff Error Decomposition 5 Validation and Model Selection Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 32 / 39
77 Validation We have already learned some hypothesis h Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 33 / 39
78 Validation We have already learned some hypothesis h Now we want to estimate how good is h Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 33 / 39
79 Validation We have already learned some hypothesis h Now we want to estimate how good is h Simple solution: Take fresh i.i.d. sample V = (x 1, y 1 ),..., (x mv, y mv ) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 33 / 39
80 Validation We have already learned some hypothesis h Now we want to estimate how good is h Simple solution: Take fresh i.i.d. sample V = (x 1, y 1 ),..., (x mv, y mv ) Output L V (h) as an estimator of L D (h) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 33 / 39
81 Validation We have already learned some hypothesis h Now we want to estimate how good is h Simple solution: Take fresh i.i.d. sample V = (x 1, y 1 ),..., (x mv, y mv ) Output L V (h) as an estimator of L D (h) Using Hoeffding s inequality, if the range of l is [0, 1] we have log(2/δ) L V (h) L D (h). 2 m v Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 33 / 39
82 Validation for Model Selection Fitting polynomials of degrees 2,3, and 10 based on the black points The red points are validation examples Choose the degree 3 polynomial as it has minimal validation error Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 34 / 39
83 Validation for Model Selection Analysis Let H = {h 1,..., h r } be the output predictors of applying ERM w.r.t. the different classes on S Let V be a fresh validation set Choose h ERM H (V ) By our analysis of finite classes, L D (h ) min h H L D(h) + 2 log(2 H /δ) V Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 35 / 39
84 The model-selection curve 0.4 train validation 0.3 error degree Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 36 / 39
85 Train-Validation-Test split In practice, we usually have one pool of examples and we split them into three sets: Training set: apply the learning algorithm with different parameters on the training set to produce H = {h 1,..., h r } Validation set: Choose h from H based on the validation set Test set: Estimate the error of h using the test set Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 37 / 39
86 k-fold cross validation The train-validation-test split is the best approach when data is plentiful. If data is scarce: k-fold Cross Validation for Model Selection input: training set S = (x 1, y 1 ),..., (x m, y m ) learning algorithm A and a set of parameter values Θ partition S into S 1, S 2,..., S k foreach θ Θ for i = 1... k h i,θ = A(S \ S i ; θ) error(θ) = 1 k k i=1 L S i (h i,θ ) output θ = argmin θ [error(θ)], h θ = A(S; θ ) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 38 / 39
87 Summary The general PAC model Agnostic General loss functions Uniform convergence is sufficient for learnability Uniform convergence holds for finite classes and bounded loss Least squares Linear regression Polynomial fitting The bias-complexity tradeoff Approximation error vs. Estimation error Validation Model selection Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 39 / 39
Introduction to Machine Learning (67577) Lecture 5
Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Nonuniform learning, MDL, SRM, Decision Trees, Nearest Neighbor Shai
More informationLecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;
CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and
More informationMachine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017
Machine Learning Model Selection and Validation Fabio Vandin November 7, 2017 1 Model Selection When we have to solve a machine learning task: there are different algorithms/classes algorithms have parameters
More informationOn the tradeoff between computational complexity and sample complexity in learning
On the tradeoff between computational complexity and sample complexity in learning Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Joint work with Sham
More informationUnderstanding Machine Learning Solution Manual
Understanding Machine Learning Solution Manual Written by Alon Gonen Edited by Dana Rubinstein Noveber 17, 2014 2 Gentle Start 1. Given S = ((x i, y i )), define the ultivariate polynoial p S (x) = i []:y
More informationIntroduction to Machine Learning (67577) Lecture 7
Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew
More informationOnline Learning Summer School Copenhagen 2015 Lecture 1
Online Learning Summer School Copenhagen 2015 Lecture 1 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Online Learning Shai Shalev-Shwartz (Hebrew U) OLSS Lecture
More informationActive Learning: Disagreement Coefficient
Advanced Course in Machine Learning Spring 2010 Active Learning: Disagreement Coefficient Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz In previous lectures we saw examples in which
More informationIFT Lecture 7 Elements of statistical learning theory
IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and
More informationMachine Learning in the Data Revolution Era
Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,
More informationUnderstanding Machine Learning A theory Perspective
Understanding Machine Learning A theory Perspective Shai Ben-David University of Waterloo MLSS at MPI Tubingen, 2017 Disclaimer Warning. This talk is NOT about how cool machine learning is. I am sure you
More informationIntroduction to Statistical Learning Theory
Introduction to Statistical Learning Theory Definition Reminder: We are given m samples {(x i, y i )} m i=1 Dm and a hypothesis space H and we wish to return h H minimizing L D (h) = E[l(h(x), y)]. Problem
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationIntroduction to Machine Learning (67577)
Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Deep Learning Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks
More informationPAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht
PAC Learning prof. dr Arno Siebes Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht Recall: PAC Learning (Version 1) A hypothesis class H is PAC learnable
More informationGeneralization bounds
Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question
More informationMachine Learning. Linear Models. Fabio Vandin October 10, 2017
Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w
More informationMachine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015
Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and
More information12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016
12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses
More informationIntroduction and Models
CSE522, Winter 2011, Learning Theory Lecture 1 and 2-01/04/2011, 01/06/2011 Lecturer: Ofer Dekel Introduction and Models Scribe: Jessica Chang Machine learning algorithms have emerged as the dominant and
More informationMachine Learning. Linear Models. Fabio Vandin October 10, 2017
Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w
More informationThe PAC Learning Framework -II
The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1 Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1 Outline
More informationPAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University PAC Learning Matt Gormley Lecture 14 March 5, 2018 1 ML Big Picture Learning Paradigms:
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory Problem set 1 Due: Monday, October 10th Please send your solutions to learning-submissions@ttic.edu Notation: Input space: X Label space: Y = {±1} Sample:
More informationMachine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016
Machine Learning 10-701, Fall 2016 Computational Learning Theory Eric Xing Lecture 9, October 5, 2016 Reading: Chap. 7 T.M book Eric Xing @ CMU, 2006-2016 1 Generalizability of Learning In machine learning
More informationMachine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017
Machine Learning Regularization and Feature Selection Fabio Vandin November 13, 2017 1 Learning Model A: learning algorithm for a machine learning task S: m i.i.d. pairs z i = (x i, y i ), i = 1,..., m,
More informationThe Perceptron algorithm
The Perceptron algorithm Tirgul 3 November 2016 Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric
More informationCSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18
CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H
More informationComputational and Statistical Learning theory
Computational and Statistical Learning theory Problem set 2 Due: January 31st Email solutions to : karthik at ttic dot edu Notation : Input space : X Label space : Y = {±1} Sample : (x 1, y 1,..., (x n,
More informationLittlestone s Dimension and Online Learnability
Littlestone s Dimension and Online Learnability Shai Shalev-Shwartz Toyota Technological Institute at Chicago The Hebrew University Talk at UCSD workshop, February, 2009 Joint work with Shai Ben-David
More informationIntroduction to Machine Learning
Introduction to Machine Learning 236756 Prof. Nir Ailon Lecture 4: Computational Complexity of Learning & Surrogate Losses Efficient PAC Learning Until now we were mostly worried about sample complexity
More informationThe sample complexity of agnostic learning with deterministic labels
The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College
More informationMachine Learning Theory (CS 6783)
Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Kimball, B-11 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)
More informationMachine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012
Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Computational Learning Theory Le Song Lecture 11, September 20, 2012 Based on Slides from Eric Xing, CMU Reading: Chap. 7 T.M book 1 Complexity of Learning
More informationComputational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar
Computational Learning Theory: Probably Approximately Correct (PAC) Learning Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 This lecture: Computational Learning Theory The Theory
More informationConsistency of Nearest Neighbor Methods
E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationCOMP 551 Applied Machine Learning Lecture 2: Linear Regression
COMP 551 Applied Machine Learning Lecture 2: Linear Regression Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise
More information6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL
6.867 Machine learning: lecture 2 Tommi S. Jaakkola MIT CSAIL tommi@csail.mit.edu Topics The learning problem hypothesis class, estimation algorithm loss and estimation criterion sampling, empirical and
More informationConditional Sparse Linear Regression
COMS 6998-4 Fall 07 November 3, 07 Introduction. Background Conditional Sparse Linear Regression esenter: Xingyu Niu Scribe: Jinyi Zhang Given a distribution D over R d R and data samples (y, z) D, linear
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem
More informationCOMP 551 Applied Machine Learning Lecture 2: Linear regression
COMP 551 Applied Machine Learning Lecture 2: Linear regression Instructor: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551 Unless otherwise noted, all material posted for this
More informationFORMULATION OF THE LEARNING PROBLEM
FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we
More informationAn Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI
An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process
More information1 Differential Privacy and Statistical Query Learning
10-806 Foundations of Machine Learning and Data Science Lecturer: Maria-Florina Balcan Lecture 5: December 07, 015 1 Differential Privacy and Statistical Query Learning 1.1 Differential Privacy Suppose
More informationIntroduction to Support Vector Machines
Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear
More informationComputational Learning Theory - Hilary Term : Learning Real-valued Functions
Computational Learning Theory - Hilary Term 08 8 : Learning Real-valued Functions Lecturer: Varun Kanade So far our focus has been on learning boolean functions. Boolean functions are suitable for modelling
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem
More informationComputational Learning Theory. CS534 - Machine Learning
Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows
More informationComputational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013
Computational Learning Theory CS 486/686: Introduction to Artificial Intelligence Fall 2013 1 Overview Introduction to Computational Learning Theory PAC Learning Theory Thanks to T Mitchell 2 Introduction
More informationMulticlass Learnability and the ERM principle
Multiclass Learnability and the ERM principle Amit Daniely Sivan Sabato Shai Ben-David Shai Shalev-Shwartz November 5, 04 arxiv:308.893v [cs.lg] 4 Nov 04 Abstract We study the sample complexity of multiclass
More informationMachine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution
More informationLecture Support Vector Machine (SVM) Classifiers
Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in
More informationDoes Unlabeled Data Help?
Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 6: Training versus Testing (LFD 2.1) Cho-Jui Hsieh UC Davis Jan 29, 2018 Preamble to the theory Training versus testing Out-of-sample error (generalization error): What
More informationCOS 402 Machine Learning and Artificial Intelligence Fall Lecture 3: Learning Theory
COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 3: Learning Theory Sanjeev Arora Elad Hazan Admin Exercise 1 due next Tue, in class Enrolment Recap We have seen: AI by introspection
More informationCS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims
CS340 Machine learning Lecture 5 Learning theory cont'd Some slides are borrowed from Stuart Russell and Thorsten Joachims Inductive learning Simplest form: learn a function from examples f is the target
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic
More informationECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple
More informationMachine Learning Theory (CS 6783)
Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Hollister, 306 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)
More informationMulticlass Classification-1
CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass
More information4 Bias-Variance for Ridge Regression (24 points)
Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,
More informationLecture 21: Minimax Theory
Lecture : Minimax Theory Akshay Krishnamurthy akshay@cs.umass.edu November 8, 07 Recap In the first part of the course, we spent the majority of our time studying risk minimization. We found many ways
More informationSample Complexity of Learning Independent of Set Theory
Sample Complexity of Learning Independent of Set Theory Shai Ben-David University of Waterloo, Canada Based on joint work with Pavel Hrubes, Shay Moran, Amir Shpilka and Amir Yehudayoff Simons workshop,
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 2: Introduction to statistical learning theory. 1 / 22 Goals of statistical learning theory SLT aims at studying the performance of
More informationIntroduction to Machine Learning
Introduction to Machine Learning PAC Learning and VC Dimension Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE
More informationVariance Reduction and Ensemble Methods
Variance Reduction and Ensemble Methods Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Last Time PAC learning Bias/variance tradeoff small hypothesis
More informationBias-Variance Decomposition. Mohammad Emtiyaz Khan EPFL Oct 6, 2015
Bias-Variance Decomposition Mohammad Emtiyaz Khan EPFL Oct 6, 2015 Mohammad Emtiyaz Khan 2015 Motivation In ridge regression, we observe a typical behaviour for train and test errors with respect to model
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationVC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.
VC Dimension Review The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces. Previously, in discussing PAC learning, we were trying to answer questions about
More informationGeneralization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh
Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds
More informationEvaluation requires to define performance measures to be optimized
Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation
More informationOnline Convex Optimization
Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed
More informationMachine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?
Linear Regression Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2014 1 What about continuous variables? n Billionaire says: If I am measuring a continuous variable, what
More informationEfficiently Training Sum-Product Neural Networks using Forward Greedy Selection
Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Greedy Algorithms, Frank-Wolfe and Friends
More informationPart of the slides are adapted from Ziko Kolter
Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,
More informationGeneralization, Overfitting, and Model Selection
Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How
More informationAdvanced Introduction to Machine Learning CMU-10715
Advanced Introduction to Machine Learning CMU-10715 Risk Minimization Barnabás Póczos What have we seen so far? Several classification & regression algorithms seem to work fine on training datasets: Linear
More informationActive Learning and Optimized Information Gathering
Active Learning and Optimized Information Gathering Lecture 7 Learning Theory CS 101.2 Andreas Krause Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok. Office
More informationAssociation studies and regression
Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration
More informationGeneralization Bounds and Stability
Generalization Bounds and Stability Lorenzo Rosasco Tomaso Poggio 9.520 Class 9 2009 About this class Goal To recall the notion of generalization bounds and show how they can be derived from a stability
More informationLearning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin
Learning Theory Machine Learning CSE546 Carlos Guestrin University of Washington November 25, 2013 Carlos Guestrin 2005-2013 1 What now n We have explored many ways of learning from data n But How good
More informationMulticlass Learnability and the ERM Principle
Journal of Machine Learning Research 6 205 2377-2404 Submitted 8/3; Revised /5; Published 2/5 Multiclass Learnability and the ERM Principle Amit Daniely amitdaniely@mailhujiacil Dept of Mathematics, The
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 7: Computational Complexity of Learning Agnostic Learning Hardness of Learning via Crypto Assumption: No poly-time algorithm
More informationOnline Passive-Aggressive Algorithms
Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il
More information1 Review of The Learning Setting
COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #8 Scribe: Changyan Wang February 28, 208 Review of The Learning Setting Last class, we moved beyond the PAC model: in the PAC model we
More informationAgnostic Online learnability
Technical Report TTIC-TR-2008-2 October 2008 Agnostic Online learnability Shai Shalev-Shwartz Toyota Technological Institute Chicago shai@tti-c.org ABSTRACT We study a fundamental question. What classes
More informationLEARNING KERNEL-BASED HALFSPACES WITH THE 0-1 LOSS
LEARNING KERNEL-BASED HALFSPACES WITH THE 0- LOSS SHAI SHALEV-SHWARTZ, OHAD SHAMIR, AND KARTHIK SRIDHARAN Abstract. We describe and analyze a new algorithm for agnostically learning kernel-based halfspaces
More informationMachine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression
Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,
More informationE0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)
E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how
More informationTTIC An Introduction to the Theory of Machine Learning. Learning from noisy data, intro to SQ model
TTIC 325 An Introduction to the Theory of Machine Learning Learning from noisy data, intro to SQ model Avrim Blum 4/25/8 Learning when there is no perfect predictor Hoeffding/Chernoff bounds: minimizing
More informationIntroduction to Machine Learning
Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabás Póczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic): Bayes risk: Let us use the empirical
More informationLogarithmic Regret Algorithms for Strongly Convex Repeated Games
Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600
More informationLinear Classifiers. Michael Collins. January 18, 2012
Linear Classifiers Michael Collins January 18, 2012 Today s Lecture Binary classification problems Linear classifiers The perceptron algorithm Classification Problems: An Example Goal: build a system that
More informationLogistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017
1 Will Monroe CS 109 Logistic Regression Lecture Notes #22 August 14, 2017 Based on a chapter by Chris Piech Logistic regression is a classification algorithm1 that works by trying to learn a function
More information