Introduction to Machine Learning (67577) Lecture 3

Size: px
Start display at page:

Download "Introduction to Machine Learning (67577) Lecture 3"

Transcription

1 Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 1 / 39

2 Outline 1 The general PAC model Releasing the realizability assumption beyond binary classification The general PAC learning model 2 Learning via Uniform Convergence 3 Linear Regression and Least Squares Polynomial Fitting 4 The Bias-Complexity Tradeoff Error Decomposition 5 Validation and Model Selection Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 2 / 39

3 Relaxing the realizability assumption Agnostic PAC learning So far we assumed that labels are generated by some f H This assumption may be too strong Relax the realizability assumption by replacing the target labeling function with a more flexible notion, a data-labels generating distribution Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 3 / 39

4 Relaxing the realizability assumption Agnostic PAC learning Recall: in PAC model, D is a distribution over X Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 4 / 39

5 Relaxing the realizability assumption Agnostic PAC learning Recall: in PAC model, D is a distribution over X From now on, let D be a distribution over X Y Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 4 / 39

6 Relaxing the realizability assumption Agnostic PAC learning Recall: in PAC model, D is a distribution over X From now on, let D be a distribution over X Y We redefine the risk as: L D (h) def = P [h(x) y] def = D({(x, y) : h(x) y}) (x,y) D Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 4 / 39

7 Relaxing the realizability assumption Agnostic PAC learning Recall: in PAC model, D is a distribution over X From now on, let D be a distribution over X Y We redefine the risk as: L D (h) def = P [h(x) y] def = D({(x, y) : h(x) y}) (x,y) D We redefine the approximately correct notion to L D (A(S)) min h H L D(h) + ɛ Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 4 / 39

8 PAC vs. Agnostic PAC learning PAC Agnostic PAC Distribution D over X D over X Y Truth f H not in class or doesn t exist Risk L D,f (h) = L D (h) = D({x : h(x) f(x)}) D({(x, y) : h(x) y}) Training set (x 1,..., x m ) D m ((x 1, y 1 ),..., (x m, y m )) D m i, y i = f(x i ) Goal L D,f (A(S)) ɛ L D (A(S)) min h H L D (h) + ɛ Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 5 / 39

9 Beyond Binary Classification Scope of learning problems: Multiclass categorization: Y is a finite set representing Y different classes. E.g. X is documents and Y = {News, Sports, Biology, Medicine} Regression: Y = R. E.g. one wishes to predict a baby s birth weight based on ultrasound measures of his head circumference, abdominal circumference, and femur length. Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 6 / 39

10 Loss Functions Let Z = X Y Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 7 / 39

11 Loss Functions Let Z = X Y Given hypothesis h H, and an example, (x, y) Z, how good is h on (x, y)? Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 7 / 39

12 Loss Functions Let Z = X Y Given hypothesis h H, and an example, (x, y) Z, how good is h on (x, y)? Loss function: l : H Z R + Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 7 / 39

13 Loss Functions Let Z = X Y Given hypothesis h H, and an example, (x, y) Z, how good is h on (x, y)? Loss function: l : H Z R + Examples: Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 7 / 39

14 Loss Functions Let Z = X Y Given hypothesis h H, and an example, (x, y) Z, how good is h on (x, y)? Loss function: l : H Z R + Examples: { 1 if h(x) y 0-1 loss: l(h, (x, y)) = 0 if h(x) = y Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 7 / 39

15 Loss Functions Let Z = X Y Given hypothesis h H, and an example, (x, y) Z, how good is h on (x, y)? Loss function: l : H Z R + Examples: { 1 if h(x) y 0-1 loss: l(h, (x, y)) = 0 if h(x) = y Squared loss: l(h, (x, y)) = (h(x) y) 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 7 / 39

16 Loss Functions Let Z = X Y Given hypothesis h H, and an example, (x, y) Z, how good is h on (x, y)? Loss function: l : H Z R + Examples: { 1 if h(x) y 0-1 loss: l(h, (x, y)) = 0 if h(x) = y Squared loss: l(h, (x, y)) = (h(x) y) 2 Absolute-value loss: l(h, (x, y)) = h(x) y Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 7 / 39

17 Loss Functions Let Z = X Y Given hypothesis h H, and an example, (x, y) Z, how good is h on (x, y)? Loss function: l : H Z R + Examples: { 1 if h(x) y 0-1 loss: l(h, (x, y)) = 0 if h(x) = y Squared loss: l(h, (x, y)) = (h(x) y) 2 Absolute-value loss: l(h, (x, y)) = h(x) y Cost-sensitive loss: l(h, (x, y)) = C h(x),y where C is some Y Y matrix Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 7 / 39

18 The General PAC Learning Problem We wish to Probably Approximately Solve: min L D(h) where L D (h) def = E [l(h, z)]. h H z D Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 8 / 39

19 The General PAC Learning Problem We wish to Probably Approximately Solve: min L D(h) where L D (h) def = E [l(h, z)]. h H z D Learner knows H, Z, and l Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 8 / 39

20 The General PAC Learning Problem We wish to Probably Approximately Solve: min L D(h) where L D (h) def = E [l(h, z)]. h H z D Learner knows H, Z, and l Learner receives accuracy parameter ɛ and confidence parameter δ Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 8 / 39

21 The General PAC Learning Problem We wish to Probably Approximately Solve: min L D(h) where L D (h) def = E [l(h, z)]. h H z D Learner knows H, Z, and l Learner receives accuracy parameter ɛ and confidence parameter δ Learner can decide on training set size m based on ɛ, δ Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 8 / 39

22 The General PAC Learning Problem We wish to Probably Approximately Solve: min L D(h) where L D (h) def = E [l(h, z)]. h H z D Learner knows H, Z, and l Learner receives accuracy parameter ɛ and confidence parameter δ Learner can decide on training set size m based on ɛ, δ Learner doesn t know D but can sample S D m Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 8 / 39

23 The General PAC Learning Problem We wish to Probably Approximately Solve: min L D(h) where L D (h) def = E [l(h, z)]. h H z D Learner knows H, Z, and l Learner receives accuracy parameter ɛ and confidence parameter δ Learner can decide on training set size m based on ɛ, δ Learner doesn t know D but can sample S D m Using S the learner outputs some hypothesis A(S) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 8 / 39

24 The General PAC Learning Problem We wish to Probably Approximately Solve: min L D(h) where L D (h) def = E [l(h, z)]. h H z D Learner knows H, Z, and l Learner receives accuracy parameter ɛ and confidence parameter δ Learner can decide on training set size m based on ɛ, δ Learner doesn t know D but can sample S D m Using S the learner outputs some hypothesis A(S) We want that with probability of at least 1 δ over the choice of S, the following would hold: L D (A(S)) min h H L D (h) + ɛ Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 8 / 39

25 Formal definition A hypothesis class H is agnostic PAC learnable with respect to a set Z and a loss function l : H Z R +, if there exists a function m H : (0, 1) 2 N and a learning algorithm, A, with the following property: for every ɛ, δ (0, 1), m m H (ɛ, δ), and distribution D over Z, }) D ({S m Z m : L D (A(S)) min L D(h) + ɛ 1 δ h H Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 9 / 39

26 Outline 1 The general PAC model Releasing the realizability assumption beyond binary classification The general PAC learning model 2 Learning via Uniform Convergence 3 Linear Regression and Least Squares Polynomial Fitting 4 The Bias-Complexity Tradeoff Error Decomposition 5 Validation and Model Selection Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 10 / 39

27 Representative Sample Definition (ɛ-representative sample) A training set S is called ɛ-representative if h H, L S (h) L D (h) ɛ. Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 11 / 39

28 Representative Sample Lemma Assume that a training set S is ɛ 2-representative. Then, any output of ERM H (S), namely any h S argmin h H L S (h), satisfies L D (h S ) min h H L D(h) + ɛ. Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 12 / 39

29 Representative Sample Lemma Assume that a training set S is ɛ 2-representative. Then, any output of ERM H (S), namely any h S argmin h H L S (h), satisfies Proof: For every h H, L D (h S ) min h H L D(h) + ɛ. L D (h S ) L S (h S ) + ɛ 2 L S(h) + ɛ 2 L D(h) + ɛ 2 + ɛ 2 = L D(h) + ɛ Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 12 / 39

30 Uniform Convergence is Sufficient for Learnability Definition (uniform convergence) H has the uniform convergence property if there exists a function m UC H : (0, 1)2 N such that for every ɛ, δ (0, 1), and every distribution D, D m ({S Z m : S is ɛ -representative}) 1 δ Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 13 / 39

31 Uniform Convergence is Sufficient for Learnability Definition (uniform convergence) H has the uniform convergence property if there exists a function m UC H : (0, 1)2 N such that for every ɛ, δ (0, 1), and every distribution D, D m ({S Z m : S is ɛ -representative}) 1 δ Corollary If H has the uniform convergence property with a function m UC H then H is agnostically PAC learnable with the sample complexity m H (ɛ, δ) m UC H (ɛ/2, δ). Furthermore, in that case, the ERM H paradigm is a successful agnostic PAC learner for H. Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 13 / 39

32 Finite Classes are Agnostic PAC Learnable We will prove the following: Theorem Assume H is finite and the range of the loss function is [0, 1]. Then, H is agnostically PAC learnable using the ERM H algorithm with sample complexity 2 log(2 H /δ) m H (ɛ, δ). ɛ 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 14 / 39

33 Finite Classes are Agnostic PAC Learnable We will prove the following: Theorem Assume H is finite and the range of the loss function is [0, 1]. Then, H is agnostically PAC learnable using the ERM H algorithm with sample complexity 2 log(2 H /δ) m H (ɛ, δ). Proof: It suffices to show that H has the uniform convergence property with log(2 H /δ) m UC H (ɛ, δ) 2ɛ 2. ɛ 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 14 / 39

34 Proof (cont.) To show uniform convergence, we need: D m ({S : h H, L S (h) L D (h) > ɛ}) < δ. Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 15 / 39

35 Proof (cont.) To show uniform convergence, we need: D m ({S : h H, L S (h) L D (h) > ɛ}) < δ. Using the union bound: D m ({S : h H, L S (h) L D (h) > ɛ}) = D m ( h H {S : L S (h) L D (h) > ɛ}) D m ({S : L S (h) L D (h) > ɛ}). h H Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 15 / 39

36 Proof (cont.) Recall: L D (h) = E z D [l(h, z)] and L S (h) = 1 m m i=1 l(h, z i). Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 16 / 39

37 Proof (cont.) Recall: L D (h) = E z D [l(h, z)] and L S (h) = 1 m m i=1 l(h, z i). Denote θ i = l(h, z i ). Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 16 / 39

38 Proof (cont.) Recall: L D (h) = E z D [l(h, z)] and L S (h) = 1 m m i=1 l(h, z i). Denote θ i = l(h, z i ). Then, for all i, E[θ i ] = L D (h) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 16 / 39

39 Proof (cont.) Recall: L D (h) = E z D [l(h, z)] and L S (h) = 1 m m i=1 l(h, z i). Denote θ i = l(h, z i ). Then, for all i, E[θ i ] = L D (h) Lemma (Hoeffding s inequality) Let θ 1,..., θ m be a sequence of i.i.d. random variables and assume that for all i, E[θ i ] = µ and P[a θ i b] = 1. Then, for any ɛ > 0 [ ] m 1 P m θ i µ > ɛ 2 exp ( 2 m ɛ 2 /(b a) 2). i=1 Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 16 / 39

40 Proof (cont.) Recall: L D (h) = E z D [l(h, z)] and L S (h) = 1 m m i=1 l(h, z i). Denote θ i = l(h, z i ). Then, for all i, E[θ i ] = L D (h) Lemma (Hoeffding s inequality) Let θ 1,..., θ m be a sequence of i.i.d. random variables and assume that for all i, E[θ i ] = µ and P[a θ i b] = 1. Then, for any ɛ > 0 [ ] m 1 P m θ i µ > ɛ 2 exp ( 2 m ɛ 2 /(b a) 2). i=1 This implies: D m ({S : L S (h) L D (h) > ɛ}) 2 exp ( 2 m ɛ 2). Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 16 / 39

41 Proof (cont.) We have shown: D m ({S : h H, L S (h) L D (h) > ɛ}) 2 H exp ( 2 m ɛ 2) So, if m log(2 H /δ) 2ɛ 2 then the right-hand side is at most δ as required. Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 17 / 39

42 The Discretization Trick Suppose H is parameterized by d numbers Suppose we are happy with a representation of each number using b bits (say, b = 32) Then H 2 db, and so 2db + 2 log(2/δ) m H (ɛ, δ). While not very elegant, it s a great tool for upper bounding sample complexity ɛ 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 18 / 39

43 Outline 1 The general PAC model Releasing the realizability assumption beyond binary classification The general PAC learning model 2 Learning via Uniform Convergence 3 Linear Regression and Least Squares Polynomial Fitting 4 The Bias-Complexity Tradeoff Error Decomposition 5 Validation and Model Selection Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 19 / 39

44 Linear Regression X R d, Y R, H = {x w, x : w R d } Example: d = 1, predict weight of a child based on his age. 18 Weight (Kg.) Age (years) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 20 / 39

45 The Squared Loss Zero-one loss doesn t make sense in regression Squared loss: l(h, (x, y)) = (h(x) y) 2 The ERM problem: 1 min w R d m m ( w, x i y i ) 2 i=1 Equivalently, suppose X is a matrix whose ith column is x i, and y is a vector with y i on its ith entry, then min X w y 2 w R d Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 21 / 39

46 Background: Gradient and Optimization Given a function f : R R, its derivative is f (x) = lim 0 f(x + ) f(x) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 22 / 39

47 Background: Gradient and Optimization Given a function f : R R, its derivative is f (x) = lim 0 f(x + ) f(x) If x minimizes f(x) then f (x) = 0 Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 22 / 39

48 Background: Gradient and Optimization Given a function f : R R, its derivative is f (x) = lim 0 f(x + ) f(x) If x minimizes f(x) then f (x) = 0 Now take f : R d R Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 22 / 39

49 Background: Gradient and Optimization Given a function f : R R, its derivative is f (x) = lim 0 f(x + ) f(x) If x minimizes f(x) then f (x) = 0 Now take f : R d R Its gradient is a d-dimensional vector, f(x), where the ith coordinate of f(x) is the derivative of the scalar function g(a) = f((x 1,..., x i 1, x i + a, x i+1,..., x d )). Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 22 / 39

50 Background: Gradient and Optimization Given a function f : R R, its derivative is f (x) = lim 0 f(x + ) f(x) If x minimizes f(x) then f (x) = 0 Now take f : R d R Its gradient is a d-dimensional vector, f(x), where the ith coordinate of f(x) is the derivative of the scalar function g(a) = f((x 1,..., x i 1, x i + a, x i+1,..., x d )). The derivative of g is called the partial derivative of f Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 22 / 39

51 Background: Gradient and Optimization Given a function f : R R, its derivative is f (x) = lim 0 f(x + ) f(x) If x minimizes f(x) then f (x) = 0 Now take f : R d R Its gradient is a d-dimensional vector, f(x), where the ith coordinate of f(x) is the derivative of the scalar function g(a) = f((x 1,..., x i 1, x i + a, x i+1,..., x d )). The derivative of g is called the partial derivative of f If x minimizes f(x) then f(x) = (0,..., 0) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 22 / 39

52 Background: Jacobian and the chain rule The Jacobian of f : R n R m at x R n, denoted J x (f), is the m n matrix whose i, j element is the partial derivative of f i : R n R w.r.t. its j th variable at x Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 23 / 39

53 Background: Jacobian and the chain rule The Jacobian of f : R n R m at x R n, denoted J x (f), is the m n matrix whose i, j element is the partial derivative of f i : R n R w.r.t. its j th variable at x Note: if m = 1 then J x (f) = f(x) (as a row vector) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 23 / 39

54 Background: Jacobian and the chain rule The Jacobian of f : R n R m at x R n, denoted J x (f), is the m n matrix whose i, j element is the partial derivative of f i : R n R w.r.t. its j th variable at x Note: if m = 1 then J x (f) = f(x) (as a row vector) Example: If f(w) = Aw for A R m,n then J w (f) = A Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 23 / 39

55 Background: Jacobian and the chain rule The Jacobian of f : R n R m at x R n, denoted J x (f), is the m n matrix whose i, j element is the partial derivative of f i : R n R w.r.t. its j th variable at x Note: if m = 1 then J x (f) = f(x) (as a row vector) Example: If f(w) = Aw for A R m,n then J w (f) = A Chain rule: Given f : R n R m and g : R k R n, the Jacobian of the composition function, (f g) : R k R m, at x, is J x (f g) = J g(x) (f)j x (g). Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 23 / 39

56 Least Squares Recall that we d like to solve the ERM problem: 1 min w R d 2 X w y 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 24 / 39

57 Least Squares Recall that we d like to solve the ERM problem: 1 min w R d 2 X w y 2 Let g(w) = X w y and f(v) = 1 2 v 2 = m i=1 v2 i Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 24 / 39

58 Least Squares Recall that we d like to solve the ERM problem: 1 min w R d 2 X w y 2 Let g(w) = X w y and f(v) = 1 2 v 2 = m i=1 v2 i Then, we need to solve min w f(g(w)) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 24 / 39

59 Least Squares Recall that we d like to solve the ERM problem: 1 min w R d 2 X w y 2 Let g(w) = X w y and f(v) = 1 2 v 2 = m i=1 v2 i Then, we need to solve min w f(g(w)) Note that J w (g) = X and J v (f) = (v 1,..., v m ) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 24 / 39

60 Least Squares Recall that we d like to solve the ERM problem: 1 min w R d 2 X w y 2 Let g(w) = X w y and f(v) = 1 2 v 2 = m i=1 v2 i Then, we need to solve min w f(g(w)) Note that J w (g) = X and J v (f) = (v 1,..., v m ) Using the chain rule: J w (f g) = J g(w) (f)j w (g) = g(w) X = (X w y) X Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 24 / 39

61 Least Squares Recall that we d like to solve the ERM problem: 1 min w R d 2 X w y 2 Let g(w) = X w y and f(v) = 1 2 v 2 = m i=1 v2 i Then, we need to solve min w f(g(w)) Note that J w (g) = X and J v (f) = (v 1,..., v m ) Using the chain rule: J w (f g) = J g(w) (f)j w (g) = g(w) X = (X w y) X Requiring that J w (f g) = (0,..., 0) yields (X w y) X = 0 XX w = Xy. Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 24 / 39

62 Least Squares Recall that we d like to solve the ERM problem: 1 min w R d 2 X w y 2 Let g(w) = X w y and f(v) = 1 2 v 2 = m i=1 v2 i Then, we need to solve min w f(g(w)) Note that J w (g) = X and J v (f) = (v 1,..., v m ) Using the chain rule: J w (f g) = J g(w) (f)j w (g) = g(w) X = (X w y) X Requiring that J w (f g) = (0,..., 0) yields (X w y) X = 0 XX w = Xy. This is a linear set of equations. If XX is invertible, the solution is w = (XX ) 1 Xy. Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 24 / 39

63 Least Squares What if XX is not invertible? In the exercise you ll see that there s always a solution to the set of linear equations using pseudo-inverse Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 25 / 39

64 Least Squares What if XX is not invertible? In the exercise you ll see that there s always a solution to the set of linear equations using pseudo-inverse Non-rigorous trick to help remembering the formula: We want X w y Multiply both sides by X to obtain XX w Xy Multiply both sides by (XX ) 1 to obtain the formula: w = (XX ) 1 Xy Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 25 / 39

65 Least Squares Interpretation as projection Recall, we try to minimize X w y The set C = {X w : w R d } R m is a linear subspace, forming the range of X Therefore, if w is the least squares solution, then the vector ŷ = X w is the vector in C which is closest to y. This is called the projection of y onto C We can find ŷ by taking V to be an m d matrix whose columns are orthonormal basis of the range of X, and then setting ŷ = V V y Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 26 / 39

66 Polynomial Fitting Sometimes, linear predictors are not expressive enough for our data We will show how to fit a polynomial to the data using linear regression Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 27 / 39

67 Polynomial Fitting A one-dimensional polynomial function of degree n: p(x) = a 0 + a 1 x + a 2 x a n x n Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 28 / 39

68 Polynomial Fitting A one-dimensional polynomial function of degree n: p(x) = a 0 + a 1 x + a 2 x a n x n Goal: given data S = ((x 1, y 1 ),..., (x m, y m )) find ERM with respect to the class of polynomials of degree n Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 28 / 39

69 Polynomial Fitting A one-dimensional polynomial function of degree n: p(x) = a 0 + a 1 x + a 2 x a n x n Goal: given data S = ((x 1, y 1 ),..., (x m, y m )) find ERM with respect to the class of polynomials of degree n Reduction to linear regression: Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 28 / 39

70 Polynomial Fitting A one-dimensional polynomial function of degree n: p(x) = a 0 + a 1 x + a 2 x a n x n Goal: given data S = ((x 1, y 1 ),..., (x m, y m )) find ERM with respect to the class of polynomials of degree n Reduction to linear regression: Define ψ : R R n+1 by ψ(x) = (1, x, x 2,..., x n ) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 28 / 39

71 Polynomial Fitting A one-dimensional polynomial function of degree n: p(x) = a 0 + a 1 x + a 2 x a n x n Goal: given data S = ((x 1, y 1 ),..., (x m, y m )) find ERM with respect to the class of polynomials of degree n Reduction to linear regression: Define ψ : R R n+1 by ψ(x) = (1, x, x 2,..., x n ) Define a = (a 0, a 1,..., a n ) and observe: p(x) = n a i x i = a, ψ(x) i=0 Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 28 / 39

72 Polynomial Fitting A one-dimensional polynomial function of degree n: p(x) = a 0 + a 1 x + a 2 x a n x n Goal: given data S = ((x 1, y 1 ),..., (x m, y m )) find ERM with respect to the class of polynomials of degree n Reduction to linear regression: Define ψ : R R n+1 by ψ(x) = (1, x, x 2,..., x n ) Define a = (a 0, a 1,..., a n ) and observe: p(x) = n a i x i = a, ψ(x) i=0 To find a, we can solve Least Squares w.r.t. ((ψ(x 1 ), y 1 ),..., (ψ(x m ), y m )) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 28 / 39

73 Outline 1 The general PAC model Releasing the realizability assumption beyond binary classification The general PAC learning model 2 Learning via Uniform Convergence 3 Linear Regression and Least Squares Polynomial Fitting 4 The Bias-Complexity Tradeoff Error Decomposition 5 Validation and Model Selection Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 29 / 39

74 Error Decomposition Let h S = ERM H (S). We can decompose the risk of h S as: L D (h S ) = ɛ app + ɛ est ɛ app ɛ est The approximation error, ɛ app = min h H L D (h): How much risk do we have due to restricting to H Doesn t depend on S Decreases with the complexity (size, or VC dimension) of H The estimation error, ɛ est = L D (h S ) ɛ app : Result of L S being only an estimate of L D Decreases with the size of S Increases with the complexity of H Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 30 / 39

75 Bias-Complexity Tradeoff How to choose H? degree 2 degree 3 degree 10 Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 31 / 39

76 Outline 1 The general PAC model Releasing the realizability assumption beyond binary classification The general PAC learning model 2 Learning via Uniform Convergence 3 Linear Regression and Least Squares Polynomial Fitting 4 The Bias-Complexity Tradeoff Error Decomposition 5 Validation and Model Selection Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 32 / 39

77 Validation We have already learned some hypothesis h Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 33 / 39

78 Validation We have already learned some hypothesis h Now we want to estimate how good is h Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 33 / 39

79 Validation We have already learned some hypothesis h Now we want to estimate how good is h Simple solution: Take fresh i.i.d. sample V = (x 1, y 1 ),..., (x mv, y mv ) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 33 / 39

80 Validation We have already learned some hypothesis h Now we want to estimate how good is h Simple solution: Take fresh i.i.d. sample V = (x 1, y 1 ),..., (x mv, y mv ) Output L V (h) as an estimator of L D (h) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 33 / 39

81 Validation We have already learned some hypothesis h Now we want to estimate how good is h Simple solution: Take fresh i.i.d. sample V = (x 1, y 1 ),..., (x mv, y mv ) Output L V (h) as an estimator of L D (h) Using Hoeffding s inequality, if the range of l is [0, 1] we have log(2/δ) L V (h) L D (h). 2 m v Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 33 / 39

82 Validation for Model Selection Fitting polynomials of degrees 2,3, and 10 based on the black points The red points are validation examples Choose the degree 3 polynomial as it has minimal validation error Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 34 / 39

83 Validation for Model Selection Analysis Let H = {h 1,..., h r } be the output predictors of applying ERM w.r.t. the different classes on S Let V be a fresh validation set Choose h ERM H (V ) By our analysis of finite classes, L D (h ) min h H L D(h) + 2 log(2 H /δ) V Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 35 / 39

84 The model-selection curve 0.4 train validation 0.3 error degree Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 36 / 39

85 Train-Validation-Test split In practice, we usually have one pool of examples and we split them into three sets: Training set: apply the learning algorithm with different parameters on the training set to produce H = {h 1,..., h r } Validation set: Choose h from H based on the validation set Test set: Estimate the error of h using the test set Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 37 / 39

86 k-fold cross validation The train-validation-test split is the best approach when data is plentiful. If data is scarce: k-fold Cross Validation for Model Selection input: training set S = (x 1, y 1 ),..., (x m, y m ) learning algorithm A and a set of parameter values Θ partition S into S 1, S 2,..., S k foreach θ Θ for i = 1... k h i,θ = A(S \ S i ; θ) error(θ) = 1 k k i=1 L S i (h i,θ ) output θ = argmin θ [error(θ)], h θ = A(S; θ ) Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 38 / 39

87 Summary The general PAC model Agnostic General loss functions Uniform convergence is sufficient for learnability Uniform convergence holds for finite classes and bounded loss Least squares Linear regression Polynomial fitting The bias-complexity tradeoff Approximation error vs. Estimation error Validation Model selection Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 bias-complexity tradeoff 39 / 39

Introduction to Machine Learning (67577) Lecture 5

Introduction to Machine Learning (67577) Lecture 5 Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Nonuniform learning, MDL, SRM, Decision Trees, Nearest Neighbor Shai

More information

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity; CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and

More information

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017 Machine Learning Model Selection and Validation Fabio Vandin November 7, 2017 1 Model Selection When we have to solve a machine learning task: there are different algorithms/classes algorithms have parameters

More information

On the tradeoff between computational complexity and sample complexity in learning

On the tradeoff between computational complexity and sample complexity in learning On the tradeoff between computational complexity and sample complexity in learning Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Joint work with Sham

More information

Understanding Machine Learning Solution Manual

Understanding Machine Learning Solution Manual Understanding Machine Learning Solution Manual Written by Alon Gonen Edited by Dana Rubinstein Noveber 17, 2014 2 Gentle Start 1. Given S = ((x i, y i )), define the ultivariate polynoial p S (x) = i []:y

More information

Introduction to Machine Learning (67577) Lecture 7

Introduction to Machine Learning (67577) Lecture 7 Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew

More information

Online Learning Summer School Copenhagen 2015 Lecture 1

Online Learning Summer School Copenhagen 2015 Lecture 1 Online Learning Summer School Copenhagen 2015 Lecture 1 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Online Learning Shai Shalev-Shwartz (Hebrew U) OLSS Lecture

More information

Active Learning: Disagreement Coefficient

Active Learning: Disagreement Coefficient Advanced Course in Machine Learning Spring 2010 Active Learning: Disagreement Coefficient Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz In previous lectures we saw examples in which

More information

IFT Lecture 7 Elements of statistical learning theory

IFT Lecture 7 Elements of statistical learning theory IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and

More information

Machine Learning in the Data Revolution Era

Machine Learning in the Data Revolution Era Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,

More information

Understanding Machine Learning A theory Perspective

Understanding Machine Learning A theory Perspective Understanding Machine Learning A theory Perspective Shai Ben-David University of Waterloo MLSS at MPI Tubingen, 2017 Disclaimer Warning. This talk is NOT about how cool machine learning is. I am sure you

More information

Introduction to Statistical Learning Theory

Introduction to Statistical Learning Theory Introduction to Statistical Learning Theory Definition Reminder: We are given m samples {(x i, y i )} m i=1 Dm and a hypothesis space H and we wish to return h H minimizing L D (h) = E[l(h(x), y)]. Problem

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Introduction to Machine Learning (67577)

Introduction to Machine Learning (67577) Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Deep Learning Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks

More information

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht PAC Learning prof. dr Arno Siebes Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht Recall: PAC Learning (Version 1) A hypothesis class H is PAC learnable

More information

Generalization bounds

Generalization bounds Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question

More information

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017 Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w

More information

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015 Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and

More information

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016 12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses

More information

Introduction and Models

Introduction and Models CSE522, Winter 2011, Learning Theory Lecture 1 and 2-01/04/2011, 01/06/2011 Lecturer: Ofer Dekel Introduction and Models Scribe: Jessica Chang Machine learning algorithms have emerged as the dominant and

More information

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017 Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w

More information

The PAC Learning Framework -II

The PAC Learning Framework -II The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1 Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1 Outline

More information

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University PAC Learning Matt Gormley Lecture 14 March 5, 2018 1 ML Big Picture Learning Paradigms:

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory Problem set 1 Due: Monday, October 10th Please send your solutions to learning-submissions@ttic.edu Notation: Input space: X Label space: Y = {±1} Sample:

More information

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016 Machine Learning 10-701, Fall 2016 Computational Learning Theory Eric Xing Lecture 9, October 5, 2016 Reading: Chap. 7 T.M book Eric Xing @ CMU, 2006-2016 1 Generalizability of Learning In machine learning

More information

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017 Machine Learning Regularization and Feature Selection Fabio Vandin November 13, 2017 1 Learning Model A: learning algorithm for a machine learning task S: m i.i.d. pairs z i = (x i, y i ), i = 1,..., m,

More information

The Perceptron algorithm

The Perceptron algorithm The Perceptron algorithm Tirgul 3 November 2016 Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

Computational and Statistical Learning theory

Computational and Statistical Learning theory Computational and Statistical Learning theory Problem set 2 Due: January 31st Email solutions to : karthik at ttic dot edu Notation : Input space : X Label space : Y = {±1} Sample : (x 1, y 1,..., (x n,

More information

Littlestone s Dimension and Online Learnability

Littlestone s Dimension and Online Learnability Littlestone s Dimension and Online Learnability Shai Shalev-Shwartz Toyota Technological Institute at Chicago The Hebrew University Talk at UCSD workshop, February, 2009 Joint work with Shai Ben-David

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning 236756 Prof. Nir Ailon Lecture 4: Computational Complexity of Learning & Surrogate Losses Efficient PAC Learning Until now we were mostly worried about sample complexity

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Kimball, B-11 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)

More information

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012 Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Computational Learning Theory Le Song Lecture 11, September 20, 2012 Based on Slides from Eric Xing, CMU Reading: Chap. 7 T.M book 1 Complexity of Learning

More information

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar Computational Learning Theory: Probably Approximately Correct (PAC) Learning Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 This lecture: Computational Learning Theory The Theory

More information

Consistency of Nearest Neighbor Methods

Consistency of Nearest Neighbor Methods E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

COMP 551 Applied Machine Learning Lecture 2: Linear Regression

COMP 551 Applied Machine Learning Lecture 2: Linear Regression COMP 551 Applied Machine Learning Lecture 2: Linear Regression Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise

More information

6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL

6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL 6.867 Machine learning: lecture 2 Tommi S. Jaakkola MIT CSAIL tommi@csail.mit.edu Topics The learning problem hypothesis class, estimation algorithm loss and estimation criterion sampling, empirical and

More information

Conditional Sparse Linear Regression

Conditional Sparse Linear Regression COMS 6998-4 Fall 07 November 3, 07 Introduction. Background Conditional Sparse Linear Regression esenter: Xingyu Niu Scribe: Jinyi Zhang Given a distribution D over R d R and data samples (y, z) D, linear

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

COMP 551 Applied Machine Learning Lecture 2: Linear regression

COMP 551 Applied Machine Learning Lecture 2: Linear regression COMP 551 Applied Machine Learning Lecture 2: Linear regression Instructor: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551 Unless otherwise noted, all material posted for this

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process

More information

1 Differential Privacy and Statistical Query Learning

1 Differential Privacy and Statistical Query Learning 10-806 Foundations of Machine Learning and Data Science Lecturer: Maria-Florina Balcan Lecture 5: December 07, 015 1 Differential Privacy and Statistical Query Learning 1.1 Differential Privacy Suppose

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear

More information

Computational Learning Theory - Hilary Term : Learning Real-valued Functions

Computational Learning Theory - Hilary Term : Learning Real-valued Functions Computational Learning Theory - Hilary Term 08 8 : Learning Real-valued Functions Lecturer: Varun Kanade So far our focus has been on learning boolean functions. Boolean functions are suitable for modelling

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

Computational Learning Theory. CS534 - Machine Learning

Computational Learning Theory. CS534 - Machine Learning Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows

More information

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013 Computational Learning Theory CS 486/686: Introduction to Artificial Intelligence Fall 2013 1 Overview Introduction to Computational Learning Theory PAC Learning Theory Thanks to T Mitchell 2 Introduction

More information

Multiclass Learnability and the ERM principle

Multiclass Learnability and the ERM principle Multiclass Learnability and the ERM principle Amit Daniely Sivan Sabato Shai Ben-David Shai Shalev-Shwartz November 5, 04 arxiv:308.893v [cs.lg] 4 Nov 04 Abstract We study the sample complexity of multiclass

More information

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

Does Unlabeled Data Help?

Does Unlabeled Data Help? Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 6: Training versus Testing (LFD 2.1) Cho-Jui Hsieh UC Davis Jan 29, 2018 Preamble to the theory Training versus testing Out-of-sample error (generalization error): What

More information

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 3: Learning Theory

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 3: Learning Theory COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 3: Learning Theory Sanjeev Arora Elad Hazan Admin Exercise 1 due next Tue, in class Enrolment Recap We have seen: AI by introspection

More information

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims CS340 Machine learning Lecture 5 Learning theory cont'd Some slides are borrowed from Stuart Russell and Thorsten Joachims Inductive learning Simplest form: learn a function from examples f is the target

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Hollister, 306 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)

More information

Multiclass Classification-1

Multiclass Classification-1 CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass

More information

4 Bias-Variance for Ridge Regression (24 points)

4 Bias-Variance for Ridge Regression (24 points) Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,

More information

Lecture 21: Minimax Theory

Lecture 21: Minimax Theory Lecture : Minimax Theory Akshay Krishnamurthy akshay@cs.umass.edu November 8, 07 Recap In the first part of the course, we spent the majority of our time studying risk minimization. We found many ways

More information

Sample Complexity of Learning Independent of Set Theory

Sample Complexity of Learning Independent of Set Theory Sample Complexity of Learning Independent of Set Theory Shai Ben-David University of Waterloo, Canada Based on joint work with Pavel Hrubes, Shay Moran, Amir Shpilka and Amir Yehudayoff Simons workshop,

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 2: Introduction to statistical learning theory. 1 / 22 Goals of statistical learning theory SLT aims at studying the performance of

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning PAC Learning and VC Dimension Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE

More information

Variance Reduction and Ensemble Methods

Variance Reduction and Ensemble Methods Variance Reduction and Ensemble Methods Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Last Time PAC learning Bias/variance tradeoff small hypothesis

More information

Bias-Variance Decomposition. Mohammad Emtiyaz Khan EPFL Oct 6, 2015

Bias-Variance Decomposition. Mohammad Emtiyaz Khan EPFL Oct 6, 2015 Bias-Variance Decomposition Mohammad Emtiyaz Khan EPFL Oct 6, 2015 Mohammad Emtiyaz Khan 2015 Motivation In ridge regression, we observe a typical behaviour for train and test errors with respect to model

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces. VC Dimension Review The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces. Previously, in discussing PAC learning, we were trying to answer questions about

More information

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds

More information

Evaluation requires to define performance measures to be optimized

Evaluation requires to define performance measures to be optimized Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation

More information

Online Convex Optimization

Online Convex Optimization Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed

More information

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables? Linear Regression Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2014 1 What about continuous variables? n Billionaire says: If I am measuring a continuous variable, what

More information

Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection

Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Greedy Algorithms, Frank-Wolfe and Friends

More information

Part of the slides are adapted from Ziko Kolter

Part of the slides are adapted from Ziko Kolter Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

Advanced Introduction to Machine Learning CMU-10715

Advanced Introduction to Machine Learning CMU-10715 Advanced Introduction to Machine Learning CMU-10715 Risk Minimization Barnabás Póczos What have we seen so far? Several classification & regression algorithms seem to work fine on training datasets: Linear

More information

Active Learning and Optimized Information Gathering

Active Learning and Optimized Information Gathering Active Learning and Optimized Information Gathering Lecture 7 Learning Theory CS 101.2 Andreas Krause Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok. Office

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

Generalization Bounds and Stability

Generalization Bounds and Stability Generalization Bounds and Stability Lorenzo Rosasco Tomaso Poggio 9.520 Class 9 2009 About this class Goal To recall the notion of generalization bounds and show how they can be derived from a stability

More information

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin Learning Theory Machine Learning CSE546 Carlos Guestrin University of Washington November 25, 2013 Carlos Guestrin 2005-2013 1 What now n We have explored many ways of learning from data n But How good

More information

Multiclass Learnability and the ERM Principle

Multiclass Learnability and the ERM Principle Journal of Machine Learning Research 6 205 2377-2404 Submitted 8/3; Revised /5; Published 2/5 Multiclass Learnability and the ERM Principle Amit Daniely amitdaniely@mailhujiacil Dept of Mathematics, The

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 7: Computational Complexity of Learning Agnostic Learning Hardness of Learning via Crypto Assumption: No poly-time algorithm

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

1 Review of The Learning Setting

1 Review of The Learning Setting COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #8 Scribe: Changyan Wang February 28, 208 Review of The Learning Setting Last class, we moved beyond the PAC model: in the PAC model we

More information

Agnostic Online learnability

Agnostic Online learnability Technical Report TTIC-TR-2008-2 October 2008 Agnostic Online learnability Shai Shalev-Shwartz Toyota Technological Institute Chicago shai@tti-c.org ABSTRACT We study a fundamental question. What classes

More information

LEARNING KERNEL-BASED HALFSPACES WITH THE 0-1 LOSS

LEARNING KERNEL-BASED HALFSPACES WITH THE 0-1 LOSS LEARNING KERNEL-BASED HALFSPACES WITH THE 0- LOSS SHAI SHALEV-SHWARTZ, OHAD SHAMIR, AND KARTHIK SRIDHARAN Abstract. We describe and analyze a new algorithm for agnostically learning kernel-based halfspaces

More information

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,

More information

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011) E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how

More information

TTIC An Introduction to the Theory of Machine Learning. Learning from noisy data, intro to SQ model

TTIC An Introduction to the Theory of Machine Learning. Learning from noisy data, intro to SQ model TTIC 325 An Introduction to the Theory of Machine Learning Learning from noisy data, intro to SQ model Avrim Blum 4/25/8 Learning when there is no perfect predictor Hoeffding/Chernoff bounds: minimizing

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabás Póczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic): Bayes risk: Let us use the empirical

More information

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Logarithmic Regret Algorithms for Strongly Convex Repeated Games Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600

More information

Linear Classifiers. Michael Collins. January 18, 2012

Linear Classifiers. Michael Collins. January 18, 2012 Linear Classifiers Michael Collins January 18, 2012 Today s Lecture Binary classification problems Linear classifiers The perceptron algorithm Classification Problems: An Example Goal: build a system that

More information

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017 1 Will Monroe CS 109 Logistic Regression Lecture Notes #22 August 14, 2017 Based on a chapter by Chris Piech Logistic regression is a classification algorithm1 that works by trying to learn a function

More information