Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Size: px
Start display at page:

Download "Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh"

Transcription

1 Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh

2 Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds

3 Learning Scenario Goal: given feature vectors labels (y 1,..., y m ) drawn i.i.d. from an (x 1,..., x m ) with unknown distribution D, learn a function (hypothesis) h H that minimizes expected loss (risk) over entire distribution, [ ] R(h) =E (x,y) D l(h(x),y) Where l : R R R is the loss function.

4 Learning Scenario Cannot measure risk directly, approximate using the sample S = ((x 1,y 1 ),..., (x m,y m )) R S (h) = 1 m m i=1 l ( h(x i ),y i ) ) How well does R S approximate R? In other words, how well does the algorithm generalize?

5 Generalization Bound

6 Generalization Bound Generalization (risk) bounds show how well R S (h) bounds R(h) as a function of m = S. R(h) R S (h) + complexity(h)

7 Generalization Bound Generalization (risk) bounds show how well R S (h) bounds R(h) as a function of m = S. R(h) R S (h) + complexity(h) Uniform convergence, holds simultaneously over all h H.

8 Generalization Bound Generalization (risk) bounds show how well R S (h) bounds R(h) as a function of m = S. R(h) R S (h) + complexity(h) Uniform convergence, holds simultaneously over all h H. Complexity (capacity) is a measure of the richness of the hypothesis class.

9 Generalization Bound Generalization (risk) bounds show how well R S (h) bounds R(h) as a function of m = S. R(h) R S (h) + complexity(h) Uniform convergence, holds simultaneously over all h H. Complexity (capacity) is a measure of the richness of the hypothesis class. Explicitly shows the tradeoff between complexity and generalization ability.

10 Outline

11 Outline We will cover four main examples,

12 Outline We will cover four main examples, Hypothesis Complexity Bounds: VC-Dimension Covering Number Rademacher Complexity

13 Outline We will cover four main examples, Hypothesis Complexity Bounds: VC-Dimension Covering Number Rademacher Complexity Algorithm Complexity Bounds: Algorithmic Stability

14 Concentration Bounds Hoeffding s Inequality [ 63] - Let be (X 1,..., X m ) independent bounded ( a i X i b i ) random variables. Then the following is true, Pr ( Y E[Y ] ɛ ) 2 exp ( 2ɛ 2 ) m i=1 (b i a i ) 2 where Y = X X m.

15 Concentration Bounds McDiarmid s Inequality [ 89] - Let f be a real-valued function such that for all i, sup f(x 1,..., x i,..., x m ) f(x 1,..., x i,..., x m ) c i then the following is true, Pr ( f(s) E[f] ɛ ) 2 exp ( 2ɛ 2 m i=1 c i ) with respect to i.i.d. sample S =(x 1,..., x m ).

16 Warmup For finite hypothesis set H and bounded loss function l(, ) M the following is true for all h H simultaneously, ( ) 2mɛ 2 Pr ( R(h) R S (h) ɛ ) 2 H exp M 2

17 Warmup For finite hypothesis set H and bounded loss function l(, ) M the following is true for all h H simultaneously, ( ) 2mɛ 2 Pr ( R(h) R S (h) ɛ ) 2 H exp Setting the rhs equal to, we can say with probability at least δ (1 δ) M 2 the following is true, ln( H ) + ln(2/δ) R(h) R S (h)+m 2m

18 Warmup

19 Warmup Proof: for any fixed hypothesis h, by Hoeffding s inequality, we have Pr ( R(h) R S (h) ɛ ) 2 exp ( 2mɛ 2 M 2 ) where in this case, (a i b i ) M/m.

20 Warmup Proof: for any fixed hypothesis h, by Hoeffding s inequality, we have Pr ( R(h) R S (h) ɛ ) 2 exp ( 2mɛ 2 M 2 ) where in this case, (a i b i ) M/m. Taking the union bound for all hypotheses in H completes the proof.

21 Hypothesis Complexity

22 Hypothesis Complexity Bound can be loose for large values of H.

23 Hypothesis Complexity Bound can be loose for large values of H. In general hypothesis class is not even finite!

24 Hypothesis Complexity Bound can be loose for large values of H. In general hypothesis class is not even finite! We must measure complexity in a more careful way.

25 Symmetrization Main Insight: Define the quantity of interest purely over a finite sample. We can now consider the complexity of a hypothesis class over only a finite sample.

26 Symmetrization Main Insight: Define the quantity of interest purely over a finite sample. Pr [ sup h 2 Pr [ sup h R(h) R S (h) ɛ ] R S (h) R S (h) ɛ 2 We can now consider the complexity of a hypothesis class over only a finite sample. ]

27 VC-dimension

28 VC-dimension Consider binary classification, denote the number of dichotomies realized by H over a sample S as follows, Π H (S) ={(h(x 1 ),..., h(x m )) h H}

29 VC-dimension Consider binary classification, denote the number of dichotomies realized by H over a sample S as follows, Π H (S) ={(h(x 1 ),..., h(x m )) h H} The sample S is shattered by H if, Π H (S) = 2 m.

30 VC-dimension Consider binary classification, denote the number of dichotomies realized by H over a sample S as follows, Π H (S) ={(h(x 1 ),..., h(x m )) h H} The sample S is shattered by H if, Π H (S) = 2 m. The shattering coefficient over m points is the defined as, Π H (m) = max S { Π H(S) : S = m}

31 VC-dimension

32 VC-dimension VC-dimension: the size of the largest set that is shattered by H. d : max{m : S = m, Π H (S) = 2 m }

33 VC-dimension VC-dimension: the size of the largest set that is shattered by H. d : max{m : S = m, Π H (S) = 2 m } Example: halfspaces in R n.

34 VC-dimension VC-dimension: the size of the largest set that is shattered by H. d : max{m : S = m, Π H (S) = 2 m } Example: halfspaces in R n. Can always shatter set of (non-degenerate) n points.

35 VC-dimension VC-dimension: the size of the largest set that is shattered by H. d : max{m : S = m, Π H (S) = 2 m } Example: halfspaces in R n. Can always shatter set of (non-degenerate) n points. No configuration of (n + 1) points can be shattered Case 1: Case 2:

36 SVM Margin Bound For linear hypothesis class h(x) = <w, x> with canonical margin, bound the VC-dim independently of the input dimension, d Λ 2 R 2 where the data is held in a ball of radius R centered at the origin and w Λ. 2 w <w,x>=0

37 VC-bound

38 VC-bound Sauer s Lemma - If VC-dim is finite then, Π H (m) O(m d )

39 VC-bound Sauer s Lemma - If VC-dim is finite then, Π H (m) O(m d ) VC-bound [ 71] - for hypothesis class H with 0/1- loss function and finite VC-dimentsion d, the following is holds with probability at least 1 δ, ( d ln(m) ) R(h) R S (h)+o m + ln(1/δ) Shown to be optimal within a log(m) factor.

40 Covering Number

41 Covering Number Def. The ε-covering number of a function class F with respect to a sample S, denoted N p (F, ɛ,s), is the minimum number of ε-balls needed to cover all h(s) in H(S),

42 Covering Number Def. The ε-covering number of a function class F with respect to a sample S, denoted N p (F, ɛ,s), is the minimum number of ε-balls needed to cover all h(s) in H(S), H(S)

43 Covering Number Def. The ε-covering number of a function class F with respect to a sample S, denoted N p (F, ɛ,s), is the minimum number of ε-balls needed to cover all h(s) in H(S), H(S) ε

44 Covering Number Def. The ε-covering number of a function class F with respect to a sample S, denoted N p (F, ɛ,s), is the minimum number of ε-balls needed to cover all h(s) in H(S), H(S) ε ε

45 Covering Number Def. The ε-covering number of a function class F with respect to a sample S, denoted N p (F, ɛ,s), is the minimum number of ε-balls needed to cover all h(s) in H(S), H(S) ε ε ε

46 Covering Number Def. The ε-covering number of a function class F with respect to a sample S, denoted N p (F, ɛ,s), is the minimum number of ε-balls needed to cover all h(s) in H(S), ε ε ε ε H(S) ε ε ε ε ε ε ε

47 Covering Number Def. The ε-covering number of a function class F with respect to a sample S, denoted N p (F, ɛ,s), is the minimum number of ε-balls needed to cover all h(s) in H(S), ε ε ε ε H(S) ε ε ε ε ε ε ε Intuitively, the number of functions needed to ε-approximate (in p-norm) any hypothesis in H.

48 Covering Number Def. The ε-covering number of a function class F with respect to a sample S, denoted N p (F, ɛ,s), is the minimum number of ε-balls needed to cover all h(s) in H(S), H(S) ε ε ε ε ε Intuitively, the number of functions needed to ε ε-approximate (in p-norm) any hypothesis in H. Also, denote N p (F, ɛ,m) = sup N p (F, ɛ,s). ε ε ε ε S: S =m ε

49 Covering Number

50 Covering Number A natural complexity measure for continuous functions.

51 Covering Number A natural complexity measure for continuous functions. Can be applied during the union bound step. Replace each function in l H with an ε/8 approximation.

52 Covering Number A natural complexity measure for continuous functions. Can be applied during the union bound step. Replace each function in l H with an ε/8 approximation. If loss function is L-Lipschitz, then N p (l F, ɛ,m) N p (F, ɛ/l, m).

53 Covering Number Bound Thm. [Pollard 84] For a Lipschitz loss function bounded by M, the following is holds with probability at least, ( Pr sup R(h) R ) S (h) ɛ h H ( ) mɛ 2 O N 1 (H, ɛ,m)e M 2

54 Rademacher Complexity

55 Rademacher Complexity How well does the hypothesis class fit random noise?

56 Rademacher Complexity How well does the hypothesis class fit random noise? Empirical Rademacher Complexity: R S (H) =E σ (sup h H where σi is chosen uniformly from {1,-1}. 2 m m i=1 σ i h(x i ) )

57 Rademacher Complexity How well does the hypothesis class fit random noise? Empirical Rademacher Complexity: R S (H) =E σ (sup h H where σi is chosen uniformly from {1,-1}. This quantity can be computed from data (assuming we can minimize empirical error). 2 m m i=1 σ i h(x i ) )

58 Rademacher Complexity

59 Rademacher Complexity Rademacher Complexity R m (H) =E S [ RS (H) S = m ]

60 Rademacher Complexity Rademacher Complexity R m (H) =E S [ RS (H) S = m ] McDiarmid s inequality exponentially bounds the difference between R S and R m. ln(1/δ) R m (H) R S (H)+ 2m

61 Rademacher Bound Rademacher Bound: Consider binary classification with 0/1 loss function, then the following is true, Pr sup h H R(h) R S (h) ɛ + R m (h) e 2mɛ2 Again, with probability at least 1 δ, ln(1/δ) R(h) R S (h)+r m (H)+ 2m

62 Rademacher Bound Proof (outline): Apply McDiarmid s inequality to, Φ(S) = sup h which is 1/m-Lipschitz. Main insight: use symmetrization type argument to show, R(h) RS (h) E [ Φ(S) ] R m (H)

63 Rademacher Complexity Important qualities: We can compute R S (H) from data. The empirical Rademacher complexity lower bounds the VC-bound! 2d ln(em/d) R(H) 2M m

64 Algorithmic Stability

65 Algorithmic Stability Complexity measure of a specific algorithm (rather than a hypothesis class).

66 Algorithmic Stability Complexity measure of a specific algorithm (rather than a hypothesis class). The usual union-bound term is not need.

67 Algorithmic Stability Complexity measure of a specific algorithm (rather than a hypothesis class). The usual union-bound term is not need. Intuition: How does the algorithm explore the hypothesis space?

68 Algorithmic Stability Algorithm is called β-stable if, for any two samples S and S that differ in one point, the following holds, β sup x,y l(hs (x),y) l(h S (x),y) where hs and hs are the hypothesis produced by the algorithm for samples S and S respectively.

69 Stability Bound Thm. Let hs be the hypothesis returned by a β- stable algorithm with access to a sample S, then the following bound holds with respect to a nonnegative loss function with upper bound M, Pr S ( R(S) RS (S) ɛ + β ) exp ( 2mɛ 2 (mβ + M) 2 ) Thus, probability at least (1 - δ), over the sample S, ln(1/δ) R(S) R S (S)+β(mβ + M) 2m

70 Stability Bound Proof: Apply McDiarmid s inequality to the random variable, Φ(S) =R(h S ) ˆR S (h S ) Stability allows us to bound the mean, and verify the Lipschitz property.

71 Stability Bound Let S and Si differ in the ith coordinate, show Lipschitz behavior, Φ(S) Φ(S i ) = R(h S ) R S (h S ) R(h S i)+ R S i(h S i) R S i(h S i) R S (h S ) }{{} (1) + R(h S ) R(h S i) }{{} (2) We bound the two terms separately.

72 Stability Bound Part (1): R S i(h S i) R S (h S ) = 1 m j i l(h S i,x j ) l(h S,x j ) + l(h S i,x i) l(h S,x i ) β + M m

73 Stability Bound Part (2): R(h S ) R(h S i) = E x [ l(hs,x) l(h S,x) ] β. So we have, c i = Φ ( S) Φ(S i ) 2β M m

74 Bounding the mean, E S [ Φ(S) ] Stability Bound = E S,x [ l(hs,x) ] E S [ 1 The second equality holds by the i.i.d. assumption. m = E S,x [ l(hs,x) ] E S,x [ 1 = E S,x [ 1 m β i m i i l(h S,x i ) ] [ l(hs,x) l(h S i,x) }{{} β l(h S i,x) ] ] ]

75 Stable Algorithms

76 Stable Algorithms Algorithm must be 1/m-stable for bound to converge.

77 Stable Algorithms Algorithm must be 1/m-stable for bound to converge. Kernel Regularized Algorithms: min h R S (h)+λ h 2 K

78 Stable Algorithms Algorithm must be 1/m-stable for bound to converge. Kernel Regularized Algorithms: min h R S (h)+λ h 2 K Thm. Kernel regularized algorithms with κ- bounded kernel and σ-lipschitz and convex loss function, are stable with coefficient, β σ2 κ 2 λm

79 Stable Algorithms B(f g) =F (f) F (g) (f g) F (g) B 2 K (h S h S )+B 2 K (h S h S ) 2σ mλ sup x S h(x) h = h S h S

80 Proof outline: Stable Algorithms B(f g) =F (f) F (g) (f g) F (g) B 2 K (h S h S )+B 2 K (h S h S ) 2σ mλ sup x S h(x) h = h S h S

81 Proof outline: Stable Algorithms Define the Bregman divergence with respect to RKHS convex function F, B(f g) =F (f) F (g) (f g) F (g) B 2 K (h S h S )+B 2 K (h S h S ) 2σ mλ sup x S h(x) h = h S h S

82 Proof outline: Stable Algorithms Define the Bregman divergence with respect to RKHS convex function F, B(f g) =F (f) F (g) (f g) F (g) First show, where B 2 K (h S h S )+B 2 K (h S h S ) 2σ mλ sup x S h(x) h = h S h S

83 Stable Algorithms B 2 K (h S,h S )= h S h S 2 K 2 h 2 K 2σ mλ sup x S h(x) = 2σ mλ sup K(x, ),h K 2σ x S mλ κ h K

84 Stable Algorithms Proof outline: B 2 K (h S,h S )= h S h S 2 K 2 h 2 K 2σ mλ sup x S h(x) = 2σ mλ sup K(x, ),h K 2σ x S mλ κ h K

85 Stable Algorithms Proof outline: Notice, B 2 (h S,h. K S )= h S h S 2 K 2 h 2 K 2σ mλ sup x S h(x) = 2σ mλ sup K(x, ),h K 2σ x S mλ κ h K

86 Stable Algorithms Proof outline: Notice, B (h 2. K S,h S )= h S h S 2 K Using the kernel reproducing property we see, 2 h 2 K 2σ mλ sup x S h(x) = 2σ mλ sup K(x, ),h K 2σ x S mλ κ h K

87 Stable Algorithms h 2 K σκ mλ l(h S (x),y) l(h S (x),y) σ h(x) κσ h K σ2 κ 2 mλ

88 Stable Algorithms Proof outline: h 2 K σκ mλ l(h S (x),y) l(h S (x),y) σ h(x) κσ h K σ2 κ 2 mλ

89 Stable Algorithms Proof outline: Thus, h 2 K σκ mλ, and finally we see that l(h S (x),y) l(h S (x),y) σ h(x) κσ h K σ2 κ 2 mλ

90 Conclusion Generalization bounds: Show that the empirical error closely estimates the true error. Complexity: Several different notions used to measure size of Hypothesis class. VC-dimension (combinatorial) Stability (algorithm specific) Covering Number (continuous hypothesis) Rademacher Complexity (data specific)

91 Additional Material

92 VC-dimension

93 VC-dimension Consider binary classification, denote the number of dichotomies realized by H over a sample S as follows, Π H (S) ={(h(x 1 ),..., h(x m )) h H}

94 VC-dimension Consider binary classification, denote the number of dichotomies realized by H over a sample S as follows, Π H (S) ={(h(x 1 ),..., h(x m )) h H} The sample S is shattered by H if, Π H (S) = 2 m.

95 VC-dimension Consider binary classification, denote the number of dichotomies realized by H over a sample S as follows, Π H (S) ={(h(x 1 ),..., h(x m )) h H} The sample S is shattered by H if, Π H (S) = 2 m. The shattering coefficient over m points is the defined as, Π H (m) = max S { Π H(S) : S = m}

96 VC-dimension

97 VC-dimension VC-dimension: the size of the largest set that is shattered by H. d : max{m : S = m, Π H (S) = 2 m }

98 VC-dimension VC-dimension: the size of the largest set that is shattered by H. d : max{m : S = m, Π H (S) = 2 m } Example: halfspaces in R n.

99 VC-dimension VC-dimension: the size of the largest set that is shattered by H. d : max{m : S = m, Π H (S) = 2 m } Example: halfspaces in R n. Can always shatter set of (non-degenerate) n points.

100 VC-dimension VC-dimension: the size of the largest set that is shattered by H. d : max{m : S = m, Π H (S) = 2 m } Example: halfspaces in R n. Can always shatter set of (non-degenerate) n points. No configuration of (n + 1) points can be shattered Case 1: Case 2:

101 VC-dimension

102 VC-dimension Sauer s Lemma [ 72]: The shattering coefficient can be bounded by the VC-dimension. Π H (m) d i=0 ( ) m i

103 VC-dimension Sauer s Lemma [ 72]: The shattering coefficient can be bounded by the VC-dimension. Π H (m) d i=0 ( ) m i Furthermore, if d is finite, then for m > d, d i=0 ( ) m i ( ) em d = O(m d ) d

104 VC-bound VC-bound [VC 71] - for hypothesis class H with 0/1-loss function, the following is true, Pr [ sup h H R(h) R ] s (h) ɛ 8Π H (m)e mɛ2 32 Using Sauer s Lemma, the following holds with probability at least, 1 δ 2d ln(em/d) + 2 ln(8/δ) R(h) R S (h)+4 m

105 VC-bound (proof outline)

106 VC-bound (proof outline) Symmetrization - Pr [ sup h 2 Pr [ sup h R(h) R S (h) ɛ ] R S (h) R S (h) ɛ 2 where S is another i.i.d. sample of size m. ]

107 VC-bound (proof outline) Symmetrization - Pr [ sup h 2 Pr [ sup h R(h) R S (h) ɛ ] R S (h) R S (h) ɛ 2 ] where S is another i.i.d. sample of size m. Random Signs - for 2 Pr [ sup h 1 m 4 Pr [ sup h i 1 m ( ) 1h(x i ) y 1 ɛ ] i h(x i ) y i 2 i Pr(σ i = 1) = Pr(σ i = 1) = 1/2 σ i 1 h(xi ) y i ɛ 4 ]

108 VC-bound (proof outline)

109 VC-bound (proof outline) Conditioning - [ 4 Pr sup 1 σ i 1 h m h(xi ) y ɛ i 4 4Π H (m) sup h i Pr [ 1 m i ] x 1,..., x m σ i 1 h(xi ) y i ɛ 4 ] x1,..., x m

110 VC-bound (proof outline) Conditioning - [ 4 Pr sup 1 σ i 1 h m h(xi ) y ɛ i 4 4Π H (m) sup h i Pr [ 1 m Concentration bound (Hoeffding)- i ] x 1,..., x m σ i 1 h(xi ) y i ɛ 4 ] x1,..., x m sup h Pr [ 1 m i σ i 1 h(xi ) y i ɛ 4 x 1,..., x m ] 2e mɛ2 32 Taking the expectation of both sides proves the result.

111 SVM Margin Bound For linear hypothesis class h(x) = <w, x> with canonical margin, bound the VC-dim independently of the input dimension, d Λ 2 R 2 where the data is held in a ball of radius R centered at the origin and w Λ. 2 w <w,x>=0

112 SVM Margin Bound Proof: Choose k points that the hypothesis shatters. Then, upper and lower bound E[ k i=1 y ix i 2 ] with respect to uniform +1/-1 labels chosen for y. E [ Upper bound: By independence of y s, k E [ y i x i y j x j ] = E [ y i x i 2] = i=1 i,j also, y i x i = x i R i y i x i 2] thus, E [ k i=1 y ix i 2] kr 2

113 SVM Margin Bound Lower bound: Since the hypothesis (with canonical margin) shatters the sample i y i(w x i ) k thus, i, y i (w x i ) 1 k i y i(w x i ) w i y ix i Λ i y ix i Combining the bounds, shows k 2 Λ 2 E[ k y i x i 2] kr 2 i=1

114 Covering Number

115 Covering Number Def. The ε-covering number of a function class F with respect to a sample S =(x 1,..., x m ), denoted N p (F, ɛ,s), is the minimum number of points vj needed such that, [ ] m 1/p 1 f F, v j : f(x i ) vj i p ɛ n i=1

116 Covering Number Def. The ε-covering number of a function class F with respect to a sample S =(x 1,..., x m ), denoted N p (F, ɛ,s), is the minimum number of points vj needed such that, [ ] m 1/p 1 f F, v j : f(x i ) vj i p ɛ n i=1 Intuitively, the number of epsilon balls needed to cover the hypothesis space.

117 Covering Number Def. The ε-covering number of a function class F with respect to a sample S =(x 1,..., x m ), denoted N p (F, ɛ,s), is the minimum number of points vj needed such that, [ ] m 1/p 1 f F, v j : f(x i ) vj i p ɛ n i=1 Intuitively, the number of epsilon balls needed to cover the hypothesis space. Also, denote N p (F, ɛ,m) = sup N p (F, ɛ,s). S: S =m

118 Covering Number

119 Covering Number A natural complexity measure for continuous functions.

120 Covering Number A natural complexity measure for continuous functions. Can be applied during the conditioning step. Replace each function in l H with an ε/8 approximation.

121 Covering Number A natural complexity measure for continuous functions. Can be applied during the conditioning step. Replace each function in l H with an ε/8 approximation. If loss function is L-Lipschitz, then N p (l F, ɛ,m) N p (F, ɛ/l, m).

122 Covering Number Bound Thm. [Pollard 84] For L-lipschtiz loss function bounded by M, the following is true, Pr ( sup h H R(h) R S (h) ɛ ) 8N 1 (H, ɛ/(8l),m)exp ( mɛ 2 128M 2 )

123 Covering Number Bound To bound the covering number, notice by Jensen s inequality p < q, N p N q Result of Zhang [ 02], for linear hypothesis classes (h(x) = <w, x>), if x < b, w < a, then log 2 N 2 (H, ɛ,m) a2 b 2 ɛ 2 log 2 (2m + 1)

124 Rademacher Complexity R S (H) =E σ (sup h H 2 m m i=1 σ i h(x i ) )

125 Rademacher Complexity Empirical Rademacher Complexity: R S (H) =E σ (sup h H where again σi is selected uniformly from {1,-1}. 2 m m i=1 σ i h(x i ) )

126 Rademacher Complexity Empirical Rademacher Complexity: R S (H) =E σ (sup h H where again σi is selected uniformly from {1,-1}. This quantity can be computed from data. 2 m m i=1 σ i h(x i ) )

127 Rademacher Complexity Empirical Rademacher Complexity: R S (H) =E σ (sup h H where again σi is selected uniformly from {1,-1}. This quantity can be computed from data. Rademacher Complexity R m (H) =E S [ RS (H) S = m ] 2 m m i=1 σ i h(x i ) )

128 Rademacher Complexity Empirical Rademacher Complexity: R S (H) =E σ (sup h H where again σi is selected uniformly from {1,-1}. This quantity can be computed from data. Rademacher Complexity R m (H) =E S [ RS (H) S = m ] McDiarmid s inequality provides an exponentially bounded difference between R S and R m. 2 m m i=1 σ i h(x i ) )

129 Rademacher Bound Rademacher Bound: Consider binary classification with 0/1 loss function, then the following is true, Pr sup h H R(h) R S (h) ɛ + R m (h) e 2mɛ2 Again, with probability at least 1 δ, ln(1/δ) R(h) R S (h)+r m (H)+ 2m

130 Rademacher Bound Proof (outline): Apply McDiarmid s inequality to, Φ(S) = sup h which is 1/m-Lipschitz. Main insight: use symmetrization type argument to show, R(h) RS (h) E [ Φ(S) ] R m (H)

131 Rademacher Complexity Improvements over VC-bound We can compute R S (H) from data. The empirical Rademacher complexity lower bounds the VC-bound! 2d ln(em/d) R(H) 2M m

132 Rademacher Complexity Proof outline: Use Chernoff bounding method, need to show, [ m exp 2 t R(H) ] Π H (m) exp(t 2 M 2 m/2) then by Sauer s Lemma, m R(H) 2d ln(m) t then choose best t so that, + tm 2 m 2d ln(em/d) R(H) 2M m

Rademacher Bounds for Non-i.i.d. Processes

Rademacher Bounds for Non-i.i.d. Processes Rademacher Bounds for Non-i.i.d. Processes Afshin Rostamizadeh Joint work with: Mehryar Mohri Background Background Generalization Bounds - How well can we estimate an algorithm s true performance based

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

1 A Lower Bound on Sample Complexity

1 A Lower Bound on Sample Complexity COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #7 Scribe: Chee Wei Tan February 25, 2008 1 A Lower Bound on Sample Complexity In the last lecture, we stopped at the lower bound on

More information

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity; CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and

More information

Computational Learning Theory - Hilary Term : Learning Real-valued Functions

Computational Learning Theory - Hilary Term : Learning Real-valued Functions Computational Learning Theory - Hilary Term 08 8 : Learning Real-valued Functions Lecturer: Varun Kanade So far our focus has been on learning boolean functions. Boolean functions are suitable for modelling

More information

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims CS340 Machine learning Lecture 5 Learning theory cont'd Some slides are borrowed from Stuart Russell and Thorsten Joachims Inductive learning Simplest form: learn a function from examples f is the target

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Does Unlabeled Data Help?

Does Unlabeled Data Help? Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline

More information

Learning with Imperfect Data

Learning with Imperfect Data Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Joint work with: Yishay Mansour (Tel-Aviv & Google) and Afshin Rostamizadeh (Courant Institute). Standard Learning Assumptions IID assumption.

More information

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015 Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and

More information

Introduction to Statistical Learning Theory

Introduction to Statistical Learning Theory Introduction to Statistical Learning Theory Definition Reminder: We are given m samples {(x i, y i )} m i=1 Dm and a hypothesis space H and we wish to return h H minimizing L D (h) = E[l(h(x), y)]. Problem

More information

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization John Duchi Outline I Motivation 1 Uniform laws of large numbers 2 Loss minimization and data dependence II Uniform

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU10701 11. Learning Theory Barnabás Póczos Learning Theory We have explored many ways of learning from data But How good is our classifier, really? How much data do we

More information

Rademacher Complexity Bounds for Non-I.I.D. Processes

Rademacher Complexity Bounds for Non-I.I.D. Processes Rademacher Complexity Bounds for Non-I.I.D. Processes Mehryar Mohri Courant Institute of Mathematical ciences and Google Research 5 Mercer treet New York, NY 00 mohri@cims.nyu.edu Afshin Rostamizadeh Department

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory Problem set 1 Due: Monday, October 10th Please send your solutions to learning-submissions@ttic.edu Notation: Input space: X Label space: Y = {±1} Sample:

More information

Statistical Learning Learning From Examples

Statistical Learning Learning From Examples Statistical Learning Learning From Examples We want to estimate the working temperature range of an iphone. We could study the physics and chemistry that affect the performance of the phone too hard We

More information

VC dimension and Model Selection

VC dimension and Model Selection VC dimension and Model Selection Overview PAC model: review VC dimension: Definition Examples Sample: Lower bound Upper bound!!! Model Selection Introduction to Machine Learning 2 PAC model: Setting A

More information

Sample Selection Bias Correction

Sample Selection Bias Correction Sample Selection Bias Correction Afshin Rostamizadeh Joint work with: Corinna Cortes, Mehryar Mohri & Michael Riley Courant Institute & Google Research Motivation Critical Assumption: Samples for training

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

Learning Kernels -Tutorial Part III: Theoretical Guarantees.

Learning Kernels -Tutorial Part III: Theoretical Guarantees. Learning Kernels -Tutorial Part III: Theoretical Guarantees. Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute & Google Research mohri@cims.nyu.edu Afshin Rostami UC Berkeley

More information

Active Learning: Disagreement Coefficient

Active Learning: Disagreement Coefficient Advanced Course in Machine Learning Spring 2010 Active Learning: Disagreement Coefficient Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz In previous lectures we saw examples in which

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Generalization Bounds and Stability

Generalization Bounds and Stability Generalization Bounds and Stability Lorenzo Rosasco Tomaso Poggio 9.520 Class 9 2009 About this class Goal To recall the notion of generalization bounds and show how they can be derived from a stability

More information

Statistical learning theory, Support vector machines, and Bioinformatics

Statistical learning theory, Support vector machines, and Bioinformatics 1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.

More information

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, 018 Review Theorem (Occam s Razor). Say algorithm A finds a hypothesis h A H consistent with

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear

More information

Generalization and Overfitting

Generalization and Overfitting Generalization and Overfitting Model Selection Maria-Florina (Nina) Balcan February 24th, 2016 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

Generalization error bounds for classifiers trained with interdependent data

Generalization error bounds for classifiers trained with interdependent data Generalization error bounds for classifiers trained with interdependent data icolas Usunier, Massih-Reza Amini, Patrick Gallinari Department of Computer Science, University of Paris VI 8, rue du Capitaine

More information

2 Upper-bound of Generalization Error of AdaBoost

2 Upper-bound of Generalization Error of AdaBoost COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Haipeng Zheng March 5, 2008 1 Review of AdaBoost Algorithm Here is the AdaBoost Algorithm: input: (x 1,y 1 ),...,(x m,y

More information

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell CS340 Machine learning Lecture 4 Learning theory Some slides are borrowed from Sebastian Thrun and Stuart Russell Announcement What: Workshop on applying for NSERC scholarships and for entry to graduate

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

References for online kernel methods

References for online kernel methods References for online kernel methods W. Liu, J. Principe, S. Haykin Kernel Adaptive Filtering: A Comprehensive Introduction. Wiley, 2010. W. Liu, P. Pokharel, J. Principe. The kernel least mean square

More information

Generalization Bounds for the Area Under an ROC Curve

Generalization Bounds for the Area Under an ROC Curve Generalization Bounds for the Area Under an ROC Curve Shivani Agarwal, Thore Graepel, Ralf Herbrich, Sariel Har-Peled and Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign

More information

1 Review of The Learning Setting

1 Review of The Learning Setting COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #8 Scribe: Changyan Wang February 28, 208 Review of The Learning Setting Last class, we moved beyond the PAC model: in the PAC model we

More information

The definitions and notation are those introduced in the lectures slides. R Ex D [h

The definitions and notation are those introduced in the lectures slides. R Ex D [h Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 2 October 04, 2016 Due: October 18, 2016 A. Rademacher complexity The definitions and notation

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

Computational Learning Theory

Computational Learning Theory CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful

More information

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14 Learning Theory Piyush Rai CS5350/6350: Machine Learning September 27, 2011 (CS5350/6350) Learning Theory September 27, 2011 1 / 14 Why Learning Theory? We want to have theoretical guarantees about our

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER COMPLEXITY Slides adapted from Rob Schapire Machine Learning: Jordan Boyd-Graber UMD Introduction

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric

More information

Generalization bounds

Generalization bounds Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question

More information

Introduction to Machine Learning (67577) Lecture 3

Introduction to Machine Learning (67577) Lecture 3 Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz

More information

12.1 A Polynomial Bound on the Sample Size m for PAC Learning

12.1 A Polynomial Bound on the Sample Size m for PAC Learning 67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 12: PAC III Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 In this lecture will use the measure of VC dimension, which is a combinatorial

More information

The PAC Learning Framework -II

The PAC Learning Framework -II The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1 Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1 Outline

More information

Computational Learning Theory. CS534 - Machine Learning

Computational Learning Theory. CS534 - Machine Learning Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows

More information

Computational Learning Theory. Definitions

Computational Learning Theory. Definitions Computational Learning Theory Computational learning theory is interested in theoretical analyses of the following issues. What is needed to learn effectively? Sample complexity. How many examples? Computational

More information

Active Learning and Optimized Information Gathering

Active Learning and Optimized Information Gathering Active Learning and Optimized Information Gathering Lecture 7 Learning Theory CS 101.2 Andreas Krause Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok. Office

More information

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization

More information

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016 Machine Learning 10-701, Fall 2016 Computational Learning Theory Eric Xing Lecture 9, October 5, 2016 Reading: Chap. 7 T.M book Eric Xing @ CMU, 2006-2016 1 Generalizability of Learning In machine learning

More information

The Vapnik-Chervonenkis Dimension

The Vapnik-Chervonenkis Dimension The Vapnik-Chervonenkis Dimension Prof. Dan A. Simovici UMB 1 / 91 Outline 1 Growth Functions 2 Basic Definitions for Vapnik-Chervonenkis Dimension 3 The Sauer-Shelah Theorem 4 The Link between VCD and

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 12: Weak Learnability and the l 1 margin Converse to Scale-Sensitive Learning Stability Convex-Lipschitz-Bounded Problems

More information

Computational and Statistical Learning theory

Computational and Statistical Learning theory Computational and Statistical Learning theory Problem set 2 Due: January 31st Email solutions to : karthik at ttic dot edu Notation : Input space : X Label space : Y = {±1} Sample : (x 1, y 1,..., (x n,

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

BINARY CLASSIFICATION

BINARY CLASSIFICATION BINARY CLASSIFICATION MAXIM RAGINSY The problem of binary classification can be stated as follows. We have a random couple Z = X, Y ), where X R d is called the feature vector and Y {, } is called the

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabás Póczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic): Bayes risk: Let us use the empirical

More information

Selective Prediction. Binary classifications. Rong Zhou November 8, 2017

Selective Prediction. Binary classifications. Rong Zhou November 8, 2017 Selective Prediction Binary classifications Rong Zhou November 8, 2017 Table of contents 1. What are selective classifiers? 2. The Realizable Setting 3. The Noisy Setting 1 What are selective classifiers?

More information

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012 Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Computational Learning Theory Le Song Lecture 11, September 20, 2012 Based on Slides from Eric Xing, CMU Reading: Chap. 7 T.M book 1 Complexity of Learning

More information

Statistical and Computational Learning Theory

Statistical and Computational Learning Theory Statistical and Computational Learning Theory Fundamental Question: Predict Error Rates Given: Find: The space H of hypotheses The number and distribution of the training examples S The complexity of the

More information

Introduction and Models

Introduction and Models CSE522, Winter 2011, Learning Theory Lecture 1 and 2-01/04/2011, 01/06/2011 Lecturer: Ofer Dekel Introduction and Models Scribe: Jessica Chang Machine learning algorithms have emerged as the dominant and

More information

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017 Machine Learning Regularization and Feature Selection Fabio Vandin November 13, 2017 1 Learning Model A: learning algorithm for a machine learning task S: m i.i.d. pairs z i = (x i, y i ), i = 1,..., m,

More information

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY Dan A. Simovici UMB, Doctoral Summer School Iasi, Romania What is Machine Learning? The Vapnik-Chervonenkis Dimension Probabilistic Learning Potential

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

IFT Lecture 7 Elements of statistical learning theory

IFT Lecture 7 Elements of statistical learning theory IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Sample Complexity of Learning Mahalanobis Distance Metrics. Nakul Verma Janelia, HHMI

Sample Complexity of Learning Mahalanobis Distance Metrics. Nakul Verma Janelia, HHMI Sample Complexity of Learning Mahalanobis Distance Metrics Nakul Verma Janelia, HHMI feature 2 Mahalanobis Metric Learning Comparing observations in feature space: x 1 [sq. Euclidean dist] x 2 (all features

More information

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization : Neural Networks Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization 11s2 VC-dimension and PAC-learning 1 How good a classifier does a learner produce? Training error is the precentage

More information

Maximum Mean Discrepancy

Maximum Mean Discrepancy Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia

More information

Improved Bounds on the Dot Product under Random Projection and Random Sign Projection

Improved Bounds on the Dot Product under Random Projection and Random Sign Projection Improved Bounds on the Dot Product under Random Projection and Random Sign Projection Ata Kabán School of Computer Science The University of Birmingham Birmingham B15 2TT, UK http://www.cs.bham.ac.uk/

More information

where x i is the ith coordinate of x R N. 1. Show that the following upper bound holds for the growth function of H:

where x i is the ith coordinate of x R N. 1. Show that the following upper bound holds for the growth function of H: Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 2 October 25, 2017 Due: November 08, 2017 A. Growth function Growth function of stum functions.

More information

1 The Probably Approximately Correct (PAC) Model

1 The Probably Approximately Correct (PAC) Model COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #3 Scribe: Yuhui Luo February 11, 2008 1 The Probably Approximately Correct (PAC) Model A target concept class C is PAC-learnable by

More information

Decentralized Detection and Classification using Kernel Methods

Decentralized Detection and Classification using Kernel Methods Decentralized Detection and Classification using Kernel Methods XuanLong Nguyen Computer Science Division University of California, Berkeley xuanlong@cs.berkeley.edu Martin J. Wainwright Electrical Engineering

More information

Accelerated Training of Max-Margin Markov Networks with Kernels

Accelerated Training of Max-Margin Markov Networks with Kernels Accelerated Training of Max-Margin Markov Networks with Kernels Xinhua Zhang University of Alberta Alberta Innovates Centre for Machine Learning (AICML) Joint work with Ankan Saha (Univ. of Chicago) and

More information

A Uniform Convergence Bound for the Area Under the ROC Curve

A Uniform Convergence Bound for the Area Under the ROC Curve Proceedings of the th International Workshop on Artificial Intelligence & Statistics, 5 A Uniform Convergence Bound for the Area Under the ROC Curve Shivani Agarwal, Sariel Har-Peled and Dan Roth Department

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Kimball, B-11 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)

More information

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016 12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 6: Training versus Testing (LFD 2.1) Cho-Jui Hsieh UC Davis Jan 29, 2018 Preamble to the theory Training versus testing Out-of-sample error (generalization error): What

More information

Authors: John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira and Jennifer Wortman (University of Pennsylvania)

Authors: John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira and Jennifer Wortman (University of Pennsylvania) Learning Bouds for Domain Adaptation Authors: John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira and Jennifer Wortman (University of Pennsylvania) Presentation by: Afshin Rostamizadeh (New York

More information

Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence

Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence Chao Zhang The Biodesign Institute Arizona State University Tempe, AZ 8587, USA Abstract In this paper, we present

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher

More information

Joint distribution optimal transportation for domain adaptation

Joint distribution optimal transportation for domain adaptation Joint distribution optimal transportation for domain adaptation Changhuang Wan Mechanical and Aerospace Engineering Department The Ohio State University March 8 th, 2018 Joint distribution optimal transportation

More information

Generalization Bounds for the Area Under the ROC Curve

Generalization Bounds for the Area Under the ROC Curve Generalization Bounds for the Area Under the ROC Curve Shivani Agarwal Department of Computer Science University of Illinois at Urbana-Champaign 201 N. Goodwin Avenue Urbana, IL 61801, USA Thore Graepel

More information

Lecture 14: Binary Classification: Disagreement-based Methods

Lecture 14: Binary Classification: Disagreement-based Methods CSE599i: Online and Adaptive Machine Learning Winter 2018 Lecture 14: Binary Classification: Disagreement-based Methods Lecturer: Kevin Jamieson Scribes: N. Cano, O. Rafieian, E. Barzegary, G. Erion, S.

More information

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,

More information

Rank, Trace-Norm & Max-Norm

Rank, Trace-Norm & Max-Norm Rank, Trace-Norm & Max-Norm as measures of matrix complexity Nati Srebro University of Toronto Adi Shraibman Hebrew University Matrix Learning users movies 2 1 4 5 5 4? 1 3 3 5 2 4? 5 3? 4 1 3 5 2 1? 4

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

Generalization Bounds

Generalization Bounds Generalization Bounds Here we consider the problem of learning from binary labels. We assume training data D = x 1, y 1,... x N, y N with y t being one of the two values 1 or 1. We will assume that these

More information

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning

More information

Perceptron Mistake Bounds

Perceptron Mistake Bounds Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce

More information

Generalization Bounds for the Area Under the ROC Curve

Generalization Bounds for the Area Under the ROC Curve Generalization Bounds for the Area Under the ROC Curve Shivani Agarwal Thore Graepel Sariel Har-Peled Ralf Herbrich Dan Roth April 3, 005 Abstract We study generalization properties of the area under the

More information

Part of the slides are adapted from Ziko Kolter

Part of the slides are adapted from Ziko Kolter Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,

More information

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process

More information

Computational Learning Theory

Computational Learning Theory 09s1: COMP9417 Machine Learning and Data Mining Computational Learning Theory May 20, 2009 Acknowledgement: Material derived from slides for the book Machine Learning, Tom M. Mitchell, McGraw-Hill, 1997

More information

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations

More information

ORIE 4741: Learning with Big Messy Data. Generalization

ORIE 4741: Learning with Big Messy Data. Generalization ORIE 4741: Learning with Big Messy Data Generalization Professor Udell Operations Research and Information Engineering Cornell September 23, 2017 1 / 21 Announcements midterm 10/5 makeup exam 10/2, by

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 2: Introduction to statistical learning theory. 1 / 22 Goals of statistical learning theory SLT aims at studying the performance of

More information