Rademacher Bounds for Non-i.i.d. Processes

Size: px
Start display at page:

Download "Rademacher Bounds for Non-i.i.d. Processes"

Transcription

1 Rademacher Bounds for Non-i.i.d. Processes Afshin Rostamizadeh Joint work with: Mehryar Mohri

2 Background

3 Background Generalization Bounds - How well can we estimate an algorithm s true performance based on a finite sample? êrr(h, S)? E S [êrr(h, S)] = err(h, S)

4 Background Generalization Bounds - How well can we estimate an algorithm s true performance based on a finite sample? êrr(h, S)? E S [êrr(h, S)] = err(h, S) Usually, generalization bounds take the form: h H, S (X Y ) m err(h, S) êrr(h, S) + complexity(h)

5 Background Generalization Bounds - How well can we estimate an algorithm s true performance based on a finite sample? êrr(h, S)? E S [êrr(h, S)] = err(h, S) Usually, generalization bounds take the form: h H, S (X Y ) m err(h, S) êrr(h, S) + complexity(h) Explicit trade-off in choice of H.

6 Background

7 Background One natural measure, Rademacher Complexity.

8 Background One natural measure, Rademacher Complexity. Define Rademacher r.v. as: Pr(σ i = ±1) = 1/2

9 Background One natural measure, Rademacher Complexity. Define Rademacher r.v. as: Pr(σ i = ±1) = 1/2 Empirical: [ R S (H) = 2 m E σ sup h H m i=1 ] σ i h(z i ) S =(z 1,..., z m )

10 Background One natural measure, Rademacher Complexity. Define Rademacher r.v. as: Pr(σ i = ±1) = 1/2 Empirical: [ R S (H) = 2 m E σ sup h H m i=1 ] σ i h(z i ) S =(z 1,..., z m ) Actual: R m(h) = E S [ R S ]

11 Background One natural measure, Rademacher Complexity. Define Rademacher r.v. as: Pr(σ i = ±1) = 1/2 Empirical: [ R S (H) = 2 m E σ sup h H m i=1 ] σ i h(z i ) S =(z 1,..., z m ) Actual: R m (H) = E S [ R S ] Intuitively, this measures the ability of a hypothesis class to fit uniform random noise.

12 Background One natural measure, Rademacher Complexity. Define Rademacher r.v. as: Pr(σ i = ±1) = 1/2 Empirical: [ R S (H) = 2 m E σ sup h H m i=1 ] σ i h(z i ) S =(z 1,..., z m ) Actual: R m (H) = E S [ R S ] Intuitively, this measures the ability of a hypothesis class to fit uniform random noise. Can be measured from data, tighter bounds.

13 Background

14 Background Rademacher Generalization Bounds (0/1 loss) [Bartlett, Mendelson 01, Koltchinskii, Panchenko 00]: h H, S (X Y ) m err(h, S) êrr(h, S)+ R S (H) 2 + log(1/δ) 2m

15 Background Rademacher Generalization Bounds (0/1 loss) [Bartlett, Mendelson 01, Koltchinskii, Panchenko 00]: h H, S (X Y ) m err(h, S) êrr(h, S)+ R S (H) 2 + log(1/δ) 2m CRITICAL Assumption: The sample must be identically and independently distributed (i.i.d.).

16 Motivation

17 Motivation In practice data often contains dependencies. Here we consider temporal dependancies:

18 Motivation In practice data often contains dependencies. Here we consider temporal dependancies: We assume the distribution to be mixing; implies a dependence which weakens over time.

19 Motivation In practice data often contains dependencies. Here we consider temporal dependancies: We assume the distribution to be mixing; implies a dependence which weakens over time. Natural in the context of time-series analysis (i.e. stock market quotes).

20 Motivation

21 Motivation We MUST deal with non-i.i.d. data!

22 Motivation We MUST deal with non-i.i.d. data! Often in practice, simply use algorithm designed for the i.i.d. case. Many times, such algorithms still perform well

23 Motivation We MUST deal with non-i.i.d. data! Often in practice, simply use algorithm designed for the i.i.d. case. Many times, such algorithms still perform well How can we justify/guarantee this performance?

24 Motivation We MUST deal with non-i.i.d. data! Often in practice, simply use algorithm designed for the i.i.d. case. Many times, such algorithms still perform well How can we justify/guarantee this performance? Give generalization bounds that do NOT assume i.i.d. data!

25 Motivation We MUST deal with non-i.i.d. data! Often in practice, simply use algorithm designed for the i.i.d. case. Many times, such algorithms still perform well How can we justify/guarantee this performance? Give generalization bounds that do NOT assume i.i.d. data! Here we present general proof techniques useful for mixing processes.

26 Motivation We MUST deal with non-i.i.d. data! Often in practice, simply use algorithm designed for the i.i.d. case. Many times, such algorithms still perform well How can we justify/guarantee this performance? Give generalization bounds that do NOT assume i.i.d. data! Here we present general proof techniques useful for mixing processes. We extend useful properties of Rademacher complexity to this non-i.i.d. setting.

27 Definitions

28 Definitions We make the standard assumption of Stationarity. [Stationarity] A sequence of random variables = Z t t= is said to be stationary if for any t and non-negative integers m and k, the random vectors (Z t,..., Z t+m ) and (Z t+k,..., Z t+m+k ) have the same distribution.

29 Definitions We make the standard assumption of Stationarity. [Stationarity] A sequence of random variables = Z t t= is said to be stationary if for any t and non-negative integers m and k, the random vectors (Z t,..., Z t+m ) and (Z t+k,..., Z t+m+k ) have the same distribution. = A B A B P(x t = B x t s = A) P(x t+k = B x t s+k = A)

30 Definitions We make the standard assumption of Stationarity. [Stationarity] A sequence of random variables = Z t t= is said to be stationary if for any t and non-negative integers m and k, the random vectors (Z t,..., Z t+m ) and (Z t+k,..., Z t+m+k ) have the same distribution. = A B A B P(x t = B x t s = A) P(x t+k = B x t s+k = A) = A B (relative distance matters) P(x t+k = B x t s+l = A)

31 Definitions

32 Definitions We quantify dependence with natural β-mixing coefficient: [β-mixing] Let = Z tt= be a stationary sequence of random variables. Let σ j i denote the σ-algebra generated by the random variables Z k, i k j. The β-mixing coefficient of the stochastic process is defined as [ ] β(k) = sup sup Pr[A B] Pr[A]. n B σn A σ n+k

33 Definitions We quantify dependence with natural β-mixing coefficient: [β-mixing] Let = Z tt= be a stationary sequence of random variables. Let σ j i denote the σ-algebra generated by the random variables Z k, i k j. The β-mixing coefficient of the stochastic process is defined as [ ] β(k) = sup sup Pr[A B] Pr[A]. n B σn Mixing implies: A σ n+k A B A B Pr[B A] Pr[B] Pr[B A] Pr[B]

34 Proof Strategy

35 Proof Strategy Reduce dependent scenario to the independent case.

36 Proof Strategy Reduce dependent scenario to the independent case. If we introduce gaps in the sequence, can we treat the blocks as independent? S S0

37 Proof Strategy Reduce dependent scenario to the independent case. If we introduce gaps in the sequence, can we treat the blocks as independent? a 2aμ = m S S μ

38 Proof Strategy Reduce dependent scenario to the independent case. If we introduce gaps in the sequence, can we treat the blocks as independent? a 2aμ = m S S μ β-mixing assumption allows us to exactly bound this approximation. Lemma [Yu, 94] Let S 0 be defined as above and let h be a function of S 0 that is bounded by M and then, E S0 [h] E S0 [h] (µ 1)Mβ(a), where E S0 (resp. E S0 ) denotes the expectation with respect to the dependent (resp. independent) block sequence.

39 Proof Strategy

40 Proof Strategy In i.i.d. case, apply McDiarmid s inequality to: where, R S (h) = 1 m Φ(S) = sup h H m i=1 h(z i ) R(h) R S (h) R(h) = E S [ R S (h)]

41 Proof Strategy In i.i.d. case, apply McDiarmid s inequality to: where, R S (h) = 1 m We apply it to i.i.d. blocks, extending H in a natural way. Φ(S) = sup h H m i=1 h(z i ) R(h) R S (h) R(h) = E S [ R S (h)] Define h a (B) = 1 a a i=1 h(z i) for any block B =(z 1,..., z a ) Z a, and define H a as the set of all block-based hypotheses h a generated from h H.

42 Preparation Step

43 Preparation Step Re-write in terms of blocks:

44 Preparation Step Re-write in terms of blocks: Pr[Φ(S) > ɛ] = Pr S = Pr S [sup (R(h) R S (h)) > ɛ] S h [ ( R(h) RS0 b (h) sup h 2 + R(h) R b S1 (h) 2 ) ] > ɛ (def. of R S (h)) Pr 0)+Φ(S 1 ) > 2ɛ] S (def. of Φ) Pr[Φ(S 0 ) > ɛ] + Pr[Φ(S 1 ) > ɛ] S 0 S 1 (union bound) = 2 Pr S 0 [Φ(S 0 ) > ɛ] (stationarity) = 2 Pr S 0 [Φ(S 0 ) E S0 [Φ( S 0 )] > ɛ ]. (def. of ɛ )

45 Preparation Step Re-write in terms of blocks: Pr[Φ(S) > ɛ] = Pr S = Pr S [sup (R(h) R S (h)) > ɛ] S h [ ( R(h) RS0 b (h) sup h 2 + R(h) R b S1 (h) 2 ) ] > ɛ (def. of R S (h)) Pr 0)+Φ(S 1 ) > 2ɛ] S (def. of Φ) Pr[Φ(S 0 ) > ɛ] + Pr[Φ(S 1 ) > ɛ] S 0 S 1 (union bound) = 2 Pr S 0 [Φ(S 0 ) > ɛ] ɛ = ɛ E S0 [Φ( S 0 )] (stationarity) = 2 Pr S 0 [Φ(S 0 ) E S0 [Φ( S 0 )] > ɛ ]. (def. of ɛ )

46 Preparation Step Re-write in terms of blocks: Pr[Φ(S) > ɛ] = Pr S = Pr S [sup (R(h) R S (h)) > ɛ] S h [ ( R(h) RS0 b (h) sup h 2 + R(h) R b S1 (h) 2 Obtain independent blocks, using Yu s Lemma: ) ] > ɛ (def. of R S (h)) Pr 0)+Φ(S 1 ) > 2ɛ] S (def. of Φ) Pr[Φ(S 0 ) > ɛ] + Pr[Φ(S 1 ) > ɛ] S 0 S 1 (union bound) = 2 Pr S 0 [Φ(S 0 ) > ɛ] ɛ = ɛ E S0 [Φ( S 0 )] (stationarity) = 2 Pr S 0 [Φ(S 0 ) E S0 [Φ( S 0 )] > ɛ ]. (def. of ɛ ) 2 Pr S 0 [Φ(S 0 ) E S0 [Φ( S 0 )] > ɛ ] 2 Pr S 0 [Φ( S 0 ) E S0 [Φ( S 0 )] > ɛ ] + 2(µ 1)β(a)

47 Concentration Bound

48 Concentration Bound Now, apply McDiarmid s inequality to i.i.d. blocks:

49 Concentration Bound Now, apply McDiarmid s inequality to i.i.d. blocks: Changing a block Z k of the sample S 0 can change Φ( S 0 ) by at most 1 µ h( Z k ) M/µ and by McDiarmid s inequality, the following holds for any ɛ > 2(µ 1)Mβ(a): Pr[Φ( S 0 ) E S0 [Φ( S 0 )] > ɛ ] S 0 ( 2ɛ 2 ) ( ) 2µɛ 2 exp µ = exp i=1 (M/µ)2 M 2

50 Concentration Bound Now, apply McDiarmid s inequality to i.i.d. blocks: Changing a block Z k of the sample S 0 can change Φ( S 0 ) by at most 1 µ h( Z k ) M/µ and by McDiarmid s inequality, the following holds for any ɛ > 2(µ 1)Mβ(a): Pr[Φ( S 0 ) E S0 [Φ( S 0 )] > ɛ ] S 0 ( 2ɛ 2 ) ( 2µɛ 2 exp µ = exp i=1 (M/µ)2 M 2 ) So far, we have: Pr[Φ(S) > ɛ] 2 exp S ( 2µɛ 2 M 2 ) + 2(µ 1)β(a),

51 Bounding the Expectation

52 Bounding the Expectation To make the bound useful, we must bound the expectation (over i.i.d. blocks):

53 Bounding the Expectation To make the bound useful, we must bound the expectation (over i.i.d. blocks): E S0 [Φ( S 0 )] E S0, S 0[sup h H R S 0 (h) R S0 (h)] [ 1 µ =E S0, S sup 0 h a H a µ i=1 [ 1 µ =E S0, S 0,σ sup h a H a µ i=1 [ 1 µ ] E S0, S 0,σ sup σ i h a (Z i ) h a H a µ i=1 [ 1 µ +E S0, S 0,σ sup h a H a µ i=1 [ 1 µ ] = 2E S0,σ sup σ i h a (Z i ). h a H a µ i=1 ] h a (Z i ) h a (Z i) ] σ i (h a (Z i ) h a (Z i)) ] σ i h a (Z i) (def. of R) (Rad. var. s) (prop. of sup)

54 Bounding the Expectation

55 Bounding the Expectation Would like complexity of H not H a:

56 Bounding the Expectation Would like complexity of H not H a: E S0,σ [ 2 sup h a H a µ µ i=1 ] σ i h a (Z i ) =E S0,σ [ sup h H 2 µ µ i=1 σ i 1 a a j=1 ] h(z (i) j )

57 Bounding the Expectation Would like complexity of H not H a: E S0,σ [ 2 sup h a H a µ µ i=1 ] σ i h a (Z i ) =E S0,σ [ sup h H 2 µ µ i=1 σ i 1 a a j=1 ] h(z (i) j ) Almost, looks like Rad. comp. but σ s are shared.

58 Bounding the Expectation Would like complexity of H not H a: E S0,σ [ 2 sup h a H a µ µ i=1 ] σ i h a (Z i ) =E S0,σ [ sup h H 2 µ µ i=1 σ i 1 a a j=1 ] h(z (i) j ) Almost, looks like Rad. comp. but σ s are shared. E S0 [Φ( S 0 )] E S0,σ 1 a = 1 a a j=1 a j=1 =E Sµ,σ [ sup h H 1 a [ a j=1 E S0,σ sup h H [ E Sj 0,σ sup h H [ sup h H 2 µ 2 µ 2 µ 2 µ µ i=1 µ i=1 µ i=1 ] σ i h(z) z S µ ] σ i h(z (i) j ) ] σ i h(z (i) j ) ] σ i h(z (i) j ) R D µ (H). (reversing order of sums) (convexity of sup) (marginalization)

59 Bounding the Expectation Would like complexity of H not H a: E S0,σ [ 2 sup h a H a µ µ i=1 ] σ i h a (Z i ) =E S0,σ [ sup h H 2 µ µ i=1 σ i 1 a a j=1 ] h(z (i) j ) Almost, looks like Rad. comp. but σ s are shared. E S0 [Φ( S 0 )] E S0,σ 1 a = 1 a a j=1 a j=1 =E Sµ,σ [ sup h H 1 a [ a j=1 E S0,σ sup h H [ E Sj 0,σ sup h H [ sup h H 2 µ 2 µ 2 µ 2 µ µ i=1 µ i=1 µ i=1 ] σ i h(z) z S µ ] σ i h(z (i) j ) ] σ i h(z (i) j ) ] σ i h(z (i) j ) R D µ (H). (reversing order of sums) (convexity of sup) (marginalization) D Denotes i.i.d. distribution

60 Bound So Far

61 Bound So Far We have a bound in terms of the Rad. complexity. With probability at least 1 δ and δ = δ 2(µ 1)β(a), the following inequality holds for all hypotheses h H: R(h) R S (h)+r D µ (H)+M log 2 δ 2µ

62 Bound So Far We have a bound in terms of the Rad. complexity. With probability at least 1 δ and δ = δ 2(µ 1)β(a), the following inequality holds for all hypotheses h H: R(h) R S (h)+r D log 2 δ µ (H)+M 2µ Would like bounds in terms of empirical Rademacher complexity. It has many benefits:

63 Bound So Far We have a bound in terms of the Rad. complexity. With probability at least 1 δ and δ = δ 2(µ 1)β(a), the following inequality holds for all hypotheses h H: R(h) R S (h)+r D log 2 δ µ (H)+M 2µ Would like bounds in terms of empirical Rademacher complexity. It has many benefits: Can be measured from data (tighter bounds).

64 Bound So Far We have a bound in terms of the Rad. complexity. With probability at least 1 δ and δ = δ 2(µ 1)β(a), the following inequality holds for all hypotheses h H: R(h) R S (h)+r D log 2 δ µ (H)+M 2µ Would like bounds in terms of empirical Rademacher complexity. It has many benefits: Can be measured from data (tighter bounds). Can be related to other complexity measures (e.g. VC-dimension).

65 Bound So Far We have a bound in terms of the Rad. complexity. With probability at least 1 δ and δ = δ 2(µ 1)β(a), the following inequality holds for all hypotheses h H: R(h) R S (h)+r D log 2 δ µ (H)+M 2µ Would like bounds in terms of empirical Rademacher complexity. It has many benefits: Can be measured from data (tighter bounds). Can be related to other complexity measures (e.g. VC-dimension). Can be bounded for specific hypotheses.

66 Bound So Far

67 Bound So Far Using similar techniques, we can show R Sµ to : R D µ is close

68 Bound So Far Using similar techniques, we can show R Sµ to : R D µ is close With probability at least 1 δ and δ = δ/2 2(µ 1)β(a), the following inequality holds for all hypotheses h H: R(h) R S (h)+ R Sµ (H)+3M log 2 δ 2µ,

69 Bound So Far Using similar techniques, we can show R Sµ to : R D µ is close With probability at least 1 δ and δ = δ/2 2(µ 1)β(a), the following inequality holds for all hypotheses h H: R(h) R S (h)+ R log 2 δ Sµ (H)+3M 2µ, Kernel hypotheses bounds [Bartlett, Mendelson 01]: In the case of classification with hypotheses based on a kernel K and a weight vector w bounded by B, w B, the empirical Rademacher complexity can be bounded as follows: R Sµ (H) 2B µ [K]

70 Classification Bound

71 Classification Bound Let H be the set of hypotheses { (x, y) Z y m i=1 α ik(x i,x): m i,j=1 α iα j K(x i,x j ) 1 }.

72 Classification Bound Let H be the set of hypotheses { (x, y) Z y m i=1 α ik(x i,x): m i,j=1 α iα j K(x i,x j ) 1 Let R ρ S (h) denote the average amount by which y ih(x i ) deviates from the margin ρ: Rρ S (h) = 1 m m i=1 (ρ y ih(x i )) +. }.

73 Classification Bound Let H be the set of hypotheses { (x, y) Z y m i=1 α ik(x i,x): m i,j=1 α iα j K(x i,x j ) 1 Let R ρ S (h) denote the average amount by which y ih(x i ) deviates from the margin ρ: Rρ S (h) = 1 m m i=1 (ρ y ih(x i )) +. }. (ρ yh(x)) 1 ρ h(x)

74 Classification Bound

75 Classification Bound With probability at least 1 δ, the following inequality holds for all hypotheses h H: Pr[yh(x) 0] 1 ρ R ρ S (h)+ 4 log 2 [K]+3 δ µρ 2µ, where δ = δ/2 2(µ 1)β(a), and K is the Gram matrix of the kernel K for the sample S.

76 Classification Bound With probability at least 1 δ, the following inequality holds for all hypotheses h H: Pr[yh(x) 0] 1 ρ R ρ S (h)+ 4 log 2 [K]+3 δ µρ 2µ, where δ = δ/2 2(µ 1)β(a), and K is the Gram matrix of the kernel K for the sample S. Now, we need to appropriately choose the parameters a and μ.

77 Classification Bound With probability at least 1 δ, the following inequality holds for all hypotheses h H: Pr[yh(x) 0] 1 ρ R ρ S (h)+ 4 log 2 [K]+3 δ µρ 2µ, where δ = δ/2 2(µ 1)β(a), and K is the Gram matrix of the kernel K for the sample S. Now, we need to appropriately choose the parameters a and μ. If we assume algebraic mixing,, one suitable choice: µ = m 2r+1 2r+4 2 β(a) := β 0 a r

78 A Complete Bound

79 A Complete Bound Assuming that the sample is drawn from a stationary algebraically β-mixing distribution, β(a) =β 0 a r, the following bound holds, Pr[yh(x) 0] 1 ρ R ρ 1 S (h)+8rmγ +3m γ 2 log 2 δ, where γ 1 = 2( 1 3 r+2 1) (, γ 2 = r+4 1) and δ = δ/2 β 0 m γ 1. ρ

80 A Complete Bound Assuming that the sample is drawn from a stationary algebraically β-mixing distribution, β(a) =β 0 a r, the following bound holds, Pr[yh(x) 0] 1 ρ R ρ 1 S (h)+8rmγ +3m γ 2 log 2 δ, where γ 1 = 2( 1 3 r+2 1) (, γ 2 = r+4 1) and δ = δ/2 β 0 m γ 1. ρ As r and β 0 0 (i.e. the i.i.d. scenario is approached), this bound has the same asymptotic behavior as the i.i.d. bound.

81 Summary

82 Summary Have given the first data-dependent bounds for a non-i.i.d. scenario.

83 Summary Have given the first data-dependent bounds for a non-i.i.d. scenario. First known margin-based classification bounds (can be extended to regression as well).

84 Summary Have given the first data-dependent bounds for a non-i.i.d. scenario. First known margin-based classification bounds (can be extended to regression as well). Can easily extend bounds to other complexity measures via Rademacher complexity.

85 Summary Have given the first data-dependent bounds for a non-i.i.d. scenario. First known margin-based classification bounds (can be extended to regression as well). Can easily extend bounds to other complexity measures via Rademacher complexity. Future work:

86 Summary Have given the first data-dependent bounds for a non-i.i.d. scenario. First known margin-based classification bounds (can be extended to regression as well). Can easily extend bounds to other complexity measures via Rademacher complexity. Future work: Can we make use of the entire sample to computer empirical Rademacher complexity?

87 Summary Have given the first data-dependent bounds for a non-i.i.d. scenario. First known margin-based classification bounds (can be extended to regression as well). Can easily extend bounds to other complexity measures via Rademacher complexity. Future work: Can we make use of the entire sample to computer empirical Rademacher complexity? Can we strictly generalize the i.i.d. bound?

Rademacher Complexity Bounds for Non-I.I.D. Processes

Rademacher Complexity Bounds for Non-I.I.D. Processes Rademacher Complexity Bounds for Non-I.I.D. Processes Mehryar Mohri Courant Institute of Mathematical ciences and Google Research 5 Mercer treet New York, NY 00 mohri@cims.nyu.edu Afshin Rostamizadeh Department

More information

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds

More information

Sample Selection Bias Correction

Sample Selection Bias Correction Sample Selection Bias Correction Afshin Rostamizadeh Joint work with: Corinna Cortes, Mehryar Mohri & Michael Riley Courant Institute & Google Research Motivation Critical Assumption: Samples for training

More information

Learning Kernels -Tutorial Part III: Theoretical Guarantees.

Learning Kernels -Tutorial Part III: Theoretical Guarantees. Learning Kernels -Tutorial Part III: Theoretical Guarantees. Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute & Google Research mohri@cims.nyu.edu Afshin Rostami UC Berkeley

More information

Time Series Prediction & Online Learning

Time Series Prediction & Online Learning Time Series Prediction & Online Learning Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values. earthquakes.

More information

Online Learning for Time Series Prediction

Online Learning for Time Series Prediction Online Learning for Time Series Prediction Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values.

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:

More information

Computational Learning Theory - Hilary Term : Learning Real-valued Functions

Computational Learning Theory - Hilary Term : Learning Real-valued Functions Computational Learning Theory - Hilary Term 08 8 : Learning Real-valued Functions Lecturer: Varun Kanade So far our focus has been on learning boolean functions. Boolean functions are suitable for modelling

More information

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH. Deep Boosting Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Ensemble Methods in ML Combining several base classifiers

More information

Perceptron Mistake Bounds

Perceptron Mistake Bounds Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce

More information

Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence

Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence Chao Zhang The Biodesign Institute Arizona State University Tempe, AZ 8587, USA Abstract In this paper, we present

More information

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization John Duchi Outline I Motivation 1 Uniform laws of large numbers 2 Loss minimization and data dependence II Uniform

More information

Sharp Generalization Error Bounds for Randomly-projected Classifiers

Sharp Generalization Error Bounds for Randomly-projected Classifiers Sharp Generalization Error Bounds for Randomly-projected Classifiers R.J. Durrant and A. Kabán School of Computer Science The University of Birmingham Birmingham B15 2TT, UK http://www.cs.bham.ac.uk/ axk

More information

LECTURES 2-3 : Stochastic Processes, Autocorrelation function. Stationarity.

LECTURES 2-3 : Stochastic Processes, Autocorrelation function. Stationarity. LECTURES 2-3 : Stochastic Processes, Autocorrelation function. Stationarity. Important points of Lecture 1: A time series {X t } is a series of observations taken sequentially over time: x t is an observation

More information

Learning with Imperfect Data

Learning with Imperfect Data Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Joint work with: Yishay Mansour (Tel-Aviv & Google) and Afshin Rostamizadeh (Courant Institute). Standard Learning Assumptions IID assumption.

More information

The definitions and notation are those introduced in the lectures slides. R Ex D [h

The definitions and notation are those introduced in the lectures slides. R Ex D [h Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 2 October 04, 2016 Due: October 18, 2016 A. Rademacher complexity The definitions and notation

More information

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:

More information

Generalization error bounds for classifiers trained with interdependent data

Generalization error bounds for classifiers trained with interdependent data Generalization error bounds for classifiers trained with interdependent data icolas Usunier, Massih-Reza Amini, Patrick Gallinari Department of Computer Science, University of Paris VI 8, rue du Capitaine

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Learning Kernels MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Kernel methods. Learning kernels scenario. learning bounds. algorithms. page 2 Machine Learning

More information

On the tradeoff between computational complexity and sample complexity in learning

On the tradeoff between computational complexity and sample complexity in learning On the tradeoff between computational complexity and sample complexity in learning Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Joint work with Sham

More information

Lesson 4: Stationary stochastic processes

Lesson 4: Stationary stochastic processes Dipartimento di Ingegneria e Scienze dell Informazione e Matematica Università dell Aquila, umberto.triacca@univaq.it Stationary stochastic processes Stationarity is a rather intuitive concept, it means

More information

ADANET: adaptive learning of neural networks

ADANET: adaptive learning of neural networks ADANET: adaptive learning of neural networks Joint work with Corinna Cortes (Google Research) Javier Gonzalo (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR

More information

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015 10-806 Foundations of Machine Learning and Data Science Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015 1 Active Learning Most classic machine learning methods and the formal learning

More information

Algorithms for Learning Kernels Based on Centered Alignment

Algorithms for Learning Kernels Based on Centered Alignment Journal of Machine Learning Research 3 (2012) XX-XX Submitted 01/11; Published 03/12 Algorithms for Learning Kernels Based on Centered Alignment Corinna Cortes Google Research 76 Ninth Avenue New York,

More information

The PAC Learning Framework -II

The PAC Learning Framework -II The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1 Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1 Outline

More information

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity; CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and

More information

Case study: stochastic simulation via Rademacher bootstrap

Case study: stochastic simulation via Rademacher bootstrap Case study: stochastic simulation via Rademacher bootstrap Maxim Raginsky December 4, 2013 In this lecture, we will look at an application of statistical learning theory to the problem of efficient stochastic

More information

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification page 2 Motivation Real-world problems often have multiple classes:

More information

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu

More information

Improved Bounds on the Dot Product under Random Projection and Random Sign Projection

Improved Bounds on the Dot Product under Random Projection and Random Sign Projection Improved Bounds on the Dot Product under Random Projection and Random Sign Projection Ata Kabán School of Computer Science The University of Birmingham Birmingham B15 2TT, UK http://www.cs.bham.ac.uk/

More information

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 A. Kernels 1. Let X be a finite set. Show that the kernel

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Learning Weighted Automata

Learning Weighted Automata Learning Weighted Automata Joint work with Borja Balle (Amazon Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Weighted Automata (WFAs) page 2 Motivation Weighted automata (WFAs): image

More information

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Multi-Class Classification Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation Real-world problems often have multiple classes: text, speech,

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory Problem set 1 Due: Monday, October 10th Please send your solutions to learning-submissions@ttic.edu Notation: Input space: X Label space: Y = {±1} Sample:

More information

Stochastic process for macro

Stochastic process for macro Stochastic process for macro Tianxiao Zheng SAIF 1. Stochastic process The state of a system {X t } evolves probabilistically in time. The joint probability distribution is given by Pr(X t1, t 1 ; X t2,

More information

Support Vector Machine

Support Vector Machine Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary

More information

Statistical learning theory, Support vector machines, and Bioinformatics

Statistical learning theory, Support vector machines, and Bioinformatics 1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.

More information

THE NOISY VETO-VOTER MODEL: A RECURSIVE DISTRIBUTIONAL EQUATION ON [0,1] April 28, 2007

THE NOISY VETO-VOTER MODEL: A RECURSIVE DISTRIBUTIONAL EQUATION ON [0,1] April 28, 2007 THE NOISY VETO-VOTER MODEL: A RECURSIVE DISTRIBUTIONAL EQUATION ON [0,1] SAUL JACKA AND MARCUS SHEEHAN Abstract. We study a particular example of a recursive distributional equation (RDE) on the unit interval.

More information

Risk Bounds for Lévy Processes in the PAC-Learning Framework

Risk Bounds for Lévy Processes in the PAC-Learning Framework Risk Bounds for Lévy Processes in the PAC-Learning Framework Chao Zhang School of Computer Engineering anyang Technological University Dacheng Tao School of Computer Engineering anyang Technological University

More information

Active Learning: Disagreement Coefficient

Active Learning: Disagreement Coefficient Advanced Course in Machine Learning Spring 2010 Active Learning: Disagreement Coefficient Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz In previous lectures we saw examples in which

More information

Plug-in Approach to Active Learning

Plug-in Approach to Active Learning Plug-in Approach to Active Learning Stanislav Minsker Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 1 / 18 Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X

More information

Lecture 9. d N(0, 1). Now we fix n and think of a SRW on [0,1]. We take the k th step at time k n. and our increments are ± 1

Lecture 9. d N(0, 1). Now we fix n and think of a SRW on [0,1]. We take the k th step at time k n. and our increments are ± 1 Random Walks and Brownian Motion Tel Aviv University Spring 011 Lecture date: May 0, 011 Lecture 9 Instructor: Ron Peled Scribe: Jonathan Hermon In today s lecture we present the Brownian motion (BM).

More information

VC dimension and Model Selection

VC dimension and Model Selection VC dimension and Model Selection Overview PAC model: review VC dimension: Definition Examples Sample: Lower bound Upper bound!!! Model Selection Introduction to Machine Learning 2 PAC model: Setting A

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

Strong approximation for additive functionals of geometrically ergodic Markov chains

Strong approximation for additive functionals of geometrically ergodic Markov chains Strong approximation for additive functionals of geometrically ergodic Markov chains Florence Merlevède Joint work with E. Rio Université Paris-Est-Marne-La-Vallée (UPEM) Cincinnati Symposium on Probability

More information

x log x, which is strictly convex, and use Jensen s Inequality:

x log x, which is strictly convex, and use Jensen s Inequality: 2. Information measures: mutual information 2.1 Divergence: main inequality Theorem 2.1 (Information Inequality). D(P Q) 0 ; D(P Q) = 0 iff P = Q Proof. Let ϕ(x) x log x, which is strictly convex, and

More information

Computational and Statistical Learning theory

Computational and Statistical Learning theory Computational and Statistical Learning theory Problem set 2 Due: January 31st Email solutions to : karthik at ttic dot edu Notation : Input space : X Label space : Y = {±1} Sample : (x 1, y 1,..., (x n,

More information

Econometría 2: Análisis de series de Tiempo

Econometría 2: Análisis de series de Tiempo Econometría 2: Análisis de series de Tiempo Karoll GOMEZ kgomezp@unal.edu.co http://karollgomez.wordpress.com Segundo semestre 2016 II. Basic definitions A time series is a set of observations X t, each

More information

Structured Prediction Theory and Algorithms

Structured Prediction Theory and Algorithms Structured Prediction Theory and Algorithms Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE

More information

Information Theory and Statistics Lecture 3: Stationary ergodic processes

Information Theory and Statistics Lecture 3: Stationary ergodic processes Information Theory and Statistics Lecture 3: Stationary ergodic processes Łukasz Dębowski ldebowsk@ipipan.waw.pl Ph. D. Programme 2013/2014 Measurable space Definition (measurable space) Measurable space

More information

Structured Prediction

Structured Prediction Structured Prediction Ningshan Zhang Advanced Machine Learning, Spring 2016 Outline Ensemble Methods for Structured Prediction[1] On-line learning Boosting AGeneralizedKernelApproachtoStructuredOutputLearning[2]

More information

Stability Bounds for Stationary ϕ-mixing and β-mixing Processes

Stability Bounds for Stationary ϕ-mixing and β-mixing Processes Journal of Machine Learning Research (200) 789-84 Subitted /08; Revised /0; Published 2/0 Stability Bounds for Stationary ϕ-ixing and β-ixing Processes Mehryar Mohri Courant Institute of Matheatical Sciences

More information

March 1, Florida State University. Concentration Inequalities: Martingale. Approach and Entropy Method. Lizhe Sun and Boning Yang.

March 1, Florida State University. Concentration Inequalities: Martingale. Approach and Entropy Method. Lizhe Sun and Boning Yang. Florida State University March 1, 2018 Framework 1. (Lizhe) Basic inequalities Chernoff bounding Review for STA 6448 2. (Lizhe) Discrete-time martingales inequalities via martingale approach 3. (Boning)

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

i=1 = H t 1 (x) + α t h t (x)

i=1 = H t 1 (x) + α t h t (x) AdaBoost AdaBoost, which stands for ``Adaptive Boosting", is an ensemble learning algorithm that uses the boosting paradigm []. We will discuss AdaBoost for binary classification. That is, we assume that

More information

l 1 -Regularized Linear Regression: Persistence and Oracle Inequalities

l 1 -Regularized Linear Regression: Persistence and Oracle Inequalities l -Regularized Linear Regression: Persistence and Oracle Inequalities Peter Bartlett EECS and Statistics UC Berkeley slides at http://www.stat.berkeley.edu/ bartlett Joint work with Shahar Mendelson and

More information

Introduction and Models

Introduction and Models CSE522, Winter 2011, Learning Theory Lecture 1 and 2-01/04/2011, 01/06/2011 Lecturer: Ofer Dekel Introduction and Models Scribe: Jessica Chang Machine learning algorithms have emerged as the dominant and

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

1 Review of The Learning Setting

1 Review of The Learning Setting COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #8 Scribe: Changyan Wang February 28, 208 Review of The Learning Setting Last class, we moved beyond the PAC model: in the PAC model we

More information

Modelli Lineari (Generalizzati) e SVM

Modelli Lineari (Generalizzati) e SVM Modelli Lineari (Generalizzati) e SVM Corso di AA, anno 2018/19, Padova Fabio Aiolli 19/26 Novembre 2018 Fabio Aiolli Modelli Lineari (Generalizzati) e SVM 19/26 Novembre 2018 1 / 36 Outline Linear methods

More information

Boosting Ensembles of Structured Prediction Rules

Boosting Ensembles of Structured Prediction Rules Boosting Ensembles of Structured Prediction Rules Corinna Cortes Google Research 76 Ninth Avenue New York, NY 10011 corinna@google.com Vitaly Kuznetsov Courant Institute 251 Mercer Street New York, NY

More information

Domain Adaptation for Regression

Domain Adaptation for Regression Domain Adaptation for Regression Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Motivation Applications: distinct training and test distributions.

More information

Information measures in simple coding problems

Information measures in simple coding problems Part I Information measures in simple coding problems in this web service in this web service Source coding and hypothesis testing; information measures A(discrete)source is a sequence {X i } i= of random

More information

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 Lecture 14: Information Theoretic Methods Lecturer: Jiaming Xu Scribe: Hilda Ibriga, Adarsh Barik, December 02, 2016 Outline f-divergence

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

An Introduction to Laws of Large Numbers

An Introduction to Laws of Large Numbers An to Laws of John CVGMI Group Contents 1 Contents 1 2 Contents 1 2 3 Contents 1 2 3 4 Intuition We re working with random variables. What could we observe? {X n } n=1 Intuition We re working with random

More information

The Behavior of Multivariate Maxima of Moving Maxima Processes

The Behavior of Multivariate Maxima of Moving Maxima Processes The Behavior of Multivariate Maxima of Moving Maxima Processes Zhengjun Zhang Department of Mathematics Washington University Saint Louis, MO 6313-4899 USA Richard L. Smith Department of Statistics University

More information

i=1 Pr(X 1 = i X 2 = i). Notice each toss is independent and identical, so we can write = 1/6

i=1 Pr(X 1 = i X 2 = i). Notice each toss is independent and identical, so we can write = 1/6 Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 1 - solution Credit: Ashish Rastogi, Afshin Rostamizadeh Ameet Talwalkar, and Eugene Weinstein.

More information

Lecture 5: GPs and Streaming regression

Lecture 5: GPs and Streaming regression Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X

More information

Decentralized Detection and Classification using Kernel Methods

Decentralized Detection and Classification using Kernel Methods Decentralized Detection and Classification using Kernel Methods XuanLong Nguyen Computer Science Division University of California, Berkeley xuanlong@cs.berkeley.edu Martin J. Wainwright Electrical Engineering

More information

Nonlinear time series

Nonlinear time series Based on the book by Fan/Yao: Nonlinear Time Series Robert M. Kunst robert.kunst@univie.ac.at University of Vienna and Institute for Advanced Studies Vienna October 27, 2009 Outline Characteristics of

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A.

More information

1 A Lower Bound on Sample Complexity

1 A Lower Bound on Sample Complexity COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #7 Scribe: Chee Wei Tan February 25, 2008 1 A Lower Bound on Sample Complexity In the last lecture, we stopped at the lower bound on

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric

More information

Generalization Bounds

Generalization Bounds Generalization Bounds Here we consider the problem of learning from binary labels. We assume training data D = x 1, y 1,... x N, y N with y t being one of the two values 1 or 1. We will assume that these

More information

For a stochastic process {Y t : t = 0, ±1, ±2, ±3, }, the mean function is defined by (2.2.1) ± 2..., γ t,

For a stochastic process {Y t : t = 0, ±1, ±2, ±3, }, the mean function is defined by (2.2.1) ± 2..., γ t, CHAPTER 2 FUNDAMENTAL CONCEPTS This chapter describes the fundamental concepts in the theory of time series models. In particular, we introduce the concepts of stochastic processes, mean and covariance

More information

Generalization Bounds and Stability

Generalization Bounds and Stability Generalization Bounds and Stability Lorenzo Rosasco Tomaso Poggio 9.520 Class 9 2009 About this class Goal To recall the notion of generalization bounds and show how they can be derived from a stability

More information

MATH4406 Assignment 5

MATH4406 Assignment 5 MATH4406 Assignment 5 Patrick Laub (ID: 42051392) October 7, 2014 1 The machine replacement model 1.1 Real-world motivation Consider the machine to be the entire world. Over time the creator has running

More information

Does Unlabeled Data Help?

Does Unlabeled Data Help? Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline

More information

ORIE 4741: Learning with Big Messy Data. Generalization

ORIE 4741: Learning with Big Messy Data. Generalization ORIE 4741: Learning with Big Messy Data Generalization Professor Udell Operations Research and Information Engineering Cornell September 23, 2017 1 / 21 Announcements midterm 10/5 makeup exam 10/2, by

More information

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 13 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification Mehryar Mohri - Introduction to Machine Learning page 2 Motivation

More information

Information Theory. Lecture 5 Entropy rate and Markov sources STEFAN HÖST

Information Theory. Lecture 5 Entropy rate and Markov sources STEFAN HÖST Information Theory Lecture 5 Entropy rate and Markov sources STEFAN HÖST Universal Source Coding Huffman coding is optimal, what is the problem? In the previous coding schemes (Huffman and Shannon-Fano)it

More information

Generalization and Overfitting

Generalization and Overfitting Generalization and Overfitting Model Selection Maria-Florina (Nina) Balcan February 24th, 2016 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 59 Classical case: n d. Asymptotic assumption: d is fixed and n. Basic tools: LLN and CLT. High-dimensional setting: n d, e.g. n/d

More information

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process

More information

Model Specification Testing in Nonparametric and Semiparametric Time Series Econometrics. Jiti Gao

Model Specification Testing in Nonparametric and Semiparametric Time Series Econometrics. Jiti Gao Model Specification Testing in Nonparametric and Semiparametric Time Series Econometrics Jiti Gao Department of Statistics School of Mathematics and Statistics The University of Western Australia Crawley

More information

Bivariate Paired Numerical Data

Bivariate Paired Numerical Data Bivariate Paired Numerical Data Pearson s correlation, Spearman s ρ and Kendall s τ, tests of independence University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html

More information

i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x

i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 November 16, 017 Due: Dec 01, 017 A. Kernels Show that the following kernels K are PDS: 1.

More information

A Bahadur Representation of the Linear Support Vector Machine

A Bahadur Representation of the Linear Support Vector Machine A Bahadur Representation of the Linear Support Vector Machine Yoonkyung Lee Department of Statistics The Ohio State University October 7, 2008 Data Mining and Statistical Learning Study Group Outline Support

More information

Fast learning rates for plug-in classifiers under the margin condition

Fast learning rates for plug-in classifiers under the margin condition Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,

More information

Monte-Carlo MMD-MA, Université Paris-Dauphine. Xiaolu Tan

Monte-Carlo MMD-MA, Université Paris-Dauphine. Xiaolu Tan Monte-Carlo MMD-MA, Université Paris-Dauphine Xiaolu Tan tan@ceremade.dauphine.fr Septembre 2015 Contents 1 Introduction 1 1.1 The principle.................................. 1 1.2 The error analysis

More information

Stability Bounds for Stationary ϕ-mixing and β-mixing Processes

Stability Bounds for Stationary ϕ-mixing and β-mixing Processes Journal of Machine Learning Research (200 66-686 Subitted /08; Revised /0; Published 2/0 Stability Bounds for Stationary ϕ-ixing and β-ixing Processes Mehryar Mohri Courant Institute of Matheatical Sciences

More information

Statistical Learning Learning From Examples

Statistical Learning Learning From Examples Statistical Learning Learning From Examples We want to estimate the working temperature range of an iphone. We could study the physics and chemistry that affect the performance of the phone too hard We

More information

Chapter 8: Differential entropy. University of Illinois at Chicago ECE 534, Natasha Devroye

Chapter 8: Differential entropy. University of Illinois at Chicago ECE 534, Natasha Devroye Chapter 8: Differential entropy Chapter 8 outline Motivation Definitions Relation to discrete entropy Joint and conditional differential entropy Relative entropy and mutual information Properties AEP for

More information

A Uniform Convergence Bound for the Area Under the ROC Curve

A Uniform Convergence Bound for the Area Under the ROC Curve Proceedings of the th International Workshop on Artificial Intelligence & Statistics, 5 A Uniform Convergence Bound for the Area Under the ROC Curve Shivani Agarwal, Sariel Har-Peled and Dan Roth Department

More information