Rademacher Bounds for Non-i.i.d. Processes

Size: px

Start display at page:

Download "Rademacher Bounds for Non-i.i.d. Processes"

Denis Day
5 years ago
Views:

1 Rademacher Bounds for Non-i.i.d. Processes Afshin Rostamizadeh Joint work with: Mehryar Mohri

2 Background

3 Background Generalization Bounds - How well can we estimate an algorithm s true performance based on a finite sample? êrr(h, S)? E S [êrr(h, S)] = err(h, S)

4 Background Generalization Bounds - How well can we estimate an algorithm s true performance based on a finite sample? êrr(h, S)? E S [êrr(h, S)] = err(h, S) Usually, generalization bounds take the form: h H, S (X Y ) m err(h, S) êrr(h, S) + complexity(h)

5 Background Generalization Bounds - How well can we estimate an algorithm s true performance based on a finite sample? êrr(h, S)? E S [êrr(h, S)] = err(h, S) Usually, generalization bounds take the form: h H, S (X Y ) m err(h, S) êrr(h, S) + complexity(h) Explicit trade-off in choice of H.

6 Background

7 Background One natural measure, Rademacher Complexity.

8 Background One natural measure, Rademacher Complexity. Define Rademacher r.v. as: Pr(σ i = ±1) = 1/2

9 Background One natural measure, Rademacher Complexity. Define Rademacher r.v. as: Pr(σ i = ±1) = 1/2 Empirical: [ R S (H) = 2 m E σ sup h H m i=1 ] σ i h(z i ) S =(z 1,..., z m )

10 Background One natural measure, Rademacher Complexity. Define Rademacher r.v. as: Pr(σ i = ±1) = 1/2 Empirical: [ R S (H) = 2 m E σ sup h H m i=1 ] σ i h(z i ) S =(z 1,..., z m ) Actual: R m(h) = E S [ R S ]

11 Background One natural measure, Rademacher Complexity. Define Rademacher r.v. as: Pr(σ i = ±1) = 1/2 Empirical: [ R S (H) = 2 m E σ sup h H m i=1 ] σ i h(z i ) S =(z 1,..., z m ) Actual: R m (H) = E S [ R S ] Intuitively, this measures the ability of a hypothesis class to fit uniform random noise.

12 Background One natural measure, Rademacher Complexity. Define Rademacher r.v. as: Pr(σ i = ±1) = 1/2 Empirical: [ R S (H) = 2 m E σ sup h H m i=1 ] σ i h(z i ) S =(z 1,..., z m ) Actual: R m (H) = E S [ R S ] Intuitively, this measures the ability of a hypothesis class to fit uniform random noise. Can be measured from data, tighter bounds.

13 Background

14 Background Rademacher Generalization Bounds (0/1 loss) [Bartlett, Mendelson 01, Koltchinskii, Panchenko 00]: h H, S (X Y ) m err(h, S) êrr(h, S)+ R S (H) 2 + log(1/δ) 2m

15 Background Rademacher Generalization Bounds (0/1 loss) [Bartlett, Mendelson 01, Koltchinskii, Panchenko 00]: h H, S (X Y ) m err(h, S) êrr(h, S)+ R S (H) 2 + log(1/δ) 2m CRITICAL Assumption: The sample must be identically and independently distributed (i.i.d.).

16 Motivation

17 Motivation In practice data often contains dependencies. Here we consider temporal dependancies:

18 Motivation In practice data often contains dependencies. Here we consider temporal dependancies: We assume the distribution to be mixing; implies a dependence which weakens over time.

19 Motivation In practice data often contains dependencies. Here we consider temporal dependancies: We assume the distribution to be mixing; implies a dependence which weakens over time. Natural in the context of time-series analysis (i.e. stock market quotes).

20 Motivation

21 Motivation We MUST deal with non-i.i.d. data!

22 Motivation We MUST deal with non-i.i.d. data! Often in practice, simply use algorithm designed for the i.i.d. case. Many times, such algorithms still perform well

23 Motivation We MUST deal with non-i.i.d. data! Often in practice, simply use algorithm designed for the i.i.d. case. Many times, such algorithms still perform well How can we justify/guarantee this performance?

24 Motivation We MUST deal with non-i.i.d. data! Often in practice, simply use algorithm designed for the i.i.d. case. Many times, such algorithms still perform well How can we justify/guarantee this performance? Give generalization bounds that do NOT assume i.i.d. data!

25 Motivation We MUST deal with non-i.i.d. data! Often in practice, simply use algorithm designed for the i.i.d. case. Many times, such algorithms still perform well How can we justify/guarantee this performance? Give generalization bounds that do NOT assume i.i.d. data! Here we present general proof techniques useful for mixing processes.

26 Motivation We MUST deal with non-i.i.d. data! Often in practice, simply use algorithm designed for the i.i.d. case. Many times, such algorithms still perform well How can we justify/guarantee this performance? Give generalization bounds that do NOT assume i.i.d. data! Here we present general proof techniques useful for mixing processes. We extend useful properties of Rademacher complexity to this non-i.i.d. setting.

27 Definitions

28 Definitions We make the standard assumption of Stationarity. [Stationarity] A sequence of random variables = Z t t= is said to be stationary if for any t and non-negative integers m and k, the random vectors (Z t,..., Z t+m ) and (Z t+k,..., Z t+m+k ) have the same distribution.

29 Definitions We make the standard assumption of Stationarity. [Stationarity] A sequence of random variables = Z t t= is said to be stationary if for any t and non-negative integers m and k, the random vectors (Z t,..., Z t+m ) and (Z t+k,..., Z t+m+k ) have the same distribution. = A B A B P(x t = B x t s = A) P(x t+k = B x t s+k = A)

30 Definitions We make the standard assumption of Stationarity. [Stationarity] A sequence of random variables = Z t t= is said to be stationary if for any t and non-negative integers m and k, the random vectors (Z t,..., Z t+m ) and (Z t+k,..., Z t+m+k ) have the same distribution. = A B A B P(x t = B x t s = A) P(x t+k = B x t s+k = A) = A B (relative distance matters) P(x t+k = B x t s+l = A)

31 Definitions

32 Definitions We quantify dependence with natural β-mixing coefficient: [β-mixing] Let = Z tt= be a stationary sequence of random variables. Let σ j i denote the σ-algebra generated by the random variables Z k, i k j. The β-mixing coefficient of the stochastic process is defined as [ ] β(k) = sup sup Pr[A B] Pr[A]. n B σn A σ n+k

33 Definitions We quantify dependence with natural β-mixing coefficient: [β-mixing] Let = Z tt= be a stationary sequence of random variables. Let σ j i denote the σ-algebra generated by the random variables Z k, i k j. The β-mixing coefficient of the stochastic process is defined as [ ] β(k) = sup sup Pr[A B] Pr[A]. n B σn Mixing implies: A σ n+k A B A B Pr[B A] Pr[B] Pr[B A] Pr[B]

34 Proof Strategy

35 Proof Strategy Reduce dependent scenario to the independent case.

36 Proof Strategy Reduce dependent scenario to the independent case. If we introduce gaps in the sequence, can we treat the blocks as independent? S S0

37 Proof Strategy Reduce dependent scenario to the independent case. If we introduce gaps in the sequence, can we treat the blocks as independent? a 2aμ = m S S μ

38 Proof Strategy Reduce dependent scenario to the independent case. If we introduce gaps in the sequence, can we treat the blocks as independent? a 2aμ = m S S μ β-mixing assumption allows us to exactly bound this approximation. Lemma [Yu, 94] Let S 0 be defined as above and let h be a function of S 0 that is bounded by M and then, E S0 [h] E S0 [h] (µ 1)Mβ(a), where E S0 (resp. E S0 ) denotes the expectation with respect to the dependent (resp. independent) block sequence.

39 Proof Strategy

40 Proof Strategy In i.i.d. case, apply McDiarmid s inequality to: where, R S (h) = 1 m Φ(S) = sup h H m i=1 h(z i ) R(h) R S (h) R(h) = E S [ R S (h)]

41 Proof Strategy In i.i.d. case, apply McDiarmid s inequality to: where, R S (h) = 1 m We apply it to i.i.d. blocks, extending H in a natural way. Φ(S) = sup h H m i=1 h(z i ) R(h) R S (h) R(h) = E S [ R S (h)] Define h a (B) = 1 a a i=1 h(z i) for any block B =(z 1,..., z a ) Z a, and define H a as the set of all block-based hypotheses h a generated from h H.

42 Preparation Step

43 Preparation Step Re-write in terms of blocks:

44 Preparation Step Re-write in terms of blocks: Pr[Φ(S) > ɛ] = Pr S = Pr S [sup (R(h) R S (h)) > ɛ] S h [ ( R(h) RS0 b (h) sup h 2 + R(h) R b S1 (h) 2 ) ] > ɛ (def. of R S (h)) Pr 0)+Φ(S 1 ) > 2ɛ] S (def. of Φ) Pr[Φ(S 0 ) > ɛ] + Pr[Φ(S 1 ) > ɛ] S 0 S 1 (union bound) = 2 Pr S 0 [Φ(S 0 ) > ɛ] (stationarity) = 2 Pr S 0 [Φ(S 0 ) E S0 [Φ( S 0 )] > ɛ ]. (def. of ɛ )

45 Preparation Step Re-write in terms of blocks: Pr[Φ(S) > ɛ] = Pr S = Pr S [sup (R(h) R S (h)) > ɛ] S h [ ( R(h) RS0 b (h) sup h 2 + R(h) R b S1 (h) 2 ) ] > ɛ (def. of R S (h)) Pr 0)+Φ(S 1 ) > 2ɛ] S (def. of Φ) Pr[Φ(S 0 ) > ɛ] + Pr[Φ(S 1 ) > ɛ] S 0 S 1 (union bound) = 2 Pr S 0 [Φ(S 0 ) > ɛ] ɛ = ɛ E S0 [Φ( S 0 )] (stationarity) = 2 Pr S 0 [Φ(S 0 ) E S0 [Φ( S 0 )] > ɛ ]. (def. of ɛ )

46 Preparation Step Re-write in terms of blocks: Pr[Φ(S) > ɛ] = Pr S = Pr S [sup (R(h) R S (h)) > ɛ] S h [ ( R(h) RS0 b (h) sup h 2 + R(h) R b S1 (h) 2 Obtain independent blocks, using Yu s Lemma: ) ] > ɛ (def. of R S (h)) Pr 0)+Φ(S 1 ) > 2ɛ] S (def. of Φ) Pr[Φ(S 0 ) > ɛ] + Pr[Φ(S 1 ) > ɛ] S 0 S 1 (union bound) = 2 Pr S 0 [Φ(S 0 ) > ɛ] ɛ = ɛ E S0 [Φ( S 0 )] (stationarity) = 2 Pr S 0 [Φ(S 0 ) E S0 [Φ( S 0 )] > ɛ ]. (def. of ɛ ) 2 Pr S 0 [Φ(S 0 ) E S0 [Φ( S 0 )] > ɛ ] 2 Pr S 0 [Φ( S 0 ) E S0 [Φ( S 0 )] > ɛ ] + 2(µ 1)β(a)

47 Concentration Bound

48 Concentration Bound Now, apply McDiarmid s inequality to i.i.d. blocks:

49 Concentration Bound Now, apply McDiarmid s inequality to i.i.d. blocks: Changing a block Z k of the sample S 0 can change Φ( S 0 ) by at most 1 µ h( Z k ) M/µ and by McDiarmid s inequality, the following holds for any ɛ > 2(µ 1)Mβ(a): Pr[Φ( S 0 ) E S0 [Φ( S 0 )] > ɛ ] S 0 ( 2ɛ 2 ) ( ) 2µɛ 2 exp µ = exp i=1 (M/µ)2 M 2

50 Concentration Bound Now, apply McDiarmid s inequality to i.i.d. blocks: Changing a block Z k of the sample S 0 can change Φ( S 0 ) by at most 1 µ h( Z k ) M/µ and by McDiarmid s inequality, the following holds for any ɛ > 2(µ 1)Mβ(a): Pr[Φ( S 0 ) E S0 [Φ( S 0 )] > ɛ ] S 0 ( 2ɛ 2 ) ( 2µɛ 2 exp µ = exp i=1 (M/µ)2 M 2 ) So far, we have: Pr[Φ(S) > ɛ] 2 exp S ( 2µɛ 2 M 2 ) + 2(µ 1)β(a),

51 Bounding the Expectation

52 Bounding the Expectation To make the bound useful, we must bound the expectation (over i.i.d. blocks):

53 Bounding the Expectation To make the bound useful, we must bound the expectation (over i.i.d. blocks): E S0 [Φ( S 0 )] E S0, S 0[sup h H R S 0 (h) R S0 (h)] [ 1 µ =E S0, S sup 0 h a H a µ i=1 [ 1 µ =E S0, S 0,σ sup h a H a µ i=1 [ 1 µ ] E S0, S 0,σ sup σ i h a (Z i ) h a H a µ i=1 [ 1 µ +E S0, S 0,σ sup h a H a µ i=1 [ 1 µ ] = 2E S0,σ sup σ i h a (Z i ). h a H a µ i=1 ] h a (Z i ) h a (Z i) ] σ i (h a (Z i ) h a (Z i)) ] σ i h a (Z i) (def. of R) (Rad. var. s) (prop. of sup)

54 Bounding the Expectation

55 Bounding the Expectation Would like complexity of H not H a:

56 Bounding the Expectation Would like complexity of H not H a: E S0,σ [ 2 sup h a H a µ µ i=1 ] σ i h a (Z i ) =E S0,σ [ sup h H 2 µ µ i=1 σ i 1 a a j=1 ] h(z (i) j )

57 Bounding the Expectation Would like complexity of H not H a: E S0,σ [ 2 sup h a H a µ µ i=1 ] σ i h a (Z i ) =E S0,σ [ sup h H 2 µ µ i=1 σ i 1 a a j=1 ] h(z (i) j ) Almost, looks like Rad. comp. but σ s are shared.

58 Bounding the Expectation Would like complexity of H not H a: E S0,σ [ 2 sup h a H a µ µ i=1 ] σ i h a (Z i ) =E S0,σ [ sup h H 2 µ µ i=1 σ i 1 a a j=1 ] h(z (i) j ) Almost, looks like Rad. comp. but σ s are shared. E S0 [Φ( S 0 )] E S0,σ 1 a = 1 a a j=1 a j=1 =E Sµ,σ [ sup h H 1 a [ a j=1 E S0,σ sup h H [ E Sj 0,σ sup h H [ sup h H 2 µ 2 µ 2 µ 2 µ µ i=1 µ i=1 µ i=1 ] σ i h(z) z S µ ] σ i h(z (i) j ) ] σ i h(z (i) j ) ] σ i h(z (i) j ) R D µ (H). (reversing order of sums) (convexity of sup) (marginalization)

59 Bounding the Expectation Would like complexity of H not H a: E S0,σ [ 2 sup h a H a µ µ i=1 ] σ i h a (Z i ) =E S0,σ [ sup h H 2 µ µ i=1 σ i 1 a a j=1 ] h(z (i) j ) Almost, looks like Rad. comp. but σ s are shared. E S0 [Φ( S 0 )] E S0,σ 1 a = 1 a a j=1 a j=1 =E Sµ,σ [ sup h H 1 a [ a j=1 E S0,σ sup h H [ E Sj 0,σ sup h H [ sup h H 2 µ 2 µ 2 µ 2 µ µ i=1 µ i=1 µ i=1 ] σ i h(z) z S µ ] σ i h(z (i) j ) ] σ i h(z (i) j ) ] σ i h(z (i) j ) R D µ (H). (reversing order of sums) (convexity of sup) (marginalization) D Denotes i.i.d. distribution

60 Bound So Far

61 Bound So Far We have a bound in terms of the Rad. complexity. With probability at least 1 δ and δ = δ 2(µ 1)β(a), the following inequality holds for all hypotheses h H: R(h) R S (h)+r D µ (H)+M log 2 δ 2µ

62 Bound So Far We have a bound in terms of the Rad. complexity. With probability at least 1 δ and δ = δ 2(µ 1)β(a), the following inequality holds for all hypotheses h H: R(h) R S (h)+r D log 2 δ µ (H)+M 2µ Would like bounds in terms of empirical Rademacher complexity. It has many benefits:

63 Bound So Far We have a bound in terms of the Rad. complexity. With probability at least 1 δ and δ = δ 2(µ 1)β(a), the following inequality holds for all hypotheses h H: R(h) R S (h)+r D log 2 δ µ (H)+M 2µ Would like bounds in terms of empirical Rademacher complexity. It has many benefits: Can be measured from data (tighter bounds).

64 Bound So Far We have a bound in terms of the Rad. complexity. With probability at least 1 δ and δ = δ 2(µ 1)β(a), the following inequality holds for all hypotheses h H: R(h) R S (h)+r D log 2 δ µ (H)+M 2µ Would like bounds in terms of empirical Rademacher complexity. It has many benefits: Can be measured from data (tighter bounds). Can be related to other complexity measures (e.g. VC-dimension).

65 Bound So Far We have a bound in terms of the Rad. complexity. With probability at least 1 δ and δ = δ 2(µ 1)β(a), the following inequality holds for all hypotheses h H: R(h) R S (h)+r D log 2 δ µ (H)+M 2µ Would like bounds in terms of empirical Rademacher complexity. It has many benefits: Can be measured from data (tighter bounds). Can be related to other complexity measures (e.g. VC-dimension). Can be bounded for specific hypotheses.

66 Bound So Far

67 Bound So Far Using similar techniques, we can show R Sµ to : R D µ is close

68 Bound So Far Using similar techniques, we can show R Sµ to : R D µ is close With probability at least 1 δ and δ = δ/2 2(µ 1)β(a), the following inequality holds for all hypotheses h H: R(h) R S (h)+ R Sµ (H)+3M log 2 δ 2µ,

69 Bound So Far Using similar techniques, we can show R Sµ to : R D µ is close With probability at least 1 δ and δ = δ/2 2(µ 1)β(a), the following inequality holds for all hypotheses h H: R(h) R S (h)+ R log 2 δ Sµ (H)+3M 2µ, Kernel hypotheses bounds [Bartlett, Mendelson 01]: In the case of classification with hypotheses based on a kernel K and a weight vector w bounded by B, w B, the empirical Rademacher complexity can be bounded as follows: R Sµ (H) 2B µ [K]

70 Classification Bound

71 Classification Bound Let H be the set of hypotheses { (x, y) Z y m i=1 α ik(x i,x): m i,j=1 α iα j K(x i,x j ) 1 }.

72 Classification Bound Let H be the set of hypotheses { (x, y) Z y m i=1 α ik(x i,x): m i,j=1 α iα j K(x i,x j ) 1 Let R ρ S (h) denote the average amount by which y ih(x i ) deviates from the margin ρ: Rρ S (h) = 1 m m i=1 (ρ y ih(x i )) +. }.

73 Classification Bound Let H be the set of hypotheses { (x, y) Z y m i=1 α ik(x i,x): m i,j=1 α iα j K(x i,x j ) 1 Let R ρ S (h) denote the average amount by which y ih(x i ) deviates from the margin ρ: Rρ S (h) = 1 m m i=1 (ρ y ih(x i )) +. }. (ρ yh(x)) 1 ρ h(x)

74 Classification Bound

75 Classification Bound With probability at least 1 δ, the following inequality holds for all hypotheses h H: Pr[yh(x) 0] 1 ρ R ρ S (h)+ 4 log 2 [K]+3 δ µρ 2µ, where δ = δ/2 2(µ 1)β(a), and K is the Gram matrix of the kernel K for the sample S.

76 Classification Bound With probability at least 1 δ, the following inequality holds for all hypotheses h H: Pr[yh(x) 0] 1 ρ R ρ S (h)+ 4 log 2 [K]+3 δ µρ 2µ, where δ = δ/2 2(µ 1)β(a), and K is the Gram matrix of the kernel K for the sample S. Now, we need to appropriately choose the parameters a and μ.

77 Classification Bound With probability at least 1 δ, the following inequality holds for all hypotheses h H: Pr[yh(x) 0] 1 ρ R ρ S (h)+ 4 log 2 [K]+3 δ µρ 2µ, where δ = δ/2 2(µ 1)β(a), and K is the Gram matrix of the kernel K for the sample S. Now, we need to appropriately choose the parameters a and μ. If we assume algebraic mixing,, one suitable choice: µ = m 2r+1 2r+4 2 β(a) := β 0 a r

78 A Complete Bound

79 A Complete Bound Assuming that the sample is drawn from a stationary algebraically β-mixing distribution, β(a) =β 0 a r, the following bound holds, Pr[yh(x) 0] 1 ρ R ρ 1 S (h)+8rmγ +3m γ 2 log 2 δ, where γ 1 = 2( 1 3 r+2 1) (, γ 2 = r+4 1) and δ = δ/2 β 0 m γ 1. ρ

80 A Complete Bound Assuming that the sample is drawn from a stationary algebraically β-mixing distribution, β(a) =β 0 a r, the following bound holds, Pr[yh(x) 0] 1 ρ R ρ 1 S (h)+8rmγ +3m γ 2 log 2 δ, where γ 1 = 2( 1 3 r+2 1) (, γ 2 = r+4 1) and δ = δ/2 β 0 m γ 1. ρ As r and β 0 0 (i.e. the i.i.d. scenario is approached), this bound has the same asymptotic behavior as the i.i.d. bound.

81 Summary

82 Summary Have given the first data-dependent bounds for a non-i.i.d. scenario.

83 Summary Have given the first data-dependent bounds for a non-i.i.d. scenario. First known margin-based classification bounds (can be extended to regression as well).

84 Summary Have given the first data-dependent bounds for a non-i.i.d. scenario. First known margin-based classification bounds (can be extended to regression as well). Can easily extend bounds to other complexity measures via Rademacher complexity.

85 Summary Have given the first data-dependent bounds for a non-i.i.d. scenario. First known margin-based classification bounds (can be extended to regression as well). Can easily extend bounds to other complexity measures via Rademacher complexity. Future work:

86 Summary Have given the first data-dependent bounds for a non-i.i.d. scenario. First known margin-based classification bounds (can be extended to regression as well). Can easily extend bounds to other complexity measures via Rademacher complexity. Future work: Can we make use of the entire sample to computer empirical Rademacher complexity?

87 Summary Have given the first data-dependent bounds for a non-i.i.d. scenario. First known margin-based classification bounds (can be extended to regression as well). Can easily extend bounds to other complexity measures via Rademacher complexity. Future work: Can we make use of the entire sample to computer empirical Rademacher complexity? Can we strictly generalize the i.i.d. bound?

Rademacher Complexity Bounds for Non-I.I.D. Processes

Rademacher Complexity Bounds for Non-I.I.D. Processes Mehryar Mohri Courant Institute of Mathematical ciences and Google Research 5 Mercer treet New York, NY 00 mohri@cims.nyu.edu Afshin Rostamizadeh Department