Rademacher Bounds for Non-i.i.d. Processes
|
|
- Denis Day
- 5 years ago
- Views:
Transcription
1 Rademacher Bounds for Non-i.i.d. Processes Afshin Rostamizadeh Joint work with: Mehryar Mohri
2 Background
3 Background Generalization Bounds - How well can we estimate an algorithm s true performance based on a finite sample? êrr(h, S)? E S [êrr(h, S)] = err(h, S)
4 Background Generalization Bounds - How well can we estimate an algorithm s true performance based on a finite sample? êrr(h, S)? E S [êrr(h, S)] = err(h, S) Usually, generalization bounds take the form: h H, S (X Y ) m err(h, S) êrr(h, S) + complexity(h)
5 Background Generalization Bounds - How well can we estimate an algorithm s true performance based on a finite sample? êrr(h, S)? E S [êrr(h, S)] = err(h, S) Usually, generalization bounds take the form: h H, S (X Y ) m err(h, S) êrr(h, S) + complexity(h) Explicit trade-off in choice of H.
6 Background
7 Background One natural measure, Rademacher Complexity.
8 Background One natural measure, Rademacher Complexity. Define Rademacher r.v. as: Pr(σ i = ±1) = 1/2
9 Background One natural measure, Rademacher Complexity. Define Rademacher r.v. as: Pr(σ i = ±1) = 1/2 Empirical: [ R S (H) = 2 m E σ sup h H m i=1 ] σ i h(z i ) S =(z 1,..., z m )
10 Background One natural measure, Rademacher Complexity. Define Rademacher r.v. as: Pr(σ i = ±1) = 1/2 Empirical: [ R S (H) = 2 m E σ sup h H m i=1 ] σ i h(z i ) S =(z 1,..., z m ) Actual: R m(h) = E S [ R S ]
11 Background One natural measure, Rademacher Complexity. Define Rademacher r.v. as: Pr(σ i = ±1) = 1/2 Empirical: [ R S (H) = 2 m E σ sup h H m i=1 ] σ i h(z i ) S =(z 1,..., z m ) Actual: R m (H) = E S [ R S ] Intuitively, this measures the ability of a hypothesis class to fit uniform random noise.
12 Background One natural measure, Rademacher Complexity. Define Rademacher r.v. as: Pr(σ i = ±1) = 1/2 Empirical: [ R S (H) = 2 m E σ sup h H m i=1 ] σ i h(z i ) S =(z 1,..., z m ) Actual: R m (H) = E S [ R S ] Intuitively, this measures the ability of a hypothesis class to fit uniform random noise. Can be measured from data, tighter bounds.
13 Background
14 Background Rademacher Generalization Bounds (0/1 loss) [Bartlett, Mendelson 01, Koltchinskii, Panchenko 00]: h H, S (X Y ) m err(h, S) êrr(h, S)+ R S (H) 2 + log(1/δ) 2m
15 Background Rademacher Generalization Bounds (0/1 loss) [Bartlett, Mendelson 01, Koltchinskii, Panchenko 00]: h H, S (X Y ) m err(h, S) êrr(h, S)+ R S (H) 2 + log(1/δ) 2m CRITICAL Assumption: The sample must be identically and independently distributed (i.i.d.).
16 Motivation
17 Motivation In practice data often contains dependencies. Here we consider temporal dependancies:
18 Motivation In practice data often contains dependencies. Here we consider temporal dependancies: We assume the distribution to be mixing; implies a dependence which weakens over time.
19 Motivation In practice data often contains dependencies. Here we consider temporal dependancies: We assume the distribution to be mixing; implies a dependence which weakens over time. Natural in the context of time-series analysis (i.e. stock market quotes).
20 Motivation
21 Motivation We MUST deal with non-i.i.d. data!
22 Motivation We MUST deal with non-i.i.d. data! Often in practice, simply use algorithm designed for the i.i.d. case. Many times, such algorithms still perform well
23 Motivation We MUST deal with non-i.i.d. data! Often in practice, simply use algorithm designed for the i.i.d. case. Many times, such algorithms still perform well How can we justify/guarantee this performance?
24 Motivation We MUST deal with non-i.i.d. data! Often in practice, simply use algorithm designed for the i.i.d. case. Many times, such algorithms still perform well How can we justify/guarantee this performance? Give generalization bounds that do NOT assume i.i.d. data!
25 Motivation We MUST deal with non-i.i.d. data! Often in practice, simply use algorithm designed for the i.i.d. case. Many times, such algorithms still perform well How can we justify/guarantee this performance? Give generalization bounds that do NOT assume i.i.d. data! Here we present general proof techniques useful for mixing processes.
26 Motivation We MUST deal with non-i.i.d. data! Often in practice, simply use algorithm designed for the i.i.d. case. Many times, such algorithms still perform well How can we justify/guarantee this performance? Give generalization bounds that do NOT assume i.i.d. data! Here we present general proof techniques useful for mixing processes. We extend useful properties of Rademacher complexity to this non-i.i.d. setting.
27 Definitions
28 Definitions We make the standard assumption of Stationarity. [Stationarity] A sequence of random variables = Z t t= is said to be stationary if for any t and non-negative integers m and k, the random vectors (Z t,..., Z t+m ) and (Z t+k,..., Z t+m+k ) have the same distribution.
29 Definitions We make the standard assumption of Stationarity. [Stationarity] A sequence of random variables = Z t t= is said to be stationary if for any t and non-negative integers m and k, the random vectors (Z t,..., Z t+m ) and (Z t+k,..., Z t+m+k ) have the same distribution. = A B A B P(x t = B x t s = A) P(x t+k = B x t s+k = A)
30 Definitions We make the standard assumption of Stationarity. [Stationarity] A sequence of random variables = Z t t= is said to be stationary if for any t and non-negative integers m and k, the random vectors (Z t,..., Z t+m ) and (Z t+k,..., Z t+m+k ) have the same distribution. = A B A B P(x t = B x t s = A) P(x t+k = B x t s+k = A) = A B (relative distance matters) P(x t+k = B x t s+l = A)
31 Definitions
32 Definitions We quantify dependence with natural β-mixing coefficient: [β-mixing] Let = Z tt= be a stationary sequence of random variables. Let σ j i denote the σ-algebra generated by the random variables Z k, i k j. The β-mixing coefficient of the stochastic process is defined as [ ] β(k) = sup sup Pr[A B] Pr[A]. n B σn A σ n+k
33 Definitions We quantify dependence with natural β-mixing coefficient: [β-mixing] Let = Z tt= be a stationary sequence of random variables. Let σ j i denote the σ-algebra generated by the random variables Z k, i k j. The β-mixing coefficient of the stochastic process is defined as [ ] β(k) = sup sup Pr[A B] Pr[A]. n B σn Mixing implies: A σ n+k A B A B Pr[B A] Pr[B] Pr[B A] Pr[B]
34 Proof Strategy
35 Proof Strategy Reduce dependent scenario to the independent case.
36 Proof Strategy Reduce dependent scenario to the independent case. If we introduce gaps in the sequence, can we treat the blocks as independent? S S0
37 Proof Strategy Reduce dependent scenario to the independent case. If we introduce gaps in the sequence, can we treat the blocks as independent? a 2aμ = m S S μ
38 Proof Strategy Reduce dependent scenario to the independent case. If we introduce gaps in the sequence, can we treat the blocks as independent? a 2aμ = m S S μ β-mixing assumption allows us to exactly bound this approximation. Lemma [Yu, 94] Let S 0 be defined as above and let h be a function of S 0 that is bounded by M and then, E S0 [h] E S0 [h] (µ 1)Mβ(a), where E S0 (resp. E S0 ) denotes the expectation with respect to the dependent (resp. independent) block sequence.
39 Proof Strategy
40 Proof Strategy In i.i.d. case, apply McDiarmid s inequality to: where, R S (h) = 1 m Φ(S) = sup h H m i=1 h(z i ) R(h) R S (h) R(h) = E S [ R S (h)]
41 Proof Strategy In i.i.d. case, apply McDiarmid s inequality to: where, R S (h) = 1 m We apply it to i.i.d. blocks, extending H in a natural way. Φ(S) = sup h H m i=1 h(z i ) R(h) R S (h) R(h) = E S [ R S (h)] Define h a (B) = 1 a a i=1 h(z i) for any block B =(z 1,..., z a ) Z a, and define H a as the set of all block-based hypotheses h a generated from h H.
42 Preparation Step
43 Preparation Step Re-write in terms of blocks:
44 Preparation Step Re-write in terms of blocks: Pr[Φ(S) > ɛ] = Pr S = Pr S [sup (R(h) R S (h)) > ɛ] S h [ ( R(h) RS0 b (h) sup h 2 + R(h) R b S1 (h) 2 ) ] > ɛ (def. of R S (h)) Pr 0)+Φ(S 1 ) > 2ɛ] S (def. of Φ) Pr[Φ(S 0 ) > ɛ] + Pr[Φ(S 1 ) > ɛ] S 0 S 1 (union bound) = 2 Pr S 0 [Φ(S 0 ) > ɛ] (stationarity) = 2 Pr S 0 [Φ(S 0 ) E S0 [Φ( S 0 )] > ɛ ]. (def. of ɛ )
45 Preparation Step Re-write in terms of blocks: Pr[Φ(S) > ɛ] = Pr S = Pr S [sup (R(h) R S (h)) > ɛ] S h [ ( R(h) RS0 b (h) sup h 2 + R(h) R b S1 (h) 2 ) ] > ɛ (def. of R S (h)) Pr 0)+Φ(S 1 ) > 2ɛ] S (def. of Φ) Pr[Φ(S 0 ) > ɛ] + Pr[Φ(S 1 ) > ɛ] S 0 S 1 (union bound) = 2 Pr S 0 [Φ(S 0 ) > ɛ] ɛ = ɛ E S0 [Φ( S 0 )] (stationarity) = 2 Pr S 0 [Φ(S 0 ) E S0 [Φ( S 0 )] > ɛ ]. (def. of ɛ )
46 Preparation Step Re-write in terms of blocks: Pr[Φ(S) > ɛ] = Pr S = Pr S [sup (R(h) R S (h)) > ɛ] S h [ ( R(h) RS0 b (h) sup h 2 + R(h) R b S1 (h) 2 Obtain independent blocks, using Yu s Lemma: ) ] > ɛ (def. of R S (h)) Pr 0)+Φ(S 1 ) > 2ɛ] S (def. of Φ) Pr[Φ(S 0 ) > ɛ] + Pr[Φ(S 1 ) > ɛ] S 0 S 1 (union bound) = 2 Pr S 0 [Φ(S 0 ) > ɛ] ɛ = ɛ E S0 [Φ( S 0 )] (stationarity) = 2 Pr S 0 [Φ(S 0 ) E S0 [Φ( S 0 )] > ɛ ]. (def. of ɛ ) 2 Pr S 0 [Φ(S 0 ) E S0 [Φ( S 0 )] > ɛ ] 2 Pr S 0 [Φ( S 0 ) E S0 [Φ( S 0 )] > ɛ ] + 2(µ 1)β(a)
47 Concentration Bound
48 Concentration Bound Now, apply McDiarmid s inequality to i.i.d. blocks:
49 Concentration Bound Now, apply McDiarmid s inequality to i.i.d. blocks: Changing a block Z k of the sample S 0 can change Φ( S 0 ) by at most 1 µ h( Z k ) M/µ and by McDiarmid s inequality, the following holds for any ɛ > 2(µ 1)Mβ(a): Pr[Φ( S 0 ) E S0 [Φ( S 0 )] > ɛ ] S 0 ( 2ɛ 2 ) ( ) 2µɛ 2 exp µ = exp i=1 (M/µ)2 M 2
50 Concentration Bound Now, apply McDiarmid s inequality to i.i.d. blocks: Changing a block Z k of the sample S 0 can change Φ( S 0 ) by at most 1 µ h( Z k ) M/µ and by McDiarmid s inequality, the following holds for any ɛ > 2(µ 1)Mβ(a): Pr[Φ( S 0 ) E S0 [Φ( S 0 )] > ɛ ] S 0 ( 2ɛ 2 ) ( 2µɛ 2 exp µ = exp i=1 (M/µ)2 M 2 ) So far, we have: Pr[Φ(S) > ɛ] 2 exp S ( 2µɛ 2 M 2 ) + 2(µ 1)β(a),
51 Bounding the Expectation
52 Bounding the Expectation To make the bound useful, we must bound the expectation (over i.i.d. blocks):
53 Bounding the Expectation To make the bound useful, we must bound the expectation (over i.i.d. blocks): E S0 [Φ( S 0 )] E S0, S 0[sup h H R S 0 (h) R S0 (h)] [ 1 µ =E S0, S sup 0 h a H a µ i=1 [ 1 µ =E S0, S 0,σ sup h a H a µ i=1 [ 1 µ ] E S0, S 0,σ sup σ i h a (Z i ) h a H a µ i=1 [ 1 µ +E S0, S 0,σ sup h a H a µ i=1 [ 1 µ ] = 2E S0,σ sup σ i h a (Z i ). h a H a µ i=1 ] h a (Z i ) h a (Z i) ] σ i (h a (Z i ) h a (Z i)) ] σ i h a (Z i) (def. of R) (Rad. var. s) (prop. of sup)
54 Bounding the Expectation
55 Bounding the Expectation Would like complexity of H not H a:
56 Bounding the Expectation Would like complexity of H not H a: E S0,σ [ 2 sup h a H a µ µ i=1 ] σ i h a (Z i ) =E S0,σ [ sup h H 2 µ µ i=1 σ i 1 a a j=1 ] h(z (i) j )
57 Bounding the Expectation Would like complexity of H not H a: E S0,σ [ 2 sup h a H a µ µ i=1 ] σ i h a (Z i ) =E S0,σ [ sup h H 2 µ µ i=1 σ i 1 a a j=1 ] h(z (i) j ) Almost, looks like Rad. comp. but σ s are shared.
58 Bounding the Expectation Would like complexity of H not H a: E S0,σ [ 2 sup h a H a µ µ i=1 ] σ i h a (Z i ) =E S0,σ [ sup h H 2 µ µ i=1 σ i 1 a a j=1 ] h(z (i) j ) Almost, looks like Rad. comp. but σ s are shared. E S0 [Φ( S 0 )] E S0,σ 1 a = 1 a a j=1 a j=1 =E Sµ,σ [ sup h H 1 a [ a j=1 E S0,σ sup h H [ E Sj 0,σ sup h H [ sup h H 2 µ 2 µ 2 µ 2 µ µ i=1 µ i=1 µ i=1 ] σ i h(z) z S µ ] σ i h(z (i) j ) ] σ i h(z (i) j ) ] σ i h(z (i) j ) R D µ (H). (reversing order of sums) (convexity of sup) (marginalization)
59 Bounding the Expectation Would like complexity of H not H a: E S0,σ [ 2 sup h a H a µ µ i=1 ] σ i h a (Z i ) =E S0,σ [ sup h H 2 µ µ i=1 σ i 1 a a j=1 ] h(z (i) j ) Almost, looks like Rad. comp. but σ s are shared. E S0 [Φ( S 0 )] E S0,σ 1 a = 1 a a j=1 a j=1 =E Sµ,σ [ sup h H 1 a [ a j=1 E S0,σ sup h H [ E Sj 0,σ sup h H [ sup h H 2 µ 2 µ 2 µ 2 µ µ i=1 µ i=1 µ i=1 ] σ i h(z) z S µ ] σ i h(z (i) j ) ] σ i h(z (i) j ) ] σ i h(z (i) j ) R D µ (H). (reversing order of sums) (convexity of sup) (marginalization) D Denotes i.i.d. distribution
60 Bound So Far
61 Bound So Far We have a bound in terms of the Rad. complexity. With probability at least 1 δ and δ = δ 2(µ 1)β(a), the following inequality holds for all hypotheses h H: R(h) R S (h)+r D µ (H)+M log 2 δ 2µ
62 Bound So Far We have a bound in terms of the Rad. complexity. With probability at least 1 δ and δ = δ 2(µ 1)β(a), the following inequality holds for all hypotheses h H: R(h) R S (h)+r D log 2 δ µ (H)+M 2µ Would like bounds in terms of empirical Rademacher complexity. It has many benefits:
63 Bound So Far We have a bound in terms of the Rad. complexity. With probability at least 1 δ and δ = δ 2(µ 1)β(a), the following inequality holds for all hypotheses h H: R(h) R S (h)+r D log 2 δ µ (H)+M 2µ Would like bounds in terms of empirical Rademacher complexity. It has many benefits: Can be measured from data (tighter bounds).
64 Bound So Far We have a bound in terms of the Rad. complexity. With probability at least 1 δ and δ = δ 2(µ 1)β(a), the following inequality holds for all hypotheses h H: R(h) R S (h)+r D log 2 δ µ (H)+M 2µ Would like bounds in terms of empirical Rademacher complexity. It has many benefits: Can be measured from data (tighter bounds). Can be related to other complexity measures (e.g. VC-dimension).
65 Bound So Far We have a bound in terms of the Rad. complexity. With probability at least 1 δ and δ = δ 2(µ 1)β(a), the following inequality holds for all hypotheses h H: R(h) R S (h)+r D log 2 δ µ (H)+M 2µ Would like bounds in terms of empirical Rademacher complexity. It has many benefits: Can be measured from data (tighter bounds). Can be related to other complexity measures (e.g. VC-dimension). Can be bounded for specific hypotheses.
66 Bound So Far
67 Bound So Far Using similar techniques, we can show R Sµ to : R D µ is close
68 Bound So Far Using similar techniques, we can show R Sµ to : R D µ is close With probability at least 1 δ and δ = δ/2 2(µ 1)β(a), the following inequality holds for all hypotheses h H: R(h) R S (h)+ R Sµ (H)+3M log 2 δ 2µ,
69 Bound So Far Using similar techniques, we can show R Sµ to : R D µ is close With probability at least 1 δ and δ = δ/2 2(µ 1)β(a), the following inequality holds for all hypotheses h H: R(h) R S (h)+ R log 2 δ Sµ (H)+3M 2µ, Kernel hypotheses bounds [Bartlett, Mendelson 01]: In the case of classification with hypotheses based on a kernel K and a weight vector w bounded by B, w B, the empirical Rademacher complexity can be bounded as follows: R Sµ (H) 2B µ [K]
70 Classification Bound
71 Classification Bound Let H be the set of hypotheses { (x, y) Z y m i=1 α ik(x i,x): m i,j=1 α iα j K(x i,x j ) 1 }.
72 Classification Bound Let H be the set of hypotheses { (x, y) Z y m i=1 α ik(x i,x): m i,j=1 α iα j K(x i,x j ) 1 Let R ρ S (h) denote the average amount by which y ih(x i ) deviates from the margin ρ: Rρ S (h) = 1 m m i=1 (ρ y ih(x i )) +. }.
73 Classification Bound Let H be the set of hypotheses { (x, y) Z y m i=1 α ik(x i,x): m i,j=1 α iα j K(x i,x j ) 1 Let R ρ S (h) denote the average amount by which y ih(x i ) deviates from the margin ρ: Rρ S (h) = 1 m m i=1 (ρ y ih(x i )) +. }. (ρ yh(x)) 1 ρ h(x)
74 Classification Bound
75 Classification Bound With probability at least 1 δ, the following inequality holds for all hypotheses h H: Pr[yh(x) 0] 1 ρ R ρ S (h)+ 4 log 2 [K]+3 δ µρ 2µ, where δ = δ/2 2(µ 1)β(a), and K is the Gram matrix of the kernel K for the sample S.
76 Classification Bound With probability at least 1 δ, the following inequality holds for all hypotheses h H: Pr[yh(x) 0] 1 ρ R ρ S (h)+ 4 log 2 [K]+3 δ µρ 2µ, where δ = δ/2 2(µ 1)β(a), and K is the Gram matrix of the kernel K for the sample S. Now, we need to appropriately choose the parameters a and μ.
77 Classification Bound With probability at least 1 δ, the following inequality holds for all hypotheses h H: Pr[yh(x) 0] 1 ρ R ρ S (h)+ 4 log 2 [K]+3 δ µρ 2µ, where δ = δ/2 2(µ 1)β(a), and K is the Gram matrix of the kernel K for the sample S. Now, we need to appropriately choose the parameters a and μ. If we assume algebraic mixing,, one suitable choice: µ = m 2r+1 2r+4 2 β(a) := β 0 a r
78 A Complete Bound
79 A Complete Bound Assuming that the sample is drawn from a stationary algebraically β-mixing distribution, β(a) =β 0 a r, the following bound holds, Pr[yh(x) 0] 1 ρ R ρ 1 S (h)+8rmγ +3m γ 2 log 2 δ, where γ 1 = 2( 1 3 r+2 1) (, γ 2 = r+4 1) and δ = δ/2 β 0 m γ 1. ρ
80 A Complete Bound Assuming that the sample is drawn from a stationary algebraically β-mixing distribution, β(a) =β 0 a r, the following bound holds, Pr[yh(x) 0] 1 ρ R ρ 1 S (h)+8rmγ +3m γ 2 log 2 δ, where γ 1 = 2( 1 3 r+2 1) (, γ 2 = r+4 1) and δ = δ/2 β 0 m γ 1. ρ As r and β 0 0 (i.e. the i.i.d. scenario is approached), this bound has the same asymptotic behavior as the i.i.d. bound.
81 Summary
82 Summary Have given the first data-dependent bounds for a non-i.i.d. scenario.
83 Summary Have given the first data-dependent bounds for a non-i.i.d. scenario. First known margin-based classification bounds (can be extended to regression as well).
84 Summary Have given the first data-dependent bounds for a non-i.i.d. scenario. First known margin-based classification bounds (can be extended to regression as well). Can easily extend bounds to other complexity measures via Rademacher complexity.
85 Summary Have given the first data-dependent bounds for a non-i.i.d. scenario. First known margin-based classification bounds (can be extended to regression as well). Can easily extend bounds to other complexity measures via Rademacher complexity. Future work:
86 Summary Have given the first data-dependent bounds for a non-i.i.d. scenario. First known margin-based classification bounds (can be extended to regression as well). Can easily extend bounds to other complexity measures via Rademacher complexity. Future work: Can we make use of the entire sample to computer empirical Rademacher complexity?
87 Summary Have given the first data-dependent bounds for a non-i.i.d. scenario. First known margin-based classification bounds (can be extended to regression as well). Can easily extend bounds to other complexity measures via Rademacher complexity. Future work: Can we make use of the entire sample to computer empirical Rademacher complexity? Can we strictly generalize the i.i.d. bound?
Rademacher Complexity Bounds for Non-I.I.D. Processes
Rademacher Complexity Bounds for Non-I.I.D. Processes Mehryar Mohri Courant Institute of Mathematical ciences and Google Research 5 Mercer treet New York, NY 00 mohri@cims.nyu.edu Afshin Rostamizadeh Department
More informationGeneralization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh
Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds
More informationSample Selection Bias Correction
Sample Selection Bias Correction Afshin Rostamizadeh Joint work with: Corinna Cortes, Mehryar Mohri & Michael Riley Courant Institute & Google Research Motivation Critical Assumption: Samples for training
More informationLearning Kernels -Tutorial Part III: Theoretical Guarantees.
Learning Kernels -Tutorial Part III: Theoretical Guarantees. Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute & Google Research mohri@cims.nyu.edu Afshin Rostami UC Berkeley
More informationTime Series Prediction & Online Learning
Time Series Prediction & Online Learning Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values. earthquakes.
More informationOnline Learning for Time Series Prediction
Online Learning for Time Series Prediction Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values.
More informationFoundations of Machine Learning
Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about
More informationIntroduction to Support Vector Machines
Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear
More informationLearning with Rejection
Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,
More informationAdvanced Machine Learning
Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:
More informationComputational Learning Theory - Hilary Term : Learning Real-valued Functions
Computational Learning Theory - Hilary Term 08 8 : Learning Real-valued Functions Lecturer: Varun Kanade So far our focus has been on learning boolean functions. Boolean functions are suitable for modelling
More informationDeep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.
Deep Boosting Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Ensemble Methods in ML Combining several base classifiers
More informationPerceptron Mistake Bounds
Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce
More informationBennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence
Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence Chao Zhang The Biodesign Institute Arizona State University Tempe, AZ 8587, USA Abstract In this paper, we present
More informationUniform concentration inequalities, martingales, Rademacher complexity and symmetrization
Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization John Duchi Outline I Motivation 1 Uniform laws of large numbers 2 Loss minimization and data dependence II Uniform
More informationSharp Generalization Error Bounds for Randomly-projected Classifiers
Sharp Generalization Error Bounds for Randomly-projected Classifiers R.J. Durrant and A. Kabán School of Computer Science The University of Birmingham Birmingham B15 2TT, UK http://www.cs.bham.ac.uk/ axk
More informationLECTURES 2-3 : Stochastic Processes, Autocorrelation function. Stationarity.
LECTURES 2-3 : Stochastic Processes, Autocorrelation function. Stationarity. Important points of Lecture 1: A time series {X t } is a series of observations taken sequentially over time: x t is an observation
More informationLearning with Imperfect Data
Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Joint work with: Yishay Mansour (Tel-Aviv & Google) and Afshin Rostamizadeh (Courant Institute). Standard Learning Assumptions IID assumption.
More informationThe definitions and notation are those introduced in the lectures slides. R Ex D [h
Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 2 October 04, 2016 Due: October 18, 2016 A. Rademacher complexity The definitions and notation
More informationIntroduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:
More informationGeneralization error bounds for classifiers trained with interdependent data
Generalization error bounds for classifiers trained with interdependent data icolas Usunier, Massih-Reza Amini, Patrick Gallinari Department of Computer Science, University of Paris VI 8, rue du Capitaine
More informationAdvanced Machine Learning
Advanced Machine Learning Learning Kernels MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Kernel methods. Learning kernels scenario. learning bounds. algorithms. page 2 Machine Learning
More informationOn the tradeoff between computational complexity and sample complexity in learning
On the tradeoff between computational complexity and sample complexity in learning Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Joint work with Sham
More informationLesson 4: Stationary stochastic processes
Dipartimento di Ingegneria e Scienze dell Informazione e Matematica Università dell Aquila, umberto.triacca@univaq.it Stationary stochastic processes Stationarity is a rather intuitive concept, it means
More informationADANET: adaptive learning of neural networks
ADANET: adaptive learning of neural networks Joint work with Corinna Cortes (Google Research) Javier Gonzalo (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR
More information1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015
10-806 Foundations of Machine Learning and Data Science Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015 1 Active Learning Most classic machine learning methods and the formal learning
More informationAlgorithms for Learning Kernels Based on Centered Alignment
Journal of Machine Learning Research 3 (2012) XX-XX Submitted 01/11; Published 03/12 Algorithms for Learning Kernels Based on Centered Alignment Corinna Cortes Google Research 76 Ninth Avenue New York,
More informationThe PAC Learning Framework -II
The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1 Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1 Outline
More informationLecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;
CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and
More informationCase study: stochastic simulation via Rademacher bootstrap
Case study: stochastic simulation via Rademacher bootstrap Maxim Raginsky December 4, 2013 In this lecture, we will look at an application of statistical learning theory to the problem of efficient stochastic
More informationFoundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification page 2 Motivation Real-world problems often have multiple classes:
More informationFast Rates for Estimation Error and Oracle Inequalities for Model Selection
Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu
More informationImproved Bounds on the Dot Product under Random Projection and Random Sign Projection
Improved Bounds on the Dot Product under Random Projection and Random Sign Projection Ata Kabán School of Computer Science The University of Birmingham Birmingham B15 2TT, UK http://www.cs.bham.ac.uk/
More informationMehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013
Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 A. Kernels 1. Let X be a finite set. Show that the kernel
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationLearning Weighted Automata
Learning Weighted Automata Joint work with Borja Balle (Amazon Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Weighted Automata (WFAs) page 2 Motivation Weighted automata (WFAs): image
More informationFoundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning Multi-Class Classification Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation Real-world problems often have multiple classes: text, speech,
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory Problem set 1 Due: Monday, October 10th Please send your solutions to learning-submissions@ttic.edu Notation: Input space: X Label space: Y = {±1} Sample:
More informationStochastic process for macro
Stochastic process for macro Tianxiao Zheng SAIF 1. Stochastic process The state of a system {X t } evolves probabilistically in time. The joint probability distribution is given by Pr(X t1, t 1 ; X t2,
More informationSupport Vector Machine
Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary
More informationStatistical learning theory, Support vector machines, and Bioinformatics
1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.
More informationTHE NOISY VETO-VOTER MODEL: A RECURSIVE DISTRIBUTIONAL EQUATION ON [0,1] April 28, 2007
THE NOISY VETO-VOTER MODEL: A RECURSIVE DISTRIBUTIONAL EQUATION ON [0,1] SAUL JACKA AND MARCUS SHEEHAN Abstract. We study a particular example of a recursive distributional equation (RDE) on the unit interval.
More informationRisk Bounds for Lévy Processes in the PAC-Learning Framework
Risk Bounds for Lévy Processes in the PAC-Learning Framework Chao Zhang School of Computer Engineering anyang Technological University Dacheng Tao School of Computer Engineering anyang Technological University
More informationActive Learning: Disagreement Coefficient
Advanced Course in Machine Learning Spring 2010 Active Learning: Disagreement Coefficient Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz In previous lectures we saw examples in which
More informationPlug-in Approach to Active Learning
Plug-in Approach to Active Learning Stanislav Minsker Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 1 / 18 Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X
More informationLecture 9. d N(0, 1). Now we fix n and think of a SRW on [0,1]. We take the k th step at time k n. and our increments are ± 1
Random Walks and Brownian Motion Tel Aviv University Spring 011 Lecture date: May 0, 011 Lecture 9 Instructor: Ron Peled Scribe: Jonathan Hermon In today s lecture we present the Brownian motion (BM).
More informationVC dimension and Model Selection
VC dimension and Model Selection Overview PAC model: review VC dimension: Definition Examples Sample: Lower bound Upper bound!!! Model Selection Introduction to Machine Learning 2 PAC model: Setting A
More informationComputer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)
Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori
More informationStrong approximation for additive functionals of geometrically ergodic Markov chains
Strong approximation for additive functionals of geometrically ergodic Markov chains Florence Merlevède Joint work with E. Rio Université Paris-Est-Marne-La-Vallée (UPEM) Cincinnati Symposium on Probability
More informationx log x, which is strictly convex, and use Jensen s Inequality:
2. Information measures: mutual information 2.1 Divergence: main inequality Theorem 2.1 (Information Inequality). D(P Q) 0 ; D(P Q) = 0 iff P = Q Proof. Let ϕ(x) x log x, which is strictly convex, and
More informationComputational and Statistical Learning theory
Computational and Statistical Learning theory Problem set 2 Due: January 31st Email solutions to : karthik at ttic dot edu Notation : Input space : X Label space : Y = {±1} Sample : (x 1, y 1,..., (x n,
More informationEconometría 2: Análisis de series de Tiempo
Econometría 2: Análisis de series de Tiempo Karoll GOMEZ kgomezp@unal.edu.co http://karollgomez.wordpress.com Segundo semestre 2016 II. Basic definitions A time series is a set of observations X t, each
More informationStructured Prediction Theory and Algorithms
Structured Prediction Theory and Algorithms Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE
More informationInformation Theory and Statistics Lecture 3: Stationary ergodic processes
Information Theory and Statistics Lecture 3: Stationary ergodic processes Łukasz Dębowski ldebowsk@ipipan.waw.pl Ph. D. Programme 2013/2014 Measurable space Definition (measurable space) Measurable space
More informationStructured Prediction
Structured Prediction Ningshan Zhang Advanced Machine Learning, Spring 2016 Outline Ensemble Methods for Structured Prediction[1] On-line learning Boosting AGeneralizedKernelApproachtoStructuredOutputLearning[2]
More informationStability Bounds for Stationary ϕ-mixing and β-mixing Processes
Journal of Machine Learning Research (200) 789-84 Subitted /08; Revised /0; Published 2/0 Stability Bounds for Stationary ϕ-ixing and β-ixing Processes Mehryar Mohri Courant Institute of Matheatical Sciences
More informationMarch 1, Florida State University. Concentration Inequalities: Martingale. Approach and Entropy Method. Lizhe Sun and Boning Yang.
Florida State University March 1, 2018 Framework 1. (Lizhe) Basic inequalities Chernoff bounding Review for STA 6448 2. (Lizhe) Discrete-time martingales inequalities via martingale approach 3. (Boning)
More informationClass 2 & 3 Overfitting & Regularization
Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating
More informationi=1 = H t 1 (x) + α t h t (x)
AdaBoost AdaBoost, which stands for ``Adaptive Boosting", is an ensemble learning algorithm that uses the boosting paradigm []. We will discuss AdaBoost for binary classification. That is, we assume that
More informationl 1 -Regularized Linear Regression: Persistence and Oracle Inequalities
l -Regularized Linear Regression: Persistence and Oracle Inequalities Peter Bartlett EECS and Statistics UC Berkeley slides at http://www.stat.berkeley.edu/ bartlett Joint work with Shahar Mendelson and
More informationIntroduction and Models
CSE522, Winter 2011, Learning Theory Lecture 1 and 2-01/04/2011, 01/06/2011 Lecturer: Ofer Dekel Introduction and Models Scribe: Jessica Chang Machine learning algorithms have emerged as the dominant and
More informationThe sample complexity of agnostic learning with deterministic labels
The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College
More information1 Review of The Learning Setting
COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #8 Scribe: Changyan Wang February 28, 208 Review of The Learning Setting Last class, we moved beyond the PAC model: in the PAC model we
More informationModelli Lineari (Generalizzati) e SVM
Modelli Lineari (Generalizzati) e SVM Corso di AA, anno 2018/19, Padova Fabio Aiolli 19/26 Novembre 2018 Fabio Aiolli Modelli Lineari (Generalizzati) e SVM 19/26 Novembre 2018 1 / 36 Outline Linear methods
More informationBoosting Ensembles of Structured Prediction Rules
Boosting Ensembles of Structured Prediction Rules Corinna Cortes Google Research 76 Ninth Avenue New York, NY 10011 corinna@google.com Vitaly Kuznetsov Courant Institute 251 Mercer Street New York, NY
More informationDomain Adaptation for Regression
Domain Adaptation for Regression Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Motivation Applications: distinct training and test distributions.
More informationInformation measures in simple coding problems
Part I Information measures in simple coding problems in this web service in this web service Source coding and hypothesis testing; information measures A(discrete)source is a sequence {X i } i= of random
More informationMGMT 69000: Topics in High-dimensional Data Analysis Falll 2016
MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 Lecture 14: Information Theoretic Methods Lecturer: Jiaming Xu Scribe: Hilda Ibriga, Adarsh Barik, December 02, 2016 Outline f-divergence
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationGeneralization, Overfitting, and Model Selection
Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How
More informationAn Introduction to Laws of Large Numbers
An to Laws of John CVGMI Group Contents 1 Contents 1 2 Contents 1 2 3 Contents 1 2 3 4 Intuition We re working with random variables. What could we observe? {X n } n=1 Intuition We re working with random
More informationThe Behavior of Multivariate Maxima of Moving Maxima Processes
The Behavior of Multivariate Maxima of Moving Maxima Processes Zhengjun Zhang Department of Mathematics Washington University Saint Louis, MO 6313-4899 USA Richard L. Smith Department of Statistics University
More informationi=1 Pr(X 1 = i X 2 = i). Notice each toss is independent and identical, so we can write = 1/6
Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 1 - solution Credit: Ashish Rastogi, Afshin Rostamizadeh Ameet Talwalkar, and Eugene Weinstein.
More informationLecture 5: GPs and Streaming regression
Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X
More informationDecentralized Detection and Classification using Kernel Methods
Decentralized Detection and Classification using Kernel Methods XuanLong Nguyen Computer Science Division University of California, Berkeley xuanlong@cs.berkeley.edu Martin J. Wainwright Electrical Engineering
More informationNonlinear time series
Based on the book by Fan/Yao: Nonlinear Time Series Robert M. Kunst robert.kunst@univie.ac.at University of Vienna and Institute for Advanced Studies Vienna October 27, 2009 Outline Characteristics of
More informationApproximate Dynamic Programming
Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A.
More information1 A Lower Bound on Sample Complexity
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #7 Scribe: Chee Wei Tan February 25, 2008 1 A Lower Bound on Sample Complexity In the last lecture, we stopped at the lower bound on
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric
More informationGeneralization Bounds
Generalization Bounds Here we consider the problem of learning from binary labels. We assume training data D = x 1, y 1,... x N, y N with y t being one of the two values 1 or 1. We will assume that these
More informationFor a stochastic process {Y t : t = 0, ±1, ±2, ±3, }, the mean function is defined by (2.2.1) ± 2..., γ t,
CHAPTER 2 FUNDAMENTAL CONCEPTS This chapter describes the fundamental concepts in the theory of time series models. In particular, we introduce the concepts of stochastic processes, mean and covariance
More informationGeneralization Bounds and Stability
Generalization Bounds and Stability Lorenzo Rosasco Tomaso Poggio 9.520 Class 9 2009 About this class Goal To recall the notion of generalization bounds and show how they can be derived from a stability
More informationMATH4406 Assignment 5
MATH4406 Assignment 5 Patrick Laub (ID: 42051392) October 7, 2014 1 The machine replacement model 1.1 Real-world motivation Consider the machine to be the entire world. Over time the creator has running
More informationDoes Unlabeled Data Help?
Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline
More informationORIE 4741: Learning with Big Messy Data. Generalization
ORIE 4741: Learning with Big Messy Data Generalization Professor Udell Operations Research and Information Engineering Cornell September 23, 2017 1 / 21 Announcements midterm 10/5 makeup exam 10/2, by
More informationIntroduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 13 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification Mehryar Mohri - Introduction to Machine Learning page 2 Motivation
More informationInformation Theory. Lecture 5 Entropy rate and Markov sources STEFAN HÖST
Information Theory Lecture 5 Entropy rate and Markov sources STEFAN HÖST Universal Source Coding Huffman coding is optimal, what is the problem? In the previous coding schemes (Huffman and Shannon-Fano)it
More informationGeneralization and Overfitting
Generalization and Overfitting Model Selection Maria-Florina (Nina) Balcan February 24th, 2016 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle
More informationSTAT 200C: High-dimensional Statistics
STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 59 Classical case: n d. Asymptotic assumption: d is fixed and n. Basic tools: LLN and CLT. High-dimensional setting: n d, e.g. n/d
More informationAn Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI
An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process
More informationModel Specification Testing in Nonparametric and Semiparametric Time Series Econometrics. Jiti Gao
Model Specification Testing in Nonparametric and Semiparametric Time Series Econometrics Jiti Gao Department of Statistics School of Mathematics and Statistics The University of Western Australia Crawley
More informationBivariate Paired Numerical Data
Bivariate Paired Numerical Data Pearson s correlation, Spearman s ρ and Kendall s τ, tests of independence University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html
More informationi=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x
Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 November 16, 017 Due: Dec 01, 017 A. Kernels Show that the following kernels K are PDS: 1.
More informationA Bahadur Representation of the Linear Support Vector Machine
A Bahadur Representation of the Linear Support Vector Machine Yoonkyung Lee Department of Statistics The Ohio State University October 7, 2008 Data Mining and Statistical Learning Study Group Outline Support
More informationFast learning rates for plug-in classifiers under the margin condition
Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,
More informationMonte-Carlo MMD-MA, Université Paris-Dauphine. Xiaolu Tan
Monte-Carlo MMD-MA, Université Paris-Dauphine Xiaolu Tan tan@ceremade.dauphine.fr Septembre 2015 Contents 1 Introduction 1 1.1 The principle.................................. 1 1.2 The error analysis
More informationStability Bounds for Stationary ϕ-mixing and β-mixing Processes
Journal of Machine Learning Research (200 66-686 Subitted /08; Revised /0; Published 2/0 Stability Bounds for Stationary ϕ-ixing and β-ixing Processes Mehryar Mohri Courant Institute of Matheatical Sciences
More informationStatistical Learning Learning From Examples
Statistical Learning Learning From Examples We want to estimate the working temperature range of an iphone. We could study the physics and chemistry that affect the performance of the phone too hard We
More informationChapter 8: Differential entropy. University of Illinois at Chicago ECE 534, Natasha Devroye
Chapter 8: Differential entropy Chapter 8 outline Motivation Definitions Relation to discrete entropy Joint and conditional differential entropy Relative entropy and mutual information Properties AEP for
More informationA Uniform Convergence Bound for the Area Under the ROC Curve
Proceedings of the th International Workshop on Artificial Intelligence & Statistics, 5 A Uniform Convergence Bound for the Area Under the ROC Curve Shivani Agarwal, Sariel Har-Peled and Dan Roth Department
More information