Plug-in Approach to Active Learning

Size: px
Start display at page:

Download "Plug-in Approach to Active Learning"

Transcription

1 Plug-in Approach to Active Learning Stanislav Minsker Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 1 / 18

2 Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X is the observed quantity and the label Y is to be predicted; prediction is carried via the binary classifier g : S { 1, +1} The joint distribution of (X, Y ) will be denoted by P and the distribution of X by Π. η(x) := E(Y X = x) is the regression function. The generalization error of a binary classifier g is Bayes classifier R(g) = Pr (Y g(x)) g (x) := sign η(x) has the smallest possible generalization error. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 2 / 18

3 Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X is the observed quantity and the label Y is to be predicted; prediction is carried via the binary classifier g : S { 1, +1} The joint distribution of (X, Y ) will be denoted by P and the distribution of X by Π. η(x) := E(Y X = x) is the regression function. The generalization error of a binary classifier g is Bayes classifier R(g) = Pr (Y g(x)) g (x) := sign η(x) has the smallest possible generalization error. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 2 / 18

4 Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X is the observed quantity and the label Y is to be predicted; prediction is carried via the binary classifier g : S { 1, +1} The joint distribution of (X, Y ) will be denoted by P and the distribution of X by Π. η(x) := E(Y X = x) is the regression function. The generalization error of a binary classifier g is Bayes classifier R(g) = Pr (Y g(x)) g (x) := sign η(x) has the smallest possible generalization error. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 2 / 18

5 Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X is the observed quantity and the label Y is to be predicted; prediction is carried via the binary classifier g : S { 1, +1} The joint distribution of (X, Y ) will be denoted by P and the distribution of X by Π. η(x) := E(Y X = x) is the regression function. The generalization error of a binary classifier g is Bayes classifier R(g) = Pr (Y g(x)) g (x) := sign η(x) has the smallest possible generalization error. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 2 / 18

6 Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X is the observed quantity and the label Y is to be predicted; prediction is carried via the binary classifier g : S { 1, +1} The joint distribution of (X, Y ) will be denoted by P and the distribution of X by Π. η(x) := E(Y X = x) is the regression function. The generalization error of a binary classifier g is Bayes classifier R(g) = Pr (Y g(x)) g (x) := sign η(x) has the smallest possible generalization error. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 2 / 18

7 Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X is the observed quantity and the label Y is to be predicted; prediction is carried via the binary classifier g : S { 1, +1} The joint distribution of (X, Y ) will be denoted by P and the distribution of X by Π. η(x) := E(Y X = x) is the regression function. The generalization error of a binary classifier g is Bayes classifier R(g) = Pr (Y g(x)) g (x) := sign η(x) has the smallest possible generalization error. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 2 / 18

8 Difference between Active and Passive Learning In the passive learning framework we are given an iid sample (X i, Y i ) n i=1 as an input. In practice: cost of the training data is associated with labeling the observations while the pool of observations itself is almost unlimited. Active learning algorithms try to take advantage of this modified framework. The goal is to construct a classifier with good generalization properties using a small number of labels. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 3 / 18

9 Difference between Active and Passive Learning In the passive learning framework we are given an iid sample (X i, Y i ) n i=1 as an input. In practice: cost of the training data is associated with labeling the observations while the pool of observations itself is almost unlimited. Active learning algorithms try to take advantage of this modified framework. The goal is to construct a classifier with good generalization properties using a small number of labels. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 3 / 18

10 Difference between Active and Passive Learning In the passive learning framework we are given an iid sample (X i, Y i ) n i=1 as an input. In practice: cost of the training data is associated with labeling the observations while the pool of observations itself is almost unlimited. Active learning algorithms try to take advantage of this modified framework. The goal is to construct a classifier with good generalization properties using a small number of labels. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 3 / 18

11 Difference between Active and Passive Learning Observations are sampled sequentially. X k is sampled from the modified distribution ˆΠ k that depends on (X 1, Y 1 ),..., (X k 1, Y k 1 ). Y k is sampled from the conditional distribution P Y X ( X = x). Labels are conditionally independent given the feature vectors X i, i n. ˆΠ k is supported on a set where classification is difficult. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 3 / 18

12 Difference between Active and Passive Learning Observations are sampled sequentially. X k is sampled from the modified distribution ˆΠ k that depends on (X 1, Y 1 ),..., (X k 1, Y k 1 ). Y k is sampled from the conditional distribution P Y X ( X = x). Labels are conditionally independent given the feature vectors X i, i n. ˆΠ k is supported on a set where classification is difficult. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 3 / 18

13 Difference between Active and Passive Learning Observations are sampled sequentially. X k is sampled from the modified distribution ˆΠ k that depends on (X 1, Y 1 ),..., (X k 1, Y k 1 ). Y k is sampled from the conditional distribution P Y X ( X = x). Labels are conditionally independent given the feature vectors X i, i n. ˆΠ k is supported on a set where classification is difficult. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 3 / 18

14 Previous work 1 Dasgupta, S., Hsu, D. and Monteleoni, C. A general agnostic active learning algorithm. 2 Castro, R. M. and Nowak, R. D.(2008) Minimax bounds for active learning. 3 Balcan, M.-F., Beygelzimer, A. and Langford, J.(2009) Agnostic active learning. 4 Hanneke, S.(2010) Rates of Convergence in Active Learning. 5 Koltchinskii, V. (2010) Rademacher Complexities and Bounding the Excess Risk in Active Learning. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 4 / 18

15 Previous work It was discovered that in some cases active learners can significantly outperform passive algorithms. In particular, this is the case when the Tsybakov s low noise assumption is satisfied: there exist constants B, γ > 0 such that t > 0, Π(x : η(x) t) Bt γ Castro and Nowak: if the decision boundary { x R d : η(x) = 0 } is a graph of the Hölder smooth function g Σ(β, K, [0, 1] d 1 ), noise assumption is satisfied with γ > 0, then ER(ĝ N ) R N β(1+γ) 2β+γ(d 1) where N is the label budget. However, the construction of the classifier that achieves this bound assumes β and γ to be known. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 4 / 18

16 Previous work It was discovered that in some cases active learners can significantly outperform passive algorithms. In particular, this is the case when the Tsybakov s low noise assumption is satisfied: there exist constants B, γ > 0 such that t > 0, Π(x : η(x) t) Bt γ Castro and Nowak: if the decision boundary { x R d : η(x) = 0 } is a graph of the Hölder smooth function g Σ(β, K, [0, 1] d 1 ), noise assumption is satisfied with γ > 0, then ER(ĝ N ) R N β(1+γ) 2β+γ(d 1) where N is the label budget. However, the construction of the classifier that achieves this bound assumes β and γ to be known. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 4 / 18

17 Motivation and Assumptions A good learning algorithm is expected to be computationally efficient and adaptive with respect to the underlying structure of the problem(smoothness and noise level). Existing algorithms do not possess these two properties simultaneosly. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 5 / 18

18 Motivation and Assumptions Developement of Active Learning Algorithms that are adaptive and computationally tractable. Obtaining minimax lower bounds for the excess risk of active learning for the broad class of underlying distributions. Main tool: construct nonparametric estimator of the regression function ˆη n( ) and use the plug-in classifier ĝ( ) = sign ˆη n( ) Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 5 / 18

19 Motivation and Assumptions Developement of Active Learning Algorithms that are adaptive and computationally tractable. Obtaining minimax lower bounds for the excess risk of active learning for the broad class of underlying distributions. Main tool: construct nonparametric estimator of the regression function ˆη n( ) and use the plug-in classifier ĝ( ) = sign ˆη n( ) Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 5 / 18

20 Motivation and Assumptions Developement of Active Learning Algorithms that are adaptive and computationally tractable. Obtaining minimax lower bounds for the excess risk of active learning for the broad class of underlying distributions. Main tool: construct nonparametric estimator of the regression function ˆη n( ) and use the plug-in classifier ĝ( ) = sign ˆη n( ) Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 5 / 18

21 Motivation and Assumptions Assumptions: 1 X is supported in [0, 1] d, marginal Π is regular ( for example, dπ(x) = p(x)d x, 0 < µ 1 p(x) µ 2 < ); 2 η( ) Σ(β, L, [0, 1] d ) 3 t > 0, Π(x : η(x) t) Bt γ In Passive Learning under similar assumptions plug-in approach leads to classifiers that attain optimal rates of convergence: C 1 N β(1+γ) 2β+d sup ER(ĝ N ) R C 2 N β(1+γ) 2β+d P P(β,γ) (J.-Y. Audibert, A. B.Tsybakov. Fast learning rates for plug-in classifiers). Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 5 / 18

22 Motivation and Assumptions Assumptions: 1 X is supported in [0, 1] d, marginal Π is regular ( for example, dπ(x) = p(x)d x, 0 < µ 1 p(x) µ 2 < ); 2 η( ) Σ(β, L, [0, 1] d ) 3 t > 0, Π(x : η(x) t) Bt γ In Passive Learning under similar assumptions plug-in approach leads to classifiers that attain optimal rates of convergence: C 1 N β(1+γ) 2β+d sup ER(ĝ N ) R C 2 N β(1+γ) 2β+d P P(β,γ) (J.-Y. Audibert, A. B.Tsybakov. Fast learning rates for plug-in classifiers). Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 5 / 18

23 Lower bound: Theorem Let β, γ, d be such that βγ d. There exists C > 0 such that for all N large enough and for any active classifier ĝ N (x) we have sup P P (β,γ) ER P (ĝ N ) R C N β(1+γ) 2β+d βγ Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 6 / 18

24 Lower bound: Theorem Let β, γ, d be such that βγ d. There exists C > 0 such that for all N large enough and for any active classifier ĝ N (x) we have sup P P (β,γ) ER P (ĝ N ) R C N β(1+γ) 2β+d βγ Here, N is the number of labels used to construct ĝ N rather then the size of the training data set. Proof combines techniques developed by Tsybakov, Audibert and Castro, Nowak. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 6 / 18

25 Lower bound: Theorem Let β, γ, d be such that βγ d. There exists C > 0 such that for all N large enough and for any active classifier ĝ N (x) we have sup P P (β,γ) ER P (ĝ N ) R C N β(1+γ) 2β+d βγ Example: β = 1, d = 2, γ = 1 : Passive rate : N 1/2 Active rate: N 2/3 Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 6 / 18

26 Active learning algorithm Suppose that assumptions (1),(2),(3) are satisfied and Π, the marginal distribution of X, is known. Input label budget N. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 7 / 18

27 Active learning algorithm Use a small fraction of data to construct an estimator that is close to η(x) in sup-norm. In our case, this is a piecewise polynomial estimator. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 7 / 18

28 Active learning algorithm Construct a confidence band based on obtained estimator Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 7 / 18

29 Active learning algorithm Active Set: the subset of [0, 1] d where the confidence band crosses zero level. The size of the active set is controlled by the low noise assumption. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 7 / 18

30 Active learning algorithm Outside of the active set: correct classification with high confidence! Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 7 / 18

31 Next iteration: Sample observations from ˆΠ 1 (dx) := Π (dx X Active Set); Construct a tighter confidence band; Repeat until the label budget is attained; Output ĝ = sign ˆη fin Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 8 / 18

32 Active learning algorithm, r = 0 input : label threshold N; confidence α; minimal regularity 0 < ν < 1 output: ĝ := sign ˆη 1 ω := d 2 ν ; 2 k = 0 ; 3 N 0 := N ; 4 LB := N 2N 0; 5 for i = 1 to 2N 0 do ( 6 sample i.i.d. 7 S 0,1 := {( X (0) i, Y (0) ) i X (0) i, Y (0) i ) with X (0) i Π; } {(, i N 0, S 0,2 = X (0) i, Y (0) ) i, N i 2N 0 }; 8 ˆm 0 := ˆm(s, N 0; S 0,1) /* see equation (2) */; 9 ˆη 0 := ˆηˆm0,[0,1] d ;S 0,2 /* see equation (1) */; 0 while LB > 0 do 1 k := k + 1; 2  k := { x [0, 1] d : f 1, f 2 ˆF k 1, sign (f 1(x)) sign (f 2(x)) } ; 3 if  k supp(π) = then 4 break 5 else 6 ˆm k := ˆm k ; 7 τ k := ˆm k ˆm k 1 8 N k := N τ k k 1 9 for i = 1 to N k ( Π( k ) do ) 0 sample i.i.d. X (k) i, Y (k) i ; with X (k) i ˆΠ k := Π(dx x  k ); 1 S k := {( X (k) i, Y (k) ) } i, i N k Π( k ) ; 2 ˆη k := ˆηˆmk, k /* estimator based on S k */; 3 δ k := D(log α N )ω N β/(2β+d) k /* size of the confidence band */; } 4 ˆF k := {f F 0ˆmk : F f Âk,Âk ( ˆη k ; δ k ), f [0,1] ˆη d \Âk k 1 ; [0,1] d \Âk 5 LB := LB N k Π( k ) ; 6 ˆη := ˆη k /* keeping track of the most recent estimator */; Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 9 / 18

33 Rates of convergence Let ĝ be the classifier output by the algorithm. Theorem Assume P P (β, γ). Then the following bound holds uniformly over all 0 < ν β r + 1 and γ > 0 with probability 1 α: where p = p(β, γ). R P (ĝ) R Const N β(1+γ) 2β+d (β 1)γ log p N α, Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 10 / 18

34 Rates of convergence Let ĝ be the classifier output by the algorithm. Theorem Assume P P (β, γ). Then the following bound holds uniformly over all 0 < ν β r + 1 and γ > 0 with probability 1 α: where p = p(β, γ). Remarks: R P (ĝ) R Const N β(1+γ) 2β+d (β 1)γ log p N α, 1 Fast rates(i.e., faster then N 1/2 ) are attained: Passive learning: βγ > d 2 Active learning: βγ > d d (if β 1), βγ > if β > /β 2 Matches the lower bound, up to log-factors, when β 1. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 10 / 18

35 Rates of convergence Let ĝ be the classifier output by the algorithm. Theorem Assume P P (β, γ). Then the following bound holds uniformly over all 0 < ν β r + 1 and γ > 0 with probability 1 α: where p = p(β, γ). Remarks: R P (ĝ) R Const N β(1+γ) 2β+d (β 1)γ log p N α, 1 Fast rates(i.e., faster then N 1/2 ) are attained: Passive learning: βγ > d 2 Active learning: βγ > d d (if β 1), βγ > if β > /β 2 Matches the lower bound, up to log-factors, when β 1. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 10 / 18

36 Comments Properties of the algorithm: Positive: 1 Adaptation to unknown smoothness and noise level. 2 Algorithm is computationally tractable. Negative: 1 Requires the minimal regularity ν as an input. 2 Currently, we are unaware of any effective way to certify that the resulting classifier has small excess risk. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 11 / 18

37 Comments Properties of the algorithm: Positive: 1 Adaptation to unknown smoothness and noise level. 2 Algorithm is computationally tractable. Negative: 1 Requires the minimal regularity ν as an input. 2 Currently, we are unaware of any effective way to certify that the resulting classifier has small excess risk. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 11 / 18

38 Comments For β > 1 there is a gap between the upper and lower bounds. Question: is it a flaw of the proof techniques, or is it possible to improve the upper bound under some extra assumptions? What if the design distribution Π is discrete? Example: Active Learning on graphs. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 11 / 18

39 Comments For β > 1 there is a gap between the upper and lower bounds. Question: is it a flaw of the proof techniques, or is it possible to improve the upper bound under some extra assumptions? What if the design distribution Π is discrete? Example: Active Learning on graphs. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 11 / 18

40 Simulation Model: Y i = sign [f (X i ) + ε i ], ε N (0, σ 2 ), i= f (x) = x ( 1 + sin 5 x ) sin(4πx), σ 2 = 0.2 Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 12 / 18

41 Simulation Model: Y i = sign [f (X i ) + ε i ], ε N (0, σ 2 ), i= f (x) = x ( 1 + sin 5 x ) sin(4πx), σ 2 = 0.2 Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 12 / 18

42 Simulation Model: Y i = sign [f (X i ) + ε i ], ε N (0, σ 2 ), i= f (x) = x ( 1 + sin 5 x ) sin(4πx), σ 2 = 0.2 Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 12 / 18

43 Simulation Model: Y i = sign [f (X i ) + ε i ], ε N (0, σ 2 ), i= f (x) = x ( 1 + sin 5 x ) sin(4πx), σ 2 = 0.2 Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 12 / 18

44 Proof details: Define the nested families of piecewise-polynomial functions: dm 2 Fm r := f = q i (x 1,..., x d )I Ri, i=1 where {R i } 2dm i=1 forms the dyadic partition of the unit cube and q i(x 1,..., x d ) are the polynomials of degree at most r in d variables. ˆα i := 1 N ˆη m(x) := N Y j φ i (X j ), j=1 d m i=1 Provides simple structure for the active set. Good concentration around the mean. ˆα i φ i (x) (1) Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 13 / 18

45 Proof details: Define the nested families of piecewise-polynomial functions: dm 2 Fm r := f = q i (x 1,..., x d )I Ri, i=1 where {R i } 2dm i=1 forms the dyadic partition of the unit cube and q i(x 1,..., x d ) are the polynomials of degree at most r in d variables. ˆα i := 1 N ˆη m(x) := N Y j φ i (X j ), j=1 d m i=1 Provides simple structure for the active set. Good concentration around the mean. ˆα i φ i (x) (1) Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 13 / 18

46 Proof details: Define the nested families of piecewise-polynomial functions: dm 2 Fm r := f = q i (x 1,..., x d )I Ri, i=1 where {R i } 2dm i=1 forms the dyadic partition of the unit cube and q i(x 1,..., x d ) are the polynomials of degree at most r in d variables. ˆα i := 1 N ˆη m(x) := N Y j φ i (X j ), j=1 d m i=1 Provides simple structure for the active set. Good concentration around the mean. ˆα i φ i (x) (1) Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 13 / 18

47 Proof details: Define the nested families of piecewise-polynomial functions: dm 2 Fm r := f = q i (x 1,..., x d )I Ri, i=1 where {R i } 2dm i=1 forms the dyadic partition of the unit cube and q i(x 1,..., x d ) are the polynomials of degree at most r in d variables. ˆα i := 1 N ˆη m(x) := N Y j φ i (X j ), j=1 d m i=1 Provides simple structure for the active set. Good concentration around the mean. ˆα i φ i (x) (1) Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 13 / 18

48 Proof details: Define the nested families of piecewise-polynomial functions: dm 2 Fm r := f = q i (x 1,..., x d )I Ri, i=1 where {R i } 2dm i=1 forms the dyadic partition of the unit cube and q i(x 1,..., x d ) are the polynomials of degree at most r in d variables. ˆα i := 1 N ˆη m(x) := N Y j φ i (X j ), j=1 d m i=1 Provides simple structure for the active set. Good concentration around the mean. ˆα i φ i (x) (1) Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 13 / 18

49 Main steps of the proof: Estimate the size of the active set: Controlled by the width of the confidence band and the low noise assumption. Estimate the risk of a plug-in classifier on the active set: Use the comparison inequality: R P (f ) R Const (f η)i {sign f sign η} 1+γ Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 14 / 18

50 Main steps of the proof: Estimate the size of the active set: Controlled by the width of the confidence band and the low noise assumption. Estimate the risk of a plug-in classifier on the active set: Use the comparison inequality: R P (f ) R Const (f η)i {sign f sign η} 1+γ Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 14 / 18

51 Main steps of the proof: Estimate the size of the active set: Controlled by the width of the confidence band and the low noise assumption. Estimate the risk of a plug-in classifier on the active set: Use the comparison inequality: R P (f ) R Const (f η)i {sign f sign η} 1+γ Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 14 / 18

52 Main steps of the proof: Estimate the size of the active set: Controlled by the width of the confidence band and the low noise assumption. Estimate the risk of a plug-in classifier on the active set: Use the comparison inequality: R P (f ) R Const (f η)i {sign f sign η} 1+γ Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 14 / 18

53 How to choose optimal resolution level? Lepski-type method: ˆm := ˆm(t, N) = min { Compare to optimal resolution level: { m := min m : l > m, ˆη l ˆη m K 1 t m : η η m K 2 2 dm m N } } 2 dl l N (2) (3) Here, η m is the L 2 (Π) projection of η onto F r m. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 15 / 18

54 How to choose optimal resolution level? Lepski-type method: ˆm := ˆm(t, N) = min { Compare to optimal resolution level: { m := min m : l > m, ˆη l ˆη m K 1 t m : η η m K 2 2 dm m N } } 2 dl l N (2) (3) Here, η m is the L 2 (Π) projection of η onto F r m. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 15 / 18

55 How to choose optimal resolution level? Assume that Lower bound is crucial to control the bias. B 2 2 βm η η m B 1 2 βm Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 16 / 18

56 How to choose optimal resolution level? Assume that Lower bound is crucial to control the bias. Lemma ˆm B 2 2 βm η η m B 1 2 βm ( m 1 ( ) ] B 1 log β 2 t + log 2 + h, m B 2 with probability at least 1 C2 d m log N exp( ct m), where h is some fixed positive number that depends on d, r, Π. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 16 / 18

57 Confidence band The previous result implies η ˆη C ( log N ) ω N β/(2β+d) α This allows to construct a confidence band for η(x) = control for the size of the active set. The final bound for the risk follows after simple algebraic computations. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 17 / 18

58 Confidence band The previous result implies η ˆη C ( log N ) ω N β/(2β+d) α This allows to construct a confidence band for η(x) = control for the size of the active set. The final bound for the risk follows after simple algebraic computations. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 17 / 18

59 Thank you for your attention! Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 18 / 18

arxiv: v2 [math.st] 2 Nov 2011

arxiv: v2 [math.st] 2 Nov 2011 Plug-in Approach to Active Learning Stanislav Minsker, e-mail: sminsker@math.gatech.edu arxiv:1104.1450v2 [math.st] 2 Nov 2011 Abstract: We present a new active learning algorithm based on nonparametric

More information

Fast learning rates for plug-in classifiers under the margin condition

Fast learning rates for plug-in classifiers under the margin condition Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,

More information

Plug-in Approach to Active Learning

Plug-in Approach to Active Learning Journal of Machine Learning Research 13 2012 67-90 Submitted 2/11; Revised 9/11; Published 1/12 Plug-in Approach to Active Learning Stanislav Minsker 686 Cherry Street School of Mathematics Georgia Institute

More information

Unlabeled Data: Now It Helps, Now It Doesn t

Unlabeled Data: Now It Helps, Now It Doesn t institution-logo-filena A. Singh, R. D. Nowak, and X. Zhu. In NIPS, 2008. 1 Courant Institute, NYU April 21, 2015 Outline institution-logo-filena 1 Conflicting Views in Semi-supervised Learning The Cluster

More information

Minimax Bounds for Active Learning

Minimax Bounds for Active Learning Minimax Bounds for Active Learning Rui M. Castro 1,2 and Robert D. Nowak 1 1 University of Wisconsin, Madison WI 53706, USA, rcastro@cae.wisc.edu,nowak@engr.wisc.edu, 2 Rice University, Houston TX 77005,

More information

Minimax Bounds for Active Learning

Minimax Bounds for Active Learning IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. XX, NO. X, MONTH YEAR Minimax Bounds for Active Learning Rui M. Castro, Robert D. Nowak, Senior Member, IEEE, Abstract This paper analyzes the potential advantages

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

Minimax Analysis of Active Learning

Minimax Analysis of Active Learning Journal of Machine Learning Research 6 05 3487-360 Submitted 0/4; Published /5 Minimax Analysis of Active Learning Steve Hanneke Princeton, NJ 0854 Liu Yang IBM T J Watson Research Center, Yorktown Heights,

More information

A talk on Oracle inequalities and regularization. by Sara van de Geer

A talk on Oracle inequalities and regularization. by Sara van de Geer A talk on Oracle inequalities and regularization by Sara van de Geer Workshop Regularization in Statistics Banff International Regularization Station September 6-11, 2003 Aim: to compare l 1 and other

More information

A Bound on the Label Complexity of Agnostic Active Learning

A Bound on the Label Complexity of Agnostic Active Learning A Bound on the Label Complexity of Agnostic Active Learning Steve Hanneke March 2007 CMU-ML-07-103 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Machine Learning Department,

More information

Semi-Supervised Learning by Multi-Manifold Separation

Semi-Supervised Learning by Multi-Manifold Separation Semi-Supervised Learning by Multi-Manifold Separation Xiaojin (Jerry) Zhu Department of Computer Sciences University of Wisconsin Madison Joint work with Andrew Goldberg, Zhiting Xu, Aarti Singh, and Rob

More information

Inverse Statistical Learning

Inverse Statistical Learning Inverse Statistical Learning Minimax theory, adaptation and algorithm avec (par ordre d apparition) C. Marteau, M. Chichignoud, C. Brunet and S. Souchet Dijon, le 15 janvier 2014 Inverse Statistical Learning

More information

Persistent homology and nonparametric regression

Persistent homology and nonparametric regression Cleveland State University March 10, 2009, BIRS: Data Analysis using Computational Topology and Geometric Statistics joint work with Gunnar Carlsson (Stanford), Moo Chung (Wisconsin Madison), Peter Kim

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu

More information

Sample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University

Sample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University Sample and Computationally Efficient Active Learning Maria-Florina Balcan Carnegie Mellon University Machine Learning is Shaping the World Highly successful discipline with lots of applications. Computational

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Practical Agnostic Active Learning

Practical Agnostic Active Learning Practical Agnostic Active Learning Alina Beygelzimer Yahoo Research based on joint work with Sanjoy Dasgupta, Daniel Hsu, John Langford, Francesco Orabona, Chicheng Zhang, and Tong Zhang * * introductory

More information

Adaptive Sampling Under Low Noise Conditions 1

Adaptive Sampling Under Low Noise Conditions 1 Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università

More information

Solving Classification Problems By Knowledge Sets

Solving Classification Problems By Knowledge Sets Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose

More information

Atutorialonactivelearning

Atutorialonactivelearning Atutorialonactivelearning Sanjoy Dasgupta 1 John Langford 2 UC San Diego 1 Yahoo Labs 2 Exploiting unlabeled data Alotofunlabeleddataisplentifulandcheap,eg. documents off the web speech samples images

More information

Distribution-specific analysis of nearest neighbor search and classification

Distribution-specific analysis of nearest neighbor search and classification Distribution-specific analysis of nearest neighbor search and classification Sanjoy Dasgupta University of California, San Diego Nearest neighbor The primeval approach to information retrieval and classification.

More information

Risk Bounds for CART Classifiers under a Margin Condition

Risk Bounds for CART Classifiers under a Margin Condition arxiv:0902.3130v5 stat.ml 1 Mar 2012 Risk Bounds for CART Classifiers under a Margin Condition Servane Gey March 2, 2012 Abstract Non asymptotic risk bounds for Classification And Regression Trees (CART)

More information

Active Learning: Disagreement Coefficient

Active Learning: Disagreement Coefficient Advanced Course in Machine Learning Spring 2010 Active Learning: Disagreement Coefficient Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz In previous lectures we saw examples in which

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Tsybakov noise adap/ve margin- based ac/ve learning

Tsybakov noise adap/ve margin- based ac/ve learning Tsybakov noise adap/ve margin- based ac/ve learning Aar$ Singh A. Nico Habermann Associate Professor NIPS workshop on Learning Faster from Easy Data II Dec 11, 2015 Passive Learning Ac/ve Learning (X j,?)

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

Consistency of Nearest Neighbor Methods

Consistency of Nearest Neighbor Methods E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study

More information

Multivariate Topological Data Analysis

Multivariate Topological Data Analysis Cleveland State University November 20, 2008, Duke University joint work with Gunnar Carlsson (Stanford), Peter Kim and Zhiming Luo (Guelph), and Moo Chung (Wisconsin Madison) Framework Ideal Truth (Parameter)

More information

The Perceptron algorithm

The Perceptron algorithm The Perceptron algorithm Tirgul 3 November 2016 Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric

More information

SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY 1. Minimax-Optimal Bounds for Detectors Based on Estimated Prior Probabilities

SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY 1. Minimax-Optimal Bounds for Detectors Based on Estimated Prior Probabilities SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY 1 Minimax-Optimal Bounds for Detectors Based on Estimated Prior Probabilities Jiantao Jiao*, Lin Zhang, Member, IEEE and Robert D. Nowak, Fellow, IEEE

More information

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015 10-806 Foundations of Machine Learning and Data Science Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015 1 Active Learning Most classic machine learning methods and the formal learning

More information

Nonparametric regression for topology. applied to brain imaging data

Nonparametric regression for topology. applied to brain imaging data , applied to brain imaging data Cleveland State University October 15, 2010 Motivation from Brain Imaging MRI Data Topology Statistics Application MRI Data Topology Statistics Application Cortical surface

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

Better Algorithms for Selective Sampling

Better Algorithms for Selective Sampling Francesco Orabona Nicolò Cesa-Bianchi DSI, Università degli Studi di Milano, Italy francesco@orabonacom nicolocesa-bianchi@unimiit Abstract We study online algorithms for selective sampling that use regularized

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem Set 2 Due date: Wednesday October 6 Please address all questions and comments about this problem set to 6867-staff@csail.mit.edu. You will need to use MATLAB for some of

More information

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process

More information

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality Weiqiang Dong 1 The goal of the work presented here is to illustrate that classification error responds to error in the target probability estimates

More information

Nonparametric regression with martingale increment errors

Nonparametric regression with martingale increment errors S. Gaïffas (LSTA - Paris 6) joint work with S. Delattre (LPMA - Paris 7) work in progress Motivations Some facts: Theoretical study of statistical algorithms requires stationary and ergodicity. Concentration

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

Introduction to Machine Learning (67577) Lecture 3

Introduction to Machine Learning (67577) Lecture 3 Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz

More information

Fast learning rates for plug-in classifiers under the margin condition

Fast learning rates for plug-in classifiers under the margin condition arxiv:math/0507180v3 [math.st] 27 May 21 Fast learning rates for plug-in classifiers under the margin condition Jean-Yves AUDIBERT 1 and Alexandre B. TSYBAKOV 2 1 Ecole Nationale des Ponts et Chaussées,

More information

The multi armed-bandit problem

The multi armed-bandit problem The multi armed-bandit problem (with covariates if we have time) Vianney Perchet & Philippe Rigollet LPMA Université Paris Diderot ORFE Princeton University Algorithms and Dynamics for Games and Optimization

More information

Generalization and Overfitting

Generalization and Overfitting Generalization and Overfitting Model Selection Maria-Florina (Nina) Balcan February 24th, 2016 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle

More information

Adaptive Minimax Classification with Dyadic Decision Trees

Adaptive Minimax Classification with Dyadic Decision Trees Adaptive Minimax Classification with Dyadic Decision Trees Clayton Scott Robert Nowak Electrical and Computer Engineering Electrical and Computer Engineering Rice University University of Wisconsin Houston,

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Unlabeled data: Now it helps, now it doesn t

Unlabeled data: Now it helps, now it doesn t Unlabeled data: Now it helps, now it doesn t Aarti Singh, Robert D. Nowak Xiaojin Zhu Department of Electrical and Computer Engineering Department of Computer Sciences University of Wisconsin - Madison

More information

Classification objectives COMS 4771

Classification objectives COMS 4771 Classification objectives COMS 4771 1. Recap: binary classification Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22 Scoring functions Consider binary classification

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Optimal global rates of convergence for interpolation problems with random design

Optimal global rates of convergence for interpolation problems with random design Optimal global rates of convergence for interpolation problems with random design Michael Kohler 1 and Adam Krzyżak 2, 1 Fachbereich Mathematik, Technische Universität Darmstadt, Schlossgartenstr. 7, 64289

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning 3. Instance Based Learning Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701 Outline Parzen Windows Kernels, algorithm Model selection

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Active learning. Sanjoy Dasgupta. University of California, San Diego

Active learning. Sanjoy Dasgupta. University of California, San Diego Active learning Sanjoy Dasgupta University of California, San Diego Exploiting unlabeled data A lot of unlabeled data is plentiful and cheap, eg. documents off the web speech samples images and video But

More information

Recap from previous lecture

Recap from previous lecture Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience

More information

Efficient Semi-supervised and Active Learning of Disjunctions

Efficient Semi-supervised and Active Learning of Disjunctions Maria-Florina Balcan ninamf@cc.gatech.edu Christopher Berlind cberlind@gatech.edu Steven Ehrlich sehrlich@cc.gatech.edu Yingyu Liang yliang39@gatech.edu School of Computer Science, College of Computing,

More information

Logistic regression and linear classifiers COMS 4771

Logistic regression and linear classifiers COMS 4771 Logistic regression and linear classifiers COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random

More information

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization

More information

Activized Learning with Uniform Classification Noise

Activized Learning with Uniform Classification Noise Activized Learning with Uniform Classification Noise Liu Yang Steve Hanneke June 9, 2011 Abstract We prove that for any VC class, it is possible to transform any passive learning algorithm into an active

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

CS 543 Page 1 John E. Boon, Jr.

CS 543 Page 1 John E. Boon, Jr. CS 543 Machine Learning Spring 2010 Lecture 05 Evaluating Hypotheses I. Overview A. Given observed accuracy of a hypothesis over a limited sample of data, how well does this estimate its accuracy over

More information

Dyadic Classification Trees via Structural Risk Minimization

Dyadic Classification Trees via Structural Risk Minimization Dyadic Classification Trees via Structural Risk Minimization Clayton Scott and Robert Nowak Department of Electrical and Computer Engineering Rice University Houston, TX 77005 cscott,nowak @rice.edu Abstract

More information

Nonparametric Bayesian Methods

Nonparametric Bayesian Methods Nonparametric Bayesian Methods Debdeep Pati Florida State University October 2, 2014 Large spatial datasets (Problem of big n) Large observational and computer-generated datasets: Often have spatial and

More information

Classification. Chapter Introduction. 6.2 The Bayes classifier

Classification. Chapter Introduction. 6.2 The Bayes classifier Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode

More information

Lecture Notes 15 Prediction Chapters 13, 22, 20.4.

Lecture Notes 15 Prediction Chapters 13, 22, 20.4. Lecture Notes 15 Prediction Chapters 13, 22, 20.4. 1 Introduction Prediction is covered in detail in 36-707, 36-701, 36-715, 10/36-702. Here, we will just give an introduction. We observe training data

More information

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity; CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and

More information

Activized Learning with Uniform Classification Noise

Activized Learning with Uniform Classification Noise Activized Learning with Uniform Classification Noise Liu Yang Machine Learning Department, Carnegie Mellon University Steve Hanneke LIUY@CS.CMU.EDU STEVE.HANNEKE@GMAIL.COM Abstract We prove that for any

More information

Minimax fast rates for discriminant analysis with errors in variables

Minimax fast rates for discriminant analysis with errors in variables Submitted to the Bernoulli Minimax fast rates for discriminant analysis with errors in variables Sébastien Loustau and Clément Marteau The effect of measurement errors in discriminant analysis is investigated.

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

Nearest neighbor classification in metric spaces: universal consistency and rates of convergence

Nearest neighbor classification in metric spaces: universal consistency and rates of convergence Nearest neighbor classification in metric spaces: universal consistency and rates of convergence Sanjoy Dasgupta University of California, San Diego Nearest neighbor The primeval approach to classification.

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

Material presented. Direct Models for Classification. Agenda. Classification. Classification (2) Classification by machines 6/16/2010.

Material presented. Direct Models for Classification. Agenda. Classification. Classification (2) Classification by machines 6/16/2010. Material presented Direct Models for Classification SCARF JHU Summer School June 18, 2010 Patrick Nguyen (panguyen@microsoft.com) What is classification? What is a linear classifier? What are Direct Models?

More information

Model Selection and Geometry

Model Selection and Geometry Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model

More information

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk) 10-704: Information Processing and Learning Spring 2015 Lecture 21: Examples of Lower Bounds and Assouad s Method Lecturer: Akshay Krishnamurthy Scribes: Soumya Batra Note: LaTeX template courtesy of UC

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Op#mal convex op#miza#on under Tsybakov noise through connec#ons to ac#ve learning

Op#mal convex op#miza#on under Tsybakov noise through connec#ons to ac#ve learning Op#mal convex op#miza#on under Tsybakov noise through connec#ons to ac#ve learning Aar$ Singh Joint work with: Aaditya Ramdas Connec#ons between convex op#miza#on and ac#ve learning (a formal reduc#on)

More information

Variance Reduction and Ensemble Methods

Variance Reduction and Ensemble Methods Variance Reduction and Ensemble Methods Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Last Time PAC learning Bias/variance tradeoff small hypothesis

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

Computational Learning Theory. CS534 - Machine Learning

Computational Learning Theory. CS534 - Machine Learning Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows

More information

CMSC858P Supervised Learning Methods

CMSC858P Supervised Learning Methods CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors

More information

Lecture 10 February 23

Lecture 10 February 23 EECS 281B / STAT 241B: Advanced Topics in Statistical LearningSpring 2009 Lecture 10 February 23 Lecturer: Martin Wainwright Scribe: Dave Golland Note: These lecture notes are still rough, and have only

More information

An Introduction to Statistical Machine Learning - Theoretical Aspects -

An Introduction to Statistical Machine Learning - Theoretical Aspects - An Introduction to Statistical Machine Learning - Theoretical Aspects - Samy Bengio bengio@idiap.ch Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4 1920 Martigny,

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

Gaussian Process Regression

Gaussian Process Regression Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

More information

Laplace s Equation. Chapter Mean Value Formulas

Laplace s Equation. Chapter Mean Value Formulas Chapter 1 Laplace s Equation Let be an open set in R n. A function u C 2 () is called harmonic in if it satisfies Laplace s equation n (1.1) u := D ii u = 0 in. i=1 A function u C 2 () is called subharmonic

More information

Bayes rule and Bayes error. Donglin Zeng, Department of Biostatistics, University of North Carolina

Bayes rule and Bayes error. Donglin Zeng, Department of Biostatistics, University of North Carolina Bayes rule and Bayes error Definition If f minimizes E[L(Y, f (X))], then f is called a Bayes rule (associated with the loss function L(y, f )) and the resulting prediction error rate, E[L(Y, f (X))],

More information

Lecture 29: Computational Learning Theory

Lecture 29: Computational Learning Theory CS 710: Complexity Theory 5/4/2010 Lecture 29: Computational Learning Theory Instructor: Dieter van Melkebeek Scribe: Dmitri Svetlov and Jake Rosin Today we will provide a brief introduction to computational

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

Minimax Estimation of a nonlinear functional on a structured high-dimensional model

Minimax Estimation of a nonlinear functional on a structured high-dimensional model Minimax Estimation of a nonlinear functional on a structured high-dimensional model Eric Tchetgen Tchetgen Professor of Biostatistics and Epidemiologic Methods, Harvard U. (Minimax ) 1 / 38 Outline Heuristics

More information

A Sampling of IMPACT Research:

A Sampling of IMPACT Research: A Sampling of IMPACT Research: Methods for Analysis with Dropout and Identifying Optimal Treatment Regimes Marie Davidian Department of Statistics North Carolina State University http://www.stat.ncsu.edu/

More information

Linear Discrimination Functions

Linear Discrimination Functions Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach

More information

A Magiv CV Theory for Large-Margin Classifiers

A Magiv CV Theory for Large-Margin Classifiers A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector

More information

Rademacher Bounds for Non-i.i.d. Processes

Rademacher Bounds for Non-i.i.d. Processes Rademacher Bounds for Non-i.i.d. Processes Afshin Rostamizadeh Joint work with: Mehryar Mohri Background Background Generalization Bounds - How well can we estimate an algorithm s true performance based

More information