Plug-in Approach to Active Learning

Plug-in Approach to Active Learning Stanislav Minsker Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 1 / 18

Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X is the observed quantity and the label Y is to be predicted; prediction is carried via the binary classifier g : S { 1, +1} The joint distribution of (X, Y ) will be denoted by P and the distribution of X by Π. η(x) := E(Y X = x) is the regression function. The generalization error of a binary classifier g is Bayes classifier R(g) = Pr (Y g(x)) g (x) := sign η(x) has the smallest possible generalization error. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 2 / 18

Difference between Active and Passive Learning In the passive learning framework we are given an iid sample (X i, Y i ) n i=1 as an input. In practice: cost of the training data is associated with labeling the observations while the pool of observations itself is almost unlimited. Active learning algorithms try to take advantage of this modified framework. The goal is to construct a classifier with good generalization properties using a small number of labels. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 3 / 18

Difference between Active and Passive Learning Observations are sampled sequentially. X k is sampled from the modified distribution ˆΠ k that depends on (X 1, Y 1 ),..., (X k 1, Y k 1 ). Y k is sampled from the conditional distribution P Y X ( X = x). Labels are conditionally independent given the feature vectors X i, i n. ˆΠ k is supported on a set where classification is difficult. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 3 / 18

Previous work 1 Dasgupta, S., Hsu, D. and Monteleoni, C. A general agnostic active learning algorithm. 2 Castro, R. M. and Nowak, R. D.(2008) Minimax bounds for active learning. 3 Balcan, M.-F., Beygelzimer, A. and Langford, J.(2009) Agnostic active learning. 4 Hanneke, S.(2010) Rates of Convergence in Active Learning. 5 Koltchinskii, V. (2010) Rademacher Complexities and Bounding the Excess Risk in Active Learning. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 4 / 18

Previous work It was discovered that in some cases active learners can significantly outperform passive algorithms. In particular, this is the case when the Tsybakov s low noise assumption is satisfied: there exist constants B, γ > 0 such that t > 0, Π(x : η(x) t) Bt γ Castro and Nowak: if the decision boundary { x R d : η(x) = 0 } is a graph of the Hölder smooth function g Σ(β, K, [0, 1] d 1 ), noise assumption is satisfied with γ > 0, then ER(ĝ N ) R N β(1+γ) 2β+γ(d 1) where N is the label budget. However, the construction of the classifier that achieves this bound assumes β and γ to be known. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 4 / 18

Motivation and Assumptions A good learning algorithm is expected to be computationally efficient and adaptive with respect to the underlying structure of the problem(smoothness and noise level). Existing algorithms do not possess these two properties simultaneosly. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 5 / 18

Motivation and Assumptions Developement of Active Learning Algorithms that are adaptive and computationally tractable. Obtaining minimax lower bounds for the excess risk of active learning for the broad class of underlying distributions. Main tool: construct nonparametric estimator of the regression function ˆη n( ) and use the plug-in classifier ĝ( ) = sign ˆη n( ) Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 5 / 18

Motivation and Assumptions Assumptions: 1 X is supported in [0, 1] d, marginal Π is regular ( for example, dπ(x) = p(x)d x, 0 < µ 1 p(x) µ 2 < ); 2 η( ) Σ(β, L, [0, 1] d ) 3 t > 0, Π(x : η(x) t) Bt γ In Passive Learning under similar assumptions plug-in approach leads to classifiers that attain optimal rates of convergence: C 1 N β(1+γ) 2β+d sup ER(ĝ N ) R C 2 N β(1+γ) 2β+d P P(β,γ) (J.-Y. Audibert, A. B.Tsybakov. Fast learning rates for plug-in classifiers). Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 5 / 18

Lower bound: Theorem Let β, γ, d be such that βγ d. There exists C > 0 such that for all N large enough and for any active classifier ĝ N (x) we have sup P P (β,γ) ER P (ĝ N ) R C N β(1+γ) 2β+d βγ Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 6 / 18

Lower bound: Theorem Let β, γ, d be such that βγ d. There exists C > 0 such that for all N large enough and for any active classifier ĝ N (x) we have sup P P (β,γ) ER P (ĝ N ) R C N β(1+γ) 2β+d βγ Here, N is the number of labels used to construct ĝ N rather then the size of the training data set. Proof combines techniques developed by Tsybakov, Audibert and Castro, Nowak. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 6 / 18

Lower bound: Theorem Let β, γ, d be such that βγ d. There exists C > 0 such that for all N large enough and for any active classifier ĝ N (x) we have sup P P (β,γ) ER P (ĝ N ) R C N β(1+γ) 2β+d βγ Example: β = 1, d = 2, γ = 1 : Passive rate : N 1/2 Active rate: N 2/3 Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 6 / 18

Active learning algorithm Suppose that assumptions (1),(2),(3) are satisfied and Π, the marginal distribution of X, is known. Input label budget N. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 7 / 18

Active learning algorithm Use a small fraction of data to construct an estimator that is close to η(x) in sup-norm. In our case, this is a piecewise polynomial estimator. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 7 / 18

Active learning algorithm Construct a confidence band based on obtained estimator Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 7 / 18

Active learning algorithm Active Set: the subset of [0, 1] d where the confidence band crosses zero level. The size of the active set is controlled by the low noise assumption. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 7 / 18

Active learning algorithm Outside of the active set: correct classification with high confidence! Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 7 / 18

Next iteration: Sample observations from ˆΠ 1 (dx) := Π (dx X Active Set); Construct a tighter confidence band; Repeat until the label budget is attained; Output ĝ = sign ˆη fin Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 8 / 18

Active learning algorithm, r = 0 input : label threshold N; confidence α; minimal regularity 0 < ν < 1 output: ĝ := sign ˆη 1 ω := 1 + 4+2d 2 ν ; 2 k = 0 ; 3 N 0 := N ; 4 LB := N 2N 0; 5 for i = 1 to 2N 0 do ( 6 sample i.i.d. 7 S 0,1 := {( X (0) i, Y (0) ) i X (0) i, Y (0) i ) with X (0) i Π; } {(, i N 0, S 0,2 = X (0) i, Y (0) ) i, N 0 + 1 i 2N 0 }; 8 ˆm 0 := ˆm(s, N 0; S 0,1) /* see equation (2) */; 9 ˆη 0 := ˆηˆm0,[0,1] d ;S 0,2 /* see equation (1) */; 0 while LB > 0 do 1 k := k + 1; 2 Â k := { x [0, 1] d : f 1, f 2 ˆF k 1, sign (f 1(x)) sign (f 2(x)) } ; 3 if Â k supp(π) = then 4 break 5 else 6 ˆm k := ˆm k 1 + 1 ; 7 τ k := ˆm k ˆm k 1 8 N k := N τ k k 1 9 for i = 1 to N k ( Π(Â k ) do ) 0 sample i.i.d. X (k) i, Y (k) i ; with X (k) i ˆΠ k := Π(dx x Â k ); 1 S k := {( X (k) i, Y (k) ) } i, i N k Π(Â k ) ; 2 ˆη k := ˆηˆmk,Â k /* estimator based on S k */; 3 δ k := D(log α N )ω N β/(2β+d) k /* size of the confidence band */; } 4 ˆF k := {f F 0ˆmk : F f Âk,Âk ( ˆη k ; δ k ), f [0,1] ˆη d \Âk k 1 ; [0,1] d \Âk 5 LB := LB N k Π(Â k ) ; 6 ˆη := ˆη k /* keeping track of the most recent estimator */; Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 9 / 18

Rates of convergence Let ĝ be the classifier output by the algorithm. Theorem Assume P P (β, γ). Then the following bound holds uniformly over all 0 < ν β r + 1 and γ > 0 with probability 1 α: where p = p(β, γ). Remarks: R P (ĝ) R Const N β(1+γ) 2β+d (β 1)γ log p N α, 1 Fast rates(i.e., faster then N 1/2 ) are attained: Passive learning: βγ > d 2 Active learning: βγ > d d (if β 1), βγ > if β > 1. 3 2+1/β 2 Matches the lower bound, up to log-factors, when β 1. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 10 / 18

Comments Properties of the algorithm: Positive: 1 Adaptation to unknown smoothness and noise level. 2 Algorithm is computationally tractable. Negative: 1 Requires the minimal regularity ν as an input. 2 Currently, we are unaware of any effective way to certify that the resulting classifier has small excess risk. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 11 / 18

Comments For β > 1 there is a gap between the upper and lower bounds. Question: is it a flaw of the proof techniques, or is it possible to improve the upper bound under some extra assumptions? What if the design distribution Π is discrete? Example: Active Learning on graphs. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 11 / 18

Simulation Model: Y i = sign [f (X i ) + ε i ], ε N (0, σ 2 ), i=1... 34 f (x) = x ( 1 + sin 5 x ) sin(4πx), σ 2 = 0.2 Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 12 / 18

Proof details: Define the nested families of piecewise-polynomial functions: dm 2 Fm r := f = q i (x 1,..., x d )I Ri, i=1 where {R i } 2dm i=1 forms the dyadic partition of the unit cube and q i(x 1,..., x d ) are the polynomials of degree at most r in d variables. ˆα i := 1 N ˆη m(x) := N Y j φ i (X j ), j=1 d m i=1 Provides simple structure for the active set. Good concentration around the mean. ˆα i φ i (x) (1) Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 13 / 18

Main steps of the proof: Estimate the size of the active set: Controlled by the width of the confidence band and the low noise assumption. Estimate the risk of a plug-in classifier on the active set: Use the comparison inequality: R P (f ) R Const (f η)i {sign f sign η} 1+γ Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 14 / 18

How to choose optimal resolution level? Lepski-type method: ˆm := ˆm(t, N) = min { Compare to optimal resolution level: { m := min m : l > m, ˆη l ˆη m K 1 t m : η η m K 2 2 dm m N } } 2 dl l N (2) (3) Here, η m is the L 2 (Π) projection of η onto F r m. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 15 / 18

How to choose optimal resolution level? Assume that Lower bound is crucial to control the bias. B 2 2 βm η η m B 1 2 βm Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 16 / 18

How to choose optimal resolution level? Assume that Lower bound is crucial to control the bias. Lemma ˆm B 2 2 βm η η m B 1 2 βm ( m 1 ( ) ] B 1 log β 2 t + log 2 + h, m B 2 with probability at least 1 C2 d m log N exp( ct m), where h is some fixed positive number that depends on d, r, Π. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 16 / 18

Confidence band The previous result implies η ˆη C ( log N ) ω N β/(2β+d) α This allows to construct a confidence band for η(x) = control for the size of the active set. The final bound for the risk follows after simple algebraic computations. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 17 / 18

Thank you for your attention! Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 18 / 18