Plug-in Approach to Active Learning

Size: px

Start display at page:

Download "Plug-in Approach to Active Learning"

Maud Wilkinson
5 years ago
Views:

1 Journal of Machine Learning Research Submitted 2/11; Revised 9/11; Published 1/12 Plug-in Approach to Active Learning Stanislav Minsker 686 Cherry Street School of Mathematics Georgia Institute of Technology Atlanta, GA , USA Editor: Sanjoy Dasgupta Abstract We present a new active learning algorithm based on nonparametric estimators of the regression function. Our investigation provides probabilistic bounds for the rates of convergence of the generalization error achievable by proposed method over a broad class of underlying distributions. We also prove minimax lower bounds which show that the obtained rates are almost tight. Keywords: active learning, selective sampling, model selection, classification, confidence bands 1. Introduction Let S,B be a measurable space and let X,Y S { 1,1} be a random couple with unknown distribution P. The marginal distribution of the design variable X will be denoted by Π. Let ηx := EY X = x be the regression function. The goal of binary classification is to predict label Y based on the observation X. Prediction is based on a classifier - a measurable function f : S { 1,1}. The quality of a classifier is measured in terms of its generalization error, R f=pry fx. In practice, the distribution P remains unknown but the learning algorithm has access to the training data - the i.i.d. samplex i,y i, i=1...n from P. It often happens that the cost of obtaining the training data is associated with labeling the observations X i while the pool of observations itself is almost unlimited. This suggests to measure the performance of a learning algorithm in terms of its label complexity, the number of labels Y i required to obtain a classifier with the desired accuracy. Active learning theory is mainly devoted to design and analysis of the algorithms that can take advantage of this modified framework. Most of these procedures can be characterized by the following property: at each step k, observation X k is sampled from a distribution ˆΠ k that depends on previously obtained X i,y i, i k 1while passive learners obtain all available training data at the same time. ˆΠ k is designed to be supported on a set where classification is difficult and requires more labeled data to be collected. The situation when active learners outperform passive algorithms might occur when the so-called Tsybakov s low noise assumption is satisfied: there exist constants B,γ>0 such that t > 0, Πx : ηx t Bt γ. 1 This assumption provides a convenient way to characterize the noise level of the problem and will play a crucial role in our investigation. The topic of active learning is widely present in the literature; see Balcan et al. 2009, Hanneke 2011, Castro and Nowak 2008 for review. It was discovered that in some cases the generaliza Stanislav Minsker.

2 MINSKER tion error of a resulting classifier can converge to zero exponentially fast with respect to its label complexitywhile the best rate for passive learning is usually polynomial with respect to the cardinality of the training data set. However, available algorithms that adapt to the unknown parameters of the problemγ in Tsybakov s low noise assumption, regularity of the decision boundary involve empirical risk minimization with binary loss, along with other computationally hard problems, see Balcan et al. 2008, Dasgupta et al. 2008, Hanneke 2011 and Koltchinskii On the other hand, the algorithms that can be effectively implemented, as in Castro and Nowak 2008, are not adaptive. The majority of the previous work in the field was done under standard complexity assumptions on the set of possible classifierssuch as polynomial growth of the covering numbers. Castro and Nowak 2008 derived their results under the regularity conditions on the decision boundary and the noise assumption which is slightly more restrictive then 1. Essentially, they proved that if the decision boundary is a graph of the Hölder smooth function g Σβ,K,[0,1] d 1 see Section 2 for definitions and the noise assumption is satisfied with γ > 0, then the minimax lower bound for the expected excess risk of the active classifier is of order C N β1+γ 2β+γd 1 and the upper bound is CN/logN 2β+γd 1, β1+γ where N is the label budget. However, the construction of the classifier that achieves an upper bound assumes β and γ to be known. In this paper, we consider the problem of active learning under classical nonparametric assumptions on the regression function - namely, we assume that it belongs to a certain Hölder class Σβ,K,[0,1] d and satisfies to the low noise condition 1 with some positive γ. In this case, the work of Audibert and Tsybakov 2005 showed that plug-in classifiers can attain optimal rates in the passive learning framework, namely, that the expected excess risk of a classifier ĝ = sign ˆη is bounded above by C N β1+γ which is the optimal rate, where ˆη is the local polynomial estimator of the regression function and N is the size of the training data set. We were able to partially extend this claim to the case of active learning: first, we obtain minimax lower bounds for the excess risk of an active classifier in terms of its label complexity. Second, we propose a new algorithm that is based on plug-in classifiers, attains almost optimal rates over a broad class of distributions and possesses adaptivity with respect to β,γwithin the certain range of these parameters. The paper is organized as follows: the next section introduces remaining notations and specifies the main assumptions made throughout the paper. This is followed by a qualitative description of our learning algorithm. The second part of the work contains the statements and proofs of our main results - minimax upper and lower bounds for the excess risk. 2. Preliminaries Our active learning framework is governed by the following rules: 1. Observations are sampled sequentially: X k is sampled from the modified distribution ˆΠ k that depends on X 1,Y 1,...,X k 1,Y k Y k is sampled from the conditional distribution P Y X X = x. Labels are conditionally independent given the feature vectors X i, i n. Usually, the distribution ˆΠ k is supported on a set where classification is difficult. Given the probability measure Q on S { 1,1}, we denote the integral with respect to this measure by Qg := gdq. Let F be a class of bounded, measurable functions. The risk and the 68

3 PLUG-IN APPROACH TO ACTIVE LEARNING excess risk of f F with respect to the measureqare defined by R Q f :=QI y sign fx E Q f := R Q f inf R Qg, g F wherei A is the indicator of eventa. We will omit the subindex Q when the underlying measure is clear from the context. Recall that we denoted the distribution ofx,y by P. The minimal possible risk with respect to P is R = inf PrY sign gx, g:s [ 1,1] where the infimum is taken over all measurable functions. It is well known that it is attained for any g such that sign gx=sign ηx Π - a.s. Given g F, A B, δ>0, define F,A g;δ :={ f F : f g,a δ}, where f g,a = sup fx gx. For A B, define the function class x A F A :={ f A, f F}, where f A x := fxi A x. From now on, we restrict our attention to the case S=[0,1] d. Let K> 0. Definition 1 We say that g :R d R belongs to Σβ,K,[0,1] d, the β,k,[0,1] d - Hölder class of functions, if g is β times continuously differentiable and for all x,x 1 [0,1] d satisfies gx 1 T x x 1 K x x 1 β, where T x is the Taylor polynomial of degree β of g at the point x. Definition 2 Pβ,γ is the class of probability distributions on[0,1] d { 1,+1} with the following properties: 1. t > 0, Πx : ηx t Bt γ ; 2. ηx Σβ,K,[0,1] d. We do not mention the dependence ofpβ,γ on the fixed constants B,K explicitly, but this should not cause any uncertainty. Finally, let us definepu β,γ andp Uβ,γ, the subclasses ofpβ,γ, by imposing two additional assumptions. Along with the formal descriptions of these assumptions, we shall try to provide some motivation behind them. The first deals with the marginal Π. For an integer M 1, let { k1 G M := M,..., k } d, k i = 1...M, i=1...d M be the regular grid on the unit cube [0,1] d with mesh size M 1. It naturally defines a partition into a set of M d open cubes R i, i = 1...M d with edges of length M 1 and vertices in G M. Below, we consider the nested sequence of grids{g 2 m, m 1} and corresponding dyadic partitions of the unit cube. 69

4 MINSKER Definition 3 We will say that Π is u 1,u 2 -regular with respect to {G 2 m} if for any m 1, any element of the partition R i, i 2 dm such that R i suppπ /0, we have where 0<u 1 u 2 <. Assumption 1 Π is u 1,u 2 - regular. u 1 2 dm ΠR i u 2 2 dm, In particular, u 1,u 2 -regularity holds for the distribution with a density p on [0,1] d such that 0< u 1 px u 2 <. Let us mention that our definition of regularity is of rather technical nature; for most of the paper, the reader might think of Π as being uniform on[0,1] d however, we need slightly more complicated marginal to construct the minimax lower bounds for the excess risk. It is known that estimation of regression function in sup-norm is sensitive to the geometry of design distribution, mainly because the quality of estimation depends on the local amount of data at every point; conditions similar to our Assumption 1 were used in the previous works where this problem appeared, for example, strong density assumption in Audibert and Tsybakov 2005 and Assumption D in Gaïffas A useful characteristic of u 1,u 2 - regular distribution Π is that this property is stable with respect to restrictions of Π to certain subsets of its support. This fact fits the active learning framework particularly well. Definition 4 We say that Q belongs to P U β,γ if Q Pβ,γ and Assumption 1 is satisfied for some u 1,u 2. The second assumption is crucial in derivation of the upper bounds. The space of piecewise-constant functions which is used to construct the estimators of ηx is defined via F m = { 2 dm λ i I Ri : λ i 1, i=1...2 }, dm i=1 where {R i } 2dm i=1 forms the dyadic partition of the unit cube. Note thatf m can be viewed as a - unit ball in the linear span of first 2 dm Haar basis functions in [0,1] d. Moreover, {F m, m 1} is a nested family, which is a desirable property for the model selection procedures. By η m x we denote the L 2 Π - projection of the regression function ontof m. We will say that the set A [0,1] d approximates the decision boundary {x : ηx=0} if there exists t > 0 such that {x : ηx t} Π A Π {x : ηx 3t} Π, 2 where for any set A we define A Π := A suppπ. The most important example we have in mind is the following: let ˆη be some estimator of η with ˆη η,suppπ t, and define the 2t - band around η by { ˆF = f : ˆηx 2t fx ˆηx+2t x [0,1] d}. Take A = { x : f 1, f 2 ˆF s.t. sign f 1 x sign f 2 x }, then it is easy to see that A satisfies 2. Modified design distributions used by our algorithm are supported on the sets with similar structure. Let σf m be the sigma-algebra generated byf m and A σf m. 70

5 PLUG-IN APPROACH TO ACTIVE LEARNING Assumption 2 There exists B 2 > 0 such that for all m 1, A σf m satisfying 2 and such that A Π /0 the following holds true: [0,1] d η η m 2 Πdx x A Π B 2 η η m 2,A Π. Appearance of Assumption 2 is motivated by the structure of our learning algorithm - namely, it is based on adaptive confidence bands for the regression function. Nonparametric confidence bands is a big topic in statistical literature, and the review of this subject is not our goal. We just mention that it is impossible to construct adaptive confidence bands of optimal size over the whole Σ β,k,[0,1] d. Low 1997; Hoffmann and Nickl to appear discuss the subject in details. β 1 However, it is possible to construct adaptive L 2 - confidence balls see an example following Theorem 6.1 in Koltchinskii, For functions satisfying Assumption 2, this fact allows to obtain confidence bands of desired size. In particular, a functions that are differentiable, with gradient being bounded away from 0 in the vicinity of decision boundary; b Lipschitz continuous functions that are convex in the vicinity of decision boundary satisfy Assumption 2. For precise statements, see Propositions 15, 16 in Appendix A. A different approach to adaptive confidence bands in case of one-dimensional density estimation is presented in Giné and Nickl Finally, we definep U β,γ: Definition 5 We say that Q belongs to P U β,γ if Q P Uβ,γ and Assumption 2 is satisfied for some B 2 > Learning Algorithm Now we give a brief description of the algorithm, since several definitions appear naturally in this context. First, let us emphasize that the marginal distribution Π is assumed to be known to the learner. This is not a restriction, since we are not limited in the use of unlabeled data and Π can be estimated to any desired accuracy. Our construction is based on so-called plug-in classifiers of the form ˆf = sign ˆη, where ˆη is a piecewise-constant estimator of the regression function. As we have already mentioned above, it was shown in Audibert and Tsybakov 2005 that in the passive learning framework plug-in classifiers attain optimal rate for the excess risk of order N β1+γ, with ˆη being the local polynomial estimator. Our active learning algorithm iteratively improves the classifier by constructing shrinking confidence bands for the regression function. On every step k, the piecewise-constant estimator ˆη k is obtained via the model selection procedure which allows adaptation to the unknown smoothnessfor Hölder exponent 1. The estimator is further used to construct a confidence band ˆF k for ηx. The active set associated with ˆF k is defined as Â k = A ˆF { k := x suppπ : f 1, f 2 ˆF } k,sign f 1 x sign f 2 x. Clearly, this is the set where the confidence band crosses zero level and where classification is potentially difficult. Â k serves as a support of the modified distribution ˆΠ k+1 : on step k+1, label Y is 71

6 MINSKER requested only for observations X Â k, forcing the labeled data to concentrate in the domain where higher precision is needed. This allows one to obtain a tighter confidence band for the regression function restricted to the active set. Since Â k approaches the decision boundary, its size is controlled by the low noise assumption. The algorithm does not require a priori knowledge of the noise and regularity parameters, being adaptive for γ>0,β 1. Further details are given in Section confidence band ηx regression function 1 confidence band estimator of ηx 1 ACTIVE SET Figure 1: Active Learning Algorithm 2.2 Comparison Inequalities Before proceeding with the main results, let us recall the well-known connections between the binary risk and the, L2 Π - norm risks: Proposition 6 Under the low noise assumption, R P f R D 1 f ηi{sign f sign η} 1+γ ; 3 R P f R 21+γ 2+γ D 2 f ηi{sign f sign η} L 2 Π ; 4 R P f R D 3 Πsign f sign η 1+γ γ. 5 Proof For 3 and 4, see Audibert and Tsybakov 2005, Lemmas 5.1 and 5.2 respectively, and for 5 Koltchinskii 2011, Lemma Main Results The question we address below is: what are the best possible rates that can be achieved by active algorithms in our framework and how these rates can be attained. 3.1 Minimax Lower Bounds For the Excess Risk The goal of this section is to prove that for P Pβ,γ, no active learner can output a classifier with expected excess risk converging to zero faster than N β1+γ βγ. Our result builds upon the minimax bounds of Audibert and Tsybakov 2005, Castro and Nowak

7 PLUG-IN APPROACH TO ACTIVE LEARNING Remark The theorem below is proved for a smaller class PU β,γ, which implies the result for Pβ,γ. Theorem 7 Let β,γ,d be such that βγ d. Then there exists C>0 such that for all n large enough and for any active classifier ˆf n x we have sup ER P ˆf n R CN β1+γ βγ. P PU β,γ Proof We proceed by constructing the appropriate family of classifiers f σ x = sign η σ x, in a way similar to Theorem 3.5 in Audibert and Tsybakov 2005, and then apply Theorem 2.5 from Tsybakov We present it below for reader s convenience. Theorem 8 Let Σ be a class of models, d : Σ Σ R - the pseudometric and { P f, f Σ } - a collection of probability measures associated with Σ. Assume there exists a subset{ f 0,..., f M } of Σ such that Then 1. d f i, f j 2s>0 for all 0 i< j M; 2. P f j P f0 for every 1 j M; 3. 1 M M KLP f j,p f0 αlogm, 0<α< 1 8. j=1 inf ˆf supp f d ˆf, f s f Σ M 1+ 2α 1 2α, M logm where the infimum is taken over all possible estimators of f based on a sample from P f and KL, is the Kullback-Leibler divergence. Going back to the proof, let q=2 l, l 1 and { 2k1 1 G q := 2q be the grid on[0,1] d. For x [0,1] d, let,..., 2k d 1 2q }, k i = 1...q, i=1...d n q x=argmin { x x k 2 : x k G q }. If n q x is not unique, we choose a representative with the smallest 2 norm. The unit cube is partitioned with respect to G q as follows: x 1,x 2 belong to the same subset if n q x 1 =n q x 2. Let be some order on the elements of G q such that x y implies x 2 y 2. Assume that the elements of the partition are enumerated with respect to the order of their centers induced by : [0,1] d = qd R i. Fix 1 m q d and let i=1 m S := R i i=1 73

8 MINSKER Note that the partition is ordered in such a way that there always exists 1 k q d with B + 0, k S B + 0, k+ 3 d, 6 q q where B + 0,R := { x R d + : x 2 R }. In other words, 6 means that that the difference between the radii of inscribed and circumscribed spherical sectors of S is of order Cdq 1. Let v>r 1 > r 2 be three integers satisfying Define ux :R R + by where Ut := 2 v < 2 r 1 < 2 r 1 d < 2 r 2 d < { ux := x Utdt, 8 1/2 Utdt 2 v 1 exp 1/2 xx 2 v, x 2 v, else. Note that ux is an infinitely differentiable function such that ux = 1, x [0,2 v ] and ux = 0, x 1 2. Finally, for x Rd let Φx := Cu x 2, where C := C L,β is chosen such that Φ Σβ,L,R d. Let r S := inf{r>0: B + 0,r S} and { } A 0 := R i : R i B + 0,r S + q βγ d = /0. Note that i r S c m1/d q, 9 since VolS=mq d. Define H m ={P σ : σ { 1,1} m } to be the hypercube of probability distributions on [0,1] d { 1,+1}. The marginal distribution Π of X is independent of σ: define its density p by px= 2 dr dr 1 r 2 1, x B c 0, x A 0, 0 else. z, 2 r 2 q \B z, 2 r 1 q, z G q S, where B z,r :={x : x z r}, c 0 := 1 mq d VolA 0 note that ΠR i=q d i m and r 1,r 2 are defined in 7. In particular, Π satisfies Assumption 1 since it is supported on the union of dyadic cubes and has bounded above and below on suppπ density. Let Ψx := u 1/2 q βγ d dist2 x,b + 0,r S, 74

$Finally, the regression function η σ x=e Pσ Y X = x is defined via { σi q β Φq[x n q x], x R i, 1 i m η σ x := 1 C L,β d dist 2 x,b + 0,r S d γ Ψx, x [0,1] d \ S.$

9 PLUG-IN APPROACH TO ACTIVE LEARNING Figure 2: Geometry of the support where u is defined in 8 and dist 2 x,a := inf{ x y 2, y A}. Finally, the regression function η σ x=e Pσ Y X = x is defined via { σi q β Φq[x n q x], x R i, 1 i m η σ x := 1 C L,β d dist 2 x,b + 0,r S d γ Ψx, x [0,1] d \ S. The graph of η σ is a surface consisting of small bumps spread around S and tending away from 0 monotonically with respect to dist 2,B + 0,r S on[0,1] d \ S. Clearly, η σ x satisfies smoothness requirement, 1 since for x [0,1] d dist 2 x,b + 0,r S = x 2 r S 0. Let s check that it also satisfies the low noise condition. Since η σ Cq β on the support of Π, it is enough to consider t = Czq β for z>1: Π η σ x Czq β mq d + Π dist 2 x,b + 0,r S Cz γ/d q βγ d mq d +C 2 r S +Cz γ/d q βγ d d mq d +C 3 mq d +C 4 z γ q βγ Ĉt γ, if mq d = Oq βγ. Here, the first inequality follows from considering η σ on S and A 0 separately, and second inequality follows from 9 and direct computation of the sphere volume. Finally, η σ satisfies Assumption 2 with some B 2 := B 2 q since on suppπ 0<c 1 q η σ x 2 c 2 q<. The next step in the proof is to choose the subset ofh which is well-separated : this can be done due to the following fact see Tsybakov, 2009, Lemma 2.9: 1. Ψx is introduced to provide extra smoothness at the boundary of B + 0,r S. 75

10 MINSKER Proposition 9 Gilbert-Varshamov For m 8, there exists {σ 0,...,σ M } { 1,1} m such that σ 0 ={1,1,...,1}, ρσ i,σ j m 8 0 i<k M and M 2m/8 where ρ stands for the Hamming distance. Let H := {P σ0,...,p σm } be chosen such that {σ 0,...,σ M } satisfies the proposition above. Next, following the proof of Theorems 1 and 3 in Castro and Nowak 2008, we note that σ H, σ σ 0 KLP σ,n P σ0,n 8N max x [0,1] η σx η σ0 x 2 32C 2 L,β Nq 2β, 10 where P σ,n is the joint distribution of X i,y i N i=1 under hypothesis that the distribution of couple X,Y is P σ. Let us briefly sketch the derivation of 10; see also the proof of Theorem 1 in Castro and Nowak Denote Then dp σ,n admits the following factorization: dp σ,n X N,Ȳ N = X k :=X 1,...,X k, Ȳ k :=Y 1,...,Y k. N i=1 P σ Y i X i dpx i X i 1,Ȳ i 1, where dpx i X i 1,Ȳ i 1 does not depend on σ but only on the active learning algorithm. As a consequence, KLP σ,n P σ0,n=e Pσ,N log dp σ,n X N,Ȳ N dp σ0,n X n,ȳ N =E P σ,n log N i=1 P σy i X i N i=1 P σ 0 Y i X i = = N i=1 E Pσ,N [ E Pσ log P σy i X i N max x [0,1] d E Pσ ] P σ0 Y i X i X i log P σy 1 X 1 P σ0 Y 1 X 1 X 1 = x 8N max x [0,1] d η σ x η σ0 x 2, where the last inequality follows from Lemma 1 Castro and Nowak, Also, note that we have max x [0,1] d in our bounds rather than the average over x that would appear in the passive learning framework. It remains to choose q,m in appropriate way: set q = C 1 N βγ 1 and m = C 2 q d βγ where C 1, C 2 are such that q d m 1 and 32CL,β 2 Nq 2β < m 64 which is possible for N big enough. In particular, mq d = Oq βγ. Together with the bound 10, this gives 1 M σ H KLP σ P σ 0 32C 2 unq 2β < m 8 2 = 1 8 log H, 76

11 PLUG-IN APPROACH TO ACTIVE LEARNING so that conditions of Theorem 8 are satisfied. Setting we finally have σ 1 σ 2 H f σ x := sign η σ x, d f σ1, f σ2 := Πsign η σ1 x sign η σ2 x m 8q d C 4N βγ βγ, where the lower bound just follows by construction of our hypotheses. Since under the low noise assumption R P ˆf n R cπ ˆf n sign η 1+γ γ see 5, we conclude by Theorem 8 that inf ˆf N sup Pr P PU β,γ inf ˆf N sup P P U β,γ Pr R P ˆf n R C 4 N β1+γ βγ Π ˆf n x sign η P x C 4 2 N βγ βγ τ> Upper Bounds For the Excess Risk Below, we present a new active learning algorithm which is computationally tractable, adaptive with respect to β,γ in a certain range of these parameters and can be applied in the nonparametric setting. We show that the classifier constructed by the algorithm attains the rates of Theorem 7, up to polylogarithmic factor, if 0<β 1 and βγ d the last condition covers the most interesting case when the regression function hits or crosses the decision boundary in the interior of the support of Π; for detailed statement about the connection between the behavior of the regression function near the decision boundary with parameters β, γ, see Proposition 3.4 in Audibert and Tsybakov, The problem of adaptation to higher order of smoothness β > 1 is still awaiting its complete solution; we address these questions below in our final remarks. For the purpose of this section, the regularity assumption reads as follows: there exists 0<β 1 such that x 1,x 2 [0,1] d ηx 1 ηx 2 B 1 x 1 x 2 β. 11 Since we want to be able to construct non-asymptotic confidence bands, some estimates on the size of constants in 11 and Assumption 2 are needed. Below, we will additionally assume that B 1 logn, B 2 log 1 N, where N is the label budget. This can be replaced by any known bounds on B 1,B 2. Let A σf m with A Π := A suppπ /0. Define ˆΠ A dx := Πdx x A Π 77

12 MINSKER and d m := dimf m AΠ. Next, we introduce a simple estimator of the regression function on the set A Π. Given the resolution level m and an iid samplex i,y i, i N with X i ˆΠ A, let ˆη m,a x := i:r i A Π /0 N j=1 Y ji Ri X j I Ri x. 12 N ˆΠ A R i Since we assumed that the marginal Π is known, the estimator is well-defined. The following proposition provides the information about concentration of ˆη m around its mean: Proposition 10 For all t > 0, Pr max ˆη m,a x η m x t x A Π 2d m exp 2 dm ΠA u 1 N t t 3 2 dm ΠA/u 1 N Proof This is a straightforward application of the Bernstein s inequality to the random variables S i N := N j=1 Y j I Ri X j, i {i : R i A Π /0}, and the union bound: indeed, note that EYI Ri X j 2 = ˆΠ A R i, so that Pr SN i N ηd ˆΠ A tn ˆΠ A R i 2exp N ˆΠ A R i t 2, R i 2+2t/3 and the rest follows by simple algebra using that ˆΠ A R i u 1 2 dm ΠA by the u 1,u 2 -regularity of Π., Given a sequence of hypotheses classesg m, m 1, define the index set { JN := m N: 1 dimg m N } log 2 N 13 - the set of possible resolution levels of an estimator based on N classified observationsan upper bound corresponds to the fact that we want the estimator to be consistent. When talking about model selection procedures below, we will implicitly assume that the model index is chosen from the corresponding setj. The role ofg m will be played byf m A for appropriately chosen set A. We are now ready to present the active learning algorithm followed by its detailed analysissee Table 1. Remark Note that on every iteration, Algorithm 1a uses the whole sample to select the resolution level ˆm k and to build the estimator ˆη k. While being suitable for practical implementation, this is not convenient for theoretical analysis. We will prove the upper bounds for a slighly modified version: namely, on every iteration k labeled data is divided into two subsamples S k,1 and S k,2 of approximately equal size, S k,1 S k,2 1 2 N k ΠÂ k. Then S 1,k is used to select the resolution level ˆm k and S k,2 - to construct ˆη k. We will call this modified version Algorithm 1b. 78

13 PLUG-IN APPROACH TO ACTIVE LEARNING Algorithm 1a input label budget N; confidence α; ˆm 0 = 0, ˆF 0 :=F ˆm0, ˆη 0 0; LB := N; // label budget N 0 := 2 log 2 N ; s k m,n,α := sm,n,α := mlogn+ log 1 α ; k := 0; while LB 0 do k := k+ 1; N k := { 2N k 1 ; Â k := x [0,1] d : f 1, f 2 ˆF } k 1,sign f 1 x sign f 2 x ; if Â k suppπ= /0 or LB< N k ΠÂ k then break; output ĝ := sign ˆη k 1 else for i=1... N k ΠÂ k sample i.i.d X k i,y k i end for; LB := LB N k ΠÂ k ; 1 ˆP k := N k ΠÂ k δ k X i i,y k i [ ˆm k := argmin m ˆmk 1 δ k := D log 2 N α ˆF k := with X k i ˆΠ k := Πdx x Â k ; inf f Fm ˆη k := ˆη ˆmk,Â k // see 12 2 d ˆm k N k ; end; // active empirical measure ˆP k Y fx 2 + K 1 2 dm ΠÂ k +sm ˆm k 1,N,α N k ΠÂ k { f F ˆmk : f Âk F,Âk ˆη k ;δ k, f [0,1] d \Â k ˆη k 1 [0,1] d \Â k }; Table 1: Active Learning Algorithm ] As a first step towards the analysis of Algorithm 1b, let us prove the useful fact about the general model selection scheme. Given an iid samplex i,y i, i N, set s m = ms+loglog 2 N, m 1 and [ ˆm := ˆms=argmin m JN inf P N Y fx 2 2 dm + s m + K 1 f F m N { } m := min m 1:. inf E fx ηx 2 2 dm K 2 f F m N Theorem 11 There exist an absolute constant K 1 big enough such that, with probability 1 e s, Proof See Appendix B. ˆm m. Straightforward application of this result immediately yields the following: ], 79

14 MINSKER Corollary 12 Suppose ηx Σβ,L,[0,1] d. Then, with probability 1 e s, 2 ˆm C 1 N 1 Proof By definition of m, we have { m 1+max m : inf E fx ηx 2 2 dm } > K 2 f F m N 1+max {m : L 2 2 2βm 2 dm } > K 2, N and the claim follows. With this bound in hand, we are ready to formulate and prove the main result of this section: Theorem 13 Suppose that P P U β,γ with B 1 logn, B 2 log 1 N and βγ d. Then, with probability 1 3α, the classifier ĝ returned by Algorithm 1b with label budget N satisfies R P ĝ R Const N β1+γ βγ log p N α, where p 2βγ1+γ βγ and B 1, B 2 are the constants from 11 and Assumption 2. Remarks 1. Note that when βγ > d β1+γ 3, N βγ the passive learning rate N β1+γ Tsybakov is a fast rate, that is, faster than N 1 2 ; at the same time, is guaranteed to be fast only when βγ> d 2, see Audibert and 2. For ˆα N β1+γ βγ Algorithm 1b returns a classifier ĝ ˆα that satisfies ER P ĝ ˆα R Const N β1+γ βγ log p N. This is a direct corollary of Theorem 13 and the inequality E Z t+ Z Pr Z t. Proof Our main goal is to construct high probability bounds for the size of the active sets defined by Algorithm 1b. In turn, these bounds depend on the size of the confidence bands for ηx, and the previous resulttheorem 11 is used to obtain the required estimates. Suppose L is the number of steps performed by the algorithm before termination; clearly, L N. Let Nk act := N k ΠÂ k be the number of labels requested on k-th step of the algorithm: this choice guarantees that the density of labeled examples doubles on every step. Claim: the following bound for the size of the active set holds uniformly for all 2 k L with probability at least 1 2α: ΠÂ k CN βγ k log N 2γ. 14 α 80

15 PLUG-IN APPROACH TO ACTIVE LEARNING It is not hard to finish the proof assuming 14 is true: indeed, it implies that the number of labels requested on step k satisfies N act k βγ = N k ΠÂ k C Nk log N 2γ α with probability 1 2α. Since Nk act N, one easily deduces that on the last iteration L we have k N L c N log 2γ N/α βγ To obtain the risk bound of the theorem from here, we apply 2 inequality 3 from Proposition 6: 15 R P ĝ R D 1 ˆη L η I{sign ˆη L sign η} 1+γ. 16 It remains to estimate ˆη L η,âl : we will show below while proving 14 that ˆη L η,âl C N β L log 2 N α. Together with 15 and 16, it implies the final result. To finish the proof, it remains to establish 14. Recall that η k stands for the L 2 Π - projection of η ontof ˆmk. An important role in the argument is played by the bound on the L 2 ˆΠ k - norm of the bias η k η: together with Assumption 2, it allows to estimate η k η,âk. The required bound follows from the following oracle inequality: there exists an eventb of probability 1 α such that on this event for every 1 k L [ η k η 2 L 2 ˆΠ k inf m ˆm k 1 inf f η 2 f F L 2 ˆΠ m k K 1 2 dm ΠÂ k +m ˆm k 1 logn/α N k ΠÂ k It general form, this inequality is given by Theorem 6.1 in Koltchinskii 2011 and provides the estimate for ˆη k η L2 ˆΠ k, so it automatically implies the weaker bound for the bias term only. To deduce 17, we use the mentioned general inequality L timesonce for every iteration and the union bound. The quantity 2 dm ΠÂ k in 17 plays the role of the dimension, which is justified below. Let k 1 be fixed. For m ˆm k 1, consider hypothesis classes { F m Âk := fiâk, f F m }. An obvious but important fact is that for P P U β,γ, the dimension off m Âk is bounded by u m ΠÂ k : indeed, ΠÂ k = ΠR j u 1 2 dm # { j : R j Â k /0 }, j:r j Â k /0 2. alternatively, inequality 4 can be used but results in a slightly inferior logarithmic factor. ]. 81

16 MINSKER hence dimf m Âk = # { j : R j Â k /0 } u m ΠÂ k. 18 { } Theorem 11 applies conditionally on X j Nj i, j k 1 with sample of size Nact k and s = i=1 logn/α: to apply the theorem, note that, by definition of Â k, it is independent of X k i, i=1...nk act. Arguing as in Corollary 12 and using 18, we conclude that the following inequality holds with probability 1 α N for every fixed k: 1 2 ˆm k C Nk. 19 LetE 1 be an event of probability 1 α such that on this event bound 19 holds for every step k, k L and lete 2 be an event of probability 1 α on which inequalities 17 are satisfied. Suppose that event E 1 E 2 occurs and let k 0 be a fixed arbitrary integer 2 k 0 L+1. It is enough to assume that Â k0 1 is nonemptyotherwise, the bound trivially holds, so that it contains at least one cube with sidelength 2 ˆm k 0 2 and Consider inequality 17 with k=k 0 1 and 2 m N ΠÂ k0 1 u 1 2 d ˆm k 0 1 cn d k η k0 1 η 2 L 2 ˆΠ k0 1 1 k 0 1. By 20, we have CN 2β k 0 1 log 2 N α. 21 For convenience and brevity, denote Ω := suppπ. Now Assumption 2 comes into play: it implies, together with 21 that To bound CN β k 0 1 log N α η k 0 1 η L2 ˆΠ k0 1 B 2 η k0 1 η,ω Âk ˆη k0 1x η k0 1x,Ω Âk0 1 we apply Proposition 10. Recall that ˆm k0 1 depends only on the subsample S k0 1,1 but not on S k0 1,2. Let { { } } T k := X j i,y j N act j i, j k 1 S k,1 be the random vector that defines Â k and resolution level ˆm k. Note that for any x, Proposition 10 thus implies Pr i=1 E ˆη k0 1x T k0 1 a.s. = η ˆmk0 1 x. max ˆη k0 1x η x Kt ˆmk0 1 x Ω Â k0 1 N exp 2 d ˆm k 0 1 N k0 1 T k 0 1. t t 3 C 3 82

17 PLUG-IN APPROACH TO ACTIVE LEARNING Choosing t = clogn/α and taking expectation, the inequalitynow unconditional becomes Pr max ˆη x η ˆmk0 1 ˆm x K 2 d ˆm k 0 1 log 2 N/α k0 1 1 α. 23 x Ω Â k0 1 N k0 1 LetE 3 be the event on which 23 holds true. Combined, the estimates 19,22 and 23 imply that one 1 E 2 E 3 η ˆη k0 1 η η,ω Âk0 1 k η,ω Âk0 1 k 0 1 ˆη k0 1,Ω Âk0 1 C N β k B 0 1 log N 2 α + K 2 d ˆm k 0 1 log 2 N/α 24 N k0 1 K+C N β k 0 1 log 2 N α, where we used the assumption B 2 log 1 N. Now the width of the confidence band is defined via δ k := 2K+C N β k 0 1 log 2 N α 25 in particular, D from Algorithm 1a is equal to 2K+ C. With the bound 24 available, it is straightforward to finish the proof of the claim. Indeed, by 25 and the definition of the active set, the necessary condition for x Ω Â k0 is ηx 3K+C N β k 0 1 log 2 N α, so that ΠÂ k0 =ΠΩ Â k0 Π ηx 3K+C N β k 0 1 log 2 N α BN βγ k 0 1 log 2γ N α. by the low noise assumption. This completes the proof of the claim since PrE 1 E 2 E 3 1 3α. We conclude this section by discussing running time of the active learning algorithm. Assume that the algorithm has access to the sampling subroutine that, given A [0,1] d with ΠA>0, generates i.i.d. X i,y i with X i Πdx x A. Proposition 14 The running time of Algorithm 1a1b with label budget N is OdN log 2 N. Remark In view of Theorem 13, the running time required to output a classifier ĝ such that R P ĝ R ε with probability 1 α is βγ O 1 β1+γ poly log 1. ε εα 83

18 MINSKER Proof We will use the notations of Theorem 13. Let Nk act be the number of labels requested by the algorithm on step k. The resolution level ˆm k is always chosen such that Â k is partitioned into at most Nk act dyadic cubes, see 13. This means that the estimator ˆη k takes at most Nk act distinct values. The key observation is that for any k, the active set Â k+1 is always represented as the union of a finite numberat most Nk act of dyadic cubes: to determine if a cube R j Â k+1, it is enough to take a point x R j and compare sign ˆη k x δ k with sign ˆη k x+δ k : R j Â k+1 only if the signs are differentso that the confidence band crosses zero level. This can be done inonk act steps. Next, resolution level ˆm k can be found inonk act log 2 N steps: there are at most log 2 Nk act models to consider; for each m, inf f Fm ˆP k Y fx 2 is found explicitly and is achieved for the piecewiseconstant ˆfx = iy k i I R j X k i, x R j. Sorting of the data required for this computation is done in i I R j X k i OdNk act logn steps for each m, so the whole k-th iteration running time isodnk act log 2 N. Since N, the result follows. k N act k 4. Conclusion and Open Problems We have shown that active learning can significantly improve the quality of a classifier over the passive algorithm for a large class of underlying distributions. Presented method achieves fast rates of convergence for the excess risk, moreover, it is adaptivein the certain range of smoothness and noise parameters and involves minimization only with respect to quadratic lossrather than the 0 1 loss. The natural question related to our results is: Can we implement adaptive smooth estimators in the learning algorithm to extend our results beyond the case β 1? The answer to this second question is so far an open problem. Our conjecture is that the correct rate of convergence for the excess risk is, up to logarithmic factors, N γβ 1, β1+γ which coincides with presented results for β 1. This rate can be derived from an argument similar to the proof of Theorem 13 under the assumption that on every step k one could construct an estimator ˆη k with η ˆη k,âk N β k. At the same time, the active set associated to ˆη k should maintain some structure which is suitable for the iterative nature of the algorithm. Transforming these ideas into a rigorous proof is a goal of our future work. Acknowledgments I want to express my sincere gratitude to my Ph.D. advisor, Dr. Vladimir Koltchinskii, for his support and numerous helpful discussions. I am grateful to the anonymous reviewers for carefully reading the manuscript. Their insightful and wise suggestions helped to improve the quality of presentation and results. This project was supported by NSF Grant DMS and by the Algorithms and Randomness Center, Georgia Institute of Technology, through the ARC Fellowship. 84

19 PLUG-IN APPROACH TO ACTIVE LEARNING Appendix A. Functions Satisfying Assumption 2 In the propositions below, we will assume for simplicity that the marginal distribution Π is absolutely continuous with respect to Lebesgue measure with density px such that Given t 0,1], define A t :={x : ηx t}. 0< p 1 px p 2 < for all x [0,1] d. Proposition 15 Suppose η is Lipschitz continuous with Lipschitz constant S. Assume also that for some t > 0 we have a Π A t /3 > 0; b η is twice differentiable for all x A t ; c inf x At ηx 1 s>0; d sup x At D 2 ηx C< where is the operator norm. Then η satisfies Assumption 2. Proof By intermediate value theorem, for any cube R i, 1 i 2 dm there exists x 0 R i such that η m x=ηx 0, x R i. This implies On the other hand, if R i A t then ηx η m x = ηx ηx 0 = ηξ x x 0 ηx η m x = ηx ηx 0 = ηξ 1 x x 0 S 2 m. = ηx 0 x x [D2 ηξ]x x 0 x x 0 ηx 0 x x sup D 2 ηξ max x x ξ x R i ηx 0 x x 0 C 1 2 2m. Note that a strictly positive continuous function hy,u= u x y 2 dx [0,1] d achieves its minimal value h > 0 on a compact set[0,1] d { u R d : u 1 = 1 }. This impliesusing 26 and the inequality a b 2 a2 2 b2 Π 1 R i ηx η m x 2 pxdx R i 1 2 p 22 dm 1 ηx 0 x x 0 2 p 1 dx C12 2 4m R i 1 p 1 ηx p 12 2m h C12 2 4m c 2 2 2m for m m

20 MINSKER Now take a set A σf m, m m 0 from Assumption 2. There are 2 possibilities: either A A t or A A t /3. In the first case the computation above implies η η m 2 Πdx x A c 2 2 2m c 2 = S 2 S2 2 2m [0,1] d c 2 S 2 η η m 2,A. If the second case occurs, note that, since { x : 0< ηx < t } 3 has nonempty interior, it must contain a dyadic cube R with edge length 2 m. Then for any m maxm 0,m η η m 2 Πdx x A [0,1] d and the claim follows. Π 1 A R η η m 2 Πdx c 2 S 2 ΠR η η m 2,A c m ΠR The next proposition describes conditions which allow functions to have vanishing gradient on decision boundary but requires convexity and regular behaviour of the gradient. Everywhere below, η denotes the subgradient of a convex function η. For 0< t 1 < t 2, define Gt 1,t 2 := sup x At 2 \At 1 ηx 1 a representative that makes Gt 1,t 2 as small as possible. inf ηx 1. In case when ηx is not unique, we choose x At 2 \At 1 Proposition 16 Suppose ηx is Lipschitz continuous with Lipschitz constant S. Moreover, assume that there exists t > 0 and q :0, 0, such that A t 0,1 d and a b 1 t γ ΠA t b 2 t γ t < t ; b For all 0< t 1 < t 2 t, Gt 1,t 2 q t2 t1 ; c Restriction of η to any convex subset of A t is convex. Then η satisfies Assumption 2. Remark The statement remains valid if we replace η by η in c. Proof Assume that for some t t and k>0 R A t \ A t/k is a dyadic cube with edge length 2 m and let x 0 be such that η m x = ηx 0, x R. Note that η is convex on R due to c. Using the subgradient inequality ηx ηx 0 ηx 0 x x 0, we obtain ηx ηx 0 2 dπx ηx ηx 0 2 I{ ηx 0 x x 0 0}dΠx R R R ηx 0 x x 0 2 I { ηx 0 x x 0 0}dΠx

21 PLUG-IN APPROACH TO ACTIVE LEARNING The next step is to show that under our assumptions x 0 can be chosen such that dist x 0, R ν2 m, 28 where ν = νk is independent of m. In this case any part of R cut by a hyperplane through x 0 contains half of a ball Bx 0,r 0 of radius r 0 = νk2 m and the last integral in 27 can be further bounded below to get ηx ηx 0 2 dπx 1 2 ηx 0 x x 0 2 p 1 dx R Bx 0,r 0 ck ηx m 2 dm. 29 It remains to show 28. Assume that for all y such that ηy=ηx 0 we have dist y, R δ2 m for some δ>0. This implies that the boundary of the convex set {x R:ηx ηx 0 } is contained in R δ := {x R:dist x, R δ2 m }. There are two possibilities: either {x R:ηx ηx 0 } R\R δ or {x R:ηx ηx 0 } R δ. We consider the first case onlythe proof in the second case is similar. First, note that by b for all x R δ ηx 1 qk ηx 0 1 and ηx ηx 0 + ηx 1 δ2 m ηx 0 +qk ηx 0 1 δ2 m. 30 Let x c be the center of the cube R and u - the unit vector in direction ηx c. Observe that ηx c +1 3δ2 m u ηx c ηx c 1 3δ2 m u= On the other hand, x c +1 3δ2 m u R\R δ and =1 3δ2 m ηx c 2. ηx c +1 3δ2 m u ηx 0, hence ηx c ηx 0 c1 3δ2 m ηx c 1. Consequently, for all { x Bx c,δ := x : x x c 1 } 2 c2 m 1 3δ we have ηx ηx c + ηx c 1 x x c ηx c2 m 1 3δ ηx c

22 MINSKER Finally, recall that ηx 0 is the average value of η on R. Together with 30 and 31 this gives ΠRηx 0 = ηxdπ = ηxdπ+ ηxdπ R R\R δ R δ ηx 0 +qk ηx 0 1 δ2 m ΠR δ + +ηx 0 c 2 2 m 1 3δ ηx 0 1 ΠBx c,δ+ + ηx 0 ΠR\R δ Bx c,δ= = ΠRηx 0 +qk ηx 0 1 δ2 m ΠR δ c 2 2 m 1 3δ ηx 0 1 ΠBx c,δ. Since ΠR δ p 2 2 dm and ΠBx c,δ c 3 2 dm 1 3δ d, the inequality above implies c 4 qkδ 1 3δ d+1 c qk3d+4. which is impossible for small δ e.g., for δ< Let A be a set from condition 2. If A A t /3, then there exists a dyadic cube R with edge length 2 m such that R A t /3\ A t /k for some k > 0, and the claim follows from 29 as in Proposition 15. Assume now that A t A A 3t and 3t t. Condition a of the proposition implies that for any ε>0 we can choose kε>0 large enough so that ΠA\A t/k ΠA b 2 t/k γ ΠA b 2 b 1 k γ ΠA t 1 επa. 32 This means that for any partition of A into dyadic cubes R i with edge length 2 m at least half of them satisfy ΠR i \ A t/k 1 cεπr i. 33 LetI be the index set of cardinality I cπa2 dm 1 such that 33 is true for i I. Since R i A t/k is convex, there exists 3 z=zε N such that for any such cube R i there exists a dyadic sub-cube with edge length 2 m+z entirely contained in R i \ A t/k : T i R i \ A t/k A 3t \ A t/k. It follows that Π T i cεπa. Recall that condition b implies i sup ηx 1 x i T i q3k. inf ηx 1 x i T i Finally, sup ηx 2 is attained at the boundary point, that is for some x : ηx = 3t, and by x A 3t b sup ηx 1 d ηx 1 q3k d inf ηx 1. x A 3t x A 3t \A t/k 3. If, on the contrary, every sub-cube with edge length 2 m+z contains a point from A t/k, then A t/k must contain the convex hull of these points which would contradict 32 for large z. 88

23 PLUG-IN APPROACH TO ACTIVE LEARNING Application of 29 to every cube T i gives ηx η m+z x 2 dπx c 1 kπa I inf ηx m 2 dm x A 3t \A t/k i I T i concluding the proof. c 2 kπa sup x A 3t ηx m c 3 kπa η ηm 2,A Appendix B. Proof of Theorem 11 The main ideas of this proof, which significantly simplifies and clarifies initial author s version, are due to V. Koltchinskii. For convenience and brevity, let us introduce additional notations. Recall that s m = ms+loglog 2 N. Let 2 dm + s m τ N m,s := K 1, N 2 dm + s+loglog π N m,s := K 2 N 2. N ByE P F, f ore PN F, f we denote the excess risk of f F with respect to the true or empirical measure: E P F, f := Py fx 2 inf g F Py gx2, E PN F, f := P N y fx 2 inf g F P Ny gx 2. It follows from Theorem 4.2 in Koltchinskii 2011 and the union bound that there exists an eventb of probability 1 e s such that on this event the following holds for all m such that dm logn: E P F m, ˆf ˆm π N m,s, f F m, E P F m, f 2E PN F m, f π N m,s, 34 f F m, E PN F m, f 3 2 E PF m, f π N m,s. We will show that onb,{ˆm m} holds. Indeed, assume that, on the contrary, ˆm> m; by definition of ˆm, we have P N Y ˆf ˆm 2 + τ N ˆm,s P N Y ˆf m 2 + τ N m,s, which implies for K 1 big enough. By 34, E PN F ˆm, ˆf m τ N ˆm,s τ N m,s>3π N ˆm,s E PN F ˆm, ˆf m = inf f F m E PN F ˆm, f inf E P F ˆm, f π N ˆm,s, f F m

24 MINSKER and combination the two inequalities above yields inf f F m E P F ˆm, f>π N ˆm,s. 35 Since for any me P F m, f E fx ηx 2, the definition of m and 35 imply that π N m,s inf f F m E fx ηx 2 > π N ˆm,s, contradicting our assumption, hence proving the claim. References J.-Y. Audibert and A. B. Tsybakov. Fast learning rates for plug-in classifiers. Preprint, Available at: M.-F. Balcan, S. Hanneke, and J. Wortman. The true sample complexity of active learning. In Proceedings of the Conference on Learning Theory, pages 45 56, M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. J. Comput. System Sci., 751:78 89, R. M. Castro and R. D. Nowak. Minimax bounds for active learning. IEEE Trans. Inform. Theory, 545: , S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In Advances in Neural Information Processing Systems 20, pages MIT Press, S. Gaïffas. Sharp estimation in sup norm with random design. Statist. Probab. Lett., 778: , E. Giné and R. Nickl. Confidence bands in density estimation. Ann. Statist., 382: , S. Hanneke. Rates of convergence in active learning. Ann. Statist., 391: , M. Hoffmann and R. Nickl. On adaptive inference and confidence bands. The Annals of Statistics, to appear. V. Koltchinskii. Rademacher complexities and bounding the excess risk in active learning. J. Mach. Learn. Res., 11: , V. Koltchinskii. Oracle inequalities in empirical risk minimization and sparse recovery problems. Springer, Lectures from the 38th Probability Summer School held in Saint-Flour, 2008, École d Été de Probabilités de Saint-Flour. M. G. Low. On nonparametric confidence intervals. Ann. Statist., 256: , A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer,

Plug-in Approach to Active Learning

Plug-in Approach to Active Learning Stanislav Minsker Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 1 / 18 Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X