arxiv: v2 [math.st] 2 Nov 2011

Size: px
Start display at page:

Download "arxiv: v2 [math.st] 2 Nov 2011"

Transcription

1 Plug-in Approach to Active Learning Stanislav Minsker, arxiv: v2 [math.st] 2 Nov 2011 Abstract: We present a new active learning algorithm based on nonparametric estimators of the regression function. Our investigation provides probabilistic bounds for the rates of convergence of the generalization error achievable by proposed method over a broad class of underlying distributions. We also prove minimax lower bounds which show that the obtained rates are almost tight. Keywords and phrases: Active learning, selective sampling, model selection, classification, confidence bands. 1. Introduction Let S, B be a measurable space and let X, Y S { 1, 1} be a random couple with unknown distribution P. The marginal distribution of the design variable X will be denoted by Π. Let ηx := EY X = x be the regression function. The goal of binary classification is to predict label Y based on the observation X. Prediction is based on a classifier - a measurable function f : S { 1, 1}. The quality of a classifier is measured in terms of its generalization error, Rf = Pr Y fx. In practice, the distribution P remains unknown but the learning algorithm has access to the training data - the i.i.d. sample X i, Y i, i = 1... n from P. It often happens that the cost of obtaining the training data is associated with labeling the observations X i while the pool of observations itself is almost unlimited. This suggests to measure the performance of a learning algorithm in terms of its label complexity, the number of labels Y i required to obtain a classifier with the desired accuracy. Active learning theory is mainly devoted to design and analysis of the algorithms that can take advantage of this modified framework. Most of these procedures can be characterized by the following property: at each step k, observation X k is sampled from a distribution ˆΠ k that depends on previously obtained X i, Y i, i k 1while passive learners obtain all available training data at the same time. ˆΠ k is designed to be supported on a set where classification is difficult and requires more labeled Partially supported by ARC Fellowship, NSF Grants DMS and CCF Mailing address: 686 Cherry street, School of Mathematics, Atlanta, GA

2 S. Minsker/Plug-in Approach 2 data to be collected. The situation when active learners outperform passive algorithms might occur when the so-called Tsybakov s low noise assumption is satisfied: there exist constants B, γ > 0 such that t > 0, Πx : ηx t Bt γ 1.1 This assumption provides a convenient way to characterize the noise level of the problem and will play a crucial role in our investigation. The topic of active learning is widely present in the literature; see Balcan et al. [3], Hanneke [7], Castro and Nowak [4] for review. It was discovered that in some cases the generalization error of a resulting classifier can converge to zero exponentially fast with respect to its label complexitywhile the best rate for passive learning is usually polynomial with respect to the cardinality of the training data set. However, available algorithms that adapt to the unknown parameters of the problemγ in Tsybakov s low noise assumption, regularity of the decision boundary involve empirical risk minimization with binary loss, along with other computationally hard problems, see Balcan et al. [2], Hanneke [7]. On the other hand, the algorithms that can be effectively implemented, as in Castro and Nowak [4], are not adaptive. The majority of the previous work in the field was done under standard complexity assumptions on the set of possible classifierssuch as polynomial growth of the covering numbers. Castro and Nowak [4] derived their results under the regularity conditions on the decision boundary and the noise assumption which is slightly more restrictive then 1.1. Essentially, they proved that if the decision boundary is a graph of the Hölder smooth function g Σβ, K, [0, 1] d 1 see section 2 for definitions and the noise assumption is satisfied with γ > 0, then the minimax lower bound for the expected excess risk of the active classifier is of order C N β1+γ 2β+γd 1 and the β1+γ upper bound is CN/ log N 2β+γd 1, where N is the label budget. However, the construction of the classifier that achieves an upper bound assumes β and γ to be known. In this paper, we consider the problem of active learning under classical nonparametric assumptions on the regression function - namely, we assume that it belongs to a certain Hölder class Σβ, K, [0, 1] d and satisfies to the low noise condition 1.1 with some positive γ. In this case, the work of Audibert and Tsybakov [1] showed that plug-in classifiers can attain optimal rates in the passive learning framework, namely, that the expected excess which is the optimal rate, where ˆη is the local polynomial estimator of the regression function and N is the size of the training data set. We were able to partially risk of a classifier ĝ = sign ˆη is bounded above by CN β1+γ 2β+d

3 S. Minsker/Plug-in Approach 3 extend this claim to the case of active learning: first, we obtain minimax lower bounds for the excess risk of an active classifier in terms of its label complexity. Second, we propose a new algorithm that is based on plug-in classifiers, attains almost optimal rates over a broad class of distributions and possesses adaptivity with respect to β, γwithin the certain range of these parameters. The paper is organized as follows: the next section introduces remaining notations and specifies the main assumptions made throughout the paper. This is followed by a qualitative description of our learning algorithm. The second part of the work contains the statements and proofs of our main results - minimax upper and lower bounds for the excess risk. 2. Preliminaries Our active learning framework is governed by the following rules: 1. Observations are sampled sequentially: X k is sampled from the modified distribution ˆΠ k that depends on X 1, Y 1,..., X k 1, Y k Y k is sampled from the conditional distribution P Y X X = x. Labels are conditionally independent given the feature vectors X i, i n. Usually, the distribution ˆΠ k is supported on a set where classification is difficult. Given the probability measure Q on S { 1, 1}, we denote the integral with respect to this measure by Qg := gd Q. Let F be a class of bounded, measurable functions. The risk and the excess risk of f F with respect to the measure Q are defined by R Q f := QI y sign fx E Q f := R Q f inf g F R Qg, where I A is the indicator of event A. We will omit the subindex Q when the underlying measure is clear from the context. Recall that we denoted the distribution of X, Y by P. The minimal possible risk with respect to P is R = inf Pr Y sign gx, g:s [ 1,1] where the infimum is taken over all measurable functions. It is well known that it is attained for any g such that sign gx = sign ηx Π - a.s. Given g F, A B, δ > 0, define F,A g; δ := {f F : f g,a δ},

4 S. Minsker/Plug-in Approach 4 where f g,a = sup fx gx. For A B, define the function class x A F A := {f A, f F} where f A x := fxi A x. From now on, we restrict our attention to the case S = [0, 1] d. Let K > 0. Definition 2.1. We say that g : R d R belongs to Σβ, K, [0, 1] d, the β, K, [0, 1] d - Hölder class of functions, if g is β times continuously differentiable and for all x, x 1 [0, 1] d satisfies gx 1 T x x 1 K x x 1 β, where T x is the Taylor polynomial of degree β of g at the point x. Definition 2.2. Pβ, γ is the class of probability distributions on [0, 1] d { 1, +1} with the following properties: 1. t > 0, Πx : ηx t Bt γ ; 2. ηx Σβ, K, [0, 1] d. We do not mention the dependence of Pβ, γ on the fixed constants B, K explicitly, but this should not cause any uncertainty. Finally, let us define PU β, γ and P Uβ, γ, the subclasses of Pβ, γ, by imposing two additional assumptions. Along with the formal descriptions of these assumptions, we shall try to provide some motivation behind them. The first deals with the marginal Π. For an integer M 1, let G M := { k1 M,..., k d M, k i = 1... M, i = 1... d be the regular grid on the unit cube [0, 1] d with mesh size M 1. It naturally defines a partition into a set of M d open cubes R i, i = 1... M d with edges of length M 1 and vertices in G M. Below, we consider the nested sequence of grids {G 2 m, m 1} and corresponding dyadic partitions of the unit cube. Definition 2.3. We will say that Π is u 1, u 2 -regular with respect to {G 2 m} if for any m 1, any element of the partition R i, i 2 dm such that R i suppπ, we have where 0 < u 1 u 2 <. Assumption 1. Π is u 1, u 2 - regular. u 1 2 dm Π R i u 2 2 dm. 2.1 }

5 S. Minsker/Plug-in Approach 5 In particular, u 1, u 2 -regularity holds for the distribution with a density p on [0, 1] d such that 0 < u 1 px u 2 <. Let us mention that our definition of regularity is of rather technical nature; for most of the paper, the reader might think of Π as being uniform on [0, 1] d however, we need slightly more complicated marginal to construct the minimax lower bounds for the excess risk. It is know that estimation of regression function in sup-norm is sensitive to the geometry of design distribution, mainly because the quality of estimation depends on the local amount of data at every point; conditions similar to our assumption 1 were used in the previous works where this problem appeared, e.g., strong density assumption in Audibert and Tsybakov [1] and assumption D in Gaïffas [5]. Another useful characteristic of u 1, u 2 - regular distribution Π is that this property is stable with respect to restrictions of Π to certain subsets of its support. This fact fits the active learning framework particularly well. Definition 2.4. We say that Q belongs to P U β, γ if Q Pβ, γ and assumption 1 is satisfied for some u 1, u 2. The second assumption is crucial in derivation of the upper bounds. The space of piecewise-constant functions which is used to construct the estimators of ηx is defined via 2 dm F m = λ i I Ri : λ i 1, i = dm, i=1 where {R i } 2dm i=1 forms the dyadic partition of the unit cube. Note that F m can be viewed as a -unit ball in the linear span of first 2 dm Haar basis functions in [0, 1] d. Moreover, {F m, m 1} is a nested family, which is a desirable property for the model selection procedures. By η m x we denote the L 2 Π - projection of the regression function onto F m. We will say that the set A [0, 1] d approximates the decision boundary {x : ηx = 0} if there exists t > 0 such that {x : ηx t} Π A Π {x : ηx 3t} Π, 2.2 where for any set A we define A Π := A suppπ. The most important example we have in mind is the following: let ˆη be some estimator of η with ˆη η,suppπ t, and define the 2t - band around η by ˆF = {f : ˆηx 2t fx ˆηx + 2t x [0, 1] d}

6 S. Minsker/Plug-in Approach 6 { Take A = x : f 1, f 2 ˆF } s.t. sign f 1 x sign f 2 x, then it is easy to see that A satisfies 2.2. Modified design distributions used by our algorithm are supported on the sets with similar structure. Let σf m be the sigma-algebra generated by F m and A σf m. Assumption 2. There exists B 2 > 0 such that for all m 1, A σf m satisfying 2.2 and such that A Π the following holds true: η η m 2 Πdx x A Π B 2 η η m 2,A Π [0,1] d β 1 Appearance of assumption 2 is motivated by the structure of our learning algorithm - namely, it is based on adaptive confidence bands for the regression function. Nonparametric confidence bands is a big topic in statistical literature, and the review of this subject is not our goal. We just mention that it is impossible to construct adaptive confidence bands of optimal size over the whole Σ β, K, [0, 1] d. Hoffmann and Nickl [8], Low [11] discuss the subject in details. However, it is possible to construct adaptive L 2 - confidence ballssee an example following Theorem 6.1 in Koltchinskii [10]. For functions satisfying assumption 2, this fact allows to obtain confidence bands of desired size. In particular, a functions that are differentiable, with gradient being bounded away from 0 in the vicinity of decision boundary; b Lipschitz continuous functions that are convex in the vicinity of decision boundary satisfy assumption 2. For precise statements, see Propositions A.1, A.2 in Appendix A. A different approach to adaptive confidence bands in case of one-dimensional density estimation is presented in Giné and Nickl [6]. Finally, we define PU β, γ: Definition 2.5. We say that Q belongs to P U β, γ if Q P Uβ, γ and assumption 2 is satisfied for some B 2 > Learning algorithm Now we give a brief description of the algorithm, since several definitions appear naturally in this context. First, let us emphasize that the marginal distribution Π is assumed to be known to the learner. This is not a restriction, since we are not limited in the use of unlabeled data and Π can be

7 S. Minsker/Plug-in Approach 7 estimated to any desired accuracy. Our construction is based on so-called plug-in classifiers of the form ˆf = sign ˆη, where ˆη is a piecewise-constant estimator of the regression function. As we have already mentioned above, it was shown in Audibert and Tsybakov [1] that in the passive learning framework plug-in classifiers attain optimal rate for the excess risk of order N β1+γ 2β+d, with ˆη being the local polynomial estimator. Our active learning algorithm iteratively improves the classifier by constructing shrinking confidence bands for the regression function. On every step k, the piecewise-constant estimator ˆη k is obtained via the model selection procedure which allows adaptation to the unknown smoothnessfor Hölder exponent 1. The estimator is further used to construct a confidence band ˆF k for ηx. The active set assosiated with ˆF k is defined as  k = A ˆF { k := x suppπ : f 1, f 2 ˆF } k, sign f 1 x sign f 2 x Clearly, this is the set where the confidence band crosses zero level and where classification is potentially difficult.  k serves as a support of the modified distribution ˆΠ k+1 : on step k + 1, label Y is requested only for observations X Âk, forcing the labeled data to concentrate in the domain where higher precision is needed. This allows one to obtain a tighter confidence band for the regression function restricted to the active set. Since Âk approaches the decision boundary, its size is controlled by the low noise assumption. The algorithm does not require a priori knowledge of the noise and regularity parameters, being adaptive for γ > 0, β 1. Further details are given in confidence band!x regression function 1 confidence band estimator of!x 1 ACTIVE SET Fig 1. Active Learning Algorithm section 3.2.

8 2.2. Comparison inequalities S. Minsker/Plug-in Approach 8 Before proceeding to the main results, let us recall the well-known connections between the binary risk and the, L2 Π - norm risks: Proposition 2.1. Under the low noise assumption, R P f R D 1 f ηi {sign f sign η} 1+γ ; 2.3 R P f R 21+γ 2+γ D 2 f ηi {sign f sign η} L 2 Π ; 2.4 R P f R D 3 Πsign f sign η 1+γ γ 2.5 Proof. For 2.3 and 2.4, see Audibert and Tsybakov [1], lemmas 5.1, 5.2 respectively, and for 2.5 Koltchinskii [10], lemma Main results The question we address below is: what are the best possible rates that can be achieved by active algorithms in our framework and how these rates can be attained Minimax lower bounds for the excess risk The goal of this section is to prove that for P Pβ, γ no active learner can output a classifier with expected excess risk converging to zero faster than N β1+γ 2β+d βγ. Our result builds upon the minimax bounds of Audibert and Tsybakov [1], Castro and Nowak [4]. Remark The theorem below is proved for a smaller class PU β, γ, which implies the result for Pβ, γ. Theorem 3.1. Let β, γ, d be such that βγ d. Then there exists C > 0 such that for all n large enough and for any active classifier ˆf n x we have sup ER P ˆf n R CN P PU β,γ β1+γ 2β+d βγ Proof. We proceed by constructing the appropriate family of classifiers f σ x = sign η σ x, in a way similar to Theorem 3.5 in Audibert and Tsybakov [1], and then apply Theorem 2.5 from Tsybakov [13]. We present it below for reader s convenience.

9 S. Minsker/Plug-in Approach 9 Theorem 3.2. Let Σ be a class of models, d : Σ Σ R - the pseudometric and {P f, f Σ} - a collection of probability measures associated with Σ. Assume there exists a subset {f 0,..., f M } of Σ such that 1. df i, f j 2s > 0 0 i < j M 2. P fj P f0 for every 1 j M 3. 1 M M j=1 KLP f j, P f0 α log M, 0 < α < 1 8 Then inf ˆf sup P f d ˆf, f s f Σ M 2α α M log M where the infimum is taken over all possible estimators of f based on a sample from P f and KL, is the Kullback-Leibler divergence. Going back to the proof, let q = 2 l, l 1 and { 2k1 1 G q :=,..., 2k } d 1, k i = 1... q, i = 1... d 2q 2q be the grid on [0, 1] d. For x [0, 1] d, let n q x = argmin { x x k 2 : x k G q } If n q x is not unique, we choose the one with smallest 2 norm. The unit cube is partitioned with respect to G q as follows: x 1, x 2 belong to the same subset if n q x 1 = n q x 2. Let be some order on the elements of G q such that x y implies x 2 y 2. Assume that the elements of the partition are enumerated with respect to the order of their centers induced by : [0, 1] d = q d i=1 R i. Fix 1 m q d and let S := m i=1 Note that the partition is ordered in such a way that there always exists 1 k q d with B + 0, k S B + 0, k + 3 d, 3.1 q q where B + 0, R := { x R d + : x 2 R }. In other words, 3.1 means that that the difference between the radii of inscribed and circumscribed spherical R i

10 sectors of S is of order Cdq 1. Let v > r 1 > r 2 be three integers satisfying S. Minsker/Plug-in Approach 10 2 v < 2 r 1 < 2 r 1 d < 2 r 2 d < Define ux : R R + by ux := x 1/2 Utdt 2 v Utdt 3.3 where Ut := { 1 exp 1/2 xx 2 v, x 2 v, else. Note that ux is an infinitely diffferentiable function such that ux = 1, x [0, 2 v ] and ux = 0, x 1 2. Finally, for x Rd let Φx := Cu x 2 where C := C L,β is chosen such that Φ Σβ, L, R d. Let r S := inf {r > 0 : B + 0, r S} and { } A 0 := R i : R i B + 0, r S + q βγ d = Note that i r S c m1/d, 3.4 q since Vol S = mq d. Define H m = {P σ : σ { 1, 1} m } to be the hypercube of probability distributions on [0, 1] d { 1, +1}. The marginal distribution Π of X is independent of σ: define its density p by px = 2 dr dr 1 r 2 1, x B c 0, x A 0, 0 else. z, 2 r 2 q \ B z, 2 r 1 q, z G q S, where B z, r := {x : x z r}, c 0 := 1 mq d VolA 0 note that ΠR i = q d i m and r 1, r 2 are defined in 3.2. In particular, Π satisfies

11 S. Minsker/Plug-in Approach 11 Fig 2. Geometry of the support assumption 1 since it is supported on the union of dyadic cubes and has bounded above and below on suppπ density. Let Ψx := u 1/2 q βγ d dist2 x, B + 0, r S, where u is defined in 3.3 and dist 2 x, A := inf { x y 2, y A}. Finally, the regression function η σ x = E Pσ Y X = x is defined via η σ x := { σi q β Φq[x n q x], x R i, 1 i m 1 C L,β d dist 2 x, B + 0, r S d γ Ψx, x [0, 1] d \ S. The graph of η σ is a surface consisting of small bumps spread around S and tending away from 0 monotonically with respect to dist 2, B + 0, r S on [0, 1] d \S. Clearly, η σ x satisfies smoothness requirement, since for x [0, 1] d dist 2 x, B + 0, r S = x 2 r S and d γ β by assumption. 1 Let s check that it also satisfies the low noise condition. Since η σ Cq β on support of Π, it is enough to consider 1 Ψx can be replaced by 1 unless βγ = d and β is an integer, in which case extra smoothness at the boundary of B +0, r S, provided by Ψ, is necessary.

12 S. Minsker/Plug-in Approach 12 t = Czq β for z > 1: Π η σ x Czq β mq d + Π dist 2 x, B + 0, r S Cz γ/d q βγ d mq d + C 2 r S + Cz γ/d q βγ d d mq d + C 3 mq d + C 4 z γ q βγ Ĉtγ, if mq d = Oq βγ. Here, the first inequality follows from considering η σ on S and A 0 separately, and second inequality follows from 3.4 and direct computation of the sphere volume. Finally, η σ satisfies assumption 2 with some B 2 := B 2 q since on suppπ 0 < c 1 q η σ x 2 c 2 q < The next step in the proof is to choose the subset of H which is wellseparated : this can be done due to the following factsee Tsybakov [13], Lemma 2.9: Proposition 3.1 Gilbert-Varshamov. For m 8, there exists {σ 0,..., σ M } { 1, 1} m such that σ 0 = {1, 1,..., 1}, ρσ i, σ j m 8 0 i < k M and M 2m/8 where ρ stands for the Hamming distance. Let H := {P σ0,..., P σm } be chosen such that {σ 0,..., σ M } satisfies the proposition above. Next, following the proof of Theorems 1 and 3 in Castro and Nowak [4], we note that σ H, σ σ 0 KLP σ,n P σ0,n 8N max x [0,1] η σx η σ0 x 2 32C 2 L,β Nq 2β, 3.5 where P σ,n is the joint distribution of X i, Y i N i=1 under hypothesis that the distribution of couple X, Y is P σ. Let us briefly sketch the derivation of 3.5; see also the proof of Theorem 1 in Castro and Nowak [4]. Denote X k := X 1,..., X k, Ȳ k := Y 1,..., Y k Then dp σ,n admits the following factorization: dp σ,n X N, ȲN = N P σ Y i X i dp X i X i 1, Ȳi 1, i=1

13 S. Minsker/Plug-in Approach 13 where dp X i X i 1, Ȳi 1 does not depend on σ but only on the active learning algorithm. As a consequence, KLP σ,n P σ0,n = E Pσ,N log dp σ,n X N, ȲN N dp σ0,n X n, ȲN = E i=1 P σ,n log P σy i X i N i=1 P σ 0 Y i X i = = N i=1 E Pσ,N [ E Pσ log P σy i X i N max x [0,1] d E P σ ] P σ0 Y i X i X i log P σy 1 X 1 P σ0 Y 1 X 1 X 1 = x 8N max x [0,1] dη σx η σ0 x 2, where the last inequality follows from Lemma 1, Castro and Nowak [4]. Also, note that we have max x [0,1] d in our bounds rather than the average over x that would appear in the passive learning framework. It remains to choose q, m in appropriate way: set q C 1 N 1 2β+d βγ and m = C 2 q d βγ where C 1, C 2 are such that q d m 1 and 32CL,β 2 Nq 2β < m 64 which is possible for N big enough. In particular, mq d = Oq βγ. Together with the bound 3.5, this gives 1 M σ H KLP σ P σ 0 32C 2 unq 2β < m 8 2 = 1 8 log H, so that conditions of Theorem 3.2 are satisfied. Setting we finally have σ 1 σ 2 H f σ x := sign η σ x, df σ1, f σ2 := Πsign η σ1 x sign η σ2 x m 8q d C 4N βγ 2β+d βγ, where the lower bound just follows by construction of our hypotheses. Since under the low noise assumption R P ˆf n R cπ ˆf n sign η 1+γ γ see 2.5, we conclude that inf sup Pr R P ˆf n R C 4 N β1+γ 2β+d βγ ˆf N P PU β,γ inf ˆf N sup Pr P PU β,γ Π ˆf n x sign η P x C 4 2 N βγ 2β+d βγ τ > 0.

14 3.2. Upper bounds for the excess risk S. Minsker/Plug-in Approach 14 Below, we present a new active learning algorithm which is computationally tractable, adaptive with respect to β, γin a certain range of these parameters and can be applied in the nonparametric setting. We show that the classifier constructed by the algorithm attains the rates of Theorem 3.1, up to polylogarithmic factor, if 0 < β 1 and βγ d the last condition covers the most interesting case when the regression function hits or crosses the decision boundary in the interior of the support of Π; for detailed statement about the connection between the behavior of the regression function near the decision boundary with parameters β, γ, see Proposition 3.4 in Audibert and Tsybakov [1]. The problem of adaptation to higher order of smoothness β > 1 is still awaiting its complete solution; we address these questions below in our final remarks. For the purpose of this section, the regularity assumption reads as follows: there exists 0 < β 1 such that x 1, x 2 [0, 1] d ηx 1 ηx 2 B 1 x 1 x 2 β 3.6 Since we want to be able to construct non-asymptotic confidence bands, some estimates on the size of constants in 3.6 and assumption 2 are needed. Below, we will additionally assume that B 1 log N B 2 log 1 N, where N is the label budget. This can be replaced by any known bounds on B 1, B 2. Let A σf m with A Π := A suppπ. Define ˆΠ A dx := Πdx x A Π and d m := dim F m AΠ. Next, we introduce a simple estimator of the regression function on the set A Π. Given the resolution level m and an iid sample X i, Y i, i N with X i ˆΠ A, let ˆη m,a x := N j=1 Y ji Ri X j N ˆΠ I Ri x 3.7 A R i i:r i A Π Since we assumed that the marginal Π is known, the estimator is welldefined. The following proposition provides the information about concentration of ˆη m around its mean:

15 Proposition 3.2. For all t > 0, Pr S. Minsker/Plug-in Approach 15 max ˆη m,a x η m x t x A Π 2d m exp 2 dm ΠA u 1 N t t 3 2 dm ΠA/u 1 N Proof. This is a straightforward application of the Bernstein s inequality to the random variables S i N := N Y j I Ri X j, i {i : R i A Π }, j=1 and the union bound: indeed, note that EY I Ri X j 2 = ˆΠ A R i, so that Pr SN i N ηdˆπ A tn ˆΠ A R i 2 exp N ˆΠ A R i t 2, R i 2 + 2t/3 and the rest follows by simple algebra using that ˆΠ A R i u 1, u 2 -regularity of Π., u 1 2 dm ΠA by the Given a sequence of hypotheses classes G m, m 1, define the index set { J N := m N : 1 dim G m N } log N - the set of possible resolution levels of an estimator based on N classified observationsan upper bound corresponds to the fact that we want the estimator to be consistent. When talking about model selection procedures below, we will implicitly assume that the model index is chosen from the corresponding set J. The role of G m will be played by F m A for appropriately chosen set A. We are now ready to present the active learning algorithm followed by its detailed analysissee Table 1. Remark Note that on every iteration, Algorithm 1a uses the whole sample to select the resolution level ˆm k and to build the estimator ˆη k. While being suitable for practical implementation, this is not convenient for theoretical analysis. We will prove the upper bounds for a slighly modified version: namely, on every iteration k labeled data is divided into two subsamples S k,1 and S k,2 of approximately equal size, S k,1 S k,2 1 2 N k ΠÂk.

16 S. Minsker/Plug-in Approach 16 Algorithm 1a input label budget N; confidence α; ˆm 0 = 0, ˆF0 := F ˆm0, ˆη 0 0; LB := N; // label budget N 0 := 2 log 2 N ; s k m, N, α := sm, N, α := mlog N + log 1 ; α k := 0; while LB 0 do k := k + 1; N k := { 2N k 1 ; Â k := x [0, 1] d : f 1, f 2 ˆF } k 1, sign f 1x sign f 2x ; if Âk suppπ = or LB < N k ΠÂk then break; output ĝ := sign ˆη k 1 else for i = 1... N k ΠÂk sample i.i.d X k i, Y k end for; LB := LB N k ΠÂk ; 1 ˆP k := N k ΠÂk i [ ˆm k := argmin m ˆmk 1 i δ k X i,y k i with X k i inf f Fm ˆη k := ˆη ˆmk // see 3.7,Âk δ k := D log 2 N 2 d ˆm k α N k ; { ˆF k := f F ˆmk : f Âk F, Â k ˆη k ; δ k, end; ˆΠ k := Πdx x Âk; // active empirical measure ˆPk Y fx 2 + K 1 2 dm ΠÂk+sm ˆm k 1,N,α N k ΠÂk Table 1 Active Learning Algorithm f [0,1] d \Âk ˆη k 1 [0,1] d \Âk } ; ] Then S 1,k is used to select the resolution level ˆm k and S k,2 - to construct ˆη k. We will call this modified version Algorithm 1b. As a first step towards the analysis of Algorithm 1b, let us prove the useful fact about the general model selection scheme. Given an iid sample X i, Y i, i N, set s m = ms + log log 2 N, m 1 and [ ˆm := ˆms = argmin m J N inf P N Y fx 2 2 dm ] + s m + K f F m N { m := min m 1 : inf EfX ηx 2 2 dm } K f F m N Theorem 3.3. There exist an absolute constant K 1 big enough such that, with probability 1 e s, ˆm m

17 Proof. See Appendix B. S. Minsker/Plug-in Approach 17 Straightforward application of this result immediately yields the following: Corollary 3.1. Suppose ηx Σβ, L, [0, 1] d. Then, with probability 1 e s, 2 ˆm C 1 N 1 2β+d Proof. By definition of m, we have { m 1 + max m : inf EfX ηx 2 2 dm } > K 2 f F m N 1 + max {m : L 2 2 2βm 2 dm } > K 2, N and the claim follows. With this bound in hand, we are ready to formulate and prove the main result of this section: Theorem 3.4. Suppose that P P U β, γ with B 1 log N, B 2 log 1 N and βγ d. Then, with probability 1 3α, the classifier ĝ returned by Algorithm 1b with label budget N satisfies R P ĝ R Const N β1+γ 2β+d βγ log p N α, where p 2βγ1+γ 2β+d βγ and B 1, B 2 are the constants from 3.6 and assumption 2. Remarks 1. Note that when βγ > d 3, N β1+γ 2β+d βγ is a fast rate, i.e., faster than N 1 2 ; at the same time, the passive learning rate N β1+γ 2β+d is guaranteed to be fast only when βγ > d 2, see Audibert and Tsybakov [1]. 2. For ˆα N β1+γ 2β+d βγ Algorithm 1b returns a classifier ĝˆα that satisfies ER P ĝˆα R Const N β1+γ 2β+d βγ log p N. This is a direct corollary of Theorem 3.4 and the inequality E Z t + Z Pr Z t

18 S. Minsker/Plug-in Approach 18 Proof. Our main goal is to construct high probability bounds for the size of the active sets defined by Algorithm 1b. In turn, these bounds depend on the size of the confidence bands for ηx, and the previous resulttheorem 3.3 is used to obtain the required estimates. Suppose L is the number of steps performed by the algorithm before termination; clearly, L N. Let Nk act := N k ΠÂk be the number of labels requested on k-th step of the algorithm: this choice guarantees that the density of labeled examples doubles on every step. Claim: the following bound for the size of the active set holds uniformly for all 2 k L with probability at least 1 2α: ΠÂk CN βγ 2β+d k log N 2γ 3.11 α It is not hard to finish the proof assuming 3.11 is true: indeed, it implies that the number of labels requested on step k satisfies N act k with probability 1 2α. Since k the last iteration L we have 2β+d βγ 2β+d = N k ΠÂk C Nk log N 2γ α N L c N act k N log 2γ N/α N, one easily deduces that on 2β+d 2β+d βγ 3.12 To obtain the risk bound of the theorem from here, we apply inequality from proposition 2.1: R P ĝ R D 1 ˆη L η I {sign ˆη L sign η} 1+γ 3.13 It remains to estimate ˆη L η, Â L : we will show below while proving 3.11 that ˆη L η, Â L C N β 2β+d L log 2 N α Together with 3.12 and 3.13, it implies the final result. To finish the proof, it remains to establish Recall that η k stands for the L 2 Π - projection of η onto F ˆmk. An important role in the argument is played by the bound on the L 2 ˆΠ k - norm of the bias η k η: 2 alternatively, inequality 2.4 can be used but results in a slightly inferior logarithmic factor.

19 S. Minsker/Plug-in Approach 19 together with assumption 2, it allows to estimate η k η,  k. The required bound follows from the following oracle inequality: there exists an event B of probability 1 α such that on this event for every 1 k L [ η k η 2 L 2 ˆΠ k inf m ˆm k 1 inf f η 2 f F m L 2 ˆΠ k K 1 2 dm ΠÂk + m ˆm k 1 logn/α N k ΠÂk It general form, this inequality is given by Theorem 6.1 in Koltchinskii [10] and provides the estimate for ˆη k η L2 ˆΠ k, so it automatically implies the weaker bound for the bias term only. To deduce 3.14, we use the mentioned general inequality L timesonce for every iteration and the union bound. The quantity 2 dm ΠÂk in 3.14 plays the role of the dimension, which is justified below. Let k 1 be fixed. For m ˆm k 1, consider hypothesis classes } F m Âk :=, f F {fiâk m An obvious but important fact is that for P P U β, γ, the dimension of F m Âk is bounded by u m ΠÂk: indeed, hence ΠÂk = j:r j Âk ΠR j u 1 2 dm # { } j : R j Âk, { } dim F m Âk = # j : R j Âk u m ΠÂk { Theorem 3.3 applies conditionally on X j i } Nj i=1 ], j k 1 with sample of size Nk act and s = logn/α: to apply the theorem, note that, by definition of  k, it is independent of X k i, i = 1... Nk act. Arguing as in Corollary 3.1 and using 3.15, we conclude that the following inequality holds with probability 1 α N for every fixed k: 1 2 ˆm k 2β+d C Nk Let E 1 be an event of probability 1 α such that on this event bound 3.16 holds for every step k, k L and let E 2 be an event of probability 1 α on which inequalities 3.14 are satisfied. Suppose that event E 1 E 2

20 S. Minsker/Plug-in Approach 20 occurs and let k 0 be a fixed arbitrary integer 2 k 0 L + 1. It is enough to assume that Âk 0 1 is nonemptyotherwise, the bound trivially holds, so that it contains at least one cube with sidelength 2 ˆm k 0 2 and ΠÂk 0 1 u 1 2 d ˆm k 0 1 cn 2β+d k Consider inequality 3.14 with k = k 0 1 and 2 m N have η k0 1 η 2 2β L 2 ˆΠ 2β+d CN k0 1 k 0 1 log 2 N α d 1 2β+d k 0 1. By 3.17, we 3.18 For convenience and brevity, denote Ω := suppπ. Now assumption 2 comes into play: it implies, together with 3.18 that CN β 2β+d k 0 1 log N α η k 0 1 η L2 ˆΠ k0 1 B 2 η k0 1 η,ω Â k To bound ˆη k0 1x η k0 1x,Ω Â k0 1 we apply Proposition 3.2. Recall that ˆm k0 1 depends only on the subsample S k0 1,1 but not on S k0 1,2. Let T k := { { X j i, Y j i } N act j i=1, j k 1; S k,1 be the random vector that defines Âk and resolution level ˆm k. Note that Eˆη k0 1x T k0 1 = η x ˆmk0 x a.s. 1 Proposition 3.2 thus implies Pr max ˆη k0 1x η x Kt 2 d ˆm k 0 1 ˆmk0 1 x Ω Âk 0 N 1 k0 1 T k 0 1 N exp } t t 3 C 3 Choosing t = c logn/α and taking expectation, the inequalitynow unconditional becomes Pr ˆη x η ˆmk0 1 ˆm x K 2 d ˆm k 0 1 log 2 N/α k0 1 α 1 N k0 1 max x Ω Âk

21 S. Minsker/Plug-in Approach 21 Let E 3 be the event on which 3.20 holds true. Combined, the estimates 3.16,3.19 and 3.20 imply that on E 1 E 2 E 3 η ˆη k0 1,Ω Â η η k0 k 1 0 1,Ω Â + η k0 k ˆη k0 1,Ω Â k0 1 C N β 2β+d B k 0 1 log N 2 α + K 2 d ˆm k 0 1 log 2 N/α N k K + C N β 2β+d k 0 1 log 2 N α where we used the assumption B 2 log 1 N. Now the width of the confidence band is defined via δ k := 2K + C N β 2β+d k 0 1 log 2 N α 3.22 in particular, D from Algorithm 1a is equal to 2K + C. With the bound 3.21 available, it is straightforward to finish the proof of the claim. Indeed, by 3.22 and the definition of the active set, the necessary condition for x Ω Âk 0 is so that ΠÂk 0 = ΠΩ Âk 0 Π ηx 3K + C N 2β+d k 0 1 log 2 N α, β ηx 3K + C N β 2β+d k 0 1 log 2 N α βγ 2β+d BN k 0 1 log 2γ N α by the low noise assumption. This completes the proof of the claim since Pr E 1 E 2 E 3 1 3α. We conclude this section by discussing running time of the active learning algorithm. Assume that the algorithm has access to the sampling subroutine that, given A [0, 1] d with ΠA > 0, generates i.i.d. X i, Y i with X i Πdx x A. Proposition 3.3. The running time of Algorithm 1a1b with label budget N is OdN log 2 N.

22 S. Minsker/Plug-in Approach 22 Remark In view of Theorem 3.4, the running time required to output a classifier ĝ such that R P ĝ R ε with probability 1 α is O 1 ε 2β+d βγ β1+γ poly log 1. εα Proof. We will use the notations of Theorem 3.4. Let Nk act be the number of labels requested by the algorithm on step k. The resolution level ˆm k is always chosen such that Âk is partitioned into at most Nk act dyadic cubes, see 3.8. This means that the estimator ˆη k takes at most Nk act distinct values. The key observation is that for any k, the active set Âk+1 is always represented as the union of a finite numberat most Nk act of dyadic cubes: to determine if a cube R j Âk+1, it is enough to take a point x R j and compare signˆη k x δ k with signˆη k x + δ k : R j Âk+1 only if the signs are differentso that the confidence band crosses zero level. This can be done in ONk act steps. Next, resolution level ˆm k can be found in ONk act log 2 N steps: there are at most log 2 Nk act models to consider; for each m, inf f Fm ˆPk Y fx 2 is found explicitly and is achieved for the piecewise-constant ˆfx = i Y k i I Rj X k i I R j X k i i, x R j. Sorting of the data required for this computation is done in OdNk act log N steps for each m, so the whole k-th iteration running time is OdNk act log 2 N. Since Nk act N, the result follows. k 4. Conclusion and open problems We have shown that active learning can significantly improve the quality of a classifier over the passive algorithm for a large class of underlying distributions. Presented method achieves fast rates of convergence for the excess risk, moreover, it is adaptivein the certain range of smoothness and noise parameters and involves minimization only with respect to quadratic lossrather than the 0 1 loss. The natural question related to our results is: Can we implement adaptive smooth estimators in the learning algorithm to extend our results beyond the case β 1?

23 S. Minsker/Plug-in Approach 23 The answer to this second question is so far an open problem. Our conjecture is that the correct rate of convergence for the excess risk is N β1+γ 2β+d γβ 1, up to logarithmic factors, which coincides with presented results for β 1. This rate can be derived from an argument similar to the proof of Theorem 3.4 under the assumption that on every step k one could construct an estimator ˆη k with η ˆη k, Â k N β 2β+d k. At the same time, the active set associated to ˆη k should maintain some structure which is suitable for the iterative nature of the algorithm. Transforming these ideas into a rigorous proof is a goal of our future work. Acknowledgements I want to express my deepest gratitude to my Ph.D. advisor, Dr. Vladimir Koltchinskii, for his support and numerous helpful discussions. I am grateful to the anonymous reviewers for carefully reading the manuscript. Their insightful and wise suggestions helped to improve the quality of presentation and results. I would like to acknowledge support for this project from the National Science Foundation NSF Grants DMS and CCF and by the Algorithms and Randomness Center, Georgia Institute of Technology, through the ARC Fellowship. Appendix A: Functions satisfying assumption 2 In the propositions below, we will assume for simplicity that the marginal distribution Π is absolutely continuous with respect to Lebesgue measure with density px such that 0 < p 1 px p 2 < for all x [0, 1] d A.1 Given t 0, 1], define A t := {x : ηx t}. Proposition A.1. Suppose η is Lipschitz continuous with Lipschitz constant S. Assume also that for some t > 0 we have a Π A t /3 > 0; b η is twice differentiable for all x A t ; c inf x At ηx 1 s > 0; d sup x At D 2 ηx C < where is the operator norm.

24 Then η satisfies assumption 2. S. Minsker/Plug-in Approach 24 Proof. By intermediate value theorem, for any cube R i, 1 i 2 dm there exists x 0 R i such that η m x = ηx 0, x R i. This implies ηx η m x = ηx ηx 0 = ηξ x x 0 ηξ 1 x x 0 S 2 m On the other hand, if R i A t then ηx η m x = ηx ηx 0 = = ηx 0 x x [D2 ηξ]x x 0 x x 0 ηx 0 x x sup D 2 ηξ max x x ξ x R i A.2 ηx 0 x x 0 C 1 2 2m Note that a strictly positive continuous function hy, u = u x y 2 dx [0,1] d achieves its minimal value h > 0 on a compact set [0, 1] d { u R d : u 1 = 1 }. This impliesusing A.2 and the inequality a b 2 a2 2 b2 Π 1 R i ηx η m x 2 pxdx R i 1 2 p 22 dm 1 ηx 0 x x 0 2 p 1 dx C12 2 4m R i 1 p 1 ηx p 12 2m h C12 2 4m c 2 2 2m for m m 0. 2 Now take a set A σf m, m m 0 from assumption 2. There are 2 possibilities: either A A t or A A t /3. In the first case the computation above implies η η m 2 Πdx x A c 2 2 2m = c 2 S 2 S2 2 2m [0,1] d c 2 S 2 η η m 2,A

25 S. Minsker/Plug-in Approach 25 If the second case occurs, note that, since { } x : 0 < ηx < t 3 has nonempty interior, it must contain a dyadic cube R with edge length 2 m. Then for any m maxm 0, m η η m 2 Πdx x A [0,1] d Π 1 A η η m 2 c 2 Πdx 4 2 2m ΠR and the claim follows. R c 2 S 2 ΠR η η m 2,A The next proposition describes conditions which allow functions to have vanishing gradient on decision boundary but requires convexity and regular behaviour of the gradient. Everywhere below, η denotes the subgradient of a convex function η. sup x A t2 \A t1 ηx 1 For 0 < t 1 < t 2, define Gt 1, t 2 := inf ηx 1. In case when ηx x A t2 \A t1 is not unique, we choose a representative that makes Gt 1, t 2 as small as possible. Proposition A.2. Suppose ηx is Lipschitz continuous with Lipschitz constant S. Moreover, assume that there exists t > 0 and q : 0, 0, such that A t 0, 1 d and a b 1 t γ ΠA t b 2 t γ t < t ; b For all 0 < t 1 < t 2 t, Gt 1, t 2 q t2 t1 ; c Restriction of η to any convex subset of A t is convex. Then η satisfies assumption 2. Remark The statement remains valid if we replace η by η in c. Proof. Assume that for some t t and k > 0 R A t \ A t/k is a dyadic cube with edge length 2 m and let x 0 be such that η m x = ηx 0, x R. Note that η is convex on R due to c. Using the subgradient

26 S. Minsker/Plug-in Approach 26 inequality ηx ηx 0 ηx 0 x x 0, we obtain ηx ηx 0 2 dπx ηx ηx 0 2 I { ηx 0 x x 0 0} dπx R R R ηx 0 x x 0 2 I { ηx 0 x x 0 0} dπx A.3 The next step is to show that under our assumptions x 0 can be chosen such that dist x 0, R ν2 m A.4 where ν = νk is independent of m. In this case any part of R cut by a hyperplane through x 0 contains half of a ball Bx 0, r 0 of radius r 0 = νk2 m and the last integral in A.3 can be further bounded below to get ηx ηx 0 2 dπx 1 ηx 0 x x 0 2 p 1 dx 2 R Bx 0,r 0 ck ηx m 2 dm A.5 It remains to show A.4. Assume that for all y such that ηy = ηx 0 we have dist y, R δ2 m for some δ > 0. This implies that the boundary of the convex set {x R : ηx ηx 0 } is contained in R δ := {x R : dist x, R δ2 m }. There are two possibilities: either {x R : ηx ηx 0 } R \ R δ or {x R : ηx ηx 0 } R δ. We consider the first case onlythe proof in the second case is similar. First, note that by b for all x R δ ηx 1 qk ηx 0 1 and ηx ηx 0 + ηx 1 δ2 m ηx 0 + qk ηx 0 1 δ2 m A.6 Let x c be the center of the cube R and u - the unit vector in direction ηx c. Observe that ηx c + 1 3δ2 m u ηx c ηx c 1 3δ2 m u = = 1 3δ2 m ηx c 2

27 S. Minsker/Plug-in Approach 27 On the other hand, x c + 1 3δ2 m u R \ R δ and ηx c + 1 3δ2 m u ηx 0, hence ηx c ηx 0 c1 3δ2 m ηx c 1. Consequently, for all { x Bx c, δ := x : x x c 1 } 2 c2 m 1 3δ we have ηx ηx c + ηx c 1 x x c ηx c2 m 1 3δ ηx c 1 A.7 Finally, recall that ηx 0 is the average value of η on R. Together with A.6,A.7 this gives ΠRηx 0 = ηxdπ = ηxdπ + ηxdπ R R δ R\R δ ηx 0 + qk ηx 0 1 δ2 m ΠR δ + + ηx 0 c 2 2 m 1 3δ ηx 0 1 Π Bx c, δ + + ηx 0 ΠR \ R δ Bx c, δ = = ΠRηx 0 + qk ηx 0 1 δ2 m ΠR δ c 2 2 m 1 3δ ηx 0 1 Π Bx c, δ Since ΠR δ p 2 2 dm and ΠBx c, δ c 3 2 dm 1 3δ d, the inequality above implies c 4 qkδ 1 3δ d+1 c qk3d+4. which is impossible for small δe.g., for δ < Let A be a set from condition 2. If A A t /3, then there exists a dyadic cube R with edge length 2 m such that R A t /3 \ A t /k for some k > 0, and the claim follows from A.5 as in proposition A.1. Assume now that A t A A 3t and 3t t. Condition a of the proposition implies that for any ε > 0 we can choose kε > 0 large enough so that ΠA \ A t/k ΠA b 2 t/k γ ΠA b 2 b 1 k γ ΠA t 1 επa A.8

28 S. Minsker/Plug-in Approach 28 This means that for any partition of A into dyadic cubes R i with edge length 2 m at least half of them satisfy ΠR i \ A t/k 1 cεπr i A.9 Let I be the index set of cardinality I cπa2 dm 1 such that A.9 is true for i I. Since R i A t/k is convex, there exists 3 z = zε N such that for any such cube R i there exists a dyadic sub-cube with edge length 2 m+z entirely contained in R i \ A t/k : T i R i \ A t/k A 3t \ A t/k. It follows that Π i T i cεπa. Recall that condition b implies sup ηx 1 x i T i q3k inf ηx 1 x i T i Finally, sup x A 3t ηx 2 is attained at the boundary point, that is for some x : ηx = 3t, and by b sup ηx 1 d ηx 1 q3k d inf ηx 1. x A 3t x A 3t \A t/k Application of A.5 to every cube T i gives ηx η m+z x 2 dπx c 1 kπa I inf ηx m 2 dm x A 3t \A i I t/k T i c 2 kπa sup x A 3t ηx m c 3 kπa η ηm 2,A concluding the proof. Appendix B: Proof of Theorem 3.3 The main ideas of this proof, which significantly simplifies and clarifies initial author s version, are due to V. Koltchinskii. For conveniece and brevity, let us introduce additional notations. Recall that s m = ms + log log 2 N 3 If, on the contrary, every sub-cube with edge length 2 m+z contains a point from A t/k, then A t/k must contain the convex hull of these points which would contradict A.8 for large z.

29 S. Minsker/Plug-in Approach 29 Let 2 dm + s m τ N m, s := K 1 N 2 dm + s + log log π N m, s := K 2 N 2 N By E P F, f or E PN F, f we denote the excess risk of f F with respect to the true or empirical measure: E P F, f := P y fx 2 inf P y gx2 g F E PN F, f := P N y fx 2 inf g F P Ny gx 2 It follows from Theorem 4.2 in Koltchinskii [10] and the union bound that there exists an event B of probability 1 e s such that on this event the following holds for all m such that dm log N: E P F m, ˆf ˆm π N m, s f F m, E P F m, f 2E PN F m, f π N m, s B.1 f F m, E PN F m, f 3 2 E P F m, f π N m, s. We will show that on B, { ˆm m} holds. Indeed, assume that, on the contrary, ˆm > m; by definition of ˆm, we have which implies P N Y ˆf ˆm 2 + τ N ˆm, s P N Y ˆf m 2 + τ N m, s, E PN F ˆm, ˆf m τ N ˆm, s τ N m, s > 3π N ˆm, s for K 1 big enough. By B.1, E PN F ˆm, ˆf m = inf f F m E PN F ˆm, f 3 2 and combination the two inequalities above yields inf f F m E P F ˆm, f > π N ˆm, s inf E P F ˆm, f π N ˆm, s, f F m B.2 Since for any m E P F m, f EfX ηx 2, the definition of m and B.2 imply that π N m, s inf f F m EfX ηx 2 > π N ˆm, s, contradicting our assumption, hence proving the claim.

30 S. Minsker/Plug-in Approach 30 References [1] J.-Y. Audibert and A. B. Tsybakov. Fast learning rates for plug-in classifiers. Preprint, Available at: publications/papers/05preprint_audtsy.pdf. [2] M.-F. Balcan, S. Hanneke, and J. Wortman. The true sample complexity of active learning. In COLT, pages 45 56, [3] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. J. Comput. System Sci., 751:78 89, [4] R. M. Castro and R. D. Nowak. Minimax bounds for active learning. IEEE Trans. Inform. Theory, 545: , [5] S. Gaïffas. Sharp estimation in sup norm with random design. Statist. Probab. Lett., 778: , [6] E. Giné and R. Nickl. Confidence bands in density estimation. Ann. Statist., 382: , [7] S. Hanneke. Rates of convergence in active learning. Ann. Statist., 39 1: , [8] M. Hoffmann and R. Nickl. On adaptive inference and confidence bands. The Annals of Statistics, to appear, [9] V. Koltchinskii. Rademacher complexities and bounding the excess risk in active learning. J. Mach. Learn. Res., 11: , [10] V. Koltchinskii. Oracle inequalities in empirical risk minimization and sparse recovery problems. Springer, Lectures from the 38th Probability Summer School held in Saint-Flour, 2008, École d Été de Probabilités de Saint-Flour. [11] M. G. Low. On nonparametric confidence intervals. Ann. Statist., 25 6: , [12] A. B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Ann. Statist., 321: , [13] A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer, [14] A. W. van der Vaart and J. A. Wellner. Weak convergence and empirical processes. Springer Series in Statistics.

Plug-in Approach to Active Learning

Plug-in Approach to Active Learning Plug-in Approach to Active Learning Stanislav Minsker Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 1 / 18 Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X

More information

Plug-in Approach to Active Learning

Plug-in Approach to Active Learning Journal of Machine Learning Research 13 2012 67-90 Submitted 2/11; Revised 9/11; Published 1/12 Plug-in Approach to Active Learning Stanislav Minsker 686 Cherry Street School of Mathematics Georgia Institute

More information

Fast learning rates for plug-in classifiers under the margin condition

Fast learning rates for plug-in classifiers under the margin condition Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,

More information

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk) 10-704: Information Processing and Learning Spring 2015 Lecture 21: Examples of Lower Bounds and Assouad s Method Lecturer: Akshay Krishnamurthy Scribes: Soumya Batra Note: LaTeX template courtesy of UC

More information

Minimax Bounds for Active Learning

Minimax Bounds for Active Learning IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. XX, NO. X, MONTH YEAR Minimax Bounds for Active Learning Rui M. Castro, Robert D. Nowak, Senior Member, IEEE, Abstract This paper analyzes the potential advantages

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

arxiv: v2 [math.ag] 24 Jun 2015

arxiv: v2 [math.ag] 24 Jun 2015 TRIANGULATIONS OF MONOTONE FAMILIES I: TWO-DIMENSIONAL FAMILIES arxiv:1402.0460v2 [math.ag] 24 Jun 2015 SAUGATA BASU, ANDREI GABRIELOV, AND NICOLAI VOROBJOV Abstract. Let K R n be a compact definable set

More information

Concentration behavior of the penalized least squares estimator

Concentration behavior of the penalized least squares estimator Concentration behavior of the penalized least squares estimator Penalized least squares behavior arxiv:1511.08698v2 [math.st] 19 Oct 2016 Alan Muro and Sara van de Geer {muro,geer}@stat.math.ethz.ch Seminar

More information

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu

More information

LECTURE 15: COMPLETENESS AND CONVEXITY

LECTURE 15: COMPLETENESS AND CONVEXITY LECTURE 15: COMPLETENESS AND CONVEXITY 1. The Hopf-Rinow Theorem Recall that a Riemannian manifold (M, g) is called geodesically complete if the maximal defining interval of any geodesic is R. On the other

More information

Optimal global rates of convergence for interpolation problems with random design

Optimal global rates of convergence for interpolation problems with random design Optimal global rates of convergence for interpolation problems with random design Michael Kohler 1 and Adam Krzyżak 2, 1 Fachbereich Mathematik, Technische Universität Darmstadt, Schlossgartenstr. 7, 64289

More information

Laplace s Equation. Chapter Mean Value Formulas

Laplace s Equation. Chapter Mean Value Formulas Chapter 1 Laplace s Equation Let be an open set in R n. A function u C 2 () is called harmonic in if it satisfies Laplace s equation n (1.1) u := D ii u = 0 in. i=1 A function u C 2 () is called subharmonic

More information

RATE-OPTIMAL GRAPHON ESTIMATION. By Chao Gao, Yu Lu and Harrison H. Zhou Yale University

RATE-OPTIMAL GRAPHON ESTIMATION. By Chao Gao, Yu Lu and Harrison H. Zhou Yale University Submitted to the Annals of Statistics arxiv: arxiv:0000.0000 RATE-OPTIMAL GRAPHON ESTIMATION By Chao Gao, Yu Lu and Harrison H. Zhou Yale University Network analysis is becoming one of the most active

More information

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider

More information

Whitney s Extension Problem for C m

Whitney s Extension Problem for C m Whitney s Extension Problem for C m by Charles Fefferman Department of Mathematics Princeton University Fine Hall Washington Road Princeton, New Jersey 08544 Email: cf@math.princeton.edu Supported by Grant

More information

arxiv: v1 [math.co] 3 Nov 2014

arxiv: v1 [math.co] 3 Nov 2014 SPARSE MATRICES DESCRIBING ITERATIONS OF INTEGER-VALUED FUNCTIONS BERND C. KELLNER arxiv:1411.0590v1 [math.co] 3 Nov 014 Abstract. We consider iterations of integer-valued functions φ, which have no fixed

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE. By Michael Nussbaum Weierstrass Institute, Berlin

ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE. By Michael Nussbaum Weierstrass Institute, Berlin The Annals of Statistics 1996, Vol. 4, No. 6, 399 430 ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE By Michael Nussbaum Weierstrass Institute, Berlin Signal recovery in Gaussian

More information

C m Extension by Linear Operators

C m Extension by Linear Operators C m Extension by Linear Operators by Charles Fefferman Department of Mathematics Princeton University Fine Hall Washington Road Princeton, New Jersey 08544 Email: cf@math.princeton.edu Partially supported

More information

Solving Classification Problems By Knowledge Sets

Solving Classification Problems By Knowledge Sets Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose

More information

The deterministic Lasso

The deterministic Lasso The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Unlabeled Data: Now It Helps, Now It Doesn t

Unlabeled Data: Now It Helps, Now It Doesn t institution-logo-filena A. Singh, R. D. Nowak, and X. Zhu. In NIPS, 2008. 1 Courant Institute, NYU April 21, 2015 Outline institution-logo-filena 1 Conflicting Views in Semi-supervised Learning The Cluster

More information

JUHA KINNUNEN. Harmonic Analysis

JUHA KINNUNEN. Harmonic Analysis JUHA KINNUNEN Harmonic Analysis Department of Mathematics and Systems Analysis, Aalto University 27 Contents Calderón-Zygmund decomposition. Dyadic subcubes of a cube.........................2 Dyadic cubes

More information

STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song

STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song Presenter: Jiwei Zhao Department of Statistics University of Wisconsin Madison April

More information

Brownian Motion. 1 Definition Brownian Motion Wiener measure... 3

Brownian Motion. 1 Definition Brownian Motion Wiener measure... 3 Brownian Motion Contents 1 Definition 2 1.1 Brownian Motion................................. 2 1.2 Wiener measure.................................. 3 2 Construction 4 2.1 Gaussian process.................................

More information

2. Function spaces and approximation

2. Function spaces and approximation 2.1 2. Function spaces and approximation 2.1. The space of test functions. Notation and prerequisites are collected in Appendix A. Let Ω be an open subset of R n. The space C0 (Ω), consisting of the C

More information

Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model.

Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model. Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model By Michael Levine Purdue University Technical Report #14-03 Department of

More information

S chauder Theory. x 2. = log( x 1 + x 2 ) + 1 ( x 1 + x 2 ) 2. ( 5) x 1 + x 2 x 1 + x 2. 2 = 2 x 1. x 1 x 2. 1 x 1.

S chauder Theory. x 2. = log( x 1 + x 2 ) + 1 ( x 1 + x 2 ) 2. ( 5) x 1 + x 2 x 1 + x 2. 2 = 2 x 1. x 1 x 2. 1 x 1. Sep. 1 9 Intuitively, the solution u to the Poisson equation S chauder Theory u = f 1 should have better regularity than the right hand side f. In particular one expects u to be twice more differentiable

More information

Nonparametric regression with martingale increment errors

Nonparametric regression with martingale increment errors S. Gaïffas (LSTA - Paris 6) joint work with S. Delattre (LPMA - Paris 7) work in progress Motivations Some facts: Theoretical study of statistical algorithms requires stationary and ergodicity. Concentration

More information

(convex combination!). Use convexity of f and multiply by the common denominator to get. Interchanging the role of x and y, we obtain that f is ( 2M ε

(convex combination!). Use convexity of f and multiply by the common denominator to get. Interchanging the role of x and y, we obtain that f is ( 2M ε 1. Continuity of convex functions in normed spaces In this chapter, we consider continuity properties of real-valued convex functions defined on open convex sets in normed spaces. Recall that every infinitedimensional

More information

Behaviour of Lipschitz functions on negligible sets. Non-differentiability in R. Outline

Behaviour of Lipschitz functions on negligible sets. Non-differentiability in R. Outline Behaviour of Lipschitz functions on negligible sets G. Alberti 1 M. Csörnyei 2 D. Preiss 3 1 Università di Pisa 2 University College London 3 University of Warwick Lars Ahlfors Centennial Celebration Helsinki,

More information

Supplementary Notes for W. Rudin: Principles of Mathematical Analysis

Supplementary Notes for W. Rudin: Principles of Mathematical Analysis Supplementary Notes for W. Rudin: Principles of Mathematical Analysis SIGURDUR HELGASON In 8.00B it is customary to cover Chapters 7 in Rudin s book. Experience shows that this requires careful planning

More information

Metric Spaces and Topology

Metric Spaces and Topology Chapter 2 Metric Spaces and Topology From an engineering perspective, the most important way to construct a topology on a set is to define the topology in terms of a metric on the set. This approach underlies

More information

Active Learning: Disagreement Coefficient

Active Learning: Disagreement Coefficient Advanced Course in Machine Learning Spring 2010 Active Learning: Disagreement Coefficient Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz In previous lectures we saw examples in which

More information

SYMMETRY RESULTS FOR PERTURBED PROBLEMS AND RELATED QUESTIONS. Massimo Grosi Filomena Pacella S. L. Yadava. 1. Introduction

SYMMETRY RESULTS FOR PERTURBED PROBLEMS AND RELATED QUESTIONS. Massimo Grosi Filomena Pacella S. L. Yadava. 1. Introduction Topological Methods in Nonlinear Analysis Journal of the Juliusz Schauder Center Volume 21, 2003, 211 226 SYMMETRY RESULTS FOR PERTURBED PROBLEMS AND RELATED QUESTIONS Massimo Grosi Filomena Pacella S.

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Empirical Processes: General Weak Convergence Theory

Empirical Processes: General Weak Convergence Theory Empirical Processes: General Weak Convergence Theory Moulinath Banerjee May 18, 2010 1 Extended Weak Convergence The lack of measurability of the empirical process with respect to the sigma-field generated

More information

Learning symmetric non-monotone submodular functions

Learning symmetric non-monotone submodular functions Learning symmetric non-monotone submodular functions Maria-Florina Balcan Georgia Institute of Technology ninamf@cc.gatech.edu Nicholas J. A. Harvey University of British Columbia nickhar@cs.ubc.ca Satoru

More information

Geometric intuition: from Hölder spaces to the Calderón-Zygmund estimate

Geometric intuition: from Hölder spaces to the Calderón-Zygmund estimate Geometric intuition: from Hölder spaces to the Calderón-Zygmund estimate A survey of Lihe Wang s paper Michael Snarski December 5, 22 Contents Hölder spaces. Control on functions......................................2

More information

GAUSSIAN MEASURE OF SECTIONS OF DILATES AND TRANSLATIONS OF CONVEX BODIES. 2π) n

GAUSSIAN MEASURE OF SECTIONS OF DILATES AND TRANSLATIONS OF CONVEX BODIES. 2π) n GAUSSIAN MEASURE OF SECTIONS OF DILATES AND TRANSLATIONS OF CONVEX BODIES. A. ZVAVITCH Abstract. In this paper we give a solution for the Gaussian version of the Busemann-Petty problem with additional

More information

Risk Bounds for CART Classifiers under a Margin Condition

Risk Bounds for CART Classifiers under a Margin Condition arxiv:0902.3130v5 stat.ml 1 Mar 2012 Risk Bounds for CART Classifiers under a Margin Condition Servane Gey March 2, 2012 Abstract Non asymptotic risk bounds for Classification And Regression Trees (CART)

More information

Optimization and Optimal Control in Banach Spaces

Optimization and Optimal Control in Banach Spaces Optimization and Optimal Control in Banach Spaces Bernhard Schmitzer October 19, 2017 1 Convex non-smooth optimization with proximal operators Remark 1.1 (Motivation). Convex optimization: easier to solve,

More information

Constructions with ruler and compass.

Constructions with ruler and compass. Constructions with ruler and compass. Semyon Alesker. 1 Introduction. Let us assume that we have a ruler and a compass. Let us also assume that we have a segment of length one. Using these tools we can

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

SEPARABILITY AND COMPLETENESS FOR THE WASSERSTEIN DISTANCE

SEPARABILITY AND COMPLETENESS FOR THE WASSERSTEIN DISTANCE SEPARABILITY AND COMPLETENESS FOR THE WASSERSTEIN DISTANCE FRANÇOIS BOLLEY Abstract. In this note we prove in an elementary way that the Wasserstein distances, which play a basic role in optimal transportation

More information

Topological properties

Topological properties CHAPTER 4 Topological properties 1. Connectedness Definitions and examples Basic properties Connected components Connected versus path connected, again 2. Compactness Definition and first examples Topological

More information

1/12/05: sec 3.1 and my article: How good is the Lebesgue measure?, Math. Intelligencer 11(2) (1989),

1/12/05: sec 3.1 and my article: How good is the Lebesgue measure?, Math. Intelligencer 11(2) (1989), Real Analysis 2, Math 651, Spring 2005 April 26, 2005 1 Real Analysis 2, Math 651, Spring 2005 Krzysztof Chris Ciesielski 1/12/05: sec 3.1 and my article: How good is the Lebesgue measure?, Math. Intelligencer

More information

Part III. 10 Topological Space Basics. Topological Spaces

Part III. 10 Topological Space Basics. Topological Spaces Part III 10 Topological Space Basics Topological Spaces Using the metric space results above as motivation we will axiomatize the notion of being an open set to more general settings. Definition 10.1.

More information

Boosting with Early Stopping: Convergence and Consistency

Boosting with Early Stopping: Convergence and Consistency Boosting with Early Stopping: Convergence and Consistency Tong Zhang Bin Yu Abstract Boosting is one of the most significant advances in machine learning for classification and regression. In its original

More information

A LOCALIZATION PROPERTY AT THE BOUNDARY FOR MONGE-AMPERE EQUATION

A LOCALIZATION PROPERTY AT THE BOUNDARY FOR MONGE-AMPERE EQUATION A LOCALIZATION PROPERTY AT THE BOUNDARY FOR MONGE-AMPERE EQUATION O. SAVIN. Introduction In this paper we study the geometry of the sections for solutions to the Monge- Ampere equation det D 2 u = f, u

More information

Integral Jensen inequality

Integral Jensen inequality Integral Jensen inequality Let us consider a convex set R d, and a convex function f : (, + ]. For any x,..., x n and λ,..., λ n with n λ i =, we have () f( n λ ix i ) n λ if(x i ). For a R d, let δ a

More information

Introduction to Real Analysis Alternative Chapter 1

Introduction to Real Analysis Alternative Chapter 1 Christopher Heil Introduction to Real Analysis Alternative Chapter 1 A Primer on Norms and Banach Spaces Last Updated: March 10, 2018 c 2018 by Christopher Heil Chapter 1 A Primer on Norms and Banach Spaces

More information

A Bound on the Label Complexity of Agnostic Active Learning

A Bound on the Label Complexity of Agnostic Active Learning A Bound on the Label Complexity of Agnostic Active Learning Steve Hanneke March 2007 CMU-ML-07-103 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Machine Learning Department,

More information

Minimax Bounds for Active Learning

Minimax Bounds for Active Learning Minimax Bounds for Active Learning Rui M. Castro 1,2 and Robert D. Nowak 1 1 University of Wisconsin, Madison WI 53706, USA, rcastro@cae.wisc.edu,nowak@engr.wisc.edu, 2 Rice University, Houston TX 77005,

More information

Topological properties of Z p and Q p and Euclidean models

Topological properties of Z p and Q p and Euclidean models Topological properties of Z p and Q p and Euclidean models Samuel Trautwein, Esther Röder, Giorgio Barozzi November 3, 20 Topology of Q p vs Topology of R Both R and Q p are normed fields and complete

More information

Chapter 1. Measure Spaces. 1.1 Algebras and σ algebras of sets Notation and preliminaries

Chapter 1. Measure Spaces. 1.1 Algebras and σ algebras of sets Notation and preliminaries Chapter 1 Measure Spaces 1.1 Algebras and σ algebras of sets 1.1.1 Notation and preliminaries We shall denote by X a nonempty set, by P(X) the set of all parts (i.e., subsets) of X, and by the empty set.

More information

Lebesgue Measure on R n

Lebesgue Measure on R n CHAPTER 2 Lebesgue Measure on R n Our goal is to construct a notion of the volume, or Lebesgue measure, of rather general subsets of R n that reduces to the usual volume of elementary geometrical sets

More information

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016 Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall 206 2 Nov 2 Dec 206 Let D be a convex subset of R n. A function f : D R is convex if it satisfies f(tx + ( t)y) tf(x)

More information

for all subintervals I J. If the same is true for the dyadic subintervals I D J only, we will write ϕ BMO d (J). In fact, the following is true

for all subintervals I J. If the same is true for the dyadic subintervals I D J only, we will write ϕ BMO d (J). In fact, the following is true 3 ohn Nirenberg inequality, Part I A function ϕ L () belongs to the space BMO() if sup ϕ(s) ϕ I I I < for all subintervals I If the same is true for the dyadic subintervals I D only, we will write ϕ BMO

More information

NECESSARY CONDITIONS FOR WEIGHTED POINTWISE HARDY INEQUALITIES

NECESSARY CONDITIONS FOR WEIGHTED POINTWISE HARDY INEQUALITIES NECESSARY CONDITIONS FOR WEIGHTED POINTWISE HARDY INEQUALITIES JUHA LEHRBÄCK Abstract. We establish necessary conditions for domains Ω R n which admit the pointwise (p, β)-hardy inequality u(x) Cd Ω(x)

More information

OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1

OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1 The Annals of Statistics 1997, Vol. 25, No. 6, 2512 2546 OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1 By O. V. Lepski and V. G. Spokoiny Humboldt University and Weierstrass Institute

More information

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

Proof. We indicate by α, β (finite or not) the end-points of I and call

Proof. We indicate by α, β (finite or not) the end-points of I and call C.6 Continuous functions Pag. 111 Proof of Corollary 4.25 Corollary 4.25 Let f be continuous on the interval I and suppose it admits non-zero its (finite or infinite) that are different in sign for x tending

More information

C.7. Numerical series. Pag. 147 Proof of the converging criteria for series. Theorem 5.29 (Comparison test) Let a k and b k be positive-term series

C.7. Numerical series. Pag. 147 Proof of the converging criteria for series. Theorem 5.29 (Comparison test) Let a k and b k be positive-term series C.7 Numerical series Pag. 147 Proof of the converging criteria for series Theorem 5.29 (Comparison test) Let and be positive-term series such that 0, for any k 0. i) If the series converges, then also

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Real Analysis Math 131AH Rudin, Chapter #1. Dominique Abdi

Real Analysis Math 131AH Rudin, Chapter #1. Dominique Abdi Real Analysis Math 3AH Rudin, Chapter # Dominique Abdi.. If r is rational (r 0) and x is irrational, prove that r + x and rx are irrational. Solution. Assume the contrary, that r+x and rx are rational.

More information

Introduction and Preliminaries

Introduction and Preliminaries Chapter 1 Introduction and Preliminaries This chapter serves two purposes. The first purpose is to prepare the readers for the more systematic development in later chapters of methods of real analysis

More information

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability... Functional Analysis Franck Sueur 2018-2019 Contents 1 Metric spaces 1 1.1 Definitions........................................ 1 1.2 Completeness...................................... 3 1.3 Compactness......................................

More information

Some Background Math Notes on Limsups, Sets, and Convexity

Some Background Math Notes on Limsups, Sets, and Convexity EE599 STOCHASTIC NETWORK OPTIMIZATION, MICHAEL J. NEELY, FALL 2008 1 Some Background Math Notes on Limsups, Sets, and Convexity I. LIMITS Let f(t) be a real valued function of time. Suppose f(t) converges

More information

Adaptive Sampling Under Low Noise Conditions 1

Adaptive Sampling Under Low Noise Conditions 1 Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università

More information

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization

More information

Regularity of the density for the stochastic heat equation

Regularity of the density for the stochastic heat equation Regularity of the density for the stochastic heat equation Carl Mueller 1 Department of Mathematics University of Rochester Rochester, NY 15627 USA email: cmlr@math.rochester.edu David Nualart 2 Department

More information

Local strong convexity and local Lipschitz continuity of the gradient of convex functions

Local strong convexity and local Lipschitz continuity of the gradient of convex functions Local strong convexity and local Lipschitz continuity of the gradient of convex functions R. Goebel and R.T. Rockafellar May 23, 2007 Abstract. Given a pair of convex conjugate functions f and f, we investigate

More information

REAL RENORMINGS ON COMPLEX BANACH SPACES

REAL RENORMINGS ON COMPLEX BANACH SPACES REAL RENORMINGS ON COMPLEX BANACH SPACES F. J. GARCÍA PACHECO AND A. MIRALLES Abstract. In this paper we provide two ways of obtaining real Banach spaces that cannot come from complex spaces. In concrete

More information

arxiv: v1 [math.fa] 14 Jul 2018

arxiv: v1 [math.fa] 14 Jul 2018 Construction of Regular Non-Atomic arxiv:180705437v1 [mathfa] 14 Jul 2018 Strictly-Positive Measures in Second-Countable Locally Compact Non-Atomic Hausdorff Spaces Abstract Jason Bentley Department of

More information

SOME CONVERSE LIMIT THEOREMS FOR EXCHANGEABLE BOOTSTRAPS

SOME CONVERSE LIMIT THEOREMS FOR EXCHANGEABLE BOOTSTRAPS SOME CONVERSE LIMIT THEOREMS OR EXCHANGEABLE BOOTSTRAPS Jon A. Wellner University of Washington The bootstrap Glivenko-Cantelli and bootstrap Donsker theorems of Giné and Zinn (990) contain both necessary

More information

Math212a1413 The Lebesgue integral.

Math212a1413 The Lebesgue integral. Math212a1413 The Lebesgue integral. October 28, 2014 Simple functions. In what follows, (X, F, m) is a space with a σ-field of sets, and m a measure on F. The purpose of today s lecture is to develop the

More information

Online Learning, Mistake Bounds, Perceptron Algorithm

Online Learning, Mistake Bounds, Perceptron Algorithm Online Learning, Mistake Bounds, Perceptron Algorithm 1 Online Learning So far the focus of the course has been on batch learning, where algorithms are presented with a sample of training data, from which

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A.

More information

General Power Series

General Power Series General Power Series James K. Peterson Department of Biological Sciences and Department of Mathematical Sciences Clemson University March 29, 2018 Outline Power Series Consequences With all these preliminaries

More information

ON KRONECKER PRODUCTS OF CHARACTERS OF THE SYMMETRIC GROUPS WITH FEW COMPONENTS

ON KRONECKER PRODUCTS OF CHARACTERS OF THE SYMMETRIC GROUPS WITH FEW COMPONENTS ON KRONECKER PRODUCTS OF CHARACTERS OF THE SYMMETRIC GROUPS WITH FEW COMPONENTS C. BESSENRODT AND S. VAN WILLIGENBURG Abstract. Confirming a conjecture made by Bessenrodt and Kleshchev in 1999, we classify

More information

A Generalized Sharp Whitney Theorem for Jets

A Generalized Sharp Whitney Theorem for Jets A Generalized Sharp Whitney Theorem for Jets by Charles Fefferman Department of Mathematics Princeton University Fine Hall Washington Road Princeton, New Jersey 08544 Email: cf@math.princeton.edu Abstract.

More information

The Dirichlet s P rinciple. In this lecture we discuss an alternative formulation of the Dirichlet problem for the Laplace equation:

The Dirichlet s P rinciple. In this lecture we discuss an alternative formulation of the Dirichlet problem for the Laplace equation: Oct. 1 The Dirichlet s P rinciple In this lecture we discuss an alternative formulation of the Dirichlet problem for the Laplace equation: 1. Dirichlet s Principle. u = in, u = g on. ( 1 ) If we multiply

More information

Estimates for probabilities of independent events and infinite series

Estimates for probabilities of independent events and infinite series Estimates for probabilities of independent events and infinite series Jürgen Grahl and Shahar evo September 9, 06 arxiv:609.0894v [math.pr] 8 Sep 06 Abstract This paper deals with finite or infinite sequences

More information

Monte-Carlo MMD-MA, Université Paris-Dauphine. Xiaolu Tan

Monte-Carlo MMD-MA, Université Paris-Dauphine. Xiaolu Tan Monte-Carlo MMD-MA, Université Paris-Dauphine Xiaolu Tan tan@ceremade.dauphine.fr Septembre 2015 Contents 1 Introduction 1 1.1 The principle.................................. 1 1.2 The error analysis

More information

The optimal partial transport problem

The optimal partial transport problem The optimal partial transport problem Alessio Figalli Abstract Given two densities f and g, we consider the problem of transporting a fraction m [0, min{ f L 1, g L 1}] of the mass of f onto g minimizing

More information

SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY 1. Minimax-Optimal Bounds for Detectors Based on Estimated Prior Probabilities

SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY 1. Minimax-Optimal Bounds for Detectors Based on Estimated Prior Probabilities SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY 1 Minimax-Optimal Bounds for Detectors Based on Estimated Prior Probabilities Jiantao Jiao*, Lin Zhang, Member, IEEE and Robert D. Nowak, Fellow, IEEE

More information

Least squares under convex constraint

Least squares under convex constraint Stanford University Questions Let Z be an n-dimensional standard Gaussian random vector. Let µ be a point in R n and let Y = Z + µ. We are interested in estimating µ from the data vector Y, under the assumption

More information

Analysis-3 lecture schemes

Analysis-3 lecture schemes Analysis-3 lecture schemes (with Homeworks) 1 Csörgő István November, 2015 1 A jegyzet az ELTE Informatikai Kar 2015. évi Jegyzetpályázatának támogatásával készült Contents 1. Lesson 1 4 1.1. The Space

More information

Integration on Measure Spaces

Integration on Measure Spaces Chapter 3 Integration on Measure Spaces In this chapter we introduce the general notion of a measure on a space X, define the class of measurable functions, and define the integral, first on a class of

More information

Continuity of convex functions in normed spaces

Continuity of convex functions in normed spaces Continuity of convex functions in normed spaces In this chapter, we consider continuity properties of real-valued convex functions defined on open convex sets in normed spaces. Recall that every infinitedimensional

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

Computational and Statistical Learning theory

Computational and Statistical Learning theory Computational and Statistical Learning theory Problem set 2 Due: January 31st Email solutions to : karthik at ttic dot edu Notation : Input space : X Label space : Y = {±1} Sample : (x 1, y 1,..., (x n,

More information

A talk on Oracle inequalities and regularization. by Sara van de Geer

A talk on Oracle inequalities and regularization. by Sara van de Geer A talk on Oracle inequalities and regularization by Sara van de Geer Workshop Regularization in Statistics Banff International Regularization Station September 6-11, 2003 Aim: to compare l 1 and other

More information

Consistency of Nearest Neighbor Methods

Consistency of Nearest Neighbor Methods E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study

More information

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS STEVEN P. LALLEY AND ANDREW NOBEL Abstract. It is shown that there are no consistent decision rules for the hypothesis testing problem

More information

Regularity for Poisson Equation

Regularity for Poisson Equation Regularity for Poisson Equation OcMountain Daylight Time. 4, 20 Intuitively, the solution u to the Poisson equation u= f () should have better regularity than the right hand side f. In particular one expects

More information

2 A Model, Harmonic Map, Problem

2 A Model, Harmonic Map, Problem ELLIPTIC SYSTEMS JOHN E. HUTCHINSON Department of Mathematics School of Mathematical Sciences, A.N.U. 1 Introduction Elliptic equations model the behaviour of scalar quantities u, such as temperature or

More information