Plug-in Approach to Active Learning

Similar documents
arxiv: v2 [math.st] 2 Nov 2011

Fast learning rates for plug-in classifiers under the margin condition

Plug-in Approach to Active Learning

Unlabeled Data: Now It Helps, Now It Doesn t

Minimax Bounds for Active Learning

Minimax Bounds for Active Learning

Understanding Generalization Error: Bounds and Decompositions

Approximation Theoretical Questions for SVMs

Minimax Analysis of Active Learning

A talk on Oracle inequalities and regularization. by Sara van de Geer

A Bound on the Label Complexity of Agnostic Active Learning

Semi-Supervised Learning by Multi-Manifold Separation

Inverse Statistical Learning

Persistent homology and nonparametric regression

Lecture 3: Statistical Decision Theory (Part II)

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Sample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University

Learning with Rejection

Practical Agnostic Active Learning

Adaptive Sampling Under Low Noise Conditions 1

Solving Classification Problems By Knowledge Sets

Atutorialonactivelearning

Distribution-specific analysis of nearest neighbor search and classification

Risk Bounds for CART Classifiers under a Margin Condition

Active Learning: Disagreement Coefficient

Lecture 7 Introduction to Statistical Decision Theory

Tsybakov noise adap/ve margin- based ac/ve learning

The sample complexity of agnostic learning with deterministic labels

Consistency of Nearest Neighbor Methods

Multivariate Topological Data Analysis

The Perceptron algorithm

COMS 4771 Introduction to Machine Learning. Nakul Verma

SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY 1. Minimax-Optimal Bounds for Detectors Based on Estimated Prior Probabilities

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

Nonparametric regression for topology. applied to brain imaging data

Generalization, Overfitting, and Model Selection

Better Algorithms for Selective Sampling

6.867 Machine Learning

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong

Nonparametric regression with martingale increment errors

Machine Learning. Lecture 9: Learning Theory. Feng Li.

FORMULATION OF THE LEARNING PROBLEM

Introduction to Machine Learning (67577) Lecture 3

Fast learning rates for plug-in classifiers under the margin condition

The multi armed-bandit problem

Generalization and Overfitting

Adaptive Minimax Classification with Dyadic Decision Trees

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Unlabeled data: Now it helps, now it doesn t

Classification objectives COMS 4771

Machine Learning 2017

Empirical Risk Minimization

Optimal global rates of convergence for interpolation problems with random design

Introduction to Machine Learning

Statistical Data Mining and Machine Learning Hilary Term 2016

Active learning. Sanjoy Dasgupta. University of California, San Diego

Recap from previous lecture

Efficient Semi-supervised and Active Learning of Disjunctions

Logistic regression and linear classifiers COMS 4771

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

Activized Learning with Uniform Classification Noise

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Undirected Graphical Models

CS 543 Page 1 John E. Boon, Jr.

Dyadic Classification Trees via Structural Risk Minimization

Nonparametric Bayesian Methods

Classification. Chapter Introduction. 6.2 The Bayes classifier

Lecture Notes 15 Prediction Chapters 13, 22, 20.4.

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Activized Learning with Uniform Classification Noise

Minimax fast rates for discriminant analysis with errors in variables

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Nearest neighbor classification in metric spaces: universal consistency and rates of convergence

Machine Learning

Material presented. Direct Models for Classification. Agenda. Classification. Classification (2) Classification by machines 6/16/2010.

Model Selection and Geometry

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Op#mal convex op#miza#on under Tsybakov noise through connec#ons to ac#ve learning

Variance Reduction and Ensemble Methods

Foundations of Machine Learning

Computational Learning Theory. CS534 - Machine Learning

CMSC858P Supervised Learning Methods

Lecture 10 February 23

An Introduction to Statistical Machine Learning - Theoretical Aspects -

CMU-Q Lecture 24:

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Gaussian Process Regression

Machine Learning Linear Classification. Prof. Matteo Matteucci

Statistical Machine Learning from Data

Laplace s Equation. Chapter Mean Value Formulas

Bayes rule and Bayes error. Donglin Zeng, Department of Biostatistics, University of North Carolina

Lecture 29: Computational Learning Theory

Machine Learning

Minimax Estimation of a nonlinear functional on a structured high-dimensional model

A Sampling of IMPACT Research:

Linear Discrimination Functions

A Magiv CV Theory for Large-Margin Classifiers

Rademacher Bounds for Non-i.i.d. Processes

Transcription:

Plug-in Approach to Active Learning Stanislav Minsker Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 1 / 18

Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X is the observed quantity and the label Y is to be predicted; prediction is carried via the binary classifier g : S { 1, +1} The joint distribution of (X, Y ) will be denoted by P and the distribution of X by Π. η(x) := E(Y X = x) is the regression function. The generalization error of a binary classifier g is Bayes classifier R(g) = Pr (Y g(x)) g (x) := sign η(x) has the smallest possible generalization error. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 2 / 18

Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X is the observed quantity and the label Y is to be predicted; prediction is carried via the binary classifier g : S { 1, +1} The joint distribution of (X, Y ) will be denoted by P and the distribution of X by Π. η(x) := E(Y X = x) is the regression function. The generalization error of a binary classifier g is Bayes classifier R(g) = Pr (Y g(x)) g (x) := sign η(x) has the smallest possible generalization error. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 2 / 18

Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X is the observed quantity and the label Y is to be predicted; prediction is carried via the binary classifier g : S { 1, +1} The joint distribution of (X, Y ) will be denoted by P and the distribution of X by Π. η(x) := E(Y X = x) is the regression function. The generalization error of a binary classifier g is Bayes classifier R(g) = Pr (Y g(x)) g (x) := sign η(x) has the smallest possible generalization error. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 2 / 18

Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X is the observed quantity and the label Y is to be predicted; prediction is carried via the binary classifier g : S { 1, +1} The joint distribution of (X, Y ) will be denoted by P and the distribution of X by Π. η(x) := E(Y X = x) is the regression function. The generalization error of a binary classifier g is Bayes classifier R(g) = Pr (Y g(x)) g (x) := sign η(x) has the smallest possible generalization error. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 2 / 18

Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X is the observed quantity and the label Y is to be predicted; prediction is carried via the binary classifier g : S { 1, +1} The joint distribution of (X, Y ) will be denoted by P and the distribution of X by Π. η(x) := E(Y X = x) is the regression function. The generalization error of a binary classifier g is Bayes classifier R(g) = Pr (Y g(x)) g (x) := sign η(x) has the smallest possible generalization error. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 2 / 18

Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X is the observed quantity and the label Y is to be predicted; prediction is carried via the binary classifier g : S { 1, +1} The joint distribution of (X, Y ) will be denoted by P and the distribution of X by Π. η(x) := E(Y X = x) is the regression function. The generalization error of a binary classifier g is Bayes classifier R(g) = Pr (Y g(x)) g (x) := sign η(x) has the smallest possible generalization error. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 2 / 18

Difference between Active and Passive Learning In the passive learning framework we are given an iid sample (X i, Y i ) n i=1 as an input. In practice: cost of the training data is associated with labeling the observations while the pool of observations itself is almost unlimited. Active learning algorithms try to take advantage of this modified framework. The goal is to construct a classifier with good generalization properties using a small number of labels. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 3 / 18

Difference between Active and Passive Learning In the passive learning framework we are given an iid sample (X i, Y i ) n i=1 as an input. In practice: cost of the training data is associated with labeling the observations while the pool of observations itself is almost unlimited. Active learning algorithms try to take advantage of this modified framework. The goal is to construct a classifier with good generalization properties using a small number of labels. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 3 / 18

Difference between Active and Passive Learning In the passive learning framework we are given an iid sample (X i, Y i ) n i=1 as an input. In practice: cost of the training data is associated with labeling the observations while the pool of observations itself is almost unlimited. Active learning algorithms try to take advantage of this modified framework. The goal is to construct a classifier with good generalization properties using a small number of labels. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 3 / 18

Difference between Active and Passive Learning Observations are sampled sequentially. X k is sampled from the modified distribution ˆΠ k that depends on (X 1, Y 1 ),..., (X k 1, Y k 1 ). Y k is sampled from the conditional distribution P Y X ( X = x). Labels are conditionally independent given the feature vectors X i, i n. ˆΠ k is supported on a set where classification is difficult. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 3 / 18

Difference between Active and Passive Learning Observations are sampled sequentially. X k is sampled from the modified distribution ˆΠ k that depends on (X 1, Y 1 ),..., (X k 1, Y k 1 ). Y k is sampled from the conditional distribution P Y X ( X = x). Labels are conditionally independent given the feature vectors X i, i n. ˆΠ k is supported on a set where classification is difficult. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 3 / 18

Difference between Active and Passive Learning Observations are sampled sequentially. X k is sampled from the modified distribution ˆΠ k that depends on (X 1, Y 1 ),..., (X k 1, Y k 1 ). Y k is sampled from the conditional distribution P Y X ( X = x). Labels are conditionally independent given the feature vectors X i, i n. ˆΠ k is supported on a set where classification is difficult. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 3 / 18

Previous work 1 Dasgupta, S., Hsu, D. and Monteleoni, C. A general agnostic active learning algorithm. 2 Castro, R. M. and Nowak, R. D.(2008) Minimax bounds for active learning. 3 Balcan, M.-F., Beygelzimer, A. and Langford, J.(2009) Agnostic active learning. 4 Hanneke, S.(2010) Rates of Convergence in Active Learning. 5 Koltchinskii, V. (2010) Rademacher Complexities and Bounding the Excess Risk in Active Learning. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 4 / 18

Previous work It was discovered that in some cases active learners can significantly outperform passive algorithms. In particular, this is the case when the Tsybakov s low noise assumption is satisfied: there exist constants B, γ > 0 such that t > 0, Π(x : η(x) t) Bt γ Castro and Nowak: if the decision boundary { x R d : η(x) = 0 } is a graph of the Hölder smooth function g Σ(β, K, [0, 1] d 1 ), noise assumption is satisfied with γ > 0, then ER(ĝ N ) R N β(1+γ) 2β+γ(d 1) where N is the label budget. However, the construction of the classifier that achieves this bound assumes β and γ to be known. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 4 / 18

Previous work It was discovered that in some cases active learners can significantly outperform passive algorithms. In particular, this is the case when the Tsybakov s low noise assumption is satisfied: there exist constants B, γ > 0 such that t > 0, Π(x : η(x) t) Bt γ Castro and Nowak: if the decision boundary { x R d : η(x) = 0 } is a graph of the Hölder smooth function g Σ(β, K, [0, 1] d 1 ), noise assumption is satisfied with γ > 0, then ER(ĝ N ) R N β(1+γ) 2β+γ(d 1) where N is the label budget. However, the construction of the classifier that achieves this bound assumes β and γ to be known. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 4 / 18

Motivation and Assumptions A good learning algorithm is expected to be computationally efficient and adaptive with respect to the underlying structure of the problem(smoothness and noise level). Existing algorithms do not possess these two properties simultaneosly. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 5 / 18

Motivation and Assumptions Developement of Active Learning Algorithms that are adaptive and computationally tractable. Obtaining minimax lower bounds for the excess risk of active learning for the broad class of underlying distributions. Main tool: construct nonparametric estimator of the regression function ˆη n( ) and use the plug-in classifier ĝ( ) = sign ˆη n( ) Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 5 / 18

Motivation and Assumptions Developement of Active Learning Algorithms that are adaptive and computationally tractable. Obtaining minimax lower bounds for the excess risk of active learning for the broad class of underlying distributions. Main tool: construct nonparametric estimator of the regression function ˆη n( ) and use the plug-in classifier ĝ( ) = sign ˆη n( ) Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 5 / 18

Motivation and Assumptions Developement of Active Learning Algorithms that are adaptive and computationally tractable. Obtaining minimax lower bounds for the excess risk of active learning for the broad class of underlying distributions. Main tool: construct nonparametric estimator of the regression function ˆη n( ) and use the plug-in classifier ĝ( ) = sign ˆη n( ) Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 5 / 18

Motivation and Assumptions Assumptions: 1 X is supported in [0, 1] d, marginal Π is regular ( for example, dπ(x) = p(x)d x, 0 < µ 1 p(x) µ 2 < ); 2 η( ) Σ(β, L, [0, 1] d ) 3 t > 0, Π(x : η(x) t) Bt γ In Passive Learning under similar assumptions plug-in approach leads to classifiers that attain optimal rates of convergence: C 1 N β(1+γ) 2β+d sup ER(ĝ N ) R C 2 N β(1+γ) 2β+d P P(β,γ) (J.-Y. Audibert, A. B.Tsybakov. Fast learning rates for plug-in classifiers). Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 5 / 18

Motivation and Assumptions Assumptions: 1 X is supported in [0, 1] d, marginal Π is regular ( for example, dπ(x) = p(x)d x, 0 < µ 1 p(x) µ 2 < ); 2 η( ) Σ(β, L, [0, 1] d ) 3 t > 0, Π(x : η(x) t) Bt γ In Passive Learning under similar assumptions plug-in approach leads to classifiers that attain optimal rates of convergence: C 1 N β(1+γ) 2β+d sup ER(ĝ N ) R C 2 N β(1+γ) 2β+d P P(β,γ) (J.-Y. Audibert, A. B.Tsybakov. Fast learning rates for plug-in classifiers). Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 5 / 18

Lower bound: Theorem Let β, γ, d be such that βγ d. There exists C > 0 such that for all N large enough and for any active classifier ĝ N (x) we have sup P P (β,γ) ER P (ĝ N ) R C N β(1+γ) 2β+d βγ Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 6 / 18

Lower bound: Theorem Let β, γ, d be such that βγ d. There exists C > 0 such that for all N large enough and for any active classifier ĝ N (x) we have sup P P (β,γ) ER P (ĝ N ) R C N β(1+γ) 2β+d βγ Here, N is the number of labels used to construct ĝ N rather then the size of the training data set. Proof combines techniques developed by Tsybakov, Audibert and Castro, Nowak. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 6 / 18

Lower bound: Theorem Let β, γ, d be such that βγ d. There exists C > 0 such that for all N large enough and for any active classifier ĝ N (x) we have sup P P (β,γ) ER P (ĝ N ) R C N β(1+γ) 2β+d βγ Example: β = 1, d = 2, γ = 1 : Passive rate : N 1/2 Active rate: N 2/3 Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 6 / 18

Active learning algorithm Suppose that assumptions (1),(2),(3) are satisfied and Π, the marginal distribution of X, is known. Input label budget N. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 7 / 18

Active learning algorithm Use a small fraction of data to construct an estimator that is close to η(x) in sup-norm. In our case, this is a piecewise polynomial estimator. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 7 / 18

Active learning algorithm Construct a confidence band based on obtained estimator Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 7 / 18

Active learning algorithm Active Set: the subset of [0, 1] d where the confidence band crosses zero level. The size of the active set is controlled by the low noise assumption. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 7 / 18

Active learning algorithm Outside of the active set: correct classification with high confidence! Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 7 / 18

Next iteration: Sample observations from ˆΠ 1 (dx) := Π (dx X Active Set); Construct a tighter confidence band; Repeat until the label budget is attained; Output ĝ = sign ˆη fin Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 8 / 18

Active learning algorithm, r = 0 input : label threshold N; confidence α; minimal regularity 0 < ν < 1 output: ĝ := sign ˆη 1 ω := 1 + 4+2d 2 ν ; 2 k = 0 ; 3 N 0 := N ; 4 LB := N 2N 0; 5 for i = 1 to 2N 0 do ( 6 sample i.i.d. 7 S 0,1 := {( X (0) i, Y (0) ) i X (0) i, Y (0) i ) with X (0) i Π; } {(, i N 0, S 0,2 = X (0) i, Y (0) ) i, N 0 + 1 i 2N 0 }; 8 ˆm 0 := ˆm(s, N 0; S 0,1) /* see equation (2) */; 9 ˆη 0 := ˆηˆm0,[0,1] d ;S 0,2 /* see equation (1) */; 0 while LB > 0 do 1 k := k + 1; 2  k := { x [0, 1] d : f 1, f 2 ˆF k 1, sign (f 1(x)) sign (f 2(x)) } ; 3 if  k supp(π) = then 4 break 5 else 6 ˆm k := ˆm k 1 + 1 ; 7 τ k := ˆm k ˆm k 1 8 N k := N τ k k 1 9 for i = 1 to N k ( Π( k ) do ) 0 sample i.i.d. X (k) i, Y (k) i ; with X (k) i ˆΠ k := Π(dx x  k ); 1 S k := {( X (k) i, Y (k) ) } i, i N k Π( k ) ; 2 ˆη k := ˆηˆmk, k /* estimator based on S k */; 3 δ k := D(log α N )ω N β/(2β+d) k /* size of the confidence band */; } 4 ˆF k := {f F 0ˆmk : F f Âk,Âk ( ˆη k ; δ k ), f [0,1] ˆη d \Âk k 1 ; [0,1] d \Âk 5 LB := LB N k Π( k ) ; 6 ˆη := ˆη k /* keeping track of the most recent estimator */; Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 9 / 18

Rates of convergence Let ĝ be the classifier output by the algorithm. Theorem Assume P P (β, γ). Then the following bound holds uniformly over all 0 < ν β r + 1 and γ > 0 with probability 1 α: where p = p(β, γ). R P (ĝ) R Const N β(1+γ) 2β+d (β 1)γ log p N α, Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 10 / 18

Rates of convergence Let ĝ be the classifier output by the algorithm. Theorem Assume P P (β, γ). Then the following bound holds uniformly over all 0 < ν β r + 1 and γ > 0 with probability 1 α: where p = p(β, γ). Remarks: R P (ĝ) R Const N β(1+γ) 2β+d (β 1)γ log p N α, 1 Fast rates(i.e., faster then N 1/2 ) are attained: Passive learning: βγ > d 2 Active learning: βγ > d d (if β 1), βγ > if β > 1. 3 2+1/β 2 Matches the lower bound, up to log-factors, when β 1. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 10 / 18

Rates of convergence Let ĝ be the classifier output by the algorithm. Theorem Assume P P (β, γ). Then the following bound holds uniformly over all 0 < ν β r + 1 and γ > 0 with probability 1 α: where p = p(β, γ). Remarks: R P (ĝ) R Const N β(1+γ) 2β+d (β 1)γ log p N α, 1 Fast rates(i.e., faster then N 1/2 ) are attained: Passive learning: βγ > d 2 Active learning: βγ > d d (if β 1), βγ > if β > 1. 3 2+1/β 2 Matches the lower bound, up to log-factors, when β 1. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 10 / 18

Comments Properties of the algorithm: Positive: 1 Adaptation to unknown smoothness and noise level. 2 Algorithm is computationally tractable. Negative: 1 Requires the minimal regularity ν as an input. 2 Currently, we are unaware of any effective way to certify that the resulting classifier has small excess risk. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 11 / 18

Comments Properties of the algorithm: Positive: 1 Adaptation to unknown smoothness and noise level. 2 Algorithm is computationally tractable. Negative: 1 Requires the minimal regularity ν as an input. 2 Currently, we are unaware of any effective way to certify that the resulting classifier has small excess risk. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 11 / 18

Comments For β > 1 there is a gap between the upper and lower bounds. Question: is it a flaw of the proof techniques, or is it possible to improve the upper bound under some extra assumptions? What if the design distribution Π is discrete? Example: Active Learning on graphs. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 11 / 18

Comments For β > 1 there is a gap between the upper and lower bounds. Question: is it a flaw of the proof techniques, or is it possible to improve the upper bound under some extra assumptions? What if the design distribution Π is discrete? Example: Active Learning on graphs. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 11 / 18

Simulation Model: Y i = sign [f (X i ) + ε i ], ε N (0, σ 2 ), i=1... 34 f (x) = x ( 1 + sin 5 x ) sin(4πx), σ 2 = 0.2 Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 12 / 18

Simulation Model: Y i = sign [f (X i ) + ε i ], ε N (0, σ 2 ), i=1... 34 f (x) = x ( 1 + sin 5 x ) sin(4πx), σ 2 = 0.2 Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 12 / 18

Simulation Model: Y i = sign [f (X i ) + ε i ], ε N (0, σ 2 ), i=1... 34 f (x) = x ( 1 + sin 5 x ) sin(4πx), σ 2 = 0.2 Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 12 / 18

Simulation Model: Y i = sign [f (X i ) + ε i ], ε N (0, σ 2 ), i=1... 34 f (x) = x ( 1 + sin 5 x ) sin(4πx), σ 2 = 0.2 Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 12 / 18

Proof details: Define the nested families of piecewise-polynomial functions: dm 2 Fm r := f = q i (x 1,..., x d )I Ri, i=1 where {R i } 2dm i=1 forms the dyadic partition of the unit cube and q i(x 1,..., x d ) are the polynomials of degree at most r in d variables. ˆα i := 1 N ˆη m(x) := N Y j φ i (X j ), j=1 d m i=1 Provides simple structure for the active set. Good concentration around the mean. ˆα i φ i (x) (1) Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 13 / 18

Proof details: Define the nested families of piecewise-polynomial functions: dm 2 Fm r := f = q i (x 1,..., x d )I Ri, i=1 where {R i } 2dm i=1 forms the dyadic partition of the unit cube and q i(x 1,..., x d ) are the polynomials of degree at most r in d variables. ˆα i := 1 N ˆη m(x) := N Y j φ i (X j ), j=1 d m i=1 Provides simple structure for the active set. Good concentration around the mean. ˆα i φ i (x) (1) Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 13 / 18

Proof details: Define the nested families of piecewise-polynomial functions: dm 2 Fm r := f = q i (x 1,..., x d )I Ri, i=1 where {R i } 2dm i=1 forms the dyadic partition of the unit cube and q i(x 1,..., x d ) are the polynomials of degree at most r in d variables. ˆα i := 1 N ˆη m(x) := N Y j φ i (X j ), j=1 d m i=1 Provides simple structure for the active set. Good concentration around the mean. ˆα i φ i (x) (1) Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 13 / 18

Proof details: Define the nested families of piecewise-polynomial functions: dm 2 Fm r := f = q i (x 1,..., x d )I Ri, i=1 where {R i } 2dm i=1 forms the dyadic partition of the unit cube and q i(x 1,..., x d ) are the polynomials of degree at most r in d variables. ˆα i := 1 N ˆη m(x) := N Y j φ i (X j ), j=1 d m i=1 Provides simple structure for the active set. Good concentration around the mean. ˆα i φ i (x) (1) Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 13 / 18

Proof details: Define the nested families of piecewise-polynomial functions: dm 2 Fm r := f = q i (x 1,..., x d )I Ri, i=1 where {R i } 2dm i=1 forms the dyadic partition of the unit cube and q i(x 1,..., x d ) are the polynomials of degree at most r in d variables. ˆα i := 1 N ˆη m(x) := N Y j φ i (X j ), j=1 d m i=1 Provides simple structure for the active set. Good concentration around the mean. ˆα i φ i (x) (1) Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 13 / 18

Main steps of the proof: Estimate the size of the active set: Controlled by the width of the confidence band and the low noise assumption. Estimate the risk of a plug-in classifier on the active set: Use the comparison inequality: R P (f ) R Const (f η)i {sign f sign η} 1+γ Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 14 / 18

Main steps of the proof: Estimate the size of the active set: Controlled by the width of the confidence band and the low noise assumption. Estimate the risk of a plug-in classifier on the active set: Use the comparison inequality: R P (f ) R Const (f η)i {sign f sign η} 1+γ Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 14 / 18

Main steps of the proof: Estimate the size of the active set: Controlled by the width of the confidence band and the low noise assumption. Estimate the risk of a plug-in classifier on the active set: Use the comparison inequality: R P (f ) R Const (f η)i {sign f sign η} 1+γ Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 14 / 18

Main steps of the proof: Estimate the size of the active set: Controlled by the width of the confidence band and the low noise assumption. Estimate the risk of a plug-in classifier on the active set: Use the comparison inequality: R P (f ) R Const (f η)i {sign f sign η} 1+γ Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 14 / 18

How to choose optimal resolution level? Lepski-type method: ˆm := ˆm(t, N) = min { Compare to optimal resolution level: { m := min m : l > m, ˆη l ˆη m K 1 t m : η η m K 2 2 dm m N } } 2 dl l N (2) (3) Here, η m is the L 2 (Π) projection of η onto F r m. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 15 / 18

How to choose optimal resolution level? Lepski-type method: ˆm := ˆm(t, N) = min { Compare to optimal resolution level: { m := min m : l > m, ˆη l ˆη m K 1 t m : η η m K 2 2 dm m N } } 2 dl l N (2) (3) Here, η m is the L 2 (Π) projection of η onto F r m. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 15 / 18

How to choose optimal resolution level? Assume that Lower bound is crucial to control the bias. B 2 2 βm η η m B 1 2 βm Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 16 / 18

How to choose optimal resolution level? Assume that Lower bound is crucial to control the bias. Lemma ˆm B 2 2 βm η η m B 1 2 βm ( m 1 ( ) ] B 1 log β 2 t + log 2 + h, m B 2 with probability at least 1 C2 d m log N exp( ct m), where h is some fixed positive number that depends on d, r, Π. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 16 / 18

Confidence band The previous result implies η ˆη C ( log N ) ω N β/(2β+d) α This allows to construct a confidence band for η(x) = control for the size of the active set. The final bound for the risk follows after simple algebraic computations. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 17 / 18

Confidence band The previous result implies η ˆη C ( log N ) ω N β/(2β+d) α This allows to construct a confidence band for η(x) = control for the size of the active set. The final bound for the risk follows after simple algebraic computations. Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 17 / 18

Thank you for your attention! Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 18 / 18