Robust Bounds for Classification via Selective Sampling

Similar documents
Better Algorithms for Selective Sampling

Adaptive Sampling Under Low Noise Conditions 1

Least-Squares Regression on Sparse Spaces

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

Linear Regression with Limited Observation

Lower Bounds for the Smoothed Number of Pareto optimal Solutions

Linear Classification and Selective Sampling Under Low Noise Conditions

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

A Course in Machine Learning

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs

Topic 7: Convergence of Random Variables

19 Eigenvalues, Eigenvectors, Ordinary Differential Equations, and Control

FLUCTUATIONS IN THE NUMBER OF POINTS ON SMOOTH PLANE CURVES OVER FINITE FIELDS. 1. Introduction

Linear First-Order Equations

Efficient and Principled Online Classification Algorithms for Lifelon

Learning noisy linear classifiers via adaptive and selective sampling

7.1 Support Vector Machine

PLAL: Cluster-based Active Learning


Multi-View Clustering via Canonical Correlation Analysis

6 General properties of an autonomous system of two first order ODE

Topic Modeling: Beyond Bag-of-Words

Lower bounds on Locality Sensitive Hashing

NOTES ON EULER-BOOLE SUMMATION (1) f (l 1) (n) f (l 1) (m) + ( 1)k 1 k! B k (y) f (k) (y) dy,

arxiv: v4 [stat.ml] 21 Dec 2016

Analyzing Tensor Power Method Dynamics in Overcomplete Regime

Expected Value of Partial Perfect Information

Flexible High-Dimensional Classification Machines and Their Asymptotic Properties

Gaussian processes with monotonicity information

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments

Tractability results for weighted Banach spaces of smooth functions

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

How to Minimize Maximum Regret in Repeated Decision-Making

SYNCHRONOUS SEQUENTIAL CIRCUITS

Balancing Expected and Worst-Case Utility in Contracting Models with Asymmetric Information and Pooling

JUST THE MATHS UNIT NUMBER DIFFERENTIATION 2 (Rates of change) A.J.Hobson

CUSTOMER REVIEW FEATURE EXTRACTION Heng Ren, Jingye Wang, and Tony Wu

Parameter estimation: A new approach to weighting a priori information

'HVLJQ &RQVLGHUDWLRQ LQ 0DWHULDO 6HOHFWLRQ 'HVLJQ 6HQVLWLYLW\,1752'8&7,21

Cascaded redundancy reduction

Acute sets in Euclidean spaces

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions

A simple model for the small-strain behaviour of soils

Multi-View Clustering via Canonical Correlation Analysis

The Exact Form and General Integrating Factors

Separation of Variables

II. First variation of functionals

Admin BACKPROPAGATION. Neural network. Neural network 11/3/16. Assignment 7. Assignment 8 Goals today. David Kauchak CS158 Fall 2016

Table of Common Derivatives By David Abraham

Necessary and Sufficient Conditions for Sketched Subspace Clustering

Influence of weight initialization on multilayer perceptron performance

Designing Information Devices and Systems II Fall 2017 Note Theorem: Existence and Uniqueness of Solutions to Differential Equations

Improving Estimation Accuracy in Nonrandomized Response Questioning Methods by Multiple Answers

Generalized Tractability for Multivariate Problems

The Role of Models in Model-Assisted and Model- Dependent Estimation for Domains and Small Areas

Stopping-Set Enumerator Approximations for Finite-Length Protograph LDPC Codes

The derivative of a function f(x) is another function, defined in terms of a limiting expression: f(x + δx) f(x)

STATISTICAL LIKELIHOOD REPRESENTATIONS OF PRIOR KNOWLEDGE IN MACHINE LEARNING

All s Well That Ends Well: Supplementary Proofs

Survey-weighted Unit-Level Small Area Estimation

Closed and Open Loop Optimal Control of Buffer and Energy of a Wireless Device

Lower Bounds for Local Monotonicity Reconstruction from Transitive-Closure Spanners

Robust Selective Sampling from Single and Multiple Teachers

Multi-View Clustering via Canonical Correlation Analysis

The Principle of Least Action

A note on asymptotic formulae for one-dimensional network flow problems Carlos F. Daganzo and Karen R. Smilowitz

Thermal conductivity of graded composites: Numerical simulations and an effective medium approximation

Schrödinger s equation.

inflow outflow Part I. Regular tasks for MAE598/494 Task 1

Multi-View Clustering via Canonical Correlation Analysis

Qubit channels that achieve capacity with two states

Database-friendly Random Projections

Equilibrium in Queues Under Unknown Service Times and Service Value

Convergence of Random Walks

PDE Notes, Lecture #11

LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION

On combinatorial approaches to compressed sensing

Chaos, Solitons and Fractals Nonlinear Science, and Nonequilibrium and Complex Phenomena

A study on ant colony systems with fuzzy pheromone dispersion

Efficient Learning of Linear Separators under Bounded Noise

Euler equations for multiple integrals

SINGULAR PERTURBATION AND STATIONARY SOLUTIONS OF PARABOLIC EQUATIONS IN GAUSS-SOBOLEV SPACES

A. Incorrect! The letter t does not appear in the expression of the given integral

Online Appendix for Trade Policy under Monopolistic Competition with Firm Selection

Active Learning Class 22, 03 May Claire Monteleoni MIT CSAIL

Iterated Point-Line Configurations Grow Doubly-Exponentially

A new proof of the sharpness of the phase transition for Bernoulli percolation on Z d

An Iterative Incremental Learning Algorithm for Complex-Valued Hopfield Associative Memory

Implicit Differentiation

12.11 Laplace s Equation in Cylindrical and

THE VAN KAMPEN EXPANSION FOR LINKED DUFFING LINEAR OSCILLATORS EXCITED BY COLORED NOISE

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences.

An Analytical Expression of the Probability of Error for Relaying with Decode-and-forward

Estimation of the Maximum Domination Value in Multi-Dimensional Data Sets

Optimized Schwarz Methods with the Yin-Yang Grid for Shallow Water Equations

Q( t) = T C T =! " t 3,t 2,t,1# Q( t) T = C T T T. Announcements. Bezier Curves and Splines. Review: Third Order Curves. Review: Cubic Examples

The total derivative. Chapter Lagrangian and Eulerian approaches

Transcription:

Nicolò Cesa-Bianchi DSI, Università egli Stui i Milano, Italy Clauio Gentile DICOM, Università ell Insubria, Varese, Italy Francesco Orabona Iiap, Martigny, Switzerlan cesa-bianchi@siunimiit clauiogentile@uninsubriait forabona@iiapch Abstract We introuce a new algorithm for binary classification in the selective sampling protocol Our algorithm uses Regularize Least Squares RLS as base classifier, an for this reason it can be efficiently run in any RKHS Unlike previous margin-base semisupervise algorithms, our sampling conition hinges on a simultaneous upper boun on bias an variance of the RLS estimate uner a simple linear label noise moel This fact allows us to prove performance bouns that hol for an arbitrary sequence of instances In particular, we show that our sampling strategy approximates the margin of the Bayes optimal classifier to any esire accuracy ε by asking Õ /ε 2 queries in the RKHS case is replace by a suitable spectral quantity While these are the stanar rates in the fully supervise ii case, the best previously known result in our harer setting was Õ 3 /ε 4 Preliminary experiments show that some of our algorithms also exhibit a goo practical performance 1 Introuction A practical variant of the stanar fully supervise online learning protocol is a setting where, at each preiction step, the learner can abstain from observing the current label If the learner observes the label, which he can o by issuing a query, then the label value can be use to improve future preictions If Appearing in Proceeings of the 26 th International Conference on Machine Learning, Montreal, Canaa, 2009 Copyright 2009 by the authors/owners the label is preicte but not querie, then the learner never knows whether his preiction was correct Thus, only querie labels are observe while all others remain unknown This protocol is often calle selective sampling, an we interchangeably use querie labels an sample labels to enote the labels observe by the learner Given a general online preiction technique, like regularize least squares RLS, we are intereste in controlling the preictive performance as the query rate goes from fully supervise all labels are querie to fully unsupervise no label is querie This is motivate by observing that, in a typical practical scenario, one might want to control the accuracy of preictions while imposing an upper boun on the query rate In fact, the number of observe labels has usually a very irect influence on basic computational aspects of online learning algorithms, such as running time an storage requirements In this work we evelop semi-supervise variants of RLS for binary classification We analyze these variants uner no assumptions on the mechanism generating the sequence of instances, while imposing a simple linear noise moel for the conitional label istribution Intuitively, our algorithms issue a query when a common upper boun on bias an variance of the current RLS estimate is larger than a given threshol Conversely, when this upper boun gets small, we infer via a simple large eviation argument that the margin of the RLS estimate on the current instance is close enough to the margin of the Bayes optimal classifier Hence the learner can safely avoi issuing a query on that step In orer to summarize our results, assume for the sake of simplicity that the Bayes optimal classifier which for us is a linear classifier u has unknown

margin u x t ε > 0 on all instances x t R Then, in our ata moel, the average per-step risk of the fully supervise RLS, asking N T = T labels in T steps, is known to converge to the average risk of the Bayes optimal classifier at rate ε 2 T 1 excluing logarithmic factors In this work we show that, using our semi-supervise RLS variant, we can replace N T = T with any esire query boun N T = T κ for 0 κ 1 while achieving a convergence rate of orer ε 2 T 1 + ε 2/κ T 1 One might woner whether these results coul also be obtaine just by running a stanar RLS algorithm with a constant label sampling rate, say T κ 1, inepenent of the sequence of instances If we coul prove for RLS an instantaneous regret boun like /T then the answer woul be affirmative However, the lack of assumptions on the way instances are generate makes it har to prove any nontrivial instantaneous regret boun If the margin ε is known, or, equivalently, our goal is to approximate the Bayes margin to some accuracy ε, then we show that the above strategies achieve, with high probability, any esire accuracy ε by querying only orer of /ε 2 labels excluing logarithmic factors Again, the reaer shoul observe that this boun coul not be obtaine by, say, concentrating all queries on an initial phase of length O/ε 2 In such a case, an obvious aversarial strategy woul be to generate noninformative instances just in that phase In short, if we require online semi-supervise learning algorithms to work in worst-case scenarios we nee to resort to nontrivial label sampling techniques We have run comparative experiments on both artificial an real-worl meium-size atasets These experiments, though preliminary in nature, reveal the effectiveness of our sampling strategies even from a practical stanpoint 11 Relate work an comparison Selective sampling is a well-known semi-supervise online learning setting Pioneering works in this area are Cohn et al, 1990 an Freun et al, 1997 More recent relate results focusing on linear classification problems inclue Balcan et al, 2006; Balcan et al, 2007; Cavallanti et al, 2009; Cesa-Bianchi et al, 2006b; Dasgupta et al, 2008; Dasgupta et al, 2005; Strehl & Littman, 2008, although some of these works analyze batch rather than online protocols Most previous stuies consier the case when instances are rawn ii from a fixe istribution, exceptions being the worst-case analysis in Cesa-Bianchi et al, 2006b an the very recent analysis in the KWIK learning protocol Strehl & Littman, 2008 Both of these papers use variants of RLS working on arbitrary instance sequences The work Cesa-Bianchi et al, 2006b is completely worst case: the authors make no assumptions whatsoever on the mechanism generating labels an instances; however, they are unable prove bouns on the label query rate as we o here The KWIK moel of Strehl & Littman, 2008 see also the more general setup in Li et al, 2008 is closest to the setting consiere in this paper There the goal is to approximate the Bayes margin to within a given accuracy ε The authors assume arbitrary sequences of instances an the same linear stochastic moel for labels as the one consiere here A moification of the linear regressor in Auer, 2002, combine with covering arguments, allows them to compete against an aaptive aversarial strategy for generating instances Their algorithm, however, yiels the significantly worse boun Õ 3 /ε 4 on the number of queries, an seems to work in the finite imensional < case only In contrast, our algorithms achieve the better query boun Õ /ε 2 against oblivious aversaries Moreover, our algorithms can be easily run in any infinite imensional RKHS 2 Preliminaries In the selective sampling protocol for online binary classification, at each step t = 1, 2, the learner receives an instance x t R an outputs a binary preiction for the associate unknown label y t { 1, +1} After each preiction the learner may observe the label y t only by issuing a query If no query is issue at time t, then y t remains unknown Since one expects the learner s performance to improve if more labels are observe, our goal is to trae off preictive accuracy against number of queries All results proven in this paper hol for any fixe iniviual sequence x 1, x 2, of instances, uner the sole assumption that x t = 1 for all t 1 Given any such sequence, we assume the corresponing labels y t { 1, +1} are realizations of ranom variables Y t such that EY t = u x t for all t 1, where u R is a fixe an unknown vector such that u = 1 Note that sgn t, for t = u x, is the Bayes optimal classifier for this noise moel We stuy selective sampling algorithms that use sgn t to preict Yt The quantity t = w t x t is a margin compute via the RLS estimate w t = I + S t 1 St 1 + x t x 1 t St 1 Y t 1 1 efine over the matrix S t 1 = [ x 1,, x N t 1 ] of the

N t 1 querie instances up to time t 1 The ranom vector Y t 1 = Y 1,, Y N t 1 contains the observe labels so that Y k is the label of x k, an I is the ientity matrix We are intereste in simultaneously controlling the cumulative regret R T = T PY t t < 0 PY t t < 0 t=1 2 an the number N T of querie labels, uniformly over T Let A t = I + S t 1 St 1 + x t x t We introuce the following relevant quantities: B t = u I + x t x t A 1 t x t, r t = x t A 1 t x t q t = S t 1A 1 t x t, s t = A 1 t x t The following properties of the RLS estimate 1 have been proven in, eg, Cesa-Bianchi et al, 2006a Lemma 1 For each t = 1, 2, the following inequalities hol: 1 E t = t B t ; 2 B t s t + r t ; 3 s t r t ; 4 q t 2 r t ; 5 For all ε > 0, P t + B t t ε 2 exp ε2 2 q t 2 ; 6 If N T is the total number of queries issue in the first T steps, then 1 t T Y t querie r t i=1 ln1 + λ i ln 1 + N T where λ i is the i-th eigenvalue of the Gram matrix S T S T efine on the querie instances 3 A new selective sampling algorithm Our main theoretical result provies bouns on the cumulative regret an the number of querie labels for the selective sampling algorithm introuce in Figure 1 We call this algorithm the BBQ Boun on Algorithm 1 The BBQ selective sampler Parameters: 0 κ 1 Initialization: Weight vector w = 0 for each time step t = 1, 2, o Observe instance x t R ; preict label y t { 1, +1} with sgnw x t ; if r t > t κ then query label y t, upate w t using x t, y t as in 1 en if en for Bias Query algorithm BBQ queries x t whenever r t is larger than a threshol vanishing as t κ, where 0 κ 1 is an input parameter This simple query conition buils on Property 5 of Lemma 1 This property shows that t is likely to be close to the margin t of the Bayes optimal preictor when both the bias B t an the variance boun q t 2 are small Since these quantities are both boune by functions of r t see Properties 2, 3, an 4 of Lemma 1, this suggests that one can safely isregar Y t when r t is small Accoring to our noise moel, the label of x t is harer to preict if t is small For this reason, our regret boun is split into a cumulative regret on big margin steps t, where t ε, an small margin steps, where t < ε On one han, we boun the regret on small margin steps simply by ε T ε, where T ε = {1 t T : t < ε} On the other han, we show that the overall regret can be boune in terms of the best possible choice of ε with no nee for the algorithm to know this optimal value Theorem 1 If BBQ is run with input κ [0, 1] then its cumulative regret R T after any number T of steps satisfies R T min 0<ε<1 8 ε T ε + 2 + e 1/κ! ε 2 + 1 + 2 8 e ε 2 ln 1/κ 1 + N T The number of querie labels is N T = O T κ lnt It is worth observing that the bouns presente here hol in the finite imensional < case only One can easily turn them to work in any RKHS after switching to an eigenvalue representation of the cumulative regret eg, by using the mile boun in Property 6 of Lemma 1 rather than the rightmost one, as we i in the proof below This essentially correspons to analyzing Algorithm 1 in a ual variable representation A similar comment hols for Remark 1 an Theorem 2 below

In the rest of the paper we enote by {φ} the inicator function of a Boolean preicate φ Proof: [of Theorem 1] Fix any ε 0, 1 As in Cavallanti et al, 2009, we first observe that our label noise moel allows us to upper boun the time-t regret PY t t < 0 PY t t < 0 as PY t t < 0 PY t t < 0 ε{ t < ε} + P t t 0, t ε ε{ t < ε} + P t t ε Hence the cumulative regret 2 can be split as follows: R T ε T ε + T t=1 P t t ε 3 We procee by expaning the inicator of t t ε with the introuction of the bias term B t { } { t t ε t + B t t ε } { 2 + B t > ε } 2 Note that { B t > ε } 2 } {r t > ε2 e exp ε2 8 8r t the first inequality eriving from a combination of Properties 2 an 3 in Lemma 1 an then overapproximating, whereas the secon one uses {b < 1} e 1 b b Moreover, by Properties 4 an 5 in Lemma 1, we have that P t + B t t 2 ε 2 exp ε2 8r t We substitute this back into 3 an single out the steps where queries are issue This gives R T ε T ε + 2 + e exp ε2 8r t t:r t t κ + 2 + e ε2 8r t t:r t>t κ exp The secon term is boune as follows: exp ε2 8r t t : r t t κ 0 T exp ε2 t κ 8 t=1 exp ε2 x κ x = 1 1/κ 8 8 κ Γ1/κ ε 2 where Γ is the Euler s Gamma function Γx = e t t x 1 t We further boun 1 0 κγ1/κ 1/κ! using the monotonicity of Γ For the thir term we write exp ε2 8 8r t eε 2 r t t : r t>t κ t : r t>t κ 8 eε 2 ln 1 + N T The first step uses the inequality e x 1 ex for x > 0, while the secon step uses Property 6 in Lemma 1 Finally, in orer to erive a boun on the number N T of querie labels, we have N T t:r t>t κ r t t κ T κ T κ ln t : r t>t κ r t 1 + N T where for the last inequality we use, once more, Property 6 in Lemma 1 Hence, N T = O T κ lnt, an this conclues the proof It is important to observe that, if we isregar the margin term ε T ε which is fully controlle by the aversary, the regret boun epens logarithmically on T for any constant κ > 0: 1 R T ε T ε + O ε + 2/κ ε 2 lnt If κ is set to 1 then our boun on the number of queries N T becomes vacuous, an the selective sampling algorithm essentially becomes fully supervise This recovers the known regret boun for RLS in the fully supervise case, R T ε T ε + O ln T / ε 2 Remark 1 A ranomize variant of BBQ exists that queries label y t with inepenent probability r 1 κ/κ t [0, 1] Through a similar bias-variance analysis as the one in Theorem 1 above, one can show that in expectation over the internal ranomization of this algorithm the cumulative regret R T is boune by min 0<ε<1 ε Tε + O L ε while the number of 2/κ querie labels N T is O T κ L 1 κ, being L = lnt This boun is similar though generally incomparable to the one of Theorem 1 4 A parametric performance guarantee In the proof of Theorem 1 the quantity ε acts as a threshol for the cumulative regret, which is split into a sum over steps t such that t < ε where the regret

Algorithm 2 The parametric BBQ selective sampler Parameters: 0 < ε, < 1 Initialization: weight vector w = 0 for each time step t = 1, 2, o observe instance x t R ; preict label y t { 1, +1} with sgnw x t if [ ε r t s t ]+ < q tt + 1 t 2 ln then 2 query label y t upate w t using x t, y t as in 1 en if en for grows by less than ε an a sum over the remaining steps Most technicalities in the proof are ue to the fact that the final boun epens on the optimal choice of this ε, which the algorithm nee not know On the other han, if a specific value for ε is provie in input to the algorithm, then the cumulative regret over steps t such that t ε can be boune by any constant > 0 using only orer of /ε 2 lnt/ queries In particular, when min t t ε, the above logarithmic boun implies that the per-step regret vanishes exponentially fast as a function of the number of queries As we state in the introuction, this result cannot be obtaine as an easy consequence of known results, ue to the aversarial nature of the instance sequence We now evelop the above argument for a practically motivate variant of our BBQ selective sampler Let us isregar for a moment the bias term B t In orer to guarantee that t t ε hols when no query is issue, it is enough to observe that Property 5 of Lemma 1 implies that t t 2r t ln 2 with probability at least 1 This immeiately elivers a rule prescribing that no query be issue at time t when 2r t ln 2 ε A slightly more involve conition, one that better exploits the inequalities of Lemma 1, allows us to obtain a significantly improve practical performance This results in the algorithm escribe in Figure 2 The algorithm, calle Parametric BBQ, takes in input two parameters ε an, an issues a query at time t whenever 1 [ ε rt s t ] + < q t 2 ln 2tt + 1 4 Theorem 2 If Parametric BBQ is run with input ε, 0, 1 then: 1 with probability at least 1, t t ε hols on all time steps t when no query is issue; 1 Here an throughout ˆx] + = max{0, x} 2 the number N T of queries issue after any number T of steps is boune as N T = O ε 2 ln T ln lnt/ ε This theorem has been phrase so as to make it easier to compare to a corresponing result in Strehl & Littman, 2008 for the KWIK Knows What It Knows framework In that paper, the authors use a moification of Auer s upper confience linear regression algorithm for associative reinforcement learning Auer, 2002 This moification allows them to compete against any aaptive aversarial strategy generating instance vectors x t, but it yiels the significantly worse boun Õ 3 /ε 4 on N T in the KWIK setting N T is the number of times the preiction algorithm answers I on t know Besies, their strategy seems to work in the finite imensional < case only In contrast, Parametric BBQ works against an oblivious aversary only, but it has the better boun N T = Õ /ε 2, with the Õ notation hiing a mil logarithmic epenence on T Moreover, Parametric BBQ can be reaily run in infinite = imensional RKHS recall the comment before the proof of Theorem 1 In fact, this is a quite important feature: the real-worl experiments of Section 5 neee kernels in orer to either attain a goo empirical performance on Ault or use a reasonable amount of computational resources on RCV1 Remark 2 The boun on the number of querie labels in Theorem 2 is optimal up to logarithmic factors In fact, it is possible to prove that there exists a sequence x 1, x 2, of instances an a number ε 0 > 0 such that: for all ε ε 0 an for any learning algorithm that issues N = O/ε 2 queries there exists a target vector u R an a time step t = Ω /ε 2 for which the estimate t compute by the algorithm for t = u x t has the property P t t > ε = Ω 1 Hence, at least Ω /ε 2 queries are neee to learn any target hyperplane with arbitrarily small accuracy an arbitrarily high confience Proof: [Theorem 2] Let I {1,, T } be the set of time steps when a query is issue Then, using Property 2 of Lemma 1 we can write { } t t > ε t/ I { } t + B t t > ε Bt t/ I t/ I { } t + B t t > [ε rt s t ] +

We first take expectations on both sies, an then apply Property 5 of Lemma 1 along with conition 4 rewritten as follows 2 [ε rt s t ] + 2 exp 2 q t 2 This gives P t t > ε t/ I tt + 1 P t + B t t > [ε r t s t ] + t/ I t/ I t/ I 2 exp 2 ε rt s t + 2 q t 2 tt + 1 tt + 1 = t=1 In orer to erive a boun on the number N T of querie labels, we procee as follows For every step t I in which a query was issue we can write ε r t r t ε r t s t [ ε r t s t ]+ 2tt + 1 2tt + 1 q t 2 ln 2 r t ln where we use Properties 3 an 4 of Lemma 1 Solving for r t an overapproximating we obtain r t 2ε + 1 + ε 2 2 ln 2tt+1 2 5 Similarly to the proof of Theorem 1, we then write N T min r t r t ln 1 + N T t I t I Using 5 we get N T = O ε ln T 2 ln lnt/ ε 5 Experiments In this section we report on preliminary experiments with the Parametric BBQ algorithm The first test is a synthetic experiment to valiate the moel We generate 10,000 ranom examples on the unit circle in R 2 The labels of these examples were generate accoring to our noise moel see Section 2 using a ranomly selecte hyperplane u with unit norm We then set = 01 an analyze the behavior of the algorithm with various settings of ε > 0 an using a Maximum error 1 08 06 04 02 0 Theoretical maximum error Observe maximum error Fraction of querie labels 02 03 04 05 06 07 08 09 1 0 ε Figure 1 Maximum error jagge blue line an number of querie labels ecreasing green line on a synthetic ataset for Parametric BBQ with = 01 an 015 ε 1 The straight blue line is the theoretical upper boun on the maximum error provie by the theory linear kernel In Figure 1 the jagge blue line represents the maximum error over the example sequence, ie, max t t t Although we stoppe the plot at ε = 1, note that the maximum error is ominate by t, which can be of the orer of N t As preicte by Theorem 2, the maximum error remains below the straight line y = ε the maximum error preicte by the theory In the same plot, the ecreasing green line shows the number of querie labels, which closely follows the curve ε 2 preicte by the theory This initial test reveals that the algorithm is ramatically unerconfient, ie, it is a lot more precise than it thinks Moreover, the actual error is rather insensitive to the choice of ε In orer to leverage on this, we ran the remaining tests using Parametric BBQ with a more extreme setting of parameters Namely, we change the query conition the if conition in Algorithm 2 to [ ] 1 rt s t + < q t 2 ln 2 for 0 < < 1 This amounts to setting the esire error to a efault value of ε = 1 while making the number of querie labels inepenent of T With the above setting, we compare Parametric BBQ to the secon-orer version of the label-efficient classifier SOLE of Cesa-Bianchi et al, 2006b This is a mistake-riven RLS algorithm that queries the label of the current instance with probability 1/1+b t, where b > 0 is a parameter an t is the RLS margin The other baseline algorithm is a vanilla sampler calle Ranom in the plots that asks labels at ranom with constant probability 0 < p < 1 Recall that SOLE oes not come with a guarantee boun on the 05 04 03 02 01 Fraction of querie labels

F measure 062 06 058 056 054 052 05 SOLE 048 Ranom Parametric BBQ 046 0 001 002 003 004 005 006 Fraction of querie labels F measure 075 07 065 06 055 05 SOLE Ranom Parametric BBQ 045 005 01 015 02 025 03 035 04 Fraction of querie labels Figure 2 F-measure against fraction of querie labels on the a9a ataset 32,561 examples in ranom orer The plotte curves are averages over 10 ranom shuffles Figure 3 F-measure against fraction of querie labels average over the 50 most frequent categories of RCV1 first 40,000 examples in chronological orer number of querie labels Ranom, on the other han, has the simple expectation boun E[N T ] = p T For each algorithm we plot the F-measure harmonic mean of precision an recall against the fraction of querie labels We control the fraction of querie labels by changing the parameters of the three algorithms for Parametric BBQ, b for SOLE, an p for Ranom For the first real-worl experiment we chose a9a 2, a subset of the census-income Ault atabase with 32,561 binary-labele examples an 123 features In orer to bring all algorithms to a reasonable performance level, we use a Gaussian kernel with σ 2 = 125 The plots Figure 2 show that less than 6% queries are enough for the three algorithms to saturate their performance In the whole query range Parametric BBQ is consistently slightly better than SOLE, while Ranom has the worst performance For our secon real-worl experiment we use the first 40,000 newswire stories in chronological orer from the Reuters Corpus Volume 1 ataset RCV1 Each newsstory of this corpus is tagge with one or more labels from a set of 102 classes A stanar TF-IDF bag-of-wors encoing was use to obtain 138,860 features We consiere the 50 most populate classes an traine 50 classifiers one-vs-all using a linear kernel Earlier experiments, such as those reporte in Cesa-Bianchi et al, 2006b, show that RLS-base algorithms perform best on RCV1 when run in a mistake riven fashion For this reason, on this ataset we use a mistake-riven variant of Parametric BBQ, storing a querie label only when it is wrongly preicte Figure 3 shows the macroaverage F-measure 2 wwwcsientueutw/ cjlin/libsvmtools/ plotte against the average fraction of querie labels, where averages are compute over the 50 classifiers Here the algorithms nee over 35% of labels to saturate Moreover, Parametric BBQ performs worse than SOLE, although still better than Ranom Since SOLE an Parametric BBQ are both base on the mistake-riven RLS classifier, any ifference of performance is ue to their ifferent query conitions: SOLE is margin-base, while Parametric BBQ uses q t an relate quantities Note that, unlike the margin, q t oes not epen on the querie labels, but only on the correlation between their corresponing instances This fact, which helpe us a lot in the analysis of BBQ, coul make a crucial ifference between omains like RCV1 where instances are extremely sparse an Ault where instances are relatively ense More experimental work is neee in orer to settle this conjecture 6 Conclusions an ongoing research We have introuce a new family of online algorithms, the BBQ family, for selective sampling uner oblivious aversarial environments These algorithms naturally interpolate between fully supervise an fully unsupervise learning scenarios A parametric variant Parametric BBQ of our basic algorithm is esigne to work in a weakene KWIK framework Li et al, 2008; Strehl & Littman, 2008 with improve bouns on the number of querie labels We have mae preliminary experiments First, we valiate the theory on an artificially generate ataset Secon, we compare a variant of Parametric BBQ to algorithms with similar guarantees, with encouraging results

A few issues we are currently working on are the following First, we are trying to see if a sharper analysis of BBQ exists which allows one to prove a regret boun of the form ε T ε + ln T ε when N 2 T = O ln T This boun woul be a worst-case analog of the boun Cavallanti et al 2009 have obtaine in an ii setting This improvement is likely to require refine bouns on bias an variance of our estimators Moreover, we woul like to see if it is possible either to remove the lnt epenence on the boun on N T in Theorem 2 or to make Parametric BBQ work in aaptive aversarial environments presumably at the cost of looser bouns on N T In fact, it is currently unclear to us how a irect covering argument coul be applie in Theorem 2 which avois the nee for a conitionally inepenent structure of the involve ranom variables On the experimental sie, we are planning to perform a more thorough empirical investigation using aitional atasets In particular, since our algorithms can also be viewe as memory boune proceures, we woul like to see how they perform when compare to buget-base algorithms, such as those in Weston et al, 2005; Dekel et al, 2007; Cavallanti et al, 2007; Orabona et al, 2008 Finally, since our algorithms can be easily aapte to solve regression tasks, we are planning to test the BBQ family on stanar regression benchmarks Acknowlegments We woul like to thank the anonymous reviewers for their helpful comments The thir author is supporte by the DIRAC project uner EC grant FP6-0027787 All authors acknowlege partial support by the PAS- CAL2 NoE uner EC grant FP7-216886 This publication only reflects the authors views References Auer, P 2002 Using confience bouns for exploitation-exploration trae-offs Journal of Machine Learning Research, 3, 397 422 Balcan, M, Beygelzimer, A, & Langfor, J 2006 Agnostic active learning Proceeings of the 23r International Conference on Machine Learning Balcan, M, Broer, A, & Zhang, T 2007 Marginbase active learning Proceeings of the 20th Annual Conference on Learning Theory Cavallanti, G, Cesa-Bianchi, N, & Gentile, C 2007 Tracking the best hyperplane with a simple buget Perceptron Machine Learning, 69, 143 167 Cavallanti, G, Cesa-Bianchi, N, & Gentile, C 2009 Linear classification an selective sampling uner low noise conitions In Avances in Neural Information Processing Systems 21 MIT Press Cesa-Bianchi, N, Gentile, C, & Zaniboni, L 2006a Incremental algorithms for hierarchical classication Journal of Machine Learning Research, 7, 31 54 Cesa-Bianchi, N, Gentile, C, & Zaniboni, L 2006b Worst-case analysis of selective sampling for linear classification Journal of Machine Learning Research, 7, 1025 1230 Cohn, R, Atlas, L, & Laner, R 1990 Training connectionist networks with queries an selective sampling In Avances in Neural Information Processing Systems 2 MIT Press Dasgupta, S, Hsu, D, & Monteleoni, C 2008 A general agnostic active learning algorithm In Avances in Neural Information Processing Systems 21, 353 360 MIT Press Dasgupta, S, Kalai, A T, & Monteleoni, C 2005 Analysis of perceptron-base active learning Proceeings of the 18th Annual Conference on Learning Theory pp 249 263 Dekel, O, Shalev-Shwartz, S, & Singer, Y 2007 The Forgetron: A kernel-base Perceptron on a buget SIAM Journal on Computing, 37, 1342 1372 Freun, Y, Seung, S, Shamir, E, & Tishby, N 1997 Selective sampling using the query by committee algorithm Machine Learning, 28, 133 168 Li, L, Littman, M, & Walsh, T 2008 Knows what it knows: a framework for self-aware learning Proceeings of the 25th International Conference on Machine Learning pp 568 575 Orabona, F, Keshet, J, & Caputo, B 2008 The Projectron: a boune kernel-base Perceptron Proceeings of the 25th International Conference on Machine Learning pp 720 727 Strehl, A, & Littman, M 2008 Online linear regression an its application to moel-base reinforcement learning In Avances in Neural Information Processing Systems 20 MIT Press Weston, J, Bores, A, & Bottou, L 2005 Online an offline on an even tighter buget Proceeings of the 10th International Workshop on Artificial Intelligence an Statistics pp 413 420