Efficient Learning with Partially Observed Attributes

Size: px

Start display at page:

Download "Efficient Learning with Partially Observed Attributes"

Sydney Simpson
5 years ago
Views:

1 Nicolò Cesa-Bianchi DSI, Università degli Studi di Milano, Italy Shai Shalev-Shwartz The Hebrew University, Jerusale, Israel Ohad Shair The Hebrew University, Jerusale, Israel Abstract We describe and analyze efficient algoriths for learning a linear predictor fro exaples when the learner can only view a few attributes of each training exaple This is the case, for instance, in edical research, where each patient participating in the experient is only willing to go through a sall nuber of tests Our analysis bounds the nuber of additional exaples sufficient to copensate for the lack of full inforation on each training exaple We deonstrate the efficiency of our algoriths by showing that when running on digit recognition data, they obtain a high prediction accuracy even when the learner gets to see only four pixels of each iage Introduction Suppose we would like to predict if a person has soe disease based on edical tests Theoretically, we ay choose a saple of the population, perfor a large nuber of edical tests on each person in the saple and learn fro this inforation In any situations this is unrealistic, since patients participating in the experient are not willing to go through a large nuber of edical tests The above exaple otivates the proble studied in this paper, that is learning when there is a hard constraint on the nuber of attributes the learner ay view for each training exaple We propose an efficient algorith for dealing with this partial inforation proble, and bound the nuber of additional training exaples sufficient to copensate for the lack of full inforation on each training Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 200 Copyright 200 by the author(s)/owner(s) cesa-bianchi@dsiuniiit shais@cshujiacil ohadsh@cshujiacil exaple Roughly speaking, we actively pick which attributes to observe in a randoized way so as to construct a noisy version of all attributes Intuitively, we can still learn despite the error of this estiate because instead of receiving the exact value of each individual exaple in a sall set it suffices to get noisy estiations of any exaples Related Work Many ethods have been proposed for dealing with issing or partial inforation Most of the approaches do not coe with foral guarantees on the risk of the resulting algorith, and are not guaranteed to converge in polynoial tie The difficulty stes fro the exponential nuber of ways to coplete the issing inforation In the fraework of generative odels, a popular approach is the Expectation- Maxiization (EM) procedure (Depster et al, 977) The ain drawback of the EM approach is that it ight find sub-optial solutions In contrast, the ethods we propose in this paper are provably efficient and coe with finite saple guarantees on the risk Our technique for dealing with issing inforation borrows ideas fro algoriths for the adversarial ulti-ared bandit proble (Auer et al, 2003; Cesa- Bianchi and Lugosi, 2006) Our learning algoriths actively choose which attributes to observe for each exaple This and siilar protocols were studied in the context of active learning (Cohn et al, 994; Balcan et al, 2006; Hanneke, 2007; 2009; Beygelzier et al, 2009), where the learner can ask for the target associated with specific exaples The specific learning task we consider in the paper was first proposed in (Ben-David and Dichteran, 998), where it is called learning with restricted focus of attention Ben-David and Dichteran (998) considered the classification setting and showed learnability

2 of several hypothesis classes in this odel, like k-dnf and axis-aligned rectangles However, to the best of our knowledge, no efficient algorith for the class of linear predictors has been proposed A related setting, called budgeted learning, was recently studied - see for exaple (Deng et al, 2007; Kapoor and Greiner, 2005) and the references therein In budgeted learning, the learner purchases attributes at soe fixed cost subject to an overall budget Besides lacking foral guarantees, this setting is different fro the one we consider in this paper, because we ipose a budget constraint on the nuber of attributes that can be obtained for every individual exaple, as opposed to a global budget In soe applications, such as the edical application discussed previously, our constraint leads to a ore realistic data acquisition process - the global budget allows to ask for any attributes of soe individual patients while our protocol guarantees a constant nuber of edical tests to all the patients Our technique is reiniscent of ethods used in the copressed learning fraework (Calderbank et al, 2009; Zhou et al, 2009), where data is accessed via a sall set of rando linear easureents Unlike copressed learning, where learners are both trained and evaluated in the copressed doain, our techniques are ainly designed for a scenario in which only the access to training data is restricted The opposite setting, in which full inforation is given at training tie and the goal is to train a predictor that depends only on a sall nuber of attributes at test tie, was studied in the context of learning sparse predictors - see for exaple (Tibshirani, 996) and the wide literature on sparsity properties of l regularization Since our algoriths also enforce low l nor, any of those results can be cobined with our techniques to yield an algorith that views only O() attributes at training tie, and a nuber of attributes coparable to the achievable sparsity at test tie Since our focus in this work is on constrained inforation at training tie, we do not elaborate on this subject Furtherore, in soe real-world situations, it is reasonable to assue that attributes are very expensive at training tie but are ore easy to obtain at test tie Returning to the exaple of edical applications, it is unrealistic to convince patients to participate in a edical experient in which they need to go through a lot of edical tests, but once the syste is trained, at testing tie, patients who need Ben-David and Dichteran (998) do describe learnability results for siilar classes but only under the restricted faily of product distributions the prediction of the syste will agree to perfor as any edical tests as needed A variant of the above setting is the one studied by Greiner et al (2002), where the learner has all the inforation at training tie and at test tie he tries to actively choose a sall aount of attributes to for a prediction Note that active learning at training tie, as we do here, ay give ore learning power than active learning at testing tie For exaple, we forally prove that while it is possible to learn a consistent predictor accessing at ost 2 attributes of each exaple at training tie, it is not possible (even with an infinite aount of training exaples) to build an active classifier that uses at ost 2 attributes of each exaple at test tie, and whose error will be saller than a constant 2 Main Results In this section we outline the ain results We start with a foral description of the learning proble In linear regression each exaple is an instance-target pair, (x, y) R d R We refer to x as a vector of attributes and the goal of the learner is to find a linear predictor x w, x, where we refer to w R d as the predictor The perforance of a predictor w on an instance-target pair, (x, y) R d R, is easured by a loss function l( w, x, y) For siplicity, we focus on the squared loss function, l(a, b) = (a b) 2, and briefly discuss other loss functions in Section 5 Following the standard fraework of statistical learning (Haussler, 992; Devroye et al, 996; Vapnik, 998), we odel the environent as a joint distribution D over the set of instance-target pairs, R d R The goal of the learner is to find a predictor with low risk, defined as the expected loss: L D (w) def = E (x,y) D [l( w, x, y)] Since the distribution D is unknown to the learner he learns by relying on a training set of exaples S = (x, y ),, (x, y ), which are assued to be sapled iid fro D We denote the training loss by L S (w) def = i= ( w, x i y i ) 2 We now distinguish between two scenarios: Full inforation: The learner receives the entire training set This is the traditional linear regression setting Partial inforation: For each individual exaple, (x i, y i ), the learner receives the target y i but is only allowed to see k attributes of x i, where k is a paraeter of the proble The learner has the freedo to actively choose which of the attributes will be revealed, as long as at ost k of the will be given

3 While the full inforation case was extensively studied, the partial inforation case is ore challenging Our approach for dealing with the proble of partial inforation is to rely on algoriths for the full inforation case and to fill in the issing inforation in a randoized, data and algorithic dependent, way As a siple baseline, we begin by describing a straightforward adaptation of Lasso (Tibshirani, 996), based on a direct nonadaptive estiate of the loss function We then turn to describe a ore effective approach, which cobines a stochastic gradient descent algorith called Pegasos (Shalev-Shwartz et al, 2007) with the active sapling of attributes in order to estiate the gradient of the loss at each step 2 Baseline Algorith A popular approach for learning a linear regressor is to iniize the epirical loss on the training set plus a regularization ter taking the for of a nor of the predictor For exaple, in ridge regression the regularization ter is w 2 2 and in Lasso the regularization ter is w Instead of regularization, we can include a constraint of the for w B or w 2 B With an adequate tuning of paraeters, the regularization for is equivalent to the constraint for In the constraint for, the predictor is a solution to the following optiization proble: in w R d S (x,y) S ( w, x y) 2 st w p B, () where S = {(x, y ),, (x, y )} is a training set of exaples, B is a regularization paraeter, and p is for Lasso and 2 for ridge regression Standard risk bounds for Lasso iply that if ŵ is a iniizer of () (with p = ), then with probability greater than δ over the choice of a training set of size we have L D (ŵ) in L D(w)+O w: w B ( B 2 ln(d/δ) ) (2) To adapt Lasso to the partial inforation case, we first rewrite the squared loss as follows: ( w, x y) 2 = w T (xx T )w 2yx T w + y 2, where w, x are colun vectors and w T, x T are their corresponding transpose (ie, row vectors) Next, we estiate the atrix xx T and the vector x using the partial inforation we have, and then we solve the optiization proble given in () with the estiated values of xx T and x To estiate the vector x we can pick an index i uniforly at rando fro [d] = {,, d} and define the estiation to be a vector v such that v r = { d x r if r = i 0 else (3) It is easy to verify that v is an unbiased estiate of x, naely, E[v] = x where expectation is with respect to the choice of the index i When we are allowed to see k > attributes, we siply repeat the above process (without replaceent) and set v to be the averaged vector To estiate the atrix xx T we could pick two indices i, j independently and uniforly at rando fro [d], and define the estiation to be a atrix with all zeros except d 2 x i x j in the (i, j) entry However, this yields a non-syetric atrix which will ake our optiization proble with the estiated atrix non-convex To overcoe this obstacle, we syetrize the atrix by adding its transpose and dividing by 2 The resulting baseline procedure 2 is given in Algorith Algorith Baseline(S, k) S full inforation training set with exaples k Can view only k eleents of each instance in S Paraeter: B Initialize: Ā = 0 R d,d ; v = 0 R d ; ȳ = 0 for each (x, y) S v = 0 R d A = 0 R d,d Choose C uniforly at rando fro all subsets of [d] [d] of size k/2 for each (i, j) C v i = v i + (d/k) x i v j = v j + (d/k) x j A i,j = A i,j + (d 2 /k) x i x j A j,i = A j,i + (d 2 /k) x i x j end Ā = Ā + A/ v = v + 2 y v/ ȳ = ȳ + y 2 / end Let L S (w) = w T Āw + w T v + ȳ Output: solution of in w: w B L S (w) 2 We note that an even sipler approach is to arbitrarily assue that the correlation atrix is the identity atrix and then the solution to the loss iniization proble is siply the averaged vector, w = P (x,y) S y x In that case, we can siply replace x by its estiated vector as defined in (3) While this naive approach can work on very siple classification tasks, it will perfor poorly on realistic data sets, in which the correlation atrix is not likely to be identity Indeed, in our experients with the MNIST data set, we found out that this approach perfored poorly relatively to the algoriths proposed in this paper

4 The following theore shows that siilar to Lasso, the Baseline algorith is copetitive with the optial linear predictor with a bounded L nor Theore Let D be a distribution such that P[x [, +] d y [, +]] = Let ŵ be the output of Baseline(S,k), where S = Then, with probability of at least δ over the choice of the training set and the algorith s own randoization we have L D (ŵ) in L D(w) + O w: w B ( (d B) 2 k ) ln(d/δ) The above theore tells us that for a sufficiently large training set we can find a very good predictor Put another way, a large nuber of exaples can copensate for the lack of full inforation on each individual exaple In particular, to overcoe the extra factor d 2 /k in the bound, which does not appear in the full inforation bound given in (2), we need to increase by a factor of d 4 /k 2 Note that when k = d we do not recover the full inforation bound This is because we try to estiate a atrix with d 2 entries using only k = d < d 2 saples In the next subsection, we describe a better, adaptive procedure for the partial inforation case 22 Gradient-based Attribute Efficient Regression In this section, by avoiding the estiation of the atrix xx T, we significantly decrease the nuber of additional exaples sufficient for learning with k attributes per training exaple To do so, we do not try to estiate the loss function but rather estiate the gradient l(w) = 2 ( w, x y) x, with respect to w, of the squared loss function ( w, x y) 2 Each vector w can define a probability distribution over [d] by letting P[i] = w i / w We can estiate the gradient using 2 attributes as follows First, we randoly pick j fro [d] according to the distribution defined by w Using j we estiate the ter w, x by sgn(w j ) w x j It is easy to verify that the expectation of the estiate equals w, x Second, we randoly pick i fro [d] according to the unifor distribution over [d] Based on i, we estiate the vector x as in (3) Overall, we obtain the following unbiased estiation of the gradient: l(w) = 2 (sgn(w j ) w x j y) v, (4) where v is as defined in (3) The advantage of the above approach over the loss based approach we took before is that the agnitude of each eleent of the gradient estiate is order of d w This is in contrast to what we had for the loss based approach, where the agnitude of each eleent of the atrix A was order of d 2 In any situations, the L nor of a good predictor is significantly saller than d and in these cases the gradient based estiate is better than the loss based estiate However, while in the previous approach our estiation did not depend on a specific w, now the estiation depends on w We therefore need an iterative learning ethod in which at each iteration we use the gradient of the loss function on an individual exaple Luckily, the stochastic gradient descent approach conveniently fits our needs Concretely, below we describe a variant of the Pegasos algorith (Shalev-Shwartz et al, 2007) for learning linear regressors Pegasos tries to iniize the regularized risk in w [ E ( w, x y) 2 ] + λ w 2 2 (5) (x,y) D Of course, the distribution D is unknown, and therefore we cannot hope to solve the above proble exactly Instead, Pegasos finds a sequence of weight vectors that (on average) converge to the solution of (5) We start with the all zeros vector w = 0 R d Then, at each iteration Pegasos picks the next exaple in the training set (which is equivalent to sapling a fresh exaple according to D) and calculates the gradient of the loss function on this exaple with respect to the current weight vector w In our case, the gradient is siply 2( w, x y)x We denote this gradient vector by Finally, Pegasos updates the predictor according to the rule: w = ( t ) w λ t, where t is the current iteration nuber To apply Pegasos in the partial inforation case we could siply replace the gradient vector with its estiation given in (4) However, our analysis shows that it is desirable to aintain an estiation vector with sall agnitude Since the agnitude of is order of d w, where w is the current weight vector aintained by the algorith, we would like to ensure that w is always saller than soe threshold B We achieve this goal by adding an additional projection step at the end of each Pegasos s iteration Forally, after perforing the update we set w argin u w 2 (6) u: u B This projection step can be perfored efficiently in tie O(d) using the technique described in (Duchi et al, 2008) A pseudo-code of the resulting Attribute Efficient Regression algorith is given in Algorith 2

5 Algorith 2 AER(S, k) S Full inforation training set with exaples k Access only k eleents of each instance in S Paraeters: λ, B w = (0,, 0) ; w = w ; t = for each (x, y) S v = 0 R d Choose C uniforly at rando fro all subsets of [d] of size k/2 for each j C v j = v j + 2 k d x j end ŷ = 0 for r =,, k/2 saple i fro [d] based on P[i] = w i / w ŷ = ŷ + 2 k sgn(w i) w x i end w = ( t )w 2 λt (ŷ y)v w = argin u: u B u w 2 w = w + w/ t = t + end Output: w The following theore provides convergence guarantees for AER Theore 2 Let D be a distribution such that P[x [, +] d y [, +]] = Let w be any vector such that w B and w 2 B 2 Then, ( ) E[L D ( w)] L D (w d (B + ) B 2 ln() ) + O, k where S =, w is the output of AER(S, k) run with λ = ((B+)d/B 2 ) log()/(k), and the expectation is over the choice of S and over the algorith s own randoization For siplicity and readability, in the above theore we only bounded the expected risk It is possible to obtain siilar guarantees with high probability by relying on Azua s inequality see for exaple (Cesa-Bianchi et al, 2004) Note that w 2 w B, so Theore 2 iplies that ( ) L D ( w) in L d B 2 ln() D(w) + O w: w B k Therefore, the bound for AER is uch better 3 than 3 When coparing bounds, we ignore logarithic ters Also, in this discussion we assue that B and B 2 are at least the bound for Baseline: instead of d 2 /k we have d/ k It is interesting to copare the bound for AER to the Lasso bound in the full inforation case given in (2) As it can be seen, to achieve the sae level of risk, AER needs a factor of d 2 /k ore exaples than the full inforation Lasso 4 Since each AER exaple uses only k attributes while each Lasso exaple uses all d attributes, the ratio between the total nuber of attributes AER needs and the nuber of attributes Lasso needs to achieve the sae error is O(d) Intuitively, when having d ties total nuber of attributes, we can fully copensate for the partial inforation protocol However, in soe situations even this extra d factor is not needed Suppose we know that the vector w, which iniizes the risk, is dense That is, it satisfies w d w 2 In this case, choosing B 2 = B/ d, the bound in Theore 2 becoes order of B 2 d/k / Therefore, the nuber of exaples AER needs in order to achieve the sae error as Lasso is only a factor d/k ore than the nuber of exaples Lasso uses But, this iplies that both AER and Lasso needs the sae nuber of attributes in order to achieve the sae level of error! Crucially, the above holds only if w is dense When w is sparse we have w w 2 and then AER needs ore attributes than Lasso One ight wonder whether a ore clever active sapling strategy could attain in the sparse case the perforance of Lasso while using the sae nuber of attributes The next subsection shows that this is not possible in general 23 Lower bounds and negative results We now show (proof in the appendix) that any attribute efficient algorith needs in general order of d/ɛ exaples for learning an ɛ-accurate sparse linear predictor Recall that the upper bound of AER iplies that order of d 2 (B + ) 2 B 2 2/ɛ 2 exaples are sufficient for learning a predictor with L D (w) L D (w ) < ɛ Specializing this saple coplexity bound of AER to the w described in Theore 3 below, yields that O(d 2 /ɛ) exaples are sufficient for AER for learning a good predictor in this case That is, we have a gap of factor d between the lower bound and the upper bound, and it reains open to bridge this gap Theore 3 For any ɛ (0, /6), k, and d 4k, 4 We note that when d = k we still do not recover the full inforation bound However, it is possible to iprove the analysis and replace the factor d/ k with a factor d ax t x t 2/k

6 there exists a distribution over exaples and a weight vector w, with w 0 = and w 2 = w = 2 ɛ, such that any attribute efficient regression algorith accessing at ost k attributes per training exaple ust see (in expectation) at least Ω ( d kɛ) exaples in order to learn a linear predictor w with L D (w) L D (w ) < ɛ Recall that in our setting, while at training tie the learner can only view k attributes of each exaple, at test tie all attributes can be observed The setting of Greiner et al (2002), instead, assues that at test tie the learner cannot observe all the attributes The following theore shows that if a learner can view at ost 2 attributes at test tie then it is ipossible to give accurate predictions at test tie even when the optial linear predictor is known Theore 4 There exists a weight vector w and a distribution D such that L D (w ) = 0 while any algorith A that gives predictions A(x) while viewing only 2 attributes of each x ust have L D (A) /9 The proof is given in the appendix This negative result highlights an interesting phenoenon We can learn an arbitrarily accurate predictor w fro partially observed exaples However, even if we know the optial w, we ight not be able to accurately predict a new partially observed exaple 3 Proof Sketch of Theore 2 Here we only sketch the proof of Theore 2 A coplete proof of all our theores is given in the appendix We start with a general logarithic regret bound for strongly convex functions (Hazan et al, 2006; Kakade and Shalev-Shwartz, 2008) The regret bound iplies the following Let z,, z be a sequence of vectors, each of which has nor bounded by G Let λ > 0 and consider the sequence of functions g,, g such that g t (w) = λ 2 w 2 + z t, w Each g t is λ-strongly convex (eaning, it is not too flat), and therefore regret bounds for strongly convex functions tell us that there is a way to construct a sequence of vectors w,, w such that for any w that satisfies w B we have t g t (w t ) t t= ( ) g t (w ) O G 2 log() λ t= With an appropriate choice of λ, and with the assuption w 2 B 2, the above inequality iplies that t= z t, w t w α where α = O ) This holds for any sequence of z,, z, and in particular, we can set z t = 2(ŷ t y t )v t Note that z t is a ( G B2 log() rando vector that depends both on the value of w t and on the rando bits chosen on round t Taking conditional expectation of z t wrt the rando bits chosen on round t we obtain that E[z t w t ] is exactly the gradient of ( w, x t y t ) 2 at w t, which we denote by t Fro the convexity of the squared loss, we can lower bound t, w t w by ( w t, x t y t ) 2 ( w, x t y t ) 2 That is, in expectation we have that [ ( E ( wt, x t y t ) 2 ( w, x t y t ) 2)] α t= Taking expectation wrt the rando choice of the exaples fro D, denoting w = t=, and using Jensen s inequality we get that E[L D ( w)] L D (w )+ α Finally, we need to ake sure that α is not too large The only potential danger is that G, the bound on the nors of z,, z, will be large We ake sure this cannot happen by restricting each w t to the l ball of radius B, which ensures that z t O((B + )d) for all t 4 Experients We perfored soe preliinary experients to test the behavior of our algorith on the well-known MNIST digit recognition dataset (Cun et al, 998), which contains 70,000 iages (28 28 pixels each) of the digits 0 9 The advantages of this dataset for our purposes is that it is not a sall-scale dataset, has a reasonable diensionality-to-data-size ratio, and the setting is clearly interpretable graphically While this dataset is designed for classification (eg recognizing the digit in the iage), we can still apply our algoriths on it by regressing to the label First, to deonstrate the hardness of our settings, we provide in Figure below soe exaples of iages fro the dataset, in the full inforation setting and the partial inforation setting The upper row contains six iages fro the dataset, as available to a full-inforation algorith A partial-inforation algorith, however, will have a uch ore liited access to these iages In particular, if the algorith ay only choose k = 4 pixels fro each iage, the sae six iages as available to it ight look like the botto row of Figure We began by looking at a dataset coposed of 3 vs 5, where all the 3 digits were labeled as and all the 5 digits were labeled as + We ran four different algoriths on this dataset: the siple Baseline algorith, AER, as well as ridge regression and Lasso for coparison (for Lasso, we solved () with p = ) Both ridge regression and Lasso were run in the full inforation setting: Naely, they enjoyed full access to

7 2 Test Regression Error Figure In the upper row six exaples fro the training set (of digits 3 and 5) are shown In the lower row we show the sae six exaples, where only four randoly sapled pixels fro each original iage are displayed all attributes of all exaples in the training set The Baseline algorith and AER, however, were given access to only 4 attributes fro each training exaple We randoly split the dataset into a training set and a test set (with the test set being 0% of the original dataset) For each algorith, paraeter tuning was perfored using 0-fold cross validation Then, we ran the algorith on increasingly long prefixes of the training set, and easured the average regression error ( w, x y) 2 on the test set The results (averaged over runs on 0 rando train-test splits) are presented in Figure 2 In the upper plot, we see how the test regression error iproves with the nuber of exaples The Baseline algorith is highly unstable at the beginning, probably due to the ill-conditioning of the estiated covariance atrix, although it eventually stabilizes (to prevent a graphical ess at the left hand side of the figure, we reoved the error bars fro the corresponding plot) Its perforance is worse than AER, copletely in line with our earlier theoretical analysis The botto plot of Figure 2 is siilar, only that now the X-axis represents the accuulative nuber of attributes seen by each algorith rather than the nuber of exaples For the partial-inforation algorith, the graph ends at approxiately 49,000 attributes, which is the total nuber of attributes accessed by the algorith after running over all training exaples, seeing k = 4 pixels fro each exaple However, for the full-inforation algorith 49,000 attributes are already seen after just 62 exaples When we copare the algoriths in this way, we see that our AER algorith achieves excellent perforance for a given attribute budget, significantly better than the other L -based algoriths, and even coparable to full-inforation ridge regression Finally, we tested the algoriths over 45 datasets generated fro MNIST, one for each possible pair of dig- Test Regression Error Nuber of Exaples Ridge Reg Lasso 04 AER Baseline Nuber of Features x 0 4 Figure 2 Test regression error for each of the 4 algoriths, over increasing prefixes of the training set for 3 vs 5 The results are averaged over 0 runs its For each dataset and each of 0 rando train-test splits, we perfored paraeter tuning for each algorith separately, and checked the average squared error on the test set The edian test errors over all datasets are presented in the table below Test Error Full Inforation Ridge 00 Lasso 0222 Partial Inforation AER 0320 Baseline 085 As can be seen, the AER algorith anages to achieve good perforance, not uch worse than the fullinforation Lasso algorith The Baseline algorith, however, achieves a substantially worse perforance, in line with our theoretical analysis above We also calculated the test classification error of AER, ie sign( w, x ) y, and found out that AER, which can see only 4 pixels per iage, usually perfor only a little worse than the full-inforation algoriths (ridge regression and Lasso), which enjoy full access to all 784 pixels in each iage In particular, the edian test classification errors of AER, Lasso, and Ridge are

8 35%, %, and 3% respectively 5 Discussion and Extensions In this paper, we provided an efficient algorith for learning when only a few attributes fro each training exaple can be seen The algorith coes with foral guarantees, is provably copetitive with algoriths which enjoy full access to the data, and sees to perfor well in practice We also presented saple coplexity lower bounds, which are only a factor d saller than the upper bound achieved by our algorith, and it reains open to bridge this gap Our approach easily extends to other gradient-based algoriths besides Pegasos For exaple, generalized additive algoriths such as p-nor Perceptrons and Winnow - see, eg, (Cesa-Bianchi and Lugosi, 2006) An obvious direction for future research is how to deal with loss functions other than the squared loss In upcoing work on a related proble, we develop a technique which allows us to deal with arbitrary analytic loss functions, but in the setting of this paper will lead to saple coplexity bounds which are exponential in d Another interesting extension we are considering is connecting our results to the field of privacy-preserving learning (Dwork, 2008), where the goal is to exploit the attribute efficiency property in order to prevent acquisition of inforation about individual data instances References P Auer, N Cesa-Bianchi, Y Freund, and RE Schapire The nonstochastic ultiared bandit proble SIAM Journal on Coputing, 32, 2003 M-F Balcan, A Beygelzier, and J Langford Agnostic active learning In Proceedings of ICML, 2006 S Ben-David and E Dichteran Learning with restricted focus of attention Journal of Coputer and Syste Sciences, 56, 998 A Beygelzier, S Dasgupta, and J Langford Iportance weighted active learning In Proceedings of ICML, 2009 R Calderbank, S Jafarpour, and R Schapire Copressed learning: Universal sparse diensionality reduction and learning in the easureent doain Manuscript, 2009 N Cesa-Bianchi and G Lugosi Prediction, learning, and gaes Cabridge University Press, 2006 N Cesa-Bianchi, A Conconi, and C Gentile On the generalization ability of on-line learning algoriths IEEE Transactions on Inforation Theory, 50(9): , Septeber 2004 D Cohn, L Atlas, and R Ladner Iproving generalization with active learning Machine Learning, 5:20 22, 994 Y L Le Cun, L Bottou, Y Bengio, and P Haffner Gradient-based learning applied to docuent recognition Proceedings of IEEE, 86(): , Noveber 998 A Depster, N Laird, and D Rubin Maxiu likelihood fro incoplete data via the EM algorith Journal of the Royal Statistical Society, Ser B, 39: 38, 977 K Deng, C Bourke, S Scott, J Sunderan, and Y Zheng Bandit-based algoriths for budgeted learning In Proceedings of ICDM, pages IEEE Coputer Society, 2007 L Devroye, L Györfi, and G Lugosi A Probabilistic Theory of Pattern Recognition Springer, 996 J Duchi, S Shalev-Shwartz, Y Singer, and T Chandra Efficient projections onto the l -ball for learning in high diensions In Proceedings of ICML, 2008 C Dwork Differential privacy: A survey of results In M Agrawal, D-Z Du, Z Duan, and A Li, editors, TAMC, volue 4978 of Lecture Notes in Coputer Science, pages 9 Springer, 2008 R Greiner, A Grove, and D Roth Learning cost-sensitive active classifiers Artificial Intelligence, 39(2):37 74, 2002 S Hanneke A bound on the label coplexity of agnostic active learning In Proceedings of ICML, 2007 S Hanneke Adaptive rates of convergence in active learning In Proceedings of COLT, 2009 D Haussler Decision theoretic generalizations of the PAC odel for neural net and other learning applications Inforation and Coputation, 00():78 50, 992 E Hazan, A Kalai, S Kale, and A Agarwal Logarithic regret algoriths for online convex optiization In Proceedings of ICML, 2006 S Kakade and S Shalev-Shwartz Mind the duality gap: Logarithic regret algoriths for online optiization In Proceedings of NIPS, 2008 A Kapoor and R Greiner Learning and classifying under hard budgets In Proceedings of ECML, pages 70 8, 2005 S Shalev-Shwartz, Y Singer, and N Srebro Pegasos: Prial Estiated sub-gradient SOlver for SVM In Proceedings of ICML, pages , 2007 R Tibshirani Regression shrinkage and selection via the lasso J Royal Statist Soc B, 58(): , 996 V N Vapnik Statistical Learning Theory Wiley, 998 S Zhou, J Lafferty, and L Wasseran Copressed and privacy-sensitive sparse regression IEEE Transactions on Inforation Theory, 55(2): , 2009

9 A Proofs A Proof of Theore To ease our calculations, we first show that sapling k eleents without replaceents and then averaging the result has the sae expectation as sapling just once In the lea below, for a set C we denote the unifor distribution over C by U(C) Lea Let C be a finite set and let f : C R be an arbitrary function Let C k = {C C : C = k} Then, E [ C U(C k ) k f(c)] = E [f(c)] c U(C) c C Proof Denote C = n We have: E [ C U(C k ) k f(c)] = ( n ) k f(c) c C k C C k c C = k ( n k) f(c) {C C k : c C } c C ) ( n k = k ( n k) f(c) c C (n )!k!(n k)! = f(c) kn!(k )!(n k)! c C = f(c) n c C = E c U(C) [f(c)] To prove Theore we first show that the estiation atrix constructed by the Baseline algorith is likely to be close to the true correlation atrix over the training set Lea 2 Let A t be the atrix constructed at iteration t of the Baseline algorith and note that Ā = t= A t Let X = t= x tx T t Then, with probability of at least δ over the algorith s own randoness we have that r, s Ār,s X r,s d2 2 k ln(2d2 /δ) Proof Based on Lea, it is easy to verify that E[A t ] = x T t x t Additionally, since we saple without replaceents, each eleent of A t is in [ d 2 /k, d 2 /k] (because we assue x t ) Therefore, we can apply Hoeffding s inequality on each eleent of Ā and obtain that P[ Ār,s X r,s > ɛ] 2e k2 ɛ 2 /(2d 4) Cobining the above with the union bound we obtain that P[ (r, s) : Ār,s X r,s > ɛ] 2d 2 e k2 ɛ 2 /(2d 4) Calling the right-hand-side of the above δ and rearranging ters we conclude our proof Next, we show that the estiate of the linear part of the objective function is also likely to be accurate

10 Lea 3 Let v t be the vector constructed at iteration t of the Baseline algorith and note that v = t= 2y tv t Let x = t= 2y tx t Then, with probability of at least δ over the algorith s own randoness we have that v x d 8 ln(2d/δ) k Proof Based on Lea, it is easy to verify that E[2y t v t ] = 2y t x t Additionally, since we saple k/2 pairs without replaceents, each eleent of v t is in [ 2d/k, 2d/k] (because we assue x t ) and thus each eleent of 2y t v t is in [ 4d/k, 4d/k] (because we assue that y t ) Therefore, we can apply Hoeffding s inequality on each eleent of v and obtain that P[ v r x r > ɛ] 2e k2 ɛ 2 /(8d 2) Cobining the above with the union bound we obtain that P[ (r, s) : Ār,s X r,s > ɛ] 2 d e k2 ɛ 2 /(8d 2) Calling the right-hand-side of the above δ and rearranging ters we conclude our proof We next show that the estiated training loss found by the Baseline algorith, LS (w), is close to the true training loss Lea 4 With probability greater than δ over the Baseline Algorith s own randoization, for all w such that w B we have that ( ) B 2 d 2 ln(d/δ) L S (w) L S (w) O k Proof Cobining Lea 2 with the boundedness of w and using Holder s inequality twice we easily get that w T (Ā X)w B2 d 2 2 ln(2d2 /δ) k Siilarly, using Lea 3 and Holder s inequality, w T ( v x) B d 8 ln(2d/δ) k Cobining the above inequalities with the union bound and the triangle inequality we conclude our proof We are now ready to prove Theore First, using standard risk bounds (based on Radeacher coplexities 5 ) we know that with probability greater than δ over the choice of a training set of exaples, for all w st w B, we have that ( ) ln(d/δ) L S (w) L D (w) O B 2 Cobining the above with Lea 4 we obtain that for any w st w B, L D (w) L S (w) L D (w) L S (w) + L S (w) L S (w) ( ) B 2 d 2 ln(d/δ) O k The proof of Theore follows since the Baseline algorith iniizes L S (w) 5 To bound the Radeacher coplexity, we use the boundedness of w, x, y to get that the squared loss is O(B) Lipschitz on the doain Cobining this with the contraction principle yields the desired Radeacher bound

11 A2 Proof of Theore 2 We start with the following lea Lea 5 Let y t, ŷ t, v t, w t be the values of y, ŷ, v, w, respectively, at iteration t of the AER algorith Then, for any vector w st w B we have ( λ 2 w t (ŷ t y t ) v t, w t ) t= t= ( λ 2 w (ŷ t y t ) v t, w ) ( ) + O ((B+)d) 2 /k log() λ Proof The proof follows directly fro logarithic regret bounds for strongly convex functions (Hazan et al, 2006; Kakade and Shalev-Shwartz, 2008) by noting that according to our construction, ax t 2(ŷ t y t ) v t 2 O((B + ) d/ k) Let B 2 be such that w 2 B 2 and choose λ = ((B + )d/b 2 ) log()/(k) Since λ w t 2 0 we obtain fro Lea 5 that 2(ŷ t y t ) v t, w t w λ w t= + O ( ((B+)d) 2 /k log() λ def = α ) log() ) ( d = O k (B + ) B 2 }{{} For each t, let t = 2( w t, x t y t )x t and t = 2(ŷ t y t )v t Taking expectation of (7) with respect to the algorith s own randoization, and noting that the conditional expectation of t equals t, we obtain [ ] E t, w t w α (8) t= Fro the convexity of the squared loss we know that Cobining with (8) yields ( w t, x t y t ) 2 ( w, x t y t ) 2 t, w t w [ ] E ( w t, x t y t ) 2 ( w, x t y t ) 2 α (9) t= Taking expectation again, this tie with respect to the randoness in choosing the training set, and using the fact that w t only depends on previous exaples in the training set, we obtain that [ ] E L D (w t ) L D (w ) α (0) t= Finally, fro Jensen s inequality we know that E[ t= L D(w t )] E[L D ( w)] and this concludes our proof A3 Proof of Theore 3 The outline of the proof is as follows We define a specific distribution such that only one good feature is slightly correlated with the label We then show that if soe algorith learns a linear predictor with an extra risk of at ost ɛ, then it ust know the value of the good feature Next, we construct a variant of a ulti-ared bandit proble out of our distribution and show that a good learner can yield a good prediction strategy Finally, we adapt a lower bound for the ulti-ared bandit proble given in (Auer et al, 2003), to conclude that in our case no learner can be too good (7)

12 The distribution: We generate a joint distribution over R d R as follows Choose soe j [d] First, each feature is generated iid according to P[x i = ] = P[x i = ] = 2 Next, given x and j, y is generated according to P[y = x j ] = 2 + p and P[y = x j] = 2 p, where p is set to be ɛ Denote by P j the distribution entioned above assuing the good feature is j Also denote by P u the unifor distribution over {±} d+ Analogously, we denote by E j and E u expectations wrt P j and P u A good regressor knows j : We now show that if we have a good linear regressor than we can know the value of j The optial linear predictor is w = 2pe j and the risk of w is L D (w ) = E[( w, x y) 2 ] = ( 2 + p) ( 2p) 2 + ( 2 p) ( + 2p) 2 = + 4p 2 8p 2 = 4p 2 The risk of an arbitrary weight vector under the aforeentioned distribution is: L D (w) = E [( w, x y)] 2 = wi 2 + E[(w j x j y) 2 ] = wi 2 + wj 2 + 4pw j () x,y i j i j Suppose that L D (w) L D (w ) < ɛ This iplies that: For all i j we have w 2 i < ɛ, or equivalently, w i < ɛ 2 + w 2 j 4pw j ( 4p 2 ) < ɛ and thus w j 2p < ɛ which gives w j > 2p ɛ Since we set p = ɛ, the above iplies that we can identify the value of j fro any w whose risk is strictly saller than L D (w ) + ɛ Constructing a variant of a ulti-ared bandit proble: We now construct a variant of the ulti-ared bandit proble out of the distribution P j Each i [d] is an ar and the reward of pulling i is 2 x i + y {0, } Unlike standard ulti-ared bandit probles, here at each round the learner chooses K ars a t,,, a t,k, which correspond to the K atributes accessed at round t, and his reward is defined to be the average of the rewards of the chosen ars At the end of each round the learner observes the value of x t at a t,,, a t,k, as well as the value of y t Note that the expected reward is 2 + p K K i= [a t,i=j] Therefore, the total expected reward of an algorith that runs for T rounds is upper bounded by 2 T + p E[N j], where N j is the nuber of ties j {a t,,, a t,k } A good learner yields a strategy: Suppose that we have a learner that can learn a linear predictor with L D (w) L D (w ) < ɛ using exaples (on average) Since we have shown that once L D (w) L D (w ) < ɛ we know the value of j, we can construct a strategy for the ulti-ared bandit proble in a straightforward way; Siply use the first exaples to learn w and fro then on always pull the ar j, naely, a t, = = a t,k = j The expected reward of this algorith is at least 2 + (T ) ( 2 + p) = 2T + (T )p An upper bound on the reward of any strategy: Consider an arbitrary prediction algorith At round t the algorith uses the history (and its own rando bits, which we can assue are set in advance) to ask for the current K attributes a t,,, a t,k The history is the value of x s at a s,,, a s,k as well as the value of y s, for all s < t That is, we can denote the history at round t to be r t = (r,,, r,k+ ),, (r t,,, r t,k+ ) Therefore, on round t the algorith uses a apping fro r t to [d] K We use r as a shorthand for r T + The following lea shows that any function of the history cannot distinguish too well between the distribution P j and the unifor distribution Lea 6 Let f : {, } (K+)T [0, M] be any function defined on a history sequence r = (r,,, r,k+ ),, (r T,,, r T,K+ ) Let N j be the nuber of ties the algorith calculating f picks action j aong the selected ars Then, E j [f(r)] E u [f(r)] + M log( 4p 2 )E u [N j ]

13 Proof For any two distributions P, Q we let P Q = r P [r] Q[r] be the total variation distance and let KL(P, Q) = r P [r] log(p [r]/q[r]) be the KL divergence Using Holder inequality we know that E j[f(r)] E u [f(r)] M P j P u Additionally, using Pinsker s inequality we have 2 P j P u 2 KL(P u, P j ) Finally, the chain rule and siple calculations yield, KL(P u, P j ) = ( (K+)T 2) T ( Pu [r t, r t ) ] log P r t= j [r t, r t ] ( K+ 2) = r = r T = = t= t= ( (K+)T 2) T log t= ( 2) (K+)T T t= ( ) K+ ( 2 + ) K 2 p [ W K (at,i=j)]sgn(x t,jy t ) i= [ W K i= (at,i=j)] ( log ( + 2 p sgn(x t,j y t ) )) E u [ [ W K i= (at,i=j)] ( log ( + 2 p sgn(x t,j y t ) ))] [ T K ] [ P u (a t,i = j) E u log ( + 2 p sgn(x t,j y t ) )] i= (since x t,j y t is independent of a t,,, a t,k ) ( = 2 ( log( + 2 p) ) ( ) ) T [ K ] + 2 log( 2 p) P u (a t,i = j) = 2 log( 4p2 )E u [N j ] Cobining all the above we conclude our proof t= i= We have shown previously that the expected reward of any algorith is bounded above by 2 T + pe j[n j ] Applying Lea 6 above on f(r) = N j {0,,, T } we get that E j [N j ] E u [N j ] + T log( 4p 2 )E u [N j ] Therefore, the expected reward of any algorith is at ost ) 2 (E T + p u [N j ] + T log( 4p 2 )E u [N j ] Since the adversary will choose j to iniize the above and since the iniu over j is saller then the expectation over choosing j uniforly at rando we have that the reward against an adversarial choice of j is at ost Note that 2 T + p d d d j= d j= ( ) E u [N j ] + T log( 4p 2 )E u [N j ] E u [N j ] = d E u[n + + N d ] KT d (2) Cobining this with (2) and using Jensen s inequality we obtain the following upper bound on the reward ( ) 2 T + p K d T + T log( 4p 2 ) K d T Assuing that ɛ /6 we have that 4p 2 = 4ɛ /4 and thus using the inequality log( q) 3 2q, which holds for q [0, /4], we get the upper bound ( ) 2 T + p K d T + T 6K d p2 T (3)

14 Concluding the proof: Take a learning algorith that finds an ɛ-good predictor using exaples Since the reward of the strategy based on this learning algorith cannot exceed the upper bound given in (3) we obtain that: ( ) 2 T + (T )p 2 T + p K d T + T 6K d p2 T which solved for gives T ) ( Kd 6K d p2 T Since we assue d 4K, choosing T = d / (96Kp 2 ), and recalling p 2 = ɛ, gives A4 Proof of Theore 4 T 2 = d 2 96Kɛ Let w = (/3, /3, /3) Let x {±} 3 be distributed uniforly at rando and y is deterined deterinistically to be w, x Then, L D (w ) = 0 However, any algorith that only view 2 attributes have an uncertainty about the label of at least ± 3, and therefore its expected squared error is at least /9 Forally, suppose the algorith asks for the first two attributes and for its prediction to be ŷ Since the generation of attributes is independent, we have that the value of x 3 does not depend on x, x 2, and ŷ, and therefore E[(ŷ w, x ) 2 ] = E[(ŷ w x w x 2 w 3x 3 ) 2 ] = E[(ŷ w x w x 2 ) 2 ]+E[(w 3x 3 ) 2 ] 0+(/3) 2 E[x 2 3] = /9, which concludes our proof

Efficient Learning with Partially Observed Attributes

Journal of Machine Learning Research 2 (20) 2857-2878 Subitted 4/; Revised 9/; Published 0/ Efficient Learning with Partially Observed Attributes Nicolò Cesa-Bianchi DSI, Università degli Studi di Milano