Efficient Learning with Partially Observed Attributes

Size: px
Start display at page:

Download "Efficient Learning with Partially Observed Attributes"

Transcription

1 Nicolò Cesa-Bianchi DSI, Università degli Studi di Milano, Italy Shai Shalev-Shwartz The Hebrew University, Jerusale, Israel Ohad Shair The Hebrew University, Jerusale, Israel Abstract We describe and analyze efficient algoriths for learning a linear predictor fro exaples when the learner can only view a few attributes of each training exaple This is the case, for instance, in edical research, where each patient participating in the experient is only willing to go through a sall nuber of tests Our analysis bounds the nuber of additional exaples sufficient to copensate for the lack of full inforation on each training exaple We deonstrate the efficiency of our algoriths by showing that when running on digit recognition data, they obtain a high prediction accuracy even when the learner gets to see only four pixels of each iage Introduction Suppose we would like to predict if a person has soe disease based on edical tests Theoretically, we ay choose a saple of the population, perfor a large nuber of edical tests on each person in the saple and learn fro this inforation In any situations this is unrealistic, since patients participating in the experient are not willing to go through a large nuber of edical tests The above exaple otivates the proble studied in this paper, that is learning when there is a hard constraint on the nuber of attributes the learner ay view for each training exaple We propose an efficient algorith for dealing with this partial inforation proble, and bound the nuber of additional training exaples sufficient to copensate for the lack of full inforation on each training Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 200 Copyright 200 by the author(s)/owner(s) cesa-bianchi@dsiuniiit shais@cshujiacil ohadsh@cshujiacil exaple Roughly speaking, we actively pick which attributes to observe in a randoized way so as to construct a noisy version of all attributes Intuitively, we can still learn despite the error of this estiate because instead of receiving the exact value of each individual exaple in a sall set it suffices to get noisy estiations of any exaples Related Work Many ethods have been proposed for dealing with issing or partial inforation Most of the approaches do not coe with foral guarantees on the risk of the resulting algorith, and are not guaranteed to converge in polynoial tie The difficulty stes fro the exponential nuber of ways to coplete the issing inforation In the fraework of generative odels, a popular approach is the Expectation- Maxiization (EM) procedure (Depster et al, 977) The ain drawback of the EM approach is that it ight find sub-optial solutions In contrast, the ethods we propose in this paper are provably efficient and coe with finite saple guarantees on the risk Our technique for dealing with issing inforation borrows ideas fro algoriths for the adversarial ulti-ared bandit proble (Auer et al, 2003; Cesa- Bianchi and Lugosi, 2006) Our learning algoriths actively choose which attributes to observe for each exaple This and siilar protocols were studied in the context of active learning (Cohn et al, 994; Balcan et al, 2006; Hanneke, 2007; 2009; Beygelzier et al, 2009), where the learner can ask for the target associated with specific exaples The specific learning task we consider in the paper was first proposed in (Ben-David and Dichteran, 998), where it is called learning with restricted focus of attention Ben-David and Dichteran (998) considered the classification setting and showed learnability

2 of several hypothesis classes in this odel, like k-dnf and axis-aligned rectangles However, to the best of our knowledge, no efficient algorith for the class of linear predictors has been proposed A related setting, called budgeted learning, was recently studied - see for exaple (Deng et al, 2007; Kapoor and Greiner, 2005) and the references therein In budgeted learning, the learner purchases attributes at soe fixed cost subject to an overall budget Besides lacking foral guarantees, this setting is different fro the one we consider in this paper, because we ipose a budget constraint on the nuber of attributes that can be obtained for every individual exaple, as opposed to a global budget In soe applications, such as the edical application discussed previously, our constraint leads to a ore realistic data acquisition process - the global budget allows to ask for any attributes of soe individual patients while our protocol guarantees a constant nuber of edical tests to all the patients Our technique is reiniscent of ethods used in the copressed learning fraework (Calderbank et al, 2009; Zhou et al, 2009), where data is accessed via a sall set of rando linear easureents Unlike copressed learning, where learners are both trained and evaluated in the copressed doain, our techniques are ainly designed for a scenario in which only the access to training data is restricted The opposite setting, in which full inforation is given at training tie and the goal is to train a predictor that depends only on a sall nuber of attributes at test tie, was studied in the context of learning sparse predictors - see for exaple (Tibshirani, 996) and the wide literature on sparsity properties of l regularization Since our algoriths also enforce low l nor, any of those results can be cobined with our techniques to yield an algorith that views only O() attributes at training tie, and a nuber of attributes coparable to the achievable sparsity at test tie Since our focus in this work is on constrained inforation at training tie, we do not elaborate on this subject Furtherore, in soe real-world situations, it is reasonable to assue that attributes are very expensive at training tie but are ore easy to obtain at test tie Returning to the exaple of edical applications, it is unrealistic to convince patients to participate in a edical experient in which they need to go through a lot of edical tests, but once the syste is trained, at testing tie, patients who need Ben-David and Dichteran (998) do describe learnability results for siilar classes but only under the restricted faily of product distributions the prediction of the syste will agree to perfor as any edical tests as needed A variant of the above setting is the one studied by Greiner et al (2002), where the learner has all the inforation at training tie and at test tie he tries to actively choose a sall aount of attributes to for a prediction Note that active learning at training tie, as we do here, ay give ore learning power than active learning at testing tie For exaple, we forally prove that while it is possible to learn a consistent predictor accessing at ost 2 attributes of each exaple at training tie, it is not possible (even with an infinite aount of training exaples) to build an active classifier that uses at ost 2 attributes of each exaple at test tie, and whose error will be saller than a constant 2 Main Results In this section we outline the ain results We start with a foral description of the learning proble In linear regression each exaple is an instance-target pair, (x, y) R d R We refer to x as a vector of attributes and the goal of the learner is to find a linear predictor x w, x, where we refer to w R d as the predictor The perforance of a predictor w on an instance-target pair, (x, y) R d R, is easured by a loss function l( w, x, y) For siplicity, we focus on the squared loss function, l(a, b) = (a b) 2, and briefly discuss other loss functions in Section 5 Following the standard fraework of statistical learning (Haussler, 992; Devroye et al, 996; Vapnik, 998), we odel the environent as a joint distribution D over the set of instance-target pairs, R d R The goal of the learner is to find a predictor with low risk, defined as the expected loss: L D (w) def = E (x,y) D [l( w, x, y)] Since the distribution D is unknown to the learner he learns by relying on a training set of exaples S = (x, y ),, (x, y ), which are assued to be sapled iid fro D We denote the training loss by L S (w) def = i= ( w, x i y i ) 2 We now distinguish between two scenarios: Full inforation: The learner receives the entire training set This is the traditional linear regression setting Partial inforation: For each individual exaple, (x i, y i ), the learner receives the target y i but is only allowed to see k attributes of x i, where k is a paraeter of the proble The learner has the freedo to actively choose which of the attributes will be revealed, as long as at ost k of the will be given

3 While the full inforation case was extensively studied, the partial inforation case is ore challenging Our approach for dealing with the proble of partial inforation is to rely on algoriths for the full inforation case and to fill in the issing inforation in a randoized, data and algorithic dependent, way As a siple baseline, we begin by describing a straightforward adaptation of Lasso (Tibshirani, 996), based on a direct nonadaptive estiate of the loss function We then turn to describe a ore effective approach, which cobines a stochastic gradient descent algorith called Pegasos (Shalev-Shwartz et al, 2007) with the active sapling of attributes in order to estiate the gradient of the loss at each step 2 Baseline Algorith A popular approach for learning a linear regressor is to iniize the epirical loss on the training set plus a regularization ter taking the for of a nor of the predictor For exaple, in ridge regression the regularization ter is w 2 2 and in Lasso the regularization ter is w Instead of regularization, we can include a constraint of the for w B or w 2 B With an adequate tuning of paraeters, the regularization for is equivalent to the constraint for In the constraint for, the predictor is a solution to the following optiization proble: in w R d S (x,y) S ( w, x y) 2 st w p B, () where S = {(x, y ),, (x, y )} is a training set of exaples, B is a regularization paraeter, and p is for Lasso and 2 for ridge regression Standard risk bounds for Lasso iply that if ŵ is a iniizer of () (with p = ), then with probability greater than δ over the choice of a training set of size we have L D (ŵ) in L D(w)+O w: w B ( B 2 ln(d/δ) ) (2) To adapt Lasso to the partial inforation case, we first rewrite the squared loss as follows: ( w, x y) 2 = w T (xx T )w 2yx T w + y 2, where w, x are colun vectors and w T, x T are their corresponding transpose (ie, row vectors) Next, we estiate the atrix xx T and the vector x using the partial inforation we have, and then we solve the optiization proble given in () with the estiated values of xx T and x To estiate the vector x we can pick an index i uniforly at rando fro [d] = {,, d} and define the estiation to be a vector v such that v r = { d x r if r = i 0 else (3) It is easy to verify that v is an unbiased estiate of x, naely, E[v] = x where expectation is with respect to the choice of the index i When we are allowed to see k > attributes, we siply repeat the above process (without replaceent) and set v to be the averaged vector To estiate the atrix xx T we could pick two indices i, j independently and uniforly at rando fro [d], and define the estiation to be a atrix with all zeros except d 2 x i x j in the (i, j) entry However, this yields a non-syetric atrix which will ake our optiization proble with the estiated atrix non-convex To overcoe this obstacle, we syetrize the atrix by adding its transpose and dividing by 2 The resulting baseline procedure 2 is given in Algorith Algorith Baseline(S, k) S full inforation training set with exaples k Can view only k eleents of each instance in S Paraeter: B Initialize: Ā = 0 R d,d ; v = 0 R d ; ȳ = 0 for each (x, y) S v = 0 R d A = 0 R d,d Choose C uniforly at rando fro all subsets of [d] [d] of size k/2 for each (i, j) C v i = v i + (d/k) x i v j = v j + (d/k) x j A i,j = A i,j + (d 2 /k) x i x j A j,i = A j,i + (d 2 /k) x i x j end Ā = Ā + A/ v = v + 2 y v/ ȳ = ȳ + y 2 / end Let L S (w) = w T Āw + w T v + ȳ Output: solution of in w: w B L S (w) 2 We note that an even sipler approach is to arbitrarily assue that the correlation atrix is the identity atrix and then the solution to the loss iniization proble is siply the averaged vector, w = P (x,y) S y x In that case, we can siply replace x by its estiated vector as defined in (3) While this naive approach can work on very siple classification tasks, it will perfor poorly on realistic data sets, in which the correlation atrix is not likely to be identity Indeed, in our experients with the MNIST data set, we found out that this approach perfored poorly relatively to the algoriths proposed in this paper

4 The following theore shows that siilar to Lasso, the Baseline algorith is copetitive with the optial linear predictor with a bounded L nor Theore Let D be a distribution such that P[x [, +] d y [, +]] = Let ŵ be the output of Baseline(S,k), where S = Then, with probability of at least δ over the choice of the training set and the algorith s own randoization we have L D (ŵ) in L D(w) + O w: w B ( (d B) 2 k ) ln(d/δ) The above theore tells us that for a sufficiently large training set we can find a very good predictor Put another way, a large nuber of exaples can copensate for the lack of full inforation on each individual exaple In particular, to overcoe the extra factor d 2 /k in the bound, which does not appear in the full inforation bound given in (2), we need to increase by a factor of d 4 /k 2 Note that when k = d we do not recover the full inforation bound This is because we try to estiate a atrix with d 2 entries using only k = d < d 2 saples In the next subsection, we describe a better, adaptive procedure for the partial inforation case 22 Gradient-based Attribute Efficient Regression In this section, by avoiding the estiation of the atrix xx T, we significantly decrease the nuber of additional exaples sufficient for learning with k attributes per training exaple To do so, we do not try to estiate the loss function but rather estiate the gradient l(w) = 2 ( w, x y) x, with respect to w, of the squared loss function ( w, x y) 2 Each vector w can define a probability distribution over [d] by letting P[i] = w i / w We can estiate the gradient using 2 attributes as follows First, we randoly pick j fro [d] according to the distribution defined by w Using j we estiate the ter w, x by sgn(w j ) w x j It is easy to verify that the expectation of the estiate equals w, x Second, we randoly pick i fro [d] according to the unifor distribution over [d] Based on i, we estiate the vector x as in (3) Overall, we obtain the following unbiased estiation of the gradient: l(w) = 2 (sgn(w j ) w x j y) v, (4) where v is as defined in (3) The advantage of the above approach over the loss based approach we took before is that the agnitude of each eleent of the gradient estiate is order of d w This is in contrast to what we had for the loss based approach, where the agnitude of each eleent of the atrix A was order of d 2 In any situations, the L nor of a good predictor is significantly saller than d and in these cases the gradient based estiate is better than the loss based estiate However, while in the previous approach our estiation did not depend on a specific w, now the estiation depends on w We therefore need an iterative learning ethod in which at each iteration we use the gradient of the loss function on an individual exaple Luckily, the stochastic gradient descent approach conveniently fits our needs Concretely, below we describe a variant of the Pegasos algorith (Shalev-Shwartz et al, 2007) for learning linear regressors Pegasos tries to iniize the regularized risk in w [ E ( w, x y) 2 ] + λ w 2 2 (5) (x,y) D Of course, the distribution D is unknown, and therefore we cannot hope to solve the above proble exactly Instead, Pegasos finds a sequence of weight vectors that (on average) converge to the solution of (5) We start with the all zeros vector w = 0 R d Then, at each iteration Pegasos picks the next exaple in the training set (which is equivalent to sapling a fresh exaple according to D) and calculates the gradient of the loss function on this exaple with respect to the current weight vector w In our case, the gradient is siply 2( w, x y)x We denote this gradient vector by Finally, Pegasos updates the predictor according to the rule: w = ( t ) w λ t, where t is the current iteration nuber To apply Pegasos in the partial inforation case we could siply replace the gradient vector with its estiation given in (4) However, our analysis shows that it is desirable to aintain an estiation vector with sall agnitude Since the agnitude of is order of d w, where w is the current weight vector aintained by the algorith, we would like to ensure that w is always saller than soe threshold B We achieve this goal by adding an additional projection step at the end of each Pegasos s iteration Forally, after perforing the update we set w argin u w 2 (6) u: u B This projection step can be perfored efficiently in tie O(d) using the technique described in (Duchi et al, 2008) A pseudo-code of the resulting Attribute Efficient Regression algorith is given in Algorith 2

5 Algorith 2 AER(S, k) S Full inforation training set with exaples k Access only k eleents of each instance in S Paraeters: λ, B w = (0,, 0) ; w = w ; t = for each (x, y) S v = 0 R d Choose C uniforly at rando fro all subsets of [d] of size k/2 for each j C v j = v j + 2 k d x j end ŷ = 0 for r =,, k/2 saple i fro [d] based on P[i] = w i / w ŷ = ŷ + 2 k sgn(w i) w x i end w = ( t )w 2 λt (ŷ y)v w = argin u: u B u w 2 w = w + w/ t = t + end Output: w The following theore provides convergence guarantees for AER Theore 2 Let D be a distribution such that P[x [, +] d y [, +]] = Let w be any vector such that w B and w 2 B 2 Then, ( ) E[L D ( w)] L D (w d (B + ) B 2 ln() ) + O, k where S =, w is the output of AER(S, k) run with λ = ((B+)d/B 2 ) log()/(k), and the expectation is over the choice of S and over the algorith s own randoization For siplicity and readability, in the above theore we only bounded the expected risk It is possible to obtain siilar guarantees with high probability by relying on Azua s inequality see for exaple (Cesa-Bianchi et al, 2004) Note that w 2 w B, so Theore 2 iplies that ( ) L D ( w) in L d B 2 ln() D(w) + O w: w B k Therefore, the bound for AER is uch better 3 than 3 When coparing bounds, we ignore logarithic ters Also, in this discussion we assue that B and B 2 are at least the bound for Baseline: instead of d 2 /k we have d/ k It is interesting to copare the bound for AER to the Lasso bound in the full inforation case given in (2) As it can be seen, to achieve the sae level of risk, AER needs a factor of d 2 /k ore exaples than the full inforation Lasso 4 Since each AER exaple uses only k attributes while each Lasso exaple uses all d attributes, the ratio between the total nuber of attributes AER needs and the nuber of attributes Lasso needs to achieve the sae error is O(d) Intuitively, when having d ties total nuber of attributes, we can fully copensate for the partial inforation protocol However, in soe situations even this extra d factor is not needed Suppose we know that the vector w, which iniizes the risk, is dense That is, it satisfies w d w 2 In this case, choosing B 2 = B/ d, the bound in Theore 2 becoes order of B 2 d/k / Therefore, the nuber of exaples AER needs in order to achieve the sae error as Lasso is only a factor d/k ore than the nuber of exaples Lasso uses But, this iplies that both AER and Lasso needs the sae nuber of attributes in order to achieve the sae level of error! Crucially, the above holds only if w is dense When w is sparse we have w w 2 and then AER needs ore attributes than Lasso One ight wonder whether a ore clever active sapling strategy could attain in the sparse case the perforance of Lasso while using the sae nuber of attributes The next subsection shows that this is not possible in general 23 Lower bounds and negative results We now show (proof in the appendix) that any attribute efficient algorith needs in general order of d/ɛ exaples for learning an ɛ-accurate sparse linear predictor Recall that the upper bound of AER iplies that order of d 2 (B + ) 2 B 2 2/ɛ 2 exaples are sufficient for learning a predictor with L D (w) L D (w ) < ɛ Specializing this saple coplexity bound of AER to the w described in Theore 3 below, yields that O(d 2 /ɛ) exaples are sufficient for AER for learning a good predictor in this case That is, we have a gap of factor d between the lower bound and the upper bound, and it reains open to bridge this gap Theore 3 For any ɛ (0, /6), k, and d 4k, 4 We note that when d = k we still do not recover the full inforation bound However, it is possible to iprove the analysis and replace the factor d/ k with a factor d ax t x t 2/k

6 there exists a distribution over exaples and a weight vector w, with w 0 = and w 2 = w = 2 ɛ, such that any attribute efficient regression algorith accessing at ost k attributes per training exaple ust see (in expectation) at least Ω ( d kɛ) exaples in order to learn a linear predictor w with L D (w) L D (w ) < ɛ Recall that in our setting, while at training tie the learner can only view k attributes of each exaple, at test tie all attributes can be observed The setting of Greiner et al (2002), instead, assues that at test tie the learner cannot observe all the attributes The following theore shows that if a learner can view at ost 2 attributes at test tie then it is ipossible to give accurate predictions at test tie even when the optial linear predictor is known Theore 4 There exists a weight vector w and a distribution D such that L D (w ) = 0 while any algorith A that gives predictions A(x) while viewing only 2 attributes of each x ust have L D (A) /9 The proof is given in the appendix This negative result highlights an interesting phenoenon We can learn an arbitrarily accurate predictor w fro partially observed exaples However, even if we know the optial w, we ight not be able to accurately predict a new partially observed exaple 3 Proof Sketch of Theore 2 Here we only sketch the proof of Theore 2 A coplete proof of all our theores is given in the appendix We start with a general logarithic regret bound for strongly convex functions (Hazan et al, 2006; Kakade and Shalev-Shwartz, 2008) The regret bound iplies the following Let z,, z be a sequence of vectors, each of which has nor bounded by G Let λ > 0 and consider the sequence of functions g,, g such that g t (w) = λ 2 w 2 + z t, w Each g t is λ-strongly convex (eaning, it is not too flat), and therefore regret bounds for strongly convex functions tell us that there is a way to construct a sequence of vectors w,, w such that for any w that satisfies w B we have t g t (w t ) t t= ( ) g t (w ) O G 2 log() λ t= With an appropriate choice of λ, and with the assuption w 2 B 2, the above inequality iplies that t= z t, w t w α where α = O ) This holds for any sequence of z,, z, and in particular, we can set z t = 2(ŷ t y t )v t Note that z t is a ( G B2 log() rando vector that depends both on the value of w t and on the rando bits chosen on round t Taking conditional expectation of z t wrt the rando bits chosen on round t we obtain that E[z t w t ] is exactly the gradient of ( w, x t y t ) 2 at w t, which we denote by t Fro the convexity of the squared loss, we can lower bound t, w t w by ( w t, x t y t ) 2 ( w, x t y t ) 2 That is, in expectation we have that [ ( E ( wt, x t y t ) 2 ( w, x t y t ) 2)] α t= Taking expectation wrt the rando choice of the exaples fro D, denoting w = t=, and using Jensen s inequality we get that E[L D ( w)] L D (w )+ α Finally, we need to ake sure that α is not too large The only potential danger is that G, the bound on the nors of z,, z, will be large We ake sure this cannot happen by restricting each w t to the l ball of radius B, which ensures that z t O((B + )d) for all t 4 Experients We perfored soe preliinary experients to test the behavior of our algorith on the well-known MNIST digit recognition dataset (Cun et al, 998), which contains 70,000 iages (28 28 pixels each) of the digits 0 9 The advantages of this dataset for our purposes is that it is not a sall-scale dataset, has a reasonable diensionality-to-data-size ratio, and the setting is clearly interpretable graphically While this dataset is designed for classification (eg recognizing the digit in the iage), we can still apply our algoriths on it by regressing to the label First, to deonstrate the hardness of our settings, we provide in Figure below soe exaples of iages fro the dataset, in the full inforation setting and the partial inforation setting The upper row contains six iages fro the dataset, as available to a full-inforation algorith A partial-inforation algorith, however, will have a uch ore liited access to these iages In particular, if the algorith ay only choose k = 4 pixels fro each iage, the sae six iages as available to it ight look like the botto row of Figure We began by looking at a dataset coposed of 3 vs 5, where all the 3 digits were labeled as and all the 5 digits were labeled as + We ran four different algoriths on this dataset: the siple Baseline algorith, AER, as well as ridge regression and Lasso for coparison (for Lasso, we solved () with p = ) Both ridge regression and Lasso were run in the full inforation setting: Naely, they enjoyed full access to

7 2 Test Regression Error Figure In the upper row six exaples fro the training set (of digits 3 and 5) are shown In the lower row we show the sae six exaples, where only four randoly sapled pixels fro each original iage are displayed all attributes of all exaples in the training set The Baseline algorith and AER, however, were given access to only 4 attributes fro each training exaple We randoly split the dataset into a training set and a test set (with the test set being 0% of the original dataset) For each algorith, paraeter tuning was perfored using 0-fold cross validation Then, we ran the algorith on increasingly long prefixes of the training set, and easured the average regression error ( w, x y) 2 on the test set The results (averaged over runs on 0 rando train-test splits) are presented in Figure 2 In the upper plot, we see how the test regression error iproves with the nuber of exaples The Baseline algorith is highly unstable at the beginning, probably due to the ill-conditioning of the estiated covariance atrix, although it eventually stabilizes (to prevent a graphical ess at the left hand side of the figure, we reoved the error bars fro the corresponding plot) Its perforance is worse than AER, copletely in line with our earlier theoretical analysis The botto plot of Figure 2 is siilar, only that now the X-axis represents the accuulative nuber of attributes seen by each algorith rather than the nuber of exaples For the partial-inforation algorith, the graph ends at approxiately 49,000 attributes, which is the total nuber of attributes accessed by the algorith after running over all training exaples, seeing k = 4 pixels fro each exaple However, for the full-inforation algorith 49,000 attributes are already seen after just 62 exaples When we copare the algoriths in this way, we see that our AER algorith achieves excellent perforance for a given attribute budget, significantly better than the other L -based algoriths, and even coparable to full-inforation ridge regression Finally, we tested the algoriths over 45 datasets generated fro MNIST, one for each possible pair of dig- Test Regression Error Nuber of Exaples Ridge Reg Lasso 04 AER Baseline Nuber of Features x 0 4 Figure 2 Test regression error for each of the 4 algoriths, over increasing prefixes of the training set for 3 vs 5 The results are averaged over 0 runs its For each dataset and each of 0 rando train-test splits, we perfored paraeter tuning for each algorith separately, and checked the average squared error on the test set The edian test errors over all datasets are presented in the table below Test Error Full Inforation Ridge 00 Lasso 0222 Partial Inforation AER 0320 Baseline 085 As can be seen, the AER algorith anages to achieve good perforance, not uch worse than the fullinforation Lasso algorith The Baseline algorith, however, achieves a substantially worse perforance, in line with our theoretical analysis above We also calculated the test classification error of AER, ie sign( w, x ) y, and found out that AER, which can see only 4 pixels per iage, usually perfor only a little worse than the full-inforation algoriths (ridge regression and Lasso), which enjoy full access to all 784 pixels in each iage In particular, the edian test classification errors of AER, Lasso, and Ridge are

8 35%, %, and 3% respectively 5 Discussion and Extensions In this paper, we provided an efficient algorith for learning when only a few attributes fro each training exaple can be seen The algorith coes with foral guarantees, is provably copetitive with algoriths which enjoy full access to the data, and sees to perfor well in practice We also presented saple coplexity lower bounds, which are only a factor d saller than the upper bound achieved by our algorith, and it reains open to bridge this gap Our approach easily extends to other gradient-based algoriths besides Pegasos For exaple, generalized additive algoriths such as p-nor Perceptrons and Winnow - see, eg, (Cesa-Bianchi and Lugosi, 2006) An obvious direction for future research is how to deal with loss functions other than the squared loss In upcoing work on a related proble, we develop a technique which allows us to deal with arbitrary analytic loss functions, but in the setting of this paper will lead to saple coplexity bounds which are exponential in d Another interesting extension we are considering is connecting our results to the field of privacy-preserving learning (Dwork, 2008), where the goal is to exploit the attribute efficiency property in order to prevent acquisition of inforation about individual data instances References P Auer, N Cesa-Bianchi, Y Freund, and RE Schapire The nonstochastic ultiared bandit proble SIAM Journal on Coputing, 32, 2003 M-F Balcan, A Beygelzier, and J Langford Agnostic active learning In Proceedings of ICML, 2006 S Ben-David and E Dichteran Learning with restricted focus of attention Journal of Coputer and Syste Sciences, 56, 998 A Beygelzier, S Dasgupta, and J Langford Iportance weighted active learning In Proceedings of ICML, 2009 R Calderbank, S Jafarpour, and R Schapire Copressed learning: Universal sparse diensionality reduction and learning in the easureent doain Manuscript, 2009 N Cesa-Bianchi and G Lugosi Prediction, learning, and gaes Cabridge University Press, 2006 N Cesa-Bianchi, A Conconi, and C Gentile On the generalization ability of on-line learning algoriths IEEE Transactions on Inforation Theory, 50(9): , Septeber 2004 D Cohn, L Atlas, and R Ladner Iproving generalization with active learning Machine Learning, 5:20 22, 994 Y L Le Cun, L Bottou, Y Bengio, and P Haffner Gradient-based learning applied to docuent recognition Proceedings of IEEE, 86(): , Noveber 998 A Depster, N Laird, and D Rubin Maxiu likelihood fro incoplete data via the EM algorith Journal of the Royal Statistical Society, Ser B, 39: 38, 977 K Deng, C Bourke, S Scott, J Sunderan, and Y Zheng Bandit-based algoriths for budgeted learning In Proceedings of ICDM, pages IEEE Coputer Society, 2007 L Devroye, L Györfi, and G Lugosi A Probabilistic Theory of Pattern Recognition Springer, 996 J Duchi, S Shalev-Shwartz, Y Singer, and T Chandra Efficient projections onto the l -ball for learning in high diensions In Proceedings of ICML, 2008 C Dwork Differential privacy: A survey of results In M Agrawal, D-Z Du, Z Duan, and A Li, editors, TAMC, volue 4978 of Lecture Notes in Coputer Science, pages 9 Springer, 2008 R Greiner, A Grove, and D Roth Learning cost-sensitive active classifiers Artificial Intelligence, 39(2):37 74, 2002 S Hanneke A bound on the label coplexity of agnostic active learning In Proceedings of ICML, 2007 S Hanneke Adaptive rates of convergence in active learning In Proceedings of COLT, 2009 D Haussler Decision theoretic generalizations of the PAC odel for neural net and other learning applications Inforation and Coputation, 00():78 50, 992 E Hazan, A Kalai, S Kale, and A Agarwal Logarithic regret algoriths for online convex optiization In Proceedings of ICML, 2006 S Kakade and S Shalev-Shwartz Mind the duality gap: Logarithic regret algoriths for online optiization In Proceedings of NIPS, 2008 A Kapoor and R Greiner Learning and classifying under hard budgets In Proceedings of ECML, pages 70 8, 2005 S Shalev-Shwartz, Y Singer, and N Srebro Pegasos: Prial Estiated sub-gradient SOlver for SVM In Proceedings of ICML, pages , 2007 R Tibshirani Regression shrinkage and selection via the lasso J Royal Statist Soc B, 58(): , 996 V N Vapnik Statistical Learning Theory Wiley, 998 S Zhou, J Lafferty, and L Wasseran Copressed and privacy-sensitive sparse regression IEEE Transactions on Inforation Theory, 55(2): , 2009

9 A Proofs A Proof of Theore To ease our calculations, we first show that sapling k eleents without replaceents and then averaging the result has the sae expectation as sapling just once In the lea below, for a set C we denote the unifor distribution over C by U(C) Lea Let C be a finite set and let f : C R be an arbitrary function Let C k = {C C : C = k} Then, E [ C U(C k ) k f(c)] = E [f(c)] c U(C) c C Proof Denote C = n We have: E [ C U(C k ) k f(c)] = ( n ) k f(c) c C k C C k c C = k ( n k) f(c) {C C k : c C } c C ) ( n k = k ( n k) f(c) c C (n )!k!(n k)! = f(c) kn!(k )!(n k)! c C = f(c) n c C = E c U(C) [f(c)] To prove Theore we first show that the estiation atrix constructed by the Baseline algorith is likely to be close to the true correlation atrix over the training set Lea 2 Let A t be the atrix constructed at iteration t of the Baseline algorith and note that Ā = t= A t Let X = t= x tx T t Then, with probability of at least δ over the algorith s own randoness we have that r, s Ār,s X r,s d2 2 k ln(2d2 /δ) Proof Based on Lea, it is easy to verify that E[A t ] = x T t x t Additionally, since we saple without replaceents, each eleent of A t is in [ d 2 /k, d 2 /k] (because we assue x t ) Therefore, we can apply Hoeffding s inequality on each eleent of Ā and obtain that P[ Ār,s X r,s > ɛ] 2e k2 ɛ 2 /(2d 4) Cobining the above with the union bound we obtain that P[ (r, s) : Ār,s X r,s > ɛ] 2d 2 e k2 ɛ 2 /(2d 4) Calling the right-hand-side of the above δ and rearranging ters we conclude our proof Next, we show that the estiate of the linear part of the objective function is also likely to be accurate

10 Lea 3 Let v t be the vector constructed at iteration t of the Baseline algorith and note that v = t= 2y tv t Let x = t= 2y tx t Then, with probability of at least δ over the algorith s own randoness we have that v x d 8 ln(2d/δ) k Proof Based on Lea, it is easy to verify that E[2y t v t ] = 2y t x t Additionally, since we saple k/2 pairs without replaceents, each eleent of v t is in [ 2d/k, 2d/k] (because we assue x t ) and thus each eleent of 2y t v t is in [ 4d/k, 4d/k] (because we assue that y t ) Therefore, we can apply Hoeffding s inequality on each eleent of v and obtain that P[ v r x r > ɛ] 2e k2 ɛ 2 /(8d 2) Cobining the above with the union bound we obtain that P[ (r, s) : Ār,s X r,s > ɛ] 2 d e k2 ɛ 2 /(8d 2) Calling the right-hand-side of the above δ and rearranging ters we conclude our proof We next show that the estiated training loss found by the Baseline algorith, LS (w), is close to the true training loss Lea 4 With probability greater than δ over the Baseline Algorith s own randoization, for all w such that w B we have that ( ) B 2 d 2 ln(d/δ) L S (w) L S (w) O k Proof Cobining Lea 2 with the boundedness of w and using Holder s inequality twice we easily get that w T (Ā X)w B2 d 2 2 ln(2d2 /δ) k Siilarly, using Lea 3 and Holder s inequality, w T ( v x) B d 8 ln(2d/δ) k Cobining the above inequalities with the union bound and the triangle inequality we conclude our proof We are now ready to prove Theore First, using standard risk bounds (based on Radeacher coplexities 5 ) we know that with probability greater than δ over the choice of a training set of exaples, for all w st w B, we have that ( ) ln(d/δ) L S (w) L D (w) O B 2 Cobining the above with Lea 4 we obtain that for any w st w B, L D (w) L S (w) L D (w) L S (w) + L S (w) L S (w) ( ) B 2 d 2 ln(d/δ) O k The proof of Theore follows since the Baseline algorith iniizes L S (w) 5 To bound the Radeacher coplexity, we use the boundedness of w, x, y to get that the squared loss is O(B) Lipschitz on the doain Cobining this with the contraction principle yields the desired Radeacher bound

11 A2 Proof of Theore 2 We start with the following lea Lea 5 Let y t, ŷ t, v t, w t be the values of y, ŷ, v, w, respectively, at iteration t of the AER algorith Then, for any vector w st w B we have ( λ 2 w t (ŷ t y t ) v t, w t ) t= t= ( λ 2 w (ŷ t y t ) v t, w ) ( ) + O ((B+)d) 2 /k log() λ Proof The proof follows directly fro logarithic regret bounds for strongly convex functions (Hazan et al, 2006; Kakade and Shalev-Shwartz, 2008) by noting that according to our construction, ax t 2(ŷ t y t ) v t 2 O((B + ) d/ k) Let B 2 be such that w 2 B 2 and choose λ = ((B + )d/b 2 ) log()/(k) Since λ w t 2 0 we obtain fro Lea 5 that 2(ŷ t y t ) v t, w t w λ w t= + O ( ((B+)d) 2 /k log() λ def = α ) log() ) ( d = O k (B + ) B 2 }{{} For each t, let t = 2( w t, x t y t )x t and t = 2(ŷ t y t )v t Taking expectation of (7) with respect to the algorith s own randoization, and noting that the conditional expectation of t equals t, we obtain [ ] E t, w t w α (8) t= Fro the convexity of the squared loss we know that Cobining with (8) yields ( w t, x t y t ) 2 ( w, x t y t ) 2 t, w t w [ ] E ( w t, x t y t ) 2 ( w, x t y t ) 2 α (9) t= Taking expectation again, this tie with respect to the randoness in choosing the training set, and using the fact that w t only depends on previous exaples in the training set, we obtain that [ ] E L D (w t ) L D (w ) α (0) t= Finally, fro Jensen s inequality we know that E[ t= L D(w t )] E[L D ( w)] and this concludes our proof A3 Proof of Theore 3 The outline of the proof is as follows We define a specific distribution such that only one good feature is slightly correlated with the label We then show that if soe algorith learns a linear predictor with an extra risk of at ost ɛ, then it ust know the value of the good feature Next, we construct a variant of a ulti-ared bandit proble out of our distribution and show that a good learner can yield a good prediction strategy Finally, we adapt a lower bound for the ulti-ared bandit proble given in (Auer et al, 2003), to conclude that in our case no learner can be too good (7)

12 The distribution: We generate a joint distribution over R d R as follows Choose soe j [d] First, each feature is generated iid according to P[x i = ] = P[x i = ] = 2 Next, given x and j, y is generated according to P[y = x j ] = 2 + p and P[y = x j] = 2 p, where p is set to be ɛ Denote by P j the distribution entioned above assuing the good feature is j Also denote by P u the unifor distribution over {±} d+ Analogously, we denote by E j and E u expectations wrt P j and P u A good regressor knows j : We now show that if we have a good linear regressor than we can know the value of j The optial linear predictor is w = 2pe j and the risk of w is L D (w ) = E[( w, x y) 2 ] = ( 2 + p) ( 2p) 2 + ( 2 p) ( + 2p) 2 = + 4p 2 8p 2 = 4p 2 The risk of an arbitrary weight vector under the aforeentioned distribution is: L D (w) = E [( w, x y)] 2 = wi 2 + E[(w j x j y) 2 ] = wi 2 + wj 2 + 4pw j () x,y i j i j Suppose that L D (w) L D (w ) < ɛ This iplies that: For all i j we have w 2 i < ɛ, or equivalently, w i < ɛ 2 + w 2 j 4pw j ( 4p 2 ) < ɛ and thus w j 2p < ɛ which gives w j > 2p ɛ Since we set p = ɛ, the above iplies that we can identify the value of j fro any w whose risk is strictly saller than L D (w ) + ɛ Constructing a variant of a ulti-ared bandit proble: We now construct a variant of the ulti-ared bandit proble out of the distribution P j Each i [d] is an ar and the reward of pulling i is 2 x i + y {0, } Unlike standard ulti-ared bandit probles, here at each round the learner chooses K ars a t,,, a t,k, which correspond to the K atributes accessed at round t, and his reward is defined to be the average of the rewards of the chosen ars At the end of each round the learner observes the value of x t at a t,,, a t,k, as well as the value of y t Note that the expected reward is 2 + p K K i= [a t,i=j] Therefore, the total expected reward of an algorith that runs for T rounds is upper bounded by 2 T + p E[N j], where N j is the nuber of ties j {a t,,, a t,k } A good learner yields a strategy: Suppose that we have a learner that can learn a linear predictor with L D (w) L D (w ) < ɛ using exaples (on average) Since we have shown that once L D (w) L D (w ) < ɛ we know the value of j, we can construct a strategy for the ulti-ared bandit proble in a straightforward way; Siply use the first exaples to learn w and fro then on always pull the ar j, naely, a t, = = a t,k = j The expected reward of this algorith is at least 2 + (T ) ( 2 + p) = 2T + (T )p An upper bound on the reward of any strategy: Consider an arbitrary prediction algorith At round t the algorith uses the history (and its own rando bits, which we can assue are set in advance) to ask for the current K attributes a t,,, a t,k The history is the value of x s at a s,,, a s,k as well as the value of y s, for all s < t That is, we can denote the history at round t to be r t = (r,,, r,k+ ),, (r t,,, r t,k+ ) Therefore, on round t the algorith uses a apping fro r t to [d] K We use r as a shorthand for r T + The following lea shows that any function of the history cannot distinguish too well between the distribution P j and the unifor distribution Lea 6 Let f : {, } (K+)T [0, M] be any function defined on a history sequence r = (r,,, r,k+ ),, (r T,,, r T,K+ ) Let N j be the nuber of ties the algorith calculating f picks action j aong the selected ars Then, E j [f(r)] E u [f(r)] + M log( 4p 2 )E u [N j ]

13 Proof For any two distributions P, Q we let P Q = r P [r] Q[r] be the total variation distance and let KL(P, Q) = r P [r] log(p [r]/q[r]) be the KL divergence Using Holder inequality we know that E j[f(r)] E u [f(r)] M P j P u Additionally, using Pinsker s inequality we have 2 P j P u 2 KL(P u, P j ) Finally, the chain rule and siple calculations yield, KL(P u, P j ) = ( (K+)T 2) T ( Pu [r t, r t ) ] log P r t= j [r t, r t ] ( K+ 2) = r = r T = = t= t= ( (K+)T 2) T log t= ( 2) (K+)T T t= ( ) K+ ( 2 + ) K 2 p [ W K (at,i=j)]sgn(x t,jy t ) i= [ W K i= (at,i=j)] ( log ( + 2 p sgn(x t,j y t ) )) E u [ [ W K i= (at,i=j)] ( log ( + 2 p sgn(x t,j y t ) ))] [ T K ] [ P u (a t,i = j) E u log ( + 2 p sgn(x t,j y t ) )] i= (since x t,j y t is independent of a t,,, a t,k ) ( = 2 ( log( + 2 p) ) ( ) ) T [ K ] + 2 log( 2 p) P u (a t,i = j) = 2 log( 4p2 )E u [N j ] Cobining all the above we conclude our proof t= i= We have shown previously that the expected reward of any algorith is bounded above by 2 T + pe j[n j ] Applying Lea 6 above on f(r) = N j {0,,, T } we get that E j [N j ] E u [N j ] + T log( 4p 2 )E u [N j ] Therefore, the expected reward of any algorith is at ost ) 2 (E T + p u [N j ] + T log( 4p 2 )E u [N j ] Since the adversary will choose j to iniize the above and since the iniu over j is saller then the expectation over choosing j uniforly at rando we have that the reward against an adversarial choice of j is at ost Note that 2 T + p d d d j= d j= ( ) E u [N j ] + T log( 4p 2 )E u [N j ] E u [N j ] = d E u[n + + N d ] KT d (2) Cobining this with (2) and using Jensen s inequality we obtain the following upper bound on the reward ( ) 2 T + p K d T + T log( 4p 2 ) K d T Assuing that ɛ /6 we have that 4p 2 = 4ɛ /4 and thus using the inequality log( q) 3 2q, which holds for q [0, /4], we get the upper bound ( ) 2 T + p K d T + T 6K d p2 T (3)

14 Concluding the proof: Take a learning algorith that finds an ɛ-good predictor using exaples Since the reward of the strategy based on this learning algorith cannot exceed the upper bound given in (3) we obtain that: ( ) 2 T + (T )p 2 T + p K d T + T 6K d p2 T which solved for gives T ) ( Kd 6K d p2 T Since we assue d 4K, choosing T = d / (96Kp 2 ), and recalling p 2 = ɛ, gives A4 Proof of Theore 4 T 2 = d 2 96Kɛ Let w = (/3, /3, /3) Let x {±} 3 be distributed uniforly at rando and y is deterined deterinistically to be w, x Then, L D (w ) = 0 However, any algorith that only view 2 attributes have an uncertainty about the label of at least ± 3, and therefore its expected squared error is at least /9 Forally, suppose the algorith asks for the first two attributes and for its prediction to be ŷ Since the generation of attributes is independent, we have that the value of x 3 does not depend on x, x 2, and ŷ, and therefore E[(ŷ w, x ) 2 ] = E[(ŷ w x w x 2 w 3x 3 ) 2 ] = E[(ŷ w x w x 2 ) 2 ]+E[(w 3x 3 ) 2 ] 0+(/3) 2 E[x 2 3] = /9, which concludes our proof

Efficient Learning with Partially Observed Attributes

Efficient Learning with Partially Observed Attributes Journal of Machine Learning Research 2 (20) 2857-2878 Subitted 4/; Revised 9/; Published 0/ Efficient Learning with Partially Observed Attributes Nicolò Cesa-Bianchi DSI, Università degli Studi di Milano

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October

More information

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds

More information

Understanding Machine Learning Solution Manual

Understanding Machine Learning Solution Manual Understanding Machine Learning Solution Manual Written by Alon Gonen Edited by Dana Rubinstein Noveber 17, 2014 2 Gentle Start 1. Given S = ((x i, y i )), define the ultivariate polynoial p S (x) = i []:y

More information

A Simple Regression Problem

A Simple Regression Problem A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,

More information

Learnability and Stability in the General Learning Setting

Learnability and Stability in the General Learning Setting Learnability and Stability in the General Learning Setting Shai Shalev-Shwartz TTI-Chicago shai@tti-c.org Ohad Shair The Hebrew University ohadsh@cs.huji.ac.il Nathan Srebro TTI-Chicago nati@uchicago.edu

More information

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher

More information

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices CS71 Randoness & Coputation Spring 018 Instructor: Alistair Sinclair Lecture 13: February 7 Disclaier: These notes have not been subjected to the usual scrutiny accorded to foral publications. They ay

More information

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians Using EM To Estiate A Probablity Density With A Mixture Of Gaussians Aaron A. D Souza adsouza@usc.edu Introduction The proble we are trying to address in this note is siple. Given a set of data points

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic

More information

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search Quantu algoriths (CO 781, Winter 2008) Prof Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search ow we begin to discuss applications of quantu walks to search algoriths

More information

1 Generalization bounds based on Rademacher complexity

1 Generalization bounds based on Rademacher complexity COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #0 Scribe: Suqi Liu March 07, 08 Last tie we started proving this very general result about how quickly the epirical average converges

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic

More information

1 Rademacher Complexity Bounds

1 Rademacher Complexity Bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Max Goer March 07, 2013 1 Radeacher Coplexity Bounds Recall the following theore fro last lecture: Theore 1. With probability

More information

1 Bounding the Margin

1 Bounding the Margin COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost

More information

Stochastic Subgradient Methods

Stochastic Subgradient Methods Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods

More information

A Theoretical Analysis of a Warm Start Technique

A Theoretical Analysis of a Warm Start Technique A Theoretical Analysis of a War Start Technique Martin A. Zinkevich Yahoo! Labs 701 First Avenue Sunnyvale, CA Abstract Batch gradient descent looks at every data point for every step, which is wasteful

More information

arxiv: v1 [cs.lg] 8 Jan 2019

arxiv: v1 [cs.lg] 8 Jan 2019 Data Masking with Privacy Guarantees Anh T. Pha Oregon State University phatheanhbka@gail.co Shalini Ghosh Sasung Research shalini.ghosh@gail.co Vinod Yegneswaran SRI international vinod@csl.sri.co arxiv:90.085v

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Coputable Shell Decoposition Bounds John Langford TTI-Chicago jcl@cs.cu.edu David McAllester TTI-Chicago dac@autoreason.co Editor: Leslie Pack Kaelbling and David Cohn Abstract Haussler, Kearns, Seung

More information

1 Proof of learning bounds

1 Proof of learning bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #4 Scribe: Akshay Mittal February 13, 2013 1 Proof of learning bounds For intuition of the following theore, suppose there exists a

More information

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering

More information

PAC-Bayes Analysis Of Maximum Entropy Learning

PAC-Bayes Analysis Of Maximum Entropy Learning PAC-Bayes Analysis Of Maxiu Entropy Learning John Shawe-Taylor and David R. Hardoon Centre for Coputational Statistics and Machine Learning Departent of Coputer Science University College London, UK, WC1E

More information

Sharp Time Data Tradeoffs for Linear Inverse Problems

Sharp Time Data Tradeoffs for Linear Inverse Problems Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used

More information

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes Radeacher Coplexity Margin Bounds for Learning with a Large Nuber of Classes Vitaly Kuznetsov Courant Institute of Matheatical Sciences, 25 Mercer street, New York, NY, 002 Mehryar Mohri Courant Institute

More information

Combining Classifiers

Combining Classifiers Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/

More information

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning JMLR: Workshop and Conference Proceedings vol (1) 1 15 New Bounds for Learning Intervals with Iplications for Sei-Supervised Learning David P. Helbold dph@soe.ucsc.edu Departent of Coputer Science, University

More information

Feature Extraction Techniques

Feature Extraction Techniques Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that

More information

Non-Parametric Non-Line-of-Sight Identification 1

Non-Parametric Non-Line-of-Sight Identification 1 Non-Paraetric Non-Line-of-Sight Identification Sinan Gezici, Hisashi Kobayashi and H. Vincent Poor Departent of Electrical Engineering School of Engineering and Applied Science Princeton University, Princeton,

More information

Boosting with log-loss

Boosting with log-loss Boosting with log-loss Marco Cusuano-Towner Septeber 2, 202 The proble Suppose we have data exaples {x i, y i ) i =... } for a two-class proble with y i {, }. Let F x) be the predictor function with the

More information

COS 424: Interacting with Data. Written Exercises

COS 424: Interacting with Data. Written Exercises COS 424: Interacting with Data Hoework #4 Spring 2007 Regression Due: Wednesday, April 18 Written Exercises See the course website for iportant inforation about collaboration and late policies, as well

More information

Polygonal Designs: Existence and Construction

Polygonal Designs: Existence and Construction Polygonal Designs: Existence and Construction John Hegean Departent of Matheatics, Stanford University, Stanford, CA 9405 Jeff Langford Departent of Matheatics, Drake University, Des Moines, IA 5011 G

More information

Prediction by random-walk perturbation

Prediction by random-walk perturbation Prediction by rando-walk perturbation Luc Devroye School of Coputer Science McGill University Gábor Lugosi ICREA and Departent of Econoics Universitat Popeu Fabra lucdevroye@gail.co gabor.lugosi@gail.co

More information

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011) E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how

More information

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Weak Learning Definition: concept class C is weakly PAC-learnable if there exists a (weak)

More information

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

e-companion ONLY AVAILABLE IN ELECTRONIC FORM OPERATIONS RESEARCH doi 10.1287/opre.1070.0427ec pp. ec1 ec5 e-copanion ONLY AVAILABLE IN ELECTRONIC FORM infors 07 INFORMS Electronic Copanion A Learning Approach for Interactive Marketing to a Custoer

More information

Testing Properties of Collections of Distributions

Testing Properties of Collections of Distributions Testing Properties of Collections of Distributions Reut Levi Dana Ron Ronitt Rubinfeld April 9, 0 Abstract We propose a fraework for studying property testing of collections of distributions, where the

More information

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks Intelligent Systes: Reasoning and Recognition Jaes L. Crowley MOSIG M1 Winter Seester 2018 Lesson 7 1 March 2018 Outline Artificial Neural Networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon Model Fitting CURM Background Material, Fall 014 Dr. Doreen De Leon 1 Introduction Given a set of data points, we often want to fit a selected odel or type to the data (e.g., we suspect an exponential

More information

Bayes Decision Rule and Naïve Bayes Classifier

Bayes Decision Rule and Naïve Bayes Classifier Bayes Decision Rule and Naïve Bayes Classifier Le Song Machine Learning I CSE 6740, Fall 2013 Gaussian Mixture odel A density odel p(x) ay be ulti-odal: odel it as a ixture of uni-odal distributions (e.g.

More information

Soft-margin SVM can address linearly separable problems with outliers

Soft-margin SVM can address linearly separable problems with outliers Non-linear Support Vector Machines Non-linearly separable probles Hard-argin SVM can address linearly separable probles Soft-argin SVM can address linearly separable probles with outliers Non-linearly

More information

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t. CS 493: Algoriths for Massive Data Sets Feb 2, 2002 Local Models, Bloo Filter Scribe: Qin Lv Local Models In global odels, every inverted file entry is copressed with the sae odel. This work wells when

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2016 Lessons 7 14 Dec 2016 Outline Artificial Neural networks Notation...2 1. Introduction...3... 3 The Artificial

More information

Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression

Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression Sha M Kakade Microsoft Research and Wharton, U Penn skakade@icrosoftco Varun Kanade SEAS, Harvard University vkanade@fasharvardedu

More information

Multiple Instance Learning with Query Bags

Multiple Instance Learning with Query Bags Multiple Instance Learning with Query Bags Boris Babenko UC San Diego bbabenko@cs.ucsd.edu Piotr Dollár California Institute of Technology pdollar@caltech.edu Serge Belongie UC San Diego sjb@cs.ucsd.edu

More information

CS Lecture 13. More Maximum Likelihood

CS Lecture 13. More Maximum Likelihood CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood

More information

arxiv: v3 [cs.lg] 7 Jan 2016

arxiv: v3 [cs.lg] 7 Jan 2016 Efficient and Parsionious Agnostic Active Learning Tzu-Kuo Huang Alekh Agarwal Daniel J. Hsu tkhuang@icrosoft.co alekha@icrosoft.co djhsu@cs.colubia.edu John Langford Robert E. Schapire jcl@icrosoft.co

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Journal of Machine Learning Research 5 (2004) 529-547 Subitted 1/03; Revised 8/03; Published 5/04 Coputable Shell Decoposition Bounds John Langford David McAllester Toyota Technology Institute at Chicago

More information

Symmetrization and Rademacher Averages

Symmetrization and Rademacher Averages Stat 928: Statistical Learning Theory Lecture: Syetrization and Radeacher Averages Instructor: Sha Kakade Radeacher Averages Recall that we are interested in bounding the difference between epirical and

More information

Tight Complexity Bounds for Optimizing Composite Objectives

Tight Complexity Bounds for Optimizing Composite Objectives Tight Coplexity Bounds for Optiizing Coposite Objectives Blake Woodworth Toyota Technological Institute at Chicago Chicago, IL, 60637 blake@ttic.edu Nathan Srebro Toyota Technological Institute at Chicago

More information

Graphical Models in Local, Asymmetric Multi-Agent Markov Decision Processes

Graphical Models in Local, Asymmetric Multi-Agent Markov Decision Processes Graphical Models in Local, Asyetric Multi-Agent Markov Decision Processes Ditri Dolgov and Edund Durfee Departent of Electrical Engineering and Coputer Science University of Michigan Ann Arbor, MI 48109

More information

Lower Bounds for Quantized Matrix Completion

Lower Bounds for Quantized Matrix Completion Lower Bounds for Quantized Matrix Copletion Mary Wootters and Yaniv Plan Departent of Matheatics University of Michigan Ann Arbor, MI Eail: wootters, yplan}@uich.edu Mark A. Davenport School of Elec. &

More information

Fixed-to-Variable Length Distribution Matching

Fixed-to-Variable Length Distribution Matching Fixed-to-Variable Length Distribution Matching Rana Ali Ajad and Georg Böcherer Institute for Counications Engineering Technische Universität München, Gerany Eail: raa2463@gail.co,georg.boecherer@tu.de

More information

Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence

Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence Best Ar Identification: A Unified Approach to Fixed Budget and Fixed Confidence Victor Gabillon Mohaad Ghavazadeh Alessandro Lazaric INRIA Lille - Nord Europe, Tea SequeL {victor.gabillon,ohaad.ghavazadeh,alessandro.lazaric}@inria.fr

More information

RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION

RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION GUANGHUI LAN AND YI ZHOU Abstract. In this paper, we consider a class of finite-su convex optiization probles defined over a distributed

More information

Block designs and statistics

Block designs and statistics Bloc designs and statistics Notes for Math 447 May 3, 2011 The ain paraeters of a bloc design are nuber of varieties v, bloc size, nuber of blocs b. A design is built on a set of v eleents. Each eleent

More information

Randomized Recovery for Boolean Compressed Sensing

Randomized Recovery for Boolean Compressed Sensing Randoized Recovery for Boolean Copressed Sensing Mitra Fatei and Martin Vetterli Laboratory of Audiovisual Counication École Polytechnique Fédéral de Lausanne (EPFL) Eail: {itra.fatei, artin.vetterli}@epfl.ch

More information

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

A Smoothed Boosting Algorithm Using Probabilistic Output Codes A Soothed Boosting Algorith Using Probabilistic Output Codes Rong Jin rongjin@cse.su.edu Dept. of Coputer Science and Engineering, Michigan State University, MI 48824, USA Jian Zhang jian.zhang@cs.cu.edu

More information

Support Vector Machines. Goals for the lecture

Support Vector Machines. Goals for the lecture Support Vector Machines Mark Craven and David Page Coputer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Soe of the slides in these lectures have been adapted/borrowed fro aterials developed

More information

Support recovery in compressed sensing: An estimation theoretic approach

Support recovery in compressed sensing: An estimation theoretic approach Support recovery in copressed sensing: An estiation theoretic approach Ain Karbasi, Ali Horati, Soheil Mohajer, Martin Vetterli School of Coputer and Counication Sciences École Polytechnique Fédérale de

More information

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab Support Vector Machines Machine Learning Series Jerry Jeychandra Bloh Lab Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lessons 7 20 Dec 2017 Outline Artificial Neural networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

On the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation

On the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation journal of coplexity 6, 459473 (2000) doi:0.006jco.2000.0544, available online at http:www.idealibrary.co on On the Counication Coplexity of Lipschitzian Optiization for the Coordinated Model of Coputation

More information

Soft Computing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis

Soft Computing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis Soft Coputing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis Beverly Rivera 1,2, Irbis Gallegos 1, and Vladik Kreinovich 2 1 Regional Cyber and Energy Security Center RCES

More information

arxiv: v1 [cs.ds] 17 Mar 2016

arxiv: v1 [cs.ds] 17 Mar 2016 Tight Bounds for Single-Pass Streaing Coplexity of the Set Cover Proble Sepehr Assadi Sanjeev Khanna Yang Li Abstract arxiv:1603.05715v1 [cs.ds] 17 Mar 2016 We resolve the space coplexity of single-pass

More information

Detection and Estimation Theory

Detection and Estimation Theory ESE 54 Detection and Estiation Theory Joseph A. O Sullivan Sauel C. Sachs Professor Electronic Systes and Signals Research Laboratory Electrical and Systes Engineering Washington University 11 Urbauer

More information

A note on the multiplication of sparse matrices

A note on the multiplication of sparse matrices Cent. Eur. J. Cop. Sci. 41) 2014 1-11 DOI: 10.2478/s13537-014-0201-x Central European Journal of Coputer Science A note on the ultiplication of sparse atrices Research Article Keivan Borna 12, Sohrab Aboozarkhani

More information

Ch 12: Variations on Backpropagation

Ch 12: Variations on Backpropagation Ch 2: Variations on Backpropagation The basic backpropagation algorith is too slow for ost practical applications. It ay take days or weeks of coputer tie. We deonstrate why the backpropagation algorith

More information

Domain-Adversarial Neural Networks

Domain-Adversarial Neural Networks Doain-Adversarial Neural Networks Hana Ajakan, Pascal Gerain 2, Hugo Larochelle 3, François Laviolette 2, Mario Marchand 2,2 Départeent d inforatique et de génie logiciel, Université Laval, Québec, Canada

More information

Exact tensor completion with sum-of-squares

Exact tensor completion with sum-of-squares Proceedings of Machine Learning Research vol 65:1 54, 2017 30th Annual Conference on Learning Theory Exact tensor copletion with su-of-squares Aaron Potechin Institute for Advanced Study, Princeton David

More information

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data Suppleentary to Learning Discriinative Bayesian Networks fro High-diensional Continuous Neuroiaging Data Luping Zhou, Lei Wang, Lingqiao Liu, Philip Ogunbona, and Dinggang Shen Proposition. Given a sparse

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2016/2017 Lessons 9 11 Jan 2017 Outline Artificial Neural networks Notation...2 Convolutional Neural Networks...3

More information

CSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13

CSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13 CSE55: Randoied Algoriths and obabilistic Analysis May 6, Lecture Lecturer: Anna Karlin Scribe: Noah Siegel, Jonathan Shi Rando walks and Markov chains This lecture discusses Markov chains, which capture

More information

Ensemble Based on Data Envelopment Analysis

Ensemble Based on Data Envelopment Analysis Enseble Based on Data Envelopent Analysis So Young Sohn & Hong Choi Departent of Coputer Science & Industrial Systes Engineering, Yonsei University, Seoul, Korea Tel) 82-2-223-404, Fax) 82-2- 364-7807

More information

Solutions of some selected problems of Homework 4

Solutions of some selected problems of Homework 4 Solutions of soe selected probles of Hoework 4 Sangchul Lee May 7, 2018 Proble 1 Let there be light A professor has two light bulbs in his garage. When both are burned out, they are replaced, and the next

More information

Multi-Dimensional Hegselmann-Krause Dynamics

Multi-Dimensional Hegselmann-Krause Dynamics Multi-Diensional Hegselann-Krause Dynaics A. Nedić Industrial and Enterprise Systes Engineering Dept. University of Illinois Urbana, IL 680 angelia@illinois.edu B. Touri Coordinated Science Laboratory

More information

The Weierstrass Approximation Theorem

The Weierstrass Approximation Theorem 36 The Weierstrass Approxiation Theore Recall that the fundaental idea underlying the construction of the real nubers is approxiation by the sipler rational nubers. Firstly, nubers are often deterined

More information

Probability Distributions

Probability Distributions Probability Distributions In Chapter, we ephasized the central role played by probability theory in the solution of pattern recognition probles. We turn now to an exploration of soe particular exaples

More information

DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS

DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS ISSN 1440-771X AUSTRALIA DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS An Iproved Method for Bandwidth Selection When Estiating ROC Curves Peter G Hall and Rob J Hyndan Working Paper 11/00 An iproved

More information

arxiv: v1 [cs.ds] 3 Feb 2014

arxiv: v1 [cs.ds] 3 Feb 2014 arxiv:40.043v [cs.ds] 3 Feb 04 A Bound on the Expected Optiality of Rando Feasible Solutions to Cobinatorial Optiization Probles Evan A. Sultani The Johns Hopins University APL evan@sultani.co http://www.sultani.co/

More information

Asynchronous Gossip Algorithms for Stochastic Optimization

Asynchronous Gossip Algorithms for Stochastic Optimization Asynchronous Gossip Algoriths for Stochastic Optiization S. Sundhar Ra ECE Dept. University of Illinois Urbana, IL 680 ssrini@illinois.edu A. Nedić IESE Dept. University of Illinois Urbana, IL 680 angelia@illinois.edu

More information

Introduction to Machine Learning. Recitation 11

Introduction to Machine Learning. Recitation 11 Introduction to Machine Learning Lecturer: Regev Schweiger Recitation Fall Seester Scribe: Regev Schweiger. Kernel Ridge Regression We now take on the task of kernel-izing ridge regression. Let x,...,

More information

Machine Learning in the Data Revolution Era

Machine Learning in the Data Revolution Era Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,

More information

1 Identical Parallel Machines

1 Identical Parallel Machines FB3: Matheatik/Inforatik Dr. Syaantak Das Winter 2017/18 Optiizing under Uncertainty Lecture Notes 3: Scheduling to Miniize Makespan In any standard scheduling proble, we are given a set of jobs J = {j

More information

Good Learners for Evil Teachers

Good Learners for Evil Teachers Ofer Dekel Microsoft Research Microsoft Way Redond WA 985 USA Ohad Shair The Hebrew University Jerusale 994 Israel OFERD@MICROSOFT.COM OHADSH@CS.HUJI.AC.IL Abstract We consider a supervised achine learning

More information

Supplementary Material for Fast and Provable Algorithms for Spectrally Sparse Signal Reconstruction via Low-Rank Hankel Matrix Completion

Supplementary Material for Fast and Provable Algorithms for Spectrally Sparse Signal Reconstruction via Low-Rank Hankel Matrix Completion Suppleentary Material for Fast and Provable Algoriths for Spectrally Sparse Signal Reconstruction via Low-Ran Hanel Matrix Copletion Jian-Feng Cai Tianing Wang Ke Wei March 1, 017 Abstract We establish

More information

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lesson 1 4 October 2017 Outline Learning and Evaluation for Pattern Recognition Notation...2 1. The Pattern Recognition

More information

Proc. of the IEEE/OES Seventh Working Conference on Current Measurement Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES

Proc. of the IEEE/OES Seventh Working Conference on Current Measurement Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES Proc. of the IEEE/OES Seventh Working Conference on Current Measureent Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES Belinda Lipa Codar Ocean Sensors 15 La Sandra Way, Portola Valley, CA 98 blipa@pogo.co

More information

Machine Learning Basics: Estimators, Bias and Variance

Machine Learning Basics: Estimators, Bias and Variance Machine Learning Basics: Estiators, Bias and Variance Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Basics

More information

Nonmonotonic Networks. a. IRST, I Povo (Trento) Italy, b. Univ. of Trento, Physics Dept., I Povo (Trento) Italy

Nonmonotonic Networks. a. IRST, I Povo (Trento) Italy, b. Univ. of Trento, Physics Dept., I Povo (Trento) Italy Storage Capacity and Dynaics of Nononotonic Networks Bruno Crespi a and Ignazio Lazzizzera b a. IRST, I-38050 Povo (Trento) Italy, b. Univ. of Trento, Physics Dept., I-38050 Povo (Trento) Italy INFN Gruppo

More information

A Probabilistic and RIPless Theory of Compressed Sensing

A Probabilistic and RIPless Theory of Compressed Sensing A Probabilistic and RIPless Theory of Copressed Sensing Eanuel J Candès and Yaniv Plan 2 Departents of Matheatics and of Statistics, Stanford University, Stanford, CA 94305 2 Applied and Coputational Matheatics,

More information

SPECTRUM sensing is a core concept of cognitive radio

SPECTRUM sensing is a core concept of cognitive radio World Acadey of Science, Engineering and Technology International Journal of Electronics and Counication Engineering Vol:6, o:2, 202 Efficient Detection Using Sequential Probability Ratio Test in Mobile

More information

A PROBABILISTIC AND RIPLESS THEORY OF COMPRESSED SENSING. Emmanuel J. Candès Yaniv Plan. Technical Report No November 2010

A PROBABILISTIC AND RIPLESS THEORY OF COMPRESSED SENSING. Emmanuel J. Candès Yaniv Plan. Technical Report No November 2010 A PROBABILISTIC AND RIPLESS THEORY OF COMPRESSED SENSING By Eanuel J Candès Yaniv Plan Technical Report No 200-0 Noveber 200 Departent of Statistics STANFORD UNIVERSITY Stanford, California 94305-4065

More information

Distributed Subgradient Methods for Multi-agent Optimization

Distributed Subgradient Methods for Multi-agent Optimization 1 Distributed Subgradient Methods for Multi-agent Optiization Angelia Nedić and Asuan Ozdaglar October 29, 2007 Abstract We study a distributed coputation odel for optiizing a su of convex objective functions

More information

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA) Bayesian Learning Chapter 6: Bayesian Learning CS 536: Machine Learning Littan (Wu, TA) [Read Ch. 6, except 6.3] [Suggested exercises: 6.1, 6.2, 6.6] Bayes Theore MAP, ML hypotheses MAP learners Miniu

More information

Improved Guarantees for Agnostic Learning of Disjunctions

Improved Guarantees for Agnostic Learning of Disjunctions Iproved Guarantees for Agnostic Learning of Disjunctions Pranjal Awasthi Carnegie Mellon University pawasthi@cs.cu.edu Avri Blu Carnegie Mellon University avri@cs.cu.edu Or Sheffet Carnegie Mellon University

More information

Using a De-Convolution Window for Operating Modal Analysis

Using a De-Convolution Window for Operating Modal Analysis Using a De-Convolution Window for Operating Modal Analysis Brian Schwarz Vibrant Technology, Inc. Scotts Valley, CA Mark Richardson Vibrant Technology, Inc. Scotts Valley, CA Abstract Operating Modal Analysis

More information

A Unified Approach to Universal Prediction: Generalized Upper and Lower Bounds

A Unified Approach to Universal Prediction: Generalized Upper and Lower Bounds 646 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL 6, NO 3, MARCH 05 A Unified Approach to Universal Prediction: Generalized Upper and Lower Bounds Nuri Denizcan Vanli and Suleyan S Kozat,

More information

In this chapter, we consider several graph-theoretic and probabilistic models

In this chapter, we consider several graph-theoretic and probabilistic models THREE ONE GRAPH-THEORETIC AND STATISTICAL MODELS 3.1 INTRODUCTION In this chapter, we consider several graph-theoretic and probabilistic odels for a social network, which we do under different assuptions

More information