A theory of learning from different domains

Size: px
Start display at page:

Download "A theory of learning from different domains"

Transcription

1 DOI /s A theory of learning fro different doains Shai Ben-David John Blitzer Koby Craer Alex Kulesza Fernando Pereira Jennifer Wortan Vaughan Received: 28 February 2009 / Revised: 12 Septeber 2009 / Accepted: 18 Septeber 2009 The Authors This article is published with open access at Springerlink.co Abstract Discriinative learning ethods for classification perfor well when training and test data are drawn fro the sae distribution. Often, however, we have plentiful labeled training data fro a source doain but wish to learn a classifier which perfors well on a target doain with a different distribution and little or no labeled training data. In this work we investigate two questions. First, under what conditions can a classifier trained fro source data be expected to perfor well on target data? Second, given a sall aount of labeled target data, how should we cobine it during training with the large aount of labeled source data to achieve the lowest target error at test tie? Editors: Nicolo Cesa-Bianchi, David R. Hardoon, and Gayle Leen. Preliinary versions of the work contained in this article appeared in Advances in Neural Inforation Processing Systes Ben-David et al. 2006; Blitzer et al. 2007a. S. Ben-David David R. Cheriton School of Coputer Science, University of Waterloo, Waterloo, ON, Canada e-ail: shai@cs.uwaterloo.ca J. Blitzer Departent of Coputer Science, UC Berkeley, Berkeley, CA, USA e-ail: blitzer@cs.berkeley.edu K. Craer Departent of Electrical Engineering, The Technion, Haifa, Israel e-ail: koby@ee.technion.ac.il A. Kulesza Departent of Coputer and Inforation Science, University of Pennsylvania, Philadelphia, PA, USA e-ail: kulesza@cis.upenn.edu F. Pereira Google Research, Mountain View, CA, USA e-ail: pereira@google.co J.W. Vaughan School of Engineering and Applied Sciences, Harvard University, Cabridge, MA, USA e-ail: jenn@seas.harvard.edu

2 We address the first question by bounding a classifier s target error in ters of its source error and the divergence between the two doains. We give a classifier-induced divergence easure that can be estiated fro finite, unlabeled saples fro the doains. Under the assuption that there exists soe hypothesis that perfors well in both doains, we show that this quantity together with the epirical source error characterize the target error of a source-trained classifier. We answer the second question by bounding the target error of a odel which iniizes a convex cobination of the epirical source and target errors. Previous theoretical work has considered iniizing just the source error, just the target error, or weighting instances fro the two doains equally. We show how to choose the optial cobination of source and target error as a function of the divergence, the saple sizes of both doains, and the coplexity of the hypothesis class. The resulting bound generalizes the previously studied cases and is always at least as tight as a bound which considers iniizing only the target error or an equal weighting of source and target errors. Keywords Doain adaptation Transfer learning Learning theory Saple-selection bias 1 Introduction Most research in achine learning, both theoretical and epirical, assues that odels are trained and tested using data drawn fro soe fixed distribution. This single doain setting has been well studied, and unifor convergence theory guarantees that a odel s epirical training error is close to its true error under such assuptions. In any practical cases, however, we wish to train a odel in one or ore source doains and then apply it to a different target doain. For exaple, we ight have a spa filter trained fro a large eail collection received by a group of current users the source doain and wish to adapt it for a new user the target doain. Intuitively this should iprove filtering perforance for the new user, under the assuption that users generally agree on what is spa and what is not. The challenge is that each user receives a unique distribution of eail. Many other exaples arise in natural language processing. In general, labeled data for tasks like part-of-speech tagging Ratnaparkhi 1996, parsing Collins 1999, inforation extraction Bikel et al. 1997, and sentient analysis Pang et al are drawn fro a liited set of docuent types and genres in a given language due to availability, cost, and specific goals of the project. However, useful applications for the trained systes ay involve docuents of different types or genres. We can hope to successfully adapt the systes in these cases since parts-of-speech, syntactic structure, entity entions, and positive or negative sentient are to a large extent stable across different doains, as they depend on general properties of language. In this work we investigate the proble of doain adaptation. We analyze a setting in which we have plentiful labeled training data drawn fro one or ore source distributions but little or no labeled training data drawn fro the target distribution of interest. This work answers two ain questions. First, under what conditions on the source and target distributions can we expect to learn well? We give a bound on a classifier s target doain error in ters of its source doain error and a divergence easure between the two doains. In a distribution-free setting, we cannot obtain accurate estiates of coon easures of divergence such as L 1 or Kullback-Leibler fro finite saples. Instead, we show that when learning a hypothesis fro a class of finite coplexity, it is sufficient to use a classifier-induced divergence we call the H H-divergence Kifer et al. 2004;

3 Ben-David et al Finite saple estiates of the H H-divergence converge uniforly to the true H H-divergence, allowing us to estiate the doain divergence fro unlabeled data in both doains. Our final bound on the target error is in ters of the epirical source error, the epirical H H-divergence between unlabeled saples fro the doains, and the cobined error of the best single hypothesis for both doains. A second iportant question is how to learn when the large quantity of labeled source data is augented with a sall aount of labeled target data, for exaple, when our new eail user has begun to anually ark a few received essages as spa. Given a source doain S and a target doain T, we consider hypotheses h which iniize a convex cobination of epirical source and target error ˆɛ T h and ˆɛ S h, respectively, which we refer to as the epirical α-error: α ˆɛ T h + 1 αˆɛ S h. Setting α involves trading off the ideal but sall target dataset against the large but less relevant source dataset. Baseline choices for α include α = 0 using only source data Ben- David et al. 2006, α = 1 using only target data, and the equal weighting of source and target instances Craer et al. 2008, setting α to the fraction of the instances that are fro the target doain. We give a bound on a classifier s target error in ters of its epirical α error. The α that iniizes the bound depends on the divergence between the doains as well as the size of the source and target training datasets. The optial bound is always at least as tight as the bounds using only source, only target, or equally-weighted source and target instances. We show that for a real-world proble of sentient classification, nontrivial settings of α perfor better than the three baseline settings. In the next section, we give a brief overview of related work. We then specify precisely our odel of doain adaptation. Section 4 shows how to bound the target error of a hypothesis in ters of its source error and the source-target divergence. Section 5 gives our ain result, a bound on the target error of a classifier which iniizes a convex cobination of epirical errors on the two doains, and in Sect. 6 we investigate the properties of the best convex cobination of that bound. In Sect. 7, we illustrate experientally the above bounds on sentient classification data. Section 8 describes how to extend the previous results to the case of ultiple data sources. Finally, we conclude with a brief discussion of future directions for research in Sect Related work Craer et al introduced a PAC-style odel of learning fro ultiple sources in which the distribution over input points is assued to be the sae across sources but each source ay have its own deterinistic labeling function. They derive bounds on the target error of the function that iniizes the epirical error on uniforly weighted data fro any subset of the sources. As discussed in Sect. 8.2, the bounds that they derive are equivalent to ours in certain restricted settings, but their theory is significantly less general. Daué 2007 and Finkel 2009 suggest an epirically successful ethod for doain adaptation based on ulti-task learning. The crucial difference between our doain adaptation setting and analyses of ulti-task ethods is that ulti-task bounds require labeled data fro each task, and ake no attept to exploit unlabeled data. Although these bounds have a ore liited scope than ours, they can soeties yield useful results even when the optial predictors for each task or doain in the case of Daué 2007 are quite different Baxter 2000; Ando and Zhang 2005.

4 Li and Biles 2007 give PAC-Bayesian learning bounds for adaptation using divergence priors. In particular, they place a source-centered prior on the paraeters of a odel learned in the target doain. Like our odel, the divergence prior ephasizes the tradeoff between source hypotheses trained on large but biased data sets and target hypotheses trained fro sall but unbiased data sets. In our odel, however, we easure the divergence and consequently the bias of the source doain fro unlabeled data. This allows us to choose a tradeoff paraeter for source and target labeled data before training begins. More recently, Mansour et al. 2009a, 2009b introduced a theoretical odel for the ultiple source adaptation proble. This odel operates under assuptions very siilar to our ultiple source analysis Sect. 8, and we address their work in ore detail there. Finally, doain adaptation is closely related to the setting of saple selection bias Heckan A well-studied variant of this is covariate shift, which has seen significant work in recent years Huang et al. 2007; Sugiyaa et al. 2008; Cortesetal This line of work leads to algoriths based on instance weighting, which have also been explored epirically in the achine learning and natural language processing counities Jiang and Zhai 2007;Bickeletal Our work differs fro covariate shift priarily in two ways. First, we do not assue the labeling rule is identical for the source and target data although there ust exist soe good labeling rule for both in order to achieve low error. Second, our H H-divergence can be coputed fro finite saples of unlabeled data, allowing us to directly estiate the error of a source-trained classifier on the target doain. A point of general contrast is that we work in an agnostic setting in which we do not ake strong assuptions about the data generation odel, such as a specific relationship between the source and target data distributions, which would be needed to obtain absolute error bounds. Instead, we assue only that the saples fro each of the two doains are generated i.i.d. according to the respective data distributions, and as a result our bounds ust be relative to the error of soe benchark predictor rather than absolute, specifically, relative to the cobined error on both doains of an optial joint predictor. 3 A rigorous odel of doain adaptation We foralize the proble of doain adaptation for binary classification as follows. We define a doain 1 as a pair consisting of a distribution D on inputs X and a labeling function f : X [0, 1], which can have a fractional expected value when labeling occurs nondeterinistically. Initially, we consider two doains, a source doain and a target doain. We denote by D S,f S the source doain and D T,f T the target doain. A hypothesis is a function h : X {0, 1}. The probability according to the distribution D S that a hypothesis h disagrees with a labeling function f which can also be a hypothesis is defined as ɛ S h, f = E x DS [ hx fx ]. When we want to refer to the source error soeties called risk of a hypothesis, we use the shorthand ɛ S h = ɛ S h, f S. We write the epirical source error as ˆɛ S h. Weusethe parallel notation ɛ T h, f, ɛ T h,andˆɛ T h for the target doain. 1 Note that this notion of doain is not the doain of a function. We always ean a specific distribution and function pair when we say doain.

5 4 A bound relating the source and target error We now proceed to develop bounds on the target doain generalization perforance of a classifier trained in the source doain. We first show how to bound the target error in ters of the source error, the difference between labeling functions f S and f T, and the divergence between the distributions D S and D T. Since we expect the labeling function difference to be sall in practice, we focus here on easuring distribution divergence, and especially on how to estiate it with finite saples of unlabeled data fro D S and D T. That is the role of the H-divergence introduced in Sect A natural easure of divergence for distributions is the L 1 or variation divergence d 1 D, D = 2sup Pr D [B] Pr D [B], B B where B is the set of easurable subsets under D and D. We ake use of this easure to state an initial bound on the target error of a classifier. Theore 1 For a hypothesis h, ɛ T h ɛ S h + d 1 D S, D T + in { [ E DS fs x f T x ] [, E DT fs x f T x ]}. Proof See Appendix. In this bound, the first ter is the source error, which a training algorith ight seek to iniize, and the third is the difference in labeling functions across the two doains, which we expect to be sall. The proble is the reaining ter. Bounding the error in ters of the L 1 divergence between distributions has two disadvantages. First, it cannot be accurately estiated fro finite saples of arbitrary distributions Batu et al. 2000; Kifer et al and therefore has liited usefulness in practice. Second, for our purposes the L 1 divergence is an overly strict easure that unnecessarily inflates the bound, since it involves a supreu over all easurable subsets. We are only interested in the error of a hypothesis fro soe class of finite coplexity, thus we can restrict our attentions to the subsets on which this type of hypothesis can coit errors. The divergence easure introduced in the next section addresses both of these concerns. 4.1 The H-divergence Definition 1 Based on Kifer et al Given a doain X with D and D probability distributions over X,letH be a hypothesis class on X and denote by Ihthe set for which h H is the characteristic function; that is, x Ih hx = 1. The H-divergence between D and D is d H D, D = 2sup Pr D [Ih] Pr D [Ih]. h H The H-divergence resolves both probles associated with the L 1 divergence. First, for hypothesis classes H of finite VC diension, the H-divergence can be estiated fro finite saples see Lea 1 below. Second, the H-divergence for any class H is never larger than the L 1 divergence, and is in general saller when H has finite VC diension. Since it plays an iportant role in the rest of this work, we now state a slight odification of Theore 3.4 of Kifer et al as a lea.

6 Lea 1 Let H be a hypothesis space on X with VC diension d. If U and U are saples of size fro D and D respectively and ˆd H U, U is the epirical H-divergence between saples, then for any δ 0, 1, with probability at least 1 δ, d H D, D ˆd H U, U d log2 + log 2 δ + 4. Lea 1 shows that the epirical H-divergence between two saples fro distributions D and D converges uniforly to the true H-divergence for hypothesis classes H of finite VC diension. The next lea shows that we can copute the H-divergence by finding a classifier which attepts to separate one doain fro the other. Our basic plan of attack will be as follows: Label each unlabeled source instance with 0 and unlabeled target instance as 1. Then train a classifier to discriinate between source and target instances. The H-divergence is iediately coputable fro the error. Lea 2 For a syetric hypothesis class H one where for every h H, the inverse hypothesis 1 h is also in H and saples U, U of size [ 1 ˆd H U, U = 2 1 in I [x U] + 1 I [ x U ]], h H x:hx=0 x:hx=1 where I[x U] is the binary indicator variable which is 1 when x U. Proof See Appendix. This lea leads directly to a procedure for coputing the H-divergence. We first find a hypothesis in H which has iniu error for the binary classification proble of distinguishing source fro target instances. The error of this hypothesis is related to the H- divergence by Lea 2. Of course, iniizing error for ost reasonable hypothesis classes is a coputationally intractable proble. Nonetheless, as we shall see in Sect. 7, the error of hypotheses trained to iniize convex upper bounds on error are useful in approxiating the H-divergence. 4.2 Bounding the difference in error using the H-divergence The H-divergence allows us to estiate divergence fro unlabeled data, but in order to use it in a bound we ust have tools to represent error relative to other hypotheses in our class. We introduce two new definitions. Definition 2 The ideal joint hypothesis is the hypothesis which iniizes the cobined error h =argin ɛ S h + ɛ T h. h H We denote the cobined error of the ideal hypothesis by λ = ɛ S h + ɛ T h.

7 The ideal joint hypothesis explicitly ebodies our notion of adaptability. When this hypothesis perfors poorly, we cannot expect to learn a good target classifier by iniizing source error. On the other hand, we will show that if the ideal joint hypothesis perfors well, we can easure adaptability of a source-trained classifier by using the H-divergence between the arginal distributions D S and D T. Next we define the syetric difference hypothesis space H H for a hypothesis space H, which will be very useful in reasoning about error. Definition 3 For a hypothesis space H,thesyetric difference hypothesis space H H is the set of hypotheses g H H gx = hx h x for soe h, h H, where is the XOR function. In words, every hypothesis g H H is the set of disagreeents between two hypotheses in H. The following siple lea shows how we can ake use of the H H-divergence in bounding the error of our hypothesis. Lea 3 For any hypotheses h, h H, ɛ S h, h ɛ T h, h 1 2 d H HD S, D T. Proof By the definition of H H-distance, [ d H H D S, D T = 2 sup Prx DS hx h x ] [ Pr x DT hx h x ] h,h H = 2 sup ɛ S h, h ɛ T h, h 2 ɛ S h, h ɛ T h, h. h,h H We are now ready to give a bound on target error in ters of the new divergence easure we have defined. Theore 2 Let H be a hypothesis space of VC diension d. If U S, U T are unlabeled saples of size each, drawn fro D S and D T respectively, then for any δ 0, 1, with probability at least 1 δ over the choice of the saples, for every h H: ɛ T h ɛ S h ˆd H H U S, U T + 4 2d log2 + log 2 δ + λ. Proof This proof relies on Lea 3 and the triangle inequality for classification error Ben- David et al. 2006; Craeretal.2008, which iplies that for any labeling functions f 1, f 2,andf 3,wehaveɛf 1,f 2 ɛf 1,f 3 + ɛf 2,f 3.Then ɛ T h ɛ T h + ɛ T h, h ɛ T h + ɛ S h, h + ɛ T h, h ɛ S h, h ɛ T h + ɛ S h, h d H HD S, D T

8 ɛ T h + ɛ S h + ɛ S h d H HD S, D T = ɛ S h d H HD S, D T + λ ɛ S h ˆd H H U S, U T + 4 2d log2 + log 2 δ + λ. The last step is an application of Lea 1, together with the observation that since we can represent every g H H as a linear threshold network of depth 2 with 2 hidden units, the VC diension of H H is at ost twice the VC diension of H Anthony and Bartlett The bound in Theore 2 is relative to λ, and we briefly coent that the for λ = ɛ S h + ɛ T h coes fro the use of the triangle inequality for classification error. Other losses result in other fors for this bound Craer et al When the cobined error of the ideal joint hypothesis is large, then there is no classifier that perfors well on both the source and target doains, so we cannot hope to find a good target hypothesis by training only on the source doain. On the other hand, for sall λ the ost relevant case for doain adaptation, the bound shows that source error and unlabeled H H-divergence are iportant quantities in coputing the target error. 5 A learning bound cobining source and target training data Theore 2 shows how to relate source and target error. We now proceed to give a learning bound for epirical risk iniization using cobined source and target training data. At train tie a learner receives a saple S = S T,S S of instances, where S T consists of instances drawn independently fro D T and S S consists of 1 instances drawn independently fro D S. The goal of a learner is to find a hypothesis that iniizes target error ɛ T h. When is sall, as in doain adaptation, iniizing epirical target error ay not be the best choice. We analyze learners that instead iniize a convex cobination of epirical source and target error, ˆɛ α h = α ˆɛ T h + 1 αˆɛ S h, for soe α [0, 1]. Wedenoteasɛ α h the corresponding weighted cobination of true source and target errors, easured with respect to D S and D T. We bound the target error of a doain adaptation algorith that iniizes ˆɛ α h. The proof of the bound has two ain coponents, which we state as leas below. First we bound the difference between the target error ɛ T h and weighted error ɛ α h. Thenwe bound the difference between the true and epirical weighted errors ɛ α h and ˆɛ α h. Lea 4 Let h be a hypothesis in class H. Then 1 ɛ α h ɛ T h 1 α 2 d H HD S, D T + λ. Proof See Appendix.

9 The lea shows that as α approaches 1, we rely increasingly on the target data, and the distance between doains atters less and less. The unifor convergence bound on the α-error is nearly identical to the standard unifor convergence bound for hypothesis classes of finite VC diension Vapnik 1998; Anthony and Bartlett 1999, only with target and source errors weighted differently. The key part of the proof relies on a slight odification of Hoeffding s inequality for our setup, which we state here: Lea 5 For a fixed hypothesis h, if a rando labeled saple of size is generated by drawing points fro D T and 1 points fro D S, and labeling the according to f S and f T respectively, then for any δ 0, 1, with probability at least 1 δ over the choice of the saples, Pr [ ˆɛ α h ɛ α h ɛ ] 2ɛ 2 2exp α α2 1 Before giving the proof, we first restate Hoeffding s inequality for copleteness. Proposition 1 Hoeffding s inequality If X 1,...,X n are independent rando variables with a i X i b i for all i, then for any ɛ>0, where X = X 1 + +X n /n. We are now ready to prove the lea. Pr [ X E[ X] ɛ ] 2e 2n2 ɛ 2 / n i=1 b i a i 2, Proof Lea 5 LetX 1,...,X be rando variables that take on the values α hx f T x for the instances x S T. Siilarly, let X +1,...,X be rando variables that take on the values 1 α 1 hx f Sx for the 1 instances x S S. Note that X 1,...,X [0,α/] and X +1,...,X [0,1 α/1 ].Then ˆɛ α h = α ˆɛ T h + 1 αˆɛ S h = α 1 1 hx f T x +1 α hx f S x = 1 1 x S T x S S Furtherore, by linearity of expectations E[ˆɛ α h]= 1 α ɛ T h α 1 ɛ Sh = αɛ T h + 1 αɛ S h = ɛ α h.. X i. i=1

10 So by Hoeffding s inequality the following holds for every h. Pr [ ˆɛ α h ɛ α h ɛ ] 2 2 ɛ 2 2exp i=1 range2 X i 2 2 ɛ 2 = 2exp α α 1 2 2ɛ 2 = 2exp α α2 1. This lea shows that as α oves away fro where each instance is weighted equally, our finite saple approxiation to ɛ α h becoes less reliable. We can now ove on to the ain theore of this section. Theore 3 Let H be a hypothesis space of VC diension d. Let U S and U T be unlabeled saples of size each, drawn fro D S and D T respectively. Let S be a labeled saple of size generated by drawing points fro D T and 1 points fro D S and labeling the according to f S and f T, respectively. If ĥ H is the epirical iniizer of ˆɛ α h on S and h T = in h H ɛ T h is the target error iniizer, then for any δ 0, 1, with probability at least 1 δ over the choice of the saples, ɛ T ĥ ɛ T h T + 4 α α α 2 ˆd H H U S, U T + 4 2d log log 8 δ 2d log2 + log 8 δ + λ. The proof follows the standard set of steps for proving learning bounds Anthony and Bartlett 1999, using Lea 4 to bound the difference between target and weighted errors and Lea 5 for the unifor convergence of epirical and true weighted errors. The full proof is in Appendix. When α = 0 that is, we ignore target data, the bound is identical to that of Theore 2, but with an epirical estiate for the source error. Siilarly when α = 1 that is, we use only target data, the bound is the standard learning bound using only target data. At the optial α which iniizes the right hand side, the bound is always at least as tight as either of these two settings. Finally note that by choosing different values of α, the bound allows us to effectively trade off the sall aount of target data against the large aount of less relevant source data. We reark that when it is known that λ = 0, the dependence on in Theore 3 can be iproved; this corresponds to the restricted or realizable setting. 6 Optial ixing value We exaine now the bound of Theore 3 in ore detail to illustrate soe interesting properties. Writing the bound as a function of α and oitting additive constants, we obtain α 2 1 α2 fα= 2B αa, 1 1

11 where 1 A = 2 ˆd 2d log2 + log 4 δ H H U S, U T λ, is the total divergence between source and target, and 2d log log 8 δ B = 4 is the coplexity ter, which is approxiately d/. The optial value α is a function of the nuber of target exaples T =, the nuber of source exaples S = 1, and the ratio D = d/a: { 1 α T D T, S ; D = 2 in{1,ν} T D 2, 2 where T ν = T + S 1 + S D2 S + T S T. Several observations follow fro this analysis. First, if T = 0 = 0 then α = 0and if S = 0 = 1 then α = 1. That is, if we have only source or only target data, the best cobination is to use exactly what we have. Second, if we are certain that the source and target are the sae, that is if A = 0orD, then α =, that is, the optial cobination is to use the training data with unifor weighting of the exaples across all exaples, as in Craer et al. 2008, who always enforce such a unifor weighing. Finally, two phase transitions occur in the value of α. First, if there are enough target data specifically, if T D 2 = d/a 2 then no source data are needed, and in fact using any source data will yield suboptial perforance. This is because the possible reduction in error due to additional source data is always less than the increase in error caused by the source data being too far fro the target data. Second, even if there are few target exaples, it ight be the case that we do not have enough source data to justify using it, and this sall aount of source data should be ignored. Once we have enough source data then we get a non-trivial value for α. These two phase transitions are illustrated in Fig. 1. The intensity of a point reflects the value α and ranges fro 0 white to 1 black. In this plot α is a function of S x axis and T y axis, and we fix the coplexity to d = 1,601 and the divergence between source and target to A = We chose these values to correspond ore closely to real data see Sect. 7. Observe first that D 2 = 1,601/ ,130. When T D 2, the first case of 2 predicts that α = 1 for all values of S, which is illustrated by the black region above the line T = 3,130. Furtherore, fixing the value of T 3,130, the second case of 2 predicts that α will be either one 1 if S is sall enough, or go soothly to zero as S increases. This is illustrated by any horizontal line with T 3,130. Each such line is black for sall values of S and then gradually becoes white as S increases left to right. 7 Results on sentient classification In this section we illustrate our theory on the natural language processing task of sentient classification Pang et al The point of these experients is not to instantiate the

12 Fig. 1 An illustration of the phase transition in the balance between source and target training data. The value of α iniizing the bound is indicated by the intensity, where black eans α = 1. We fix d = 1,601 and A = 0.715, approxiating the epirical setup in Fig. 3. Thex-axis shows the nuber of source instances log-scale. The y-axis shows the nuber of target instances. A phase transition occurs at 3,130 target instances. With ore target instances than this, it is ore effective to ignore even an infinite aount of source data bound fro Theore 3 directly, since the aount of data we use here is uch too sall for the bound to yield eaningful nuerical results. Instead, we show how the two ain principles of our theory fro Sects. 4 and 5 can be applied on a real-world proble. First, we show that an approxiation to the H-distance, obtained by training a linear odel to discriinate between instances fro different doains, correlates well with the loss incurred by training in one doain and testing in another. Second, we investigate iniizing the α-error as suggested by Theore 3. We show that the optial value of α for a given aount of source and target data is closely related to our approxiate H-distance. The next subsection describes the proble of sentient classification, along with our dataset, features, and learning algoriths. Then we show experientally how our approxiate H-distance is related to the adaptation perforance and the optial value of α. 7.1 Sentient classification Given a piece of text usually a review or essay, autoatic sentient classification is the task of deterining whether the sentient expressed by the text is positive or negative Pang et al. 2002; Turney While ovie reviews are the ost coonly studied doain, sentient analysis has been extended to a nuber of new doains, ranging fro stock essage boards to congressional floor debates Das and Chen 2001; Thoas et al Research results have been deployed industrially in systes that gauge arket reaction and suarize opinion fro Web pages, discussion boards, and blogs. We used the publicly available data set fro Blitzer et al. 2007b to exaine our theory. 2 The data set consists of reviews fro the Aazon website for several different types of products. We chose reviews fro the doains apparel, books, DVDs, kitchen & housewares, and electronics. Each review consists of a rating 1 5 stars, a title, and review text. We created a binary classification proble by binning reviews with 1 2 stars as negative and 4 5 stars as positive. Reviews with 3 stars were discarded. Classifying product reviews as having either positive or negative sentient fits well into our theory of doain adaptation. We note that reviews for different products have widely 2 Available at

13 Positive books review Title: A great find during an annual suer shopping trip Review: I found this novel at a bookstore on the boardwalk I visit every suer...the narrative was brilliantly told, the dialogue copletely believable and the plot totally heartwrenching. If I had ade it to the end without soe tears, I would believe yself ade of stone! Negative books review Title: The Hard Way Review: I a not sure whatever possessed e to buy this book. Honestly, it was a coplete waste of y tie. To quote a friend, it was not the best use of y entertainent dollar. If you are a fan of pedestrian writing, lack-luster plots and hackneyed character developent, this is your book. Positive kitchen & housewares review Title: no ore doggy feuds with neighbor Review: i absolutely love this product. y neighbor has four little yippers and y shepard/chow ix was antagonized by the yipping on our side of the fence. I hung the device on y side of the fence and the noise keeps the neighbors dog fro picking arguents with y dog. all barking and fighting has ceased. Negative kitchen & housewares review Title: cooks great, lid does not work well... Review: I Love the way the Tefal deep fryer cooks, however, I a returning y second one due to a defective lid closure. The lid ay close initially, but after a few uses it no longer stays closed. Since I have sall children in y hoe, I will not be purchasing this one again. Fig. 2 Soe saple product reviews for sentient classification. The top row shows reviews fro the books doain. The botto row shows reviews fro kitchen & housewares different vocabularies, so classifiers trained on one doain are likely to iss out on iportant lexical cues in a different doain. On the other hand, a single good universal sentient classifier is likely to exist naely the classifier that assigns high positive weight to all positive words and high negative weight to all negative words, regardless of product type. We illustrate the type of text in this dataset in Fig. 2, which shows one positive and one negative review each fro the doains books and kitchen & housewares. For each doain, the data set contains 1,600 labeled docuents and between 5,000 and 6,000 unlabeled docuents. We follow Pang et al and represent each instance review by a sparse vector containing the counts of its unigras and bigras, and we noralize the vectors in L 1. Finally, we discard all but the ost frequent 1,600 unigras and bigras fro each data set. In all of the learning probles of the next section, including those that require us to estiate an approxiate H-distance, we use signed linear classifiers. To estiate the paraeters of these classifiers, we iniize a Huber loss with stochastic gradient descent Zhang Experients We explore Theore 3 further by coparing its predictions to the predictions of an approxiation that can be coputed fro finite labeled source and unlabeled source and target saples. As we shall see, our approxiation is a finite-saple analog of 1. We first address λ, the error of the ideal hypothesis. Unfortunately, in general we cannot assue any relationship between the labeling functions f S and f T. Thus in order to estiate λ,weust

14 estiate ɛ T h independently of the source data. If we had enough target data to do this accurately, we would not need to adapt a source classifier in the first place. For our sentient task, however, λ is sall enough to be a negligible ter in the bound. Thus we ignore it here. We approxiate the divergence between two doains by training a linear classifier to discriinate between unlabeled instances fro the source and target doains. Then we apply Lea 2 to get an estiate of ˆd H that we denote by ζu S, U T. ζu S, U T is a lower bound on ˆd H, which is in turn a lower bound on ˆd H H.ForTheore3 to be valid, we need an upper bound on ˆd H H. Unfortunately, this is coputationally intractable for linear threshold classifiers, since finding a iniu error classifier is hard in general Ben-David et al We chose our ζu S, U T estiate because it requires no new achinery beyond an algorith for epirical risk iniization on H. Finally, we note that the unlabeled saple size is large, so we leave out the finite saple error ter for the H H-divergence. We set C to be 1,601, the VC diension of a 1,600-diensional linear classifier and ignore the log ter in the nuerator of the bound. The coplete approxiation to the bound is C α 2 fα= 1 α αζu S, U T. 3 1 C Note that in 3 corresponds to B fro 1, and ζu S, U T is a finite saple approxiation to A when λ is negligible and we have large unlabeled saples fro both the source and target doains. We copare 3 to experiental results for the sentient classification task. All of our experients use the apparel doain as the target. We obtain epirical curves for the error Fig. 3 Coparing the bound fro Theore 3 with test error for sentient classification. Each colun varies one coponent of the bound. For all plots, the y-axis shows the error and the x-axis shows α. Plots on the top row show the value given by our approxiation to the bound, and plots on the botto row show the epirical test set error. Colun a depicts different distances aong doains. Colun b depicts different nubers of target instances, and colun c represents different nubers of source instances

15 as a function of α by training a classifier using a weighted hinge loss. Suppose the target doain has weight α and there are target training instances. Then we scale the loss of target training instance by α/ and the loss of a source training instance by 1 α/1. Figure 3 shows a series of plots of 3 top row coupled with corresponding plots of test error botto row as a function of α for different aounts of source and target data and different distances between doains. In each colun, a single paraeter distance, nuber of target instances T, or nuber of source instances S is varied while the other two are held constant. Note that = T / T + S. The plots on the top row of Fig. 3 are not eant to be nuerical proxies for the true error. For the source doains books and dvd, the distance alone is well above 1/2. However, they illustrate that the bound is siilar in shape to the true error curve and that iportant relationships are preserved. Note that in every pair of plots, the epirical error curves, like the bounds, have an essentially convex shape. Furtherore, the value of α that iniizes the bound also yields low epirical error in each case. This suggests that choosing α to iniize the bound of Theore 3 and subsequently training a classifier to iniize the epirical error ˆɛ α h can work well in practice, provided we have a reasonable easure of coplexity and λ is sall. Colun a shows that ore distant source doains result in higher target error. Colun b illustrates that for ore target data, we have not only lower error in general, but also a higher iniizing α. Finally, colun c deonstrates the liitations of distant source data. When enough labeled target data exists, we always prefer to use only the target data, no atter how uch source data is available. Intuitively this is because any biased source doain cannot help to reduce error beyond soe positive constant. When the target data alone is sufficient to surpass this level of perforance, the source data ceases to be useful. Thus colun c illustrates epirically one phase transition we discuss in Sect Cobining data fro ultiple sources We now explore an extension of our theory to the case of ultiple source doains. In this setting, the learner is presented with data fro N distinct sources. Each source S j is associated with an unknown distribution D j over input points and an unknown labeling function f j. The learner receives a total of labeled saples, with j = j fro each source S j, and the objective is to use these saples to train a odel to perfor well on a target doain D T,f T, which ay or ay not be one of the sources. This setting is otivated by several doain adaptation algoriths Huang et al. 2007; Bickel et al. 2007; Jiang and Zhai 2007; Dai et al that weigh the loss fro training instances depending on how far they are fro the target doain. That is, each training instance is its own source doain. As before, we exaine algoriths that iniize convex cobinations of training error over the labeled exaples fro each source doain. Given a vector α = α 1,...,α N of doain weights with N α j = 1, we define the epirical α-weighted error of function h as N N α j ˆɛ α h = α j ˆɛ j h = hx f j x. j x S j The true α-weighted error ɛ α h is defined analogously. We use D α to denote the ixture of the N source distributions with ixing weights equal to the coponents of α. We present in turn two alternative generalizations of the bounds in Sect. 5. The first bound considers the quality and quantity of data available fro each source individually,

16 ignoring the relationships between sources. In contrast, the second bound depends directly on the H H-distance between the target doain and the weighted cobination of source doains. This dependence allows us to achieve significantly tighter bounds when there exists a ixture of sources that approxiates the target better than any single source. Both results require the derivation of unifor convergence bounds for the epirical α-error. We begin with those. 8.1 Unifor convergence The following lea provides a unifor convergence bound for the epirical α-error. Lea 6 For each j {1,...,N}, let S j be a labeled saple of size j generated by drawing j points fro D j and labeling the according to f j. For any fixed weight vector α, let ˆɛ α h be the epirical α-weighted error of soe fixed hypothesis h on this saple, and let ɛ α h be the true α-weighted error. Then for any δ 0, 1, with probability at least 1 δ: Pr [ ˆɛ α h ɛ α h ɛ ] 2ɛ 2 2exp. N αj 2 j Proof See Appendix. Note that this bound is iniized when α j = j for all j. In other words, convergence is fastest when all data instances are weighted equally. 8.2 A bound using pairwise divergence The first bound we present considers the pairwise H H-distance between each source and the target, and illustrates the trade-off that exists between iniizing the average divergence of the training data fro the target and weighting all points equally to encourage faster convergence. The ter N α j λ j that appears in this bound plays a role corresponding to λ in the previous section. Soewhat surprisingly, this ter can be sall even when there is not a single hypothesis that works well for all heavily weighted sources. Theore 4 Let H be a hypothesis space of VC diension d. For each j {1,...,N}, let S j be a labeled saple of size j generated by drawing j points fro D j and labeling the according to f j. If ĥ H is the epirical iniizer of ˆɛ α h for a fixed weight vector α on these saples and h T = in h H ɛ T h is the target error iniizer, then for any δ 0, 1, with probability at least 1 δ, ɛ T ĥ ɛ T h T + 2 N + α 2 j j d log2 logδ N α j 2λj + d H H D j,d T, where λ j = in h H {ɛ T h + ɛ j h}. 2

17 Proof See Appendix. In the special case where the H H-divergence between each source and the target is 0 and all data instances are weighted equally, the bound in Theore 4 becoes ɛ T ĥ ɛ T h T + 2 2d log log 4 δ N + 2 α j λ j. This bound is nearly identical to the ultiple source classification bound given in Theore 6 of Craer et al Aside fro the constants in the coplexity ter, the only difference is that the quantity λ i that appears here is replaced by an alternate easure of the label error between source S j and the target. Furtherore, these easures are equivalent when the true target function is a eber of H. However, the bound of Craer et al is less general. In particular, it does not handle positive H H-divergence or non-unifor weighting of the data. 8.3 A bound using cobined divergence In the previous bound, divergence between doains is easured only on pairs, so it is not necessary to have a single hypothesis that is good for every source doain. However, this bound does not give us the flexibility to take advantage of doain structure when calculating unlabeled divergence. The alternate bound given in Theore 5 allows us to effectively alter the source distribution by changing α. This has two consequences. First, we ust now deand that there exists a hypothesis h whichhaslowerroronboththeα-weighted convex cobination of sources and the target doain. Second, we easure H H-divergence between the target and a ixture of sources, rather than between the target and each single source. Theore 5 Let H be a hypothesis space of VC diension d. For each j {1,...,N}, let S j be a labeled saple of size j generated by drawing j points fro D j and labeling the according to f j. If ĥ H is the epirical iniizer of ˆɛ α h for a fixed weight vector α on these saples and h T = in h H ɛ T h is the target error iniizer, then for any δ 0, 1, with probability at least 1 δ, ɛ T ĥ ɛ T h T + 4 N α 2 j j + 2γ α + d H H D α,d T, d log2 logδ where γ α = in h {ɛ T h + ɛ α h}=in h {ɛ T h + N α j ɛ j h}. 2 Proof See Appendix. Theore 5 reduces to Theore 3 when N = 2 and one of the two source doains is the target doain that is, we have soe sall nuber of target instances.

18 8.4 Discussion One ight ask whether there exist settings where a non-unifor weighting can lead to a significantly lower value of the bound than a unifor weighting. Indeed, this can happen if soe non-unifor weighting of sources accurately approxiates the target distribution. This is true, for exaple, in the setting studied by Mansour et al. 2009a, 2009b, who derive results for cobining pre-coputed hypotheses. In particular, they show that for arbitrary convex losses, if the Rényi divergence between the target and a ixture of sources is sall, it is possible to cobine low-error source hypotheses to create a low-error target hypothesis. They then show that if for each doain j there exists a hypothesis h j with error less than ɛ, it is possible to achieve an error less than ɛ on the target by weighting the predictions of h 1,...,h N appropriately. The Rényi divergence is not directly coparable to the H H-divergence in general; however it is possible to exhibit source and target distributions which have low H Hdivergence and high even infinite Rényi divergence. For exaple, the Rényi divergence is infinite when the source and target distributions do not share support, but the H Hdivergence is only large when these regions of differing support also coincide with classifier disagreeent regions. On the other hand, we require that a single hypothesis be trained on the ixture of sources. Mansour et al. 2009a, 2009b give algoriths which do not require the original training data at all, but only a single hypothesis fro each source. 9 Conclusion We presented a theoretical investigation of the task of doain adaptation, a task in which we have a large aount of training data fro a source doain, but we wish to apply a odel in a target doain with a uch saller aount of training data. Our ain result is a unifor convergence learning bound for algoriths which iniize convex cobinations of epirical source and target error. Our bound reflects the trade-off between the size of the source data and the accuracy of the target data, and we give a siple approxiation to it that is coputable fro finite labeled and unlabeled saples. This approxiation akes correct predictions about odel test error for a sentient classification task. Our theory also extends in a straightforward anner to a ulti-source setting, which we believe helps to explain the success of recent epirical work in doain adaptation. There are two interesting open probles that deserve future exploration. First, our bounds on the divergence between source and target distribution are in ters of VC diension. We do not yet know whether our divergence easure adits tighter data-dependent bounds McAllester 2003; Bartlett and Mendelson 2002, or if there are other, ore appropriate divergence easures which do. Second, it would be interesting to investigate algoriths that choose a convex cobination of ultiple sources to iniize the bound in Theore 5 as possible approaches to adaptation fro ultiple sources. Acknowledgeents This aterial is based upon work partially supported by the Defense Advanced Research Projects Agency DARPA under Contract No. NBCHD CALO, by the National Science Foundation under grants ITR and RI , and by a gift fro Google, Inc. to the University of Pennsylvania. Koby Craer is a Horev fellow, supported by the Taub Foundations. Any opinions, findings, and conclusions or recoendations expressed in this aterial are those of the authors and do not necessarily reflect the views of the DARPA, Departent of Interior-National Business Center DOI-NBC, NSF, the Taub Foundations, or Google, Inc.

19 Open Access This article is distributed under the ters of the Creative Coons Attribution Noncoercial License which perits any noncoercial use, distribution, and reproduction in any ediu, provided the original authors and source are credited. Appendix: Proofs Theore 1 For a hypothesis h, ɛ T h ɛ S h + d 1 D S, D T + in { E DS [ fs x f T x ], E DT [ fs x f T x ]}. Proof Recall that ɛ T h = ɛ T h, f T and ɛ S h = ɛ S h, f S.Letφ S and φ T be the density functions of D S and D T respectively. ɛ T h = ɛ T h + ɛ S h ɛ S h + ɛ S h, f T ɛ S h, f T ɛ S h + ɛ S h, f T ɛ S h, f S + ɛ T h, f T ɛ S h, f T [ ɛ S h + E DS fs x f T x ] + ɛ T h, f T ɛ S h, f T [ ɛ S h + E DS fs x f T x ] + φ S x φ T x hx f T x dx ɛ S h + E DS [ fs x f T x ] + d 1 D S, D T. In the first line, we could instead choose to add and subtract ɛ T h, f S rather than ɛ S h, f T, which would result in the sae bound only with the expectation taken with respect to D T instead of D S. Choosing the saller of the two gives us the bound. Lea 2 For a syetric hypothesis class H one where for every h H, the inverse hypothesis 1 h is also in H and saples U, U of size, the epirical H-distance is [ 1 d H U, U = 2 1 in h H x:hx=0 I[x U]+ 1 x:hx=1 where I[x U] is the binary indicator variable which is 1 when x U. ] I[x U ] Proof We will show that for any hypothesis h and corresponding set Ih of positively labeled instances, [ 1 1 I[x U]+ 1 ] I[x U ] = Pr U [Ih] Pr U [Ih]. We have 1 1 = 1 2 x:hx=0 x:hx=0 x:hx=0 I [x U] + x:hx=1 x:hx=1 I [ x U ] I [x U] + I [ x U ] x:hx=1 I [x U] + I [ x U ]

20 1 = 1 2 x:hx=0 x:hx=0 I [x U] + x:hx=1 I [ x U ] I [ x U ] I [x U] x:hx=1 I [x U] I [ x U ] = Pr U [Ih] 1 Pr U [Ih] Pr U [Ih] Pr U [Ih] Mach Learn = Pr U [Ih] Pr U [Ih]. 4 The absolute value in the stateent of the lea follows fro the syetry of H. Lea 4 Let h be a hypothesis in class H. Then 1 ɛ α h ɛ T h 1 α 2 d H HD S, D T + λ. Proof Siilarly to the proof of Theore 2, this proof relies heavily on the triangle inequality for classification error. ɛ α h ɛ T h = 1 α ɛ S h ɛ T h 1 α [ ɛ S h ɛ S h, h + ɛ S h, h ɛ T h, h + ɛ T h, h ɛ T h ] 1 α [ ɛ S h + ɛ S h, h ɛ T h, h +ɛ T h ] 1 1 α 2 d H HD S, D T + λ. Theore 3 Let H be a hypothesis space of VC diension d. Let U S and U T be unlabeled saples of size each, drawn fro D S and D T respectively. Let S be a labeled saple of size generated by drawing points fro D T and 1 points fro D S, labeling the according to f S and f T, respectively. If ĥ H is the epirical iniizer of ˆɛ α h on S and h T = in h H ɛ T h is the target error iniizer, then for any δ 0, 1, with probability at least 1 δ over the choice of the saples, ɛ T ĥ ɛ T h T + 4 α α α 2 ˆd H H U S, U T + 4 2d log log 8 δ 2d log2 + log 4 δ + λ. Proof The coplete proof is ostly identical to the standard proof of unifor convergence for epirical risk iniizers. We show here the steps that are different. Below we use L4, and Th2 to indicate that a line of the proof follows by application of Lea 4 or Theore 2 respectively. L5 indicates that the proof follows by Lea 5, but also relies on saple syetrization and bounding the growth function by the VC diension Anthony

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds

More information

Domain-Adversarial Neural Networks

Domain-Adversarial Neural Networks Doain-Adversarial Neural Networks Hana Ajakan, Pascal Gerain 2, Hugo Larochelle 3, François Laviolette 2, Mario Marchand 2,2 Départeent d inforatique et de génie logiciel, Université Laval, Québec, Canada

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher

More information

Learnability and Stability in the General Learning Setting

Learnability and Stability in the General Learning Setting Learnability and Stability in the General Learning Setting Shai Shalev-Shwartz TTI-Chicago shai@tti-c.org Ohad Shair The Hebrew University ohadsh@cs.huji.ac.il Nathan Srebro TTI-Chicago nati@uchicago.edu

More information

Non-Parametric Non-Line-of-Sight Identification 1

Non-Parametric Non-Line-of-Sight Identification 1 Non-Paraetric Non-Line-of-Sight Identification Sinan Gezici, Hisashi Kobayashi and H. Vincent Poor Departent of Electrical Engineering School of Engineering and Applied Science Princeton University, Princeton,

More information

A Simple Regression Problem

A Simple Regression Problem A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where

More information

1 Rademacher Complexity Bounds

1 Rademacher Complexity Bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Max Goer March 07, 2013 1 Radeacher Coplexity Bounds Recall the following theore fro last lecture: Theore 1. With probability

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic

More information

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lesson 1 4 October 2017 Outline Learning and Evaluation for Pattern Recognition Notation...2 1. The Pattern Recognition

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Coputable Shell Decoposition Bounds John Langford TTI-Chicago jcl@cs.cu.edu David McAllester TTI-Chicago dac@autoreason.co Editor: Leslie Pack Kaelbling and David Cohn Abstract Haussler, Kearns, Seung

More information

Understanding Machine Learning Solution Manual

Understanding Machine Learning Solution Manual Understanding Machine Learning Solution Manual Written by Alon Gonen Edited by Dana Rubinstein Noveber 17, 2014 2 Gentle Start 1. Given S = ((x i, y i )), define the ultivariate polynoial p S (x) = i []:y

More information

1 Proof of learning bounds

1 Proof of learning bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #4 Scribe: Akshay Mittal February 13, 2013 1 Proof of learning bounds For intuition of the following theore, suppose there exists a

More information

COS 424: Interacting with Data. Written Exercises

COS 424: Interacting with Data. Written Exercises COS 424: Interacting with Data Hoework #4 Spring 2007 Regression Due: Wednesday, April 18 Written Exercises See the course website for iportant inforation about collaboration and late policies, as well

More information

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,

More information

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning JMLR: Workshop and Conference Proceedings vol (1) 1 15 New Bounds for Learning Intervals with Iplications for Sei-Supervised Learning David P. Helbold dph@soe.ucsc.edu Departent of Coputer Science, University

More information

Machine Learning Basics: Estimators, Bias and Variance

Machine Learning Basics: Estimators, Bias and Variance Machine Learning Basics: Estiators, Bias and Variance Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Basics

More information

Combining Classifiers

Combining Classifiers Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/

More information

Robustness and Regularization of Support Vector Machines

Robustness and Regularization of Support Vector Machines Robustness and Regularization of Support Vector Machines Huan Xu ECE, McGill University Montreal, QC, Canada xuhuan@ci.cgill.ca Constantine Caraanis ECE, The University of Texas at Austin Austin, TX, USA

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic

More information

Polygonal Designs: Existence and Construction

Polygonal Designs: Existence and Construction Polygonal Designs: Existence and Construction John Hegean Departent of Matheatics, Stanford University, Stanford, CA 9405 Jeff Langford Departent of Matheatics, Drake University, Des Moines, IA 5011 G

More information

1 Generalization bounds based on Rademacher complexity

1 Generalization bounds based on Rademacher complexity COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #0 Scribe: Suqi Liu March 07, 08 Last tie we started proving this very general result about how quickly the epirical average converges

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Journal of Machine Learning Research 5 (2004) 529-547 Subitted 1/03; Revised 8/03; Published 5/04 Coputable Shell Decoposition Bounds John Langford David McAllester Toyota Technology Institute at Chicago

More information

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering

More information

Ensemble Based on Data Envelopment Analysis

Ensemble Based on Data Envelopment Analysis Enseble Based on Data Envelopent Analysis So Young Sohn & Hong Choi Departent of Coputer Science & Industrial Systes Engineering, Yonsei University, Seoul, Korea Tel) 82-2-223-404, Fax) 82-2- 364-7807

More information

Boosting with log-loss

Boosting with log-loss Boosting with log-loss Marco Cusuano-Towner Septeber 2, 202 The proble Suppose we have data exaples {x i, y i ) i =... } for a two-class proble with y i {, }. Let F x) be the predictor function with the

More information

Proc. of the IEEE/OES Seventh Working Conference on Current Measurement Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES

Proc. of the IEEE/OES Seventh Working Conference on Current Measurement Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES Proc. of the IEEE/OES Seventh Working Conference on Current Measureent Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES Belinda Lipa Codar Ocean Sensors 15 La Sandra Way, Portola Valley, CA 98 blipa@pogo.co

More information

PAC-Bayes Analysis Of Maximum Entropy Learning

PAC-Bayes Analysis Of Maximum Entropy Learning PAC-Bayes Analysis Of Maxiu Entropy Learning John Shawe-Taylor and David R. Hardoon Centre for Coputational Statistics and Machine Learning Departent of Coputer Science University College London, UK, WC1E

More information

Block designs and statistics

Block designs and statistics Bloc designs and statistics Notes for Math 447 May 3, 2011 The ain paraeters of a bloc design are nuber of varieties v, bloc size, nuber of blocs b. A design is built on a set of v eleents. Each eleent

More information

Ch 12: Variations on Backpropagation

Ch 12: Variations on Backpropagation Ch 2: Variations on Backpropagation The basic backpropagation algorith is too slow for ost practical applications. It ay take days or weeks of coputer tie. We deonstrate why the backpropagation algorith

More information

Testing Properties of Collections of Distributions

Testing Properties of Collections of Distributions Testing Properties of Collections of Distributions Reut Levi Dana Ron Ronitt Rubinfeld April 9, 0 Abstract We propose a fraework for studying property testing of collections of distributions, where the

More information

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011) E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how

More information

Upper bound on false alarm rate for landmine detection and classification using syntactic pattern recognition

Upper bound on false alarm rate for landmine detection and classification using syntactic pattern recognition Upper bound on false alar rate for landine detection and classification using syntactic pattern recognition Ahed O. Nasif, Brian L. Mark, Kenneth J. Hintz, and Nathalia Peixoto Dept. of Electrical and

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lessons 7 20 Dec 2017 Outline Artificial Neural networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes Radeacher Coplexity Margin Bounds for Learning with a Large Nuber of Classes Vitaly Kuznetsov Courant Institute of Matheatical Sciences, 25 Mercer street, New York, NY, 002 Mehryar Mohri Courant Institute

More information

Ocean 420 Physical Processes in the Ocean Project 1: Hydrostatic Balance, Advection and Diffusion Answers

Ocean 420 Physical Processes in the Ocean Project 1: Hydrostatic Balance, Advection and Diffusion Answers Ocean 40 Physical Processes in the Ocean Project 1: Hydrostatic Balance, Advection and Diffusion Answers 1. Hydrostatic Balance a) Set all of the levels on one of the coluns to the lowest possible density.

More information

arxiv: v1 [cs.ds] 3 Feb 2014

arxiv: v1 [cs.ds] 3 Feb 2014 arxiv:40.043v [cs.ds] 3 Feb 04 A Bound on the Expected Optiality of Rando Feasible Solutions to Cobinatorial Optiization Probles Evan A. Sultani The Johns Hopins University APL evan@sultani.co http://www.sultani.co/

More information

A Theoretical Analysis of a Warm Start Technique

A Theoretical Analysis of a Warm Start Technique A Theoretical Analysis of a War Start Technique Martin A. Zinkevich Yahoo! Labs 701 First Avenue Sunnyvale, CA Abstract Batch gradient descent looks at every data point for every step, which is wasteful

More information

Sharp Time Data Tradeoffs for Linear Inverse Problems

Sharp Time Data Tradeoffs for Linear Inverse Problems Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used

More information

Interactive Markov Models of Evolutionary Algorithms

Interactive Markov Models of Evolutionary Algorithms Cleveland State University EngagedScholarship@CSU Electrical Engineering & Coputer Science Faculty Publications Electrical Engineering & Coputer Science Departent 2015 Interactive Markov Models of Evolutionary

More information

1 Bounding the Margin

1 Bounding the Margin COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost

More information

Stochastic Subgradient Methods

Stochastic Subgradient Methods Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods

More information

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices CS71 Randoness & Coputation Spring 018 Instructor: Alistair Sinclair Lecture 13: February 7 Disclaier: These notes have not been subjected to the usual scrutiny accorded to foral publications. They ay

More information

Feature Extraction Techniques

Feature Extraction Techniques Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that

More information

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

e-companion ONLY AVAILABLE IN ELECTRONIC FORM OPERATIONS RESEARCH doi 10.1287/opre.1070.0427ec pp. ec1 ec5 e-copanion ONLY AVAILABLE IN ELECTRONIC FORM infors 07 INFORMS Electronic Copanion A Learning Approach for Interactive Marketing to a Custoer

More information

Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence

Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence Best Ar Identification: A Unified Approach to Fixed Budget and Fixed Confidence Victor Gabillon Mohaad Ghavazadeh Alessandro Lazaric INRIA Lille - Nord Europe, Tea SequeL {victor.gabillon,ohaad.ghavazadeh,alessandro.lazaric}@inria.fr

More information

A note on the multiplication of sparse matrices

A note on the multiplication of sparse matrices Cent. Eur. J. Cop. Sci. 41) 2014 1-11 DOI: 10.2478/s13537-014-0201-x Central European Journal of Coputer Science A note on the ultiplication of sparse atrices Research Article Keivan Borna 12, Sohrab Aboozarkhani

More information

Chapter 6 1-D Continuous Groups

Chapter 6 1-D Continuous Groups Chapter 6 1-D Continuous Groups Continuous groups consist of group eleents labelled by one or ore continuous variables, say a 1, a 2,, a r, where each variable has a well- defined range. This chapter explores:

More information

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks Intelligent Systes: Reasoning and Recognition Jaes L. Crowley MOSIG M1 Winter Seester 2018 Lesson 7 1 March 2018 Outline Artificial Neural Networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

arxiv: v3 [stat.ml] 9 Aug 2016

arxiv: v3 [stat.ml] 9 Aug 2016 with Specialization to Linear Classifiers Pascal Gerain Aaury Habrard François Laviolette 3 ilie Morvant INRIA, SIRRA Project-Tea, 75589 Paris, France, et DI, École Norale Supérieure, 7530 Paris, France

More information

Lost-Sales Problems with Stochastic Lead Times: Convexity Results for Base-Stock Policies

Lost-Sales Problems with Stochastic Lead Times: Convexity Results for Base-Stock Policies OPERATIONS RESEARCH Vol. 52, No. 5, Septeber October 2004, pp. 795 803 issn 0030-364X eissn 1526-5463 04 5205 0795 infors doi 10.1287/opre.1040.0130 2004 INFORMS TECHNICAL NOTE Lost-Sales Probles with

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 11 10/15/2008 ABSTRACT INTEGRATION I

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 11 10/15/2008 ABSTRACT INTEGRATION I MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 11 10/15/2008 ABSTRACT INTEGRATION I Contents 1. Preliinaries 2. The ain result 3. The Rieann integral 4. The integral of a nonnegative

More information

3.8 Three Types of Convergence

3.8 Three Types of Convergence 3.8 Three Types of Convergence 3.8 Three Types of Convergence 93 Suppose that we are given a sequence functions {f k } k N on a set X and another function f on X. What does it ean for f k to converge to

More information

Improved Guarantees for Agnostic Learning of Disjunctions

Improved Guarantees for Agnostic Learning of Disjunctions Iproved Guarantees for Agnostic Learning of Disjunctions Pranjal Awasthi Carnegie Mellon University pawasthi@cs.cu.edu Avri Blu Carnegie Mellon University avri@cs.cu.edu Or Sheffet Carnegie Mellon University

More information

Distributed Subgradient Methods for Multi-agent Optimization

Distributed Subgradient Methods for Multi-agent Optimization 1 Distributed Subgradient Methods for Multi-agent Optiization Angelia Nedić and Asuan Ozdaglar October 29, 2007 Abstract We study a distributed coputation odel for optiizing a su of convex objective functions

More information

On the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation

On the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation journal of coplexity 6, 459473 (2000) doi:0.006jco.2000.0544, available online at http:www.idealibrary.co on On the Counication Coplexity of Lipschitzian Optiization for the Coordinated Model of Coputation

More information

A Theoretical Framework for Deep Transfer Learning

A Theoretical Framework for Deep Transfer Learning A Theoretical Fraewor for Deep Transfer Learning Toer Galanti The School of Coputer Science Tel Aviv University toer22g@gail.co Lior Wolf The School of Coputer Science Tel Aviv University wolf@cs.tau.ac.il

More information

List Scheduling and LPT Oliver Braun (09/05/2017)

List Scheduling and LPT Oliver Braun (09/05/2017) List Scheduling and LPT Oliver Braun (09/05/207) We investigate the classical scheduling proble P ax where a set of n independent jobs has to be processed on 2 parallel and identical processors (achines)

More information

CS Lecture 13. More Maximum Likelihood

CS Lecture 13. More Maximum Likelihood CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood

More information

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon Model Fitting CURM Background Material, Fall 014 Dr. Doreen De Leon 1 Introduction Given a set of data points, we often want to fit a selected odel or type to the data (e.g., we suspect an exponential

More information

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians Using EM To Estiate A Probablity Density With A Mixture Of Gaussians Aaron A. D Souza adsouza@usc.edu Introduction The proble we are trying to address in this note is siple. Given a set of data points

More information

Support Vector Machines. Goals for the lecture

Support Vector Machines. Goals for the lecture Support Vector Machines Mark Craven and David Page Coputer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Soe of the slides in these lectures have been adapted/borrowed fro aterials developed

More information

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and This article appeared in a ournal published by Elsevier. The attached copy is furnished to the author for internal non-coercial research and education use, including for instruction at the authors institution

More information

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

A Smoothed Boosting Algorithm Using Probabilistic Output Codes A Soothed Boosting Algorith Using Probabilistic Output Codes Rong Jin rongjin@cse.su.edu Dept. of Coputer Science and Engineering, Michigan State University, MI 48824, USA Jian Zhang jian.zhang@cs.cu.edu

More information

The Transactional Nature of Quantum Information

The Transactional Nature of Quantum Information The Transactional Nature of Quantu Inforation Subhash Kak Departent of Coputer Science Oklahoa State University Stillwater, OK 7478 ABSTRACT Inforation, in its counications sense, is a transactional property.

More information

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search Quantu algoriths (CO 781, Winter 2008) Prof Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search ow we begin to discuss applications of quantu walks to search algoriths

More information

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space Journal of Machine Learning Research 3 (2003) 1333-1356 Subitted 5/02; Published 3/03 Grafting: Fast, Increental Feature Selection by Gradient Descent in Function Space Sion Perkins Space and Reote Sensing

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2016 Lessons 7 14 Dec 2016 Outline Artificial Neural networks Notation...2 1. Introduction...3... 3 The Artificial

More information

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis City University of New York (CUNY) CUNY Acadeic Works International Conference on Hydroinforatics 8-1-2014 Experiental Design For Model Discriination And Precise Paraeter Estiation In WDS Analysis Giovanna

More information

Chapter 6: Economic Inequality

Chapter 6: Economic Inequality Chapter 6: Econoic Inequality We are interested in inequality ainly for two reasons: First, there are philosophical and ethical grounds for aversion to inequality per se. Second, even if we are not interested

More information

Graphical Models in Local, Asymmetric Multi-Agent Markov Decision Processes

Graphical Models in Local, Asymmetric Multi-Agent Markov Decision Processes Graphical Models in Local, Asyetric Multi-Agent Markov Decision Processes Ditri Dolgov and Edund Durfee Departent of Electrical Engineering and Coputer Science University of Michigan Ann Arbor, MI 48109

More information

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Weak Learning Definition: concept class C is weakly PAC-learnable if there exists a (weak)

More information

Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation

Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation Algorithic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation Michael Kearns AT&T Labs Research Murray Hill, New Jersey kearns@research.att.co Dana Ron MIT Cabridge, MA danar@theory.lcs.it.edu

More information

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t. CS 493: Algoriths for Massive Data Sets Feb 2, 2002 Local Models, Bloo Filter Scribe: Qin Lv Local Models In global odels, every inverted file entry is copressed with the sae odel. This work wells when

More information

STOPPING SIMULATED PATHS EARLY

STOPPING SIMULATED PATHS EARLY Proceedings of the 2 Winter Siulation Conference B.A.Peters,J.S.Sith,D.J.Medeiros,andM.W.Rohrer,eds. STOPPING SIMULATED PATHS EARLY Paul Glasseran Graduate School of Business Colubia University New Yor,

More information

Curious Bounds for Floor Function Sums

Curious Bounds for Floor Function Sums 1 47 6 11 Journal of Integer Sequences, Vol. 1 (018), Article 18.1.8 Curious Bounds for Floor Function Sus Thotsaporn Thanatipanonda and Elaine Wong 1 Science Division Mahidol University International

More information

Support recovery in compressed sensing: An estimation theoretic approach

Support recovery in compressed sensing: An estimation theoretic approach Support recovery in copressed sensing: An estiation theoretic approach Ain Karbasi, Ali Horati, Soheil Mohajer, Martin Vetterli School of Coputer and Counication Sciences École Polytechnique Fédérale de

More information

PAC-Bayesian Learning of Linear Classifiers

PAC-Bayesian Learning of Linear Classifiers Pascal Gerain Pascal.Gerain.@ulaval.ca Alexandre Lacasse Alexandre.Lacasse@ift.ulaval.ca François Laviolette Francois.Laviolette@ift.ulaval.ca Mario Marchand Mario.Marchand@ift.ulaval.ca Départeent d inforatique

More information

On Conditions for Linearity of Optimal Estimation

On Conditions for Linearity of Optimal Estimation On Conditions for Linearity of Optial Estiation Erah Akyol, Kuar Viswanatha and Kenneth Rose {eakyol, kuar, rose}@ece.ucsb.edu Departent of Electrical and Coputer Engineering University of California at

More information

The Weierstrass Approximation Theorem

The Weierstrass Approximation Theorem 36 The Weierstrass Approxiation Theore Recall that the fundaental idea underlying the construction of the real nubers is approxiation by the sipler rational nubers. Firstly, nubers are often deterined

More information

Bootstrapping Dependent Data

Bootstrapping Dependent Data Bootstrapping Dependent Data One of the key issues confronting bootstrap resapling approxiations is how to deal with dependent data. Consider a sequence fx t g n t= of dependent rando variables. Clearly

More information

Revealed Preference and Stochastic Demand Correspondence: A Unified Theory

Revealed Preference and Stochastic Demand Correspondence: A Unified Theory Revealed Preference and Stochastic Deand Correspondence: A Unified Theory Indraneel Dasgupta School of Econoics, University of Nottingha, Nottingha NG7 2RD, UK. E-ail: indraneel.dasgupta@nottingha.ac.uk

More information

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical IEEE TRANSACTIONS ON INFORMATION THEORY Large Alphabet Source Coding using Independent Coponent Analysis Aichai Painsky, Meber, IEEE, Saharon Rosset and Meir Feder, Fellow, IEEE arxiv:67.7v [cs.it] Jul

More information

Multiple Instance Learning with Query Bags

Multiple Instance Learning with Query Bags Multiple Instance Learning with Query Bags Boris Babenko UC San Diego bbabenko@cs.ucsd.edu Piotr Dollár California Institute of Technology pdollar@caltech.edu Serge Belongie UC San Diego sjb@cs.ucsd.edu

More information

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information Cite as: Straub D. (2014). Value of inforation analysis with structural reliability ethods. Structural Safety, 49: 75-86. Value of Inforation Analysis with Structural Reliability Methods Daniel Straub

More information

Strategic Classification. Moritz Hardt Nimrod Megiddo Christos Papadimitriou Mary Wootters. November 24, 2015

Strategic Classification. Moritz Hardt Nimrod Megiddo Christos Papadimitriou Mary Wootters. November 24, 2015 Strategic Classification Moritz Hardt Nirod Megiddo Christos Papadiitriou Mary Wootters Noveber 24, 2015 arxiv:1506.06980v2 [cs.lg] 22 Nov 2015 Abstract Machine learning relies on the assuption that unseen

More information

An Extension to the Tactical Planning Model for a Job Shop: Continuous-Time Control

An Extension to the Tactical Planning Model for a Job Shop: Continuous-Time Control An Extension to the Tactical Planning Model for a Job Shop: Continuous-Tie Control Chee Chong. Teo, Rohit Bhatnagar, and Stephen C. Graves Singapore-MIT Alliance, Nanyang Technological Univ., and Massachusetts

More information

On the Use of A Priori Information for Sparse Signal Approximations

On the Use of A Priori Information for Sparse Signal Approximations ITS TECHNICAL REPORT NO. 3/4 On the Use of A Priori Inforation for Sparse Signal Approxiations Oscar Divorra Escoda, Lorenzo Granai and Pierre Vandergheynst Signal Processing Institute ITS) Ecole Polytechnique

More information

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay A Low-Coplexity Congestion Control and Scheduling Algorith for Multihop Wireless Networks with Order-Optial Per-Flow Delay Po-Kai Huang, Xiaojun Lin, and Chih-Chun Wang School of Electrical and Coputer

More information

3.3 Variational Characterization of Singular Values

3.3 Variational Characterization of Singular Values 3.3. Variational Characterization of Singular Values 61 3.3 Variational Characterization of Singular Values Since the singular values are square roots of the eigenvalues of the Heritian atrices A A and

More information

Tight Complexity Bounds for Optimizing Composite Objectives

Tight Complexity Bounds for Optimizing Composite Objectives Tight Coplexity Bounds for Optiizing Coposite Objectives Blake Woodworth Toyota Technological Institute at Chicago Chicago, IL, 60637 blake@ttic.edu Nathan Srebro Toyota Technological Institute at Chicago

More information

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material Consistent Multiclass Algoriths for Coplex Perforance Measures Suppleentary Material Notations. Let λ be the base easure over n given by the unifor rando variable (say U over n. Hence, for all easurable

More information

Solutions of some selected problems of Homework 4

Solutions of some selected problems of Homework 4 Solutions of soe selected probles of Hoework 4 Sangchul Lee May 7, 2018 Proble 1 Let there be light A professor has two light bulbs in his garage. When both are burned out, they are replaced, and the next

More information

Fairness via priority scheduling

Fairness via priority scheduling Fairness via priority scheduling Veeraruna Kavitha, N Heachandra and Debayan Das IEOR, IIT Bobay, Mubai, 400076, India vavitha,nh,debayan}@iitbacin Abstract In the context of ulti-agent resource allocation

More information

arxiv: v1 [cs.ds] 17 Mar 2016

arxiv: v1 [cs.ds] 17 Mar 2016 Tight Bounds for Single-Pass Streaing Coplexity of the Set Cover Proble Sepehr Assadi Sanjeev Khanna Yang Li Abstract arxiv:1603.05715v1 [cs.ds] 17 Mar 2016 We resolve the space coplexity of single-pass

More information

Stability Bounds for Non-i.i.d. Processes

Stability Bounds for Non-i.i.d. Processes tability Bounds for Non-i.i.d. Processes Mehryar Mohri Courant Institute of Matheatical ciences and Google Research 25 Mercer treet New York, NY 002 ohri@cis.nyu.edu Afshin Rostaiadeh Departent of Coputer

More information

Research Article On the Isolated Vertices and Connectivity in Random Intersection Graphs

Research Article On the Isolated Vertices and Connectivity in Random Intersection Graphs International Cobinatorics Volue 2011, Article ID 872703, 9 pages doi:10.1155/2011/872703 Research Article On the Isolated Vertices and Connectivity in Rando Intersection Graphs Yilun Shang Institute for

More information

Estimating Entropy and Entropy Norm on Data Streams

Estimating Entropy and Entropy Norm on Data Streams Estiating Entropy and Entropy Nor on Data Streas Ait Chakrabarti 1, Khanh Do Ba 1, and S. Muthukrishnan 2 1 Departent of Coputer Science, Dartouth College, Hanover, NH 03755, USA 2 Departent of Coputer

More information

Revealed Preference with Stochastic Demand Correspondence

Revealed Preference with Stochastic Demand Correspondence Revealed Preference with Stochastic Deand Correspondence Indraneel Dasgupta School of Econoics, University of Nottingha, Nottingha NG7 2RD, UK. E-ail: indraneel.dasgupta@nottingha.ac.uk Prasanta K. Pattanaik

More information