arxiv: v3 [stat.ml] 9 Aug 2016

Size: px
Start display at page:

Download "arxiv: v3 [stat.ml] 9 Aug 2016"

Transcription

1 with Specialization to Linear Classifiers Pascal Gerain Aaury Habrard François Laviolette 3 ilie Morvant INRIA, SIRRA Project-Tea, Paris, France, et DI, École Norale Supérieure, 7530 Paris, France Univ Lyon, UJM-Saint-tienne, CNRS, Institut d Optique Graduate School, Laboratoire Hubert Curien UMR 556, F-403, Saint-tienne, France 3 Départeent d inforatique et de génie logiciel arxiv: v3 statml] 9 Aug 06 Université Laval, Québec, Canada October, 08 This report is a long version of our paper entitled A PAC-Bayesian Approach for Doain Adaptation with Specialization to Linear Classifiers published in the proceedings of the International Conference on Machine Learning ICML 03 We iproved our ain results, extended our experients, and proposed an extension to ultisource doain adaptation Abstract In this paper, we provide two ain contributions in PAC-Bayesian theory for doain adaptation where the objective is to learn, fro a source distribution, a well-perforing ajority vote on a different target distribution On the one hand, we propose an iproveent of the previous approach proposed by Gerain et al 03, that relies on a novel distribution pseudodistance based on a disagreeent averaging, allowing us to derive a new tighter PAC-Bayesian doain adaptation bound for the stochastic Gibbs classifier We specialize it to linear classifiers, and design a learning algorith which shows interesting results on a synthetic proble and on a popular sentient annotation task On the other hand, we generalize these results to ultisource doain adaptation allowing us to take into account different source doains This study opens the door to tackle doain adaptation tasks by aking use of all the PAC-Bayesian tools Introduction As huan beings, we learn fro what we saw before Think about our education process: when a student attends to a new course, he has to ake use of the knowledge he acquired during previous courses However, in achine learning the ost coon assuption is based on the fact that the learning and test data are drawn fro the sae probability distribution This strong assuption ay be clearly irrelevant for a lot of real tasks including those where we desire to adapt a odel fro one task to another one For instance, a spa filtering syste suitable for one user can be poorly adapted to another who receives significantly different eails In other words, the learning data associated with one or several users could be unrepresentative of the test data coing fro another one This enhances the need to design ethods for adapting a classifier fro learning source data to test target data One solution to tackle this issue is to consider the doain adaptation fraework, which arises when the distribution generating the target data the target doain differs fro the one generating the source data the source doain In such a situation, it is well known that doain adaptation is a hard and challenging task even under strong assuptions Ben-David and Urner, 0; Ben-David et al, 00b; Ben-David and Urner, 04 Note that doain adaptation with learning data coing fro different source doains is referred to as ultisource or ultiple sources doain adaptation Craer et al, 007; Mansour et al, 009c; Ben-David et al, 00a See the surveys proposed by Jiang 008; Quionero-Candela et al 009; and Margolis 0

2 Aong the existing approaches in the literature to address doain adaptation, the instance weightingbased ethods allow one to deal with the covariate-shift proble eg, Huang et al, 006; Sugiyaa et al, 008, where source and target doains diverge only in their arginals, ie, they share the sae labeling function Another technique is to exploit self-labeling procedures, where the objective is to transfer the source labels to the target unlabeled points eg, Bruzzone and Marconcini 00; Habrard et al 03; Morvant 04 A third solution is to learn a new coon representation fro the unlabeled part of source and target data Then, a standard supervised learning algorith can be executed on the source labeled instances eg, Glorot et al 0; Chen et al 0 The work presented in this paper stands into a popular class of approaches, which relies on a distance between the source distribution and the target distribution Such distance depends on the set H of hypotheses or classifiers considered by the learning algorith The intuition behind this approach is that one ust look for a set H that iniizes the distance while preserving good perforances on the source data; if the distributions are close under this easure, then generalization ability ay be easier to quantify In fact, ining such a easure to quantify how uch the doains are related is a ajor issue in doain adaptation For exaple, in the context of binary classification with the 0- loss function, Ben-David et al 00a; and Ben-David et al 006 have considered the H H-divergence between the arginal distributions This quantity is based on the axial disagreeent between two classifiers, allowing the to deduce a doain adaptation generalization bound based on the VC-diension theory The discrepancy distance proposed by Mansour et al 009a generalizes this divergence to real-valued functions and ore general losses, and is used to obtain a generalization bound based on the Radeacher coplexity In this context, Cortes and Mohri 0, 04 have specialized the iniization of the discrepancy to regression with kernels In these situations, doain adaptation can be viewed as a ultiple trade-off between the coplexity of the hypothesis class H, the adaptation ability of H according to the divergence between the arginals, and the epirical source risk Moreover, other easures have been exploited under different assuptions, such as the Rényi divergence suitable for iportance weighting Mansour et al, 009b, or the easure proposed by C Zhang 0 which takes into account the source and target true labeling, or the Bayesian divergence prior Li and Biles, 007 which favors classifiers closer to the best source odel However, a ajority of ethods prefer to perfor a two-step approach: i first construct a suitable representation by iniizing the divergence, then ii learn a odel on the source doain in the new representation space The novelty of our contribution is to explore the PAC-Bayesian fraework to tackle doain adaptation in a binary classification situation without target labels soeties called unsupervised doain adaptation Given a prior distribution over a faily of classifiers H, PAC-Bayesian theory introduced by McAllester, 999 focuses on algoriths that output a posterior distribution ρ over H ie, a ρ-average over H rather than just a single classifier h H Following this principle, we propose a pseudoetric which evaluates the doain divergence according to the ρ-average disagreeent of the classifiers over the doains This disagreeent easure shows any advantages First, it is ideal for the PAC-Bayesian setting, since it is expressed as a ρ-average over H Second, we prove that it is always lower than the popular H H-divergence Last but not least, our easure can be easily estiated fro saples Indeed, based on this disagreeent easure, we derived in a previous work Gerain et al, 03 a first PAC-Bayesian doain adaptation bound expressed as a ρ-averaging In this paper, we provide a new version of this result, that does not change the philosophy supported by the previous bound, but clearly iproves the theoretical result: The doain adaptation bound is now tighter and easier to interpret Thanks to this new result, we also derive three new PAC-Bayesian doain adaptation generalization bounds Then, in contrast to the ajority of ethods that perfor a two-step procedure, we design an algorith tailored to linear classifiers, called PBDA, which jointly iniizes the ultiple trade-offs iplied by the bounds The first two quantities being, as usual in the PAC-Bayesian approach, the coplexity of the ajority vote easured by a Kullback-Leibler divergence and the epirical risk easured by the ρ-average errors on the source saple The third quantity corresponds to our doain divergence and assesses the capacity of the posterior distribution to distinguish soe structural difference between the source and target saples Finally, we extend our results to doain adaptation with ultiple sources by considering a ixture of different source doains as done by Ben-David et al 00a The rest of the paper is structured as follows Section deals with two seinal works on doain adaptation The PAC-Bayesian fraework is then recalled in Section 3 Note that for the sake of copleteness, we provide for the first tie the explicit derivation of the algorith PBGD3 Gerain et al, 009a tailored to linear classifiers in supervised learning Our ain contribution, which consists in a In this paper, we were very keen to iprove the readability of our proofs, particularly those provided by Gerain et al 03 as suppleentary aterial The proof techniques ay be of independent interest Technical Report V

3 doain adaptation bound suitable for PAC-Bayesian learning, is presented in Section 4 Then, we derive our new algorith for PAC-Bayesian doain adaptation in Section 5, that we experient in Section 6 Afterwards, we generalize this analysis to ultisource doain adaptation in Section 7 Before concluding in Section 9, we discuss two iportant points in Section 8: i two different results for the ultisource setting that iply open-questions for deriving new algoriths, and ii the coparison between our new result and the one provided in Gerain et al 03 Doain Adaptation Related Works In this section, we review the two seinal works in doain adaptation that are based on a divergence easure between the doains Ben-David et al, 00a; Ben-David et al, 006; Mansour et al, 009a Notations and Setting We consider doain adaptation for binary classification tasks where X R d is the input space of diension d and Y {, +} is the label set The source doain P S and the target doain P T are two different distributions over X Y unknown and fixed, D S and D T being the respective arginal distributions over X We tackle the challenging task where we have no target labels A learning algorith is then provided with a labeled source saple S {x s i, ys i } consisting of exaples drawn iid3 fro P S, and an unlabeled target saple T {x t j } j consisting of exaples drawn iid fro D T Note that, we denote the distribution of a -saple by P S We suppose that H is a set of hypothesis functions for X to Y The expected source error and the expected target error of h H over P S, respectively P T, are the probability that h errs on the entire distribution P S, respectively P T, R PS h L x s,y s 0- hx s, y s, and R PT h L P S x t,y t 0- hx t, y t, P T where L 0- a, b Ia b] is the 0- loss function which returns if a b and 0 otherwise The epirical source error R S on the learning saple S is R S h x s,y s S L 0- hx s, y s The ain objective in doain adaptation is then to learn without target labels a classifier h H leading to the lowest expected target error R PT h We also introduce the expected source disagreeent R DS h, h and the expected target disagreeent R DT h, h of h, h H, which easure the probability that two classifiers h and h do not agree on the respective arginal distributions, and are ined by R DS h, h L x s 0- hx s, h x s and R DT h, h L D S x t 0- hx t, h x t D T The epirical source disagreeent R S h, h on S and the epirical target disagreeents R T h, h on T are R S h, h L 0- hx s, h x s and R T h, h L 0- hx t, h x t x s S Note that, depending on the context, S denotes either the source labeled saple {x s i, ys i } or its unlabeled part {x s i } Note also that the expected error R P h on a distribution P can be viewed as a shortcut notation for the expected disagreeent between a hypothesis h and a labeling function f P that assigns the true label to an exaple description according with respect to P We have x t T R P h R D h, f P L 0- hx, fp x, x D where D is the arginal distribution of P over X 3 iid stands for independent and identically distributed Technical Report V 3

4 Necessity of a Doain Divergence The doain adaptation objective is to find a low-error target hypothesis, even if the target labels are not available ven under strong assuptions, this task can be ipossible to solve Ben-David and Urner, 0; Ben-David et al, 00b However, for deriving generalization ability in a doain adaptation situation with the help of a doain adaptation bound, it is critical to ake use of a divergence between the source and the target doains: the ore siilar the doains, the easier the adaptation appears Soe previous works have proposed different quantities to estiate how a doain is close to another one C Zhang, 0; Ben-David et al, 00a; Mansour et al, 009a,b; Ben-David et al, 006; Li and Biles, 007 Concretely, two doains P S and P T differ if their arginals D S and D T are different, or if the source labeling function differs fro the target one, or if both happen This suggests taking into account two divergences: one between D S and D T and one between the labeling If we have soe target labels, we can cobine the two distances as C Zhang 0 Otherwise, we preferably consider two separate easures, since it is ipossible to estiate the best target hypothesis in such a situation Usually, we suppose that the source labeling function is soehow related to the target one, then we look for a representation where the arginals D S and D T appear closer without losing perforances on the source doain 3 Doain Adaptation Bounds for Binary Classification We now review the first two seinal works which propose doain adaptation bounds based on a arginal divergence First, under the assuption that there exists a hypothesis in H that perfors well on both the source and the target doain, Ben-David et al 00a; and Ben-David et al 006 have provided the following doain adaptation bound Theore Ben-David et al 00a; Ben-David et al 006 Let H be a syetric 4 hypothesis class We have h H, R PT h R PS h + d H HD S, D T + µ h, where d H HD S, D T sup R DT h, h R DS h, h h,h H is the H H-distance between the arginals D S and D T, and µ h R PS h + R PT h is the error of the best hypothesis overall, denoted h, and ined by h argin RPS h + R PT h h H This bound depends on four ters R PS h is the classical source doain expected error d H HD S, D T depends on H and corresponds to the axiu disagreeent between two hypotheses of H In other words, it quantifies how hypothesis fro H can detect differences between these arginals: the lower this easure is for a given H, the better are the generalization guarantees The last ter µ h R PS h + R PT h is related to the best hypothesis h over the doains and act as a quality easure of H in ters of labeling inforation If h does not have a good perforance on both the source and the target doain, then there is no way one can adapt fro this source to this target Hence, as pointed out by the authors, quation, together with the usual VC-bound theory, express a ultiple trade-off between the accuracy of soe particular hypothesis h, the coplexity of H, and the incapacity of hypotheses of H to detect difference between the source and the target doain Second, Mansour et al 009a have extended the H H-distance to the discrepancy divergence for regression and any syetric loss L fulfilling the triangle inequality Given L :, +] R + such a loss, the discrepancy disc L D S, D T between D S and D T is disc L D S, D T sup h,h H Lhx t, h x t Lhx s, h x s x t D T x s D S 4 In a syetric hypothesis space H, for every h H, its inverse h is also in H Technical Report V 4

5 Note that with the 0- loss in binary classification, we have d H HD S, D T disc L0- D S, D T ven if these two divergences ay coincide, the following doain adaptation bound of Mansour et al 009a differs fro Theore Theore Mansour et al 009a Let H be a syetric hypothesis class We have where h H, R PT h R PT h T R DS h S, h + disc L0- D S, D T + ν h S,h T, ν h S,h T R DS h S, h T is the disagreeent between the ideal hypothesis on the target and source doains ined respectively as h T argin R PT h, and h S h H argin R PS h h H In this context, quation can be tighter 5 since it bounds the difference between the target error of a classifier and the one of the optial h T This bound expresses a trade-off between the disagreeent between h and the best source hypothesis h S, the coplexity of H with the Radeacher coplexity, and again the incapacity of hypothesis to detect differences between the doains To conclude, the doain adaptation bounds and suggest that if the divergence between the doains is low, a low-error classifier over the source doain ight perfor well on the target one These divergences copute the worst case of the disagreeent between a pair of hypothesis We propose in Section 4 an average case approach by aking use of the essence of the PAC-Bayesian theory, which is known to offer tight generalization bounds McAllester, 999; Gerain et al, 009a; Parrado-Hernández et al, 0 3 PAC-Bayesian Theory in Supervised Learning Let us now review the classical supervised binary classification fraework called the PAC-Bayesian theory, first introduced by McAllester 999 This theory succeeds to provide tight generalization guarantees on ajority vote classifiers, without relying on any validation set Throughout this section, we adopt an algorith design perspective: we interpret the various fors of the PAC-Bayesian theore as a guide to derive new achine learning algoriths Indeed, the PAC- Bayesian analysis of doain adaptation provided in the forthcoing sections is oriented by the otivation of creating a new adaptive algoriths 3 Notations and Setting Traditionally, the PAC-Bayesian theory considers weighted ajority votes over a set H of binary hypothesis Given a prior distribution π over H and a training set S, the learner ais at finding the posterior distribution ρ over H leading to a ρ-weighted ajority vote B ρ also called the Bayes classifier with good generalization guarantees and ined by B ρ x sign ] hx h ρ Miniizing R PS B ρ the risk of B ρ is known to be NP-hard In the PAC-Bayesian approach, it is replaced by the risk of the stochastic Gibbs classifier G ρ associated with ρ In order to predict the label of an exaple x, the Gibbs classifier first draws a hypothesis h fro H according to ρ, then returns hx as label Note that the error of the Gibbs classifier on a doain P S corresponds to the expectation of the errors over ρ: R PS G ρ R PS h 3 h ρ 5 quation can lead to an error ter 3 ties higher than quation in soe cases Mansour et al, 009a Technical Report V 5

6 In this setting, if B ρ isclassifies x, then at least half of the classifiers under ρ errs on x Hence, we have R PS B ρ R PS G ρ Another result on the relation between R PS B ρ and R PS G ρ is the C-bound of Lacasse et al 006 expressed as RPS G ρ R PS B ρ R DS G ρ, G ρ, 4 where R DS G ρ, G ρ corresponds to the disagreeent of the classifiers over ρ: R DS G ρ, G ρ R D h,h ρ S h, h 5 quation 4 suggests that for a fixed nuerator, ie, a fixed risk of the Gibbs classifier, the best ajority vote is the one with the lowest denoinator, ie, with the greatest disagreeent between its voters see Laviolette et al 0 for further analysis Finally, we introduce the notion of expected joint error of a pair of classifiers h, h drawn according to the distribution ρ, ined as e PS G ρ, G ρ L h,h ρ 0- hx, y L0- h x, y 6 x,y P S The PAC-Bayesian theory allows one to bound the expected error R PS G ρ in ters of two ajor quantities: the epirical error R S G ρ h ρ R S h estiated on a saple S drawn iid fro P S and the Kullback-Leibler divergence KLρ π h ρ ln ρh πh let us recall that π and ρ are respectively the prior and the posterior distributions The three ain PAC-Bayes theores, that we present in the next section, have been proposed by McAllester 999; Seeger 00; Langford 005; and Catoni Three Versions of the PAC-Bayesian Theore First, let us consider the KL-divergence kla b between two Bernoulli distributions with success probability a and b, ined by kla b a ln a a + a ln b b Seeger 00; and Langford 005 have derived the following PAC-Bayesian theore in which the tradeoff between the coplexity and the risk is handled by kl Theore 3 Seeger 00; Langford 005 For any doain P S over X Y, any set of hypotheses H, and any prior distribution π over H, any 0, ], with a probability at least over the choice of S P S, for every ρ over H, we have kl R S G ρ R PS G ρ KLρ π + ln ] This version of the PAC-Bayes theore offers a tight bound, especially for low epirical risk However, due to the kl R S G ρ R PS G ρ ter, this bound reains difficult to interpret: the link between the epirical risk R S G ρ and the true risk R PS G ρ is not given by a close for Thus, fro an algorithic point of view, finding the distribution ρ that iniizes the bound on R PS G ρ given by Theore 3 ight be a difficult task The following version of the PAC-Bayes theore, which was the first proposed McAllester, 999, appears easier to interpret since it links the ters R S G ρ and R PS G ρ by a linear relation Note that Theore 4 can be straightforwardly obtained fro Theore 3 using Pinsker s inequality: q p klq p 7 Theore 4 McAllester 999 For any doain P S over X Y, any set of hypotheses H, any prior distribution π over H, and any 0, ], with a probability at least over the choice of S P S, for every ρ over H, we have R PS G ρ R S G ρ KLρ π + ln ] Technical Report V 6

7 Theores 3 and 4 suggest that, in order to iniize the expected risk, a learning algorith should perfor a trade-off between the epirical risk iniization R S G ρ and KL-divergence iniization KLρ π roughly speaking the coplexity ter The nature of this trade-off can be explicitly controlled in Theore 5 below This PAC-Bayesian result, first proposed by Catoni 007, is ined with a hyperparaeter here naed c It appears to be a natural tool to design PAC-Bayesian algoriths We present this result in the siplified for suggested by Gerain et al 009b Theore 5 Catoni 007 For any doain P S over X Y, for any set of hypotheses H, any prior distribution π over H, any 0, ], and any real nuber c > 0, with a probability at least over the choice of S P S, for every ρ on H, we have R PS G ρ c e c R S G ρ + KLρ π + ln ] c The bound given by Theore 5 has two interesting characteristics First, choosing c, the bound becoes consistent: it converges to R SG ρ + 0] as grows Second, as described in Section 33, its iniization is closely related to the iniization proble associated with the SVM when ρ is an isotropic Gaussian over the space of linear classifiers Gerain et al, 009a Hence, the value c allows us to control the trade-off between the epirical risk R S G ρ and the coplexity ter KLρ π 33 Supervised PAC-Bayesian Learning of Linear Classifiers Let us consider H as a set of linear classifiers in a d-diensional space ach h w H is ined by a weight vector w R d : h w x sgn w x, where denotes the dot product By restricting the prior and the posterior distributions over H to be Gaussian distributions, Langford and Shawe-Taylor 00; Abroladze et al 006; and Parrado-Hernández et al 0 have specialized the PAC-Bayesian theory in order to bound the expected risk of any linear classifier h w H More precisely, given a prior π 0 and a posterior ρ w ined as spherical Gaussians with identity covariance atrix respectively centered on vectors 0 and w, for any h w H, we have and π 0 h w ρ w h w d exp π w d exp π w w An interesting property of these Gaussian distributions is that the prediction of the ρ w -weighted ajority vote B ρw coincides with the one of the linear classifier h w Indeed, we have x X, w H, x,y P S h w x B ρw x ] sign h w x h w ρ w Moreover, the expected risk of the Gibbs classifier G ρw on a doain P S is then given by h w ρ w R PS G ρw L 0- hw x, y x,y P S x,y P S x,y P S x,y P S Φ x,y P S I h w x y h w ρ w I y w x 0 h w ρ w exp π R w w d Pr t N 0, y w x x, t y w x ] x, I y w x 0 d w Technical Report V 7

8 where we ined with rf is the Gauss error function ined as Φa ] a rf, rf b b π Finally, the KL-divergence between ρ w and π 0 becoes siply 33 Objective Function and Gradient KLρ w π 0 w 0 exp t dt 8 Based on the specialization of the PAC-Bayesian theory to linear classifiers, Gerain et al 009a suggested iniizing a PAC-Bayesian bound on R PS G ρw For sake of copleteness, we provide here ore atheatical details than in the original conference paper Gerain et al, 009a We will build on this PAC-Bayesian learning algorith for supervised leaning in our doain adaptation work Given a saple S {x s i, ys i } and a hyperparaeter C > 0, the learning algorith perfors a gradient descent in order to find an optial weight vector w that iniizes F w CR S G ρw + KLρ w π 0 w x i C Φ y i + x i w 9 It turns out that the optial vector w corresponds to the distribution ρ w that iniizes the value of the bound on R PS G ρw given by Theore 5, with the paraeter c of the theore being the hyperparaeter C of the learning algorith It is iportant to point out that PAC-Bayesian theores bound siultaneously R PS G ρw for every ρ w on H Therefore, one can freely explore the doain of objective function F to choose a posterior distribution ρ w that gives, thanks to Theore 5, a bound valid with probability The iniization of quation 9 by gradient descent corresponds to the learning algorith called PBGD3 of Gerain et al 009a The gradient of F w is given the vector F w: F w C Φ w x i yi x i y i x i x i + w, where Φ a π exp a is the derivative of Φ at point a Siilarly to the SVM, the learning algorith PBGD3 realizes a trade-off between the epirical risk expressed by the loss Φ and the coplexity of the learned linear classifier expressed by the regularizer w This siilarity increases when we use a kernel function, as described next 33 Using a kernel function The kernel trick allows to substitute inner products by a kernel function k : R d R d R in quation 9 If k is a Mercer kernel, it iplicitly represents a function φ : X R d that aps an exaple of X into an arbitrary d -diensional space 6, such that x, x X, kx, x φx φx Then, a dual weight vector α α, α,, α R encodes the linear classifier w R d cobination of exaples of S: ] w α i φx i, and thus h w x sgn α i kx i, x as a linear 6 We consider here that the induced space is finite-diensional Technical Report V 8

9 By the representer theore Schölkopf et al, 00, the vector w iniizing quation 9 can be recovered by finding the vector α that iniizes j F α C Φ y α jk i,j i + α i α j K i,j, 0 Ki,i j where K is the kernel atrix of size That is, K i,j kx i, x j The gradient of F α is siply given the vector F α α, α, α, with α # C Φ j y α jk i,j i Ki,i y i K i,# Ki,i Iproving the Algorith Using a Convex Objective α i K i,#, for # {,,, } An annoying drawback of PBGD3 is that the objective function is non-convex and the gradient descent ipleentation needs any rando restarts In fact, we ade extensive epirical experients after the ones described by Gerain et al 009a and saw that PBGD3 achieves an equivalent accuracy and at a fraction of the running tie by replacing the loss function Φ of quations 9 and 0 by its convex relaxation, which is { ax j a } π Φ cvx a Φa, a if a 0, π Φa otherwise The derivative of Φ cvx at point a is then Φ cvxa if a < 0, and Φ a otherwise Note that Figure in Section 5 illustrates the functions Φ and Φ cvx π In the following we present our contributions on PAC-Bayesian doain adaptation 4 The originality of our contribution is to theoretically design a doain adaptation fraework for PAC- Bayesian approach In Section 4, we propose a doain coparison pseudoetric suitable in this context We then derive PAC-Bayesian doain adaptation bounds in Section 4, that iproves the result proposed in Gerain et al 03 Finally, note that in Section 5 we see that using the previous approach in a doain adaptation way is a relevant strategy: we specialize our result to linear classifiers 4 A Doain Divergence for PAC-Bayesian Analysis In the following, while the doain adaptation bounds presented in Section focus on a single classifier, we first ine a ρ-average disagreeent easure to copare the arginals Then, this leads us to derive our doain adaptation bound suitable for the PAC-Bayesian approach As discussed in Section, the derivation of generalization ability in doain adaptation critically needs a divergence easure between the source and target arginals 4 Designing the Divergence We ine a doain disagreeent pseudoetric 7 to easure the structural difference between doain arginals in ters of posterior distribution ρ over H Since we are interested in learning a ρ-weighted ajority vote B ρ leading to good generalization guarantees, we propose to follow the idea behind the C-bound presented in quation 4: given P S, P T, and ρ, if R PS G ρ and R PT G ρ are siilar, then R PS B ρ and R PT B ρ are siilar when R D h,h ρ S h, h and R D h,h ρ T h, h are also siilar Thus, the doains P S and P T are close according to ρ if the divergence between R D h,h ρ T h, h tends to be low Our pseudoetric is ined as follows R D h,h ρ S h, h and 7 A pseudoetric d is a etric for which the property dx, y 0 x y is relaxed to dx, y 0 x y Technical Report V 9

10 Definition Let H be a hypothesis class For any arginal distributions D S and D T over X, any distribution ρ on H, the doain disagreeent dis ρ D S, D T between D S and D T is ined by dis ρ D S, D T R DT h, h R DS h, h ] h,h ρ R DT G ρ, G ρ R DS G ρ, G ρ Note that dis ρ, is syetric and fulfills the triangle inequality 4 Coparison of the H H-divergence and our doain disagreeent While the H H-divergence of Theore is difficult to jointly optiize with the epirical source error, our epirical disagreeent easure is easier to anipulate: we siply need to copute the ρ-average of the classifiers disagreeent instead of finding the pair of classifiers that axiizes the disagreeent Indeed, dis ρ, depends on the ajority vote, which suggests that we can directly iniize it via the epirical dis ρ S, T and the KL-divergence This can be done without instance reweighing, space representation changing or faily of classifiers odification On the contrary, d H H, is a supreu over all h H and hence, does not depend on the h on which the risk is considered Moreover, dis ρ, the ρ-average is lower than the d H H, the worst case Indeed, for every H and ρ over H, we have d H HD S, D T sup h,h H R DT h, h R DS h, h h,h ρ R D T h, h R DS h, h dis ρ D S, D T 43 PAC-Bayesian bounds for our doain disagreeent The following theores show that dis ρ D S, D T can be bounded in ters of the classical PAC-Bayesian quantities: the epirical disagreeent dis ρ S, T estiated on the source and target saples, and the KL-divergence between the prior and posterior distribution on H For the sake of siplicity, let first suppose that, ie, the size of S and T are equal Here is a Seeger s type PAC-Bayesian bound for our doain disagreeent dis ρ Theore 6 For any distributions D S and D T over X, any set of hypotheses H, and any prior distribution π over H, any 0, ], with a probability at least over the choice of S T D S D T, for every ρ on H, we have kl dis ρ S, T + Proof Deferred to Appendix B dis ρ D S, D T + KLρ π + ln ] Here is a McAllester s type PAC-Bayesian bound for our doain disagreeent dis ρ obtained straightforwardly fro Theore 6 Corollary For any distributions D S and D T over X, any set of hypotheses H, and any prior distribution π over H, any 0, ], with a probability at least over the choice of S T D S D T, for every ρ on H, we have dis ρ D S, D T dis ρ S, T KLρ π + ln ] Proof The result is obtained by using Pinsker s inequality quation 7 on Theore 6 Here is a Catoni s type PAC-Bayesian bound which helps us to derive a doain adaptation algorith in the following Technical Report V 0

11 Theore 7 For any distributions D S and D T over X, any set of hypotheses H, any prior distribution π over H, any 0, ], and any real nuber α > 0, with a probability at least over the choice of S T D S D T, for every ρ on H, we have dis ρ D S, D T Proof Deferred to Appendix C α e α dis ρ S, T + KLρ π + ln α ] + Siilarly to the epirical risk bound of Catoni 007 shown by Theore 5, the above doain disagreeent bound is consistent if one puts α Indeed, it converges to dis ρs, T ] as grows The last result of this section tackles the situation where, ie, the sizes of S and T are different Theore 8 For any arginal distributions D S and D T over X, any set of hypotheses H, any prior distribution π over H, any 0, ], with a probability at least over the choice of S D S and T D T, for every ρ over H, we have dis ρd S, D T dis ρ S, T Proof Deferred to Appendix D KLρ π + ln 4 + KLρ π + ln 4 Note that Theore 8 is very siilar to the result of Corollary In fact, in the particular case, Theore 8 differs fro Corollary only by the 4 ter inside the logarith, instead of 4 We now derive our ain result in the following theore: a doain adaptation bound relevant in a PAC- Bayesian setting 4 A doain adaptation bound for the stochastic Gibbs classifier Theore 9 below relies on the doain disagreeent of Definition, and also on expected joint error of quation 6 Theore 9 Let H be a hypothesis class We have ρ on H, R PT G ρ R PS G ρ + dis ρd S, D T + λ ρ, where λ ρ is the deviation between the expected joint errors of G ρ on the target and source doains: λ ρ L h,h ρ 0- hx, y L0- h x, y L 0- hx, y L0- h x, y ] x,y P T x,y P S e PT G ρ, G ρ e PS G ρ, G ρ Proof First, notice that for any distribution P on X Y and corresponding arginal distribution D on X, we have as R P G ρ R DG ρ, G ρ + e P G ρ, G ρ, R P G ρ L h,h ρ 0- hx, y + L0- h x, y ] x,y P h,h ρ x,y P R D G ρ, G ρ + e P G ρ, G ρ L 0- hx, h x + L 0- hx, y L0- h x, y ] Technical Report V

12 Therefore, R PT G ρ R PS G ρ R DT G ρ, G ρ R DS G ρ, G ρ + e PT G ρ, G ρ e PS G ρ, G ρ R DT G ρ, G ρ R DS G ρ, G ρ + e PT G ρ, G ρ e PS G ρ, G ρ dis ρd S, D T + λ ρ Our bound is, in general, incoparable with the ones of Theores and It can be seen as a tradeoff between different quantities The ters R PS G ρ and dis ρ D S, D T are siilar to the first two ters of the doain adaptation bound of Ben-David et al 00a quation : R PS G ρ is the ρ-average risk over H on the source doain, and dis ρ D T, DS easures the ρ-average disagreeent between the arginals but is specific to the current ρ The other ter λ ρ easures the deviation between the expected joint target and source errors of G ρ According to this theory, a good doain adaptation is possible if this deviation is low However, since we suppose that we do not have any label in the target saple, we cannot control or estiate it In practice, we suppose that λ ρ is low and we neglect it In other words, we assue that the labeling inforation between the two doains is related and that considering only the arginal agreeent and the source labels is sufficient to find a good ajority vote Another iportant point coes fro the fact that this bound is not degenerated when the source and target distributions are the sae or close, see Section 8 for a discussion on this point In the next section, we provide three PAC-Bayesian theores that justifies the epirical optiization of the bound of Theore 9 4 PAC-Bayesian theores for doain adaptation Finally, our Theore 9 leads to a PAC-Bayesian bound based on both the epirical source error of the Gibbs classifier and the epirical doain disagreeent pseudoetric estiated on a source and target saples Fro the preceding Seeger s type results, one can then obtain the following PAC-Bayesian doain adaptation bound Theore 0 For any doains P S and P T respectively with arginals D S and D T over X Y, any set of hypotheses H, any prior distribution π over H, and any 0, ], with a probability at least over the choice of S T P S D T, we have R PT G ρ sup R ρ + sup D ρ + λ ρ, where λ ρ is ined by quation, and { R ρ r : kl R S G ρ r KLρ π + ln 4 { D ρ d : kl dis ρs,t + d+ ]}, KLρ π + ln 4 Proof The result is obtained by inserting Theores 3 and 6 with : in Theore 9 The following bound is based on Catoni s approach and corresponds to the one fro which we derive in Section 5 our algorith for PAC-Bayesian doain adaptation Theore For any doains P S and P T resp with arginals D S and D T over X Y, any set of hypotheses H, any prior distribution π over H, any 0, ], any real nubers α > 0 and c > 0, with a probability at least over the choice of S T P S D T, for every posterior distribution ρ on H, we have c R PT G ρ c R S G ρ + α dis ρs, T + c + α KLρ π + ln 3 + λ ρ + α α, ]} where λ ρ is ined by quation, and where c c, and α e c α e α Technical Report V

13 Proof In Theore 9, we replace R S G ρ and dis ρ S, T by their upper bound, obtained fro Theore 5 and Theore 7, with chosen respectively as 3 and 3 In the latter case, we use KLρ π + ln /3 KLρ π + ln 3 < KLρ π + ln 3 We now present a result based on the McAllester bound, which allows us to easily deal with different sizes of saples Theore For any doains P S and P T respectively with arginals D S and D T over X Y, and for any set H of hypotheses, for any prior distribution π over H, any 0, ], with a probability at least over the choice of S P S, S D S, and T D T, for every ρ over H, we have R PT G ρ R S G ρ + dis ρs, T + λ ρ KLρ π + ln where λ ρ is ined by quation KLρ π + ln Proof We insert Theores 4 and 8 with : in Theore 9 KLρ π + ln 8 8, Under the assuption that the doains are soehow related in ters of labeling agreeent on P S and P T for every distribution ρ over H, ie, a low dis ρ D S, D T iplies a negligible λ ρ, a natural solution for a PAC-Bayesian doain adaptation algorith without target label is to iniize the bound of Theore by disregarding λ ρ Notice that a ajor advantage of our doain adaptation bound is that we can jointly optiize the risk and the divergence with a theoretical justification 5 PAC-Bayesian Doain Adaptation Learning of Linear Classifiers In this section, we design a learning algorith for doain adaptation inspired by the PAC-Bayesian learning algorith of Gerain et al 009a That is, we adopt the specialization of the PAC-Bayesian theory to linear classifiers described in Section 33 Note that the code of our algorith is available on-line 8 5 Miniizing the PAC-Bayesian Doain Adaptation Bound Let us consider a prior π 0 and a posterior ρ w that are spherical Gaussian distributions over a space of linear classifiers, exactly as ined in Section 33 Given a source saple S {x s i, ys i } and a target saple T {xt i }, we focus on the iniization of the bound given by Theore We work under the assuption that the ter λ ρw of the bound is negligible Thus, the posterior distribution ρ w that iniizes the bound on R T G ρw is the sae that iniizes C R S G ρw + A dis ρw S, T + KLρ w π 0 3 The values A > 0 and C > 0 are hyperparaeters of the algorith Note that the constants α and c of Theore can be recovered fro any A and C 5 Doain Disagreeent of Linear Classifiers We know fro quation 9 how to copute the ters R S G ρw and KLρ w π 0 of quation 3 Let us now derive the value of dis ρw S, T, ie, the epirical doain disagreeent between S and T of a distribution ρ w over linear classifiers 8 See Technical Report V 3

14 First, for any arginal D, we obtain Thus, where R D G ρw, G ρw x D x D x D x D x D x D Φ L h,h ρ 0- hx, h x w Ihx h x] h,h ρ w h,h ρ w Ihx ] Ih x ] + Ihx ] Ih x ] Ihx ] Ih x ] h,h ρ w Ihx ] Ih x ] h ρw h ρ w w x Φ w x x x dis ρw S, T R S G ρw, G ρw R T G ρw, G ρw w x s Φ i dis x s i 5 Objective Function and Gradient Φ dis w x t i x t i, Φ dis a Φa Φ a 4 Fro the results of Sections 33 and 5, we obtain that quation 3 equals to C Φ yi s w x s i ] w x s x s i + A Φ i w x t dis x s i Φ i dis x t i + w, which is highly non-convex To ake the optiization proble ore tractable, we replace the loss function Φ by its convex relaxation Φ cvx as in Section 333 and iniize the resulting cost function by gradient descent ven if this optiization task is still not convex Φ dis is quasiconcave, our epirical study shows no need to perfor any restarts to find a suitable solution 9 We nae this doain adaptation algorith PBDA To su up, given a source saple S {x s i, ys i }, a target saple T {x t i }, and hyperparaeters A and C, the algorith PBDA perfors gradient descent to iniize the following objective function: Gw C Φ cvx yi s w x s i ] w x s x s i + A Φ i w x t dis x s i Φ i dis x t i + w, 5 where Φa Φ cvx a Φ dis a ax ] a rf, { Φa, Φa Φ a, a }, π with rf the Gauss error function ined in quation 8 Figure illustrates these three functions The gradient Gw of the quation 5 is then given by y Gw C Φ s i w x s i y s i x s i cvx x s i x s i + w ] w x + s A Φ t i x t i w x s dis x t i x t i i x s i Φ dis x s i x s i, 9 We observe epirically that a good strategy is to first find the vector w iniizing the convex proble of PBGD3 described in Section 333, and then use this w as a starting point for the gradient descent of PBDA Technical Report V 4

15 Φa Φ cvx a Φ dis a a Figure : Behavior of functions Φ, Φ cvx and Φ dis where Φ cvxa and Φ dis a are respectively the derivatives of functions Φ cvx and Φ dis evaluated at point a, and ] w x s s sgn Φ i w x t dis x s i Φ i dis x t i We extend these equations to kernels in the following subsection 53 Using a Kernel Function The kernel trick allows us to work with dual weight vector α R that is a linear classifier in an augented space Given a kernel k : R d R d R, we have ] h w x sgn α i kx s i, x + α i+ kx t i, x Let us denote K the kernel atrix of size such as K i,j kx i, x j, where { x # x s # if # otherwise x t # In that case, the objective function of quation 5 is rewritten in ters of the vector α α, α, α as Gα C Φ cvx yi s j α jk i,j Ki,i + A j Φ α jk i,j dis Ki,i Φ dis j α jk i+,j Ki+,i+ ] + j The gradient of the latter equation is given by the vector Gα α, α, α, with j C α jk i,j y s i K i,# + α i K i,# Ki,i Ki,i where α # + s A Φ cvx y s i s sgn Φ dis j α jk i,j Ki,i j j Φ α jk i,j dis Ki,i K i,# Φ Ki,i dis j α jk i+,j Ki+,i+ j Φ α ] jk i+,j dis Ki+,i+ α i α j K i,j ] K i+,#, Ki+,i+ Technical Report V 5

16 6 xperients 6 General Setup PBDA 0 has been evaluated on a toy proble and a sentient dataset For our experients, we iniize the objective function using a Broyden-Fletcher-Goldfarb-Shanno ethod BFGS ipleented in the scipy python library PBDA has been copared with: SVM learned only fro the source doain, ie, without adaptation We ade use of the SVM-light library Joachis, 999 PBGD3, presented in Section 33, and learned only fro the source doain, ie, without adaptation DASVM of Bruzzone and Marconcini 00, an iterative doain adaptation algorith which tries to axiize iteratively a notion of argin on self-labeled target exaples We ipleented DASVM with the LibSVM library Chang and Lin, 00 CODA of Chen et al 0, a co-training doain adaptation algorith, which looks iteratively for target features related to the training set We used the ipleentation provided by the authors Note that Chen et al 0 have shown best results on the dataset considered in our Section 64 ach paraeter is selected with a grid search via a classical 5-folds cross-validation CV on the source saple for PBGD3 and SVM, and via a 5-folds reverse/circular validation RCV on the source and the unlabeled target saples for CODA, DASVM, and PBDA We describe this latter point in the following section Note that for PBDA we search on a 0 0 paraeter grid for a A between 00 and 0 6 and a paraeter C between 0 and 0 8, both on a logarith scale 6 A Note about the Reverse Validation A crucial question in doain adaptation is the validation of the hyperparaeters One solution is to follow the principle proposed by Zhong et al 00 which relies on the use of a reverse validation approach This approach is based on a so-called reverse classifier evaluated on the source doain We propose to follow it for tuning the paraeters of PBDA, DASVM and CODA Note that Bruzzone and Marconcini 00 have proposed a siilar ethod, called circular validation, in the context of DASVM Concretely, in our setting, given k-folds on the source labeled saple S S S k, k-folds on the unlabeled target T saple T T T k and a learning algorith paraetrized by a fixed tuple of hyperparaeters, the reverse cross validation risk on the i th fold is coputed as follows Firstly, the source set S \ S i is used as a labeled saple and the target set T \ T i is used as an unlabeled saple for learning a classifier h Secondly, using the sae algorith, a reverse classifier h r is learned using the self-labeled saple {x, h x} x T \Ti as the source set and the unlabeled part of S \ S i as target saple Finally, the reverse classifier h r is evaluated on S i We suarize this principle on Figure The process is repeated k ties to obtain the reverse cross validation risk averaged across all folds 63 Toy Proble: Two Inter-Twinning Moons The source doain considered here is the classical binary proble with two inter-twinning oons, each class corresponding to one oon Figure 3 We then consider seven different target doains by rotating anticlockwise the source doain according to seven angles fro 0 to 90 The higher the angle, the ore difficult the proble becoes For each doain, we generate 300 instances 50 of each class Moreover, to assess the generalization ability of our approach, we evaluate each algorith on an independent test set of, 000 target points not provided to the algoriths We ake use of a Gaussian kernel for all the ethods ach doain adaptation proble is repeated ten ties, and we report the average error rates on Table Note that since CODA decoposes features for applying co-training, it is not appropriate here we have only two features We reark that our PBDA provides the best perforances except for 50 and 0, indicating that PBDA accurately tackles doain adaptation tasks It shows a nice adaptation ability, especially for the hardest proble, probably due to the fact that dis ρ is tighter and sees to be a good regularizer in a doain adaptation situation The adaptation versus risk iniization trade-off suggested by Theore 0 We ade our code available at the following URL: Available at Technical Report V 6

17 S S T 4 valuation of h on labeled S Learning h Auto-labeling fro SUT of T with h T h r h r S T Learning the r reverse classifier h fro unlabeled S and auto-labeled T Figure : The principle of the reverse/circular validation in our setting Table : Average error rate results for seven rotation angles PBGD3 CV SVM CV DASVM RCV PBDA RCV appears in Figure 3 Indeed, the plot illustrates that PBDA accepts to have a lower source accuracy to aintain its perforance on the target doain, at least when the source and the target doains are not so different Note, however, that for large angles, PBDA prefers to focus on the source accuracy We clai that this is a reasonable behavior for a doain adaptation algorith 64 Sentient Analysis Dataset We consider the popular Aazon reviews dataset Blitzer et al, 006 coposed of reviews of four types of Aazonco c products books, DVDs, electronics, kitchen appliances Originally, the reviews corresponded to a rate between one and five stars and the feature space of unigras and bigras has on average a diension of 00, 000 For sake of siplicity and for considering a binary classification task, we propose to follow a setting siilar to the one proposed by Chen et al 0 Then the two possible classes are: + for the products with a rank higher than 3 stars, for those with a rank lower or equal to 3 stars The diensionality is reduced in the following way: Chen et al 0 only kept the features that appear at least ten ties in a particular DA task it reains about 40, 000 features, and pre-processed the data with a standard tf-idf re-weighting One type of product is a doain, then we perfor twelve doain adaptation tasks For exaple, books DVDs corresponds to the task for which books is the source doain and DVDs the target one The algoriths use a linear kernel and consider, 000 labeled source exaples and, 000 unlabeled target exaples We evaluate the on separate target test sets proposed by Chen et al 0 between 3, 000 and 6, 000 exaples, and we report the results on Table We ake the following observations First, as expected, the doain adaptation approaches provide the best average results Then, PBDA is on average better than CODA, but less accurate than DASVM However, PBDA is copetitive: the results are not significantly different fro CODA and DASVM Moreover, we have observed that PBDA is significantly faster than CODA and DASVM: these two algoriths are based on costly iterative procedures increasing the running tie by at least a factor of five in coparison of PBDA In fact, the clear advantage of PBDA is that we jointly optiize the ters of our bound in one step Technical Report V 7

18 Figure 3: Illustration of the decision boundary of PBDA on three rotations angles for fixed paraeters A C The two classes of the source saple are green and pink, and target unlabeled saple is gray The botto plot shows corresponding source and target errors We intentionally avoid tuning PBDA paraeters to highlight its inherent adaptation behavior 65 Cobining PBDA and Representation Learning As discussed in the introduction, there exist several failies of approaches used to tackle the doain adaptation proble The present work focuses on the iniization of a distance etric between the source and target distributions Now, we ask ourselves whether it can be fruitful to cobine our PBDA algorith with another approach To do so, we executed PBDA on top of the Marginalized Stacked Denoising Autoencoders SDA introduced by Chen et al 0 In brief, SDA is an unsupervised algorith that learns a new representation of the training saples As a denoising autoencoders algorith, it finds a representation fro which one can approxiately reconstruct the original features of an exaple fro its noisy counterpart The originality of SDA is to learn a representation that allows reconstructing both source and target unlabeled exaples Then, one can execute any supervised learning algorith on the new representation of source saples, for which the labels are known That is, given a source saple S {x s i, ys i } and a target saple T {xt i }, SDA takes the unlabeled parts of S and T, {x s,, x s, x t,, x t }, and learn a feature ap f : X X, where X is a new input space of real-valued vector In Chen et al, 0, a linear SVM is executed using S f {fx s i, ys i } as training data, and the hyper-paraeter C is selected by standard cross-validation We copare the perforance of SVM on SDA representation to PBDA on the sae representations That is, we obtain a new representation of both source S f {fx s i, ys i } and target T f {fx t i } data, using SDA Then, we execute PBDA using S f and T f This coparison is done using the Aazon reviews dataset For the sake of coparison, we used the dataset pre-processed by Chen et al 0, which is slightly different fro the one used in Section 64 Indeed, each doain share the sae 5, 000 features, and no tf-idf re-weighting is applied For each pair source-target, SDA representations are generated using a corruption probability of 50% and a nuber of layers of 5 Then, SVM and PBDA are executed on the sae representations The results are reported in Table 3 The PBDA algorith, when we select the hyperparaeter by reverse cross-validation PBDA RCV, is not always as good as the cross-validated SVM SVM CV However, by looking closer at the results, we notice that there often exists hyperparaeters for which PBDA is better on the testing set than the best achievable SVM as reported by the coluns PBDA T ST and SVM T ST This suggests that it ight be advantageous to ix SDA and PBDA learning strategies However, the hyperparaeters selection is still a challenge in doain adaptation, when we do not have any target labels, even if the reverse cross-validation ethod is a sound strategy For exploratory purposes, we report on Table 3 the risk of PBDA while perforing the odel selection by standard cross-validation PBDA CV and Technical Report V 8

Domain-Adversarial Neural Networks

Domain-Adversarial Neural Networks Doain-Adversarial Neural Networks Hana Ajakan, Pascal Gerain 2, Hugo Larochelle 3, François Laviolette 2, Mario Marchand 2,2 Départeent d inforatique et de génie logiciel, Université Laval, Québec, Canada

More information

A PAC-Bayesian Approach for Domain Adaptation with Specialization to Linear Classifiers

A PAC-Bayesian Approach for Domain Adaptation with Specialization to Linear Classifiers A PAC-Bayesian Approach for Doain Adaptation with Specialization to Linear Classifiers Pascal Gerain Aaury Habrard François Laviolette ilie Morvant o cite this version: Pascal Gerain Aaury Habrard François

More information

PAC-Bayesian Learning and Domain Adaptation

PAC-Bayesian Learning and Domain Adaptation PAC-Bayesian Learning and Domain Adaptation Pascal Germain 1 François Laviolette 1 Amaury Habrard 2 Emilie Morvant 3 1 GRAAL Machine Learning Research Group Département d informatique et de génie logiciel

More information

PAC-Bayesian Learning of Linear Classifiers

PAC-Bayesian Learning of Linear Classifiers Pascal Gerain Pascal.Gerain.@ulaval.ca Alexandre Lacasse Alexandre.Lacasse@ift.ulaval.ca François Laviolette Francois.Laviolette@ift.ulaval.ca Mario Marchand Mario.Marchand@ift.ulaval.ca Départeent d inforatique

More information

arxiv: v1 [stat.ml] 17 Jul 2017

arxiv: v1 [stat.ml] 17 Jul 2017 PACBayes and Domain Adaptation arxiv:1707.05712v1 [stat.ml] 17 Jul 2017 Pascal Germain pascal.germain@inria.fr Département d informatique de l ENS, École normale supérieure, CNRS, PSL Research University,

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October

More information

1 Bounding the Margin

1 Bounding the Margin COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost

More information

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lesson 1 4 October 2017 Outline Learning and Evaluation for Pattern Recognition Notation...2 1. The Pattern Recognition

More information

1 Rademacher Complexity Bounds

1 Rademacher Complexity Bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Max Goer March 07, 2013 1 Radeacher Coplexity Bounds Recall the following theore fro last lecture: Theore 1. With probability

More information

Support Vector Machines. Goals for the lecture

Support Vector Machines. Goals for the lecture Support Vector Machines Mark Craven and David Page Coputer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Soe of the slides in these lectures have been adapted/borrowed fro aterials developed

More information

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering

More information

Boosting with log-loss

Boosting with log-loss Boosting with log-loss Marco Cusuano-Towner Septeber 2, 202 The proble Suppose we have data exaples {x i, y i ) i =... } for a two-class proble with y i {, }. Let F x) be the predictor function with the

More information

PAC-Bayes Analysis Of Maximum Entropy Learning

PAC-Bayes Analysis Of Maximum Entropy Learning PAC-Bayes Analysis Of Maxiu Entropy Learning John Shawe-Taylor and David R. Hardoon Centre for Coputational Statistics and Machine Learning Departent of Coputer Science University College London, UK, WC1E

More information

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,

More information

Support Vector Machines. Maximizing the Margin

Support Vector Machines. Maximizing the Margin Support Vector Machines Support vector achines (SVMs) learn a hypothesis: h(x) = b + Σ i= y i α i k(x, x i ) (x, y ),..., (x, y ) are the training exs., y i {, } b is the bias weight. α,..., α are the

More information

1 Generalization bounds based on Rademacher complexity

1 Generalization bounds based on Rademacher complexity COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #0 Scribe: Suqi Liu March 07, 08 Last tie we started proving this very general result about how quickly the epirical average converges

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic

More information

Non-Parametric Non-Line-of-Sight Identification 1

Non-Parametric Non-Line-of-Sight Identification 1 Non-Paraetric Non-Line-of-Sight Identification Sinan Gezici, Hisashi Kobayashi and H. Vincent Poor Departent of Electrical Engineering School of Engineering and Applied Science Princeton University, Princeton,

More information

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab Support Vector Machines Machine Learning Series Jerry Jeychandra Bloh Lab Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a

More information

Combining Classifiers

Combining Classifiers Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher

More information

Machine Learning Basics: Estimators, Bias and Variance

Machine Learning Basics: Estimators, Bias and Variance Machine Learning Basics: Estiators, Bias and Variance Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Basics

More information

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes

More information

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Weak Learning Definition: concept class C is weakly PAC-learnable if there exists a (weak)

More information

Domain adaptation of weighted majority votes via perturbed variation-based self-labeling

Domain adaptation of weighted majority votes via perturbed variation-based self-labeling Domain adaptation of weighted majority votes via perturbed variation-based self-labeling Emilie Morvant To cite this version: Emilie Morvant. Domain adaptation of weighted majority votes via perturbed

More information

CS Lecture 13. More Maximum Likelihood

CS Lecture 13. More Maximum Likelihood CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood

More information

Generalization of the PAC-Bayesian Theory

Generalization of the PAC-Bayesian Theory Generalization of the PACBayesian Theory and Applications to SemiSupervised Learning Pascal Germain INRIA Paris (SIERRA Team) Modal Seminar INRIA Lille January 24, 2017 Dans la vie, l essentiel est de

More information

A theory of learning from different domains

A theory of learning from different domains DOI 10.1007/s10994-009-5152-4 A theory of learning fro different doains Shai Ben-David John Blitzer Koby Craer Alex Kulesza Fernando Pereira Jennifer Wortan Vaughan Received: 28 February 2009 / Revised:

More information

Robustness and Regularization of Support Vector Machines

Robustness and Regularization of Support Vector Machines Robustness and Regularization of Support Vector Machines Huan Xu ECE, McGill University Montreal, QC, Canada xuhuan@ci.cgill.ca Constantine Caraanis ECE, The University of Texas at Austin Austin, TX, USA

More information

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon Model Fitting CURM Background Material, Fall 014 Dr. Doreen De Leon 1 Introduction Given a set of data points, we often want to fit a selected odel or type to the data (e.g., we suspect an exponential

More information

PAC-Bayesian Generalization Bound on Confusion Matrix for Multi-Class Classification

PAC-Bayesian Generalization Bound on Confusion Matrix for Multi-Class Classification PAC-Bayesian Generalization Bound on Confusion Matrix for Multi-Class Classification Eilie Morvant eilieorvant@lifuniv-rsfr okol Koço sokolkoco@lifuniv-rsfr Liva Ralaivola livaralaivola@lifuniv-rsfr Aix-Marseille

More information

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians Using EM To Estiate A Probablity Density With A Mixture Of Gaussians Aaron A. D Souza adsouza@usc.edu Introduction The proble we are trying to address in this note is siple. Given a set of data points

More information

Introduction to Machine Learning. Recitation 11

Introduction to Machine Learning. Recitation 11 Introduction to Machine Learning Lecturer: Regev Schweiger Recitation Fall Seester Scribe: Regev Schweiger. Kernel Ridge Regression We now take on the task of kernel-izing ridge regression. Let x,...,

More information

Domain Adaptation of Majority Votes via Perturbed Variation-based Label Transfer

Domain Adaptation of Majority Votes via Perturbed Variation-based Label Transfer Domain Adaptation of Majority Votes via Perturbed Variation-based Label Transfer Emilie Morvant To cite this version: Emilie Morvant. Domain Adaptation of Majority Votes via Perturbed Variation-based Label

More information

Feature Extraction Techniques

Feature Extraction Techniques Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that

More information

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

A Smoothed Boosting Algorithm Using Probabilistic Output Codes A Soothed Boosting Algorith Using Probabilistic Output Codes Rong Jin rongjin@cse.su.edu Dept. of Coputer Science and Engineering, Michigan State University, MI 48824, USA Jian Zhang jian.zhang@cs.cu.edu

More information

Bayes Decision Rule and Naïve Bayes Classifier

Bayes Decision Rule and Naïve Bayes Classifier Bayes Decision Rule and Naïve Bayes Classifier Le Song Machine Learning I CSE 6740, Fall 2013 Gaussian Mixture odel A density odel p(x) ay be ulti-odal: odel it as a ixture of uni-odal distributions (e.g.

More information

COS 424: Interacting with Data. Written Exercises

COS 424: Interacting with Data. Written Exercises COS 424: Interacting with Data Hoework #4 Spring 2007 Regression Due: Wednesday, April 18 Written Exercises See the course website for iportant inforation about collaboration and late policies, as well

More information

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning JMLR: Workshop and Conference Proceedings vol (1) 1 15 New Bounds for Learning Intervals with Iplications for Sei-Supervised Learning David P. Helbold dph@soe.ucsc.edu Departent of Coputer Science, University

More information

A Simple Regression Problem

A Simple Regression Problem A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where

More information

Probability Distributions

Probability Distributions Probability Distributions In Chapter, we ephasized the central role played by probability theory in the solution of pattern recognition probles. We turn now to an exploration of soe particular exaples

More information

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes Radeacher Coplexity Margin Bounds for Learning with a Large Nuber of Classes Vitaly Kuznetsov Courant Institute of Matheatical Sciences, 25 Mercer street, New York, NY, 002 Mehryar Mohri Courant Institute

More information

3.8 Three Types of Convergence

3.8 Three Types of Convergence 3.8 Three Types of Convergence 3.8 Three Types of Convergence 93 Suppose that we are given a sequence functions {f k } k N on a set X and another function f on X. What does it ean for f k to converge to

More information

Sharp Time Data Tradeoffs for Linear Inverse Problems

Sharp Time Data Tradeoffs for Linear Inverse Problems Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used

More information

1 Proof of learning bounds

1 Proof of learning bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #4 Scribe: Akshay Mittal February 13, 2013 1 Proof of learning bounds For intuition of the following theore, suppose there exists a

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic

More information

Proc. of the IEEE/OES Seventh Working Conference on Current Measurement Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES

Proc. of the IEEE/OES Seventh Working Conference on Current Measurement Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES Proc. of the IEEE/OES Seventh Working Conference on Current Measureent Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES Belinda Lipa Codar Ocean Sensors 15 La Sandra Way, Portola Valley, CA 98 blipa@pogo.co

More information

Ensemble Based on Data Envelopment Analysis

Ensemble Based on Data Envelopment Analysis Enseble Based on Data Envelopent Analysis So Young Sohn & Hong Choi Departent of Coputer Science & Industrial Systes Engineering, Yonsei University, Seoul, Korea Tel) 82-2-223-404, Fax) 82-2- 364-7807

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2016 Lessons 7 14 Dec 2016 Outline Artificial Neural networks Notation...2 1. Introduction...3... 3 The Artificial

More information

arxiv: v3 [stat.ml] 13 Jul 2017

arxiv: v3 [stat.ml] 13 Jul 2017 PAC-Bayesian Analysis for a two-step Hierarchical Multiview Learning Approach Anil Goyal,2 ilie Morvant Pascal Gerain 3 Massih-Reza Aini 2 Univ Lyon, UJM-aint-tienne, CNR, Institut d Optique Graduate chool,

More information

E. Alpaydın AERFAISS

E. Alpaydın AERFAISS E. Alpaydın AERFAISS 00 Introduction Questions: Is the error rate of y classifier less than %? Is k-nn ore accurate than MLP? Does having PCA before iprove accuracy? Which kernel leads to highest accuracy

More information

Two transfer learning approaches for domain adaptation

Two transfer learning approaches for domain adaptation Two transfer learning approaches for domain adaptation Amaury Habrard amaury.habrard@univ-st-etienne.fr Laboratoire Hubert Curien, UMR CNRS 5516 University Jean Monnet - Saint-Étienne (France) Seminar,

More information

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks Intelligent Systes: Reasoning and Recognition Jaes L. Crowley MOSIG M1 Winter Seester 2018 Lesson 7 1 March 2018 Outline Artificial Neural Networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

A note on the multiplication of sparse matrices

A note on the multiplication of sparse matrices Cent. Eur. J. Cop. Sci. 41) 2014 1-11 DOI: 10.2478/s13537-014-0201-x Central European Journal of Coputer Science A note on the ultiplication of sparse atrices Research Article Keivan Borna 12, Sohrab Aboozarkhani

More information

A Theoretical Analysis of a Warm Start Technique

A Theoretical Analysis of a Warm Start Technique A Theoretical Analysis of a War Start Technique Martin A. Zinkevich Yahoo! Labs 701 First Avenue Sunnyvale, CA Abstract Batch gradient descent looks at every data point for every step, which is wasteful

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2016/2017 Lessons 9 11 Jan 2017 Outline Artificial Neural networks Notation...2 Convolutional Neural Networks...3

More information

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA) Bayesian Learning Chapter 6: Bayesian Learning CS 536: Machine Learning Littan (Wu, TA) [Read Ch. 6, except 6.3] [Suggested exercises: 6.1, 6.2, 6.6] Bayes Theore MAP, ML hypotheses MAP learners Miniu

More information

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011) E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how

More information

UNIVERSITY OF TRENTO ON THE USE OF SVM FOR ELECTROMAGNETIC SUBSURFACE SENSING. A. Boni, M. Conci, A. Massa, and S. Piffer.

UNIVERSITY OF TRENTO ON THE USE OF SVM FOR ELECTROMAGNETIC SUBSURFACE SENSING. A. Boni, M. Conci, A. Massa, and S. Piffer. UIVRSITY OF TRTO DIPARTITO DI IGGRIA SCIZA DLL IFORAZIO 3823 Povo Trento (Italy) Via Soarive 4 http://www.disi.unitn.it O TH US OF SV FOR LCTROAGTIC SUBSURFAC SSIG A. Boni. Conci A. assa and S. Piffer

More information

A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words)

A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words) 1 A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine (1900 words) Contact: Jerry Farlow Dept of Matheatics Univeristy of Maine Orono, ME 04469 Tel (07) 866-3540 Eail: farlow@ath.uaine.edu

More information

List Scheduling and LPT Oliver Braun (09/05/2017)

List Scheduling and LPT Oliver Braun (09/05/2017) List Scheduling and LPT Oliver Braun (09/05/207) We investigate the classical scheduling proble P ax where a set of n independent jobs has to be processed on 2 parallel and identical processors (achines)

More information

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search Quantu algoriths (CO 781, Winter 2008) Prof Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search ow we begin to discuss applications of quantu walks to search algoriths

More information

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

e-companion ONLY AVAILABLE IN ELECTRONIC FORM OPERATIONS RESEARCH doi 10.1287/opre.1070.0427ec pp. ec1 ec5 e-copanion ONLY AVAILABLE IN ELECTRONIC FORM infors 07 INFORMS Electronic Copanion A Learning Approach for Interactive Marketing to a Custoer

More information

Stochastic Subgradient Methods

Stochastic Subgradient Methods Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods

More information

Ch 12: Variations on Backpropagation

Ch 12: Variations on Backpropagation Ch 2: Variations on Backpropagation The basic backpropagation algorith is too slow for ost practical applications. It ay take days or weeks of coputer tie. We deonstrate why the backpropagation algorith

More information

Distributed Subgradient Methods for Multi-agent Optimization

Distributed Subgradient Methods for Multi-agent Optimization 1 Distributed Subgradient Methods for Multi-agent Optiization Angelia Nedić and Asuan Ozdaglar October 29, 2007 Abstract We study a distributed coputation odel for optiizing a su of convex objective functions

More information

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs BULLETIN OF THE POLISH ACADEMY OF SCIENCES TECHNICAL SCIENCES Vol. 57, No. 3, 2009 Algoriths for parallel processor scheduling with distinct due windows and unit-tie obs A. JANIAK 1, W.A. JANIAK 2, and

More information

Support recovery in compressed sensing: An estimation theoretic approach

Support recovery in compressed sensing: An estimation theoretic approach Support recovery in copressed sensing: An estiation theoretic approach Ain Karbasi, Ali Horati, Soheil Mohajer, Martin Vetterli School of Coputer and Counication Sciences École Polytechnique Fédérale de

More information

are equal to zero, where, q = p 1. For each gene j, the pairwise null and alternative hypotheses are,

are equal to zero, where, q = p 1. For each gene j, the pairwise null and alternative hypotheses are, Page of 8 Suppleentary Materials: A ultiple testing procedure for ulti-diensional pairwise coparisons with application to gene expression studies Anjana Grandhi, Wenge Guo, Shyaal D. Peddada S Notations

More information

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material Consistent Multiclass Algoriths for Coplex Perforance Measures Suppleentary Material Notations. Let λ be the base easure over n given by the unifor rando variable (say U over n. Hence, for all easurable

More information

arxiv: v3 [cs.lg] 7 Jan 2016

arxiv: v3 [cs.lg] 7 Jan 2016 Efficient and Parsionious Agnostic Active Learning Tzu-Kuo Huang Alekh Agarwal Daniel J. Hsu tkhuang@icrosoft.co alekha@icrosoft.co djhsu@cs.colubia.edu John Langford Robert E. Schapire jcl@icrosoft.co

More information

Soft-margin SVM can address linearly separable problems with outliers

Soft-margin SVM can address linearly separable problems with outliers Non-linear Support Vector Machines Non-linearly separable probles Hard-argin SVM can address linearly separable probles Soft-argin SVM can address linearly separable probles with outliers Non-linearly

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lessons 7 20 Dec 2017 Outline Artificial Neural networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

Randomized Recovery for Boolean Compressed Sensing

Randomized Recovery for Boolean Compressed Sensing Randoized Recovery for Boolean Copressed Sensing Mitra Fatei and Martin Vetterli Laboratory of Audiovisual Counication École Polytechnique Fédéral de Lausanne (EPFL) Eail: {itra.fatei, artin.vetterli}@epfl.ch

More information

The Weierstrass Approximation Theorem

The Weierstrass Approximation Theorem 36 The Weierstrass Approxiation Theore Recall that the fundaental idea underlying the construction of the real nubers is approxiation by the sipler rational nubers. Firstly, nubers are often deterined

More information

Lecture 21. Interior Point Methods Setup and Algorithm

Lecture 21. Interior Point Methods Setup and Algorithm Lecture 21 Interior Point Methods In 1984, Kararkar introduced a new weakly polynoial tie algorith for solving LPs [Kar84a], [Kar84b]. His algorith was theoretically faster than the ellipsoid ethod and

More information

Learnability and Stability in the General Learning Setting

Learnability and Stability in the General Learning Setting Learnability and Stability in the General Learning Setting Shai Shalev-Shwartz TTI-Chicago shai@tti-c.org Ohad Shair The Hebrew University ohadsh@cs.huji.ac.il Nathan Srebro TTI-Chicago nati@uchicago.edu

More information

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay A Low-Coplexity Congestion Control and Scheduling Algorith for Multihop Wireless Networks with Order-Optial Per-Flow Delay Po-Kai Huang, Xiaojun Lin, and Chih-Chun Wang School of Electrical and Coputer

More information

Support Vector Machines MIT Course Notes Cynthia Rudin

Support Vector Machines MIT Course Notes Cynthia Rudin Support Vector Machines MIT 5.097 Course Notes Cynthia Rudin Credit: Ng, Hastie, Tibshirani, Friedan Thanks: Şeyda Ertekin Let s start with soe intuition about argins. The argin of an exaple x i = distance

More information

Block designs and statistics

Block designs and statistics Bloc designs and statistics Notes for Math 447 May 3, 2011 The ain paraeters of a bloc design are nuber of varieties v, bloc size, nuber of blocs b. A design is built on a set of v eleents. Each eleent

More information

PAC-Bayes Risk Bounds for Sample-Compressed Gibbs Classifiers

PAC-Bayes Risk Bounds for Sample-Compressed Gibbs Classifiers PAC-Bayes Ris Bounds for Sample-Compressed Gibbs Classifiers François Laviolette Francois.Laviolette@ift.ulaval.ca Mario Marchand Mario.Marchand@ift.ulaval.ca Département d informatique et de génie logiciel,

More information

Geometrical intuition behind the dual problem

Geometrical intuition behind the dual problem Based on: Geoetrical intuition behind the dual proble KP Bennett, EJ Bredensteiner, Duality and Geoetry in SVM Classifiers, Proceedings of the International Conference on Machine Learning, 2000 1 Geoetrical

More information

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information Cite as: Straub D. (2014). Value of inforation analysis with structural reliability ethods. Structural Safety, 49: 75-86. Value of Inforation Analysis with Structural Reliability Methods Daniel Straub

More information

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison yströ Method vs : A Theoretical and Epirical Coparison Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou Machine Learning Lab, GE Global Research, San Raon, CA 94583 Michigan State University,

More information

On Constant Power Water-filling

On Constant Power Water-filling On Constant Power Water-filling Wei Yu and John M. Cioffi Electrical Engineering Departent Stanford University, Stanford, CA94305, U.S.A. eails: {weiyu,cioffi}@stanford.edu Abstract This paper derives

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Coputable Shell Decoposition Bounds John Langford TTI-Chicago jcl@cs.cu.edu David McAllester TTI-Chicago dac@autoreason.co Editor: Leslie Pack Kaelbling and David Cohn Abstract Haussler, Kearns, Seung

More information

Ph 20.3 Numerical Solution of Ordinary Differential Equations

Ph 20.3 Numerical Solution of Ordinary Differential Equations Ph 20.3 Nuerical Solution of Ordinary Differential Equations Due: Week 5 -v20170314- This Assignent So far, your assignents have tried to failiarize you with the hardware and software in the Physics Coputing

More information

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon Lecture 2: Enseble Methods Isabelle Guyon guyoni@inf.ethz.ch Introduction Book Chapter 7 Weighted Majority Mixture of Experts/Coittee Assue K experts f, f 2, f K (base learners) x f (x) Each expert akes

More information

arxiv: v1 [cs.ds] 3 Feb 2014

arxiv: v1 [cs.ds] 3 Feb 2014 arxiv:40.043v [cs.ds] 3 Feb 04 A Bound on the Expected Optiality of Rando Feasible Solutions to Cobinatorial Optiization Probles Evan A. Sultani The Johns Hopins University APL evan@sultani.co http://www.sultani.co/

More information

A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION

A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION A eshsize boosting algorith in kernel density estiation A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION C.C. Ishiekwene, S.M. Ogbonwan and J.E. Osewenkhae Departent of Matheatics, University

More information

Homework 3 Solutions CSE 101 Summer 2017

Homework 3 Solutions CSE 101 Summer 2017 Hoework 3 Solutions CSE 0 Suer 207. Scheduling algoriths The following n = 2 jobs with given processing ties have to be scheduled on = 3 parallel and identical processors with the objective of iniizing

More information

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and This article appeared in a ournal published by Elsevier. The attached copy is furnished to the author for internal non-coercial research and education use, including for instruction at the authors institution

More information

RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION

RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION GUANGHUI LAN AND YI ZHOU Abstract. In this paper, we consider a class of finite-su convex optiization probles defined over a distributed

More information

Tracking using CONDENSATION: Conditional Density Propagation

Tracking using CONDENSATION: Conditional Density Propagation Tracking using CONDENSATION: Conditional Density Propagation Goal Model-based visual tracking in dense clutter at near video frae rates M. Isard and A. Blake, CONDENSATION Conditional density propagation

More information

. The univariate situation. It is well-known for a long tie that denoinators of Pade approxiants can be considered as orthogonal polynoials with respe

. The univariate situation. It is well-known for a long tie that denoinators of Pade approxiants can be considered as orthogonal polynoials with respe PROPERTIES OF MULTIVARIATE HOMOGENEOUS ORTHOGONAL POLYNOMIALS Brahi Benouahane y Annie Cuyt? Keywords Abstract It is well-known that the denoinators of Pade approxiants can be considered as orthogonal

More information

DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS

DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS ISSN 1440-771X AUSTRALIA DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS An Iproved Method for Bandwidth Selection When Estiating ROC Curves Peter G Hall and Rob J Hyndan Working Paper 11/00 An iproved

More information

Training an RBM: Contrastive Divergence. Sargur N. Srihari

Training an RBM: Contrastive Divergence. Sargur N. Srihari Training an RBM: Contrastive Divergence Sargur N. srihari@cedar.buffalo.edu Topics in Partition Function Definition of Partition Function 1. The log-likelihood gradient 2. Stochastic axiu likelihood and

More information

Research Article Robust ε-support Vector Regression

Research Article Robust ε-support Vector Regression Matheatical Probles in Engineering, Article ID 373571, 5 pages http://dx.doi.org/10.1155/2014/373571 Research Article Robust ε-support Vector Regression Yuan Lv and Zhong Gan School of Mechanical Engineering,

More information

PAC-Bayesian Generalization Bound for Multi-class Learning

PAC-Bayesian Generalization Bound for Multi-class Learning PAC-Bayesian Generalization Bound for Multi-class Learning Loubna BENABBOU Department of Industrial Engineering Ecole Mohammadia d Ingènieurs Mohammed V University in Rabat, Morocco Benabbou@emi.ac.ma

More information