Advanced Machine Learning

Size: px
Start display at page:

Download "Advanced Machine Learning"

Transcription

1 Advanced Machine Learning Domain Adaptation MEHRYAR MOHRI COURANT INSTITUTE & GOOGLE RESEARCH.

2 Non-Ideal World time real world ideal domain sampling page 2

3 Outline Domain adaptation. Multiple-source domain adaptation. page 3

4 Domain Adaptation Sentiment analysis. Language modeling, part-of-speech tagging. Statistical parsing. Speech recognition. Computer vision. Solution critical for applications. page 4

5 This Talk Domain adaptation Discrepancy Theoretical guarantees Algorithm Enhancements page 5

6 Domain Adaptation Problem Domains: source, target. Input: labeled sample unlabeled sample (Q, f Q ) (P, f P ) S drawn from source. T drawn from target. h Problem: find hypothesis in with small expected loss with respect to target domain, that is H L P (h, f P )= E L h(x),f P (x). x P page 6

7 Sample Bias Correction Pb Problem: special case of domain adaptation with f Q = f P. supp(q) supp(p ). page 7

8 Related Work in Theory Single-source adaptation: d A et al. (1996); Kifer et al. (2004); Ben-David et al. (2007)). relation between adaptation and the distance (Devroye a few negative examples of adaptation (Ben-David et al. (AISTATS 2010)). analysis and learning guarantees for importance weighting (Cortes, Mansour, and MM (NIPS 2010)). page 8

9 Related Work in Theory Multiple-source: same input distribution, but different labels (Crammer et al., 2005, 2006). theoretical analysis and method for multiple-source adaptation (Mansour, MM, Rostamizadeh, 2008). page 9

10 Distribution Mismatch Q P Which distance should we use to compare these distributions? Adaptation Theory and Algorithms - Mohri@ page 10

11 Simple Analysis Proposition: assume that the loss is bounded by, then Proof: L Q (h, f) L P (h, f) ML 1 (Q, P ). L P (h, f) L Q (h, f) = E x P L (h(x),f(x) E x Q L (h(x),f(x) L M = M x x P (x) Q(x) L (h(x),f(x) P (x) Q(x). But, is this bound informative? Adaptation Theory and Algorithms - Mohri@ page 11

12 Example - Zero-One Loss a a h f a L Q (h, f) L P (h, f) = Q(a) P (a) Adaptation Theory and Algorithms - Mohri@ page 12

13 Discrepancy Definition: (Mansour, MM, Rostami, COLT 2009) disc(p, Q) = max h,h H L P (h, h ) L Q (h, h ). symmetric, triangle inequality, in general not a distance. helps compare distributions for arbitrary losses, e.g. hinge loss, or generalization of L p d A loss. (2004); Ben-David et al. (2007)). distance (Devroye et al. (1996); Kifer et al. Adaptation Theory and Algorithms - Mohri@ page 13

14 Discrepancy - Properties L q M >0 1 Theorem: for loss bounded by, for any, with probability at least, disc(p, Q) disc(p,q)+4q R S (H)+R T (H) +3M log 4 2m + log 4 2n. Adaptation Theory and Algorithms - Mohri@ page 14

15 Discrepancy = Distance Theorem: let be a universal kernel (e.g., Gaussian kernel) and H = {h H K : h K }. Then, for the L 2 loss, discrepancy is a distance over a compact set. Proof: for norm, thus continuous on C(X). K : h E x P [h 2 (x)] E x Q [h 2 (x)] disc(p, Q)=0 (h)=0 h H implies for all. H C(X) =0 C(X) since is dense in, over. E P [f] E Q [f]=0 f 0 C(X) thus, for all in. P =Q this implies. (Cortes & MM (TCS 2013)) X is continuous Adaptation Theory and Algorithms - Mohri@ page 15

16 Theoretical Guarantees Two types of questions: difference between average loss of hypothesis on Q versus? difference of loss (measured on ) between hypothesis obtained when training on (Q, f Q ) versus hypothesis obtained when training on? P (P,f P ) h P h h Adaptation Theory and Algorithms - Mohri@ page 16

17 Notation: Generalization Bound L Q (h Q,f Q )=min L Q(h, f Q ) h2h. L P (h P,f P )=min. Theorem: assume that the following holds: h2h L P (h, f P ) L (Mansour, MM, Rostami (COLT 2009)) obeys the triangle inequality, then L P (h, f P ) L Q (h, h Q )+L P (h P,f P )+disc(p, Q) + L Q (h Q,h P ). Adaptation Theory and Algorithms - Mohri@ page 17

18 Some Natural Cases h = h Q = h P When, L P (h, f P ) L Q (h, h )+L P (h,f P ) + disc(p, Q). When f P H (consistent case), L P (h, f P ) L Q (h, f P ) disc(q, P ). Bound of (Ben-David et al., NIPS 2006) or (Blitzer et al., NIPS 2007): always worse in these cases. Adaptation Theory and Algorithms - Mohri@ page 18

19 Regularized ERM Algorithms Objective function: F bq (h) = h 2 K + R b Q (h), where K >0 R bq (h) is a PDS kernel; is a trade-off parameter; and is the empirical error of. h broad family of algorithms including SVM, SVR, kernel ridge regression, etc. Adaptation Theory and Algorithms - Mohri@ page 19

20 Guarantees for Reg. ERM K K(x, x) R 2 (Cortes & MM (TCS 2013)) Theorem: let be a PDS kernel with and a loss function such that L(,y) is µ -Lipschitz. Let h 0 be the minimizer of F bp and h that of that F bq, then, for all (x, y) X Y, s L L(h 0 (x),y) L(h(x),y) apple µr disc( b P, b Q)+µ H (f P,f Q ), where H (f P,f Q )= inf h2h max f P (x) h(x) + max f Q (x) h(x). x2supp( P b ) x2supp( Q) b Adaptation Theory and Algorithms - Mohri@ page 20

21 Proof By the property of the minimizers, there exist subgradients such that Thus, 2 h 0 = R bp (h 0 ) 2 h = R bq (h). 2 kh 0 hk 2 = hh 0 h, R bp (h 0 ) R bq (h)i = hh 0 h, R bp (h 0 )i + hh 0 h, R bq (h)i apple R bp (h) R bp (h 0 )+R bq (h 0 ) R bq (h) apple 2disc( P, b Q)+2µ b H (f P,f Q ). Foundations of Machine Learning page 21

22 Guarantees for Reg. ERM Theorem: let be a PDS kernel with and the loss bounded by. Then, for all, L 2 K M K(x, x) R 2 (x, y) (Cortes & MM (TCS 2013)) L L(h (x),y) L(h(x),y) R M disc(p,q), where =min h H E x Q b h(x) f Q (x) K(x) E x P b h(x) f P (x) K(x). K For f P =f Q =f, R =0 f H if is ε-close to on samples. for a Gaussian kernel and continuous. f Adaptation Theory and Algorithms - Mohri@ page 22

23 Empirical Discrepancy Discrepancy measure disc(p,q) critical term in bounds. Smaller empirical discrepancy guarantees closeness of pointwise losses of h and h. But, can we further reduce the discrepancy? Adaptation Theory and Algorithms - Mohri@ page 23

24 Algorithm - Idea Search for a new empirical distribution support: q with same q = argmin supp(q) supp( b Q) Solve modified optimization problem: disc(p,q). min h F q (h) = m i=1 q (x i )L(h(x i ),y i )+ h 2 K. Adaptation Theory and Algorithms - Mohri@ page 24

25 Case of Halfspaces Adaptation Theory and Algorithms - Mohri@ page 25

26 Min-Max Problem Reformulation: Q = argmin bq Q max L b h,h H P (h,h) L bq (h,h). game theoretical interpretation. gives lower bound: max min L bp (h,h) L bq (h,h) h,h H bq Q min bq Q max L b h,h H P (h,h) L bq (h,h). page 26

27 Classification - 0/1 Loss Problem: min Q max a H H subject to Q (a) P (a) x S Q, Q (x) 0 x S Q Q (x) =1. page 27

28 Classification - 0/1 Loss Linear program (LP): min Q subject to a H H, Q (a) P (a) a H H, P (a) Q (a) x S Q, Q (x) 0 x S Q Q (x) =1. No. of constraints bounded by shattering coefficient. H H(m 0 + n 0 ) page 28

29 Algorithm - 1D page 29

30 Regression - L2 Loss Problem: min bq Q max E [(h (x) h(x)) 2 ] E[(h (x) h(x)) 2 ]. h,h H P b bq min bq 0 2Q =min bq 0 2Q =min bq 0 2Q =min bq 0 2Q max kwkapple1 kw 0 kapple1 max kwkapple1 kw 0 kapple1 max kukapple2 E bp [((w 0 w) > x) 2 ] E bq 0[((w 0 w) > x) 2 ] X ( P b (x) Q b 0 (x))[(w 0 w) > x] 2 x2s X ( P b (x) Q b 0 (x))[u > x] 2 x2s max kukapple2 u> X x2s ( b P (x) b Q 0 (x))xx > u. page 30

31 Regression - L2 Loss Problem equivalent to Xm 0 with: M(z) =M 0 M 0 = X x2s min max kzk 1 =1 kuk=1 u> M(z)u, z 0 i=1 P (x)xx > z i M i, M i = s i s > i elements of supp(q) Foundations of Machine Learning page 31

32 Regression - L2 Loss Semi-definite program (SDP): linear hypotheses. min z, subject to I M(z) 0 I + M(z) 0 1 z =1 z 0, where the matrix M(z) = M(z) is defined by: x S P (x)xx m 0 i=1 z i s i s i. page 32

33 Regression - L2 Loss SDP: generalization to RKHS for some kernel. with: min z, H subject to I M(z) 0 M(z) =M 0 m 0 I + M(z) 0 1 z =1 z 0, i=1 z i M i M 0 = K 1/2 diag(p (s 1 ),...,P(s p0 ))K 1/2 M i = K 1/2 I i K 1/2. K page 33

34 Discrepancy Min. Algorithm Convex optimization: cast as semi-definite programming (SDP) prob. efficient solution using smooth optimization. Algorithm and solution for arbitrary kernels. Outperforms other algorithms in experiments. (Cortes & MM (TCS 2013)) Adaptation Theory and Algorithms - Mohri@ page 34

35 Experiments Classification: Q and P Gaussians. H: halfspaces. f: interval [-1, +1] w/ min disc w/ orig disc # Training Points page 35

36 Experiments Regression: Q Q P # Training Points SDP solved in about 15s using SeDuMi on 3GHz CPU with 2GB memory. page 36

37 Experiments page 37

38 Shortcomings: Enhancement discrepancy depends on maximizing pair of hypotheses. DM algorithm too conservative. Ideas: finer quantity: generalized discrepancy, hypothesisdependent. reweighting depending on hypothesis. (Cortes, MM, and Muñoz (2014)) page 38

39 Choose Ideally: Q h Algorithm such that objectives are unif. close: h 2 K + L Q h (h, f Q ) h 2 K + L P b(h, f P ). (Cortes, MM, and Muñoz (2014)) Q h =argmin q Using convex surrogate H : Q h =argmin q L q (h, f Q ) L bp (h, f P ). max L q(h, f Q ) L(h, h ). h H page 39

40 L Qh (h, f Q )= Optimization argmin l {L q (h,f Q ):q F(S X,R)} =argmin l R max l L b h H P (h, h ) (Cortes, MM, and Muñoz (2014)) max l L b h H P (h, h ) = 1 2 max L b h H P (h, h )+ min L b h H P (h, h ). Convex optimization problem (loss jointly convex): min h h 2 K max L b h H P (h, h )+ min L b h H P (h, h ). page 40

41 Convex Surrogate Hyp. Set Choice of H among balls B(r) ={h H L q (h,f Q ) r p }. (Cortes, MM, and Muñoz (2014)) Generalization bound proven to be more favorable than DM for some choices of radius r. r Radius chosen via cross-validation using small amount of labeled data from target. Further improvement of empirical results. page 41

42 Conclusion Theory of adaptation based on discrepancy: key term in analysis of adaptation and drifting. discrepancy minimization algorithm DM. compares favorably to other adaptation algorithms in experiments. Generalized discrepancy: extension to hypothesis-dependent reweighting. convex optimization problem. further empirical improvements. page 42

43 Outline Domain adaptation. Multiple-source domain adaptation. page 43

44 Problem Formulation Given distributions and corresponding hypotheses: h 1 h 2. h k Notation: D 1 D 2 D k f. unknown target L(D i,h i,f)= E x i, L(D i,h i,f). each hypothesis performs well in its domain. Di [L(h i (x),f(x))]. Loss L assumed non-negative, bounded, convex and continuous. page 44

45 Problem Formulation The unknown target distribution is a mixture of input distributions. D 1. D T D T (x) = k i=1 id i (x) D k Task: choose a hypothesis mixture that performs well in target distribution. h z (x) = k i=1 z i h i (x) convex combination rule k z i D i (x) h z (x) = k i=1 j=1 z jd j (x) h i(x) distribution weighted combination page 45

46 Known Target Distribution For some distributions, any convex combination performs poorly. distribution weights DT D0 D1 a b hypothesis output f h0 h1 a b base hypotheses have no error within domain. any convex combination has error of 1/2! page 46

47 Main Results Thus, although convex combinations seem natural, they can perform very poorly. We will show that distribution weighted combinations seem to define the right combination rule. There exists a single robust distribution weighted hypothesis, that does well for any target mixture. f, z,, L(D,h z,f). page 47

48 Known Target Distribution If distribution is known, distribution weighted rule will always do well. Choose: z =. k id i (x) h (x) = i=1 k j=1 jd j (x) h i(x). Proof: L(D T,h,f)= = x X x X k i=1 L(h (x),f(x))d T (x) k i=1 id i (x) D T (x) L(h i(x),f(x))d T (x) il(d i,h i (x),f(x)) k i=1 i =. page 48

49 Unknown Target Mixture Zero-sum game: NATURE: select a target distribution. z LEARNER: select a, i.e. a distribution weighted hypothesis. h z L(D i,h z,f) Payoff:. Already shown: game value is at most. Minimax theorem (modulo discretization of ): there exists a mixture of distribution weighted hypothesis that does well for any distribution mixture. D i j jh zj z page 49

50 Balancing Losses Brouwer s Fixed Point theorem: for any compact, convex, non-empty set A and any continuous function f: A A, there exists such that:. x f(x)=x A f Notation: L z i :=L(D i,h z,f). Define mapping by: [ (z)] i = z il z i j z jl z j By fixed point theorem (modulo continuity):. z : i, z i = z il z i j z jl z j = i, L z i = j z j L z j =:. page 50

51 Bounding Loss z For fixed point, L(D z,h z,f)= L(h z (x),f(x)) = = x X k i=1 k i=1 Also, by convexity, k i=1 z i D i (x) z i x X D i (x)l(h z (x),f(x)) z i L z i = k i=1 z i =. = L(D z,h z,f) x X k i=1 z i D i (x) D z (x) L(h i(x),f(x))d z (x) = k i=1 z i L(D i,h i,f). page 51

52 Bounding Loss Thus, and for any mixture, k L(D,h z,f)= i=1 il(d i,h z,f) k i=1 i =. D 2 h z D 1. D i f page 52

53 Details To deal with non-continuity refine hypotheses: h z (x) = Theorem: for any target function and any, If loss obeys triangle inequality: k i=1 z i D i (x)+ /k k j=1 z jd j (x)+ h i(x). holds for all admissible target functions. f >0 >0,z:, L(D,h z,f) +. >0, z, > 0,, f F, L(D,h z,f) 3 +. page 53

54 A Simple Algorithm A simple constructive algorithm, choose z with uniform weights: L(D,h u,f)= = x x x k i=1 x D (x)l k m=1 k i=1 md m (x) k m=1 md m (x) k j=1 D j(x) 1 D i (x) k j=1 D j(x) h i(x),f(x) L k i=1 k i=1 D i (x)l (h i (x),f(x)) = D i (x) k j=1 D j(x) h i(x),f(x) D i (x)l (h i (x),f(x)) k i=1 L(D i,h i,f)= k i=1 i k. page 54

55 Preliminary Empirical Results Sentiment Analysis - given a product review (text string), predict a rating (between 1.0 and 5.0). 4 Domains: Books, DVDs, Electronics and Kitchen Appliances. Base hypotheses are trained within each domain (Support Vector Regression). We are not given the distributions. We model each distribution using a bag of words model. We then test the distribution combination rule on known target mixture domains. page 55

56 Empirical Results Uniform Mixture Over 4 Domains In Domain Out Domain 1.9 MSE page 56

57 Empirical Results 2 class MSE Mixture = α book + (1 α) kitchen weighted linear book kitchen α page 57

58 Conclusion Formulation of the multiple source adaptation problem. Theoretical analysis for mixture distributions. Efficient algorithm for finding distribution weighted combination hypothesis? Beyond mixture distributions? page 58

59 Rényi Divergences Definition: for 0, D (P kq) = 1 1 log X x P (x) apple P (x) Q(x) 1. =1 =2 : coincide with relative entropy. : logarithm of expected probability ratio; =+1 D (P kq) = log E x P apple P (x) Q(x) : logarithm of maximum probability ratio; apple P (x). Q(x) D (P kq) = log sup x P. Foundations of Machine Learning page 59

60 Extensions - Arbitrary Target >0 Theorem: for any and > 1, 9,z: 8P, L(P, h z,f) apple P (Mansour, MM, and Rostami, 2009) h i 1 d (P kq)( + ) M 1. Q= k i=1 id i : k measured in terms of Rényi divergence, apple X P (x) d (P, Q)= Q 1 (x) x 1 1. page 60

61 Proof By Hölder s inequality, for any hypothesis h, L(P, h, f) = X x apple X apple x P (x) Q 1 (x) Q 1 P x) Q 1 (x) =(d (P kq)) 1 =(d (P kq)) 1 apple (d (P kq)) 1 h h (x)l(h(x),f(x)) 1 h X x i 1 Q(x)L 1 (h(x),f(x)) i 1 E [L 1 (h(x),f(x))] x Q i 1 E [L(h(x),f(x))L 1 1 (h(x),f(x))] x Q h L(Q, h, f)m 1 1 i 1. Foundations of Machine Learning page 61

62 Other Extensions Approximate distributions (estimated): similar results shown depending on divergence between true and estimated distributions. Different source target functions : similar results when target functions close to f on target distribution. f i (Mansour, MM, and Rostami, 2009) page 62

63 References S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In NIPS S. Ben-David, T. Lu, T. Luu, and D. Pal. Impossibility theorems for domain adaptation. Journal of Machine Learning Research-Proceedings Track, 9: , S. Bickel, M. Bruckner, and T. Scheffer. Discriminative learning under covariate shift. Journal of Machine Learning Research, 10: , J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman. Learning bounds for domain adaptation. In NIPS Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Learning bounds for importance weighting. In NIPS Corinna Cortes and Mehryar Mohri. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519, page 63

64 References Corinna Cortes, Mehryar Mohri, Michael Riley, and Afshin Rostamizadeh. Sample selection bias correction theory. In Proceedings of ALT Koby Crammer, Michael Kearns, and Jennifer Wortman. Learning from Data of Variable Quality. In Proceedings of NIPS, Koby Crammer, Michael Kearns, and Jennifer Wortman.Learning from multiple sources.in Proceedings of NIPS, Devroye, L., Gyorfi, L., and Lugosi, G. (1996). A probabilistic theory of pattern recognition. Springer. M. Dredze, J. Blitzer, P. P. Talukdar, K. Ganchev, J. Graca, and F. Pereira. Frustratingly Hard Domain Adaptation for Parsing. In CoNLL, J. Huang, A. J. Smola, A. Gretton, K. M. Borgwardt, and B. Scholkopf. Correcting sample selection bias by unlabeled data. In NIPS, volume 19, pages page 64

65 References Kifer D., Ben-David S., Gehrke J. Detecting change in data streams. In: Proceedings of VLDB, pp Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In COLT Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation with multiple sources. In Advances in NIPS, pages Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Multiple source adaptation and the Rényi divergence. In Proceedings of UAI Mehryar Mohri and Andres Muñoz Medina. New analysis and algorithm for learning with drifting distributions. In Proceedings of ALT, volume 7568, pages page 65

66 References H. Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2): , M. Sugiyama, S. Nakajima, H. Kashima, P. von Bunau, and M. Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In NIPS M. Sugiyama, T. Suzuki, S. Nakajima, H. Kashima, P. von Bunau, and M.Kawanabe. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60: , page 66

Learning with Imperfect Data

Learning with Imperfect Data Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Joint work with: Yishay Mansour (Tel-Aviv & Google) and Afshin Rostamizadeh (Courant Institute). Standard Learning Assumptions IID assumption.

More information

Domain Adaptation for Regression

Domain Adaptation for Regression Domain Adaptation for Regression Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Motivation Applications: distinct training and test distributions.

More information

Domain Adaptation with Multiple Sources

Domain Adaptation with Multiple Sources Domain Adaptation with Multiple Sources Yishay Mansour Google Research and Tel Aviv Univ. mansour@tau.ac.il Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Afshin Rostamizadeh Courant

More information

Authors: John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira and Jennifer Wortman (University of Pennsylvania)

Authors: John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira and Jennifer Wortman (University of Pennsylvania) Learning Bouds for Domain Adaptation Authors: John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira and Jennifer Wortman (University of Pennsylvania) Presentation by: Afshin Rostamizadeh (New York

More information

Impossibility Theorems for Domain Adaptation

Impossibility Theorems for Domain Adaptation Shai Ben-David and Teresa Luu Tyler Lu Dávid Pál School of Computer Science University of Waterloo Waterloo, ON, CAN {shai,t2luu}@cs.uwaterloo.ca Dept. of Computer Science University of Toronto Toronto,

More information

Sample Selection Bias Correction

Sample Selection Bias Correction Sample Selection Bias Correction Afshin Rostamizadeh Joint work with: Corinna Cortes, Mehryar Mohri & Michael Riley Courant Institute & Google Research Motivation Critical Assumption: Samples for training

More information

Learning Kernels -Tutorial Part III: Theoretical Guarantees.

Learning Kernels -Tutorial Part III: Theoretical Guarantees. Learning Kernels -Tutorial Part III: Theoretical Guarantees. Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute & Google Research mohri@cims.nyu.edu Afshin Rostami UC Berkeley

More information

Learning Bounds for Importance Weighting

Learning Bounds for Importance Weighting Learning Bounds for Importance Weighting Corinna Cortes Google Research corinna@google.com Yishay Mansour Tel-Aviv University mansour@tau.ac.il Mehryar Mohri Courant & Google mohri@cims.nyu.edu Motivation

More information

Marginal Singularity, and the Benefits of Labels in Covariate-Shift

Marginal Singularity, and the Benefits of Labels in Covariate-Shift Marginal Singularity, and the Benefits of Labels in Covariate-Shift Samory Kpotufe ORFE, Princeton University Guillaume Martinet ORFE, Princeton University SAMORY@PRINCETON.EDU GGM2@PRINCETON.EDU Abstract

More information

arxiv: v1 [cs.lg] 19 May 2008

arxiv: v1 [cs.lg] 19 May 2008 Sample Selection Bias Correction Theory Corinna Cortes 1, Mehryar Mohri 2,1, Michael Riley 1, and Afshin Rostamizadeh 2 arxiv:0805.2775v1 [cs.lg] 19 May 2008 1 Google Research, 76 Ninth Avenue, New York,

More information

Instance-based Domain Adaptation via Multi-clustering Logistic Approximation

Instance-based Domain Adaptation via Multi-clustering Logistic Approximation Instance-based Domain Adaptation via Multi-clustering Logistic Approximation FENG U, Nanjing University of Science and Technology JIANFEI YU, Singapore Management University RUI IA, Nanjing University

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Learning Kernels MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Kernel methods. Learning kernels scenario. learning bounds. algorithms. page 2 Machine Learning

More information

Time Series Prediction & Online Learning

Time Series Prediction & Online Learning Time Series Prediction & Online Learning Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values. earthquakes.

More information

Online Learning for Time Series Prediction

Online Learning for Time Series Prediction Online Learning for Time Series Prediction Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values.

More information

Perceptron Mistake Bounds

Perceptron Mistake Bounds Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:

More information

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds

More information

Learning Theory and Algorithms for Auctioning and Adaptation Problems

Learning Theory and Algorithms for Auctioning and Adaptation Problems Learning Theory and Algorithms for Auctioning and Adaptation Problems by Andrés Muñoz Medina A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department

More information

learning bounds for importance weighting Tamas Madarasz & Michael Rabadi April 15, 2015

learning bounds for importance weighting Tamas Madarasz & Michael Rabadi April 15, 2015 learning bounds for importance weighting Tamas Madarasz & Michael Rabadi April 15, 2015 Introduction Often, training distribution does not match testing distribution Want to utilize information about test

More information

Domain Adaptation Can Quantity Compensate for Quality?

Domain Adaptation Can Quantity Compensate for Quality? Domain Adaptation Can Quantity Compensate for Quality? hai Ben-David David R. Cheriton chool of Computer cience University of Waterloo Waterloo, ON N2L 3G1 CANADA shai@cs.uwaterloo.ca hai halev-hwartz

More information

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Multi-Class Classification Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation Real-world problems often have multiple classes: text, speech,

More information

A Two-Stage Weighting Framework for Multi-Source Domain Adaptation

A Two-Stage Weighting Framework for Multi-Source Domain Adaptation A Two-Stage Weighting Framework for Multi-Source Domain Adaptation Qian Sun, Rita Chattopadhyay, Sethuraman Panchanathan, Jieping Ye Computer Science and Engineering, Arizona State University, AZ 8587

More information

Boosting Ensembles of Structured Prediction Rules

Boosting Ensembles of Structured Prediction Rules Boosting Ensembles of Structured Prediction Rules Corinna Cortes Google Research 76 Ninth Avenue New York, NY 10011 corinna@google.com Vitaly Kuznetsov Courant Institute 251 Mercer Street New York, NY

More information

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification page 2 Motivation Real-world problems often have multiple classes:

More information

On-Line Learning with Path Experts and Non-Additive Losses

On-Line Learning with Path Experts and Non-Additive Losses On-Line Learning with Path Experts and Non-Additive Losses Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Courant Institute) Manfred Warmuth (UC Santa Cruz) MEHRYAR MOHRI MOHRI@ COURANT

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Structured Prediction Theory and Algorithms

Structured Prediction Theory and Algorithms Structured Prediction Theory and Algorithms Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE

More information

Importance Reweighting Using Adversarial-Collaborative Training

Importance Reweighting Using Adversarial-Collaborative Training Importance Reweighting Using Adversarial-Collaborative Training Yifan Wu yw4@andrew.cmu.edu Tianshu Ren tren@andrew.cmu.edu Lidan Mu lmu@andrew.cmu.edu Abstract We consider the problem of reweighting a

More information

Sparse Domain Adaptation in a Good Similarity-Based Projection Space

Sparse Domain Adaptation in a Good Similarity-Based Projection Space Sparse Domain Adaptation in a Good Similarity-Based Projection Space Emilie Morvant, Amaury Habrard, Stéphane Ayache To cite this version: Emilie Morvant, Amaury Habrard, Stéphane Ayache. Sparse Domain

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

arxiv: v2 [cs.lg] 13 Sep 2018

arxiv: v2 [cs.lg] 13 Sep 2018 Unsupervised Domain Adaptation Based on ource-guided Discrepancy arxiv:1809.03839v2 [cs.lg] 13 ep 2018 eiichi Kuroki 1 Nontawat Charoenphakdee 1 Han Bao 1,2 Junya Honda 1,2 Issei ato 1,2 Masashi ugiyama

More information

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH. Deep Boosting Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Ensemble Methods in ML Combining several base classifiers

More information

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 13 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification Mehryar Mohri - Introduction to Machine Learning page 2 Motivation

More information

Joint distribution optimal transportation for domain adaptation

Joint distribution optimal transportation for domain adaptation Joint distribution optimal transportation for domain adaptation Changhuang Wan Mechanical and Aerospace Engineering Department The Ohio State University March 8 th, 2018 Joint distribution optimal transportation

More information

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Structured Prediction

Structured Prediction Structured Prediction Ningshan Zhang Advanced Machine Learning, Spring 2016 Outline Ensemble Methods for Structured Prediction[1] On-line learning Boosting AGeneralizedKernelApproachtoStructuredOutputLearning[2]

More information

Generalization and Overfitting

Generalization and Overfitting Generalization and Overfitting Model Selection Maria-Florina (Nina) Balcan February 24th, 2016 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle

More information

Agnostic Domain Adaptation

Agnostic Domain Adaptation Agnostic Domain Adaptation Alexander Vezhnevets Joachim M. Buhmann ETH Zurich 8092 Zurich, Switzerland {alexander.vezhnevets,jbuhmann}@inf.ethz.ch Abstract. The supervised learning paradigm assumes in

More information

Covariate Shift in Hilbert Space: A Solution via Surrogate Kernels

Covariate Shift in Hilbert Space: A Solution via Surrogate Kernels Kai Zhang kzhang@nec-labs.com NEC Laboratories America, Inc., 4 Independence Way, Suite 2, Princeton, NJ 854 Vincent W. Zheng Advanced Digital Sciences Center, 1 Fusionopolis Way, Singapore 138632 vincent.zheng@adsc.com.sg

More information

Unlabeled Data: Now It Helps, Now It Doesn t

Unlabeled Data: Now It Helps, Now It Doesn t institution-logo-filena A. Singh, R. D. Nowak, and X. Zhu. In NIPS, 2008. 1 Courant Institute, NYU April 21, 2015 Outline institution-logo-filena 1 Conflicting Views in Semi-supervised Learning The Cluster

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Time Series Prediction VITALY KUZNETSOV KUZNETSOV@ GOOGLE RESEARCH.. MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH.. Motivation Time series prediction: stock values.

More information

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.

More information

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,

More information

A survey of multi-source domain adaptation

A survey of multi-source domain adaptation A survey of multi-source domain adaptation Shiliang Sun, Honglei Shi, Yuanbin Wu Shanghai Key Laboratory of Multidimensional Information Processing, Department of Computer Science and Technology, East

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Online Convex Optimization MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Online projected sub-gradient descent. Exponentiated Gradient (EG). Mirror descent.

More information

Algorithms for Learning Kernels Based on Centered Alignment

Algorithms for Learning Kernels Based on Centered Alignment Journal of Machine Learning Research 3 (2012) XX-XX Submitted 01/11; Published 03/12 Algorithms for Learning Kernels Based on Centered Alignment Corinna Cortes Google Research 76 Ninth Avenue New York,

More information

Does Unlabeled Data Help?

Does Unlabeled Data Help? Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

Distribution Regression with Minimax-Optimal Guarantee

Distribution Regression with Minimax-Optimal Guarantee Distribution Regression with Minimax-Optimal Guarantee (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton

More information

Time Series Prediction and Online Learning

Time Series Prediction and Online Learning JMLR: Workshop and Conference Proceedings vol 49:1 24, 2016 Time Series Prediction and Online Learning Vitaly Kuznetsov Courant Institute, New York Mehryar Mohri Courant Institute and Google Research,

More information

ADANET: adaptive learning of neural networks

ADANET: adaptive learning of neural networks ADANET: adaptive learning of neural networks Joint work with Corinna Cortes (Google Research) Javier Gonzalo (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR

More information

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee The Learning Problem and Regularization 9.520 Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:

More information

Robust Support Vector Machines for Probability Distributions

Robust Support Vector Machines for Probability Distributions Robust Support Vector Machines for Probability Distributions Andreas Christmann joint work with Ingo Steinwart (Los Alamos National Lab) ICORS 2008, Antalya, Turkey, September 8-12, 2008 Andreas Christmann,

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Learning and Games MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Normal form games Nash equilibrium von Neumann s minimax theorem Correlated equilibrium Internal

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Hierarchical Active Transfer Learning

Hierarchical Active Transfer Learning Hierarchical Active Transfer Learning David Kale Marjan Ghazvininejad Anil Ramakrishna Jingrui He Yan Liu Abstract We describe a unified active transfer learning framework called Hierarchical Active Transfer

More information

Introduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 14 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Density Estimation Maxent Models 2 Entropy Definition: the entropy of a random variable

More information

Foundations of Machine Learning

Foundations of Machine Learning Maximum Entropy Models, Logistic Regression Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Motivation Probabilistic models: density estimation. classification. page 2 This

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

Statistical learning theory, Support vector machines, and Bioinformatics

Statistical learning theory, Support vector machines, and Bioinformatics 1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.

More information

Kernel Methods. Outline

Kernel Methods. Outline Kernel Methods Quang Nguyen University of Pittsburgh CS 3750, Fall 2011 Outline Motivation Examples Kernels Definitions Kernel trick Basic properties Mercer condition Constructing feature space Hilbert

More information

Introduction to Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Kernel Methods Motivation Non-linear decision boundary. Efficient computation of inner

More information

Maximum Mean Discrepancy

Maximum Mean Discrepancy Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia

More information

Learning Weighted Automata

Learning Weighted Automata Learning Weighted Automata Joint work with Borja Balle (Amazon Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Weighted Automata (WFAs) page 2 Motivation Weighted automata (WFAs): image

More information

Towards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert

Towards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert Towards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert HSE Computer Science Colloquium September 6, 2016 IST Austria (Institute of Science and Technology

More information

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization John Duchi Outline I Motivation 1 Uniform laws of large numbers 2 Loss minimization and data dependence II Uniform

More information

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS Maya Gupta, Luca Cazzanti, and Santosh Srivastava University of Washington Dept. of Electrical Engineering Seattle,

More information

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee Distribution Regression: A Simple Technique with Minimax-optimal Guarantee (CMAP, École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department,

More information

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning CS 446 Machine Learning Fall 206 Nov 0, 206 Bayesian Learning Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes Overview Bayesian Learning Naive Bayes Logistic Regression Bayesian Learning So far, we

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Unsupervised Domain Adaptation with Distribution Matching Machines

Unsupervised Domain Adaptation with Distribution Matching Machines Unsupervised Domain Adaptation with Distribution Matching Machines Yue Cao, Mingsheng Long, Jianmin Wang KLiss, MOE; NEL-BDS; TNList; School of Software, Tsinghua University, China caoyue1@gmail.com mingsheng@tsinghua.edu.cn

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

Distribution Regression

Distribution Regression Zoltán Szabó (École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) Dagstuhl Seminar 16481

More information

Convergence Rates of Kernel Quadrature Rules

Convergence Rates of Kernel Quadrature Rules Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1 Introduction to Machine Learning Introduction to ML - TAU 2016/7 1 Course Administration Lecturers: Amir Globerson (gamir@post.tau.ac.il) Yishay Mansour (Mansour@tau.ac.il) Teaching Assistance: Regev Schweiger

More information

Inverse Density as an Inverse Problem: the Fredholm Equation Approach

Inverse Density as an Inverse Problem: the Fredholm Equation Approach Inverse Density as an Inverse Problem: the Fredholm Equation Approach Qichao Que, Mikhail Belkin Department of Computer Science and Engineering The Ohio State University {que,mbelkin}@cse.ohio-state.edu

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Instance-based Domain Adaptation

Instance-based Domain Adaptation Instance-based Domain Adaptation Rui Xia School of Computer Science and Engineering Nanjing University of Science and Technology 1 Problem Background Training data Test data Movie Domain Sentiment Classifier

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Speech Recognition Lecture 7: Maximum Entropy Models. Mehryar Mohri Courant Institute and Google Research

Speech Recognition Lecture 7: Maximum Entropy Models. Mehryar Mohri Courant Institute and Google Research Speech Recognition Lecture 7: Maximum Entropy Models Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.com This Lecture Information theory basics Maximum entropy models Duality theorem

More information

Midterm Exam, Spring 2005

Midterm Exam, Spring 2005 10-701 Midterm Exam, Spring 2005 1. Write your name and your email address below. Name: Email address: 2. There should be 15 numbered pages in this exam (including this cover sheet). 3. Write your name

More information

Statistical Learning Reading Assignments

Statistical Learning Reading Assignments Statistical Learning Reading Assignments S. Gong et al. Dynamic Vision: From Images to Face Recognition, Imperial College Press, 2001 (Chapt. 3, hard copy). T. Evgeniou, M. Pontil, and T. Poggio, "Statistical

More information

Class Prior Estimation from Positive and Unlabeled Data

Class Prior Estimation from Positive and Unlabeled Data IEICE Transactions on Information and Systems, vol.e97-d, no.5, pp.1358 1362, 2014. 1 Class Prior Estimation from Positive and Unlabeled Data Marthinus Christoffel du Plessis Tokyo Institute of Technology,

More information

Domain Adaptation via Transfer Component Analysis

Domain Adaptation via Transfer Component Analysis Domain Adaptation via Transfer Component Analysis Sinno Jialin Pan 1,IvorW.Tsang, James T. Kwok 1 and Qiang Yang 1 1 Department of Computer Science and Engineering Hong Kong University of Science and Technology,

More information

Dynamic Time-Alignment Kernel in Support Vector Machine

Dynamic Time-Alignment Kernel in Support Vector Machine Dynamic Time-Alignment Kernel in Support Vector Machine Hiroshi Shimodaira School of Information Science, Japan Advanced Institute of Science and Technology sim@jaist.ac.jp Mitsuru Nakai School of Information

More information

Unsupervised Domain Adaptation with a Relaxed Covariate Shift Assumption

Unsupervised Domain Adaptation with a Relaxed Covariate Shift Assumption Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) Unsupervised Domain Adaptation with a Relaxed Covariate Shift Assumption Tameem Adel University of Manchester, UK tameem.hesham@gmail.com

More information

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) Diffeomorphic Warping Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) What Manifold Learning Isn t Common features of Manifold Learning Algorithms: 1-1 charting Dense sampling Geometric Assumptions

More information

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kenji Fukumizu The Institute of Statistical Mathematics NIPS 2012 Workshop Confluence between Kernel Methods and Graphical Models December

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information

A talk on Oracle inequalities and regularization. by Sara van de Geer

A talk on Oracle inequalities and regularization. by Sara van de Geer A talk on Oracle inequalities and regularization by Sara van de Geer Workshop Regularization in Statistics Banff International Regularization Station September 6-11, 2003 Aim: to compare l 1 and other

More information

IFT Lecture 7 Elements of statistical learning theory

IFT Lecture 7 Elements of statistical learning theory IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and

More information