Foundations of Machine Learning

Size: px
Start display at page:

Download "Foundations of Machine Learning"

Transcription

1 Maximum Entropy Models, Logistic Regression Mehryar Mohri Courant Institute and Google Research page 1

2 Motivation Probabilistic models: density estimation. classification. page 2

3 This Lecture Notions of information theory. Introduction to density estimation. Maxent models. Conditional Maxent models. page 3

4 Entropy Definition: the entropy of a discrete random variable X with probability mass distribution p(x) =Pr[X = x] is Properties: H(X) = E[log p(x)] = X x2x p(x) log p(x). measure of uncertainty of X. H(X) 0. (Shannon, 1948) maximal for uniform distribution. For a finite support, by Jensen s inequality: apple apple 1 1 H(X) =E log apple log E = log N. p(x) p(x) page 4

5 Entropy Base of logarithm: not critical; for base 2, number of bits needed to represent p(x). log 2 (p(x)) is the Definition and notation: the entropy of a distribution defined by the same quantity and denoted by H(p). Special case of Rényi entropy (Rényi, 1961). Binary entropy: H(p) = p log p (1 p) log(1 p). p is page 5

6 Relative Entropy (Shannon, 1948; Kullback and Leibler, 1951) Definition: the relative entropy (or Kullback-Leibler divergence) between two distributions p and q (discrete case) is apple D(p k q) =E p log p(x) = X p(x) log p(x) q(x) q(x), x2x with 0 log 0 q = 0 and p log p 0 =+1. Properties: asymmetric: in general, D(p k q) 6= D(q k p) for p 6= q. non-negative: D(p k q) 0 for all p and q. definite: (D(p k q) = 0) ) (p = q). page 6

7 Non-Negativity of Rel. Entropy By the concavity of log and Jensen's inequality, D(p k q) = X q(x) p(x) log p(x) x : p(x)>0 X apple log p(x) q(x) p(x) x : p(x)>0 X = log q(x) apple log(1) = 0. x : p(x)>0 page 7

8 Bregman Divergence Definition: let F be a convex and differentiable function defined over a convex set Cin a Hilbert space H. Then, the Bregman divergence associated to F is defined by B F B F (x k y) =F (x) F (y) hrf (y),x yi. (Bregman, 1967) F (y) } B F (y x) F (x)+(y x) F (x) x y page 8

9 Bregman Divergence Examples: B F (x k y) F (x) Squared L 2 -distance kx yk 2 kxk 2 Mahalanobis distance (x y) > K 1 (x y) x > K 1 x Unnormalized relative entropy D(x e k y) Pi2I x i log x i x i note: relative entropy not a Bregman divergence since not defined over an open set; but, on the simplex, coincides with unnormalized relative entropy ed(p k q) = X x2x p(x) log apple p(x) q(x) + q(x) p(x). page 9

10 Conditional Relative Entropy Definition: let p and q be two probability distributions over X Y. Then, the conditional relative entropy of p and q with respect to distribution r over X is defined by h i E D p( X) k q( X) = X r(x) X p(y x) log p(y x) X r q(y x) x2x y2y = D(ep k eq), with ep(x, y) =r(x)p(y x), eq(x, y) =r(x)q(y x), and the conventions 0 log 0 = 0, 0 log 0 0 =0, and p log p 0 =+1. note: the definition of conditional relative entropy is not intrinsic, it depends on a third distribution r. page 10

11 This Lecture Notions of information theory. Introduction to density estimation. Maxent models. Conditional Maxent models. page 11

12 Density Estimation Problem Training data: sample Sof size m drawn i.i.d. from set according to some distribution D, X S =(x 1,...,x m ). Problem: find distribution p out of hypothesis set P that best estimates D. page 12

13 Maximum Likelihood Solution Maximum Likelihood principle: select distribution maximizing likelihood of observed sample S, p 2 P p ML = argmax p2p = argmax p2p = argmax p2p Pr[S p] my p(x i ) i=1 mx log p(x i ). i=1 page 13

14 Relative Entropy Formulation Lemma: let bp S be the empirical distribution for sample S, then p ML = argmin D(bp S k p). p2p Proof: D(bp S k p) = X X bp S (x) log bp S (x) bp S (x) log p(x) x x X P m i=1 = H(bp S ) 1 x=x i log p(x) m x mx X 1 x=xi = H(bp S ) log p(x) m i=1 x mx log p(x i ) = H(bp S ) m. i=1 page 14

15 Maximum a Posteriori (MAP) Maximum a Posteriori principle: select distribution p 2 P that is the most likely, given the observed sample S and assuming a prior distribution Pr[p] over P, p MAP = argmax p2p = argmax p2p = argmax p2p Pr[p S] Pr[S p]pr[p] Pr[S] Pr[S p]pr[p]. note: for a uniform prior, ML = MAP. page 15

16 This Lecture Notions of information theory. Introduction to density estimation. Maxent models. Conditional Maxent models. page 16

17 Density Estimation + Features Training data: sample Sof size mdrawn i.i.d. from set according to some distribution D, X S =(x 1,...,x m ). Features: associated to elements of X, : X! R N x 7! (x) = apple 1 (x). N (x). Problem: find distribution p out of hypothesis set P that best estimates D. for simplicity, in what follows, X is assumed to be finite. page 17

18 Features Feature functions assumed to be in H and k k 1 apple. Examples of H: family of threshold functions defined over N j variables. {x 7! 1 xi apple : x 2 R N, 2 R} functions defined via decision trees with larger depths. k -degree monomials of the original features. zero-one features (often used in NLP, e.g., presence/ absence of a word or POS tag). page 18

19 Maximum Entropy Principle Maxent principle: find distribution p that is closest to a prior distribution p 0 (typically uniform distribution) while verifying E x p [ (x)] E x D b[ (x)] apple. Closeness is measured using relative entropy. note: no set needed to be specified. (E. T. Jaynes, 1957, 1983) Idea: empirical feature vector average close to expectation. For any > 0, with probability at least 1 s E [ (x)] E log 2 [ (x)] apple 2R m (H)+ x D x D b 1 2m, P 1 page 19

20 Maxent Formulation Optimization problem: min D(p k p 0) p2 subject to: E [ (x)] E [ x p x S (x)] 1 apple. convex optimization problem, unique solution. =0 > 0 : regularized Maxent. : standard Maxent (or unregularized Maxent). page 20

21 Relation with Entropy Relationship with entropy: for a uniform prior, p 0 D(p k p 0 )= X x2x p(x) log p(x) p 0 (x) = X x2x p(x) log p 0 (x)+ X x2x p(x) log p(x) = log X H(p). page 21

22 Maxent Problem Optimization: convex optimization problem. X min p(x) log p(x) p x2x subject to: p(x) 0, 8x 2 X X p(x) =1 x2x X x2x p(x) j (x) 1 m mx i=1 j(x i ) apple, 8j 2 [1,N]. page 22

23 Gibbs Distributions Gibbs distributions: set Q of distributions with w R N, p w [x] = p 0[x]exp w Z with Z = X x (x) p 0 [x]exp w (x). p w = p 0[x]exp P N j=1 w j j(x) Z, Rich family: for linear and quadratic features: includes Gaussians and other distributions with non-psd quadratic forms in exponents. for higher-degree polynomials of raw features: more complex multi-modal distributions. page 23

24 Dual Problems Regularized Maxent problem: with ( min p F (p) =D(p k p 0 )+I C (E p [ ]), D(p k p 0 )=D(pkp 0 ) if p 2, +1 otherwise; C = {u: ku k E[ ; S k] k 1 apple } I C (x) =0if x 2 C, I C (x) =+1otherwise. Regularized Maximum Likelihood problem with Gibbs distributions: sup G(w) = 1 mx apple pw [x i ] log kwk 1. w m p 0 [x i ] i=1 page 24

25 Duality Theorem (Della Pietra et al., 1997; Dudík et al., 2007; Cortes et al., 2015) Theorem: the regularized Maxent and ML with Gibbs distributions problems are equivalent, sup G(w) =minf (p). w2r N p furthemore, let p = argmin F (p), then, for any > 0, G(w) p sup G(w) < w2r N ) D(p k p w ) apple. page 25

26 Notes Maxent formulation: no explicit restriction to a family of distributions P. but solution coincides with regularized ML with a specific family P! more general Bregman divergence-based formulation. page 26

27 L 1 -Regularized Maxent Optimization problem: 1 inf kwk 1 w2r N m where p w [x] = 1 Z Bayesian interpretation: equivalent to MAP with Laplacian prior q prior (w) (Williams, 1994), max log Y m p w [x i ] q prior (w) w i=1 with q prior (w) = mx log p w [x i ]. i=1 N j=1 exp w (x). j 2 exp( j w j ). (Kazama and Tsujii, 2003) page 27

28 Generalization Guarantee Notation: L D (w) = E log p w [x]], L S (w) = E log p w [x]]. x D [ x S [ (Dudík et al., 2007) Theorem: Fix > 0. Let bw be the solution q of the L1-reg. Maxent problem for =2R m (H)+ log( 2 )/2m. Then, with probability at least 1, q L D ( bw) apple inf w L D (w)+2kwk 1 apple2r m (H)+ log 2 2m. page 28

29 Proof By Hölder s inequality and the concentration bound for average feature vectors, L D ( bw) L S ( bw) = bw [E S [ ] E D [ ]] applekbwk 1 k E S [ ] E D [ ]k 1 apple k bwk 1. Since bw is a minimizer, L D ( bw) L D (w) =L D ( bw) L S ( bw)+l S ( bw) L D (w) apple k bwk 1 + L S ( bw) L D (w) apple kwk 1 + L S (w) L D (w) apple 2 kwk 1. page 29

30 L 2 -Regularized Maxent Different relaxations: L 1 constraints: L 2 constraints: (Chen and Rosenfeld, 2000; Lebanon and Lafferty, 2001) j [1,N], E x p [ j(x)] E x bp [ j(x)] j. E [ (x)] E [ (x)] x p x bp 2 apple B. page 30

31 L 2 -Regularized Maxent Optimization problem: inf w2r N kwk 2 2 where Bayesian interpretation: equivalent to MAP with Gaussian q prior (w) prior (Goodman, 2004), max w with log m Y i=1 1 m p w [x] = 1 Z q prior (w) = mx log p w [x i ]. i=1 exp w (x). p w [x i ] q prior (w) NY j=1 1 p 2 2 e w 2 j 2 2. page 31

32 This Lecture Notions of information theory. Introduction to density estimation. Maxent models. Conditional Maxent models. page 32

33 Conditional Maxent Models Maxent models for conditional probabilities: conditional probability modeling each class. use in multi-class classification. can use different features for each class. a.k.a. multinomial logistic regression. logistic regression: special case of two classes. page 33

34 Problem Data: sample drawn i.i.d. according to some distribution D, S =((x 1,y 1 ),...,(x m,y m )) (X Y) m. Y ={1,...,k}, or Y ={0, 1} k in multi-label case. Features: mapping : X Y R N. Problem: find accurate conditional probability models Pr[ x], x X, based on. page 34

35 Conditional Maxent Principle E x bp y D[ x] x bp y bp[ x] (Berger et al., 1996; Cortes et al., 2015) Idea: empirical feature vector average close to expectation. For any > 0, with probability at least 1, s log 2 [ (x, y)] E [ (x, y)] apple 2R m (H)+ 2m. Maxent principle: find conditional distributions p[ x] that are closest to priors p 0 [ x] (typically uniform distributions) while verifying E x bp [ (x, y)] E x bp [ (x, y)] apple. y p[ x] Closeness is measured using conditional relative entropy based on bp. 1 y bp[ x] 1 page 35

36 Cond. Maxent Formulation Optimization problem: find distribution X bp[x] D p[ x] k p 0 [ x] min p[ x]2 s.t. x2x h E E [ x bp y p[ x] (x, y)] i (Berger et al., 1996; Cortes et al., 2015) solution of convex optimization problem, unique solution. =0: unregularized conditional Maxent. > 0 : regularized conditional Maxent. p E (x,y) S [ (x, y)] 1 apple. page 36

37 Dual Problems Regularized conditional Maxent problem: ef (p) = E appled p[ x] k p 0 [ x] + I p[ x] + I C E [ ]. x bp x bp y p[ x] Regularized Maximum Likelihood problem with conditional Gibbs distributions: eg(w) = 1 mx apple pw [y i x i ] log kwk 1, m p 0 [y i x i ] i=1 where 8(x, y) 2 X Y, p w [y x] = p 0[y x]exp w (x, y) Z(x) Z(x)= X y2y p 0 [y x]exp(w (x, y)). page 37

38 Duality Theorem Theorem: the regularized conditional Maxent and ML with conditional Gibbs distributions problems are equivalent, furthemore, let p = argmin, then, for any > 0, e G(w) sup w2r N sup w2r N e G(w) =min p p G(w) e < ef (p) e F (p). h ) E D p [ x] k p w [ x] x bp (Cortes et al., 2015) i apple. page 38

39 Regularized Cond. Maxent Optimization problem: convex optimizations, regularization parameter 0. 1 mx min kwk 1 log p w [y i x i ] w2r N m i=1 or min kwk 2 1 mx w2r N 2 log p w [y i x i ], m where i=1 8(x, y) 2 X Y, p w [y x]= exp(w (x, y)) Z(x) Z(x)= X exp(w (x, y)). y2y (Berger et al., 1996; Cortes et al., 2015) page 39

40 More Explicit Forms Optimization problem: w 1 min w R N w 2 2 min w2r N ( kwk 1 kwk m m i=1 w 1 m log mx i=1 y Y exp w (x i,y) w (x i,y i ). (x i,y i )+ 1 m mx apple X i log e w (x i,y). i=1 y2y page 40

41 Related Problem Optimization problem: log-sum-exp replaced by max. ( kwk 1 kwk mx max w (x i,y) w (x i,y i ). 2 m y2y i=1 {z } w (x i,y i ) min w2r N page 41

42 Common Feature Choice Multi-class features: (x, y) = (x) w = w 1. w y 1 w y w y+1. w Y w (x, y) =w y (x). min w2r N L 2 -regularized cond. maxent optimization: X kw y k mx apple X log exp m y2y i=1 y2y w y (x i ) w yi (x i ). page 42

43 Prediction Prediction with p w [y x]= exp(w (x,y)) : Z(x) by(x) = argmax y2y p w [y x] = argmax y2y w (x, y). page 43

44 Binary Classification Simpler expression: X exp w (x i,y) w (x i,y i ) y2y = e w (x i,+1) w (x i,y i ) + e w (x i, 1) w (x i,y i ) =1+e y iw [ (x i,+1) (x i, 1)] =1+e y iw (x i ), with (x) = (x, +1) (x, 1). page 44

45 Logistic Regression Binary case of conditional Maxent. (Berkson, 1944) Optimization problem: ( kwk 1 min w2r N kwk m mx log i=1 h 1+e y iw (x i ) i. convex optimization. variety of solutions: SGD, coordinate descent, etc. coordinate descent: similar to AdaBoost with logistic loss ( u) = log 2 (1 + e u ) 1 uapple0 instead of exponential loss. page 45

46 Generalization Bound Theorem: assume that ± j 2 H for all j 2 [1,N]. Then, for any > 0, with probability at least 1 over the draw of a sample S of size m, for all f : x 7! w (x), R(f) apple 1 m mx log u0 1+e y iw (x i ) +4kwk 1 R m (H) i=1 + r log log2 2kwk 1 m where u 0 = log(1 + 1/e). + s log 2 m, page 46

47 Proof Proof: by the learning bound for convex ensembles holding uniformly for all, with probability at least 1, for all f and > 0, s s R(f) apple 1 mx m apple0 R log log 2 log 2 m(h)+ + m m. i=1 1 y i w (x i ) kwk 1 Choosing = 1 kwk 1 and using 1 uapple0 apple log u0 (1 + e u 1 ) yields immediately the learning bound of the theorem. page 47

48 Logistic Regression Logistic model: Pr[y =+1 x] = ew (x,+1) Z(x) where Z(x) =e w (x,+1) w (x, 1) + e, (Berkson, 1944) Properties: linear decision rule, sign of log-odds ratio: Pr[y =+1 x] log = w (x, +1) (x, 1) = w (x). Pr[y = 1 x] logistic form: Pr[y =+1 x] = 1 1+e = 1 w [ (x,+1) (x, 1)] 1+e w (x). page 48

49 Logistic/Sigmoid Function f: x 1 1+e x Pr[y =+1 x] =f w (x). page 49

50 Applications Natural language processing (Berger et al., 1996; Rosenfeld, 1996; Pietra et al., 1997; Malouf, 2002; Manning and Klein, 2003; Mann et al., 2009; Ratnaparkhi, 2010). Species habitat modeling (Phillips et al., 2004, 2006; Dudík et al., 2007; Elith et al, 2011). Computer vision (Jeon and Manmatha, 2004). page 50

51 Extensions Extensive theoretical study of alternative regularizations: (Dudík et al., 2007) (see also (Altun and Smola, 2006) though some proofs unclear). Maxent models with other Bregman divergences (see for example (Altun and Smola, 2006)). Structural Maxent models (Cortes et al., 2015): extension to the case of multiple feature families. empirically outperform Maxent and L1-Maxent. conditional structural Maxent: coincide with deep boosting using the logistic loss. page 51

52 Conclusion Logistic regression/maxent models: theoretical foundation. natural solution when probabilites are required. widely used for density estimation/classification. often very effective in practice. distributed optimization solutions. no natural non-linear L1-version (use of kernels). connections with boosting. connections with neural networks. page 52

53 References Yasemin Altun, Alexander J. Smola. Unifying Divergence Minimization and Statistical Inference Via Convex Duality. COLT 2006: Adam Berger, Stephen Della Pietra, and Vincent Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, (22-1), March 1996; Berkson, J. (1944). Application of the logistic function to bio-assay. Journal of the American Statistical Association 39, Lev M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7: , Corinna Cortes, Vitaly Kuznetsov, Mehryar Mohri, and Umar Syed. Structural Maxent. In Proceedings of ICML, page

54 References Imre Csiszar and Tusnady. Information geometry and alternating minimization procedures. Statistics and Decisions, Supplement Issue 1, , Imre Csiszar. A geometric interpretation of Darroch and Ratchliff's generalized iterative scaling. The Annals of Statisics, 17(3), pp J. Darroch and D. Ratchliff. Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics, 43(5), pp , Stephen Della Pietra, Vincent Della Pietra, and John Lafferty. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 19:4, pp , April, Dudík, Miroslav, Phillips, Steven J., and Schapire, Robert E. Maximum entropy density estimation with generalized regularization and an application to species distribution modeling. JMLR, 8, page

55 References E. Jaynes. Information theory and statistical mechanics. Physics Reviews, 106: , E. Jaynes. Papers on Probability, Statistics, and Statistical Physics. R. Rosenkrantz (editor), D. Reidel Publishing Company, Solomon Kullback and Richard A. Leibler. On information and sufficiency. Ann. Math. Statist., 22(1):79 86, O'Sullivan. Alternating minimzation algorithms: From Blahut-Arimoto to expectation-maximization. Codes, Curves and Signals: Common Threads in Communications, A. Vardy, (editor), Kluwer, page

56 References Alfréd Rényi. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, pages University of California Press, Roni Rosenfeld. A maximum entropy approach to adaptive statistical language modelling. Computer, Speech and Language 10: , Claude E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379423, page

Introduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 14 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Density Estimation Maxent Models 2 Entropy Definition: the entropy of a random variable

More information

Speech Recognition Lecture 7: Maximum Entropy Models. Mehryar Mohri Courant Institute and Google Research

Speech Recognition Lecture 7: Maximum Entropy Models. Mehryar Mohri Courant Institute and Google Research Speech Recognition Lecture 7: Maximum Entropy Models Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.com This Lecture Information theory basics Maximum entropy models Duality theorem

More information

Structural Maxent Models

Structural Maxent Models Corinna Cortes Google Research, 111 8th Avenue, New York, NY 10011 Vitaly Kuznetsov Courant Institute of Mathematical Sciences, 251 Mercer Street, New York, NY 10012 Mehryar Mohri Courant Institute and

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:

More information

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH. Deep Boosting Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Ensemble Methods in ML Combining several base classifiers

More information

A Representation Approach for Relative Entropy Minimization with Expectation Constraints

A Representation Approach for Relative Entropy Minimization with Expectation Constraints A Representation Approach for Relative Entropy Minimization with Expectation Constraints Oluwasanmi Koyejo sanmi.k@utexas.edu Department of Electrical and Computer Engineering, University of Texas, Austin,

More information

Structured Prediction Theory and Algorithms

Structured Prediction Theory and Algorithms Structured Prediction Theory and Algorithms Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE

More information

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:

More information

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Multi-Class Classification Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation Real-world problems often have multiple classes: text, speech,

More information

Weighted Finite-State Transducers in Computational Biology

Weighted Finite-State Transducers in Computational Biology Weighted Finite-State Transducers in Computational Biology Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Joint work with Corinna Cortes (Google Research). 1 This Tutorial

More information

Oct CS497:Learning and NLP Lec 8: Maximum Entropy Models Fall The from needs to be completed. (The form needs to be completed).

Oct CS497:Learning and NLP Lec 8: Maximum Entropy Models Fall The from needs to be completed. (The form needs to be completed). Oct. 20 2005 1 CS497:Learning and NLP Lec 8: Maximum Entropy Models Fall 2005 In this lecture we study maximum entropy models and how to use them to model natural language classification problems. Maximum

More information

Lecture 9: PGM Learning

Lecture 9: PGM Learning 13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

ADANET: adaptive learning of neural networks

ADANET: adaptive learning of neural networks ADANET: adaptive learning of neural networks Joint work with Corinna Cortes (Google Research) Javier Gonzalo (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR

More information

Boosting Ensembles of Structured Prediction Rules

Boosting Ensembles of Structured Prediction Rules Boosting Ensembles of Structured Prediction Rules Corinna Cortes Google Research 76 Ninth Avenue New York, NY 10011 corinna@google.com Vitaly Kuznetsov Courant Institute 251 Mercer Street New York, NY

More information

Conditional Random Fields: An Introduction

Conditional Random Fields: An Introduction University of Pennsylvania ScholarlyCommons Technical Reports (CIS) Department of Computer & Information Science 2-24-2004 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania

More information

Maximum Entropy and Species Distribution Modeling

Maximum Entropy and Species Distribution Modeling Maximum Entropy and Species Distribution Modeling Rob Schapire Steven Phillips Miro Dudík Also including work by or with: Rob Anderson, Jane Elith, Catherine Graham, Chris Raxworthy, NCEAS Working Group,...

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods 2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University

More information

Graphical models for part of speech tagging

Graphical models for part of speech tagging Indian Institute of Technology, Bombay and Research Division, India Research Lab Graphical models for part of speech tagging Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional

More information

A Minimax Approach to Supervised Learning

A Minimax Approach to Supervised Learning A Minimax Approach to Supervised Learning Farzan Farnia farnia@stanford.edu David Tse dntse@stanford.edu Abstract Given a task of predicting Y from X, a loss function L, and a set of probability distributions

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Structural Maxent Models

Structural Maxent Models Corinna Cortes Google Research, 111 8th Avenue, New York, NY 10011 Vitaly Kuznetsov Courant Institute of Mathematical ciences, 251 Mercer treet, New York, NY 10012 Mehryar Mohri Courant Institute and Google

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Machine Learning for natural language processing

Machine Learning for natural language processing Machine Learning for natural language processing Classification: Maximum Entropy Models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 24 Introduction Classification = supervised

More information

Learning Bounds for Importance Weighting

Learning Bounds for Importance Weighting Learning Bounds for Importance Weighting Corinna Cortes Google Research corinna@google.com Yishay Mansour Tel-Aviv University mansour@tau.ac.il Mehryar Mohri Courant & Google mohri@cims.nyu.edu Motivation

More information

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)

More information

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models. Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.com This Lecture Expectation-Maximization (EM)

More information

Geometry of U-Boost Algorithms

Geometry of U-Boost Algorithms Geometry of U-Boost Algorithms Noboru Murata 1, Takashi Takenouchi 2, Takafumi Kanamori 3, Shinto Eguchi 2,4 1 School of Science and Engineering, Waseda University 2 Department of Statistical Science,

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling

Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling Journal of Machine Learning Research 8 (007) 117-160 Submitted 4/06; Revised 4/07; Published 6/07 Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution

More information

Bregman Divergences for Data Mining Meta-Algorithms

Bregman Divergences for Data Mining Meta-Algorithms p.1/?? Bregman Divergences for Data Mining Meta-Algorithms Joydeep Ghosh University of Texas at Austin ghosh@ece.utexas.edu Reflects joint work with Arindam Banerjee, Srujana Merugu, Inderjit Dhillon,

More information

Structured Prediction

Structured Prediction Structured Prediction Ningshan Zhang Advanced Machine Learning, Spring 2016 Outline Ensemble Methods for Structured Prediction[1] On-line learning Boosting AGeneralizedKernelApproachtoStructuredOutputLearning[2]

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Learning with Imperfect Data

Learning with Imperfect Data Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Joint work with: Yishay Mansour (Tel-Aviv & Google) and Afshin Rostamizadeh (Courant Institute). Standard Learning Assumptions IID assumption.

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Performance Guarantees for Regularized Maximum Entropy Density Estimation

Performance Guarantees for Regularized Maximum Entropy Density Estimation Appearing in Proceedings of the 17th Annual Conference on Computational Learning Theory, 2004. Performance Guarantees for Regularized Maximum Entropy Density Estimation Miroslav Dudík 1, Steven J. Phillips

More information

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,

More information

Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models

Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models Gideon Mann Google gmann@google.com Ryan McDonald Google ryanmcd@google.com Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification page 2 Motivation Real-world problems often have multiple classes:

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

More information

Logistic Regression, AdaBoost and Bregman Distances

Logistic Regression, AdaBoost and Bregman Distances Proceedings of the Thirteenth Annual Conference on ComputationalLearning Theory, Logistic Regression, AdaBoost and Bregman Distances Michael Collins AT&T Labs Research Shannon Laboratory 8 Park Avenue,

More information

Domain Adaptation for Regression

Domain Adaptation for Regression Domain Adaptation for Regression Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Motivation Applications: distinct training and test distributions.

More information

Part 4: Conditional Random Fields

Part 4: Conditional Random Fields Part 4: Conditional Random Fields Sebastian Nowozin and Christoph H. Lampert Colorado Springs, 25th June 2011 1 / 39 Problem (Probabilistic Learning) Let d(y x) be the (unknown) true conditional distribution.

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.

More information

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 13 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification Mehryar Mohri - Introduction to Machine Learning page 2 Motivation

More information

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart Machine Learning Bayesian Regression & Classification learning as inference, Bayesian Kernel Ridge regression & Gaussian Processes, Bayesian Kernel Logistic Regression & GP classification, Bayesian Neural

More information

Multiclass Learning by Probabilistic Embeddings

Multiclass Learning by Probabilistic Embeddings Multiclass Learning by Probabilistic Embeddings Ofer Dekel and Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {oferd,singer}@cs.huji.ac.il Abstract

More information

Learning Kernels -Tutorial Part III: Theoretical Guarantees.

Learning Kernels -Tutorial Part III: Theoretical Guarantees. Learning Kernels -Tutorial Part III: Theoretical Guarantees. Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute & Google Research mohri@cims.nyu.edu Afshin Rostami UC Berkeley

More information

Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk

Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk John Lafferty School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 lafferty@cs.cmu.edu Abstract

More information

CS489/698: Intro to ML

CS489/698: Intro to ML CS489/698: Intro to ML Lecture 04: Logistic Regression 1 Outline Announcements Baseline Learning Machine Learning Pyramid Regression or Classification (that s it!) History of Classification History of

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Online Convex Optimization MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Online projected sub-gradient descent. Exponentiated Gradient (EG). Mirror descent.

More information

Comparison of Log-Linear Models and Weighted Dissimilarity Measures

Comparison of Log-Linear Models and Weighted Dissimilarity Measures Comparison of Log-Linear Models and Weighted Dissimilarity Measures Daniel Keysers 1, Roberto Paredes 2, Enrique Vidal 2, and Hermann Ney 1 1 Lehrstuhl für Informatik VI, Computer Science Department RWTH

More information

Bayesian Learning (II)

Bayesian Learning (II) Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP

More information

Maximum entropy modeling of species geographic distributions. Steven Phillips. with Miro Dudik & Rob Schapire

Maximum entropy modeling of species geographic distributions. Steven Phillips. with Miro Dudik & Rob Schapire Maximum entropy modeling of species geographic distributions Steven Phillips with Miro Dudik & Rob Schapire Modeling species distributions Yellow-throated Vireo occurrence points environmental variables

More information

Lecture 5 - Information theory

Lecture 5 - Information theory Lecture 5 - Information theory Jan Bouda FI MU May 18, 2012 Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 1 / 42 Part I Uncertainty and entropy Jan Bouda (FI MU) Lecture 5 - Information

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Lecture 2: August 31

Lecture 2: August 31 0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 2: August 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Maximum Mean Discrepancy

Maximum Mean Discrepancy Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia

More information

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri Speech Recognition Lecture 5: N-gram Language Models Eugene Weinstein Google, NYU Courant Institute eugenew@cs.nyu.edu Slide Credit: Mehryar Mohri Components Acoustic and pronunciation model: Pr(o w) =

More information

Introduction to Statistical Learning Theory

Introduction to Statistical Learning Theory Introduction to Statistical Learning Theory In the last unit we looked at regularization - adding a w 2 penalty. We add a bias - we prefer classifiers with low norm. How to incorporate more complicated

More information

Information Theory in Intelligent Decision Making

Information Theory in Intelligent Decision Making Information Theory in Intelligent Decision Making Adaptive Systems and Algorithms Research Groups School of Computer Science University of Hertfordshire, United Kingdom June 7, 2015 Information Theory

More information

Machine Learning for Structured Prediction

Machine Learning for Structured Prediction Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for

More information

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye Chapter 2: Entropy and Mutual Information Chapter 2 outline Definitions Entropy Joint entropy, conditional entropy Relative entropy, mutual information Chain rules Jensen s inequality Log-sum inequality

More information

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

The Method of Types and Its Application to Information Hiding

The Method of Types and Its Application to Information Hiding The Method of Types and Its Application to Information Hiding Pierre Moulin University of Illinois at Urbana-Champaign www.ifp.uiuc.edu/ moulin/talks/eusipco05-slides.pdf EUSIPCO Antalya, September 7,

More information

Sequential Supervised Learning

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given

More information

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Does Unlabeled Data Help?

Does Unlabeled Data Help? Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline

More information

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. Converting online to batch. Online convex optimization.

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

Perceptron Mistake Bounds

Perceptron Mistake Bounds Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Learning Binary Classifiers for Multi-Class Problem

Learning Binary Classifiers for Multi-Class Problem Research Memorandum No. 1010 September 28, 2006 Learning Binary Classifiers for Multi-Class Problem Shiro Ikeda The Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku, Tokyo, 106-8569,

More information

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.

More information

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information. L65 Dept. of Linguistics, Indiana University Fall 205 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission rate

More information

Artificial Intelligence Roman Barták

Artificial Intelligence Roman Barták Artificial Intelligence Roman Barták Department of Theoretical Computer Science and Mathematical Logic Introduction We will describe agents that can improve their behavior through diligent study of their

More information

Dept. of Linguistics, Indiana University Fall 2015

Dept. of Linguistics, Indiana University Fall 2015 L645 Dept. of Linguistics, Indiana University Fall 2015 1 / 28 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Machine learning - HT Maximum Likelihood

Machine learning - HT Maximum Likelihood Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Learning with Large Expert Spaces MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Problem Learning guarantees: R T = O( p T log N). informative even for N very large.

More information

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning

More information

Algorithms for Predicting Structured Data

Algorithms for Predicting Structured Data 1 / 70 Algorithms for Predicting Structured Data Thomas Gärtner / Shankar Vembu Fraunhofer IAIS / UIUC ECML PKDD 2010 Structured Prediction 2 / 70 Predicting multiple outputs with complex internal structure

More information

Convexity/Concavity of Renyi Entropy and α-mutual Information

Convexity/Concavity of Renyi Entropy and α-mutual Information Convexity/Concavity of Renyi Entropy and -Mutual Information Siu-Wai Ho Institute for Telecommunications Research University of South Australia Adelaide, SA 5095, Australia Email: siuwai.ho@unisa.edu.au

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Surrogate loss functions, divergences and decentralized detection

Surrogate loss functions, divergences and decentralized detection Surrogate loss functions, divergences and decentralized detection XuanLong Nguyen Department of Electrical Engineering and Computer Science U.C. Berkeley Advisors: Michael Jordan & Martin Wainwright 1

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2 Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Learning Kernels MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Kernel methods. Learning kernels scenario. learning bounds. algorithms. page 2 Machine Learning

More information

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H. Appendix A Information Theory A.1 Entropy Shannon (Shanon, 1948) developed the concept of entropy to measure the uncertainty of a discrete random variable. Suppose X is a discrete random variable that

More information

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Sample Selection Bias Correction

Sample Selection Bias Correction Sample Selection Bias Correction Afshin Rostamizadeh Joint work with: Corinna Cortes, Mehryar Mohri & Michael Riley Courant Institute & Google Research Motivation Critical Assumption: Samples for training

More information