Foundations of Machine Learning
|
|
- Cordelia Gaines
- 6 years ago
- Views:
Transcription
1 Maximum Entropy Models, Logistic Regression Mehryar Mohri Courant Institute and Google Research page 1
2 Motivation Probabilistic models: density estimation. classification. page 2
3 This Lecture Notions of information theory. Introduction to density estimation. Maxent models. Conditional Maxent models. page 3
4 Entropy Definition: the entropy of a discrete random variable X with probability mass distribution p(x) =Pr[X = x] is Properties: H(X) = E[log p(x)] = X x2x p(x) log p(x). measure of uncertainty of X. H(X) 0. (Shannon, 1948) maximal for uniform distribution. For a finite support, by Jensen s inequality: apple apple 1 1 H(X) =E log apple log E = log N. p(x) p(x) page 4
5 Entropy Base of logarithm: not critical; for base 2, number of bits needed to represent p(x). log 2 (p(x)) is the Definition and notation: the entropy of a distribution defined by the same quantity and denoted by H(p). Special case of Rényi entropy (Rényi, 1961). Binary entropy: H(p) = p log p (1 p) log(1 p). p is page 5
6 Relative Entropy (Shannon, 1948; Kullback and Leibler, 1951) Definition: the relative entropy (or Kullback-Leibler divergence) between two distributions p and q (discrete case) is apple D(p k q) =E p log p(x) = X p(x) log p(x) q(x) q(x), x2x with 0 log 0 q = 0 and p log p 0 =+1. Properties: asymmetric: in general, D(p k q) 6= D(q k p) for p 6= q. non-negative: D(p k q) 0 for all p and q. definite: (D(p k q) = 0) ) (p = q). page 6
7 Non-Negativity of Rel. Entropy By the concavity of log and Jensen's inequality, D(p k q) = X q(x) p(x) log p(x) x : p(x)>0 X apple log p(x) q(x) p(x) x : p(x)>0 X = log q(x) apple log(1) = 0. x : p(x)>0 page 7
8 Bregman Divergence Definition: let F be a convex and differentiable function defined over a convex set Cin a Hilbert space H. Then, the Bregman divergence associated to F is defined by B F B F (x k y) =F (x) F (y) hrf (y),x yi. (Bregman, 1967) F (y) } B F (y x) F (x)+(y x) F (x) x y page 8
9 Bregman Divergence Examples: B F (x k y) F (x) Squared L 2 -distance kx yk 2 kxk 2 Mahalanobis distance (x y) > K 1 (x y) x > K 1 x Unnormalized relative entropy D(x e k y) Pi2I x i log x i x i note: relative entropy not a Bregman divergence since not defined over an open set; but, on the simplex, coincides with unnormalized relative entropy ed(p k q) = X x2x p(x) log apple p(x) q(x) + q(x) p(x). page 9
10 Conditional Relative Entropy Definition: let p and q be two probability distributions over X Y. Then, the conditional relative entropy of p and q with respect to distribution r over X is defined by h i E D p( X) k q( X) = X r(x) X p(y x) log p(y x) X r q(y x) x2x y2y = D(ep k eq), with ep(x, y) =r(x)p(y x), eq(x, y) =r(x)q(y x), and the conventions 0 log 0 = 0, 0 log 0 0 =0, and p log p 0 =+1. note: the definition of conditional relative entropy is not intrinsic, it depends on a third distribution r. page 10
11 This Lecture Notions of information theory. Introduction to density estimation. Maxent models. Conditional Maxent models. page 11
12 Density Estimation Problem Training data: sample Sof size m drawn i.i.d. from set according to some distribution D, X S =(x 1,...,x m ). Problem: find distribution p out of hypothesis set P that best estimates D. page 12
13 Maximum Likelihood Solution Maximum Likelihood principle: select distribution maximizing likelihood of observed sample S, p 2 P p ML = argmax p2p = argmax p2p = argmax p2p Pr[S p] my p(x i ) i=1 mx log p(x i ). i=1 page 13
14 Relative Entropy Formulation Lemma: let bp S be the empirical distribution for sample S, then p ML = argmin D(bp S k p). p2p Proof: D(bp S k p) = X X bp S (x) log bp S (x) bp S (x) log p(x) x x X P m i=1 = H(bp S ) 1 x=x i log p(x) m x mx X 1 x=xi = H(bp S ) log p(x) m i=1 x mx log p(x i ) = H(bp S ) m. i=1 page 14
15 Maximum a Posteriori (MAP) Maximum a Posteriori principle: select distribution p 2 P that is the most likely, given the observed sample S and assuming a prior distribution Pr[p] over P, p MAP = argmax p2p = argmax p2p = argmax p2p Pr[p S] Pr[S p]pr[p] Pr[S] Pr[S p]pr[p]. note: for a uniform prior, ML = MAP. page 15
16 This Lecture Notions of information theory. Introduction to density estimation. Maxent models. Conditional Maxent models. page 16
17 Density Estimation + Features Training data: sample Sof size mdrawn i.i.d. from set according to some distribution D, X S =(x 1,...,x m ). Features: associated to elements of X, : X! R N x 7! (x) = apple 1 (x). N (x). Problem: find distribution p out of hypothesis set P that best estimates D. for simplicity, in what follows, X is assumed to be finite. page 17
18 Features Feature functions assumed to be in H and k k 1 apple. Examples of H: family of threshold functions defined over N j variables. {x 7! 1 xi apple : x 2 R N, 2 R} functions defined via decision trees with larger depths. k -degree monomials of the original features. zero-one features (often used in NLP, e.g., presence/ absence of a word or POS tag). page 18
19 Maximum Entropy Principle Maxent principle: find distribution p that is closest to a prior distribution p 0 (typically uniform distribution) while verifying E x p [ (x)] E x D b[ (x)] apple. Closeness is measured using relative entropy. note: no set needed to be specified. (E. T. Jaynes, 1957, 1983) Idea: empirical feature vector average close to expectation. For any > 0, with probability at least 1 s E [ (x)] E log 2 [ (x)] apple 2R m (H)+ x D x D b 1 2m, P 1 page 19
20 Maxent Formulation Optimization problem: min D(p k p 0) p2 subject to: E [ (x)] E [ x p x S (x)] 1 apple. convex optimization problem, unique solution. =0 > 0 : regularized Maxent. : standard Maxent (or unregularized Maxent). page 20
21 Relation with Entropy Relationship with entropy: for a uniform prior, p 0 D(p k p 0 )= X x2x p(x) log p(x) p 0 (x) = X x2x p(x) log p 0 (x)+ X x2x p(x) log p(x) = log X H(p). page 21
22 Maxent Problem Optimization: convex optimization problem. X min p(x) log p(x) p x2x subject to: p(x) 0, 8x 2 X X p(x) =1 x2x X x2x p(x) j (x) 1 m mx i=1 j(x i ) apple, 8j 2 [1,N]. page 22
23 Gibbs Distributions Gibbs distributions: set Q of distributions with w R N, p w [x] = p 0[x]exp w Z with Z = X x (x) p 0 [x]exp w (x). p w = p 0[x]exp P N j=1 w j j(x) Z, Rich family: for linear and quadratic features: includes Gaussians and other distributions with non-psd quadratic forms in exponents. for higher-degree polynomials of raw features: more complex multi-modal distributions. page 23
24 Dual Problems Regularized Maxent problem: with ( min p F (p) =D(p k p 0 )+I C (E p [ ]), D(p k p 0 )=D(pkp 0 ) if p 2, +1 otherwise; C = {u: ku k E[ ; S k] k 1 apple } I C (x) =0if x 2 C, I C (x) =+1otherwise. Regularized Maximum Likelihood problem with Gibbs distributions: sup G(w) = 1 mx apple pw [x i ] log kwk 1. w m p 0 [x i ] i=1 page 24
25 Duality Theorem (Della Pietra et al., 1997; Dudík et al., 2007; Cortes et al., 2015) Theorem: the regularized Maxent and ML with Gibbs distributions problems are equivalent, sup G(w) =minf (p). w2r N p furthemore, let p = argmin F (p), then, for any > 0, G(w) p sup G(w) < w2r N ) D(p k p w ) apple. page 25
26 Notes Maxent formulation: no explicit restriction to a family of distributions P. but solution coincides with regularized ML with a specific family P! more general Bregman divergence-based formulation. page 26
27 L 1 -Regularized Maxent Optimization problem: 1 inf kwk 1 w2r N m where p w [x] = 1 Z Bayesian interpretation: equivalent to MAP with Laplacian prior q prior (w) (Williams, 1994), max log Y m p w [x i ] q prior (w) w i=1 with q prior (w) = mx log p w [x i ]. i=1 N j=1 exp w (x). j 2 exp( j w j ). (Kazama and Tsujii, 2003) page 27
28 Generalization Guarantee Notation: L D (w) = E log p w [x]], L S (w) = E log p w [x]]. x D [ x S [ (Dudík et al., 2007) Theorem: Fix > 0. Let bw be the solution q of the L1-reg. Maxent problem for =2R m (H)+ log( 2 )/2m. Then, with probability at least 1, q L D ( bw) apple inf w L D (w)+2kwk 1 apple2r m (H)+ log 2 2m. page 28
29 Proof By Hölder s inequality and the concentration bound for average feature vectors, L D ( bw) L S ( bw) = bw [E S [ ] E D [ ]] applekbwk 1 k E S [ ] E D [ ]k 1 apple k bwk 1. Since bw is a minimizer, L D ( bw) L D (w) =L D ( bw) L S ( bw)+l S ( bw) L D (w) apple k bwk 1 + L S ( bw) L D (w) apple kwk 1 + L S (w) L D (w) apple 2 kwk 1. page 29
30 L 2 -Regularized Maxent Different relaxations: L 1 constraints: L 2 constraints: (Chen and Rosenfeld, 2000; Lebanon and Lafferty, 2001) j [1,N], E x p [ j(x)] E x bp [ j(x)] j. E [ (x)] E [ (x)] x p x bp 2 apple B. page 30
31 L 2 -Regularized Maxent Optimization problem: inf w2r N kwk 2 2 where Bayesian interpretation: equivalent to MAP with Gaussian q prior (w) prior (Goodman, 2004), max w with log m Y i=1 1 m p w [x] = 1 Z q prior (w) = mx log p w [x i ]. i=1 exp w (x). p w [x i ] q prior (w) NY j=1 1 p 2 2 e w 2 j 2 2. page 31
32 This Lecture Notions of information theory. Introduction to density estimation. Maxent models. Conditional Maxent models. page 32
33 Conditional Maxent Models Maxent models for conditional probabilities: conditional probability modeling each class. use in multi-class classification. can use different features for each class. a.k.a. multinomial logistic regression. logistic regression: special case of two classes. page 33
34 Problem Data: sample drawn i.i.d. according to some distribution D, S =((x 1,y 1 ),...,(x m,y m )) (X Y) m. Y ={1,...,k}, or Y ={0, 1} k in multi-label case. Features: mapping : X Y R N. Problem: find accurate conditional probability models Pr[ x], x X, based on. page 34
35 Conditional Maxent Principle E x bp y D[ x] x bp y bp[ x] (Berger et al., 1996; Cortes et al., 2015) Idea: empirical feature vector average close to expectation. For any > 0, with probability at least 1, s log 2 [ (x, y)] E [ (x, y)] apple 2R m (H)+ 2m. Maxent principle: find conditional distributions p[ x] that are closest to priors p 0 [ x] (typically uniform distributions) while verifying E x bp [ (x, y)] E x bp [ (x, y)] apple. y p[ x] Closeness is measured using conditional relative entropy based on bp. 1 y bp[ x] 1 page 35
36 Cond. Maxent Formulation Optimization problem: find distribution X bp[x] D p[ x] k p 0 [ x] min p[ x]2 s.t. x2x h E E [ x bp y p[ x] (x, y)] i (Berger et al., 1996; Cortes et al., 2015) solution of convex optimization problem, unique solution. =0: unregularized conditional Maxent. > 0 : regularized conditional Maxent. p E (x,y) S [ (x, y)] 1 apple. page 36
37 Dual Problems Regularized conditional Maxent problem: ef (p) = E appled p[ x] k p 0 [ x] + I p[ x] + I C E [ ]. x bp x bp y p[ x] Regularized Maximum Likelihood problem with conditional Gibbs distributions: eg(w) = 1 mx apple pw [y i x i ] log kwk 1, m p 0 [y i x i ] i=1 where 8(x, y) 2 X Y, p w [y x] = p 0[y x]exp w (x, y) Z(x) Z(x)= X y2y p 0 [y x]exp(w (x, y)). page 37
38 Duality Theorem Theorem: the regularized conditional Maxent and ML with conditional Gibbs distributions problems are equivalent, furthemore, let p = argmin, then, for any > 0, e G(w) sup w2r N sup w2r N e G(w) =min p p G(w) e < ef (p) e F (p). h ) E D p [ x] k p w [ x] x bp (Cortes et al., 2015) i apple. page 38
39 Regularized Cond. Maxent Optimization problem: convex optimizations, regularization parameter 0. 1 mx min kwk 1 log p w [y i x i ] w2r N m i=1 or min kwk 2 1 mx w2r N 2 log p w [y i x i ], m where i=1 8(x, y) 2 X Y, p w [y x]= exp(w (x, y)) Z(x) Z(x)= X exp(w (x, y)). y2y (Berger et al., 1996; Cortes et al., 2015) page 39
40 More Explicit Forms Optimization problem: w 1 min w R N w 2 2 min w2r N ( kwk 1 kwk m m i=1 w 1 m log mx i=1 y Y exp w (x i,y) w (x i,y i ). (x i,y i )+ 1 m mx apple X i log e w (x i,y). i=1 y2y page 40
41 Related Problem Optimization problem: log-sum-exp replaced by max. ( kwk 1 kwk mx max w (x i,y) w (x i,y i ). 2 m y2y i=1 {z } w (x i,y i ) min w2r N page 41
42 Common Feature Choice Multi-class features: (x, y) = (x) w = w 1. w y 1 w y w y+1. w Y w (x, y) =w y (x). min w2r N L 2 -regularized cond. maxent optimization: X kw y k mx apple X log exp m y2y i=1 y2y w y (x i ) w yi (x i ). page 42
43 Prediction Prediction with p w [y x]= exp(w (x,y)) : Z(x) by(x) = argmax y2y p w [y x] = argmax y2y w (x, y). page 43
44 Binary Classification Simpler expression: X exp w (x i,y) w (x i,y i ) y2y = e w (x i,+1) w (x i,y i ) + e w (x i, 1) w (x i,y i ) =1+e y iw [ (x i,+1) (x i, 1)] =1+e y iw (x i ), with (x) = (x, +1) (x, 1). page 44
45 Logistic Regression Binary case of conditional Maxent. (Berkson, 1944) Optimization problem: ( kwk 1 min w2r N kwk m mx log i=1 h 1+e y iw (x i ) i. convex optimization. variety of solutions: SGD, coordinate descent, etc. coordinate descent: similar to AdaBoost with logistic loss ( u) = log 2 (1 + e u ) 1 uapple0 instead of exponential loss. page 45
46 Generalization Bound Theorem: assume that ± j 2 H for all j 2 [1,N]. Then, for any > 0, with probability at least 1 over the draw of a sample S of size m, for all f : x 7! w (x), R(f) apple 1 m mx log u0 1+e y iw (x i ) +4kwk 1 R m (H) i=1 + r log log2 2kwk 1 m where u 0 = log(1 + 1/e). + s log 2 m, page 46
47 Proof Proof: by the learning bound for convex ensembles holding uniformly for all, with probability at least 1, for all f and > 0, s s R(f) apple 1 mx m apple0 R log log 2 log 2 m(h)+ + m m. i=1 1 y i w (x i ) kwk 1 Choosing = 1 kwk 1 and using 1 uapple0 apple log u0 (1 + e u 1 ) yields immediately the learning bound of the theorem. page 47
48 Logistic Regression Logistic model: Pr[y =+1 x] = ew (x,+1) Z(x) where Z(x) =e w (x,+1) w (x, 1) + e, (Berkson, 1944) Properties: linear decision rule, sign of log-odds ratio: Pr[y =+1 x] log = w (x, +1) (x, 1) = w (x). Pr[y = 1 x] logistic form: Pr[y =+1 x] = 1 1+e = 1 w [ (x,+1) (x, 1)] 1+e w (x). page 48
49 Logistic/Sigmoid Function f: x 1 1+e x Pr[y =+1 x] =f w (x). page 49
50 Applications Natural language processing (Berger et al., 1996; Rosenfeld, 1996; Pietra et al., 1997; Malouf, 2002; Manning and Klein, 2003; Mann et al., 2009; Ratnaparkhi, 2010). Species habitat modeling (Phillips et al., 2004, 2006; Dudík et al., 2007; Elith et al, 2011). Computer vision (Jeon and Manmatha, 2004). page 50
51 Extensions Extensive theoretical study of alternative regularizations: (Dudík et al., 2007) (see also (Altun and Smola, 2006) though some proofs unclear). Maxent models with other Bregman divergences (see for example (Altun and Smola, 2006)). Structural Maxent models (Cortes et al., 2015): extension to the case of multiple feature families. empirically outperform Maxent and L1-Maxent. conditional structural Maxent: coincide with deep boosting using the logistic loss. page 51
52 Conclusion Logistic regression/maxent models: theoretical foundation. natural solution when probabilites are required. widely used for density estimation/classification. often very effective in practice. distributed optimization solutions. no natural non-linear L1-version (use of kernels). connections with boosting. connections with neural networks. page 52
53 References Yasemin Altun, Alexander J. Smola. Unifying Divergence Minimization and Statistical Inference Via Convex Duality. COLT 2006: Adam Berger, Stephen Della Pietra, and Vincent Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, (22-1), March 1996; Berkson, J. (1944). Application of the logistic function to bio-assay. Journal of the American Statistical Association 39, Lev M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7: , Corinna Cortes, Vitaly Kuznetsov, Mehryar Mohri, and Umar Syed. Structural Maxent. In Proceedings of ICML, page
54 References Imre Csiszar and Tusnady. Information geometry and alternating minimization procedures. Statistics and Decisions, Supplement Issue 1, , Imre Csiszar. A geometric interpretation of Darroch and Ratchliff's generalized iterative scaling. The Annals of Statisics, 17(3), pp J. Darroch and D. Ratchliff. Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics, 43(5), pp , Stephen Della Pietra, Vincent Della Pietra, and John Lafferty. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 19:4, pp , April, Dudík, Miroslav, Phillips, Steven J., and Schapire, Robert E. Maximum entropy density estimation with generalized regularization and an application to species distribution modeling. JMLR, 8, page
55 References E. Jaynes. Information theory and statistical mechanics. Physics Reviews, 106: , E. Jaynes. Papers on Probability, Statistics, and Statistical Physics. R. Rosenkrantz (editor), D. Reidel Publishing Company, Solomon Kullback and Richard A. Leibler. On information and sufficiency. Ann. Math. Statist., 22(1):79 86, O'Sullivan. Alternating minimzation algorithms: From Blahut-Arimoto to expectation-maximization. Codes, Curves and Signals: Common Threads in Communications, A. Vardy, (editor), Kluwer, page
56 References Alfréd Rényi. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, pages University of California Press, Roni Rosenfeld. A maximum entropy approach to adaptive statistical language modelling. Computer, Speech and Language 10: , Claude E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379423, page
Introduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 14 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Density Estimation Maxent Models 2 Entropy Definition: the entropy of a random variable
More informationSpeech Recognition Lecture 7: Maximum Entropy Models. Mehryar Mohri Courant Institute and Google Research
Speech Recognition Lecture 7: Maximum Entropy Models Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.com This Lecture Information theory basics Maximum entropy models Duality theorem
More informationStructural Maxent Models
Corinna Cortes Google Research, 111 8th Avenue, New York, NY 10011 Vitaly Kuznetsov Courant Institute of Mathematical Sciences, 251 Mercer Street, New York, NY 10012 Mehryar Mohri Courant Institute and
More informationAdvanced Machine Learning
Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:
More informationDeep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.
Deep Boosting Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Ensemble Methods in ML Combining several base classifiers
More informationA Representation Approach for Relative Entropy Minimization with Expectation Constraints
A Representation Approach for Relative Entropy Minimization with Expectation Constraints Oluwasanmi Koyejo sanmi.k@utexas.edu Department of Electrical and Computer Engineering, University of Texas, Austin,
More informationStructured Prediction Theory and Algorithms
Structured Prediction Theory and Algorithms Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE
More informationIntroduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:
More informationFoundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning Multi-Class Classification Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation Real-world problems often have multiple classes: text, speech,
More informationWeighted Finite-State Transducers in Computational Biology
Weighted Finite-State Transducers in Computational Biology Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Joint work with Corinna Cortes (Google Research). 1 This Tutorial
More informationOct CS497:Learning and NLP Lec 8: Maximum Entropy Models Fall The from needs to be completed. (The form needs to be completed).
Oct. 20 2005 1 CS497:Learning and NLP Lec 8: Maximum Entropy Models Fall 2005 In this lecture we study maximum entropy models and how to use them to model natural language classification problems. Maximum
More informationLecture 9: PGM Learning
13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and
More informationFoundations of Machine Learning
Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about
More informationADANET: adaptive learning of neural networks
ADANET: adaptive learning of neural networks Joint work with Corinna Cortes (Google Research) Javier Gonzalo (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR
More informationBoosting Ensembles of Structured Prediction Rules
Boosting Ensembles of Structured Prediction Rules Corinna Cortes Google Research 76 Ninth Avenue New York, NY 10011 corinna@google.com Vitaly Kuznetsov Courant Institute 251 Mercer Street New York, NY
More informationConditional Random Fields: An Introduction
University of Pennsylvania ScholarlyCommons Technical Reports (CIS) Department of Computer & Information Science 2-24-2004 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania
More informationMaximum Entropy and Species Distribution Modeling
Maximum Entropy and Species Distribution Modeling Rob Schapire Steven Phillips Miro Dudík Also including work by or with: Rob Anderson, Jane Elith, Catherine Graham, Chris Raxworthy, NCEAS Working Group,...
More informationSupport Vector Machines and Kernel Methods
2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University
More informationGraphical models for part of speech tagging
Indian Institute of Technology, Bombay and Research Division, India Research Lab Graphical models for part of speech tagging Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional
More informationA Minimax Approach to Supervised Learning
A Minimax Approach to Supervised Learning Farzan Farnia farnia@stanford.edu David Tse dntse@stanford.edu Abstract Given a task of predicting Y from X, a loss function L, and a set of probability distributions
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning
More informationStructural Maxent Models
Corinna Cortes Google Research, 111 8th Avenue, New York, NY 10011 Vitaly Kuznetsov Courant Institute of Mathematical ciences, 251 Mercer treet, New York, NY 10012 Mehryar Mohri Courant Institute and Google
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationMachine Learning for natural language processing
Machine Learning for natural language processing Classification: Maximum Entropy Models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 24 Introduction Classification = supervised
More informationLearning Bounds for Importance Weighting
Learning Bounds for Importance Weighting Corinna Cortes Google Research corinna@google.com Yishay Mansour Tel-Aviv University mansour@tau.ac.il Mehryar Mohri Courant & Google mohri@cims.nyu.edu Motivation
More informationMachine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)
More informationSpeech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.
Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.com This Lecture Expectation-Maximization (EM)
More informationGeometry of U-Boost Algorithms
Geometry of U-Boost Algorithms Noboru Murata 1, Takashi Takenouchi 2, Takafumi Kanamori 3, Shinto Eguchi 2,4 1 School of Science and Engineering, Waseda University 2 Department of Statistical Science,
More informationAnnouncements. Proposals graded
Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:
More informationMaximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling
Journal of Machine Learning Research 8 (007) 117-160 Submitted 4/06; Revised 4/07; Published 6/07 Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution
More informationBregman Divergences for Data Mining Meta-Algorithms
p.1/?? Bregman Divergences for Data Mining Meta-Algorithms Joydeep Ghosh University of Texas at Austin ghosh@ece.utexas.edu Reflects joint work with Arindam Banerjee, Srujana Merugu, Inderjit Dhillon,
More informationStructured Prediction
Structured Prediction Ningshan Zhang Advanced Machine Learning, Spring 2016 Outline Ensemble Methods for Structured Prediction[1] On-line learning Boosting AGeneralizedKernelApproachtoStructuredOutputLearning[2]
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationLearning with Imperfect Data
Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Joint work with: Yishay Mansour (Tel-Aviv & Google) and Afshin Rostamizadeh (Courant Institute). Standard Learning Assumptions IID assumption.
More informationLearning with Rejection
Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,
More informationPerformance Guarantees for Regularized Maximum Entropy Density Estimation
Appearing in Proceedings of the 17th Annual Conference on Computational Learning Theory, 2004. Performance Guarantees for Regularized Maximum Entropy Density Estimation Miroslav Dudík 1, Steven J. Phillips
More informationKernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton
Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,
More informationEfficient Large-Scale Distributed Training of Conditional Maximum Entropy Models
Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models Gideon Mann Google gmann@google.com Ryan McDonald Google ryanmcd@google.com Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationFoundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification page 2 Motivation Real-world problems often have multiple classes:
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters
More informationLogistic Regression, AdaBoost and Bregman Distances
Proceedings of the Thirteenth Annual Conference on ComputationalLearning Theory, Logistic Regression, AdaBoost and Bregman Distances Michael Collins AT&T Labs Research Shannon Laboratory 8 Park Avenue,
More informationDomain Adaptation for Regression
Domain Adaptation for Regression Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Motivation Applications: distinct training and test distributions.
More informationPart 4: Conditional Random Fields
Part 4: Conditional Random Fields Sebastian Nowozin and Christoph H. Lampert Colorado Springs, 25th June 2011 1 / 39 Problem (Probabilistic Learning) Let d(y x) be the (unknown) true conditional distribution.
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationMachine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.
More informationIntroduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 13 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification Mehryar Mohri - Introduction to Machine Learning page 2 Motivation
More informationMachine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart
Machine Learning Bayesian Regression & Classification learning as inference, Bayesian Kernel Ridge regression & Gaussian Processes, Bayesian Kernel Logistic Regression & GP classification, Bayesian Neural
More informationMulticlass Learning by Probabilistic Embeddings
Multiclass Learning by Probabilistic Embeddings Ofer Dekel and Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {oferd,singer}@cs.huji.ac.il Abstract
More informationLearning Kernels -Tutorial Part III: Theoretical Guarantees.
Learning Kernels -Tutorial Part III: Theoretical Guarantees. Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute & Google Research mohri@cims.nyu.edu Afshin Rostami UC Berkeley
More informationIterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk
Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk John Lafferty School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 lafferty@cs.cmu.edu Abstract
More informationCS489/698: Intro to ML
CS489/698: Intro to ML Lecture 04: Logistic Regression 1 Outline Announcements Baseline Learning Machine Learning Pyramid Regression or Classification (that s it!) History of Classification History of
More informationAdvanced Machine Learning
Advanced Machine Learning Online Convex Optimization MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Online projected sub-gradient descent. Exponentiated Gradient (EG). Mirror descent.
More informationComparison of Log-Linear Models and Weighted Dissimilarity Measures
Comparison of Log-Linear Models and Weighted Dissimilarity Measures Daniel Keysers 1, Roberto Paredes 2, Enrique Vidal 2, and Hermann Ney 1 1 Lehrstuhl für Informatik VI, Computer Science Department RWTH
More informationBayesian Learning (II)
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP
More informationMaximum entropy modeling of species geographic distributions. Steven Phillips. with Miro Dudik & Rob Schapire
Maximum entropy modeling of species geographic distributions Steven Phillips with Miro Dudik & Rob Schapire Modeling species distributions Yellow-throated Vireo occurrence points environmental variables
More informationLecture 5 - Information theory
Lecture 5 - Information theory Jan Bouda FI MU May 18, 2012 Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 1 / 42 Part I Uncertainty and entropy Jan Bouda (FI MU) Lecture 5 - Information
More informationLeast Squares Regression
CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the
More informationLecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods
More informationLecture 2: August 31
0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 2: August 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy
More informationMachine Learning for Signal Processing Bayes Classification and Regression
Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationMaximum Mean Discrepancy
Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia
More informationSpeech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri
Speech Recognition Lecture 5: N-gram Language Models Eugene Weinstein Google, NYU Courant Institute eugenew@cs.nyu.edu Slide Credit: Mehryar Mohri Components Acoustic and pronunciation model: Pr(o w) =
More informationIntroduction to Statistical Learning Theory
Introduction to Statistical Learning Theory In the last unit we looked at regularization - adding a w 2 penalty. We add a bias - we prefer classifiers with low norm. How to incorporate more complicated
More informationInformation Theory in Intelligent Decision Making
Information Theory in Intelligent Decision Making Adaptive Systems and Algorithms Research Groups School of Computer Science University of Hertfordshire, United Kingdom June 7, 2015 Information Theory
More informationMachine Learning for Structured Prediction
Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for
More informationChapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye
Chapter 2: Entropy and Mutual Information Chapter 2 outline Definitions Entropy Joint entropy, conditional entropy Relative entropy, mutual information Chain rules Jensen s inequality Log-sum inequality
More informationComputer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression
Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:
More informationLogistic Regression: Online, Lazy, Kernelized, Sequential, etc.
Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010
More informationThe Method of Types and Its Application to Information Hiding
The Method of Types and Its Application to Information Hiding Pierre Moulin University of Illinois at Urbana-Champaign www.ifp.uiuc.edu/ moulin/talks/eusipco05-slides.pdf EUSIPCO Antalya, September 7,
More informationSequential Supervised Learning
Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given
More informationLecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University
Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More informationDoes Unlabeled Data Help?
Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline
More informationLearning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley
Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. Converting online to batch. Online convex optimization.
More informationComputer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)
Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori
More informationPerceptron Mistake Bounds
Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationLearning Binary Classifiers for Multi-Class Problem
Research Memorandum No. 1010 September 28, 2006 Learning Binary Classifiers for Multi-Class Problem Shiro Ikeda The Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku, Tokyo, 106-8569,
More informationFoundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.
More informationIntroduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.
L65 Dept. of Linguistics, Indiana University Fall 205 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission rate
More informationArtificial Intelligence Roman Barták
Artificial Intelligence Roman Barták Department of Theoretical Computer Science and Mathematical Logic Introduction We will describe agents that can improve their behavior through diligent study of their
More informationDept. of Linguistics, Indiana University Fall 2015
L645 Dept. of Linguistics, Indiana University Fall 2015 1 / 28 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationMachine learning - HT Maximum Likelihood
Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce
More informationAdvanced Machine Learning
Advanced Machine Learning Learning with Large Expert Spaces MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Problem Learning guarantees: R T = O( p T log N). informative even for N very large.
More informationReducing Multiclass to Binary: A Unifying Approach for Margin Classifiers
Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning
More informationAlgorithms for Predicting Structured Data
1 / 70 Algorithms for Predicting Structured Data Thomas Gärtner / Shankar Vembu Fraunhofer IAIS / UIUC ECML PKDD 2010 Structured Prediction 2 / 70 Predicting multiple outputs with complex internal structure
More informationConvexity/Concavity of Renyi Entropy and α-mutual Information
Convexity/Concavity of Renyi Entropy and -Mutual Information Siu-Wai Ho Institute for Telecommunications Research University of South Australia Adelaide, SA 5095, Australia Email: siuwai.ho@unisa.edu.au
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationSurrogate loss functions, divergences and decentralized detection
Surrogate loss functions, divergences and decentralized detection XuanLong Nguyen Department of Electrical Engineering and Computer Science U.C. Berkeley Advisors: Michael Jordan & Martin Wainwright 1
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationNearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2
Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal
More informationAdvanced Machine Learning
Advanced Machine Learning Learning Kernels MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Kernel methods. Learning kernels scenario. learning bounds. algorithms. page 2 Machine Learning
More information3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.
Appendix A Information Theory A.1 Entropy Shannon (Shanon, 1948) developed the concept of entropy to measure the uncertainty of a discrete random variable. Suppose X is a discrete random variable that
More informationSTATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION
STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationSample Selection Bias Correction
Sample Selection Bias Correction Afshin Rostamizadeh Joint work with: Corinna Cortes, Mehryar Mohri & Michael Riley Courant Institute & Google Research Motivation Critical Assumption: Samples for training
More information