Foundations of Machine Learning

Size: px

Start display at page:

Download "Foundations of Machine Learning"

Cordelia Gaines
6 years ago
Views:

1 Maximum Entropy Models, Logistic Regression Mehryar Mohri Courant Institute and Google Research page 1

2 Motivation Probabilistic models: density estimation. classification. page 2

3 This Lecture Notions of information theory. Introduction to density estimation. Maxent models. Conditional Maxent models. page 3

4 Entropy Definition: the entropy of a discrete random variable X with probability mass distribution p(x) =Pr[X = x] is Properties: H(X) = E[log p(x)] = X x2x p(x) log p(x). measure of uncertainty of X. H(X) 0. (Shannon, 1948) maximal for uniform distribution. For a finite support, by Jensen s inequality: apple apple 1 1 H(X) =E log apple log E = log N. p(x) p(x) page 4

5 Entropy Base of logarithm: not critical; for base 2, number of bits needed to represent p(x). log 2 (p(x)) is the Definition and notation: the entropy of a distribution defined by the same quantity and denoted by H(p). Special case of Rényi entropy (Rényi, 1961). Binary entropy: H(p) = p log p (1 p) log(1 p). p is page 5

6 Relative Entropy (Shannon, 1948; Kullback and Leibler, 1951) Definition: the relative entropy (or Kullback-Leibler divergence) between two distributions p and q (discrete case) is apple D(p k q) =E p log p(x) = X p(x) log p(x) q(x) q(x), x2x with 0 log 0 q = 0 and p log p 0 =+1. Properties: asymmetric: in general, D(p k q) 6= D(q k p) for p 6= q. non-negative: D(p k q) 0 for all p and q. definite: (D(p k q) = 0) ) (p = q). page 6

7 Non-Negativity of Rel. Entropy By the concavity of log and Jensen's inequality, D(p k q) = X q(x) p(x) log p(x) x : p(x)>0 X apple log p(x) q(x) p(x) x : p(x)>0 X = log q(x) apple log(1) = 0. x : p(x)>0 page 7

8 Bregman Divergence Definition: let F be a convex and differentiable function defined over a convex set Cin a Hilbert space H. Then, the Bregman divergence associated to F is defined by B F B F (x k y) =F (x) F (y) hrf (y),x yi. (Bregman, 1967) F (y) } B F (y x) F (x)+(y x) F (x) x y page 8

9 Bregman Divergence Examples: B F (x k y) F (x) Squared L 2 -distance kx yk 2 kxk 2 Mahalanobis distance (x y) > K 1 (x y) x > K 1 x Unnormalized relative entropy D(x e k y) Pi2I x i log x i x i note: relative entropy not a Bregman divergence since not defined over an open set; but, on the simplex, coincides with unnormalized relative entropy ed(p k q) = X x2x p(x) log apple p(x) q(x) + q(x) p(x). page 9

10 Conditional Relative Entropy Definition: let p and q be two probability distributions over X Y. Then, the conditional relative entropy of p and q with respect to distribution r over X is defined by h i E D p( X) k q( X) = X r(x) X p(y x) log p(y x) X r q(y x) x2x y2y = D(ep k eq), with ep(x, y) =r(x)p(y x), eq(x, y) =r(x)q(y x), and the conventions 0 log 0 = 0, 0 log 0 0 =0, and p log p 0 =+1. note: the definition of conditional relative entropy is not intrinsic, it depends on a third distribution r. page 10

11 This Lecture Notions of information theory. Introduction to density estimation. Maxent models. Conditional Maxent models. page 11

12 Density Estimation Problem Training data: sample Sof size m drawn i.i.d. from set according to some distribution D, X S =(x 1,...,x m ). Problem: find distribution p out of hypothesis set P that best estimates D. page 12

13 Maximum Likelihood Solution Maximum Likelihood principle: select distribution maximizing likelihood of observed sample S, p 2 P p ML = argmax p2p = argmax p2p = argmax p2p Pr[S p] my p(x i ) i=1 mx log p(x i ). i=1 page 13

14 Relative Entropy Formulation Lemma: let bp S be the empirical distribution for sample S, then p ML = argmin D(bp S k p). p2p Proof: D(bp S k p) = X X bp S (x) log bp S (x) bp S (x) log p(x) x x X P m i=1 = H(bp S ) 1 x=x i log p(x) m x mx X 1 x=xi = H(bp S ) log p(x) m i=1 x mx log p(x i ) = H(bp S ) m. i=1 page 14

15 Maximum a Posteriori (MAP) Maximum a Posteriori principle: select distribution p 2 P that is the most likely, given the observed sample S and assuming a prior distribution Pr[p] over P, p MAP = argmax p2p = argmax p2p = argmax p2p Pr[p S] Pr[S p]pr[p] Pr[S] Pr[S p]pr[p]. note: for a uniform prior, ML = MAP. page 15

16 This Lecture Notions of information theory. Introduction to density estimation. Maxent models. Conditional Maxent models. page 16

17 Density Estimation + Features Training data: sample Sof size mdrawn i.i.d. from set according to some distribution D, X S =(x 1,...,x m ). Features: associated to elements of X, : X! R N x 7! (x) = apple 1 (x). N (x). Problem: find distribution p out of hypothesis set P that best estimates D. for simplicity, in what follows, X is assumed to be finite. page 17

18 Features Feature functions assumed to be in H and k k 1 apple. Examples of H: family of threshold functions defined over N j variables. {x 7! 1 xi apple : x 2 R N, 2 R} functions defined via decision trees with larger depths. k -degree monomials of the original features. zero-one features (often used in NLP, e.g., presence/ absence of a word or POS tag). page 18

19 Maximum Entropy Principle Maxent principle: find distribution p that is closest to a prior distribution p 0 (typically uniform distribution) while verifying E x p [ (x)] E x D b[ (x)] apple. Closeness is measured using relative entropy. note: no set needed to be specified. (E. T. Jaynes, 1957, 1983) Idea: empirical feature vector average close to expectation. For any > 0, with probability at least 1 s E [ (x)] E log 2 [ (x)] apple 2R m (H)+ x D x D b 1 2m, P 1 page 19

20 Maxent Formulation Optimization problem: min D(p k p 0) p2 subject to: E [ (x)] E [ x p x S (x)] 1 apple. convex optimization problem, unique solution. =0 > 0 : regularized Maxent. : standard Maxent (or unregularized Maxent). page 20

21 Relation with Entropy Relationship with entropy: for a uniform prior, p 0 D(p k p 0 )= X x2x p(x) log p(x) p 0 (x) = X x2x p(x) log p 0 (x)+ X x2x p(x) log p(x) = log X H(p). page 21

22 Maxent Problem Optimization: convex optimization problem. X min p(x) log p(x) p x2x subject to: p(x) 0, 8x 2 X X p(x) =1 x2x X x2x p(x) j (x) 1 m mx i=1 j(x i ) apple, 8j 2 [1,N]. page 22

23 Gibbs Distributions Gibbs distributions: set Q of distributions with w R N, p w [x] = p 0[x]exp w Z with Z = X x (x) p 0 [x]exp w (x). p w = p 0[x]exp P N j=1 w j j(x) Z, Rich family: for linear and quadratic features: includes Gaussians and other distributions with non-psd quadratic forms in exponents. for higher-degree polynomials of raw features: more complex multi-modal distributions. page 23

24 Dual Problems Regularized Maxent problem: with ( min p F (p) =D(p k p 0 )+I C (E p [ ]), D(p k p 0 )=D(pkp 0 ) if p 2, +1 otherwise; C = {u: ku k E[ ; S k] k 1 apple } I C (x) =0if x 2 C, I C (x) =+1otherwise. Regularized Maximum Likelihood problem with Gibbs distributions: sup G(w) = 1 mx apple pw [x i ] log kwk 1. w m p 0 [x i ] i=1 page 24

25 Duality Theorem (Della Pietra et al., 1997; Dudík et al., 2007; Cortes et al., 2015) Theorem: the regularized Maxent and ML with Gibbs distributions problems are equivalent, sup G(w) =minf (p). w2r N p furthemore, let p = argmin F (p), then, for any > 0, G(w) p sup G(w) < w2r N ) D(p k p w ) apple. page 25

26 Notes Maxent formulation: no explicit restriction to a family of distributions P. but solution coincides with regularized ML with a specific family P! more general Bregman divergence-based formulation. page 26

27 L 1 -Regularized Maxent Optimization problem: 1 inf kwk 1 w2r N m where p w [x] = 1 Z Bayesian interpretation: equivalent to MAP with Laplacian prior q prior (w) (Williams, 1994), max log Y m p w [x i ] q prior (w) w i=1 with q prior (w) = mx log p w [x i ]. i=1 N j=1 exp w (x). j 2 exp( j w j ). (Kazama and Tsujii, 2003) page 27

28 Generalization Guarantee Notation: L D (w) = E log p w [x]], L S (w) = E log p w [x]]. x D [ x S [ (Dudík et al., 2007) Theorem: Fix > 0. Let bw be the solution q of the L1-reg. Maxent problem for =2R m (H)+ log( 2 )/2m. Then, with probability at least 1, q L D ( bw) apple inf w L D (w)+2kwk 1 apple2r m (H)+ log 2 2m. page 28

29 Proof By Hölder s inequality and the concentration bound for average feature vectors, L D ( bw) L S ( bw) = bw [E S [ ] E D [ ]] applekbwk 1 k E S [ ] E D [ ]k 1 apple k bwk 1. Since bw is a minimizer, L D ( bw) L D (w) =L D ( bw) L S ( bw)+l S ( bw) L D (w) apple k bwk 1 + L S ( bw) L D (w) apple kwk 1 + L S (w) L D (w) apple 2 kwk 1. page 29

30 L 2 -Regularized Maxent Different relaxations: L 1 constraints: L 2 constraints: (Chen and Rosenfeld, 2000; Lebanon and Lafferty, 2001) j [1,N], E x p [ j(x)] E x bp [ j(x)] j. E [ (x)] E [ (x)] x p x bp 2 apple B. page 30

31 L 2 -Regularized Maxent Optimization problem: inf w2r N kwk 2 2 where Bayesian interpretation: equivalent to MAP with Gaussian q prior (w) prior (Goodman, 2004), max w with log m Y i=1 1 m p w [x] = 1 Z q prior (w) = mx log p w [x i ]. i=1 exp w (x). p w [x i ] q prior (w) NY j=1 1 p 2 2 e w 2 j 2 2. page 31

32 This Lecture Notions of information theory. Introduction to density estimation. Maxent models. Conditional Maxent models. page 32

33 Conditional Maxent Models Maxent models for conditional probabilities: conditional probability modeling each class. use in multi-class classification. can use different features for each class. a.k.a. multinomial logistic regression. logistic regression: special case of two classes. page 33

34 Problem Data: sample drawn i.i.d. according to some distribution D, S =((x 1,y 1 ),...,(x m,y m )) (X Y) m. Y ={1,...,k}, or Y ={0, 1} k in multi-label case. Features: mapping : X Y R N. Problem: find accurate conditional probability models Pr[ x], x X, based on. page 34

35 Conditional Maxent Principle E x bp y D[ x] x bp y bp[ x] (Berger et al., 1996; Cortes et al., 2015) Idea: empirical feature vector average close to expectation. For any > 0, with probability at least 1, s log 2 [ (x, y)] E [ (x, y)] apple 2R m (H)+ 2m. Maxent principle: find conditional distributions p[ x] that are closest to priors p 0 [ x] (typically uniform distributions) while verifying E x bp [ (x, y)] E x bp [ (x, y)] apple. y p[ x] Closeness is measured using conditional relative entropy based on bp. 1 y bp[ x] 1 page 35

36 Cond. Maxent Formulation Optimization problem: find distribution X bp[x] D p[ x] k p 0 [ x] min p[ x]2 s.t. x2x h E E [ x bp y p[ x] (x, y)] i (Berger et al., 1996; Cortes et al., 2015) solution of convex optimization problem, unique solution. =0: unregularized conditional Maxent. > 0 : regularized conditional Maxent. p E (x,y) S [ (x, y)] 1 apple. page 36

37 Dual Problems Regularized conditional Maxent problem: ef (p) = E appled p[ x] k p 0 [ x] + I p[ x] + I C E [ ]. x bp x bp y p[ x] Regularized Maximum Likelihood problem with conditional Gibbs distributions: eg(w) = 1 mx apple pw [y i x i ] log kwk 1, m p 0 [y i x i ] i=1 where 8(x, y) 2 X Y, p w [y x] = p 0[y x]exp w (x, y) Z(x) Z(x)= X y2y p 0 [y x]exp(w (x, y)). page 37

38 Duality Theorem Theorem: the regularized conditional Maxent and ML with conditional Gibbs distributions problems are equivalent, furthemore, let p = argmin, then, for any > 0, e G(w) sup w2r N sup w2r N e G(w) =min p p G(w) e < ef (p) e F (p). h ) E D p [ x] k p w [ x] x bp (Cortes et al., 2015) i apple. page 38

39 Regularized Cond. Maxent Optimization problem: convex optimizations, regularization parameter 0. 1 mx min kwk 1 log p w [y i x i ] w2r N m i=1 or min kwk 2 1 mx w2r N 2 log p w [y i x i ], m where i=1 8(x, y) 2 X Y, p w [y x]= exp(w (x, y)) Z(x) Z(x)= X exp(w (x, y)). y2y (Berger et al., 1996; Cortes et al., 2015) page 39

40 More Explicit Forms Optimization problem: w 1 min w R N w 2 2 min w2r N ( kwk 1 kwk m m i=1 w 1 m log mx i=1 y Y exp w (x i,y) w (x i,y i ). (x i,y i )+ 1 m mx apple X i log e w (x i,y). i=1 y2y page 40

41 Related Problem Optimization problem: log-sum-exp replaced by max. ( kwk 1 kwk mx max w (x i,y) w (x i,y i ). 2 m y2y i=1 {z } w (x i,y i ) min w2r N page 41

42 Common Feature Choice Multi-class features: (x, y) = (x) w = w 1. w y 1 w y w y+1. w Y w (x, y) =w y (x). min w2r N L 2 -regularized cond. maxent optimization: X kw y k mx apple X log exp m y2y i=1 y2y w y (x i ) w yi (x i ). page 42

43 Prediction Prediction with p w [y x]= exp(w (x,y)) : Z(x) by(x) = argmax y2y p w [y x] = argmax y2y w (x, y). page 43

44 Binary Classification Simpler expression: X exp w (x i,y) w (x i,y i ) y2y = e w (x i,+1) w (x i,y i ) + e w (x i, 1) w (x i,y i ) =1+e y iw [ (x i,+1) (x i, 1)] =1+e y iw (x i ), with (x) = (x, +1) (x, 1). page 44

45 Logistic Regression Binary case of conditional Maxent. (Berkson, 1944) Optimization problem: ( kwk 1 min w2r N kwk m mx log i=1 h 1+e y iw (x i ) i. convex optimization. variety of solutions: SGD, coordinate descent, etc. coordinate descent: similar to AdaBoost with logistic loss ( u) = log 2 (1 + e u ) 1 uapple0 instead of exponential loss. page 45

46 Generalization Bound Theorem: assume that ± j 2 H for all j 2 [1,N]. Then, for any > 0, with probability at least 1 over the draw of a sample S of size m, for all f : x 7! w (x), R(f) apple 1 m mx log u0 1+e y iw (x i ) +4kwk 1 R m (H) i=1 + r log log2 2kwk 1 m where u 0 = log(1 + 1/e). + s log 2 m, page 46

47 Proof Proof: by the learning bound for convex ensembles holding uniformly for all, with probability at least 1, for all f and > 0, s s R(f) apple 1 mx m apple0 R log log 2 log 2 m(h)+ + m m. i=1 1 y i w (x i ) kwk 1 Choosing = 1 kwk 1 and using 1 uapple0 apple log u0 (1 + e u 1 ) yields immediately the learning bound of the theorem. page 47

48 Logistic Regression Logistic model: Pr[y =+1 x] = ew (x,+1) Z(x) where Z(x) =e w (x,+1) w (x, 1) + e, (Berkson, 1944) Properties: linear decision rule, sign of log-odds ratio: Pr[y =+1 x] log = w (x, +1) (x, 1) = w (x). Pr[y = 1 x] logistic form: Pr[y =+1 x] = 1 1+e = 1 w [ (x,+1) (x, 1)] 1+e w (x). page 48

49 Logistic/Sigmoid Function f: x 1 1+e x Pr[y =+1 x] =f w (x). page 49

50 Applications Natural language processing (Berger et al., 1996; Rosenfeld, 1996; Pietra et al., 1997; Malouf, 2002; Manning and Klein, 2003; Mann et al., 2009; Ratnaparkhi, 2010). Species habitat modeling (Phillips et al., 2004, 2006; Dudík et al., 2007; Elith et al, 2011). Computer vision (Jeon and Manmatha, 2004). page 50

51 Extensions Extensive theoretical study of alternative regularizations: (Dudík et al., 2007) (see also (Altun and Smola, 2006) though some proofs unclear). Maxent models with other Bregman divergences (see for example (Altun and Smola, 2006)). Structural Maxent models (Cortes et al., 2015): extension to the case of multiple feature families. empirically outperform Maxent and L1-Maxent. conditional structural Maxent: coincide with deep boosting using the logistic loss. page 51

52 Conclusion Logistic regression/maxent models: theoretical foundation. natural solution when probabilites are required. widely used for density estimation/classification. often very effective in practice. distributed optimization solutions. no natural non-linear L1-version (use of kernels). connections with boosting. connections with neural networks. page 52

53 References Yasemin Altun, Alexander J. Smola. Unifying Divergence Minimization and Statistical Inference Via Convex Duality. COLT 2006: Adam Berger, Stephen Della Pietra, and Vincent Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, (22-1), March 1996; Berkson, J. (1944). Application of the logistic function to bio-assay. Journal of the American Statistical Association 39, Lev M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7: , Corinna Cortes, Vitaly Kuznetsov, Mehryar Mohri, and Umar Syed. Structural Maxent. In Proceedings of ICML, page

54 References Imre Csiszar and Tusnady. Information geometry and alternating minimization procedures. Statistics and Decisions, Supplement Issue 1, , Imre Csiszar. A geometric interpretation of Darroch and Ratchliff's generalized iterative scaling. The Annals of Statisics, 17(3), pp J. Darroch and D. Ratchliff. Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics, 43(5), pp , Stephen Della Pietra, Vincent Della Pietra, and John Lafferty. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 19:4, pp , April, Dudík, Miroslav, Phillips, Steven J., and Schapire, Robert E. Maximum entropy density estimation with generalized regularization and an application to species distribution modeling. JMLR, 8, page

55 References E. Jaynes. Information theory and statistical mechanics. Physics Reviews, 106: , E. Jaynes. Papers on Probability, Statistics, and Statistical Physics. R. Rosenkrantz (editor), D. Reidel Publishing Company, Solomon Kullback and Richard A. Leibler. On information and sufficiency. Ann. Math. Statist., 22(1):79 86, O'Sullivan. Alternating minimzation algorithms: From Blahut-Arimoto to expectation-maximization. Codes, Curves and Signals: Common Threads in Communications, A. Vardy, (editor), Kluwer, page

56 References Alfréd Rényi. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, pages University of California Press, Roni Rosenfeld. A maximum entropy approach to adaptive statistical language modelling. Computer, Speech and Language 10: , Claude E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379423, page

Introduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 14 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Density Estimation Maxent Models 2 Entropy Definition: the entropy of a random variable