Introduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research

Size: px

Start display at page:

Download "Introduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research"

William Murphy
5 years ago
Views:

1 Introduction to Machine Learning Lecture 14 Mehryar Mohri Courant Institute and Google Research

2 Density Estimation Maxent Models 2

3 Entropy Definition: the entropy of a random variable X probability distribution p(x) =Pr[X = x] is with H(X) = E[log p(x)] = x p(x) log p(x). Properties: Measure of uncertainty of p(x). H(X) 0. Maximal for uniform distribution. For a finite support, by Jensen s inequality: H(X) = E[log 1/p(X)] log E[1/p(X)] = log N. 3

4 Relative Entropy Definition: the relative entropy (or Kullback-Leibler divergence) of two distributions p and q is D(p q) =E p [log p(x) q(x) ]= p(x) log p(x) q(x), x with 0 log 0 q = 0 and p log p 0 =. Properties: Assymetric measure of deviation between two distributions. It is convex in p and q. D(p q) 0 for all p and q. D(p q) =0 iff p = q. 4

5 Jensen s Inequality Theorem: let X be a random variable and f a measurable convex function. Then, f(e[x]) E[f(X)]. Proof: For a distribution over a finite set, the property follows directly the definition of convexity. The general case is a consequence of the continuity of convex functions and the density of finite distributions. 5

6 Applications Non-negativity of relative entropy: D(p q) =E p [log q(x) p(x) ] log E p [ q(x) = log x p(x) ] q(x) =0. 6

7 Hoeffding s Bounds Theorem: let X 1,X 2,...,X m be a sequence of independent Bernoulli trials taking values in [0, 1], then for all ɛ>0, the following inequalities hold for X m = 1 m m i=1 X i Pr[X m E[X m ] ɛ] e 2mɛ2. Pr[X m E[X m ] ɛ] e 2mɛ2. 7

8 Problem Data: sample drawn i.i.d. from set some distribution D, x 1,...,x m X. X according to Problem: find distribution p out of a set P that best estimates D. 8

9 Maximum Likelihood Likelihood: probability of observing sample under distribution p P, which, given the independence assumption is m Pr[x 1,...,x m ]= p(x i ). Principle: select distribution maximizing sample probability m p = argmax p(x i ), p P i=1 m or p = argmax log p(x i ). p P i=1 i=1 9

10 Relative Entropy Formulation Empirical distribution: distribution ˆp that assigns to each point the frequency of its occurrence in the sample. Lemma: has maximum likelihood l(p ) iff Proof: p p = argmin p P D(ˆp p). D(ˆp p) = ˆp(z i ) log ˆp(z i ) ˆp(z i ) log p(z i ) z i observed z i observed = H(ˆp) count(z i ) z i N 0 log p(z i ) = H(ˆp) 1 N 0 log z i p(z i ) count(z i) = H(ˆp) 1 N 0 l(p). 10

11 Maximum A Posteriori (MAP) Principle: select the most likely hypothesis h H given the sample, with some prior distribution over the hypotheses,, Pr[h] h = argmax h H = argmax h H = argmax h H Pr[h S] Pr[S h] Pr[h] Pr[S] Pr[S h] Pr[h]. Note: for a uniform prior, MAP coincides with maximum likelihood. 11

12 Problem Data: sample drawn i.i.d. from set some distribution D, x 1,...,x m X. Xaccording to Features: mappings associated to elements of X, φ 1,...,φ N : X R. Problem: how do we estimate distribution D? Uniform distribution u over X? 12

13 Features Examples: statistical language modeling. n-grams, distance-d n-grams. class-based n-grams, word triggers. sentence length. number and type of verbs. various grammatical information (e.g., agreement, POS tags). dialog-level information. 13

14 Maximum Entropy Principle For large m, we can give a fairly good estimate of the expected value of each feature (Hoeffding s inequality): j [1,N], E x D [φ j (x)] 1 m (E. T. Jaynes, 1957, 1983) m φ j (x i ). i=1 Find distribution that is closest to the uniform distribution u and that preserves the expected values of features. Closeness is measured using relative entropy (or Kullback-Liebler divergence). 14

15 Maximum Entropy Formulation Distributions: let P = P p : E x p [φ j (x)] = 1 m denote the set of distributions m i=1 φ j (x i ),j [1,n]. Optimization problem: find distribution p verifying p =argmin p P D(p u). 15

16 Relation with Entropy Maximization Relationship with entropy: D(p u) = p(x)log p(x) 1/ X x X =log X + p(x)logp(x) x X =log X H(p). Optimization problem: convex problem. minimize x X p(x)logp(x) subject to x X p(x) =1 x X p(x)φ j (x) = 1 m m φ j (x i ), j [1,N]. i=1 16

17 Maximum Likelihood Gibbs Distrib. Gibbs distributions: set Qof distributions defined by p(x) = 1 Z exp(w Φ(x)) = 1 Z exp N with Z = x Maximum likelihood with Gibbs distributions: p =argmax q Q exp(w Φ(x)). m i=1 log q(x i )=argmin q Q where Q is the closure of Q. j=1 w j Φ j (x), D(ˆp q), 17

18 Duality Theorem Theorem: Assume that D(ˆp u) <. Then, there exists a unique probability distribution satisfying 1. p P Q; 2. D(p q) for any and q Q =D(p p )+D(p q) (Pythagorean equality); 3. p = argmin D(ˆp q) (maximum likelihood); 4. q Q p = argmin p P D(p u) Each of these properties determines (Della Pietra et al., 1997) p p uniquely. p P 18

19 Regularization Overfitting: Features with low counts; Very large feature weights. Approximation parameters: β j 0, j [1,n] ; j [1,n], E x D [φ j (x)] 1 m m i=1 φ j (x i ) β j. 19

20 Regularized Maxent Optimization problem: argmin w with p w [x]= 1 Z(x) λw 2 1 m Regularization: L 1 or L 2 norm, also interpreted as Gaussian or Laplacian priors (Bayesian view). E.g., p(x) p w (x) Pr[w] n l(w) l(w) exp(w Φ(x)). j=1 where Pr[w] = n w 2 j 2σ 2 j 1 2 m i=1 2 j=1 e w j /(2σ2 j ) 2πσ 2 j. log p w [x i ], n log(2πσj 2 ), j=1 20

21 Extensions - Bregman Divergences Definition: let F be a convex and differentiable function, then the Bregman divergence based on F is defined as B F (y, x) =F (y) F (x) (y x) x F (x). Examples: Unnormalized relative entropy. Euclidean distance. F (y) F (x)+(y x) x F (x) x y 21

22 Conditional Maxent Models Generalization of Maxent models: multi-class classification. conditional probability modeling each class. different features for each class. logistic regression: special case of two classes. also known as multinomial logistic regression. 22

23 Problem Data: sample drawn i.i.d. according to some distribution D. S =((x 1,y 1 ),...,(x m,y m )) (X {1,...,k}) m. Multi-label case: S =((x 1,y 1 ),...,(x m,y m )) (X { 1, +1} k ) m. Features: for any y {1,...,k} x Φ(x, y)., feature vector Problem: find distribution p out of set P that best estimates the probability distributions Pr[y x], for all classes y {1,...,k}. 23

24 Conditional Maxent Optimization problem: maximizing conditional entropy. minimize y [1,k] x X p(y x)logp(y x) subject to y [1,k], x X p(y x) =1 j [1,N], x X y [1,k] p(y x)φ j (x, y) = 1 m m Φ j (x i,y i ). i=1 24

25 Equivalent ML Formulation Optimization problem: maximize conditional loglikelihood. max w H 1 m m i=1 log p w [y i x i ] subject to: (x, y) X [1,k], p w [y x]= 1 exp(w Φ(x, y)) Z(x) Z(x)= y Y exp(w Φ(x, y)). 25

26 Regularized Conditional Maxent Optimization problem: maximizing conditional entropy, regularization parameter λ 0. min w H λw2 1 m m log p w [y i x i ] i=1 subject to: (x, y) X [1,k], p w [y x]= 1 exp(w Φ(x, y)) Z(x) Z(x)= y Y exp(w Φ(x, y)). Different norms used for regularization. 26

27 Logistic Regression Logistic model: k =2. Pr[y x] = 1 ew Φ(x,y), Z(x) Z(x) = e w Φ(x,y). with Properties: linear log-odds ratio (or logit): log Pr[y 2 x] Pr[y 1 x] = w (Φ(x, y 2) Φ(x, y 1 )). decision rule: sign of log-odds ratio. Pr[+1 x] = y [1,k] 1 1+e w (Φ(x,+1) Φ(x, 1). 27 (Berkson, 1944)

28 References Adam Berger, Stephen Della Pietra, and Vincent Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, (22-1), March 1996; Berkson, J. (1944). Application of the logistic function to bio-assay. Journal of the American Statistical Association 39, Imre Csiszar and Tusnady. Information geometry and alternating minimization procedures. Statistics and Decisions, Supplement Issue 1, , Imre Csiszar. A geometric interpretation of Darroch and Ratchliff's generalized iterative scaling. The Annals of Statisics, 17(3), pp J. Darroch and D. Ratchliff. Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics, 43(5), pp , Stephen Della Pietra, Vincent Della Pietra, and John Lafferty. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 19:4, pp , April,

29 References Stephen Della Pietra, Vincent Della Pietra, and John Lafferty. Duality and auxiliary functions for Bregman distances. Technical Report CMU-CS , School of Computer Science, CMU, E. Jaynes. Information theory and statistical mechanics. Physics Reviews, 106: , E. Jaynes. Papers on Probability, Statistics, and Statistical Physics. R. Rosenkrantz (editor), D. Reidel Publishing Company, O'Sullivan. Alternating minimzation algorithms: From Blahut-Arimoto to expectationmaximization. Codes, Curves and Signals: Common Threads in Communications, A. Vardy, (editor), Kluwer, Roni Rosenfeld. A maximum entropy approach to adaptive statistical language modelling. Computer, Speech and Language 10: ,

Speech Recognition Lecture 7: Maximum Entropy Models. Mehryar Mohri Courant Institute and Google Research

Speech Recognition Lecture 7: Maximum Entropy Models Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.com This Lecture Information theory basics Maximum entropy models Duality theorem