Speech Recognition Lecture 7: Maximum Entropy Models. Mehryar Mohri Courant Institute and Google Research

Size: px

Start display at page:

Download "Speech Recognition Lecture 7: Maximum Entropy Models. Mehryar Mohri Courant Institute and Google Research"

Sibyl McBride
6 years ago
Views:

1 Speech Recognition Lecture 7: Maximum Entropy Models Mehryar Mohri Courant Institute and Google Research

2 This Lecture Information theory basics Maximum entropy models Duality theorem 2

3 Convexity Definition: X R N is said to be convex if for any two points x, y X the segment [x, y] lies in X: {αx +(1 α)y, 0 α 1} X. Definition: let X be a convex set. A function is said to be convex if for all x, y X and α (0, 1), f(αx +(1 α)y) αf(x)+(1 α)f(y). f : X R With a strict inequality, f is said to be strictly convex. f is said to be concave when f is convex. 3

4 Properties of Convex Functions Theorem: let f is convex iff be a differentiable function. Then, f is convex and dom(f) x, y dom(f), f(y) f(x) f(x) (y x). f(y) (x, f(x)) f(x)+ f(x) (y x). Theorem: let f be a twice differentiable function. Then, f is convex iff its Hessian is positive semidefinite: x dom(f), 2 f(x) 0. 4

5 Jensen s Inequality Theorem: let X be a random variable and f a measurable convex function. Then, Proof: f(e[x]) E[f(X)]. For a distribution over a finite set, the property follows directly the definition of convexity. The general case is a consequence of the continuity of convex functions and the density of finite distributions. 5

6 Entropy Definition: the entropy of a random variable with probability distribution X p(x) =Pr[X = x] H(X) = E[log p(x)] = x X p(x)logp(x). is Properties: measure of uncertainty of p(x). H(X) 0. maximal for uniform distribution. For a finite support, by Jensen s inequality: H(X) =E log 1 p(x) 1 log E p(x) =logn. 6

7 Relative Entropy Definition: the relative entropy (or Kullback- Leibler divergence) of two distributions p and q is with D(p q) =E p log p(x) q(x) 0 log 0 q = 0 and p log p 0 =. Properties: assymetric measure of divergence between two distributions. It is convex in p and q. D(p q) 0 for all p and q. = x p(x)log p(x) q(x), D(p q) =0 iff 7 p = q.

8 Non-Negativity of Relative Entropy Treat separately the case q(x)=0 for x supp(p), or use the same proof by extending domain of log: D(p q) =E p log q(x) p(x) q(x) log E p p(x) =log x q(x) =0. 8

9 This Lecture Information theory basics Maximum entropy models Duality theorem 9

10 Problem Data: sample S drawn i.i.d. from set some distribution D, x 1,...,x m X. X according to Problem: find distribution p out of a set P that best estimates D. 10

11 Maximum Likelihood Likelihood: probability of observing sample under distribution p P, which, given the independence assumption is Principle: select distribution maximizing likelihood, or, equivalently Pr[S] = p =argmax p P p =argmax p P m p(x i ). i=1 m i=1 m i=1 p(x i ), log p(x i ). 11

12 Relative Entropy Formulation Empirical distribution: distribution to sample. Lemma: Proof: p S has maximum likelihood iff p =argmin p P D(ˆp p) = ˆp(x)logˆp(x) ˆp(x)logp(x) x x = H(ˆp) x S log p(x) m x = H(ˆp) 1 m log p(x) x S x in S p D(p p). corresponding = H(ˆp) 1 m log Pr S p m[s]. 12

13 Problem Data: sample S drawn i.i.d. from set some distribution D, x 1,...,x m X. Features: associated to elements of X, Φ: X R N. according to Problem: how do we estimate distribution D? Uniform distribution u over X? X 13

14 Features Examples: n-grams, distance-d n-grams. sentence length. number and type of verbs. class-based n-grams, word triggers. various grammatical information (e.g., agreement, POS tags). dialog-level information. 14

15 Hoeffding s Bounds Theorem: let X 1,X 2,...,X m be a sequence of independent Bernoulli trials taking values in [0, 1], then for all >0, the following inequalities hold for : X m = 1 m m i=1 X i Pr X m E[X m ] e 2m2 Pr X m E[X m ] e 2m2. 15

16 Maximum Entropy Principle (E. T. Jaynes, 1957, 1983) For large m, empirical average is a good estimate of the expected values of the features: E [Φ j(x)] E [Φ j (x)], x D x bp j [1,N]. Principle: find distribution that is closest to the uniform distribution u and that preserves the expected values of features. Closeness is measured using relative entropy. 16

17 Maximum Entropy Formulation Distributions: let P = P denote the set of distributions p : E [Φ(x)] = E [Φ(x)]. x p x bp Optimization problem: find distribution verifying p =argmin p P D(p u). p 17

18 Relation with Entropy Maximization Relationship with entropy: D(p u) = x X p(x)log p(x) 1/ X =log X + x X p(x)logp(x) =log X H(p). Optimization problem: minimize x X p(x)logp(x) subject to p(x) 0, x X, p(x) =1, x X p(x)φ(x) = 1 m m Φ(x i ). x X i=1 18

19 Maximum Likelihood Gibbs Distrib. Gibbs distributions: set by of distributions p defined p(x) = 1 Z exp w Φ(x) = 1 Z exp N with Z = x Maximum likelihood Gibbs distribution: p =argmax q Q exp w Φ(x). m i=1 Q where Q is the closure of Q. j=1 log q(x i )=argmin q Q w j Φ j (x), D(ˆp q), 19

20 Duality Theorem Theorem: assume that D(ˆp u)<. Then, there exists a unique probability distribution p satisfying 1. p P Q; 2. D(p q)=d(p p )+D(p q) for any and q Q (Pythagorean equality); 3. p =argmind(ˆp q) (maximum likelihood); q Q 4. p =argmind(p u) (maximum entropy). p P Each of these properties determines (Della Pietra et al., 1997) p p P uniquely. 20

21 Regularization Relaxation: sample size too small to impose equality constraints. Instead, L 1 constraints: E [Φ j (x)] E [Φ j (x)] β j, x D x bp L 2 constraints: E [Φ(x)] E [Φ(x)] Λ 2. x D x bp 2 j [1,N]. 21

22 L1 Maxent Optimization problem: arginf w R N N β j w j 1 m j=1 with p w [x] = 1 Z exp w Φ(x). m i=1 log p w [x i ], Bayesian interpretation: Laplacian prior. with p(w) = p w [x] p(w) p w [x] N β j 2 exp( β j w j ). j=1 22

23 L2 Maxent Optimization problem: arginf w R N λw 2 1 m with p w [x] = 1 Z exp w Φ(x). m i=1 log p w [x i ]. with Bayesian interpretation: Gaussian prior. p(w) = p w [x] p(w) p w [x] N 1 exp w2 j 2πσ 2 2σ 2 j=1. 23

24 Extensions - Bregman Divergences Definition: let F be a convex and differentiable function, then the Bregman divergence based on is defined as F B F (y, x) =F (y) F (x) (y x) x F (x). Examples: Unnormalized relative entropy. Euclidean distance. F (y) F (x)+(y x) x F (x) x y Mehryar Mohri - Introduction to Machine Learning 24

25 This Lecture Information theory basics Maximum entropy models Duality theorem 25

26 References Adam Berger, Stephen Della Pietra, and Vincent Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, (22-1), March 1996; Berkson, J. (1944). Application of the logistic function to bio-assay. Journal of the American Statistical Association 39, Imre Csiszar and Tusnady. Information geometry and alternating minimization procedures. Statistics and Decisions, Supplement Issue 1, , Imre Csiszar. A geometric interpretation of Darroch and Ratchliff's generalized iterative scaling. The Annals of Statisics, 17(3), pp J. Darroch and D. Ratchliff. Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics, 43(5), pp , Stephen Della Pietra, Vincent Della Pietra, and John Lafferty. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 19:4, pp , April,

27 References Stephen Della Pietra, Vincent Della Pietra, and John Lafferty. Duality and auxiliary functions for Bregman distances. Technical Report CMU-CS , School of Computer Science, CMU, E. Jaynes. Information theory and statistical mechanics. Physics Reviews, 106: , E. Jaynes. Papers on Probability, Statistics, and Statistical Physics. R. Rosenkrantz (editor), D. Reidel Publishing Company, O'Sullivan. Alternating minimzation algorithms: From Blahut-Arimoto to expectationmaximization. Codes, Curves and Signals: Common Threads in Communications, A. Vardy, (editor), Kluwer, Roni Rosenfeld. A maximum entropy approach to adaptive statistical language modelling. Computer, Speech and Language 10: ,

Introduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 14 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Density Estimation Maxent Models 2 Entropy Definition: the entropy of a random variable