Lecture 4: Exponential family of distributions and generalized linear model (GLM) (Draft: version 0.9.2)

Lectures on Machine Learning (Fall 2017) Hyeong In Choi Seoul National University Lecture 4: Exponential family of distributions and generalized linear model (GLM) (Draft: version 0.9.2) Topics to be covered: Exponential family of distributions Mean and (canonical) link functions Convexity of log partition function Generalized linear model (GLM) Various GLM models 1 Exponential family of distributions In this section, we study a family of probability distribution called the exponential family (of distributions). It is of a special form, but most, if not all, of the well known probability distributions belong to this class. 1

1.1 Definition Definition 1. A probability distribution (PDF or PMF) is said to belong to the exponential family of distributions (in natural or canonical form) if it is of the form. P θ (y) = 1 Z(θ) h(y)eθ T (y), (1) where y = (y 1,, y m ) is a point in R m ; and θ = (θ 1,, θ k ) R k is a parameter called the canonical (natural) parameter; T : R m R k is a map T (y) = (T 1 (y),, T k (y)); and Z(θ) = h(y)e θ T (y) dy is called the partition function, while its logarithm, A(θ) = log Z(θ), is called the log partition (cumulant) function. Remark. In this lecture and throughout this course, the dot notation as in θ T (y) always means the inner (dot) product of two vectors. Equivalently, P θ (y) can be written in the form P θ (y) = exp[θ T (y) A(θ) + C(y)], (2) where C(y) = log h(y). More generally, one sometimes introduces an extra parameter, called the dispersion parameter, to control the shape of P θ (y) by [ ] θ T (y) A(θ) P θ (y) = exp + C(y, ). 1.2 Standing assumption In our discussion on the exponential family of distribution, we always assume the following. In case P θ (y) is a probability density function (PDF), it is assumed to be continuous as a function of y. It means that there is no singularity in the probability measure P θ (y); In case P θ (y) is a probability mass function (PMF), there exists a range of discrete value of P θ (y) that is the same for all θ and for all y. If P θ (y) satisfies either condition, we say it is regular, which is always assumed throughout this course. 2

Remark. Sometimes people use the more general form of P θ (y) to write P θ (y) = 1 Z(θ) h(y)eη(θ) T (y). But most of the time the same result can be obtained without the use of the general form η(θ). So it is not much of a loss of generality to stick to our convention of using just θ. 1.3 Examples Let us now look at a few illustrative examples. (1) Bernoulli(µ) The Bernoulli distribution is perhaps the simplest in the exponential family. Let Y be a random variable taking its binary value in {0, 1}. Let µ = P [Y = 1]. Its distribution (in fact, the probability mass function, PMF) is then succinctly written as P (y) = µ y (1 µ) 1 y. Then, P (y) = µ y (1 µ) 1 y = exp[y log µ + (1 y) log(1 µ)] [ ( ) ] µ = exp y log + log(1 µ). 1 µ ( ) µ Letting T (y) = y and θ = log and recalling the definitions of 1 µ logit and σ functions given in Lecture 3, we have ( ) µ logit(µ) = log. 1 µ Thus Its inverse function is the sigmoid function: θ = logit(µ). (3) µ = σ(θ), (4) where σ(θ) = 1 1 + e θ. 3

Therefore, we have A(θ) = log(1 µ) = log(1 + e θ ). (5) Thus P θ (y), written in canonical form as in (2), becomes P θ (y) = exp[θ y log(1 + e θ )]. (2) Exponential distribution The exponential distribution is a distribution that models the independent arrival time. Its distribution (the probability density function, PDF) is given as P θ (y) = θe θy I(x 0). To put it in the exponential family form, we use the same θ as the canonical parameter and we let T (y) = y and h(y) = I(y 0). Since Z(θ) = 1 θ = e θy I(y 0)dy, it is already in the canonical form given as in (1). (3) Normal distribution The normal (Gaussian) distribution given by [ ] 1 P (y) = exp (y µ)2 2πσ 2 2σ 2 is the single most well known distribution. As far as its relation with the exponential family is concerned there are two views. (Here, this σ is a number, not the sigmoid function.) 1st view (σ 2 as a dispersion parameter) This is the case when the main focus is the mean. In this case, the variance σ 2 is regarded as known or as a parameter that can be fiddled with as if known. Writing P (y) in the form 1 P θ (y) = exp 2 y2 + yµ 1 2 µ2 1 σ 2 2 log(2πσ2 ), 4

One can see right away it is in the form of the exponential family if we set θ = (θ 1, θ 2 ) = (µ, 1) T (y) = (y, 1 2 y2 ) = σ 2 A(θ) = 1 2 µ2 = 1 2 θ2 1 C(y, ) = 1 2 log(2πσ2 ). 2nd view ( = 1) When both µ and σ are parameters to be treated as unknown, we take this point of view. In here, we set the dispersion parameter = 1. Writing out P θ (y) we have [ P θ (y) = exp 1 2σ 2 y2 + µ σ y 1 2 2σ 2 µ2 log σ 1 ] log 2π. 2 Thus it is easy to see the following: T (y) = (y, 1 2 y2 ) θ = (θ 1, θ 2 ) = ( µ σ 2, 1 σ 2 ) A(θ) = 1 2σ 2 µ2 + log σ = 1 θ1 2 1 2 θ 2 2 log( θ 2) C(y) = 1 2 log(2π). 1.4 Properties of exponential family The log partition function A(θ) plays a key role, so let us now look at it more carefully. First, since [ ] θ T (y) A(θ) P θ (y)dy = exp + C(y, ) dy = 1, 5

taking θ, we have [ θ T (y) A(θ) exp Thus ] T (y) θ A(θ) + C(y, ) dy = 0. [ ] θ T (y) A(θ) exp + C(y, ) T (y)dy = = θ A(θ) [ ] θ T (y) A(θ) exp + C(y, ) θ A(θ)dy [ ] θ T (y) A(θ) exp + C(y, ) dy. Since we have the following: [ ] θ T (y) A(θ) exp + C(y, ) dy = 1, Proposition 1. θ A(θ) = P θ (y)t (y)dy = E[T (Y )], where Y is the random variable with distribution P θ (y). Writing the component, we have [ A θ T (y) A(θ) = exp θ i ] + C(y, ) T i (y)dy. Taking the second partial derivative, we have 2 A = 1 [ ] { θ T (y) A(θ) exp + C(y, ) T j (y) A(θ) } T i (y)dy θ i θ j θ j = 1 ( [ ] θ T (y) A(θ) exp + C(y, ) T i (y)t j (y)dy A(θ) [ ] ) θ T (y) A(θ) exp + C(y, ) T i (y)dy. θ j 6

Using Proposition 1 once more, we have 2 A = 1 { } E[T i (Y )T j (Y )] E[T i (Y )]E[T j (Y )] θ i θ j = 1 [ E (Ti (Y ) E[T i (Y )] ) ( T j (Y ) E[T j (Y )] )] = 1 Cov(T i(y ), T j (Y )). Therefore we have the following result on the Hessian matrix of A. ( 2 Proposition 2. Dθ 2A = A ) = 1 θ i θ j Cov(T (Y ), T (Y )) = 1 Cov(T (Y )), where Y is the random variable with distribution P θ (y). Here, Cov(T (Y ), T (Y )) denotes the covariance matrix of T (Y ) with itself, which is sometimes called the variance matrix. Since the covariance matrix is always positive semi-definite, we have Corollary 1. A(θ) is a convex function of θ. Remark. In most exponential family models, Cov(T (Y )) is positive definite, in which case A(θ) is strictly convex. In this lecture and throughout the course, we always assume that Cov(T (Y )) is positive definite and therefore that A(θ) is strictly convex. 1.5 Maximum likelihood estimation Let D = {y (i) } N be a given data of IID samples, where y (i) R m. Then its likelihood function L(θ) and the log likelihood function l(θ) are given by L(θ) = N [ ] θ T (y (i) ) A(θ) exp + C(y (i), ) l(θ) = 1 { N θ T (y (i) ) NA(θ) } + N C(y (i), ). 7

Since A(θ) is strictly convex, l(θ) has a unique maximum at θ that is the unique solution of θ l(θ) = 0. Note that So we have the following: θ l(θ) = 1 { N T (y (i) ) N θ A(θ) }. Proposition 3. There is a unique θ that maximizes l(θ) at which θ A(θ) θ= θ = 1 N N T (y (i) ). 1.5.1 Example: Bernoulli Distribution Let us apply the above argument to the Bernoulli distribution Bernoulli(µ) and see what it leads to. First, differentiating A(θ) as in (5), we get A (θ) = 1 = σ(θ). 1 + e θ Thus by Proposition 3 we have a unique θ satisfying σ( θ) = 1 N N T (y (i) ). Since T (y) = y, and y takes on 0 or 1 as its value, we have T (y) = I(y = 1). Defining µ = σ( θ) by (4), we then have µ = σ( θ) = 1 N I(y (i) = 1) N This confirms the usual estimate of µ. = Number of 1 s in the sample {y(i) }. N 8

Remark. T is a sufficient statistic. It is well known that the statistic S(y i,, y n ) = n T (y(i) ) is sufficient, which means that θ is a constant on the level set {(y 1,, y n ) S(y 1, y 2,, y n ) = const}. This provides many advantages in estimating θ, as one needs not to look inside the level sets. 1.6 Maximum Entropy Distribution Exponential family is useful in that it is the most random distribution under some constraints. Let Y be a random variable with an unknown distribution P (y). Suppose the expected values of certain features f 1 (y),, f k (y) are nonetheless known. Namely, f i (y)p (y)dy = C i (6) is known for some given constant C i for i = 1,, k. The question is how to fix P (y) so that it is as random as possible while satisfying the above constraints. One approach is to construct a maximum entropy distribution satisfying the constraints (6). The entropy is the measure of randomness and in general, the higher it is, the more random it is deemed. The definition of entropy is H = P (y) log P (y), y for the discrete case; for the continuous case, the sum becomes an integral so that H = P (y) log P (y)dy. We will use the integral notation here without the loss of generality. To solve the problem, we use the Lagrange multiplier method to set up the following variational problem of minimizing J(P, λ) : J(P, λ) = P (y) log P (y)dy+λ 0 (1 P (y)dy)+ k λ i {C i } f i (y)p (y)dy. 9

The solution is found as the critical point. Namely, setting the variational (functional or Fréchet) derivative equal to 0, δj k δp = 1 log P (y) λ 0 λ i f i (y) = 0. (7) To see where this comes from, let η be any smooth function with compact support and define { J(P +ɛη, λ) = (P +ɛη) log(p +ɛη)dy+λ 0 1 (P +ɛη)dy) } + { λ i Ci f i (P +ɛη)dy }. Taking the derivative with respect to ɛ at ɛ = 0 and setting it equal to 0, we get d J(P + ɛη, λ) = (η + η log P )dy λ 0 ηdy + λ i { f i ηdy} dɛ ɛ=0 ( k ) = 1 log P λ0 λ i f i ηdy = 0. Since the above formula is true for any η, we have k 1 log P λ 0 λ i f i = 0, which is (7). Solving (7) for P (y), we have where Z = e 1+λ 0. Now 1 = P (y)dy = 1 Z Thus P (y) = 1 k Z exp( λ i f i (y)), Z = exp( exp( k λ i f i (y))dy. k λ i f i (y))dy. Therefore P (y) belongs to the exponential family of distribution. This kind of distribution is in general called the Gibbs distribution. 10

2 Generalized liner model (GLM) 2.1 Mean parameter and canonical link function Before we embark on the generalized linear model, we need the following elementary fact. Lemma 1. Let f : R k R be a strictly convex C 1 function. Then x f : R k R k is an invertible function. Proof. Let p R k be any given vector. By the strict convexity and the absence of corners (the C 1 condition), there is a unique hyperplane H in R k+1 that is tangent to the graph of y = f(x), having the normal vector of the form (p, 1). Let (x, y) be the point of contact. Since the graph is the zero set of F (x, y) = f(x) y, its normal vector at (x, y) is ( f(x), 1). Therefore we must have p = f(x). Figure 1: Graph of a strictly convex function with a tangent plane Let us now look at the exponential family P θ (y) = exp [ θ T (y) A(θ) + C(y) ], (8) where the dispersion parameter is set to be 1 for the sake of simplicity of the presentation. (The argument does not change much even with the presence of the dispersion parameter.) Recall that Proposition 1 says that θ A(θ) = E[T (Y )]. We always assume that the log partition function A : R k R is strictly convex and C 1. Thus by the above Lemma, θ A : R k R k is invertible. We now set up a few terminologies. 11

Definition 2. The mean parameter µ is defined as µ = E[T (Y )] = θ A(θ), and the function θ A : R k R k is called the mean function. Since θ A is invertible, we define Definition 3. The inverse function ( θ A) 1 of θ A is called the canonical link function and is denoted by ψ. Thus θ = ψ(µ) = ( θ A) 1 (µ). Therefore, combining the above definitions, the mean function is also written as µ = ψ 1 (θ) = θ A(θ). Example: Bernoulli(µ) Recall, for Bernoulli(µ), P (y) is given by P (y) = µ y (1 µ) 1 y = exp [ y log ] µ + log(1 µ) 1 µ = exp [ θy log(1 + e θ ) ] = exp [ θy A(θ) ], ( ) µ where θ = log and A(θ) = log(1 + e θ ). Thus 1 µ µ = sigmoid(θ) = A (θ) θ = logit(µ) = (A ) 1 (θ). Therefore for the Bernoulli distribution the sigmoid function is the mean function and the logit function the canonical link function. 2.2 Conditional probability in GLM From now on in this lecture, to simplify notation, we use ω to stand for both w and b and extend x R d to x R d+1 by adding x 0 = 1. (In the previous lectures, we used θ as a generic term to represent both w and b. But in the exponential family notation, θ is reserved to denote the canonical parameter. So we are forced to use ω. One more note: although the typography is difficult to discern, this ω is the Greek lower case omega, not English w.) 12

2.2.1 GLM recipe With this notation, the expression ω = (b, w 1,, w d ) as (1) of Lecture 3 is now written as ω x, and thus the conditional probability in Lecture 3 is written as which can be simplified as P (y = 1 x) = P (y = 0 x) = P (y x) = e ω x 1 + e ω x 1 1 + e, ω x e(ω x)y 1 + e ω x, for y {0, 1}. Using (5), this is easily seen to be equivalent to the following conditional probability model P (y x) = exp [(ω x)y A(ω x)] = exp [θy A(θ)]. Summarizing this process, we have the following: GLM Recipe: P (y x) is gotten from P (x) by replacing the canonical parameter θ in (8) with a linear expression of x. In the binary case, the linear expression is ω x. The multiclass case mimics it in exactly the same way. To describe it correctly, first define the parameter (ω 1 ) T ω 10 ω 1j ω 1d W =. (ω i ) T. (ω k ) T... = ω i0 ω ij ω id. (9)... ω k0 ω kj ω kd Then replace θ in (8) with W x to get the conditional probability as P (y x) = exp[(w x) T (y) A(W x) + C(y)]. (10) 13

Written this way, we can see that (W x) T (y) = k d ω lj x j T l (y). l=1 j=0 Now, let D = {(x (i), y (i) )} N be a given data, where the x-part of the i-th data is represented by the column vector x (i) = [x (i) 0,, x (i) d ]T. Using the conditional probability as in (10), one can write the likelihood function and the log likelihood function by L(W ) = N exp[(w x (i) ) T (y (i) ) A(W x (i) ) + C(y (i) )] l(w ) = log L(W ) = = [ N k l=1 d j=0 N [(W x (i) ) T (y (i) ) A(W x (i) ) + C(y (i) )] Thus taking the derivative, one gets [ l(w ) N k d = x (i) j ω δ lrδ js T l (y (i) ) rs = = l=1 ω lj x (i) j T l(y (i) ) A(ω T 1 x (i),, ω T k x (i) ) + C(y (i) ) j=0 k A ω l l=1 d j=0 x (i) j δ lrδ js ] N [ x (i) s T r (y (i) ) A ] x s (i) ω r N [ T r (y (i) ) A ] (W x (i) ) x (i) s. (11) ω r This is a generalized form of what we got for the logistic regression (see (10) of Lecture 3). 2.3 Multiclass classification (softmax regression) We now look at the multiclass classification problem. The formula (11) is the derivative formula with which the gradient descent algorithm can be used. 14 ].

But for multiclass regression, the categorical distribution has the redundancy we talked about in Lecture 3. As usual, let y i = I(y = i) and let Prob[Y = i] = µ i. Then we must have y 1 + + y k = 1 and µ 1 + + µ k = 1. Thus the probability (PMF) can be written as P (y) = µ y 1 1 µ y 2 2 µ y k k. Rewrite P (y) by k 1 P (y) = µ y 1 1 µ y 2 2 µ y k 1 k 1 µ(1 y i) k [ ( ) ] k 1 = exp y 1 log µ 1 + + y k 1 log µ k 1 + 1 y i log µ k = exp [ k 1 y i log µ i µ k + log µ k Note that when k = 2, this is exactly the Bernoulli distribution. Define θ i = log(µ i /µ k ) and T i (y) = y i for i = 1,, k 1. Using the facts that µ i = µ k e θ i and 1 µ k = k 1 µ i, and solving for µ k and then for µ j, we get µ k = µ j = ]. 1 1 + k 1 eθ i e θ j 1 + k 1 eθ i, (12) for j = 1,, k 1. The expression in the right hand side of (12) is called the generalized sigmoid (softmax) function. Therefore P (y) can be written in the exponential family form as P θ (y) = exp[θ T (y) A(θ)], where and θ = (θ 1,, θ k 1 ), T (y) = (y 1,, y k 1 ), ( k 1 A(θ) = log µ k = log 1 + e θ i ). 15

Note that the mean parameter µ i is given as µ i = E[T i (Y )] = E[Y i ] = Prob[Y i = 1] = Prob[Y = i], which again can be calculated by the fact µ = θ A. Indeed one can verify that A e θ j = θ j 1 + = µ k 1 j. eθ i We have shown above that θ j = log(µ j /µ k ). Thus for j = 1,, k 1, we have ( ) µ j θ j = log 1 k 1 µ. i The expression on the right side of the above equation is called the generalized logit function. 2.4 Probit regression Recall that the essence of the GLM for the exponential family is the way it links the (outside) features x 1,, x d to the probability model. To be specific, given the probability model (8), the linking was done by setting θ = ψ(µ) = ω x so that the conditional probability becomes P (y x) = exp[(ω x)t (y) A(ω x) + C(y)]. But there is no a priori reason why µ has to be related to ω x via the canonical link function only. One may use any function as long as it invertible. So set g(µ) = ω x, where g : R k R k is an invertible function. So we call g a link function as opposed to the canonical link function and its inverse g 1 the mean function. So we have g(µ) = ω x : link function µ = g 1 (ω x) : mean function. Let Φ(t) be the cumulative distribution function (CDF) of the standard normal distribution N (0, 1), i.e., Φ(t) = 1 2π t 16 e s2 /2 ds.

The probit regression deals with the following situation: the output y is binary taking on values in {0, 1}, and its probability model is Bernoulli. So P (y) = µ y (1 µ) 1 y, where µ = P (y = 1). Its link to the outside variables (features) is via the link function g = Φ 1. Thus Therefore, g(µ) = ω x = ω 0 + ω 1 x 1 + + ω d x d µ = g 1 (ω x) = Φ(ω x). (13) P (y x) = Φ(ω x) y (1 Φ(ω x)) 1 y. Since this is still a Bernoulli hence an exponential family distribution, the general machinery of GLM can be employed. But, in the probit regression we change tack and take a different approach. First, we let y { 1, 1}. Then since µ = E[Y ] = 1 P (y = 1) + ( 1) P (y = 1) = P (y = 1) [1 P (y = 1)], we have µ = 2P (y = 1) 1. Thus P (y = 1) = 1 (1 + µ) and P (y = 1) = 2 1 2 (1 µ). Therefore, P (y) = 1 (1 + yµ). 2 Thus using (13), we have P (y x) = 1 [1 + yφ(ω x)]. 2 Now let the data D = {(x i, y i )} n be given. function is n l(ω) = log P (y i x i ) Thus = n l(ω) ω k = ( ) 1 log 2 [1 + y iφ(ω x i )], n y i Φ (ω x i ) 1 + y i Φ(ω x i ) x ik, 17 Then the log likelihood

which is in a neat form to apply the numerical methods introduced in Lecture 3, where Φ (ω x) = 1 e (ω x)2 /2. 2π Comments (1). When one uses a link function which is not canonical, the probability model does not even have to belong to the exponential family. As long as one can get a hold of the conditional distribution P (y x), one is in business. Namely with it one can form the likelihood function and the rest of the maximum likelihood machinery. (2). One may not even need the probability distribution as long as one can somehow estimate ω so that the decisions can be made. (3). Historically the GLM was first developed for the exponential family but was later extended to the non-exponential family and even to the case where the distribution is not completely known. 2.5 GLM for other distributions 2.5.1 GLM for Poisson(µ) Recall that the Poisson distribution is the probability distribution (PMF) on the non-negative integers given by P (n) = e µ µ n, n! for n = 0, 1, 2,. It is an exponential family distribution as it can be written as P (y) = e µ µ y y! = exp{y log µ µ log(y!)}, where y = 0, 1, 2,. It can be put in the exponential family form by setting θ = log µ : canonical link function µ = e θ : mean function. 18

One can independently check that E[Y ] = np (n) = µ. 2.5.2 GLM for Γ(α, λ) n=0 Recall that the gamma distribution is written in the following form: P (y) = λe λy (λy) α 1, for y 0 Γ(α) = exp{log λ λy + (α 1) log(λy) log Γ(α)} = exp{α log λ λy + (α 1) log y log Γ(α)} { ( ) λy ( log λ) 1 } = exp + 1 log y log Γ(1/). Set = 1 as the dispersion parameter and let θ = λ. Then P (y) is an α exponential family distribution written as: { θy ( log( θ)) } P θ (y) = exp + C(y, ), where C(y, ) = 1 log(1/) + ( 1 1 ) log y log Γ(1/). Therefore the log partition function is A(θ) = log( θ). Differentiating this we get the mean parameter µ = A (θ) = 1 θ. Since E[Y ] = α for Y Γ(α, λ), this fact is independently verified. λ To recap, the mean function and the canonical link function are given by µ = ψ 1 (θ) = 1 θ θ = ψ(µ) = 1 µ : mean function : canonical link function. 19

In practice, this canonical link function is rarely used. Instead one uses one of the following three alternatives. (i) the inverse link g(µ) = 1 µ, (ii) the log link g(µ) = log µ, (iii) the identity link g(µ) = µ. 2.5.3 GLM Summary The GLM link and inverse link (mean) functions for various probability models are summarized in the following Table 1. In here, the range of j in the link and the inverse link of the categorical distribution is j = 1,, k 1, although µ = (µ 1,, µ k ). Range Link Inverse Link (mean) Dispersion N(µ, σ 2 ) (, ) θ = µ µ = θ σ 2 Bernoulli(µ) {0, 1} θ = logit(µ) µ = σ(θ) 1 ( ) µ j Categorical(µ) {1,, k} θ j = log 1 e θ j k 1 µ µ j = i 1 + k 1 1 eθ i {0, 1} or Probit Φ { 1, 1} 1 Φ 1 Poisson(µ) Z + θ = log(µ) µ = e θ 1 Gamma(α, λ) R + θ = 1 µ = 1 1 µ θ α Table 1: GLM-Summary 20