Lecture 4: Exponential family of distributions and generalized linear model (GLM) (Draft: version 0.9.2)

Similar documents
Rapid Introduction to Machine Learning/ Deep Learning

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random

Generalized Linear Models

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Generalized Linear Models

Logistic Regression. Seungjin Choi

Generalized Linear Models and Exponential Families

Lecture 3 - Linear and Logistic Regression

STAT5044: Regression and Anova

Introduction: exponential family, conjugacy, and sufficiency (9/2/13)

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

SB1a Applied Statistics Lectures 9-10

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Lecture 2 Machine Learning Review

Foundations of Statistical Inference

STAT Chapter 5 Continuous Distributions

Generalized Linear Models (1/29/13)

Classification. Sandro Cumani. Politecnico di Torino

Association studies and regression

Statistical Machine Learning Hilary Term 2018

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Generalized Linear Models I

Neural Network Training

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Generalized Linear Models 1

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Multiclass Logistic Regression

STA 4273H: Statistical Machine Learning

Linear Models in Machine Learning

CS489/698: Intro to ML

Chapter 3. Exponential Families. 3.1 Regular Exponential Families

Generalized Linear Models Introduction

Iterative Reweighted Least Squares

Machine Learning. 7. Logistic and Linear Regression

Gradient-Based Learning. Sargur N. Srihari

Statistical Data Mining and Machine Learning Hilary Term 2016

Weighted Least Squares I

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics

Modern Methods of Statistical Learning sf2935 Lecture 5: Logistic Regression T.K

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Lecture 4. Continuous Random Variables and Transformations of Random Variables

Machine learning - HT Maximum Likelihood

Machine Learning Linear Classification. Prof. Matteo Matteucci

CS-E4830 Kernel Methods in Machine Learning

1. Fisher Information

When is MLE appropriate

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

The logistic regression model is thus a glm-model with canonical link function so that the log-odds equals the linear predictor, that is

Generalized linear models

Linear Regression Models P8111

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Link lecture - Lagrange Multipliers

Constrained Optimization and Lagrangian Duality

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Lecture 4 September 15

Probability Distributions Columns (a) through (d)

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Linear Methods for Prediction

LECTURE 11: EXPONENTIAL FAMILY AND GENERALIZED LINEAR MODELS

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Stat 5421 Lecture Notes Proper Conjugate Priors for Exponential Families Charles J. Geyer March 28, 2016

Topic 2: Logistic Regression

Introduction to Generalized Linear Models

Generalized Linear Models. Kurt Hornik

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Linear and Logistic Regression. Dr. Xiaowei Huang

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Chapter 5 continued. Chapter 5 sections

Machine Learning, Fall 2012 Homework 2

Linear Classification: Probabilistic Generative Models

Introduction to Machine Learning

Least Squares Regression

14 : Theory of Variational Inference: Inner and Outer Approximation

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Regression with Numerical Optimization. Logistic

40.530: Statistics. Professor Chen Zehua. Singapore University of Design and Technology

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Introduction to Machine Learning. Lecture 2

Dimensionality Reduction with Generalized Linear Models

Least Squares Regression

Jeff Howbert Introduction to Machine Learning Winter

Generalized linear models

The Expectation-Maximization Algorithm

Probability and Distributions

Stat 710: Mathematical Statistics Lecture 12

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Overfitting, Bias / Variance Analysis

ECE 4400:693 - Information Theory

Chapter 8.8.1: A factorization theorem

Ch 4. Linear Models for Classification

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Transcription:

Lectures on Machine Learning (Fall 2017) Hyeong In Choi Seoul National University Lecture 4: Exponential family of distributions and generalized linear model (GLM) (Draft: version 0.9.2) Topics to be covered: Exponential family of distributions Mean and (canonical) link functions Convexity of log partition function Generalized linear model (GLM) Various GLM models 1 Exponential family of distributions In this section, we study a family of probability distribution called the exponential family (of distributions). It is of a special form, but most, if not all, of the well known probability distributions belong to this class. 1

1.1 Definition Definition 1. A probability distribution (PDF or PMF) is said to belong to the exponential family of distributions (in natural or canonical form) if it is of the form. P θ (y) = 1 Z(θ) h(y)eθ T (y), (1) where y = (y 1,, y m ) is a point in R m ; and θ = (θ 1,, θ k ) R k is a parameter called the canonical (natural) parameter; T : R m R k is a map T (y) = (T 1 (y),, T k (y)); and Z(θ) = h(y)e θ T (y) dy is called the partition function, while its logarithm, A(θ) = log Z(θ), is called the log partition (cumulant) function. Remark. In this lecture and throughout this course, the dot notation as in θ T (y) always means the inner (dot) product of two vectors. Equivalently, P θ (y) can be written in the form P θ (y) = exp[θ T (y) A(θ) + C(y)], (2) where C(y) = log h(y). More generally, one sometimes introduces an extra parameter, called the dispersion parameter, to control the shape of P θ (y) by [ ] θ T (y) A(θ) P θ (y) = exp + C(y, ). 1.2 Standing assumption In our discussion on the exponential family of distribution, we always assume the following. In case P θ (y) is a probability density function (PDF), it is assumed to be continuous as a function of y. It means that there is no singularity in the probability measure P θ (y); In case P θ (y) is a probability mass function (PMF), there exists a range of discrete value of P θ (y) that is the same for all θ and for all y. If P θ (y) satisfies either condition, we say it is regular, which is always assumed throughout this course. 2

Remark. Sometimes people use the more general form of P θ (y) to write P θ (y) = 1 Z(θ) h(y)eη(θ) T (y). But most of the time the same result can be obtained without the use of the general form η(θ). So it is not much of a loss of generality to stick to our convention of using just θ. 1.3 Examples Let us now look at a few illustrative examples. (1) Bernoulli(µ) The Bernoulli distribution is perhaps the simplest in the exponential family. Let Y be a random variable taking its binary value in {0, 1}. Let µ = P [Y = 1]. Its distribution (in fact, the probability mass function, PMF) is then succinctly written as P (y) = µ y (1 µ) 1 y. Then, P (y) = µ y (1 µ) 1 y = exp[y log µ + (1 y) log(1 µ)] [ ( ) ] µ = exp y log + log(1 µ). 1 µ ( ) µ Letting T (y) = y and θ = log and recalling the definitions of 1 µ logit and σ functions given in Lecture 3, we have ( ) µ logit(µ) = log. 1 µ Thus Its inverse function is the sigmoid function: θ = logit(µ). (3) µ = σ(θ), (4) where σ(θ) = 1 1 + e θ. 3

Therefore, we have A(θ) = log(1 µ) = log(1 + e θ ). (5) Thus P θ (y), written in canonical form as in (2), becomes P θ (y) = exp[θ y log(1 + e θ )]. (2) Exponential distribution The exponential distribution is a distribution that models the independent arrival time. Its distribution (the probability density function, PDF) is given as P θ (y) = θe θy I(x 0). To put it in the exponential family form, we use the same θ as the canonical parameter and we let T (y) = y and h(y) = I(y 0). Since Z(θ) = 1 θ = e θy I(y 0)dy, it is already in the canonical form given as in (1). (3) Normal distribution The normal (Gaussian) distribution given by [ ] 1 P (y) = exp (y µ)2 2πσ 2 2σ 2 is the single most well known distribution. As far as its relation with the exponential family is concerned there are two views. (Here, this σ is a number, not the sigmoid function.) 1st view (σ 2 as a dispersion parameter) This is the case when the main focus is the mean. In this case, the variance σ 2 is regarded as known or as a parameter that can be fiddled with as if known. Writing P (y) in the form 1 P θ (y) = exp 2 y2 + yµ 1 2 µ2 1 σ 2 2 log(2πσ2 ), 4

One can see right away it is in the form of the exponential family if we set θ = (θ 1, θ 2 ) = (µ, 1) T (y) = (y, 1 2 y2 ) = σ 2 A(θ) = 1 2 µ2 = 1 2 θ2 1 C(y, ) = 1 2 log(2πσ2 ). 2nd view ( = 1) When both µ and σ are parameters to be treated as unknown, we take this point of view. In here, we set the dispersion parameter = 1. Writing out P θ (y) we have [ P θ (y) = exp 1 2σ 2 y2 + µ σ y 1 2 2σ 2 µ2 log σ 1 ] log 2π. 2 Thus it is easy to see the following: T (y) = (y, 1 2 y2 ) θ = (θ 1, θ 2 ) = ( µ σ 2, 1 σ 2 ) A(θ) = 1 2σ 2 µ2 + log σ = 1 θ1 2 1 2 θ 2 2 log( θ 2) C(y) = 1 2 log(2π). 1.4 Properties of exponential family The log partition function A(θ) plays a key role, so let us now look at it more carefully. First, since [ ] θ T (y) A(θ) P θ (y)dy = exp + C(y, ) dy = 1, 5

taking θ, we have [ θ T (y) A(θ) exp Thus ] T (y) θ A(θ) + C(y, ) dy = 0. [ ] θ T (y) A(θ) exp + C(y, ) T (y)dy = = θ A(θ) [ ] θ T (y) A(θ) exp + C(y, ) θ A(θ)dy [ ] θ T (y) A(θ) exp + C(y, ) dy. Since we have the following: [ ] θ T (y) A(θ) exp + C(y, ) dy = 1, Proposition 1. θ A(θ) = P θ (y)t (y)dy = E[T (Y )], where Y is the random variable with distribution P θ (y). Writing the component, we have [ A θ T (y) A(θ) = exp θ i ] + C(y, ) T i (y)dy. Taking the second partial derivative, we have 2 A = 1 [ ] { θ T (y) A(θ) exp + C(y, ) T j (y) A(θ) } T i (y)dy θ i θ j θ j = 1 ( [ ] θ T (y) A(θ) exp + C(y, ) T i (y)t j (y)dy A(θ) [ ] ) θ T (y) A(θ) exp + C(y, ) T i (y)dy. θ j 6

Using Proposition 1 once more, we have 2 A = 1 { } E[T i (Y )T j (Y )] E[T i (Y )]E[T j (Y )] θ i θ j = 1 [ E (Ti (Y ) E[T i (Y )] ) ( T j (Y ) E[T j (Y )] )] = 1 Cov(T i(y ), T j (Y )). Therefore we have the following result on the Hessian matrix of A. ( 2 Proposition 2. Dθ 2A = A ) = 1 θ i θ j Cov(T (Y ), T (Y )) = 1 Cov(T (Y )), where Y is the random variable with distribution P θ (y). Here, Cov(T (Y ), T (Y )) denotes the covariance matrix of T (Y ) with itself, which is sometimes called the variance matrix. Since the covariance matrix is always positive semi-definite, we have Corollary 1. A(θ) is a convex function of θ. Remark. In most exponential family models, Cov(T (Y )) is positive definite, in which case A(θ) is strictly convex. In this lecture and throughout the course, we always assume that Cov(T (Y )) is positive definite and therefore that A(θ) is strictly convex. 1.5 Maximum likelihood estimation Let D = {y (i) } N be a given data of IID samples, where y (i) R m. Then its likelihood function L(θ) and the log likelihood function l(θ) are given by L(θ) = N [ ] θ T (y (i) ) A(θ) exp + C(y (i), ) l(θ) = 1 { N θ T (y (i) ) NA(θ) } + N C(y (i), ). 7

Since A(θ) is strictly convex, l(θ) has a unique maximum at θ that is the unique solution of θ l(θ) = 0. Note that So we have the following: θ l(θ) = 1 { N T (y (i) ) N θ A(θ) }. Proposition 3. There is a unique θ that maximizes l(θ) at which θ A(θ) θ= θ = 1 N N T (y (i) ). 1.5.1 Example: Bernoulli Distribution Let us apply the above argument to the Bernoulli distribution Bernoulli(µ) and see what it leads to. First, differentiating A(θ) as in (5), we get A (θ) = 1 = σ(θ). 1 + e θ Thus by Proposition 3 we have a unique θ satisfying σ( θ) = 1 N N T (y (i) ). Since T (y) = y, and y takes on 0 or 1 as its value, we have T (y) = I(y = 1). Defining µ = σ( θ) by (4), we then have µ = σ( θ) = 1 N I(y (i) = 1) N This confirms the usual estimate of µ. = Number of 1 s in the sample {y(i) }. N 8

Remark. T is a sufficient statistic. It is well known that the statistic S(y i,, y n ) = n T (y(i) ) is sufficient, which means that θ is a constant on the level set {(y 1,, y n ) S(y 1, y 2,, y n ) = const}. This provides many advantages in estimating θ, as one needs not to look inside the level sets. 1.6 Maximum Entropy Distribution Exponential family is useful in that it is the most random distribution under some constraints. Let Y be a random variable with an unknown distribution P (y). Suppose the expected values of certain features f 1 (y),, f k (y) are nonetheless known. Namely, f i (y)p (y)dy = C i (6) is known for some given constant C i for i = 1,, k. The question is how to fix P (y) so that it is as random as possible while satisfying the above constraints. One approach is to construct a maximum entropy distribution satisfying the constraints (6). The entropy is the measure of randomness and in general, the higher it is, the more random it is deemed. The definition of entropy is H = P (y) log P (y), y for the discrete case; for the continuous case, the sum becomes an integral so that H = P (y) log P (y)dy. We will use the integral notation here without the loss of generality. To solve the problem, we use the Lagrange multiplier method to set up the following variational problem of minimizing J(P, λ) : J(P, λ) = P (y) log P (y)dy+λ 0 (1 P (y)dy)+ k λ i {C i } f i (y)p (y)dy. 9

The solution is found as the critical point. Namely, setting the variational (functional or Fréchet) derivative equal to 0, δj k δp = 1 log P (y) λ 0 λ i f i (y) = 0. (7) To see where this comes from, let η be any smooth function with compact support and define { J(P +ɛη, λ) = (P +ɛη) log(p +ɛη)dy+λ 0 1 (P +ɛη)dy) } + { λ i Ci f i (P +ɛη)dy }. Taking the derivative with respect to ɛ at ɛ = 0 and setting it equal to 0, we get d J(P + ɛη, λ) = (η + η log P )dy λ 0 ηdy + λ i { f i ηdy} dɛ ɛ=0 ( k ) = 1 log P λ0 λ i f i ηdy = 0. Since the above formula is true for any η, we have k 1 log P λ 0 λ i f i = 0, which is (7). Solving (7) for P (y), we have where Z = e 1+λ 0. Now 1 = P (y)dy = 1 Z Thus P (y) = 1 k Z exp( λ i f i (y)), Z = exp( exp( k λ i f i (y))dy. k λ i f i (y))dy. Therefore P (y) belongs to the exponential family of distribution. This kind of distribution is in general called the Gibbs distribution. 10

2 Generalized liner model (GLM) 2.1 Mean parameter and canonical link function Before we embark on the generalized linear model, we need the following elementary fact. Lemma 1. Let f : R k R be a strictly convex C 1 function. Then x f : R k R k is an invertible function. Proof. Let p R k be any given vector. By the strict convexity and the absence of corners (the C 1 condition), there is a unique hyperplane H in R k+1 that is tangent to the graph of y = f(x), having the normal vector of the form (p, 1). Let (x, y) be the point of contact. Since the graph is the zero set of F (x, y) = f(x) y, its normal vector at (x, y) is ( f(x), 1). Therefore we must have p = f(x). Figure 1: Graph of a strictly convex function with a tangent plane Let us now look at the exponential family P θ (y) = exp [ θ T (y) A(θ) + C(y) ], (8) where the dispersion parameter is set to be 1 for the sake of simplicity of the presentation. (The argument does not change much even with the presence of the dispersion parameter.) Recall that Proposition 1 says that θ A(θ) = E[T (Y )]. We always assume that the log partition function A : R k R is strictly convex and C 1. Thus by the above Lemma, θ A : R k R k is invertible. We now set up a few terminologies. 11

Definition 2. The mean parameter µ is defined as µ = E[T (Y )] = θ A(θ), and the function θ A : R k R k is called the mean function. Since θ A is invertible, we define Definition 3. The inverse function ( θ A) 1 of θ A is called the canonical link function and is denoted by ψ. Thus θ = ψ(µ) = ( θ A) 1 (µ). Therefore, combining the above definitions, the mean function is also written as µ = ψ 1 (θ) = θ A(θ). Example: Bernoulli(µ) Recall, for Bernoulli(µ), P (y) is given by P (y) = µ y (1 µ) 1 y = exp [ y log ] µ + log(1 µ) 1 µ = exp [ θy log(1 + e θ ) ] = exp [ θy A(θ) ], ( ) µ where θ = log and A(θ) = log(1 + e θ ). Thus 1 µ µ = sigmoid(θ) = A (θ) θ = logit(µ) = (A ) 1 (θ). Therefore for the Bernoulli distribution the sigmoid function is the mean function and the logit function the canonical link function. 2.2 Conditional probability in GLM From now on in this lecture, to simplify notation, we use ω to stand for both w and b and extend x R d to x R d+1 by adding x 0 = 1. (In the previous lectures, we used θ as a generic term to represent both w and b. But in the exponential family notation, θ is reserved to denote the canonical parameter. So we are forced to use ω. One more note: although the typography is difficult to discern, this ω is the Greek lower case omega, not English w.) 12

2.2.1 GLM recipe With this notation, the expression ω = (b, w 1,, w d ) as (1) of Lecture 3 is now written as ω x, and thus the conditional probability in Lecture 3 is written as which can be simplified as P (y = 1 x) = P (y = 0 x) = P (y x) = e ω x 1 + e ω x 1 1 + e, ω x e(ω x)y 1 + e ω x, for y {0, 1}. Using (5), this is easily seen to be equivalent to the following conditional probability model P (y x) = exp [(ω x)y A(ω x)] = exp [θy A(θ)]. Summarizing this process, we have the following: GLM Recipe: P (y x) is gotten from P (x) by replacing the canonical parameter θ in (8) with a linear expression of x. In the binary case, the linear expression is ω x. The multiclass case mimics it in exactly the same way. To describe it correctly, first define the parameter (ω 1 ) T ω 10 ω 1j ω 1d W =. (ω i ) T. (ω k ) T... = ω i0 ω ij ω id. (9)... ω k0 ω kj ω kd Then replace θ in (8) with W x to get the conditional probability as P (y x) = exp[(w x) T (y) A(W x) + C(y)]. (10) 13

Written this way, we can see that (W x) T (y) = k d ω lj x j T l (y). l=1 j=0 Now, let D = {(x (i), y (i) )} N be a given data, where the x-part of the i-th data is represented by the column vector x (i) = [x (i) 0,, x (i) d ]T. Using the conditional probability as in (10), one can write the likelihood function and the log likelihood function by L(W ) = N exp[(w x (i) ) T (y (i) ) A(W x (i) ) + C(y (i) )] l(w ) = log L(W ) = = [ N k l=1 d j=0 N [(W x (i) ) T (y (i) ) A(W x (i) ) + C(y (i) )] Thus taking the derivative, one gets [ l(w ) N k d = x (i) j ω δ lrδ js T l (y (i) ) rs = = l=1 ω lj x (i) j T l(y (i) ) A(ω T 1 x (i),, ω T k x (i) ) + C(y (i) ) j=0 k A ω l l=1 d j=0 x (i) j δ lrδ js ] N [ x (i) s T r (y (i) ) A ] x s (i) ω r N [ T r (y (i) ) A ] (W x (i) ) x (i) s. (11) ω r This is a generalized form of what we got for the logistic regression (see (10) of Lecture 3). 2.3 Multiclass classification (softmax regression) We now look at the multiclass classification problem. The formula (11) is the derivative formula with which the gradient descent algorithm can be used. 14 ].

But for multiclass regression, the categorical distribution has the redundancy we talked about in Lecture 3. As usual, let y i = I(y = i) and let Prob[Y = i] = µ i. Then we must have y 1 + + y k = 1 and µ 1 + + µ k = 1. Thus the probability (PMF) can be written as P (y) = µ y 1 1 µ y 2 2 µ y k k. Rewrite P (y) by k 1 P (y) = µ y 1 1 µ y 2 2 µ y k 1 k 1 µ(1 y i) k [ ( ) ] k 1 = exp y 1 log µ 1 + + y k 1 log µ k 1 + 1 y i log µ k = exp [ k 1 y i log µ i µ k + log µ k Note that when k = 2, this is exactly the Bernoulli distribution. Define θ i = log(µ i /µ k ) and T i (y) = y i for i = 1,, k 1. Using the facts that µ i = µ k e θ i and 1 µ k = k 1 µ i, and solving for µ k and then for µ j, we get µ k = µ j = ]. 1 1 + k 1 eθ i e θ j 1 + k 1 eθ i, (12) for j = 1,, k 1. The expression in the right hand side of (12) is called the generalized sigmoid (softmax) function. Therefore P (y) can be written in the exponential family form as P θ (y) = exp[θ T (y) A(θ)], where and θ = (θ 1,, θ k 1 ), T (y) = (y 1,, y k 1 ), ( k 1 A(θ) = log µ k = log 1 + e θ i ). 15

Note that the mean parameter µ i is given as µ i = E[T i (Y )] = E[Y i ] = Prob[Y i = 1] = Prob[Y = i], which again can be calculated by the fact µ = θ A. Indeed one can verify that A e θ j = θ j 1 + = µ k 1 j. eθ i We have shown above that θ j = log(µ j /µ k ). Thus for j = 1,, k 1, we have ( ) µ j θ j = log 1 k 1 µ. i The expression on the right side of the above equation is called the generalized logit function. 2.4 Probit regression Recall that the essence of the GLM for the exponential family is the way it links the (outside) features x 1,, x d to the probability model. To be specific, given the probability model (8), the linking was done by setting θ = ψ(µ) = ω x so that the conditional probability becomes P (y x) = exp[(ω x)t (y) A(ω x) + C(y)]. But there is no a priori reason why µ has to be related to ω x via the canonical link function only. One may use any function as long as it invertible. So set g(µ) = ω x, where g : R k R k is an invertible function. So we call g a link function as opposed to the canonical link function and its inverse g 1 the mean function. So we have g(µ) = ω x : link function µ = g 1 (ω x) : mean function. Let Φ(t) be the cumulative distribution function (CDF) of the standard normal distribution N (0, 1), i.e., Φ(t) = 1 2π t 16 e s2 /2 ds.

The probit regression deals with the following situation: the output y is binary taking on values in {0, 1}, and its probability model is Bernoulli. So P (y) = µ y (1 µ) 1 y, where µ = P (y = 1). Its link to the outside variables (features) is via the link function g = Φ 1. Thus Therefore, g(µ) = ω x = ω 0 + ω 1 x 1 + + ω d x d µ = g 1 (ω x) = Φ(ω x). (13) P (y x) = Φ(ω x) y (1 Φ(ω x)) 1 y. Since this is still a Bernoulli hence an exponential family distribution, the general machinery of GLM can be employed. But, in the probit regression we change tack and take a different approach. First, we let y { 1, 1}. Then since µ = E[Y ] = 1 P (y = 1) + ( 1) P (y = 1) = P (y = 1) [1 P (y = 1)], we have µ = 2P (y = 1) 1. Thus P (y = 1) = 1 (1 + µ) and P (y = 1) = 2 1 2 (1 µ). Therefore, P (y) = 1 (1 + yµ). 2 Thus using (13), we have P (y x) = 1 [1 + yφ(ω x)]. 2 Now let the data D = {(x i, y i )} n be given. function is n l(ω) = log P (y i x i ) Thus = n l(ω) ω k = ( ) 1 log 2 [1 + y iφ(ω x i )], n y i Φ (ω x i ) 1 + y i Φ(ω x i ) x ik, 17 Then the log likelihood

which is in a neat form to apply the numerical methods introduced in Lecture 3, where Φ (ω x) = 1 e (ω x)2 /2. 2π Comments (1). When one uses a link function which is not canonical, the probability model does not even have to belong to the exponential family. As long as one can get a hold of the conditional distribution P (y x), one is in business. Namely with it one can form the likelihood function and the rest of the maximum likelihood machinery. (2). One may not even need the probability distribution as long as one can somehow estimate ω so that the decisions can be made. (3). Historically the GLM was first developed for the exponential family but was later extended to the non-exponential family and even to the case where the distribution is not completely known. 2.5 GLM for other distributions 2.5.1 GLM for Poisson(µ) Recall that the Poisson distribution is the probability distribution (PMF) on the non-negative integers given by P (n) = e µ µ n, n! for n = 0, 1, 2,. It is an exponential family distribution as it can be written as P (y) = e µ µ y y! = exp{y log µ µ log(y!)}, where y = 0, 1, 2,. It can be put in the exponential family form by setting θ = log µ : canonical link function µ = e θ : mean function. 18

One can independently check that E[Y ] = np (n) = µ. 2.5.2 GLM for Γ(α, λ) n=0 Recall that the gamma distribution is written in the following form: P (y) = λe λy (λy) α 1, for y 0 Γ(α) = exp{log λ λy + (α 1) log(λy) log Γ(α)} = exp{α log λ λy + (α 1) log y log Γ(α)} { ( ) λy ( log λ) 1 } = exp + 1 log y log Γ(1/). Set = 1 as the dispersion parameter and let θ = λ. Then P (y) is an α exponential family distribution written as: { θy ( log( θ)) } P θ (y) = exp + C(y, ), where C(y, ) = 1 log(1/) + ( 1 1 ) log y log Γ(1/). Therefore the log partition function is A(θ) = log( θ). Differentiating this we get the mean parameter µ = A (θ) = 1 θ. Since E[Y ] = α for Y Γ(α, λ), this fact is independently verified. λ To recap, the mean function and the canonical link function are given by µ = ψ 1 (θ) = 1 θ θ = ψ(µ) = 1 µ : mean function : canonical link function. 19

In practice, this canonical link function is rarely used. Instead one uses one of the following three alternatives. (i) the inverse link g(µ) = 1 µ, (ii) the log link g(µ) = log µ, (iii) the identity link g(µ) = µ. 2.5.3 GLM Summary The GLM link and inverse link (mean) functions for various probability models are summarized in the following Table 1. In here, the range of j in the link and the inverse link of the categorical distribution is j = 1,, k 1, although µ = (µ 1,, µ k ). Range Link Inverse Link (mean) Dispersion N(µ, σ 2 ) (, ) θ = µ µ = θ σ 2 Bernoulli(µ) {0, 1} θ = logit(µ) µ = σ(θ) 1 ( ) µ j Categorical(µ) {1,, k} θ j = log 1 e θ j k 1 µ µ j = i 1 + k 1 1 eθ i {0, 1} or Probit Φ { 1, 1} 1 Φ 1 Poisson(µ) Z + θ = log(µ) µ = e θ 1 Gamma(α, λ) R + θ = 1 µ = 1 1 µ θ α Table 1: GLM-Summary 20