Generalized Linear Models David Rosenberg New York University April 12, 2015 David Rosenberg (New York University) DS-GA 1003 April 12, 2015 1 / 20
Conditional Gaussian Regression Gaussian Regression Input space X = R d, Output space Y = R Hypothesis space consists of functions f : x N ( w T x,σ 2). For each x, f (x) returns a particular Gaussian density with variance σ 2. Choice of w determines the function. For some parameter w R d, can write our prediction function as where σ 2 > 0. [f w (x)](y) = p w (y x) =N(y w T x,σ 2 ), Given some i.i.d. data D = {(x 1,y 1 ),...,(x n,y n )}, how to assess the fit? avid Rosenberg (New York University) DS-GA 1003 April 12, 2015 2 / 20
Conditional Gaussian Regression Gaussian Regression: Likelihood Scoring Suppose we have data D = {(x 1,y 1 ),...,(x n,y n )}. Compute the model likelihood for D: p w (D) = n p w (y i x i ) [by independence] i=1 Maximum Likelihood Estimation (MLE) finds w maximizing p w (D). Equivalently, maximize the data log-likelihood: Let s start solving this! w = arg max w R d n logp w (y i x i ) i=1 avid Rosenberg (New York University) DS-GA 1003 April 12, 2015 3 / 20
Conditional Gaussian Regression Gaussian Regression: MLE The conditional log-likelhood is: = = n logp w (y i x i ) i=1 n [ 1 log ( σ 2π exp (y i w T x i ) 2 )] 2σ 2 i=1 n [ ] 1 n log σ + ( (y i w T x i ) 2 ) 2π 2σ 2 i=1 i=1 }{{} independent of w MLE is the w where this is maximized. Note that σ 2 is irrelevant to finding the maximizing w. Can drop the negative sign and make it a minimization problem. David Rosenberg (New York University) DS-GA 1003 April 12, 2015 4 / 20
Conditional Gaussian Regression Gaussian Regression: MLE The MLE is w =arg min w R d n (y i w T x i ) 2 i=1 This is exactly the objective function for least squares. From here, can use usual approaches to solve for w (linear algebra, calculus, iterative methods etc.) NOTE: Parameter vector w only interacts with x by an inner product avid Rosenberg (New York University) DS-GA 1003 April 12, 2015 5 / 20
Poisson Regression Poisson Regression: Setup Input space X = R d, Output space Y = {0,1,2,3,4,...} Hypothesis space consists of functions f : x Poisson(λ(x)). That is, for each x, f (x) returns a Poisson with mean λ(x). What function? Recall λ > 0. GLMs (and Poisson is a special case) have a linear dependence on x. Standard approach is to take for some parameter vector w. λ(x) = exp ( w T x ), Note that range of λ(x) = (0, ), (appropriate for the Poisson parameter). avid Rosenberg (New York University) DS-GA 1003 April 12, 2015 6 / 20
Poisson Regression Poisson Regression: Likelihood Scoring Suppose we have data D = {(x 1,y 1 ),...,(x n,y n )}. Last time we found the log-likelihood for Poisson was: logp(d,λ) = n [y i logλ λ log(y i!)] i=1 Plugging in λ(x) = exp ( w T x ), we get logp(d,λ) = = n [ yi log [ exp ( w T x )] exp ( w T x ) log(y i!) ] i=1 n [ yi w T x exp ( w T x ) log(y i!) ] i=1 Maximize this w.r.t. w to find the Poisson regression. No closed form for optimum, but it s concave, so easy to optimize. avid Rosenberg (New York University) DS-GA 1003 April 12, 2015 7 / 20
Bernoulli Regression Linear Probabilistic Classifiers Setting: X = R d, Y = {0,1} For each X = x, p(y = 1 x) = θ. (i.e. Y has a Bernoulli(θ) distribution) θ may vary with x. For each x R d, just want to predict θ [0,1]. Two steps: }{{} x w}{{} T x f (w T x), }{{} R D R [0,1] where f : R [0,1] is called the transfer or inverse link function. Probability model is then p(y = 1 x) = f (w T x) avid Rosenberg (New York University) DS-GA 1003 April 12, 2015 8 / 20
Bernoulli Regression Inverse Link Functions Two commonly used inverse link functions to map from w T x to θ: 1.00 0.75 0.50 Y Logistic Function Normal CDF 0.25 0.00 5.0 2.5 0.0 2.5 5.0 Linear(x) Logistic function = Logistic Regression Normal CDF = Probit Regression David Rosenberg (New York University) DS-GA 1003 April 12, 2015 9 / 20
Multinomial Logistic Regression Multinomial Logistic Regression Setting: X = R d, Y = {1,...,K} The numbers (θ 1,...,θ c ) where K c=1 θ c = 1 represent a multinoulli or categorical distribution. For each x, we want to produce a distribution on the K classes. That is, for each x and each y {1,...,K}, we want to produce a probability p(y x) = θ y, where K y=1 θ y = 1. avid Rosenberg (New York University) DS-GA 1003 April 12, 2015 10 / 20
Multinomial Logistic Regression Multinomial Logistic Regression: Classic Setup Classically we write multinomial logistic regression (cf. KPM Sec. 8.3.7): exp ( wy T x ) p(y x) = K c=1 exp(w c T x), where we ve introduced parameter vectors w 1,...,w K R d. The log of this likelihod is concave and straightforward to optimize. David Rosenberg (New York University) DS-GA 1003 April 12, 2015 11 / 20
Multinomial Logistic Regression More Convenient to Flatten This Dropping proportionality constant Z(x) = K c=1 exp( wc T x ), we have p(y x) exp ( wy T x ) ( K ) = exp 1(y = c)wc T x c=1 K d = exp 1(y = c) (w c ) j x j c=1 K = exp i=1 j=1 j=1 d (w c ) j 1(y = c)x j }{{} Create a feature for every term 1(y = c)x j, for c {1,...,k}. Define feature function g r (x,y) = 1(y = c)x j. David Rosenberg (New York University) DS-GA 1003 April 12, 2015 12 / 20
Multinomial Logistic Regression More Convenient to Flatten This So K d p(y x) exp (w c ) j 1(y = c)x j }{{} i=1 j=1 ( R ) = exp µ r g r (x,y). r=1 What is R? What are the µ r s R = kd and µ r s are just some flattening of w 1,...,w K into a single vector. David Rosenberg (New York University) DS-GA 1003 April 12, 2015 13 / 20
Multinomial Logistic Regression More Convenient to Flatten This Why did we do this? Computational Reason: To plug into optimization algorithm, easier to have a single parameter vector. Original version had K parameter vectors. Conceptual Reason: Introduce the idea of features that depend jointly on input and output. These features measure compatibility between input and particular label. We could call them compatibility functions, but we usually call them features. Example from natural language processing: (Part-of-speech tagging) g r (y,x) = { 1 if y = "NOUN" and x i = "apple" 0 otherwise avid Rosenberg (New York University) DS-GA 1003 April 12, 2015 14 / 20
Generalized Linear Models (Lite) Natural Exponential Families { pθ (y) θ Θ R d} is a family of pdf s or pmf s on Y. The family is a natural exponential family with parameter θ if p θ (y) = 1 Z(θ) h(y)exp[ θ T y ]. h(y) is a nonnegative function called the base measure. Z(θ) = Y h(y)exp[ θ T y ] is the partition function. The natural parameter space is the set Θ = {θ Z(θ) < }. the set of θ for which exp [ θ T y ] can be normalized to have integral 1 θ is called the natural parameter. Note: In exponential family form, family typically has a different parameterization than the standard form. avid Rosenberg (New York University) DS-GA 1003 April 12, 2015 15 / 20
Generalized Linear Models (Lite) Specifying a Natural Exponential Family The family is a natural exponential family with parameter θ if p θ (y) = 1 Z(θ) h(y)exp[ θ T y ]. To specify a natural exponential family, we need to choose h(y). Everything else is determined. Implicit in choosing h(y) is the choice of the support of the distribution. avid Rosenberg (New York University) DS-GA 1003 April 12, 2015 16 / 20
Generalized Linear Models (Lite) Natural Exponential Families: Examples The following are univariate natural exponential families: 1 Normal distribution with known variance. 2 Poisson distribution 3 Gamma distribution (with known k parameter) 4 Bernoulli distribution (and Binomial with known number of trials) David Rosenberg (New York University) DS-GA 1003 April 12, 2015 17 / 20
Generalized Linear Models (Lite) θ = logλ is the natural parameter. avid Rosenberg (New York University) DS-GA 1003 April 12, 2015 18 / 20 Example: Poisson Distribution For Poisson, we found the log probability mass function is: Exponentiating this, we get log[p(y;λ)] = y logλ λ log(y!). p(y;λ) = exp(y logλ λ log(y!)). If we reparametrize, taking θ = logλ, we can write this as p(y,θ) = exp ( yθ e θ log(y!) ) = 1 1 exp(yθ), y! eθ e which is in natural exponential family form, where Z(θ) = exp ( e θ) h(y) = 1 y!.
Generalized Linear Models (Lite) Generalized Linear Models [with Canonical Link] In GLMs, we first choose a natural exponential family. (This amounts to choosing h(y).) The idea is to plug in w T x for the natural parameter. This gives models of the following form: p θ (y x) = 1 Z(w T x) h(y)exp[ (w T x)y ]. This is the form we had for Poisson regression. Note: This is very convenient, but only works if Θ = R. avid Rosenberg (New York University) DS-GA 1003 April 12, 2015 19 / 20
Generalized Linear Models (Lite) Generalized Linear Models [with General Link] More generally, choose a function ψ : R Θ so that x w T x ψ(w T x), where θ = ψ(w T x) is the natural parameter for the family. So our final prediction (for one-parameter families) is: p θ (y x) = 1 Z(ψ(w T x)) h(y)exp[ ψ(w T x)y ]. avid Rosenberg (New York University) DS-GA 1003 April 12, 2015 20 / 20