Generalized Linear Models

Similar documents
Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Lecture 4: Exponential family of distributions and generalized linear model (GLM) (Draft: version 0.9.2)

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52

Generalized Linear Models

Generalized Linear Models and Exponential Families

Probabilistic Graphical Models

Linear Regression Models P8111

Statistics 3858 : Maximum Likelihood Estimators

Introduction to Probabilistic Machine Learning

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

The logistic regression model is thus a glm-model with canonical link function so that the log-odds equals the linear predictor, that is

Mathematical statistics

Generalized Linear Models Introduction

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Multiclass and Introduction to Structured Prediction

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Lecture 4 September 15

Closed book and notes. 60 minutes. Cover page and four pages of exam. No calculators.

Introduction to Machine Learning

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

STAT J535: Chapter 5: Classes of Bayesian Priors

LECTURE 11: EXPONENTIAL FAMILY AND GENERALIZED LINEAR MODELS

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Generalized Linear Models 1

Introduction to Bayesian Learning. Machine Learning Fall 2018

Multiclass and Introduction to Structured Prediction

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

Rapid Introduction to Machine Learning/ Deep Learning

Chapter 5. Chapter 5 sections

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Generalized Linear Models (1/29/13)

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

LOGISTIC REGRESSION Joseph M. Hilbe

Generalized Linear Models for Non-Normal Data

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Notes on Noise Contrastive Estimation (NCE)

The Multivariate Gaussian Distribution [DRAFT]

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Probability and Estimation. Alan Moses

Multiclass Classification

l 1 and l 2 Regularization

When is MLE appropriate

Introduction: exponential family, conjugacy, and sufficiency (9/2/13)

Bayesian Linear Regression [DRAFT - In Progress]

Stat 5101 Lecture Notes

Logistic Regression. Seungjin Choi

Generative v. Discriminative classifiers Intuition

TUTORIAL 8 SOLUTIONS #

Linear Regression With Special Variables

Copula Regression RAHUL A. PARSA DRAKE UNIVERSITY & STUART A. KLUGMAN SOCIETY OF ACTUARIES CASUALTY ACTUARIAL SOCIETY MAY 18,2011

p y (1 p) 1 y, y = 0, 1 p Y (y p) = 0, otherwise.

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

CS Lecture 19. Exponential Families & Expectation Propagation

A brief introduction to Conditional Random Fields

Lecture 1: August 28

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Generalized Linear Models. Kurt Hornik

CPSC 340: Machine Learning and Data Mining

UNIT Define joint distribution and joint probability density function for the two random variables X and Y.

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Chapter 5 continued. Chapter 5 sections

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Introduction to Maximum Likelihood Estimation

1. Fisher Information

How can a Machine Learn: Passive, Active, and Teaching

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction to Machine Learning

Probability Distributions Columns (a) through (d)

Contents 1. Contents

Gaussian discriminant analysis Naive Bayes

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random

CS 340 Lec. 16: Logistic Regression

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Support Vector Machines

Linear Classification

Chapter 4 Multiple Random Variables

Lecture : Probabilistic Machine Learning

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Lecture 6: Discrete Choice: Qualitative Response

STAT 430/510: Lecture 15

Logistic Regression. Machine Learning Fall 2018

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Max. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes

Chapter 8.8.1: A factorization theorem

Multivariate distributions

Generalized Linear Models and Logistic Regression

Introduction to Machine Learning

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

CPSC 340 Assignment 4 (due November 17 ATE)

MIT Spring 2016

A minimalist s exposition of EM

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

STAT5044: Regression and Anova

Transcription:

Generalized Linear Models David Rosenberg New York University April 12, 2015 David Rosenberg (New York University) DS-GA 1003 April 12, 2015 1 / 20

Conditional Gaussian Regression Gaussian Regression Input space X = R d, Output space Y = R Hypothesis space consists of functions f : x N ( w T x,σ 2). For each x, f (x) returns a particular Gaussian density with variance σ 2. Choice of w determines the function. For some parameter w R d, can write our prediction function as where σ 2 > 0. [f w (x)](y) = p w (y x) =N(y w T x,σ 2 ), Given some i.i.d. data D = {(x 1,y 1 ),...,(x n,y n )}, how to assess the fit? avid Rosenberg (New York University) DS-GA 1003 April 12, 2015 2 / 20

Conditional Gaussian Regression Gaussian Regression: Likelihood Scoring Suppose we have data D = {(x 1,y 1 ),...,(x n,y n )}. Compute the model likelihood for D: p w (D) = n p w (y i x i ) [by independence] i=1 Maximum Likelihood Estimation (MLE) finds w maximizing p w (D). Equivalently, maximize the data log-likelihood: Let s start solving this! w = arg max w R d n logp w (y i x i ) i=1 avid Rosenberg (New York University) DS-GA 1003 April 12, 2015 3 / 20

Conditional Gaussian Regression Gaussian Regression: MLE The conditional log-likelhood is: = = n logp w (y i x i ) i=1 n [ 1 log ( σ 2π exp (y i w T x i ) 2 )] 2σ 2 i=1 n [ ] 1 n log σ + ( (y i w T x i ) 2 ) 2π 2σ 2 i=1 i=1 }{{} independent of w MLE is the w where this is maximized. Note that σ 2 is irrelevant to finding the maximizing w. Can drop the negative sign and make it a minimization problem. David Rosenberg (New York University) DS-GA 1003 April 12, 2015 4 / 20

Conditional Gaussian Regression Gaussian Regression: MLE The MLE is w =arg min w R d n (y i w T x i ) 2 i=1 This is exactly the objective function for least squares. From here, can use usual approaches to solve for w (linear algebra, calculus, iterative methods etc.) NOTE: Parameter vector w only interacts with x by an inner product avid Rosenberg (New York University) DS-GA 1003 April 12, 2015 5 / 20

Poisson Regression Poisson Regression: Setup Input space X = R d, Output space Y = {0,1,2,3,4,...} Hypothesis space consists of functions f : x Poisson(λ(x)). That is, for each x, f (x) returns a Poisson with mean λ(x). What function? Recall λ > 0. GLMs (and Poisson is a special case) have a linear dependence on x. Standard approach is to take for some parameter vector w. λ(x) = exp ( w T x ), Note that range of λ(x) = (0, ), (appropriate for the Poisson parameter). avid Rosenberg (New York University) DS-GA 1003 April 12, 2015 6 / 20

Poisson Regression Poisson Regression: Likelihood Scoring Suppose we have data D = {(x 1,y 1 ),...,(x n,y n )}. Last time we found the log-likelihood for Poisson was: logp(d,λ) = n [y i logλ λ log(y i!)] i=1 Plugging in λ(x) = exp ( w T x ), we get logp(d,λ) = = n [ yi log [ exp ( w T x )] exp ( w T x ) log(y i!) ] i=1 n [ yi w T x exp ( w T x ) log(y i!) ] i=1 Maximize this w.r.t. w to find the Poisson regression. No closed form for optimum, but it s concave, so easy to optimize. avid Rosenberg (New York University) DS-GA 1003 April 12, 2015 7 / 20

Bernoulli Regression Linear Probabilistic Classifiers Setting: X = R d, Y = {0,1} For each X = x, p(y = 1 x) = θ. (i.e. Y has a Bernoulli(θ) distribution) θ may vary with x. For each x R d, just want to predict θ [0,1]. Two steps: }{{} x w}{{} T x f (w T x), }{{} R D R [0,1] where f : R [0,1] is called the transfer or inverse link function. Probability model is then p(y = 1 x) = f (w T x) avid Rosenberg (New York University) DS-GA 1003 April 12, 2015 8 / 20

Bernoulli Regression Inverse Link Functions Two commonly used inverse link functions to map from w T x to θ: 1.00 0.75 0.50 Y Logistic Function Normal CDF 0.25 0.00 5.0 2.5 0.0 2.5 5.0 Linear(x) Logistic function = Logistic Regression Normal CDF = Probit Regression David Rosenberg (New York University) DS-GA 1003 April 12, 2015 9 / 20

Multinomial Logistic Regression Multinomial Logistic Regression Setting: X = R d, Y = {1,...,K} The numbers (θ 1,...,θ c ) where K c=1 θ c = 1 represent a multinoulli or categorical distribution. For each x, we want to produce a distribution on the K classes. That is, for each x and each y {1,...,K}, we want to produce a probability p(y x) = θ y, where K y=1 θ y = 1. avid Rosenberg (New York University) DS-GA 1003 April 12, 2015 10 / 20

Multinomial Logistic Regression Multinomial Logistic Regression: Classic Setup Classically we write multinomial logistic regression (cf. KPM Sec. 8.3.7): exp ( wy T x ) p(y x) = K c=1 exp(w c T x), where we ve introduced parameter vectors w 1,...,w K R d. The log of this likelihod is concave and straightforward to optimize. David Rosenberg (New York University) DS-GA 1003 April 12, 2015 11 / 20

Multinomial Logistic Regression More Convenient to Flatten This Dropping proportionality constant Z(x) = K c=1 exp( wc T x ), we have p(y x) exp ( wy T x ) ( K ) = exp 1(y = c)wc T x c=1 K d = exp 1(y = c) (w c ) j x j c=1 K = exp i=1 j=1 j=1 d (w c ) j 1(y = c)x j }{{} Create a feature for every term 1(y = c)x j, for c {1,...,k}. Define feature function g r (x,y) = 1(y = c)x j. David Rosenberg (New York University) DS-GA 1003 April 12, 2015 12 / 20

Multinomial Logistic Regression More Convenient to Flatten This So K d p(y x) exp (w c ) j 1(y = c)x j }{{} i=1 j=1 ( R ) = exp µ r g r (x,y). r=1 What is R? What are the µ r s R = kd and µ r s are just some flattening of w 1,...,w K into a single vector. David Rosenberg (New York University) DS-GA 1003 April 12, 2015 13 / 20

Multinomial Logistic Regression More Convenient to Flatten This Why did we do this? Computational Reason: To plug into optimization algorithm, easier to have a single parameter vector. Original version had K parameter vectors. Conceptual Reason: Introduce the idea of features that depend jointly on input and output. These features measure compatibility between input and particular label. We could call them compatibility functions, but we usually call them features. Example from natural language processing: (Part-of-speech tagging) g r (y,x) = { 1 if y = "NOUN" and x i = "apple" 0 otherwise avid Rosenberg (New York University) DS-GA 1003 April 12, 2015 14 / 20

Generalized Linear Models (Lite) Natural Exponential Families { pθ (y) θ Θ R d} is a family of pdf s or pmf s on Y. The family is a natural exponential family with parameter θ if p θ (y) = 1 Z(θ) h(y)exp[ θ T y ]. h(y) is a nonnegative function called the base measure. Z(θ) = Y h(y)exp[ θ T y ] is the partition function. The natural parameter space is the set Θ = {θ Z(θ) < }. the set of θ for which exp [ θ T y ] can be normalized to have integral 1 θ is called the natural parameter. Note: In exponential family form, family typically has a different parameterization than the standard form. avid Rosenberg (New York University) DS-GA 1003 April 12, 2015 15 / 20

Generalized Linear Models (Lite) Specifying a Natural Exponential Family The family is a natural exponential family with parameter θ if p θ (y) = 1 Z(θ) h(y)exp[ θ T y ]. To specify a natural exponential family, we need to choose h(y). Everything else is determined. Implicit in choosing h(y) is the choice of the support of the distribution. avid Rosenberg (New York University) DS-GA 1003 April 12, 2015 16 / 20

Generalized Linear Models (Lite) Natural Exponential Families: Examples The following are univariate natural exponential families: 1 Normal distribution with known variance. 2 Poisson distribution 3 Gamma distribution (with known k parameter) 4 Bernoulli distribution (and Binomial with known number of trials) David Rosenberg (New York University) DS-GA 1003 April 12, 2015 17 / 20

Generalized Linear Models (Lite) θ = logλ is the natural parameter. avid Rosenberg (New York University) DS-GA 1003 April 12, 2015 18 / 20 Example: Poisson Distribution For Poisson, we found the log probability mass function is: Exponentiating this, we get log[p(y;λ)] = y logλ λ log(y!). p(y;λ) = exp(y logλ λ log(y!)). If we reparametrize, taking θ = logλ, we can write this as p(y,θ) = exp ( yθ e θ log(y!) ) = 1 1 exp(yθ), y! eθ e which is in natural exponential family form, where Z(θ) = exp ( e θ) h(y) = 1 y!.

Generalized Linear Models (Lite) Generalized Linear Models [with Canonical Link] In GLMs, we first choose a natural exponential family. (This amounts to choosing h(y).) The idea is to plug in w T x for the natural parameter. This gives models of the following form: p θ (y x) = 1 Z(w T x) h(y)exp[ (w T x)y ]. This is the form we had for Poisson regression. Note: This is very convenient, but only works if Θ = R. avid Rosenberg (New York University) DS-GA 1003 April 12, 2015 19 / 20

Generalized Linear Models (Lite) Generalized Linear Models [with General Link] More generally, choose a function ψ : R Θ so that x w T x ψ(w T x), where θ = ψ(w T x) is the natural parameter for the family. So our final prediction (for one-parameter families) is: p θ (y x) = 1 Z(ψ(w T x)) h(y)exp[ ψ(w T x)y ]. avid Rosenberg (New York University) DS-GA 1003 April 12, 2015 20 / 20