Lecture 16 Solving GLMs via IRWLS

Similar documents
Weighted Least Squares I

Sampling distribution of GLM regression coefficients

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.

Lecture 17 Intro to Lasso Regression

Logistic Regression. Seungjin Choi

Linear Methods for Prediction

Lecture 13: Simple Linear Regression in Matrix Format. 1 Expectations and Variances with Vectors and Matrices

Introduction to gradient descent

Lecture 05 Geometry of Least Squares

Generalized Linear Models. Kurt Hornik

Bayesian Regression Linear and Logistic Regression

STAT5044: Regression and Anova

Ma 3/103: Lecture 24 Linear Regression I: Estimation

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

Logistic Regression with the Nonnegative Garrote

Lecture 14 Singular Value Decomposition

Linear Regression Models P8111

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Generalized Linear Models 1

Lecture 3 September 1

Cox regression: Estimation

Generalized Linear Models I

Linear Methods for Prediction

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

POLI 8501 Introduction to Maximum Likelihood Estimation

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Asymptotic standard errors of MLE

Linear Regression Linear Regression with Shrinkage

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Lecture 21 Theory of the Lasso II

Association studies and regression

CSC321 Lecture 6: Backpropagation

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Logistic Regression. Mohammad Emtiyaz Khan EPFL Oct 8, 2015

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

Lecture 2 Part 1 Optimization

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Chapter 4: Factor Analysis

Linear Models in Machine Learning

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Nonparametric Bayesian Methods (Gaussian Processes)

Statistical Machine Learning Hilary Term 2018

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

STA 4273H: Statistical Machine Learning

Generalized Linear Models

Linear Algebra Review

Outline Lecture 2 2(32)

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52

Lecture 5: LDA and Logistic Regression

Machine Learning Lecture Notes

Lecture 13: Simple Linear Regression in Matrix Format

FSAN815/ELEG815: Foundations of Statistical Learning

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Week 3: Linear Regression

Linear Regression Linear Regression with Shrinkage

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

MS&E 226: Small Data

Lecture 5: Logistic Regression. Neural Networks

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Proportional hazards regression

Generalized Linear Models (1/29/13)

Regression Models - Introduction

MEANINGFUL REGRESSION COEFFICIENTS BUILT BY DATA GRADIENTS

SB1a Applied Statistics Lectures 9-10

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random

Lecture 34: Properties of the LSE

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

CSC 411: Lecture 09: Naive Bayes

Convex Optimization Algorithms for Machine Learning in 10 Slides

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Properties of the least squares estimates

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Theorems. Least squares regression

CPSC 340 Assignment 4 (due November 17 ATE)

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Time Series Analysis

Linear Regression (9/11/13)

Generalized Linear Models Introduction

LECTURE 2 LINEAR REGRESSION MODEL AND OLS

Stat751 / CSI771 Midterm October 15, 2015 Solutions, Comments. f(x) = 0 otherwise

Non-linear least squares

Lecture 07 Hypothesis Testing with Multivariate Regression

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Last lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton

HOMEWORK #4: LOGISTIC REGRESSION

Lecture 9 SLR in Matrix Form

Logistic Regression and Generalized Linear Models

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Learning from Data: Regression

Transcription:

Lecture 16 Solving GLMs via IRWLS 09 November 2015 Taylor B. Arnold Yale Statistics STAT 312/612

Notes problem set 5 posted; due next class problem set 6, November 18th

Goals for today fixed PCA example from last time how to solve logistic regression via weighted least squares classification problem on image corpus

fixed PCA example from last time

solving logistic regression

GLMs Recall that we define generalized linear models such that the mean of y is some function of Xβ, rather than directly equal to it: E(y X) = g 1 (Xβ) With g, called the link function, equal to some fixed and known function.

Logistic regression The logistic regression function uses g equal to the logit function to describe a distribution with y {0, 1}. Specifically, we have the following description of the statistical model: E(y X) = logit 1 (Xβ) 1 = 1 + e Xβ

Now that we have discussed how to solve the ordinary least squares equation, you may wonder how we would go about solving the logistic regression problem.

Recall that the likelihood of the data point y i given its mean p i is equal to: { ( ) } pi L i (y i p i ) = exp y i log + log(1 p i ) 1 p i

There are two terms here that depend on p i in different ways: { ( ) } pi L i (y i p i ) = exp y i log + log(1 p i ) 1 p i The first term is simply equal to the projection of the regression vector x t i β. We can write the second term using: 1 p i = 1 + e xt i β 1 p i = e xt i β 1 + e xt i β 1 = 1 + e xt i β log(1 p i ) = 1 log(1 + e xt i β )

We can then write the log-likelihood of the full model, in vector form, as: l(y i x i, β) = ( ) y i x t iβ log(1+ xt i β ) i = y t Xβ log(1 + e Xβ )

To find the MLE estimator, we want to find zeros of the derivative of the log-likelihood. The derivative is given by: β l(y i x i, β) = y t Xβ log(1 + e Xβ ) = X t y log(1 + e Xβ ) The second term can be calculated by writing the gradient out component wise.

In other words, we see from a simple application of the chain rule that: β k log(1 + e Xβ ) = Which can be simplified in terms of p: 1 1 + e Xβ xt k exβ 1 1 + e Xβ xt k exβ = x t k p In vector form gives the gradient as: log(1 + e Xβ ) = X t p

So now we have an explicit form of the gradient of the log-likelihood: β l(y i x i, β) = X t y log(1 + e Xβ ) = X t y X t p = X t (y p) We cannot simply set this to zero because p is a non-linear function of X. However it does tell us that the residual in logistic regression should be uncorrelated with the X matrix (just as was the case with linear regression).

To find a numerical solution to the zeros of the function β l(y i x i, β), the Newton Raphson method can be used. The technique is best described in a series of illustractions. http://euler.stat.yale.edu/~tba3/stat612/ lectures/lec16/img/newtoniteration_ani.gif

We saw that the scalar version of using Newton Raphson to determine the optimal values of f is given by (remember the derivatives are one more than the illustration because we want critical points rather than zeros of f): β (k+1) = β (k) f (β (k) ) f (β (k) ) The multivariate version of Newton Raphson applies the following set of updates: β (k+1) = β (k) H 1 (β (k) ) f (β (k) ) Where H is the Hessian matrix of all second order partial derivatives.

What is the Hessian of the log-likelihood? The gradient was X t (y p), so we can quickly see that the Hessian only depends on the term X t p.

In componentwise form, we see that: 2 l(β) β j β k = = i x i,k p i β j i x i,k p i β j

The partial derivative of p i with respect to β j is given by the following (it is just calculus with a clever grouping of terms at the end): p i β j = 1 β j 1 + e xt i β 1 = (1 + e xt i β ) 1 x 2 i,j e xt i β ( ) ( ) 1 e xt i β = x i,j 1 + e xt i β 1 + e xt i β = x i,j p i (1 p i ) = x i,j Var(y i ) We don t actually need the final line, but include it as it helps to give some intutition to the next set of steps.

This now yields the Hessian as: Hl(y x i, β) = X t DX Where D is an n-by-n diagonal matrix with components equal to p i (1 p i ).

And therefore the Newton-Raphson step is given by: β (k+1) = β (k) H 1 (β (k) ) f (β (k) ) = β (k) + (X t DX) 1 X t (y p (k) ) Look familiar?

We can re-write this, for a suitable z, as: β (k+1) = (X t DX) 1 (X t DX)β (k) + (X t DX) 1 X t (y p (k) ) = (X t DX) 1 X t D(Xβ (k) + D 1 (y p)) = (X t DX) 1 X t Dz So solving the logistic regression amounts to solving a sequence of weighted least squares models: { } β (k+1) = arg min D 1/2 (z Xb) 2 2 b

Notice that the scheme weights observations more if they have predicted probabilities close to 0 or 1. Does this make sense based on our other lecture? Another interpretation is that it put the highest weight on points with the lowest variance.

This method generalizes to generalized linear models with exponential families, and has some fairly deep connections to Fischer information.

Why is this important? 1. Explains many of the results on problem set 4; in particular when linear and logistic regression can be expected to give similar predictions and when they don t. 2. Helps explain and address convergence properties in generalized linear models. 3. The logistic regression version of the normal equations can be used to establish large sample theory convergence results. 4. Gives us an idea of how the geometric and analytic concepts underlying ridge regression, principal component analysis, and (later) lasso regression connect to generalized linear models.