Logistic Regression Maximum Likelihood Estimation

Similar documents
Decision Analysis (part 2 of 2) Review Linear Regression

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Lecture Notes on Linear Regression

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

The Geometry of Logit and Probit

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Generalized Linear Methods

Discriminative classifier: Logistic Regression. CS534-Machine Learning

Pattern Classification

Topic 5: Non-Linear Regression

10-701/ Machine Learning, Fall 2005 Homework 3

STAT 405 BIOSTATISTICS (Fall 2016) Handout 15 Introduction to Logistic Regression

Classification as a Regression Problem

Chapter 2 - The Simple Linear Regression Model S =0. e i is a random error. S β2 β. This is a minimization problem. Solution is a calculus exercise.

Discriminative classifier: Logistic Regression. CS534-Machine Learning

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

Multilayer Perceptron (MLP)

See Book Chapter 11 2 nd Edition (Chapter 10 1 st Edition)

Department of Quantitative Methods & Information Systems. Time Series and Their Components QMIS 320. Chapter 6

Maximum Likelihood Estimation (MLE)

Linear regression. Regression Models. Chapter 11 Student Lecture Notes Regression Analysis is the

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

e i is a random error

IV. Performance Optimization

since [1-( 0+ 1x1i+ 2x2 i)] [ 0+ 1x1i+ assumed to be a reasonable approximation

LINEAR REGRESSION MODELS W4315

SDMML HT MSc Problem Sheet 4

The EM Algorithm (Dempster, Laird, Rubin 1977) The missing data or incomplete data setting: ODL(φ;Y ) = [Y;φ] = [Y X,φ][X φ] = X

Support Vector Machines

BIO Lab 2: TWO-LEVEL NORMAL MODELS with school children popularity data

Laboratory 1c: Method of Least Squares

Support Vector Machines

Statistics for Economics & Business

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

Correlation and Regression. Correlation 9.1. Correlation. Chapter 9

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

Chapter 3. Two-Variable Regression Model: The Problem of Estimation

Limited Dependent Variables

Singular Value Decomposition: Theory and Applications

Hydrological statistics. Hydrological statistics and extremes

Which Separator? Spring 1

Chapter 7 Generalized and Weighted Least Squares Estimation. In this method, the deviation between the observed and expected values of

Basic R Programming: Exercises

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

Introduction to the R Statistical Computing Environment R Programming

Time to dementia onset: competing risk analysis with Laplace regression

Basic Business Statistics, 10/e

Outline. Zero Conditional mean. I. Motivation. 3. Multiple Regression Analysis: Estimation. Read Wooldridge (2013), Chapter 3.

is the calculated value of the dependent variable at point i. The best parameters have values that minimize the squares of the errors

β0 + β1xi. You are interested in estimating the unknown parameters β

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Issues To Consider when Estimating Health Care Costs with Generalized Linear Models (GLMs): To Gamma/Log Or Not To Gamma/Log? That Is The New Question

Advanced Statistical Methods: Beyond Linear Regression

Learning Objectives for Chapter 11

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

Linear Feature Engineering 11

Lab 4: Two-level Random Intercept Model

The Ordinary Least Squares (OLS) Estimator

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Optimization. September 4, 2018

4.3 Poisson Regression

Logistic regression models 1/12

Maximum Likelihood Estimation

Logistic Classifier CISC 5800 Professor Daniel Leeds

Statistics for Business and Economics

Multilayer neural networks

Chapter 11: Simple Linear Regression and Correlation

Multi-layer neural networks

ECONOMICS 351*-A Mid-Term Exam -- Fall Term 2000 Page 1 of 13 pages. QUEEN'S UNIVERSITY AT KINGSTON Department of Economics

Optimization. August 30, 2016

Lecture 2: Prelude to the big shrink

Solutions to exam in SF1811 Optimization, Jan 14, 2015

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

Laboratory 3: Method of Least Squares

Properties of Least Squares

Designing a Pseudo R-Squared Goodness-of-Fit Measure in Generalized Linear Models

Marginal Effects in Probit Models: Interpretation and Testing. 1. Interpreting Probit Coefficients

Homework Assignment 3 Due in class, Thursday October 15

Chapter Newton s Method

1 Convex Optimization

[The following data appear in Wooldridge Q2.3.] The table below contains the ACT score and college GPA for eight college students.

Expectation Maximization Mixture Models HMMs

Ensemble Methods: Boosting

2016 Wiley. Study Session 2: Ethical and Professional Standards Application

MIMA Group. Chapter 2 Bayesian Decision Theory. School of Computer Science and Technology, Shandong University. Xin-Shun SDU

Classification learning II

Statistics for Managers Using Microsoft Excel/SPSS Chapter 13 The Simple Linear Regression Model and Correlation

Single Variable Optimization

Interval Estimation in the Classical Normal Linear Regression Model. 1. Introduction

Linear Regression Introduction to Machine Learning. Matt Gormley Lecture 5 September 14, Readings: Bishop, 3.1

STAT 3008 Applied Regression Analysis

Discrete Dependent Variable Models James J. Heckman

Support Vector Machines

Statistics MINITAB - Lab 2

Transcription:

Harvard-MIT Dvson of Health Scences and Technology HST.951J: Medcal Decson Support, Fall 2005 Instructors: Professor Lucla Ohno-Machado and Professor Staal Vnterbo 6.873/HST.951 Medcal Decson Support Fall 2005 Logstc Regresson Mamum Lkelhood Estmaton Lucla Ohno-Machado

Rsk Score of Death from Angoplasty Unadjusted Overall Mortalty Rate = 2.1% 3000 60% Number of Cases 2500 2000 1500 1000 62% Number of Cases 26% Mortalty Rsk 21.5% 53.6% 50% 40% 30% 20% 12.4% 500 0 7.6% 0.4% 1.4% 2.2% 2.9% 1.6% 1.3% 0 to 2 3 to 4 5 to 6 7 to 8 9 to 10 >10 Rsk Score Category 10% 0%

Lnear Regresson Ordnary Least Squares (OLS) Mnmze Sum of Squared Errors (SSE) y 3 n data ponts s the subscrpt for each pont 2 1 4 ŷ = β 0 + β 1 n n 2 2 0 1 =1 =1 SSE = (y ŷ ) = [y (β + β )]

Logt 1 p = 1+ e (β +β ) 0 1 y p = β 0 +β 1 e e β 0 +β 1 +1 p log 1 p = β 01+ β 1 logt

Increasng β 1.2 1.2 1.2 1 1 1 0.8 0.8 0.8 0.6 0.6 Seres1 0.6 Seres1 Seres1 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 10 20 30 0 10 20 30 0 0 10 20 30

Fndng β 0 Baselne case 1 p = 1+ e (β ) 0 Death Blue(1) Green(0) 28 22 50 297 Lfe 45 52 97 Total 73 74 147 1 0. = 1+ e (β ) 0 β 0 = 0. 8616

Odds rato Odds: p/(1-p) Odds-rato Death Blue Green OR = p death blue 1 p death p blue death green 1 p 28 22 50 death green 28/ 45 73 74 147 OR = = 1. 47 22 / 52 Lfe 45 52 97 Total

What do coeffcents mean? e β color = ORcolor Death OR = 28/ 45 = 1. 47 22 / 52 e β color = 1. 47 Blue Green β color = 0. 385 28 22 50 Lfe 45 52 97 p blue = 1+ e ( Total 73 74 147 1 0. 8616+. 385 = 0. 383 0 ) 1 p green = = 0. 297 1+ e 0. 8616

What do coeffcents mean? e β age = ORage Death Age49 Age50 OR = p 1 p death age =50 death age =50 28 22 50 p death age =49 Lfe 45 52 97 Total 73 74 147 1 p death age =49

Why not search usng OLS? ŷ = β 0 + β 1 y 3 n 4 2 2 SSE = (y ŷ ) 1 =1 logt p log 1 p = β 01+ β 1

P(model data)? p = 1+ e ( If only ntercept s allowed, whch 1 value would t have? β β 1 ) 0 + y y

P (data model)? P(data model) = [P(model data) P(data)] / P(model) When comparng models: P(model): assume all the same (e, chances of beng a model wth hgh coeffcents the same as low, etc) P(data): assume t s the same Then, P(data model) α P(model data)

Mamum Lkelhood Estmaton Mamze P(data model) Mamze the probablty that we would observe what we observed (gven assumpton of a partcular model) Choose the best parameters from the partcular model logt

Mamum Lkelhood Estmaton Steps: Defne epresson for the probablty of data as a functon of the parameters Fnd the values of the parameters that mamze ths epresson

Lkelhood Functon L = Pr(Y ) L = Pr( y 1, y 2,..., y n ) L = Pr( y ) Pr(y n 1 2 )... Pr( y n )= = 1 Pr( y )

L = Pr(Y ) L = Pr( Lkelhood Functon y, 1 y 2,..., y n ) L = Pr( y ) Pr( y Bnomal )... Pr( y ) = Pr( y ) 1 2 n =1 n Pr( y = 1) = p Pr( y = 0) = (1 p ) Pr( y ) = p y (1 p ) 1 y

Lkelhood Functon log L log L n L = Pr( y ) = p (1 p ) = 1 = 1 n p L = = 1 (1 p ) = = y log n y p (1 p ) + y ( β ) log(1 + e y (1 p ) 1 y log(1 β ) p ) Snce model s the logt

Log Lkelhood Functon n L = Pr( y ) = p (1 p ) = 1 = 1 n p y L = = 1 (1 p ) n y (1 p ) p 1 y log L = y log + log(1 p ) (1 p )

Log Lkelhood Functon p log L = y log + log(1 p ) (1 p ) log L = y (β ) log(1+ e β ) Snce model s the logt

Mamze log L = y (β ) log(1+ e β ) log L = β yˆ = 1 1+ e β y yˆ = 0 Not easy to solve because y-hat s non-lnear, need to use teratve methods: most popular s Newton-Raphson

Mamze log L = y (β ) log(1+ e β ) log L = y y β ˆ = 0 ŷ = 1 Not easy to solve because y-hat s non-lnear, need to use teratve methods: most popular s Newton-Raphson 1+ e β

Newton-Raphson Start wth random or zero βs walk n the drecton that mamzes MLE how bg a step (Gradent or Score) drecton

Mamzng the LogLkelhood Log Lkelhood β +1 Frst teraton LL β Intal LL

Mamzng the LogLkelhood Log Lkelhood β +1 Second teraton LL β New Intal LL

Smlar teratve method to Mnmzng the Error n Gradent Descent (neural nets) Error surface ntal error negatve dervatve fnal error local mnmum w ntal w traned postve change

Newton-Raphson Algorthm log L = y (β ) log(1+ e β ) U (β ) = log L = y β ˆ y Gradent I (β ) = 2 log L = β β ' ' y ˆ (1 y ˆ ) Hessan 1 β j + 1 = β j I (β j )U (β j ) a step

Convergence Crteron β β j+1 j <. 0001 β j Convergence problems: complete and quascomplete separaton

Complete separaton MLE does not est (e, t s nfnte) β β +1 y y

Quas-complete separaton Same values for predctors, dfferent outcomes y

No (quas)complete separaton s fne to fnd MLE y

How good s the model? Is t better than predctng the same pror probablty for everyone? (e, model wth just β 0 ) How well do the tranng data ft? How well does s generalze?

Generalzed lkelhood-rato test Are β 1, β 2,, β dfferent from 0? n n y n L = Pr( y ) = p (1 p ) = 1 = 1 1 y log L = [y log p + (1 y ) log( 1 p )] G = 2 log L + 2 log L G has χ 2 dstrbuton o 1 cross entropy _ error = [y log p + (1 y ) log( 1 p )]

AIC, SC, BIC To compare models Akake s Informaton Crteron, k parameters AIC= 2 log L+2 k Schwartz Crteron, Bayesan Informaton Crteron, n cases BIC= 2 log L +k logn

Summary Mamum Lkelhood Estmaton s used n fndng parameters for models MLE mamzes the probablty that the data obtaned would have been generated by the model Comng up: goodness-of-ft (how good are the predctons?) How well do the tranng data ft? How well does s generalze?