Logistic Regression. and Maximum Likelihood. Marek Petrik. Feb

Similar documents
Simple Linear Regression (single variable)

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

What is Statistical Learning?

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression

COMP 551 Applied Machine Learning Lecture 4: Linear classification

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

IAML: Support Vector Machines

Introduction to Regression

Maximum A Posteriori (MAP) CS 109 Lecture 22 May 16th, 2016

Lecture 8: Multiclass Classification (I)

Chapter 8: The Binomial and Geometric Distributions

Pattern Recognition 2014 Support Vector Machines

Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

Stats Classification Ji Zhu, Michigan Statistics 1. Classification. Ji Zhu 445C West Hall

Distributions, spatial statistics and a Bayesian perspective

Resampling Methods. Chapter 5. Chapter 5 1 / 52

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

LDA, QDA, Naive Bayes

CS 109 Lecture 23 May 18th, 2016

In SMV I. IAML: Support Vector Machines II. This Time. The SVM optimization problem. We saw:

CN700 Additive Models and Trees Chapter 9: Hastie et al. (2001)

Statistical Learning. 2.1 What Is Statistical Learning?

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank

Support Vector Machines and Flexible Discriminants

SURVIVAL ANALYSIS WITH SUPPORT VECTOR MACHINES

Linear Classification

LHS Mathematics Department Honors Pre-Calculus Final Exam 2002 Answers

Elements of Machine Intelligence - I

T Algorithmic methods for data mining. Slide set 6: dimensionality reduction

A Matrix Representation of Panel Data

NUMBERS, MATHEMATICS AND EQUATIONS

ELE Final Exam - Dec. 2018

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

DESIGN OPTIMIZATION OF HIGH-LIFT CONFIGURATIONS USING A VISCOUS ADJOINT-BASED METHOD

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

AP Statistics Notes Unit Two: The Normal Distributions

Computational modeling techniques

Data Mining: Concepts and Techniques. Classification and Prediction. Chapter February 8, 2007 CSE-4412: Data Mining 1

NAME: Prof. Ruiz. 1. [5 points] What is the difference between simple random sampling and stratified random sampling?

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

Section 6-2: Simplex Method: Maximization with Problem Constraints of the Form ~

New SAT Math Diagnostic Test

STATS216v Introduction to Statistical Learning Stanford University, Summer Practice Final (Solutions) Duration: 3 hours

Smoothing, penalized least squares and splines

Department of Economics, University of California, Davis Ecn 200C Micro Theory Professor Giacomo Bonanno. Insurance Markets

Support-Vector Machines

Linear programming III

The Law of Total Probability, Bayes Rule, and Random Variables (Oh My!)

In the OLG model, agents live for two periods. they work and divide their labour income between consumption and

Assessment Primer: Writing Instructional Objectives

Tree Structured Classifier

GENESIS Structural Optimization for ANSYS Mechanical

[COLLEGE ALGEBRA EXAM I REVIEW TOPICS] ( u s e t h i s t o m a k e s u r e y o u a r e r e a d y )

MATHEMATICS SYLLABUS SECONDARY 5th YEAR

Kinematic transformation of mechanical behavior Neville Hogan

MATCHING TECHNIQUES. Technical Track Session VI. Emanuela Galasso. The World Bank

Turing Machines. Human-aware Robotics. 2017/10/17 & 19 Chapter 3.2 & 3.3 in Sipser Ø Announcement:

Engineering Decision Methods

Evaluating enterprise support: state of the art and future challenges. Dirk Czarnitzki KU Leuven, Belgium, and ZEW Mannheim, Germany

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis

The blessing of dimensionality for kernel methods

Part 3 Introduction to statistical classification techniques

February 28, 2013 COMMENTS ON DIFFUSION, DIFFUSIVITY AND DERIVATION OF HYPERBOLIC EQUATIONS DESCRIBING THE DIFFUSION PHENOMENA

Randomized Quantile Residuals

Exercise 3 Identification of parameters of the vibrating system with one degree of freedom

Logistic Regression. John Fox. York Summer Programme in Data Analysis. Department of Sociology McMaster University May 2005.

Fall 2013 Physics 172 Recitation 3 Momentum and Springs

WYSE Academic Challenge Regional Mathematics 2007 Solution Set

1b) =.215 1c).080/.215 =.372

A NOTE ON THE EQUIVAImCE OF SOME TEST CRITERIA. v. P. Bhapkar. University of Horth Carolina. and

INSTRUMENTAL VARIABLES

Statistics, Numerical Models and Ensembles

A New Evaluation Measure. J. Joiner and L. Werner. The problems of evaluation and the needed criteria of evaluation

Five Whys How To Do It Better

Lecture 3: Principal Components Analysis (PCA)

Semester 2 AP Chemistry Unit 12

SIZE BIAS IN LINE TRANSECT SAMPLING: A FIELD TEST. Mark C. Otto Statistics Research Division, Bureau of the Census Washington, D.C , U.S.A.

DF452. Fast Recovery Diode DF452 APPLICATIONS KEY PARAMETERS V RRM 1600V I F(AV) 540A I FSM. 5000A Q r t rr FEATURES VOLTAGE RATINGS

Dead-beat controller design

Kinetic Model Completeness

Checking the resolved resonance region in EXFOR database

The Kullback-Leibler Kernel as a Framework for Discriminant and Localized Representations for Visual Recognition

1 The limitations of Hartree Fock approximation

Department of Electrical Engineering, University of Waterloo. Introduction

Quantum Harmonic Oscillator, a computational approach

AEC 874 (2007) Field Data Collection & Analysis in Developing Countries. VII. Data Analysis & Project Documentation

PSU GISPOPSCI June 2011 Ordinary Least Squares & Spatial Linear Regression in GeoDa

Functional Form and Nonlinearities

We can see from the graph above that the intersection is, i.e., [ ).

CHAPTER 4 DIAGNOSTICS FOR INFLUENTIAL OBSERVATIONS

An Introduction. Statistical Learning. The Elements of Statistical Learning. Data Mining, Inference, and Prediction.

Lyapunov Stability Stability of Equilibrium Points

Churn Prediction using Dynamic RFM-Augmented node2vec

This section is primarily focused on tools to aid us in finding roots/zeros/ -intercepts of polynomials. Essentially, our focus turns to solving.

Transcription:

Lgistic Regressin and Maximum Likelihd Marek Petrik Feb 09 2017

S Far in ML Regressin vs Classificatin Linear regressin Bias-variance decmpsitin Practical methds fr linear regressin

Simple Linear Regressin We have nly ne feature Y β 0 + β 1 X Y = β 0 + β 1 X + ɛ Example: 0 50 100 150 200 250 300 5 10 15 20 25 TV Sales sales β 0 + β 1 TV

Multiple Linear Regressin Y X 2 X 1

Types f Functin f Regressin: cntinuus target f : X R Years f Educatin Senirity Incme Classificatin: discrete target f : X {1, 2, 3,..., k} X1 X2

Tday Why nt use linear regressin fr classificatin Lgistic regressin Maximum likelihd principle Maximum likelihd fr linear regressin Reading: ISL 4.1-3 ESL 2.6 (max likelihd)

Examples f Classificatin 1. A persn arrives at the emergency rm with a set f symptms that culd pssibly be attributed t ne f three medical cnditins. Which f the three cnditins des the individual have?

Examples f Classificatin 2. An nline banking service must be able t determine whether r nt a transactin being perfrmed n the site is fraudulent, n the basis f the userffs IP address, past transactin histry, and s frth.

Examples f Classificatin 3. On the basis f DNA sequence data fr a number f patients with and withut a given disease, a bilgist wuld like t figure ut which DNA mutatins are deleterius (disease-causing) and which are nt.

IBM Watsn Fair use, https://en.wikipedia.rg/w/index.php?curid=31142331 Lgistic regressin + clever functin engineering

Predicting Default default f(incme, balance) Incme 0 20000 40000 60000 0 500 1000 1500 2000 2500 Balance

Predicting Default default f(incme, balance) Bxplt Balance 0 500 1000 1500 2000 2500 Incme 0 20000 40000 60000 N Yes Default N Yes Default

Casting Classificatin as Regressin Regressin: f : X R Classificatin: f : X {1, 2, 3}

Casting Classificatin as Regressin Regressin: f : X R Classificatin: f : X {1, 2, 3} But {1, 2, 3} R D we even need classificatin?

Casting Classificatin as Regressin Regressin: f : X R Classificatin: f : X {1, 2, 3} But {1, 2, 3} R D we even need classificatin? Yes! Regressin: Values that are clse are similar Classificatin: Distance f classes is meaningless

Casting Classificatin as Regressin: Example Predict pssible diagnsis: {strke, verdse, seizure} Assign class labels: 1 if strke Y = 2 if verdse 3 if seizure. Fit linear regressin

Casting Classificatin as Regressin: Example Predict pssible diagnsis: {strke, verdse, seizure} Assign class labels: 1 if strke Y = 2 if verdse 3 if seizure. Fit linear regressin Make predictins: If uncertain whether symptms pint t strke r seizure, we predict verdse

Linear Regressin fr 2-class Classificatin Y = { 1 if default 0 therwise Linear regressin Lgistic regressin 0 500 1000 1500 2000 2500 0.0 0.2 0.4 0.6 0.8 1.0 Balance Prbability f Default 0 500 1000 1500 2000 2500 0.0 0.2 0.4 0.6 0.8 1.0 Balance Prbability f Default P[default = yes balance]

Lgistic Regressin Predict prbability f a class: p(x) Example: p(balance) prbability f default fr persn with balance Linear regressin: lgistic regressin: p(x) = β 0 + β 1 p(x) = eβ 0+β 1 X 1 + e β 0+β 1 X the same as: ( ) p(x) lg = β 0 + β 1 X 1 p(x) Odds: p(x) /1 p(x)

Lgistic Functin y = ex 1 + e x Lgistic 0.0 0.2 0.4 0.6 0.8 1.0 10 5 0 5 10 x

Lgistic Functin ( ) p(x) lg 1 p(x) Lgit 4 2 0 2 4 0.0 0.2 0.4 0.6 0.8 1.0 p(x)

Lgistic Regressin P[default = yes balance] = eβ 0+β 1 balance 1 + e β 0+β 1 balance Linear regressin Lgistic regressin 0 500 1000 1500 2000 2500 0.0 0.2 0.4 0.6 0.8 1.0 Balance Prbability f Default 0 500 1000 1500 2000 2500 0.0 0.2 0.4 0.6 0.8 1.0 Balance Prbability f Default

Estimating Cefficients: Maximum Likelihd Likelihd: Prbability that data is generated frm a mdel Find the mst likely mdel: l(mdel) = P[data mdel] max l(mdel) = max P[data mdel] mdel mdel Likelihd functin is difficult t maximize Transfrm it using lg (strictly increasing) max lg l(mdel) mdel Strictly increasing transfrmatin des nt change maximum

Example: Maximum Likelihd Assume a cin with p as the prbability f heads Data: h heads, t tails The likelihd functin is: l(p) = p h (1 p) t. Likelihd 0e+00 2e 07 4e 07 6e 07 8e 07 0.0 0.2 0.4 0.6 0.8 1.0 p

Likelihd Functin: 2 cin flips heads h = 1 tails t = 1 Likelihd 0.00 0.05 0.10 0.15 0.20 0.25 0.0 0.2 0.4 0.6 0.8 1.0 p

Likelihd Functin: 20 cin flips heads h = 10 tails t = 10 Likelihd 0e+00 2e 07 4e 07 6e 07 8e 07 0.0 0.2 0.4 0.6 0.8 1.0 p

Likelihd Functin: 200 cin flips heads h = 100 tails t = 100 Likelihd 0e+00 2e 61 4e 61 6e 61 0.0 0.2 0.4 0.6 0.8 1.0 p

p Maximizing Likelihd Likelihd functin is nt cncave: hard t maximize l(p) = p h (1 p) t. Maximize the lg-likelihd instead lg l(p) = h lg(p) + t lg(1 p). Lglikelihd 45 40 35 30 25 20 15 0.0 0.2 0.4 0.6 0.8 1.0

Lg-likelihd: Biased Cin heads h = 20 tails t = 50 Lglikelihd 80 60 40 20 0.0 0.2 0.4 0.6 0.8 1.0 p

Maximize Lg-likelihd Lg-likelihd: lg l(p) = h lg(p) + t lg(1 p).

Maximize Lg-likelihd Lg-likelihd: lg l(p) = h lg(p) + t lg(1 p). Maximum where derivative = 0 Derivative: d dp h lg(p) + t lg(1 p) = h p t 1 p

Maximize Lg-likelihd Lg-likelihd: lg l(p) = h lg(p) + t lg(1 p). Maximum where derivative = 0 Derivative: d dp h lg(p) + t lg(1 p) = h p t 1 p Maximum likelihd slutin: p = h h + 1

Max-likelihd: Lgistic Regressin Features x i and labels y i Likelihd: l(β 0, β 1 ) = p(x i ) (1 p(x i )) i:y i =1 i:y i =0 Lg-likelihd: l(β 0, β 1 ) = lg p(x i ) + lg(1 p(x i )) i:y i =1 i:y i =0 Cncave maximizatin prblem Can be slved using gradient descent

Multiple Lgistic Regressin Multiple features eβ 0+β 1 X 1 +β 2 X 2 +...+β mx n p(x) = 1 + e β 0+β 1 X 1 +β 2 X 2 +...+β mx n Equivalent t: ( ) p(x) lg = β 0 + β 1 X 1 + β 2 X 2 +... + β m X n 1 p(x)

Multinmial Lgistic Regressin Predicting multiple classes: Medical diagnsis 1 if strke Y = 2 if verdse 3 if seizure. Predicting which prducts custmer purchases Straightfrward generalizatin f simple lgistic regressin e c 1 1 + e c 1 e c 1 e c 1 + e c 2 +... + e c k