Lgistic Regressin and Maximum Likelihd Marek Petrik Feb 09 2017
S Far in ML Regressin vs Classificatin Linear regressin Bias-variance decmpsitin Practical methds fr linear regressin
Simple Linear Regressin We have nly ne feature Y β 0 + β 1 X Y = β 0 + β 1 X + ɛ Example: 0 50 100 150 200 250 300 5 10 15 20 25 TV Sales sales β 0 + β 1 TV
Multiple Linear Regressin Y X 2 X 1
Types f Functin f Regressin: cntinuus target f : X R Years f Educatin Senirity Incme Classificatin: discrete target f : X {1, 2, 3,..., k} X1 X2
Tday Why nt use linear regressin fr classificatin Lgistic regressin Maximum likelihd principle Maximum likelihd fr linear regressin Reading: ISL 4.1-3 ESL 2.6 (max likelihd)
Examples f Classificatin 1. A persn arrives at the emergency rm with a set f symptms that culd pssibly be attributed t ne f three medical cnditins. Which f the three cnditins des the individual have?
Examples f Classificatin 2. An nline banking service must be able t determine whether r nt a transactin being perfrmed n the site is fraudulent, n the basis f the userffs IP address, past transactin histry, and s frth.
Examples f Classificatin 3. On the basis f DNA sequence data fr a number f patients with and withut a given disease, a bilgist wuld like t figure ut which DNA mutatins are deleterius (disease-causing) and which are nt.
IBM Watsn Fair use, https://en.wikipedia.rg/w/index.php?curid=31142331 Lgistic regressin + clever functin engineering
Predicting Default default f(incme, balance) Incme 0 20000 40000 60000 0 500 1000 1500 2000 2500 Balance
Predicting Default default f(incme, balance) Bxplt Balance 0 500 1000 1500 2000 2500 Incme 0 20000 40000 60000 N Yes Default N Yes Default
Casting Classificatin as Regressin Regressin: f : X R Classificatin: f : X {1, 2, 3}
Casting Classificatin as Regressin Regressin: f : X R Classificatin: f : X {1, 2, 3} But {1, 2, 3} R D we even need classificatin?
Casting Classificatin as Regressin Regressin: f : X R Classificatin: f : X {1, 2, 3} But {1, 2, 3} R D we even need classificatin? Yes! Regressin: Values that are clse are similar Classificatin: Distance f classes is meaningless
Casting Classificatin as Regressin: Example Predict pssible diagnsis: {strke, verdse, seizure} Assign class labels: 1 if strke Y = 2 if verdse 3 if seizure. Fit linear regressin
Casting Classificatin as Regressin: Example Predict pssible diagnsis: {strke, verdse, seizure} Assign class labels: 1 if strke Y = 2 if verdse 3 if seizure. Fit linear regressin Make predictins: If uncertain whether symptms pint t strke r seizure, we predict verdse
Linear Regressin fr 2-class Classificatin Y = { 1 if default 0 therwise Linear regressin Lgistic regressin 0 500 1000 1500 2000 2500 0.0 0.2 0.4 0.6 0.8 1.0 Balance Prbability f Default 0 500 1000 1500 2000 2500 0.0 0.2 0.4 0.6 0.8 1.0 Balance Prbability f Default P[default = yes balance]
Lgistic Regressin Predict prbability f a class: p(x) Example: p(balance) prbability f default fr persn with balance Linear regressin: lgistic regressin: p(x) = β 0 + β 1 p(x) = eβ 0+β 1 X 1 + e β 0+β 1 X the same as: ( ) p(x) lg = β 0 + β 1 X 1 p(x) Odds: p(x) /1 p(x)
Lgistic Functin y = ex 1 + e x Lgistic 0.0 0.2 0.4 0.6 0.8 1.0 10 5 0 5 10 x
Lgistic Functin ( ) p(x) lg 1 p(x) Lgit 4 2 0 2 4 0.0 0.2 0.4 0.6 0.8 1.0 p(x)
Lgistic Regressin P[default = yes balance] = eβ 0+β 1 balance 1 + e β 0+β 1 balance Linear regressin Lgistic regressin 0 500 1000 1500 2000 2500 0.0 0.2 0.4 0.6 0.8 1.0 Balance Prbability f Default 0 500 1000 1500 2000 2500 0.0 0.2 0.4 0.6 0.8 1.0 Balance Prbability f Default
Estimating Cefficients: Maximum Likelihd Likelihd: Prbability that data is generated frm a mdel Find the mst likely mdel: l(mdel) = P[data mdel] max l(mdel) = max P[data mdel] mdel mdel Likelihd functin is difficult t maximize Transfrm it using lg (strictly increasing) max lg l(mdel) mdel Strictly increasing transfrmatin des nt change maximum
Example: Maximum Likelihd Assume a cin with p as the prbability f heads Data: h heads, t tails The likelihd functin is: l(p) = p h (1 p) t. Likelihd 0e+00 2e 07 4e 07 6e 07 8e 07 0.0 0.2 0.4 0.6 0.8 1.0 p
Likelihd Functin: 2 cin flips heads h = 1 tails t = 1 Likelihd 0.00 0.05 0.10 0.15 0.20 0.25 0.0 0.2 0.4 0.6 0.8 1.0 p
Likelihd Functin: 20 cin flips heads h = 10 tails t = 10 Likelihd 0e+00 2e 07 4e 07 6e 07 8e 07 0.0 0.2 0.4 0.6 0.8 1.0 p
Likelihd Functin: 200 cin flips heads h = 100 tails t = 100 Likelihd 0e+00 2e 61 4e 61 6e 61 0.0 0.2 0.4 0.6 0.8 1.0 p
p Maximizing Likelihd Likelihd functin is nt cncave: hard t maximize l(p) = p h (1 p) t. Maximize the lg-likelihd instead lg l(p) = h lg(p) + t lg(1 p). Lglikelihd 45 40 35 30 25 20 15 0.0 0.2 0.4 0.6 0.8 1.0
Lg-likelihd: Biased Cin heads h = 20 tails t = 50 Lglikelihd 80 60 40 20 0.0 0.2 0.4 0.6 0.8 1.0 p
Maximize Lg-likelihd Lg-likelihd: lg l(p) = h lg(p) + t lg(1 p).
Maximize Lg-likelihd Lg-likelihd: lg l(p) = h lg(p) + t lg(1 p). Maximum where derivative = 0 Derivative: d dp h lg(p) + t lg(1 p) = h p t 1 p
Maximize Lg-likelihd Lg-likelihd: lg l(p) = h lg(p) + t lg(1 p). Maximum where derivative = 0 Derivative: d dp h lg(p) + t lg(1 p) = h p t 1 p Maximum likelihd slutin: p = h h + 1
Max-likelihd: Lgistic Regressin Features x i and labels y i Likelihd: l(β 0, β 1 ) = p(x i ) (1 p(x i )) i:y i =1 i:y i =0 Lg-likelihd: l(β 0, β 1 ) = lg p(x i ) + lg(1 p(x i )) i:y i =1 i:y i =0 Cncave maximizatin prblem Can be slved using gradient descent
Multiple Lgistic Regressin Multiple features eβ 0+β 1 X 1 +β 2 X 2 +...+β mx n p(x) = 1 + e β 0+β 1 X 1 +β 2 X 2 +...+β mx n Equivalent t: ( ) p(x) lg = β 0 + β 1 X 1 + β 2 X 2 +... + β m X n 1 p(x)
Multinmial Lgistic Regressin Predicting multiple classes: Medical diagnsis 1 if strke Y = 2 if verdse 3 if seizure. Predicting which prducts custmer purchases Straightfrward generalizatin f simple lgistic regressin e c 1 1 + e c 1 e c 1 e c 1 + e c 2 +... + e c k