Tufts COMP 135: Introduction to Machine Learning

Similar documents
Probability and Statistical Decision Theory

Linear classifiers: Logistic regression

Lecture 2 Machine Learning Review

Logistic Regression. COMP 527 Danushka Bollegala

CPSC 340: Machine Learning and Data Mining

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Linear and Logistic Regression. Dr. Xiaowei Huang

ECE521 Lecture7. Logistic Regression

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Generative v. Discriminative classifiers Intuition

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Recap from previous lecture

10-701/ Machine Learning - Midterm Exam, Fall 2010

Classification Logistic Regression

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Overfitting, Bias / Variance Analysis

Machine Learning Practice Page 2 of 2 10/28/13

ECE521 week 3: 23/26 January 2017

Machine Learning for NLP

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Decision trees COMS 4771

Machine Learning Basics: Maximum Likelihood Estimation

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Gaussian and Linear Discriminant Analysis; Multiclass Classification

18.6 Regression and Classification with Linear Models

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Nonlinear Classification

Lecture 7: DecisionTrees

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

ECE521 Lectures 9 Fully Connected Neural Networks

Day 3: Classification, logistic regression

18.9 SUPPORT VECTOR MACHINES

Logistic Regression. William Cohen

Classification: The rest of the story

Classification Based on Probability

Announcements. Proposals graded

CSCI-567: Machine Learning (Spring 2019)

Stochastic Gradient Descent

Mathematical Formulation of Our Example

Neural Networks and Deep Learning

CMSC858P Supervised Learning Methods

Statistical Data Mining and Machine Learning Hilary Term 2016

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Is the test error unbiased for these programs?

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Midterm: CS 6375 Spring 2015 Solutions

CPSC 340: Machine Learning and Data Mining. Gradient Descent Fall 2016

Introduction to Logistic Regression

Final Exam, Machine Learning, Spring 2009

ECS171: Machine Learning

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Machine Learning, Fall 2009: Midterm

Logistic Regression & Neural Networks

Lecture #11: Classification & Logistic Regression

Qualifying Exam in Machine Learning

Classification Logistic Regression

Bias-Variance Tradeoff

Machine Learning Lecture 5

CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors. Furong Huang /

Logistic Regression. Machine Learning Fall 2018

Introduction to Gaussian Process

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

the tree till a class assignment is reached

Machine Learning and Data Mining. Linear classification. Kalev Kask

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Machine Learning (CS 567) Lecture 2

CPSC 340: Machine Learning and Data Mining

Machine Learning Lecture 7

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Gradient-Based Learning. Sargur N. Srihari

FINAL: CS 6375 (Machine Learning) Fall 2014

Machine Learning: A Statistics and Optimization Perspective

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Variations of Logistic Regression with Stochastic Gradient Descent

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Linear Regression (continued)

Linear classifiers: Overfitting and regularization

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Generative Clustering, Topic Modeling, & Bayesian Inference

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

10701/15781 Machine Learning, Spring 2007: Homework 2

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Machine Learning (CS 567) Lecture 3

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

ECE 5984: Introduction to Machine Learning

Midterm Exam Solutions, Spring 2007

Introduction to Logistic Regression and Support Vector Machine

Decision Trees. Gavin Brown

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Machine Learning Linear Classification. Prof. Matteo Matteucci

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression

Transcription:

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Logistic Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books) 1

Logistics Waitlist: We have room, contact me ASAP HW3 due Wed Please annotate pages in Gradescope! Remember: Turn in on time! Recitation tonight (730-830pm, Room 111B) Practical binary classifiers in Python with sklearn Numerical issues and how to address them 2

Objectives: Logistic Regression Unit Refresher: taste of 3 Methods Logistic Regression, k-nn, Decision Trees Logistic Regression: A Probabilistic Classifier 3 views on why we optimize log loss Upper bound error rate Minimize (cross) entropy Maximize (log) likelihood Computing gradients Training via gradient descent 3

What will we learn? Supervised Learning Training Task Data, Label Pairs {x n,y n } N n=1 Evaluation Performance measure Unsupervised Learning Reinforcement Learning data x Prediction label y 4

Task: Binary Classification Supervised Learning binary classification Unsupervised Learning x 2 y is a binary variable (red or blue) Reinforcement Learning x 1 5

Example: Hotdog or Not https://www.theverge.com/tldr/2017/5/14/15639784/hbosilicon-valley-not-hotdog-app-download 6

Binary Prediction Goal: Predict label (0 or 1) given features x Input: Output: x i, [x i1,x i2,...x if...x if ] features covariates attributes y i 2 {0, 1} responses or labels Entries can be real-valued, or other numeric types (e.g. integer, binary) Binary label (0 or 1) >>> yhat_n = model.predict(x_nf) >>> yhat_n[:5] [0, 0, 1, 0, 1] 7

Probability Prediction Goal: Predict probability of label given features Input: Output: x i, [x i1,x i2,...x if...x if ] features covariates attributes Entries can be real-valued, or other numeric types (e.g. integer, binary) ˆp i, p(y i =1 x i ) Value between 0 and 1 probability e.g. 0.001, 0.513, 0.987 >>> yproba_n2 = model.predict_proba(x_nf) >>> yproba1_n = model.predict_proba(x_nf)[:,1] >>> yproba1_n[:5] [0.143, 0.432, 0.523, 0.003, 0.994] 8

Decision Tree Classifier Goal: Does patient have heart disease? Leaves make binary predictions! (but can be made probabilistic) 9

Decision Tree Classifier Parameters: - at each internal node: x variable id and threshold - at each leaf: probability of positive y label Prediction: - identify rectangular region for input x - predict: most common y value in region - predict_proba: report fraction of each label in regtion Training: - minimize error on training set - often, use greedy heuristics 10

Decision Tree: Predicted Probas Pretty flexible! Function is piecewise constant and axis-aligned 11

K nearest neighbor classifier Parameters: K : number of neighbors Prediction: - find K nearest training vectors to input x - predict: vote most common y in neighborhood - predict_proba: report fraction of labels Training: none needed (use training data as lookup table) 12

KNN: Predicted Probas Very flexible! Function is piecewise constant 13

Logistic Regression classifier Parameters: weight vector bias scalar Prediction: ˆp(x i,w,b)= w =[w 1,w 2,...w f...w F ] b p(y i =1 x i ), sigmoid @ Training: find weights and bias that minimize error 0 1 FX w f x if + ba f=1 14

Logistic Sigmoid Function Goal: Transform real values into probabilities probability sigmoid(z) = 1 1+e z 15

Logistic Regression: Training Optimization: Minimize total log loss on train set min w,b NX n=1 Algorithm: Gradient descent Today! log loss(y n, ˆp(x n,w,b)) Avoid overfitting: Use L2 or L1 penalty on weights 16

Logistic Regr: Predicted Probas Function is monotonically increasing in one direction Decision boundaries will be linear 17

Summary of Methods Function class flexibility Knobs to tune Interpret? Logistic Regression Linear L2/L1 penalty on weights Inspect weights Decision Tree Classifier Axis-aligned Piecewise constant Max. depth Min. leaf size Goal criteria Inspect tree K Nearest Neighbors Classifier Piecewise constant Number of Neighbors Distance metric How neighbors vote Inspect neighbors 18

Optimization Objective Why minimize log loss? An upper bound justification 19

Log loss upper bounds error rate error(y, ŷ) = ( 1 if y 6= ŷ 0 if y =ŷ log loss(y, ˆp) = y log ˆp (1 y) log(1 ˆp) Plot assumes: - True label is 1 - Threshold is 0.5 - Log base 2 20

Optimization Objective Why minimize log loss? An information-theory justification 21

Entropy of Binary Random Var. Goal: Entropy of a distribution captures the amount of uncertainty entropy(x) = p(x = 1) log 2 p(x = 1) p(x = 0) log 2 p(x = 0) Log base 2: Units are bits Log base e: Units are nats 1 bit of information is always needed to represent a binary variable X Entropy tells us how much of this one bit is uncertain 22

Entropy of Binary Random Var. Goal: Entropy of a distribution captures the amount of uncertainty entropy(x) = p(x = 1) log 2 p(x = 1) p(x = 0) log 2 p(x = 0) H[X] = X x2{0,1} p(x = x) log 2 p(x = x) = E x p(x) [log 2 p(x = x)] Entropy is the average number of bits needed to encode an outcome Want: low entropy (low cost storage and transmission!) 23

Cross Entropy Goal: Measure cost of using estimated q to capture true distribution p Entropy[p(X)] = Cross-Entropy[p(X),q(X)] = X x2{0,1} X x2{0,1} Info theory interpretation: Average number of bits needed to encode samples from a true distribution p(x) with codes defined by a model q(x) p(x = x) log 2 p(x = x) p(x = x) log 2 q(x = x) Goal: Want a model q that uses fewer bits! Lower entropy! 24

Log loss is cross entropy! Let our true distribution p(y) be empirical distribution of labels in the training set Let our model distribution q(y) be the estimated probabilities from logistic regression Cross-Entropy[p(Y ),q(y )] = E y p(y ) [ log q(y = y)] = 1 NX y n log ˆp n (1 y n ) log(1 ˆp n ) N n=1 Same as the log loss! Info Theory Justification for log loss: Want to set logistic regression weights to provide best encoding of the training data s label distribution 25

The log loss metric Log loss (aka binary cross entropy ) from sklearn.metrics import log_loss log loss(y, ˆp) = y log ˆp (1 y) log(1 ˆp) Lower is better! Advantages: smooth easy to take derivatives! 26

Optimization Objective Why minimize log loss? A probabilistic justification 27

Likelihood of labels under LR We can write the probability for each outcome of Y as: p(y i =1 x i ) = sigmoid(w T x i + b) p(y i =0 x i )=1 sigmoid(w T x i + b) We can write the probability mass function of Y as: p(y i = y i x i )= (w T x i + b) y i 1 (w T x i + b) 1 y i Interpret: p(y x) is the likelihood of label y given input features x Goal: Fit model to make the training data as likely as possible 28

Maximizing likelihood NY max w,b n=1 p(y n = y n x n,w,b) Why might this be hard in practice? Think about datasets with 1000s of examples N 29

Maximizing log likelihood The logarithm (with any base) is a monotonic transform a > b implies log (a) > log (b) Thus, the following are equivalent problems w,b = arg max w,b w,b = arg max w,b NY n=1 log p(y n = y n x n,w,b) " N Y n=1 p(y n = y n x n,w,b) # 30

Log likelihood for LR We can write the probability mass function of Y as: p(y i = y i x i )= (w T x i + b) y i 1 (w T x i + b) 1 y i Y Our training objective is to maximize log likelihood w,b = arg max w,b n=1 log " N Y n=1 Pair Exercise: Simplify the training objective J(w,b)! Can you recover a familiar form? p(y n = y n x n,w,b) J(w, b) # 31

Minimize negative log likelihood Two equivalent optimization problems: w,b = arg max w,b w,b = arg min w,b NX log p(y n = y n x n,w,b) n=1 NX log p(y n = y n x n,w,b) n=1 32

Summary of Likelihood interpretation LR defines a probabilistic model for Y given x We want to maximize probability of training data (the likelihood ) under this model We can show that an another optimization problem ( maximize log likelihood ) is easier numerically but produces the same optimal values for weights and bias Turns out, minimizing log loss is precisely the same thing as minimizing negative log likelihood 33

Computing gradients 34

Simplified LR notation Feature vector with first entry constant Weight vector (first entry is the bias ) w =[w 0 w 1 w 2...w F ] Score value z (real number, -inf to +inf) 35

Simplifying the log likelihood 36

Gradient of the log likelihood Log likelihood d J(z n (w)) = y n z n log(1 + e z n ) d J(z n (w)) = dw f d Gradient w.r.t. weight on feature f d dz n J(z n ) d d z(w) dw f Simplifying yields: =(y n (z n )) x nf 37

Partner Activity Try the notebook here: https://github.com/tufts-ml-courses/comp135-19sassignments/blob/master/labs/logisticregressiondemo.ipynb Goals: Build understanding What is the optimal w for the 1D example? What is the optimal w for the 2D example? Why is regularization important here? 38

STOP: End of class 2/11 39

Gradient descent for LR 40

Gradient descent for L2 penalized LR min!,a B D log H I D J D ;!, & " ) + : 2! N N J(w, w 0 ) Start with! " = 0, & " " = 0, step size - for. = 0,, (1 1)! 567 =! 5 s 89! 5, & " 5 :! 5 & " 567 = & " 5 s 89! 5, & " 5 if ;! 567, & 567 " ;! 5 5, & " < = break return! > >, & " 41

Will gradient descent always find same solution? 42

Log loss is convex! 43

Intuition: 1D minimization!(#)!(#) # # # # 44

Log likelihood vs iterations 45

If step size is too small 46

If step size is large 47

If step size is too large 48

If step size is way too large 49

Rule for picking step sizes Never try just one! Try several values (exponentially spaced) until Find one clearly too small Find one clearly too large (oscillation / divergence) Always make trace plots! Show the loss, norm of gradient, and parameters Smarter choices for step size: Decaying methods Search methods Adaptive methods 50

Decaying step sizes Decay over iterations 51

Searching for good step sizes Line Search scipy.optimize.line_search 52