Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Similar documents
Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Jelinek Summer School Machine Learning

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Perceptron (Theory) + Linear Regression

Deep Learning (CNNs)

Logistic Regression. William Cohen

MLE/MAP + Naïve Bayes

Machine Learning

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

MLE/MAP + Naïve Bayes

Machine Learning Gaussian Naïve Bayes Big Picture

Perceptron & Neural Networks

NASSLLI 2018 Machine Learning

Hidden Markov Models

Bayesian Networks (Part I)

Bias-Variance Tradeoff

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Machine Learning

Machine Learning

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition

Stochastic Gradient Descent

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Machine Learning Practice Page 2 of 2 10/28/13

Nonparametric Bayesian Methods (Gaussian Processes)

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Hidden Markov Models

Introduction to Logistic Regression and Support Vector Machine

Statistical Data Mining and Machine Learning Hilary Term 2016

Warm up: risk prediction with logistic regression

Logistic Regression. Machine Learning Fall 2018

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Lecture 17: Neural Networks and Deep Learning

Machine Learning Basics

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

ECE 5984: Introduction to Machine Learning

Linear Regression Introduction to Machine Learning. Matt Gormley Lecture 4 September 19, Readings: Bishop, 3.

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

HOMEWORK #4: LOGISTIC REGRESSION

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Introduction to Machine Learning

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Generative v. Discriminative classifiers Intuition

Logistic Regression & Neural Networks

Optimization and Gradient Descent

1 Machine Learning Concepts (16 points)

ECS171: Machine Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Overfitting, Bias / Variance Analysis

10-701/ Machine Learning - Midterm Exam, Fall 2010

Linear Models for Classification

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Neural Network Training

Regression with Numerical Optimization. Logistic

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Machine Learning, Fall 2012 Homework 2

Logistic Regression. Seungjin Choi

Machine Learning Lecture 5

ECE521 week 3: 23/26 January 2017

Support Vector Machines

Big Data Analytics. Lucas Rego Drumond

Statistical Machine Learning Hilary Term 2018

Introduction to Machine Learning

Ad Placement Strategies

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

ECE521 lecture 4: 19 January Optimization, MLE, regularization

CS489/698: Intro to ML

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

CS260: Machine Learning Algorithms

Loss Functions, Decision Theory, and Linear Models

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Linear Classifiers and the Perceptron

ECS171: Machine Learning

Introduction to Machine Learning Midterm Exam

SGD and Deep Learning

Ch 4. Linear Models for Classification

Classification Based on Probability

HOMEWORK #4: LOGISTIC REGRESSION

Machine Learning (CS 567) Lecture 5

Neural Networks: Backpropagation

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Transcription:

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 8 Feb. 12, 2018 1

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Probabilistic Learning Matt Gormley Lecture 8 Feb. 12, 2018 2

Reminders Homework 3: KNN, Perceptron, Lin.Reg. Out: Wed, Feb 7 Due: Wed, Feb 14 at 11:59pm 4

STOCHASTIC GRADIENT DESCENT 5

Stochastic Gradient Descent (SGD) Algorithm 2 Stochastic Gradient Descent (SGD) 1: procedure SGD(D, 2: (0) (0) ) 3: while not converged do 4: for i shu e({1, 2,...,N}) do 5: + J (i) ( ) 6: return We need a per-example objective: Let J( )= N i=1 J (i) ( ) T 6

7

Expectations of Gradients 8

Convergence of Optimizers 9

Optimization Objectives You should be able to Apply gradient descent to optimize a function Apply stochastic gradient descent (SGD) to optimize a function Apply knowledge of zero derivatives to identify a closed-form solution (if one exists) to an optimization problem Distinguish between convex, concave, and nonconvex functions Obtain the gradient (and Hessian) of a (twice) differentiable function 12

Linear Regression Objectives You should be able to Design k-nn Regression and Decision Tree Regression Implement learning for Linear Regression using three optimization techniques: (1) closed form, (2) gradient descent, (3) stochastic gradient descent Choose a Linear Regression optimization technique that is appropriate for a particular dataset by analyzing the tradeoff of computational complexity vs. convergence speed Distinguish the three sources of error identified by the bias-variance decomposition: bias, variance, and irreducible error. 13

PROBABILISTIC LEARNING 14

Probabilistic Learning Function Approximation Previously, we assumed that our output was generated using a deterministic target function: Probabilistic Learning Today, we assume that our output is sampled from a conditional probability distribution: Our goal was to learn a hypothesis h(x) that best approximates c * (x) Our goal is to learn a probability distribution p(y x) that best approximates p * (y x) 15

Robotic Farming Classification (binary output) Regression (continuous output) Deterministic Is this a picture of a wheat kernel? How many wheat kernels are in this picture? Probabilistic Is this plant drought resistant? What will the yield of this plant be? 16

Maximum Likelihood Estimation 17

Learning from Data (Frequentist) Whiteboard Principle of Maximum Likelihood Estimation (MLE) Strawmen: Example: Bernoulli Example: Gaussian Example: Conditional #1 (Bernoulli conditioned on Gaussian) Example: Conditional #2 (Gaussians conditioned on Bernoulli) 18

Outline Motivation: Choosing the right classifier Example: Image Classification Logistic Regression Background: Hyperplanes Data, Model, Learning, Prediction Log-odds Bernoulli interpretation Maximum Conditional Likelihood Estimation Gradient descent for Logistic Regression Stochastic Gradient Descent (SGD) Computing the gradient Details (learning rate, finite differences) 19

MOTIVATION: LOGISTIC REGRESSION 20

Example: Image Classification ImageNet LSVRC-2010 contest: Dataset: 1.2 million labeled images, 1000 classes Task: Given a new image, label it with the correct class Multiclass classification problem Examples from http://image-net.org/ 23

24

25

26

Example: Image Classification CNN for Image Classification (Krizhevsky, Sutskever & Hinton, 2011) 17.5% error on ImageNet LSVRC-2010 contest Input image (pixels) Five convolutional layers (w/max-pooling) Three fully connected layers 1000-way softmax 27

Example: Image Classification CNN for Image Classification (Krizhevsky, Sutskever & Hinton, 2011) 17.5% error on ImageNet LSVRC-2010 contest Input image (pixels) Five convolutional layers (w/max-pooling) Three fully connected layers 1000-way softmax The rest is just some fancy feature extraction (discussed later in the course) This softmax layer is Logistic Regression! 28

LOGISTIC REGRESSION 29

Logistic Regression Data: Inputs are continuous vectors of length K. Outputs are discrete. We are back to classification. Despite the name logistic regression. 30

Linear Models for Classification Key idea: Try to learn this hyperplane directly Looking ahead: We ll see a number of commonly used Linear Classifiers These include: Perceptron Logistic Regression Naïve Bayes (under certain conditions) Support Vector Machines Directly modeling the hyperplane would use a decision function: for: h( ) =sign( T ) y { 1, +1}

Background: Hyperplanes Notation Trick: fold the bias b and the weights w into a single vector θ by prepending a constant to x and increasing dimensionality by one! w Hyperplane (Definition 1): H = {x : w T x = b} Hyperplane (Definition 2): Half-spaces:

Using gradient ascent for linear classifiers Key idea behind today s lecture: 1. Define a linear classifier (logistic regression) 2. Define an objective function (likelihood) 3. Optimize it with gradient descent to learn parameters 4. Predict the class with highest probability under the model 33

Using gradient ascent for linear classifiers This decision function isn t differentiable: h( ) =sign( T ) Use a differentiable function instead: p (y =1 ) = 1 1+ ( T ) sign(x) logistic(u) 1 1+ e u 34

Using gradient ascent for linear classifiers This decision function isn t differentiable: h( ) =sign( T ) Use a differentiable function instead: p (y =1 ) = 1 1+ ( T ) sign(x) logistic(u) 1 1+ e u 35

Logistic Regression Data: Inputs are continuous vectors of length K. Outputs are discrete. Model: Logistic function applied to dot product of parameters with input vector. p (y =1 ) = Learning: finds the parameters that minimize some objective function. = argmin J( ) 1 1+ ( T ) Prediction: Output is the most probable class. ŷ = y {0,1} p (y ) 36

Whiteboard Logistic Regression Bernoulli interpretation Logistic Regression Model Decision boundary 37

Logistic Regression 38

Logistic Regression 39

Logistic Regression 40

LEARNING LOGISTIC REGRESSION 41

Maximum Conditional Likelihood Estimation Learning: finds the parameters that minimize some objective function. We minimize the negative log conditional likelihood: J( )= = argmin J( ) N i=1 p (y (i) (i) ) Why? 1. We can t maximize likelihood (as in Naïve Bayes) because we don t have a joint model p(x,y) 2. It worked well for Linear Regression (least squares is MCLE) 42

Maximum Conditional Likelihood Estimation Learning: Four approaches to solving = argmin J( ) Approach 1: Gradient Descent (take larger more certain steps opposite the gradient) Approach 2: Stochastic Gradient Descent (SGD) (take many small steps opposite the gradient) Approach 3: Newton s Method (use second derivatives to better follow curvature) Approach 4: Closed Form??? (set derivatives equal to zero and solve for parameters) 43

Maximum Conditional Likelihood Estimation Learning: Four approaches to solving = argmin J( ) Approach 1: Gradient Descent (take larger more certain steps opposite the gradient) Approach 2: Stochastic Gradient Descent (SGD) (take many small steps opposite the gradient) Approach 3: Newton s Method (use second derivatives to better follow curvature) Approach 4: Closed Form??? (set derivatives equal to zero and solve for parameters) Logistic Regression does not have a closed form solution for MLE parameters. 44

Algorithm 1 Gradient Descent 1: procedure GD(D, 2: (0) (0) ) 3: while not converged do 4: + J( ) 5: return Gradient Descent In order to apply GD to Logistic Regression all we need is the gradient of the objective function (i.e. vector of partial derivatives). J( )= d d 1 J( ) d d 2 J( ). d d N J( ) 45

Stochastic Gradient Descent (SGD) We can also apply SGD to solve the MCLE problem for Logistic Regression. We need a per-example objective: Let J( )= N i=1 J (i) ( ) where J (i) ( )= p (y i i ). 46

GRADIENT FOR LOGISTIC REGRESSION 48

Learning for Logistic Regression Whiteboard Partial derivative for Logistic Regression Gradient for Logistic Regression 49

Details: Picking learning rate Use grid-search in log-space over small values on a tuning set: e.g., 0.01, 0.001, Sometimes, decrease after each pass: e.g factor of 1/(1 + dt), t=epoch sometimes 1/t 2 Fancier techniques I won t talk about: Adaptive gradient: scale gradient differently for each dimension (Adagrad, ADAM,.) Slide courtesy of William Cohen 54

SGD for Logistic Regression We can also apply SGD to solve the MCLE problem for Logistic Regression. We need a per-example objective: Let J( )= N i=1 J (i) ( ) where J (i) ( )= p (y i i ). 56

Summary 1. Discriminative classifiers directly model the conditional, p(y x) 2. Logistic regression is a simple linear classifier, that retains a probabilistic semantics 3. Parameters in LR are learned by iterative optimization (e.g. SGD) 57

Probabilistic Interpretation of Linear Whiteboard Regression Conditional Likelihood Case #1: 1D Linear Regression Case #2: Multiple Linear Regression Equivalence: Predictions Equivalence: Learning 59