Learning From Data Lecture 9 Logistic Regression and Gradient Descent

Similar documents
ECS171: Machine Learning

Learning From Data Lecture 8 Linear Classification and Regression

Learning From Data Lecture 10 Nonlinear Transforms

Learning From Data Lecture 2 The Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

Optimization and Gradient Descent

Machine Learning Foundations

ECS171: Machine Learning

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Classification Logistic Regression

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

Neural Network Training

Overfitting, Bias / Variance Analysis

Linear Regression (continued)

Classification: Logistic Regression from Data

Stochastic Gradient Descent

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Learning From Data Lecture 5 Training Versus Testing

Learning From Data Lecture 25 The Kernel Trick

Gradient Descent. Sargur Srihari

Machine Learning. Lecture 2: Linear regression. Feng Li.

CSC 411: Lecture 04: Logistic Regression

Linear models and the perceptron algorithm

Logistic Regression. Stochastic Gradient Descent

Stochastic gradient descent; Classification

Regression with Numerical Optimization. Logistic

CSCI567 Machine Learning (Fall 2014)

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

CS229 Supplemental Lecture notes

Selected Topics in Optimization. Some slides borrowed from

Linear Models in Machine Learning

Logistic Regression Trained with Different Loss Functions. Discussion

Homework #3 RELEASE DATE: 10/28/2013 DUE DATE: extended to 11/18/2013, BEFORE NOON QUESTIONS ABOUT HOMEWORK MATERIALS ARE WELCOMED ON THE FORUM.

Logistic Regression. COMP 527 Danushka Bollegala

Generative v. Discriminative classifiers Intuition

MLPR: Logistic Regression and Neural Networks

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Learning From Data Lecture 3 Is Learning Feasible?

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Lecture 3 - Linear and Logistic Regression

Variations of Logistic Regression with Stochastic Gradient Descent

Why should you care about the solution strategies?

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Linear classifiers: Overfitting and regularization

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Big Data Analytics. Lucas Rego Drumond

Classification Based on Probability

Warm up: risk prediction with logistic regression

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Convex Optimization Lecture 16

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Learning From Data Lecture 12 Regularization

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Bias-Variance Tradeoff

Classification Logistic Regression

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Sparse Linear Models (10/7/13)

CS260: Machine Learning Algorithms

CS60021: Scalable Data Mining. Large Scale Machine Learning

Optimization for Machine Learning

Lecture 2 Machine Learning Review

outline Nonlinear transformation Error measures Noisy targets Preambles to the theory

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

Logistic Regression. William Cohen

Classification: Logistic Regression from Data

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

ECE521 Lecture7. Logistic Regression

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Lecture 1: Supervised Learning

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

MS&E 226: Small Data

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Linear and logistic regression

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

GRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x)

CSC2541 Lecture 5 Natural Gradient

1 Review of Winnow Algorithm

Introduction to Machine Learning

Advanced computational methods X Selected Topics: SGD

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Introduction to Optimization

Probabilistic Graphical Models & Applications

Lecture 2: From Linear Regression to Kalman Filter and Beyond

CSCI567 Machine Learning (Fall 2018)

STA141C: Big Data & High Performance Statistical Computing

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Ad Placement Strategies

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Linear Models for Regression CS534

MS&E 226: Small Data

Iterative Reweighted Least Squares

CSC321 Lecture 8: Optimization

Transcription:

Learning From Data Lecture 9 Logistic Regression and Gradient Descent Logistic Regression Gradient Descent M. Magdon-Ismail CSCI 4100/6100

recap: Linear Classification and Regression The linear signal: s = wtx Good Features are Important Algorithms Linear Classification. Pocket algorithm can tolerate errors Simple and efficient y Linear Regression. Single step learning: Before looking at the data, we can reason that symmetry and intensity should be good features based on our knowledge of the problem. c AM L Creator: Malik Magdon-Ismail w = X y = (XtX) 1Xt y Very efficient O(N d2 ) exact algorithm. x1 x2 Logistic Regression and Gradient Descent: 2 /23 Predicting a probability

Predicting a Probability Will someone have a heart attack over the next year? age 62 years gender male blood sugar 120 mg/dl40,000 HDL 50 LDL 120 Mass 190 lbs Height 5 10...... Classification: Yes/No Logistic Regression: Likelihood of heart attack logistic regression y [0, 1] h(x) = θ ( d ) w i x i = θ(w t x) i=0 c AM L Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 3 /23 What is θ?

Predicting a Probability Will someone have a heart attack over the next year? age 62 years gender male blood sugar 120 mg/dl40,000 HDL 50 LDL 120 Mass 190 lbs Height 5 10...... Classification: Yes/No Logistic Regression: Likelihood of heart attack logistic regression y [0, 1] h(x) = θ ( d ) w i x i = θ(w t x) i=0 1 θ(s) 0 s θ(s) = es 1+e s = 1 1+e s. θ( s) = e s 1+e s = 1 1+e s = 1 θ(s). c AM L Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 4 /23 Data is binary ±1

The Data is Still Binary, ±1 D = (x 1,y 1 = ±1),,(x N,y N = ±1) x n y n = ±1 a person s health information did they have a heart attack or not We cannot measure a probability. We can only see the occurence of an event and try to infer a probability. c AM L Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 5 /23 f is noisy

The Target Function is Inherently Noisy f(x) = P[y = +1 x]. The data is generated from a noisy target function: f(x) for y = +1; P(y x) = 1 f(x) for y = 1. c AM L Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 6 /23 When is h good?

What Makes an h Good? fitting the data means finding a good h h is good if: h(x n ) 1 whenever y n = +1; h(x n ) 0 whenever y n = 1. A simple error measure that captures this: E in (h) = 1 N ( h(xn ) 1 N 2 (1+y n) ) 2. n=1 Not very convenient (hard to minimize). c AM L Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 7 /23 Cross entropy error

The Cross Entropy Error Measure E in (w) = 1 N N ln(1+e y n w tx ) n=1 It looks complicated and ugly (ln,e ( ),...), But, it is based on an intuitive probabilistic interpretation of h. it is very convenient and mathematically friendly ( easy to minimize). Verify: y n = +1 encourages w t x n 0, so θ(w t x n ) 1; y n = 1 encourages w t x n 0, so θ(w t x n ) 0; c AM L Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 8 /23 Probabilistic interpretation

The Probabilistic Interpretation Suppose that h(x) = θ(w t x) closely captures P[+1 x]: P(y x) = θ(w t x) for y = +1; 1 θ(w t x) for y = 1. c AM L Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 9 /23 1 θ(s) = θ( s)

The Probabilistic Interpretation So, if h(x) = θ(w t x) closely captures P[+1 x]: P(y x) = θ(w t x) for y = +1; θ( w t x) for y = 1. c AM L Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 10 /23 Simplify to one equation

The Probabilistic Interpretation So, if h(x) = θ(w t x) closely captures P[+1 x]: P(y x) = θ(w t x) for y = +1; θ( w t x) for y = 1....or, more compactly, P(y x) = θ(y w t x) c AM L Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 11 /23 The likelihood

The Likelihood P(y x) = θ(y w t x) Recall: (x 1,y 1 ),...,(x N,y N ) are independently generated Likelihood: The probability of getting the y 1,...,y N in D from the corresponding x 1,...,x N : N P(y 1,...,y N x 1,...,x n ) = P(y n x n ). n=1 The likelihood measures the probability that the data were generated if f were h. c AM L Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 12 /23 Maximize the likelihood

Maximizing The Likelihood (why?) max N n=1 P(y n x n ) N ) max ln( n=1 P(y n x n ) max N n=1 lnp(y n x n ) min 1 N N n=1 lnp(y n x n ) min 1 N min 1 N N n=1 ln 1 P(y n x n ) N n=1 ln 1 θ(y n w t x n ) we specialize to our model here min 1 N N n=1 ln(1+e y n w t x n ) E in (w) = 1 N N ln(1+e y n w t x n ) n=1 c AM L Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 13 /23 How to minimize E in (w)

How To Minimize E in (w) Classification PLA/Pocket (iterative) Regression pseudoinverse (analytic), from solving w E in (w) = 0. Logistic Regression analytic won t work. Numerically/iteratively set w E in (w) 0. c AM L Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 14 /23 Hill analogy

Finding The Best Weights - Hill Descent Ball on a complicated hilly terrain rolls down to a local valley this is called a local minimum Questions: How to get to the bottom of the deepest valey? How to do this when we don t have gravity? c AM L Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 15 /23 Our E in is convex

Our E in Has Only One Valley In-sample Error, Ein Weights, w...because E in (w) is a convex function of w. (So, who care s if it looks ugly!) c AM L Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 16 /23 How to roll down?

How to Roll Down? Assume you are at weights w(t) and you take a step of size η in the direction ˆv. w(t+1) = w(t)+ηˆv We get to pick ˆv what s the best direction to take the step? Pick ˆv to make E in (w(t+1)) as small as possible. c AM L Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 17 /23 The gradient

The Gradient is the Fastest Way to Roll Down Approximating the change in E in E in = E in (w(t+1)) E in (w(t)) = E in (w(t)+ηˆv) E in (w(t)) = η E in (w(t)) tˆv +O(η 2 ) }{{} minimized at ˆv = E in(w(t)) E in (w(t)) > η E in (w(t)) (Taylor s Approximation) attained at ˆv = E in(w(t)) E in (w(t)) The best (steepest) direction to move is the negative gradient: ˆv = E in(w(t)) E in (w(t)) c AM L Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 18 /23 Iterate the gradient

Rolling Down Iterating the Negative Gradient w(0) w(1) w(2) w(3) negative gradient negative gradient negative gradient negative gradient. η = 0.5; 15 steps c AM L Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 19 /23 What step size?

The Goldilocks Step Size η too small η too large variable η t just right large η In-sample Error, Ein In-sample Error, Ein In-sample Error, Ein small η Weights, w Weights, w Weights, w η = 0.1; 75 steps η = 2; 10 steps variable η t ; 10 steps c AM L Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 20 /23 Fixed learning rate gradient descent

Fixed Learning Rate Gradient Descent η t = η E in (w(t)) E in (w(t)) 0 when closer to the minimum. ˆv = η t E in (w(t)) E in (w(t)) = η E in (w(t)) E in (w(t)) E in (w(t)) ˆv = η E in (w(t)) 1: Initialize at step t = 0 to w(0). 2: for t = 0,1,2,... do 3: Compute the gradient g t = E in (w(t)). 4: Move in the direction v t = g t. 5: Update the weights: w(t+1) = w(t)+ηv t. 6: Iterate until it is time to stop. 7: end for 8: Return the final weights. (Ex. 3.7 in LFD) Gradient descent can minimize any smooth function, for example E in (w) = 1 N N ln(1+e y n w tx ) n=1 logistic regression c AM L Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 21 /23 Stochastic gradient descent

Stochastic Gradient Descent (SGD) A variation of GD that considers only the error on one data point. E in (w) = 1 N ln(1+e y n w tx ) = 1 N e(w,x n,y n ) N N n=1 n=1 Pick a random data point (x,y ) Run an iteration of GD on e(w,x,y ) w(t+1) w(t) η w e(w,x,y ) 1. The average move is the same as GD; 2. Computation: fraction 1 N cheaper per step; 3. Stochastic: helps escape local minima; 4. Simple; 5. Similar to PLA. Logistic Regression: w(t+1) w(t)+y x ( ) η 1+e y w t x (Recall PLA: w(t+1) w(t)+y x ) c AM L Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 22 /23 GD versus SGD, a picture

Stochastic Gradient Descent GD SGD η = 6 10 steps N = 10 η = 2 30 steps c AM L Creator: Malik Magdon-Ismail Logistic Regression and Gradient Descent: 23 /23