Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Similar documents
Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 5: Logistic Regression. Neural Networks

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Iterative Reweighted Least Squares

Multiclass Logistic Regression

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Ch 4. Linear Models for Classification

Linear Models for Classification

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Regression with Numerical Optimization. Logistic

Machine Learning 2017

Machine Learning. 7. Logistic and Linear Regression

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Max Margin-Classifier

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Logistic Regression Logistic

Midterm exam CS 189/289, Fall 2015

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Machine Learning Lecture 7

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Linear and logistic regression

Linear Models in Machine Learning

Linear & nonlinear classifiers

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Linear & nonlinear classifiers

Machine Learning - Waseda University Logistic Regression

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Reading Group on Deep Learning Session 1

Support Vector Machines and Kernel Methods

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Lecture 5: GPs and Streaming regression

Linear Models for Regression CS534

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

MLPR: Logistic Regression and Neural Networks

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Classification Logistic Regression

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

CSC 411: Lecture 04: Logistic Regression

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Kernel Methods and Support Vector Machines

CIS 520: Machine Learning Oct 09, Kernel Methods

Neural Network Training

Bayesian Logistic Regression

CS-E3210 Machine Learning: Basic Principles

Linear Models for Regression CS534

ECS171: Machine Learning

Stochastic Gradient Descent

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

1 Machine Learning Concepts (16 points)

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Machine Learning Lecture 5

Lecture 3 - Linear and Logistic Regression

Cheng Soon Ong & Christian Walder. Canberra February June 2018

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Statistical Machine Learning Hilary Term 2018

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Statistical learning theory, Support vector machines, and Bioinformatics

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

CS-E4830 Kernel Methods in Machine Learning

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

SVMs, Duality and the Kernel Trick

Jeff Howbert Introduction to Machine Learning Winter

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Machine Learning Final Exam May 5, 2015

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Introduction to Machine Learning

Warm up: risk prediction with logistic regression

CS 340 Lec. 16: Logistic Regression

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Logistic Regression. Seungjin Choi

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Linear Regression (9/11/13)

Support Vector Machines

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Linear Models for Regression CS534

Machine Learning and Data Mining. Linear classification. Kalev Kask

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Introduction Dual Representations Kernel Design RBF Linear Reg. GP Regression GP Classification Summary. Kernel Methods. Henrik I Christensen

Machine Learning for NLP

Overfitting, Bias / Variance Analysis

Machine Learning Practice Page 2 of 2 10/28/13

Statistical Machine Learning from Data

CPSC 340: Machine Learning and Data Mining

Introduction to Machine Learning

Transcription:

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods Kernelizing linear methods COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 1

Regression vs classification Regression: map x R n to y R. Classification: map x R n to a label y Y = {1,, k} (k classes). Input space is divided into decision regions. Linear classification: decisions surfaces are linear (or affine) in x. How to represent the output? If k = 2, Y = { 1, 1} can be convenient... One-hot encoding is another option... We will mainly talk about binary classification, i.e. k = 2. COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 2

Linear models for classification Hyperplane in R n : defined by the equation w x = 0 for some w R n. Least-squares for classification: Y = {1,, k} and h W (x) = arg max i=1,,k (W x) i where W = [ w 1 w k ] R n k. Learning: encode each output sample y i to a vector in R k using onehot encoding and minimize the squared error (closed form solution). COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 3

Linear models for classification Least-squares for classification: Y = {1,, k} and h W (x) = arg max i=1,,k (W x) i where W = [ w 1 w k ] R n k. Learning: encode each output sample y i to a vector in R k using onehot encoding and minimize the squared error (closed form solution). Very sensitive to outliers: COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 4

Linear models for classification Least-squares for classification: Y = {1,, k} and h W (x) = arg max i=1,,k (W x) i where W = [ w 1 w k ] R n k. Learning: encode each output sample y i to a vector in R k using onehot encoding and minimize the squared error (closed form solution). Very sensitive to outliers: Recall that minimizing the squared error corresponds to maximizing the likelihood under the assumption of a Gaussian conditional distribution. COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 5

Linear models for classification Hyperplane in R n : defined by the equation w x = 0 for some w R n. Binary linear discriminant function: Y = { 1, 1} and h w (x) = sign(w x). Learning: Perceptron algorithm (Rosenblatt, 1957) iterative algorithm (closely related to stochastic gradient descent, more on that later...). converges when the data is linearly separable. Converges... ok, but to which hyperplane? What is the best one? Support Vector Machines (next lecture) More generally, Generalized linear models: h w (x) = ψ(w x) for some activation function ψ. If ψ takes its values in (0, 1), we could interpret h w (x) as the conditional probability P (y = 1 x)! COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 6

Logistic sigmoid (recall): σ(z) = 1 1 + exp( z) Symmetry: σ( z) = 1 σ(z) ) Inverse: z = ln ( σ(z) 1 σ(z) Derivative: σ (z) = σ(z)(1 σ(z)) COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 7

Logistic regression Suppose we represent the hypothesis itself as a logistic function of a linear combination of inputs: h(x) = σ(w x) = 1 1 + exp( w x) This is also known as a sigmoid neuron Suppose we interpret h(x) as P (y = 1 x) Then the log-odds ratio, ( ) P (y = 1 x) ln = w x P (y = 0 x) is linear in x (nice!) The optimum weights will maximize the conditional likelihood of the outputs, given the inputs. COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 8

The cross-entropy error function Suppose we interpret the output of the hypothesis, h(x i ), as the probability that y i = 1 Setting Y = {0, 1} for convenience, the log-likelihood of a hypothesis h is: m m { log h(xi ) if y log L(h) = log P (y i x i, h) = i = 1 log(1 h(x i )) if y i = 0 = i=1 i=1 m y i log h(x i ) + (1 y i ) log(1 h(x i )) i=1 The cross-entropy error function is the opposite quantity: ( m ) J D (w) = y i log h(x i ) + (1 y i ) log(1 h(x i )) i=1 COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 9

Cross-entropy error surface for logistic function ( m ) J D (w) = y i log σ(w x i ) + (1 y i ) log(1 σ(w x i )) i=1 Nice error surface, unique minimum, but cannot be solved in closed form COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 10

Gradient descent The gradient of J at a point w can be thought of as a vector indicating which way is uphill. 2000 1500 SSQ 1000 500 10 0 5 0 w0 5 10 10 5 w1 0 5 10 If this is an error function, we want to move downhill on it, i.e., in the direction opposite to the gradient COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 11

Example gradient descent traces If the error function is convex, gradient descent converges to the global minimum (under some conditions on the learning rates) For more general hypothesis classes, there may be local optima In this case, the final solution may depend on the initial parameters COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 12

Gradient descent algorithm The basic algorithm assumes that J is easily computed We want to produce a sequence of vectors w 1, w 2, w 3,... with the goal that: J(w 1 ) > J(w 2 ) > J(w 3 ) >... lim i w i = w and w is locally optimal. The algorithm: Given w 0, do for i = 0, 1, 2,... w i+1 = w i α i J(w i ), where α i > 0 is the step size or learning rate for iteration i. COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 13

Maximization procedure: Gradient ascent First we compute the gradient of log L(w) wrt w: log L(w) = 1 y i h i w (x i ) h w(x i )(1 h w (x i ))x i 1 + (1 y i ) 1 h w (x i ) h w(x i )(1 h w (x i ))x i = i x i (y i y i h w (x i ) h w (x i ) + y i h w (x i )) = i (y i h w (x i ))x i = X (y ŷ) where ŷ R m is defined by (ŷ) i = h w (x i ). COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 14

The update rule (because we maximize) is: w w + α log L(w) = w + αx (y ŷ) where α (0, 1) is a step-size or learning rate parameter This is called logistic regression If one uses features of the input, we simply have: w w + αφ (y ŷ) COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 15

Roadmap: theoretical guarantees for gradient descent and second-order methods If the cost function is convex then gradient descent will converge to the optimal solution for an appropriate choice of the learning rates. We will show that the cross-entropy error function is convex. We will see how we can use a second-order method to choose optimal learning rates. COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 16

Convexity A function f : R d R is convex if for all a, b R d, λ [0, 1]: f(λa + (1 λ)b) λf(a) + (1 λ)f(b). f is concave if f is convex. If f and g are convex functions, αf + βg is also convex for any real numbers α and β. COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 17

First-order characterization: Characterizations of convexity f is convex for all a, b : f(a) f(b) + f(b) (a b) (the function is globally above the tangent at b). Second-order characterization: f is convex the Hessian of f is positive semi-definite. The Hessian contains the second-order derivatives of f: H i,j = 2 f x i x j. H is positive semi-definite a Ha 0 for all a R d H has only non-negative eigenvalues. COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 18

Logistic regression: convexity of the cost function ( m ) J(w) = y i log σ(w x i ) + (1 y i ) log(1 σ(w x i )) i=1 where σ is the sigmoid function. We show that log σ(w x) and log(1 σ(w x)) are convex in w: ( w log σ(w x) ) = ( w σ(w x) ) σ(w = σ (w x) w (w x) x) σ(w = (σ(w x) 1)x x) 2 w ( log σ(w x) ) = w ( σ(w x)x ) = σ(w x)(1 σ(w x))xx It is easy to check that this matrix is positive semi-definite for any x. COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 19

Convexity of the cost function ( m ) J(w) = y i log σ(w x i ) + (1 y i ) log(1 σ(w x i )) i=1 where σ(z) = 1/(1 + e z ) (check that σ (z) = σ(z)(1 σ(z))). Similarly you can show that 2 w w ( log(1 σ(w x)) ) = σ(w x)x ( log(1 σ(w x)) ) = σ(w x)(1 σ(w x))xx J(w) is convex in w. The gradient of J is X (ŷ y) where ŷ i = σ(w x i ) = h(x i ). The Hessian of J is X RX where R is diagonal with entries R i,i = h(x i )(1 h(x i )). COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 20

Another algorithm for optimization Recall Newton s method for finding the zero of a function g : R R At point w t, approximate the function by a straight line (its tangent) Solve the linear equation for where the tangent equals 0, and move the parameter to this point: w t+1 = w t g(wt ) g (w t ) COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 21

Application to machine learning Suppose for simplicity that the error function J has only one parameter We want to optimize J, so we can apply Newton s method to find the zeros of J = d dw J We obtain the iteration: w t+1 = w t J (w t ) J (w t ) Note that there is no step size parameter! This is a second-order method, because it requires computing the second derivative But, if our error function is quadratic, this will find the global optimum in one step! COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 22

Second-order methods: Multivariate setting If we have an error function J that depends on many variables, we can compute the Hessian matrix, which contains the second-order derivatives of J: H ij = 2 J w i w j The inverse of the Hessian gives the optimal learning rates The weights are updated as: w w H 1 w J This is also called Newton-Raphson method for logistic regression, or Fisher scoring COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 23

Which method is better? Newton s method usually requires significantly fewer iterations than gradient descent Computing the Hessian requires a batch of data, so there is no natural on-line algorithm Inverting the Hessian explicitly is expensive, but almost never necessary Computing the product of a Hessian with a vector can be done in linear time (Pearlmutter, 1993), which helps also to compute the product of the inverse Hessian with a vector without explicitly computing H COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 24

Newton-Raphson for logistic regression Leads to a nice algorithm called iteratively reweighted least squares (or iterative recursive least squares) The Hessian has the form: H = Φ RΦ where R is the diagonal matrix of h(x i )(1 h(x i )) (you can check that this is the form of the second derivative). The weight update becomes: w w (Φ RΦ) 1 Φ (ŷ y) which can be rewritten as the solution of a weighted least square problem: w (Φ RΦ) 1 Φ R(Φw R 1 (ŷ y)) COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 25

Regularization for logistic regression One can do regularization for logistic regression just like in the case of linear regression Recall regularization makes a statement about the weights, so does not affect the error function Eg: L 2 regularization will have the optimization criterions: J(w) = J D (w) + λ 2 w w COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 26

Probabilistic view of logistic regression Consider the additive noise model we discussed before: y i = h w (x i ) + ɛ where ɛ are drawn iid from some distribution At first glance, log reg does not fit very well We will instead think of a latent variable ŷ i such that: ŷ i = h w (x i ) + ɛ Then the output is generated as: y i = 1 iff ŷ i > 0 COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 27

Generalized Linear Models Logistic regression is a special case of a generalized linear model: E[Y x] = g 1 (w x). g is called the link function, it relates the mean of the response to the linear predictor. Linear regression: E[Y x] = E[w x +ε x] = w x (g is the identity). Logistic regression: E[Y x] = P (Y = 1 x) = σ(w x) Poisson regression: E[Y x] = exp(w x) (for count data).... COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 28

Linear regression with feature vectors revisited Recall: we want to minimize the (regularized) error function: J(w) = 1 2 (Φw y) (Φw y) + λ 2 w w Using the identity (M M + αi) 1 M = M (MM + αi) 1, the solution can be rewritten as w = (Φ Φ + λi n ) 1 Φ y = Φ (ΦΦ + λi m ) 1 y = Φ a with a R m. Inversion of an m m matrix instead of n n! The solution w is a linear combination of input points! COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 29

Linear regression with feature vectors revisited (cont d) Let K = ΦΦ R m m be the so-called Gram matrix. K i,j = φ(x i ) φ(x j ) measures the similarity between inputs i and j. No need to compute the feature map, just need to know how to compute φ(u) φ(v) for any u, v. Solution of regularized least squares: w = Φ a with a = (K + λi) 1 y. The predictions for the input data are given by ŷ = Φw = ΦΦ a = K(K + λi) 1 y. The prediction for a new input point x is given by y = φ(x) Φ a = k x a where k x R m is defined by (k x ) i = φ(x) φ(x i ) Never need to compute the feature map φ explicitly! COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 30

Another derivation: Linear regression with feature vectors revisited Find the weight vector w which minimizes the (regularized) error function: J(w) = 1 2 (Φw y) (Φw y) + λ 2 w w Instead of closed-form solution, take the gradient and rearrange the terms Setting the learning rate α = 1 λ, the solution takes the form: w = 1 λ m (w φ(x i ) y i )φ(x i ) = i=1 m a i φ(x i ) = Φ a i=1 where a is a vector of size m (with a i = 1 λ (w φ(x i ) y i )) Main idea: use a instead of w as parameter vector COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 31

Another derivation: Re-writing the error function Instead of J(w) we have J(a): J(a) = 1 2 a ΦΦ ΦΦ a a ΦΦ y + 1 2 y y + λ 2 a ΦΦ a Denote ΦΦ = K Hence, we can re-write this as: J(a) = 1 2 a KKa a Ky + 1 2 y y + λ 2 a Ka This is quadratic in a, and we can set the gradient to 0 and solve. COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 32

Another derivation: Dual-view regression By setting the gradient to 0 we get: a = (K + λi m ) 1 y Note that this is similar to re-formulating a weight vector in terms of a linear combination of instances Again, the feature mapping is not needed either to learn or to make predictions! This approach is useful if the feature space is very large COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 33

Kernel functions Whenever a learning algorithm can be written in terms of dot-products, it can be generalized to kernels. A kernel is any function K : R n R n R which corresponds to a dot product for some feature mapping φ: K(x 1, x 2 ) = φ(x 1 ) φ(x 2 ) for some φ. Conversely, by choosing a feature map φ : R n R l, we implicitly choose a kernel function (l may even be infinite!) Recall that φ(x 1 ) φ(x 2 ) = cos (x 1, x 2 ) where denotes the angle between the vectors, so a kernel function can be thought of as a notion of similarity. COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 34

Example: Quadratic kernel Let K(x, z) = (x z) 2. Is this a kernel? K(x, z) = = ( n i=1 i,j {1...n} x i z i ) n x j z j = j=1 (x i x j ) (z i z j ) i,j {1...n} x i z i x j z j Hence, it is a kernel, with feature mapping: φ(x) = x 2 1, x 1 x 2,..., x 1 x n, x 2 x 1, x 2 2,..., x 2 n Feature vector includes all squares of elements and all cross terms. Note that computing φ takes O(n 2 ) but computing K takes only O(n)! COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 35