GRADIENT DESCENT AND LOCAL MINIMA

Similar documents
Machine Learning and Data Mining. Linear classification. Kalev Kask

The Perceptron algorithm

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

The Perceptron Algorithm 1

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

In the Name of God. Lecture 11: Single Layer Perceptrons

Single layer NN. Neuron Model

The Perceptron Algorithm

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

Machine Learning Linear Models

Chapter 2 Single Layer Feedforward Networks

Linear Classifiers and the Perceptron Algorithm

More about the Perceptron

Lecture 3 January 28

Minimax risk bounds for linear threshold functions

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Neural networks and support vector machines

Lecture 4: Linear predictors and the Perceptron

Optimization and Gradient Descent

Linear & nonlinear classifiers

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Discriminative Models

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

18.9 SUPPORT VECTOR MACHINES

LEARNING & LINEAR CLASSIFIERS

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Chapter ML:VI. VI. Neural Networks. Perceptron Learning Gradient Descent Multilayer Perceptron Radial Basis Functions

Classification objectives COMS 4771

Linear Classifiers. Michael Collins. January 18, 2012

CPSC 340: Machine Learning and Data Mining

18.6 Regression and Classification with Linear Models

Linear Discrimination Functions

Linear Discriminant Functions

PAC-learning, VC Dimension and Margin-based Bounds

Machine Learning Lecture 7

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

y(x n, w) t n 2. (1)

Binary Classification / Perceptron

17 Neural Networks NEURAL NETWORKS. x XOR 1. x Jonathan Richard Shewchuk

Discriminative Models

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Linear discriminant functions

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

The Perceptron. Volker Tresp Summer 2014

11 More Regression; Newton s Method; ROC Curves

The Perceptron Algorithm

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Deep Learning for Computer Vision

Support Vector Machines. Machine Learning Fall 2017

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Error Functions & Linear Regression (2)

Logistic Regression. Machine Learning Fall 2018

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Supervised learning in single-stage feedforward networks

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Logistic Regression Logistic

The Perceptron. Volker Tresp Summer 2016

Kernelized Perceptron Support Vector Machines

PMR5406 Redes Neurais e Lógica Fuzzy Aula 3 Single Layer Percetron

Inf2b Learning and Data

Linear & nonlinear classifiers

Input layer. Weight matrix [ ] Output layer

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Statistical Learning Theory and the C-Loss cost function

Machine Learning Basics

Machine Learning for NLP

CSC321 Lecture 4: Learning a Classifier

Bayesian Support Vector Machines for Feature Ranking and Selection

Introduction to Machine Learning

CMU-Q Lecture 24:

HOMEWORK 4: SVMS AND KERNELS

Statistical Pattern Recognition

Multilayer Perceptron = FeedForward Neural Network

Machine Learning Lecture 5

Lecture Support Vector Machine (SVM) Classifiers

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Warm up: risk prediction with logistic regression

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning 2017

Supervised Learning. George Konidaris

2 Linear Classifiers and Perceptrons

CS 590D Lecture Notes

Support vector machines Lecture 4

Statistical and Computational Learning Theory

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

The Perceptron. Volker Tresp Summer 2018

6.867 Machine learning

9 Classification. 9.1 Linear Classifiers

Training the linear classifier

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Support Vector Machines.

Transcription:

GRADIENT DESCENT AND LOCAL MINIMA 25 20 5 15 10 3 2 1 1 2 5 2 2 4 5 5 10 Suppose for both functions above, gradient descent is started at the point marked red. It will walk downhill as far as possible, then terminate. For the function on the left, the minimum it finds is global. For the function on the right, it is only a local minimum. Since the derivative at both minima is 0, gradient descent cannot detect whether they are global or local. For smooth functions, gradient descent finds local minima. If the function is complicated, there may be no way to tell whether the solution is also a global minimum. Peter Orbanz Applied Data Mining 85

OUTLOOK Summary so far The derivative/gradient provides local information about how a function changes around a point x. Optimization algorithms: If we know the gradient at our current location x, we can use this information to make a step in downhill direction, and move closer to a (local) minimum. What we do not know yet That assumes that we can compute the gradient. There are two possibilities: For some functions, we are able to derive f (x) as a function of x. Gradient descent can evaluate the gradient by evaluating that function. Otherwise, we have to estimate f (x) by evaluating the function f at points close to x. For now, we will assume that we can compute the gradient as a function. Peter Orbanz Applied Data Mining 86

RECAP: OPTIMIZATION AND LEARNING Given Training data ( x 1, ỹ 1 ),..., ( x n, ỹ n ). The data analyst chooses A class H of possible solutions (e.g. H = all linear classifier). A loss function L (e.g. 0-1 loss) that measures how well the solution represents the data. Learning from training data Define the empirical risk of a solution f H on the training data, ˆR n (f ) = 1 n n L(f ( x i ), ỹ i ) i=1 Find an optimal solution f by minimizing the empirical risk using gradient descent (or another optimization algorithm). Peter Orbanz Applied Data Mining 87

THE PERCEPTRON ALGORITHM

TASK: TRAIN A LINEAR CLASSIFIER We want to determine a linear classifier in R d using 0-1 loss. Recall A solution is determined by a normal v H R d and an offset c > 0. For training data ( x 1, ỹ 1 ),..., ( x n, ỹ n ), the empirical risk is ˆR n (v H, c) = 1 n n I{sgn( v H, x i ) c) y i } i=1 The empirical risk minimization approach would choose a classifer given by (v H, c ) for which (v H, c ) = arg min ˆR n (v H, c) v H,c Idea Can we use gradient descent to find the minimizer of ˆR n? Peter Orbanz Applied Data Mining 89

PROBLEM Example In the example on the right, the two dashed classifiers both get a single (blue) training point wrong. Both of them have different values of (v H, c), but for both of these values, the empirical risk is identical. Suppose we shift one of the to dashed lines to obtain the dotted line. On the way, the line moves over a single red point. The moment it passes that point, the empirical risk jumps from 1 n to 2 n. Conclusion Consider the empirical risk function ˆR n (v H, c): If (v H, c) defines an affine plain that contains one of the training points, ˆR n is discontinuous at (v H, c) (it jumps ). That means it is not differentiable. At all other points (v H, c), the function is constant. It is differentiable, but the length of the gradient is 0. Peter Orbanz Applied Data Mining 90

The empirical risk of a linear classifer under 0-1 loss is piece-wise constant. J(a) 3 2 1 0-2 y 1 y 2 y 3 0 2 solution region 4 4 2 0 a 2 a 1-2 Peter Orbanz Applied Data Mining 91

CONSEQUENCES FOR OPTIMIZATION Formal problem Even if we can avoid points where ˆR n jumps, the gradient is always 0. Gradient descent never moves anywhere. Intuition Remember that we can only evaluate local information about ˆR n around a given point (v H, c). In every direction around (v H, c), the function looks identical. The algorithm cannot tell what a good direction to move in would be. Note that is also the case for every other optimization algorithm, since optimization algorithms depend on local information. Solution idea Find an approximation to ˆR n that is not piece-wise constant, and decreases in direction of an optimal solution. We try to keep the approximation as simple as possible. Peter Orbanz Applied Data Mining 92

PERCEPTRON COST FUNCTION We replace the empirical risk ˆR n (v H, c) = 1 n n i=1 I{sgn( v H, x i ) c) y i } by the piece-wise linear function measures distance to plane Ŝ n (v H, c) = 1 n n I{sgn( v H, x i ) c) y i } v H, x c i=1 switches off correctly classified points Ŝ n is called 16 the perceptron cost function. CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS J(a) 3 2 1 0 J(a) 3 2 1 0 Jp(a) 10 5 0-2 y y 3-2 -2 y 2 y y y 3 3 2 2 y 1 y y 1 0 0 1 0-2 solution solution -2 solution 0 0 region region 2 a 1 2 a 1 0 region 2 a 1 a 2 2 a 2 2 a 2 2 4 4 4 4 4 4-2 Jq(a) Peter Orbanz Applied Data Mining 93 Jr(a)

THE PERCEPTRON Training Given: Training data ( x 1, ỹ 1 ),..., ( x n, ỹ n ). Training: Fix a precision ε and a learning rate α, and run gradient descent on the perceptron cost function to find an affine plane (v H, c ) Define a classifier as f (x) := sgn( v H, x c). (If the gradient algorithm returns c < 0, flip signs: Use ( v H, c ).) Prediction For a given data point x R d, predict the class label y := f (x). This classifier is called the perceptron. It was first proposed by Rosenblatt in 1962. Peter Orbanz Applied Data Mining 94

THE PERCEPTRON GRADIENT One can show that the gradient of the cost function is n ) ( xi Ŝ n (v H, c) = I{f ( x i ) ỹ i } ỹ i 1 i=1 This is an example of gradient descent where we do not have to approximate the derivative numerically. Gradient descent steps then look like this: ( ) ( ) v (k+1) H c (k+1) := v (k) ) ( xi H c (k) + ỹ i 1 i x i misclassified. Peter Orbanz Applied Data Mining 95

ILLUSTRATION ( ) v (k+1) H c (k+1) := ( ) v (k) H c (k) + i x i misclassified ) ( xi ỹ i 1 Effect for a single training point Step k: x (in class -1) classified incorrectly Step k + 1 H k x x v k H v k H x v k+1 H = v k H x Simplifying assumption: H contains origin, so c = 0. Peter Orbanz Applied Data Mining 96

Peter Orbanz Applied Data Mining 97

l 3 I g ( Peter Orbanz Applied Data Mining 98

Peter Orbanz Applied Data Mining 99

DOES THE PERCEPTRON WORK? Theorem: Perceptron convergence If the training data is linearly separable, the Perceptron learning algorithm (with fixed step size) terminates after a finite number of steps with a valid solution (v H, c) (i.e. a solution which classifies all training data points correctly). Issues The perceptron selects some hyperplane between the two classes. The choice depends on initialization, step size etc. The solution on the right will probably predict better than the one on the left, but the perceptron may return either. Peter Orbanz Applied Data Mining 100