Deep Learning for Computer Vision

Similar documents
Machine Learning Practice Page 2 of 2 10/28/13

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Kernelized Perceptron Support Vector Machines

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Max Margin-Classifier

Lecture Support Vector Machine (SVM) Classifiers

(Kernels +) Support Vector Machines

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Machine Learning A Geometric Approach

Minimax risk bounds for linear threshold functions

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning for NLP

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

CSC 411 Lecture 17: Support Vector Machine

Warm up: risk prediction with logistic regression

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Statistical Methods for Data Mining

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Binary Classification / Perceptron

Support Vector Machines and Bayes Regression

Linear classifiers Lecture 3

Support Vector Machines. Machine Learning Fall 2017

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Machine Learning and Data Mining. Linear classification. Kalev Kask

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support Vector Machines for Classification and Regression

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machines and Kernel Methods

HOMEWORK 4: SVMS AND KERNELS

Statistical Pattern Recognition

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

CSC321 Lecture 4 The Perceptron Algorithm

Midterm: CS 6375 Spring 2015 Solutions

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

L5 Support Vector Classification

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Support Vector Machine (continued)

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

Perceptron Revisited: Linear Separators. Support Vector Machines

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Support vector machines Lecture 4

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Learning by constraints and SVMs (2)

Linear discriminant functions

LECTURE NOTE #8 PROF. ALAN YUILLE. Can we find a linear classifier that separates the position and negative examples?

Linear Classification and SVM. Dr. Xin Zhang

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Linear & nonlinear classifiers

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Learning Linear Detectors

Energy Based Learning. Energy Based Learning

Linear Discrimination Functions

Introduction to Logistic Regression and Support Vector Machine

Linear & nonlinear classifiers

CPSC 340: Machine Learning and Data Mining

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

LECTURE NOTE #3 PROF. ALAN YUILLE

Statistical Methods for SVM

Machine Learning Basics Lecture 3: Perceptron. Princeton University COS 495 Instructor: Yingyu Liang

Announcements - Homework

Support Vector Machine

Kernel Methods and Support Vector Machines

CS229 Supplemental Lecture notes

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Support Vector Machines

Lecture 4: Training a Classifier

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

MIRA, SVM, k-nn. Lirong Xia

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

A vector from the origin to H, V could be expressed using:

Dan Roth 461C, 3401 Walnut

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Pattern Recognition 2018 Support Vector Machines

Machine Learning Lecture 7

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

18.9 SUPPORT VECTOR MACHINES

Machine Learning for NLP

Support Vector Machines

Machine Learning Lecture 5

Machine Learning (CS 567) Lecture 3

Linear Regression (continued)

Optimization and Gradient Descent

Linear, Binary SVM Classifiers

Transcription:

Deep Learning for Computer Vision Lecture 4: Curse of Dimensionality, High Dimensional Feature Spaces, Linear Classifiers, Linear Regression, Python, and Jupyter Notebooks Peter Belhumeur Computer Science Columbia University

If we want to build a minimum-error rate classifier then we need a very good estimate of P (! i x).

Let s say our feature space is just 1 dimensional and our feature x 2 [0, 1]. And let s say we have 10,000 training samples from which to estimate our a posteriori probabilities. We could estimate these probabilities using a histogram in which we divided the interval into 100 evenly spaced bins.

! 1! 2 0 1 On average each bin would have 100 samples. We could estimate (x! i ) as the number of samples from class i that fall in the same bin that falls into divided by the total number of samples in that bin. x

But this plan does not scale as we increase the dimensionality of the feature space!

The Curse of Dimensionality!

Let s say our feature space is just 3 dimensional and our feature. x 2 [0, 1] 3 Let s say we still have 10,000 training samples from which to estimate our a posteriori probabilities. If we estimate these probabilities using a histogram in which we divided the volume into the same width bins as before

On average each bin would only have 0.01 samples! Ugh.

So we never have enough training samples to densely sample high-dimensional feature spaces!

And to make matters worse, intuition about how things behave in high-dimensional spaces is mostly incorrect.

4

What fraction of the volume of the unit hypercube is occupied by the biggest hypersphere that fits within it in n-dimensions? Say if n=20?

Volume of n-d hypersphere = V n R n V n = n 2 ( n 2 + 1) 10 So if n = 20, the volume = (10 + 1) = 10 1 10! 2 20 1 =0.0000000246 2 20

Another way to think of this is if we uniformly sampled 100 million points in the unit hypercube, then we should expect less than 3 of them would lie within its contained hypersphere.

back to building classifiers

What if we gave up on finding the probabilities and instead found discriminant functions that defined surfaces that separated our classes.

So let s find a separating surface by seeing if we can produce a function such that is always positive for the features of one class and negative for the features of another.

Linear Regression

y x

Find the line that minimizes the sum of the squared vertical distances of the line to the sample points. Let the line be parameterized by y = 0 + 1 x = T x = h (x) where = apple 0 and x = 1 apple 1 x

Now given sample points (x (i),y (i) ), we can write our loss function to minimize as J( ) = 1 2m mx i=1 h (x (i) ) y (i) 2 and its gradient as r J( ) = 1 m mx h (x (i) ) y (i) x (i) i=1

This is a convex optimization problem so we can easily find the solution using gradient descent: (t + 1) = (t) r J( (t))

Course Programming Language: Python

Course Coding Environment: Jupyter Notebooks

Binary Classifier Let s say we have set of training samples (xi, yi) with i = 1,, N and xi and. 2 R d y i 2 { 1, 1} The goal is to learn a classifier f(xi) such that f(x i )= 0 y i =+1 < 0 y i = 1

f(x) 0 f(x) =0 f(x) < 0

But let s find a simple parametric separating surface by fitting to the training data using some pre-chosen criterion of optimality. If we restrict the decision surface to a line, plane, or hyperplane, then we call this a linear classifier.

Linear Classifier Let s say we have set of training samples (xi, yi) with i = 1,, N and xi and. 2 R d y i 2 { 1, 1} The goal is to learn a classifier f(xi) such that f(x i )= 0 y i =+1 < 0 y i = 1 and where f(x) =w T x + b

Linear Classifier Note: The classifier is a line, plane, or hyperplane depending of the dimension d w is normal to the line b is known as the bias. f(x) =w T x + b

f(x) 0 f(x) =0 f(x) < 0

w T x + b 0 w w w T x + b =0 w T x + b<0

Linearly Separable Data w T x + b =0

NOT Linearly Separable Data w T x + b =0

NOT Linearly Separable Data

Linear Classifier How do we find the weights given by w and b? f(x) =w T x + b

Perceptron Classifier Stack b onto end of weight vector w and stack a 1 onto end of feature vector x then we can write f(x) =w T x The perceptron algorithm cycles through the data points until all points are correctly classified updating the weights for each misclassification as w w + sign (f(x i )) x i

We can do better. We need to choose our weights according to some notion of optimality. So what is the best w?

Support Vector Machine Choose weights w and b to maximize the margin between classes. f(x) =w T x + b

Linearly Separable Data w T x + b 0 margin w w T x + b =0 w T x + b<0

We can choose the normalization of w however we please since this is a free parameter. Let s choose it so for the positive and negative support vectors so we have w T x + + b =+1 w T x + b = 1 This makes the margin width 2 w

Linearly Separable Data w T x + b 0 2 w w w T x + b =1 w T x + b =0 w T x + b<0 w T x + b = 1

We can now set this choice of w up as an optimization: max w 2 w subject to wt x i + b +1 if y i =+1 w T x i + b apple 1ify i = 1 Or equivalently, min w w 2 subject to y i (w T x i + b) 1 This is a quadratic optimation with linear constraints and a unique minimum.

What if the data is not linearly separable or has some bad apples?

Soft Margin and Slack Variables We soften the optimization with slack variables: NX min w w 2 + C i i subject to y i (w T x i + b) 1 i and i 0

If we make our slack variables large enough we can satisfy all of the constraints C is the regularization parameter Large C makes constraints hard to ignore = small margin Small C makes constraints easy to ignore = large margin Still a unique minimum

NOT Linearly Separable Data i > 2 2 w b w 0 < i < 1 w w T x + b =1 w T x + b =0 w T x + b = 1

Soft Margin and Slack Variables And with a little manipulation we get NX min w w 2 + C i max(0, 1 y i f(x i )) regularization + loss function

If a point is on its side of the support vector(s), there is no contribution to the loss as y i f(x i )) 1 If the point is within the margin or on other side of the opposite support vector(s), then there IS contribution to the loss as y i f(x i )) < 1

Loss Hinge Loss 4 3 2 1-3 -2-1 0 1 y i f(x i ))

NOT Linearly Separable Data i > 2 2 w b w 0 < i < 1 w w T x + b =1 w T x + b =0 w T x + b = 1

Optimization There is a closed form solution to optimal w NX min w w 2 + C i max(0, 1 y i f(x i )) regularization + loss function Alternatively, you use gradient descent!

Optimization Convex Non-Convex

Optimization X Convex Non-Convex

Gradient Descent for SVM Our cost function for optimizing can be re-written: min w C(w) = 2 w 2 + 1 N NX max(0, 1 y i f(x i )) i with = 2 NC and f(x) =wt x + b

Gradient Descent for SVM So gradient descent might look like: w t+1 w t t r w C(w t ) with the learning rate t and but this gradient is not differentiable r w C(w)

(Sub-)Gradient of Hinge Loss Loss L(x,y; w) = max(0, 1 yf(x)) 4 3 2 r w L = yx 1 r w L =0-3 -2-1 0 1 yf(x))

So now the whole gradient becomes: r w C(w) = w + 1 N NX 1[y i w T x i < 1] y i x i i where 1[t] is and indicator function = 1 if argument is true and 0 if false.

Pegasos Algorithm

Pegasos is a stochastic gradient descent algorithm with a mini batch size = 1!