Neural networks and support vector machines

Similar documents
CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

The Perceptron Algorithm

Support Vector Machine (continued)

Kernelized Perceptron Support Vector Machines

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machines

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Logistic Regression. Machine Learning Fall 2018

Linear & nonlinear classifiers

CS 1674: Intro to Computer Vision. Final Review. Prof. Adriana Kovashka University of Pittsburgh December 7, 2016

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Support Vector Machine (SVM) and Kernel Methods

Artificial Neural Networks. Part 2

Support Vector Machine (SVM) and Kernel Methods

Kernel Methods and Support Vector Machines

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Statistical Pattern Recognition

Support Vector Machines

SUPPORT VECTOR MACHINE

Linear & nonlinear classifiers

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

COMS 4771 Introduction to Machine Learning. Nakul Verma

Support Vector Machine & Its Applications

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Support Vector Machines. Maximizing the Margin

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

CS798: Selected topics in Machine Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

9 Classification. 9.1 Linear Classifiers

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Machine Learning Lecture 5

Linear Discriminant Functions

Support Vector Machines

Day 4: Classification, support vector machines

Support Vector Machines

(Kernels +) Support Vector Machines

Introduction to Convolutional Neural Networks (CNNs)

Support Vector Machines and Kernel Methods

Support Vector Machines.

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Machine Learning : Support Vector Machines

Introduction to Support Vector Machines

Support Vector Machines

Linear Classification and SVM. Dr. Xin Zhang

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Jeff Howbert Introduction to Machine Learning Winter

Linear models and the perceptron algorithm

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Neural Networks and Deep Learning

Support Vector Machines Explained

Kernels and the Kernel Trick. Machine Learning Fall 2017

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machines and Kernel Methods

Lecture 6. Notes on Linear Algebra. Perceptron

SVMs: nonlinearity through kernels

Multilayer Perceptron

Bits of Machine Learning Part 1: Supervised Learning

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

Support Vector Machines and Speaker Verification

18.9 SUPPORT VECTOR MACHINES

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines: Maximum Margin Classifiers

Logistic Regression & Neural Networks

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Deep Feedforward Networks

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

The Perceptron Algorithm 1

Lecture 3a: The Origin of Variational Bayes

Support Vector and Kernel Methods

Machine Learning: Logistic Regression. Lecture 04

1 Machine Learning Concepts (16 points)

Machine Learning Lecture 7

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

FIND A FUNCTION TO CLASSIFY HIGH VALUE CUSTOMERS

Neural Networks: Backpropagation

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Computational statistics

Brief Introduction to Machine Learning

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Machine Learning A Geometric Approach

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Support vector machines Lecture 4

Deep Feedforward Networks

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Introduction to Logistic Regression and Support Vector Machine

CSCI567 Machine Learning (Fall 2018)

Transcription:

Neural netorks and support vector machines

Perceptron Input x 1 Weights 1 x 2 x 3... x D 2 3 D Output: sgn( x + b) Can incorporate bias as component of the eight vector by alays including a feature ith value set to 1

Loose inspiration: Human neurons

Linear separability

Perceptron training algorithm Initialize eights Cycle through training examples in multiple passes (epochs) For each training example: If classified correctly, do nothing If classified incorrectly, update eights

Perceptron update rule For each training instance x ith label y: Classify ith current eights: y = sgn( x) Update eights: ß + α(y-y )x α is a learning rate that should decay as a function of epoch t, e.g., 1000/(1000+t) What happens if y is correct? Otherise, consider hat happens to individual eights i ß i + α(y-y )x i If y = 1 and y = -1, i ill be increased if x i is positive or decreased if x i is negative à x ill get bigger If y = -1 and y = 1, i ill be decreased if x i is positive or increased if x i is negative à x ill get smaller

Convergence of perceptron update rule Linearly separable data: converges to a perfect solution Non-separable data: converges to a minimum-error solution assuming learning rate decays as O(1/t) and examples are presented in random sequence

Implementation details Bias (add feature dimension ith value fixed to 1) vs. no bias Initialization of eights: all zeros vs. random Learning rate decay function Number of epochs (passes through the training data) Order of cycling through training examples (random)

Multi-class perceptrons One-vs-others frameork: Need to keep a eight vector c for each class c Decision rule: c = argmax c c x Update rule: suppose example from class c gets misclassified as c Update for c: c ß c + α x Update for c : c ß c α x

Multi-class perceptrons One-vs-others frameork: Need to keep a eight vector c for each class c Decision rule: c = argmax c c x Softmax: Max P(c x) = exp( c x) C k=1 exp( k x) Inputs Perceptrons / eights c

Visualizing linear classifiers Source: Andre Karpathy, http://cs231n.github.io/linear-classify/

Differentiable perceptron Input x 1 Weights 1 x 2 x 3... x d 2 3 d Output: σ( x + b) Sigmoid function: σ ( t) 1 = 1 + e t

Update rule for differentiable perceptron Define total classification error or loss on the training set: Update eights by gradient descent: ( ) ) ( ) (, ) ( ) ( 1 2 N f f y E x x x = = = σ E α 1 2

Update rule for differentiable perceptron Define total classification error or loss on the training set: Update eights by gradient descent: For a single training point, the update is: ( ) ) ( ) (, ) ( ) ( 1 2 N f f y E x x x = = = σ ( ) ( ) [ ] = = = = N N f y f y E 1 1 )) ( )(1 ( ) ( 2 ) ( ) '( ) ( 2 x x x x x x x σ σ σ E α ( ) x x x x )) ( )(1 ( ) ( + σ σ α f y

Update rule for differentiable perceptron For a single training point, the update is: ( y f ( x) ) σ ( x)(1 σ ( x x + α )) This is called stochastic gradient descent Compare ith update rule ith non-differentiable perceptron: + α( y f ( x) )x

Netork ith a hidden layer eights eights Can learn nonlinear functions (provided each perceptron is nonlinear and differentiable) f (x)= σ #! σ ( k x )% $ k k &

Netork ith a hidden layer Hidden layer size and netork capacity: Source: http://cs231n.github.io/neural-netorks-1/

Training of multi-layer netorks Find netork eights to minimize the error beteen true and estimated labels of training examples: N =1 ( ) 2 E() = y f (x ) Update eights by gradient descent: E α This requires perceptrons ith a differentiable nonlinearity Sigmoid: g(t) = 1 1+ e t Rectified linear unit (ReLU): g(t) = max(0,t)

Training of multi-layer netorks Find netork eights to minimize the error beteen true and estimated labels of training examples: N =1 ( ) 2 E() = y f (x ) Update eights by gradient descent: E α Back-propagation: gradients are computed in the direction from output to input layers and combined using chain rule Stochastic gradient descent: compute the eight update.r.t. one training example (or a small batch of examples) at a time, cycle through training examples in random order in multiple epochs

Regularization It is common to add a penalty on eight magnitudes to the obective function: N ( ) 2 E( f ) = y i f (x i ) + λ 2 i=1 2 This encourages netork to use all of its inputs a little rather than a fe inputs a lot Source: http://cs231n.github.io/neural-netorks-1/

Multi-Layer Netork Demo http://playground.tensorflo.org/

Neural netorks: Pros and cons Pros Flexible and general function approximation frameork Can build extremely poerful models by adding more layers Cons Hard to analyze theoretically (e.g., training is prone to local optima) Huge amount of training data, computing poer required to get good performance The space of implementation choices is huge (netork architectures, parameters)

Support vector machines When the data is linearly separable, there may be more than one separator (hyperplane) Which separator is best?

Support vector machines Find hyperplane that maximizes the margin beteen the positive and negative examples Support vectors Margin C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knoledge Discovery, 1998

SVM parameter learning Separable data: min, b 1 2 2 subect to y i ( x i + b) 1 Maximize margin Classify training data correctly Non-separable data: min,b 1 2 2 n + C max 0,1 y i ( x i + b) i=1 ( ) Maximize margin Minimize classification mistakes

SVM parameter learning min,b 1 2 2 n + C max 0,1 y i ( x i + b) i=1 ( ) +1 Margin 0-1 Demo: http://cs.stanford.edu/people/karpathy/svms/demo

Nonlinear SVMs General idea: the original input space can alays be mapped to some higher-dimensional feature space here the training set is separable Φ: x φ(x) Image source

Nonlinear SVMs Linearly separable dataset in 1D: Non-separable dataset in 1D: 0 x 0 x We can map the data to a higher-dimensional space: x 2 0 x Slide credit: Andre Moore

The kernel trick General idea: the original input space can alays be mapped to some higher-dimensional feature space here the training set is separable The kernel trick: instead of explicitly computing the lifting transformation φ(x), define a kernel function K such that K(x, y) = φ(x) φ(y) (to be valid, the kernel function must satisfy Mercer s condition)

The kernel trick Linear SVM decision function: x + b = αi yixi x + i b learned eight Support vector C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knoledge Discovery, 1998

The kernel trick Linear SVM decision function: x + b = αi yixi x + i b Kernel SVM decision function: i αi yiϕ( xi ) ϕ( x) + b = αi yik( xi, x) + i b This gives a nonlinear decision boundary in the original feature space C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knoledge Discovery, 1998

Polynomial kernel: K ( x, y) = ( c + x y) d

Gaussian kernel Also knon as the radial basis function (RBF) kernel: 1 K( x, y) = exp x y 2 σ 2 K(x, y) x y

Gaussian kernel SV s

SVMs: Pros and cons Pros Kernel-based frameork is very poerful, flexible Training is convex optimization, globally optimal solution can be found Amenable to theoretical analysis SVMs ork very ell in practice, even ith very small training sample sizes Cons No direct multi-class SVM, must combine to-class SVMs (e.g., ith one-vs-others) Computation, memory (esp. for nonlinear SVMs)

Neural netorks vs. SVMs (a.k.a. deep vs. shallo learning)