Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Similar documents
Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning Basics Lecture 3: Perceptron. Princeton University COS 495 Instructor: Yingyu Liang

Kernelized Perceptron Support Vector Machines

Warm up: risk prediction with logistic regression

Max Margin-Classifier

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Lecture Support Vector Machine (SVM) Classifiers

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Support Vector Machines

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Cheng Soon Ong & Christian Walder. Canberra February June 2018

SVMs, Duality and the Kernel Trick

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Dan Roth 461C, 3401 Walnut

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

(Kernels +) Support Vector Machines

CSC 411 Lecture 17: Support Vector Machine

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Logistic Regression. Machine Learning Fall 2018

Deep Learning for Computer Vision

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

MLCC 2017 Regularization Networks I: Linear Models

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Stochastic Gradient Descent

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Linear & nonlinear classifiers

Support Vector Machines for Classification and Regression

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Support Vector Machines and Kernel Methods

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Introduction to Support Vector Machines

Support Vector Machines and Kernel Methods

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

CS229 Supplemental Lecture notes

Binary Classification / Perceptron

Machine Learning Practice Page 2 of 2 10/28/13

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Machine Learning Basics

Support Vector Machines. Machine Learning Fall 2017

Kernel Methods and Support Vector Machines

Overfitting, Bias / Variance Analysis

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support vector machines Lecture 4

Support Vector Machines

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Linear Models for Classification

Announcements - Homework

Pattern Recognition 2018 Support Vector Machines

Homework #3 RELEASE DATE: 10/28/2013 DUE DATE: extended to 11/18/2013, BEFORE NOON QUESTIONS ABOUT HOMEWORK MATERIALS ARE WELCOMED ON THE FORUM.

Lecture 16: Perceptron and Exponential Weights Algorithm

PAC-learning, VC Dimension and Margin-based Bounds

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

CENG 793. On Machine Learning and Optimization. Sinan Kalkan

Lecture 3: Introduction to Complexity Regularization

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Qualifying Exam in Machine Learning

Neural Network Training

COMP 875 Announcements

Loss Functions and Optimization. Lecture 3-1

Linear & nonlinear classifiers

Machine Learning Lecture 7

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

CS145: INTRODUCTION TO DATA MINING

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Lecture 5: Logistic Regression. Neural Networks

Machine learning for pervasive systems Classification in high-dimensional spaces

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Lecture 2 Machine Learning Review

Support Vector Machines, Kernel SVM

Machine Learning

Machine Learning for NLP

Convex optimization COMS 4771

ECE521 Lecture7. Logistic Regression

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Linear Classification and SVM. Dr. Xin Zhang

Bias-Variance Tradeoff

Multiclass Classification-1

10701 Recitation 5 Duality and SVM. Ahmed Hefny

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Support Vector Machine (SVM) and Kernel Methods

Machine Learning

Linear and Logistic Regression. Dr. Xiaowei Huang

Logistic Regression. COMP 527 Danushka Bollegala

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Today. Calculus. Linear Regression. Lagrange Multipliers

Discriminative Models

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Machine Learning Gaussian Naïve Bayes Big Picture

Support Vector Machines

Neural Networks and Deep Learning

Day 4: Classification, support vector machines

Transcription:

Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang

Review: machine learning basics

Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution D Find y = f(x) H that minimizes L f = 1 σ n i=1 n l(f, x i, y i ) s.t. the expected loss is small L f = E x,y ~D [l(f, x, y)]

Machine learning 1-2-3 Collect data and extract features Build model: choose hypothesis class H and loss function l Optimization: minimize the empirical loss

Loss function l 2 loss: linear regression Cross-entropy: logistic regression Hinge loss: Perceptron General principle: maximum likelihood estimation (MLE) l 2 loss: corresponds to Normal distribution logistic regression: corresponds to sigmoid conditional distribution

Optimization Linear regression: closed form solution Logistic regression: gradient descent Perceptron: stochastic gradient descent General principle: local improvement SGD: Perceptron; can also be applied to linear regression/logistic regression

Principle for hypothesis class? Yes, there exists a general principle (at least philosophically) Different names/faces/connections Occam s razor VC dimension theory Minimum description length Tradeoff between Bias and variance; uniform convergence The curse of dimensionality Running example: Support Vector Machine (SVM)

Motivation

Linear classification (w ) T x = 0 (w ) T x > 0 Class +1 w (w ) T x < 0 Class -1 Assume perfect separation between the two classes

Attempt Given training data Hypothesis y = sign(f w x ) = sign(w T x) y = +1 if w T x > 0 y = 1 if w T x < 0 x i, y i : 1 i n i.i.d. from distribution D Let s assume that we can optimize to find w

Multiple optimal solutions? Class +1 w 1 w 2 w 3 Class -1 Same on empirical loss; Different on test/expected loss

What about w 1? New test data Class +1 w 1 Class -1

What about w 3? New test data Class +1 w 3 Class -1

Most confident: w 2 New test data Class +1 w 2 Class -1

Intuition: margin large margin Class +1 w 2 Class -1

Margin

Margin Lemma 1: x has distance f w x w Proof: w is orthogonal to the hyperplane The unit direction is The projection of x is w w w w T to the hyperplane f w x = w T x = 0 x = f w(x) w 0 w w T x w w x

Margin: with bias Claim 1: w is orthogonal to the hyperplane f w,b x = w T x + b = 0 Proof: pick any x 1 and x 2 on the hyperplane w T x 1 + b = 0 w T x 2 + b = 0 So w T (x 1 x 2 ) = 0

Margin: with bias Claim 2: 0 has distance b w to the hyperplane wt x + b = 0 Proof: pick any x 1 the hyperplane Project x 1 to the unit direction w w T w w x 1 = b w since wt x 1 + b = 0 to get the distance

Margin: with bias Lemma 2: x has distance f w,b x b = 0 Proof: w Let x = x + r w, then r is the distance w Multiply both sides by w T and add b Left hand side: w T x + b = f w,b x Right hand side: w T x + r wt w w to the hyperplane f w,b x = w T x + + b = 0 + r w

The notation here is: y x = w T x + w 0 Figure from Pattern Recognition and Machine Learning, Bishop

Support Vector Machine (SVM)

SVM: objective Margin over all training data points: γ = min i f w,b x i w Since only want correct f w,b, and recall y i {+1, 1}, we have y i f w,b x i γ = min i w If f w,b incorrect on some x i, the margin is negative

SVM: objective Maximize margin over all training data points: max w,b γ = max w,b min i y i f w,b x i w = max w,b min i y i (w T x i + b) w A bit complicated

SVM: simplified objective Observation: when (w, b) scaled by a factor c, the margin unchanged y i (cw T x i + cb) cw = y i(w T x i + b) w Let s consider a fixed scale such that y i w T x i + b = 1 where x i is the point closest to the hyperplane

SVM: simplified objective Let s consider a fixed scale such that y i w T x i + b = 1 where x i is the point closet to the hyperplane Now we have for all data y i w T x i + b 1 and at least for one i the equality holds Then the margin is 1 w

SVM: simplified objective Optimization simplified to min w,b 2 1 2 w y i w T x i + b 1, i How to find the optimum w?

SVM: principle for hypothesis class

Thought experiment Suppose pick an R, and suppose can decide if exists w satisfying 1 2 w 2 R y i w T x i + b 1, i Decrease R until cannot find w satisfying the inequalities

Thought experiment w is the best weight (i.e., satisfying the smallest R) w

Thought experiment w is the best weight (i.e., satisfying the smallest R) w

Thought experiment w is the best weight (i.e., satisfying the smallest R) w

Thought experiment w is the best weight (i.e., satisfying the smallest R) w

Thought experiment w is the best weight (i.e., satisfying the smallest R) w

Thought experiment To handle the difference between empirical and expected losses Choose large margin hypothesis (high confidence) Choose a small hypothesis class w Corresponds to the hypothesis class

Thought experiment Principle: use smallest hypothesis class still with a correct/good one Also true beyond SVM Also true for the case without perfect separation between the two classes Math formulation: VC-dim theory, etc. w Corresponds to the hypothesis class

Thought experiment Principle: use smallest hypothesis class still with a correct/good one Whatever you know about the ground truth, add it as constraint/regularizer w Corresponds to the hypothesis class

SVM: optimization Optimization (Quadratic Programming): min w,b 2 1 2 w y i w T x i + b 1, i Solved by Lagrange multiplier method: L w, b, α = 1 2 w 2 where α is the Lagrange multiplier Details in next lecture i α i [y i w T x i + b 1]

Reading Review Lagrange multiplier method E.g. Section 5 in Andrew Ng s note on SVM posted on the course website: http://www.cs.princeton.edu/courses/archive/spring16/cos495/