COMP 875 Announcements

Similar documents
Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines for Classification and Regression

Support Vector Machines

Announcements - Homework

Support vector machines Lecture 4

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines. Machine Learning Fall 2017

Introduction to Logistic Regression and Support Vector Machine

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Machine Learning. Support Vector Machines. Manfred Huber

Max Margin-Classifier

Kernelized Perceptron Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Support Vector Machines

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Support Vector Machines, Kernel SVM

Linear classifiers Lecture 3

Support Vector Machine (SVM) and Kernel Methods

Machine Learning And Applications: Supervised Learning-SVM

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Support Vector Machines

Linear & nonlinear classifiers

Support Vector Machine (SVM) and Kernel Methods

CSC 411 Lecture 17: Support Vector Machine

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machine (continued)

COMS F18 Homework 3 (due October 29, 2018)

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Lecture Support Vector Machine (SVM) Classifiers

L5 Support Vector Classification

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Support Vector Machines

Support Vector Machines

Kernel Methods and Support Vector Machines

LECTURE NOTE #8 PROF. ALAN YUILLE. Can we find a linear classifier that separates the position and negative examples?

Statistical Machine Learning from Data

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear & nonlinear classifiers

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Statistical Methods for NLP

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning Practice Page 2 of 2 10/28/13

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

ICS-E4030 Kernel Methods in Machine Learning

SVM optimization and Kernel methods

Lecture 10: A brief introduction to Support Vector Machine

Lecture Notes on Support Vector Machine

Support Vector Machines and Kernel Methods

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

CS145: INTRODUCTION TO DATA MINING

Learning with kernels and SVM

LECTURE 7 Support vector machines

Modelli Lineari (Generalizzati) e SVM

Statistical Pattern Recognition

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Warm up: risk prediction with logistic regression

Machine Learning A Geometric Approach

Convex Optimization and Support Vector Machine

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Support Vector Machines for Classification: A Statistical Portrait

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Support Vector Machine

CS798: Selected topics in Machine Learning

Support Vector Machines Explained

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Data Mining - SVM. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - SVM 1 / 55

Multiclass Classification-1

Support Vector Machines and Kernel Methods

Classification and Support Vector Machine

SUPPORT VECTOR MACHINE

Discriminative Models

SVMs, Duality and the Kernel Trick

ML4NLP Multiclass Classification

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Pattern Recognition 2018 Support Vector Machines

Soft-margin SVM can address linearly separable problems with outliers

Support Vector Machines

Introduction to Machine Learning

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Lecture 3: Multiclass Classification

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Midterm Exam Solutions, Spring 2007

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machines

Statistical Methods for Data Mining

Transcription:

Announcements Tentative presentation order is out

Announcements Tentative presentation order is out Remember: Monday before the week of the presentation you must send me the final paper list (for posting on class website) and draft slides

Announcements Tentative presentation order is out Remember: Monday before the week of the presentation you must send me the final paper list (for posting on class website) and draft slides Exception: Rahul and Brendan send draft slides by next Friday

A few more presentation tips Make presentation interesting and accessible to everybody in the class: Define your problem so it makes sense to people outside of your area Clearly explain where machine learning techniques come in Emphasize high-level and conceptual content Make sure you understand everything on your slides don t put in any equations you can t explain Discuss connections to previous topics covered in class

A few more presentation tips Make presentation interesting and accessible to everybody in the class: Define your problem so it makes sense to people outside of your area Clearly explain where machine learning techniques come in Emphasize high-level and conceptual content Make sure you understand everything on your slides don t put in any equations you can t explain Discuss connections to previous topics covered in class Best presentation contest! Students who are present in class will score each presentation The scores will not be publicly announced and will not affect the presenter s grade Popular favorite and runner-up(s) to receive prizes at the end of the course!

Review: Bias-variance tradeoff

Review: Bias-variance tradeoff

Review: Classifiers Bayes classifier:

Review: Classifiers Bayes classifier: f (x) = argmax y Pr[Y = y x].

Review: Classifiers Bayes classifier: f (x) = argmax y Pr[Y = y x]. This is the optimal classifier for 0-1 loss.

Review: Classifiers Bayes classifier: f (x) = argmax y Pr[Y = y x]. This is the optimal classifier for 0-1 loss. Nearest-neighbor classifier

Review: Classifiers Bayes classifier: f (x) = argmax y Pr[Y = y x]. This is the optimal classifier for 0-1 loss. Nearest-neighbor classifier Linear classifiers

Review: Classifiers Bayes classifier: f (x) = argmax y Pr[Y = y x]. This is the optimal classifier for 0-1 loss. Nearest-neighbor classifier Linear classifiers Logistic regression:

Review: Classifiers Bayes classifier: f (x) = argmax y Pr[Y = y x]. This is the optimal classifier for 0-1 loss. Nearest-neighbor classifier Linear classifiers Logistic regression: Assume that the regression function η(x) = Pr[Y = 1 x] satisfies log Pr[1 x] 1 Pr[1 x] = w 0 + w T x. Then η(x) = 1 1 + e (w0+wt x).

Review: Classifiers Bayes classifier: f (x) = argmax y Pr[Y = y x]. This is the optimal classifier for 0-1 loss. Nearest-neighbor classifier Linear classifiers Logistic regression: Assume that the regression function η(x) = Pr[Y = 1 x] satisfies log Pr[1 x] 1 Pr[1 x] = w 0 + w T x. Then η(x) = Perceptron (Rosenblatt 1957): 1 1 + e (w0+wt x).

Review: Classifiers Bayes classifier: f (x) = argmax y Pr[Y = y x]. This is the optimal classifier for 0-1 loss. Nearest-neighbor classifier Linear classifiers Logistic regression: Assume that the regression function η(x) = Pr[Y = 1 x] satisfies log Pr[1 x] 1 Pr[1 x] = w 0 + w T x. Then η(x) = 1 1 + e (w0+wt x). Perceptron (Rosenblatt 1957): Find parameters w 0, w to minimize classifier output for misclassified examples: i misclassified y i(w 0 + w T x i ).

Review: Perceptron Optimization: starting with some initial values w 0, w, iterate over misclassified examples and update parameter values to reduce the error.

Review: Perceptron Optimization: starting with some initial values w 0, w, iterate over misclassified examples and update parameter values to reduce the error. Problems:

Review: Perceptron Optimization: starting with some initial values w 0, w, iterate over misclassified examples and update parameter values to reduce the error. Problems: When data is separable, solution depends on starting parameter values.

Review: Perceptron Optimization: starting with some initial values w 0, w, iterate over misclassified examples and update parameter values to reduce the error. Problems: When data is separable, solution depends on starting parameter values. May take a long time to converge (depending on learning rate).

Review: Perceptron Optimization: starting with some initial values w 0, w, iterate over misclassified examples and update parameter values to reduce the error. Problems: When data is separable, solution depends on starting parameter values. May take a long time to converge (depending on learning rate). When data is not separable, does not converge at all!

Review: Perceptron Optimization: starting with some initial values w 0, w, iterate over misclassified examples and update parameter values to reduce the error. Problems: When data is separable, solution depends on starting parameter values. May take a long time to converge (depending on learning rate). When data is not separable, does not converge at all! Historical note: because of problems with perceptrons (as described by Minsky & Papert, 1969), the field of neural networks fell out of favor for over ten years.

Recall: Geometry of hyperplanes A hyperplane is defined by an equation w T x + w 0 = 0 The unit vector w/ w is normal to the hyperplane. The signed distance of any point x i to the hyperplane is given by 1 w (wt x i + w 0 )

Maximum-margin separating hyperplane Margin maximization (for linearly separable data) is formulated as follows: max M (w,w 0 ) subject to y i (w T x i + w 0 ) M w, i = 1,..., n

Maximum-margin separating hyperplane Margin maximization (for linearly separable data) is formulated as follows: max M (w,w 0 ) subject to y i (w T x i + w 0 ) M w, i = 1,..., n Explanation: 1 w (wt x i + w 0 ) is the signed distance between x i and the hyperplane w T x + w 0 = 0. The constraints require that each training point is on the correct side of the decision boundary and is at least an unsigned distance M from it. The goal is to find the hyperplane with parameters w and w 0 that would have the largest such M.

Maximum-margin separating hyperplane Constrained optimization problem: max M (w,w 0 ) subject to y i (w T x i + w 0 ) M w, i = 1,..., n

Maximum-margin separating hyperplane Constrained optimization problem: max M (w,w 0 ) subject to y i (w T x i + w 0 ) M w, i = 1,..., n We can choose M = 1/ w and instead solve 1 min (w,w 0 ) 2 w 2 subject to y i (w T x i + w 0 ) 1, i = 1,..., n

Support vectors 1 min (w,w 0 ) 2 w 2 subject to y i (w T x i + w 0 ) 1, i = 1,..., n The quantity y i (w T x i + w 0 ) is the (functional) margin of x i. Points for which y i (w T x i + w 0 ) = 1 are support vectors.

Lagrange multipliers Source: G. Shakhnarovich min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. We want to transform this constrained problem into an unconstrained problem.

Lagrange multipliers Source: G. Shakhnarovich min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. We want to transform this constrained problem into an unconstrained problem. We will associate with each constraint the loss max α [ i 1 yi (w 0 + w T x i ) ] = α i 0

Lagrange multipliers Source: G. Shakhnarovich min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. We want to transform this constrained problem into an unconstrained problem. We will associate with each constraint the loss max α [ i 1 yi (w 0 + w T x i ) ] = α i 0 { 0, if y i (w 0 + w T x i ) 1 0, if constraint is violated

Lagrange multipliers Source: G. Shakhnarovich min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. We want to transform this constrained problem into an unconstrained problem. We will associate with each constraint the loss max α [ i 1 yi (w 0 + w T x i ) ] = α i 0 { 0, if y i (w 0 + w T x i ) 1 0, We can reformulate our problem now: min (w,w 0 ) 1 2 w 2 + max α i 0 α i if constraint is violated [ 1 yi (w 0 + w T x i ) ]

Lagrange multipliers Source: G. Shakhnarovich min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. We want to transform this constrained problem into an unconstrained problem. We will associate with each constraint the loss max α [ i 1 yi (w 0 + w T x i ) ] = α i 0 { 0, if y i (w 0 + w T x i ) 1 0, We can reformulate our problem now: min (w,w 0 ) 1 2 w 2 + max α i 0 α i if constraint is violated [ 1 yi (w 0 + w T x i ) ]

Optimization Source: G. Shakhnarovich We want all the constraint terms to be zero: { 1 min (w,w 0 ) 2 w 2 + max α [ i 1 yi (w 0 + w T x i ) ]} α i 0 { = min max 1 [ (w,w 0 ) {α i 0} 2 w 2 + α i 1 yi (w 0 + w T x i ) ]}

Optimization Source: G. Shakhnarovich We want all the constraint terms to be zero: { 1 min (w,w 0 ) 2 w 2 + max α [ i 1 yi (w 0 + w T x i ) ]} α i 0 { = min max 1 [ (w,w 0 ) {α i 0} 2 w 2 + α i 1 yi (w 0 + w T x i ) ]} { = max min 1 [ {α i 0} (w,w 0 ) 2 w 2 + α i 1 yi (w 0 + w T x i ) ]}. }{{} L(w,w 0 ;α) (Note: in general, it is not always valid to exchange min and max.)

Strategy for optimization Source: G. Shakhnarovich We need to find max min L(w, w 0; α). {α i 0} (w,w 0 ) We will first fix α = [α 1,..., α n ] and treat L(w, w 0 ; α) as a function of w,w 0. Find functions w(α), w 0 (α) that attain the minimum. Next, treat L(w(α), w 0 (α); α) as a function of α. Find α that attain the maximum. In the end, the solution is given by α, w(α ) and w 0 (α ).

Minimizing L(w, w 0 ; α) w.r.t. w, w 0 Source: G. Shakhnarovich For fixed α we can minimize L(w, w 0 ; α) = 1 [ 2 w 2 + α i 1 yi (w 0 + w T x i ) ] by setting derivatives w.r.t. w, w 0 to zero:

Minimizing L(w, w 0 ; α) w.r.t. w, w 0 Source: G. Shakhnarovich For fixed α we can minimize L(w, w 0 ; α) = 1 2 w 2 + [ α i 1 yi (w 0 + w T x i ) ] by setting derivatives w.r.t. w, w 0 to zero: w L(w, w 0; α) = w w 0 L(w, w 0 ; α) = α i y y x i = 0, α i y i = 0.

Minimizing L(w, w 0 ; α) w.r.t. w, w 0 Source: G. Shakhnarovich For fixed α we can minimize L(w, w 0 ; α) = 1 2 w 2 + [ α i 1 yi (w 0 + w T x i ) ] by setting derivatives w.r.t. w, w 0 to zero: w L(w, w 0; α) = w w 0 L(w, w 0 ; α) = α i y y x i = 0, α i y i = 0. Note that the bias term w 0 has dropped out but has produced a global constraint on α.

Solving for α Source: G. Shakhnarovich w(α) = α i y i x i, α i y i = 0. Now we can substitute this solution into { 1 2 w(α) 2 + max {α i 0, i α iy i =0} α i [ 1 yi (w 0 (α) + w(α) T x i ) ]}

Solving for α Source: G. Shakhnarovich w(α) = α i y i x i, α i y i = 0. Now we can substitute this solution into { 1 2 w(α) 2 + max {α i 0, i α iy i =0} = max {α i 0, i α iy i =0} α i 1 2 α i [ 1 yi (w 0 (α) + w(α) T x i ) ]} α i α j y i y j x T i x j. i,j=1

Max-margin and quadratic programming Source: G. Shakhnarovich We started by writing down the max-margin problem and arrived at the dual problem in α: max α i 1 α i α j y i y j x T i x j 2 subject to i,j=1 α i y i = 0, α i 0 for all i = 1,..., n. Solving this quadratic program yields α.

Max-margin and quadratic programming Source: G. Shakhnarovich We started by writing down the max-margin problem and arrived at the dual problem in α: max α i 1 α i α j y i y j x T i x j 2 subject to i,j=1 α i y i = 0, α i 0 for all i = 1,..., n. Solving this quadratic program yields α. We substitute α back to get w: ŵ = w(α ) = αi y i x i

Maximum margin decision boundary Source: G. Shakhnarovich ŵ = w(α ) = αi y i x i Recall that, at the optimal solution, we must have αi [ 1 yi (w 0 + ŵ T x i ) ] = 0. Suppose that, under the optimal solution, the margin of x i is y i (w 0 + ŵ T x i ) > 1 (x i is not a support vector).

Maximum margin decision boundary Source: G. Shakhnarovich ŵ = w(α ) = αi y i x i Recall that, at the optimal solution, we must have αi [ 1 yi (w 0 + ŵ T x i ) ] = 0. Suppose that, under the optimal solution, the margin of x i is y i (w 0 + ŵ T x i ) > 1 (x i is not a support vector). Then, necessarily, αi = 0. Thus, we can express the direction of the max-margin decision boundary as a function of support vectors alone: ŵ = αi y i x i. α i >0

Maximum margin decision boundary Source: G. Shakhnarovich ŵ = w(α ) = αi y i x i Recall that, at the optimal solution, we must have αi [ 1 yi (w 0 + ŵ T x i ) ] = 0. Suppose that, under the optimal solution, the margin of x i is y i (w 0 + ŵ T x i ) > 1 (x i is not a support vector). Then, necessarily, αi = 0. Thus, we can express the direction of the max-margin decision boundary as a function of support vectors alone: ŵ = αi y i x i. α i >0 We have ŵ 0 = y i ŵ T x i for any support vector x i. Or, we can compute w 0 by making sure the margin is balanced between the two classes.

Support vectors Source: G. Shakhnarovich ŵ = α i >0 α i y i x i. Given a test example x, it is classified by ŷ = sign ( ŵ 0 + ŵ T x ) ( ) T = sign ŵ 0 + α i y i x i x ( α i >0 = sign ŵ 0 + α i y i x T i x α i >0 ) The classifier is based on the expansion in terms of dot products of x with support vectors.

Non-separable case What if the training data are not linearly separable? We can no longer require exact margin constraints. One idea: minimize min w This is the 0-1 loss. 1 2 w 2 + C(#mistakes). The parameter C determines the penalty paid for violating margin constraints. (Tradeoff: number of mistakes and margin.)

Non-separable case What if the training data are not linearly separable? We can no longer require exact margin constraints. One idea: minimize min w This is the 0-1 loss. 1 2 w 2 + C(#mistakes). The parameter C determines the penalty paid for violating margin constraints. (Tradeoff: number of mistakes and margin.) Problem: not QP anymore, also does not distinguish between near misses and bad mistakes.

Non-separable case Another idea: rewrite the constraints with slack variables ξ i 0: 1 min (w,w 0 ) 2 w + C ξ i subject to y i ( w0 + w T x i ) 1+ξi 0.

Non-separable case Another idea: rewrite the constraints with slack variables ξ i 0: 1 min (w,w 0 ) 2 w + C ξ i subject to y i ( w0 + w T x i ) 1+ξi 0. Whenever margin is 1 (original constraint is satisfied), ξ i = 0. Whenever margin is < 1 (constraint violated), pay linear penalty.

Non-separable case Another idea: rewrite the constraints with slack variables ξ i 0: 1 min (w,w 0 ) 2 w + C ξ i subject to y i ( w0 + w T x i ) 1+ξi 0. Whenever margin is 1 (original constraint is satisfied), ξ i = 0. Whenever margin is < 1 (constraint violated), pay linear penalty. This is called the hinge loss:. max ( 0, 1 y i (w 0 + w T x i ) )

Connection between SVMs and logistic regression Logistic regression: Support vector machines: ( ) Hinge loss: max 0, 1 y i (w 0 + w T x i ) 1 P (y i x i ; w, w 0 ) = 1 + e y i(w 0 +w T x i ) Log loss: log (1 ) + e y i(w 0 +w T x i )

Non-separable case: solution Source: G. Shakhnarovich min w 1 2 w + C ξ i. We can solve this using Lagrange multipliers Introduce additional multipliers for the ξs. The resulting dual problem: max α i 1 2 subject to α i α j y i y j x T i x j i,j=1 α i y i = 0, 0 α i C for all i = 1,..., N.

SVM with slack variables Source: G. Shakhnarovich α = C, 0 < ξ < 1 0 < α < C, ξ = 0 0 < α < C, ξ = 0 α = C, ξ > 1 0 < α < C, ξ = 0 0 < α < C, ξ = 0 α = C, ξ > 1 Support vectors: points with α > 0 If 0 < α < C: SVs on the margin, ξ = 0. If 0 < α = C: over the margin, either misclassified (ξ > 1) or not (0 < ξ 1).