Cheng Soon Ong & Christian Walder. Canberra February June 2018

Similar documents
Support Vector Machine (continued)

Linear & nonlinear classifiers

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Pattern Recognition 2018 Support Vector Machines

Max Margin-Classifier

Linear & nonlinear classifiers

Kernel Methods and Support Vector Machines

Cheng Soon Ong & Christian Walder. Canberra February June 2018

(Kernels +) Support Vector Machines

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine

ICS-E4030 Kernel Methods in Machine Learning

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine (SVM) and Kernel Methods

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CS798: Selected topics in Machine Learning

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Support Vector Machines, Kernel SVM

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Support Vector Machine (SVM) and Kernel Methods

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Constrained Optimization and Support Vector Machines

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines

Support Vector Machines

Support vector machines

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Support Vector Machines for Classification and Regression

Statistical Machine Learning from Data

Machine Learning. Support Vector Machines. Manfred Huber

CS-E4830 Kernel Methods in Machine Learning

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Support Vector Machines and Kernel Methods

Learning with kernels and SVM

Introduction to Support Vector Machines

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Support Vector Machines

Cheng Soon Ong & Christian Walder. Canberra February June 2018

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine

Support Vector Machines: Maximum Margin Classifiers

Linear, Binary SVM Classifiers

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

L5 Support Vector Classification

Announcements - Homework

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Lecture Notes on Support Vector Machine

Linear Models for Classification

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Support Vector Machines for Classification: A Statistical Portrait

Convex Optimization and Support Vector Machine

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Incorporating detractors into SVM classification

Statistical Pattern Recognition

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Introduction to SVM and RVM

Support Vector Machines

Support Vector Machines

Sparse Kernel Machines - SVM

Support Vector Machines

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Discriminative Models

Applied inductive learning - Lecture 7

Support Vector Machines

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Support Vector Machines Explained

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machines. Maximizing the Margin

Machine Learning And Applications: Supervised Learning-SVM

Discriminative Models

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Machine Learning Lecture 7

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Support Vector Machine for Classification and Regression

Support Vector Machines.

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Lecture Support Vector Machine (SVM) Classifiers

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Support Vector Machines

Lecture 18: Optimization Programming

Kernel Methods. Machine Learning A W VO

CS145: INTRODUCTION TO DATA MINING

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Non-linear Support Vector Machines

Transcription:

Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1 Linear Regression 2 Linear Classification 1 Linear Classification 2 Kernel Methods Sparse Kernel Methods Mixture Models and EM 1 Mixture Models and EM 2 Neural Networks 1 Neural Networks 2 Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data 2 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 826

Part X : 376of 826

- Motivation Nonlinear kernels extended our toolbox of methods considerably Wherever an inner product was used in an algorithm, we can replace it with a kernel. Kernels act as a kind of similarity measure and can be defined over graphs, sets, strings, and documents. But the kernel matrix is a square matrix with dimensions equal to the number of data points N In order to calculate it, the kernel function must be evaluated for all pairs of training inputs. implement learning algorithms where, for prediction, the kernels are only evaluated at a subset of the training data. : 377of 826

Maximum Return to the two-class classification with a linear model y(x) = w T φ(x) + b where φ(x) is some fixed feature space mapping, and b is the bias. Training data are N input vectors x 1,..., x N with corresponding targets t 1,..., t N where t n { 1, +1}. The class of a new point is predicted as sign (y(x)). Assume there exists a linear decision boundary. That means, there exist w and b such that : t n sign (y(x n )) > 0 n = 1,..., N. 378of 826

The Multiple Separator problem There may exist many solutions w and b for which the linear classifier perfectly separates the two classes! : 379of 826

Maximum There may exist many solutions w and b for which the linear classifier perfectly separates the two classes. The perceptron algorithm can find a solution, but this depends on the initial choice of parameters. What is the decision boundary which results in the best generalisation (smallest generalisation error)? : 380of 826

Maximum The margin is the smallest distance between the decision boundary and any of the samples. (We will see later why y = ±1 on the margin.) y = 1 y = 0 y = 1 : margin 381of 826

Maximum Support Vector Machines (SVMs) choose the decision boundary which maximises the margin. y = 1 y = 0 y = 1 : 382of 826

Reminder - Linear Classification For a linear model with weights w and bias b, the decision boundary is w T z + b = 0 Any x has projection x proj on the boundary, and x x proj = λ w since w is orthogonal to the decision boundary But we also have w T x w T x proj = w T x + b = λ w T w : x w T z + b =0 x proj 0 383of 826

Reminder - Linear Classification So, x x proj = λ w = wt x + b w The distance of x from the decision boundary is therefore w T x+b w = y(x) w. : x w T z + b =0 x proj 0 384of 826

Maximum Support Vector Machines choose the decision boundary which maximises the smallest distance to samples in both classes. We calculated the distance of a point x from the hyperplane y(x) = 0 as y(x) / w. Perfect classification means t n y(x n ) > 0 for all n. Thus, the distance of x n from the decision boundary is t n y(x n ) w = t n (w T φ(x n ) + b) w : y = 1 y = 0 y = 1 385of 826

Maximum Support Vector Machines choose the decision boundary which maximises the smallest distance to samples in both classes. For the maximum margin solution, solve arg max w,b { 1 w min [ tn (w T φ(x n ) + b) ]}. n : How can we solve this? 386of 826

Maximum Maximum margin solution or arg max w,b,γ arg max w,b { 1 w min [ tn (w T φ(x n ) + b) ]} n { } γ s.t. t n (w T φ(x n ) + b) γ n = 1,..., N. w By rescaling any (w, b) by 1/γ, we don t change the answer! We can assume that γ = 1 The canonical representation for the decision hyperplane is therefore : t n (w T φ(x n ) + b) 1 n = 1,..., N. 387of 826

Maximum Maximum margin solution arg max w,b { 1 w min [ tn (w T φ(x n ) + b) ]} n Transformed once arg max w,b Transformed again arg min w,b { 1 w } { 1 2 w 2 } s.t. t n (w T φ(x n ) + b) 1. s.t. t n (w T φ(x n ) + b) 1. : 388of 826

: 389of 826

Find the stationary points for a function f (x 1, x 2 ) subject to one or more constraints on the variables x 1 and x 2 written in the form g(x 1, x 2 ) = 0. Direct approach 1 Solve g(x 1, x 2) = 0 for one of the variables to get x 2 = h(x 1). 2 Insert the result into f (x 1, x 2) to get a function of one variable f (x 1, h(x 1)). 3 Find the stationary point(s) x 1 of f (x 1, h(x 1)) with corresponding value x 2 = h(x 1 ). Finding x 2 = h(x 1 ) may be hard. Symmetry in the variables x 1 and x 2 is lost. : 390of 826

To solve max f (x) s.t. g(x) = 0 x introduce the Lagrangian function L(x, λ) = f (x) + λg(x) from which we get the constraint stationary conditions and the constraint itself x L(x, λ) = f (x) + λ g(x) = 0 : L(x, λ) λ = g(x) = 0. This are D equations resulting from x L(x, λ) and one equation from L(x,λ) λ, together determining x and λ. 391of 826

- Example Given f (x 1, x 2 ) = 1 x1 2 x2 2 subject to the constraint g(x 1, x 2 ) = x 1 + x 2 1 = 0. Define the Lagrangian function L(x, λ) = 1 x 2 1 x 2 2 + λ(x 1 + x 2 1). A stationary solution with respect to x 1, x 2, and λ must satisfy 2x 1 + λ = 0 2x 2 + λ = 0 x 1 + x 2 1 = 0. : Therefore (x1, x 2 ) = ( 1 2, 1 2 ) and λ = 1. 392of 826

- Example Given f (x 1, x 2 ) = 1 x1 2 x2 2 subject to the constraint g(x 1, x 2 ) = x 1 + x 2 1 = 0. Lagrangian L(x, λ) = 1 x1 2 x2 2 + λ(x 1 + x 2 1) (x1, x 2 ) = ( 1 2, 1 2 ). x 2 (x 1, x 2) : x 1 g(x 1, x 2 ) = 0 393of 826

for Inequality Constraints To solve define the Lagrangian max f (x) s.t. g(x) 0 x L(x, λ) = f (x) + λg(x) Solve for x and λ subject to the constraints (Karush-Kuhn-Tucker or KKT conditions ) g(x) 0 λ 0 λg(x)= 0 : 394of 826

for General Case Maximise f (x) subject to the constraints g j (x) = 0 for j = 1,..., J, and h k (x) 0 for k = 1,..., K. Define the Lagrange multipliers {λ j } and {µ k }, and the Lagrangian L(x, {λ j }, {µ k }) = f (x) + J λ j g j (x) + j=1 K µ k h k (x). k=1 Solve for x, {λ j }, and {µ k } subject to the constraints (Karush-Kuhn-Tucker or KKT conditions ) : for k = 1,..., K. µ k 0 µ k h k (x) = 0 For minimisation of f (x), change the sign in front of the Lagrange multipliers. 395of 826

Margin Classifiers: : 396of 826

Maximum Quadratic Programming (QP) problem: minimise a quadratic function subject to linear constraints. { } 1 arg min w,b 2 w 2 s.t. t n (w T φ(x n ) + b) 1. Introduce Lagrange multipliers {a n 0} to get the Lagrangian L(w, b, a) = 1 2 w 2 N { a n tn (w T φ(x n ) + b) 1 } : note negative sign since minimisation problem 397of 826

Maximum Lagrangian L(w, b, a) = 1 N { 2 w 2 a n tn (w T φ(x n ) + b) 1 }. Derivatives with respect to w and b are N N w = a n t n φ(x n ) 0 = a n t n Dual representation (again QP problem): maximise L(a) = N a n t n = 0 N a n 1 2 m=1 a n 0 n = 1,..., N. N N a n a m t n t m k(x n, x m ) : 398of 826

Maximum Dual representation (again QP problem): maximise L(a) = N a n t n = 0 N a n 1 2 m=1 a n 0 n = 1,..., N. Vector form: maximise N N a n a m t n t m k(x n, x m ) : L(a) = 1 T a 1 2 at Qa a T t = 0 a n 0 n = 1,..., N. where Q nm = t n t m k(x n, x m ) 399of 826

Maximum - Prediction Predict via N y(x) = w T φ(x) + b = a n t n k(x, x n ) + b The Karush-Kuhn-Tucker (KKT) conditions are a n 0 t n y(x n ) 1 0 a n {t n y(x n ) 1} = 0. : Therefore, either a n = 0 or t n y(x n) 1 = 0. If a n = 0, no contribution to the prediction! After training, use only the set S of points for which a n > 0 (and therefore t n y(x n ) = 1) holds. S contains only the support vectors. 400of 826

Maximum - Support Vectors S contains only the support vectors. : Decision and margin boundaries for a two-class problem using Gaussian kernel functions. 401of 826

Maximum - Support Vectors Now for the bias b. Observe that for the support vectors, t n y(x n ) = 1. Therefore, we find t n ( m S a m t m k(x n, x m ) + b ) = 1 and can use any support vector to calculate b. Numerically more stable: Multiply by t n, observe tn 2 = 1, and average over all support vectors. ( b = 1 t n ) a m t m k(x n, x m ) N S m S n S : 402of 826

SVMs: Summary Maximise the dual objective L(a) = 1 T a 1 2 at Qa a T t = 0 a n 0 n = 1,..., N. Make predictions via N y(x) = a n t n k(x, x n ) + b : Dependence only on data points with a n > 0 403of 826

Non-Separable Case : 404of 826

Non-Separable Data What if the training data in the feature space are not linearly separable? Can we find a separator that makes fewest possible errors? : 405of 826

Non-Separable Data What if the training data in the feature space are not linearly separable? Allow some data points to be on the wrong side of the decision boundary. Increase a penalty with distance from the decision boundary. Assume, the penalty increases linearly with distance. Introduce slack variable ξ n 0 for each data point n. = 0, data point is correctly classified and on margin boundary or beyond ξ n < 1, data point is in correct margin = 1, data point is on the decision boundary > 1, data point is misclassified : 406of 826

Non-Separable Data Constraints of the separable classification t n y(x n ) 1, change now to t n y(x n ) 1 ξ n, In sum, we have the constraints n = 1,..., N n = 1,..., N : ξ n 0 ξ n 1 t n y(x n ). 407of 826

Non-Separable Data Find now N min C ξ n + 1 w,ξ 2 w 2 s.t. ξ n 0 ξ n 1 t n y(x n ). where C controls the trade-off between the slack variable penalty and the margin. Minimise the total slack associated with all data points Any misclassified point contributes ξ n > 1 therefore N ξn is an upper bound on the number of misclassified points. : 408of 826

Non-Separable Data The Lagrangian with the Lagrange multipliers a = (a 1,..., a N ) T and µ = (µ 1,..., µ N ) T is now L(w, b, ξ, a, µ) = 1 2 w 2 + C N ξ n N a n {t n y(x n ) 1 + ξ n } The KKT conditions for n = 1,..., N are a n 0 t n y(x n ) 1 + ξ n 0 a n (t n y(x n ) 1 + ξ n ) = 0 µ n 0 ξ n 0 µ n ξ n = 0. N µ n ξ n. : 409of 826

Non-Separable Data The dual Lagrangian after eliminating w, b, ξ, and µ is then L(a) = N a n t n = 0 N a n 1 2 m=1 0 a n C n = 1,..., N. N N a n a m t n t m k(x n, x m ) : The only change from the separable case, is the box constraint via the parameter C. 410of 826

Non-Separable Data Equivalent formulation by Schölkopf et. al. (2000) L(a) = 1 2 N a n t n = 0 N N a n a m t n t m k(x n, x m ) m=1 0 a n 1/N N a n ν n = 1,..., N. : where ν is both 1 an upper bound on the fraction of margin errors, and, 2 a lower bound on the fraction of support vectors. 411of 826

Non-Separable Data 2 0 2 : 2 0 2 The ν-svm algorithm using Gaussian kernels exp( γ x x 2 ) with γ = 0.45 applied to a nonseparable data set in two dimensions. Support vectors are indicated by circles. 412of 826

Support Vector Machines - Limitations Output are decisions, not posterior probabilities. Extension to classification with more than two classes is problematic. There is a complexity parameter C (or ν) which must be found (e.g. via cross-validation). Predictions are expressed as linear combinations of kernel functions that are centered on the training points. Kernel matrix is required to be positive definite. : 413of 826

Optimization View of Recall from decision theory that we want to minimize the expected loss p(mistake) = p(x R 1, C 2 ) + p(x R 2, C 1 ) = p(x, C 2 ) dx + p(x, C 1 ) dx R 1 R 2 For finite data, we minimise the negative log likelihood, appropriately regularized. More generally, we minimise the empirical risk (defined by a loss function). : R emp (w, X, t) + Ω(w) = N l(x n, t n ; w) }{{} loss per example + Ω(w) }{{} regulariser 414of 826

Logistic Regression Loss Function? Consider the labels t n { 1, +1}. Recall the definition of the sigmoid σ( ), then p(t n = 1 x n ) = σ(w T 1 φ(x n )) = 1 + exp( w T φ(x n )) p(t n = 1 x n ) = 1 σ(w T 1 φ(x n )) = 1 + exp(+w T φ(x n )) which can be compactly written as p(t n x n ) = σ(w T φ(x n )) = 1 1 + exp( t n w T φ(x n )) : Assuming independence, the negative log likelihood N log ( 1 + exp( t n w T φ(x n )) ) 415of 826

SVM Loss Function? Soft margin SVM N min C ξ n + 1 w,ξ 2 w 2 s.t. ξ n 0 ξ n 1 t n y(x n ). Observe the equivalent unconstrained loss: is equivalent to min ξ ξ,y s.t.ξ 0 ξ 1 y. min y max{0, 1 y} We denote the hinge loss by [1 y] + = max{0, 1 y}. : 416of 826

Linear Regression (squared loss) 1 2 N {t n w T φ(x n )} 2 + λ 2 D w 2 d = 1 2 (t Φw)T (t Φw)+ λ 2 wt w d=1 Logistic Regression (log loss) ln p(t w) = N log ( 1 + exp( t n w T φ(x n )) ) + λ 2 wt w : Support Vector Machine (hinge loss) N C [1 t n w T φ(x n )] + + 1 2 wt w 417of 826

E(z) : 2 1 0 1 2 z Hinge loss in blue, cross entropy loss in red 418of 826