Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Similar documents
Perceptron Revisited: Linear Separators. Support Vector Machines

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Jeff Howbert Introduction to Machine Learning Winter

Linear & nonlinear classifiers

Support Vector Machine (continued)

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Linear & nonlinear classifiers

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Support Vector Machines

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Statistical Pattern Recognition

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Support Vector Machine & Its Applications

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Introduction to SVM and RVM

SUPPORT VECTOR MACHINE

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Introduction to Support Vector Machines

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Support Vector Machines

CS145: INTRODUCTION TO DATA MINING

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

L5 Support Vector Classification

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Linear classifiers Lecture 3

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Support Vector Machines Explained

Support Vector Machine (SVM) and Kernel Methods

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning And Applications: Supervised Learning-SVM

Review: Support vector machines. Machine learning techniques and image analysis

Linear Classification and SVM. Dr. Xin Zhang

Machine Learning : Support Vector Machines

Pattern Recognition 2018 Support Vector Machines

Introduction to Support Vector Machines

Discriminative Models

Support Vector Machines

Support Vector Machine for Classification and Regression

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

An introduction to Support Vector Machines

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector and Kernel Methods

Machine Learning. Support Vector Machines. Manfred Huber

Support Vector Machines: Maximum Margin Classifiers

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Support vector machines Lecture 4

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Lecture Support Vector Machine (SVM) Classifiers

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Learning with kernels and SVM

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Support Vector Machines

Constrained Optimization and Support Vector Machines

ML (cont.): SUPPORT VECTOR MACHINES

Support Vector Machines for Classification and Regression

This is an author-deposited version published in : Eprints ID : 17710

Support Vector Machines for Classification: A Statistical Portrait

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Announcements - Homework

Support Vector Machine (SVM) and Kernel Methods

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Discriminative Models

Support Vector Machines and Speaker Verification

Support Vector Machines

Lecture 10: A brief introduction to Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Support Vector Machine

Applied inductive learning - Lecture 7

Statistical and Computational Learning Theory

Support Vector Machines

Support Vector Machines. Machine Learning Fall 2017

6.036 midterm review. Wednesday, March 18, 15

Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine

Machine Learning 2010

Max Margin-Classifier

FIND A FUNCTION TO CLASSIFY HIGH VALUE CUSTOMERS

18.9 SUPPORT VECTOR MACHINES

Support Vector Machines

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Support Vector Machines

SVM TRADE-OFF BETWEEN MAXIMIZE THE MARGIN AND MINIMIZE THE VARIABLES USED FOR REGRESSION

Statistical Methods for NLP

Support Vector Machines and Kernel Methods

Machine Learning Lecture 7

Incorporating detractors into SVM classification

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Classifier Complexity and Support Vector Classifiers

Support Vector Machines.

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Transcription:

Neural Networks Prof. Dr. Rudolf Kruse Computational Intelligence Group Faculty for Computer Science kruse@iws.cs.uni-magdeburg.de Rudolf Kruse Neural Networks 1

Supervised Learning / Support Vector Machines Rudolf Kruse Neural Networks 2

Supervised Learning, Diagnosis System for Diseases Training data: Expression profiles of patients with known diagnosis The known diagnosis gives us a structure within the data, which we want to generalize for future data. Learning/Training: Derive a decision rule from the training data which separates the two classes. Ability for generalization: How useful is the decision rule when it comes to diagnosing patients in the future? Aim: Find a decision rule with high ability for generalization! Rudolf Kruse Neural Networks 3

Learning from Examples Given: X = {x i,y i } n i=1,trainingdataofpatientswithknowndiagnosis consists of: x i R g (points, expression profiles) y i {+1, 1} (classes, 2 kinds of cancer) Decision function: f X : R g {+1, 1} diagnosis = f X (new patient) Rudolf Kruse Neural Networks 4

Underfitting / Overfitting Rudolf Kruse Neural Networks 5

Linear Separation of Training Data Begin with linear separation and increase the complexity in a second step with a kernel function. Aseparatinghyperplaneisdefinedby anormalvectorw and an offset b: Hyperplane H = {x w, x + b =0}, is called the inner product or scalar product. Rudolf Kruse Neural Networks 6

Predicting the class of a new point Training: Choose w and b in such a way that the hyperplane separates the training data. Prediction: Which side of the hyperplane is the new point located on? Points on the side that the normal vector points at are diagnosed as POSITIVE. Points on the other side are diagnosed as NEGATIVE. Rudolf Kruse Neural Networks 7

Motivation Origin in Statistical Learning Theory; class of optimal classifiers Core problem of Statistical Learning Theory: Ability for generalization. When does a low training error lead to a low real error? Binary Class Problem: Classification mapping function f(x, u) : x y {+1, 1} x: samplefromoneofthetwoclasses u: parametervectoroftheclassifier Learning sample with l observations x 1,x 2,...,x l along with their class affiliation y 1,y 2,...,y l the empirical risk (error rate) for a given training dataset: R emp (u) = 1 2l l i=1 y i f(x i,u) [0, 1] Alotofclassifiersdominimizetheempiricalrisk,e.g.NeuralNetworks. Rudolf Kruse Neural Networks 8

Motivation Expected value of classification error (expected risk): R(u) = E{R test (u)} = E{ 1 1 2 y f(x, u) } = y f(x, u) p(x, y) dxdy 2 p(x, y): Distribution density of all possible samples x along with their class affiliation y (Can t evaluate this expression directly as p(x, y) isnotavailable.) Optimal sample classification: Search for deterministic mapping function f(x, u) : x y {+1, 1} that minimizes the expected risk. Core question of sample classification: How close do we get to the real error after we saw l training samples? How well can we estimate the real risk R(u) fromtheempiricalriskr emp (u)? (Structural Risk Minimization instead of Empirical Risk Minimization) The answer is given by Learning Theory of Vapnik-Chervonenkis SVMs Rudolf Kruse Neural Networks 9

SVMs for linear separable classes Previous solution: General hyperplane: wx + b =0 Classification: sgn(wx + b) Training, e.g. by perceptron-algorithm (iterative learning, correction after every misclassification; no unique solution) Rudolf Kruse Neural Networks 10

Which hyperplane is the best - and why? Rudolf Kruse Neural Networks 11

No exact cut, but a... Rudolf Kruse Neural Networks 12

Separate the training data with maximal separation margin Rudolf Kruse Neural Networks 13

Separate the training data with maximal separation margin Try linear separation, but accept errors: Penalty for errors: Distance to hyperplane times error weight C Rudolf Kruse Neural Networks 14

SVMs for linearly separable classes With SVMs we are searching for a separating hyperplane with maximal margin. Optimum: The hyperplane with the highest 2δ of all possible separating hyperplanes. This is intuitively meaningful (At constant intra-class scattering, the confidence of right classification is growing with increasing inter-class distance) SVMs are theoretically justified by Statistical Learning Theory. Rudolf Kruse Neural Networks 15

SVMs for linearly separable classes Large-Margin Classifier: Separation line 2 is better than 1 Rudolf Kruse Neural Networks 16

SVMs for linearly separable classes wx + b =+1 wx + b = 1 x 1 2 x 2 δ δ 2 2 2 2 wx + b =0 2 2 Training samples are classified correctly, if: y i (wx i + b) > 0 Invariance of this expression towards a positive scaling leads to: y i (wx i + b) 1 with canonical hyperplanes: { wxi + b =+1;(classwithy i =+1) wx i + b = 1; (class with y i = 1) The distance between the canonical hyperplanes results from projecting x 1 x 2 to the unit length normal vector w w : 2δ = 2 w ; d.h. δ = 1 w maximizing δ minimizing w 2 Rudolf Kruse Neural Networks 17

SVMs for linearly separable classes Optimal separating plane by minimizing a quadratic function to linear constraints: Primal Optimization Problem: minimize: J(w, b) = 1 2 w 2 to the constraints i [y i (wx i + b) 1], i=1, 2,...,l Introducing a Lagrange-Function: L(w, b, α) = 1 2 w 2 l i=1 α i [y i (wx i + b) 1]; α i 0 leads to the dual problem: maximize L(w, b, α) withrespecttoα, undertheconstraints: L(w,b,α) l w = 0 = w = α i y i x i L(w,b,α) b = 0 = l i=1 i=1 α i y i = 0 Rudolf Kruse Neural Networks 18

SVMs for linearly separable classes Insert this terms in L(w, b, α): L(w, b, α) = 1 2 w 2 l i=1 α i [y i (wx i + b) 1] = 1 l 2 w w w α i y i x i b i=1 = 1 2 w w w w + l = 1 2 w w + l = 1 2 l l i=1 j=1 α i i=1 α i i=1 y i y j α i α j x i x j + l i=1 l α i i=1 α i y i + l α i i=1 Rudolf Kruse Neural Networks 19

SVMs for linearly separable classes Dual Optimization Problem: maximize: L (α) = l i=1 α i 1 2 to the constraints α i 0and l l i=1 j=1 l i=1 y i α i =0 y i y j α i α j x i x j This optimization problem can be solved numerically with the help of standard quadratic programming techniques. Rudolf Kruse Neural Networks 20

SVMs for linearly separable classes Solution of the optimization problem: w = l i=1 α i y i x i = x i SV α i y i x i b = 1 2 w (x p + x m ) for arbitrary x p SV, y p =+1, und x m SV, y m = 1 where SV = {x i α i > 0, i=1, 2,...,l} is the set of all support vectors. Classification rule: sgn(w x + b ) = sgn[( x i SV The classification only depends on the support vectors! α i y i x i )x + b ] Rudolf Kruse Neural Networks 21

SVMs for linearly separable classes Example: Support Vectors Rudolf Kruse Neural Networks 22

SVMs for linearly separable classes Rudolf Kruse Neural Networks 23 Example: class +1 contains x 1 =(0, 0) and x 2 =(1, 0); class -1 contains x 3 =(2, 0) and x 4 =(0, 2) 6 6 1 2 1 2

SVMs for linearly separable classes The Dual Optimization Problem is: maximize: L (α) = (α 1 + α 2 + α 3 + α 4 ) 1 2 (α2 2 4α 2α 3 +4α 2 3 +4α2 4 ) to the constraints α i 0andα 1 + α 2 α 3 α 4 =0 Solution: α 1 =0, α 2 =1, α 3 = 3 4, α 4 = 1 4 SV = {(1, 0), (2, 0), (0, 2)} w = 1 (1, 0) 3 4 (2, 0) 1 4 (0, 2) = ( 1 2, 1 2 ) b = 1 2 ( 1 2, 1 2 ) ((1, 0) + (2, 0)) = 3 4 Optimal separation line: x + y = 3 2 Rudolf Kruse Neural Networks 24

SVMs for linearly separable classes Observations: For the Support Vectors holds: α i > 0 For all training samples outside the margin holds: α i =0 Support Vectors form a sparse representation of the sample; They are sufficient for classification. The solution is the global optima and unique The optimization procedure only requires scalar products x i x j Rudolf Kruse Neural Networks 25

SVMs for non-linearly separable classes In this example there is no separating line such as i [y i (wx i + b) 1] ξ<1 ξ =0 ξ>1 y = 1 y =0 ξ =0 y =1 All three cases can be interpreted as: y i (wx i + b) 1 ξ i A) ξ i =0 B) 0 <ξ i 1 C) ξ i > 1 Three possible cases: A) Vectors beyond the margin, which are correctly classified, i.e. y i (wx i + b) 1 B) Vectors within the margin, which are correctly classified, i.e. 0 y i (wx i + b) < 1 C) Vectors that are not correctly classified, i.e. y i (wx i + b) < 0 Rudolf Kruse Neural Networks 26

SVMs for non-linearly separable classes Motivation for generalization: Previous approach gives no solution for classes that are non-lin. separable. Improvement of the generalization on outliers within the margin Soft-Margin SVM: Introduce slack -Variables ξ i 2 2 2 2 2 2 ξ j 2 2 Rudolf Kruse Neural Networks 27

SVMs for non-linearly separable classes Penalty for outliers via slack -Variables Primale Optimization Problem: minimize: J(w, b, ξ) = 1 2 w 2 + C l ξ i i=1 to the constraints i [y i (wx i + b) 1 ξ i,ξ i 0] Dual Optimization Problem: maximize: L (α) = l i=1 α i 1 2 to the constraints 0 α i C and l l i=1 j=1 l i=1 y i y j α i α j x i x j y i α i =0 (Neither slack-variables nor Lagrange-Multiplier occur in the dual optimization problem.) The only difference compared to the linear separable case: Constant C in the constraints. Rudolf Kruse Neural Networks 28

SVMs for non-linearly separable classes Solution of the optimization problem: where w = l i=1 α i y i x i = x i SV α i y i x i b = y k (1 ξ k ) w x k ; k =argmax i SV describes the set of all Support Vectors. = {x i α i > 0, i=1, 2,...,l} α i Rudolf Kruse Neural Networks 29

SVMs for non-linearly separable classes Example: non-linearly separable classes Rudolf Kruse Neural Networks 30

Non-linear SVMs Non-linear class boundaries low precision Example: Transformation Ψ(x) =(x, x 2 ) C 1 and C 2 linearly separable = Idea: Transformation of attributes x R n in a higher dimensional space R m,m>nby Ψ: R n R m and search for an optimal linear separating hyperplane in this space. Transformation Ψ increases linear separability. Separating hyperplane in R m non-linear separating plane in R n Rudolf Kruse Neural Networks 31

Non-linear SVMs Problem: High dimensionality of the attribute space R m E.g. Polynomes of p-th degree over R n R m, m = O(n p ) Trick with kernel function: Originally in R n :onlyscalarproductsx i x j required new in R m :onlyscalarproductsψ(x i )Ψ(x j )required Solution No need to compute Ψ(x i )Ψ(x j ), but express them at reduced complexity with the kernel function K(x i,x j ) = Ψ(x i )Ψ(x j ) Rudolf Kruse Neural Networks 32

Non-linear SVMs Example: For the transformation Ψ : R 2 R 6 Ψ((y 1,y 2 )) = (y1 2,y2 2, 2y 1, 2y 2, 2y 1 y 2, 1) the kernel function computes K(x i,x j ) = (x i x j +1) 2 = ((y i1,y i2 ) (y j1,y j2 )+1) 2 = (y i1 y j1 + y i2 y j2 +1) 2 = (y 2 i1,y2 i2, 2y i1, 2y i2, 2y i1 y i2, 1) (y 2 j1,y2 j2, 2y j1, 2y j2, 2y j1 y j2, 1) = Ψ(x i )Ψ(x j ) the scalar product in the new attribute space R 6 Rudolf Kruse Neural Networks 33

Non-linear SVMs Example: Ψ : R 2 R 3 Ψ((y 1,y 2 )) = (y 2 1, 2y 1 y 2,y 2 2 ) The kernel function K(x i,x j ) = (x i x j ) 2 = Ψ(x i )Ψ(x j ) computes the scalar product in the new attribute space R 3.Itispossibletocompute the scalar product of Ψ(x i )andψ(x j )withoutapplyingthefunctionψ. Rudolf Kruse Neural Networks 34

Nonlinear SVMs Commonly used kernel functions: Polynomial-Kernel: K(x i,x j ) = (x i x j ) d Gauss-Kernel: K(x i,x j ) = e x i x j 2 c Sigmoid-Kernel: K(x i,x j ) = tanh(β 1 x i x j + β 2 ) Linear combination of valid kernels new kernel functions We do not need to know what the new attribute space R m looks like. The only thing we need is the kernel function as a measure for similarity. Rudolf Kruse Neural Networks 35

Non-linear SVMs Example: Gauss-Kernel (c =1).TheSupportVectorsaretaggedbyanextracircle. Rudolf Kruse Neural Networks 36

Non-linear SVMs Example: Gauss-Kernel (c =1)forSoft-MarginSVM. Rudolf Kruse Neural Networks 37

Final Remarks Advantages of SVMs: According to current knowledge SVMs yield very good classification results; in some tasks they are considered to be the top-performer. Sparse representation of the solution by Support Vectors Easily practicable: few parameters, no need for a-priori-knowledge Geometrically intuitive operation Theoretical statements about results: global optima, ability for generalization Disadvantages of SVMs Learning process is slow and in need of intense memory Tuning SVMs remains a black art: selecting a specific kernel and parameters is usually done in a try-and-see manner Rudolf Kruse Neural Networks 38

Final Remarks List of SVM-implementations at http://www.kernel-machines.org/software The most common one is LIBSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/ Rudolf Kruse Neural Networks 39