Machine Learning 2010

Similar documents
Support Vector Machine (continued)

Support Vector Machine & Its Applications

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

CS798: Selected topics in Machine Learning

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Support Vector Machine (SVM) and Kernel Methods

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Jeff Howbert Introduction to Machine Learning Winter

Linear Classification and SVM. Dr. Xin Zhang

Introduction to SVM and RVM

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

CS145: INTRODUCTION TO DATA MINING

Support Vector Machine (SVM) and Kernel Methods

Applied Machine Learning Annalisa Marsico

ML (cont.): SUPPORT VECTOR MACHINES

Review: Support vector machines. Machine learning techniques and image analysis

Learning with kernels and SVM

Kernel Methods and Support Vector Machines

Linear & nonlinear classifiers

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Constrained Optimization and Support Vector Machines

SUPPORT VECTOR MACHINE

Support Vector Machines

Support Vector Machine

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines

Chapter 6 Classification and Prediction (2)

Statistical Machine Learning from Data

Support Vector Machines

CS249: ADVANCED DATA MINING

Linear & nonlinear classifiers

Machine Learning. Support Vector Machines. Manfred Huber

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Pattern Recognition 2018 Support Vector Machines

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support Vector Machines.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Introduction to Support Vector Machines

PAC-learning, VC Dimension and Margin-based Bounds

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

PAC-learning, VC Dimension and Margin-based Bounds

CS 6375 Machine Learning

Support Vector Machines

Support Vector Machines

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

SVMs, Duality and the Kernel Trick

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Statistical Pattern Recognition

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Statistical Methods for SVM

Lecture Notes on Support Vector Machine

Support Vector Machine. Industrial AI Lab.

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Support Vector Machines

Support Vector Machines

18.9 SUPPORT VECTOR MACHINES

Machine Learning : Support Vector Machines

L5 Support Vector Classification

Neural Networks and the Back-propagation Algorithm

Support Vector Machines

Support Vector Machines: Kernels

Kernel Methods & Support Vector Machines

Chapter 6: Classification

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Support Vector Machines Explained

STATE GENERALIZATION WITH SUPPORT VECTOR MACHINES IN REINFORCEMENT LEARNING. Ryo Goto, Toshihiro Matsui and Hiroshi Matsuo

Introduction to Support Vector Machines

Support Vector and Kernel Methods

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Support Vector Machines. Maximizing the Margin

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support vector machines Lecture 4

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines and Kernel Methods

CS 231A Section 1: Linear Algebra & Probability Review

Support Vector Machines: Maximum Margin Classifiers

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

(Kernels +) Support Vector Machines

Announcements - Homework

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Support Vector Machines for Classification and Regression

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

A Tutorial on Support Vector Machine

Transcription:

Machine Learning 2010 Michael M Richter Support Vector Machines Email: mrichter@ucalgary.ca 1 -

Topic This chapter deals with concept learning the numerical way. That means all concepts, problems and decisions are numerically coded. As a consequence, one deals with sets of numbers. Support vector machine are a special technique for this. -

Classification in the R n Situation: Objects are coded as points in the R n ; the learning domain X is a subset of the R n. There are two classes, denoted by the values {+1, -1} Examples are therefore of the form (x i,b i ) with x i = (x i1,...,x in ) X und b i {+1,-1}: Task: generate a hypothesis h for describing a classifier First approach: 3 - Assumption: positive and negative examples are linearly separable, i.e. can be separated by a linear function Then h has to be a hyperplane in X that separates the positive and the negative examples

Example (1) x i2 w n = 2 The linear function h separates +1 the examples Description for h: w,z + d = 0 i.e. h = { z w,z + d = 0 } -d / w -1 Scalar product (inner product): h x i1 w,z = w 1 z 1 + w 2 z 2 (= w z cos(angle[w,z]) ) 4 -

Example (2) x i2 w n = 2 h = { x w,z + d = 0 } suppose x is a point not +1 yet classified Then: classify x as +1, if w,x + d 0-1, if w,x + d < 0 -d / w -1 h x i1 The classification with b {+1,-1} is correct for x iff b ( w,x + d) 0 5 -

Goal Find a decision plane that separates between a set of objects having different memberships. Minimize the empirical classification errors and maximize the geometric margin at the same time. 6 -

Hyper Planes (1) There are several classifying hyper planes: x i2 Which one is the best? Criterion 1: Robustness for inserting new examples Criterion 2: Quality of prediction (e.g. minmimal cost of prediction errors) x i1 7 -

Hyper Planes (2) Criterion1 : Robustness if new examples are inserted Criterion 2: Quality of prediction h1 h h2 x i2 Maximum-Margin- Hyperplane (the plane for which the minimal distance to positive and negative examples is maximal) Margin (boundary) of breadth d =2q x i1 8 -

Properties of themaximum-margin-hyperplane x i2 minimal probability, that h changes if new examples are inserted maximal expected robustness of the prediction maximal expected quality of prediction Consequence: x maximal i1 distance q = min i ( w,x i + d) / w of the next neighbors of positive and negative examples Learning task: construct the Maximum-Margin-Hyperplane 9 -

Construction of the Maximum-Margin Hyperplane x i2 Observe: The Maximum-Margin-Hyperplane depends only from those positive and negative examples with minimal distance! The corresponding vectors are called Support-Vectors methods for determining MMH are called Support Vector Machines Breath q x i1 10 -

Which Points Separates the Margin? for b i = +1 we have w,x i + d 0 for b i = -1 we have w,x i + d < 0 b i ( w,x i + d) 0 for all i (with h) b i ( w,x i + d) / w q for all i x i2 h 1 h 2 h for b i = +1 w,x i + d (with h 1 ) for b i = -1 holds w,x i + d - (with h 2 ) b i ( w,x i + d) for all i (with h 1 and h 2 ) q, maximal q = / w x i1 11 -

Search w and d with Maximal Margin Look fore q = min i b i ( w,x i + d) / w is maximal resp, = min i b i ( w,x i + d) is maximal Idea: for the hyperplane h: the direction of w is fixed, d and w are variable Normal form : choose w such that = 1 q = 1 / w the margin has then a width of 2 / w 2 / w should maximal with b i ( w,x i + d) 1 for all i ½ w should be minimal with b i ( w,x i + d) 1 for all i ½ w,w should be minimal with b i ( w,x i + d) 1 for all i 12 -

Learning = Optimizing The Optimization Problem Determine w and d such that ½ w,w minimal (Goal) b i ( w,x i + d) 1 0 for all i from {1,..., m} (Condition) How to solve such an optimization problem in general? Given: f, c 1,..., c m, R n R f(w) minimal/maximal c j (w) = 0 for all j from {1,..., m} Wanted: w in R n such that (Goal) (Condition) General approach: Langrange multipliers! L(w) = f(w) - 1 c 1 (w) -... - m c m (w) with variables j (Lagrange multipliers): n+m unknowns: w 1,..., w n, 1,..., m n+m equations for the extrema: dl/dw i = 0 and c j (w) = 0 13 -

Noise, not Representative Data h 1 / w Basic Idea: Soft Margin instead of Margin See chapter on PAC- Learning 2 / w 3 / w 14 -

Weakly Separating Hyperplane Choose c IR, c > 0, and minimize Such that for all i : f(x i ) = w*x i +b 1- for y i = 1 and f(x i ) = w*x i +b -1+ i for y i = -1 w 2 c n i 1 i Equivalent: y i *f(x i ) 1- i 15 - +1 f

Meaning of =0 =0 >1 0< <1 16 - f(x)=-1 f(x)=0 f(x)=1

Not Linearly Separable Data In applications linearly separable data are rare. An approach: Remove a minimal set of points such that they are linearly separable (i.e. minimal classification error). Problem: Algorithm is exponential.? 17 -

More Complex Examples (1) x i2 Here we consider examples where the nonlinearity is not a consequence of noise but results from the nature of the problem +1-1 x i1 18 -

More Complex Examples (2) Idea: Transformation of domain X into another space X such that: in X' positive and negative training examples are linearly separable Remark: The dimension of X and X' may be different! a X X' = (X) 19 -

More Complex Examples (3) x i2 z 3 n = 3 +1 x i1-1 +1-1 n = 2 z 2 X' = (X) z 1 Classificator: non-linear ellipsoid (x) = (x 1,x 2 ) = (x 12, 2x 1 x 2, x 22 ) Classificator: linear hyperplane 20 -

More General: Kernels Kernel function = Inner product in some space (may be very complex) Kernel methods: Explore the properties of an inner product space Kernels occur in many machine learning methods, not only in this chapter. 21 -

General Kernel Functions Instead of Scalar Examples: Product Polynomials, homogeneous: K(x,y) = (x y) d Polynomials, inhomogeneous) : K(x,y) = (x y +1) d Radial Basis Function: for g > 0 K( x, y) exp( g x They describe situations of non-linear separable character y 2 ) 22 -

The Kernel Trick (1) the kernel trick is a method for using a linear classifier algorithm to solve a non-linear problem by mapping the original non-linear observations into a higher-dimensional space, where the linear classifier is subsequently used. 23 - This makes a linear classification in the new space equivalent to non-linear classification in the original space.

The Kernel Trick (2) This is done using Mercer s theorem: Any continuous, symmetric function K(x, y) with K(x, y) 0 can be expressed as a scalar product in a high-dimensional space: There exists a function φ(x) such that K(x, y) = φ(x) φ(x) 24 -

Example X X' = (X) 25 -

Simplicity Einschub: Prinzip der strukturellen Risikominimierung Too simple: Errors Right simplicity Correct but not simple enough

Complexity Problem (1) Training a support vector machine (SVM) requires solving a quadratic programming (QP) problem in a number of coefficients equal to the number of training examples. For very large datasets, standard numeric techniques for QP become infeasible. Practical techniques decompose the problem into manageable sub problems over part of the data. 27 - A disadvantage of this technique is that it may give an approximate solution, and may requires many passes through the dataset to reach a reasonable level of convergence.

Complexity Problem (2) An on-line alternative is training an SVM incrementally. However, adding new data by discarding all previous data except their support vectors, gives only approximate results. A better way is to do incremental learning as an exact on-line method to construct the solution recursively, one point at a time. The key is to keep the opimization conditions on all previously 28 - seen data, while adiabatically adding a new data point to the solution.

Complexity Problem (3) In adiabatic increments the margin vector coefficients change value during each incremental step to keep all elements in equilibrium, i.e., keep optimization conditions satisfied. The examples are added one by one At each step, the valid margins are updated by expressing the new solutions in terms of the old solution plus a new term. A MATLAB package implements the methods for exact 29 - incremental/decremental SVM learning, regularization parameter perturbation and kernel parameter perturbation presented in "SVM Incremental Learning, Adaptation and Optimization" by Christopher Diehl and Gert Cauwenberghs.

Applications (1) There are many and an increasing number of applications. We mention some typical ones. Face recognition Text classification Generalized Predictive Control: Controlling chaotic dynamics with small parameter perturbations. Statistical learning theory for geo(spatial) and spatio-temporal environmental data analysis and modelling. Comparisons with geostatistical predictions and simulations. Personalized and learner centered learning is receiving increasing importance 30 - due to increased learning rate. SVMs stand out due to their better performance specially in handling large dimensions which text content do possess.

Applications (2) NewsRec is a SVM-driven personal recommender system designed for news websites and uses SVMs for prediction wether articles are interesting or not. Bioinformatics applications: Coding sequences in DNA encode proteins. Protein remote homology detection: It is a central problem in computational biology. Supervised learning algorithms based on support vector machines are currently one of the most effective methods for 31 - remote homology detection. Such applications are typical for the kind of problem where SVMs do well

typische Anwendungen SVMs in OCR (Optical Character Recognition) Data bases with hand written digits

typische Anwendungen SVMs in OCR (Optical Character Recognition)

typische Anwendungen SVMs in image recognition

Tools Libsvm toolbox function svmtrain can be used to train the network function svmpredict can be used to classify the testing data In function svmtrain, it has an option called kernel type which has 4 values that are linear, ploynomial, radial basis, and sigmoid. Radial basis kernel type is often recommended 35 - DTREG generates SVM, decision tree and logistic regression models http://www.dtreg.com

Summary (1) : Main Elements S Choosen parameters (kernel,...) Form of admissible hypotheses Effiziency requirements for learning Relevant aspects Quality citeria Effiziency requirements for classifiers Interpretability of classifiers Correctness of data

Summary (2) Geometric interpretation Hyperplanes and hypersurfaces Kernels Best separation Non-linear separability Application and tools 37 -

Recommended Literature T. Mitchell: Machine Learning. McGraw Hill, 1997 Schölkopf, Support Vector Learning, Oldenbourg, 1997 http://www.kernel-machines.org I.Bratko, I.Kononenko (1987) Learning Diagnostic Rules from Incomplete and Noisy Data, AI Methods in statistics, 16-17 dec.1986, London, In: B. Phelps (ed.) Interactions in Artificial Intelligence and Statistical Methods, Technical Press. Serdar Iplikci: Support vector machines-based generalized predictive control., INTERNATIONAL JOURNAL OF ROBUST AND NONLINEAR CONTROL, Vol. 16, pp. 843-862, 2006 M Kanevski, N Gilardi, E Mayoraz, M Maignan: Spatial Data Classification with Support Vector Machines. Geostat 2000 congress. South Africa, April 2000. 38 - Gert Cauwenbergh, Tomaso Poggio: Incremental and Decremental Support Vector Machine Learning. In: Advances in Neural Information Processing Systems, volume 13, 2001.