Machine learning and pattern recognition Part 2: Classifiers

Similar documents
Kernel Methods and SVMs Extension

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Support Vector Machines

Linear Classification, SVMs and Nearest Neighbors

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Lecture 12: Classification

Support Vector Machines

Natural Language Processing and Information Retrieval

Which Separator? Spring 1

Lecture 10 Support Vector Machines II

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Support Vector Machines

Support Vector Machines

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

MMA and GCMMA two methods for nonlinear optimization

Statistical pattern recognition

Lecture Notes on Linear Regression

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Composite Hypotheses testing

Why Bayesian? 3. Bayes and Normal Models. State of nature: class. Decision rule. Rev. Thomas Bayes ( ) Bayes Theorem (yes, the famous one)

Pattern Classification

Problem Set 9 Solutions

Linear Approximation with Regularization and Moving Least Squares

Support Vector Machines CS434

Support Vector Machines

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

Maximal Margin Classifier

MIMA Group. Chapter 2 Bayesian Decision Theory. School of Computer Science and Technology, Shandong University. Xin-Shun SDU

10-701/ Machine Learning, Fall 2005 Homework 3

Chapter 6 Support vector machine. Séparateurs à vaste marge

Classification as a Regression Problem

Generalized Linear Methods

Module 9. Lecture 6. Duality in Assignment Problems

Support Vector Machines CS434

Homework Assignment 3 Due in class, Thursday October 15

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Feature Selection: Part 1

Learning from Data 1 Naive Bayes

Negative Binomial Regression

CSE 252C: Computer Vision III

Maximum Likelihood Estimation (MLE)

Some modelling aspects for the Matlab implementation of MMA

Comparison of Regression Lines

9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov

Kristin P. Bennett. Rensselaer Polytechnic Institute

COS 521: Advanced Algorithms Game Theory and Linear Programming

The big picture. Outline

Limited Dependent Variables

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

Numerical Heat and Mass Transfer

Image classification. Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing i them?

Online Classification: Perceptron and Winnow

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Advanced Introduction to Machine Learning

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Supporting Information

On a direct solver for linear least squares problems

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Intro to Visual Recognition

Generative classification models

Chapter 11: Simple Linear Regression and Correlation

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

The exam is closed book, closed notes except your one-page cheat sheet.

Instance-Based Learning (a.k.a. memory-based learning) Part I: Nearest Neighbor Classification

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Boostrapaggregating (Bagging)

Lecture 10 Support Vector Machines. Oct

Lecture 12: Discrete Laplacian

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Assortment Optimization under MNL

3.1 ML and Empirical Distribution

Errors for Linear Systems

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Nonlinear Classifiers II

Lecture 3: Dual problems and Kernels

Outline and Reading. Dynamic Programming. Dynamic Programming revealed. Computing Fibonacci. The General Dynamic Programming Technique

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Winter 2008 CS567 Stochastic Linear/Integer Programming Guest Lecturer: Xu, Huan

The Geometry of Logit and Probit

Structure and Drive Paul A. Jensen Copyright July 20, 2003

EEE 241: Linear Systems

Radial-Basis Function Networks

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Relevance Vector Machines Explained

18-660: Numerical Methods for Engineering Design and Optimization

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

Fisher Linear Discriminant Analysis

Statistical Foundations of Pattern Recognition

Multilayer Perceptron (MLP)

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

Transcription:

Machne learnng and pattern recognton Part 2: Classfers Irs bulbeux I. hstro (Photo F. Vrecoulon). Irs nan botanque I. pumla attca (Vrecoulon). Grand rs 'Ecstatc Echo'. http://www.rs-bulbeuses.org/rs/ Tark AL ANI, Département Informatque et Télécommuncaton, ESIEE-Pars E-mal : t.alan@esee.fr Url: http://www.esee.fr/~alant

0. Tranng-based modellng The machne learnng was largely devoted to solvng problems related to data mnng, text categorsaton [6], bomedcal problems such as data analyss [7], Magnetc resonance magng [8, 9], sgnal processng [10], speech recognton [11, 12], mage processng [13-19] and other felds. In general, the machne learnng or pattern recognton s used as a technque for data, pattern or a physcal process modellng. 18/03/2014 1

0. Tranng-based modellng It s only after the raw data acquston, preprocessng, extracton and selecton of most nformatve features from a representatve data (see the frst part of ths course RF1 ) that we are fnally ready to choose the type of a classfer and ts correspondent tranng algorthm to construct a model of the object or process of our nterest. 18/03/2014 2

0. Tranng-based 0. classfers modellng and regressors Supervsed learnng Supervsed learnng framework Consder the problem of separatng (accordng to some gven crteron by a lne, a hyperplan,..) a set of tranng vectors {p q R R } {1, 2,, n q }, q {1, 2,, Q}, called tranng feature vectors where Q s the maxmum number of classes. Gven a set of pars (p q, y q ), = 1, 2,, n q, q =1, 2,, Q: D = { (p, y ), (p, y ), (p, y ), 18/03/2014 D 11 11 21 21 n 1 1 n 1 1 (p (p 12 1Q,, y y 12 1Q ), ),... (p (p 22 2Q,, n q could be the same q. y q s the desred output (target) correspondng to the nput feature vector p q, n class q. For example, y q [0, 1] or y q [-1, 1] n 2-D classfcaton or y q Rn regresson. y y 22 2Q ), ), (p (p n 2 n 2 Q, Q y, n y 2 n 2 Q ), Q )} 3

0. Tranng-based classfers and regressors In theory, the problem of classfcaton (or regresson) s to fnd a functon f that maps an Rx1 nput feature vector to an output: class label (n classfcaton case) or to real-valued (n regresson case), n whch nformaton s encoded n an approprate manner. p y = f(p) y Feature nput space Output space 18/03/2014 4

0. Tranng-based classfers and regressors Once the problem of classfcaton (or regresson) s defned, a varety of mathematcal tools such as optmzaton algorthms can be used to buld a model. 18/03/2014 5

0. Tranng-based classfers and regressors The classfcaton problem Recall that a classfer consders a set of feature vectors {p R R } (or scalars) =1, 2,.., N, from objects or a processes, each of whch belongs to a known class q, q {1,..., Q}. Ths set s called the tranng feature vectors. Once the classfer s traned, the problem s then to assgn to new gven feature vectors (feld feature vectors) p R R = [p 1 p 2 p R ] T, {1, 2,.., M}, the best class labels (classfer) or the best real value (regressor). In ths course we focus more on the classfcaton problem. 18/03/2014 6

0. Tranng-based classfers and regressors Example : Classfcatons of Irs flowers The Fsher's Irs data s a set of multvarate data ntroduced by Sr Ronald Aylmer Fsher (1936) as an example of dscrmnate analyss. Sr Ronald Aylmer Fsher FRS (17 February 1890 29 July 1962) was an Englsh statstcan, evolutonary bologst, eugencst and genetcst. http://en.wkpeda.org/wk/ronald_a._fsher Irs bulbeux I. hstro (Photo F. Vrecoulon). Irs nan botanque I. pumla attca (Vrecoulon). Grand rs 'Ecstatc Echo'. http://www.rs-bulbeuses.org/rs/ 18/03/2014 7

0. Tranng-based classfers and regressors Example : Classfcatons of Irs flowers (cont.) In botany, a sepal s one of the leafy, generally green, whch together made up the chalce and supports the flower corolla. A petal s a floral pece that surrounds the reproductve systems of flowers. Consttutng one of the folose whch together made up the petals of a flower, t s a modfed leaf. 18/03/2014 8

0. Tranng-based classfers and regressors Example : Classfcatons of Irs flowers (cont.) The data conssts of 50 samples from each of three speces of Irs flowers. (Irs setosa, Irs vrgnca et Irs verscolor). Four features were measured for each sample: the length and wdth of sepals and petals, n centmetres. The data set contans 3 classes where each class refers to a type of rs plant. One class s lnearly separable from the other 2; class2 and class 3 ('verscolor and 'vrgnca') are NOT lnearly separable from each other. 1 2 3 4.. Ls Ws Lp Wp 5.1000 3.5000 1.4000 0.2000 4.9000 3.0000 1.4000 0.2000 4.7000 3.2000 1.3000 0.2000 4.6000 3.1000 1.5000 0.2000 51 7.0000 3.2000 4.7000 1.4000 52 6.4000 3.2000 4.5000 1.5000 53 6.9000 3.1000 4.9000 1.5000 54 5.5000 2.3000 4.0000 1.3000.. 101 6.3000 3.3000 6.0000 2.5000 102 5.8000 2.7000 5.1000 1.9000 103 7.1000 3.0000 5.9000 2.1000 1104 6.3000 2.9000 5.6000 1.8000. 150 5.9000 3.0000 5.1000 1.8000 speces = 'setosa' 'setosa' 'setosa' 'setosa' Irs setosa Irs verscolor Irs vrgnca 18/03/2014 9. 'verscolor' 'verscolor' 'verscolor' 'verscolor'. 'vrgnca' 'vrgnca' 'vrgnca' 'vrgnca'. 'vrgnca'

0. Tranng-based classfers 1) Statstcal Pattern Recognton approaches [26] Several approches, the most mportant: 1.1 Bayes classfer [26] 1.2 Nave Bayes classfer [26] 1.3 Lnear and Quadratc Dscrmnant Analyss [26] 1.4 Support vector machnes (SVM) [1, 2, 26] 1.5 Hdden Markov models (HMMs) [27, 26, 31, 32] 2) Neural networks [26] 3) Decson trees [26] In ths course, we ntroduce only 1.1, 1.2, 1.4 and 2. 18/03/2014 10

1.Statstcal classfers Although the most common pattern recognton algorthms are classfed as statstcal approaches aganst neural network approaches, t s possble to show that they are closely related and even that there s a certan equvalence relaton between statstcal approaches and ther correspondng neural networks. 18/03/2014 11

1.1 Bayes classfer 1.Statstcal classfers Thomas Bayes (c. 1701 7 Aprl 1761) was an Englsh mathematcan and Presbyteran mnster, known for havng formulated a specfc case of the theorem that bears hs name: Bayes' theorem. Bayes never publshed what would eventually become hs most famous accomplshment; hs notes were edted and publshed after hs death by Rchard Prce. http://en.wkpeda.org/wk/thomas_bayes We ntroduce the technques nspred by Bayes decson theory. In statstcal approaches, feature nstances (data samples) are treated as random varables (scalars or vectors) drawn from a probablty dstrbuton, where each nstance has a certan probablty of belongng to a class determned by ts probablty dstrbuton n the class. To buld a classfer, these dstrbutons must be ether known n advance or to be learned from data. 18/03/2014 12

1.1 Bayes classfer 1.Statstcal classfers The feature vector p belongng to class c q s consdered as an observaton drawn at random from a condtonal probablty dstrbuton over the class c q, pr(p c q ). Remark: Pr s the probablty n the dscrete feature case or the probablty densty functon n the contnuous feature case. Ths dstrbuton s called lkelhood: t s the condtonnal probablty to observe a feature vector p, gven the true class s c q. 18/03/2014 13

1.1 Bayes classfer 1.Statstcal classfers Maxmum a posteror probablty (MAP) Two cases: 1. All classes have equal pror probabltes pr(c q ). In ths case, the class wth the greatest lkelhood s more lkely to be the rght class,.e., the posteror probablty whch s the condtonal probablty that the true class s c q*, gven the feature vector p. pr ( c q * * p ) = max pr ( p c q ), q, q q { 1, 2, 3,..., Q} 2. All classes have not always equal pror probabltes (some classes may be nherently more lkely. The lkelhood s then converted by the Bayes theorem to a posteror probablty pr(c q p ), 18/03/2014 14

1.1 Bayes classfer Bayes theorem: a posteror probablty pr ( c q p Note that the evdence ) 1.Statstcal classfers pr ( p c q ) pr ( p c = = Q pr ( p ) pr ( p pr a pror probablty lkelhood 18/03/2014 15 q = 1 q c ) q pr ) ( c pr evdence ( total probablty ) s the same for all classes, and therefore ts value s nconsequental to the fnal classfcaton. Q ( p ) = pr ( p q = 1 c q ) q ) ( c pr q ) ( c q )

1.1 Bayes classfer 1.Statstcal classfers The pror probablty can be estmated from a pror experence. If such an experment s not possble, t can be estmated: - ether by the ratos between the numbers of features n each class and the total number of features; - ether by consderng that all these probabltes are equal f the number of features s not enough to make ths estmaton. 18/03/2014 16

1.1 Bayes classfer 1.Statstcal classfers A classfer constructed n ths way s usually called Bayes classfer or Bayes decson rule, and t can be shown that ths classfer s optmal, wth mnmal error n the statstcal sense. 18/03/2014 17

1.1 Bayes classfer 1.Statstcal classfers More general form Ths approch consder that all the errors are equally costly, and try then to mnmse the expected rsk R(a q p ), the expected loss of takng acton a q. 18/03/2014 18

1.1 Bayes classfer 1.Statstcal classfers Whle takng acton a q, usually consdered the selecton of a class c q, refusng to take an acton may also be consdered as an acton allowng the classfer to not make a decson f the estmated rsk of dong so s smaller than that to select one of the classes. 18/03/2014 19

1.1 Bayes classfer 1.Statstcal classfers The expected rsk can be calculated by Q R( c p) = λ( c c ) pr( c p q ' q = 1 where λ(c q c q' ) s the loss ncurred n takng acton c q when the correct class s c q'. λ( c ' ) 0 1 q q f f ' ' ) If one assocate an acton a q as the selecton of c q, and f all the errors are equally costly, the zero-one loss s obtaned q c q = q q = q q q ' ' 18/03/2014 20

1.1 Bayes classfer 1.Statstcal classfers Ths loss functon assgns no loss to correct classfcaton and assgns a loss of 1 to msclassfcaton. The rsk correspondng to ths loss functon s then R ( c q p) = pr ( c ' p ) = 1 pr ( cq p ) ' q q ' q = 1,, Q provng that the class that maxmses the posteror probablty mnmses the expected rsk. q 18/03/2014 21

1.1 Bayes classfer 1.Statstcal classfers Out of the three terms n the optmal Bayes decson rule, the evdence s unnecessary, the pror probablty can be easly estmated, but we have not mentoned how to obtan the key thrd term, the lkelhood. Yet, t s ths crtcal lkelhood term whose estmaton s usually very dffcult, partcularly for hgh dmensonal data, renderng Bayes classfer mpractcal for most applcatons of practcal nterest. 18/03/2014 22

1.1 Bayes classfer 1.Statstcal classfers One cannot dscard the Bayes classfer outrght, however, as several ways exst n whch t can stll be used: (1) If the lkelhood s known, t s the optmal classfer; (2) f the form of the lkelhood functon s known (e.g., Gaussan), but ts parameters are unknown, they can be estmated usng the parametrc approach: maxmum lkelhood estmaton (MLE) [26]*; * See Appendx n the frst part of lecture (RF1). 18/03/2014 23

1.1 Bayes classfer 1.Statstcal classfers (3) even the form of the lkelhood functon can be estmated from the tranng data usng non parametrc approach, for example, by usng Parzen wndows [1] *, however, ths approach becomes computatonally expensve as dmensonalty ncreases; (4) the Bayes classfer can be used as a benchmark aganst the performance of new classfers by usng artfcally generated data whose dstrbutons are known. * See the frst part of lecture (RF1). 18/03/2014 24

1.Statstcal classfers 1.2 Naïve Bayes classfer As mentoned above, the man dsadvantage of the Bayes classfer s the dffculty n estmatng the lkelhood (classcondtonal) probabltes, partcularly for hgh dmensonal data because of the curse of dmensonalty, where a large number of tranng nstances should be avalable to obtan a relable estmate of the correspondng multdmensonal probablty densty functon (pdf) assumng that features could be statstcally dependent on each other. 18/03/2014 25

1.2 Naïve Bayes classfer 1.Statstcal classfers There s hghly practcal soluton to ths problem, however, and that s to assume class-condtonal ndependence of the prmtves p n p = [p 1,, p R ] T pr(p c q ) = R = 1 pr( p whch yelds the so-called Naïve Bayes classfer. Ths equaton bascally requres that the th prmtve p of nstance p, s ndependent of all other prmtves n p, gven the class nformaton. c q ) 18/03/2014 26

1.2 Naïve Bayes classfer 1.Statstcal classfers It should be noted that ths s not as nearly restrctve as assumng full ndependence, that s, pr(p) = R = 1 pr( p ) 18/03/2014 27

1.2 Naïve Bayes classfer 1.Statstcal classfers The classfcaton rule correspondng to the Naïve Bayes classfer s then to compute the dscrmnant functon representng posteror probabltes as g q (p) = pr( c q ) R = 1 pr( for each classe c q, and then choosng the class for whch the dscrmnant functon g q (p) s largest. The man advantage of ths approach s that t only requres unvarate denstes pr(p c q ) to be computed, whch are much easer to compute than the multvarate denstes pr(p c q ). p c q ) 18/03/2014 28

1.2 Naïve Bayes classfer 1.Statstcal classfers In practce, Naïve Bayes has been shown to provde respectable performance, comparable wth that of neural networks, even under mld volatons of the ndependence assumptons. 18/03/2014 29

1.3 Lnear and Quadratc Dscrmnant Analyss 1.Statstcal classfers 1.3.1 Lnear dscrmnant analyss (LDA) In the frst part of ths course (RF1), we ntroduced the prncple of LDA that can be used for lnear classfcaton. In the followng we wll deal brefly wth the problems of ts practcal mplementaton as a classfer. 18/03/2014 30

1.3 Lnear and Quadratc Dscrmnant Analyss 1.3.1 Lnear dscrmnant analyss (LDA) 1.Statstcal classfers In practce, the means and covarances of a gven class are not known. They can, however, be estmated on the bass of tranng. Ether the estmate of maxmum lkelhood ether the estmaton of maxmum a posteror can be used nstead of exact values n the equatons gven n the frst part of the course. Although the estmate of the covarance can be consdered as optmal n some sense, ths does not mean that the resultng dscrmnaton obtaned by substtutng these parameters s optmal n all drectons, even f the hypothess of a normal dstrbuton of classes s correct. 18/03/2014 31

1.3 Lnear and Quadratc Dscrmnant Analyss 1.3.1 Lnear dscrmnant analyss (LDA) 1.Statstcal classfers Another complcaton n the applcaton of LDA and Fsher dscrmnaton to real data occurs when the number of features n the feature vector of each class s greater than the number of nstances n ths class. In ths case, the estmate of the covarance does not have full rank, and therefore can not be nversed. 18/03/2014 32

1.3 Lnear and Quadratc Dscrmnant Analyss 1.3.1 Lnear dscrmnant analyss (LDA) 1.Statstcal classfers There are a number of methods to address ths problem: The frst s to use a pseudo nverse nstead of the usual nverse of the matrx S W. The second s to use a Shrnkage estmator of the covarance matrx, usng a parameter δ [0, 1] called shrnkage ntensty or regularsaton parameter. For more detals, see e.g., [29, 30]. 18/03/2014 33

1.3 Lnear and Quadratc Dscrmnant Analyss 1.Statstcal classfers 1.3.2 Quadratc Dscrmnant Analyss (QDA) A quadratc classfer s used n machne learnng and statstcal classfcaton to classfy data from two or more classes of objects or events by a quadrc surface (*). Ths s a more general verson of the lnear classfer. A second-order algebrac surface gven by the general equaton: Quadratc surfaces are also called quadrcs, and there are 17 standard-form types. A quadratc surface ntersects every plane n a (proper or degenerate) conc secton. In addton, the cone consstng of all tangents from a fxed pont to a quadratc surface cuts every plane n a conc secton, and the ponts of contact of ths cone wth the surface form a conc secton 18/03/2014 34

1.3 Lnear and Quadratc Dscrmnant Analyss 1.3.2 Quadratc Dscrmnant Analyss (QDA) 1.Statstcal classfers In statstcs, If p s a feature vector consstng of R random features, and A s a RxR squar symetrc matrx, then the scalar quantty p T Ap s known as quadratc form n p. 18/03/2014 35

1.3 Lnear and Quadratc Dscrmnant Analyss 1.3.2 Quadratc Dscrmnant Analyss (QDA) 1.Statstcal classfers The classfcaton problem For a quadratc classfer, the correct classfcaton s supposed to be of second degree n the features, then the class c q wll be decded on the bass of quadratc dscrmnant functon: g(p) = p T Ap+b T p+c In the specal case where each feature vector conssts of two features (R = 2), ths means that the surfaces of separaton of the classes are conc sectons (a lne, a crcle or an ellpse, a parabola or a hyperbola). For more detals, see e.g., [26]. 18/03/2014 36

1.3 Lnear and Quadratc Dscrmnant Analyss 1.3.2 Quadratc Dscrmnant Analyss (QDA) 1.Statstcal classfers Types of conque sectons: 1. Parabola, 2. Crcle or ellpse, 3. Hyperbola http://en.wkpeda.org/wk/fle:conc_sectons_wth_plane.svg 18/03/2014 37

Run Matlab 1. In the Matlab menu clck on "Help". 1.Statstcal classfers Usng Statstcs Matlab toolbox for lnear, quadratc and naïve Bayes classfers A separate help wndow wll be open then clck on "product help", wat for the openng of the wndow that dsplays all the toolboxes 2. In the search command feld on the left, type "classfcaton". You get tutoral on classfcaton at the rght sde of ths wndow: 18/03/2014 38

1.Statstcal classfers 1.3 Lnear, quadratc and naïve Bayes classfers 3. Start wth the ntroducton and follow the tutoral whch gudes you on usng these methods by clckng each tme the arrow at the bottom rght. Learn the use of the two methods ntroduced n the lecture: "Nave Bayes Classfcaton" and "Dscrmnant analyss". Exercse: explore the other methods. 18/03/2014 39

1.Statstcal classfers 1.4 Support vector machnes (SVM) Snce the year 1990, Support Vector Machnes (SVM), was a major theme n the theoretcal development and applcatons (see for example [1-5]). The theory of SVM s based on the combned contrbutons of the optmsaton theory, statstcal learnng, kernel theory and the algorthmc. Recently, the SVMs, has been appled successfully to solve problems n dfferent areas. Vladmr Vapnk s a leadng developer of the theory of SVM. http://clrc.rhul.ac.uk/people/vlad/ 18/03/2014 40

1.4 Support vector machnes (SVM) 1.Statstcal classfers Let p : nput feature vector (pont), p = [p 1 p 2 p R ] T, R = maxmum number of attrbutes, {1, 2,, nq} P = [p 1 p 2 p Q ], Q = maxmum number of classes (q {1, 2,, Q} w: weght vector (traned classfer parameters): w = [w 1 w 2 w R ] T 18/03/2014 41

1.4 Support vector machnes (SVM) 1.Statstcal 2. Neural networks classfers General scheme for tranng a classfer 1. Gven a couple (P,y d ) of nput matrx (P) contanng the feature vectors of all classes (P=[p ], p R R = 1, 2,, (n 1 + n 2 + +n Q )) and desred output vector (y d =[y 1d, y 2d,, y (n1+ n2+ +nq)d ]) 2. when an nput p s presented to the classfer, the stable output y of the classfer s calculated; 3. the error vector E = [e 1, e 2,, e (n1+ n2+ +nq) ] = [y 1 -y 1d, y 2 -y 2d,, y (n1+ n2+ +nq) - y (n1+ n2+ +nq)d ] s calculated; 1. E s mnmsed by adjustng the vector w usng a specfc tranng algorthm based on some optmsaton method. y d p Classfer parameters (weghts) w error e = y -y d 18/03/2014 42 y Tranng algorthm

1.4 Support vector machnes (SVM) 1.Statstcal classfers Tradtonal optmsaton approaches apply a procedure based on the Mnmum mean square error (MMSE) between the desred result (desred classfer output: y d = +1 for samples p from class 1 and y d = -1 for samples p j from class 2) and the real result (classfer output: y ). 18/03/2014 43

1.4 Support vector machnes (SVM) 1.Statstcal classfers Lnear dscrmnant functons and decson hyperplanes Two-class case The decson hypersurface n the R-dmensonal feature space s a hyperplane, that s the lnear dscrmnant functon g(p) = w T p+b = 0 where b s known as the threshold or bas. Let p 1 and p 2 two ponts on the decson hyperplanes, p 2 1 2 Hyperplane wth Bas = b p 2 w W T p +b = 0 then the followng s vald p 1 w T p 1 +b = w T p 2 +b = 0w T (p 1 - p 2 ) = 0. p 1 Snce the dfference vector p 1 - p 2 obvously les on the decson hyperplane (for any p 1, p 2 ), t s apparent from the fnal expresson that the vector w s always orthogonal to the decson hyperplane. 18/03/2014 44

1.4 Support vector machnes (SVM) 1.Statstcal classfers Lnear dscrmnant functons and decson hyperplanes Two-class case Let w 1 > 0, w 2 > 0 and b < 0. Then we can demonstrate that b z( w, b ) = = 2 2 w 1 + w d( w, b; p) = w T p + b w 2 = b w g( p) 2 w 1 + w 2 2 p 2 -b/w 2 d p w z -b/w 1 p 1.e., g(p) s a measure of the Eucldean dstance of the pont p from the decson hyperplane. On one sde of the plane g(p) takes postve values and on the other negatve. In the specal case that b = 0, the hyperplane passes through the orgn. 18/03/2014 45

1.4 Support vector machnes (SVM) 1.Statstcal classfers Two-class lnear SVM If the the hyperplane pass through the orgn: p 2 angle < 90 W T p 1 > 0 p 1 p W T p = 0 Hyperplane passes through the orgn, angle = 90 angle > 90 W T p 2 < 0 p 2 p 1 18/03/2014 46

1.4 Support vector machnes (SVM) Lnear dscrmnant functons and decson hyperplanes Two-class case Let w 1 > 0, w 2 > 0 and b < 0. Then we can demonstrate that b b z( w, b ) = = 2 2 w 1 + w w 1.Statstcal classfers If the hyperplane s based (does not pass through the orgn), the dscrmnant functon s then: g(p) = w T p+b = 0. d( w, b; p) = w T p + b w 2 = g( p) 2 w 1 + w 2 2 p 2 -b/w 2.e., g(p) s a measure of the Eucldean dstance of the pont p from the decson hyperplane. On one sde of the plane g(p) takes postve values and on the other negatve. In the specal case that b = 0, the hyperplane passes through the orgn. 18/03/2014 47 d z p -b/w 1 w p 1

1.4 Support vector machnes (SVM) 1.Statstcal classfers Support vectors classfer (SVC) Tradtonal approach of adjustng separaton plan Generalsaton capacty: The man queston now s how to fnd a separaton hyperplane to classfy the data n an optmal way? What we really want s to reduce the mnmum probablty of msclassfcaton for classfyng a set of feature vectors (feld feature vectors) that are dfferent from those used to adjust the weght parameters and b of the hyperplane (.e. the tranng feature vectors). 18/03/2014 48

1.4 Support vector machnes (SVM) 1.Statstcal classfers Suppose two possble hyperplane solutons: Both hyperplanes do the job for the tranng set. However, whch one of the two hyperplanes one choose as the classfer for operaton n practce, where data outsde the tranng set (from feld data set) wll be fed to t? No doubt the answer s: the full-lne one. p 2 No doubt the answer s: the full-lne one: ths hyperplane leaves more space on ether sde, so that data n both classes can move a bt more freely, wth less rsk of causng an error. Thus such a hyperplane can be trusted more, when t s faced wth the challenge of operatng wth unknown data,.e. ncreasng the generalsaton performance of the classfer. Now we can accept that a very sensble choce for the hyperplane classfer would be the one that leaves the maxmum margn from both classes. p 1 18/03/2014 49

1.4 Support vector machnes (SVM) Support vectors classfer (SVC) Lnear classfers 1.Statstcal classfers We consder that the data s lnearly separable and wsh to fnd the best lne (or hyperplane) separatng them nto two classes: T w p b + 1, class1, y 1, class 1 = + p w T p + + b 1, class 2, y = 1, p class 2 The hypothess space s then defned by the set of functons (decson surfaces): y = f w, b = sgn(w T p +b), y {-1, 1} If the parameters w and b are calbrated by the same amount, then the decson surface wll not be changed n ths case. 18/03/2014 50

1.4 Support vector machnes (SVM) Support vectors classfer (SVC) 1.Statstcal classfers Why the number x s +1 or -1 n w T p +b x and not any number x? The parameter x can take any value, whch means that the two plans can be close or dstant from one another. By settng the value of x, and by dvdng both sdes of the above equaton by x, we obtan ± 1 on the rght sde. However, the drecton and poston n space of the hyperplane n the two cases do not change. W T p+b = +1 W T p+b = -1 W T p+b = 0 The same apples to the hyperplane descrbed by the equaton w T p +b = 0. Normalsng by a constant value x has no effect on the ponts that are on (and defne) the hyperplane. 18/03/2014 51

1.4 Support vector machnes (SVM) Support vectors classfer (SVC) 1.Statstcal classfers To avod ths redundancy, and to match each decson surface to a unque par (w, b) t s approprate to constran the parameters w, b by T mn w p + b =1 The set of hyperplanes defned by ths constrant are called canoncal hyperplane [1]. Ths constrant s just a normalsaton that s sutable for the optmzaton problem. 18/03/2014 52

1.4 Support vector machnes (SVM) Support vectors classfer (SVC) Here we assume that the data are lnearly separable, whch means that we can draw a lne on a graph p 1 vs. p 2 separatng the two classes when R = 2 and a hyperplane on the graphs of p 1, p 2,, p R when R > 2. As we show before, the dstance from the nearest nstance n the data set to the lne or the hyperplane to be equal to d( w, b ; p ) = w T p w 1.Statstcal classfers + b p 2 W T p = 0 Separaton lne or hyperplane wth bas = b w p p W T p +b = 0 18/03/2014 53 p 1

1.4 Support vector machnes (SVM) Support vectors classfer (SVC) 1.Statstcal classfers The optmal separatng hyperplane s the one that mnmses the mean square error (mse) between the desred result (+1 or -1) and actual results obtaned when classfyng the gven data nto 2 classes 1 and 2 respectvely. Ths mse crteron turns out to be optmal when the statstcal propertes of the data are Gaussan. But f the data s not Gaussan, the result wll be based. 18/03/2014 54

1.4 Support vector machnes (SVM) Support vectors classfer (SVC) 1.Statstcal classfers Example: In the followng Fgure, two Gaussan clusters of data are separated by a hyperplane. It s adjusted usng a mnmum mse crteron. Samples of both classes have the mnmum possble mean squared dstance to the hyperplanes w T p + b = ±1. w T p+b = +1 w T p+b = 0 w T p+b = -1 18/03/2014 55

1.4 Support vector machnes (SVM) Support vectors classfer (SVC) 1.Statstcal classfers Example (Cont.) But n the followng fgure, the same procedure s appled to a data set (non-gaussan or Gaussan corrupted by some outlers that are far from the centre group, thus basng the result. W T p+b =+1 W T p+b = 0 W T p+b = -1 18/03/2014 56

1.4 Support vector machnes (SVM) Support vectors classfer (SVC) 1.Statstcal classfers SVM approach In the classcal classfcaton approaches, t s consdered that the classfcaton error s commtted by a pont f t s on the wrong sde of the decson hyperplane formed by the classfer. In the SVM approach more constrants wll be mposed: not only nstances on the wrong sde of the classfer that contrbute to the countng of the error, but also any nstance that s between w T p+b = ± 1, even f t s on the rght sde of the classfer. Only nstances that are outsde these lmts and on the rght sde of the classfer does not contrbute to the cost of error countng. 18/03/2014 57

1.4 Support vector machnes (SVM) Support vectors classfer (SVC) 1.Statstcal classfers Exemple: two overlappng classes and two lnear classfers denoted by a dash-dotted and a sold lne, respectvely. For both cases, the lmts have been chosen to nclude fve ponts. Observe that for the case of the dash-dotted classfer, n order to nclude fve ponts the margn had to be made narrow. 18/03/2014 58

1.4 Support vector machnes (SVM) Support vectors classfer (SVC) 1.Statstcal classfers Imagne that the open and flled crcles n the prevous Fgure are houses n two nearby vllages and that a road must be constructed n between the two vllages. One has to decde where to construct the road so that t wll be as wde as possble and ncur the least cost (n the sense of demolshng the smallest number of houses). No sensble engneer would choose the dash-dotted opton. The dea s smlar wth desgnng a classfer. It should be placed between the hghly populated (hgh probablty densty) areas of the two classes and n a regon that s sparse n data, leavng the largest possble margn. Ths s dctated by the requrement for good generalzaton performance that any classfer has to exhbt. That s, the classfer must exhbt good error performance when t s faced wth data outsde the tranng set (valdaton, test or feld data). 18/03/2014 59

1.4 Support vector machnes (SVM) Support vectors classfer (SVC) 1.Statstcal classfers To solve the above problem, we mantan always the assumpton that data are separable wthout msclassfcaton by a lnear hyperplane. The optmalty crteron s: put the separaton hyperplane as far as possble away from the nearest nstances, but keepng all the nstances n ther good sde. Separaton hyperplane W T p+b = 0 (*) CAUTION: In some books and papers, the margn s consdered as the dstance 2d. 18/03/2014 60

1.4 Support vector machnes (SVM) Support vectors classfer (SVC) 1.Statstcal classfers Ths translates n: maxmzng the margn d between the separaton hyperplane and the nearest nstances, but now placng the margn hyperplanes w T p+b = ± 1 nto the separaton margn. W T p+b = +1 Separaton hyperplane W T p+b = 0 W T p+b = -1 18/03/2014 61

1.4 Support vector machnes (SVM) Support vectors classfer (SVC) 1.Statstcal classfers One can reformulate the SVM crteron as: maxmze the dstance d between the separatng hyperplane and the nearest samples subject to the constrants y [w T p + b] 1 wher y [+1, -1] s the class label assocated to the nstances p. 18/03/2014 62

The margn wdth (= 2d) between the margn hyperplanes s wher. (sometmes denoted. 2 ) denotes the Eucldan norm. Demonstraton : w w 2 2 ), ( = = d b M p w p w w p p ) ;, max d( ) ;, d( max b), ( 1, 1, + = = = b b M y y 1.4 Support vector machnes (SVM) Support vectors classfer (SVC) 1.Statstcal classfers 18/03/2014 63 w p w p w w w p w w p w p p p p p p 2 ) max max ( 1 max max 1, 1, 1, 1, 1, 1, = + + + = + + + = = = = = = = b b b b T y T y T y T y y y

1.4 Support vector machnes (SVM) Support vectors classfer (SVC) 1.Statstcal classfers Maxmsaton of d s equvalent to solvng a quadratc optmsaton: Mnmse the norm of the vector w. Ths gves a more useful expresson crteron of SVM: mn w w, b T subject to the constrant y [ w p + b] 1, = 1, 2,..., nq 1 2 Mnmse w s equvalent to mnmse w and the use of ths 2 w term then allows for optmsaton by a quadratc programmng: --------------------------- Remnder : 18/03/2014 1 mn w w, b 2 subject to w 2 the = constrant w T w y T = [ w p + b] 1 0, = 1, 2,..., nq 64

1.4 Support vector machnes (SVM) Support vectors classfer (SVC) 1.Statstcal classfers In practcal stuatons, the samples are not lnearly separable, so the prevous constrant cannot be satsfed. For that reason, slack varables must be ntroduced to account for the non separable samples [33]. Then, the optmzaton crteron consst of mnmzng the (prmal) functonal [33, 21]. nq 1 2 mn w + C ξ w, b 2 = 1 T subject to the constrant y = [ w p + b] 1 ξ wth ξ > =, nc 0, 1 2, 3,, For a smple ntroducton on the dervaton of SVM optmzaton procedures, see for example [20-23]. 18/03/2014 65

1.4 Support vector machnes (SVM) Support vectors classfer (SVC) 1.Statstcal classfers If the nstance p s correctly classfed by the hyperplane, and t s outsde the margn, ts correspondng slack varable sξ = 0. If t s well classfed but t s nto the margn, then 0 <ξ < 1. If the sample s msclassfed, thenξ > 1. The value of C s a trade-off between the maxmzaton of the margn and the mnmzaton of the errors. SVM of 2 non lnearly separable classes 18/03/2014 66

1.4 Support vector machnes (SVM) Support vectors classfer (SVC) 1.Statstcal classfers Once the crteron of optmalty has been establshed, we need a method for fndng the parameter vector w whch meets t. The optmzaton problem n the last equatons s a classcal constraned optmzaton problem. In order to solve ths a optmzaton problem, one must apply a Lagrange optmzaton procedure wth as many Lagrange multplers λ as constrants [22]. 18/03/2014 67

1.4 Support vector machnes (SVM) Support vectors classfer (SVC) 1.Statstcal classfers Optmal estmaton of w (for demonstraton, see for example [34, 21]) Mnmse the cost s a compromse between a large margn and a few margn errors. The soluton s gven as a weghted average of nstances of learnng: w * = nq = 1 λ y p The coeffcentsλ, wth 0 λ C are the Lagrange multplers of the optmsaton task and they are zero for all nstances outsde of the margn and on the rght sde of the classfer. These nstances do not contrbute to the determnaton of the drecton of the classfer (drecton of hyperplanes defned by w). The rest of the nstances, wth non zero λ, whch contrbute to the constructon of w *, are called support vectors. 18/03/2014 68

1.4 Support vector machnes (SVM) Support vectors classfer (SVC) 1.Statstcal classfers Selectng a value for the parameter C The free parameter C controls the relatve mportance of mnmzng the norm of w (whch s equvalent to maxmzng the margn) and satsfyng the margn constrant for each data pont. The margn of the soluton ncreases as C decreases. Ths s natural, because reducng C makes the margn term n equaton 1 2 w 2 + C more mportant. In practce, several SVM classfers must be traned, usng tranng as well as testng data wth dfferent values of C (e.g., start from mn. to max. n {0.1, 0.2, 0.5, 1, 2, 20}) and select the classfer whch gves the mnmum test error. 18/03/2014 69 nc = 1 ξ

Estmaton of b Any nstance p s whch s a support vector as well as ts desred response y s satsfy : y s [w* T p s + b] 1 or 1 ) ( = + b y y s j T j S j j j s p p λ 1.4 Support vector machnes (SVM) Support vectors classfer (SVC) 1.Statstcal classfers 18/03/2014 70 s s the ndex set of support vectors where λ > 0. Multplyng by y s and usng (y s ) 2 =1 Instead of usng an arbtrarly support vector p js, t s better to use the average over all support vectors n S: j S s j T j S j j j s s s j T j S j j j s y y b y b y y p p p p = = + λ λ then, ) ( ) ( 2 ) ( 1 s j T j S S j j j s s y y N b p p = λ

1.4 Support vector machnes (SVM) Support vectors classfer (SVC) 1.Statstcal classfers Practcal mplementaton of the SVM algorthm for the separaton of 2 lnearly separable classes Create a matrx H, wth H j = y y j p T p j,, j = 1, 2,, nc 1. select a value for the parameter C (start from mn to max, e.g. C n {0.1, 0.2, 0.5, 1, 2, 20}) 2. fnd Λ={λ 1, λ 2,, λ nc } such as the quantty nc = 1 λ s maxmsed (usng a quadratc programmng solver) subject to the constrants 0 λ C and nc λ y = 1 18/03/2014 71 1 2 λ = T 0 H λ

1.4 Support vector machnes (SVM) Support vectors classfer (SVC) 1.Statstcal classfers Practcal mplementaton of the SVM algorthm for the separaton of 2 lnearly separable classes (Cont.) nq 4. Calculate w * = λ y p = 1 5. Determne the set of support vectors S by fndng the ndexes such that 0 < λ C 1 s T s 6. Calculate b = ( y λ j y jp jp j ) Ns S j S 7. Each new vector p s classfed by the followng evaluaton: T If If w w T p p + b 1, + b 1, y y = + 1, p = 1, p class1 class 2 8. Calculate the tranng and the test errors usng test data. 9. Repeat from step 1 (construct another classfer) wth a next value of C 10. Choose the best classfer that mnmses the test error wth mnmum number of support vectors. 18/03/2014 72

1.4 Support vector machnes (SVM) Support vectors classfer (SVC) 1.Statstcal classfers Example [20] : Lnear SVM classfer (SVC) n MATLAB An easy way to program a lnear SVC s to use the MATLAB quadratc programmng "quadprog.m". Frst, generate a set of data n two dmensons wth few nstances from two classes wth ths smple code: p = [randn(1,10)-1 randn(1,10)+1;randn(1,10)-1 randn(1,10)+1]'; y = [- ones(1,10) ones(1,10)]' ; Ths generates a matrx of lne vectors n = 20 n two dmensons. We study the performance of the SVC on a non-separable set. The frst 10 samples are labeled as vectors of class 1, and the rest as vectors of class -1. 18/03/2014 73

1.4 Support vector machnes (SVM) Support vectors classfer (SVC) %%Lnear Support Vector Classfer %%%%%%%%%%%%%%%%%%%%%%% %%% Data Generaton %%% %%%%%%%%%%%%%%%%%%%%%%% x=[randn(1,10)-1 randn(1,10)+1;randn(1,10)-1 randn(1,10)+1]'; y=[-ones(1,10) ones(1,10)]'; %%%%%%%%%%%%%%%%%%%%%%%% %%% SVC Optmzaton %%% %%%%%%%%%%%%%%%%%%%%%%%% R=x*x'; % Dot products Y=dag(y); H=Y*R*Y+1e-6*eye(length(y)); % Matrx H regularzed f=-ones(sze(y)); a=y'; K=0; Kl=zeros(sze(y)); C=100; % Functonal Trade-off Ku=C*ones(sze(y)); alpha=quadprog(h,f,[],[],a,k,kl,ku); % Solver w=x'*(alpha.*y); % Parameters of the Hyperplane %%% Computaton of the bas b %%% e=1e-6; % Tolerance to errors n alpha nd=fnd(alpha>e & alpha<c-e) % Search for 0 < alpha_ < C b=mean(y(nd) - x(nd,:)*w) % Averaged result 1.Statstcal classfers %%%%%%%%%%%%%%%%%%%%%% %%% Representaton %%% %%%%%%%%%%%%%%%%%%%%%% data1=x(fnd(y==1),:); data2=x(fnd(y==-1),:); svc=x(fnd(alpha>e),:); plot(data1(:,1),data1(:,2),'o') hold on plot(data2(:,1),data2(:,2),'*') plot(svc(:,1),svc(:,2),'s') % Separatng hyperplane plot([-3 3],[(3*w(1)-b)/w(2) (-3*w(1)-b)/w(2)]) % Margn hyperplanes plot([-3 3],[(3*w(1)-b)/w(2)+1 (-3*w(1)-b)/w(2)+1],'--') plot([-3 3],[(3*w(1)-b)/w(2)-1 (-3*w(1)-b)/w(2)-1],'--') %%%%%%%%%%%%%%%%%%%%%%%%%%%% %%% Test Data Generaton %%% %%%%%%%%%%%%%%%%%%%%%%%%%%%% x=[randn(1,10)-1 randn(1,10)+1;randn(1,10)-1 randn(1,10)+1]'; y=[-ones(1,10) ones(1,10)]'; y_pred=sgn(x*w+b); %Test error=mean(y_pred~=y); %Error Computaton 18/03/2014 74

1.4 Support vector machnes (SVM) Support vectors classfer (SVC) 1.Statstcal classfers Generated data. 18/03/2014 75

The values of Lagrange multplers λ. Crcles and squares correspond to crcles and stars of the data n the prevous and the next fgures. 18/03/2014 76

1.4 Support vector machnes (SVM) Support vectors classfer (SVC) 1.Statstcal classfers Resultng margn and separatng hyperplanes. Support vectors are marked by squares. 18/03/2014 77

1.Statstcal classfers Lnear Support vectors regressor (LSVR) A lnear regressor s a functon f(p) = w T p+b whch allows for an approxmaton of a mappng from set of vectors p R R to a set of scalars y R. w T p+b Lnear egresson 18/03/2014 78

1.Statstcal classfers 1.4 Support vector machnes (SVM) Lnear Support vectors regressor (LSVR) Instead of tryng to classfy new varables nto two categores y = ± 1, we now want to predct a real-valued output y R. The man dea of a SVR s to fnd a functon whch fts the data wth a devaton less than a gven quanttyεfor every sngle par p, y. At the same tme, we want the soluton to have a mnmum norm w. Ths means that SVR does not mnmze errors less thanε, but only hgher errors. 18/03/2014 79

Formulaton of the SVR The dea to adjust the lnear regressor can be formulated n the followng prmal functonal, n whch we mnmze the norm of w plus the total error. ) ( 1 ' 2 ξ ξ + + = n C L w 1.4 Support vector machnes (SVM) Lnear Support vectors regressor (LSVR) 1.Statstcal classfers Subject to the constrants ) ( 2 1 ' 1 2 ξ ξ + + = = p C L w 0, ' ' + + + + T T b y b y ξ ξ ε ξ ε ξ p w p w 18/03/2014 80

1.Statstcal classfers 1.4 Support vector machnes (SVM) Lnear Support vectors regressor (LSVR) Les contrantes précédentes sgnfent : For each nstance, f the error > 0 and error >ε, then error s forced to be less Than ξ + ε. f the error < 0 error >ε, then error s forced to be less Than ξ ' + ε. ' If error <ε, then the correspondng slack varable wll be zero, as ths s the mnmum allowed value for the slack varables n the prevous constrants. Ths s the concept ofε- nsenstvty [2]. 18/03/2014 81

1.Statstcal classfers 1.4 Support vector machnes (SVM) Lnear Support vectors regressor (LSVR) Concept of ε- nsenstvty. Only nstances out of (margn/2) ± ε wll have a nonzero slack varable, so they wll be the only ones that wll be part of the soluton. 18/03/2014 82

1.Statstcal classfers 1.4 Support vector machnes (SVM) Lnear Support vectors regressor (LSVR) The functonal s ntended to mnmze the sum of the slack varables ' and. Only losses of samples for whch the error s greater thanεappear, so the soluton wll be only functon of those samples. The appled cost functon s a lnear one, so the descrbed procedure s equvalent to the applcaton of the so-called Vapnk or ε- nsenstve cost functon ξ ξ l ( e for ) 0 = e e = ξ ε + ε, e e e < ε > ε = ' ξ ε Vapnk orε-nsenstve cost functon. 18/03/2014 83

1.Statstcal classfers 1.4 Support vector machnes (SVM) Lnear Support vectors regressor (LSVR) Ths procedure s smlar to the one appled to SVC. In prncple, we should force the errors to be less thanεand we mnmze the norm of the parameters. Nevertheless, n practcal stuatons, t may be not possble to force all the errors to be less thanε. In order to be able to solve the functonal, we ntroduce slack varables on the constrants, and then we mnmze them. 18/03/2014 84

1.Statstcal classfers 1.4 Support vector machnes (SVM) Lnear Support vectors regressor (LSVR) To solve ths constraned optmzaton problem, we can apply Lagrange optmzaton procedure to convert t nto an unconstraned one. The resultng dual formulaton for ths dual functonal s [20-23, 34] : 1 nc nc nc ' 2 = 1 = 1 = 1 ' T ' ' Ld = ( λ λ ) p p j ( λ λ j ) + (( λ λ ) y ( λ + λ ) ε ) 2 wth the addtonal constrant 0 ( λ λ ) ' C 18/03/2014 85

1.Statstcal classfers 1.4 Support vector machnes (SVM) Lnear Support vectors regressor (LSVR) The mportant result of ths dervaton s that the expresson nc of the parameters w s ' w ( λ λ ) p and nc = 1 = = 1 ' ( λ λ ) In order to fnd the bas, we just need to recall that for all samples that le n one of the two margns, the error s exactly ε for those samples λ et λ' < C. Once these samples are dentfed, we can solve b from the followng equatons y w T p b + ε = T y + w p + b + ε = 0 for the nstances p for whch 0 18/03/2014 86 = 0 λ, λ ' < C

1.Statstcal classfers 1.4 Support vector machnes (SVM) Lnear Support vectors regressor (LSVR) In matrx notaton we get Ld = 1 ' ' T ' ' T ( λ λ ) R( λ λ ) + ( λ λ ) y ( λ + λ ) 1ε 2 T [ p p j Where R s the dot product matrx Ths functonal can be maxmzed usng the same procedure used for SVC. Very small egenvalues may eventually appear n the matrx, so t s convenent to numercally regularze t by addng a small dagonal matrx to t. The functonal becomes 1 ' ' T ' ' T Ld = ( λ λ ) [ R + γ I]( λ λ ) + ( λ λ ) y ( λ + λ ) 1ε 2 where s a regularsaton constant. Ths numercal regularzaton s equvalent to the applcaton of a modfed verson of the cost functon. 18/03/2014 87 ]

We need to compute the dot product matrx R and then the product but the Lagrange multplers λ et λ' should be splt to be dentfed after the optmzaton. To acheve ths we use the equvalent form 1.4 Support vector machnes (SVM) Lnear Support vectors regressor (LSVR) 1.Statstcal classfers T ) ]( [ ) ( ' ' λ λ I R λ λ + γ T We can use the Matlab functon quadprog.m to solve the optmzaton problem. 18/03/2014 88 + ' ' λ λ I I I I R R R R λ λ γ T

1.Statstcal classfers 1.4 Support vector machnes (SVM) Lnear Support vectors regressor (LSVR) Exemple [20] : SVR lnéare en MATLAB We start by wrtng a smple lnear model of the form y (x ) = a x + b + n (15) where x s a random varable and n s a Gaussan process. P = rand(100,1) % Generate 100 unform nstances between -1 et 1 y =1.5*P+1+0.1*randn(100,1) % Lnear model plus nose Generated data. 18/03/2014 89

1.4 Support vector machnes (SVM) Lnear Support vectors regressor (LSVR) %% Lnear SupportVector Regressor %%%%%%%%%%%%%%%%%%%%%%% %%% Data Generaton %%% %%%%%%%%%%%%%%%%%%%%%%% x=rand(30,1); % Generate 30 samples y=1.5*x+1+0.2*randn(30,1); % Lnear model plus nose %%%%%%%%%%%%%%%%%%%%%%%% %%% SVR Optmzaton %%% %%%%%%%%%%%%%%%%%%%%%%%% R_=x*x'; R=[R_ -R_;-R_ R_]; a=[ones(sze(y')) -ones(sze(y'))]; y2=[y;-y]; H=(R+1e-9*eye(sze(R,1))); epslon=0.1; C=100; f=-y2'+epslon*ones(sze(y2')); K=0; K1=zeros(sze(y2')); Ku=C*ones(sze(y2')); alpha=quadprog(h,f,[],[],a,k,k1,ku); % Solver beta=(alpha(1:end/2)-alpha(end/2+1:end)); w=beta'*x; %% Computaton of bas b %% e=1e-6; % Tolerance to errors n alpha nd=fnd(abs(beta)>e & abs(beta)<c-e) % Search for % 0 < alpha_ < C b=mean(y(nd) - x(nd,:)*w) % Averaged result 1.Statstcal classfers %%%%%%%%%%%%%%%%%%%%%% %%% Representaton %%% %%%%%%%%%%%%%%%%%%%%%% plot(x,y,'.') % All data hold on nd=fnd(abs(beta)>e); plot(x(nd),y(nd),'s') % Support Vectors plot([0 1],[b w+b]) % Regresson lne plot([0 1],[b+epslon w+b+epslon],'--') % Margns plot([0 1],[b-epslon w+b-epslon],'--') plot([0 1],[1 1.5+1],':') % True model} 18/03/2014 90

1.Statstcal classfers 1.4 Support vector machnes (SVM) Lnear Support vectors regressor (LSVR) Results : Contnuous lne: SVR; Dotted lne: real lnear model; Dashed lnes: margns; square ponts: support vectors. Fgure adapted from [20] 18/03/2014 91

1.Classfcateurs statstques LMCSVM Lnear multclass SVM (LMCSVM) Although mathematcal generalsatons for the multclass case are avalable, the task tends to become rather complex. When more than two classes are present, there are several dfferent approaches that evolve around the 2-class case. The more used methods are called one versus all and one versus one. These technques are not talored to the SVM. They are general and can be used wth any classfer developed for the 2-class problem. 18/03/2014 92

1.4 Support vector machnes (SVM) Lnear multclass SVM (LMCSVM) 1.Statstcal classfers One-versus-all The method one-versus-all s used to buld Q bnary classfers by assgnng a label to nstances from one class and a label -1 to nstances from all others. For example, n 4 classes problem, we construct 4 bnary classfers : {c 1 /{c 2, c 3, c 4 }, c 2 /{c 1, c 3, c 4 }, c 3 /{c 1, c 2, c 4 }, c 4 /{c 1, c 2, c 3 }}. For each one of the classes, we seek to desgn an optmal dscrmnant functon, g q (p), q = 1, 2,..., Q, so that g q (p) > g q (p), q q, f p c q. Adoptng the SVM methodology, we can desgn the dscrmnant functons so that g q (p) = 0 to be the optmal hyperplane separatng class c q from all the others. Thus, each classfer s desgned to gve g q (p) > 0 for p c q and g q (p) < 0 otherwse. 18/03/2014 93

1.4 Support vector machnes (SVM) Lnear multclass SVM (LMCSVM) 1.Statstcal classfers Accordng to the one-aganst-all method, Q classfers have to be desgned. Each one of them s desgned to separate one class from the rest. For the SVM paradgm, we have to desgn Q lnear classfers: w T k p + b k, k =1, 2,..., Q For example, to desgn classfer c 1, we consder the tranng data of all classes other than c 1 to form the second class. Obvously, unless an error s commtted we expect all ponts from class c 1 to result n w T 1 p + b1 + 1, and the data from the rest of the classes to result n negatve outcomes. w T p m + b m 1, m 1 T T p s classfed n c f w p + b > w p + b, m l = 1,2,, Q l l l 18/03/2014 94 m m

1.4 Support vector machnes (SVM) Lnear multclass SVM (LMCSVM) 1.Statstcal classfers The classfer gvng the hghest margn wns the vote. : assgn p n c f q = arg max q { g ( p) } A drawback of one-aganst-all s that after the tranng there are regons n the space, where no tranng data le, for whch more than one hyperplane gves a postve value or all of them result n negatve values. q ' q ' 18/03/2014 95

1.4 Support vector machnes (SVM) Lnear multclass SVM (LMCSVM) 1.Statstcal classfers One-versus-one The more wdely used method one-versus-one constructs Q(Q 1) /2 bnary classfers (each classfer separates a par of classes.) by confrontng each one of the Q classes. For example, n 4 classes problem, we construct 6 bnary classfers : {{ c 1 /c 2 }, {c 1 /c 3 }, {c 1 /c 4 }, {c 2 /c 3 }, {c 2 /c 4 }, {c 3 /c 4 }}. In 3 classes problem, we construct 3 bnary classfers : {{c 1 /c 2 }, {c 1 /c 3 }, {c 2 /c 3 }}. In the classfcaton phase, the nstance to classfy s analysed by each classfer and a majorty vote determnes ts class. The obvous dsadvantage of the technque s that a relatvely large number of bnary classfers has to be traned. In [Plat 00] a methodology s suggested that may speed up the procedure. [Plat 00] Platt J.C., Crstann N., Shawe-Taylor J. Large margn DAGs for the multclass classfcaton, n Advances n Neural Informaton Processng,(Smola S.A.,LeenT.K.,Muller K.R., eds.), Vol. 12, pp. 547 553, MIT Press, 2000. 18/03/2014 96

1.4.1 Nonlnear SVM 1.Statstcal classfers Non lnear mappng of the feature vectors (p) n a hghdmensonal space We adopt the phlosophy of non-lnear mappng of feature vectors n a space of hgher dmenson, where we expect, wth hgh probablty, that the classes are lnearly separable (*). Ths s guaranteed by the famous theorem of Cover [34, 35]. (*) See http://www.youtube.com/watch?v=3lcbrzprza 18/03/2014 97

1.4 Support vector machnes (SVM) 1.4.1 Nonlnear SVM 1.Statstcal classfers Non-lnear low dmensonal space (p) Non-lnear Kernel functon φ(p) Lnear hgh dmensonal space (p) Lnear SVM 18/03/2014 98

1.4 Support vector machnes (SVM) 1.4.1 Nonlnear SVM 1.Statstcal classfers Mappng from : p R R ---φ(p ) --------> p R H where the dmenson of H s greater than R, dependng on the choce of the nonlnear functonφ( ). In addton, f the functonφ( ) s carefully chosen from a known famly of functons that have specfc desrable propertes, the nner (or dot) product <φ(p ),φ(p j ) > between the spaces correspondng to two nput vectors p, p j can be wrtten as <φ(p ),φ(p j ) > = k(p, p j ) where <, > denotes the nner product n H and k (, ) s a known functon, called kernel functon. 18/03/2014 99

1.4 Support vector machnes (SVM) 1.4.1 Nonlnear SVM 1.Statstcal classfers In other words, nner products n hgh-dmensonal space can be made n terms of the kernel functon actng n the orgnal space of low dmenson. The space H assocated wth k (, ) s known as reproducng kernel Hlbert space (RKHS) [35, 36]. 18/03/2014 100

1.4 Support vector machnes (SVM) 1.4.1 Nonlnear SVM 1.Statstcal classfers Two typcal examples of kernel functon: (a) Radal bass functon (RBF), s a real-valued functon k(p, p j ) =φ( p - p j ) The norm s usually the Eucldean dstance, although other dstance functons are also possble. The sums of radal functons are typcally used to approxmate a gven functon. Ths approxmaton process can also be nterpreted as a knd of smple neural network. 18/03/2014 101

1.4 Support vector machnes (SVM) 1.4.1 Nonlnear SVM 1.Statstcal classfers Exampls of RBFs : Let r = p -p j σ ϕ( r) = e Gaussan:, where σ a user-defned parameter whch specfes the decay rate of k(p, p j ) twards zero. Multquadratc: r 2 2 ϕ ( r ) = 1 + ( ε r ) Inverse quadratc: ϕ( r) = Inverse multquadratc: ϕ( r) = 2 1+ ( εr) k Splne polyharmonque : ϕ( r) = r, k = 1, 3, 5,... k ϕ( r) = r ln(r), k = 2, 4, Specal splne polyharmonc (thn plate splne) : 2 1 2 1+ ( εr) 1 6,... ϕ( r ) = r ln(r) 18/03/2014 102 2

1.4 Support vector machnes (SVM) 1.4.1 Nonlnear SVM 1.Statstcal classfers (b) polynomal functon (PF): k(p, p j ) = (p T p j +β) n whereβ, n are user-defned parameters. Note that the resoluton of a lnear problem n hgh-dmensonal space s equvalent to the resoluton of a non-lnear problem n the orgnal space. 18/03/2014 103

1.4 Support vector machnes (SVM) 1.4.1 Nonlnear SVM 1.Statstcal classfers Although a classfer (lnear) s formed n the space RKHS, due to the non-lnearty of the mappng functonφ( ), the method s equvalent to a nonlnear functon n the orgnal space. Moreover, snce each operaton can be expressed n nner products, the explct knowledge of φ ( ) s not necessary. All that s necessary s to adopt the kernel functon whch defnes the nner product. 18/03/2014 104

1.4 Support vector machnes (SVM) 1.4.1 Nonlnear SVM 1.Statstcal classfers (c) Sgmoïde (d) Drchlet K T ( p, p ) tanh( γ p p + µ ) j = j K ( p, p j ) = sn(( n + 2 sn(( 1 / p 2)( p p j ) / p 2) j )) More on Nonlnear PCA (NLPCA): http://www.nlpca.de/ 18/03/2014 105

1.4 Support vector machnes (SVM) 1.4.1 Nonlnear SVM Constructon of a nonlnear SVC 1.Statstcal classfers The soluton to the lnear SVC s gven by a lnear combnaton of a subset of tranng data: w = nc = 1 y p If, before optmzaton data s mapped nto a Hlbert space, then the soluton becomes nc w = (A) whereφ s a mappng nonlnear functon. λ λ y ϕ = 1 18/03/2014 106

1.4 Support vector machnes (SVM) 1.4.1 Nonlnear SVM 1.Statstcal classfers The parameter vector w s a combnaton of vectors nto the Hlbert space, but recall that many transformatons φ( ) are unknown. Thus, we may not have an explct form for them. But the problem can stll be solved, because the SVM just needs the dot products of the vector, and not an explct form of them. 18/03/2014 107

1.4 Support vector machnes (SVM) 1.4.1 Nonlnear SVM 1.Statstcal classfers We cannot use ths expresson y j = w T φ (p j ) + b (B) because the parameters w are n an nfntedmensonal space, so no explct expresson exsts for them. However, by substtutng equaton (A) n equaton (B), we obtan y nc nc = j = 1 = 1 T j y λ ϕ ( p ) ϕ ( p ) + b = y λ K ( p, p ) + j b (C) 18/03/2014 108

In lnear algebra, the Graman matrx (or Gram matrx or Graman) of a set of vectors v 1,, v n n an nner product space s the Hermtan matrx of nner products, whose entres are gven by (D) Jørgen Pedersen Gram (June 27, 1850 Aprl 29, 1916) was a = = n n G v v v v v v v v v v v v v v v v,,,,,,, ), ( 2 2 2 1 2 1 2 1 1 1 1.4 Support vector machnes (SVM) 1.4.1 Nonlnear SVM 1.Statstcal classfers (D) Dansh actuary and mathematcan who was born n Nustrup, Duchy of Schleswg, Denmark and ded n Copenhagen, Denmark http://en.wkpeda.org/wk/j%c3%b8rgen_pedersen_gram = = n n n n n j j G v v v v v v v v v v,,,, ), ( 2 1 2 2 2 1 2 18/03/2014 109

1.4 Support vector machnes (SVM) 1.4.1 Nonlnear SVM 1.Statstcal classfers The resultng SVM can now be expressed drectly n terms of the Lagrange multplers and the Kernel dot products. In order to solve the dual functonal whch determnes the Lagrange multplers, the transformed vectors φ(p ) and φ(p j ) are not requred ether, but only the Gram matrx K of the dot products between them. Agan, the kernel s used to compute ths matrx K j = K(p, p j ) (E) 18/03/2014 110

1.4 Support vector machnes (SVM) 1.4.1 Nonlnear SVM 1.Statstcal classfers Once ths matrx has been computed, solvng for a nonlnear SVM s as easy as solvng for a lnear one, as long as the matrx s postve defnte. It can be shown that f the kernel fts the Mercer theorem, the matrx wll be postve defnte [25]. In order to compute the bas b, we can stll make use of the expresson (y j (w T p + b) 1 = 0), but for the nonlnear SVC t becomes nc T y ( y λ ϕ ( p ) ϕ ( p ) + b ) 1 = 0 y j j ( p = 1 = 1 j nc y pour λ K ( p lequel, p λ j ) < C. We just need to extract b from expresson (F) and average t for all samples wth λ < C. j + b ) 1 (*) Explct condtons that must be met for a kernel functon: t must be symmetrc, postve sem-defnte. 18/03/2014 111 = 0 (F)

1.4 Support vector machnes (SVM) 1.4.1 Nonlnear SVM Example [20] : 1.Statstcal classfers Nonlnear Support Vector Classfer (NLSVC) n MATLAB In ths example, we try to classfy a set of data whch cannot be reasonably classfed usng a lnear hyperplane. We generate a set of 40 tranng vectors usng ths code k=20; %Number of tranng data per class ro=2*p*rand(k,1); r=5+randn(k,1); x1=[r.*cos(ro) r.*sn(ro)]; x2=[randn(k,1) randn(k,1)]; x=[x1;x2]; y=[-ones(1,k) ones(1,k)]'; 18/03/2014 112

1.4 Support vector machnes (SVM) 1.4.1 Nonlnear SVM-SVC 1.Statstcal classfers Example: NLSVC (cont.) We generate a set of 100 test vectors usng ths code ktest=50; %Nombre de données de test par class ro=2*p*rand(ktest,1); r=5+randn(ktest,1); p1=[r.*cos(ro) r.*sn(ro)]; p2=[randn(ktest,1) randn(ktest,1)]; ptest=[p1;p2]; ytest=[-ones(1,ktest) ones(1,ktest)]' ; 18/03/2014 113

1.4 Support vector machnes (SVM) 1.4.1 Nonlnear SVM-SVC Example: NLSVC (cont.) 1.Statstcal classfers x 2 x 1 An example of generated data. 18/03/2014 114

1.4 Support vector machnes (SVM) 1.4.1 Nonlnear SVM-SVC Example: NLSVC (cont.) 1.Statstcal classfers The steps of the SVC procedure : 1. Calculate the nner product matrx K j = K(p, p j ) Snce we want a non-lnear classfer, we compute the nner product matrx usng a kernel. Choose a kernel In ths example, we choose a Gaussan kernel K ( p, p j ) = e γ p p j, avec γ = 1 2σ 2 18/03/2014 115

1.4 Support vector machnes (SVM) 1.4.1 Nonlnear SVM-SVC Example: NLSVC (cont.) 1.Statstcal classfers The steps of the SVC procedure (cont.) nc=2*k; % Number of data sgma=1; % Parameter of the kernel D=buffer(sum([kron(x,ones(nc,1))- kron(ones(1,nc),x')'].^2,2),nc,0) % Ths s a recpe for fast computaton % of a matrx of dstances n MATLAB % usng the Kronecker product R=exp(-D/(2*sgma)); % kernel matrx * In mathematcs the Kronecker product s an operaton on a matrx. Ths s a specal case of the tensor product. It s so named n honor of the German mathematcan Leopold Kronecker. 18/03/2014 116

1.4 Support vector machnes (SVM) 1.4.1 Nonlnear SVM-SVC Example: NLSVC (cont.) 1.Statstcal classfers The steps of the SVC procedure (cont.) 2. Optmsaton procedure Once the matrx has been obtaned, the optmzaton procedure s exactly equal to the one for the lnear case, except for the fact that we cannot have an explct expresson for the parameters w. 18/03/2014 117

1.4 Support vector machnes (SVM) 1.4.1 Nonlnear SVM-SVC Example: NLSVC (cont.) tranng %%%%%%%%%%%%%%%%%%%%%%% %%% Data Generaton %%% %%%%%%%%%%%%%%%%%%%%%%% k=20; %Number of tranng data per class ro=2*p*rand(k,1); r=5+randn(k,1); x1=[r.*cos(ro) r.*sn(ro)]; x2=[randn(k,1) randn(k,1)]; x=[x1;x2]; y=[-ones(1,k) ones(1,k)]'; ktest=50; %Number of test data per class ro=2*p*rand(ktest,1); r=5+randn(ktest,1); x1=[r.*cos(ro) r.*sn(ro)]; x2=[randn(ktest,1) randn(ktest,1)]; xtest=[x1;x2]; ytest=[-ones(1,ktest) ones(1,ktest)]'; %%%%%%%%%%%%%%%%%%%%%%%% %%% SVC Optmzaton %%% %%%%%%%%%%%%%%%%%%%%%%%% N=2*k; % Number of data sgma=2; % Parameter of the kernel 1.Statstcal classfers Cont. D=buffer(sum([kron(x,ones(N,1))... - kron(ones(1,n),x')'].^2,2),n,0); % Ths s a recpe for fast computaton % of a matrx of dstances n MATLAB R=exp(-D/(2*sgma)); % Kernel Matrx Y=dag(y); H=Y*R*Y+1e-6*eye(length(y)); % Matrx H regularzed f=-ones(sze(y)); a=y'; K=0; Kl=zeros(sze(y)); C=100; % Functonal Trade-off Ku=C*ones(sze(y)); e=1e-6; % Tolerance to % errors n alpha alpha=quadprog(h,f,[],[],a,k,kl,ku); % Solver nd=fnd(alpha>e); x_sv=x(nd,:); % Extracton of the support % vectors N_SV=length(nd); % Number of SV %%% Computaton of the bas b %%% nd=fnd(alpha>e & alpha<c-e); % Search for % 0 < alpha_ < C N_margn=length(nd); D=buffer(sum([kron(x_sv,ones(N_margn,1))... - kron(ones(1,n_sv),x(nd,:)')'].^2,2),n_margn,0); % Computaton of the kernel matrx R_margn=exp(-D/(2*sgma)); y_margn=r_margn*(y(nd).*alpha(nd)); b=mean(y(nd) - y_margn); % Averaged result 18/03/2014 118

SVM Non lnéare - SVC 1.4 Support vector machnes (SVM) 1.4.1 Nonlnear SVM-SVC Example: NLSVC (cont.) test 1.Statstcal classfers %%%%%%%%%%%%%%%%%%%%%%%%%%% %%% Support Vector Classfer %%% %%%%%%%%%%%%%%%%%%%%%%%%%%% N_test=2*ktest; % Number of test data %%Computaton of the kernel matrx%% D=buffer(sum([kron(x_sv,ones(N_test,1))... - kron(ones(1,n_sv),xtest')'].^2,2),n_test,0); % Computaton of the kernel matrx R_test=exp(-D/(2*sgma)); % Output of the classfer y_output=sgn(r_test*(y(nd).*alpha(nd))+b); errors=sum(ytest~=y_output) % Error Computaton %%%%%%%%%%%%%%%%%%%%%% %%% Representaton %%% %%%%%%%%%%%%%%%%%%%%%% data1=x(fnd(y==1),:); data2=x(fnd(y==-1),:); svc=x(fnd(alpha>e),:); plot(data1(:,1),data1(:,2),'o') hold on plot(data2(:,1),data2(:,2),'*') plot(svc(:,1),svc(:,2),'s') g=(-8:0.1:8)'; % Grd between -8 and 8 Cont. x_grd=[kron(g,ones(length(g),1)) kron(ones(length(g),1),g)]; N_grd=length(x_grd); D=buffer(sum([kron(x_sv,ones(N_grd,1))... - kron(ones(1,n_sv),x_grd')'].^2,2),n_grd,0); % Computaton of the kernel matrx R_grd=exp(-D/(2*sgma)); y_grd=(r_grd*(y(nd).*alpha(nd))+b); contour(g,g,buffer(y_grd,length(g),0),[0 0]) % Boundary draw 18/03/2014 119

1.4 Support vector machnes (SVM) 1.4.1 Nonlnear SVM-SVC NLSVM 1.Statstcal classfers Separatng boundary, margns and support vectors for the nonlnear SVC example 18/03/2014 120

1.4 Support vector machnes (SVM) 1.4.1 Nonlnear SVM Non lnear Support vectors regressor (NLSVR) The soluton of lnear SVR 1.Statstcal classfers nc = w ( λ λ ) p = 1 Its nonlnear counterpart wll have the expresson w nc = = 1 ( λ Followng the same procedure as n the SVC, one can fnd the expresson of the nonlnear SVR nc The constructon of a nonlnear SVR s almost dentcal to the constructon of the nonlnear SVC. Exercse: wrte a Matlab code of ths NLSVR. ' ' λ ) ϕ ( p ' T ' y j = ( λ λ ) ϕ ( p ) ϕ ( p j ) + b = ( λ λ ) K ( p, p j ) + = 1 18/03/2014 121 nc = 1 ) b

1.4 Support vector machnes (SVM) 1.4.1 Nonlnear SVM-SVR 1.Statstcal classfers Some useful Web stes on SVM: http://svmlght.joachms.org/ http://www.support-vector-machnes.org/ndex.html 18/03/2014 122

2. Neural networks 18/03/2014 123

2. Neural networks The artfcal neural networks (ANN) are composed of smple elements whch operate n parallel. These elements are nspred by bologcal nervous systems. As n nature, the connectons between these elements largely determne the functonng of the network. We can buld a neural network to perform a partcular functon by adjustng the connectons (weghts) between the elements. In the followng, we wll use the Matlab toolbox, verson 2011a or more recent versons to llustrate ths course. More detals, see 1.Neural Network Toolbox, Gettng Started Gude 2.Neural Network Toolbox, User s Gude An RN s an assembly of elementary processng elements. Processng capacty of the network s stored as weghts of nterconnectons obtaned by a process of adaptaton or tranng from a set of tranng examples. 18/03/2014 124

Types of neural networks : 2. Neural networks 1. Perceptron (P) one or more (adalne) formal neurons n one layer; 2. Statc networks such as "multlayer feed forward neural networks" (MLFFNR) "multlayer perceptron" (MLP) or wthout feedback from outputs to the frst nput layer; 3. Statc networks such as "Radal Bass Functons neural networks (RBFNN)" wth one layer wthout feedback to the nput layer. 3.1 Generalzed regresson neural network (GRNN) 3.2 Probablstc neural networks (PNN) 4. Partally recurrent multlayer feed forward networks (PRFNN) (Elman or Jourdan) where only the outputs are looped to the frst layer. 5. Recurrent networks wth one layer and total connectvty (Assocatve networks) 6. Self-Organzng neural networks (SONN) or compettve neural networks (CNN) 7. Dynamc neural networks (DNN) 18/03/2014 125

Types of neural networks 2. Neural networks Supervsed tranng: Gven couples of nput featur vectors P and ther assocated desred outputs (targets) Y d (P,Y d ) = ([p 1, p 2,, p N ] RxN, [y d1, y d2,, y dn ] SxN ) and error matrx E = [E 1, E 2,, E N ] contanng error vector E = [e 1, e 2,, e S ] T = Y - Y d 18/03/2014 126

Types of neural networks 1. Perceptron (cont.) 2. Neural networks 1. Gven a couple of nput and desred output vectors (p,y d ) ; 2. when an nput p s presented to the network, the actvatons of neurons are calculated once the network s stablzed; 3. the errors E are calculated; 4. E are mnmzed by a gven tranng algorthm. y d p y Error E Fg. 1. Tranng of neural network. (Fgure adapted from «Neural Network Toolbox 6 User s Gude», by The MathWorks, Inc. 18/03/2014 127

Types of neural networks (cont.) 2. Neural networks 1. Perceptron (P) one or more formal neurons wth one layer One neuron wth one scalar nput Input Neuron wthout bas Input Neuron wth bas (scalar) (scalar) Bologcal neuron p : nput (scalar) ω : weght (scalar) b : bas (scalar), consdered as a threshold wth constant nput =1, t acts as an actvaton threshold of the neuron. n : actvaton of the neuron, sum of the weghted nputs, n = wp, or n = wp+b f : transfert functon or actvaton functon) 18/03/2014 128

Types of neural networks 1. Perceptron (cont.) 2. Neural networks Transfert functon Step functon A = hardlm(n) and A = hardlms(n) takes an SxQ matrx of S N-element net nput column vectors and returns an SxQ matrx A of output vectors wth a 1 n each poston where the correspondng element of N was 0 or greater, and 0 elsewhere. A = 1 f N > 0 A = -1 f N < = 0 Example : N = -5:0.1:5; A = hardlms(n) plot(n,a) A A N A = 1 f N > 0 A = 0 f N < =0 Example : N = -5:0.1:5; A = hardlm(n) plot(n,a) N 18/03/2014 129

Types of neural networks 1. Perceptron (cont.) 2. Neural networks Postve saturatng lnear transfer functon A = satln(n) and A = satlns(n) takes an SxQ matrx of S N-element net nput column vectors and returns an SxQ matrx A of output vectors where each element of A s 1 where N s 1 or greater, N where N s n the nterval [0 1], and 0 where N s 0 or less. A = 1 f N > 0 A = -1 f N < =0 Example : N = -5:0.1:5; A = satlns(n) plot(n,a) A A N A = 1 f N > 0 A = 0 f N < =0 Example : N = -5:0.1:5; A = satln(n) plot(n,a) N 18/03/2014 130

Types of neural networks 1. Perceptron (cont.) 2. Neural networks Lnear transfert functon A = pureln (N) = N takes an SxQ matrx of S N-element net nput column vectors and returns an SxQ matrx A of output vectors equal to N. A Example : N = -5:0.1:5; A = purln(n); plot(n,a) 18/03/2014 131 N

Types of neural networks 1. Perceptron (cont.) 2. Neural networks Logarthmc sgmod transfer functon A = logsg(n) logsg(n) (=1/(1+exp(-N))]) takes an SxQ matrx of S N-element net nput column vectors and returns an SxQ matrx A of output vectors, where each element of N n s squashed from the nterval [-nf nf] to the nterval [0 1] wth an "S-shaped" functon. Example : N = -5:0.01:5; plot(n,logsg(n)) set(gca,'dataaspectrato',[1 1 1],'xgrd','on','ygrd','on') 18/03/2014 132

Types of neural networks 1. Perceptron (cont.) Symmetrc sgmod transfer functon A = transg(n) transg(n) (=[ 2/(1+exp(-2*N))]-1) takes an SxQ matrx of S N-element net nput column vectors and returns an SxQ matrx A of output vectors, where each element of N n s squashed from the nterval [-nf nf] to the nterval [-1 1] wth an "S-shaped" functon. 2. Neural networks Example : N = -5:0.01:5; plot(n,transg(n)) set(gca,'dataaspectrato',[111],'xgrd','on','ygrd','on') 18/03/2014 133

Types of neural networks 1. Perceptron (cont.) Feature vector nput (R features) Neuron wth bas 2. Neural networks One neuron wth feature vector nput w 0 weghts +1 f output s +1 y nputs x 1 1 w 1 x 2 w 2 P : feature vector nput : P=[p 1 p 2 p R ] ( denotes Transpose) W : weght vector : W=[w 1,1 w 1,2 w 1,R ] b : bas (scalare), consdered as a threshold wth constant nput =1 n : actvaton of the neuron, sum of the weghted nputs: n = W P + b = R j = 1 + b, 1 for f : transfer functon (or actvaton functon) a : output, a = f(n) w p j j = one formal neuron 18/03/2014 134

Types of neural networks 1. Perceptron (cont.) Abbrevated Notaton 2. Neural networks feature vector nput (R features) Neuron wth bas 18/03/2014 135

Types of neural networks 1. Perceptron (cont.) 2. Neural networks Lnear separaton of classes Crtcal condton of separaton: actvaton = threshold Let b = - w 0 W.x = w 0 an equaton of a straght lne decson In 2D space f w 0 > 0 we obtan a straght lne equaton wth a slope gven by w 1 p = w 2 w 1 p 2 = p1 + w2 w w 0 2 w w 0 2 p 2 Decson lne p 1 w w 0 1 18/03/2014 136

Types of neural networks 1. Perceptron (cont.) 2. Neural networks Example of 2D nput space wth 2 classes (class 1:, classe 2: ) w 1 = w 2 =1, seul w 0 =1,5 p 1 p 2 actvaton output 0 0 0 0 0 1 1 0 p 2 p = [p 1 p 2 ] 1 0 1 0 (0,1) (1,1) 1 1 2 1 (1,0) p 1 18/03/2014 137

Types of neural networks 1. Perceptron (cont.) 2. Neural networks Matlab demo: To study the effect of changng the bas (b) for a gven weght (w) or vce versa, type n the Matlab wndow: >> nnd2n1 18/03/2014 138

Types of neural networks 1. Perceptron (cont.) 2. Neural networks Abbrevated model of one layer composed of S neurons Feature vector One weghted sum One output nput layer layer layer (R features) (S weghted sum unts) (S transfer functons ) IW 1, 1 Number of the nput layer Number of layer contanng the weghted sum unts layer Feature vector One weghted sum One output nput layer layer layer (R features) (S weghted sum unts) (S transfer functons ) 18/03/2014 139

Types of neural networks 1. Perceptron (cont.) Hstorcal ADALINE (Sngle Layer Feedforward Networks) unt 1 Input X e -1 x 1 x N ω 0 ω 1 ω N bas..... unt n 2. Neural networks +1-1 +1-1 Input X e -1 x 1 x N ω 0 ω 1 ω N bas +1d(Xe) -1 S +1d(Xe) -1 S e s f +1-1 quantfer Algorthm delta e s f +1-1 quantfer Algorthm delta 18/03/2014 140

Types of neural networks 1. Perceptron (cont.) Supervsed tranng: Gven couples of nput featur vectors P and ther assocated desred outputs (targets) Y d (P,Y d ) = ([p 1, p 2,, p N ] RxN, [y d1, y d2,, y dn ] SxN ) and error matrx E = [E 1, E 2,, E N ] contanng error vector E = [e 1, e 2,, e S ] T = Y - Y d 2. Neural networks Input layer of R features p (a tranng example: R- nput feature vector) One layer of S neurons y 1 y 2 e 1 y S - - e 2 - e S outputs Errors y d1 y d2 y ds Desred outputs layer Y layer E layer y d 18/03/2014 141

Types of neural networks 1. Perceptron (cont.) 2. Neural networks 1. Gven a couple of nput and desred output vectors (p,y d ) ; 2. when an nput p s presented to the network, the actvatons of neurons are calculated once the network s stablzed; 3. the errors E are calculated; 4. E are mnmzed by a gven tranng algorthm. y d p y Error E Fg. 1. Tranng of neural network. (Fgure adapted from «Neural Network Toolbox 6 User s Gude», by The MathWorks, Inc. 18/03/2014 142

Types of neural networks (cont.) 2. Neural networks 2. Statc networks: "multlayer feed forward neural networks" (MLFFNR) or "multlayer perceptron" (MLP) wthout feedback from outputs to the frst nput layer Some nonlnear problems (or logcal, e.g. exclusve or) are not resolvable by one-layer perceptron: SolutonOne or more ntermedate layers (hdden) are added between the nput and output layers to allow networks to create ts own representaton of entres. In ths case t s possble to approxmate a nonlnear functon or perform any sort of logc functon. 18/03/2014 143

Types of neural networks 2. MLP (cont.) Model wth two formal neurons α : λε πασ δ αδαπτατιον; ω ιϕ : ποιδσ σψναπτιθυε f et f j : fonctons d actvaton ; s et s j : valeurs d actvaton; x n : une composante d entrée au réseau bas -1 Inputs x 1 weghts w 0 w 1 2. Neural networks s +1 f 1 neurone -1 w 0 j w j s j +1 1 neurone g j -1 y j x N w N Hdden layer Output layer 18/03/2014 144

Types of neural networks 2. MLP (cont.) 2. Neural networks Input layer of R features One output layer of S neurons p (a tranng example: R- nput feature vector) One or more hdden layers y 1 - y d1 y e 2 - y d2 y e S - y ds e outputs Errors Desred outputs layer Y layer E layer y d 18/03/2014 145

Types of neural networks (cont.) 2. Neural networks 3. Statc networks: "Radal bass functons neural networks " (RBFNN) wth one layer wthout feedback to the nput layer The RBF may requre more neurons than that n FFNR (needs only one hdden layer), but often they can be traned more quckly than FFNR. These networks work well for a lmted number of tranng examples. 18/03/2014 146

Types of neural networks 3. RBFNN (cont.) 2. Neural networks RBFs can be used for: Regresson: Generalzed regresson networks (GRNN) Classfcaton: Probablstc neural networks (PNN) 18/03/2014 147

Types of neural networks 3. RBFNN (cont.) 2. Neural networks 3.1 Generalzed regresson networks (GRNN) An arbtrary contnuous functon can be approxmated by a lnear combnaton of a large number of well-chosen Gaussan functons. Regresson : buld a good approxmaton of a functon that s known only by a fnte number of "expermental" low nosy couples {x, y } y x Local Regresson: the Gaussan bass affect only small areas around ther mean values. 18/03/2014 148

Types of neural networks 3. RBFNN (cont.) 2. Neural networks Ths network can be used for classfcaton problems. When an nput s presented, the frst Radal bass layer calculates the dstances between the nput vector and weght vector and produce a vector, multpled by the bas. Matlab NN toolbox 18/03/2014 149

Types of neural networks 3. RBFNN (cont.) 3.2 Probablstc neural networks (PNN) Ths network can be used for classfcaton problems. When an nput s presented, the frst layer calculates the dstances between the nput vector and all tranng vectors and produces a vector whose elements descrbe how ths nput vector s close to each tranng vector. the second layer adds these contrbutons for each class of nputs to produce at the output of the network a vector of probabltes. 2. Neural networks Ματλαβ ΝΝ τοολβοξ Fnally, a competton transfer functon n the output of the second layer selects the maxmum of these probabltes, and produces a 1 for the correspondng class and a 0 for the other classes. 18/03/2014 150

Types of neural networks 3. RBFNN (cont.) General Remarks: 2. Neural networks Replcate the functon throughout the data space means scanng the space by a large number of Gaussans. In practce, the RBF s centred and normalsed: f T ( P) = exp( P P) 18/03/2014 151

Types of neural networks 3. RBFNN (cont.) 2. Neural networks RBF networks are neffcent: - n a large nput feature space (R) - on very nosy data. Local reconstructon of the functon prevents the network to "averagng" nose on the whole space (compared wth Lnear Regresson, whose objectve s precsely to average out the nose on the data). 18/03/2014 152

Types of neural networks 3. RBFNN (cont.) Tranng RBF networks 2. Neural networks Tranng by optmzaton procedures: consderable computaton tme - very slow or mpossble tranng n practce Soluton: use heurstc tranng approxmaton the constructon of an RBF network s quck and easy but they are less effcent than multlayer perceptrons networks (MLP). 18/03/2014 153

Types of neural networks 3. RBFNN (cont.) 2. Neural networks Concluson on RBFNN : Used as a credble alternatve to MLP on problems not too dffcult. Speed and ease of use For more on RBFNN, see [I] Chen, S., C.F.N. Cowan, and P.M. Grant, "Orthogonal Least Squares Learnng Algorthm for Radal Bass Functon Networks," IEEE Transactons on Neural Networks, Vol. 2, No. 2, March 1991, pp. 302-309. http://eprnts.ecs.soton.ac.uk/1135/1/00080341.pdf [II] P.D. Wasserman, Advanced Methods n Neural Computng, New York: Van Nostrand Renhold, 1993, on pp. 155-61 and pp. 35-55, respectvely. 18/03/2014 154

Types of neural networks (cont.) 2. Neural networks 4. Partally recurrent multlayer feed forward networks (PRFNN) (Elman or Jourdan) where only the outputs are looped to the frst layer. These are networks of type "feedforward" except that a feedback s performed between the output layer and hdden layer or between the hdden layers themselves through addtonal layers called state layer (Jordan) or context layer (Elman). 18/03/2014 155

Types of neural networks 4. PRFNN (cont.) 2. Neural networks Snce the nformaton processng n recurrent networks depends on network condton at the prevous teraton, these networks can then be used to model temporal sequences (dynamc system). 18/03/2014 156

Types of neural networks 4. PRFNN (cont.) 2. Neural networks Jordan network [Jordan 86a, b] State unts p e y e e e:error - - y d Input layer One hdden layer Output layer Desred output layer 18/03/2014 157

Types of neural networks 4. PRFNN (cont.) 2. Neural networks Elman network [Elman 1990] In ths case an addtonal layer of context unts s ntroduced. The nputs of these unts are the actvatons of unts n the hdden layer. 18/03/2014 158

Types of neural networks 4. PRFNN (cont.) 2. Neural networks Context unts 1.0 1.0 p e - y e - e e : error y d Input layer One hdden layer Output layer Desred output layer 18/03/2014 159

Types of neural networks 4. PRFNN (cont.) 2. Neural networks Extended Elman network However, there s a lmtaton n the Elman network. It cannot deal wth complex structure such as a long dstance dependency, so the followng extenson: The extenson of number of generatons n context layers 18/03/2014 160

Types of neural networks (cont.) 2. Neural networks 5. Recurrent networks (RNN) wth one layer and total connectvty (Assocatve networks) Tranng type: non supervsed In these models each neuron s connected to all others and theoretcally (but not n practce) has a return on tself. These models are not motvated by a bologcal analogy but by ther analogy wth statstcal mechancs. 18/03/2014 161

Types of neural networks 5. RNN (cont.) 2. Neural networks w 31 w 32 w 21 w 23 w 1 +1-1 w 1 w +1 neuron j p y => p y +1 Réseau entèrement connecté(une seule couche). weghts w w 12 w 13 sum f f f f Transfer functon 18/03/2014 162

Types of neural networks (cont.) 2. Neural networks 6. Self-Organzng neural networks (SONN) or compettve neural networks (CNN) These networks are smlar to statc monolayer networks except that there are connectons, usually wth negatve sgns, between the output unts. p y Input layer Output layer 18/03/2014 163

Types of neural networks 6. SONN (cont.) 2. Neural networks Tranng: A set of examples s presented to the networks, one example at a tme. For each presented example, the weghts are modfed. If a degraded verson of one of these example s presented later to the networks, the networks wll then rebuld the degraded example. Through these connectons the output unts tend to have a competton to represent the current example presented to the nput networks. 18/03/2014 164

Functon fttng and pattern recognton problems In fact, t s proved that a smple neural network can adapt to all the practcal functons. Defnng a probleme 2. Neural networks Supervsed learnng To defne a fttng problem, arrange a set of Q nput vectors as columns of a matrx. Then, arrange another seres of Q target vectors (the rght output vectors for each of nput vectors) n a second matrx. For example, a logc «AND» functon: Q=4; Inputs= [0 1 0 1; 0 0 1 1]; Outputs = [0 0 0 1]; 18/03/2014 165

Supervsed learnng 2. Neural networks We can construct an ANN n 3 dfferent fashons: Usng command-lne functon * Usng graphcal user nterface, nftool ** Usng nntool, *** * Usng Command-LneFunctons, Neural Network Toolbox 6 User s Gude», 1992 2009 by The MathWorks, Inc., page 1-7, ** Usng the NeuralNetwork Fttng Tool GUI, Neural Network Toolbox 6 User s Gude», 1992 2009 by Graphcal User Interface, The MathWorks, Inc., page 1-13, ** Graphcal User Interface, Neural Network Toolbox 6 User s Gude», 1992 2009 by The MathWorks, Inc., page 3-23. 18/03/2014 166

Supervsed learnng 2. Neural networks Input-output processng unts The majorty of methods are provded by default when you create a network. You can overrde the default functons for processng nputs and outputs when you call a creaton network functon, or by settng network propertes after creatng the network. net.nputs{1}.processfcns : property of the network to dsplay the lst of nput processng functons. net.outputs{2}.processfcns : property of the network to dsplay the lst of output processng functons of a 2-layer network. You can use these propertes to change the processng functons that apply to nputs-outputs of your network (but Matlab recommends usng the default propertes). 18/03/2014 167

Supervsed learnng Input-output processng unts 2. Neural networks Several functons have default settngs whch defne ther operatons. You can access or change the th parameter of the nput or output: net.nputs{1}.processparams{} for functons of nput processng net.outputs{2}. processparams{} for functons of output processng of a 2-layer network. 18/03/2014 168

Supervsed learnng Input-output processng unts 2. Neural networks For the networks MLP the default functons are: IPF Structure of the nput processng functons. Default : IPF={'fxunknowns ','removeconstantrows ','mapmnmax'}. OPF - Structure of the output processng functons. Default : OPF = {'removeconstantrows ','mapmnmax '}. fxunknowns : Ths functon saves the unknown data (represented n the user data wth NaN values ) n a numercal form for the network. It preserves nformaton about whch values are known and whch values are unknown. removeconstantrows values n a matrx. : ths functon removes rows wth constant 18/03/2014 169

Supervsed learnng Input-output processng unts 2. Neural networks Pre- and post-processng n Matlab: 1. Mn and Max (mapmnmax) Pror to tranng, t s often useful to calbrate the nputs and targets so that they are always n a specfed range, e.g., [-1,1] (normalzed nputs and targets). [pn,ps] = mapmnmax(p); [tn,ts] = mapmnmax(t); net = tran(net,pn,tn); % tranng a created network an = sm(net,pn); % smulaton (an : normalsed outputs) 18/03/2014 170

Supervsed learnng Input-output processng unts 2. Neural networks To convert the outputs to the same unts of the orgnal targets : a = mapmnmax(`reverse',an,ts); If mapmnmax s already used to preprocess the data of tranng set, then whenever the traned network s used wth new entres, these entres must be pre-processed wth mapmnmax : Let «pnew» a new nput set to already traned network pnewn = mapmnmax('apply',pnew,ps); anewn = sm(net,pnewn); anew = mapmnmax(`reverse',anewn,ts); 18/03/2014 171

Supervsed learnng Input-output processng unts 2. Neural networks 2. Mean and standard devaton (mapstd) 4. Prncpal Component Analyss (processpca) 5. Processng unknown nputs (fxunknowns) 6. Processng unknown targets (or Don't Care ) (fxunknowns) 7. Post-tranng analyss (postreg) 18/03/2014 172

Supervsed learnng Input-output processng unts 2. Neural networks The performance of a traned network can be measured n some sens by the errors on the tranng, valdaton and test sets but t s often useful to further analyse the performance of the network response. One way to do ths s to perform lnear regresson. a = sm(net,p); [m,b,r] = postreg(a,t) m : slope b = ntersecton of the best straght lne (relatng outputs to targets values) wth the y-axs r = Pearson s correlaton coeffcent 18/03/2014 173

Supervsed learnng 2. Neural networks PERCEPTRON 18/03/2014 174

Supervsed learnng Perceptron 2. Neural networks Creaton of one-layer perceptron wth R nputs and S outputs : net = newp(p,t,tf,lf) p : RxQ1 matrx of Q1 nput feature vectors, each of R nput features. t : SxQ2 matrx of Q2 target vecteurs tf : Transfer functon, default = 'hardlm'. lf : Learnng functon, default = 'learnp' a = 1 f n > 0 a = 0 f n < =0 18/03/2014 175

2. Neural networks Supervsed learnng Perceptron Classfcaton example: % DEFINITION % Creaton of a new perceptron usng net = newp(pr,s,tf,lf) % Descrpton % Perceptrons are used to solve smple problems of classfcaton % (.e. lnearly separable classes) % net = newp(pr,s,tf,lf) % pr - Rx2 matrx of mn and max values of R nput elements. % s - Nombre de neurones. % tf transfer functon, default = 'hardlm': Hard lmt transfer functon.. % lf tranng functon, default = 'learnp': Perceptron weght/bas learnng functon. p1 = 7*rand(2,50); p2 = 7*rand(2,50)+5; p = [p1,p2]; t = [zeros(1,50),ones(1,50)]; pr = mnmax(p); % pr s Rx2 s a matrx of mn and max values of the matrx pmm of dm (RxQ) net = newp(pr,t,'hardlm','learnpn'); % Dsplay the ntal values of the network w0=net.iw{1,1} b0=net.b{1} E=1; Iter=0; % sse: Sum squared error performance functon whle (sse(e)>.1)&(ter<1000) [net,y,e] = adapt(net,p,t);% adapt: adapter le réseau de neurones ter = ter+1; end % Dsplay the fnal values of the network w=net.iw{1,1} b=net.b{1} % TEST test = rand(2,1000)*15; ctest = sm(net,test); fgure % scatter: Scatter/bubble plot. % scatter(x,y,s,c) dsplays colored crcles at the locatons specfed by the vectors X and Y (same sze). % S determne the surface of each mark (n ponts ^ 2) % C determne the colors of the markers % 'flled' fllng the markers scatter(p(1,:),p(2,:),100,t,'flled') hold on scatter(test(1,:),test(2,:),10,ctest,'flled'); hold off % plotpc (W,B): Draw a classfcaton lne % W - SxR weght matrx (R <= 3). % B - Sx1 bas vector plotpc(net.iw{1},net.b{1}); % Plot Regresson fgure [m,b,r] = postreg(y,t); 18/03/2014 176

Supervsed learnng Perceptron 2. Neural networks Example (cont): >> run( perceptron.m') w0 = 0 0 b0 = 0 w = 17.5443 12.6618 b = -175.3625 18/03/2014 177

Supervsed learnng Perceptron Lnear regresson: 2. Neural networks 18/03/2014 178

Supervsed learnng Perceptron 2. Neural networks Data structure To study the effect of data formats on the network structure There are two man types of nput vectors: those that are ndependent of tme or concurrent vectors (e.g., statc mages), and those that occur sequentally n tme or sequental vectors (e.g., tme sgnals or dynamc mages). For concurrent vectors, the orderng of vectors s not mportant, and f there s a number of networks operatng n parallel (statc networks), we can present one nput vector to each network. For sequental vectors, the order n whch the vectors appear s mportant. So, networks used n ths case are called dynamc networks. 18/03/2014 179

Supervsed learnng Perceptron Statc network 2. Neural networks Statc network These networks have nether a delay nor a feedback. Smulaton wth nput concurrent vectors n batch mode When the presentaton order of nputs s not mportant, then all these nputs can be ntroduced smultaneously (batch mode). For example, assume that the target values produced by the network are 100, 50, -100, and 25. Q=4 P = [1 2 2 3; 2 1 3 1]; % form a batch matrx P t 1 =[100], t 2 =[50], t 3 =[-100], t 4 =[25], T = [100 50-100 25]; % form a batch matrx T P T W = [1 2], b = [0]; % assgn values to weghts and to b (wthout tranng) net.iw{1,1} = W; net.b{1} = b; net = newln(p, T); % all the nput vectors and ther target values are gven at a tme y = sm(net, P); after smulaton, we obtan: y = 5 4 8 5 18/03/2014 180

Supervsed learnng Perceptron Statc network 2. Neural networks Smulaton wth nput concurrent vectors n batch mode (cont.) p 1 t 1 In the prevous network a sngle matrx contanng all the concurrent vectors s presented to the network, and the network produced smultaneously a sngle matrx of vectors n the output. p 2 t 2 The result would be the same f there were four networks operatng n parallel and each network has receved one of the nput vectors and produces one output. The orderng of nput vectors s not mportant, because they do not nteract wth each other. p 3 t 3 p 4 t 4 18/03/2014 181

Supervsed learnng Perceptron 2. Neural networks Dynamc network These networks have a delay and feedback. Smulaton wth nputs n ncremental mode When the order of presentaton of nputs s mportant, then these nputs may be ntroduced sequentally (on-lne mode). The network have a delay. For example p 1 =[1], p 2 =[2], p 3 =[3], p 4 =[4] P = {1 2 3 4};% form a «cell array», nput sequence Suppose the target values are gven n the order 10, 3, 3, -7 : t (1)=[10], t(2)=[3], t(3)=[3], t(4)=[-7], T = { 10, 3, 3, -7}; net = newln(p, T, [0 1]); % create the network wth a delay of 0 and 1 net.basconnect = 0; W=[1 2]; net.iw{1,1} = [1 2]; % assgn weghts y = sm(net, P); After smulaton we obtan «cell array», output sequence: y = [1] [4] [7] [10] t(t) 18/03/2014 182

Supervsed learnng Perceptron Dynamc network 2. Neural networks Dynamc network The presentaton order of nputs s mportant when they are presented as a sequence. In ths case, the current output s obtaned by multplyng the current nput by 1 and the prevous nput by 2 and by addng the result. If you were to change the order of nputs, the obtaned numbers n the output change. t(t) 18/03/2014 183

Supervsed learnng Perceptron Dynamc network 2. Neural networks Dynamc network Smulaton wth nput concurrent vectors n Incrémental mode Although ths choce s rratonal, we can always use a dynamc network wth nput concurrent vectors (the order of presentaton s not mportant). p 1 =[1], p 2 =[2], p 3 =[3], p 4 =[4] P = [1 2 3 4]; % form P y = sm(net, P); % create the network wthout delay ([0 0]) After smulaton we obtan «cell array», output sequence: y = 1 2 3 4 18/03/2014 184

Supervsed learnng 2. Neural networks Multple Layers Neural Network (MLNN) or Feed-forword backpropagaton network (FFNN) Entrée couche 1 couche 2 couche 3 18/03/2014 185

Supervsed learnng FFNN 2. Neural networks Input layer 1 layer 2 layer 3 18/03/2014 186

Multple Layers Neural Network (MLNN) or Feed-forword backpropagaton network (FFNN) Creatng a MLNN wth N layers : Feedforward neural network. Two (or more) layer feedforward networks can mplement any fnte nput-output functon arbtrarly well gven enough hdden neurons. feedforwardnet(hddenszes,tranfcn) takes a 1xN vector of N hdden layer szes, and a backpropagaton tranng functon, and returns a feed-forward neural network wth N+1 layers. Input, output and output layers szes are set to 0. These szes wll automatcally be confgured to match partcular data by tran. Or the user can manually confgure nputs and outputs wth confgure. Defaults are used f feedforwardnet s called wth fewer arguments. The default arguments are (10,'tranlm'). Here a feed-forward network s used to solve a smple fttng problem: [x,t] = smpleft_dataset; net = feedforwardnet(10); net = tran(net,x,t); vew(net) y = net(x); perf = perform(net,t,y) 18/03/2014 187

Multple Layers Neural Network (MLNN) or Feed-forword backpropagaton network (FFNN) Example 1 : Regresson Suppose, for example you have data from a housng applcaton [HaRu78]. Ths s to desgn a network to predct the value of a house (n 1000 U.S. dollars) gven 13 features of geographc nformaton and real estate. We have a total of 506 examples of houses for whch we have these 13 features and ther assocated values of the market.. [HaRu78] Harrson, D., and Rubnfeld, D.L., Hedonc prces and the demand for clean ar, J. Envron. Economcs & Management, Vol. 5, 1978, pp. 81-102. 18/03/2014 188

MLNN orffnn Example 1 (cont.) Gven p nput vectors and t target vectors load housng; % Load the datap (13x506 batch nput matrx) and t (1x506 batch target matrx) [P mm, Ps] = mapmnmax(p); % assgn the values mn and max of the rows n the matrx P wth values n [-1 1] [t mm, ts] = mapmnmax(t); Dvde data nto three sets: tranng, valdaton, and testng. The valdaton set s used to ensure that there wll be no overfttng n the fnal results. The test set provdes an ndependent measure of tranng data. Take 20% of the data for the valdaton set and 20% for the test set, leavng 60% for the tranng set. Choose sets randomly from the orgnal data. [tranv, val, test] = dvdevec(p mm, t mm, 0.20, 0.20); % 3 structures : tranng (60%), valdaton (20%) and testng (20%) pr = mnmax(p mm ); % pr s Rx2 matrx of mn and max values of the RxQ matrx P mm net = newff(pr, [20 1]); % create a «feed-forward backpropagaton network» wth one hdden layer of 20 neurons and one output layer wth 1 neuron. The default tranng functon s 'tranlm' 18/03/2014 189

MLNN orffnn Example 1 (cont.) Tranng vectors set (tranv.p) : presented to the network durng tranng and the network s adjusted accordng to ts errors. valdaton vectors set (tranv.t, vald) : used to measure the generalsaton of the network and nterrupt the learnng when the generalsaton stops to mprove. test vectors set (tranv.t, test) : They have no effect on tranng and thus provde an ndependent measure of network performance durng and after tranng. [net, tr]=tran(net, tranv.p, tranv.t, [ ], [ ],val,test); ensemble d apprentssage Structures des ensembles de valdaton et de test % Tran a neural network. Cette foncton présente smultanément tous les vecteurs d entrée et de cbles au réseau en mode «batch». Pour évaluer les performances, elle utlse la foncton mse (mean squared error). net est la structure du réseau obtenu, tr est tranng record (epoches et performance) 18/03/2014 190

MLNN orffnn Example 1 (cont.) 18/03/2014 191

MLNN orffnn Example 1 (cont.) The tranng s stopped at teraton 9 because the performance measured by mse) of the valdaton start ncreasng after ths teraton. Performance qute suffcent. tranng several tmes wll produce dfferent results due to dfferent ntal condtons. The mean square error (mse) s the average of squares of the dfference between outputs (standard) and targets. Zero means no error, whle an error more than 0.6667 sgnfes hgh error. 18/03/2014 192

MLNN orffnn Example 1 (cont.) Analyss of the network response Present the entre set of data to the network (tranng, valdaton and test) and perform a lnear regresson between network outputs after they were brought back to the orgnal range of outputs and related targets. y mm = sm(net, P mm ); % smulate an ANN 20 y = mapmnmax('reverse', y mm, ts); % Replace the values between [-1 1] of the matrx y mm by ther real mnmal and maxmale [m, b, r] = postreg(y, t); % Make a lnear regresson between the outputs and the targets. m Slope of the lnear regresson. b Y-ntercept of the lnear regresson. r value of the lnear regresson. P mm net y mm 18/03/2014 193

MLNN orffnn Example 1 (cont.) The regresson r of values measures the correlaton between the outputs (not normalsed) and targets. A value r of 1 means a perfect relatonshp, 0 a random relatonshp. The output follow the target, r = 0.9. If greater accuracy s requred, then: - Reset the weghts and the bas of the network agan usng the functons nt (net) and tran - Increase the number of neurons n the hdden layer - Increase the number of tranng feature vectors - Increase the number of features f more useful nformaton are avalable - Try another tranng algorthm 18/03/2014 194