Content. Learning. Regression vs Classification. Regression a.k.a. function approximation and Classification a.k.a. pattern recognition

Similar documents
Content. Learning Goal. Regression vs Classification. Support Vector Machines. SVM Context

Statistical Learning Reading Assignments

Discriminative Models

;TPa]X]V Ua^\ SPcP. ]TdaP[ ]Tcf^aZb. BX\X[PaXcXTb P]S SXUUTaT]RTb. Vojislav Kecman, The University of Auckland, Auckland, NZ

Support Vector Machine (continued)

Discriminative Models

Pattern Recognition and Machine Learning

SVM TRADE-OFF BETWEEN MAXIMIZE THE MARGIN AND MINIMIZE THE VARIABLES USED FOR REGRESSION

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

L5 Support Vector Classification

Support Vector Machines

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machine. Natural Language Processing Lab lizhonghua

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Classifier Complexity and Support Vector Classifiers

An introduction to Support Vector Machines

Introduction to SVM and RVM

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Support Vector Machines.

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

SVMs: nonlinearity through kernels

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Brief Introduction of Machine Learning Techniques for Content Analysis

LEARNING & LINEAR CLASSIFIERS

18.9 SUPPORT VECTOR MACHINES

References. Lecture 7: Support Vector Machines. Optimum Margin Perceptron. Perceptron Learning Rule

Support Vector Machines

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machines

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Linear & nonlinear classifiers

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Statistical learning theory, Support vector machines, and Bioinformatics

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Gaussian Processes (10/16/13)

SUPPORT VECTOR MACHINE

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Introduction to Support Vector Machines

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

An Introduction to Statistical and Probabilistic Linear Models

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Low Bias Bagged Support Vector Machines

Support Vector Machine & Its Applications

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Nonparametric Bayesian Methods (Gaussian Processes)

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Statistical Pattern Recognition

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Brief Introduction to Machine Learning

Support Vector Machines: Maximum Margin Classifiers

Kernel Methods and Support Vector Machines

Neural networks and support vector machines

Relevance Vector Machines for Earthquake Response Spectra

Support Vector Machine. Industrial AI Lab.

Learning with kernels and SVM

Kernels and the Kernel Trick. Machine Learning Fall 2017

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

Introduction to Machine Learning Midterm, Tues April 8

Support Vector Machines

6.036 midterm review. Wednesday, March 18, 15

Linear Classification and SVM. Dr. Xin Zhang

LECTURE NOTE #3 PROF. ALAN YUILLE

Support Vector Machines

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Chapter 6 Classification and Prediction (2)

Machine Learning Lecture 7

This is an author-deposited version published in : Eprints ID : 17710

Empirical Risk Minimization

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Formulation with slack variables

Support Vector Machines

A Tutorial on Support Vector Machine

Applied Machine Learning Annalisa Marsico

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

STATE GENERALIZATION WITH SUPPORT VECTOR MACHINES IN REINFORCEMENT LEARNING. Ryo Goto, Toshihiro Matsui and Hiroshi Matsuo

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Machine Learning 2010

Support Vector Machines Explained

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Support Vector Machines for Classification: A Statistical Portrait

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Support Vector Machines

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Artificial Neural Networks

Machine Learning

Chemometrics: Classification of spectra

Support Vector Machines and Kernel Methods

Transcription:

Content Andrew Kusiak Intelligent Systems Laboratory 239 Seamans Center The University of Iowa Iowa City, IA 52242-527 andrew-kusiak@uiowa.edu http://www.icaen.uiowa.edu/~ankusiak Introduction to learning Support Vector Machines vs Neural Networks Quadratic Programming (QP)-based learning Linear programming based learning Regression and classification by Linear Programming Illustrative examples (Based on the material provided by Professor V. Kecman) The University of Iowa Intelligent Systems Laboratory The University of Iowa Intelligent Systems Laborator Learning Learning from data, i.e., examples, samples, measurements, records, observations, patterns. Getting the data, transforming it, filtering it, compressing it, using it, reusing it, etc. Regression vs Classification Regression a.k.a. function approximation and Classification a.k.a. pattern recognition The University of Iowa Intelligent Systems Laboratory 3 The University of Iowa Intelligent Systems Laboratory 4

Support Vector Machines SVMs for multi-class problems (Weston and Watkins 998, Kindermann and Paass 2000) SVMs for density estimation (Smola and Schoelkopf 998) The theory of VC bounds (Vapnik 995 and 998) SVM Context Relationship between SVMs NNs Classical techniques such as Fourier series and polynomial approximations The University of Iowa Intelligent Systems Laboratory 5 The University of Iowa Intelligent Systems Laboratory 6 Fourier Series Fourier Series Represented as a NN AMPLITUDES and PHASES of sine (cosine) waves are not known, but frequencies are known [because Joseph Fourier has selected frequencies for us] and they are INTEGER multipliers of some pre-selected base frequency. x v is prescribed 2 4 n y w y j y j+ y J Linear learning The University of Iowa Amplitude Base frequency Intelligent Systems Laboratory N F(x) = k = a sin( kx), or b cos( kx), or both k Amplitude k Frequency + Note: Learning frequencies is nonlinear The University of Iowa Intelligent Systems Laboratory 8

J Example () Assume the following model y = 2.5 sin(.5x) is to be learned as the Fourier series model o = y = w 2 sin(w x) Example (2) Known The function is sinus Not Known Frequency and Amplitude o HL o x net HL net w w 2 o - d The University of Iowa Intelligent Systems Laboratory 9 The University of Iowa Intelligent Systems Laboratory 0 Example (3) Use NN model with a single neuron in the hidden layer (having sinus as an activation function) Use training data set {x, d} Learn the Fourier series model o = y = w 2 sin(w x) x net HL net w w 2 o - d Cost fumction J 250 200 50 00 o HL o The cost function J dependence upon A (dashed) and w (solid) J = sum(e 2 ) = sum(d - o) 2 sum(d -w 2 sin(w x)) 2 Cost fumction J 500 400 300 200 00 Example (5) o HL o 50 A = w 2 0 6 x net HL net w w 2 o - d = w The University of Iowa Intelligent Systems Laboratory The University of Iowa Intelligent Systems Laborator

Example 2 N F(x) = i = 0 w x i i SVMs and NNs x V is prescribed 2 3 4 y y j w The learning machine that determines APPROXIMATION FUNCTION (regression) or the SEPARATION BOUNDARY (classification, pattern recognition), is the same in highdimensional data sets 5 y j+ RBF= Radial basis function + The University of Iowa Intelligent Systems Laboratory 3 y J The University of Iowa Intelligent Systems Laboratory 4 V Neural Network y w F(x) = J w j j j xc j Σ = j ϕ (,, ) Support Vector Machine V y w F(x) = J w j j j xc j Σ = j ϕ (,, ) x x x i y j x i y j x n y j+ x n y j+ y J The University of Iowa Intelligent Systems Laboratory 5 + y The University of Iowa J Intelligent Systems Laboratory 6 +

V y w Comparison J F(x) = j ϕ j ( xc, j, Σ j ) = NN vs SVM x x i y j y j+ V w J y F(x) = j ϕ j (, xc j, Σ j ) = No structural differences between NNs and SVMs, i.e., in the representational capacity y J x n x y j Important differences in LEARNING + x i y j+ y J x n The University of Iowa Intelligent Systems Laboratory 7 + The University of Iowa Intelligent Systems Laboratory 8 Note Identification Estimation Regression Classification Pattern recognition Function approximation Curve fitting Surface fitting etc. The University of Iowa Intelligent Systems Laboratory 9 Question The University of Iowa Intelligent Systems Laborator0

Classical Regression The classical regression and (Bayesian) classification statistical techniques are based on the strict t assumption that t probability bilit distribution ib ti models (probability-density functions) are known Statistical Inference Data can be modeled by a set of linear in parameters functions (e.g., linear regression); this is a foundation of a parametric paradigm in learning from experimental data The assumption of normal probability distribution law, i.e., the underlying joint probability bili distribution ib i is Gaussian Due to the second assumption above, the induction paradigm for parameter estimation is the maximum likelihood method that reduces to the minimization of the sum-of-errors-squares cost function in most engineering applications The University of Iowa Intelligent Systems Laborator The University of Iowa Intelligent Systems Laborator2 Why SVM? The three assumptions of the classical statistical paradigm are too strict for many contemporary real-life problems (Vapnik 998) The University of Iowa Intelligent Systems Laborator3 Reasons for SVMs Modern problems are of high-dimensionality (many features). The underlying mapping is often not smooth and therefore the linear paradigm calls for an exponential increase in number of terms with an increasing dimensionality of the input space X, i.e., with an increase in the number of independent variables. This is known as the curse of dimensionality. The underlying application data generation laws may not follow the normal ldistribution ib i function and a model-builder must consider this in the construction of an effective learning algorithm. From the first two reasons it follows that the maximum likelihood estimator (and consequently the sum-of-errorsquares cost function) should be replaced by a new induction paradigm that is uniformly better, in order to model non- Gaussian distributions. The University of Iowa Intelligent Systems Laborator4

It Is Also True That (2) The probability-density functions are unknown, and a question arises HOW TO PERFORM a distributionfree REGRESSION or CLASSIFICATION? It Is Also True That 2(2) Available are EXPERIMENTAL DATA (examples, training patterns, samples, observations, records) are highdimensional and scarce. High-dimensional spaces are often terrifyingly y empty and the learning algorithms (i.e., machines) should be able to operate in such spaces and to learn from sparse data. There is an old saying that redundancy provides knowledge. Stated simply, the more data available at hand the better results will be produced. The University of Iowa Intelligent Systems Laborator5 The University of Iowa Intelligent Systems Laborator6 Terrifying emptiness and/or data sparseness Consider D y = f(x), 2D z = f(x, y), and 3D u = f(x, y, z), functions for 0 samples (points) in the domain (0, ) x y Illustrative Example x x The density of spaces of D, 2D and 3D functions are decreases as D increases, and the average distance between the points increases with the dimensionality! The University of Iowa Intelligent Systems Laborator7 y z Error Final error Error Analysis Dependency of the modeling error on the size of the training data set Small sample Medium sample Large sample Noisy data set Noiseless data set Data size l The University of Iowa Intelligent Systems Laborator8

Error Analysis Glivenko-Cantelli-Kolmogorov results Glivenko-Cantelli theorem states that: Distribution function Pemp(x) P(x) as the number of data l. However, for both regression and classification we need probability density functions p(x), i.e., p(x ω) rather than distribution P(x). Models Nonlinear and nonparametric models illustrated by NNs and SVMs are discussed. Nonlinear implies: ) The model class is not restricted to linear input-output maps, and 2) The cost function that measures the goodness of a model, is nonlinear in respect to the unknown parameters. The University of Iowa Intelligent Systems Laborator9 The University of Iowa Intelligent Systems Laboratory 30 Models Nonparametric does not imply that the models do not have parameters at all On the contrary, parameter learning (meaning selection, identification, estimation, fitting or tuning) is the crucial issue here Models However, unlike in the classical statistical inference, the parameters are not predefined but rather their number depends on the training data used. In other words, parameters a that atdefine ethe ecapability of the model are data driven in such a way as to match the model capability with the data complexity. This is a basic paradigm of the structural risk minimization (SRM) approach introduced by Vapnik and Chervonenkis and their coworkers. The University of Iowa Intelligent Systems Laboratory 3 The University of Iowa Intelligent Systems Laboratory 32

CLASSIFICATION (PATTERN RECOGNITITON) EXAMPLE Assume - Normally distributed classes, same covariance matrices. Solution is easy the decision boundary is linear and defined by parameter w = X * D in the case there is plenty of data (infinity). X * denotes the PSEUDOINVERSE. x 2 = w x + w 2 x 2 d = + d 2 = - x x 2 CLASSIFICATION (PATTERN RECOGNITITON) EXAMPLE Assume - Normally distributed classes, same covariance matrices. Solution is easy - decision boundary is linear and defined by parameter w = X * D in the case there is plenty of data (infinity). X * denotes the PSEUDOINVERSE. d = + d 2 = - x Note that this solution follows from the last two assumptions of classical inference. Gaussian data and minimization of the sum-of-errors-squares. The University of Iowa Intelligent Systems Laboratory 33 The University of Iowa Intelligent Systems Laboratory 34 Example () X 9 8.5 5.7948 5.9797.0000 8 5.9568 5.274.0000 7.5 5.5226 5.2523.0000 7 5.880 5.8757.0000 6.5 5.730 5.7373.0000 6 5.5 7.365 7.664.0000 5 7.08 7.2844.0000 4.5 7.8939 7.4692.0000 4 4 5 6 7 8 9 0 7.99 7.0648.0000 7.2987 7.9883.0000 w = X * D w opt = [-0.5209-0.5480 6.973] T, and D - - - - - Example (2) However, for a small sample, the solution defined by w = X * D is NO LONGER GOOD ONE because, for this data set this separation line is obtained. the separation boundary equals x 2 = -0.95x + 2.725 The University of Iowa Intelligent Systems Laboratory 35 The University of Iowa Intelligent Systems Laboratory 36

Example (3) For a different data set another separation line is obtained. Again, for a small sample the solution defined by w = X * D is NO LONGER GOOD ONE. What is common for both separation lines the red and the blue one. Both have a SMALL MARGIN. WHAT S WRONG WITH SMALL MARGIN? Look at the BLUE line! It is very likely that the new examples (, ) will be wrongly classified. SVM The question is how to FIND the OPTIMAL SEPARATION HYPERPLANE GIVEN (scarce) DATA SAMPLES? The University of Iowa Intelligent Systems Laboratory 37 The University of Iowa Intelligent Systems Laboratory 38 The STATISTICAL LEARNING THEORY IS DEVELOPED TO SOLVE PROBLEMS of FINDING THE OPTIMAL SEPARATION HYPERPLANE for small samples. SVM The STATISTICAL LEARNING THEORY IS DEVELOPED TO SOLVE PROBLEMS of FINDING THE OPTIMAL SEPARATION HYPERPLANE for small samples. SVM OPTIMAL SEPARATION HYPERPLANE is the one that has the LARGEST MARGIN on given DATA SET The University of Iowa Intelligent Systems Laboratory 39 The University of Iowa Intelligent Systems Laboratory 40

MAXIMAL MARGIN CLASSIFIER The maximal margin classifier is an alternative to the perceptron: it also assumes that the data are linearly separable it aims at finding the separating hyperplane with the maximal geometric margin (and not any one, that is typical of perceptron solutions) x 2 Small margin SVM x 2 Class, y = + Class, y = + Reference V. Kecman, Learning and Soft Computing, MIT Press, Cambridge, MA, 200. Class 2, y = - Class 2, y = - Separating lines, i.e., decision boundaries, i.e., hyperplanes Large margin x x The larger the margin, the smaller the probability of misclassification. The University of Iowa Intelligent Systems Laboratory 4 The University of Iowa Intelligent Systems Laboratory 42