Content Andrew Kusiak Intelligent Systems Laboratory 239 Seamans Center The University of Iowa Iowa City, IA 52242-527 andrew-kusiak@uiowa.edu http://www.icaen.uiowa.edu/~ankusiak Introduction to learning Support Vector Machines vs Neural Networks Quadratic Programming (QP)-based learning Linear programming based learning Regression and classification by Linear Programming Illustrative examples (Based on the material provided by Professor V. Kecman) The University of Iowa Intelligent Systems Laboratory The University of Iowa Intelligent Systems Laborator Learning Learning from data, i.e., examples, samples, measurements, records, observations, patterns. Getting the data, transforming it, filtering it, compressing it, using it, reusing it, etc. Regression vs Classification Regression a.k.a. function approximation and Classification a.k.a. pattern recognition The University of Iowa Intelligent Systems Laboratory 3 The University of Iowa Intelligent Systems Laboratory 4
Support Vector Machines SVMs for multi-class problems (Weston and Watkins 998, Kindermann and Paass 2000) SVMs for density estimation (Smola and Schoelkopf 998) The theory of VC bounds (Vapnik 995 and 998) SVM Context Relationship between SVMs NNs Classical techniques such as Fourier series and polynomial approximations The University of Iowa Intelligent Systems Laboratory 5 The University of Iowa Intelligent Systems Laboratory 6 Fourier Series Fourier Series Represented as a NN AMPLITUDES and PHASES of sine (cosine) waves are not known, but frequencies are known [because Joseph Fourier has selected frequencies for us] and they are INTEGER multipliers of some pre-selected base frequency. x v is prescribed 2 4 n y w y j y j+ y J Linear learning The University of Iowa Amplitude Base frequency Intelligent Systems Laboratory N F(x) = k = a sin( kx), or b cos( kx), or both k Amplitude k Frequency + Note: Learning frequencies is nonlinear The University of Iowa Intelligent Systems Laboratory 8
J Example () Assume the following model y = 2.5 sin(.5x) is to be learned as the Fourier series model o = y = w 2 sin(w x) Example (2) Known The function is sinus Not Known Frequency and Amplitude o HL o x net HL net w w 2 o - d The University of Iowa Intelligent Systems Laboratory 9 The University of Iowa Intelligent Systems Laboratory 0 Example (3) Use NN model with a single neuron in the hidden layer (having sinus as an activation function) Use training data set {x, d} Learn the Fourier series model o = y = w 2 sin(w x) x net HL net w w 2 o - d Cost fumction J 250 200 50 00 o HL o The cost function J dependence upon A (dashed) and w (solid) J = sum(e 2 ) = sum(d - o) 2 sum(d -w 2 sin(w x)) 2 Cost fumction J 500 400 300 200 00 Example (5) o HL o 50 A = w 2 0 6 x net HL net w w 2 o - d = w The University of Iowa Intelligent Systems Laboratory The University of Iowa Intelligent Systems Laborator
Example 2 N F(x) = i = 0 w x i i SVMs and NNs x V is prescribed 2 3 4 y y j w The learning machine that determines APPROXIMATION FUNCTION (regression) or the SEPARATION BOUNDARY (classification, pattern recognition), is the same in highdimensional data sets 5 y j+ RBF= Radial basis function + The University of Iowa Intelligent Systems Laboratory 3 y J The University of Iowa Intelligent Systems Laboratory 4 V Neural Network y w F(x) = J w j j j xc j Σ = j ϕ (,, ) Support Vector Machine V y w F(x) = J w j j j xc j Σ = j ϕ (,, ) x x x i y j x i y j x n y j+ x n y j+ y J The University of Iowa Intelligent Systems Laboratory 5 + y The University of Iowa J Intelligent Systems Laboratory 6 +
V y w Comparison J F(x) = j ϕ j ( xc, j, Σ j ) = NN vs SVM x x i y j y j+ V w J y F(x) = j ϕ j (, xc j, Σ j ) = No structural differences between NNs and SVMs, i.e., in the representational capacity y J x n x y j Important differences in LEARNING + x i y j+ y J x n The University of Iowa Intelligent Systems Laboratory 7 + The University of Iowa Intelligent Systems Laboratory 8 Note Identification Estimation Regression Classification Pattern recognition Function approximation Curve fitting Surface fitting etc. The University of Iowa Intelligent Systems Laboratory 9 Question The University of Iowa Intelligent Systems Laborator0
Classical Regression The classical regression and (Bayesian) classification statistical techniques are based on the strict t assumption that t probability bilit distribution ib ti models (probability-density functions) are known Statistical Inference Data can be modeled by a set of linear in parameters functions (e.g., linear regression); this is a foundation of a parametric paradigm in learning from experimental data The assumption of normal probability distribution law, i.e., the underlying joint probability bili distribution ib i is Gaussian Due to the second assumption above, the induction paradigm for parameter estimation is the maximum likelihood method that reduces to the minimization of the sum-of-errors-squares cost function in most engineering applications The University of Iowa Intelligent Systems Laborator The University of Iowa Intelligent Systems Laborator2 Why SVM? The three assumptions of the classical statistical paradigm are too strict for many contemporary real-life problems (Vapnik 998) The University of Iowa Intelligent Systems Laborator3 Reasons for SVMs Modern problems are of high-dimensionality (many features). The underlying mapping is often not smooth and therefore the linear paradigm calls for an exponential increase in number of terms with an increasing dimensionality of the input space X, i.e., with an increase in the number of independent variables. This is known as the curse of dimensionality. The underlying application data generation laws may not follow the normal ldistribution ib i function and a model-builder must consider this in the construction of an effective learning algorithm. From the first two reasons it follows that the maximum likelihood estimator (and consequently the sum-of-errorsquares cost function) should be replaced by a new induction paradigm that is uniformly better, in order to model non- Gaussian distributions. The University of Iowa Intelligent Systems Laborator4
It Is Also True That (2) The probability-density functions are unknown, and a question arises HOW TO PERFORM a distributionfree REGRESSION or CLASSIFICATION? It Is Also True That 2(2) Available are EXPERIMENTAL DATA (examples, training patterns, samples, observations, records) are highdimensional and scarce. High-dimensional spaces are often terrifyingly y empty and the learning algorithms (i.e., machines) should be able to operate in such spaces and to learn from sparse data. There is an old saying that redundancy provides knowledge. Stated simply, the more data available at hand the better results will be produced. The University of Iowa Intelligent Systems Laborator5 The University of Iowa Intelligent Systems Laborator6 Terrifying emptiness and/or data sparseness Consider D y = f(x), 2D z = f(x, y), and 3D u = f(x, y, z), functions for 0 samples (points) in the domain (0, ) x y Illustrative Example x x The density of spaces of D, 2D and 3D functions are decreases as D increases, and the average distance between the points increases with the dimensionality! The University of Iowa Intelligent Systems Laborator7 y z Error Final error Error Analysis Dependency of the modeling error on the size of the training data set Small sample Medium sample Large sample Noisy data set Noiseless data set Data size l The University of Iowa Intelligent Systems Laborator8
Error Analysis Glivenko-Cantelli-Kolmogorov results Glivenko-Cantelli theorem states that: Distribution function Pemp(x) P(x) as the number of data l. However, for both regression and classification we need probability density functions p(x), i.e., p(x ω) rather than distribution P(x). Models Nonlinear and nonparametric models illustrated by NNs and SVMs are discussed. Nonlinear implies: ) The model class is not restricted to linear input-output maps, and 2) The cost function that measures the goodness of a model, is nonlinear in respect to the unknown parameters. The University of Iowa Intelligent Systems Laborator9 The University of Iowa Intelligent Systems Laboratory 30 Models Nonparametric does not imply that the models do not have parameters at all On the contrary, parameter learning (meaning selection, identification, estimation, fitting or tuning) is the crucial issue here Models However, unlike in the classical statistical inference, the parameters are not predefined but rather their number depends on the training data used. In other words, parameters a that atdefine ethe ecapability of the model are data driven in such a way as to match the model capability with the data complexity. This is a basic paradigm of the structural risk minimization (SRM) approach introduced by Vapnik and Chervonenkis and their coworkers. The University of Iowa Intelligent Systems Laboratory 3 The University of Iowa Intelligent Systems Laboratory 32
CLASSIFICATION (PATTERN RECOGNITITON) EXAMPLE Assume - Normally distributed classes, same covariance matrices. Solution is easy the decision boundary is linear and defined by parameter w = X * D in the case there is plenty of data (infinity). X * denotes the PSEUDOINVERSE. x 2 = w x + w 2 x 2 d = + d 2 = - x x 2 CLASSIFICATION (PATTERN RECOGNITITON) EXAMPLE Assume - Normally distributed classes, same covariance matrices. Solution is easy - decision boundary is linear and defined by parameter w = X * D in the case there is plenty of data (infinity). X * denotes the PSEUDOINVERSE. d = + d 2 = - x Note that this solution follows from the last two assumptions of classical inference. Gaussian data and minimization of the sum-of-errors-squares. The University of Iowa Intelligent Systems Laboratory 33 The University of Iowa Intelligent Systems Laboratory 34 Example () X 9 8.5 5.7948 5.9797.0000 8 5.9568 5.274.0000 7.5 5.5226 5.2523.0000 7 5.880 5.8757.0000 6.5 5.730 5.7373.0000 6 5.5 7.365 7.664.0000 5 7.08 7.2844.0000 4.5 7.8939 7.4692.0000 4 4 5 6 7 8 9 0 7.99 7.0648.0000 7.2987 7.9883.0000 w = X * D w opt = [-0.5209-0.5480 6.973] T, and D - - - - - Example (2) However, for a small sample, the solution defined by w = X * D is NO LONGER GOOD ONE because, for this data set this separation line is obtained. the separation boundary equals x 2 = -0.95x + 2.725 The University of Iowa Intelligent Systems Laboratory 35 The University of Iowa Intelligent Systems Laboratory 36
Example (3) For a different data set another separation line is obtained. Again, for a small sample the solution defined by w = X * D is NO LONGER GOOD ONE. What is common for both separation lines the red and the blue one. Both have a SMALL MARGIN. WHAT S WRONG WITH SMALL MARGIN? Look at the BLUE line! It is very likely that the new examples (, ) will be wrongly classified. SVM The question is how to FIND the OPTIMAL SEPARATION HYPERPLANE GIVEN (scarce) DATA SAMPLES? The University of Iowa Intelligent Systems Laboratory 37 The University of Iowa Intelligent Systems Laboratory 38 The STATISTICAL LEARNING THEORY IS DEVELOPED TO SOLVE PROBLEMS of FINDING THE OPTIMAL SEPARATION HYPERPLANE for small samples. SVM The STATISTICAL LEARNING THEORY IS DEVELOPED TO SOLVE PROBLEMS of FINDING THE OPTIMAL SEPARATION HYPERPLANE for small samples. SVM OPTIMAL SEPARATION HYPERPLANE is the one that has the LARGEST MARGIN on given DATA SET The University of Iowa Intelligent Systems Laboratory 39 The University of Iowa Intelligent Systems Laboratory 40
MAXIMAL MARGIN CLASSIFIER The maximal margin classifier is an alternative to the perceptron: it also assumes that the data are linearly separable it aims at finding the separating hyperplane with the maximal geometric margin (and not any one, that is typical of perceptron solutions) x 2 Small margin SVM x 2 Class, y = + Class, y = + Reference V. Kecman, Learning and Soft Computing, MIT Press, Cambridge, MA, 200. Class 2, y = - Class 2, y = - Separating lines, i.e., decision boundaries, i.e., hyperplanes Large margin x x The larger the margin, the smaller the probability of misclassification. The University of Iowa Intelligent Systems Laboratory 4 The University of Iowa Intelligent Systems Laboratory 42