Content. Learning Goal. Regression vs Classification. Support Vector Machines. SVM Context

Size: px

Start display at page:

Download "Content. Learning Goal. Regression vs Classification. Support Vector Machines. SVM Context"

May Myra O’Neal’
5 years ago
Views:

1 Content Andrew Kusiak 39 Seamans Center Iowa City, IA (Based on the material provided by Professor. Kecman) Introduction to learning from eamples Support ector Machines vs Neural Networks Quadratic Programming (QP)-based learning Linear programming based learning Regression and classification by Linear Programming Eamples Learning Goal Learning from data, i.e., eamples, samples, measurements, records, observations, patterns. Getting the data, transferring it, filtering it, compressing it, using it, reusing it, etc. Regression vs Classification Regression a.k.a. function approimation and Classification a.k.a. pattern recognition 3 Support ector Machines SMs for multi-class problems (Weston and Watkins 998, Kindermann and Paass ) SMs for density estimation (Smola & Schoelkopf998) he theory of C bounds (apnik 995 and 998) SM Contet Relationship between SMs NNs Classical techniques such as Fourier series and polynomial approimations 5

2 Fourier Series Fourier Series Represented in NN Form AMPLIUDES and PHASES of sine (cosine) waves are not known, but frequencies are known [because Joseph Fourier has selected frequencies for us] and they are INEGER multiplies of some pre -selected base frequency. 7 v is prescribed v ji N F() = ak sin( k ), or bk cos( k ), orboth k= n Amplitude Frequency + y w y y w j j y j+ Linear learning Note: Learning frequencies is nonlinear 8 Eample () Assume the following model y =.5 sin(.5 ) is to be learned as the Fourier series model o = y = w sin(w ) Eample () Known he function is sinus Not Known Frequency and Amplitude. o = y = w sin( w ) o HL o w net HL net w o - d 9 Eample (3) Use NN model with a single neuron in the hidden layer (having sinusas an activation function) Use training data set {, d} Learn the Fourier series model o = y = w sin(w ) Cost fumction J 5 5 o HL o w net HL net w o - d he cost function J dependence upon A (dashed) and w (solid) 5 J = sum(e ) = sum(d - o) sum(d -w sin(w )) Cost fumction J 3 Eample (5) o HL o w net HL net w o - d 5 = w A = w Weight w = [A; w] Frequency w - - Amplitude A

3 Eample (5) F() is prescribed = N w i i i = y w F() = N w i i i = y w is prescribed Eample (5) Prescribing (integer) eponents results in a LINEAR APPROXIMAION SCHEME. v ji 3 5 y y j w j y j+ v ji 3 5 y y j w j y j+ Linear in parameters (w i ) to learn but not in terms of the resulting approimation function, which is nonlinear (NL) for i >. + RBF= Radial basis function + 3 Approimation of a D NL function by a Gaussian Radial Basis Function (RBF) In D case ignore the two inputs. hey are just to denote that the basic structure of the NN which is the same for ANY- DIMENSIONAL INPU wi+ϕ i+, wi+ > y k y * * * * F() * * * * * 3 * * * * * * * * σ w i+ϕ i+, w i+ < c i n+ * * * ϕ ϕ ϕ i ϕ i+ ϕ i+ ϕ N v ji + ci σ i y y y j y j+ w w j Eample (5) k F() = wiϕ i(, c i ) 5 i N = Measurements Images Records Observations Data Approimation of a D NL function by the Gaussian Radial Basis Function (RBF) APPROXIMAION OF SOME D NONLINEAR FCN BY GAUSSIAN RBF NN 8 3 Eample (5) 3 Measurements Images Records Observations Data Approimation of a D NL function by the APPROXIMAION Gaussian OF Radial SOME D NONLINEAR Basis Functions FCN BY GAUSSIAN (RBF) NN 8 3 Eample (5) 3 SMs and NNs he learning machine that determines APPROXIMAION FUNCION (regression) or the SEPARAION BOUNDARY (classification, pattern recognition), is the same in highdimensional data sets For the FIXED Gaussian RBFs, LEARNING is LINEAR. If the centers and covariance matrices are to be learned, problem becomes NONLINEAR (etremely difficult). 7 8

4 Neural Network y w F() = J w j j j c j Σ = j ϕ (,, ) Support ector Machine y w F() = J w j j j c j Σ = j ϕ (,, ) y y i v ji y j w j i v ji y j w j n y j+ n y j y w Comparison J F() = j w j ϕ j (, c j, Σ j ) = NN vs SM i v ji y y j y j+ w j y w J F() = j w j ϕ j (, c j, Σ j ) = No structural differences between NNs and SMs, i.e., in the representational capacity n y y j Important differences in LEARNING + i v ji y j+ w j n + Note Identification Estimation Regression Classification Pattern recognition Function approimation Curve fitting Surface fitting Question Do new learning concepts differ from the classical statistical inference? etc. 3

5 Classical Regression he classical regression and (Bayesian) classification statistical techniques are based on the strict assumption that probability distribution models (probability-density functions)are known. Statistical Inference Data can be modeled by a set of linear in parameter functions; this is a foundation of a parametric paradigm in learning from eperimental data. In the most of real-life problems, a stochastic component of data is the normal probability distribution law, i.e., the underlying joint probability distribution is Gaussian. Due to the second assumption above, the induction paradigm for parameter estimation is the maimum likelihood method that is reduced to the minimization of the sum-of-errors-squares cost function in most engineering applications. 5 Why SM? All three assumptions of the classical statistical paradigm, are inappropriate for many contemporary real-life problems (apnik 998). Reasons for SMs Modern problems are of high-dimensionality. he underlying mapping is often not smooth and therefore the linear paradigm calls for an eponentially increasing number of terms with an increasing dimensionality of the input space X, i.e., with an increase in the number of independent variables. his is known as the curse of dimensionality. he underlying application data generation laws may not follow the normal distribution and a model-builder must consider this in the construction of an effective learning algorithm. 7 From the first two reasons it follows that the maimum likelihood estimator (and consequently the sum-of-errorsquares cost function) should be replaced by a new induction paradigm that is uniformly better, in order to model non-gaussian distributions. 8 It Is Also rue hat () he probability-density functions are unknown, and a question arises HOW O PERFORM a distributionfree REGRESSION or CLASSIFICAION? It Is Also rue hat () Available are EXPERIMENAL DAA (eamples, training patterns, samples, observations, records) that are high-dimensional and scarce. High-dimensional spaces are often terrifyingly empty and the learning algorithms (i.e., machines) should be able to operate in such spaces and to learn from sparse data. here is an old saying that redundancy provides knowledge. Stated simpler the more data available at hand the better results will be produced. 9 3

6 errifying emptiness and/or data sparsenes Consider D y = f(), D z = f(, y), and 3D u = f(, y, z), functions for samples (points) in the domain (, ) y Illustrative Eample he density of spaces of D, D and 3D functions are decreases as D increases, and the average distance between the points increases with the dimensionality! y z Error Final error Error Analysis Dependency of the modeling error on the size of the training data set. Small sample Medium sample Large sample Noisy data set Noiseless data set Data size l 3 3 Error Analysis Error Analysis Glivenko-Cantelli-Kolmogorov results Glivenko-Cantelli theorem states that: Distribution function Pemp() fi P() as the number of data l fi. However, for both regression and classification we need probability density functions p(), i.e., p( w) rather than distribution P(). herefore, a question arises whether probability density p emp () p() as the number of data l. he answer is neither straightforward nor guaranteed, despite the fact that p()d = P(). Analogy to the classical problem A = y, =?, =A - y 33 3 Error Analysis he theory of UNIFORM CONERGENCE is needed for the set of functions implemented by a model, i.e., learning machine (apnik, Chervonenkis in 9-7 ties ) Nonlinear and nonparametricmodels illustrated by NNs and SMs are discussed. Nonlinear implies: ) he model class is not restricted to linear input-output maps, and ) he cost function that measures the goodness of a model, is nonlinear in respect to the unknown parameters. 35 3

7 Note that the second nonlinearity is the component of modeling that causes most of the computational problems. Nonparametric does not imply that the models do not have parameters at all. On the contrary, parameter learning (meaning selection, identification, estimation, fitting or tuning) is the crucial issue here However, unlike in the classical statistical inference, the parameters are not predefined but rather their number depends on the training data used. In other words, parametersthat define the capacity of the model are data driven in such a way as to match the model capacity with the data compleity. his is a basic paradigm of the structural risk minimization (SRM) introduced by apnik and Chervonenkis and their coworkers. he main characteristics of all MODERN problems is the mapping between highdimensional spaces. 39 Gender recognition problem: Are these two faces female or male? F or M? Pattern Recognition M or F? Gender recognition problem: Are these two faces female or male? F or M? Each face is represented by 8 input variables, (features). M or F? Problem from Brunelli and Poggio 993. Problem from Brunelli and Poggio 993

8 Approimation and classification are same for any dimensionality of the input space. Nothing but size changes. But the change is DRAMAIC. High dimensionality means both EXPLOSION in a number OF PARAMEERS to learn and the SPARCIY of the training data set. High dimensional spaces appear to be terrifyingly empty. Approimation P H Classification PLAN P H PLAN However, for inputs ( and P) N data is N or N data needed? N data 3 Approimation P H PLAN Classification CURSE of DIMENSIONALIY and SPARSIY OF DAA he recent promising tool FOR WORKING UNDER HESE CONSRAINS are the SUPPOR ECOR MACHINES based on the SAISICAL LEARNING HEORY (APNIK and CHERONENKIS). P WHA IS HE contemporary BASIC LEARNING PROBLEM? LEARN HE DEPENDENCY (FUNCION, MAPPING) from n N data N data N n data SPARSE DAA, under NOISE, in HIGH DIMENSIONAL SPACE! Recall - the redundancy provides the knowledge! A lot of data - easy problem. 5 CURSE of DIMENSIONALIY Illustrate HE IMPAC OF A DAA SE SIZE ON HE SIMPLES RECOGNIION PROBLEM BINARY CLASSIFICAION, i.e., DICHOOMIZAION. CLASSIFICAION (PAERN RECOGNIION) EXAMPLE Assume - Normally distributed classes, same covariance matrices. Solution is easy the decision boundary is linear and defined by parameter w = X * D in the case there is plenty of data (infinity). X * denotes the PSEUDOINERSE. = w + w d = + d = - 7 8

9 CLASSIFICAION (PAERN RECOGNIION) EXAMPLE Assume -Normally distributed classes, same covariance matrices. Solution is easy - decision boundary is linear and defined by parameter w = X * D in the case there is plenty of data (infinity). X * denotes the PSEUDOINERSE. d = + d = - Note that this solutionfollows from the last two assumptions of classical inference. Gaussian data and minimization of the sum-of-errorssquares w = X * D Eample (3) X w opt = [ ], and the separation boundary equals = D However, for a small sample, the solution defined by w = X * D is NO LONGER GOOD ONE because, for this data set this separation line is obtained. Eample (3) Eample 3(3) For a different data set another separation line is obtained. Again, for a small sample the solution defined by w = X * D is NO LONGER GOOD ONE. 5 5 What is common for both separation lines the red and the blue one. Both have a SMALL MARGIN. WHA S WRONG WIH SMALL MARGIN? Look at the BLUE line! It is very likely that the new eamples (, ) will be wrongly classified. SM he SAISICAL LEARNING HEORY IS DEELOPED O SOLE PROBLEMS of FINDING HE OPIMAL SEPARAION HYPERPLANE for small samples. SM he question is how to FIND the OPIMAL SEPARAION HYPERPLANE GIEN (scarce) DAA SAMPLES? 53 5

10 he SAISICAL LEARNING HEORY IS DEELOPED O SOLE PROBLEMS of FINDING HE OPIMAL SEPARAION HYPERPLANE for small samples. SM MAXIMAL MARGIN CLASSIFIER he maimal margin classifier is an alternative to the perceptron: it also assumes that the data are linearly separable SM it aims at finding the separating hyperplane with the maimal geometric margin (and not any one, that is typical of perceptron solutions) OPIMAL SEPARAION HYPERPLANE is the one that has the LARGES MARGIN on given DAA SE 55 Small margin Class, y = + Class, y = - Class, y = - Class, y = + Separating lines, i.e., decision boundaries, Large i.e., hyperplanes margin he larger the margin, the smaller the probability of misclassification. 5

Content. Learning. Regression vs Classification. Regression a.k.a. function approximation and Classification a.k.a. pattern recognition

Content. Learning. Regression vs Classification. Regression a.k.a. function approximation and Classification a.k.a. pattern recognition Content Andrew Kusiak Intelligent Systems Laboratory 239 Seamans Center The University of Iowa Iowa City, IA 52242-527 andrew-kusiak@uiowa.edu http://www.icaen.uiowa.edu/~ankusiak Introduction to learning