Statistical Learning Theory: A Primer

Internationa Journa of Computer Vision 38(), 9 3, 2000 c 2000 uwer Academic Pubishers. Manufactured in The Netherands. Statistica Learning Theory: A Primer THEODOROS EVGENIOU, MASSIMILIANO PONTIL AND TOMASO POGGIO Center for Bioogica and Computationa Learning, Artificia Inteigence Laboratory, MIT, Cambridge, MA, USA theos@ai.mit.edu ponti@ai.mit.edu tp@ai.mit.edu Abstract. In this paper we first overview the main concepts of Statistica Learning Theory, a framework in which earning from exampes can be studied in a principed way. We then briefy discuss we known as we as emerging earning techniques such as Reguarization Networks and Support Vector Machines which can be justified in term of the same induction principe. eywords: VC-dimension, structura risk minimization, reguarization networks, support vector machines. Introduction The goa of this paper is to provide a short introduction to Statistica Learning Theory (SLT) which studies probems and techniques of supervised earning. For a more detaied review of SLT see Evgeniou et a. (999). In supervised earning or earning-from-exampes a machine is trained, instead of programmed, to perform a given task on a number of input-output pairs. According to this paradigm, training means choosing a function which best describes the reation between the inputs and the outputs. The centra question of SLT is how we the chosen function generaizes, or how we it estimates the output for previousy unseen inputs. We wi consider techniques which ead to soution of the form c i (x, x i ). () where the x i, i =,..., are the input exampes, a certain symmetric positive definite function named kerne, and c i a set of parameters to be determined form the exampes. This function is found by minimizing functionas of the type H[ f ] = V (y i, f (x i )) + λ f 2, where V is a oss function which measures the goodness of the predicted output f (x i ) with respect to the given output y i, f 2 a smoothness term which can be thought of as a norm in the Reproducing erne Hibert Space defined by the kerne and λ a positive parameter which contros the reative weight between the data and the smoothness term. The choice of the oss function determines different earning techniques, each eading to a different earning agorithm for computing the coefficients c i. The rest of the paper is organized as foows. Section 2 presents the main idea and concepts in the theory. Section 3 discusses Reguarization Networks and Support Vector Machines, two important techniques which produce outputs of the form of Eq. (). 2. Statistica Learning Theory We consider two sets of random variabes x X R d and y Y R reated by a probabiistic reationship.

0 Evgeniou, Ponti and Poggio The reationship is probabiistic because generay an eement of X does not determine uniquey an eement of Y, but rather a probabiity distribution on Y. This can be formaized assuming that an unknown probabiity distribution P(x, y) is defined over the set X Y.Weare provided with exampes of this probabiistic reationship, that is with a data set D {(x i,y i ) X Y} caed training set, obtained by samping times the set X Y according to P(x, y). The probem of earning consists in, given the data set D, providing an estimator, that is a function f : X Y, that can be used, given any vaue of x X, to predict a vaue y. For exampe X coud be the set of a possibe images, Y the set {, }, and f (x) an indicator function which specifies whether image x contains a certain object (y = ), or not (y = ) (see for exampe Papageorgiou et a. (998)). Another exampe is the case where x is a set of parameters, such as pose or facia expressions, y is a motion fied reative to a particuar reference image of a face, and f (x) is a regression function which maps parameters to motion (see for exampe Ezzat and Poggio (996)). In SLT, the standard way to sove the earning probem consists in defining a risk functiona, which measures the average amount of error or risk associated with an estimator, and then ooking for the estimator with the owest risk. If V (y, f (x)) is the oss function measuring the error we make when we predict y by f (x), then the average error, the so caed expected risk, is: I [ f ] X,Y V (y, f (x))p(x, y) dx dy We assume that the expected risk is defined on a arge cass of functions F and we wi denote by f 0 the function which minimizes the expected risk in F. The function f 0 is our idea estimator, and it is often caed the target function. This function cannot be found in practice, because the probabiity distribution P(x, y) that defines the expected risk is unknown, and ony a sampe of it, the data set D, is avaiabe. To overcome this shortcoming we need an induction principe that we can use to earn from the imited number of training data we have. SLT, as deveoped by Vapnik (Vapnik, 998), buids on the so-caed empirica risk minimization (ERM) induction principe. The ERM method consists in using the data set D to buid a stochastic approximation of the expected risk, which is usuay caed the empirica risk, defined as I emp [ f ; ] = V (y i, f (x i )). Straight minimization of the empirica risk in F can be probematic. First, it is usuay an i-posed probem (Tikhonov and Arsenin, 977), in the sense that there might be many, possiby infinitey many, functions minimizing the empirica risk. Second, it can ead to overfitting, meaning that athough the minimum of the empirica risk can be very cose to zero, the expected risk which is what we are reay interested in can be very arge. SLT provides probabiistic bounds on the distance between the empirica and expected risk of any function (therefore incuding the minimizer of the empirica risk in a function space that can be used to contro overfitting). The bounds invove the number of exampes and the capacity h of the function space, a quantity measuring the compexity of the space. Appropriate capacity quantities are defined in the theory, the most popuar one being the VC-dimension (Vapnik and Chervonenkis, 97) or scae sensitive versions of it (earns and Shapire, 994; Aon et a., 993). The bounds have the foowing genera form: with probabiity at east η ( ) h I [ f ] < I emp [ f ] +,η. (2) where h is the capacity, and an increasing function of h and η. For more information and for exact forms of function we refer the reader to (Vapnik and Chervonenkis, 97; Vapnik, 998; Aon et a., 993). Intuitivey, if the capacity of the function space in which we perform empirica risk minimization is very arge and the number of exampes is sma, then the distance between the empirica and expected risk can be arge and overfitting is very ikey to occur. Since the space F is usuay very arge (e.g. F coud be the space of square integrabe functions), one typicay considers smaer hypothesis spaces H. Moreover, inequaity (2) suggests an aternative method for achieving good generaization: instead of minimizing the empirica risk, find the best trade off between the empirica risk and the compexity of the hypothesis space measured by the second term in the r.h.s. of inequaity (2). This observation eads to the method of Structura Risk Minimization (SRM).

Statistica Learning Theory: A Primer The idea of SRM is to define a nested sequence of hypothesis spaces H H 2 H M, where each hypothesis space H m has finite capacity h m and arger than that of a previous sets, that is: h h 2,..., h M. For exampe H m coud be the set of poynomias of degree m, or a set of spines with m nodes, or some more compicated noninear parameterization. Using such a nested sequence of more and more compex hypothesis spaces, SRM consists of choosing the minimizer of the empirica risk in the space H m for which the bound on the structura risk, as measured by the right hand side of inequaity (2), is minimized. Further information about the statistica properties of SRM can be found in Devroye et a. (996), Vapnik (998). To summarize, in SLT the probem of earning from exampes is soved in three steps: (a) we define a oss function V (y, f (x)) measuring the error of predicting the output of input x with f (x) when the actua output is y; (b) we define a nested sequence of hypothesis spaces H m, m =,...,M whose capacity is an increasing function of m; (c) we minimize the empirica risk in each of H m and choose, among the soutions found, the one with the best trade off between the empirica risk and the capacity as given by the right hand side of inequaity (2). 3. Learning Machines 3.. Learning as Functiona Minimization We now consider hypothesis spaces which are subsets of a Reproducing erne Hibert Space (RHS) (Wahba, 990). A RHS is a Hibert space of functions f of the form N n= a nφ n (x), where {φ n (x)} N n= is a set of given, ineary independent basis functions and N can be possiby infinite. A RHS is equipped with a norm which is defined as: f 2 = N n= a 2 n λ n, where {λ n } N n= is a decreasing, positive sequence of rea vaues whose sum is finite. The constants λ n and the basis functions {φ n } N n= define the symmetric positive definite kerne function: N (x, y) = λ n φ n (x)φ n (y), n= A nested sequence of spaces of functions in the RHS can be constructed by bounding the RHS norm of functions in the space. This can be done by defining a set of constants A < A 2 < < A M and considering spaces of the form: H m ={f RHS : f A m } It can be shown that the capacity of the hypothesis spaces H m is an increasing function of A m (see for exampe Evgeniou et a. (999)). According to the scheme given at the end of Section 2, the soution of the earning probem is found by soving, for each A m, the foowing optimization probem: min f subject to V (y i, f (x i )) f A m and choosing, among the soutions found for each A m, the one with the best trade off between empirica risk and capacity, i.e. the one which minimizes the bound on the structura risk as given by inequaity (2). The impementation of the SRM method described above is not practica because it requires to ook for the soution of a arge number of constrained optimization probems. This difficuty is overcome by searching for the minimum of: H[ f ] = V (y i, f (x i )) + λ f 2. (3) The functiona H[ f ] contains both the empirica risk and the norm (compexity or smoothness) of f in the RHS, simiary to functionas considered in reguarization theory (Tikhonov and Arsenin, 977). The reguarization parameter λ penaizes functions with high capacity: the arger λ, the smaer the RHS norm of the soution wi be. When impementing SRM, the key issue is the choice of the hypothesis space, i.e. the parameter H m where the structura risk is minimized. In the case of the functiona of Eq. (3), the key issue becomes the choice of the reguarization parameter λ. These two probems, as discussed in Evgeniou et a. (999), are reated, and the SRM method can in principe be used to choose λ (Vapnik, 998). In practice, instead of using SRM other methods are used such as cross-vaidation (Wahba, 990), Generaized Cross Vaidation, Finite Prediction Error and the MDL criteria (see Vapnik (998) for a review and comparison). An important feature of the minimizer of H[ f ] is that, independenty on the oss function V, the

2 Evgeniou, Ponti and Poggio minimizer has the same genera form (Wahba, 990) c i (x, x i ), (4) Notice that Eq. (4) estabishes a representation of the function f as a inear combination of kernes centered in each data point. Using different kernes we get functions such as Gaussian radia basis functions ( (x, y) = exp( β x y 2 )), or poynomias of degree d ( (x, y) = ( + x y) d ) (Girosi et a., 995; Vapnik, 998). We now turn to discuss a few earning techniques based on the minimization of functionas of the form (3) by specifying the oss function V. In particuar, we wi consider Reguarization Networks and Support Vector Machines (SVM), a earning technique which has recenty been proposed for both cassification and regression probems (see Vapnik (998) and references therein): Reguarization Networks SVM Cassification V (y i, f (x i )) = (y i f (x i )) 2, (5) V (y i, f (x i )) = y i f(x i ) +, (6) where x + = x if x > 0 and zero otherwise. SVM Regression V (y i, f (x i )) = y i f(x i ) ɛ, (7) where the function ɛ, caed ɛ-insensitive oss, is defined as: { 0 if x <ɛ x ɛ (8) x ɛ otherwise. We now briefy discuss each of these three techniques. 3.2. Reguarization Networks The approximation scheme that arises from the minimization of the quadratic functiona (y i f (x i )) 2 + λ f 2 (9) forafixedλis a specia form of reguarization. It is possibe to show (see for exampe Girosi et a. (995)) that the coefficients c i of the minimizer of (9) in Eq. (4) satisfy the foowing inear system of equations: (G + λi )c = y, (0) where I is the identity matrix, and we have defined (y) i = y i, (c) i = c i, (G) ij = (x i,x j ). Since the coefficients c i satisfy a inear system, Eq. (4) can be rewritten as: y i b i (x), () with b i (x) = j= (G + λi ) ij (x i,x). Equation () gives the dua representation of RN. Notice the difference between Eqs. (4) and (): in the first one the coefficients c i are earned from the data whie in the second one the bases functions b i are earned, the coefficient of the expansion being equa to the output of the exampes. We refer to (Girosi et a., 995) for more information on the dua representation. 3.3. Support Vector Machines We now discuss Support Vector Machines (SVM) (Cortes and Vapnik, 995; Vapnik, 998). We distinguish between rea output (regression) and binary output (cassification) probems. The method of SVM regression corresponds to the foowing minimization: Min f y i f (x i ) ɛ + λ f 2 (2) whie the method of SVM cassification corresponds to: Min f y i f (x i ) + + λ f 2, (3) It turns out that for both probems (2) and (3) the coefficients c i in Eq. (4) can be found by soving a Quadratic Programming (QP) probem with inear constraints. The reguarization parameter λ appears ony in the inear constraints: the absoute vaues of coefficients c i is bounded by 2. The QP probem is non λ

Statistica Learning Theory: A Primer 3 trivia since the size of matrix of the quadratic form is equa to and the matrix is dense. A number of agorithms for training SVM have been proposed: some are based on a decomposition approach where the QP probem is attacked by soving a sequence of smaer QP probems (Osuna et a., 997), others on sequentia updates of the soution (?). A remarkabe property of SVMs is that oss functions (7) and (6) ead to sparse soutions. This means that, unike in the case of Reguarization Networks, typicay ony a sma fraction of the coefficients c i in Eq. (4) are nonzero. The data points x i associated with the nonzero c i are caed support vectors. If a data points which are not support vectors were to be discarded from the training set the same soution woud be found. In this context, an interesting perspective on SVM is to consider its information compression properties. The support vectors represent the most informative data points and compress the information contained in the training set: for the purpose of, say, cassification ony the support vectors need to be stored, whie a other training exampes can be discarded. This, aong with some geometric properties of SVMs such as the interpretation of the RHS norm of their soution as the inverse of the margin (Vapnik, 998), is a key property of SVM and might expain why this technique works we in many practica appications. 3.4. ernes and Data Representations We concude this short review with a discussion on kernes and data representations. A key issue when using the earning techniques discussed above is the choice of the kerne in Eq. (4). The kerne (x i, x j ) defines a dot product between the projections of the two inputs x i and x j, in the feature space (the features being {φ (x), φ 2 (x),...,φ N (x)} with N the dimensionaity of the RHS). Therefore its choice is cosey reated to the choice of the effective representation of the data, i.e. the image representation in a vision appication. The probem of choosing the kerne for the machines discussed here, and more generay the issue of finding appropriate data representations for earning, is an important and open one. The theory does not provide a genera method for finding good data representations, but suggests representations that ead to simpe soutions. Athough there is not a genera soution to this probem, a number of recent experimenta and theoretica works provide insights for specific appications (Evgeniou et a., 2000; Jaakkoa and Hausser, 998; Mohan, 999; Vapnik, 998). References Aon, N., Ben-David, S., Cesa-Bianchi, N., and Hausser, D. 993. Scae-sensitive dimensions, uniform convergence, and earnabiity. Symposium on Foundations of Computer Science. Cortes, C. and Vapnik, V. 995. Support vector networks. Machine Learning, 20: 25. Devroye, L., Györfi, L., and Lugosi, G. 996. A Probabiistic Theory of Pattern Recognition, No. 3 in Appications of Mathematics. Springer: New York. Evgeniou, T., Ponti, M., Papageorgiou, C., and Poggio, T. 2000. Image representations for object detection using kerne cassifiers. In Proceedings ACCV. Taiwan, p. To appear. Evgeniou, T., Ponti, M., and Poggio, T. 999. A unified framework for Reguarization Networks and Support Vector Machines. A.I. Memo No. 654, Artificia Inteigence Laboratory, Massachusetts Institute of Technoogy. Ezzat, T. and Poggio, T. 996. Facia anaysis and synthesis using image-based modes. In Face and Gesture Recognition. pp. 6 2. Girosi, F., Jones, M., and Poggio, T. 995. Reguarization theory and neura networks architectures. Neura Computation, 7:29 269. Jaakkoa, T. and Hausser, D. 998. Probabiistic kerne regression modes. In Proc. of Neura Information Processing Conference. earns, M. and Shapire, R. 994. Efficient distribution-free earning of probabiistic concepts. Journa of Computer and Systems Sciences, 48(3):464 497. Mohan, A. 999. Robust object detection in images by components. Master s Thesis, Massachusetts Institute of Technoogy. Osuna, E., Freund, R., and Girosi, F. 997. An improved training agorithm for support vector machines. In IEEE Workshop on Neura Networks and Signa Processing, Ameia Isand, FL. Papageorgiou, C., Oren, M., and Poggio, T. 998. A genera framework for object detection. In Proceedings of the Internationa Conference on Computer Vision, Bombay, India. Patt, J.C. 998. Sequentia minima imization: A fast agorithm for training support vector machines. Technica Report MST-TR-98-4, Microsoft Research. Tikhonov, A.N. and Arsenin, V.Y. 977. Soutions of I-posed Probems. Washington, D.C.: W.H. Winston. Vapnik, V.N. 998. Statistica Learning Theory. Wiey: New York. Vapnik, V.N. and Chervonenkis, A.Y. 97. On the uniform convergence of reative frequences of events to their probabiities. Th. Prob. and its Appications, 7(2):264 280. Wahba, G. 990. Spines Modes for Observationa Data. Vo. 59, Series in Appied Mathematics: Phiadephia.