Statistical Learning Theory: a Primer

??,??, 1 6 (??) c?? Kuwer Academic Pubishers, Boston. Manufactured in The Netherands. Statistica Learning Theory: a Primer THEODOROS EVGENIOU AND MASSIMILIANO PONTIL Center for Bioogica and Computationa Learning, MIT Received??. Revised??. Abstract. In this paper we first overview the main concepts of Statistica Learning Theory, a framework in which earning from exampes can be studied in a principed way. We then briefy discuss we known as we emerging earning techniques such as Reguarization Networks and Support Vector Machines which can be justified in term of the same induction principe. Keywords: VC-dimension, Structura Risk Minimization, Reguarization Networks, Support Vector Machines 1. Introduction The goa of this paper is to provide a short introduction to Statistica Learning Theory (SLT) which studies probems and techniques of supervised earning. For a more detaied review of SLT see [5]. In supervised earning or earning-fromexampes a machine is trained, instead of programmed, to perform a given task on a number of input-output pairs. According to this paradigm, training means choosing a function which best describes the reation between the inputs and the outputs. The centra question of SLT is how we the the chosen function generaizes, or how we it estimates the output for previousy unseen inputs. We wi consider techniques which ead to soution of the form f(x) = c i K(x, x i ). (1) where the x i,i =1,..., are the input exampes, K a certain symmetric positive definite function This report describes research done within the Center for Bioogica and Computationa Learning in the Department of Brain and Cognitive Sciences and at the Artificia Inteigence Laboratory at the Massachusetts Institute of Technoogy. This research is sponsored by grants from the Nationa Science Foundation, ONR and Darpa. Additiona support is provided by Eastman Kodak Company, Daimer-Chryser, Siemens, ATR, AT&T, Compaq, Honda R&D Co., Ltd., Merri-Lynch, NTT and Centra Research Institute of Eectric Power Industry. named kerne, and c i a set of parameters to be determined form the exampes. This function is found by minimizing functionas of the type H[f] = 1 V (y i,f(x i )) + λ f 2 K, where V is a oss function which measures the goodness of the predicted output f(x i )withrespect to the given output y i, f 2 K a smoothness term which can be thought of as a norm in the Reproducing Kerne Hibert Space defined by the kerne K and λ a positive parameter which contros the reative weight between the data and the smoothness term. The choice of the oss function determines different earning techniques, each eading to a different earning agorithm for computing the coefficients c i. The rest of the paper is organized as foows. Section 2 presents the main idea and concepts in the theory. Section 3 discusses Reguarization Networks and Support Vector Machines, two important techniques which produce outputs of the form of equation (1). 2. Statistica Learning Theory We consider two sets of random variabes x X R d and y Y R reated by a probabiistic reationship. The reationship is probabiistic because generay an eement of X does not

2?? determine uniquey an eement of Y, but rather a probabiity distribution on Y. This can be formaized assuming that an unknown probabiity distribution P (x,y) is defined over the set X Y. We are provided with exampes of this probabiistic reationship, that is with a data set D {(x i,y i ) X Y } caed training set, obtained by samping times the set X Y according to P (x,y). The probem of earning consists in, given the data set D, providing an estimator, that is a function f : X Y, that can be used, given any vaue of x X, to predict a vaue y. For exampe X coud be the set of a possibe images, Y the set { 1, 1}, andf(x) anindicator function which specifies whether image x contains a certain object (y =1),ornot(y = 1) (see for exampe [12]). Another exampe is the case where x is a set of parameters, such as pose or facia expressions, y is a motion fied reative to a particuar reference image of a face, and f(x) is a regression function which maps parameters to motion (see for exampe [6]). In SLT, the standard way to sove the earning probem consists in defining a risk functiona, which measures the average amount of error or risk associated with an estimator, and then ooking for the estimator with the owest risk. If V (y, f(x)) is the oss function measuring the errorwemakewhenwepredicty by f(x), then the average error, the so caed expected risk, is: I[f] V (y, f(x))p (x,y) dxdy X,Y We assume that the expected risk is defined on a arge cass of functions F and we wi denote by f 0 the function which minimizes the expected risk in F. The function f 0 is our idea estimator, and it is often caed the target function. This function cannot be found in practice, because the probabiity distribution P (x,y) that defines the expected risk is unknown, and ony a sampe of it, the data set D, is avaiabe. To overcome this shortcoming we need an induction principe that we can use to earn from the imited number of training data we have. SLT, as deveoped by Vapnik [15], buids on the so-caed empirica risk minimization (ERM) induction principe. The ERM method consists in using the data set D to buid a stochastic approximation of the expected risk, which is usuay caed the empirica risk, defined as I emp [f; ] = 1 V (y i,f(x i )). Straight minimization of the empirica risk in F can be probematic. First, it is usuay an iposed probem [14], in the sense that there might be many, possiby infinitey many, functions minimizing the empirica risk. Second, it can ead to overfitting, meaning that athough the minimum of the empirica risk can be very cose zero, the expected risk which is what we are reay interested in can be very arge. SLT provides probabiistic bounds on the distance between the empirica and expected risk of any function (therefore incuding the minimizer of the empirica risk in a function space that can be used to contro overfitting). The bounds invove the number of exampes and the capacity h of the function space, a quantity measuring the compexity of the space. Appropriate capacity quantities are defined in the theory, the most popuar one being the VC-dimension [16] or scae sensitive versions of it [9], [1]. The bounds have the foowing genera form: with probabiity at east η h I[f] <I emp [f]+φ(,η). (2) where h is the capacity, and Φ an increasing function of h and η. For more information and for exact forms of function Φ we refer the reader to [16], [15], [1]. Intuitivey, if the capacity of the function space in which we perform empirica risk minimization is very arge and the number of exampes is sma, then the distance between the empirica and expected risk can be arge and overfitting is very ikey to occur. Since the space F is usuay very arge (i.e. F coud be the space of square integrabe functions), one typicay considers smaer hypothesis spaces H. Moreover, inequaity (2) suggests an aternative method for achieving good generaization: instead of minimizing the empirica risk, find the best trade off between the empirica risk and the compexity of the hypothesis space measured by the second term in the r.h.s. of inequaity (2). This observation eads to the method of Structura Risk Minimization (SRM).

?? 3 The idea of SRM is to define a nested sequence of hypothesis spaces H 1 H 2... H M,where each hypothesis space H m has finite capacity h m and arger than that of a previous sets, that is: h 1 h 2,..., h M. For exampe H m coud be the set of poynomias of degree m, or a set of spines with m nodes, or some more compicated noninear parameterization. Using such a nested sequence of more and more compex hypothesis spaces, SRM consists of choosing the minimizer of the empirica risk in the space H m for which the bound on the structura risk, asmeasuredby the right hand side of inequaity (2), is minimized. Further information about the statistica properties of SRM can be found in [3], [15]. To summarize, in SLT the probem of earning form exampes is soved in three steps: (a) we define a oss function V (y, f(x)) measuring the error of predicting the output of input x with f(x) when the actua output is y; (b) we define a nested sequence of hypothesis spaces H m,m =1,...,M whose capacity is an increasing function of m; (c) we minimize the empirica risk in each of H m and choose, among the soutions found, the one with the best trade off between the empirica risk and the capacity as given by the right hand side of inequaity (2). 3. Learning machines 3.1. Learning as functiona minimization We now consider hypothesis spaces which are subsets of a Reproducing Kerne Hibert Space (RKHS) [17]. A RKHS is a Hibert space of functions f of the form f(x) = N n=1 a nφ n (x), where {φ n (x)} N n=1 is a set of given, ineary independent basis functions and N can be possiby infinite. A RKHS is equipped with a norm which is defined as: N f 2 a 2 n K =, λ n=1 n where {λ n } N n=1 is a decreasing, positive sequence of rea vaues whose sum is finite. The constants λ n and the basis functions {φ n } N n=1 define the symmetric positive definite kerne function: K(x, y) = N λ n φ n (x)φ n (y), n=1 A nested sequence of spaces of functions in the RKHS can be constructed by bounding the RKHS norm of functions in the space. This can be done by defining a set of constants A 1 <A 2 <...<A M and considering spaces of the form: H m = {f RKHS : f K A m } It can be shown that the capacity of the hypothesis spaces H m is an increasing function of A m (see for exampe [5]). According to the scheme given at the end of section 2, the soution of the earning probem is found by soving, for each A m,the foowing optimization probem: min f V (y i,f(x i )) subject to f K A m and choosing, among the soutions found for each A m, the one with the best trade off between empirica risk and capacity, i.e. the one which minimizes the bound on the structura risk as given by inequaity (2). The impementation of the SRM method described above is not practica because it requires to ook for the soution of a arge number constrained optimization probems. This difficuty is overcome by searching for the minimum of: H[f] = 1 V (y i,f(x i )) + λ f 2. (3) K The functiona H[f] contains both the empirica risk and the norm (compexity or smoothness) of f in the RKHS, simiary to functionas considered in reguarization theory [14]. The reguarization parameter λ penaizes functions with high capacity: the arger λ, the smaer the RKHS norm of the soution wi be. When impementing SRM, the key issue is the choice of the hypothesis space, i.e. the parameter H m where the structura risk is minimized. In the case of the functiona of equation (3), the key issue becomes the choice of the reguarization parameter λ. These two probems, as discussed in [5], are reated, and the SRM method can in principe be used to choose λ [15]. In practice, instead of using SRM other methods are used such

4?? as cross-vaidation ([17]), Generaized Cross Vaidation, Finite Prediction Error and the MDL criteria (see [15] for a review and comparison). An important feature of the minimizer of H[f] is that, independenty on the oss function V,the minimizer has the same genera form ([17]) f(x) = c i K(x, x i ), (4) Notice that equation (4) estabishes a representation of the function f as a inear combination of kernes centered in each data point. Using different kernes we get functions such as Gaussian radia basis functions (K(x, y) =exp( β x y 2 )), or poynomias of degree d (K(x, y) =(1+x y) d ) [7], [15]. We now turn to discuss a few earning techniques based on the minimization of functionas of the form (3) by specifying the oss function V. In particuar, we wi consider Reguarization Networks and Support Vector Machines (SVM), a earning technique which has recenty been proposed for both cassification and regression probems (see [15] and references therein): Reguarization Networks SVM Cassification V (y i,f(x i )) = (y i f(x i )) 2, (5) V (y i,f(x i )) = 1 y i f(x i ) +, (6) where x + = x if x>0 and zero otherwise. SVM Regression V (y i,f(x i )) = y i f(x i ) ɛ, (7) where the function ɛ, caed ɛ-insensitive oss, is defined as: { 0 if x <ɛ x ɛ (8) x ɛ otherwise. We now briefy discuss each of these three techniques. 3.2. Reguarization Networks The approximation scheme that arises from the minimization of the quadratic functiona 1 (y i f(x i )) 2 + λ f 2 K (9) for a fixed λ is a specia form of reguarization. It is possibe to show (see for exampe [7]) that the coefficients c i of the minimizer of (9) in equation (4) satisfy the foowing inear system of equations: (G + λi)c = y, (10) where I is the identity matrix, and we have defined (y) i = y i, (c) i = c i, (G) ij = K(x i, x j ). Since the coefficients c i satisfy a inear system, equation (4) can be rewritten as: f(x) = y i b i (x), (11) with b i (x) = j=1 (G + λi) 1 ij K(x i, x). Equation (11) gives the dua representation of RN. Notice the difference between equation (4) and (11): in the first one the coefficients c i are earned from the data whie in the second one the bases functions b i are earned, the coefficient of the expansion being equa to the output of the exampes. We refer to [7] for more information on the dua representation. 3.3. Support Vector Machines We now discuss Support Vector Machines (SVM) [2], [15]. We distinguish between rea output (regression) and binary output (cassification) probems. The method of SVM regression corresponds to the foowing minimization: Min f 1 y i f(x i ) ɛ + λ f 2 K (12) whie the method of SVM cassification corresponds to:

?? 5 Min f 1 1 y i f(x i ) + + λ f 2, (13) K It turns out that for both probems (12) and (13) the coefficients c i in equation (4) can be found by soving a Quadratic Programming (QP) probem with inear constraints. The size of the box is inversey proportiona to the reguarization parameter λ. The QP probem is non trivia since the size of matrix of the quadratic form is equa to and the matrix is dense. A number of agorithms for training SVM have been proposed: some are based on a decomposition approach where the QP probem is attacked by soving a sequence of smaer QP probems [11], others on sequentia updates of the soution [13]. A remarkabe property of SVMs is that oss functions (7) and (6) ead to sparse soutions. This means that, unike in the case of Reguarization Networks, typicay ony a sma fraction of the coefficients c i in equation (4) are nonzero. The data points x i associated with the nonzero c i are caed support vectors. If a data points which are not support vectors were to be discarded from the training set the same soution woud be found. In this context, an interesting perspective on SVM is to consider its information compression properties. The support vectors represent the most informative data points and compress the information contained in the training set: for the purpose of, say, cassification ony the support vectors need to be stored, whie a other training exampes can be discarded. This, aong with some geometric properties of SVMs such as the interpretation of the RKHS norm of their soution as the inverse of the margin [15], is a key property of SVM and might expain why this technique works we in many practica appications. 3.4. Kernes and data representations We concude this short review with a short discussion on kernes and data representations. A key issue when using the earning techniques discussed above is the choice of the kerne K in equation (4). The kerne K(x i, x j ) defines a dot product between the projections of the two inputs x i and x j, in the feature space (the features being {φ 1 (x),φ 2 (x),...φ N (x)} with N the dimensionaity of the RKHS). Therefore its choice is cosey reated to the choice of the effective representation of the data, i.e. the image representation in a vision appication. The probem of choosing the kerne for the machines discussed here, and more generay the issue of finding appropriate data representations for earning, is an important and open one. The theory does not provide a genera method for finding good data representations, but suggests representations that ead to simpe soutions. Athough there is not a genera soution to this probem, a number of recent experimenta and theoretica works provide insights for specific appications [4], [8], [10], [15]. References 1. N. Aon, S. Ben-David, N. Cesa-Bianchi, and D. Hausser. Scae-sensitive dimensions, uniform convergence, and earnabiity. Symposium on Foundations of Computer Science, 1993. 2. C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:1 25, 1995. 3. L. Devroye, L. Györfi, and G. Lugosi. AProbabiistic Theory of Pattern Recognition. Number 31 in Appications of mathematics. Springer, New York, 1996. 4. T. Evgeniou, M. Ponti, C. Papageorgiou, and T. Poggio. Image representations for object detection using kerne cassifiers. In Proceedings ACCV, page (to appear), Taiwan, January 2000. 5. T. Evgeniou, M. Ponti, and T. Poggio. A unified framework for reguarization networks and support vector machines. A.I. Memo No. 1654, Artificia Inteigence Laboratory, Massachusetts Institute of Technoogy, 1999. 6. T. Ezzat and T. Poggio. Facia anaysis and synthesis using image-based modes. In Face and Gesture Recognition, pages 116 121, 1996. 7. F. Girosi, M. Jones, and T. Poggio. Reguarization theory and neura networks architectures. Neura Computation, 7:219 269, 1995. 8. T. Jaakkoa and D. Hausser. Probabiistic kerne regression modes. In Proc. of Neura Information Processing Conference, 1998. 9. M. Kearns and R.E. Shapire. Efficient distributionfree earning of probabiistic concepts. Journa of Computer and Systems Sciences, 48(3):464 497, 1994. 10. A. Mohan. Robust object detection in images by components. Master s thesis, Massachusetts Institute of Technoogy, May 1999. 11. E. Osuna, R. Freund, and F. Girosi. An improved training agorithm for support vector machines. In IEEE Workshop on Neura Networks and Signa Processing, Ameia Isand, FL, September 1997.

6?? 12. C. Papageorgiou, M. Oren, and T. Poggio. A genera framework for object detection. In Proceedings of the Internationa Conference on Computer Vision, Bombay, India, January 1998. 13. J. C. Patt. Sequentia minima imization: A fast agorithm for training support vector machines. Technica Report MST-TR-98-14, Microsoft Research, Apri 1998. 14. A. N. Tikhonov and V. Y. Arsenin. Soutions of Iposed Probems. W. H. Winston, Washington, D.C., 1977. 15. V. N. Vapnik. Statistica Learning Theory. Wiey, New York, 1998. 16. V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of reative frequences of events to their probabiities. Th. Prob. and its Appications, 17(2):264 280, 1971. 17. G. Wahba. Spines Modes for Observationa Data. Series in Appied Mathematics, Vo. 59, SIAM, Phiadephia, 1990.