Kernel Matching Pursuit

Size: px

Start display at page:

Download "Kernel Matching Pursuit"

Georgiana Kelly
5 years ago
Views:

1 Kerne Matching Pursuit Pasca Vincent and Yoshua Bengio Dept. IRO, Université demontréa C.P. 6128, Montrea, Qc, H3C 3J7, Canada Technica Report #1179 Département d Informatique et Recherche Opérationnee Université demontréa August 28th, 2000 Abstract Matching Pursuit agorithms earn a function that is a weighted sum of basis functions, by sequentiay appending functions to an initiay empty basis, to approximate a target function in the eastsquares sense. We show how matching pursuit can be extended to use non-squared error oss functions, and how it can be used to buid kerne-based soutions to machine-earning probems, whie keeping contro of the sparsity of the soution. We aso derive MDL motivated generaization bounds for this type of agorithm, and compare them to reated SVM (Support Vector Machine) bounds. Finay, inks to boosting agorithms and RBF training procedures, as we as an extensive experimenta comparison with SVMs for cassification are given, showing comparabe resuts with typicay sparser modes. 1 Introduction Recenty, there has been a renewed interest for kerne-based methods, due in great part to the success of the Support Vector Machine approach (Boser, Guyon and Vapnik, 1992; Vapnik, 1995). Kerne-based earning agorithms represent the function f(x) to be earnt with a inear combination of terms of the form K(x, x i ), where x i is generay the input vector associated to one of the training exampes, and K is a symmetric positive definite kerne function. Support Vector Machines (SVMs) are kerne-based earning agorithms in which ony a fraction of the training exampes are used in the soution (these are caed the Support Vectors), and where the objective of earning is to maximize a margin around the decision surface (in the case of cassification). Matching Pursuit was originay introduced in the signa-processing community as an agorithm that decomposes any signa into a inear expansion of waveforms that are seected from a redundant dictionary of functions. (Maat and Zhang, 1993).

2 It is a genera, greedy, sparse function approximation scheme with the squared error oss, which iterativey adds new functions (i.e. basis functions) to the inear expansion. If we take as dictionary of functions the functions d i (x) of the form K(x, x i )wherex i is the input part of a training exampe, then the inear expansion has essentiay the same form as a Support Vector Machine. Matching Pursuit and its variants were deveoped primariy in the signa-processing and waveets community, but there are many interesting inks with the research on kerne-based earning agorithms deveoped in the machine-earning community. Connections between a reated agorithm (basis pursuit (Chen, 1995)) and SVMs had aready been reported in (Poggio and Girosi, 1998). More recenty, (Smoa and Schökopf, 2000) shows connections between Matching Pursuit, Kerne-PCA, Sparse Kerne Feature anaysis, and how this kind of greedy agorithm can be used to compress the design-matrix in SVMs to aow handing of huge data-sets. Sparsity of representation is an important issue, both for the computationa efficiency of the resuting representation, and for its theoretica and practica infuence on generaization performance (see (Graepe, Herbrich and Shawe-Tayor, 2000) and (Foyd and Warmuth, 1995)). However the sparsity of the soutions found by the SVM agorithm is hardy controabe, and often these soutions are not very sparse. Our research started as a search for a fexibe aternative framework that woud aow us to directy contro the sparsity (in terms of number of support vectors) of the soution and remove the requirements of positive definiteness of K (and the representation of K as a dot product in a high-dimensiona feature space ). It ead us to uncover connections between greedy Matching Pursuit agorithms, Radia Basis Function training procedures, and boosting agorithms (section 4). We wi discuss these together with a description of the proposed agorithm and extensions thereof to use margin oss functions. We first (section 2) give an overview of the Matching Pursuit famiy of agorithms (the basic version and two refinements thereof), as a genera framework, taking a machine-earning viewpoint. We aso give a detaied description of our particuar impementation that yieds a choice of the next basis function to add to the expansion by minimizing simutaneousy across the expansion weights and the choice of the basis function, in a computationay efficient manner. We then show (section 3) how this framework can be extended, to aow the use of other differentiabe oss functions than the squared error to which the origina agorithms are imited. This might be more appropriate for some cassification probems (athough, in our experiments, we have used the squared oss for many cassification probems, aways with successfu resuts). This is foowed by a discussion about margin oss functions, underining their simiarity with more traditiona oss functions that are commony used for neura networks. In section 4 we expain how the matching pursuit famiy of agorithms can be used to buid kerne-based soutions to machine-earning probems, and how this reates to other machine-earning agorithms, namey SVMs, boosting agorithms, and Radia Basis Function training procedures. In section 5, we use previous theoretica work on the minimum description ength principe to construct generaization error bounds for the proposed agorithm. Basicay, the generaization error is bounded by the training error pus terms that grow with the fraction of support vectors. These bounds are compared with bounds obtained for Support Vector Machines. Finay, in section 6, we provide an experimenta comparison between SVMs and different variants of Matching Pursuit, performed on artificia data, USPS digits

3 cassification, and UCI machine-earning databases benchmarks. The main experimenta resut is that Kerne Matching Pursuit agorithms can yied generaization performance as good as Support Vector Machines, but often using significanty fewer support vectors. 2 Three favors of Matching Pursuit In this section we first describe the basic Matching Pursuit agorithm, as it was introduced by (Maat and Zhang, 1993), but from a machine-earning perspective rather than a signa processing one. We then present two successive refinements of the basic agorithm. 2.1 Basic Matching Pursuit We are given noisy observations {y 1,...,y } of a target function f Hat points {x 1,...,x }. We are aso given a finite dictionary D = {d 1,...,d m } of functions in a Hibert space H, and we are interested in sparse approximations of f that are expansions of the form N ˆf N = α n g n (1) n=1 where (α 1,...,α N ) IR N and {g 1,...,g N } Darechosen to minimize the squared norm of the residue R N 2 = f ˆf N 2. We sha ca the set {g 1,...,g N } our basis, andn the number of basis functions in the expansion. Notice that, in a typica machine-earning framework, a we have are noisy observations of the target function f at the data points x 1... So we sometimes abuse the notation, using f to actuay mean (y 1,...,y ). Aso, throughout this artice, for a practica purposes, during training, any function in H can be associated to an dimensiona vector that represents the function evauated at the x 1.. data points. We wi make extensive use of this abuse of notation for convenience; in particuar the notation g, h wi be used to represent the dot product between the two dimensiona vectors associated with functions g and h, and h is used to represent the L 2 norm of the vector associated to a function h. Ony when using the earnt approximation on new test data do we use the dictionary functions as actua functions. Now, finding the optima basis {g 1,...,g N } for a given number N of aowed basis functions is in genera an NP-compete probem. So the matching pursuit agorithm proceeds in a greedy constructive, fashion: It starts at stage 0 with ˆf 0 = 0, and recursivey appends functions to an initiay empty basis, at each stage n, trying to reduce the norm of the residue R n = ˆf n f. Given ˆf n we buid ˆf n+1 = ˆf n + α n+1 g n+1 by searching for g n+1 Dand for α n+1 IR that minimize the squared norm of the residue, R n+1 2 = R n α n+1 g n+1 2, i.e. ( n ) (g n+1,α n+1 ) = arg min α k g k +αg f 2 (2) (g D,α IR) k=1 }{{} ˆf n

4 INPUT: data set {(x 1,y 1 ),...,(x,y )} dictionary of functions D = {d 1,...,d m } number N of basis functions desired in the expansion (or, aternativey, a vaidation set to decide when to stop) INITIALIZE: residue vector R and dictionary matrix D y 1 d 1 (x 1 ) d m (x 1 ) R. and D..... y d 1 (x ) d m (x ) FOR n =1..N (or unti performance on vaidation set stops improving): γ n arg max D(., k),r k=1..m D(., k) α n D(., γ n),r D(., γ n ) 2 R R α n D(., γ n ) RESULT: The soution found is defined by ˆf N (x) = N α n d γn (x) n=1 Figure 1: Basic Matching Pursuit Agorithm The g n+1 that minimizes this expression is the one that maximizes g n+1,r n g n+1 and the corresponding α n+1 is α n+1 = g n+1,r n g n+1 2 We have not yet specified how to choose N (i.e. when to stop). In the signa processing iterature the agorithm is usuay stopped when the reconstruction error ( R 2 ) goes beow a predefined given threshod. For machine-earning probems, we sha rather use the error estimated on an independent vaidation set 1 to decide when to stop. In any case, N can be seen as the primary capacity-contro parameter of the agorithm. In section 5, we show that the generaization error of matching pursuit agorithms can be directy inked to the ratio N ( is the number of training exampes). The pseudo-code for the corresponding agorithm is given in figure 1 (there are sight differences in the notation, in particuar g n in the above expanations corresponds to vector D(., γ n ) in the more detaied pseudo-code). 2.2 Matching Pursuit with backfitting In the basic version of the agorithm, not ony is the set of basis functions g 1..n obtained at every step n suboptima, but so are aso their α 1..n coefficients. This can 1 or a more computationay intensive cross-vaidation technique if the data is scarce.

5 be corrected in a step often caed back-fitting or back-projection and the resuting agorithm is known as Orthogona Matching Pursuit (OMP) (Pati, Rezaiifar and Krishnaprasad, 1993; Davis, Maat and Zhang, 1994): Whie sti choosing g n+1 as previousy (equation 2), we recompute the optima set of coefficients α 1..n+1 at each step instead of ony the ast α n+1 : α (n+1) 1..n+1 = arg min (α 1..n+1 IR n+1 ) ( n+1 ) α k g k f Note that this is just ike a inear regression with parameters α 1..n+1. This backprojection step aso has a geometrica interpretation: Let B n the sub-space of H spanned by the basis (g 1,...,g n )andetbn = H B n be its orthogona compement. Let P Bn and P B n denote the projection operators on these subspaces. Then, any g Hcan be decomposed as g = P Bn g + P B n g (see figure 2). Ideay, we want the residue R n to be as sma as possibe, so given the basis at step n, wewant ˆf n = P Bn f and R n = P B n f. This is what (3) insures. But whenever we append the next α n+1 g n+1 found by (2) to the expansion, we actuay add its two orthogona components: k=1 P B n α n+1 g n+1 contributes to reducing the norm of the residue. P Bn α n+1 g n+1 which increases the norm of the residue. However, as the atter part beongs to P Bn it can be compensated for by adjusting the previous coefficients of the expansion: this is what the back-projection does. 2 (3) B n g y fn P g B n Rn P g B n B n Figure 2: Geometrica interpretation of Matching Pursuit and backprojection (Davis, Maat and Zhang, 1994) suggest maintaining an additiona orthogona basis of the B n space to faciitate this back-projection, which resuts in a computationay efficient agorithm 2. 2 In our impementation, we used a sighty modified version of this approach, described in the prefitting agorithm beow.

6 2.3 Matching Pursuit with prefitting With backfitting, the choice of the function to append at each step is made regardess of the ater possibiity to update a weights: as we find g n+1 using (2) and ony then optimize (3), we might be picking a dictionary function other than the one that woud give the best fit. Instead, it is possibe to directy optimize ( ) g n+1,α (n+1) 1..n+1 = arg min (g D,α 1..n+1 IR n+1 ) ( n ) α k g k + α n+1 g f k=1 2 (4) We sha ca this procedure prefitting to distinguish it from the former backfitting (as backfitting is done ony after the choice of g n+1 ). This can be achieved amost as efficienty as backfitting. Our impementation maintains a representation of both the target and a dictionary vectors as a decomposition into their projections on B n and Bn : As before, et B n = span(g 1,...,g n ). We maintain at each step a representation of each dictionary vector d as the sum of two orthogona components: component d Bn = P Bn d ies in the space B n spanned by the current basis and is expressed as a inear combination of current basis vectors (it s a n dimensiona vector). component d B n = P B n d ies in B n s orthogona compement and is expressed in the origina -dimensiona vector space coordinates. We aso maintain the same representation for the target y, namey its decomposition into the current expansion ˆf n B n pus the orthogona residue R n Bn. Prefitting is then achieved easiy by considering ony the components in Bn : we choose g n+1 as the g Dwhose g B n is most coinear with R n Bn. This procedure requires, at every step, ony two passes through the dictionary (searching g n+1,then updating the representation) where basic matching pursuit requires one. The detaied pseudo-code for this agorithm is given in figure Summary of the three variations of MP Regardess of the computationa tricks that use orthogonaity properties for efficient computation, the three versions of matching pursuit differ ony in the way the next function to append to the basis is chosen and the α coefficients are updated at each step n: Basic version: We find the optima g n to append to the basis and its optima α n, whie keeping a other coefficients fixed (equation 2). backfitting version: We find the optima g n whie keeping a coefficients fixed (equation 2). Then we find the optima set of coefficients α (n) 1..n for the new basis (equation 3). prefitting version: We find at the same time the optima g n and the optima set of coefficients α (n) 1..n (equation 4).

7 INPUT: data set {(x 1,y 1 ),...,(x,y )} dictionary of functions D = {d 1,...,d m } number N of basis functions desired in the expansion (or, aternativey, a vaidation set to decide when to stop) INITIALIZE: residue vector R and dictionary matrix component D B and D B R y 1.. y and D B d 1 (x 1 ) d m (x 1 ) d 1 (x ) d m (x ) D B is initiay empty, and gets appended an additiona row at each step (thus, ignore the expressions that invove D B during the first iteration when n =1) FOR n =1..N (or unti performance on vaidation set stops improving): γ n arg max D B (., k),r k=1..m D B (., k) α n D B (., γ n),r D B (., γ n ) 2 the B component of α n d γn reduces the residue: R R α n D B (., γ n ) compensate for the B component of α n d γn by adjusting previous α: (α 1,...,α n 1 ) (α 1,...,α n 1 ) α n D B (., γ n ) Now update the dictionary representation to take into account the new basis function d γn... FOR i =1..m AND i γ n : β i D B (., γ n),d B (., i) D B (., γ n ) 2 D B (., i) D B (., i) β i D B (., γ n ) D B (., i) D B (., i) β i D B (., γ n ) D B (., γ n ) 0 D B (., γ n ) 0 β γn 1 ( ) D B D B β 1,...,β m RESULT: The soution found is defined by ˆf N N (x) = α n d γn (x) n=1 Figure 3: Matching Pursuit with prefitting

8 When making use of orthogonaity properties for efficient impementations of the backfitting and prefitting version (as in our previousy described impementation of the prefitting agorithm), a three agorithms have a computationa compexity of the same order O(N.m.). 3 Extension to non-squared error oss 3.1 Gradient descent in function space It has aready been noticed that boosting agorithms are performing a form of gradient descent in function space with respect to particuar oss functions (Schapire et a., 1998; Mason et a., 2000). Foowing (Friedman, 1999), the technique can be adapted to extend the Matching Pursuit famiy of agorithms to optimize arbitrary differentiabe oss functions, instead of doing east-squares fitting. Given a oss function L(y i, ˆf n (x i )) that computes the cost of predicting a vaue of ˆf n (x i ) when the true target was y i, we use an aternative residue R n rather than the usua R n = y ˆf n when searching for the next dictionary eement to append to the basis at each step. R n is the direction of steepest descent (the gradient) in function space (evauated at the data points) with respect to L: L R n = ˆf n (x 1 ) ( y 1, ˆf n (x 1 ) ) ( L y, ˆf ) n (x ),..., ˆf (5) n (x ) i.e. g n+1 is chosen such that it is most coinear with this gradient: g n+1, R n g n+1 = arg max g D g n+1 (6) A ine-minimization procedure can then be used to find the corresponding coefficient α n+1 = arg min α IR i=1 ( L f(x i ), ˆf ) n (x i )+αg n+1 (x i ) This woud correspond to basic matching pursuit (notice how the origina squarederror agorithm is recovered when L is the squared error: L(a, b) =(a b) 2 ). It is aso possibe to do backfitting, by re-optimizing a α 1..n+1 (instead of ony α n+1 ) to minimize the target cost (with a conjugate gradient optimizer for instance): ( α (n+1) 1..n+1 = arg min L (α 1..n+1 IR n+1 ) i=1 ) n+1 f(x i ), α k g k But as this can be quite time-consuming (as we cannot use any orthogonaity property in this genera case), it may be desirabe to do it every few steps instead of every singe step. The corresponding agorithm is described in more detais in the pseudo-code of figure 4 (as previousy there are sight differences in the notation, in particuar g k in the above expanation corresponds to vector D(., γ k ) in the more detaied pseudo-code). k=1 (7) (8)

9 Finay, et s mention that it shoud in theory aso be possibe to do prefitting with an arbitrary oss functions, but finding the optima {g k+1 D,α 1..k+1 IR k+1 } in the genera case (when we cannot use any orthogona decomposition) woud invove soving equation 8 in turn for each dictionary function in order to choose the next one to append to the basis, which is computationay prohibitive. 3.2 Margin oss functions versus traditiona oss functions for cassification Now that we have seen how the matching pursuit famiy of agorithms can be extended to use arbitrary oss functions, et us discuss the merits of various oss functions. In particuar the reationship between oss functions and the notion of margin is of primary interest here, as we wanted to buid an aternative to SVMs 3. Whie the origina notion of margin in cassification probems comes from the geometricay inspired hard-margin of inear SVMs (the smaest Eucidean distance between the decision surface and the training points), a sighty different perspective has emerged in the boosting community aong with the notion of margin oss function. The margin quantity m = y ˆf(x) of an individua data point (x, y), with y { 1, +1} can be understood as a confidence measure of its cassification by function ˆf, whie the cass decided for is given by sign( ˆf(x)). A margin oss function is simpy a function of this margin quantity m that is being optimized. It is possibe to formuate SVM training such as to show the SVM margin oss function: Let ϕ be the mapping into the feature-space of SVMs, such that <ϕ(x i ),ϕ(x j ) >= K(x i,x j ) The SVM soution can be expressed in this feature space as ˆf(x) =<w,ϕ(x) > +b where w = α i y i ϕ(x i ) x i SV Where SV is the set of support vectors and the soution is the one that minimizes [1 y i ˆf(xi )] C w 2 (9) i=1 Where C is the box-constraint parameter of SVMs, and the notation [x] + is to be understood as the function that gives [x] + = x when x>0and0otherwise. Let m = y ˆf(x) theindividua margin at point x. (9) is ceary the sum of a margin oss function and a reguarization term. It is interesting to compare this margin oss function to those used in boosting agorithms and to the more traditiona cost functions. The oss functions that boosting agorithms optimize are typicay expressed as functions of m. Thus AdaBoost (Schapire et a., 1998) uses an exponentia (e m ) margin oss function, LogitBoost (Friedman, Hastie and Tibshirani, 1998) uses the negative binomia ogikeihood, og 2 (1 + e 2m ), whose shape is simiar to a smoothed version of the 3 whose good generaization abiities are beieved to be due to margin-maximization.

10 INPUT: data set {(x 1,y 1 ),...,(x,y )} dictionary of functions D = {d 1,...,d m } number N of basis functions desired in the expansion (or, aternativey, a vaidation set to decide when to stop) how often to do a fu backfitting: every p update steps a oss function L INITIALIZE: current approximation ˆf and dictionary matrix D ˆf 0 d 1 (x 1 ) d m (x 1 ) ˆf =. 0 and D..... ˆf d 1 (x ) d m (x ) FOR n =1..N (or unti performance on vaidation set stops improving): L(y1, ˆf 1) R ˆf 1.. L(y, ˆf n) ˆf D(., k), R γ n arg max k=1..m D(., k) If n is not a mutipe of p do a simpe ine minimization: RESULT: α n arg min α IR L(y i, ˆf i + αd(i, γ n )) i=1 and update ˆf: ˆf ˆf + αn D(., γ n ) If n is a mutipe of p do a fu backfitting (for ex. with gradient descent): and recompute ˆf α 1..n arg min α 1..n IR n n L(y i, α k D(i, γ k )) i=1 n α k D(., γ k ) k=1 The soution found is defined by ˆf N (x) = k=1 N α n d γn (x) Figure 4: Backfitting Matching Pursuit Agorithm with non-squared oss n=1

11 soft-margin SVM oss function [1 m] +, and Doom II (Mason et a., 2000) approximates a theoreticay motivated margin oss with 1 tanh(m). As can be seen in Figure 5 (eft), a these functions encourage arge positive margins, and differ mainy in how they penaize arge negative ones. In particuar 1 tanh(x) is expected to be more robust, as it won t penaize outiers to excess. It is enightening to compare these with the more traditiona oss functions that have been used for neura networks in cassification tasks (i.e. y { 1, +1}), when we express them as functions of m. Squared oss: ( ˆf(x) y) 2 =(1 m) 2 Squared oss after tanh with modified target: (tanh( ˆf(x)) 0.65y) 2 =(0.65 tanh(m)) 2 Both are iustrated on figure 5 (right). Notice how the squared oss after tanh appears simiar to the margin oss function used in Doom II, except that it sighty increases for arge positive margins, which is why it behaves we and does not saturate even with unconstrained weights (boosting and SVM agorithms impose constraints on the weights, here denoted α s) exp(-m) [AdaBoost] og(1+exp(-m)) [LogitBoost] 1-tanh(m) [Doom II] (1-m)+ [SVM] squared error as a margin cost function squared error after tanh with 0.65 target 2 2 oss(m) 1.5 oss(m) margin m = y.f(x) margin m = y.f(x) Figure 5: Boosting and SVM margin oss functions (eft) vs. traditiona oss functions (right) viewed as functions of the margin. Interestingy the ast-born of the margin motivated oss functions (used in Doom II) is simiar to the traditiona squared error after tanh. 4 Kerne Matching Pursuit and inks with other paradigms 4.1 Matching pursuit with a kerne-based dictionary Kerne Matching Pursuit (KMP) is simpy the idea of appying the Matching Pursuit famiy of agorithms to probems in machine-earning, using a kerne-based dictionary: Given a kerne function K : IR d IR d IR, we use as our dictionary the kerne centered on the training points: D = {d i = K(,x i ) i =1..}. Optionay, the constant function can aso be incuded in the dictionary, which accounts for a bias term b: the functiona form of approximation ˆf N then becomes ˆf N (x) =b + N α n K(x, x γn ) (10) n=1

12 where the γ 1..N are the indexes of the support points. During training we ony consider the vaues of the dictionary functions at the training points, so that it amounts to doing Matching in a vector-space of dimension. When using a squared error oss 4, the compexity of a three variations of KMP (basic, backfitting and prefitting) is O(N.m.) =O(N. 2 ) if we use a the training data as candidate support points. But it is aso possibe to use a random subset of the training points as support candidates (which yieds a m<). We woud aso ike to emphasize the fact that the use of a dictionary gives a ot of additiona fexibiity to this framework, as it is possibe to incude any kind of function into it, in particuar: There is no restriction on the shape of the kerne (no positive-definiteness constraint, coud be assymetrica, etc...). The dictionary coud incude more than a singe fixed kerne shape: it coud mix different kerne types to choose from at each point, aowing for instance the agorithm to choose among severa widths of a Gaussian for each support point. Simiary, the dictionary coud easiy be used to constrain the agorithm to use a kerne shape specific to each cass, based on prior-knowedge. The dictionary can incorporate non-kerne based functions (we aready mentioned the constant function to recover the bias term b, but this coud aso be used to incorporate prior knowedge). For huge data-sets, a reduced subset can be used as the dictionary to speed up the training. However in this study, we restrict ourseves to using a singe fixed kerne, so that the resuting functiona form is the same as the one obtained with SVMs. 4.2 Simiarities and differences with SVMs The functiona form (10) is very simiar to the one obtained with the Support Vector Machine (SVM) agorithm (Boser, Guyon and Vapnik, 1992), the main difference being that SVMs impose further constraints on α 1..N. However the quantity optimized by the SVM agorithm is quite different from the KMP greedy optimization, especiay when using a squared error oss. Consequenty the support vectors and coefficients found by the two types of agorithms are usuay different (see our experimenta resuts in section 6). Another important difference, and one that was a motivation for this research, is that in KMP, capacity contro is achieved by directy controing the sparsity of the soution, i.e. the number N of support vectors, whereas the capacity of SVMs is controed through the box-constraint parameter C, which has an indirect and hardy controabe infuence on sparsity. See (Graepe, Herbrich and Shawe-Tayor, 2000) for a discussion on the merits of sparsity and margin, and ways to combine them. 4 The agorithms generaized to arbitrary oss functions can be much more computationay intensive, as they impy a non-inear optimization step.

13 4.3 Link with Radia Basis Functions Squared-error KMP with a Gaussian kerne and prefitting appears to be identica to a particuar Radia Basis Functions training agorithm caed Orthogona Least Squares RBF (Chen, Cowan and Grant, 1991) (OLS-RBF). In (Schökopf et a., 1997) SVMs were compared to cassica RBFs, where the RBF centers were chosen by unsupervised k-means custering, and SVMs gave better resuts. To our knowedge, however, there has been no experimenta comparison between OLS-RBF and SVMs, athough their resuting functiona forms are very much aike. Such an empirica comparison is one of the contributions of this paper. Basicay our resuts (section 6) show OLS-RBF (i.e. squared-error KMP) to perform as we as Gaussian SVMs, whie aowing a tighter contro of the number of support vectors used in the soution. 4.4 Boosting with kernes KMP in its basic form generaized to using non-squared error is aso very simiar to boosting agorithms (Freund and Schapire, 1996; Friedman, Hastie and Tibshirani, 1998), in which the chosen cass of weak-earners woud be the set of kernes centered on the training points. These agorithms differ mainy in the oss function they optimize, which we have aready discussed in section Bounds on generaization error The resuts of Vapnik on the Minimum Description Length (Vapnik, 1995; Vapnik, 1998) provide a possibe framework for estabishing bounds on expected generaization error for KMP agorithms. One can aso simpy use the resuts on the generaization error obtained when the number of possibe functions is a finite number M, (and the capacity is therefore bounded by og M). We wi show that, essentiay, the bound depends ineary on the number of support vectors and ogarithmicay on the tota number of training exampes. Vapnik s resut (theorem 4.3, (Vapnik, 1995)) states that the expected generaization error rate, E gen, for binary cassification, when training with exampes, is ess than 2C og(2) 2og(η)/ with probabiity greater than 1 η, wherec is the compression rate: the number of bits to transfer the compressed conditiona vaue of the training target casses (given the training input points) divided by the number of bits required to transmit them without compression, i.e.,. When there are training errors, we can incorporate them into the compressed message by sending the identity (and the abes, in the muticass case) of the wrongy abeed exampes. The compression is due to the representation earned by the training agorithm. A good representation is one that requires few bits to represent the earned function, whie keeping the training error ow. This assumes that the number of possibe functions is finite (which we wi obtain by quantizing the α coefficients). To obtain compression, we take advantage of the sparse representation of the earned function in terms of ony N support points. To obtain a rough bound we wi encode the target outputs using three sets of bits, corresponding to three terms for C: 1. The first one is due to the cassification errors: we have to send the identity and the correct cass of the training ( errors. If the number of errors is e = E emp, that wi cost og e ) 2 bits. In the case where the number

14 of casses is N c > 2, there is an increase in the number of bits by a factor og 2 (N c 1), but there is a simiar increase in the numerator of C (to encode the correct casses of a the training exampes). 2. The second term is required to encode the identity ( of the support points: to choose N among exampes requires og N ) 2 bits. 3. The third term is to encode the quantized weights α k associated with each support point, which wi cost Np bits, where p is the number of bits of precision to quantize the weights, and it can be chosen as the smaest number that aows to obtain with the discretized α sthesamecasseson the training set as the undiscretized α s. To summarize, for KMP, we have, for e training errors and N support vectors out of exampes, with probabiity greater than 1 η (over the choice of training set), E gen < 2 og ( ) ( e +og N ) +(Npog 2) 2og(η)/ (11) Note that ( ) n is poory bounded by n og2,inwhichthee/ and N/ ratios become apparent, but where a too arge og factor appears. Sighty tighter bounds can be obtained using the resut (Vapnik, 1995; Vapnik, 1998) for earning by choosing one function among M< functions: with probabiity at east 1 η, E gen E emp + og M og η ( ) 2E emp. (12) og M og η Using the same quantization of the α s (with precision p), one obtains with og M = og( ( ) N 2 Np ), E gen <E emp + og ( ) ( ) N + Npog 2 og η 2E emp og ( ). N + Npog 2 og η (13) In contrast, one can obtain an expectation bound (Vapnik, 1995) for SVMs that is E[E gen ] <E[E emp ]+E[ N ], where E is the expectation over training sets (note that for SVMs, N is random because it depends on the training set). Note that the probabiity bounds can be readiy converted into expectation bounds. For exampe, in the case of the MDL bound (eq. 11), one obtains that in expectation, E gen < 2 E[og ( ) ( e ]+og N ) +(Npog 2) + 1. To see the roe of the ratio N intheabove,onecannotethat og( N) < N og (but keep in mind that this is a rather poor bound). Note that severa reated compression bounds have been studied, e.g. (Littestone and Warmuth, 1986; Foyd and Warmuth, 1995; Graepe, Herbrich and Shawe- Tayor, 2000). The resuts of (Graepe, Herbrich and Shawe-Tayor, 2000) are meant for maximum margin cassifiers and draw interesting connections between sparsity and maximum margin. The resuts in (Littestone and Warmuth, 1986; Foyd and Warmuth, 1995) are very genera (and very much inked to the above discussion), but they appy to cassifiers which can be specified using ony a subset of the training exampes. However, note that the case of Matching Pursuit, the cassifier requires not ony the support vectors but aso the weights α i, which in genera depend on the whoe training set.

15 6 Experimenta resuts on binary cassification Throughout this section: any mention of KMP without further specification of the oss function means east-squares KMP (aso sometimes written KMP-mse) KMP-tanh refers to KMP using squared error after a hyperboic tangent with modified targets (which behaves more ike a typica unessmargin oss function as we discussed earier in section 3.2). Uness otherwise specified, we used the prefitting matching pursuit agorithm of figure 3 to train east-squares KMP. To train KMP-tanh we aways used the backfitting matching pursuit with non-squared oss agorithm of figure 4 with a conjugate gradient optimizer to optimize the α 5 1..n D experiments Figure 6 shows a simpe 2D binary cassification probem with the decision surface found by the three versions of squared-error KMP (basic, backfitting and prefitting) and a hard-margin SVM, when using the same Gaussian kerne. We fixed the number N of support points for the prefitting and backfitting versions to be the same as the number of support points found by the SVM agorithm. The aim of this experiment was to iustrate the foowing points: Basic KMP, after 100 iterations, during which it mosty cyced back to previousy chosen support points to improve their weights, is sti unabe separate the data points. This shows that the backfitting and prefitting versions are a usefu improvement, whie the basic agorithm appears to be a bad choice if we want sparse soutions. The backfitting and prefitting KMP agorithms are abe to find a reasonabe soution (the soution found by prefitting ooks sighty better in terms of margin), but choose different support vectors than SVM, that are not necessariy cose to the decision surface (as they are in SVMs). It shoud be noted that the Reevance Vector Machine (Tipping, 2000) simiary produces 6 soutions in which the reevance vectors do not ie cose to the border. Figure 7, where we used a simpe dot-product kerne (i.e. inear decision surfaces), iustrates a probem that can arise when using east-squares fit: since the squared error penaizes arge positive margins, the decision surface is drawn towards the custer on the ower right, at the expense of a few miscassified points. As expected, the use of a tanh oss function appears to correct this probem. 5 We tried severa frequencies at which to do fu-backfitting, but it did not seem to have a rea impact, as ong as it was done often enough. 6 however in a much more computationay intensive fashion.

Figure 6: From eft to right: 100 iterations of basic KMP, 7 iterations of KMP backfitting, 7 iterations of KMP prefitting, and SVM. Casses are + and. Support vectors are circed.

backfitting chooses yet another support set, and its decision surface appears to have a sighty worse margin.

others whie sti being unabe to separate the data points, and is thus a bad choice if we want sparse soutions.

2 US Posta Service Database The main purpose of this experiment was to compement the resuts of (Schökopf et a.

, 1997) the RBF centers were chosen by unsupervised k-means custering, in what they referred to as Cassica RBF, and a gradient descent optimization procedure was used to train the kerne

We repeated the experiment using KMP-mse (equivaent to OLS-RBF) to find the support centers, with the same Gaussian Kerne and the same training set (7300 patterns) and independent test

16 Figure 6: From eft to right: 100 iterations of basic KMP, 7 iterations of KMP backfitting, 7 iterations of KMP prefitting, and SVM. Casses are + and. Support vectors are circed. Prefitting KMP and SVM appear to find equay reasonabe soutions, though using different support vectors. Ony SVM chooses its support vectors cose to the decision surface. backfitting chooses yet another support set, and its decision surface appears to have a sighty worse margin. As for basic KMP, after 100 iterations during which it mosty cyced back to previousy chosen support points to improve their weights, it appears to use more support vectors than the others whie sti being unabe to separate the data points, and is thus a bad choice if we want sparse soutions. Figure 7: Probem with east squares fit that eads KMP-mse (center) to miscassify points, but does not affect SVMs (eft), and is successfuy treated by KMP-tanh (right). 6.2 US Posta Service Database The main purpose of this experiment was to compement the resuts of (Schökopf et a., 1997) with those obtained using KMP-mse, which, as aready mentioned, is equivaent to orthogona east squares RBF (Chen, Cowan and Grant, 1991). In (Schökopf et a., 1997) the RBF centers were chosen by unsupervised k-means custering, in what they referred to as Cassica RBF, and a gradient descent optimization procedure was used to train the kerne weights. We repeated the experiment using KMP-mse (equivaent to OLS-RBF) to find the support centers, with the same Gaussian Kerne and the same training set (7300 patterns) and independent test set (2007 patterns) of preprocessed handwritten digits. Tabe 1 gives the number of errors obtained by the various agorithms on the tasks consisting of discriminating each digit versus a the others (see (Schökopf et a., 1997) for more detais). No vaidation data was used to choose the number of bases (support vectors) for the KMP. Instead, we trained with N equa to the number of support vectors obtained with the SVM, and aso with N equa to haf that number, to see whether a sparser KMP mode woud sti yied good resuts. As can be seen, resuts obtained with KMP are comparabe to those obtained for SVMs, contrariy to the resuts obtained with k-means RBFs, and there is ony a

17 sight oss of performance when using as few as haf the number of support vectors. Tabe 1: USPS Resuts: number of errors on the test set (2007 patterns), when using the same number of support vectors as found by SVM (except ast row which uses haf #sv). Squared error KMP (same as OLS-RBF) appears to perform as we as SVM. Digit cass #sv SVM k-means RBF KMP (same #sv) KMP (haf #sv) Benchmark datasets We did some further experiments, on 5 we-known datasets from the the UCI machine-earning databases, using Gaussian kernes of the form K(x 1,x 2 )=e x 1 x 2 2 σ 2. A first series of experiments used the machinery of the Deve (Rasmussen et a., 1996) system to assess performance on the Mushrooms dataset. Hyper-parameters (the σ of the kerne, the box-constraint parameter C for soft-margin SVM and the number of support points for KMP) were chosen automaticay for each run using 10-fod cross-vaidation. The resuts for varying sizes of the training set are summarized in tabe 2. The p- vaues reported in the tabe are those computed automaticay by the Deve system 7. Tabe 2: Resuts obtained on the mushrooms data set with the Deve system. KMP requires ess support vectors, whie none of the differences in error rates are significant. size of KMP SVM p-vaue KMP SVM train error error (t-test) #s.v. #s.v % 4.54% % 2.61% % 1.14% % 0.30% % 0.07% For each size, the deve system did its estimations based on 8 disjoint training sets of the given size and 8 disjoint test sets of size 503, except for 1024, in which case it used 4 disjoint training sets of size 1024 and 4 test sets of size 1007.

18 For Wisconsin Breast Cancer, Sonar, Pima Indians Diabetes and Ionosphere, we used a sighty different procedure. The σ of the Kerne was first fixed to a reasonabe vaue for the given data set 8. Then we used the foowing procedure: the dataset was randomy spit into three equa-sized subsets for training, vaidation and testing. SVM, KMP-mse and KMPtanh were then trained on the training set whie the vaidation set was used to choose the optima box-constraint parameter C for SVMs 9, and to do eary stopping (decide on the number N of s.v.) for KMP. And finay the trained modes were tested on the independent test set. This procedure was repeated 50 times over 50 different random spits of the dataset into train/vaidation/test to estimate confidence measures (p-vaues were computed using the resamped t-test (Nadeau and Bengio, 2000)). Tabe 3 reports the average error rate measured on the test sets, and the rounded average number of support vectors found by each agorithm. As can be seen from these experiments, the error rates obtained are comparabe, but the KMP versions appear to require fewer support vectors than SVMs. On these datasets, however (contrary to what we saw previousy on 2D artificia data), KMP-tanh did not seem to give any significant improvement over KMP-mse. Even in other experiments where we added abe noise, KMP-tanh didn t seem to improve generaization performance 10. Tabe 3: Resuts on 4 UCI-MLDB datasets. Again, error rates are not significanty different (vaues in parentheses are the p-vaues for the difference with SVMs), but KMPs require fewer support vectors. Dataset SVM KMP-mse KMP-tanh SVM KMP-mse KMP-tanh error error error #s.v. #s.v. #s.v. Wisc. Cancer 3.41% 3.40% (0.49) 3.49% (0.45) Sonar 20.6% 21.0% (0.45) 26.6% (0.16) Pima Indians 24.1% 23.9% (0.44) 24.0% (0.49) Ionosphere 6.51% 6.87% (0.41) 6.85% (0.40) These were chosen by tria and error using SVMs with a vaidation set and severa vaues of C, and keeping what seemed the best σ, thus this choice was made at the advantage of SVMs (athough they did not seem too sensitive to it) rather than KMP. The vaues used were: 4.0 for Wisconsin Breast Cancer, 6.0 for Pima Indians Diabetes, 2.0 for Ionosphere and Sonar. 9 Vaues of 0.02, 0.05, 0.07, 0.1, 0.5, 1, 2, 3, 5, 10, 20, 100 were tried for C. 10 We do not give a detaied account of these experiments here, as their primary intent was to show that the tanh error function coud have an advantage over squared error in presence of abe noise, but the resuts were inconcusive.

19 7 Concusion We have shown how Matching Pursuit provides a fexibe framework to buid and study aternative kerne-based methods, how it can be extended to use arbitrary differentiabe oss functions and how it reates to SVMs, RBF training procedures, and boosting methods. We have aso provided experimenta evidence that such greedy constructive agorithms can perform as we as SVMs, whie aowing a better contro of the sparsity of the soution, and thus often ead to soutions with far fewer support vectors. It shoud aso be mentioned that the use of a dictionary gives additiona fexibiity, as it can be used, for instance, to mix severa kerne shapes to choose from, simiar to what has been done in (Weston et a., 1999), or to incude other non-kerne functions based on prior knowedge, which opens the way to further research. References Boser, B., Guyon, I., and Vapnik, V. (1992). An agorithm for optima margin cassifiers. In Fifth Annua Workshop on Computationa Learning Theory, pages , Pittsburgh. Chen, S. (1995). Basis Pursuit. PhD thesis, Department of Statistics, Stanford University. Chen, S., Cowan, F., and Grant, P. (1991). Orthogona east squares earning agorithm for radia basis function networks. IEEE Transactions on Neura Networks, 2(2): Davis, G., Maat, S., and Zhang, Z. (1994). Adaptive time-frequency decompositions. Optica Engineering, 33(7): Foyd, S. and Warmuth, M. (1995). Sampe compression, earnabiity, and the vapnik-chervonenkis dimension. Machine Learning, 21(3): Freund, Y. and Schapire, R. E. (1996). Experiments with a new boosting agorithm. In Machine Learning: Proceedings of Thirteenth Internationa Conference, pages Friedman, J. (1999). Greedy function approximation: a gradient boosting machine. IMS 1999 Reitz Lecture, February 24, 1999, Dept. of Statistics, Stanford University. Friedman, J., Hastie, T., and Tibshirani, R. (1998). Additive ogistic regression: a statistica view of boosting. Technica report, August 1998, Department of Statistics, Stanford University. Graepe, T., Herbrich, R., and Shawe-Tayor, J. (2000). Generaization error bounds for sparse inear cassifiers. In Thirteenth Annua Conference on Computationa Learning Theory, 2000, page in press. Morgan Kaufmann. Littestone, N. and Warmuth, M. (1986). Reating data compression and earnabiity. Unpubished manuscript. University of Caifornia Santa Cruz. An extended version can be found in (Foyd and Warmuth 95). Maat, S. and Zhang, Z. (1993). Matching pursuit with time-frequency dictionaries. IEEE Trans. Signa Proc., 41(12):

20 Mason, L., Baxter, J., Bartett, P., and Frean, M. (2000). Boosting agorithms as gradient descent. In Soa, S. A., Leen, T. K., and Mer, K.-R., editors, Advances in Neura Information Processing Systems, voume 12, pages MIT Press. Nadeau, C. and Bengio, Y. (2000). Inference for the generaization error. In Soa, S. A., Leen, T. K., and Mer, K.-R., editors, Advances in Neura Information Processing Systems, voume 12, pages MIT Press. Pati, Y., Rezaiifar, R., and Krishnaprasad, P. (1993). Orthogona matching pursuit: Recursive function approximation with appications to waveet decomposition. In Proceedings of the 27 th Annua Asiomar Conference on Signas, Systems, and Computers, pages Poggio, T. and Girosi, F. (1998). A sparse representation for function approximation. Neura Computation, 10(6): Rasmussen, C., Nea, R., Hinton, G., van Camp, D., Ghahramani, Z., Kustra, R., and Tibshirani, R. (1996). The DELVE manua. DELVE can be found at deve. Schapire, R. E., Freund, Y., Bartett, P., and Lee, W. S. (1998). Boosting the margin: A new expanation for the effectiveness of voting methods. The Annas of Statistics, 26(5): Schökopf, B., Sung, K., Burges, C., Girosi, F., Niyogi, P., Poggio, T., and Vapnik, V. (1997). Comparing support vector machines with gaussian kernes to radia basis function cassifiers. IEEE Transactions on Signa Processing, 45: Smoa, A. and Schökopf, B. (2000). Sparse greedy matrix approximation for machine earning. In Langey, P., editor, Internationa Conference on Machine Learning, pages , San Francisco. Morgan Kaufmann. Tipping, M. (2000). The reevance vector machine. In Soa, S. A., Leen, T. K., and Mer, K.-R., editors, Advances in Neura Information Processing Systems, voume 12, pages MIT Press. Vapnik, V. (1995). The Nature of Statistica Learning Theory. Springer, New York. Vapnik, V. (1998). Statistica Learning Theory. Wiey, Lecture Notes in Economics and Mathematica Systems, voume 454. Weston, J., Gammerman, A., Stitson, M., Vapnik, V., Vovk, V., and Watkins, C. (1999). Density estimation using support vector machines. In Schökopf, B., Burges, C. J. C., and Smoa, A. J., editors, Advances in Kerne Methods Support Vector Learning, pages , Cambridge, MA. MIT Press.

Statistical Learning Theory: A Primer

Statistical Learning Theory: A Primer Internationa Journa of Computer Vision 38(), 9 3, 2000 c 2000 uwer Academic Pubishers. Manufactured in The Netherands. Statistica Learning Theory: A Primer THEODOROS EVGENIOU, MASSIMILIANO PONTIL AND TOMASO