Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels

Size: px

Start display at page:

Download "Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels"

Abigayle Jones
6 years ago
Views:

1 Non-parametric Group Orthogona Matching Pursuit for Sparse Learning with Mutipe Kernes Vikas Sindhwani and Auréie C. Lozano IBM T.J. Watson Research Center Yorktown Heights, NY 0598 Abstract We consider reguarized risk minimization in a arge dictionary of Reproducing kerne Hibert Spaces (RKHSs) over which the target function has a sparse representation. This setting, commony referred to as Sparse Mutipe Kerne Learning (MKL), may be viewed as the non-parametric extension of group sparsity in inear modes. Whie the two dominant agorithmic strands of sparse earning, namey convex reaxations using norm (e.g., Lasso) and greedy methods (e.g., OMP), have both been rigorousy extended for group sparsity, the sparse MKL iterature has so far mainy adopted the former with mid empirica success. In this paper, we cose this gap by proposing a Group-OMP based framework for sparse MKL. Unike -MKL, our approach decoupes the sparsity reguarizer (via a direct 0 constraint) from the smoothness reguarizer (via RKHS norms), which eads to better empirica performance and a simper optimization procedure that ony requires a back-box singe-kerne sover. The agorithmic deveopment and empirica studies are compemented by theoretica anayses in terms of Rademacher generaization bounds and sparse recovery conditions anaogous to those for OMP [27] and Group-OMP [6]. Introduction Kerne methods are widey used to address a variety of earning probems incuding cassification, regression, structured prediction, data fusion, custering and dimensionaity reduction [22, 23]. However, choosing an appropriate kerne and tuning the corresponding hyper-parameters can be highy chaenging, especiay when itte is known about the task at hand. In addition, many modern probems invove mutipe heterogeneous data sources (e.g. gene functiona cassification, prediction of protein-protein interactions) each necessitating the use of a different kerne. This strongy suggests avoiding the risks and imitations of singe kerne seection by considering fexibe combinations of mutipe kernes. Furthermore, it is appeaing to impose sparsity to discard noisy data sources. As severa papers have provided evidence in favor of using mutipe kernes (e.g. [9, 4, 7]), the mutipe kerne earning probem (MKL) has generated a arge body of recent work [3, 5, 24, 33], and become the foca point of the intersection between non-parametric function estimation and sparse earning methods traditionay expored in inear settings. Given a convex oss function, the MKL probem is usuay formuated as the minimization of empirica risk together with a mixed norm reguarizer, e.g., the square of the sum of individua RKHS norms, or variants thereof, that have a cose reationship to the Group Lasso criterion [30, 2]. Equivaenty, this formuation may be viewed as simutaneous optimization of both the non-negative convex combination of kernes, as we as prediction functions induced by this combined kerne. In constraining the combination of kernes, the penaty is of particuar interest as it encourages sparsity in the supporting kernes, which is highy desirabe when the number of kernes considered is arge. The MKL iterature has rapidy evoved aong two directions: one concerns scaabiity of op-

2 timization agorithms beyond the eary pioneering proposas based on Semi-definite programming or Second-order Cone programming [3, 5] to simper and more efficient aternating optimization schemes [20, 29, 24]; whie the other concerns the use of p norms [0, 29] to construct compex non-sparse kerne combinations with the goa of outperforming -norm MKL which, as reported in severa papers, has demonstrated mid success in practica appications. The cass of Orthogona Matching Pursuit techniques has recenty received considerabe attention, as a competitive aternative to Lasso. The basic OMP agorithm originates from the signa-processing community and is simiar to forward greedy feature seection, except that it performs re-estimation of the mode parameters in each iteration, which has been shown to contribute to improved accuracy. For inear modes, some strong theoretica performance guarantees and empirica support have been provided for OMP [3] and its extension for variabe group seection, Group-OMP [6]. In particuar it was shown in [25, 9] that OMP and Lasso exhibit competitive theoretica performance guarantees. It is therefore desirabe to investigate the use of Matching Pursuit techniques in the MKL framework and whether one may be abe to improve upon existing MKL methods. Our contributions in this paper are as foows. We propose a non-parametric kerne-based extension to Group-OMP [6]. In terms of the feature space (as opposed to function space) perspective of kerne methods, this aows Group-OMP to hande groups that can potentiay contain infinite features. By adding reguarization in Group-OMP, we aow it to hande settings where the sampe size might be smaer than the number of features in any group. Rather than imposing a mixed /RKHSnorm reguarizer as in group-lasso based MKL, a group-omp based approach aows us to consider the exact sparse kerne seection probem via 0 reguarization instead. Note that in contrast to the group-asso penaty, the 0 penaty by itsef has no effect on the smoothness of each individua component. This aows for a cear decouping between the roe of the smoothness reguarizer (namey, an RKHS reguarizer) and the sparsity reguarizer (via the 0 penaty). Our greedy agorithms aow for simpe and fexibe optimization schemes that ony require a back-box sover for standard earning agorithms. In this paper, we focus on mutipe kerne earning with Reguarized east squares (RLS). We provide a bound on the Rademacher compexity of the hypothesis sets considered by our formuation. We derive conditions anaogous to OMP [27] and Group-OMP [6] to guarantee the correctness of kerne seection. We cose this paper with empirica studies on simuated and rea-word datasets that confirm the vaue of our methods. 2 Learning Over an RKHS Dictionary In this section, we setup some notation and give a brief background before introducing our main objective function and describing our agorithm in the next section. Let H...H N be a coection of Reproducing Kerne Hibert Spaces with associated Kerne functions k...k N defined on the input space X R d. Let H denote the sum space of functions, H = H H 2... H N = {f : X R f(x) = N f j (x),x X,f j H j,j =...N} Let us equip this space with the foowing p norms, p N N f p(h) = inf f j p H j : f(x) = f j (x),x X,f j H j,j =...N () j= It is now natura to consider a reguarized risk minimization probem over such a RKHS dictionary, given a coection of training exampes {x i,y i } i=, arg min f H j= j= V(y i,f(x i ))+λ f 2 p(h) (2) i= wherev(, ) is a convex oss function such as squared oss in the Reguarized Least Squares (RLS) agorithm or the hinge oss in the SVM method. If this probem again has eements of an RKHS structure, then, via the Representer Theorem, it can again be reduced to a finite dimensiona probem and efficienty soved. 2

3 Let q = p 2 p and et us define theq-convex hu of the set of kerne functions to be the foowing, N N co q (k...k N ) = k γ : X X R k γ (x,z) = γ j k j (x,z), γ q j =,γ j 0 where γ R N. It is easy to see that the non-negative combination of kernes, k γ, is itsef a vaid kerne with an associated RKHS H kγ. With this definition, [7] show the foowing, { } f p(h) = inf f Hkγ,k γ co q (k...k N ) (3) γ This reationship connects Tikhonov reguarization with p norms over H to reguarization over RKHSs parameterized by the kerne functions k γ. This eads to a arge famiy of mutipe kerne earning agorithms (whose variants are aso sometimes referred to as q -MKL) where the basic idea is to sove an equivaent probem, arg min f H kγ,γ q j= j= V(y i,f(x i ))+λ f 2 H kγ (4) i= where q = {γ R N : γ q =, n j= γ j 0}. For a fixed γ, the optimization over f H kγ is recognizabe as an RKHS probem for which a standard back box sover may be used. The weights γ may then optimized in an aternating minimization scheme, athough severa other optimization procedures are aso be used (see e.g., [4]). The case where p = is of particuar interest in the setting when the size of the RKHS dictionary is arge but the unknown target function can be approximated in a much smaer number of RKHSs. This eads to a arge famiy of sparse mutipe kerne earning agorithms that have a strong connection to the Group Lasso [2, 20, 29]. 3 Mutipe Kerne Learning with Group Orthogona Matching Pursuit Let us reca the 0 pseudo-norm, which is the cardinaity of the sparsest representation of f in the dictionary, f 0(H) = min{ J : f = j J f j}. We now pose the foowing exact sparse kerne seection probem, arg min f H V(y i,f(x i ))+λ f 2 2(H) subject to f 0(H) s (5) i= It is important to note the foowing: when using a dictionary of universa kernes, e.g., Gaussian kernes with different bandwidths, the presence of the reguarization term f 2 2(H) is critica (i.e., λ > 0) since otherwise the abeed data can be perfecty fit by any singe kerne. In other words, the kerne seection probem is i-posed. Whie conceptuay simpe, our formuation is quite different from those proposed earier since the roe of a smoothness reguarizer (via the f 2 2(H) penaty) is decouped from the roe of a sparsity reguarizer (via the constraint on f 0(H) s). Moreover, the atter is imposed directy as opposed through ap = penaty making the spirit of our approach coser to Group Orthogona Matching Pursuit (Group-OMP [6]) where groups are formed by very highdimensiona (infinite for Gaussian kernes) feature spaces associated with the kernes. It has been observed in recent work [0, 29] on -MKL that sparsity aone does not ead it to improvements in rea-word empirica tasks and hence severa methods have been proposed to expore q -norm MKL with q > in Eqn. 4, making MKL depart away from sparsity in kerne combinations. By contrast, we note that as q, p 2. Our approach gives a direct knob both on smoothness (via λ) and sparsity (via s) with a soution path aong these dimensions that differs from that offered by Group-Lasso based q -MKL as q is varied. By combining 0 pseudo-norm with RKHS norms, our method is conceptuay reminiscent of the eastic net [32] (aso see [26, 2, 2]). If kernes arise from different subsets of input variabes, our approach is aso reated to sparse additive modes [8]. Our agorithm, MKL-GOMP, is outined beow for reguarized east squares. Extensions for other oss functions, e.g., hinge oss for SVMs, can aso be simiary derived. In the description of the agorithm, our notation is as foows: For any functionf beonging to an RKHSF k with kerne function k(, ), we denote the reguarized objective function as,r λ (f,y) = i= (y i f(x i )) 2 +λ f Fk 3

4 where F denotes the RKHS norm. Reca that the minimizer f = argmin f F R λ (f,y) is given by soving the inear system, α = (K + λi) y where K is the gram matrix of the kerne on the abeed data, and by setting f (x) = i= α ik(x,x i ). Moreover, the objective vaue achieved by the minimizer is: R λ (f,y) = λy T (K + λi) y. Note that MKL-GOMP shoud not be confused with Kerne Matching Pursuit [28] whose goa is different: it is designed to sparsify α in a singe-kerne setting. The MKL-GOMP procedure iterativey expands the hypothesis space, H G () H G (2)... H G (i), by greediy seecting kernes from a given dictionary, where G (i) {...N} is a subset of indices and H G = j G H j. Note that each H G is an RKHS with kerne k G = j G k j (see Section 6 in []). The seection criteria is the best improvement, I(f (i),h j ), given by a new hypothesis space H j in reducing the norm of the current residua r (i) = y f (i) where f (i) = [f (i) (x )...f (i) (x )] T, by finding the best reguarized (smooth) approximation. Note that since min g Hj R λ (g,r) R λ (0,r) = r 2, the vaue of the improvement function, I(f (i),h j ) = r (i) 2 2 min R λ (g,r (i) ) g H j is aways non-negative. Once a kerne is seected, the function is re-estimated by earning in H G (i). Note that since H G is an RKHS whose kerne function is the sum j G k j, we can use a simpe RLS inear system sover for refitting. Unike group-lasso based MKL, we do not need an iterative kerne reweighting step which essentiay arises as a mechanism to transform the ess convenient group sparsity norms into reweighted squared RKHS norms. MKL-GOMP converges when the best improvement is no better than ǫ (or, in practice, if a maximum aowed number of s kernes have been seected). Input: Data matrix X = [x...x ] T, Labe vector y R, Kerne Dictionary {k j(, )} N j=, Precisionǫ > 0, Maximum sparsitys Output: Seected Kernes G (i) and a functionf (i) H G (i) Initiaization: G (0) =,f (0) = 0, set residua r (0) = y for i = 0,,2,...s. Kerne Seection: For a j / G (i), set: I(f (i),h j) = r (i) 2 2 min g Hj R λ (g,r (i) ) = r ( (i)t I λ(k j +λi) ) r (i) Pick j (i) = argmax j/ G (i) I(f (i),h j) ( ) 2. Convergence Check: if I(f (i),h j (i)) ǫ return f (i) end 3. Refitting: Set G (i+) = G (i) {j (i) }. Setf (i+) (x) = j= αjk(x,xj) where k = j G (i+) kj andα = ( j G (i+) Kj +λi ) y 4. Update Residua: r (i+) = y f (i+) where f (i+) = [f (i+) (x )...f (i+) (x )] T. Remarks: Note that our agorithm can be appied to mutivariate probems with group structure among outputs simiar to Mutivariate Group-OMP [5]. In particuar, in our experiments on muticass datasets, we treat a outputs as a singe group and evauate each kerne for seection based on how we the tota residua is reduced across a outputs simutaneousy. Kerne matrices are normaized to unit trace or to have uniform variance of data points in their associated feature spaces, as in [0, 33]. In practice, we can aso monitor error on a vaidation set to decide the optima degree of sparsity. For efficiency, we can precompute the matrices Q j = (I λ(k j + λi) ) 2 so that I(f (i),h j ) = Q j r 2 2 can be very quicky evauated at seection time, and/or reduce the search space by considering a random subsampe of the dictionary. 4 Theoretica Anaysis Our anaysis is composed of two parts. In the first part, we estabish generaization bounds for the hypothesis spaces considered by our formuation, based on the notion of Rademacher compex- 4

5 ity. The second component of our theoretica anaysis consists of deriving conditions under which MKL-GOMP can recover good soutions. Whie the first part can be seen as characterizing the statistica convergence of our method, the second part characterizes its numerica convergence as an optimization method, and is required to compement the first part. This is because matching pursuit methods can be deemed to sove an exact sparse probem approximatey, whie reguarized methods (e.g. norm MKL) sove an approximate probem exacty. We therefore need to show that MKL-GOMP recovers a soution that is cose to an optimum soution of the exact sparse probem. 4. Rademacher Bounds Theorem. Consider the hypothesis space of sufficienty sparse and smooth functions, { } H τ,s = f H : f 2 τ, f 2(H) 0(H) s Let δ (0,) and κ = sup x X,j=...N k j (x,x). Let ρ be any probabiity distribution on (x,y) X R satisfying y M amost surey, and et {x i,y i } i= be randomy samped according to ρ. Define, ˆf = argminf Hτ,s i= (y i f(x i )) 2 to be the empirica risk minimizer and f = argmin f Hτ,s R(f) to be the true risk minimizer in H τ,s where R(f) = E (x,y) ρ (y f(x)) 2 denotes the true risk. Then, with probabiity ateast δ over random draws of sampes of size, sκτ R(ˆf) R(f )+8L where y f L = (M + sκτ). +4L 2 og( 3 δ ) 2 The proof is given in suppementary materia, but can aso be reasoned as foows. In the standard singe-rkhs case, the Rademacher compexity can be upper bounded by a quantity that is proportiona to the square root of the trace of the Gram matrix, which is further upper bounded by κ. In our case, any coection of s-sparse functions from a dictionary of N RKHSs reduces to a singe RKHS whose kerne is the sum ofsbase kernes, and hence the corresponding trace can be bounded by sκ for a possibe subsets of size s. Once it is estabished that the empirica Rademacher compexity ofh λ,s is upper bounded by sκτ, the generaization bound foows from we-known resuts [6] taiored to reguarized east squares regression with bounded target variabe. For -norm MKL, in the context of margin-based oss functions, Cortes et. a., 200 [8] bound the Rademacher compexity as ce og(n) κτ (6) where is the ceiing function that rounds to next integer, e is the exponentia and c = Using VC-based ower-bound arguments, they point out that the og(n) dependence on N is essentiay optima. By contrast, our greedy approach with sequentia reguarized risk minimization imposes direct contro over degree of sparsity as we as smoothness, and hence the Rademacher compexity in our case is independent of N. If s = O(ogN), the bounds are simiar. A critica difference between -norm MKL and sparse greedy approximations, however, is that the former is convex and hence the empirica risk can be minimized exacty in the hypothesis space whose compexity is bounded by Rademacher anaysis. This is not true in our case, and therefore, to compement Rademacher anaysis, we need conditions under which good soutions can be recovered. 4.2 Exact Recovery Conditions in Noiseess Settings We now assume that the regression function f ρ (x) = ydρ(y x) is sparse, i.e., f ρ H Ggood for some subset G good of s good kernes, and that it is sufficienty smooth in the sense that for some λ > 0, given sufficient sampes, the empirica minimizer ˆf = argmin f HGgood R λ (f,y) gives near optima generaization as per Theorem. In this section our main concern is to characterize Group- OMP ike conditions under which MKL-GOMP wi be abe to earn ˆf by recovering the support G good exacty. Reca our notation that k Ggood = j G good k j is the kerne associated withh Ggood. Note that Tikhonov reguarization using a penaty term λ 2, and Ivanov Reguarization which uses a ba constraint 2 τ return identica soutions for some one-to-one correspondence between λ andτ. 5

6 Let us denote r (i) = ˆf f (i) as the residua function at step i of the agorithm. Initiay, r (0) = ˆf H Ggood. Infact, by the Representer theorem, r (0) = ˆf ĤG good H Ggood where we use the notation ĤG good = span{k Ggood (x i, ),i =...}. Our argument is inductive: if at any step i, r (i) ĤG good and, under this assumption, we can aways guarantee that (a) max j Ggood I(f (i),h j ) > max j/ Ggood I(f (i),h j ), i.e., a good kerne offers better greedy improvement and is therefore seected and (b) that after refitting, the new residua r (i+) ĤG good, then by induction it is cear that the agorithm correcty expands the hypothesis space and never makes a mistake. Without oss of generaity, et us rearrange the dictionary so that G good = {...s}. For any function f ĤG good, we now wish to derive the foowing upper bound, (I(f,H s+ )...I(f,H N )) (I(f,H )...I(f,H s )) µ H (G good ) 2 (7) Ceary, a sufficient condition for exact recovery isµ H (G good ) <. We need some notation to state our main resut. Let s = G good, i.e., the number of good kernes. For any matrix A R s (N s), et A (2,) denote the matrix norm induced by the foowing vector norms: for any vector u = [u...u s ] R s define u (2,) = s i= u i 2 ; and simiary, for any vector v = [v...v N s ] R (N s) define v (2,) = N s i= v i 2. Then, A (2,) = Av sup (2,) v R (N s) v (2,). We can now state the foowing: Theorem 2. Given the kerne dictionary{k j (, )} N j= with associated gram matrices{k j} N i= over the abeed data, MKL-GOMP correcty recovers the good kernes, i.e.,g (s) = G good, if µ H (G good ) = C λ,h (G good ) (2,) < where C λ,h (G good ) R s (N s) is a coherence matrix whose (i,j) th bock of size, i G good,j / G good, is given by, C λ,h (G good ) i,j = K Ggood Q i Q k K 2 G good Q k Q j K Ggood (8) k G good where K Ggood = j G good K j, Q j = (I λ(k j +λi) ) 2,j =...N. The proof is given in suppementary materia. This resut is anaogous to sparse recovery conditions for OMP and methods and their (inear) group counterparts. In the noiseess setting, Tropp [27] gives an exact recovery condition of the form X good X bad <, where X good and X bad refer to the restriction of the data matrix to good and bad features, X good denotes pseudo-inverse of X good and refers to the induced matrix norm. Intriguingy, the same paper shows that this condition is aso sufficient for the Basis Pursuit minimization probem. For Group-OMP [6], the condition generaizes to invove a group sensitive matrix norm on the same matrix objects. Likewise, Bach [2] generaizes the Lasso variabe seection consistency conditions to appy to Group Lasso and then further to non-parametric -MKL. The above resut is simiar in spirit. A stronger sufficient condition can be derived by requiring Q j K Ggood 2 to be sufficienty sma for a j / G good. Intuitivey, this means that smooth functions inh Ggood cannot be we approximated by using smooth functions induced by the bad kernes, so that MKL-GOMP is never ed to making a mistake. 5 Empirica Studies We report empirica resuts on a coection of simuated datasets and 3 cassification probems from computationa ce bioogy. In a experiments, as in [0, 33], candidate kernes are normaized mutipicativey to have uniform variance of data points in their associated feature spaces. 5. Adaptabiity to Data Sparsity - Simuated Setting We adapt the experimenta setting proposed by [0] where the sparsity of the target function is expicity controed, and the optima subset of kernes is varied from requiring the entire dictionary to 6

7 Figure : Simuated Setting: Adaptabiity to Data Sparsity test error norm MKL 4/3 norm MKL 2 norm MKL 4 norm MKL norm MKL (=RLS) MKL GOMP Bayes Error v(θ) = fraction of noise kernes [in %] Sparsity Smoothness 80 % of Kernes Seected v(θ) = fraction of noise kernes [in %] Vaue of λ v(θ) = fraction of noise kernes [in %] requiring a singe kerne. Our goa is to study the soution paths offered by MKL-GOMP in comparison to q -norm MKL. For consistency, we use squared oss in a experiments 2. We impemented q -norm MKL for reguarized east squares (RLS) using an aternating minimization scheme adapted from [7, 29]. Different binary cassification datasets 3 with 50 abeed exampes are randomy generated by samping the two casses from 50-dimensiona isotropic Gaussian distributions with equa covariance matrices (identity) and equa but opposite, means µ =.75 θ θ and µ 2 = µ where θ is a binary vector encoding the true underying sparsity. The fraction of zero components in θ is a measure for the feature sparsity of the earning probem. For each dataset, a inear kerne (normaized as in [0]) is generated from each feature and the resuting dictionary is input to MKL-GOMP and q -norm MKL. For each eve of sparsity, a training of size 50, vaidation and test sets of size 0000 are generated 0 times and average cassification errors are reported. For each run, the vaidation error is monitored as kerne seection progresses in MKL-GOMP and the number of kernes with smaest vaidation error are chosen. The reguarization parameters for both MKL-GOMP and q norm MKL are simiary chosen using the vaidation set. Figure 5. shows test error rates as a function of sparsity of the target function: from non-sparse (a kernes needed) to extremey sparse (ony kerne needed). We recover the observations aso made in [0]: -norm MKL exces in extremey sparse settings where a singe kerne carries the whoe discriminative information of the earning probem. However, in the other scenarios it mosty performs worse than the other q > variants, despite the fact that the vector θ remains sparse in a but the uniform scenario. As q is increased, the error rate in these settings improves but deteriorates in sparse settings. As reported in [], the eastic net MKL approach of [26] performs simiar to -MKL in the hinge oss case. As can be seen in the Figure, the error curve of MKL-GOMP tends to be beow the ower enveope of the error rates given by q -MKL soutions. To adapt to the sparsity of the probem, q methods ceary need to tune q requiring severa fresh invocations of the appropriate q -MKL sover. On the other hand, in MKL-GOMP the hypothesis space grows as function of the iteration number and the soution trajectory naturay expands sequentiay in the direction of decreasing sparsity. The right pot in Figure 5. shows the number of kernes seected by MKL-GOMP and the optima vaue of λ, suggesting that MKL-GOMP adapts to the sparsity and smoothness of the earning probem. 5.2 Protein Subceuar Locaization The muticass generaization of -MKL proposed in [33] (MCMKL) is state of the art methodoogy in predicting protein subceuar ocaization, an important ce bioogy probem that concerns the estimation of where a protein resides in a ce so that, for exampe, the identification of drug targets can be aided. We use three muticass datasets: PSORT+, PSORT- and PLANT provided by the authors of [33] at together with a dictionary of 69 kernes derived with bioogica insight: 2 kernes on phyogenetic 2 q-mkl with SVM hinge oss behaves simiary. 3 Provided by the authors of [0] at mdata.org/repository/data/viewsug/mk-toy/ 7

8 Performance (higher is better) mkgomp mcmk sum singe other psort+ psort pant Figure 2: Protein Subceuar Locaization Resuts trees, 3 kernes based on simiarity to known proteins (BLAST E-vaues), and 64 kernes based on amino-acid sequence patterns. The statistics of the three datasets are as foows: PSORT+ has 54 proteins abeed with 4 ocation casses, PSORT- has 444 proteins in 5 casses and PLANT is a 4-cass probem with 940 proteins. For each dataset, resuts are averaged over 0 spits of the dataset into training and test sets. We used exacty the same experimenta protoco, data spits and evauation methodoogy as given in [33]: the hyper-parameters of MKL-GOMP (sparsity and the reguarization parameter λ) were tuned based on 3-fod cross-vaidation; resuts on PSORT+, PSORTare F-scores averaged over the casses whie those on PLANT are Mathew s correation coefficient 4. Figure 2 compare MKL-GOMP against MCMKL, baseines such as using the sum of a the kernes and using the best singe kerne, and resuts from other prediction systems proposed in the iterature. As can be seen, MKL-GOMP sighty outperforms MCMKL on PSORT+ an PSORT- datasets and is sighty worse on PLANT where RLS with the sum of a the kernes aso performs very we. On the two PSORT datasets, [33] report seecting 25 kernes using MCMKL. On the other hand, on average, MKL-GOMP seects 4 kernes on PSORT+, 5 on PSORT- and 24 kernes on PLANT. Note that MKL-GOMP is appied in mutivariate mode: the kernes are seected based on their utiity to reduce the tota residua error across a target casses. 6 Concusion By proposing a Group-OMP based framework for sparse mutipe kerne earning, anayzing theoreticay the performance of the resuting methods in reation to the dominant convex reaxation-based approach, and demonstrating the vaue of our framework through extensive experimenta studies, we beieve greedy methods arise as a natura aternative for tacking MKL probems. Reevant directions for future research incude extending our theoretica anaysis to the stochastic setting, investigating compex mutivariate structures and groupings over outputs, e.g., by generaizing the mutivariate version of Group-OMP [5], and extending our agorithm to incorporate interesting structured kerne dictionaries [3]. Acknowedgments: We thank Rick Lawrence, Ha Quang Minh and David Rosenberg for insightfu conversations and enthusiastic support for this work. References [] N. Aronszajn. Theory of reproducing kerne hibert spaces. Transactions of the American Mathematica Society, 68(3): , 950. [2] F. Bach. Consistency of group asso and mutipe kerne earning. JMLR, 9:79 225, [3] F. Bach. High-dimensiona non-inear variabe seection through hierarchica kerne earning. In Technica report, HAL , [4] F. Bach, R. Jenatton, J. Maira, and G. Obozinski. Optimization with sparsity-inducing penaties. In Technica report, HAL , see 8

9 [5] F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan. Mutipe kerne earning, conic duaity, and the smo agorithm. In ICML, [6] P. Bartett and S. Mendeson. Rademacher and gaussian compexities: Risk bounds and structura resuts. JMLR, 3: , [7] A. Ben-Hur and W. S. Nobe. Kerne methods for predicting protein protein interactions. Bioinformatics, 2, January [8] C. Cortes, M. Mohri, and Afshin Rostamizadeh. Generaization bounds for earning kernes. In ICML, 200. [9] A. K. Fetcher and S. Rangan. Orthogona matching pursuit from noisy measurements: A new anaysis. In NIPS, [0] M. Koft, U. Brefed, S. Sonnenburg, and A. Zien. p-norm mutipe kerne earning. JMLR, 2: , 20. [] M. Koft, U. Ruckert, and P. Bartett. A unifying view of mutipe kerne earning. In European Conference on Machine Learning (ECML), 200. [2] V. Kotchinskii and M. Yuan. Sparsity in mutipe kerne earning. The Annas of Statistics, 38(6): , 200. [3] G. R. G. Lanckriet, N. Cristianini, P. Bartett, L. E Ghaoui, and M. I. Jordan. Learning the kerne matrix with semidefinite programming. J. Mach. Learn. Res., 5:27 72, December [4] G. R. G. Lanckriet, T. De Bie, N. Cristianini, M. I. Jordan, and W. S. Nobe. A statistica framework for genomic data fusion. Bioinformatics, 20, November [5] A. C. Lozano and V. Sindhwani. Bock variabe seection in mutivariate regression and high-dimensiona causa inference. In NIPS, 200. [6] A. C. Lozano, G. Swirszcz, and N. Abe. Group orthogona matching pursuit for variabe seection and prediction. In NIPS, [7] C. Michei and M. Ponti. Learning the kerne function via reguarization. JMLR, 6:099 25, [8] H. Liu P. Ravikumar, J. Lafferty and L. Wasserman. Sparse additive modes. Journa of the Roya Statistica Society: Series B (Statistica Methodoogy) (JRSSB), 7 (5): , [9] P. Pavidis, J. Cai, J. Weston, and W.S. Nobe. Learning gene functiona cassifications from mutipe data types. Journa of Computationa Bioogy, 9:40 4, [20] A. Rakotomamonjy, F.Bach, S. Cano, and Y. Grandvaet. SimpeMKL. Journa of Machine Learning Research, 9: , [2] G. Raskutti, M. Wainwrigt, and B. Yu. Minimax-optima rates for sparse additive modes over kerne casses via convex programming. In Technica Report 795, Statistics Department, UC Berkeey., 200. [22] Bernhard Schokopf and Aexander J. Smoa. Learning with Kernes: Support Vector Machines, Reguarization, Optimization, and Beyond. MIT Press, 200. [23] J. Shawe-Tayor and N. Cristianini. Kerne Methods for Pattern Anaysis. Cambridge University Press, [24] S. Sonnenburg, G. Rätsch, C. Schäfer, and B. Schökopf. Large scae mutipe kerne earning. J. Mach. Learn. Res., 7, December [25] Zhang T. Sparse recovery with orthogona matching pursuit under rip. Computing Research Repository, 200. [26] R. Tomioka and T. Suzuki. Sparsity-accuracy tradeoff in mk. In NIPS Workshop: Understanding Mutipe Kerne Learning Methods. Technica report, arxiv:00.265v, 200. [27] J. Tropp. Greed is good: Agorithmic resuts for sparse approximation. IEEE Trans. Inform. Theory,, 50(0): , [28] P. Vincent and Y. Bengio. Kerne matching pursuit. Machine Learning, 48:65 88, [29] Z. Xu, R. Jin, H. Yang, I. King, and M.R. Lyu. Simpe and efficient mutipe kerne earning by group asso. In ICML, 200. [30] Ming Yuan, Ai Ekici, Zhaosong Lu, and Renato Monteiro. Dimension reduction and coefficient estimation in mutivariate inear regression. Journa Of The Roya Statistica Society Series B, 69(3): , [3] Tong Zhang. On the consistency of feature seection using greedy east squares regression. J. Mach. Learn. Res., 0, June [32] H. Zhou and T. Hastie. Reguarization and variabe seection via the eastic net. Journa of the Roya Statistica Society, 67(2):30 320, [33] A. Zien and Cheng S. Ong. Muticass mutipe kerne earning. ICML,

Statistical Learning Theory: A Primer

Statistical Learning Theory: A Primer Internationa Journa of Computer Vision 38(), 9 3, 2000 c 2000 uwer Academic Pubishers. Manufactured in The Netherands. Statistica Learning Theory: A Primer THEODOROS EVGENIOU, MASSIMILIANO PONTIL AND TOMASO