Factorized Sparse Learning Models with Interpretable High Order Feature Interactions

Size: px

Start display at page:

Download "Factorized Sparse Learning Models with Interpretable High Order Feature Interactions"

Aron Norris
5 years ago
Views:

1 Factorized Sparse Learning Modes with Interpretabe High Order Feature Interactions Sanjay Purushotham University of Southern Caifornia Los Angees, CA, USA Martin Renqiang Min NEC Labs America Princeton, NJ, USA Rache Ostroff SomaLogic, Inc. Bouder, CO, USA C.-C. Jay Kuo University of Southern Caifornia Los Angees, CA, USA ABSTRACT Identifying interpretabe discriminative high-order feature interactions given imited training data in high dimensions is chaenging in both machine earning and data mining. In this paper, we propose a factorization based sparse earning framework termed FHIM for identifying high-order feature interactions in inear and ogistic regression modes, and study severa optimization methods for soving them. Unike previous sparse earning methods, our mode FHIM recovers both the main effects and the interaction terms accuratey without imposing tree-structured hierarchica constraints. Furthermore, we show that FHIM has orace properties when extended to generaized inear regression modes with pairwise interactions. Experiments on simuated data show that FHIM outperforms the state-of-the-art sparse earning techniques. Further experiments on our experimentay generated data from patient bood sampes using a nove SOMAmer (Sow Off-rate Modified Aptamer) technoogy show that, FHIM performs bood-based cancer diagnosis and bio-marker discovery for Rena Ce Carcinoma much better than other competing methods, and it identifies interpretabe bock-wise high-order gene interactions predictive of cancer stages of sampes. A iterature survey shows that the interactions identified by FHIM pay important roes in cancer deveopment. Categories and Subject Descriptors J.3 [Computer Appications: Life and Medica Sciences; Co-first author Co-first author, corresponding author To whom data request shoud be sent Permission to make digita or hard copies of a or part of this work for persona or cassroom use is granted without fee provided that copies are not made or distributed for profit or commercia advantage and that copies bear this notice and the fu citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or repubish, to post on servers or to redistribute to ists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. KDD 4, August 24 27, 204, New York, NY, USA. Copyright 204 ACM /4/08...$ I.2.6 [Computing Methodoogies: Artificia Inteigence sparse earning, feature seection Keywords Sparse Learning; High-order interactions; Biomaker Discovery; Bood-based Cancer Diagnosis. INTRODUCTION Identifying interpretabe high-order feature interactions is an important probem in machine earning, data mining, and biomedica informatics, because feature interactions often hep revea some hidden domain knowedge and the structures of probems under consideration. For exampe, genes and proteins sedom perform their functions independenty, so many human diseases are often manifested as the dysfunction of some pathways or functiona gene modues, and the disrupted patterns due to diseases are often more obvious at a pathway or modue eve. Identifying these disrupted gene interactions for different diseases such as cancer wi hep us understand the underying mechanisms of the diseases and deveop effective drugs to cure them. However, identifying reiabe discriminative high-order gene/protein or SNP interactions for accurate disease diagnosis such as eary cancer diagnosis directy based on patient bood sampes is sti a chaenging probem, because we often have very imited patient sampes but a huge number of compex feature interactions to consider. In this paper, we propose a sparse earning framework based on weight matrix factorizations and reguarizations for identifying discriminative high-order feature interactions in inear and ogistic regression modes, and we study severa optimization methods for soving them. Experimenta resuts on synthetic and rea-word datasets show that our method outperforms the state-of-the-art sparse earning techniques, and it provides interpretabe bockwise highorder interactions for disease status prediction. Our proposed sparse earning framework is genera, and can be used to identify any discriminative compex system input interactions that are predictive of system outputs given imited high-dimensiona training data. Our contributions are as foows: () We propose a method capabe of simutaneousy identifying both informative singe

2 discriminative features and discriminative bock-wise highorder interactions in a sparse earning framework, which can be easiy extended to hande arbitrariy high-order feature interactions; (2) Our method works on high-dimensiona input feature spaces and i-posed probems with much more features than data points, which is typica for biomedica appications such as biomarker discovery and cancer diagnosis; (3) Our method has interesting theoretica properties for generaized inear regression modes; (4) The interactions identified by our method ead to biomedica insight into understanding bood-based cancer diagnosis. 2. RELATED WORK Variabe seection has been a we studied topic in statistics, machine earning, and data mining iterature. Generay, variabe seection approaches focus on identifying discriminative features using reguarization techniques. Most recent methods focus on identifying discriminative features or groups of discriminative features based on Lasso penaty [8, Group Lasso [2, Trace-norm [6, Dirty mode [8 and Support Vector Machines (SVMs) [6. A recent approach [20 heuristicay adds some possibe high-order interactions into the input feature set in a greedy way based on asso penaized ogistic regression. Some recent approaches [2,[3 enforce strong and/or weak heredity constraints to recover the pairwise interactions in inear regression modes. In strong heredity, an interaction term can be incuded in the mode ony if the corresponding main terms are aso incuded in the mode, whie in weak heredity, an interaction term is incuded when either of the main terms are incuded in the mode. However, recent studies in bioinformatics has shown that feature interactions need not foow heredity constraints for manifestation of the diseases, and thus the above approaches [2,[3 have imited chance of recovering reevant interactions. Kerne methods such as Gaussian Process [4 and Mutipe Kerne Learning [0 can be used to mode high-order feature interactions, but they can ony te which orders are important. Thus, a these previous approaches either faied to identify specific high-order interactions for prediction or identified sporadic pairwise interactions in a greedy way, which is very unikey to recover the interpretabe bockwise high-order interactions among features in different sub-components (for exampe: pathways or gene functiona modues) of systems. Recenty, [4 proposed an efficient way to identify combinatoria interactions among interactive genes in compex diseases by using overapping group asso and screening. However, they use prior information such as gene ontoogy in their approach, which is generay not avaiabe or difficut to coect for some machine earning probems. Thus, there is a need to deveop new efficient techniques to automaticay capture the important bockwise high-order feature interactions in regression modes, which is the focus of this paper. The remainder of the paper is organized as foows: in section 3 we discuss our probem formuation and reevant notations used in the paper. In section 4, we discuss the main idea of our approach, and in section 5 we give a overview of theoretica properties associated with our method. In section 6, we present the optimization methods which we use to sove our optimization probem. In section 7, we discuss our experimenta setup and present our resuts on synthetic and rea datasets. Finay, in section 8 we concude the paper with discussions and future research directions. 3. PROBLEM FORMULATION Consider a regression setup with a training set of n sampes and p features, {(X (i), y (i) )}, where X (i) R p is the i th instance (coumn) of the design matrix X (p n), y (i) R is the i th instance of response variabe y (n ), and i =,..., n. To mode the response in terms of the predictors, we can set up a inear regression mode or a ogistic regression mode p(y (i) = X (i) ) = y (i) = β T X (i) + ɛ (i), () + exp( β T X (i) β 0), (2) where β R p is the weight vector associated with singe features (aso caed main effects), ɛ R n is a noise vector, and β 0 R is the bias term. In many practica fieds such as bioinformatics and medica informatics, the main terms (the terms ony invoving singe features) are not enough to capture compex reationship between the response and the predictors, and thus high-order interactions are necessary. In this paper, we consider regression modes with both main effects and high-order interaction terms. Equation 3 shows a inear regression mode with pairwise interaction terms. y (i) = β T X (i) + X (i)t WX (i) + ɛ (i), (3) where W(p p) is the weight matrix associated with the pairwise feature interactions. The corresponding oss function (the sum of squared errors) is as foows (we center the data to avoid an additiona bias term), L sqerr(β, W) = n y (i) β T X (i) X T (i) WX (i) 2 2. (4) 2 i= We can simiary write the ogistic regression mode with pairwise interactions as foows, p(y (i) X (i) ) = + exp( y (i) (β T X (i) + X (i)t WX (i) + β 0)) and the corresponding oss function (the sum of the negative og-ikeihood of the training data) is, n L ogistic (β, W, β 0) = og( + exp( y (i) (β T X (i) i= (5) +X (i)t WX (i) + β 0)). (6) 4. OUR APPROACH In this section, we propose an optimization-driven sparse earning framework to identify discriminative singe features and groups of high-order interactions among input features for output prediction in the setting of imited training data. When the number of input features is huge (e.g. biomedica appications), it is practicay impossibe to expicity consider quadratic or even higher-order interactions among a the input features based on simpe asso penaized inear regression or ogistic regression. To sove this probem, we propose to factorize the weight matrix W associated with high-order interactions between input features to be a sum of K rank-one matrices for pairwise interactions or a sum of ow-rank high-order tensors for higher-order interactions.

3 Each rank-one matrix for pairwise feature interactions is represented by an outer product of two identica vectors, and each m-order (m > 2) tensor is represented by the outer product of m identica vectors. Besides minimizing the oss function of inear regression or ogistic regression, we penaize the norm of both the weights associated with singe input features and the weights associated with high-order feature interactions. Mathematicay, we sove the optimization probem to identify the discriminative singe and pairwise interaction features as foows, { ˆβ, â k } = arg min Lsqerr(β, W) a k,β K (7) + λ β β + λ ak a k k= where W = K k= a k a k, represents the tensor product/outer product, and ˆβ, â k represent the estimated parameters of our mode and et Q represent objective function of (7). For ogistic regression, we repace L sqerr(β, W) in (7) by L ogistic (β, W, β 0). We ca our mode Factorizationbased High-order Interaction Mode (FHIM). Proposition 4.. The optimization probem in Equation 7 is convex in β and non-convex in a k. Because of the non-convexity property of our optimization probem, it is difficut to propose optimization agorithms which guarantee convergence to goba optima. Here, we adopt a greedy aternating optimization methods to find the oca optima for our probem. In the case of pairwise interactions, fixing other weights, we sove each rank-one weight matrix each time. Pease note that our symmetric positive definite factorization of W makes this sub-optimization probem very easy. Moreover, for a particuar rank-one weight matrix a k a k, the nonzero entries of the corresponding vector a k can be interpreted as the bock-wise interaction feature indices of a densey interacting feature group. In the case of higher-order interactions, the optimization procedure is simiar to the one for the pairwise interactions except that we have more rounds of aternating optimization. The parameter K of W is generay unknown in rea datasets, thus, we greediy estimate K during the aternating optimization agorithm. In fact, the combination of our factorization formuation and the greedy agorithm is effective for estimating the interaction weight matrix W. β is re-estimated when K is greediy added during the aternating optimization as shown in agorithm. Agorithm Greedy Aternating Optimization : Initiaize β to 0, K = and a K = 2: Whie (K==) (a K 0 for K > ) 3: Repeat unti convergence 4: βj t = arg min j Q(β, t..., βj, t β t j+, βt p ), a t k ) 5: a t k,j = arg min j Q((a t k,,..., a t k,j, a t k,j+, at k,p ), βt ) 6: End Repeat 7: K = K + ; a K = 8: End Whie 9: Remove a K and a K from a. 5. THEORETICAL PROPERTIES In this section, we study the asymptotic behavior of FHIM for the ikeihood-based generaized inear regression modes. The emmas and theorems proved here are simiar to the ones shown in the paper [3. However, in their paper the authors make an assumption on the strong heredity (i.e. interaction term coefficients are dependent on the main effects), which is not assumed in our mode since we are interested in identifying a high-order interactions irrespective of heredity constraints. Here, we discuss the asymptotic properties w.r.t to the main effects and factorized co-efficients. Probem Setup: Assume that the data V i = (X i, y i), i =,...n are coected independenty and Y i has a density of f(g(x i), y i) conditioned on X i, where g is a known regression function with main effects and a possibe pairwise interactions. Let β j and a k,j denote the underying true parameters satisfying bock-wise properties impied by our factorization. Let Q n(θ) denote the objective with negative og-ikeihood and θ = (β T, α T ) T, where α = (a k), k =,..., K. We consider the estimates for FHIM as ˆθ n: ˆθ n = arg min θ Q n(θ) (8) = arg min n i), y i) + λ β β + θ n i=(l(g(x λ αk α k k where L(g(X i), y i) is the oss function of generaized inear regression modes with pairwise interactions. In the case of inear regression, g( ) takes the form of Equation (3) without the noise term ɛ and L( ) takes the form of Equation (4). Now, et us define A = {j : β j 0} A 2 = {(k, ) : α k, 0}, A = A A 2 (9) where A contains the indices of the main terms which correspond to the nonzero true coefficients, and simiary A 2 contains the indices of the factorized interaction terms whose true co-efficients are non-zero. Let us define a n = max{λ β j, λα k : j A, (k, ) A 2} b n = min{λ β j, λα k : j A c, (k, ) A c 2} (0) Now, we show that our mode possesses the orace properties for (i) n with fixed p and (ii) p n as n under some reguarity conditions. Pease refer to Appendix for proofs of the emmas & theorems of sections 5. and Asymptotic Orace Properties when n The asymptotic properties when sampe size increases and the number of predictors is fixed are described in the foowing emmas and theorems. FHIM possesses orace properties [3 under certain reguarity conditions (C)-(C3) shown beow. Let Ω denote the parameter space for θ. (C) The observations V i : i =,..., n are independent and identicay distributed with a probabiity density f(v, θ), which has a common support. We assume the density f satisfies the foowing equations: and [ og f(v, θ) E θ = 0 for j =,..., p(k + ), θ j [ og f(v, θ) og f(v, θ) I jk (θ) =E θ θ j θ k [ =E θ 2 og f(v, θ) θ j θ k

4 (C2) The Fisher Information Matrix [( og f(v, θ) )( og f(v, θ) ) T I(θ) = E θ θ is finite and positive definite at θ = θ. (C3) There exists an open set ω of Ω that contains the true parameter point θ such that for amost a V the density f(v, θ) admits a third derivatives ( 3 f(v, θ))/( θ j θ k θ ) for a θ ω and any j, k, =,..., p(k + ). Furthermore, there exist functions M jk such that 3 og f(v, θ) θ j θ k θ M jk(v) for a θ ω where m jk = E θ [M jk (V) <. These reguarity conditions are the existence of common support and first, second derivatives for f(v, θ); Fisher Information matrix being finite and positive definite; and existence of bounded third derivative for f(v, θ). These reguarity conditions guarantee asymptotic normaity of the ordinary maximum ikeihood estimates [. Lemma 5.. Assume a n = o() as n. Then under reguarity conditions (C)-(C3), there exists a oca minimizer ˆθ n of Q n(θ) such that ˆθ n θ = O P (n /2 + a n) Theorem 5.2. Assume nb n and the minimizer ˆθ n given in emma 5. satisfies ˆθ n θ = O P (n /2 ). Then under reguarity conditions (C)-(C3), we have P ( ˆβ A C = 0), P ( ˆα A C 2 = 0) Lemma 5. impies that when the tuning parameters associated with the non-zero coefficients of main effects and pairwise interactions tend to 0 at a rate faster than n /2, then there exists a oca minimizer of Q n(θ), which is n consistent (the samping error is O p(n /2 )). Theorem 5.2 shows that our mode removes noise consistenty with high probabiity ( ). If na n 0 and nb n, then emma 5. and theorem 5.2 impy that the n consistent estimator ˆθ n satisfies P (ˆθ A c = 0). Theorem 5.3. Assume na n 0 and nb n. Then under the reguarity conditions (C)-(C3), the component ˆθ A of the oca minimizer ˆθ n (given in emma 5.) satisfies n(ˆθa θ A) d N(0, I (θ A)), where I(θ A) is the Fisher information matrix of θ A at θ A = θ A assuming that θ Ac = 0 is known in advance. Theorem 5.3 shows that our mode estimates the non-zero coefficients of the true mode with the same asymptotic distribution as if the zero coefficients were known in advance. Based on theorems 5.2 and 5.3, we can say that our mode has the orace property [3, [5, when the tuning parameters satisfy the conditions na n 0 and nb n. To satisfy these conditions, we have to consider adaptive weights w β j, wα k [23 for our tuning parameters λ β, λ αk (see appendix for more detais). Thus, our tuning parameters are: λ β j = og n n λ βw β j, λα k = og n n λα k wα k 5.2 Asymptotic Orace Properties When p n as n In this section, we consider the asymptotic behavior of our mode when the number of predictors p n grows to infinity aong with the sampe size n. If certain reguarity conditions (C4)-(C6) (shown beow) hod, then we can show that our mode possesses the orace property. We denote the tota number of predictors by p n. We denote a the quantities that change with sampe size by adding n as their subscript. A, A 2, A are defined as in section 5 and et s n = A n. The asymptotic properties of our mode when the number of predictors increases aong with the sampe size are described in the foowing emma and theorem. The reguarity conditions (C4)-(C6) are given beow: Let Ω n denote the parameter space for θ n. (C4) The observations V ni : i =,..., n are independent and identicay distributed with a probabiity density f n(v n, θ n), which has a common support. We assume the density f n satisfies the foowing equations: and [ og fn(v n, θ n) E θn = 0 for j =,..., p n, θ nj [ og fn(v n, θ n) og f n(v n, θ n) I jk (θ n) =E θn θ nj θ nk [ =E θn 2 og f n(v n, θ n) θ nj θ nk (C5) I n(θ n) = E[( og fn(v n,θ n) θ n )( og fn(v n,θ n) θ n ) T satisfies 0 < C < λ mini n(θ n) λ maxi n(θ n) < C 2 < for a n, where λ min(.) and λ max(.) represent the smaest and argest eigenvaues of a matrix respectivey. Moreover, for any j, k =,..., p n, { } 2 og f n(v n, θ n) og f n(v n, θ n) E θn θ nj θ nk and < C 3 <, E θn { } 2 og f n(v n, θ n) < C 4 < θ nj θ nk (C6) There exists a arge open set ω n Ω n R pn which contains the true parameters θ n such that for amost a V ni the density admits a third derivatives 3 f n(v ni, θ n)/ θ nj θ nk θ n for a θ n ω n. Furthermore, there are functions M njk such that 3 f n(v ni, θ n) θ nj θ nk θ n M njk(v ni) for a θ n ω n and for a p n, n, and j, k,. E θn M 2 njk(v ni) < C 5 < Lemma 5.4. Assume that the density f n(v n, θ n) satisfies some reguarity conditions (C4)-(C6). If na n 0 and p 5 n/n 0 as n, then there exists a oca minimizer ˆθ n of Q n(θ) such that ˆθ n θ n = O P ( p n(n /2 + a n))

5 Theorem 5.5. Suppose that the density f n(v n, θ n) satisfies some reguarity conditions (C4)-(C6). If np na n 0, n/pnb n and p 5 n/n 0 as n, then with probabiity tending to, the n/p n-consistent oca minimizer ˆθ n in Lemma 5.4 satisfies the foowing: Sparsity: ˆθ na c n = 0 Asymptotic normaity: nani 2 n (ˆθ nan θ na n ) d N(0, G) where A n is an arbitrary m s n matrix with finite m such that A na T n G and G is a m m nonnegative symmetric matrix and I n(θ na n ) is the Fisher information matrix of θ nan at θ nan = θ na n. Since the dimension of ˆθ nan as sampe size n, we coud consider arbitrary inear combination A n ˆθnAn for the asymptotic normaity of our mode s estimates. Simiar to section 5., to satisfy orace property, we have to consider an adaptive weights w β nj, wα k n [23 for our tuning parameters λ β, λ αk as: λ β nj = og(n)pn λ β w β nj n, n, = og(n)pn λ αk w α k n n λα k 6. OPTIMIZATION In this section, we outine three optimization methods that we empoy to sove our objective function (7), which corresponds to Line 4 and 5 in Agorithm. [5 provides a good survey on severa optimization approaches for soving -reguarized regression probems. In this paper, we use the sub-gradient and co-ordinate wise soft-threshoding based optimization methods since they work we and are easy to impement. We compare these methods in the experimenta resuts in section Sub-Gradient Methods Sub-gradient based strategies treat the non-differentiabe objective as a non-smooth optimization probem and use sub-gradients of the objective function at the non-differentiabe points. For our mode, the optimaity conditions w.r.t parameter vectors β and a k can be written out separatey based on the objective function (7). Optimaity conditions w.r.t a k is: { jl(a k ) + λ ak sgn(a kj ) = 0 a kj > 0 jl(a k ) λ ak a kj = 0 where L(a k ) is the oss function of our inear regression mode or ogistic regression mode in Equation (7) w.r.t a k. Simiary, optimaity conditions can be written for β. The sub-gradient s jf(a k ) for each a kj is given by where s jf(a k ) = jl(a k ) + λ ak sgn(a kj ), a kj > 0 jl(a k ) + λ ak, a kj = 0, jl(a k ) < λ ak jl(a k ) λ ak, a kj = 0, jl(a k ) > λ ak 0, λ ak jl(a k ) λ ak jl(a k ) = 2 i ( 2X (i) j X T (i) WX (i). XT (i) a k )[y (i) β T X (i) for our inear regression mode. The negation of the subgradient represents the steepest descent direction. Simiary the sub-gradients for β ( s jf(β) ) can be cacuated. Differentia of the oss function of the inear regression in Equation (7) w.r.t β is given by jl(β) = ( 2X (i) j 2 )[y(i) β T X (i) X T (i) WX (i) i 6.. Orthant-Wise Descent (OWD) Andrew and Gao [ proposed an effective strategy for soving arge-scae -reguarized regression probems based on choosing an appropriate steepest descent direction for the objective function and taking a step ike a Newton iteration in this direction (with an L-BFGS Hessian approximation [2). The orthant-wise earning descent method for our mode takes the foowing form β P O[β γ β P S[H β s f(β) a k P O[a k γ ak P S[H a k s f(a k) where P O and P S are two projection operators and H β is the positive definite approximation of Hessian of quadratic approximation of objective function f(β), and γ β and γ ak are step sizes. P S projects the Newton-ike direction to guarantee that it is in the descent direction. P O projects the step onto the orthant containing β or a k and ensures that ine search does not cross points of non-differentiabiity Projected Scaed Sub-Gradient (PSS) Schmidt [5 proposed optimization methods caed Projected Scaed Sub-Gradient methods where the iterations can be written as the projection of a scaing of a sub-gradient of the objective. Pease refer to [ and [5 for more detais on OWD and PSS methods. 6.2 Soft-threshoding Soft-threshoding based co-ordinate descent optimization method can be used to find β, a k updates in the aternating optimization agorithm for our FHIM mode. The β updates are β j and are given by β j(λ β ) S( βj(λ β ) + k n X ij(y i i= k j ) X ik WX ki ), λ β X jk βk where W = k ak ak, and S is the soft-threshoding operator [7. Simiary, the updates for a k are ã kj and given by ( ã kj (λ ak ) S ã kj (λ ak ) + k j X jk βk k n p a kr X ir)[y i r= X ij( i= k ) X ik W jx ki, λ ak where W j is W with j th coumn and j th row eements are a zero. 7. EXPERIMENTS In this section, we use synthetic and rea datasets to demonstrate the efficacy of our mode (FHIM), and compare its performance with LASSO [8, A-Pairs Lasso [2, Hierarchica LASSO [2, Group Lasso [2, Trace-norm [9, Dirty mode [8 and QUIRE [4. For a these modes, we perform 5 runs of 5-fod cross-vaidation on training dataset (80 %)

We aso discuss the support recovery of β and W for our mode. 7. Datasets We use synthetic datasets and a rea dataset for cassification and support recovery experiments.

6 to find the optima parameters and evauate prediction error on a test dataset (20 %). We search tuning parameters for a methods using grid search and for our mode the parameters λ β and λ ak are searched in the range of [0.0, 0. We aso discuss the support recovery of β and W for our mode. 7. Datasets We use synthetic datasets and a rea dataset for cassification and support recovery experiments. We give detaied description of these datasets beow. 7.. Synthetic Dataset We generate the predictors of the design matrix X using a norma distribution with mean zero and variance one. The weight matrix W was generated as a sum of K rank one matrices i.e. W = K k= a ka T k. β, a k were generated as a sparse vector from a norma distribution with mean 0 and variance, whie noise vector ɛ is generated from a norma distribution with mean 0 and variance 0.. Finay, the response vectors y of the inear and ogistic regression modes with pairwise interactions were generated using Equations (3) and (5) respectivey. We generated severa synthetic datasets by varying number of instances (n), number of variabes/predictors (p), rank of W i.e. K and sparsity eve of β, a k. We denote the combined tota predictors (that is main effects predictors + predictors for interaction terms) by q, here q = p(p + )/2. Sparsity eve (non-zeros) was chosen as 2 4% for arge p(> 00), and 5 0% for sma p(< 00) for both β, a k. In this paper, we show resuts for synthetic data in these settings: Case () n > p and q > n (high-dimensiona setting w.r.t combined predictors) and, Case (2) p > n (high-dimensiona w.r.t origina predictors) Rea Dataset To predict cancer progression status directy from bood sampes, we generated our own dataset. A sampes and cinica information were coected under Heath Insurance Portabiity and Accountabiity Act compiance from study participants after obtaining written informed consent under cinica research protocos approved by the institutiona review boards for each site. Bood was processed within 2 hours of coection according to estabished standard operating procedures. To predict RCC status, serum sampes were coected at a singe study site from patients diagnosed with RCC or benign rena mass prior to treatment. Definitive pathoogy diagnosis of RCC and cancer stage was made after resection. Outcome data was obtained through foowup from 3 months to 5 years after initia treatment. Our RCC dataset contains 22 RCC sampes from benign and 4 different stages of tumor. Expression eves of 092 proteins based on a high-throughput SOMAmer protein quantification technoogy are coected. The number of Benign, Stage, Stage 2, Stage 3 and Stage 4 tumor sampes are 40; 0; 7; 24 and 3 respectivey Experimenta Design We use inear regression modes (Equation 3) for a the foowing experiments and we ony use ogistic regression modes (Equation 5) for synthetic data experiments shown in tabe 2. We evauate the performance of our method (FHIM) by the foowing experiments:. Prediction error and support recovery experiments on synthetic datasets 2. Cassification experiments using RCC sampes: We perform three stage-wise binary cassification experiments using RCC sampes: (a) Case : Cassification of Benign sampes from Stage 4 sampes. (b) Case 2: Cassification of Benign and Stage sampes from Stage 2 4 sampes. (c) Case 3: Cassification of Benign, Stage, 2 sampes from Stage 3, 4 sampes Performance on Synthetic dataset We evauate the performance of our mode (FHIM) on synthetic dataset by the foowing experiments: (i) Comparison of optimization methods presented in section 6, (ii) Prediction error on the test data for q > n and p > n (highdimensiona settings), (iii) Support recovery accuracy of β, W and (iv) Prediction of rank of W using greedy approach. Tabe 3 shows the prediction error on test data when different optimization methods (discussed in section 6) are used for our mode (FHIM). From tabe 3, we see that both OWD and PSS methods perform neary simiar (OWD is marginay better), and are better than the soft-threshoding method. This is because, in soft-threshoding, co-ordinate updates of variabes might not be accurate in high dimensiona settings (i.e. the soution is affected by the path taken during updates). We observed that soft-threshoding in genera is sower than OWD and PSS methods. For a the other experiments discussed in this paper, we choose OWD as the optimization method for FHIM. Tabe and Tabe 2 shows Figure : Support Recovery of β (90 % sparse) and W (99 % sparse) for synthetic data Case : n > p and q > n where n = 000, p = 50, q = 275. Figure 2: Support Recovery of W (99.5 % sparse) for synthetic data Case 2: p > n where p = 500, n = 00. Onine suppementary materias contain high-quaity images for this figure. the performance comparison (in terms of prediction error on test dataset) of FHIM for inear and ogistic regression modes with respect to the state-of-the-art approaches such as Lasso, Fused Lasso, Trace-Norm and Hierarchica Lasso (HLasso is a genera version of SHIM [3). From tabes and 2, we see that FHIM generay outperforms a the state-ofthe-art approaches for both inear and ogistic pairwise regression modes. For q > n, we see that test data prediction

Average area under ROC curve q > n p > n n, p, K FHIM Fused Lasso Lasso HLasso Trace norm Dirty Mode 000, 50, 338.4(4.5) 425.9(20.7) 474.7(5.3) 354.32 (24.82) 464.4(36.3) 63.5(0.76) 000, 50, 5 343.

5(32.4) 2924(0.8) 00, 500, 230.49 (50.3) 57.2(355.0) 335.0(59.2) - 60.3(299.7) 65.9(62.6) 00, 000, 340. (40.02) 770.9(27.6) 879.(80.3) - 699.9(208.7) 808.(5.) 00, 2000, 907.8 (00.) 022.3(406.2) 99.

deviation of MSE (shown inside brackets) on test data is used to measure the mode s performance.

28 (0.07) 0.56 (0.07) 0.36 (0.02) 0.28 (0.06) 000, 50, 5 0.89 (0.03) 0.227 (0.024) 0.292 (0.042) 0.257 (0.022) 0.503 (0.027) 0000, 500, 0.35 (0.002) 0.265 (0.007) 0.6 (0.02) - 0.225 (0.

7 Average area under ROC curve q > n p > n n, p, K FHIM Fused Lasso Lasso HLasso Trace norm Dirty Mode 000, 50, 338.4(4.5) 425.9(20.7) 474.7(5.3) (24.82) 464.4(36.3) 63.5(0.76) 000, 50, (2.9) 888.3(2.) 922.9(43.9) 889. (2.5) 822.6(99.8) (0.76) 0000, 500, 093.(9.5) (55.) (29.5) (0.) (0.8) 0000, 500, (2.2) 22720(597.8) (23.3) (32.4) 2924(0.8) 00, 500, (50.3) 57.2(355.0) 335.0(59.2) (299.7) 65.9(62.6) 00, 000, 340. (40.02) 770.9(27.6) 879.(80.3) (208.7) 808.(5.) 00, 2000, (00.) 022.3(406.2) 99.2(32.) (47.6) 96.7(63.4) Tabe : Performance comparison for synthetic data on inear regression mode with high-order interactions. Prediction Error (MSE) and Std. deviation of MSE (shown inside brackets) on test data is used to measure the mode s performance. For p >= 500, Hierarchica Lasso (HLasso) has heavy computationa compexity, hence we don t show it s resuts here. q > n p > n n, p, K FHIM Fused Lasso Lasso HLasso Trace norm 000, 50, 0.27 (0.009) 0.28 (0.07) 0.56 (0.07) 0.36 (0.02) 0.28 (0.06) 000, 50, (0.03) (0.024) (0.042) (0.022) (0.027) 0000, 500, 0.35 (0.002) (0.007) 0.6 (0.02) (0.077) 0000, 500, (0.05) 0.54 (0.006) 0.507(0.08) (0.006) 00, 500, (0.04) (0.086) (0.054) (0.079) 00, 000, (0.056) 0.409(0.086) 0.458(0.083) (0.0) Tabe 2: Performance comparison for synthetic dataset on ogistic regression mode with high-order interactions. Miscassification Error on test data is used to measure the mode s performance error for FHIM is consistenty ower compared to a other approaches. For p > n, FHIM performs sighty better than other approaches, however, the prediction error for a the approaches is high since it s hard to accuratey recover the coefficients of main effects and pairwise interactions in very high-dimensiona settings. From figure and tabe 4, we see that our mode performs very we (F score cose to ) in the support recovery of β and W for the q > n setting. From figure 2 and tabe 5, we see that our mode performs fairy we in the support recovery of W for p > n setting. We observe that when the tuning parameters are correcty chosen, support recovery of W works very we when W is ow-rank (see tabe 4 and 5), and the F score for the support recovery of W decreases with increase in rank of W. Tabe 5 shows that for q > n the greedy strategy of FHIM accuratey recovers the rank K of W, whie for p > n, the greedy strategy might not correcty recover K. This is because the tensor factorization is not unique and sighty correated variabes can enter our mode during optimization. RCC Dataset the markers seected by Lasso, A-Pairs Lasso [2, Group Lasso, Dirty mode [8 and QUIRE. We use SLEP [3, MAL- SAR [22 and QUIRE packages for the impementation of these modes. The overa performance of the agorithms are shown in Figure 3. In this figure, we report average AUC score for five runs of 5-fod cross vaidation experiments for cancer stage prediction in RCC. In 5-fod cross vaidation experiments, we train our mode on the four fods to identify the main effects and pairwise interactions and we use the remaining one fod for testing prediction. The average AUC achieved by features seected with our mode are 0.68, 0.89 and 0.84 respectivey for the three cases discussed in section We performed pairwise t-tests for the comparisons of our method vs. the other methods, and a p-vaues are beow From figure 3, it is cear that our mode outperforms a the other agorithms that do not use prior feature group information for a the three cassification cases of RCC prediction. In addition, our mode has simiar performance to the state-of-the-art technique - QUIRE [4, which uses Gene Ontoogy based functiona annotation for grouping and custering of genes to identify high order interactions Benign vs. Stage-4 Benign, Stage vs. Stage 2-4 Benign, Stage -2 vs. Stage 3-4 APLasso Lasso Trace-norm Group Lasso Dirty Mode QUIRE FHIM (Ours) IL5RA CHEK2 ACE2 PES AIP CXC3L CD97 IL6ST Figure 3: Comparison of the cassification performances of different feature seection approaches with our mode in identifying the different stages of RCC. We perform five fod cross vaidation five times and average AUC score is reported Cassification Performance on RCC In this section, we report systematic experimenta resuts on cassification of sampes from different stages of RCC. The predictive performance of the markers and pairwise interactions seected by our mode (FHIM) is compared against Figure 4: Exampes of functiona modues for RCC Case 3, induced by markers and interactions discovered by our mode and enriched in pathways and functions associated with RCC 7..6 Informative interactions discovered by FHIM An investigation of the pairwise interactions identified by our mode on RCC dataset reveas that many of these interactions are indeed reevant to the prediction of cancer. Figure 4 shows some of the interactions associated with higher

8 weighted pairwise co-efficients for Case 3 of RCC cassification experiment. The interactions incude CX3CL- CD97, CHEK2-IL5RA which are known to be reated to proteins in bood. CX3CL was recenty found to promote breast cancer [7, whie CD97 was found to promote coorecta cancer [9. We beieve these protein interactions might ead to rena ce cancer. Further investigations of the interactions identified by our mode might revea nove protein interactions associated with rena ce cancer and thus eading to testabe hypothesis Time Compexity FHIM has O(np) time compexity for agorithm. In genera, FHIM takes more time than the Lasso approach since we do aternating optimization of β, a k. For q n setting with n = 000, q = 275, our OWD earning optimization method on Matab takes around minute for 5-fod cross-vaidation, whie for p > n with p = 2000, n = 00, our FHIM mode took around 2 hours for 5-fod cross-vaidation. Our experiments were run on inte i3 dua-core 2.9 GHz CPU with 8 GB RAM. n, p OWD Soft-thres PSS -hoding 00, , , Tabe 3: Comparison of optimization methods for our FHIM mode based on test data prediction error n, p Sparsity K Support recovery β, a k β, W (F score) 000, 50 5, 5.0,.0 000, 50 5, 5 3.0, , 50 5, 5 5.0, , 500 0, , , 500 0, , , 500 0, , 0.55 Tabe 4: Support recovery of β, W n, p true estimated W support recovery K K F score 000, , , , , , Tabe 5: Recovering K using greedy strategy 8. CONCLUSIONS In this paper, we proposed a factorization based sparse earning framework caed FHIM for identifying high-order feature interactions in inear and ogistic regression modes, and studied severa optimization methods for our mode. Empirica experiments on synthetic and rea datasets showed that our mode outperforms severa we-known techniques such as Lasso, Trace-norm, GroupLasso and achieves comparabe performance to the current state-of-the-art method - QUIRE, whie not assuming any prior knowedge about the data. Our mode gives interpretabe resuts for high-order feature interactions on RCC dataset which can be used for biomarker discovery for disease diagnosis. In the future, we wi consider the foowing directions: (i) We wi consider factorization of the weight matrix W as W = k a kb T k and higher-order feature interactions, which is more genera, but the optimization probem is non-convex; (ii) We wi extend our optimization methods from Singe- Task Learning to Muti-Task Learning; (iii) We wi consider groupings of features for both Singe Task Learning and Muti-Task Learning. References [ G. Andrew and J. Gao. Scaabe training of -reguarized og-inear modes. In Proceedings of the 24th internationa conference on Machine earning, pages ACM, [2 J. Bien, J. Tayor, and R. Tibshirani. A asso for hierarchica interactions. The Annas of Statistics, 4(3): 4, 203. [3 N. H. Choi, W. Li, and J. Zhu. Variabe seection with the strong heredity constraint and its orace property. Journa of the American Statistica Association, 05(489): , 200. [4 D. K. Duvenaud, H. Nickisch, and C. E. Rasmussen. Additive gaussian processes. In NIPS, pages , 20. [5 J. Fan and R. Li. Variabe seection via nonconcave penaized ikeihood and its orace properties. Journa of the American Statistica Association, 96(456): , 200. [6 R. Foyge, N. Srebro, and R. Saakhutdinov. Matrix reconstruction with the oca max norm. arxiv preprint arxiv:20.596, 202. [7 J. Friedman, T. Hastie, H. Höfing, and R. Tibshirani. Pathwise coordinate optimization. The Annas of Appied Statistics, (2): , [8 A. Jaai, P. Ravikumar, and S. Sanghavi. A dirty mode for mutipe sparse regression. arxiv preprint arxiv: , 20. [9 S. Ji and J. Ye. An acceerated gradient method for trace norm minimization. In Proceedings of the 26th Annua Internationa Conference on Machine Learning, pages ACM, [0 G. R. G. Lanckriet, N. Cristianini, P. Bartett, L. E. Ghaoui, and M. I. Jordan. Learning the kerne matrix with semidefinite programming. J. Mach. Learn. Res., 5:27 72, Dec [ E. L. Lehmann and G. Casea. Theory of point estimation, voume 3. Springer, 998. [2 D. C. Liu and J. Noceda. On the imited memory bfgs method for arge scae optimization. Mathematica programming, 45(-3): , 989. [3 J. Liu, S. Ji, and J. Ye. SLEP: Sparse Learning with Efficient Projections. Arizona State University, [4 R. Min, S. Chowdhury, Y. Qi, A. Stewart, and R. Ostroff. An integrated approach to bood-based cancer diagnosis and biomarker discovery. In Proceedings of the Pacific Symposium on Biocomputing, 204.

9 [5 M. Schmidt. Graphica mode structure earning with -reguarization. PhD thesis, UNIVERSITY OF BRITISH COLUMBIA, 200. [6 J. A. Suykens and J. Vandewae. Least squares support vector machine cassifiers. Neura processing etters, 9(3): , 999. [7 M. Tardaguia, E. Mira, M. A. García-Cabezas, A. M. Feijoo, M. Quintea-Fandino, I. Azcoitia, S. A. Lira, and S. Manes. Cx3c promotes breast cancer via transactivation of the egf pathway. Cancer research, 203. [8 R. Tibshirani. Regression shrinkage and seection via the asso. Journa of the Roya Statistica Society. Series B (Methodoogica), pages , 996. [9 M. Wobus, O. Huber, J. Hamann, and G. Aust. Cd97 overexpression in tumor ces at the invasion front in coorecta cancer (cc) is independenty reguated of the canonica wnt pathway. Moecuar carcinogenesis, 45():88 886, [20 T. T. Wu, Y. F. Chen, T. Hastie, E. Sobe, and K. Lange. Genome-wide association anaysis by asso penaized ogistic regression. Bioinformatics, 25(6):74 72, [2 M. Yuan and Y. Lin. Mode seection and estimation in regression with grouped variabes. Journa of the Roya Statistica Society: Series B (Statistica Methodoogy), 68():49 67, [22 J. Zhou, J. Chen, and J. Ye. MALSAR: Muti-tAsk Learning via StructurA Reguarization. Arizona State University, 20. [23 H. Zou. The adaptive asso and its orace properties. Journa of the American statistica association, 0(476):48 429, APPENDIX A. PROOFS FOR SECTION 5. Proof of Lemma 5.:. Let η n = n /2 + a n and {θ + η nδ : δ d} be the ba around θ, where δ = (u,..., u p, v,...v Kp) T = (u T, v T ) T. Define D n(δ) Q n(θ + η nδ) Q n(θ ) Where Q n(θ ) is defined in equation (8). For δ that satisfies δ = d, we have D n(δ) = L n(θ + η nδ) + L n(θ ) + n j + n k, λ β j ( β j + η nu j β j ) λ α k ( αk, + η nv k ) αk, L n(θ + η nδ) + L n(θ ) + n λ β j ( β j + η nu j βj ) j A + n ( αk, + η nv k ) αk, λ αk (k,) A 2 L n(θ + η nδ) + L n(θ ) nη n uj nηn j A λ β j (k,) A 2 λ αk L n(θ + η nδ) + L n(θ ) nηn( 2 u j + v k ) j A (k,) A 2 v k L n(θ + η nδ) + L n(θ ) nη 2 n( A + A 2 )d = [ L n(θ ) T (η nδ) 2 (ηnδ)t [ 2 L n(θ ) (η nδ)( + o p()) nη 2 n( A + A 2 )d We used Tayor s expansion in above step. above into three parts and we get: Thus, K = η n[ L n(θ ) T δ = nη n( n L n(θ )) T δ = O p(nη 2 n)δ K 2 = 2 nη2 n{δ T [ n 2 L n(θ )δ( + o p())} = 2 nη2 n{δ T [I(θ )δ( + o p())} K 3 = nη 2 n( A + A 2 )d D n(δ) K + K 2 + K 3 We spit the = O p(nη 2 n)δ + 2 nη2 n{δ T [I(θ )δ( + o p())} nη 2 n( A + A 2 )d We see that K 2 dominates the rest of the terms and is positive since I(θ) is positive definite at θ = θ from reguarity condition (C2). Therefore, for any given ɛ > 0 there exists a arge enough constant d such that P { inf Q n(θ + η nδ) > Q n(θ )} ɛ δ =d This impies that with probabiity at-east ɛ, there exists a oca minimizer in the ba {θ +η nδ : δ d}. Thus, there exists a oca minimizer of Q n(θ) such that ˆθ n θ = O p(η n). Proof of Theorem 5.2:. Let us first consider P (ˆα A c 2 = 0). It is sufficient to show that for any (k, ) A c 2 Q n(ˆθ n) α k, < 0 for ɛ n < ˆα k, < 0 () Q n(ˆθ n) α k, > 0 for ɛ n > ˆα k, > 0 (2) with probabiity tending to, where ɛ n = Cn /2 and C > 0 is any constant. To show (2), notice Q n(ˆθ n) = Ln(ˆθ n) + nλ α k sgn(ˆα k, ) α k, α k, = Ln(θ ) α k, p(k+) j= 2 L n(θ ) α k, θ j (ˆθ j θ j )

10 p(k+) j= p(k+) m= +nλ α k sgn(ˆα k, ) 3 L n( θ) α k, θ j θ m (ˆθ j θ j )(ˆθ m θ m) where θ ies between ˆθ n and θ. By reguarity conditions (C)-(C3) and the Lemma 5., we have Q n(ˆθ n) α k, = n{o p() + nλ α k sgn(ˆα k, )} As nλ α k for (k, ) A c 2 from the assumption, the sign of Qn(ˆθ n) α k, is dominated by sgn(ˆα k,j ). Thus, ( ) P Q n(ˆθ n) > 0 for 0 < ˆα k, < ɛ n α k, as n () has identica proof as above. Aso, P ( ˆβ A c = 0) can be proved simiary since in our mode β and α are independent of each other. Proof of Theorem 5.3. Let Q n(θ A) denote the objective function Q n ony on the A component of θ, that is Q n(θ) with θ A c. Based on Lemma 5. and Theorem 5.2, we have P ( θ ˆ A c = 0). Thus, ( P arg min Q n(θ A) = (A component of θ A ) arg min Q n(θ)) θ Thus, ˆθ A shoud satisfy Q n(θ A) θ j θa =ˆθ A = 0 j A (3) with probabiity tending to. Let L n(θ A) and P λ (θ A) denote the og-ikeihood function of θ A and the penaty function of θ A respectivey so that we have From (3), we have Q n(θ A) = L n(θ A) + np λ (θ A) AQ n(ˆθ A) = AL n(ˆθ A) + n AP λ (ˆθ A) = 0, (4) with probabiity tending to. Now, consider by Tayor expansion of first term and second terms at θ A = θ A, we get the foowing: AL n(ˆθ A) = AL n(θ A) [ 2 AL n(θ A) + o p() (ˆθ A θ A) [ = n n AL n(θ A)+ I(θ A) n(ˆθ A θ A) + o p() { [ λ β j n AP λ (ˆθ A) =n sgn(βj) λ α k sgn(α k, ) } +o p()(ˆθ A θ A) j A,(k,) A 2 Thus, we get, 0 = n [ n AL n(θ A) +I(θ A) n(ˆθ A θ A) + o p() Therefore, from centra imit theorem, n(ˆθa θ A) d N(0, I (θ A)) The proofs for emma and theorem of section 5.2 are aong the same ines as above. Pease refer to Appendix section C for more detais. B. COMPUTING ADAPTIVE WEIGHTS Here, we expain how the adaptive weights w β j, wα k can be cacuated for tuning parameters λ β, λ αk in Theoretica properties (Theorems 5.3 & 5.5) of Section 5. Let q be the tota number of predictors, et n be tota number of instances. When n > q, we can compute the adaptive weights w β j, wα k for tuning parameters λ β j, λα k using ordinary east squares (OLS) estimates of the training observations. where λ β j = og n n λ βw β j, λα k w β j = ˆβ OLS j, w α k = = og n n λα k wα k ˆα k OLS When q > n, the OLS estimates are not avaiabe and so we compute the weights using the ridge regression estimates, that is, repacing a the above OLS estimates with the ridge regression estimates. The tuning parameter for ridge regression can be seected using cross-vaidation. Note, we OLS find ˆα k j by taking east squares w.r.t to each α k where k [0, K for some K truek. Without oss of generaity we can assume K = truek for proving the Theoretica properties in section 5. Even if K truek, it does not affect the Theoretica properties since the cardinaity of A 2( A 2 ) does not affect the root-n consistency (see, proof of emma 5.). In practice, K is greediy chosen by ago. in our paper. C. SUPPLEMENTARY MATERIALS For interested readers, we provide onine suppementary materias at pdf with detaied proofs for Lemma 5.4 and Theorem 5.5. These proofs do not affect the understanding of this paper. We aso provide high quaity images for figure in the suppementary materias. We provide more detais about the experimenta settings for state-of-the-art techniques used in this paper., = no p() since na n = o() and ˆθ A θ A = O p(n /2 )

Moreau-Yosida Regularization for Grouped Tree Structure Learning

Moreau-Yosida Regularization for Grouped Tree Structure Learning Moreau-Yosida Reguarization for Grouped Tree Structure Learning Jun Liu Computer Science and Engineering Arizona State University J.Liu@asu.edu Jieping Ye Computer Science and Engineering Arizona State