arxiv: v2 [cs.lg] 4 Sep 2014

Size: px

Start display at page:

Download "arxiv: v2 [cs.lg] 4 Sep 2014"

Delphia Reed
5 years ago
Views:

1 Cassification with Sparse Overapping Groups Nikhi S. Rao Robert D. Nowak Department of Eectrica and Computer Engineering University of Wisconsin-Madison ariv: v2 [cs.lg] 4 Sep 2014 Christopher R. Cox Timothy T. Rogers Department of Psychoogy University of Wisconsin-Madison Editor: Abstract crcox@wisc.edu ttrogers@wisc.edu Cassification with a sparsity constraint on the soution pays a centra roe in many high dimensiona machine earning appications. In some cases, the features can be grouped together, so that entire subsets of features can be seected or not seected. In many appications, however, this can be too restrictive. In this paper, we are interested in a ess restrictive form of structured sparse feature seection: we assume that whie features can be grouped according to some notion of simiarity, not a features in a group need be seected for the task at hand. When the groups are comprised of disjoint sets of features, this is sometimes referred to as the sparse group asso, and it aows for working with a richer cass of modes than traditiona group asso methods. Our framework generaizes conventiona sparse group asso further by aowing for overapping groups, an additiona fexibity needed in many appications and one that presents further chaenges. The main contribution of this paper is a new procedure caed Sparse Overapping Group (SOG) asso, a convex optimization program that automaticay seects simiar features for cassification in high dimensions. We estabish mode seection error bounds for SOGasso cassification probems under a fairy genera setting. In particuar, the error bounds are the first such resuts for cassification using the sparse group asso. Furthermore, the genera SOGasso bound speciaizes to resuts for the asso and the group asso, some known and some new. The SOGasso is motivated by muti-subject fmri studies in which functiona activity is cassified using brain voxes as features, source ocaization probems in Magnetoencephaography (MEG), and anayzing gene activation patterns in microarray data anaysis. Experiments with rea and synthetic data demonstrate the advantages of SOGasso compared to the asso and group asso. 1. Introduction Binary cassification pays a major roe in many machine earning and signa processing appications. In many modern appications where the number of features far exceeds the number of sampes, one typicay wishes to seect ony a few features, meaning ony a few coe cients are non zero in the soution 1. This corresponds to the case of searching for 1. a zero coe cient in the soution impies the corresponding feature is not seected 1

2 sparse soutions. The notion of sparsity prevents over-fitting and eads to more interpretabe soutions in high dimensiona machine earning, and has been extensivey studied in (Bach, 2010; Pan and Vershynin, 2013; Negahban et a., 2012; Bunea, 2008), among others. In many appications, we wish to impose structure on the sparsity pattern of the coe - cients recovered. In particuar, often it is known a priori that the optima sparsity pattern wi tend to invove custers or groups of coe cients, corresponding to pre-existing groups of features. The form of the groups is known, but the subset of groups that is reevant to the cassification task at hand is unknown. This prior knowedge reduces the space of possibe sparse coe cients thereby potentiay eading to better resuts than simpe asso methods. In such cases, the group asso, with or without overapping groups (Yuan and Lin, 2006) is used to recover the coe cients. The group asso forces a the coe cients in a group to be active at once: if a coe cient is seected for the task at hand, then a the coe cients in that group are seected. When the groups overap, a modification of the penaty aows one to recover coe cients that can be expressed as a union of groups (Jacob et a., 2009; Obozinski et a., 2011). Whie the group asso has enjoyed tremendous success in high dimensiona feature seection appications, we are interested in a much ess restrictive form of structured feature seection for cassification. Suppose that the features can be arranged into (possiby) overapping groups based on some notion of simiarity, depending on the appication. The notion of simiarity can be oosey defined, and it is used to refect the prior knowedge that if a feature is reevant for the earning task at hand, then features simiar to it may aso be reevant. It is known that whie many features may be simiar to each other, not a simiar features are reevant for the specific earning probem. We propose a new procedure caed Sparse Overapping Group (SOG) asso to refect this form of structured sparsity. As an exampe, consider the task of identifying reevant genes that pay a roe in predicting a disease. Genes are organized into pathways (Subramanian et a., 2005), but not every gene in a pathway might be reevant for prediction. At the same time, it is reasonabe to assume that if a gene from a particuar pathway is reevant, then other genes from the same pathway may aso be reevant. In such appications, the group asso may be too constraining whie the asso may be too under-constrained. 1.1 Mode and Resuts We first present the main resuts of this paper at a gance. Uppercase and owercase bod etters indicate matrices and vectors respectivey. We assume a sparse earning framework, with a feature matrix 2 R n p, n p. We assume each eement of to be distributed as a standard Gaussian random variabe. Assuming the data to arise from a Gaussian distribution simpifies anaysis, and aows us to everage toos from existing iterature. Later in the paper, we wi aow for correations in the features as we, refecting a more reaistic setting. In the resuts that foow, C is a positive constant, the vaue of which can be di erent from one resut to the other. We focus on binary cassification settings, and assume that each observation y i 2 { 1, +1}, i = 1, 2,...,n are randomy distributed according to the mode (Pan and Vershynin, 2013) E[y i i] =f(h i, x? i), (1) 2

3 where i is the i th row of corresponding to the features of data i, x? is the true coe cient vector of interest, and f is a function mapping from R to [ 1, +1]. The argument of f is the Eucidean inner product: h i, x? i = T i x?. The function f need not be known precisey. We ony assume that it satisfies for g N (0, 1) E[f(g)g] > 0. (2) Without oss of generaity, we assume x? to have unit Eucidean norm, since the normaization can be absorbed into the function f. The vaue f := 1/E[f(g)g] (3) quantifies the strength of the correation between the abes y i and inner products h i, x? i. It pays the roe of the noise eve and wi appear in our error bounds, but it need not be known to compute our proposed estimator of x?. This set-up aows for the consideration of a very genera setting for cassification, and subsumes many interesting cases. For exampe, the ogistic mode is equivaent to P (y i = 1) = exp( h i, x? i) 1 + exp( h i, x? i), (4) f(h i, x? i) = tanh( h i, x? i), for any constant > 0, yieding a corresponding 2 f = 2 E[sech 2 ( g/2)] 1 appe 6 min{, 1}. The constant accounts for the fact that we consider kx? k = 1. Indeed, for a genera vector z z, we can write h i, zi = h i, kzk i,where = kzk. The second argument in the inner product is now a vector of unit norm, and it gives rise to the expression in (4). The framework aso aows for the quantized 1-bit measurement mode f(h i, x? i) = sign(h i, x? i), which can be seen as the imiting case of the ogistic mode as!1, and with f = p 2 We work with this genera formuation since for cassification probems of interest, the ogistic (or any other) mode may be chosen somewhat arbitrariy. Existing theoretica resuts often appy ony to the chosen mode (for exampe, (Negahban et a., 2012)). Our estimator ony requires the observations are correated with the features, in the sense of (3). In any such case, the underying form of f need not be known to compute our estimator of x? and wi enter in the error bounds ony through f. We are interested in the foowing form of structured sparsity. Assume that the features can be organized into K possiby overapping groups, each consisting of L features, based on a user-defined measure of simiarity, depending on the appication. Moreover, assume that if a certain feature is reevant for the earning task at hand, then features simiar to it may 2. See Coroary 3.3 in (Pan and Vershynin, 2013) for a derivation 3

4 aso be reevant. Note that we assume groups of equa size L for convenience. It is easy to reax this assumption. These assumptions suggest a structured pattern of sparsity in the coe cients wherein a subset of the groups are reevant to the earning task, and within the reevant groups a subset of the features are seected. In other words, x? 2 R p has the foowing structure: its support is ocaized to a union of a subset of the groups, and its support is ocaized to a sparse subset within each such group Armed with these preiminaries, we state the theoretica sampe compexity bounds proved ater in the paper. If x? 2 R p has k appe K non-zero groups and appe L coe cients non-zero within each non-zero group, then O( f 2 k og( K k )+og( L )+ ) independent Gaussian measurements of the form (1) are su cient to accuratey estimate x? by soving a convex program. The statement above merits further expanation. We show in Lemma 12 (Section 4.1) via a combinatoria argument that to estimate any vector with parameters k, K,, L as stated above, O( f 2k og( K k )+og( L )+ ) sampes woud su ce. However, ooking for vectors with these properties amounts to soving a non-convex program. When the groups do not overap, we show that the soution to a convex program aso succeeds in accuratey estimating x? using the same number of measurements. When the groups overap, we show that the measurement bound hods with an additiona factor of R 2,whereR 1isthe maximum number of groups that contain any one feature. In most appications of interest (e.g., the fmri exampe discussed beow), R is sma. Nonetheess, we aso show that no matter how many groups a particuar coe cient beongs to, the sampe compexity of our proposed estimator is never greater than that of the standard asso or the overapping group asso. These statements wi be made more precise in the seque. 1.2 Motivation: The SOGasso for Mutitask Learning The SOG asso is motivated in part by mutitask earning appications. The group asso is a commony used too in mutitask earning, and it encourages the same set of features to be seected across a tasks. We wish to focus on a ess restrictive version of mutitask earning, where the main idea is to encourage seection of features that are simiar, but not identica, across tasks. This is accompished by defining subsets of simiar features and searching for soutions that seect ony a few subsets (common across tasks) and a sparse number of features within each subset (possiby di erent across tasks). Figure 1 shows an exampe of the patterns that typicay arise in sparse mutitask earning appications, aong with the one we are interested in. A major appication that we are motivated by is the anaysis of muti-subject fmri data, where the goa is to predict a cognitive state from measured neura activity using voxes as features. Because brains vary in size and shape, neura structures can be aigned ony crudey. Moreover, neura codes can vary somewhat across individuas (Feredoes et a., 2007). Thus, neuroanatomy provides ony an approximate guide as to where reevant information is ocated across individuas: a voxe usefu for prediction in one participant suggests 4

(a) Sparse (b) Group sparse (c) sparse sparse Group pus (d) sparse sparse Group and Figure 1: A comparison of di erent sparsity patterns in the mutitask earning setting.

5 (a) Sparse (b) Group sparse (c) sparse sparse Group pus (d) sparse sparse Group and Figure 1: A comparison of di erent sparsity patterns in the mutitask earning setting. Figure (a) shows a standard sparsity pattern. An exampe of group sparse patterns promoted by Gasso (Yuan and Lin, 2006) is shown in Figure (b). In Figure (c), we show the patterns considered in (Jaai et a., 2010). Finay, in Figure (d), we show the patterns we are interested in this paper. The groups are sets of rows of the matrix, and can overap with each other Figure 2: SOGasso for fmri inference. The figure shows three brains, and voxes in a particuar anatomica region are grouped together, across a individuas (red and green eipses). For exampe, the green eipse in the brains represents a singe group. The groups denote anatomicay simiar regions in the brain that may be co-activated. However, within activated regions, the exact ocation and number of voxes may di er, as seen from the green spots. the genera anatomica neighborhood where usefu voxes may be found, but not the precise voxe. Past work in inferring sparsity patterns across subjects has invoved the use of groupwise reguarization (van Gerven et a., 2009), using the ogistic asso to infer sparsity patterns without taking into account the reationships across di erent subjects (Ryai et a., 2010), or using the eastic net penaty to account for groupings among coe cients (Rish et a., 2012). These methods do not excusivey take into account both the common macrostructure and the di erences in microstructure across brains, and the SOGasso aows one to mode both the commonaities and the di erences across brains. Figure 2 sheds ight on the motivation, and the grouping of voxes across brains into overapping groups 5

6 1.3 Our Contributions In this paper, we consider binary cassification with a constraint on the structure of the sparsity pattern of the coe cients. We assume that the coe cients can be arranged (according to a predefined notion of simiarity) into (overapping) groups. Not ony are ony a few groups seected, but the seected groups themseves are aso sparse. In this sense, our constraint can be seen as an extension of sparse group asso (Simon et a., 2013) for overapping groups where the sparsity pattern ies in a union of groups. We are mainy interested in cassification probems, but the method can aso be appied to regression settings, by making an appropriate change in the oss function of course. We consider a union-of-groups formuation as in (Jacob et a., 2009), but with an additiona sparsity constraint on the seected groups. To this end, we anayze the Sparse Overapping Sets (SOG) asso, where the overapping groups might correspond to coe cients of features arbitrariy grouped according to the notion of simiarity. We aso consider a very genera cassification setting, and do not make restrictive assumptions on the observation mode. We introduce a constraint that promotes sparsity patterns that can be expressed as a union of sparsey activated groups. We show that the constraint is a tight convex reaxation of the set of coe cients having the sparsity pattern we are interested in. The main contribution of this paper is a theoretica anaysis of the mode seection consistency of the SOGasso estimator, under a very genera binary cassification setting. Based on certain parameter vaues, our method reduces to other known cases of penaization for sparse high dimensiona recovery. Specificay, under the ogistic regression mode, our method generaizes the group asso (Meier et a., 2008; Jacob et a., 2009), and aso extends to hande groups that can arbitrariy overap with each other. We aso recover resuts for the asso for ogistic regression (Bunea, 2008; Negahban et a., 2012; Pan and Vershynin, 2013; Bach, 2010). In this sense, our work unifies the asso, the group asso as we as the sparse group asso for to hande overapping groups. At the same time, our methods appy to settings beyond the ogistic regression, to incude a far richer cass of modes. To the best of our knowedge, this is the first paper that provides such a unified theory and sampe compexity bounds for a these methods. In the case of inear regression and mutitask earning, the authors in (Sprechmann et a., 2010, 2011), consider a simiar situation with non overapping subsets of features. We assume that the features can arbitrariy overap. When the groups overap, the methods mentioned above su er from a drawback: entire groups are set to zero, in e ect zeroing out many coe cients that might be reevant to the tasks at hand. This has undesirabe e ects in many appications of interest, and the authors in (Jacob et a., 2009) propose a version of the group asso to circumvent this issue. We aso test our reguarizer on both toy and rea datasets. Our experiments reinforce our theoretica resuts, and demonstrate the advantages of the SOGasso over standard asso and group asso methods, when the features can indeed be grouped according to some notion of simiarity. We show that the SOGasso is especiay usefu in mutitask Functiona Magnetic Resonance Imaging (fmri) appications, and gene seection appications in computationa bioogy. To summarize, the main contributions of this paper are the foowing: 6

7 1. New reguarizers for structured sparsity: We propose the Sparse Overapping Group (SOG) asso, a convex optimization probem that encourages the seection of coe cients that are both within-and across- group sparse. The groups can arbitrariy overap, and the pattern obtained can be represented as a union of a sma number of groups. This generaizes other known methods, and provides a common reguarizer that can be used for any structured sparse probem with two eves of hierarchy 3 : groups at a higher eve, and singetons at the ower eve. 2. New theory for cassification with structured sparsity: We provide a theoretica anaysis for the mode seection consistency of the SOGasso estimator, for binary cassification. The genera resuts we obtain speciaize to the asso, the group asso (with or without overapping groups) and the sparse group asso. We aso make minima assumptions on the measurement mode, aowing for theory that is appicabe in a wide range of cassification settings. We obtain a bound on the sampe compexity of the SOGasso under both independent and correated Gaussian measurement designs, and this in turn aso transates to corresponding resuts for the asso and the group asso. In this sense, we obtain a unified theory for performing structured feature seection in high dimensions. 3. Appications: A major motivating appication for this work is the anaysis of mutisubject fmri data. We appy the SOGasso to fmri data, and show that the resuts we obtain not ony yied ower errors on hod-out test sets compared to previous methods, but aso ead to more interpretabe resuts. To show it s appicabiity to other domains, we aso appy the method to breast cancer data to detect genes that are reevant in the prediction of metastasis in breast cancer tumors. In (Rao et a., 2013), the authors introduced the SOGasso probem emphasizing the motivating fmri appication. The authors aso derived theoretica consistency resuts under the inear regression setting. This paper gives further embeishes the reasons for considering the SOGasso penaty, and derives consistency resuts for cassification. We aso present some nove appications in computationa bioogy where simiar notions can be appied to achieve significant gains over existing methods. Our work here presents nove resuts for the group asso with potentiay overapping groups as we as the sparse group asso for cassification settings, as specia cases of the theory we deveop. 1.4 Past Work When the groups of features do not overap, (Simon et a., 2013) proposed the Sparse Group Lasso (SGL) to recover coe cients that are both within- and across- group sparse. SGL and its variants for mutitask earning has found appications in character recognition (Sprechmann et a., 2011, 2010), cimate and oceanoogy appications (Chatterjee et a., 2011), and in gene seection in computationa bioogy (Simon et a., 2013). In (Jenatton et a., 2011), the authors extended the method to hande tree structured sparsity patterns, and showed that the resuting optimization probem admits an e cient impementation in terms of proxima point operators. Aong reated ines, the excusive asso (Zhou et a., 3. Further eves can aso be added as in (Jenatton et a., 2011), but that is beyond the scope of this paper. 7

8 2010) can be used when it is expicity known that features in certain groups are negativey correated. When the groups overap, (Jacob et a., 2009; Obozinski et a., 2011) proposed a modification of the group asso penaty so that the resuting coe cients can be expressed as a union of groups. They proposed a repication-based strategy for soving the probem, which has since found appication in computationa bioogy (Jacob et a., 2009) and image processing (Rao et a., 2011), among others. The authors in (Mosci et a., 2010) proposed a method to sove the same probem in a prima-dua framework, that does not require coe cient repication. Risk bounds for probems with structured sparsity inducing penaties (incuding the asso and group asso) were obtained by (Maurer and Ponti, 2012) using Rademacher compexities. Sampe compexity bounds for mode seection in inear regression using the group asso (with possiby overapping groups) aso exist (Rao et a., 2012). The resuts naturay hod for the standard group asso (Yuan and Lin, 2006), since non overapping groups are a specia case. For the non overapping case, (Stojnic et a., 2009) characterized the sampe compexity of the group asso, and aso gave a semidefinite program to sove the group asso under a bock sparsity setting. For ogistic regression, (Bach, 2010; Bunea, 2008; Negahban et a., 2012; Pan and Vershynin, 2013) and references therein have extensivey characterized the sampe compexity of identifying the correct mode using `1 reguarized optimization. The authors in (Negahban et a., 2012) extended their resuts to incude Generaized Linear Modes as we (GLM s). In (Pan and Vershynin, 2013), the authors introduced a new framework to sove the cassification probem: minimize 4 a inear cost function subject to a constraint on the `1 norm of the soution. 1.5 Organization The rest of the paper is organized as foows: in Section 2, we formay state our structured sparse feature seection probem and the main resuts of this paper. Then in Section 3, we argue that the reguarizer we propose does indeed hep in recovering coe cient sparsity patterns that are both within-and across group sparse, even when the groups overap. In Section 4, we everage ideas from (Pan and Vershynin, 2013) and derive measurement bounds and consistency resuts for the SOGasso under a ogistic regression setting. We aso extend these resuts to hande data with correations in their entries. We perform experiments on rea and toy data in Section 5, before concuding the paper and mentioning avenues for future research in Section Main Resuts: Cassification with Structured Sparsity We now return to the probem that we wish to sove in this paper, and state our main resuts in a more forma way. Reca that we are interested in recovering a coe cient vector x?, from (corrupted) inear observations of the form h i, x? i The coe cient vector of interest is assumed to have a specia structure. Specificay, we assume that x? 2 C B p 2,whereBp 2 is the unit eucidean ba in Rp. This motivates the 4. The authors in (Pan and Vershynin, 2013) write the probem as a maximization. We minimize the negative of the same function 8

9 foowing optimization probem (Pan and Vershynin, 2013): bx = arg min x n i=1 y i h i, xi s.t. x 2 C. (5) The function to be optimized has a very natura interpretation: We assume without oss of generaity that the observations are positivey correated 5 with the inner products between the features and the coe cient vector. Hence, a natura thing to do woud be to maximize the number of sign agreements between y i and h i, xi. The objective function in (5) maximizes the product of the two terms, a inear reaxation of the quantity we wish to optimize. The statistica accuracy of bx can be characterized in terms of the mean width of C, which is defined as foows Definition 1 Let g 2 N (0, I). The mean width of a set C is defined as appe!(c) =E g sup hx, gi, x2c C where C C denotes the Minkowski set di erence. We now restate a resut from (Pan and Vershynin, 2013) Theorem 2 (Coroary 1.2 in (Pan and Vershynin, 2013)) Let 2 R n p be a matrix with i.i.d. standard Gaussian entries, and et C B p 2. Fix x? 2 C, andassumethe observations foow the mode (1) above. Then, for > 0, if! n C 2 f 2!(C) 2, then with probabiity at east 1 8 exp( c( f )2 n), thesoutionbx to the probem (5) satisfies kbx x? k 2 appe with f defined in (3). We abuse notation and define the mean width as!(c) =E g appesuphx, gi. (6) x2c The quantity defined above is a constant mutipe of that in the origina definition for centray symmetric sets C (Pan and Vershynin, 2013), which wi be the case for the remainder of this paper. In this paper, we construct a new penaty that produces a convex set C that encourages structured sparsity in the soution of (5). We show that the resuting optimization can be e cienty soved. We bound the mean width of the set, which yieds new bounds for cassification with structured sparsity, via Theorem 2. We state the main resuts in this section, and defer the proofs to Section 4 5. If the correation is negative, the signs of y i can be reversed 9

10 2.1 A New Penaty for Structured Sparsity Assume that the features can be grouped according to simiarity into K (possiby overapping) groups G = {G 1,G 2,...,G K } with the argest group being of size L and consider the foowing definition of structured sparsity. Definition 3 We say that a vector x is (k, )-group sparse if x is supported on at most k appe K groups and at most eements in each active group are non zero. Note that = 0 corresponds to x = 0. To encourage such sparsity patterns we define the foowing penaty. G 2 G, wedefinetheset Given a group W G = {w 2 R p : w i =0 if i 62 G}. We can then define ( W(x) = w G1 2 W G1, w G2 2 W G2,...,w GM 2 W GM : ) w G = x. That is, each eement of W(x) is a set of vectors, one from each W G, such that the vectors sum to x. As shorthand, in the seque we write {w G } 2 W(x) to mean a set of vectors that form an eement in W(x) For any x 2 R p,define h(x) := inf ( G kw G k 2 + G kw G k 1 ), (7) {w G }2W(x) where the G, G > 0 are constants that tradeo the contributions of the `2 and the `1 norm terms per group, respectivey. The SOGasso is the optimization in (5) with h(x) as defined in (7) determining the structure of the constraint set C, and hence the form of the soution bx. The `2 penaty promotes the seection of ony a subset of the groups, and the `1 penaty promotes the seection of ony a subset of the features within a group. To keep the exposition simpe, we wi work with the foowing definition of h(x) inthe rest of the paper: h(x) := inf kw G k 2 + p 1 kw G k 1. (8) {w G }2W(x) Note that the vaue of 1 0 can be varied to both emphasize or de emphasize the `1 penaty. In amost a appications of interest, the vaue of wi obviousy be unknown, and the quantity p 1 needs to be tuned via cross vaidation. However, for the sake of proving our theorems, we wi assume that the quantity is known (our goa is to recover vectors that are (k, )- group sparse). This is consistent with other resuts in the iterature where it is assumed that the parameters are known. Definition 4 We say the set of vectors {w G } 2 W(x) is an optima representation of x if they achieve the inf in (8). 10

11 The objective function in (8) is convex and coercive. Hence, 8x, an optima representation aways exists. The function h(x) yieds a convex reaxation for (k, )-group sparsity. Define the constraint set C = {x : h(x) appe p k (1 + 1 ), kxk 2 appe 1}. (9) We show that C is convex and contains a (k, )-group sparse vectors. We compute the mean width of C in (9), and subsequenty obtain the foowing resut: Theorem 5 Suppose there exists a coe cient vector x? that is (k, )-group sparse. Suppose the data matrix 2 R n p and observation mode foow the setting in Theorem 2. Suppose we sove (5) for the constraint set given by (9). For > 0 and a constant C, if the number of measurements satisfies ( 2 f 1+ 2 n C 2 k min (1 + 1 ) 2 1 (og (K)+L), og (p)), then with high probabiity, the soution of the SOGasso satisfies 1 kbx x? k 2 2 appe. Remarks The resuts of Theorem 5 yied new resuts for cassification with structured sparsity under the genera binary observation setting (1). Specificay, note that the SOGasso interpoates between the standard `1 reguarized (asso) and the group `1 reguarized (group asso) cassification techniques: When 1 = 0, we obtain resuts for the group asso. The resut remains the same whether or not the groups overap. The bound is given by n C 2 k(og(k)+l). Note that this resut is simiar to that obtained for the inear regression case by the authors in (Rao et a., 2012). When a the groups are singetons, (L = = 1), the bound reduces to that for the standard asso, with KL = p being the ambient dimension. In this case, the signa sparsity s := k and the bound becomes: n C 2 k og(p). The SOGasso generaizes the asso and the group asso, and aows one to recover signas that are sparse, group sparse, or a combination of the two structures. Moreover, since the mode we consider subsumes the ogistic regression setting, we obtain resuts for ogistic regression with a genera structured sparsity constraint on the soution. To the best of our knowedge, these are the first known sampe compexity bounds for the group asso for ogistic regression with overapping groups, and the sparse group asso, both of which arise as specia cases of the SOGasso. 11

12 Probem (5) admits an e cient soution. Specificay, we can use the feature repication strategy as in (Jacob et a., 2009) to reduce the probem to a sparse group asso, and use proxima point methods to recover the coe cient vector. We eaborate this in more detai ater in the paper. Theorem 5 bounds the number of measurements su cient for accurate estimation using the overapping group asso and the asso. A natura question to then ask is whether one can do better when it is known that the vectors we are interested in are further constrained by the number of non zero entries in active groups. When it is known that each coe cient beongs to at most R groups 6, we obtain the foowing sampe compexity bound Theorem 6 Suppose the coe cient vector is (k, ) group sparse, with everything ese the same as in Theorem 5. Suppose that each coe cient beongs to at most R groups. Then, with high probabiity, if the number of measurements satisfies 2 appe f K L n C 2 R2 k og + og + +2, k we have kˆx x? k 2 appe. The above resut yieds a tight bound when the groups do not overap. Indeed, when R = 1 we see that the sampe compexity bound is a function of the ogarithm of the number of groups, and the overa sparsity of the signa k. This is a tighter resut that the bound obtained for the group asso, where the second term woud be O(kL). We pay a price of R 2 for overapping groups, but in most practica exampes we are interested in, R is typicay sma, and is a constant. 3. Anaysis of the SOGasso Penaty Reca the definition of h(x), from (8): where we set µ = p 1 Remarks : h(x) = inf {w G }2W(x) kw G k 2 + µkw G k 1 (10) The SOGasso penaty can be seen as a generaization of di erent penaty functions previousy expored in the context of sparse inear regression and/or cassification: If each group in G is a singeton, then the SOGasso penaty reduces to the standard `1 norm, and the probem reduces to the asso (Tibshirani, 1996; Bunea, 2008) if 1 = 0 in (8), then we are eft with the atent group asso (Jacob et a., 2009; Obozinski et a., 2011; Rao et a., 2012). This aows us to recover sparsity patterns that can be expressed as ying in a union of groups. If a group is seected, then a the coe cients in the group are seected. 6. The vaue of R wi aways be known, since we assume that the groups G are known 12

13 If the groups G 2 G are non overapping, then (8) reduces to the sparse group asso (Simon et a., 2013). Of course, for non overapping groups, if 1 = 0, then we get the standard group asso (Yuan and Lin, 2006). Figure 3 shows the e ect that the parameter µ has on the shape of the ba kw G k + µkw G k 1 appe, for a singe two dimensiona group G. (a) µ =0 (b) µ =0.2 (c) µ =1 (d) µ =10 Figure 3: E ect of µ on the shape of the set kw G k + µkw G k 1 appe, for a two dimensiona group G. µ = 0 (a) yieds eve sets for the `2 norm ba. As the vaue of µ in increased, the e ect of the `1 norm term increases (b) (c). Finay as µ gets very arge, the eve sets resembe that of the `1 ba (d). 3.1 Properties of SOGasso Penaty The exampe in Tabe 1 gives an insight into the kind of sparsity patterns preferred by the function h(x). We wi tend to prefer soutions that have a sma vaue of h( ). Consider 3 instances of x 2 R 10, and the corresponding group asso, `1 norm, and h(x) function vaues. The vector is assumed to be made up of two groups, G 1 = {1, 2, 3, 4, 5} and G 2 = {6, 7, 8, 9, 10}. h(x) is smaest when the support set is sparse within groups, and aso when ony one of the two groups is seected (coumn 5). The `1 norm does not take into account sparsity across groups (coumn 4), whie the group asso norm does not take into account sparsity within groups (coumn 3). Since the groups do not overap, the atent group asso penaty reduces to the group asso penaty and h(x) reduces to the sparse group asso penaty. Support Vaues PG kx P Gk kxk 1 G (kx Gk + kx G k 1 ) {1, 4, 9} {3, 4, 7} {1, 2, 3, 4, 5} {2, 5, 2, 4, 5} {1, 3, 4} {3, 4, 7} Tabe 1: Di erent instances of a 10-d vector and their corresponding norms. The next tabe shows that h(x) indeed favors soutions that are not ony group sparse, but aso exhibit sparsity within groups when the groups overap. Consider again a 10- dimensiona vector x with three overapping groups {1, 2, 3, 4}, {3, 4, 5, 6, 7} and {7, 8, 9, 10}. Suppose the vector x = [ ] T. From the form of the function in (8), we see that the vector can be seen as a sum of three vectors w i, i =1, 2, 3, corresponding 13

14 to the three groups isted above. Consider the foowing instances of the w i vectors, which are a feasibe soutions for the optimization probem in (10): 1. w 1 =[ ] T, w 2 = [ ] T, w 3 = [ ] T 2. w 1 = [ ] T, w 2 = [ ] T, w 3 = [ ] T 3. w 1 = [ ] T, w 2 = [ ] T, w 3 = [ ] T 4. w 1 = [ ] T, w 2 = [ ] T, w 3 = [ ] T In the above ist, the first instance corresponds to the case where the support is ocaized to two groups, and one of these groups (group 2) has ony one zero. The second case corresponds to the case where a 3 groups have non zeros in them. The third case has support ocaized to two groups, and both groups are sparse. Finay, the fourth case has ony the second group having non zero coe cients, and this group is aso sparse. Tabe 2 shows that the smaest vaue of the sum of the terms is achieved by the fourth decomposition, and hence that wi correspond to the optima representation. A = kw 1 k + µkw 1 k 1 B = kw 2 k + µkw 2 k 1 C = kw 3 k + µkw 3 k 1 A + B + C 1+µ 2+4µ 0 3+5µ 1+µ p 1+µ 1+µ 3+3µ p 0 p 2+2µ 1+µ 1+ p 2+3µ 0 3+3µ 0 3+3µ Tabe 2: Vaues of the sum of the `1 and `2 norms corresponding to the decompositions isted above. Note that the optima representation corresponds to the case w 1 = w 3 = 0, and w 2 being a sparse vector. Lasty, we can show that h(x) is a norm. This wi impy that h(x) is convex, and hence the penaty we consider wi be convex. This wi then mean that the optimization we are interested in soving is a convex program. Lemma 7 The function is a norm h(x) = inf {w G }2W(x) (kw G k 2 + µkw G k 1 ) Proof It is trivia to show that h(x) 0 with equaity i x = 0. We now show positive homogeneity. Suppose {w G } 2 W(x) is an optima representation (Definition 4) of x, and 14

15 et 2 R\{0}. Then, P w G = x ) P w G = x. This eads to the foowing set of inequaities: h(x) = (kw G k 2 + µkw G k 1 )= 1 (k w G k 2 + µk w G k 1 ) 1 h( x) (11) Now, assuming {v G } 2 W( x) is an optima representation of x, we have that P v G = x, and we get h( x) = (kv G k 2 + µkv G k 1 )= vg 2 + µ v G 1 h(x) (12) Positive homogeneity foows from (11) and (12). The inequaities are a resut of the possibiity of the vectors not corresponding to the respective optima representations. For the triange inequaity, again et {w G } 2 W(x), {v G } 2 W(y) correspond to the optima representation for x, y respectivey. Then by definition, h(x + y) appe (kw G + v G k 2 + µkw G + v G k 1 ) appe (kw G k 2 + kv G k 2 + µkw G k 1 + µkv G k 1 ) = h(x)+h(y) The first and second inequaities foow by definition and the triange inequaity respectivey. 3.2 Soving the SOGasso Probem We sove the Lagrangian version of the SOGasso probem: ˆx = arg min x n i=1 y i h i, xi! + 1 h(x)+ 2 kxk 2, where 1 > 0 contros the amount by which we reguarize the coe cients to have a structured sparsity pattern, and 2 > 0preventsthecoe cients from taking very arge vaues. We use the covariate dupication method of (Jacob et a., 2009) to first reduce the probem to the non overapping sparse group asso in an expanded space. One can then use proxima methods to recover the coe cients. Proxima point methods progress by taking a step in the direction of the negative gradient, and appying a shrinkage/proxima point mapping to the iterate. This mapping can be computed e cienty for the non overapping sparse group asso, as it is a specia case of genera hierarchica structured penaties (Jenatton et a., 2011). The proxima point mapping can be seen as the composition of the standard soft threshoding and the group 15

16 soft threshoding operators: w = sign(w r )[ w r 1 µ] + (w t+1 ) G = ( w) G k( w) G k [k( w Gk 1 ] + if k( w) G k6=0 (w t+1 ) G =0 otherwise where w r corresponds to the iterate after a gradient step and [ ] + = max (0, ). Once the soution is obtained in the dupicated space, we then recombine the dupicates to obtain the soution in the origina space. Finay, we perform a debiasing step to obtain the fina soution. 4. Proof of Theorem 5, Theorem 6 and Extensions to Correated Data In this section, we compute the mean width of the constraint set C in (9), which wi be used to prove Theorems 5 and 6. First we define the foowing function, anaogous to the `0 pseudo-norm: Definition 8 Given a set of K groups G, for any vector x and its optima representation {w G } 2 W(x), notingthatx = P w G, define kxk G,0 = 1 {kwg k6=0}. In the above definition, 1 { } is the indicator function. Define the set ( C nc (k, ) = x : x = ) w G, kxk G,0 appe k, kw G k 0 appe k 8G 2 G. (13) We see that C nc (k, ) contains (k, )-group sparse signas (Definition 3). From the above definitions and our probem setup, our aim is to ideay sove the foowing optimization probem bx = arg min x n i=1 y ti h ti, x t i s.t. x 2 C nc (k, ) (14) However, the set C nc (k, ) is not convex, and hence soving (14) wi be hard in genera. We instead consider a convex reaxation of the above probem. The convex reaxation of the (overapping) group `0 pseudo-norm is the (overapping) group `1/`2 norm. This eads to the foowing resut: Lemma 9 The SOGasso penaty (10) admits a convex reaxation of C idea (k, ). Specificay, we can consider the set C(k, ) ={x : h(x) appe p k(1 + 1 )kxk 2 } as a tight convex reaxation containing the set C nc (k, ). 16

17 Proof Consider a (k, )-group sparse vector x 2 C nc (k, ). For any such vector, there exist vectors {v G } 2 W(x) such that the supports of v G do not overap. We then have the foowing set of inequaities h(x) = inf kw G k 2 + p 1 kw G k 1 {w G }2W(x) (i) appe kv G k p kv G k 1 (ii) appe kv G k p p kv G k 2 =(1+ 1 ) kv G k 2 (iii) appe p k (1 + 1 ) kv G k 2 2! 1 2 = p k (1 + 1 ) kxk 2 where (i) foows from the definition of the function h(x) in (8), and (ii) and (iii) foow from the fact that for any vector v 2 R d we have kvk 1 appe p d kvk 2.This, couped with the fact that h(x) is a norm (Lemma 7) ensures that the set C(k, ) is convex. To show that the reaxation is tight, we wi consider a (k, ) sparse vector x and show that the inequaity in the definition of the set hods with equaity. Specificay, et x 2 R p with non overapping groups, and et the first k w G s in it s representation be active. Moreover, suppose the first entries in each of these w G s are non zero. Let the non zero entries a be equa to p 1 k. Then kxk = 1, P G kw Gk 2 = p k and P G kw Gk 1 = p k. The resut foows. 4.1 Mean Widths for the SOGasso We see that, the mean width of the constraint set pays a crucia roe in determining the consistency of the soution of the optimization probem. We now aim to find the mean width of the constraint set in (9), and as a resut of it, prove Theorems 5 and 6. Before we do so, we restate Lemma 3.2 in (Rao et a., 2012) for the sake of competeness: Lemma 10 Let q 1,...,q K be K, -squared random variabes with d-degrees of freedom. Then E[ max 1appeiappeK q i] appe ( p 2 og(k)+ p d) 2. First, we prove Theorem 6. Lemma 11 Suppose that it is known that each coe cient is part of at most R groups, and suppose we et h(x) = inf kw G k 2 + p 1 kw G k 1 {w x}2w GinG 17

18 Then the mean width of the set C = {x : h(x) appe p k (1 + 1 ), kxk 2 appe 1} is bounded as appe!(c) 2 appe CR 2 k og K + og k L Proof The intuition behind this proof is as foows: We first consider a non convex set, which is the idea set of (k, ) sparse vectors that we are interested in. We then show that C is contained in the scaed convex hu of the non convex set, and hence by the properties of the mean width,!(c) can be bounded by a scaing of that of the non convex set 7. To this end, et us consider the foowing non convex idea set, C nc = {x : kxk appe1, kxk G,0 appe k, kw G k 0 appe k}. (15) Consider x 2 C. We now define vectors x i as foows: x 1 = P k r=1 wr 1, where the vectors w 1 are the k vectors w G with argest norm. Aong these ines, we define x i = P k r=1 wr i. For a fixed i, etx i1 be the vector containing the top k entries of x i by magnitude, and define a genera x ij in this manner as we. Note that x i = P j x ij and x = P i x x i. Aso, note that ij kx ij k 2 C nc since it has at most k active groups, and at most k non zero eements. Finay, note the foowing: By construction, we have for a fixed i, and j>1 kx ij k 2 appe 1 p k kx ij 1 k 1. (16) This foows since each eement of x ij is smaer than the average of the entries of the vector x ij 1. Using the exact same argument, we aso have for i>1 k kwi r k 2 2 r=1! 1 2 appe 1 p k k kwi r 1k 2. (17) r=1 7. Lemma 9 showed that the set C is a tight convex outer reaxation of the non convex set. 18

19 Now, kx ij k = kx 11 k 2 + kx i1 k 2 + kx ij k 2 ij i>1 i j>1 appekx 11 k 2 + kx i1 k p kx ij 1 k 1 from (16) i>1 i j>1 k appekx 11 k 2 + kx i1 k p kx ij k 1 i>1 i j k appekx 11 k 2 + kx i1 k p kx i k 1 since the indices for j are disjoint i>1 k i = kx 11 k 2 + kx i1 k k p wi r i>1 k i r=1 1 appekx 11 k 2 + kx i1 k k p kwi r k 1 triange inequaity i>1 k i r=1 = kx 11 k 2 + kx i1 k p kw G k 1 (18) i>1 k G For i>1, we have the foowing bound: kx i1 k 2 appe 2 = 4 appe k r=1 p m=1 " p appe R 2 m=1 w r i 2! 2 3 k (wi r ) m 5 r=1 R 2 max(w r r i ) 2 m p m=1 r=1 k (wi r ) 2 m k = R 2 kwi r k 2 2 r=1 # the above inequaity and (17) combine to give kx i1 k 2 appe p R k kwi r 1k 2. (19) k 19 r=1

20 Substituting this in (18) and noting that kx 11 k 2 appekxk 2 appe 1, we have kx ij k 2 appe 1+ p 1 R! kw G k p kw G k 1 i j k G G! =1+ p R kw G k k R p kw G k 1 G 1 If 1 R then the term in the parentheses is bounded by p k(1 + p 1 ), and if not, then it is bounded by p k(1 + 1 R ). This gives : kx ij k 2 appe 1+R + max (1,R 1 ). i j The foowing argument finishes the proof. Setting =1+R + max (1,R 1 ) 1. By construction, we have x = P P i j x ij. 2. Aso by construction, x ij kx ij k 2 2 C nc. 3. Now, etting ij = kx ijk 2,weshowed x = i j ij x ij kx ij k 2 4. We showed that P P i j ij appe 1, so that x can be written as a convex combination of x the ij kx ij k 2, which are eements in C nc. This means that x 2 conv(c nc). We then have the foowing bound for the mean width of C:!(C) 2 appe CR 2!(C nc ) 2. It now remains to compute!(c nc ). Lemma 12 yieds the desired resut. G (20) (21) Lemma 12 For C nc = {x : kxk appe1, kxk G,0 appe k, kw G k 0 appe k}, we have appe!(c nc ) 2 appe Ck og K + og k L We prove this in Appendix A. We make use of Lemma 10 to obtain the resut. We now proceed to prove Theorem 5. To do so, we adopt a di erent strategy than the one used to prove Theorem 6. Instead of considering the non convex idea set, we directy consider the convex set C and show that it is a subset of appropriatey scaed versions of the overapping group asso or the asso bas. The resut then foows. 20

21 Lemma 13 Consider the same set as that considered in Lemma 11. The mean width of the set can aso be shown to satisfy:!(c) 2 appe Ck min{og K + L, og (p)}. Proof Let g N (0, I), and for a given x, et{w G } 2 W(x) be its optima representation (Definition 4). Since x = P w G,wehave max x2c gt x = max x2c gt w G = max g T w G s.t. x = x2c = max g T w G s.t. (i) appe {w G }2W(x) max {w G }2W(x) = max (ii) = {w G }2W(x) g T w G g T w G s.t. s.t. p k(1 + 1 ) p + 1 max kg Gk 2 w G kw G k p kw G k 1 appe p k(1 + 1 ) (22) (1 + 1 p )kw G k 2 appe p k(1 + 1 ) p k(1 + 1 ) kw G k 2 appe ( p + 1 ) (23) appe p k (1 + 1 ) max kg Gk 2 (24) where we define g G to be the sub vector of g indexed by group G. (i) foows since the constraint set is a superset of the constraint in the expression above it, from the fact that kak 2 appekak 1 8a, and (ii) is a resut of simpe convex anaysis. The mean width is then bounded as!(c) appe p k (1 + 1 ) appe E max Gk 2. (25) Squaring both sides of (25), we get!(c) 2 appe k (1 + 1 ) 2 appee[max kg Gk 2 ] " (iii) appe k (1 + 1 ) 2 E max kg Gk 2 appe (iv) = k (1 + 1 ) 2 E max kg Gk 2 2 where (iii) foows from Jensen s inequaity and (iv) foows from the fact that the square of the maximum of non negative numbers is the same as the maximum of the squares. Now, 2 2 # 21

22 note that since g is Gaussian, kg G k 2 is a freedom. From Lemma 10, we have 2 random variabe with at most B degrees of!(c) 2 appe k (1 + 1 ) 2 ( p 2 og(k)+ p L) 2. (26) This gives us one of the two terms in the min{, } in the statement of the Lemma. Since is bounded away from 0, we can treat the term in the parenthesis as a constant. For the second term, et us revisit (22), and obtain the foowing inequaities: max x2c gt x = (v) appe max {w G }2W(x) max {w G }2W(x) = max {w G }2W(x) = max {w G }2W(x) g T w G g T w G g T w G g T w G p k(1 + 1 ) = p max + 1 p k(1 + 1 ) appe 1 max max g i i s.t. s.t. s.t. s.t. i2g (g G) i kw G k p kw G k 1 appe p k(1 + 1 ) 1 p L kw G k p kw G k 1 appe p k(1 + 1 ) p + 1 p kw G k 1 appe p k(1 + 1 ) kw G k 1 appe p k(1 + 1 ) p + 1 Where the constraint set in (v) is a superset of that in the statement above it. Again, after squaring both sides, taking expectations and appying Jensen s inequaity, The quantity inside the expectation is a Lemma 10, we obtain 1+ 2 appe!(c) 2 1 appe k E max gi 2 i 1 2 variabe with one degree of freedom, and from!(c) 2 appe Ckog (p). This gives the second term in the min{, }, and finishes the proof Lemma 13 and Theorem 2 ead directy to Theorem 5. The resuts in the proof above shed some more ight on our reguarizer h(x). If 1 = 0, then the probem reduces to that of cassification using the overapping group asso penaty, and we obtain the corresponding sampe compexity bound. For simpe sparsity without any structure, we woud want 1 to be arge, in which case 1+ 1! 1, and ( )!1. This woud then entai the bounds for the `1 reguarized probem taking over, keeping a other parameters k, K,, L fixed. 22

23 4.2 Extensions to Data with Correated Entries The resuts we proved above can be extended to data with correated Gaussian entries as we (see (Raskutti et a., 2010) for resuts in inear regression settings). Indeed, in most practica appications we are interested in, the features are expected to contain correations. For exampe, in the fmri appication that is one of the major motivating appications of our work, it is reasonabe to assume that voxes in the brain wi exhibit correation amongst themseves at a given time instant. This entais scaing the number of measurements by the condition number of the covariance matrix, where we assume that each row if the measurement matrix is samped from a Gaussian (0, ) distribution. Specificay, we obtain a generaization of the resut in (Pan and Vershynin, 2013) for the SOGasso with a correated Gaussian design. We now consider the foowing constraint set: 1 p C corr = {x : h(x) appe k(1 + 1 ), k 1 2 xkappe1}. (27) min( 1 2 ) We consider the set C corr and not C in (9), since we require the constraint set to be a subset of the unit Eucidean ba. In the proof of Coroary 14 beow, we wi reduce the probem to an optimization over variabes of the form z = 1 2 x, and hence we require k 1 2 xk 2 appe 1. Enforcing this constraint eads to the corresponding upper bound on h(x). We now obtain the foowing generaization of Theorem 5, for correated data Coroary 14 Let the entries of the data matrix be samped from a N (0, ) distribution. Suppose the measurements foow the mode in (1). Suppose we wish to recover a (k, ) group sparse vector from the set C corr in (27). Suppose the true coe cient vector x? satisfies k 1 2 x? k =1. Then, so ong as the number of measurements n satisfies n C 2 appe( )k min{og(k)+l, og(p)}, 2 the soution to (5) satisfies kˆx x? k 2 appe min( ). where min ( ), max( ) and appe( ) denote the minimum and maximum singuar vaues and the condition number of the corresponding matrices respectivey. We prove this resut in Appendix B. The proof is a straightforward modification of the proof of Theorem 5. A simiar resut aong the ines of Theorem 6 can aso be proved. 5. Appications and Experiments In this section, we perform experiments on both rea and toy data, and show that the function proposed in (8) indeed recovers the kind of sparsity patterns we are interested in in this paper. First, we experiment with some toy data to understand the properties of the function h(x) and in turn, the soutions that are yieded from the optimization probem (5). Here, we take the opportunity to report resuts on inear regression probems as we. We then present resuts using two datasets from cognitive neuroscience and computationa bioogy. 23

24 5.1 The SOGasso for Mutitask Learning The SOG asso is motivated in part by mutitask earning appications. The group asso is a commony used too in mutitask earning, and it encourages the same set of features to be seected across a tasks. As mentioned before, we wish to focus on a ess restrictive version of mutitask earning, where the main idea is to encourage sparsity patterns that are simiar, but not identica, across tasks. Such a restriction corresponds to a scenario where the di erent tasks are reated to each other, in that they use simiar features, but are not exacty identica. This is accompished by defining subsets of simiar features and searching for soutions that seect ony a few subsets (common across tasks) and a sparse number of features within each subset (possiby di erent across tasks). Figure 1 shows an exampe of the patterns that typicay arise in sparse mutitask earning appications, aong with the one we are interested in. We see that the SOGasso, with it s abiity to seect a few groups and ony a few non zero coe cients within those groups ends itsef we to the scenario we are interested in. In the mutitask earning setting, suppose the features are give by t, for tasks t = {1, 2,...,T}, and corresponding sparse vectors x? t 2 R p. These vectors can be arranged as coumns of a matrix?. Suppose we are now given M groups G = { G 1, G 2,...} with maximum size B. Note that the groups wi now correspond to sets of rows of?. Let x? =[x?t 1 x?t 2...x?T T ]T 2 R T p, and y =[y1 T y2 T...yT T ]T 2 R T n. We aso define G = {G 1,G 2,...,G M } to be the set of groups defined on R T p formed by aggregating the rows of that were originay in G, so that x is composed of groups G 2 G, and et the corresponding maximum group size be B = T B. By organizing the coe cients in this fashion, we can reduce the mutitask earning probem into the standard form as considered in (5). Hence, a the resuts we obtain in this paper can be extended to the mutitask earning setting as we Resuts on fmri dataset In this experiment, we compared SOGasso, asso, standard mutitask Gasso (with each feature grouped across tasks), the overapping group asso (Jacob et a., 2009) (with the same groups as in SOGasso) and the Eastic Net (Zou and Hastie, 2005) in anaysis of the star-pus dataset (Wang et a., 2003). 6 subjects made judgements that invoved processing 40 sentences and 40 pictures whie their brains were scanned in haf second intervas using fmri 8. We retained the 16 time points foowing each stimuus, yieding 1280 measurements at each voxe. The task is to distinguish, at each point in time, which kind of stimuus a subject was processing. (Wang et a., 2003) showed that there exists cross-subject consistency in the cortica regions usefu for prediction in this task. Specificay, experts partitioned each dataset into 24 non overapping regions of interest (ROIs), then reduced the data by discarding a but 7 ROIs and, for each subject, averaging the BOLD response across voxes within each ROI. With the resuting data, the authors showed that a cassifier trained on data from 5 participants generaized above chance when appied to data from a 6th thus proving some degree of consistency across subjects in how the di erent kinds of information were encoded. 8. Data and documentation avaiabe at 24

25 We assessed whether SOGasso coud everage this cross-individua consistency to aid in the discovery of predictive voxes without requiring expert pre-seection of ROIs, or data reduction, or any aignment of voxes beyond that existing in the raw data. Note that, unike (Wang et a., 2003), we do not aim to earn a soution that generaizes to a withhed subject. Rather, we aim to discover a group sparsity pattern that suggests a simiar set of voxes in a subjects, before optimizing a separate soution for each individua. If SOGasso can expoit cross-individua anatomica simiarity from this raw, coarsey-aigned data, it shoud show reduced cross-vaidation error reative to the asso appied separatey to each individua. If the soution is sparse within groups and highy variabe across individuas, SOGasso shoud show reduced cross-vaidation error reative to Gasso. Finay, if SOGasso is finding usefu cross-individua structure, the features it seects shoud aign at east somewhat with the expert-identified ROIs shown by (Wang et a., 2003) to carry consistent information. We trained the 5 cassifiers using 4-fod cross vaidation to seect the reguarization parameters, considering a avaiabe voxes without preseection. We group regions of voxes and considered overapping groups shifted by 2 voxes in the first 2 dimensions. 9 Figure 4 shows the prediction error (miscassification rate) of each cassifier for every individua subject. SOGasso shows the smaest error. The substantia gains over asso indicate that the agorithm is successfuy everaging cross-subject consistency in the ocation of the informative features, aowing the mode to avoid over-fitting individua subject data. We aso note that the SOGasso cassifier, despite being trained without any voxe pre-seection, averaging, or aginment, performed comparaby to the best-performing cassifier reported by Wang et a. (2003), which was trained on features average over 7 expert pre-seected ROIs To assess how we the custers seected by SOGasso aign with the anatomica regions thought a-priori to be invoved in sentence and picture representation, we cacuated the proportion of seected voxes faing within the 7 ROIs identified by (Wang et a., 2003) as reevant to the cassification task (Tabe 3). For SOGasso an average of 61.9% of identified voxes fe within these ROIs, significanty more than for asso, group asso (with or without overap) and the eastic net. The overapping group asso, despite returning a very arge number or predictors, hardy overaps with the regions of interest to cognitive neuroscientists. The asso and the eastic net make use of the fact that a separate cassifier can be trained for each subject, but even in this case, the overap with the regions of interest is ow. The group asso aso fares bady in this regard, since the same voxes are forced to be seected across individuas, and this means that the regions of interest which wi be misaigned across subjects wi not in genera be seected for each subject. A these drawbacks are circumvented by the SOGasso. This shows that even without expert knowedge about the reevant regions of interest, our method partiay succeeds in isoating the voxes that pay a part in the cassification task. We make the foowing observations from Figure 4 and Figure 5 The overapping group asso (Jacob et a., 2009) is i suited for this probem. This is natura, since the premise is that the brains of di erent subjects can ony be crudey aigned, and the overapping group asso wi force the same voxe to be seected across 9. The irreguar group size compensates for voxes being arger and scanner coverage being smaer in the z-dimension (ony 8 sices reative to 64 in the x- and y-dimensions). 25

26 Method Avg. Overap with ROI % OGasso ENet Lasso Gasso SOGasso Tabe 3: Mean Sparsity eves of the methods considered, and the average overap with the precomputed ROIs in (Wang et a., 2003) O Gasso E Net Lasso Gasso SOGasso Figure 4: Miscassification error on a hod out set for di erent methods, on a per subject basis. Each soid ine connects the di erent errors obtained for a particuar subject in the dataset. a individuas. It wi aso force a the voxes in a group to be seected, which is again undesirabe from our perspective. This eads to a high number of voxes seected, and a high error. The eastic net (Zou and Hastie, 2005) treats each subject independenty, and hence does not everage the inter-subject simiarity that we know exists across brains. The fact that a correated voxes are aso picked, couped with a highy noisy signa means that a arge number of voxes are seected, and this not ony makes the resut hard to interpret, but aso eads to a arge generaization error. The asso (Tibshirani, 1996) is simiar to the eastic net in that it does not everage the inter subject simiarities. At the same time, it enforces sparsity in the soutions, and hence a fewer number of voxes are seected across individuas. It aows any task correated voxe to be seected, regardess of its spatia ocation, and that eads to a highy distributed sparsity pattern (Figure 5(a)). It eads to a higher cross-vaidation error, indicating that the ungrouped voxes are inferior predictors. Like the eastic net, this eads to a poor generaization error (Figure 4). The distributed sparsity 26

27 pattern, ow overap with predetermined Regions of Interest, and the high error on the hod out set is what we beieve makes the asso a suboptima procedure to use. The group asso (Lounici et a., 2009) groups a singe voxe across individuas. This aows for taking into account the simiarities between subjects, but not the minor di erences across subjects. Like the overapping group asso, if a voxe is seected for one person, the same voxe is forced to be seected for a peope. This means, if a voxe encodes picture or sentence in a particuar subject, then the same voxe is forced to be seected across subjects, and can arbitrariy encode picture or sentence. This gives rise to a purpe haze in Figure 5(b), and makes the resut hard to interpret. The purpe haze manifests itsef due to the arge number of ambiguous voxes in Figure 5(d). Finay, the SOGasso as we have argued heps in accounting for both the simiarities and the di erences across subjects. This eads to the earning of a code that is at the same time very sparse and hence interpretabe, and eads to an error on the test set that is the best among the di erent methods considered. The SOGasso (Figure 5(c)) overcomes the drawbacks of asso and Gasso by aowing di erent voxes to be seected per group. This gives rise to a spatiay custered sparsity pattern, whie at the same time seecting a negigibe amount of voxes that encode both picture and sentences (Figure 5(d)). Aso, the resuting sparsity pattern has a arger overap with the ROI s than other methods considered. 5.2 Toy Data, Linear Regression Athough not the primary focus of this paper, we show that the method we propose can aso be appied to the inear regression setting. To this end, we consider simuated data and a mutitask inear regression setting, and ook to recover the coe cient matrix. We aso use the simuated data to study the properties of the function we propose in (8). The toy data is generated as foows: we consider T = 20 tasks, and consider overapping groups of size B = 6. The groups are defined so that neighboring groups overap (G 1 = {1, 2,...,6}, G 2 = {5, 6,...,10}, G 3 = {9, 10,...,14},...). We consider a case with M = 100 groups, We set k = 10 groups to be active. We vary the sparsity eve of the active groups and obtain m = 100 Gaussian inear measurements corrupted with Additive White Gaussian Noise of standard deviation = 0.1. We repeat this procedure 100 times and average the resuts. To generate the coe cient matrices?,weseectk groups at random, and within the active groups, ony retain fraction of the coe cients, again at random. The retained ocations are then popuated with uniform [ 1, 1] random variabes. The reguarization parameters were cairvoyanty picked to minimize the Mean Squared Error (MSE) over a range of parameter vaues. The resuts of appying asso, standard atent group asso (Jacob et a., 2009), Group asso where each group corresponds to a row of the sparse matrix, (Lounici et a., 2009) and our SOGasso to these data are potted in Figures 6(a), varying. Figure 6(a) shows that, as the sparsity within the active group reduces (i.e. the active groups become more dense), the overapping group asso performs progressivey better. This is because the overapping group asso does not account for sparsity within groups, 27

28 (a) Lasso (b) Group Lasso % of seected voxes PICTURE SENTENCE BOTH OGLASSO E NET LASSO GLASSO SOGLASSO (c) SOG Lasso (d) Voxe Encodings (%) Figure 5: [Best seen in coor]. Aggregated sparsity patterns across subjects per brain sice. A the voxes seected across subjects in each sice are coored in red, bue or purpe. Red indicates voxes that exhibit a picture response in at east one subject and never exhibit a sentence response. Bue indicates the opposite. Purpe indicates voxe that exhibited a a picture response in at east one subject and a sentence response in at east one more subject. (d) shows the percentage of seected voxes that encode picture, sentence or both. 28

0.04 0.035 0.03 0.025 LASSO GLASSO OGLASSO SOGLASSO MSE 0.02 0.015 0.01 0.005 0 0 0.2 0.4 0.6 0.8 1 α (a) Varying (b) Sampe pattern Figure 6: Figure (a) shows the resut of varying.

29 LASSO GLASSO OGLASSO SOGLASSO MSE α (a) Varying (b) Sampe pattern Figure 6: Figure (a) shows the resut of varying. The SOGasso accounts for both inter and intra group sparsity, and hence performs the best. The Gasso achieves good performance ony when the active groups are non sparse. Figure (b) shows a toy sparsity pattern, with di erent coors and brackets denoting di erent overapping groups and hence the resuting soutions are far from the true soutions for sma vaues of. The SOGasso however does take this into account, and hence has a ower error when the active groups are sparse. Note that as! 1, the SOGasso approaches O-Gasso (Jacob et a., 2009). The Lasso (Tibshirani, 1996) does not account for group structure at a and performs poory when is arge, whereas the Group asso (Lounici et a., 2009) does not account for overapping groups, and hence performs worse than O-Gasso and SOGasso. 5.3 SOGasso for Gene Seection As expained in the introduction, another motivating appication for the SOGasso arises in computationa bioogy, where one needs to predict whether a particuar breast cancer tumor wi ead to metastasis or not, from gene expression profies. We used the breast cancer dataset compied by (Van De Vijver et a., 2002) and grouped the genes into pathways as in (Subramanian et a., 2005). To make the dataset baanced, we perform a 3-way repication of one of the casses as in (Jacob et a., 2009), and aso restrict our anaysis to genes that are ateast in one pathway. Again as in (Jacob et a., 2009), we ensure that a the repicates are in the same fod for cross vaidation. We do not perform any preprocessing of the data, other than the repication to baance the dataset. We compared our method to the standard asso, and the overapping group asso. The standard group asso (Yuan and Lin, 2006) is i-suited for this experiment, since the groups overap and the sparsity pattern we expect is a union of groups, and it has been shown that the group asso method wi not recover the signa in such cases. We trained a mode using 4-fod cross vaidation on 80% of the data, and used the remaining 20% as a fina test set. Tabe 4 shows the resuts obtained. We see that the SOGasso penaty eads to ower cassification errors as compared to the asso or the atent group asso. The errors reported are the ones obtained on the fina (hed out) test set. We refrain from performing simpe ridge regression (Hoer and Kennard, 1970) since the 29

A Brief Introduction to Markov Chains and Hidden Markov Models

A Brief Introduction to Markov Chains and Hidden Markov Modes Aen B MacKenzie Notes for December 1, 3, &8, 2015 Discrete-Time Markov Chains You may reca that when we first introduced random processes,