The Bayesian Group-Lasso for Analyzing Contingency Tables

Size: px

Start display at page:

Download "The Bayesian Group-Lasso for Analyzing Contingency Tables"

Iris Davidson
5 years ago
Views:

1 Sudhir Raman Department of Comput Science, Univsity of Basel, Bnoullistr. 16, CH-4056 Basel, Switzland Thomas J. Fuchs Institute of Computational Science, ETH Zurich, Univsitaetstrasse 6, CH-809 Zurich, Switzland & Competence Cent for Systems Physiology and Metabolic Diseases - Schafmattstr. 18, CH-8093 Zurich, Switzland Pet J. Wild PETER.WILD@CELL.BIOL.ETHZ.CH Institute of Pathology, Univsity Hospital Zurich, Schmelzbgstrasse 1, CH-8091 Zurich, Switzland Edgar Dahl Institute of Pathology, Univsity Hospital, Pauwelsstrasse 30, 5074 Aachen, Gmany EDAHL@UKAACHEN.DE Volk Roth VOLKER.ROTH@UNIBAS.CH Department of Comput Science, Univsity of Basel, Bnoullistr. 16, CH-4056 Basel, Switzland Abstract Group-Lasso estimators, useful in many applications, suff from lack of meaningful variance estimates for regression coefficients. To ovcome such problems, we propose a full Bayesian treatment of the Group-Lasso, extending the standard Bayesian Lasso, using hiarchical expansion. The method is then applied to Poisson models for contingency tables using a highly efficient MCMC algorithm. The simulated expiments validate the pformance of this method on artificial datasets with known ground-truth. When applied to a breast canc dataset, the method demonstrates the capability of identifying the diffences in intactions pattns of mark proteins between diffent patient groups. 1. Introduction and Related Work The identification of important explanatory factors for a process is a key task in many practical learning problems. In the context of standard linear regression, the Lasso (Tibshirani, 1996) has become a popular method for this purpose in the recent years. The Lasso optimizes a regression functional und an l 1 -constraint on the coefficient vector: given a n 1 vector of responses y = (y 1,...,y n ) t, y i R and obsvation vectors x i R p arranged as rows of a Appearing in Proceedings of the 6 th Intnational Confence on Machine Learning, Montreal, Canada, 009. Copyright 009 by the author(s)/own(s). n p data matrix X, the Lasso minimizes y Xβ subject to β 1 κ. (1) In many applications, howev, explanatory factors do not necessarily have a one-to-one correspondence with single features (main effects) but rath require a more complex representation, for instance, by considing not only single features but high ord intactions between some/all features. As a furth extension to this idea, the factors (main effects + high ord intactions) may need to be represented as groups of variables. A popular example of this kind is the representation of categorical variables (i.e. factors in the usual statistical tminology and/or their intactions) as groups of dummy variables. To address such situations, the Group-Lasso (Yuan & Lin, 006) is an ideal choice since it sves as a natural extension of the Lasso by finding solutions that are sparse at the level of groups of variables which, for instance, might represent categorical variables and/or intaction tms. Despite the fact that the Group-Lasso estimators have proven useful in many applications (Kim et al., 006; Mei et al., 008; Roth & Fisch, 008), their main problem concns the definition of meaningful variance estimates for the regression coefficients. This problem because the Hessian is not defined at the optimal solution. In this pap, we suggest a full Bayesian treatment of the Group-Lasso to ovcome this problem. The proposed model uses a hiarchical expansion and directly extends corresponding Bayesian vsions of the standard Lasso (Figueiredo & Jain, 001; Park & Casella, 008) to handle grouped predictors.

2 While the Bayesian Group-Lasso is applicable for many diffent likelihood models, the focus of this pap is on Poisson models for contingency tables. In applications of this kind, the dummy covariates x represent factor intactions up to a ctain ord, and sparsity induced by the Group-Lasso corresponds to selecting edges in the hypgraph defining the intaction structure among these factors. Implementation approaches include the method described in (Figueiredo & Jain, 001) which basically reduces to MAP point estimation without essential Bayesian features such as postior variances, wheas (Park & Casella, 008) describes a full Bayesian approach based on MCMC sampling. A third class of methods falls in between these extremes by utilizing approximation schemes like expectation propagation (Seeg, 008). By exploiting a special orthogonality propty of design matrices in the Poisson model for contingency tables, we are able to dive a highly efficient MCMC sampling algorithm that makes it possible to follow the full Bayesian route without resorting to approximation schemes even in large realworld examples. The availability of postior variances and correlations ovcomes seval problems of a related penalized likelihood (or MAP) approach introduced in (Dahinden et al., 007). Moreov, our algorithm is applicable to contingency tables stemming from arbitrary factors rath than being restricted to binary variables. We show that our method is capable of identifying relevant high-ord intaction pattns both in toy examples and in large-scale medical applications.. The Bayesian Group-Lasso For the following divation, it is convenient to partition the data matrix X, and the coefficients β into G subgroups: X = (X 1,...,X G ), β t = ( β t 1,...,β t G). () The size of the g-th subgroup will be denoted by p g. The Group-Lasso minimizes the negative log-likelihood viewed as a function in β, l = l(β), und a constraint on the sum of the l -norms of the subvectors β g : minimize l(β) s.t. G g=1 β g κ. (3) From a probabilistic pspective, the Group-Lasso with Gaussian likelihood can be undstood as a standard linear regression model with Gaussian noise and a product of multivariate Laplacian priors ov the regression coefficients. The obsvations y = (y 1,...,y n ) t are assumed to be genated according to y i = x t iβ + ε i, with ε i N(0,σ ), (4) which implies a likelihood of the form p(y X, β,σ ) exp { y Xβ /(σ ) } { (σ ) p/ exp 1 } σ (β ˆβ) t X t X(β ˆβ) { (σ ) ν/ exp SSE } σ, whe ˆβ is the least-squares solution, SSE = (y X ˆβ) t (y X ˆβ) is the sum of squared rors, and ν = n p. The last equation results from completing the squares which is standard in Bayesian regression, see for instance (Gelman et al., 1995). Assuming a multivariate (sphical) p g -dimensional Multi-Laplacian prior ov each group of regression coefficients, (5) M-Laplace(β g 0,c 1 ) c pg/ exp( c β g ), (6) the classical group-lasso in eq. (3) is recoved as the MAP-solution in log-space, with σ c having the role of a fixed Lagrange paramet. For a full Bayesian treatment, on the oth hand, we would like to place (hyp)priors ov c and σ and integrate out these paramets. In practice, these integrations will not be possible analytically, and we propose to use a Gibbs sampling strategy for stochastic integration. Hiarchical expansion. For finding a representation in which all conditionals have a standard form, we use the following hiarchical expansion which extends the respective divation for the standard Lasso in (Figueiredo & Jain, 001) to grouped predictors: the prior can be expressed as a two-level hiarchical model involving independent zomean Gaussian priors ov β g and Gamma priors ov λ g. Defining a g = p g ρ and b g = β g /σ (for each group g), the Multi-Laplacian prior on β g can be expressed as a hiarchical Normal-Gamma model: p(β g ρ) = N(β 0 g 0,σ λ g)gamma(λ g pg+1, a g ) dλ g [ ] = (σ ) pg (λ g) 1 exp bg λ pg+1 a g g ag dλ g 0 λ g }{{} ( bg ag ) 1 [ 4 K 1 (a gb g) 1 ] GIG(λ g 1,ag,bg) (σ ) pg ( bg a g ) 1 4 (a g b g ) 1 4 a pg+1 g exp = (a g /σ ) pg exp( (ag /σ ) 1 βg ) [ ] (a g b g ) 1 M-Laplace(β g 0,(a g /σ ) 1 ). (7) In the above divation we have used the definition of the genalized invse Gaussian distribution [ ] (a g b g ) 1 GIG ( λ g 1,a ) b g,b g = ( g a g ) 1 4 K 1 1 [ (λ g) 1 exp 1 ( )] bg λ + λ ga g g, (8)

3 with K 1 (x) = π/(x) exp[ x] denoting a special case of the sphical Bessel functions. For the standard linear model we can, thus, expand the Group-Lasso in tms of a Gaussian likelihood, Gaussian priors ov the regression coefficients and Gamma priors ov the corresponding variances. The model is completed by introducing a prior on σ (we use the standard conjugate joint prior p(β,σ ) = p(β σ )p(σ ) = N(β µ,σ Σ) Inv-χ (σ ν 0,s 0)) and a conjugate Gamma(ρ r, s) prior on ρ. Multiplying the priors with the likelihood and rearranging the relevant tms yields the full conditional postiors, which are needed in the Gibbs sampl for carrying out the stochastic integrations. Concning β and σ, the resulting conditionals have the standard form in Bayesian regression: p(σ ) = Inv-χ, and p(β ) =N(β µ,σ Σ) with Σ = (X t X + Λ 1 ) 1, and µ = ΣX t X ˆβ, whe Λ is a diagonal matrix consisting of λ g s as diagonal elements, with each λ g repeated p g times. The conditional postior of ρ is again Gamma due to conjugacy, p(ρ ) = ρ Gr exp [ ρ (9) ( G s + 1 )] G g=1 λ gp g, (10) whe r and s are the shape- and scale hypparamets of the Gamma prior on ρ. Finally, the conditional of λ g is genalized invse Gaussian: p(λ g ) = GIG ( λ g 1,a g,b g ) (11) 3. Poisson Models for Contingency Tables While in principle the Bayesian Group-Lasso introduced in the last section is applicable for many diffent likelihoods, in this pap we focus on Poisson models for analyzing count data in contingency tables. The reason for this choice is threefold: (i) count data for ctain compositions of discrete propties occur frequently in practical applications, particularly in a bio-medical context; (ii) feature selection for categorical variables is directly related to methods infring sparsity on the level of groups of predictors; (iii) the use of ctain encoding schemes for categorical variables allows us to dive a highly efficient sampling algorithm that makes full Bayesian infence practical in large-scale applications. Denote by z = (z 1,...,z n ) the obsved counts in a contingency table with n cells. The standard approach to modeling count data is Poisson regression which involves a loglinear model with independent tms: for i = 1,...,n : i e µi z i µ i Poisson(µ i ) = µzi z i!, (1) with the Poisson mean µ i = e ηi, and η i = x i β. Using a random link η i = x t iβ + ǫ i, ǫ i N(0,σ ) (13) makes it possible to allow for deviations from the log-linear model. For instance, ovdispsed Poisson models (which are common in practice) can be modeled with random link functions. In the Bayesian context such random links correspond to a conditional Gaussian prior on η: p(η i β,σ ) = N(η i x t iβ,σ ). (14) Note that inputs are available only as counts z. Assuming that the total numb of counts is fixed, the obsved vector of individual counts z = (z 1,...,z n ) can be consided as a realization drawn from a multinomial distribution. This is the approach taken in (Dahinden et al., 007) in a penalized likelihood framework. If, on the oth hand, we assume a sampling model in which the total numb of counts itself is random and the time piod of obsving the counts is fixed, we arrive at the Poisson model (1). This sampling model is plausible for many practical situations. For example, a clinical study of fixed length whe the counts correspond to the numb of patients with ctain propties visiting the hospital. The main technical advantage of the Poisson model lies in the factorization ov the cells, given the means µ i (see eq. (1)). To undstand the nature of the dummy covariates x, the following notations are helpful: assume our contingency table is based on obsvations of d categorical random variables or factors, C 1,...,C d, whe each C j has K j levels. The dummy covariates x represent factor intactions, starting with the intactions of ord zo (the main effects) up to all intactions of a ctain maximum ord Q. Intactions are denoted by the colon opator (:). The is one group of dummy variables used for encoding each intaction tm. This corresponds to assuming that the means of the η i (which are in turn the log Poisson means) can be expressed (in vector form) as η = Xβ, with the design matrix X composed of individual submatrices: X =[1,X C1,...,X C d,x C1:C,...,X C d 1:C d, }{{}}{{} main effects 1st ord intactions...,x C1: :C Q+1,...,X C d Q: :C d ]. }{{} highest ord intactions (15) To avoid ov-parametrization, identifiability constraints are imposed on the individual submatrices in the form of contrast codes which encode a factor with K levels into a matrix with K 1 columns. Intaction tms are basically column-wise product expansions of these individual (main-effect) matrices. In many practical applications we are given orded factors, i.e. ordinal variables for which a natural ording is involved. Examples of this kind are

4 for instance intensity levels. For such ordinal categorical variables, the use of polynomial contrast codes is a natural choice. These encodings employ orthogonal polynomials and have the practical advantage that the (typically huge) design matrix is orthogonal, i.e. X t X = I. The use of random links not only allows for deviations from the parametric model, but also greatly simplifies Gibbs sampling: when conditioning on η i, sampling of β and σ follows the standard procedure in Bayesian regression. The orthogonality propty of the design matrix, X t X = I, makes it possible to sample from vy high-dimensional models in an efficient and numically stable way since only diagonal matrices have to be invted, see eq. (9). Updating η i, on the oth hand, is more difficult because the corresponding conditional is not of recognized form: p(η i β,σ,x, z) exp [ η i z i exp(η i ) i 1 σ (xt iβ η i ) ]. (16) Howev, the above conditional postior is log-concave which makes it possible to use black-box sampling methods like adaptive rejection sampling. Altnatively, we propose to use a Laplace approximation similar to that in (Green & Park, 004), which in practice turns out to give results which are almost indistinguishable from adaptive rejection sampling while speeding up the computations considably. Figure 1 summarizes the hiarchical structure of the Poisson Group-Lasso model. ν 0, s 0 Inv-χ (σ ν 0, s 0) x z r, s ρ Λ (σ, β) η Poisson(z exp(η)) Gamma(ρ r, s) Gamma(λ g, ρ 1 ) N(β, σ Λ) N(η x t β, σ ) Figure 1. Dependency structure of the hiarchical Poisson Group-Lasso model. On top of this hiarchy are the hypparamets (r, s) for the Gamma prior on ρ and (ν 0, s 0) for the Inv-χ prior on σ. On the oth end the are the obsved vector of counts z and the dummy covariates x. The core of the diagram represents the hiarchical Normal-Gamma prior on the regression coefficients β and the Normal random link between η and the covariates and coefficients. Hypparamet selection. A side effect of using the Laplace approximation for η i is a good intuition about reasonable priors on σ. Such a prior should be roughly cented around the reciprocal value of the avage of all counts, see (Green & Park, 004) for details. In our implementation we use this rule of thumb by setting s 0 in the Inv-χ (n 0,s 0) prior on σ to 1/(1 + median(z)). Note that the influence of the Inv-χ prior can be intpreted as adding n 0 virtual samples with variance s 0. The only oth hypparamets in the model are the shape r and scale s in the Gamma(r, s) prior on ρ. Testing for suitable values can be done straightforwardly by first sampling ρ from the hypprior and in turn sampling λ g from Gamma( pg+1, p ). gρ The fraction of large λ g values encodes our prior belief about the sparsity of the model: since λ g is a variance paramet for the g-th group of regression coefficients, the fraction of large λ g values essentially detmines the expected sparsity. As a default setting in our implementation we use the crition that roughly 1% of the λ g values shoud exceed 5 median(λ ). 4. Simulated Example In ord to illustrate the pformance of the Bayesian Group Lasso, expiments we carried out on simulated data. The data was constructed by assuming 8 categorical variables with 3 levels each (main effects) and all high ord intaction tms upto nd ord (total of 9 groups representing the intaction tms, and 6561 combinations of levels). The orthogonal ( ) design matrix X was genated with polynomial contrast codes as described in the previous section. Then three factors we chosen for genating the counts, namely one main effect (variable 1), a first ord intaction (1:5) and a second ord intaction (6:7:8). For these factors, positive values of β we taken, with all oth β values fixed to zo. The counts we then genated using eq. (1) and eq. (13) with σ = 0.1. Hypparamets we specified using our default procedure described above. The example traceplot in the low panel of Figure indicates that the Markov chain convges almost immediately, an obsvation which is corroborated by a length control diagnosis according to (Rafty & Lewis, 199) indicating that the necessary burn-in-piod is probably 100 samples. Gibbs sampling was executed for itations, and the postior distributions of the coefficients β i we analyzed based on evy 5th sample. The upp panel of Figure shows the resulting estimation of significant intaction tms, with significance measured by eith the fraction of positive or negative samples, depending on the sign of the mean value β i. The size of the circles encodes this significance measure for the main effects; the linewidth of the blue edges (1st ord intactions) and reddish trian-

5 Trace of 6:7: Itations Figure. Simulated data: 8 categorical variables with 3 levels each, and three truly nonzo intaction tms: 1, 1:5, 6:7:8. Upp panel: intactions identified from the postior distributions. The size of the circles indicates the estimated significance of the main effects: 97.5% of the postior samples for variable 1 have a positive sign. Correspondingly, the linewidth of the intactions (blue lines: 1st-ord, reddish triangles: nd-ord) indicates their significance. Low panel: Example traceplot and moving avage for the nd-ord intaction 6:7:8 gles represents the importance of the high-ord tms, ranging from 0.7 (thinnest lines plotted) to Note that a level of 0.5 represents complete randomness. All three intaction tms could be clearly identified. The are 3 additional spurious intaction tms whose significance level, howev, is vy low (0.7). 5. Application to Breast Canc Studies Breast canc and immunohistochemistry. In Westn societies breast canc is one of the leading causes of tumor-induced death in women. Despite improvements in the identification of prognostic and predictive paramets, novel biomarks are needed to improve patient risk stratification and to optimize patient outcome. Furthmore, the identification of molecules that are diffentially regulated during tumorigenesis may lead to the development of psonalized thapeutic approaches. Recently, independent research groups we able to identify five distinct gene expression pattns which are (i) highly predictive for the patients prognosis and (ii) may reflect the biological behavior bett compared to established paramets. According to this model, a basal as well as two distinct luminal-like expression pattns in addition to a h (ERBB) ovexpressing and a normal breast-like group could be distinguished (Pou et al., 000). 1 Even though results from mrna expression profiling are vy convincing, the are still some limitations to its clinical application due to the high costs. Seval studies have shown that biologically distinct classes of breast canc as defined by mrna expression analysis can also be identified with cost efficient immunohistochemistry (Abd El- Rehim et al., 005; Diallo-Danebrock et al., 007). Definitions of the basal phenotype by the form and oth groups using diffent cytokatin antibodies (e.g. anti- ) prove to be robust and allow the identification of this tumor type on a routine basis. Tissue microarrays. In the present study, intensity levels of the following immunohistochemical marks in tissue samples have been measured utilizing the tissue microarray (TMA) technology: the estrogen receptor (), karyophin-alpha- (KPNA), anti-cytokatin, fibrous structural protein Collagen-6, membraneassociated tetraspanin protein Claudin-7, int-α-trypsin inhibitor, and the human epidmal growth factor receptor h. The TMA technology promises to significantly accelate studies seeking for associations between molecular changes and clinical endpoints (Kononen & Bubendorf, L. et al, 1998). Survival Probability time(months) Figure 3. Kaplan-Mei curves regarding ovall survival for the low-risk (upp red curve) and the high-risk group (low blue curve) of breast canc patients. Error bars define standard 95% confidence intvals. Expimental design. Aft histopathological grading of tumors according to (Elston & Ellis, 1991), patients we divided into a low-risk (grade 1-) and a high-risk (grade 3) group. The ovall goal of this expiment was to identify diffences in intaction pattns of mark proteins between these groups. Kaplan-Mei survival analysis in Figure 3 shows that the split chosen is meaningful in the sense that the survival probability diffs significantly between these two groups. This obsvation was corroborated by the analysis of mean protein expression levels in the two groups (Figure 4). As expected for the low-risk class of patients (grade 1-), the was marked estrogen receptor () expression, wheas and KPNA we negative.

6 KPNA KPNA KPNA h KPNA h Low risk group High risk group h h Figure 5. Identified intaction pattns for the low-risk group (left) and the high-risk group (right). The size of the circles indicates the estimated significance of the main effects. For instance, the largest circle for means that more than 95% of the postior samples are negative. Correspondingly, the linewidth of the intactions (blue lines: 1st-ord, reddish triangles: nd-ord) indicates their significance, see Figure 6 for an example. Figure 4. Distribution of protein expression levels (3 bins corresponding to low, intmediate, high ) in the low-risk group (left) and the high-risk group (right). Data analysis and intpretation. Having identified a meaningful subdivision into patient groups, we focused on identifying intaction pattns in each of the groups. The obsved expression of each of the proteins was represented as a factor with 3 levels ( low, intmediate, high ). The resulting contingency tables we separately analyzed for each group with our Bayesian Poisson Group-Lasso model. Intaction tms up to the second ord (i.e. intactions between triples of factors) we analyzed. One million Gibbs samples we drawn, the burn-in phase contained the first 00,000 samples, and evy 5th of the remaining samples was used for computing the postior densities. Due to our efficient algorithm, it took less than one hour to compute all samples on a standard comput. Traceplots indicated that the convgence of the Markov chain was not really an issue in this expiment (see Figure 6 for an example), an obsvation which is corroborated by a length control diagnosis according to (Rafty & Lewis, 199) indicating that the necessary burnin-piod is probably In the low-risk group the following intaction tms appeared to be highly significant: the two main effects of KPNA and expression, the first-ord intaction KPNA: and the second ord intaction KPNA::h, see Figure 5. Intpreting high-ord intaction tms can be a complex problem. A close analysis of the contrast codes and the sign of the regression coefficients showed, howev, that all these intaction tms explain obsved counts by eith marginal or joint negative immunoreactivity for KPNA, and h, respectively. The high-risk group showed a distinctly diffent intaction pattn which is dominated by the main effect of expression, the intaction :KPNA and the (weak) intaction ::. Again, looking into the contrast codes and the signs, we concluded that the main effect of explains the counts by an ov-represented intmediate bin and both undrepresented low and high bins. The intaction tm :KPNA explains counts mainly by a joint intmediate expression. These intaction pattns are in line with known gene expression pattns of breast canc. Non-high grade breast cancs (grade 1-) we mainly hormone-receptor positive, and negative for the high-grade marks KPNA, and h. In a study by Dahl et al. (Dahl et al., 006), high rates of KPNA expression we significantly associated with positive TP53 and h immunoreactivity and a high prolifation index. Besides, KPNA seemed to be charactistic of the basal-like subtype of breast cancs, possibly representing a diffent clinical entity of breast tumors, which is associated with short survival times and a high frequency of TP53 mutations. Control expiments. In ord to compare our results with oth analysis methods we conducted two control expiments. For the low-risk group, Figure 7 shows the solution path computed by the non-bayesian analogue of our method, the standard Group-Lasso with Poisson likelihood. We used the algorithm described in (Roth & Fisch, 008). The solution path shows the evolution of the individual group norms when relaxing the constraint κ, see eq. (3). The plot indicates that the main effects and KPNA and the intactions KPNA: and KPNA::h have a dominating role, which is in pfect agreement with our results. At the same time, the more diffuse picture for large constraint values κ > 100 to-

7 Trace of KPNA: Score: Score: KPNA h KPNA h Itations Density of KPNA: Figure 6. Example traceplot for to the vy strong 1st-ordintaction KPNA: in the low-risk group, and moving avage (upp panel). Corresponding postior density (low panel). More than 90% of the samples exceed zo (red area). geth with the difficulty of defining meaningful variance estimates nicely demonstrates the inhent intpretation problems of classical Group-Lasso solutions. group norm KPNA h kappa KPNA: KPNA::h : : KPNA: :h KPNA: KPNA: : KPNA:: Figure 7. Comparison with non-bayesian analogue for the lowrisk group: evolution of group norms ( solution path ) obtained by relaxing the constraint in the standard Group-Lasso with Poisson likelihood. The test for uniqueness/completeness of solutions proposed in (Roth & Fisch, 008) reveals anoth problem: for any reasonable numical tolance paramet in the optimization process, the solutions found by the Group-Lasso are probably not uniquely identifiable. For constraint values κ > 90 the is an increasing amount of inactive (i.e. zo norm) groups that might become active in altnative solutions which are ǫ-close (in tms of likelihood) to the found optimal solution. This problem might be viewed as anoth strong argument for following the Bayesian paradigm of avaging ov Group-Lasso solutions, instead of focusing on a single (penalized) maximum likelihood solution. For a second control expiment we used the same data to estimate a Bayesian network. Concning the identification of intactions, the main technical diffences to our Group-Lasso model are the restriction to a graph (instead of Figure 8. Two examples of the top-scoring Bayes nets with almost indistinguishable scores, showing typical variations in topology. For quantifying the score diffences we made a pturbation expiment: when randomly leaving out 10% of the obsvations, the standard deviation of the individual maximum scores is σ.. More than 50 diffent networks in the unpturbed problem lie within the highest one-σ region. a hypgraph), and the use of directed edges which in some cases can be used for infring causal relations. We used the deal-package (Bøttch & Dethlefsen, 003; Bøttch & Dethlefsen., 009) that finds the topology on the basis of the network score which is basically the log of the joint probability of the graph and the data. In the resulting analysis, we obsve many similarities between the Markov blankets from the Bayes nets and from our model. On the oth hand, neith the network topology nor the direction of edges seems to be vy stable. Among the top-scoring models many variants of the network have almost indistinguishable scores. Most of these fluctuation concn the dependencies between the three variables KPNA, and h, see Figure 8 for an example. The clear identification of a second-ord intaction between these three variables in our model (the bold red triangle in the left panel of Figure 5) might be intpreted as a strong advantage of explicitly modeling high-ord intactions in a hypgraph. 6. Conclusion This pap has presented a full Bayesian treatment of the Group-Lasso which is applicable to many real-world learning problems. Despite the genality of the proposed model, the main focus was on Poisson likelihood models for analyzing count data in contingency tables, with feature selection, because (i) they occur frequently in a bio-medical context, (ii) they are endowed with an inhent group structure representing the intaction tms of the categorical variables and (iii) they allow for efficient Gibbs sampling models making it practical in real-world examples. On the theoretical side we have dived a hiarchical Normal-Gamma expansion of the Multi-Laplace prior on the regression coefficients. This expansion is one of the key components for rewriting the Group-Lasso model in a form that is suitable for efficient Gibbs sampling. The second key component is the orthogonality propty of the design matrix representing dummy encodings for categori-

8 cal variables which ensures that sampling becomes numically stable and highly efficient. When applied to a real-world breast-canc study, the proposed method identifies the diffences in intactions pattns of mark proteins between multiple patient groups. To our knowledge, this is the first study which systematically analyzes the influence of high-ord intactions based on immunohistochemical data for breast canc. Intpretation of the results by pathologists furth validates our approach since the findings are in line with known gene expression pattns of breast canc and hence supporting the usage of cost-effective immunohistochemical marks. The comparison of the method to standard approaches (standard Group-Lasso, Bayesian network) validates the results and also highlights the advantages of the proposed method ov existing approaches. The related software is going to be published as an R package. Acknowledgments The work was partly financed with a grant of the Swiss SystemsX.ch Initiative to the project LivX of the Competence Cent for Systems Physiology and Metabolic Diseases. The LivX project was evaluated by the Swiss National Science Foundation. Refences Abd El-Rehim, D., Ball, G., & Pind, S.E. et al. (005). High-throughput protein expression analysis using tissue microarray technology of a large well-charactised sies identifies biologically distinct classes of breast canc confirming recent cdna expression analyses. Int J Canc, 116, Bøttch, S. G., & Dethlefsen, C. (003). deal: A package for learning bayesian networks. Journal of Statistical Software, 8, Bøttch, S. G., & Dethlefsen., C. (009). deal: Learning bayesian networks with mixed variables. R package vsion Dahinden, C., Parmigiani, G., Emick, M., & Bühlmann, P. (007). Penalized likelihood for sparse contingency tables with an application to full-length cdna libraries. BMC Bioinformatics, 8, 476. Dahl, E., G., G. K., & Gottlob, K. et al. (006). Molecular profiling of las-microdissected matched tumor and normal breast tissue identifies karyophin alpha as a potential novel prognostic mark in breast canc. Clin Canc Res, 1, Diallo-Danebrock, R., Ting, E., & Gluz, O. et al. (007). Protein expression profiling in high-risk breast canc patients treated with high-dose or conventional dosedense chemothapy. Clin Canc Res, 13, Elston, C., & Ellis, I. (1991). Pathological prognostic factors in breast canc. I. The value of histological grade in breast canc: expience from a large study with longtm follow-up. Histopathology, 19, Figueiredo, M., & Jain, A. (001). Bayesian learning of sparse classifis. Proc. IEEE Comp. Soc. Conf. Comput Vision and Pattn Recognition (pp ). Gelman, A., Carlin, J., Stn, H., & Rubin, D. (1995). Bayesian data analysis. Chapman&Hall. Green, P., & Park, T. (004). Bayesian methods for contingency tables using Gibbs sampling. Statistical Paps, 45, Kim, Y., Kim, J., & Kim, Y. (006). Blockwise sparse regression. Statistica Sinica, 16, Kononen, J., & Bubendorf, L. et al (1998). Tissue microarrays for high-throughput molecular profiling of tumor specimens. Nature Medicine, Jul;4(7), Mei, L., van de Ge, S., & Bühlmann, P. (008). The Group Lasso for Logistic Regression. J. Roy. Stat. Soc. B, 70, Park, T., & Casella, G. (008). The Bayesian Lasso. Journal of the Amican Statistical Association, 103, Pou, C., Sorlie, T., & Eisen, M.B. et al. (000). Molecular portraits of human breast tumours. Nature, 406, Rafty, A., & Lewis, S. (199). One long run with diagnostics: Implementation strategies for Markov chain Monte Carlo. Statistical Science, 7, Roth, V., & Fisch, B. (008). The Group-Lasso for genalized linear models: uniqueness of solutions and efficient algorithms. ICML 08: Proceedings of the 5th intnational confence on Machine learning (pp ). New York, NY, USA: ACM. Seeg, M. (008). Bayesian infence and optimal design in the sparse linear model. Journal of Machine Learning Research, 9, Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. J. Roy. Stat. Soc. B, 58, Yuan, M., & Lin, Y. (006). Model selection and estimation in regression with grouped variables. J. Roy. Stat. Soc. B,

Multilevel factor analysis modelling using Markov Chain Monte Carlo (MCMC) estimation

Multilevel Factor Analysis Multilevel factor analysis modelling using Markov Chain Monte Carlo (MCMC) estimation Harvey Goldstein William Browne Institute of Education, Univsity of London Abstract A vy