Penalized methods in genome-wide association studies

Size: px
Start display at page:

Download "Penalized methods in genome-wide association studies"

Transcription

1 University of Iowa Iowa Research Online Theses and Dissertations Summer 2011 Penalized methods in genome-wide association studies Jin Liu University of Iowa Copyright 2011 Jin Liu This dissertation is available at Iowa Research Online: Recommended Citation Liu, Jin. "Penalized methods in genome-wide association studies." PhD (Doctor of Philosophy) thesis, University of Iowa, Follow this and additional works at: Part of the Statistics and Probability Commons

2 PENALIZED METHODS IN GENOME-WIDE ASSOCIATION STUDIES by Jin Liu An Abstract Of a thesis submitted in partial fulfillment of the requirements for the Doctor of Philosophy degree in Statistics in the Graduate College of The University of Iowa July 2011 Thesis Supervisors: Professor Jian Huang Associate Professor Kai Wang

3 1 ABSTRACT Penalized regression methods are becoming increasingly popular in genomewide association studies (GWAS) for identifying genetic markers associated with disease. However, standard penalized methods such as the LASSO do not take into account the possible linkage disequilibrium between adjacent markers. We propose a novel penalized approach for GWAS using a dense set of single nucleotide polymorphisms (SNPs). The proposed method uses the minimax concave penalty (MCP) for marker selection and incorporates linkage disequilibrium (LD) information by penalizing the difference of the genetic effects at adjacent SNPs with high correlation. A coordinate descent algorithm is derived to implement the proposed method. This algorithm is efficient and stable in dealing with a large number of SNPs. A multi-split method is used to calculate the p-values of the selected SNPs for assessing their significance. We refer to the proposed penalty function as the smoothed minimax concave penalty (SMCP) and the proposed approach as the SMCP method. Performance of the proposed SMCP method and its comparison with a LASSO approach are evaluated through simulation studies, which demonstrate that the proposed method is more accurate in selecting associated SNPs. Its applicability to real data is illustrated using data from a GWAS on rheumatoid arthritis. Based on the idea of SMCP, we propose a new penalized method for group variable selection in GWAS with respect to the correlation between adjacent groups. The proposed method uses the group LASSO for encouraging group sparsity and a

4 2 quadratic difference for adjacent group smoothing. We call it smoothed group LASSO, or SGL for short. Canonical correlations between two adjacent groups of SNPS are used as the weights in the quadratic difference penalty. Principal components are used to reduced dimensionality locally within groups. We derive a group coordinate descent algorithm for computing the solution path of the SGL. Simulation studies are used to evaluate the finite sample performance of the SGL and group LASSO. We also demonstrate its applicability on rheumatoid arthritis data. Abstract Approved: Thesis Supervisor Title and Department Date Thesis Supervisor Title and Department Date

5 PENALIZED METHODS IN GENOME-WIDE ASSOCIATION STUDIES by Jin Liu A thesis submitted in partial fulfillment of the requirements for the Doctor of Philosophy degree in Statistics in the Graduate College of The University of Iowa July 2011 Thesis Supervisors: Professor Jian Huang Associate Professor Kai Wang

6 Copyright by JIN LIU 2011 All Rights Reserved

7 Graduate College The University of Iowa Iowa City, Iowa CERTIFICATE OF APPROVAL PH.D. THESIS This is to certify that the Ph.D. thesis of Jin Liu has been approved by the Examining Committee for the thesis requirement for the Doctor of Philosophy degree in Statistics at the July 2011 graduation. Thesis Committee: Jian Huang, Thesis Supervisor Kai Wang, Thesis Supervisor Kung-Sik Chan Aixin Tan Dale Zimmermann

8 ACKNOWLEDGEMENTS I am greatly grateful to all those people for their advice, help and support. It is a really good opportunity for me to learn many important things with their exemplification both in academia and in every day life. Without them, this work would be impossible. I would like to gratefully acknowledge the invaluable supervision of Dr. Jian Huang during this work. His supervision is like the torchlight in the darkness which gives me courage and confidence to walk through. He also exemplifies himself by his attitude toward research. His insights and patience are vital to inspire my interest in research and form my future career path. I am also deeply indebted to my co-advisor, Dr. Kai Wang for his patience in correcting my manuscripts and adding comments on them. His knowledge in genetics plays a vital role for this work. I would like to express my warm and sincere thanks to all committee members, Dr. Kung-Sik Chan, Dr. Aixin Tan, and Dr. Dale Zimmermann. I would also like to thank my mom and dad, for their great support in my life. ii

9 ABSTRACT Penalized regression methods are becoming increasingly popular in genomewide association studies (GWAS) for identifying genetic markers associated with disease. However, standard penalized methods such as the LASSO do not take into account the possible linkage disequilibrium between adjacent markers. We propose a novel penalized approach for GWAS using a dense set of single nucleotide polymorphisms (SNPs). The proposed method uses the minimax concave penalty (MCP) for marker selection and incorporates linkage disequilibrium (LD) information by penalizing the difference of the genetic effects at adjacent SNPs with high correlation. A coordinate descent algorithm is derived to implement the proposed method. This algorithm is efficient and stable in dealing with a large number of SNPs. A multi-split method is used to calculate the p-values of the selected SNPs for assessing their significance. We refer to the proposed penalty function as the smoothed minimax concave penalty (SMCP) and the proposed approach as the SMCP method. Performance of the proposed SMCP method and its comparison with a LASSO approach are evaluated through simulation studies, which demonstrate that the proposed method is more accurate in selecting associated SNPs. Its applicability to real data is illustrated using data from a GWAS on rheumatoid arthritis. Based on the idea of SMCP, we propose a new penalized method for group variable selection in GWAS with respect to the correlation between adjacent groups. The proposed method uses the group LASSO for encouraging group sparsity and a iii

10 quadratic difference for adjacent group smoothing. We call it smoothed group LASSO, or SGL for short. Canonical correlations between two adjacent groups of SNPS are used as the weights in the quadratic difference penalty. Principal components are used to reduced dimensionality locally within groups. We derive a group coordinate descent algorithm for computing the solution path of the SGL. Simulation studies are used to evaluate the finite sample performance of the SGL and group LASSO. We also demonstrate its applicability on rheumatoid arthritis data. iv

11 TABLE OF CONTENTS LIST OF TABLES vii LIST OF FIGURES viii CHAPTER 1 INTRODUCTION Literature Review Proposed Penalized Methods Overview of this thesis SMOOTHED MINIMAX CONCAVE PENALIZATION The SMCP method Comparison with fused LASSO Genome-wide screening incorporating LD Computation Joint SMCP Model in GWAS Joint Loss Function Coordinate Descent Algorithm Methods Tuning parameter selection P -values for the selected SNPs Simulation and Empirical Studies Simulation Studies and Results Application to GAW 16 RA Data Application to GAW 17 Data SMOOTHED GROUP LASSO The SGL method Genome-wide group screening A GCD Algorithm in Marginal Model Properties of the Second Penalty Joint SGL Model in GWAS Methods Principal Components Canonical Correlation Selection of the Tuning Parameters v

12 3.4.4 P -values for the Selected SNPs Simulation and Empirical Studies Simulation Studies and Results Application to GAW 16 RA Data REGULARIZED LOGISTIC REGRESSION Unpenalized Logistic Regression Logistic Regression Model with SMCP Marginalized Logistic Regression with SMCP Joint Logistic Regression with SMCP MM algorithm in Logistic Regression Logistic Regression with SGL penalty Simulation and Empirical Studies Simulation Studies and Results Rheumatoid Arthritis Data Example DISCUSSION AND FUTURE WORK REFERENCES APPENDIX vi

13 LIST OF TABLES 2.1 Simulation results for marginalized linear model with SMCP Simulation results for joint linear model with SMCP List of 50 SNPs selected by the marginalized SMCP method and marginalized LASSO method for a simulated data set SNPs found significant other than chromosome SNPs selected by the SMCP and LASSO for trait Q1 in replicate Mean and standard error of true positives and false positives for selected SNPs over 200 replicates for trait Q True positive, false discovery rate (FDR) and false negative rate (FNR) for simulated data from example True positive, false discovery rate (FDR) and false negative rate (FNR) for simulated data from example True positive, false discovery rate (FDR) and false negative rate (FNR) for simulated data from example 3 under joint model True positive, false discovery rate (FDR) and false negative rate (FNR) for simulated data from example 3 under marginal model Multi-split p-values for a simulated data in example True positive, false discovery rate (FDR) and false negative rate (FNR) for simulated data under marginalized logistic regression model with SMCP True positive, false discovery rate (FDR) and false negative rate (FNR) for simulated data under joint logistic regression model with SMCP True positive(number of groups), false discovery rate (FDR) and false negative rate (FNR) for simulated data under logistic regression model with SGL vii

14 LIST OF FIGURES 1.1 The l 1 penalty for the LASSO along with the MCP and SCAD, λ=1, γ= Correlation plots in Chromosome 6 from Genetic Analysis Workshop 16 Rheumatoid Arthritis dataset Graph of Function R under different a j,b j and c j Plot of log 10 (p-value) for SNPs on chromosome 6 selected by (a) the SMCP method and (b) the LASSO method for the rheumatoid arthritis data. These p-values are generated using the multi-split method. The horizontal line corresponds to significance level Genome-wide plot of β estimates for the SMCP and the MCP methods on binary trait Genome-wide plot of β estimates for the LASSO and regular single-snp linear regression on binary trait Genome-wide plot of β estimates for the SMCP and MCP methods on quantitative trait Genome-wide plot of β estimates for the LASSO and regular single-snp linear regression on quantitative trait Correlation plots along the genome from Genetic Analysis Workshop 17 simulated dataset Genome-wide plot of β estimates for regular single-snp linear regression Boxplots for comparison of residuals on genotypes of selected SNPs. Y axis is for residuals and x axis for genotypes Solution path for a simulated data for (a) group LASSO, and SGL with (b) η = 0.5, (c) η = 0.2 and (d) η = 0.1. Black lines are paths of non-zero groups and grey lines are paths of irrelevant groups Genome-wide plot of β for SGL and group LASSO and β for simple linear regression viii

15 4.1 β estimates on chromosome 6 for marginalized logistic regression with the SMCP and the LASSO, and the regular single-snp linear regression β estimates on chromosome 6 for joint logistic regression with the SMCP and the LASSO β estimates on chromosome 6 for logistic regression with the SGL and the group LASSO ix

16 1 CHAPTER 1 INTRODUCTION Variable selection is an important topic in statistics. In practice, a large number of predictors are measured to catch all the possible factors related to the outcome. Usually, only a small fraction of predictors are important. As model gets more complicated, it will decrease bias but increase variance. Hence one needs to balance between bias and variance. Traditionally, researchers use forward, backward and subset methods to do model selection. The computation for these methods grows exponentially as the number of predictors increases. In genome-wide association studies (GWAS), there are usually hundreds of thousands of single nucleotide polymorphisms (SNPs) for only hundreds of subjects. The high-dimensionality of SNPs data makes these traditional methods computational infeasible. Penalized methods are effective tools for variable selection which has been shown to be applicable both theoretically and empirically for dealing with high-dimensional data. This chapter first introduces the background of GWAS and the development of penalized methods. Several widely used penalized methods are briefly presented. Then the proposed penalized methods are briefly introduced. More specific details will be presented in the relevant chapters.

17 2 1.1 Literature Review A genome-wide association study (GWAS) involves rapidly scanning genetic markers across genome of individuals of a particular species to find genetic variations associated with a particular disease. GWAS are particularly useful to find genetic variations that contribute to common or complex diseases, such as rheumatoid arthritis, glaucoma, diabetes and so on. Currently, GWAS rely on the single-marker analysis testing individual associations between each SNP and the trait either binary or quantitative since individual identification is simple and readily applicable (Cho et al., 2010). While single-marker analysis can be used when a single genetic variation is responsible for a particular trait, it may not be appropriate for investigating a complex polygenic trait due to following reasons. Due to the large number of genetic markers, certain multiple testing is required to reduce the increased probability of false positive findings. Second, single-marker analysis does not take into account the correlation structure linkage disequilibrium (LD) in the SNPs data. Therefore, it is important to develop a method capable of considering all genetic factors. In another perspective, this kind of problem can also be viewed as variable selection in high-dimensional data. Penalized methods seem to be suitable to this purpose. Model selection is a classical topic in statistics. It is well studied for the traditional models. Much work has been done in this field. Penalized regression methods arise in the last decade. Those approaches include LASSO (Tibshirani, 1996), bridge (Frank and Friedman, 1993), SCAD (Fan and Li, 2001), nonnegative garotte (Breiman, 1995), the elastic net (Zou and Hastie, 2005), adaptive LASSO

18 3 (Zou, 2006), fused LASSO (Tibshirani et al., 2005), and minimax concave penalty (MCP) (Zhang, 2010). Much work has been done to understand the properties of these penalties under the situation of p < n and p n for various models. The LASSO (least absolute shrinkage and selection operator) is a penalized method imposing an l 1 -penalty on the regression coefficients (Tibshirani, 1996). It was shown in Tibshirani (1996) that the LASSO can shrink regression coefficients to 0. Knight and Fu (2000) studied the asymptotical properties for LASSO-type estimators and showed that the estimates have positive probability of being 0. Therefore, it can be used for estimation and automatic variable selection in regression analysis. Consider a linear regression model with p variables: y i = β 0 + x iβ + ɛ i (1.1) where y i is the ith response variable, ɛ i is the ith error term, x i is a p 1 covariate vector, and β is the regression coefficient vector. Without loss of generality, I can assume that the features have been standardized such that n i=1 x ij = 0 and 1 n n i=1 x2 ij = 1. Then the objective function Q for a generalized loss function L n (β) with LASSO penalty can be written as: Q(β) = 1 n L n(β) + λ j β j where λ 0 is the penalty parameter. The LASSO provides a computationally feasible way for variable selection in high-dimensional settings (Tibshirani, 1996). Recently, this approach has been applied to GWAS for selecting associated SNPs (Wu et al., 2009). It has been shown

19 4 that the LASSO is selection consistency if the predictors meet the irrepresentable condition and this condition is almost necessary (Zhao and Yu, 2006). This condition is stringent and there is no known mechanism to verify it in GWAS. Zhang and Huang (2008) studied the sparsity and the bias of the LASSO in high-dimensional linear regression models. They showed that under reasonable conditions, the LASSO selects a model of the correct order of dimensionality. However, the LASSO tends to overselect unimportant variables. Therefore, direct application of the LASSO to GWAS tends to generate findings with high false positive rates. Another limitation of the LASSO is that, if there is a group of variables among which the pairwise correlations are high, then the LASSO tends to select only one variable from the group and does not care which one is selected (Zou and Hastie, 2005). As shown by Fan and Li (2001), a good penalty function should have three properties: unbiasedness, sparsity and continuity. The LASSO lacks the property in terms of unbiasedness. The LASSO estimates are always biased, since its penalization rate is constant no matter how large the parameters are. Hence, the LASSO can do variable selection but always produces biased estimates. Several methods attempting to improve the performance of the LASSO have been proposed. The adaptive LASSO (Zou, 2006) uses data-dependent weights for penalizing coefficients in the l 1 penalty. Zou (2006) chose the inverse of ordinary least-square estimates for the weights. It enjoys the oracle properties under some mild regularity conditions; in other words, it performs as well as if the underlying model was known beforehand. In the case that the number of predictors is much larger

20 5 than sample size, adaptive weights cannot be initiated easily. Elastic net method (Zou and Hastie, 2005) can effectively deal with certain correlation structures in the predictors by using a combination of ridge and LASSO penalties. Fan and Li (2001) introduced a smoothly clipped absolute deviation (SCAD) method. Zhang (2010) proposed a flexible minmax concave penalty (MCP) which attenuates the effect of shrinkage leading to bias. Both the SCAD and the MCP belong to the same family of quadratic spline penalties and both lead to oracle selection results (Zhang, 2010). The penalty is illustrated in Fig The MCP has a simpler form and requires weaker conditions for the oracle property. We refer to Zhang (2010) and Mazumder et al. (2009) for detailed discussion. For SNPs data, some of the correlations among them are very high due to LD. In the context of GWAS, none of the methods mentioned above consider the natural ordering along the genome with respect to their physical positions and possible linkage disequilibrium (LD) between adjacent SNPs. Thus this inadequacy prevents them from being applied in GWAS. Fused LASSO considers the ordering of coefficients but it is inappropriate, since the effect of association for a SNP (as measured by its regression coefficient) is only identifiable up to its absolute value a homozygous genotype can be equivalently coded as either 0 or 2 depending on the choice of the reference allele. A new penalized method incorporating linkage disequilibrium suitable to GWAS is briefly introduced in next section and more details can be found in Chapter 2. To accommodate the situation of existing grouping structures among variables,

21 6 (a) Penalty (b) Penalization Rate Figure 1.1: The l 1 penalty for the LASSO along with the MCP and SCAD, λ=1, γ=3 the group LASSO (Bakin, 1999) was introduced using an adequate extension of the LASSO penalty over the group norm. In group LASSO, an l 2 norm of the coefficients associated with a group of variables is used in the penalty function. Yuan and Lin (2006) considered the group selection methods by imposing the LASSO, the LARS and the nonnegative garrote and proposed algorithms to these types of group selection. Kim et al. (2006) considered group selection in generalized linear models. They proposed an efficient algorithm and a blockwise standardization method. Meier et al. (2008) extended the group LASSO to logistic regression models. They showed that the group LASSO for logistic regression is consistent under certain conditions and developed a block coordinate descent algorithm that can be used to high-dimensional data. Huang et al. (2009) proposed a group bridge method than can carry out feature selection at the group and within-group individual variable levels simultaneously. Wei and Huang (2008) studied the selection and estimation properties of the group

22 7 LASSO in high-dimensional settings when the number of groups exceeds the sample size. Huang et al. (2010) later extended group LASSO to group MCP (gmcp). They showed that gmcp enjoys an oracle property meaning that with high probability, gmcp estimator is equal to the oracle estimator assuming the true model was known. They proposed an efficient and stable group coordinate descent algorithm for computing the solution path of the gmcp. Another important aspect of the penalized methods is to solve the penalized objective function efficiently. Closed form solutions are not available in high dimensional setting for penalized methods. Therefore, optimization algorithms must be developed. Least angle regression (LARS) (Efron et al., 2004) is an algorithm that can be easily modified to produce the solutions for the LASSO. LARS produces a full piece-wise linear solution path, which is very useful in cross-validation. For the SCAD or the MCP, optimization is complicated. Fan and Li (2001) proposed a local quadratic approximation (LQA) to the penalty function. Then Newton-Raphson algorithm can be modified to optimize the penalized objective function. Zou and Li (2008) proposed a local linear approximation (LLA) algorithm to make a linear approximation to the penalty function. Then LARS algorithm can be implemented to compute the solution. The LLA algorithm is inefficient in that it cannot produce solution path at one time. Coordinate descent algorithm is found to be efficient in penalized methods. Much work has been done in this field (Breheny and Huang, 2011, Friedman et al., 2007, 2010, Mazumder et al., 2009, Wu and Lange, 2007).

23 8 1.2 Proposed Penalized Methods Before presenting the proposed penalized methods, the background of genetics related to our study is introduced first. DNA is a nucleic acid that contains instructions in the development and functioning of all known living organisms except for some viruses. To some extend, DNA is like blueprint, since it contains the instructions needed to construct other components of cells. The DNA segments carrying this genetic information are called gene. DNA consists of two chains of subunits twisted around one another to form a double strand helix. The subunits of each strand are nucleotides, each of which contains four chemical constituents called bases. These are adenine (A), thymine(t), guanine(g) and cytosine(c). Bases are all paired. Adenine pairs with thymine while guanine pairs with cytosine. Inside the nucleus of a living cell, genes are arranged in linear order along thread-like objects. These threads are called chromosomes. Normally, there are 23 pairs of chromosomes in human beings. In genetics, SNPs data is obtained from some chips testing DNAs on a dense set of predetermined genetic markers. In most cases, those genetic markers are singlenucleotide polymorphisms (SNP). SNPs are basic elements which DNAs are composed of. There are four bases in DNA, i.e. A,T,G and C. In most of time, chromosomes are paired. Hence, one specific marker can be homozygous and heterozygous. Assume that A and T are alleles for one locus. The genotypes for this locus can be AA,AT and TT. AA and TT are homozygous, but AT is heterozygous. Since researchers arbitrary choose A and T for the reference allele, homozygous AA and TT are not distinguishable. Usually, homozygous genotypes are coded as 0 or 2 depending on

24 9 the choice reference allele, while heterozygous genotypes are coded as 1. Most often, the alleles at two linked loci are associated through linkage disequilibrium (LD). LD can occur for a number of reasons, including genetic linkage, selection, the rate of recombination, the rate of mutation, genetic drift, non-random mating, and population structure. It is rare that disease markers are selected for assay. But the LD between disease markers and physically close SNPs makes the detection of association between SNP markers and phenotypic trait possible. LD is vulnerable to decay exponentially due to the recombinant. This is the reason why scientists can easily find newly developed mutation other than mutation occurred hundreds of generations ago. But if the physical loci of disease marker and SNP marker are close enough, it is safe to assume that recombination rate is zero, i.e. no recombinant between disease marker and SNP marker. Therefore, the LD between disease marker and SNP marker will remain almost the same as it was at the time the mutation occurred. SNPs are naturally ordered along the genome with respect to their physical positions. In the presence of linkage disequilibrium (LD), adjacent SNPs are expected to show similar strength of association. Making use of LD information from adjacent SNPs is highly desirable as it should help to better delineate association signals while reducing randomness seen in single SNP analysis. Fused LASSO (Tibshirani et al., 2005) is not appropriate for this purpose, since the effect of association for a SNP (as measured by its regression coefficient) is only identifiable up to its absolute value a homozygous genotype can be equivalently coded as either 0 or 2 depending on the

25 10 choice of the reference allele. Further discussion on this point is provided in Chapter 2. A new penalized regression method is proposed for identifying associated SNPs in genome-wide association studies. The proposed method uses a novel penalty, which I shall refer to as the smoothed minimax concave penalty, or SMCP, for sparsity and smoothness in absolute values. The SMCP is a combination of the MCP and a penalty consisting of the squared differences of the absolute effects of adjacent markers. The MCP promotes sparsity in the model and does automatic selection of associated SNPs. The penalty for squared differences of the absolute effects takes into account the natural ordering of SNPs and adaptively incorporates possible LD information between adjacent SNPs. It explicitly uses correlation between adjacent markers and penalizes the differences of the genetic effects at adjacent SNPs with high correlation. I derive a coordinate descent algorithm for implementing the SMCP. Furthermore, I use a resampling method for computing p-values of the selected SNPs in order to assess their significance. The proposed SMCP model is similar to the fused LASSO which uses l 1 norm over the difference of adjacent parameters. The SMCP model, however, is different from the fused LASSO in two aspects. First, it uses the MCP penalty in place of the LASSO penalty. It has been demonstrated that the MCP penalty is superior to the LASSO in terms of estimation bias and selection accuracy. Second, it imposes penalty on the difference in adjacent β s instead of the difference in adjacent βs. Therefore, the penalty would not be affected by which allele is used as the reference

26 11 to score genotypes. Although β would be of the opposite sign, the value β remains the same. In the same manner, the proposed SMCP model is extended to the model with grouping structure. SNPs are correlated through LD. Hence, there are existing grouping structure in the SNPs or sequencing data. I propose a new group selection approach using a combination of the group LASSO and the quadratic difference of the norm of adjacent groups. I call this approach the smoothed group LASSO, or simply SGL model. In the SGL model, I use canonical correlation to measure the correlation between two groups of variables. Principal components are used to reduce the dimensions within groups. It helps eliminate the presence of the highly correlated SNPs which results in small eigenvalues in Cholesky decomposition that makes the transformation unstable. 1.3 Overview of this thesis In Chapter 2, I introduce the proposed SMCP method and present genomewide screening incorporating the proposed SMCP method. I also describe a coordinate descent algorithm for estimating model parameters and discuss selection of the values of the tuning parameters and p-value calculation. In Chapter 3, I present the model framework and setting of the smoothed group LASSO. A group coordinate descent algorithm for fitting the proposed model is described. I also show the miscellaneous methods used in the model, e.g. canonical correlation, principal components, selection of the tuning parameters and evaluation

27 12 of the significance of selected groups. In Chapter 4, I implement the SMCP method in logistic regression model. Both the marginalized and the joint logistic regression models with SMCP are studied. The coordinate descent algorithms for them are scratched. Moreover, MM algorithm is used to convert penalized weighted least-squares into penalized least-squares. I also implement SGL in logistic regression model in this section. Within these chapters, the empirical properties of the methods are investigated under simulation studies as well as applied to real data sets from genetic analysis workshop (GAW) 16 rheumatoid arthritis (RA) data and GAW 17 simulated data.

28 13 CHAPTER 2 SMOOTHED MINIMAX CONCAVE PENALIZATION This chapter is organized as follows. Section 2.1 introduces the proposed SMCP method. Section 2.2 presents a genome-wide screening incorporating the proposed SMCP method. Section describes a coordinate descent algorithm for estimating model parameters. Section 2.3 presents the joint linear model with SMCP and describes a coordinate descent algorithm for estimating parameters. Section 2.4 discusses selection of the values of the tuning parameters and p-value calculation. Section evaluates the proposed method and a LASSO method using simulated data. Section applies the proposed method to a GWAS on rheumatoid arthritis. 2.1 The SMCP method Let p be the number of SNPs included in the study, and let β j denote the effect of the jth SNP in a working model that describes the relationship between phenotype and markers. Here we assume that the SNPs are ordered according to their physical locations on the chromosomes. Adjacent SNPs in high LD are expected to have similar strength of association with the phenotype. To adaptively incorporate LD information, we propose the following penalty that encourages smoothness in β s at neighboring SNPs: λ 2 2 p 1 ζ j ( β j β j+1 ) 2, (2.1) j=1 where the weight ζ j is a measure of LD between SNP j and SNP (j+1). This penalty encourages β j and β j+1 to be similar to an extent inversely proportional to the

29 14 LD strength between the two corresponding SNPs. Adjacent SNPs in weak LD are allowed to have larger difference in their β s than if they are in stronger LD. The effect of this penalty is to encourage smoothness in β s for SNPs in strong LD. By using this penalty, we expect a better delineation of the association pattern in LD blocks that harbor disease variants while reducing randomness in β s in LD blocks that do not. We note that there is no monotone relationship between ζ and the physical distance between two SNPs. While it is possible to use other LD measures, we choose ζ j to be the absolute value of Pearson s correlation coefficient between the genotype scores of SNP j and SNP (j+1). The values of ζ j for rheumatoid arthritis data used by Genetic Analysis Workshop 16, the data set to be used in our simulation study and empirical study, are plotted for chromosome 6 (Fig. 2.1(a)). The proportion that ζ j > 0.5 over non-overlapping 100-SNP windows is also plotted (Fig. 2.1(b)). For the purpose of SNP selection, we use the MCP, which is defined as t ρ(t; λ 1, γ) = λ 1 (1 x/(γλ 1 )) + dx, 0 where λ 1 is a penalty parameter and γ is a regularization parameter that controls the concavity of ρ. Here x + is the nonnegative part of x, i.e., x + = x1 x 0. The MCP can be easily understood by considering its derivative, which is ρ(t; λ 1, γ) = λ 1 ( 1 t /(γλ1 ) ) + sgn(t), where sgn(t) = 1, 0, or 1 if t < 0, = 0, or > 0. As t increases from 0, the MCP begins by applying the same rate of penalization as the LASSO, but continuously relaxes that penalization until t > γλ 1, the condition under which the rate of penal-

30 15 (a) Absolute lag-one autocorrelation (b) Proportion of absolute lag-one autocorrelation coefficients > 0.5 for 100 SNPs per segment Figure 2.1: Correlation plots in Chromosome 6 from Genetic Analysis Workshop 16 Rheumatoid Arthritis dataset

31 16 ization drops to 0. It provides a continuum of penalties where the LASSO penalty corresponds to γ = and the hard-thresholding penalty corresponds to γ 1+. We note that other penalties, such as the LASSO penalty or the SCAD penalty, can also be used to replace the MCP. We choose MCP because it possesses all the basic desired properties of a penalty function and is computationally simple (Mazumder et al., 2009, Zhang, 2010). Given the parameter vector β = (β 1,..., β p ) and a loss function g(β) based on a working model for the relationship between the phenotype and markers, the SMCP in a working model can be expressed as minimizing the criterion L n (β) = g(β) + p j=1 ρ( β j ; λ 1, γ) + λ 2 2 p 1 ζ j ( β j β j+1 ) 2. (2.2) We minimize this objective function with respect to β, while using a bisection method to determine the regularization parameters (λ 1, λ 2 ). SNPs corresponding to ˆβ j 0 are selected as being potentially associated with disease. These selected SNPs will be subject to further analysis using a multi-split sampling method to determine their statistical significance, as described later. j= Comparison with fused LASSO Fused LASSO was designed to incorporate the feature that the order of the parameters is meaningful. Its objective function with loss function g(β) can be written as follows: p 1 p g(β) + λ 1 β j + λ 2 β j+1 β j j=1 j=1

32 17 As one can see, the biggest difference between SMCP and fused LASSO lies in the second penalty. SMCP uses a l 2 penalty on the absolute difference which makes soft smoothing. But fused LASSO uses l 1 for the smoothing penalty. Hence, it will put adjacent parameters to be exactly the same. It is like hard smoothing. Furthermore, SMCP consider the nature of GWAS. Not only the order of SNPs matters, but also the sign of the predictor effects. At one locus, the effect is positive while at the adjacent locus, the effect may be negative. The last point is that in GWAS, the number of SNPs can be hundreds of thousands. Therefore, we need a fast algorithm to fit the model. Despite that Friedman et al. (2007) proposed a generalized algorithm that yields a solution to fused LASSO, it is not as efficient as coordinate descent algorithm for SMCP which has explicit solution. 2.2 Genome-wide screening incorporating LD A basic method for GWAS is to conduct genome-wide screening of a large number of dense SNPs individually and look for those with significant association with phenotype. Although several important considerations, such as adjustment for multiple comparisons and possible population stratification, need to be taken into account in the analysis, the essence of the existing genome-wide screening approach is single-marker based analysis without considering the structure of SNP data. In particular, the possible LD between two adjacent SNPs are not incorporated in the analysis. Our proposed SMCP method can be used for screening a dense set of SNPs

33 18 incorporating LD information in a natural way. To be specific, here we consider the standard case-control design for identifying SNPs that potentially associated with disease. Let the phenotype be scored as 1 for cases and 1 for controls. Let n j be the number of subjects whose genotypes are non-missing at SNP j. The standardized phenotype of the ith subject with non-missing genotype at SNP j is denoted by y ij. The genotype at SNP j is scored as 0, 1, or 2 depending on the number of copies of a reference allele in a subject. Let x ij denote the standardized genotype score satisfying i x ij = 0 and n j i=1 x2 ij = n j. Consider the penalized criterion L n (β) = 1 2 p 1 n j=1 j n j (y ij x ij β j ) 2 + i=1 p j=1 ρ( β j ; λ 1, γ)+ λ 2 2 p 1 ζ j ( β j β j+1 ) 2. (2.3) j=1 Here the loss function is g(β) = 1 2 p 1 n j=1 j n j (y ij x ij β j ) 2. (2.4) i=1 We note that switching the reference allele used for scoring the genotypes changes the sign of β j but β j remains the same. It may be counter-intuitive to use a quadratic loss in (2.4) for case-control designs. However, we now show that this is appropriate. Regardless of how the phenotype is scored, the least squares regression slope of the phenotype over the genotype score at SNP j (i.e., a regular single SNP analysis) equals n j n j y ij x ij / x 2 ij = 2(ˆp 1j ˆp 2j )/φ j (1 φ j ), i=1 i=1 where φ j is the proportion of cases out of total subjects computed from the subjects with non-missing genotype and ˆp 1j and ˆp 2j are allele frequencies of the SNP j in cases

34 19 and controls, respectively. This shows that the β j in the squared loss function (2.4) can be interpreted as the effect size of SNP j. In the classification literature, quadratic loss has also been used for indicator response variables (Hastie et al., 2009). An alternative loss function for binary phenotype would be the sum of negative marginal log-likelihood based on a working logistic regression model. We have found that the selection results using this loss function are in general similar to those based on (2.4). In addition, the computational implementation of the coordinate descent algorithm described in the next subsection using the loss function (2.4) is much more stable and efficient and can easily handle tens of thousands of SNPs Computation In this section, we derive a coordinate descent algorithm for computing the solution to objective function (2.3). This algorithm was originally proposed for criterions with convex penalties such as LASSO (Friedman et al., 2010, Knight and Fu, 2000, Wu and Lange, 2007). It has been proposed to calculate nonconvex penalized regression estimates (Breheny and Huang, 2011, Mazumder et al., 2009). This algorithm optimizes a target function with respect to one parameter at a time, iteratively cycling through all parameters until convergence is reached. It is particularly suitable for problems such as the current one that have a simple closed form solution in a single dimension but lack one in higher dimensions. We wish to minimize the objective function L n (β) in (2.3) with respect to β j while keeping all other β k, k j, fixed at their current estimates. Thus only the

35 20 terms involving β j in L n matter. That is, this problem is equivalent to minimizing R(β j ) defined as R(β j ) = 1 2n j n j (y ij x ij β j ) 2 + ρ( β j ; λ 1, γ) i= λ 2[ζ j ( β j β j+1 ) 2 + ζ j 1 ( β j 1 β j ) 2 ] = C + a j β 2 j + b j β j + c j β j, j = 2,..., p 1, (2.5) where C is a term free of β j, β j+1 and β j 1 are current estimates of β j+1 and β j 1, respectively, and a j, b j, and c j are determined as follows: For β j < γλ 1, ( a j = 1 n j ) 1 x 2 ij + λ 2 (ζ j 1 + ζ j ) 1, 2 γ n j n j i=1 b j = 1 x ij y ij, n j i=1 and c j = λ 1 λ 2 ( β j+1 ζ j + β j 1 ζ j 1 ). (2.6) For β j γλ 1, ( a j = n j n j ) x 2 ij + λ 2 (ζ j 1 + ζ j ), i=1 c j = λ 2 ( β j+1 ζ j + β j 1 ζ j 1 ), (2.7) while b j remains the same as in the previous situation. Note that function R(β j ) is defined for j 1, p. It can be defined for j = 1 by setting β j 1 = 0 and for j = p by setting β j+1 = 0 in the above two situations.

36 21 Minimizing R(β j ) with respect to β j is equivalent to minimizing a j β 2 j + b j β j + c j β j, or equivalently, ( a j β j + b ) 2 j + c j β j. (2.8) 2a j The first term is convex in β j if a j > 0. In the case β j γλ 1, a j > 0 is trivially true. In the case β j < γλ 1, a j > 0 holds when γ > 1. Let ˆβ j denote the minimizer of R(β j ). It has the following explicit expression: ˆβ j = sign(b j ) ( b j c j ) + 2a j. (2.9) This is because if c j > 0, minimizing (2.8) becomes a regular one dimensional LASSO problem. SMCP estimator ˆβ j is the soft-threshold operator. If c j < 0, it can be shown that ˆβ j and b j are of opposite sign. If b j 0, expression (2.8) becomes ( a j β j + b ) 2 j c j β j. 2a j Hence ˆβ j = (b j c j )/2a j < 0. If b j < 0, then ˆβ j = ˆβ j and ˆβ j = (b j + c j )/2a j > 0. In summary, expression (2.9) holds in all situations. From Fig. (2.2), one can observe that the optimizer of the objective function depends on the choice of b j and c j. If c j > 0, the estimates are equal to the LASSO estimates. Else if c j <= 0, the objective function will have two modes. The minimum of R(β j ) depends on the sign of b j. Fig. (2.2) represents the equation (2.5) graphically. We note that a j and b j do not depend on any β j. They only need to be computed once for each SNP. Only c j needs to be updated after all β j s are updated. In the special case of λ 2 = 0, the SMCP method becomes the MCP method. In that

37 22 case, even c j no longer depends on β j 1 and β j+1 : c j = λ 1 if β j < γλ 1 and c j = 0 otherwise. Expression (2.9) gives the explicit solution for β j. Generally, an iterative algorithm is required to estimate these parameters. Let β (0) (0) = ( β 1,..., (0) β p ) be the initial value of the estimate of β. The proposed coordinate descent algorithm proceeds as follows: 1. Compute a j and b j for j = 1,..., p. 2. Set s = For j = 1,..., p, (a) Compute c j according to expressions (2.6) or (2.7). (b) Update β (s+1) j according to expression (2.9). 4. Update s s Repeat steps (3) and (4) until the estimate of β converges. In practice, the initial values β (0) j, j = 1,..., p are set to 0. Each β j is then updated in turn using the coordinate descent algorithm described above. One iteration in completed when all β j s are updated. In our experience, convergence is typically reached after about 30 iterations for the SMCP method. For now, the property of the second penalty is discussed. We assume that currently, λ 1 and λ 2 are fixed and we want to minimize the objective function (2.2). Suppose that in the most recent step (s 1), β j 1 was updated and compare the (s) (s 1) value of estimates under adjacent steps, δ = β j 1 β j 1. We further assume that

38 23 at the most recent step (s 1), only now go into the step s to update β j. β (s 1) j 1 is non-zero and δ is usually positive. We If corr(x j, x j 1 ) > 0, then ζ j 1 = corr(x j, x j 1 ). We have c (s) j Note that c (s) j = c (s 1) j λ 2 δζ j 1. < c (s 1) j, since ζ j 1 > 0. From expression (2.9), we know that β (s) j will be non-zero if c j is less than b j. One can see that with stronger correlation (i.e. larger ζ j 1 ) and/or larger λ 2, c (s) j will be smaller. Consequently, β(s) j is more likely to be non-zero. Also, the sign of β j is positive if it is non-zero. It makes sense that the correlation between the (j 1)th and the jth predictors is positive. It is similar when corr(x j, x j 1 ) < 0. Thus, incorporating the second penalty increases the chance that adjacent SNPs with high correlation are selected together. 2.3 Joint SMCP Model in GWAS Here we describe the proposed penalized regression approach for GWAS under joint modeling. We first describe the loss function used in the objective function. We then describe a coordinate descent algorithm to fit the joint linear model with SMCP Joint Loss Function Let p be the number of SNPs and n the number of subjects. The centered phenotype of the ith subject is denoted by y i. For quantitative trait analysis, the phenotype is continuous. The genotype at SNP j is scored as 0, 1, or 2 depending on

39 24 the number of copies of a reference allele in a subject. Let x ij denote the standardized genotype score such that i x ij = 0 and 1 n n i=1 x2 ij = 1, and y the standardized phenotype such that i y i = 0. We consider the linear loss function g(β) = 1 2n n (y i i=1 p x ij β j ) 2. (2.10) Thus, for joint linear model with loss function (2.10), the objective function (2.2) is needed to minimize with respect to (β). I show the coordinate descent algorithm to the joint least squares in the following section. j= Coordinate Descent Algorithm We wish to minimize the objective function (2.2) with respect to β j while keeping other β k, k j, fixed at their current estimates. Thus only the terms involving the β j in L n matter. That is, this problem is equivalent to minimizing R(β j ) defined as R(β j ) = 1 2n n ( r i( j) x ij β j ) 2 + ρ( β j ; λ 1, γ) i= λ 2[ζ j ( β j β j+1 ) 2 + ζ j 1 ( β j 1 β j ) 2 ] = C + a j β 2 j + b j β j + c j β j, j = 2,..., p 1, where r i( j) = y i β 0 k j x ikβ k the partial residual for fitting β j, C is a term free of β j, β j+1 and β j 1 are current estimates of β j+1 and β j 1, respectively, and a, b, and c are determined as follows:

40 25 For β j < γλ 1, ( ) a j = 1 1 n x 2 ij + λ 2 (ζ j 1 + ζ j ) 1, 2 n γ i=1 b j = 1 n x ij r i( j), n i=1 and c j = λ 1 λ 2 ( β j+1 ζ j + β j 1 ζ j 1 ). (2.11) For β j γλ 1, ( a j = n j n j ) x 2 ij + λ 2 (ζ j 1 + ζ j ), i=1 c j = λ 2 ( β j+1 ζ j + β j 1 ζ j 1 ), while b j remains the same as in the previous situation. Note that function R(β j ) is defined for j 1, p. It can be defined for j = 1 by setting β j 1 = 0 and for j = p by setting β j+1 = 0 in the above two situations. Overall, minimizing R(β j ) with respect to β j is equivalent to minimizing a j β 2 j + b j β j + c j β j, or equivalently, ( a j β j + b ) 2 j + c j β j. 2a j The solution is the same as the one given in expression (2.9). The algorithm amounts to the following steps: Initialize vector of residuals r i = y i ỹ i,where ỹ i = p j=1 x (0) ij β j. 1. Set s = 0.

41 26 2. Inner loop: (a) Set j = 1 and calculate a j (b) Update b j and c j according to expression (2.11). (c) Update β (s+1) j according to expression (2.9). (s+1) (s) (d) Update r i r i x ij ( β j β j ) for i = 0,..., n and j j + 1 (e) Repeat (b) to (d) until j = p and evaluate the criterion. (f) If criterion matched, go to step 3. Else, repeat (a) to (e) in step Update s s Repeat step 2 and 3 until convergence. To make the computation more efficient, here we write the term b j in details. b j = 1 n x ij (y i x il βl ) n i=1 l j n = y i x ij + n ( x ij x il ) β l i=1 l j i=1 Hence, we can use inner product between x j s to represent b j. It will help us to get solution path more efficiently. Starting from λ max, β j s come into the model gradually. Once we calculate the inner product between x j and x l, we save and store it for the use of smaller λ. This step will save a large amount of computing time and storage space, since only a small number of β parameters will get into the model.

42 27 Wwe are only interested in the correlations between the selected predictors and all other predictors. Here we want to discuss the properties of the coefficients a j,b j and c j and how they will affect the update of the β j. Suppose, in the most recent step, β j 1 was updated and compared to its last (s) (s 1) update: set δ = β j 1 β j 1. We assume at the current step (s 1), only β j 1 is non-zero and is positive and δ is positive. We now go into the step of updating β j. a j is constant in that the design matrix and weights ζ j s are fixed. It only affects the scale of the parameter estimates. If corr(x j, x j 1 ) > 0, then ζ j 1 = corr(x j, x j 1 ). We have b (s) j = b (s 1) j δζ j 1 and c (s) j = c (s 1) j λ 2 δζ j 1. One should note that b (s 1) j is negative and thus b (s) j is getting larger. By inspecting on equation (2.9), we know that the probability that β j will be non-zero is getting larger, since c (s) j becomes smaller but b (s) j turns larger. It is similar when corr(x j, x j 1 ) < 0. Thus incorporating the smoothed penalty enables the adjacent SNPs with high correlation to be selected together. In practice, initial values of β j, j = 1,..., p are set to 0. Each β j is then updated in turn using the cyclical coordinate descent algorithm described above. One iteration completes when all β j s are updated.

43 Methods Tuning parameter selection Selecting appropriate values for tuning parameters is important. It affects not only the number of selected variables but also the estimates of model parameters and the selection consistency. There are various methods that can be applied, which include AIC (Akaike, 1974), BIC (Chen and Chen, 2008, Schwarz, 1978), crossvalidation and generalized cross-validation. However, they are all based upon the the performance of prediction error. In GWAS, disease markers may not be in the set of SNP markers. Practically it is rare that disease markers are part of SNPs data, which consequently results in non-true model for SNPs data. Hence, the methods mentioned above may be inadequate in GWAS. Wu et al. (2009) used a predetermined number of predictors to select the tuning parameter and implement a combination of bracketing and bisection to search for the optimal tuning parameter. We adopt Wu et al. (2009) method to select tuning parameters. For this purpose, tuning parameters λ 1 and λ 2 are re-parameterized through τ=λ 1 +λ 2 and η=λ 1 /τ. The value of η is fixed beforehand. When η = 1, the SMCP method becomes the MCP method. The optimal value of τ that selects the predetermined number of predictors is determined through bisection as follows. Let r(τ) denote the number of predictors selected under τ. Let τ max be the smallest value for which all coefficients are 0. τ max is the upper bound for τ. From expression (2.6), τ max = max j n j i=1 x ijy ij /(n j η). To avoid undefined saturated linear models, τ can not be 0 or close to 0. Its lower bound, denoted by τ min, is set at τ min = ɛτ max for preselected ɛ. Setting ɛ = 0.1 seems

44 29 to work well with the SMCP method. Initially, we set τ l = τ min and τ u = τ max. If r(τ u ) < s < r(τ l ), then we employ bisection. This involves testing the midpoint τ m = 1(τ 2 l + τ u ). If r(τ m ) < s, we replace τ u by τ m. If r(τ m ) > s, we replace τ l by τ m. This process repeats until r(τ m ) = s. From simulation study, we find that regularization parameter γ also has an important impact on the analysis. Based on our experience, γ = 6 is a reasonable choice for the SMCP method P -values for the selected SNPs The use of p-value is a traditional way to evaluate the significance of estimates. However, there are no straightforward ways to compute standard error of penalized linear regression estimates. Wu et al. (2009) proposed a leave-one-out approach for computing p-values by assessing the correlations among the selected SNPs in the reduced model. We use the multi-split method proposed by Meinshausen et al. (2009) to obtain reproducible p-values. This is a simulation-based method that automatically adjusts for multiple comparisons. In each iteration, the multi-split method proceeds as follows: 1. Randomly split the data into two disjoint sets of equal size: D in and D out. The case:control ratio in each set is the same as in the original data. 2. Fit the SMCP method with data in D in. Denote the set of selected SNPs by S. 3. Assign a p-value P j to SNP j in the following way:

45 30 (a) If SNP j is in set S, set P j to be its p-value on D out in the regular linear regression where SNP j is the only predictor. (b) If SNP j is not in set S, set P j = Define adjusted p-value by P j = min P j S, 1, j = 1,..., p, where S is the size of set S. This procedure is repeated B times for each SNP. Let P (b) j denote the adjusted p-value for SNP j in the bth iteration. For π (0, 1), let q π be the π-quantile on P (b) j /π; b = 1,..., B. Define Q j (π) = min1, q π. Meinshausen et al. (2009) proved that Q j (π) is an asymptotically correct p-value, adjusted for multiplicity. They also proposed an adaptive version that selects a suitable value of quantile based on the data: Q j = min1, (1 log π 0 ) inf π (π 0,1) Q j (π), where π 0 is chosen to be It was shown that Q j, j = 1,..., p, can be used for both FWER (family-wise error rate) and FDR control (Meinshausen et al., 2009). 2.5 Simulation and Empirical Studies Simulation Studies and Results To make the LD structure as realistic as possible, genotypes are obtained from the rheumatoid arthritis (RA) study provided by the Genetic Analysis Workshop (GAW) 16. This study involves 2062 individuals. Four hundred of them are randomly chosen. Five thousand SNPs are selected from chromosome 6. For individual i, its

46 31 phenotype y i is generated as follows: y i = x iβ + ɛ i, i = 1,..., 400, where x i (a vector of length 5000) represents the genotype data of individual i, β is the vector of genetic effect whose elements are all 0 except that (β 2287,..., β 2298 ) = (0.1, 0.2, 0.1, 0.2, 1, 0.1, 1, 0.1, 1, 0.1, 0.6, 0.2) and (β 2300,..., β 2318 ) = (0.1, 0.6, 0.2, 0.3, 0.1, 0.3, 0.4, 1.2, 0.1, 0.3, 0.7, 0.1, 1, 0.2, 0.4, 0.1, 0.5, 0.2, 0.1). ɛ i is the residual sampled from a normal distribution with mean 0 and standard deviation 1.5. The loss function g(β) is given in expression (2.4). To evaluate the performance of SMCP, we use false discovery rate (FDR) and false negative rate (FNR) which are defined as follows. Let ˆβ j denote the estimated value of β j, and FDR = # of SNPs with ˆβ j 0 but β j = 0 # of SNPs with ˆβ 0 FNR = # of SNPs with ˆβ j = 0 but β j 0. # of SNPs with β 0 Table 2.1 and Table 2.2 show the mean and standard deviation of true positive, FDR and FNR for various values of η for the SMCP method and a LASSO method over 100 replications under marginal model and joint model respectively. For all replications, the number of feature selected in the model is 50. The loss function used in LASSO is the one in expression (2.4). FDR and FNR perform in the same direction, since the number of the selected predictors is fixed. As the number of true positive increases, the number of false negative and the number of false positive

47 32 Table 2.1: Simulation results for marginalized linear model with SMCP. Method η True Positive FDR FNR SMCP (0.78) 0.418(0.016) 0.061(0.025) (0.87) 0.425(0.017) 0.073(0.028) (0.81) 0.435(0.016) 0.089(0.026) (0.75) 0.443(0.015) 0.101(0.024) (0.67) 0.460(0.013) 0.128(0.021) (0.45) 0.477(0.009) 0.157(0.015) (0.36) 0.481(0.007) 0.162(0.012) (0.61) 0.486(0.012) 0.171(0.020) (0.61) 0.492(0.012) 0.180(0.020) (0.56) 0.497(0.011) 0.189(0.018) (0.55) 0.504(0.012) 0.199(0.018) (0.65) 0.507(0.013) 0.205(0.021) (0.84) 0.516(0.017) 0.219(0.027) LASSO 24.31(0.83) 0.514(0.016) 0.216(0.027) Table 2.2: Simulation results for joint linear model with SMCP. Method η True Positive FDR FNR SMCP (1.68) 0.559(0.035) 0.288(0.054) (1.73) 0.585(0.035) 0.331(0.056) (1.87) 0.599(0.037) 0.354(0.060) (1.92) 0.662(0.038) 0.455(0.062) (1.83) 0.709(0.036) 0.541(0.059) (1.66) 0.767(0.033) 0.624(0.054) (1.37) 0.804(0.027) 0.684(0.044) (1.15) 0.828(0.023) 0.722(0.037) (1.11) 0.847(0.022) 0.753(0.036) (1.12) 0.861(0.022) 0.775(0.036) (1.07) 0.872(0.022) 0.794(0.035) (0.97) 0.876(0.019) 0.800(0.031) LASSO 11.62(1.21) 0.768(0.024) 0.625(0.039)

48 33 decrease. The SMCP method outperforms the LASSO in terms of true positive and FDR. It also suggests that η=0.05 gives acceptable true positive and FDR at the same time. To investigate further the performance of the SMCP method and the LASSO method for genome-wide screening purpose, we use a simulated data set. 50 SNPs are selected by the SMCP and the LASSO method (Table 2.3). For both methods, p-values for selected SNPs are computed using the multi-split method. From Table 2.3, it is apparent the SMCP method selects much higher true positive than the marginal LASSO method. SMCP selects 25 out of 31 true disease associated SNPs while LASSO selects 21. Note that multi-split method can effectively produce p-values for the selected SNPs. It assigns insignificant p-values for all non-disease associated SNPs for the SMCP method. Note that SNP 2320 and 2321, they are both selected by the SMCP and the LASSO method. However, they have insignificant p-values for the SMCP method but not the LASSO. From this point, all real data analysis below are conducted using marginal model Application to GAW 16 RA Data Rheumatoid arthritis (RA) is a complex human disorder with a prevalence ranging from around 0.8% in Caucasians to 10% in some native American groups (Amos, 2009). Its risk is generally higher in females than in males. Some studies have identified smoking as a risk factor. Genetic factors underlying RA have been mapped to the HLA region on region 6p21 (Newton et al., 2004), PTPN22 locus at

49 Table 2.3: List of 50 SNPs selected by the marginalized SMCP method and marginalized LASSO method for a simulated data set. SMCP LASSO Regression SNP ˆβ p-value ˆβ p-value ˆβ p-value e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e-04 Computed using the multi-split method Single SNP analysis, not corrected for multiple testing. 34

50 35 1p13 (BEG, 2004), and the CTLA4 locus at 2q33 (Plenge, 2005). There are some other loci reported. These loci are at 6q (TNFAIP3), 9p13 (CCL21), 10p15 (PRKCQ), and 20q13 (CD40) and seem to be of weaker effects (Amos, 2009). GAW 16 RA data is from the North American Rheumatoid Arthritis Consortium (NARAC). It is the initial batch of whole genome association data for the NARAC cases (N=868) and controls (N=1194) after removing duplicated and contaminated samples. The total sample size is After quality control and removing SNPs with low minor allele frequency, there are SNPs over 22 autosomes, of which are on chromosome 6. The marginalized loss function implemented with SMCP is used for analyzing real data for GAW 16 and GAW 17. SNPs on the whole genome are analyzed simultaneously. By using different numbers for predetermined number of SNPs, we found that 800 SNPs along the genome are appropriate for the GAW 16 RA dataset. For the SMCP method, the optimal value for tuning parameter τ corresponding to this setting is with η = p-values of the selected SNPs are computed using multi-split method. The majority of the SNPs (539 out of 800) selected by the SMCP method are on chromosome 6, 293 of which are significant at significance level The plot of log 10 (p-value) for the selected SNPs against their physical positions is shown in Fig. 2.3(a) for chromosome 6. For the LASSO method (i.e., η = 1 and γ = ), the same procedure is implemented to select 800 SNPs across the genome. The optimal value for tuning

51 36 parameter τ is There are 537 SNPs selected on chromosome 6 and 280 of them are significant with multi-split p-value less than The plot of log 10 (p-value) for chromosome 6 is shown in Fig. 2.3(b). We also analyzed the data using the MCP method (i.e., η = 1). It selects the same set of SNPs as the LASSO method. The difference of the LASSO and the MCP lies in the magnitude of estimates, since the MCP is unbiased under proper choice of γ but the LASSO is always biased. Two sets of SNPs selected by the SMCP and the LASSO, respectively, on chromosome 6 are both in the region of HLA-DRB1 gene that has been found to be associated with RA (Newton et al., 2004). There are SNPs on other chromosomes that are significant or close to be significant (Table 2.4). Particularly, association to rheumatoid arthritis at SNP rs in gene PTPN22 has been reported previously (BEG, 2004). Other noteworthy SNPs include SNP rs in RAB28 region, 6 SNPs in TRAF1 region, SNP rs in CA5A region, SNP rs in RNF126P1 region, and SNP rs in PHACTR3 region. On chromosome 9, 6 SNPs in the region of TRAF1 gene are identified by the SMCP method and the LASSO method. Among them, 2 insignificant SNPs (rs , rs ) have much smaller p-values for the SMCP method than the LASSO method. The estimates of βs obtained from SMCP, MCP, LASSO and the regular single-snp linear regression analysis are presented in Fig. 2.4 and Fig One can see from Fig. 2.4 and Fig. 2.5, the MCP method produce larger estimates than the LASSO method, but the estimates from the SMCP method are smaller than those from the LASSO. This is caused by the (side) shrinkage effect of the proposed

52 37 Table 2.4: SNPs found significant other than chromosome 6 SMCP LASSO Gene Chr Position SNP name Estimates p-value Estimates p-value PTPN rs e e-05 RAB rs LOC rs TRAF rs TRAF rs TRAF rs TRAF rs TRAF rs TRAF rs CA5A rs RNF126P rs e-06 PHACTR rs smoothing penalty. In terms of model selection, SMCP tends to select more adjacent SNPs that are in high LD. Quantitative traits anti-ccp and rheumatoid factor IgM are measured over case group in the study. Here we choose anti-ccp as response. Because it shows a great promise as a diagnostic marker of RA Khosla et al. (2004). Total sample size after removing missing anti-ccp is 867. After quality control and removing SNPs with low minor allele frequency, there are SNPs over 22 autosomes. Genome scans by using the SMCP, the MCP, the LASSO and regular linear regression are shown in Fig. 2.6 and Fig. 2.7 respectively. All chromosomes are analyzed at one time for the SMCP, the MCP and the LASSO. The number of SNPs selected is 500 for these methods. One can observe that the SMCP will have more clustered estimates. For example, see estimates in chromosome 12, the SMCP have three clustered spots, but the MCP and the LASSO are much more noisy. It is in

53 38 the same manner for other chromosomes. This is because the effect of the smoothing penalty, which takes into account the LD information. For both binary and quantitative traits, we can see that the MCP method produce larger estimates than the LASSO. However, they do not take into account LD information and thus noisier in estimates, especially in quantitative trait Application to GAW 17 Data The GAW17 data set consists of 24,487 SNP markers throughout the genome for 697 individuals. We analyse unrelated individual data with quantitative trait Q1 in replicate 1. More details can be found in (Liu et al., 2011). Fig. 2.8(a) shows the absolute lag one autocorrelation coefficients over the whole genome. Fig. 2.8(b) shows the proportion of the absolute lag one autocorrelation coefficients > 0.5 over non-overlapping 100-SNP windows. One can see that even for partially selected SNPs over genome, there exist strong correlations between adjacent SNPs. But the correlations are not as strong as a dense set of SNPs (Fig. 2.1). Although it may be more informative to use pair-wise correlations among SNPs, the computational burden makes it impossible to be implemented in real dataset. All SNPs are included in the analysis. We coded the seven population groups as dummy variables. The quantitative trait Q1 is first regressed on gender, age, smoking status and group dummy variables in order to remove their confounding effect. This procedure helps adjust for population stratification. Then, the residuals from this regression are used as the response and fitted by the SMCP model and

54 39 the LASSO model. Extended Bayesian information criteria (EBIC) (Chen and Chen, 2008, 2010) is used to select the tuning parameter. The selected tuning parameter τ is for the SMCP with η=0.1 and for the LASSO. Absolute values of estimates from simple linear regression are plotted in Fig The estimation results are presented in Table 2.5. Both the SMCP and the LASSO selected two SNPs (C13S522 and C13S523) from gene FLT1. For each method, these two SNPs have significant LOO p-values. The SMCP selected three more SNPs, one (C13S524) from gene FLT1 and the other two (C12S707 and C12S711) from gene PRR4. Only one SNP (C13S524) from gene FLT1 is significant. The box plots for these 5 SNPs selected by the SMCP and the LASSO are shown in Fig By the knowledge of the underlying model, we have computed the true positive rate and false positive rate for the SMCP, the LASSO and regular single SNP regression on trait Q1 using all the 200 replicates (Table 2.6). For regular single SNP regression, Benjamini-Hochberg method is used to control FDR and conduct multiple testing. The SMCP tends to select more SNPs than the LASSO with higher true positive rate and false positive rate. Although regular methods can select a higher true positive, its false positive is extremely higher than the SMCP and the LASSO. Further simulation studies can be found in (Liu et al., 2011). For trait Q1 in replicate 1, the SMCP selected 3 SNPs (C13S522, C13S523 and C13S524) from the associated gene FLT1 and 2 SNPs that are false positive. In comparison, the LASSO selected 2 SNPs (C13S522 and C13S523) both of which are true positive. We note that SNPs provided for GAW 17 are a small subset of the

55 40 Table 2.5: SNPs selected by the SMCP and LASSO for trait Q1 in replicate 1 Univariate Univariate SMCP LOO LASSO LOO SNP name Gene estimates p-value estimates p-value estimates p-value C12S707 PRR e C12S711 PRR e C13S522 FLT e e e-09 C13S523 FLT e e e-14 C13S524 FLT e Table 2.6: Mean and standard error of true positives and false positives for selected SNPs over 200 replicates for trait Q1 SMCP LASSO Regular Regression True Positive 3.35(1.52) 2.48(1.19) 7.03(1.81) False Positive 18.42(36.20) 8.64(21.53) (87.87) SNPs that are genotyped. The strength of LD for this set of SNPs has been greatly reduced. In addition, the GAW 17 data were simulated to mimic rare variants. The SMCP method is specially designed to map rare variants. Even so, the SMCP is able to select 3 SNPs, more than the LASSO is. In comparison, the results of the regular simple linear regression are much noisier.

56 41 (a) a j = 1, b j = 1, c j = 5 (b) a j = 1, b j = 1, c j = 5 (c) a j = 1, b j = 5, c j = 2 (d) a j = 1, b j = 5, c j = 2 Figure 2.2: Graph of Function R under different a j,b j and c j

57 42 (a) SMCP (b) LASSO Figure 2.3: Plot of log 10 (p-value) for SNPs on chromosome 6 selected by (a) the SMCP method and (b) the LASSO method for the rheumatoid arthritis data. These p-values are generated using the multi-split method. The horizontal line corresponds to significance level 0.05.

58 43 (a) SMCP (b) MCP Figure 2.4: Genome-wide plot of β estimates for the SMCP and the MCP methods on binary trait.

59 44 (a) LASSO (b) Regular Single-SNP Linear Regression Figure 2.5: Genome-wide plot of β estimates for the LASSO and regular single-snp linear regression on binary trait.

60 45 (a) SMCP (b) MCP Figure 2.6: Genome-wide plot of β estimates for the SMCP and MCP methods on quantitative trait.

61 46 (a) LASSO (b) Regular Single-SNP Linear Regression Figure 2.7: Genome-wide plot of β estimates for the LASSO and regular single-snp linear regression on quantitative trait.

62 47 (a) Absolute lag-one autocorrelation (b) Proportion of absolute lag-one autocorrelation coefficients > 0.5 for 100 SNPs per segment Figure 2.8: Correlation plots along the genome from Genetic Analysis Workshop 17 simulated dataset

63 48 (a) Univariate Linear Regression Figure 2.9: Genome-wide plot of β estimates for regular single-snp linear regression. (a) C13S522 Figure 2.10: Boxplots for comparison of residuals on genotypes of selected SNPs. Y axis is for residuals and x axis for genotypes

Accounting for Linkage Disequilibrium in Genome-Wide Association Studies: A Penalized Regression Method

Accounting for Linkage Disequilibrium in Genome-Wide Association Studies: A Penalized Regression Method Accounting for Linkage Disequilibrium in Genome-Wide Association Studies: A Penalized Regression Method Jin Liu 1, Kai Wang 2, Shuangge Ma 1, and Jian Huang 2,3 1 Division of Biostatistics, School of Public

More information

Lecture 14: Variable Selection - Beyond LASSO

Lecture 14: Variable Selection - Beyond LASSO Fall, 2017 Extension of LASSO To achieve oracle properties, L q penalty with 0 < q < 1, SCAD penalty (Fan and Li 2001; Zhang et al. 2007). Adaptive LASSO (Zou 2006; Zhang and Lu 2007; Wang et al. 2007)

More information

Incorporating Group Correlations in Genome-Wide Association Studies Using Smoothed Group Lasso

Incorporating Group Correlations in Genome-Wide Association Studies Using Smoothed Group Lasso Incorporating Group Correlations in Genome-Wide Association Studies Using Smoothed Group Lasso Jin Liu 1, Jian Huang 2,3, Shuangge Ma 1, and Kai Wang 3 1 Division of Biostatistics, School of Public Health,

More information

Bi-level feature selection with applications to genetic association

Bi-level feature selection with applications to genetic association Bi-level feature selection with applications to genetic association studies October 15, 2008 Motivation In many applications, biological features possess a grouping structure Categorical variables may

More information

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010 Department of Biostatistics Department of Statistics University of Kentucky August 2, 2010 Joint work with Jian Huang, Shuangge Ma, and Cun-Hui Zhang Penalized regression methods Penalized methods have

More information

Group exponential penalties for bi-level variable selection

Group exponential penalties for bi-level variable selection for bi-level variable selection Department of Biostatistics Department of Statistics University of Kentucky July 31, 2011 Introduction In regression, variables can often be thought of as grouped: Indicator

More information

Fast Regularization Paths via Coordinate Descent

Fast Regularization Paths via Coordinate Descent August 2008 Trevor Hastie, Stanford Statistics 1 Fast Regularization Paths via Coordinate Descent Trevor Hastie Stanford University joint work with Jerry Friedman and Rob Tibshirani. August 2008 Trevor

More information

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS * Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington Mingon Kang, Ph.D. Computer Science, Kennesaw State University Problems

More information

25 : Graphical induced structured input/output models

25 : Graphical induced structured input/output models 10-708: Probabilistic Graphical Models 10-708, Spring 2016 25 : Graphical induced structured input/output models Lecturer: Eric P. Xing Scribes: Raied Aljadaany, Shi Zong, Chenchen Zhu Disclaimer: A large

More information

Regularized methods for high-dimensional and bilevel variable selection

Regularized methods for high-dimensional and bilevel variable selection University of Iowa Iowa Research Online Theses and Dissertations Summer 2009 Regularized methods for high-dimensional and bilevel variable selection Patrick John Breheny University of Iowa Copyright 2009

More information

Generalized Elastic Net Regression

Generalized Elastic Net Regression Abstract Generalized Elastic Net Regression Geoffroy MOURET Jean-Jules BRAULT Vahid PARTOVINIA This work presents a variation of the elastic net penalization method. We propose applying a combined l 1

More information

25 : Graphical induced structured input/output models

25 : Graphical induced structured input/output models 10-708: Probabilistic Graphical Models 10-708, Spring 2013 25 : Graphical induced structured input/output models Lecturer: Eric P. Xing Scribes: Meghana Kshirsagar (mkshirsa), Yiwen Chen (yiwenche) 1 Graph

More information

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly Robust Variable Selection Methods for Grouped Data by Kristin Lee Seamon Lilly A dissertation submitted to the Graduate Faculty of Auburn University in partial fulfillment of the requirements for the Degree

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Penalized Methods for Multiple Outcome Data in Genome-Wide Association Studies

Penalized Methods for Multiple Outcome Data in Genome-Wide Association Studies Penalized Methods for Multiple Outcome Data in Genome-Wide Association Studies Jin Liu 1, Shuangge Ma 1, and Jian Huang 2 1 Division of Biostatistics, School of Public Health, Yale University 2 Department

More information

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013) A Survey of L 1 Regression Vidaurre, Bielza and Larranaga (2013) Céline Cunen, 20/10/2014 Outline of article 1.Introduction 2.The Lasso for Linear Regression a) Notation and Main Concepts b) Statistical

More information

Concave selection in generalized linear models

Concave selection in generalized linear models University of Iowa Iowa Research Online Theses and Dissertations Spring 2012 Concave selection in generalized linear models Dingfeng Jiang University of Iowa Copyright 2012 Dingfeng Jiang This dissertation

More information

Stability and the elastic net

Stability and the elastic net Stability and the elastic net Patrick Breheny March 28 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/32 Introduction Elastic Net Our last several lectures have concentrated on methods for

More information

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los

More information

Chapter 3. Linear Models for Regression

Chapter 3. Linear Models for Regression Chapter 3. Linear Models for Regression Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Linear

More information

LASSO Review, Fused LASSO, Parallel LASSO Solvers

LASSO Review, Fused LASSO, Parallel LASSO Solvers Case Study 3: fmri Prediction LASSO Review, Fused LASSO, Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 1 Variable

More information

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection DEPARTMENT OF STATISTICS University of Wisconsin 1210 West Dayton St. Madison, WI 53706 TECHNICAL REPORT NO. 1091r April 2004, Revised December 2004 A Note on the Lasso and Related Procedures in Model

More information

Comparisons of penalized least squares. methods by simulations

Comparisons of penalized least squares. methods by simulations Comparisons of penalized least squares arxiv:1405.1796v1 [stat.co] 8 May 2014 methods by simulations Ke ZHANG, Fan YIN University of Science and Technology of China, Hefei 230026, China Shifeng XIONG Academy

More information

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables LIB-MA, FSSM Cadi Ayyad University (Morocco) COMPSTAT 2010 Paris, August 22-27, 2010 Motivations Fan and Li (2001), Zou and Li (2008)

More information

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent

More information

Linear Regression. Volker Tresp 2018

Linear Regression. Volker Tresp 2018 Linear Regression Volker Tresp 2018 1 Learning Machine: The Linear Model / ADALINE As with the Perceptron we start with an activation functions that is a linearly weighted sum of the inputs h = M j=0 w

More information

The lasso: some novel algorithms and applications

The lasso: some novel algorithms and applications 1 The lasso: some novel algorithms and applications Newton Institute, June 25, 2008 Robert Tibshirani Stanford University Collaborations with Trevor Hastie, Jerome Friedman, Holger Hoefling, Gen Nowak,

More information

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:

More information

A General Framework for Variable Selection in Linear Mixed Models with Applications to Genetic Studies with Structured Populations

A General Framework for Variable Selection in Linear Mixed Models with Applications to Genetic Studies with Structured Populations A General Framework for Variable Selection in Linear Mixed Models with Applications to Genetic Studies with Structured Populations Joint work with Karim Oualkacha (UQÀM), Yi Yang (McGill), Celia Greenwood

More information

Analysis Methods for Supersaturated Design: Some Comparisons

Analysis Methods for Supersaturated Design: Some Comparisons Journal of Data Science 1(2003), 249-260 Analysis Methods for Supersaturated Design: Some Comparisons Runze Li 1 and Dennis K. J. Lin 2 The Pennsylvania State University Abstract: Supersaturated designs

More information

Logistic Regression with the Nonnegative Garrote

Logistic Regression with the Nonnegative Garrote Logistic Regression with the Nonnegative Garrote Enes Makalic Daniel F. Schmidt Centre for MEGA Epidemiology The University of Melbourne 24th Australasian Joint Conference on Artificial Intelligence 2011

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

Penalized Methods for Bi-level Variable Selection

Penalized Methods for Bi-level Variable Selection Penalized Methods for Bi-level Variable Selection By PATRICK BREHENY Department of Biostatistics University of Iowa, Iowa City, Iowa 52242, U.S.A. By JIAN HUANG Department of Statistics and Actuarial Science

More information

Semi-Penalized Inference with Direct FDR Control

Semi-Penalized Inference with Direct FDR Control Jian Huang University of Iowa April 4, 2016 The problem Consider the linear regression model y = p x jβ j + ε, (1) j=1 where y IR n, x j IR n, ε IR n, and β j is the jth regression coefficient, Here p

More information

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization / Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods

More information

ESL Chap3. Some extensions of lasso

ESL Chap3. Some extensions of lasso ESL Chap3 Some extensions of lasso 1 Outline Consistency of lasso for model selection Adaptive lasso Elastic net Group lasso 2 Consistency of lasso for model selection A number of authors have studied

More information

Iterative Selection Using Orthogonal Regression Techniques

Iterative Selection Using Orthogonal Regression Techniques Iterative Selection Using Orthogonal Regression Techniques Bradley Turnbull 1, Subhashis Ghosal 1 and Hao Helen Zhang 2 1 Department of Statistics, North Carolina State University, Raleigh, NC, USA 2 Department

More information

Linear Regression (1/1/17)

Linear Regression (1/1/17) STA613/CBB540: Statistical methods in computational biology Linear Regression (1/1/17) Lecturer: Barbara Engelhardt Scribe: Ethan Hada 1. Linear regression 1.1. Linear regression basics. Linear regression

More information

MSA220/MVE440 Statistical Learning for Big Data

MSA220/MVE440 Statistical Learning for Big Data MSA220/MVE440 Statistical Learning for Big Data Lecture 9-10 - High-dimensional regression Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Recap from

More information

Fast Regularization Paths via Coordinate Descent

Fast Regularization Paths via Coordinate Descent KDD August 2008 Trevor Hastie, Stanford Statistics 1 Fast Regularization Paths via Coordinate Descent Trevor Hastie Stanford University joint work with Jerry Friedman and Rob Tibshirani. KDD August 2008

More information

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1 Variable Selection in Restricted Linear Regression Models Y. Tuaç 1 and O. Arslan 1 Ankara University, Faculty of Science, Department of Statistics, 06100 Ankara/Turkey ytuac@ankara.edu.tr, oarslan@ankara.edu.tr

More information

The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R

The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R Xingguo Li Tuo Zhao Tong Zhang Han Liu Abstract We describe an R package named picasso, which implements a unified framework

More information

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA Presented by Dongjun Chung March 12, 2010 Introduction Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions:

More information

Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors

Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors Patrick Breheny Department of Biostatistics University of Iowa Jian Huang Department of Statistics

More information

CONSISTENT BI-LEVEL VARIABLE SELECTION VIA COMPOSITE GROUP BRIDGE PENALIZED REGRESSION INDU SEETHARAMAN

CONSISTENT BI-LEVEL VARIABLE SELECTION VIA COMPOSITE GROUP BRIDGE PENALIZED REGRESSION INDU SEETHARAMAN CONSISTENT BI-LEVEL VARIABLE SELECTION VIA COMPOSITE GROUP BRIDGE PENALIZED REGRESSION by INDU SEETHARAMAN B.S., University of Madras, India, 2001 M.B.A., University of Madras, India, 2003 A REPORT submitted

More information

Variable Selection for Highly Correlated Predictors

Variable Selection for Highly Correlated Predictors Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu Department of Statistics, University of Illinois at Urbana-Champaign WHOA-PSI, Aug, 2017 St. Louis, Missouri 1 / 30 Background Variable

More information

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.

More information

Regularization and Variable Selection via the Elastic Net

Regularization and Variable Selection via the Elastic Net p. 1/1 Regularization and Variable Selection via the Elastic Net Hui Zou and Trevor Hastie Journal of Royal Statistical Society, B, 2005 Presenter: Minhua Chen, Nov. 07, 2008 p. 2/1 Agenda Introduction

More information

An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models

An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models Proceedings 59th ISI World Statistics Congress, 25-30 August 2013, Hong Kong (Session CPS023) p.3938 An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models Vitara Pungpapong

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

Pathwise coordinate optimization

Pathwise coordinate optimization Stanford University 1 Pathwise coordinate optimization Jerome Friedman, Trevor Hastie, Holger Hoefling, Robert Tibshirani Stanford University Acknowledgements: Thanks to Stephen Boyd, Michael Saunders,

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

Regularization Path Algorithms for Detecting Gene Interactions

Regularization Path Algorithms for Detecting Gene Interactions Regularization Path Algorithms for Detecting Gene Interactions Mee Young Park Trevor Hastie July 16, 2006 Abstract In this study, we consider several regularization path algorithms with grouped variable

More information

VARIABLE SELECTION, SPARSE META-ANALYSIS AND GENETIC RISK PREDICTION FOR GENOME-WIDE ASSOCIATION STUDIES

VARIABLE SELECTION, SPARSE META-ANALYSIS AND GENETIC RISK PREDICTION FOR GENOME-WIDE ASSOCIATION STUDIES VARIABLE SELECTION, SPARSE META-ANALYSIS AND GENETIC RISK PREDICTION FOR GENOME-WIDE ASSOCIATION STUDIES Qianchuan He A dissertation submitted to the faculty of the University of North Carolina at Chapel

More information

Variable Selection for High-Dimensional Data with Spatial-Temporal Effects and Extensions to Multitask Regression and Multicategory Classification

Variable Selection for High-Dimensional Data with Spatial-Temporal Effects and Extensions to Multitask Regression and Multicategory Classification Variable Selection for High-Dimensional Data with Spatial-Temporal Effects and Extensions to Multitask Regression and Multicategory Classification Tong Tong Wu Department of Epidemiology and Biostatistics

More information

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Nonconcave Penalized Likelihood with A Diverging Number of Parameters Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized

More information

Consistent Group Identification and Variable Selection in Regression with Correlated Predictors

Consistent Group Identification and Variable Selection in Regression with Correlated Predictors Consistent Group Identification and Variable Selection in Regression with Correlated Predictors Dhruv B. Sharma, Howard D. Bondell and Hao Helen Zhang Abstract Statistical procedures for variable selection

More information

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Bayesian Grouped Horseshoe Regression with Application to Additive Models Bayesian Grouped Horseshoe Regression with Application to Additive Models Zemei Xu, Daniel F. Schmidt, Enes Makalic, Guoqi Qian, and John L. Hopper Centre for Epidemiology and Biostatistics, Melbourne

More information

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate Lucas Janson, Stanford Department of Statistics WADAPT Workshop, NIPS, December 2016 Collaborators: Emmanuel

More information

Linear regression methods

Linear regression methods Linear regression methods Most of our intuition about statistical methods stem from linear regression. For observations i = 1,..., n, the model is Y i = p X ij β j + ε i, j=1 where Y i is the response

More information

In Search of Desirable Compounds

In Search of Desirable Compounds In Search of Desirable Compounds Adrijo Chakraborty University of Georgia Email: adrijoc@uga.edu Abhyuday Mandal University of Georgia Email: amandal@stat.uga.edu Kjell Johnson Arbor Analytics, LLC Email:

More information

Variable Selection for Highly Correlated Predictors

Variable Selection for Highly Correlated Predictors Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu arxiv:1709.04840v1 [stat.me] 14 Sep 2017 Abstract Penalty-based variable selection methods are powerful in selecting relevant covariates

More information

Experimental designs for multiple responses with different models

Experimental designs for multiple responses with different models Graduate Theses and Dissertations Graduate College 2015 Experimental designs for multiple responses with different models Wilmina Mary Marget Iowa State University Follow this and additional works at:

More information

Estimating subgroup specific treatment effects via concave fusion

Estimating subgroup specific treatment effects via concave fusion Estimating subgroup specific treatment effects via concave fusion Jian Huang University of Iowa April 6, 2016 Outline 1 Motivation and the problem 2 The proposed model and approach Concave pairwise fusion

More information

Some models of genomic selection

Some models of genomic selection Munich, December 2013 What is the talk about? Barley! Steptoe x Morex barley mapping population Steptoe x Morex barley mapping population genotyping from Close at al., 2009 and phenotyping from cite http://wheat.pw.usda.gov/ggpages/sxm/

More information

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu LITIS - EA 48 - INSA/Universite de Rouen Avenue de l Université - 768 Saint-Etienne du Rouvray

More information

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences Biostatistics-Lecture 16 Model Selection Ruibin Xi Peking University School of Mathematical Sciences Motivating example1 Interested in factors related to the life expectancy (50 US states,1969-71 ) Per

More information

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson Bayesian variable selection via penalized credible regions Brian Reich, NC State Joint work with Howard Bondell and Ander Wilson Brian Reich, NCSU Penalized credible regions 1 Motivation big p, small n

More information

MSA220/MVE440 Statistical Learning for Big Data

MSA220/MVE440 Statistical Learning for Big Data MSA220/MVE440 Statistical Learning for Big Data Lecture 7/8 - High-dimensional modeling part 1 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Classification

More information

Regression Shrinkage and Selection via the Lasso

Regression Shrinkage and Selection via the Lasso Regression Shrinkage and Selection via the Lasso ROBERT TIBSHIRANI, 1996 Presenter: Guiyun Feng April 27 () 1 / 20 Motivation Estimation in Linear Models: y = β T x + ɛ. data (x i, y i ), i = 1, 2,...,

More information

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of

More information

Learning with Sparsity Constraints

Learning with Sparsity Constraints Stanford 2010 Trevor Hastie, Stanford Statistics 1 Learning with Sparsity Constraints Trevor Hastie Stanford University recent joint work with Rahul Mazumder, Jerome Friedman and Rob Tibshirani earlier

More information

High-dimensional Ordinary Least-squares Projection for Screening Variables

High-dimensional Ordinary Least-squares Projection for Screening Variables 1 / 38 High-dimensional Ordinary Least-squares Projection for Screening Variables Chenlei Leng Joint with Xiangyu Wang (Duke) Conference on Nonparametric Statistics for Big Data and Celebration to Honor

More information

Tuning Parameter Selection in L1 Regularized Logistic Regression

Tuning Parameter Selection in L1 Regularized Logistic Regression Virginia Commonwealth University VCU Scholars Compass Theses and Dissertations Graduate School 2012 Tuning Parameter Selection in L1 Regularized Logistic Regression Shujing Shi Virginia Commonwealth University

More information

6. Regularized linear regression

6. Regularized linear regression Foundations of Machine Learning École Centrale Paris Fall 2015 6. Regularized linear regression Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe agathe.azencott@mines paristech.fr

More information

Lecture 2 Part 1 Optimization

Lecture 2 Part 1 Optimization Lecture 2 Part 1 Optimization (January 16, 2015) Mu Zhu University of Waterloo Need for Optimization E(y x), P(y x) want to go after them first, model some examples last week then, estimate didn t discuss

More information

Robust Variable Selection Through MAVE

Robust Variable Selection Through MAVE Robust Variable Selection Through MAVE Weixin Yao and Qin Wang Abstract Dimension reduction and variable selection play important roles in high dimensional data analysis. Wang and Yin (2008) proposed sparse

More information

27: Case study with popular GM III. 1 Introduction: Gene association mapping for complex diseases 1

27: Case study with popular GM III. 1 Introduction: Gene association mapping for complex diseases 1 10-708: Probabilistic Graphical Models, Spring 2015 27: Case study with popular GM III Lecturer: Eric P. Xing Scribes: Hyun Ah Song & Elizabeth Silver 1 Introduction: Gene association mapping for complex

More information

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010 Model-Averaged l 1 Regularization using Markov Chain Monte Carlo Model Composition Technical Report No. 541 Department of Statistics, University of Washington Chris Fraley and Daniel Percival August 22,

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Or How to select variables Using Bayesian LASSO

Or How to select variables Using Bayesian LASSO Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO On Bayesian Variable Selection

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

A simulation study of model fitting to high dimensional data using penalized logistic regression

A simulation study of model fitting to high dimensional data using penalized logistic regression A simulation study of model fitting to high dimensional data using penalized logistic regression Ellinor Krona Kandidatuppsats i matematisk statistik Bachelor Thesis in Mathematical Statistics Kandidatuppsats

More information

Fast Regularization Paths via Coordinate Descent

Fast Regularization Paths via Coordinate Descent user! 2009 Trevor Hastie, Stanford Statistics 1 Fast Regularization Paths via Coordinate Descent Trevor Hastie Stanford University joint work with Jerome Friedman and Rob Tibshirani. user! 2009 Trevor

More information

High-dimensional regression modeling

High-dimensional regression modeling High-dimensional regression modeling David Causeur Department of Statistics and Computer Science Agrocampus Ouest IRMAR CNRS UMR 6625 http://www.agrocampus-ouest.fr/math/causeur/ Course objectives Making

More information

On High-Dimensional Cross-Validation

On High-Dimensional Cross-Validation On High-Dimensional Cross-Validation BY WEI-CHENG HSIAO Institute of Statistical Science, Academia Sinica, 128 Academia Road, Section 2, Nankang, Taipei 11529, Taiwan hsiaowc@stat.sinica.edu.tw 5 WEI-YING

More information

Multiple QTL mapping

Multiple QTL mapping Multiple QTL mapping Karl W Broman Department of Biostatistics Johns Hopkins University www.biostat.jhsph.edu/~kbroman [ Teaching Miscellaneous lectures] 1 Why? Reduce residual variation = increased power

More information

Bayesian Inference of Interactions and Associations

Bayesian Inference of Interactions and Associations Bayesian Inference of Interactions and Associations Jun Liu Department of Statistics Harvard University http://www.fas.harvard.edu/~junliu Based on collaborations with Yu Zhang, Jing Zhang, Yuan Yuan,

More information

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract Journal of Data Science,17(1). P. 145-160,2019 DOI:10.6339/JDS.201901_17(1).0007 WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION Wei Xiong *, Maozai Tian 2 1 School of Statistics, University of

More information

Exploratory quantile regression with many covariates: An application to adverse birth outcomes

Exploratory quantile regression with many covariates: An application to adverse birth outcomes Exploratory quantile regression with many covariates: An application to adverse birth outcomes June 3, 2011 eappendix 30 Percent of Total 20 10 0 0 1000 2000 3000 4000 5000 Birth weights efigure 1: Histogram

More information

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage Lingrui Gan, Naveen N. Narisetty, Feng Liang Department of Statistics University of Illinois at Urbana-Champaign Problem Statement

More information

The Iterated Lasso for High-Dimensional Logistic Regression

The Iterated Lasso for High-Dimensional Logistic Regression The Iterated Lasso for High-Dimensional Logistic Regression By JIAN HUANG Department of Statistics and Actuarial Science, 241 SH University of Iowa, Iowa City, Iowa 52242, U.S.A. SHUANGE MA Division of

More information

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017 Lecture 2: Genetic Association Testing with Quantitative Traits Instructors: Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 29 Introduction to Quantitative Trait Mapping

More information

Prediction & Feature Selection in GLM

Prediction & Feature Selection in GLM Tarigan Statistical Consulting & Coaching statistical-coaching.ch Doctoral Program in Computer Science of the Universities of Fribourg, Geneva, Lausanne, Neuchâtel, Bern and the EPFL Hands-on Data Analysis

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Presented by Yang Zhao March 5, 2010 1 / 36 Outlines 2 / 36 Motivation

More information

Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014

Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014 Case Study 3: fmri Prediction Fused LASSO LARS Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, 2014 Emily Fox 2014 1 LASSO Regression

More information

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Adam J. Rothman School of Statistics University of Minnesota October 8, 2014, joint work with Liliana

More information

Lecture 5: Soft-Thresholding and Lasso

Lecture 5: Soft-Thresholding and Lasso High Dimensional Data and Statistical Learning Lecture 5: Soft-Thresholding and Lasso Weixing Song Department of Statistics Kansas State University Weixing Song STAT 905 October 23, 2014 1/54 Outline Penalized

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 5, Issue 1 2006 Article 28 A Two-Step Multiple Comparison Procedure for a Large Number of Tests and Multiple Treatments Hongmei Jiang Rebecca

More information