COMPARISON OF STATISTICAL LEARNING METHODS FOR PREDICTION IN GENOME-WIDE ASSOCIATION STUDIES

Size: px

Start display at page:

Download "COMPARISON OF STATISTICAL LEARNING METHODS FOR PREDICTION IN GENOME-WIDE ASSOCIATION STUDIES"

Mervin Bailey
5 years ago
Views:

COMPARISON OF STATISTICAL LEARNING METHODS FOR PREDICTION IN GENOME-WIDE ASSOCIATION STUDIES A f j 3 & A thesis presented t the faculty f San Francisc

1 COMPARISON OF STATISTICAL LEARNING METHODS FOR PREDICTION IN GENOME-WIDE ASSOCIATION STUDIES A f j 3 & A thesis presented t the faculty f San Francisc State University In partial fulfilment f The Requirements fr The Degree M ATH Master f Arts In Mathematics by Rachel Hartley San Francisc, Califrnia August2017

2 Cpyright by Rachel Hartley 2017

3 CERTIFICATION OF APPROVAL I certify that I have rad COMPARISON OF STATISTICAL LEARN ING METHODS FOR PREDICTION IN GENOME-WIDE ASSOCI ATION STUDIES by Rachel Ilartlcyand that in my pinin this wrk mts the criteria fr apprving a thsis submitted in partial fulfillment f the requirements fr the degree: Master f Artsin Matheinaticsat San Francisc State University. Dr. Ta lie Assistant Prfessr f Mathematics Dr. Serkan llsten Prfessr f Mathematics Dr. Alexandra Piryatinska Assciate Prfessr f Mathematics

$the detected Single Nucletide Plymrphisms (SNPs) nly explain a small fractin f heritability and althugh identifying thse missing SNPs is imprtant, it is smetimes mre imprtant t be able t predict$ whether a persn will develp a certain disease r nt.

whether a persn will develp a certain disease r nt.

4 COMPARISON OF STATISTICAL LEARNING METHODS FOR PREDICTION IN GENOME-WIDE ASSOCIATION STUDIES Rachel Hartley San Francisc State University 2017 Studies frm Genme Wide Assciatin Studies (GWAS) shw that the detected Single Nucletide Plymrphisms (SNPs) nly explain a small fractin f heritability and althugh identifying thse missing SNPs is imprtant, it is smetimes mre imprtant t be able t predict whether a persn will develp a certain disease r nt. In this thesis, a cmparisn analysis f three different variable selectin methds (LASSO, x 2 Test fr Independence, Randm Frest) and seven classificatin methds (Lgistic Classificatin, Linear Dimensinal Analysis, Randm Frest, Supprt Vectr Machines with Linear, Radial, and Plynmial kernels, A-Nearest Neighbr) will be given n simulated GWAS datasets under different disease mdels. After a discussin f the methds, the best mdel fr each scenari will be chsen based n predictin errr rate and area under the Receiver Operating Characteristic (AUC) curve. I certify that the Abstract is a crrect representatin f the cntent f this thesis. Chair, Thesis Cmmittee

5 ACKNOWLEDGMENTS I wuld lve t thank my amazing thesis advisr Dr. Ta He fr helping me thrugh this prcess, my husband fr always pushing me t succeed, and my parents wh have supprted my math career since high schl. v

TABLE OF CONTENTS 1 Intrductin... 1 1.1 Bilgical B a ck gru n d... 1 1.1.1 Overview... 1 1.1.2 Genetic V ariatin... 4 1.1.3 Genme-wide Assciatin S tu d ies.

3 Crrelatin Structure in G en typ es... 15 3 Variable Selectin Methds... 18 3.1 The L A S S O... 18 3.2 x 2 Test fr Independence...22 3.

6 TABLE OF CONTENTS 1 Intrductin Bilgical B a ck gru n d Overview Genetic V ariatin Genme-wide Assciatin S tu d ies Mtivatin and O bjectives Data Preparatin Simulating Gentypes Simulating Phentypes Crrelatin Structure in G en typ es Variable Selectin Methds The L A S S O x 2 Test fr Independence Randm F r e s t Classificatin M e th d s Lgistic R egressin Linear Discriminate Analysis Randm F r e s t Supprt Vectr M a ch in e...30 vi

4.5 K-Nearest Neighbr... 34 5 Results...37 5.1 Methds Evaluatin and Tuning Parameter S e le cti n...37 5.1.1 Crss-Validatin and Out f Bag E r r r...37 5.1.2 Classificatin Errr Rate and the ROC C u r v e.

..48 5.3.1 Methd D e ta ils...48 5.3.2 Cmparisns Within Variable Selectin M e t h d...50 5.3.3 Cmparisns Acrss Variable Selectin Methd...74 6 Cnclusin.

7 4.5 K-Nearest Neighbr Results Methds Evaluatin and Tuning Parameter S e le cti n Crss-Validatin and Out f Bag E r r r Classificatin Errr Rate and the ROC C u r v e Variable Selectin R e s u lt s LASSO x 2 T est Randm F r e s t Overall Classificatin Methd R e s u lts Methd D e ta ils Cmparisns Within Variable Selectin M e t h d Cmparisns Acrss Variable Selectin Methd Cnclusin Appendix A: Detailed Descriptin f Simulatin S cen a ris Appendix B: Variable Selectin Detailed Results...87 Appendix C: N Selectin Mdel D eta ils...93

8 Bibligraphy

LIST OF TABLES Table Page 2.1 Table f variables remved due t high crrelatin... 16 3.1 Example f table f values used fr a x 2test...22 5.1 Details f mdels created in the LASSO Sparse scenari... 52 5.

$5 Details f mdels created in the \2 Dense scenari... 60 5.6 Details f mdels created in the \2 Interactin scenari... 62 5.7 Details f mdels created in the Randm Frest Sparse scenari.... 64 5.$

9 LIST OF TABLES Table Page 2.1 Table f variables remved due t high crrelatin Example f table f values used fr a x 2test Details f mdels created in the LASSO Sparse scenari Details f mdels created in the LASSO Dense scenari Details f mdels created in the LASSO Interactin scenari Details f mdels created in the x 2 Sparse scenari Details f mdels created in the \2 Dense scenari Details f mdels created in the \2 Interactin scenari Details f mdels created in the Randm Frest Sparse scenari Details f mdels created in the Randm Frest Dense scenari Details f mdels created in the Randm Frest Interactin scenari Lgistic Classificatin Sum m ary Linear Discriminant Analysis Sum m ary Randm Frest Summary Linear Supprt Vectr Machine Summary Radial Supprt Vectr Machine S u m m a r y Plynmial Supprt Vectr Machine Sum m ary ic-nearest Neighbr Sum m ary Mre data abut the chrmsmes and genes that the SNPs are frm. 86 ix

6.2 Variable selectin results fr LASSO in the Sparse, Dense, and Interactin scenaris... 88 6.3 Variable selectin results fr x 2 in the Sparse scenari...89 6.

10 6.2 Variable selectin results fr LASSO in the Sparse, Dense, and Interactin scenaris Variable selectin results fr x 2 in the Sparse scenari Variable selectin results fr x 2 in the Dense and Interactin scenaris Variable selectin results fr Randm Frest in the Sparse and Dense scenaris Variable selectin results fr Randm Frest in the Interactin scenari NO SELECTION - SPA R SE NO SELECTION - D E N S E NO SELECTION - INTERACTION...96

LIST OF FIGURES Figure Page 1.1 Visualizatin f the prcess f DNA t prtein prductin. Image created by Madeleine Price Ball n 23 January 2013, and dwnladed frm https://en.wikipedia.

5 An utline f the different methds used and hw they are applied t the simulated data... 9 2.1 Hw rws f SNP data are rganized befre and after phasing... 12 2.

11 LIST OF FIGURES Figure Page 1.1 Visualizatin f the prcess f DNA t prtein prductin. Image created by Madeleine Price Ball n 23 January 2013, and dwnladed frm in August Diagram f the transitin frm Chrmsme t DNA t gene. [30] Diagram f a gene pathway [21] Representatin f prcess frm raw data t cded data An utline f the different methds used and hw they are applied t the simulated data Hw rws f SNP data are rganized befre and after phasing A representatin f the relatinships between data in each scenari Crrelatin map f the SNPs frm chrmsme On the left is the tw dimensinal LASSO slutin set, the right is Ridge Regressin. (3 represents the cefficient values that wuld be chsen by a least squares mdel. The slid shapes centered n the rigin are the cnstraint areas, while the ellipses centered n (3 are the slutin curves. [15] A graph f x 2 distributin based n degrees f freedm A decisin tree n the left, with the regin it creates n the right. [20] 24 xi

4.1 A plt shwing the difference between linear regressin and lgistic classificatin...28 4.2 A plt shwing an example f an LDA classificatin... 29 4.3 Example f e values in an SVC g r a p h... 32 4.

2 An example f pssible ROC curves... 40 5.3 Results f the LASSO A selectin prcess fr the three scenaris.... 42 5.4 Cllectin f graphs frm Randm Frest errr utput... 46 5.

12 4.1 A plt shwing the difference between linear regressin and lgistic classificatin A plt shwing an example f an LDA classificatin Example f e values in an SVC g r a p h Example f cst differences in an SVC g ra p h Example f nearest neighbr graphs with different K values A representatin f hw Crss Validatin makes training and test sets An example f pssible ROC curves Results f the LASSO A selectin prcess fr the three scenaris Cllectin f graphs frm Randm Frest errr utput Tp nineteen SNPs by variable imprtance fr LDA and Randm Frest in the LASSO Sparse scenari Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the LASSO Dense scenari Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the LASSO Inter scenari Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the x 2 Sparse scenari Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the x 2 Dense scenari...57 xii

5.10 Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the x 2 Interactin scenari...59 5.

13 5.10 Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the x 2 Interactin scenari Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the Randm Frest Variable Selectin Sparse scenari Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the Randm Frest Variable Selectin Dense scenari Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the Randm Frest Variable Selectin Interactin scenari ROC curves fr LASSO scenaris ROC curves fr x 2 scenaris ROC curves fr Randm Frest scenaris... 73

1 Chapter 1 Intrductin In this chapter, sme bilgical backgrund knwledge will be firstly given, fllwed by the mtivatin and bjective. 1.1 Bilgical Backgrund 1.1.1 Overview A bilgical trait is an aspect f an rganism that can be described r measured.

Fr example, eye clr r susceptibility t certain disease in humans, yield level f crps, and meat r milk quality in livestck all differ frm animal t animal r plant t

14 1 Chapter 1 Intrductin In this chapter, sme bilgical backgrund knwledge will be firstly given, fllwed by the mtivatin and bjective. 1.1 Bilgical Backgrund Overview A bilgical trait is an aspect f an rganism that can be described r measured. Fr all types f rganisms, trait differences exist and can vary widely. Fr example, eye clr r susceptibility t certain disease in humans, yield level f crps, and meat r milk quality in livestck all differ frm animal t animal r plant t plant. The different levels that a trait can take n are called phentypes [24]. Phentypes f eye clr culd be blue, brwn, r green while phentypes f bld type wuld include O, A, and B.

$(transcriptin) YYYTTTYY» "» \ / s j / \ J / \ y s i / s $ / V H L T P E E K prtein Figure 1.1: Visualizatin f the prcess f DNA t prtein prductin.$ Image created by Madeleine Price Ball n 23 January 2013, and dwnladed frm https://en.wikipedia.rg/wiki/prtein in August 2017.

Image created by Madeleine Price Ball n 23 January 2013, and dwnladed frm https://en.wikipedia.rg/wiki/prtein in August 2017.

15 2 Central Dgma Diagram G T G C A T C T G A C T C C T G A G G A G A A G * r»ma... C A C G T A G A C T G A G G A C T C C T G T T C * UI>IM ^... G U G G A U G U G A C U C C U G A G G A G A A G *. (transcriptin) YYYTTTYY» "» \ / s j / \ J / \ y s i / s $ / V H L T P E E K prtein Figure 1.1: Visualizatin f the prcess f DNA t prtein prductin. Image created by Madeleine Price Ball n 23 January 2013, and dwnladed frm in August Gentype is the cmplete DNA sequence f an individual inherited frm the parents, which carries the majrity f bilgical infrmatin. Fr simple traits, a sectin f DNA sequence is respnsible fr deciding what phentype the trait exhibits. The prcess f translating frm DNA t phentype is referred t as the central dgma f mlecular bilgy and can be summarized as: DNA encdes RNA, RNA encdes prtein, and trait exhibits thrugh bisynthesis f prtein [25]. Figure 1.1 illustrates the transitin frm DNA t prtein. In humans, DNA is spread acrss 23 chrmsme pairs, and stred as a sequence f repetitins f fur nucletides: adenine A, cytsine C, guanine G, and thymine T. At individual bases, every individual has tw pairs f the nucletides, ne inherited frm the mther, and the ther is passed dwn by the father. A gene is a sectin f DNA sequence that is the basic physical and functinal unit

16 3 Gene Diagram Gene Figure 1.2: Diagram f the transitin frm Chrmsme t DNA t gene. [30]

4 f heredity. Figure 1.2 shws the breakdwn f chrmsmes t genes t nucletides. The size f a gene can vary frm a few hundred t 2 millin nucletide bases, and humans have arund 25,000 genes.

Each bx represents a gene that is part f the structure that cntrls systems like insulin prductin. 1.

17 4 f heredity. Figure 1.2 shws the breakdwn f chrmsmes t genes t nucletides. The size f a gene can vary frm a few hundred t 2 millin nucletide bases, and humans have arund 25,000 genes. A series f cmplex interactins amng the genes that lead t a certain prduct r a change in a cell is called a bilgical pathway. Figure 1.3 shws the pathway fr type II diabetes. Each bx represents a gene that is part f the structure that cntrls systems like insulin prductin Genetic Variatin Fr mst f the cmplex traits like cancer r bdy height, the difference in phentype can cme frm variatin in gentype, envirnmental factrs, r the interactins between them (nrmally referred t as GxE interactin). Humans share 99.9% f their DNA sequences, s nly 0.1% f the nuclitides are different. The single-base variatin in DNA sequence is called Single Nucletide Plymrphism (SNP). This ccurs when ne nucletide is accidentally replaced with anther. Fr example, in Figure 1.4 the sequence AGCTAC has becme AGC- TAG fr Individual 1 in the tp rw and Individual 3 has instead A G C C A G in the bttm rw. Fr each SNP lcatin, there is a mre cmmn nucletide, and a less cmmn nucletide. T decide which nucletide is mre cmmn, the frequency f the nucletides is calculated fr each lcatin; the mre frequent is called the majr allele, and the ther is called the minr allele. In Figure 1.4, SNP1 has a majr allele

18 MATURITY ONSET DIABETES OF THE YOUNG Pancreatic cell Image P. Hex II Hhfa9 1 Undifferentiated endthelial cell 1HNF6..I 1? 1 [ " p p x il (MOD Y 4) «=P I HNF6 1 1 Hesl 1 (MODY4) IHNFIP ) Ngn3 j f(mody5) <f * * * j ~~~~~~^^(MODY6) Nkc2.2 j Pax6 I Pax4 inwpdlli RFX6 i i i i? I y INkx6.1] i? i Acinar-cell Orcell 5-cell c p I? I l;1hhf4y j FxA3 \ j /7/15 (c) Kanehisa Labratries? PHr G h fij I las 1 I OK IAPP~ 1 ( I (MODY2)j T ^ J f Typell ^ f Insullin ^ I Diabetes meuitus J I signaling pathway J Figure 1.3: Diagram f a gene pathway [21],

$6 SNP Cding Individual 1 CL z CM Cl Z \A m Cl SNP1 Majr Allele: A Minr Allele: T ' fn <n a.$

19 6 SNP Cding Individual 1 CL z CM Cl Z \A m Cl SNP1 Majr Allele: A Minr Allele: T ' fn <n a. a. CL Z Z z in m Cding Individual 2 Individual 3 Individual 4 SNP2 Majr Allele: T Minr Allele: C SNP3 Majr Allele: C Minr Allele: G Figure 1.4: Representatin f prcess frm raw data t cded data. f A since it shws up six times as ppsed t T which nly shws up twice. Figure 1.4 als shws the transitin frm nucletide bases t SNP cunt. Several cding methds are available t quantify the SNP frm the raw gentype. One f the mst frequently used cding methds is called additive cding, where the ttal number f minr alleles is used t represent this SNP fr each persn. Since humans have tw cpies f DNA sequence, this number culd be 0, 1 r 2. Individuals 1, 3, and 4 in Figure 1.4 all have a single minr allele in an SNP, represented by the 1 in the respective psitins. Individual 2 has a minr allele fr SNP1 and SNP2, but als has a minr allele in bth spts fr SNP3, s the sequence wuld be cded. Anther type f genetic variatins is cpy number variatin (CNV), where the number f times a sectin f DNA is repeated differs frm persn t persn.

7 Like SNP, CNV can als be quantified by using similar cding methds. Fr simplicity, this thesis mainly fcuses n the scenaris when a set f SNPs cntribute t the risk f disease (i.e., a binary trait f interest), but the analysis can easily be generalized t the cases that include mre cmplex factrs r mre types f genetic variatin.

In Genme-Wide Assciatin Studies (GWAS), scientists search all the SNPs acrss the genme t identify the special nes that are assciated with certain traits f interest.

20 7 Like SNP, CNV can als be quantified by using similar cding methds. Fr simplicity, this thesis mainly fcuses n the scenaris when a set f SNPs cntribute t the risk f disease (i.e., a binary trait f interest), but the analysis can easily be generalized t the cases that include mre cmplex factrs r mre types f genetic variatin Genme-wide Assciatin Studies With the advancement f micr-array and sequencing technlgy, individuals can btain detailed and accurate genetic markers (eg. SNPs) f the human genme. In Genme-Wide Assciatin Studies (GWAS), scientists search all the SNPs acrss the genme t identify the special nes that are assciated with certain traits f interest. These culd be SNPs fr cmplex diseases, like diabetes and cancer, r quantitative traits like birth weight [42]. In traditinal GWAS, a statistical test is first perfrmed fr each f the SNPs and a p-value is btained accrdingly. The SNPs whse p-value are less than a very small threshld, cmmnly 10-7, are cnsidered significant markers that are assciated with that trait. GWAS have successfully identified SNPs related t several cmplex cnditins. Hwever, the single-marker based analysis n each SNP has a majr drawback [41]. In reality, SNPs tend t wrk tgether as a system rather than individually t realize certain functins, and testing each SNP separately ignres the cmplex interactin between them. Mrever, since every single SNP that is truly invlved in the cmplex system nly

$2 Mtivatin and Objectives Althugh the riginal gal f GWAS was t identify the disease related SNPs, mre recent studies shw that thse identified using traditinal methds can nly explain a small fractin f$

21 cntributes a small effect tward the trait, it is very likely that the individual tests will be nt able t identify thse SNPs as the genme-wide p-value threshld is t stringent. 1.2 Mtivatin and Objectives Althugh the riginal gal f GWAS was t identify the disease related SNPs, mre recent studies shw that thse identified using traditinal methds can nly explain a small fractin f heritability. S there must be many ther SNPs functining in the cmplex system which are extremely difficult t uncver. Instead f finding the whle set f assciated markers, the new bjective is t make individual predictins abut the disease risk fr each persn. Many statistical learning methds have been applied t GWAS fr the predictin purpse, such as Randm Frest [32], Grup LASSO [37], and Bayesian Neural Netwrks [1]. Each methd is ptimized fr different cnditins, s if the structure f the data is well knwn, it may be bvius what methd wuld wrk best. But if little is knwn abut the data, it may be difficult t decide hw t analyze it. Since there is nt ne best methd t use in every situatin, multiple methds are almst always used and the results cmplied tgether t make cnclusins [31]. This thesis wrk is mainly fcused n applying several statistical learning methds t different sets f simulated data t discver which methds prefrm best under different scenaris. Three different scenaris were built with increasing levels f

22 9 Map f Methds Finr Sparse Dense Interactin m n y TTT TTT I^ L a S S lic h T sq iu a r^ RF I L a S S O c h h S q u ^ JTTaiSSTj chi-squa rej RF ) 7 Class. Methds 7 Class. Methds 7 Class. Methds 7 Class. Methds 7 Class. Methds 7 Class. Methds 7 Class. Methds 7 Class. Methds 7 Class, Methds Figure 1.5: An utline f the different methds used and hw they are applied t the simulated data. cmplexity in the structure f the respnse variable: Sparse, Dense, and Interactin. Since GWAS data is nrmally high-dimensinal (i.e., the ttal number f SNPs is large relative t the number f individuals), variable selectin is almst always recmmended befre applying the algrithms t imprve the predictin perfrmance. Here, three variable selectin methds were used fr each scenari: LASSO, x 2 Test, and Randm Frest. Finally, fr each f the nine subsets f data created by the variable selectin, seven classificatin methds were run and the utcmes cmpared. Figure 1.5 gives an verview f the prcess. The thesis is rganized as fllws. Chapter 2 describes hw the data was prepared, including the simulatin f gentypes and phentypes as well as the design f three different scenaris. The general idea and principals f the three variable selectin methds are briefly reviewed in Chapter 3. Chapter 4 prvides an verview

23 10 abut the seven statistical learning methds fr classificatin. The results f variable selectin and predictin are summarized in Chapter 5. A brief discussin f the paper is given in Chapter 6. Sme simulatin details are relegated t the appendices.

24 11 Chapter 2 Data Preparatin 2.1 Simulating Gentypes The gentypes were simulated based n tw nested case-cntrl chrt type II diabetes datasets, the Nurses s Healthy Study (NHS) and the Health Prfessinal Fllw-up Study (HPFS), which are part f the Gene Envirnment Assciatin Studies (GENVEA) [23]. Fr mre detailed infrmatin abut the datasets, please refer t Diet, lifestyle, and the risk f type 2 diabetes mellitus in wmen [19] and Dietary patterns and risk fr type 2 diabetes mellitus in U.S. men [39]. The raw datasets riginally include 3391 female and 2599 male individuals, respectively. Fllwing the cnventinal data-cleaning prcedures, individuals with large prprtin f missing SNPs (>10%) r any kinship relatinship with thers in datasets were remved, and SNPs with small minr allele frequency (<0.05) were als remved. SNPs that are lcated within 50kb up- and dwn-stream f a gene were

When the raw data was stred, the different nucletides at each spt were recrded in tw rws, but which data came frm the mther and which frm the father was nt kept track f.

25 12 first mapped t the crrespnding gene, based n Human Genme Build 37.3, and then a grup f genes was mapped t a bilgical pathway. Specifically, we cnsidered a Kyt Encyclpedia f Genes and Genmes (KEGG) pathway named maturity-nset-diabetes-f-the.yung which cntains 599 SNPs ver 25 genes frm a ttal f 5961 individuals. When the raw data was stred, the different nucletides at each spt were recrded in tw rws, but which data came frm the mther and which frm the father was nt kept track f. The prcess f srting the data back int the mther sectin and father sectin is called phasing and is pictured in Figure 2.1. The phased sequence tends t be inherited as a whle frm ne f the parents, and is called a hapltype. The 25 genes cvering 599 SNPs were mapped t 14 chrmsmes (1, 2, 3, 4, 7, 8, 10, 11, 12, 13, 15, 17, 19, 20) and then phased using fastphase [36]. Based n the phased data, hapltype frequencies n each chrmsme can be estimated and used t simulate new gentypes. Specifically, 14 files were generated crrespnding t the phased data fr all the individuals, ne fr each chrmsme. Phasing Diagram Unphased Data Pssible Orientatins Phased Data Figure 2.1: Hw rws f SNP data are rganized befre and after phasing.

13 Next, the 14 utput files are fed int the R package hapsim [28] which utilizes the hapltype frequency infrmatin and prduces a simulated set f any size required.

Fr each chrmsme, 60,000 hapltypes were first simulated and then a randm subset f 40,000 was selected, paired, and assigned t 20,000 individuals.

26 13 Next, the 14 utput files are fed int the R package hapsim [28] which utilizes the hapltype frequency infrmatin and prduces a simulated set f any size required. While the new data is heavily based n the riginal, nne f the gentypes are cpied directly the simulated data, it is all new cmbinatins f nucletides. Fr each chrmsme, 60,000 hapltypes were first simulated and then a randm subset f 40,000 was selected, paired, and assigned t 20,000 individuals. Finally, the simulated hapltypes fr the 14 chrmsmes are cmbined int ne data matrix with 599 clumns and 20,000 rws, representing the gentypes f 599 SNPs fr the 20,000 individuals. Additive cding was used t quantify the SNP gentypes, i.e., each entry in the matrix is the ttal number f minr alleles fr each SNP and fr each individual. 2.2 Simulating Phentypes Thugh the riginal data did have phentypes assciated with it, the details f hw thse phentypes related back t the gentypes was unknwn. In rder t draw cnclusins abut the different classificatin methds used, the exact structure f the respnse variable must be knwn, s new respnse variables were created based n the simulated data. Here, the theretical disease status will be generated under three different simulatin scenaris: Sparse, Dense and Interactin. Mre specifically, the

$14 disease status f ith individual Yt was generated thrugh a Bernulli distributin Yi ~ Ber(pi) with lgit(pi\y.$ indicates n such disease (cntrl), pi represents the risk f having disease fr zth persn, and functin h(-) cntrls hw gentypes are cntributing t the disease risk.

indicates n such disease (cntrl), pi represents the risk f having disease fr zth persn, and functin h(-) cntrls hw gentypes are cntributing t the disease risk.

27 14 disease status f ith individual Yt was generated thrugh a Bernulli distributin Yi ~ Ber(pi) with lgit(pi\y.i) = h(x i), r pl\xl = - tt 1 + exp(n (X j)) where X, represents zth individual s SNPs gentype vectr, Yi is a binary variable with Yi = 1 indicates the persn has the disease (case) and Yi = 0 indicates n such disease (cntrl), pi represents the risk f having disease fr zth persn, and functin h(-) cntrls hw gentypes are cntributing t the disease risk. Under different scenaris, the functin takes different frms, crrespnding t different disease mdels. The three scenaris are described in Figure 2.2. The Sparse scenari cnsiders the case where five SNPs are cntributing t the disease risk, which cme frm five distinct genes, each with a mderate marginal effect. The Dense scenari assumes that frm the five genes there are 32 casual SNPs, each with smaller marginal effect. The Interactin scenari cnsiders a mre cmplex mdel where interactin between and within genes als exist, in additin t the marginal effect. Mre detailed infrmatin abut the five genes and the exact frms f h () are prvided in Appendix A. After the phentypes f the 20,000 individuals are generated, 3,000 individuals with 1,500 disease and 1,500 nn-disease are randmly selected as training data, and 400 individuals with 200 disease and 200 nn-disease

15 Scenari Cmpsitin GCK GCK NR5A2 GCK HNF1B HNF1B HNF1B NEUROG3 NEUROG3 NEUROG3 HNF1A Sparse Scenari Dense Scenari Interactin Scenari Figure 2.

28 15 Scenari Cmpsitin GCK GCK NR5A2 GCK HNF1B HNF1B HNF1B NEUROG3 NEUROG3 NEUROG3 HNF1A Sparse Scenari Dense Scenari Interactin Scenari Figure 2.2: A representatin f the relatinships between data in each scenari. are randmly selected as test data. 2.3 Crrelatin Structure in Gentypes T reduce the amunt f cmputatinal resurces, the crrelatin f the 599 SNPs was calculated. If variables are highly crrelated, the predictive mdel culd be a bit unstable r even infeasible [2]. In the case f SNPs, variables that are clse t each ther n the gene tend t be highly crrelated. In the crrelatin diagram in Figure 2.3, the darker the clr crrespnds t a strnger crrelatin. It is clear that the darker cells tend t gather arund the center diagnal. Fr every pair f variables crrelated at a higher than.95 level, the variable with the higher index was remved. This lead t a few f the riginal variables getting cut ut very early in the prcess. Table 2.1 lists which variables were remved, and

16 Crrelatin Results Original Variable Surrgate Variable V 86 V84 V190 V187 V I92 V I88 V 200 V199 V305 V302 V308 V306 V490 V489 V515 V514 Table 2.1: Table f variables remved due t high crrelatin.

29 16 Crrelatin Results Original Variable Surrgate Variable V 86 V84 V190 V187 V I92 V I88 V 200 V199 V305 V302 V308 V306 V490 V489 V515 V514 Table 2.1: Table f variables remved due t high crrelatin. what variable will be fllwed instead thrugh the prcess. Variables V84 and V302 were already part f the 32 chsen t build the respnse variable, s the maximum number f variables that culd be picked up by a mdel is reduced t 30. A ttal f 179 variables were remved, leaving 420 SNPs t study.

17 Crrelatin Diagram! I! 1111111111111 ll 11fill l l l l l l 1 1 1 i l l l i l l l l l l l i l l! l l! I I!! I I l l! l l! I I l i!! I i l l i! I!! l I y in i iiiiii! ill! m m m m m n 1 I I!!! I I I!

30 17 Crrelatin Diagram! I! ll 11fill l l l l l l i l l l i l l l l l l l i l l! l l! I I!! I I l l! l l! I I l i!! I i l l i! I!! l I y in i iiiiii! ill! m m m m m n 1 I I!!! I I I! I!! I 1 I I!!! l! l i!!!! i i l l i l! I! I I! S!! I i! l l i!!! l l l l l!!! l l i!! i S I i I l l i l I S I I!! I I!!! I I! l l! i i l i l l i l l l l i l i i I i i l l i i i i i l l i i l IS ll'lllll 111i l l!!!!! I I! I!! S I!!! l l l l l l l l i!l!!ll!!l!l!!! I! I«i!!i!!l 11II! f! 1111 ij lh f i l l!! I! 11! I l l i l! i i i l l! i l i i i! l l l!! l! l l l l l l l l I l l i l l l M l i l I I I I I l i l i I i I!! I l! I I!! l i i l l! l l l l!! I I! l l I i! : ! 111 l i l f!l I! l. i i i l! l!!! i! l i i i i l i l f l l i i t l i i l i l i S I S i i i i 11llllfIjlTllllllllfilllilll! I l i l l l l l l l l l l H I I I l I i [ i i S i i l l l f t l i i l i l i! l! l! l l l l!! l! I! I!! I I l! i! I!!! I i l l l l l! I I I I I m m m n m m m > as?! tv Wi m p* 1IIIIl lilssliiiiiliiiiillil c.doo<?6 c;6d?d6d 9 s ii1111!11i11151i1111 I Figure 2.3: Crrelatin map f the SNPs frm chrmsme 15.

18 Chapter 3 Variable Selectin Methds Because the high dimensinality f the predictrs nrmally negatively impact the classificatin perfrmance (the s-called the curse f high-dimensinality), first a

31 18 Chapter 3 Variable Selectin Methds Because the high dimensinality f the predictrs nrmally negatively impact the classificatin perfrmance (the s-called the curse f high-dimensinality), first a variable selectin prcedure will be applied t reduce the dimensin f variables. Specifically, three different selectin methds are utilized t narrw dwn the 599 SNP variables t a smaller subset. In the fllwing sectins, a brief descriptin will be given fr each f the three methds (LASSO, x 2 test, randm frest). 3.1 The LASSO The LASSO, r least abslute shrinkage and selectin peratr, is an analysis methd prpsed by Rbert Tibshirani in 1996 as an extreme frm f the Ridge Regressin technique [38]. Ridge Regressin starts with a basic linear regressin mdel Vi = A) + Plx il + Plx il + ' ' + PpXip + 6i

$n, V \ 2 Least Squares: Chse P, Pi,, Pp t minimize E V i ~ P - ^ 2 Pjxij ) i= 1 ' j = 1 ' n / p \ 2 P Ridge Regressin: Chse /3s t min ^ [y i - P ^ Pj%ij I + A i 1 ^ j= l ' j = 1 Here n is the number$ f bservatins, y* is the ith bservatin s respnse value, and Xij is the jth input fr the ith bservatin.

f bservatins, y* is the ith bservatin s respnse value, and Xij is the jth input fr the ith bservatin.

32 19 but adds an extra term when calculating the values f the cefficients that acts as a penalty n the size f the P is. n, V \ 2 Least Squares: Chse P, Pi,, Pp t minimize E V i ~ P - ^ 2 Pjxij ) i= 1 ' j = 1 ' n / p \ 2 P Ridge Regressin: Chse /3s t min ^ [y i - P ^ Pj%ij I + A i 1 ^ j= l ' j = 1 Here n is the number f bservatins, y* is the ith bservatin s respnse value, and Xij is the jth input fr the ith bservatin. The A in the secnd equatin is a tuning parameter that is usually determined by resampling, which is the prcess f taking the knwn data and making subsets t emulate a test set. Each chice f A will result in a different set f parameters, with a larger A giving smaller cefficients verall. But Ridge Regressin can never frce a cefficient t be exactly zer, s all variables are still included in the mdel. The LASSO, n the ther hand, can frce cefficients t zer by using an t\ penalty instead f the i 2 frm Ridge Regressin. n / P \ 2 p LASSO: Chse P,Pi, t minimize ^ ( y i ~ P - E M + A Y,\Pj\ [38] i= 1 ^ j = 1 ' j = 1 T see why the LASSO can cancel ut variables cmpletely, tw slightly different frms f the Ridge Regressin and LASSO equatins are cnsidered. Bth methds can be seen as the slutins t the fllwing prblems:

$20 pridge minimize subject Regressin: A )A,-,/3 P n ( P y ( Vi ~P~ fcxii ) t = l ' j = 1 ' n ( p, : LASSO: minimize I yt fa \ faxij ) P,Pi,-,Pp r f \ ^ ' ' / 1 = 1 x?$ $Then the cnstraint regin fr Ridge Regressin is + (3% < t and fr LASSO is \fa\ + \fa\ < t [15]. Pictured in Figure 3.1 are thse regins mapped alng with the elliptical slutin curves.$

33 20 pridge minimize subject Regressin: A )A,-,/3 P n ( P y ( Vi ~P~ fcxii ) t = l ' j = 1 ' n ( p, : LASSO: minimize I yt fa \ faxij ) P,Pi,-,Pp r f \ ^ ' ' / 1 = 1 x? = 1 7 t 0j < t [18] j = 1 P subject t ElftlS<[38] 3= 1 Say there are nly tw cefficients, fa and fa. Then the cnstraint regin fr Ridge Regressin is + (3% < t and fr LASSO is \fa\ + \fa\ < t [15]. Pictured in Figure 3.1 are thse regins mapped alng with the elliptical slutin curves. LASSO vs Ridge Regressin Figure 3.1: On the left is the tw dimensinal LASSO slutin set, the right is Ridge Regressin. j3 represents the cefficient values that wuld be chsen by a least squares mdel. The slid shapes centered n the rigin are the cnstraint areas, while the ellipses centered n ft are the slutin curves. [15]

If there are mre /3s and the dimensin increases, the cnstraint shapes will have similar prperties; Ridge Regressin will have a smth n-dimensinal sphere and LASSO will have a plytpe with many places

34 21 Fr Ridge Regressin, since the cnstraint regin is circular, the intersectin will almst never lie n a pint where ne f the betas is zer [20]. But the pinted crners f the LASSO regin make the intersectin much mre likely t ccur at an axis, where ne f the betas is exactly zer [20]. If there are mre /3s and the dimensin increases, the cnstraint shapes will have similar prperties; Ridge Regressin will have a smth n-dimensinal sphere and LASSO will have a plytpe with many places fr betas that are zer [20]. The clear advantage f LASSO ver least squares and Ridge Regressin is the variable selectin prperty. With a sparse data set, a few imprtant variables hidden within many, cutting ut unnecessary variables is a critical step. The lss f variables can cause a decrease in predictin accuracy, but it is nrmally ffset by the increased interpretability f the mdel. And thugh the LASSO was first used with least squares mdels, it has since been extended t cver a wide variety f generalized linear mdels. In particular fr this study, because f the categrical respnse variable, the Lgistic LASSO is used. In this setting the LASSO is a slutin t: n r / p \ minimize Vi i P + PjXij ) lg ( 1 4- ea)+^ = 1h XiA M u - A ^ [ v U J v. p subject t \Pj\ < 1 [15]. j=i

22 Cntingency Table Respnse Vi = 0 Vi = 1 SNP1 ^ - 840 846 Xi = 1 572 565 Xi = 2 88 89 Table 3.

The null hypthesis states that patterns in the data sets ccurred by chance, while the alternative hypthesis states that a change in ne set f data is matched by a change in the ther data set [33].

35 22 Cntingency Table Respnse Vi = 0 Vi = 1 SNP1 ^ Xi = Xi = Table 3.1: Example f table f values used fr a x 2test. 3.2 x 2Test fr Independence The Chi-Square, r x 2>test is a classic methd t test if tw categrical variables are independent. The null hypthesis states that patterns in the data sets ccurred by chance, while the alternative hypthesis states that a change in ne set f data is matched by a change in the ther data set [33]. Befre the x 2 statistic is calculated, a cntingency table is created. Table 3.1 gives an example using the data frm SNP1 and the respnse variable frm the Sparse scenari. Let Oij represent the number f bservatins in ith rw and jth clumn, r represent the number f rws, c represent the number f clumns, and N the sum f all the entries in the table. Then Eij represents the expected value f each entry ij with S, fr example, if the expected value f the = 0, y* = 0 entry was t be calculated, first the sum f the first rw is fund, then multiplied by the sum f the first clumn, and then divided by the ttal sum f all the entries: ( )(^ ) = 343,

36 23 Chi-Squared Distributin Chi-Squared Value Figure 3.2: A graph f x 2 distributin based n degrees f freedm. S if SNP1 and the respnse variable were independent, we wuld expect t see apprximately 843 bservatins fall int the first entry f the table. Finally the y2 statistic is given by X2 = E E (0 'i ~ y)2 [33]. i= 1 j = 1 *? This x 2 value is then cmpared t the x 2 distributin t calculate a p-value. Figure shws a few different distributins, based n the degrees f freedm. Degrees f freedm is calculated by df = (r l)(c 1) s fr ur mdel, the degrees f freedm is 2.

They make n assumptins abut the structure f the data and fr interpretatin, prduce a list f variables that are mst imprtant in gruping the data.

37 24 Decisin Tree Diagram X%< h AT** ^2 X i < fa R 2 R a A 2 4 H * H 3 /! s E t JZt A i Figure 3.3: A decisin tree n the left, with the regin it creates n the right. [20] 3.3 Randm Frest Decisin Trees are a very flexible methd f splitting the feature space int sectins based n criterin f certain variables. They make n assumptins abut the structure f the data and fr interpretatin, prduce a list f variables that are mst imprtant in gruping the data. A decisin tree is a graph made up f splitting ndes starting at the tp f the tree, and terminal ndes, r leaves, at the bttm. Each nde represents a gruping f the data int tw parts based n sme variable criterin. An example f a decisin tree frm a tw variable set and the regins it creates is given in Figure 3.3. The general prcess f building a decisin tree starts with dividing the data int

>2 j = 1 i Rj where yn] is the mean respnse f the bservatins in Rj [20]. Since it is impssible t try every cmbinatin f regins, a tp-dwn, greedy apprach called recursive binary splitting is used [34].

38 25 J distinct nn-verlapping regins, i?i, R 2,...Rj, and assigning the same respnse variable value t each bservatin in the regin [20]. These regins, nrmally simple p-dimensinal rectangles, are chsen in rder t minimize the errr given by the residual sum f squares a s s = E 5 > -» i.>2 j = 1 i Rj where yn] is the mean respnse f the bservatins in Rj [20]. Since it is impssible t try every cmbinatin f regins, a tp-dwn, greedy apprach called recursive binary splitting is used [34]. The prcess is tp-dwn because ne split begins the tree, and then tw mre splits are made frm thse tw riginal branches. Greedy refers the the fact that the best chice is made fr the current split withut cnsidering future decisins. Fr the regressin setting, at each split a variable X j is chsen, and the best cutpint s in the regin is fund such that ivi ~ VRi)2 + ^ (Vi ~ VRif is minimized, where R\ = {x\xi < s} i'.xigri i:xi R,2 and R2 = {x\xi > s} [20]. T predict a respnse, a given bservatin is placed in the regin that fits the crrect cnditins, and the mean f the regin is given as the respnse. A few changes are needed fr a classificatin setting, but the verall ideas are the same. Instead f calculating the mean f the bservatins in Rt, the mst cmmnly ccurring class is fund. Fr the splitting criterin, a measure f nde purity called the Gini index is used in place f the RSS. Let pmk represent the prprtin f

One disadvantage f decisin trees is that n their wn they tend t nt have the predictive pwer f ther methds.

39 26 bservatins in the mth regin that are in the kth class, then the Gini index is K G = '^Tpmk(l -Pm k) [20]. fc=1 A small value f G means that the pmks are all clse t 0 r 1, s the split has made the bservatins in Rk almst all ne class. One disadvantage f decisin trees is that n their wn they tend t nt have the predictive pwer f ther methds. The Randm Frest methd is ne way t imprve the predictive accuracy buy grwing multiple trees and averaging the results. Tin Kam H first develped the Randm Decisin Frests in 1995 t increase the accuracy and cmplexity f decisin trees [17]. Each tree is gwn n a randm subset f the training variables, a prcess knwn as Btstrapping, t simulate multiple training sets instead f just ne [29]. T keep the trees frm becming t highly crrelated, whenever a split is being calculated nly a randm subset f predictrs are allwed t be used fr the split [20]. Generally, if there are p predictrs, y/p variables are chsen fr each decisin nde [15]. By using less than half f the variables at each spt, it gives a chance fr weaker, but still imprtant, features t shw thrugh, instead f being verrun [5].

27 Chapter 4 Classificatin Methds After the variable selectin methds have been applied, the fllwing data analysis methds are run n each set f the remaining variables. 4.1 Lgistic Regressin Lgistic Classificatin is the extensin f linear regressin t a data set with a qualitative respnse variable.

40 27 Chapter 4 Classificatin Methds After the variable selectin methds have been applied, the fllwing data analysis methds are run n each set f the remaining variables. 4.1 Lgistic Regressin Lgistic Classificatin is the extensin f linear regressin t a data set with a qualitative respnse variable. Instead f mdeling the qualitative variable directly, the prbability the respnse wuld fall int ne f the categries is mdeled. T keep the prbability values between 0 and 1, a lgistic functin is used, where p(xi) is the prbability than an bservatin is a yi = 1 case: g A ) + / 3 l # i l H b (3p Xip p ( x i) ^ g(3-\-(3ixn-\ [-fipxip

41 28 Regressin vs Classificatin Figure 4.1: A plt shwing the difference between linear regressin and lgistic classificatin. In Figure 4.1, an example f a basic linear regressin and lgistic classificatin curves is given fr cmparisn. 4.2 Linear Discriminate Analysis Linear Discriminate Analysis, r LDA, is a classificatin technique that can give estimates that are mre stable than lgistic regressin in certain cases, like when the classes fr the respnse variable are well separated [20]. LDA uses Bayes therem t turn the distributin f the predictrs int estimates fr the prbability f the

$Bayes therem states: P r(f = k\x = x) = 7Tfc P r(a = x\y = k) E Z, n Pr(X = x\y = I) where 7Tj is the prir, r verall, prbability that an bservatin belngs t class$ $Estimating P r(x = x\y = k) is the main jb f the LDA. The set up des assume that X = (X i, X 2,.$

42 29 respnse, given the prir prbabilities [27]. Bayes therem states: P r(f = k\x = x) = 7Tfc P r(a = x\y = k) E Z, n Pr(X = x\y = I) where 7Tj is the prir, r verall, prbability that an bservatin belngs t class i and K is the number f classes [20]. This estimated prbability is ften called the psterir prbability that bservatin X is in class k. Estimating P r(x = x\y = k) is the main jb f the LDA. The set up des assume that X = (X i, X 2,..., X p) cmes frm a multivariate Gaussian distributin, with mean // and C v (X )= E. In this case P r(x = x\y = k) = 1 (2tt)p/2 E 1/2 exp LDA Plt CM LD1 Figure 4.2: A plt shwing an example f an LDA classificatin

2 shws a plt fr a tw variable data set with three factrs fr the respnse. 4.

43 30 It can be shwn by cmbining this frmula with Bayes therem, that an bservatin X will be assigned t the class fr which 6k(x) =.rr X -1///,. + lg ir^ is largest, where /x/t is the mean vectr f all bservatins with class k [20]. LDA estimates the S, /i,s, and tt, s needed t calculate and assigns the bservatin t its class. Figure 4.2 shws a plt fr a tw variable data set with three factrs fr the respnse. 4.3 Randm Frest Lgistic Regressin and LDA have linear equatins at their base, the next three methds are much mre flexible and assume almst nthing abut the structure f the riginal data. The same Randm Frest as described in Chapter 3 was used again t fine tune the data, even in the case where Randm Frest was riginally used fr variable selectin, which is encuraged [6]. 4.4 Supprt Vectr Machine The Supprt Vectr Classifier, SVC, is a classificatin technique that was develped fr a tw level respnse variable [9]. The methd attempts t split the data pints int tw grups with a hyperplane separatr, Pq+P\X i -\ 1- /3pX p = 0. Any pints whse respnse variable value des nt match the label f the grup are cnsidered misclassified. In mst cases the data cannt be perfectly separated, s a margin is

31 als built arund the hyperplane. The classifier cmes frm ptimizing the fllwing system f equatins: maximize M p subject t ^ /3 = 1 i=i J/i(A) + PiXn H 1- Pvxip) ^ ~ ei) : fr * = 1,.

44 31 als built arund the hyperplane. The classifier cmes frm ptimizing the fllwing system f equatins: maximize M p subject t ^ /3 = 1 i=i J/i(A) + PiXn H 1- Pvxip) ^ ~ ei) : fr * = 1,..., n j > 0 : fr i = 1,..., n n where (xn, x i2,...,xip, yi) is a data pint, C is a nn-negative tuning parameter, and gjs are slack variables [20]. The slack variables are terms that allw the data pint rm t be misclassified. If = 0 then the ith data pint is n the crrect side f the hyperplane, if 0 < e* < 1 the data pint is in the margin but still n the crrect side f the hyperplane, and if e* > 1 the data pint is n the wrng side f the hyperplane [9]. Figure 4.3 shws the e values f a few pints, ne frm each pssibility. C is related t the cst parameter, and it is a bund n hw many data pints are allwed t be misclassified [15]. The higher the cst used in the mdel, the smaller the margins are frced t be. In Figure 4.4, the same data is mdeled with different cst values, causing different classificatin and margins t appear.

All the pints that are nt circled in the SVC plts in Figure 4.4 culd be shifted arund withut changing the bundaries, as lng as they did nt mve int the margin while shifting.

45 32 Supprt Vectr Classifier Diagram Figure 4.3: Example f e values in an SVC graph The mst interesting aspect f SVC is that the bservatins that are utside f the margin, n the crrect side, have n effect n the placement f the hyperplane. All the pints that are nt circled in the SVC plts in Figure 4.4 culd be shifted arund withut changing the bundaries, as lng as they did nt mve int the margin while shifting. The data pints that d define the hyperplane and margin are called supprt vectrs [20]. The Supprt Vectr Machine, SVM, extends the idea f an SVC t a mdel with a nn-linear decisin bundary [20]. This is accmplished by enlarging the feature space using kernels s that the decisin bundary is linear in the enlarged space, but nn-linear in the riginal space [9]. The kernel uses the slutin t the SVC ptimizatin prblem that invlves nly the inner prducts f the bservatins [20].

33 Cst=1 Cst=5 CM X 0 1 2 3 4 Cst=10 2 3 4 Cst=20 CM x 0

46 33 Cst=1 Cst=5 CM X Cst= Cst=20 CM x Figure 4.4: Example f cst differences in an SVC graph

34 In this study, the linear, plynmial, and radial kernels are all cnsidered. Belw are the kernels used in each case: p Linear: X j) = XjjXj'j j = 1 p Plynmial: Kp(X.

47 34 In this study, the linear, plynmial, and radial kernels are all cnsidered. Belw are the kernels used in each case: p Linear: X j) = XjjXj'j j = 1 p Plynmial: Kp(X.i, X j) = (1 + XjjXj'j)d p j = 1 Radial: K R(X u X j ) = e x p (-7 ^ ( x y- - x ^ ) 2). j=i Here d is the desired degree f the plynmial, and 7 is anther psitive tuning parameter fr the radial kernel. 4.5 K-Nearest Neighbr K-Nearest Neighbrs Classificatin, r KNN, is a nn-parametric mdel that chses the assignment f an bservatin based n the ther bservatins clsest t it. The K tells hw many neighbrs t cnsider, and its value can greatly change the mdel. Figure 4.5 shws hw the classificatin can change with the value f K. Thugh there are multiple ways t define distance, usually Euclidean distance is

$KNN estimates the cnditinal prbability with i N0 Pr(F - j\x = *) = ^ J 2 = j) where is the bservatin being classified, Nq is the set f K bservatins clsest t x q, and I(x) is the identity functin [20].$

48 35 1-nearest neighbur 3-nearest neighbur 10-nearest neighbur 15-nearest neighbur Figure 4.5: Example f nearest neighbr graphs with different K values mst cmmn. KNN estimates the cnditinal prbability with i N0 Pr(F - j\x = *) = ^ J 2 = j) where is the bservatin being classified, Nq is the set f K bservatins clsest t x q, and I(x) is the identity functin [20]. Then the bservatin is assigned t which ever class has the highest prbability. KNN is strng in that there are almst n assumptins made by the mdel, s it can wrk when very little is knwn abut

49 36 the structure f the data set [10]. Its drawback thugh, is that it suffers frm the curse f dimensinality; the mre respnse variables present, the mre dimensins are used, and the sparser the data gets in the p-dimensinal mapping [3].

50 37 Chapter 5 Results 5.1 Methds Evaluatin and Tuning Parameter Selectin Crss-Validatin and Out f Bag Errr In rder t estimate the test errr and tuning parameter values, the resampling methd f Crss Validatin is emplyed. In mst statistical analysis situatins, there is n predetermined validatin, r test, set t asses the mdel with. Crss Validatin wrks by segmenting the data and creating temprary test sets t estimate the errr rate f the mdel. There are several different types f Crss Validatin, depending n hw many segments are created in the prcess. One methd is called leave ne ut crss validatin, LOOCV, where all but ne bservatin is used as the training set and the test errr is calculated frm the single bservatin. Hwever, this methd has a high cmputatinal cst and may nt be reasnable fr certain methds r large data sets. On the ther end, using nly ne train and test set will make the

First the data is randmly brken up int 10

51 38 Crss Validatin Diagram : Training set 1 ptest-j :;.... ;:.,: :..^,:..,. : : c > Errr 1 l ::: : : ; l = > Errr 2 c > Errr 3 ; i n u t t z > Errr 10 Figure 5.1: A representatin f hw Crss Validatin makes training and test sets. estimated errr have high variance. The mst cmmn cmprmise between the tw is 10-Fld Crss Validatin [22]. First the data is randmly brken up int 10 grups, r flds. Then 9 f the flds are used as a training set, and the 10th is used as a test set, as diagrammed in Figure 5.1. A mdel is build n the training set, and then errr is calculated n the test set, mst cmmnly used in regressins is 1 n the mean square errr: M S E = 'S^(yi Vi)2 [20]. Fr the classificatin setting, n i= 1 classificatin errr rate is used, which is described in the next sectin. Then a new unique chice f the 9 sets is made, and the new test set is used again t calculate the errr. Finally, all 10 f the errr values are averaged tgether, and given as the Crss Validatin errr. These errr values are uncrrelated enugh t keep the variance reasnably lw, but they als vary enugh t keep bias lw as well, s the 10-Fld Crss Validatin errr is accepted as an accurate estimate f true test errr [20]. It is used t estimate the test errr fr all classificatin methds.

10-fld CV is used in this study t estimate A fr the LASSO, K fr KNN, and cst, 7, and degree fr SVM.

52 39 Fr estimating parameters, a list f pssible values must be chsen befrehand; usually a range fr the values is selected thrugh practice. Then fr each chice f parameter value, a 10-fld Crss Validatin is run and the errr is recrded. Then the parameter frm the mdel with the least CV errr is chsen as the best value fr the parameter. 10-fld CV is used in this study t estimate A fr the LASSO, K fr KNN, and cst, 7, and degree fr SVM. Fr the Randm Frest Variable Selectin, as the btstrapping prcess builds the mdel, it can estimate the errr f the mdel, s n extra Crss Validatin calculatin is needed [16]. The randm set f variables that btstrapping takes at each set creates a training set, and the left ut, r Out Of Bag, bservatins are used as a testing set [4], If enugh trees are grwn fr Randm Frest, the Out Of Bag errr is very similar t the 10-fld Crss Validatin errr [20] Classificatin Errr Rate and the ROC Curve Fr each variable selectin methd, the seven different classificatin methds were run n the remaining variable sets. Then there are tw ways the results are cmpared, classificatin errr rate and area under the ROC curve. The classificatin errr rate is used in place f the regressin s mean squared errr, but the interpretatins are equivalent. Classificatin errr gives the percent f bservatins that d nt belng t the mst cmmn class f the grup. A lw classificatin errr rate means the mdel is accurate at predicting the respnse value f an bservatin.

53 40 ROC Curves False Psitive Rate Figure 5.2: An example f pssible ROC curves. The Receiver Operating Characteristic, ROC, Curve is a plt f the true psitive rate f a mdel against the false psitive rate [14]. It wrks by taking the prbabilities calculated in a mdel and sliding the cutff percentage frm 0 t 1 and calculating the errr rates fr each cutff. The area under the ROC curve, AUC, is a measure f hw accurate the mdel is, the mre area, the mre accurate the mdel [8]. A strng ROC curve will hug the upper left crner f the graph, while a curve clse t the diagnal is nt much mre use than randmly guessing the class a data pint falls int. Figure 5.2 shws an example f each case. The

41 5.2 Variable Selectin Results T start the prcess f analyzing the SNP data, the train and test sets are read int R, and the highly crrelated variables are remved.

1 LASSO T begin the LASSO, the cv.glmnet functin [12] is used t find the best value fr A, by a 10-fld Crss Validatin. Fr all three scenaris, a range f 100 A s frm 10-5 t 101 is checked. In Figure 5.

54 Variable Selectin Results T start the prcess f analyzing the SNP data, the train and test sets are read int R, and the highly crrelated variables are remved. Then fr each methd, the selectin is run and a test errr is calculated fr later cmparisns. Then the train and test sets are reduced t the subset f variables selected fr further study LASSO T begin the LASSO, the cv.glmnet functin [12] is used t find the best value fr A, by a 10-fld Crss Validatin. Fr all three scenaris, a range f 100 A s frm 10-5 t 101 is checked. In Figure 5.3, the left graphs shw the cefficients decreasing t zer as A decreases. The right graphs shw the calculatins f the CV prcess t find the best A. The tp rw is the Sparse results, the middle is the Dense, and the bttm is the Interactin. In the sparse cefficient graph, there are five lines that are reduced t zer much slwer than the rest f the set. These lines mst likely crrespnd t the five variables used in that scenari: V100, V199, V300, V400, V500. The A value chsen fr the Sparse scenari by crss validatin is The Dense scenari cefficients are nt as separated as the Sparse, but there des appear t be a set f three variables that take lnger t reach zer. The A value chsen fr the Dense scenari is The Interactin case lks similar t the Dense, except fr ne variable that has a very high cefficient value cmpared t the rest. The A value chsen fr the Interactin scenari is

1 0 0 0 0 0 Q S E c CD L1 Nrm lg(lambda) 15 21 29 423 414 381 240 82 28 15

55 42 LASSO A Graphs "r E c in L1Nrm lg(lambda) Q S E c CD L1 Nrm lg(lambda) Q Ixj E L1Nrm k)g(lambda) Figure 5.3: Results f the LASSO A selectin prcess fr the three scenaris.

43 Number f True Variables LASSO Results Number f Neighbr Variables Number f Missing Variables Sparse 5 0 0 Dense 30 0 0 Inter 25 4 1 With these A values, a final LASSO mdel is made fr each scenari.

A neighbr is cunted when a true variable is missed, but the SNP right next t it is selected; fr example, if V405 is nt selected by the mdel but V404 is, V404 wuld cunt as a neighbr variable.

56 43 Number f True Variables LASSO Results Number f Neighbr Variables Number f Missing Variables Sparse Dense Inter With these A values, a final LASSO mdel is made fr each scenari. Table gives the results f the variables selected. The details f which variables are selected are given in Table 6.2 in Appendix B. A neighbr is cunted when a true variable is missed, but the SNP right next t it is selected; fr example, if V405 is nt selected by the mdel but V404 is, V404 wuld cunt as a neighbr variable. Fr the Sparse case, the LASSO places the five true variables at the tp f the list, with an extra 14 variables after. Fr the Dense case, the first 20 variables n the list are true variables, and mst f the rest fllw shrtly. The last true variable, V412, is abut tw-thirds dwn the list f 79 ttal variables. The Interactin mdel selects 97 ttal variables, with the first 23 true variables. Here the mdel misses 5 f the variables, but picks up 4 neighbrs. The mdel manages t miss nly ne variable cmpletely, V I X2 Test In the secnd variable selectin methd, the x 2 tests, there is n mdel built and n tuning parameters that need calculating. In R, the functin chisq.testq [35] is used

44 Number f True Variables X2Results Number f Neighbr Variables Number f Missing Variables Sparse 5 0 0 Dense 27 3 0 Inter 24 4 2 t run the x 2 tests, and any SNP with a p-value less than 0.

05 level is kept t be further analyzed. As with the previus variable selectin methd, the x 2 tests find all five true variables in the Sparse scenari.

57 44 Number f True Variables X2Results Number f Neighbr Variables Number f Missing Variables Sparse Dense Inter t run the x 2 tests, and any SNP with a p-value less than 0.05 is cnsidered dependent. Table gives a brief summary f the results, with the details appearing in Appendix B in Tables 6.3 and 6.4. Each variable that is significant at the.05 level is kept t be further analyzed. As with the previus variable selectin methd, the x 2 tests find all five true variables in the Sparse scenari. They are amng 93 variables that are deemed significant. In the Dense scenari, all but three variables are are fund, but thse three d have neighbrs that are included in the mdel. A ttal f 135 variables are listed as significant. The Interactin scenari has 6 true variables missing, but nly 4 have neighbrs in the mdel. Variables V310 and V412 are missed by the mdel. There are nly 99 variables in ttal selected, less than the Dense scenari, whereas the ppsite is true fr the LASSO.

45 Randm Frest Results Number f Number f Number f True Neighbr Missing Variables Variables Variables Sparse 5 0 0 Dense 24 4 2 Inter 22 5 3 5.2.3 Randm Frest The Randm Frest is run using the randmfrest [26] functin in R.

Fr all scenaris, as the number f trees appraches 500, the errr levels ff. Sparse ends up with abut 30% classificatin errr. Dense stays at abut 22% errr fr mst f the prcess.

58 45 Randm Frest Results Number f Number f Number f True Neighbr Missing Variables Variables Variables Sparse Dense Inter Randm Frest The Randm Frest is run using the randmfrest [26] functin in R. As described earlier in the chapter, the Out Of Bag errr is calculated as the mdel is being created. In Figure 5.4 the errr is graphed based n the number f trees currently grwn. Fr all scenaris, as the number f trees appraches 500, the errr levels ff. Sparse ends up with abut 30% classificatin errr. Dense stays at abut 22% errr fr mst f the prcess. The Interactin case quickly reaches a classificatin errr rate f abut 15%. The errr rate in bth Dense and Interactin appears t be better fr the disease case than the nn-disease case, as the red lines in bth graphs are abut the OOB errr, and the green is belw. This is acceptable as it wuld be mre imprtant t get the disease cases crrectly identified ver the nn-disease. The variables are rdered by a measure f variable imprtance called decrease in accuracy. It measures hw much accuracy in the mdel is lst when the variable is remved. Fr the Randm Frest, the 100 tp variables by variable imprtance are chsen in each scenari. A quick summery f the results is given in Table 5.2.3, and the particulars are in Appendix B, Tables 6.5 and 6.6.

59 46 Randm Frest Errr Graphs Sparse: t r e e s Dense: t r e e s Inter: t r e e s Figure 5.4: Cllectin f graphs frm Randm Frest errr utput.

47 In the Sparse scenari, the five true variables are listed first, in almst the same rder as the LASSO mdel. Unfrtunately the Dense scenari is nt as clear-cut.

The Interactin scenari similarly has ther variables mixing in after the furth spt, and is missing eight ttal variables, with five having neighbrs. 5.2.

60 47 In the Sparse scenari, the five true variables are listed first, in almst the same rder as the LASSO mdel. Unfrtunately the Dense scenari is nt as clear-cut. Only fur true variables are listed befre thers start mixing int the list, and the mdel misses six variables, with nly fur having neighbrs in the mdel. The Interactin scenari similarly has ther variables mixing in after the furth spt, and is missing eight ttal variables, with five having neighbrs Overall All three methds were very successful at picking up the true variables in the Sparse case. Fr the Dense case, the LASSO and x 2 pick up all the variables, but the Randm Frest drps tw. The Interactin case is the hardest fr the mdels t define, nly the LASSO case identifies all the variables r a neighbr. The errr frm the Randm Frest mdels suggest that the mre cmplex the mdel is, the better predictin results will be achieved. When there are s few true variables cmpared t the ttal variables, the extra nise variables have mre chance t be included. Fr the Sparse scenari, all three mdels include, alng with the five true variables, eight ther variables: V79, V99, V105, V172, V182, V416, V421, V501. In the Dense scenari, the x 2 and Randm Frest bth have truble with V195 and V492. Bth mdels include V194 and V493 in their place. Finally, fr the Interactin scenari, all three mdels miss V187 but d

LASSO and x 2 bth select mre than twice the number f true variables, and fr the Randm Frest the number f variables selected was chsen befrehand.

61 48 get V I88, and nly include ne f V197 r V199. In additin, tw ut f the three mdels replace V80 with V79, miss V195, and replace V405 with V404. At this stage f the analysis, it is als hard t draw cnclusins abut hw many true variables the mdels are suggesting. LASSO and x 2 bth select mre than twice the number f true variables, and fr the Randm Frest the number f variables selected was chsen befrehand. The classificatin methds used next can narrw dwn the pssibilities. 5.3 Classificatin Methd Results After the variable selectin methds have been applied, the seven data analysis methds are run n each set f remaining variables. Fr each classificatin methd, first any tuning parameters are estimated. Then a mdel is built and the 10-fld Crss Validatin errr is calculated. Finally the mdel is used with the predictq functin [35] and the test set t calculate the classificatin test errr f the mdel Methd Details Fr Lgisitic Regressin, the R functin glm() [13] is used t set up a generalized linear mdel f the family binmial, which equates t lgistic regressin. Then the mdel is run thrugh the functin cv.glm() [7] t find the 10-fld CV errr f

49 the mdel. The list f variables frm the mdel that are significant at the.05 level are cmplied using the utput f the glm bject and the test errr is calculated.

Then the tp twenty variables are calculated based n the abslute value f the LDA cefficients and the test errr is fund.

62 49 the mdel. The list f variables frm the mdel that are significant at the.05 level are cmplied using the utput f the glm bject and the test errr is calculated. Linear Discriminant Analysis uses the lda() functin [40] t build the mdel. There is n built in functin that will run a 10-fld CV n the LDA mdel, s the errr is calculated manually. Then the tp twenty variables are calculated based n the abslute value f the LDA cefficients and the test errr is fund. The Randm Frest is created using the randmfrest() functin [26] and then the rfcv() functin [26] gives the 10-fld CV errr. Then the tp twenty variables based n decrease in accuracy are listed and the test errr calculated. The three Supprt Vectr Machine mdels have tuning parameters that need t be estimated fr the mdels. Cst is used in all types, 7 fr the radial and plynmial kernel, and degree fr nly the plynmial. Cst values tested are { 0.001, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10, 100}, 7 values are {0.005, 0.01, 0.05, 0.1, 0.5, 1.0}, and degree values are {2,3}. The functin tune.svm() [11] is used t run 10-fld CV t estimate the parameters. Then a SVM mdel with the selected parameters is build using the svm() functin and the test errr is fund. K-Nearest Neighbr has ne tuning parameter t calculate befre the mdel is run, K. The functin tune.knn() [11] is used t run 10-fld CV t find the best value fr K. Then the knn() functin [40] builds a mdel with the chsen K, and the test errr is calculated.

5: Tp nineteen SNPs by variable imprtance fr LDA and Randm Frest in the LASSO Sparse scenari. 5.3.

63 50 LDA Sparse: Tp Twenty Variables rf. sparse g 0 0 ) 0 0 > 5 > >? > Variable V300 V400 V I99 V500 V100 V416 V99 V421 V182 V501 V545 V172 V393 V22 V599 V79 V105 V118 V395 - T - T" MeanDecreaseAccuracy Figure 5.5: Tp nineteen SNPs by variable imprtance fr LDA and Randm Frest in the LASSO Sparse scenari Cmparisns Within Variable Selectin Methd Fr the LASSO Sparse scenari, the methds all have very similar test errrs, 28-30%. Lgistic Classificatin and the Linear SVM are the lwest with 28.25% classificatin errr and KNN the highest at 30.00%. Lgistic Classificatin, LDA, and Randm Frest all select the five true variables, but they all include extra variables as well. Fr Lgistic Classificatin, V22 and V172 bth have p-values under.05, but the highest p-value f the ther five variables is Fr the LDA and Randm Frest, the tp twenty variables were cnsidered, s there are an extra fifteen variables listed in each. Figure 5.5 shws the cefficient values frm the mdels. There is a large gap in the cefficient values after the first five fr LDA,

51 but the separatin is nt as clear in the Randm Frest. LDA als chses V22 and V172 as the next tw mst imprtant variables like Lgistic Classificatin, but Randm Frest has V416 and V99 instead.

Three f the methds have the 16.00% errr: Lgistic Classificatin, Linear SVM, and Radial SVM. The highest errr is frm the KNN mdel.

64 51 but the separatin is nt as clear in the Randm Frest. LDA als chses V22 and V172 as the next tw mst imprtant variables like Lgistic Classificatin, but Randm Frest has V416 and V99 instead. The full results fr the LASSO Sparse scenari are given in Table 5.1. The LASSO Dense scenari has a slightly larger range f test classificatin errrs, frm 16.00% t 20.50%. Three f the methds have the 16.00% errr: Lgistic Classificatin, Linear SVM, and Radial SVM. The highest errr is frm the KNN mdel. The Lgistic Classificatin mdel selects 41 ttal variables, 27 true variables and 14 extra, missing V306, V412, and V510. In this set, the p-values are mre mixed than the Sparse scenari. The variable V525 has a p-value f which is very clse t the true variable V505 p-value f The cefficient values fr the LDA and Randm Frest are shwn in Figure 5.6. The LDA mdel chses all true variables fr the tp twenty, but the Randm Frest mixes tgether 16 true variables with 4 thers: V98, V398, V507, V512. Table 5.2 gives the details f the LASSO Dense analysis. The LASSO Interactin mdels range frm 17.75% t 25.50% test classificatin errr. Three mdels have the 17.75% errr, LDA, Randm Frest, Linear SVM, and the highest errr is frm the KNN mdel. Lgistic Classificatin nly finds 23 true variables ut f 32 variables picked. The p-values are als mixed, the true variable V405 has a p-value f while and the extra variable V390 has a p-value f LDA has ne extra variable in the tp twenty, V404, but Randm Frest

52 LASSO - SPARSE Methd CV Errr Test Errr Variables r Parameters Chsen Lgistic Classificatin Linear Discriminant Analysis Randm Frest Supprt Vectr Machine: Linear Supprt

50% V300 V I99 V500 V400 VlOO V22 V172 V416 V421 V99 V I 18 V395 V I82 V545 V501 V599 V79 V105 V393 27.47% 27.

65 52 LASSO - SPARSE Methd CV Errr Test Errr Variables r Parameters Chsen Lgistic Classificatin Linear Discriminant Analysis Randm Frest Supprt Vectr Machine: Linear Supprt Vectr Machine: Radial Supprt Vectr Machine: Ply 17.19% 28.25% V22 VlOO V I72 V 199 V300 V400 V % 28.50% V300 V I99 V500 V400 VlOO V22 V172 V416 V421 V99 V I 18 V395 V I82 V545 V501 V599 V79 V105 V % 27.50% V300 V400 V I99 V500 VlOO V99 V421 V182 V501 V545 V I72 V393 V22 V599 V79 V105 V I18 V % 28.25% Cst: % 29.75% Cst: 5.0, 7 : % 29.00% Cst: 0.1, 7 : 0.1, Degree: 2 K-Nearest Neighbr 27.17% 30.00% K\ 29 Table 5.1: Details f mdels created in the LASSO Sparse scenari.

V514 V517 V525 V526 V576 Linear 13.50% 16.25% V302 V84 V416 V90 Discriminant V514 V492 V I 87 V498 Analysis V I99 V500 V410! > V 95 V495 V295 V306 V I95 V489 V400 V408 V80 Randm 17.37% 19.

66 53 LASSO - DENSE Methd CV Errr Test Errr Variables r Parameters Chsen Lgistic 9.99% 16.00% V80 V84 V90 V95 Classificatin V100 V187 V I 88 V191 V195 V I97 V I99 V220 V277 V292 V293 V295 V300 V302 V310 V333 V339 V400 V402 V405 V408 V410 V416 V427 V469 V489 V492 V495 V498 V500 V505 V508 V514 V517 V525 V526 V576 Linear 13.50% 16.25% V302 V84 V416 V90 Discriminant V514 V492 V I 87 V498 Analysis V I99 V500 V410! > V 95 V495 V295 V306 V I95 V489 V400 V408 V80 Randm 17.37% 19.25% V84 V95 V400 V416 Frest V410 V90 V98 V100 V408 V398 V508 V507 V510 V514 V495 V302 V306 V512 V500 V505 Supprt Vectr 13.23% 16.00% Cst: 0.5 Machine: Linear Supprt Vectr 13.67% 16.00% Cst: 1.0, 7 : Machine: Radial Supprt Vectr 13.93% 16.25% Cst: 0.01, 7 : 0.1, Degree: 2 Machine: Ply K-Nearest 18.80% 20.50% K : 30 Neighbr Table 5.2: Details f mdels created in the LASSO Dense scenari.

$54 d d d LDA Dense: Tp Twenty Variables * Psitive * Negative i i i i i i i i i i i i i i i i i i «r i i i i i i \ i i i r V84 O D rf.dense 0 a Variable 10 15 20 25 30 MeanDecreaseAccuracy Figure 5.$

67 54 d d d LDA Dense: Tp Twenty Variables * Psitive * Negative i i i i i i i i i i i i i i i i i i «r i i i i i i \ i i i r V84 O D rf.dense 0 a Variable MeanDecreaseAccuracy Figure 5.6: Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the LASSO Dense scenari. again has a mix f 13 true variables and 7 extra variables. Figure 5.7 shws the graphs f the LDA and Randm Frest cefficients. Bth graphs shw that variable V302 is extremely influential in the mdel, and n ther variables cme clse, mst likely because f all the interactin terms V302 is invlved in. Table 5.3 gives the details f the mdels. The x 2 Sparse methds have apprximately the same classificatin errr rates as the LASSO mdels, 28.75% t 30.75%. Here the lwest errr is frm the Radial SVM mdel, and the highest is the Linear SVM. The Lgistic Classificatin mdel identifies all five true variables alng with fur extra, V79, V172, V194, V304. There is a large separatin f p-values between the type f variables, similar t the LASSO. The largest f the true variables is fr V100 and the smallest

55 & O ^ m r- V O 05 < Q.. LDA inter: Tp Twenty Variables Psitive * Negative * * *... * * * «i! i! i i i i i i i r g I 8 S? 8 8 8 S?

........ O...... 0 O...... 0 n------1-------1------1-------r 15 20 25 30 35 40 MeanDecreaseAccuracy Figure 5.7: Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the LASSO Inter scenari.

68 55 & O ^ m r- V O 05 < Q.. LDA inter: Tp Twenty Variables Psitive * Negative * * *... * * * «i! i! i i i i i i i r g I 8 S? S? 8 U S S i > > r Variable V302 V306 V95 V90 V84 V100 V97 V495 V416 V99 V88 V400 V510 V500 V502 V514 V508 V404 V512 V89 rtinter O O... O a 0 O O O n r MeanDecreaseAccuracy Figure 5.7: Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the LASSO Inter scenari. LDA Sparse: Tp Twenty Variables rf.sparse I V300 V199 V500 V400 V100 V416 V99 V98 V V398 V304 V408 V410 0 V97 V501 V95 a V421 Q V90 Q... V493 V495 c r MeanDecrease Accuracy Figure 5.8: Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the \2 Sparse scenari.

69 56 LASSO - INTER. Methd CV Errr Test Errr Variables r Parameters Chsen Lgistic 8.91% 18.00% V80 V84 V90 V95 Classificatin V137 V I 60 V I 88 V197 V247 V295 V300 V302 V310 V358 V390 V400 V404 V405 V412 V416 V483 V489 V495 V496 V498 V500 V503 V506 V508 V510 V514 V526 Linear 12.13% 17.75% V302 V90 V84 V295 Discriminant V197 V416 V405 V I 88 Analysis V300 V510 V514 V500 V495 V404 V489 V306 V412 V95 V310 V498 Randm 14.50% 17.75% V302 V306 V95 :» V90 Frest V84 V100 V97 V495 V416 V99 V 88 V400 V510 V500 V502 V514 V508 V404 V512 V89 Supprt Vectr 12.33% 17.75% Cst: 0.1 Machine: Linear Supprt Vectr 11.90% 18.75% Cst: 1.0, 7 : 0.01 Machine: Radial Supprt Vectr 12.37% 18.75% Cst: 0.01, 7 : 0.1, Degree: 2 Machine: Ply K-Nearest 19.33% 25.50% K : 28 Neighbr Table 5.3: Details f mdels created in the LASSO Interactin scenari.

57 d d LDA Dense: Tp Twenty Variables Psitive * Negative rf.dense O Variable 14 16 18 20 22 24 MeanDecreaseAccuracy Figure 5.

70 57 d d LDA Dense: Tp Twenty Variables Psitive * Negative rf.dense O Variable MeanDecreaseAccuracy Figure 5.9: Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the x 2 Dense scenari. f the extra variables is fr V I72. Figure 5.8 shws the cefficients fr LDA and Randm Frest. Fr LDA we have a fairly clear distinctin between the five true variable and the thers. This Randm Frest mdel has a bit mre f a gap than the LASSO Randm Frest, but it is nt as clear as either LDA mdel. All three mdels chse different nise variables t fllw the tp five. The details f the mdel s results are given in Table 5.4. The x 2 Dense mdels all have test classificatin errr between 17.00% and 21.00%, with the Plynmial SVM the lwest, and KNN the highest. The Lgistic Classificatin mdel chses abut the same number f variables as the LASSO, 38 ttal variables with 25 true and 13 extra. The p-values d nt have any clear distinctin, the true variable V405 has a higher p-value, 0.040, than the nise

58 X2 - SPARSE Methd CV Errr Test Errr Variables r Parameters Chsen Lgistic Classificatin Linear Discriminant Analysis Randm Frest Supprt Vectr Machine: Linear Supprt Vectr

25% V300 V199 V500 VlOO V400 V173 V172 V420 V97 V304 V418 V424 V508 V510 V410 V402 V412 V174 V405 V311 26.37% 30.

71 58 X2 - SPARSE Methd CV Errr Test Errr Variables r Parameters Chsen Lgistic Classificatin Linear Discriminant Analysis Randm Frest Supprt Vectr Machine: Linear Supprt Vectr Machine: Radial Supprt Vectr Machine: Ply 17.92% 30.25% V79 VlOO V172 V194 V I99 V300 V304 V400 V % 30.25% V300 V199 V500 VlOO V400 V173 V172 V420 V97 V304 V418 V424 V508 V510 V410 V402 V412 V174 V405 V % 30.00% V300 V I99 V500 V400 VlOO V416 V99 V98 V502 V398 V304 V408 V410 V97 V501 V95 V421 V90 V493 V % 30.75% Cst: % 28.75% Cst: 1.0, r % 29.00% Cst: 0.01, 7 : 0.1, Degree: 2 K-Nearest Neighbr 31.37% 30.50% K: 23 Table 5.4: Details f mdels created in the x2sparse scenari.

59 LDA Inter: Tp Twenty Variables rf inter c q * Psith * Negati Variable 15 20 25 30 35 40 Mean Decrea seaceuracy Figure 5.

72 59 LDA Inter: Tp Twenty Variables rf inter c q * Psith * Negati Variable Mean Decrea seaceuracy Figure 5.10: Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the x 2 Interactin scenari. variable V525, at Figure 5.9 shws the LDA and Randm Frest variable cefficeint values. LDA is able t identify 17 true variables in the tp twenty, with V509, V506, and V409 scattered thrughut. The cefficient values appear t be separated int distinct layers n the graph, almst like steps, whereas the ther graphs tend t have nly ne gap. The Randm Frest mdel is nly able t select 11 true variables ut f the tp twenty, cmpared t the 16 that the LASSO versin finds. Table 5.5 gives the details f the mdels. The test classificatin errrs frm the x 2 Interactin scenari mdels are all within 1.00% except fr the KNN mdel. The Linear and Radial SVM mdels have 15.50% errr, Lgistic Classificatin, LDA, and Randm Frest have 16.50% errr, and the KNN mdel has 21% errr. Lgistic Classificatin selects nly 20

V525 V526 Linear 15.43% 17.75% V84 V302 V509 V506 Discriminant V405 V498 V416 V90 Analysis V495 V187 V95 V I 99 V310 V409 V508 V505 V306 V 400 V100 V410 Randm 18.47% 20.

73 60 X2 - DENSE Methd CV Errr Test Errr Variables r Parameters Chsen Lgistic 10.94% 17.25% V75 V84 V 90 V95 Classificatin V100 V187 V I 88 V191 V197 V199 V293 V295 V300 V302 V310 V313 V333 V400 V402 V405 V410 V416 V427 V467 V489 V493 V494 V495 V498 V500 V503 V505 V568 V508 V509 V514 V525 V526 Linear 15.43% 17.75% V84 V302 V509 V506 Discriminant V405 V498 V416 V90 Analysis V495 V187 V95 V I 99 V310 V409 V508 V505 V306 V 400 V100 V410 Randm 18.47% 20.75% V416 V84 V95 V400 Frest V410 V90 V408 V98 V100 V97 V404 V508 V 88 V398 V514 V99 V507 V510 V415 V512 Supprt Vectr 14.90% 18.00% Cst: 0.01 Machine: Linear Supprt Vectr 15.30% 17.25% Cst: 0.5, 7 : 0.01 Machine: Radial Supprt Vectr 15.33% 17.00% Cst: 0.01, 7 : 0.1, Degree: 2 Machine: Ply K-Nearest 20.27% 21.00% K: 23 Neighbr Table 5.5: Details f mdels created in the x 2 Dense scenari.

61 LDA Sparser Tp Twenty Variables rf.sparse <30 d <N O Psitive * Negative Q - Variable 5 10 15 20 25 30 MeanDecreaseAccuracy Figure 5.

As with the dense mdel, the p-values d nt have clear separatin, V510 has a p- value f 0.014 but V568 has a p-value f 0.028. LDA has 17 variables in its tp twenty, while Randm Frest has 14.

74 61 LDA Sparser Tp Twenty Variables rf.sparse <30 d <N O Psitive * Negative Q - Variable MeanDecreaseAccuracy Figure 5.11: Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the Randm Frest Variable Selectin Sparse scenari. variables in ttal, 16 true variables and V301, V313, V401, and V568. As with the dense mdel, the p-values d nt have clear separatin, V510 has a p- value f but V568 has a p-value f LDA has 17 variables in its tp twenty, while Randm Frest has 14. Bth are pictured in Figure All three mdels have different nise variables present; LDA has V85, V83, and V301, and Randm Frest has V 88, V98, V97, V99, V493, and V507. Bth graphs shw V302 as being very imprtant in the mdels, just as the LASSO Interactin mdels did. The details f the mdes are given in Table 5.6. The test classificatin errr range fr the Randm Frest Variable Selectin Sparse scenari is slightly larger than the ther tw variable selectin mdels, 28.25% frm the Plynmial SVM t 34% frm the KNN. This Lgistic Classificatin mdel

62 X2 - INTERACTION Methd CV Errr Test Errr Variables r Parameters Chsen Lgistic Classificatin Linear Discriminant Analysis Randm Frest Supprt Vectr Machine: Linear Supprt Vectr Machine:

50% V302 V90 V188 V416 V84 V85 V295 V510 V300 V514 V500 V95 V495 V83 V489 V498 V508 V306 V404 V301 15.17% 16.

75 62 X2 - INTERACTION Methd CV Errr Test Errr Variables r Parameters Chsen Lgistic Classificatin Linear Discriminant Analysis Randm Frest Supprt Vectr Machine: Linear Supprt Vectr Machine: Radial Supprt Vectr Machine: Ply 10.33% 16.50% V84 V90 V95 V188 V295 V300 V301 V302 V313 V401 V404 V416 V489 V495 V498 V500 V508 V510 V514 V % 16.50% V302 V90 V188 V416 V84 V85 V295 V510 V300 V514 V500 V95 V495 V83 V489 V498 V508 V306 V404 V % 16.50% V302 V306 V95 V90 VlOO V84 V 88 V416 V98 V97 V99 V400 V495 V510 V404 V500 V508 V493 V514 V % 15.50% Cst: % 15.50% Cst: 5.0, 7 : % 15.75% Cst: 0.1, 7 : 0.05, Degree: 2 K-Nearest Neighbr 19.87% 21.00% K : 25 Table 5.6: Details f mdels created in the x 2 Interactin scenari.

$63 LDA Dense: Tp Twenty Variables rtdense Q Psitive * Negative I i s i I \ I I i I l l l!$ V99 V415 V510 J nr mt T " 16 18 20 22 24 26 MeanDecreaseAccuracy Figure 5.12: Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the Randm Frest Variable Selectin Dense scenari.

V99 V415 V510 J nr mt T " 16 18 20 22 24 26 MeanDecreaseAccuracy Figure 5.12: Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the Randm Frest Variable Selectin Dense scenari.

76 63 LDA Dense: Tp Twenty Variables rtdense Q Psitive * Negative I i s i I \ I I i I l l l! I i I I I l en-+0>toir>< >h-ocotf}(moaoio'tfoa:oco O00O'r-0>Oa3gia)piCDr-OOa r-ofl00> 0 S> 53>>>3^55>3> >5>> Variable V84 V410 V416 V95 V9Q V400 V408 V514 V98 V507 V100 V88 V508 V404 V398 V97 V302 V99 V415 V510 J nr mt T " MeanDecreaseAccuracy Figure 5.12: Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the Randm Frest Variable Selectin Dense scenari. chses the mst nise variables f the three sparse mdels; 9 extra variables, cmpared t 4 in the x 2 and 2 in the LASSO. There is still a separatin between the true variables and the nise variables f a factr f apprximately 105. LDA and Randm Frest, pltted in Figure 5.11, bth chse the true five variables as the tp five ut f twenty, but again nly the LDA mdel appears t make a clear gap between the variable types. Table 5.7 shws that the nise variables f LDA and Randm Frest are quite different, but the extra LDA variables appear t align with the Lgistic Classificatin nise variables. Fr the Randm Frest Variable Selectin Dense scenari, the test classificatin errr ranges frm 15.25% t 21.00%. The Linear SVM has the lwest errr, while the Randm Frest has the largest. Lgistic Classificatin chses a ttal f 33 variables;

Supprt Vectr Machine: Ply 17.97% 29.00% VlOO V194 V I99 V273 V300 V304 V379 V387 V393 VlOO V421 V500 V562 V589 26.70% 28.

77 64 RANDOM FORES!? - SPARSE Methd CV Errr Test Errr Variables r Parameters Chsen Lgistic Classificatin Linear Discriminant Analysis Randm Frest Supprt Vectr Machine: Linear Supprt Vectr Machine: Radial Supprt Vectr Machine: Ply 17.97% 29.00% VlOO V194 V I99 V273 V300 V304 V379 V387 V393 VlOO V421 V500 V562 V % 28.75% V300 V I99 V500 VlOO V400 V273 V304 V562 V97 V508 V402 V159 V414 V379 V387 V412 V510 V404 V249 V % 31.00% V300 V199 V500 V400 VlOO V416 V99 V502 V304 V98 V398 V410 V495 V404 V95 V408 V84 V421 V188 V % 29.00% Cst: % 29.00% Cst: 1.0, 7 : % 28.25% Cst: 0.01, 7 : 0.1, Degree: 2 K-Nearest Neighbr 31.60% 34.00% K: 21 Table 5.7: Details f mdels created in the Randm Frest Sparse scenari.

$65 d IN d LDA Inter: Tp Twenty Variables i i i! i i i i \ i i i [ [ i f i r O... Psitive * Negative Q Q O a a... O.. ftinter 15 20 25 30 35 40 MeanDecrease Accuracy Figure 5.$ 22 true and 11 nise variables, the lwest rati and lwest ttal number f the three variable selectin mdels. The p-values between the true and nise variables are als mixed, as V408 nly has a p-value f 0.

22 true and 11 nise variables, the lwest rati and lwest ttal number f the three variable selectin mdels. The p-values between the true and nise variables are als mixed, as V408 nly has a p-value f 0.

78 65 d IN d LDA Inter: Tp Twenty Variables i i i! i i i i \ i i i [ [ i f i r O... Psitive * Negative Q Q O a a... O.. ftinter MeanDecrease Accuracy Figure 5.13: Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the Randm Frest Variable Selectin Interactin scenari. 22 true and 11 nise variables, the lwest rati and lwest ttal number f the three variable selectin mdels. The p-values between the true and nise variables are als mixed, as V408 nly has a p-value f but the nise variable V42T has a p-value f Figure 5.12 shws the cmparisns f LDA and Randm Frest. LDA has 18 true variables in its tp twenty, with the extra variables V509 and V506 taking quite high spts n the list. The Randm Frest mdel chses nly 14 true variables ut f the tp twenty, with nise variables shwing up in the bttm tw thirds f the list. The full details f the mdels are given in Table 5.8. And finally, the Randm Frest Variable Selectin Interactin scenari mdels have test classificatin errrs frm 15.75% t 22.00%. Bth the Radial SVM and the LDA mdels have the lw errr, while KNN again has the highest errr. Lgistic

66 RANDOM FOREST - DENSE Methd CV Errr Test Errr Variables r Parameters Chsen Lgistic Classificatin Linear Discriminant Analysis Randm Frest Supprt Vectr Machine: Linear Supprt Vectr Machine: Radial

5 V75 V98 V195 V302 V408 V466 V495 V508 14.90% 17.00% Cst: 1.0, 7 : 0.005 I!

79 66 RANDOM FOREST - DENSE Methd CV Errr Test Errr Variables r Parameters Chsen Lgistic Classificatin Linear Discriminant Analysis Randm Frest Supprt Vectr Machine: Linear Supprt Vectr Machine: Radial Supprt Vectr Machine: Ply 10.75% 16.00% V15 V95 V188 V300 V402 V421 V493 V505 V % 17.00% V302 V495 V498 VlOO V % 21.00% V84 V90 V98 V508 V % 15.75% Cst: 0.5 V75 V98 V195 V302 V408 V466 V495 V % 17.00% Cst: 1.0, 7 : I!V84 V 100 V I99 V309 V410 V489 V498 V509 V90 V I87 V292 V400 V416 V492 V500 V514 V84 V509 V416 V506 V I87 V90 V95 V492 V410 V400 V505 V514 V489 V I99 V I88 V410 V416 V95 V400 V408 V514 V507 VlOO V 88 V404 V398 V97 V99 V415 V % 15.25% Cst: 0.1, 7 : 0.05, Degree: 2 K-Nearest Neighbr 20.10% 19.00% K : 25 Table 5.8: Details f mdels created in the Randm Frest Dense scenari.

Randm Frest has true variables in the tp six spts, but mixes in 6 nise variables after that.

80 67 Classificatin lists 25 variables as imprtant, but nly 19 are true variables. The p-values are mixed, nce again, with V306 having a p-value f 0.041, which is mre than the nise variable V23 with p-value LDA has 18 true variables in its tp twenty, with the extra variables V85 and V83. Randm Frest has true variables in the tp six spts, but mixes in 6 nise variables after that. As with the tw ther Interactin setups, the variable V302 is fund t be much strnger than any ther variable, as seen in Figure Table 5.9 gives the details f the mdels. Fr all the LASSO mdels, all methds but KNN have apprximately the same test classificatin errr t ± 1%. In Figure 5.14 the ROC curves are shwn fr the different scenaris. In the Sparse case the curves are practically indistinguishable, and the AUC values are all very clse. In the Dense scenari, tw mdels fall slightly belw the thers, Randm Frest and KNN. The AUC fr the tw mdels is abut.03 less than the thers. Fr the Interactin scenari, KNN clearly struggles t match the ther methds in accuracy, and the Randm Frest curve is missing sme key areas as well. Overall, the Linear SVM appears t have the best predictive pwer in the LASSO scenaris, but the LDA mdels are better at finding the mst imprtant variables. Fr the x 2 mdels, all the test classificatin errrs are very similar, except fr the KNN errr in the Interactin scenari. The ROC curves are given in Figure 5.15 fr the three scenaris. Thugh KNN des nt have the highest test errr in the Sparse

75% V302 V90 V84 V416 Discriminant V510 V85 V295 V 188 Analysis V300 V405 V514 V500 V83 V495 V95 "V 198 V412 V508 V 306 V489 Randm 15.33% 17.

81 68 RANDOM FOREST - INTERACTION Methd CV Errr Test Errr Variables r Parameters Chsen Lgistic 9.60% 17.00% V23 V84 V85 V90 Classificatin V95 V188 V295 V300 V302 V306 V310 V373 V388 V404 V 412 V416 V437 V489 V495 V498 V500 V503 V508 V510 V514 Linear 13.17% 15.75% V302 V90 V84 V416 Discriminant V510 V85 V295 V 188 Analysis V300 V405 V514 V500 V83 V495 V95 "V 198 V412 V508 V 306 V489 Randm 15.33% 17.25% V302 V306 V95 V90 Frest V416 V84 V97 V495 V99 VlOO V 88 V98 VlOO V514 V404 V510 V500 V507 V502 V493 Supprt Vectr 13.27% 16.25% Cst: 0.5 Machine: Linear Supprt Vectr 13.23% 15.75% Cst: 5.0, 7 : Machine: Radial Supprt Vectr 13.23% 16.25% Cst: 0.001, 7 : 0.5, Degree: 2 Machine: Ply K-Nearest 20.27% 22.00% K : 25 Neighbr Table 5.9: Details f mdels created in the Randm Frest Interactin scenari.

69 Cmparisn f Sparse Methds Cmparisn f Dense Methds d CO d Sensitivity <D d <N d AUC f Mdel Lg Class «07911 LDA * 0J922 RF «0.7902 L SVM m 0.7908 RSVM * 0.7902 P SVM m 07883 KNN m0.

82 69 Cmparisn f Sparse Methds Cmparisn f Dense Methds d CO d Sensitivity <D d <N d AUC f Mdel Lg Class «07911 LDA * 0J922 RF « L SVM m RSVM * P SVM m KNN m r CO AUC f Mdel Lg Class m 0,9152 LDA is 0,9174 RF m 0,883 LSVM as RSVM m 0,9139 P SVM it 0,9153 KNN m 0,8881 i r ,0 Specificity Specificity Cmparisn f Inter Methds 00 d <N d AUG f Mdel Lg Class « LDA m RF * 0.89 LSVM m RSVM m PSVM * KNN m i 0, ,0 Figure 5.14: ROC curves fr LASSO scenaris.

Finally fr the Interactin case, KNN is much lwer than the ther ROC curves, with an AUC value.05 less than the best prefrming mdel, the Plynmial SVM.

83 70 scenari, it des have the lwest AUC value. The Linear SVM had the highest errr, but it has nly slightly lwer AUC than the ther methds. In the Dense scenari, it appears again that Randm Frest and KNN mdels d nt predict as well as the ther, which matches the test errr values. Finally fr the Interactin case, KNN is much lwer than the ther ROC curves, with an AUC value.05 less than the best prefrming mdel, the Plynmial SVM. The Radial and Plynmial SVM mdels appear t wrk best with the x 2 variable selectin methd. Fr variable selectin, Randm Frest is the mst accurate when lking fr the tp five t seven mst imprtant variables, thugh LDA is better at finding the mst ut f twenty. Once again, within the Randm Frest Variable Selectin scenaris, all the test classificatin errrs are very clse, except fr KNN and the Randm Frest. Lking at the ROC curves in Figure 5.16, it is clear in the Sparse scenari that KNN falls shrt f the rest f the mdels. In the Dense scenari, bth the KNN and Randm Frest curves fall belw the thers, which is cnsistent with the test classificatin errrs. In the final Interactin scenari, KNN s AUC value is mre than.06 less than the best curve, the Plynmial SVM. Overall, the SVM mdels appear t wrk best with the Randm Frest Variable Selectin, either the Radial r Plynmial, fr predictin. Fr variable selectin, the situatin is the same as the x 2 cnclusin; Randm Frest is best fr finding the tp few variables, but if searching fr a larger amunt, LDA cllects mre true variables. Mst ften Randm Frest and LDA pick different nise variables, s a cmparisn might imprve the accuracy f the

71 Cmparisn f Sparse Methds q Cmparisn f Dense Methds CO 03 Sensitivity t d CM AUC f Mdes Lg Class m 0 7 7 3 9 LDA «0.

84 71 Cmparisn f Sparse Methds q Cmparisn f Dense Methds CO 03 Sensitivity t d CM AUC f Mdes Lg Class m LDA « RF m 7782 LSVM * R SVM * 782 P SVM ii KNN * 7351 r <D d d Osj AUC f Mdel Lg Class m LDA * RF m LSVM m R SVM ft PSVM KNN m m Specificity Cmparisn f Inter Methds Specificity 00 d <N d AUC f Mdel Lg Class m i s LDA s RF» L SVM» Q.S 154 R SVM» PSVM m KNN m Specificity I Figure 5.15: ROC curves fr x 2 scenaris.

85 variable selectin if mre than the tp twenty terms were cmpared.

I I 73 Cmparisn f Sparse Methds Cmparisn f Dense Methds T- Sensitivity CO d / " J P f <0 M r y f t / / " < <D m i ja y / AUC f Mde? I f/ / Lg Class* 0,7806 LDA «0 781 <N 1 J RF u 0.7714 Oi d LSVM * 0.

86 I I 73 Cmparisn f Sparse Methds Cmparisn f Dense Methds T- Sensitivity CO d / " J P f <0 M r y f t / / " < <D m i ja y / AUC f Mde? I f/ / Lg Class* 0,7806 LDA «0 781 <N 1 J RF u Oi d LSVM * d RSVM a 0.7$09 PSVM * Jr KNN * d d ~r Specificity II if / AUC f Mdel Lg Class * LDA w 0.92 RF * LSVM m RSVM m 0,9151 P SVM * j KNN m ~r 0.8 Cmparisn f Inter Methds 9... T Seciffcjty Iff AUC f Mde! d Lg Class* Iff LDA * 0.91 BS <N RF * L SVM * Q.9202 R SVM n P SVM * KNN m Specificity 0.2 0,0 Figure 5.16: ROC curves fr Randm Frest scenaris.

74 Table 5.10: Lgistic Classificatin Summary Type CV Errr Test Errr Prprtin f True Variables Fund AUC NS - Sparse 21.48% 33.25% 5/5 0.7414 LASSO - Sparse 17.19% 28.25% 5/5 0.7911 X 2 - Sparse 17.

87 74 Table 5.10: Lgistic Classificatin Summary Type CV Errr Test Errr Prprtin f True Variables Fund AUC NS - Sparse 21.48% 33.25% 5/ LASSO - Sparse 17.19% 28.25% 5/ X 2 - Sparse 17.92% 30.25% 5/ RF - Sparse 17.97% 29.00% 5/ NS - Dense 13.97% 17.50% 27/ LASSO - Dense 9.99% 16.00% 27/ X 2 - Dense 10.94% 17.25% 25/ RF - Dense 10.75% 16.00% 22/ NS - Inter 13.84% 17.50% 23/ LASSO - Inter 8.91% 18.00% 23/ X 2 - Inter 10.33% 16.50% 16/ RF - Inter 9.60% 17.00% 16/ Cmparisns Acrss Variable Selectin Methd In this sectin, the perfrmance f the different classificatin methds will be cmpared ver all three variable selectin methds and a furth set where n variable sectin was prefrmed. The details f the N Selectin mdels is given in Appendix C. Lgistic Classificatin is a very simple and pwerful mdel that wrks decently acrss all situatins. Unfrtunately all the mdels appear t have a large difference in CV errr and actual test errr. On average the errr increases by 10%. Table 5.10 shws the details f hw the variable selectin methds effect the Lgistic Classificatin mdel. The Crss Validatin errr imprves quite a bit frm N

75 Table 5.11: Linear Discriminant Analysis Summary Type CV Errr Test Errr Prprtin f True Variables Fund AUC NS - Sparse 31.43% 32.75% 5/5 0.7486 LASSO - Sparse 25.10% 28.50% 5/5 0.

9087 RF - Dense 15.07% 17.00% 18/20 0.9200 NS - Inter 13.84% 17.50% 14/20 0.9002 LASSO - Inter 12.13% 17.75% 19/20 0.9117 X 2 - Inter 13.57% 16.50% 17/20 0.9153 RF - Inter 15.83% 18.50% 14/20 0.9188 Selectin t the ther, but the test errr stays relatively similar.

88 75 Table 5.11: Linear Discriminant Analysis Summary Type CV Errr Test Errr Prprtin f True Variables Fund AUC NS - Sparse 31.43% 32.75% 5/ LASSO - Sparse 25.10% 28.50% 5/ X 2 - Sparse 27.00% 30.25% 5/ RF - Sparse 26.70% 28.75% 5/ NS - Dense 17.60% 17.50% 17/ LASSO - Dense 13.50% 16.25% 20/ X 2 - Dense 15.43% 17.75% 17/ RF - Dense 15.07% 17.00% 18/ NS - Inter 13.84% 17.50% 14/ LASSO - Inter 12.13% 17.75% 19/ X 2 - Inter 13.57% 16.50% 17/ RF - Inter 15.83% 18.50% 14/ Selectin t the ther, but the test errr stays relatively similar. The prprtin f variables fund stays the same r even decreases, mst likely due t the variable selectin methd missing variables. The AUC f the mdels des increase with all f the methds; it appears t increase the mst with the LASSO fr the Sparse and Dense scenaris, and with Randm Frest fr the Interactin. The Lgistic Regressin was nt much imprved by the variable selectin methds. The Linear Discriminant Analysis mdels are summarized in Table LDA has much higher CV classificatin errr than Lgistic Classificatin, but it increases much less when it cmes t the test classificatin errr, s the methds end up lking very similar. In the N Selectin case, the errrs fr the Sparse and Dense

76 Table 5.12: Randm Frest Summary Type CV Errr Test Errr Prprtin f True Variables Fund AUC NS - Sparse 27.63% 28.00% 5/5 0.7813 LASSO - Sparse 27.47% 27.50% 5/5 0.7902 X 2 - Sparse 26.37% 30.

00% 13/20 0.8840 NS - Inter 15.73% 18.50% 15/20 0.8816 LASSO - Inter 14.50% 17.75% 13/20 0.8900 X 2 - Inter 15.17% 16.50% 14/20 0.8941 RF - Inter 15.33% 17.25% 14/20 0.8916 scenaris barely change.

89 76 Table 5.12: Randm Frest Summary Type CV Errr Test Errr Prprtin f True Variables Fund AUC NS - Sparse 27.63% 28.00% 5/ LASSO - Sparse 27.47% 27.50% 5/ X 2 - Sparse 26.37% 30.00% 5/ RF - Sparse 26.57% 31.00% 5/ NS - Dense 18.83% 20.25% 14/ LASSO - Dense 17.37% 19.25% 16/ X 2 - Dense 18.47% 20.75% 12/ RF - Dense 18.43% 21.00% 13/ NS - Inter 15.73% 18.50% 15/ LASSO - Inter 14.50% 17.75% 13/ X 2 - Inter 15.17% 16.50% 14/ RF - Inter 15.33% 17.25% 14/ scenaris barely change. The N Selectin errr appear cmparable t the ther variable selectin errrs. Since nly the tp twenty variables f LDA were chsen, the prprtin f variables is ut f twenty fr the Dense and Interactin scenaris. Here we d see an imprvement with the use f the LASSO; the Dense scenari gains 3 variables and the Interactin scenari gains 5. It appears that LDA and LASSO wrk well tgether. All f the methds als have an increase in AUC frm the N Selectin mdels, thugh it varies in each scenari which mdel has the highest AUC. Table 5.12 details the results f the Randm Frest classificatin results. The CV classificatin errr apprximates the test errr well, unlike Lgistic Classificatin.

77 Table 5.13: Linear Supprt Vectr Machine Summary Type CV Errr Test Errr AUC NS - Sparse LASSO - Sparse 30.43% 25.13% 30.00% 28.25% 0.7808 0.7908 X 2 - Sparse 26.43% 30.75% 0.7707 RF - Sparse 26.

33% 17.75% 0.9130 X 2 - Inter RF - Inter 14.10% 13.27% 15.50% 16.25% 0.9154 0.

90 77 Table 5.13: Linear Supprt Vectr Machine Summary Type CV Errr Test Errr AUC NS - Sparse LASSO - Sparse 30.43% 25.13% 30.00% 28.25% X 2 - Sparse 26.43% 30.75% RF - Sparse 26.03% 29.00% NS - Dense 15.87% 16.25% LASSO - Dense 13.23% 16.00% X 2 - Dense 14.90% 18.00% RF - Dense 15.03% 15.75% NS - Inter 15.23% 19.50% LASSO - Inter 12.33% 17.75% X 2 - Inter RF - Inter 14.10% 13.27% 15.50% 16.25% The Spare and Inter scenaris have similar test errr t the previus tw mdels, but Randm Frest has a higher errr fr the Dense scenari. Like LDA, the Randm Frest was limited t a maximum f 20 selected variables, and there are usually less variables than in LDA. Cmparing the N Selectin variables t the thers, the nly imprvement is frm the LASSO Dense scenari. The AUC des nt always increase fr the Sparse scenari, but it des increase slightly fr all mdels in the Dense and Interactin scenaris. It wuld appear the Randm Frest classificatin mdels did nt imprve with the variable selectin methds. The Linear Supprt Vectr Machine mdel details are cmpiled in Table The CV classificatin errrs and the test errrs are similar t the previus mdels.

78 Table 5.14: Radial Supprt Vectr Machine Summary Type CV Errr Test Errr AUC NS - Sparse LASSO - Sparse X2 - Sparse RF - Sparse NS - Dense 29.40% 24.93% 26.97% 26.50% 16.23% 27.25% 29.75% 28.75% 29.

83% 11.90% 13.93% 13.23% 18.00% 18.75% 15.50% 15.75% 0.9079 0.9116 0.9225 0.

91 78 Table 5.14: Radial Supprt Vectr Machine Summary Type CV Errr Test Errr AUC NS - Sparse LASSO - Sparse X2 - Sparse RF - Sparse NS - Dense 29.40% 24.93% 26.97% 26.50% 16.23% 27.25% 29.75% 28.75% 29.00% 16.50% LASSO - Dense X2 - Dense RF - Dense 13.67% 15.30% 14.90% 16.00% 17.25% 17.00% NS - Inter LASSO - Inter X2 - Inter RF - Inter 14.83% 11.90% 13.93% 13.23% 18.00% 18.75% 15.50% 15.75% The test errrs fr the Interactin scenari all decrease cmpared t the N Selectin mdel, but the Spare and Dense mdels vary; the x 2 errr increases and the thers decrease. The same patterns fllw in the AUC values. There are n variables shwn by the SVM mdels. The Linear SVM analysis benefits frm the LASSO and Randm Frest variable selectins, but nt the x 2- The details f the Radial Supprt Vectr Machine are listed in Table There is a slightly larger increase frm the CV errr t the test errr in the Interactin cases, but the final errrs are similar t the ther methds. In the Sparse scenari, the CV errr f the N Selectin mdel decreases by abut 2% fr the test errr, s all the ther Sparse errrs are higher. The errrs fr the Dense and Interactin

79 Table 5.15: Plynmial Supprt Vectr Machine Summary Type CV Errr Test Errr AUC NS - Sparse LASSO - Sparse X 2 - Sparse RF - Sparse 29.10% 25.20% 26.87% 27.17% 27.75% 29.00% 29.00% 28.25% 0.7920 0.

9136 NS - Inter LASSO - Inter X 2 - Inter RF - Inter 15.07% 12.37% 13.87% 13.23% 17.75% 18.75% 15.75% 16.25% 0.8988 0.9136 0.9226 0.9276 methds d nt appear t fllw any patterns.

92 79 Table 5.15: Plynmial Supprt Vectr Machine Summary Type CV Errr Test Errr AUC NS - Sparse LASSO - Sparse X 2 - Sparse RF - Sparse 29.10% 25.20% 26.87% 27.17% 27.75% 29.00% 29.00% 28.25% NS - Dense LASSO - Dense X 2 - Dense RF - Dense 15.80% 13.93% 15.33% 14.97% 16.75% 16.25% 17.00% 15.25% NS - Inter LASSO - Inter X 2 - Inter RF - Inter 15.07% 12.37% 13.87% 13.23% 17.75% 18.75% 15.75% 16.25% methds d nt appear t fllw any patterns. The AUC values are slightly mre rdered. In all three scenaris, the LASSO and Randm Frest imprve the AUC f the N Selectin mdel. With very similar errrs, and an increase in AUC, it appears that the Radial SVM benefits frm the LASSO and Randm Frest Variable Selectin methds. The Plynmial Supprt Vectr Machine mdels are detailed in Table The N Selectin Sparse errr drps, just as it des in the Radial case. In fact, the ther errrs almst identical, with a less than 1% difference in each pair. The AUC values are different frm the Radial case, with imprvements ccurring in nly the Dense and Interactin scenaris. The Plynmial SVM is als unique in that it was

80 Table 5.16: if-nearest Neighbr Summary Type CV Errr Test Errr AUC NS - Sparse LASSO - Sparse X 2 - Sparse RF - Sparse 36.00% 27.17% 31.37% 31.60% 35.75% 30.00% 30.50% 34.00% 0.6981 0.7886 0.7351 0.

33% 19.87% 20.27% 28.50% 25.50% 21.00% 22.00% 0.7850 0.8455 0.8736 0.8657 the methd that tk the lngest t cmpile. Using the variable selectin methds cut dwn n the prcessing time cnsiderably.

93 80 Table 5.16: if-nearest Neighbr Summary Type CV Errr Test Errr AUC NS - Sparse LASSO - Sparse X 2 - Sparse RF - Sparse 36.00% 27.17% 31.37% 31.60% 35.75% 30.00% 30.50% 34.00% NS - Dense LASSO - Dense X 2 - Dense RF - Dense 22.73% 18.80% 20.27% 20.10% 23.00% 20.50% 21.00% 19.00% NS - Inter LASSO - Inter X 2 - Inter RF - Inter 24.67% 19.33% 19.87% 20.27% 28.50% 25.50% 21.00% 22.00% the methd that tk the lngest t cmpile. Using the variable selectin methds cut dwn n the prcessing time cnsiderably. The Randm Frest creates a small decreases in errr rate and increases in AUC fr mst scenaris, s it appears t be an imprvement ver the N Selectin mdels fr the Plynmial SVM. Finally, the details f the /^-Nearest Neighbr mdels are given in Table Like the SVMs, KNN has n variable selectin t explre. The KNN almst always had the highest errr f all the mdels in each scenari, but they d have a relatively small increase between CV errr and test errr. The KNN are als the nly mdels t be psitively affected by all the variable selectin methds. The errr drps in every case, and the AUC value als increases fr every case. Fr the Sparse and

94 81 Dense scenaris, the LASSO appears t make the biggest imprvement, while the X2 methd makes the best imprvement in the Interactin scenari. There is nt ne variable selectin methd that imprves all the classificatin analyses, but the LASSO mst frequently imprves the methds, while x 2 rarely des.

82 Chapter 6 Cnclusin In this thesis, GWAS datasets with different disease mdels were firstly simulated and the effectiveness f multiple statistical learning methds was thrughly evaluated and cmpared

95 82 Chapter 6 Cnclusin In this thesis, GWAS datasets with different disease mdels were firstly simulated and the effectiveness f multiple statistical learning methds was thrughly evaluated and cmpared after. The gal was t give recmmendatins abut which variable selectin and classificatin methd might perfrm best under certain disease mdel. Fr the Sparse scenari, mst mdels were able t identify the prper variables, but the LASSO variable selectin had the mst mdels with better than 29% test classificatin errr. In that case, the Randm Frest mdel has the least test errr at 27.50%, but nt in the ther tw variable selectin cases. Fr the Dense scenari the Randm Frest Variable Selectin had the lwest test classificatin errr rates. In that case and bth thers, the Plynmial SVM had the lwest errr. The Interactin Scenari was the tughest fr the mdels t interpret. The \2 variable selectin mdels are just a tuch better than the Randm Frest Variable Selectin.

96 83 In bth f thse variable selectin cases, the Radial SVM had the smallest errr, while in the LASSO case a few ther methds beat Radial SVM. Overall, the Radial and Plynmial SVMs were very reliable, while the KNN mdel tended t have relatively high CV and test errr. Fr future wrk, I wuld like t study the details f the methds further, and pssibly develp nvel statistical methd that culd imprve the predictin accuracy n GWAS dataset. One change I wuld like t research is utilizing a better p-value cut ff fr x 2 test f independence. The value f 0.05 des nt cntrl the family-wised testing errr rate well, s many false psitive SNPs culd enter the classificatin analysis. Fr a secnd, I wuld like t study incrprating envirnmental impact and gene-envirnmental interactin int the mdel. And finally, I wuld cnsider a mre cmplex mdel where SNPs are cntributing in a nn-linear way t disease risk.

97 84 Appendix A: Detailed Descriptin f Simulatin Scenaris The fllwing mdel was used fr all three scenaris: lgit(p(xi)) = h(xi) = fa + B TX, + X f AX* Here fa is an intercept, the ther fa s are the mdel s cefficients, and At]s cntrl the interactin terms. The fllwing sets are given t shw which variables were used in each scenari. The Sparse scenari uses Si, The Dense scenari uses Si and $2, and the Interactin scenari uses all three sets. 51 = {100,200,300,400,500} 52= {80, 84, 86, 90, 95, 190, 192, 195, 197, 295, 302, 305, 308, 310, 405, 408, 410, 412, 416, 490, 492, 495, 498, 505, 508, 510, 515} 53 = {(90,100), (195,200), (410,492), (492,505), (498,510), (80,302), (86,302), (190,302), (400,302), (408,302)}

$85 Sparse Pi = { _5'6 * 4 j = 0 lg 2.7 i Si Dense /? -15.1 z = 0 Jg 1.9 i e S i,s 2 Aij = 0 Interactin /3 i s 2.$

98 85 Sparse Pi = { _5'6 * 4 j = 0 lg 2.7 i Si Dense /? z = 0 Jg 1.9 i e S i,s 2 Aij = 0 Interactin /3 i s 2.5 (S 3 ) Sz r i = 0 An = I 2 6 >0 e 53 lg 2.5 i 5 i,52 IQ therwise frm. Table 6.1 shws mre infrmatin abut the chrmsmes that the SNPs cme

99 86 SN P Lcatins Gene name Chrmsme GeuelD Gene size KR5A GCK NEUROG HNF1A HNF1B ' Table 6.1: Mre data abut the chrmsmes and genes that the SNPs are frm.

100 87 Appendix B: Variable Selectin Detailed Results The true variables are marked in red, while the variables marked in blue are neighbrs f a true variable nt picked up by the analysis. A few marked in purple are bth true variables and neighbrs f missing variables. The variables are listed in rder f decreasing cefficient value fr LASSO and and Randm Frest, s a variable higher n the list wuld be cnsidered t have mre impact in the mdel than ne twards the end. Fr x 2 the results d nt have as strng an rdering t them as the thers, s they are listed numerically.

V306 V526 V220 V509 V339 V304 V427 V576 V469 V293 V520 V517 V598 V531 V523 V144 V352 V583 V398 V412 V230 V503 V292 V184r V277 V29 V592 V329 V313 V221 V365 V425 V240 V466 V564 V173 V146 V588 V309 V126

101 88 Variables kept by LASSSO Sparse V300 V199 V500 V400 V100 V416 V421 Nne V172 V99 V118 V79 V22 V501 V395 V105 V393 V545 V182 : V599 5 Var. Missing Dense V302 V84 V416 V95 V410 V498 V514 Nne V400 V90 V508 ' V495 V492 V500 V100 V505 V199 V489 V187 V408 V188 V402 V310 V80 "V195 V300 V510 V197 V98 V507 V191 V468 V405 V295 V512 V333 V525 V306 V526 V220 V509 V339 V304 V427 V576 V469 V293 V520 V517 V598 V531 V523 V144 V352 V583 V398 V412 V230 V503 V292 V184r V277 V29 V592 V329 V313 V221 V365 V425 V240 V466 V564 V173 V146 V588 V309 V126 V446 V519 V590 Inter V302 V84 V416 V90 :l!v188 V514 V295 One Step Off V300 V405 V95 "VI97" V510 V310 V500 V187 V495 V489 V508 V498 V404 V412 V306 V199 V100 V400 V99 V80 : V503 V506 V485 V408 V97 V496 V502 V373 V501 V483 V137 V505 V407 V309 V160 V358 V568 V526 V428 N Neighbrs V275 V576 V410 V88 V388 V172 V217 V195 V486 V85 V247 V429 V208 V342 V290 V37 V89 V251 V50 V467 V343 V390 V23 V427 V349 V513 V512 V540 V34 V492 V299 V335 V599 1 V384 > V40 V301 V532 V367 V355 V564 V66 V118 V224 V456 V393 V426 V151 V107 V357 O CO > V365 V449 V523 < CO V423 V580' Table 6.2: Variable selectin results fr LASSO in the Sparse, Dense, and Interactin scenaris.

89 Variables kept by x 2 Sparse V43 r V60 V79 V83 V84 V85 V 88 > V89 Nne V90 V95 V97 : V98 V99 V100 V105 V141 V156 V159 V172 V173 V174 V180 V182 V183 V187 V188 V193 V194 V195 V199 V 220 V292 V299

102 89 Variables kept by x 2 Sparse V43 r V60 V79 V83 V84 V85 V 88 > V89 Nne V90 V95 V97 : V98 V99 V100 V105 V141 V156 V159 V172 V173 V174 V180 V182 V183 V187 V188 V193 V194 V195 V199 V 220 V292 V299 V301 V304 V309 V310 V311 V314 V320 V321 V322 V323 V324 V326 V327 V329 V331 V338 V373 V396 V397 V398 V400 V401 V402 V403 V404 V405 V408 V410 V412 V414 V415 V416 V417 V418 V419 V420 V421 > V422 V424 V481 V492 V495 V499 V500 V501 V502 V503 V504 V507 V508 V510 V512 V513 V514 V518 V541 V547 V580 CO c > Var. Missing Table 6.3: Variable selectin results fr x 2 in the Sparse scenari.

05 bjo CD a 4-2 CO LO CN CD 05 05 00 rh a O > > > I CM F- T 1 CN QO 00 LO LO CN rh CO CN 05 F- <b 1 1 LO 05 05 CN 1-1 05 cb CN T 1 05 00 O CN 05 rh CN b- 05 rh CN LO 05 c T 1 CO rh 00 05 1 1 CN T 1

OO 05 CM 05 r 1 CN rh CN 00 05 CO rh 00 05 r HT 1 T 1 rh t c CO CO LO t t t LO t 05 CN c CO t LO LO * rh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > lo.

103 05 bjo CD a 4-2 CO LO CN CD rh a O > > > I CM F- T 1 CN QO 00 LO LO CN rh CO CN 05 F- <b 1 1 LO CN cb CN T O CN 05 rh CN b- 05 rh CN LO 05 c T 1 CO rh CN T 1 t-h CN CO CO CO t t t LO t t > > rh CN c c t t LO > > > > > > > > > > > > > > > > > > > > > > > > > > > > T CN rh T 1 N rh b cb tb cb 05 c T rh cb CO 00 LO CO rh 00 c b- OO 05 CM 05 r 1 CN rh CN CO rh r HT 1 T 1 rh t c CO CO LO t t t LO t 05 CN c CO t LO LO * rh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > lo. - r - r LO LO p s. s. F- cb c CN F- cb F- CO LO 00 s. s p 05 F- T 1 cb CM 00 F- y> >> rh 05 rh CN CO rh CN 05 CN rh CO s 05 T 1 T 1 CO CO CO t t LO t t t s 05 rh CO CO t LO lo LO s p p > 05 > > > > > > > > > > > > > > > > - p > > > > > > > > > > > cb LO s p p T p p rr\ T p f-' p r p p > > 00 t CO CN 05 T 1 LO 05 cb CO CN cb > Li J > 'JKJ t CN cb t CO rh p 05 F- cb l>- O) 05 T 1 > rh rh CO rh CN 00 p F- 05 CN rh rh F- "r+l \l CXj 1 1 i CM CO CO t t t t t 05 rh CO CO t t Li J Uj F- > > >!> > > > > > > > > > > > > > c vtj TH CO > > > > > > > > > > > > > > K \r > > '-aj p rh CO F- rh tb CN CO CO p CO rh 00 CO 05 LO s 0> 05 rh 00 T ( CN 05 T 1 CN t 05 CN 05 rh c 05. r HCO rh rh CN CO CO LO LO LO LO LO p 05 rh CO CO CO t LO LO >> CO c > > > > > > > > > > > > > > > > > 05 > > > > > > > > > > > > > > r p p p p p p p p > > ' p p LO 00 cb 05 F- F- 00 LO T 1 05 p 00 rh F- T 1 00 F- CO CO a p p lo CM 00 t LO H (M I>- 05 rh T 1 00 p CO GO CN 05 1 i CN ) CN CN 05 J* 05 T 1 T 1 CN c CO LO t LO t t CN t 05 rh CO CO CO LO t t CO CO a; > > i>- > > > > > > > > > > > > > > > > > l>- > > > > > > > > > > > > > > 1 rh 5-1 F- CN CO 05 F- S CO CO 00 CN r. CO t c tb 00 F- t CO CM CN 05 LO CO 00 CO CO 05 T 1 CN r- 05 T 1 t-h CO t CN x H T 1 t CN rh CN T 1 r- 05 rh rh CN CO CO CO t LO lo t t CO r- 05 T 1 CN CO CO t LO t > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > CD CO 5=1 CD Q t O a ^. ^ r/) p F^* F- tb d) OO H H < > > > > 5-h CD m f-( rq.sp *s T 1 CO > CN Table 6.4: Variable selectin results fr x 2 in the Dense and Interactin scenaris.

91 Variables kept by Randm Frest Sparse V300 V199 V400 V500 ' VIOO1> V416 > V99 Nne V304 V502 V398 V41(T1 V98 V415 V404 V421 V301 V495r V501 V95 V97 V493 V188 V408 V194 : V401» vg8 V309 V84 V310 V79

V180 V485 V483 V352 V80 V373 V314 V13 V393 V586 V357 V394 V395 V369 V545 V427 V489 V513 V172 V333 V360 V248 V520 V512 V543 V 5 ii V441 V464 V578 V327 V508 Var.

104 91 Variables kept by Randm Frest Sparse V300 V199 V400 V500 ' VIOO1> V416 > V99 Nne V304 V502 V398 V41(T1 V98 V415 V404 V421 V301 V495r V501 V95 V97 V493 V188 V408 V194 : V401» vg8 V309 V84 V310 V79 V570 : V183 V326 V412 V503 V293 V294 V141 V298r V219 V146 V90 V599 V367 V531 V108 V519 V532 V255 V43 V349 V390 : V107 V335 V105 V516 V217 V446 V89 : V486 V182 V425 V299 V161 V514 V137 V263 V426 V270 V180 V485 V483 V352 V80 V373 V314 V13 V393 V586 V357 V394 V395 V369 V545 V427 V489 V513 V172 V333 V360 V248 V520 V512 V543 V 5 ii V441 V464 V578 V327 V508 Var. Missing Dense V95 V84 V90 : V416 V98 V97 V410 One Step Off V88 V100!,V400 V99 : V404 V408 V398 V195 V510 V495 V514 V507 V415 "V508 V512 V197 V493 V89 V85 V502 ' V500 V302 V501 V405 V83! V498 V306 V516 V513 V505;> V79 V492 V520 V503 V199 V188 V421 V499 V504 N Neighbrs V511 V304 V523 V401 V300 V486 V187 V295 V506 V509 V489 V75 V194 V522 V519 V412 V309 V352 V468 V578r V80 V586 V485 V263 V107 V367 V255 V217 V429 V310 V314 V466 V426 V265 V360 V342 V137 V583 V532 V43 V333 V588 V390 V594 V173 V37 V209 V369 V579 V441 V146 V313 V397 V13 V108 V34 V106 V454 V105 V395 Table 6.5: Variable selectin results fr Randm Frest in the Sparse and Dense scenaris.

105 92 Variables kept by Randm Frest > O r-h > Var. Missing Inter V302,;V306" V95!: V90 V97 V98 V84 One Step Off V99 V495 V400 V507 V493 V80 V510 V500 V416 V508 V502 : V514 V89 V187 V321 V512 V404 V410 V501,: V85 V304 V295 V83!V408 V503 :!,V188 V499 V398 V504 V405 V415 V486 V79 : 511 V300 V516 V513 V492 V421 V485 V489 V520 V313 V326 V498! N Neighbrs V137 V75 V412 V532 V106 V203 V61 V195 V50 V453 V209 V586 V294 V46 V505 V197 V324 V108 V360 V40 V333 V34 V483 V199 V310 V107 V599 V576 V342 V441 V588 V314 V337 V118 1 V73 V523 V219 V467 V208 V594 V522 V519 V402 V265 V41 V390 V217 : V277 V299 V425 V451 V355 V570 > LO CO CO Table 6.6: Variable selectin results fr Randm Frest in the Interactin scenari.

106 Appendix C: N Selectin Mdel Details 93

V573 V580 V581 V589 Linear 31.43% 32.75% V300 V I 99 V500 VlOO Discriminant V400 V I 20 V133 V I 56 Analysis V407 V273 V130 V553 V22 V529 V409 V569 V380 V169 V472 V306 Randm 27.63% 28.

107 94 Table 6.7: NO SELECTION - SPARSE Methd CV Errr Test Errr Variables r Parameters Chsen Lgistic 21.48% 33.25% V 22. V24 V33 VlOO Classificatin V I 20 V I 25 V132 V163 V I69 V172 V194 V197 V I99 V231 V269 V292 V293 V295 V300 V301 V304 V365 V380 V393 V400 V402 V407 V412 V451 V453 V461 V471 V496 V500 V529 V544 V545 V573 V580 V581 V589 Linear 31.43% 32.75% V300 V I 99 V500 VlOO Discriminant V400 V I 20 V133 V I 56 Analysis V407 V273 V130 V553 V22 V529 V409 V569 V380 V169 V472 V306 Randm 27.63% 28.00% V300 V I 99 V400 V500 Frest VlOO V416 V99 V502 V98 V304 V410 V398 V415 V404 V95 V408 V401 V495 V421 V501 Supprt Vectr 30.43% 30.00% Cst: 0.01 Machine: Linear Supprt Vectr 29.40% 27.25% Cst: 1.0, 7 : Machine: Radial Supprt Vectr 29.10% 27.75% Cst: 0.001, 7 : 0.1, Degree: 2 Machine: Ply K-Nearest 36.00% 35.75% K- 30 Neighbr

V339 V355 V400 V402 V405 V410 V411 V416 V421 V458 V489 V492 V493 V495 V498 V500 V503 V505 V506 V508 V509 V514 V526 V534 V542 V550 V576 V598 Linear 17.60% 17.

108 95 Table 6.8: NO SELECT:iON - DENSE Methd CV Errr Test Errr Variables r Parameters Chsen Lgistic 13.97% 17.50% V15 V29 V84 V90 Classificatin V95 V98 VlOO V I 15 V I 17 V 144 V145 V184 V187 V188 V191 V193 V195 V197 V 199 V202 V209 V211 V220 V229 V248 V270 V277 V293 V295 V300 V302 V310 V313 V322 V327 V333 V339 V355 V400 V402 V405 V410 V411 V416 V421 V458 V489 V492 V493 V495 V498 V500 V503 V505 V506 V508 V509 V514 V526 V534 V542 V550 V576 V598 Linear 17.60% 17.50% V84 V302 V I 5 V506 Discriminant V509 V416 V405 V264 Analysis V495 V90 V498 V492 V295 V514 V I87 V399 V306 V505 V I 15 V409 Randm 18.83% 20.25% V95 V84 V400 V416 Frest V410 V98 V90 V408 V 88 VlOO V97 V404 V398 V99 V415 V508 V514 V510 V507 V495 Supprt Vectr 15.87% 16.25% Cst: 0.01 Machine: Linear Supprt Vectr 16.23% 16.50% Cst: 1.0, 7 : Machine: Radial Supprt Vectr 15.80% 16.75% Cst: 0.001, 7 : 0.1, Degree: 2 Machine: Ply K-Nearest 22.73% 23.00% K : 29 Neighbr

V456 V483 V489 V492 V495 V496 V498 V500 V506 V508 V510 V514 V526 V568 V571 V576 V582 V591 V599 Linear 15.83% 18.

109 96 Table 6.9: NO SELECTION - INTERACTION Methd CV Errr Test Errr Variables r Parameters Chsen Lgistic 13.84% 17.50% V40 V59 V 66 V71 Classificatin V72 V80 V84 V85 V90 V95 V I37 V147 V I55 V I60 V I72 V183 V I88 V197 V217 V221 V251 V295 V300 V302 V310 V320 V373 V400 V404 V405 V412 V416 V419 V428 V442 V447 V454 V456 V483 V489 V492 V495 V496 V498 V500 V506 V508 V510 V514 V526 V568 V571 V576 V582 V591 V599 Linear 15.83% 18.50% V302 V90 V554 V428 Discriminant V295 V84 V510 V I 88 Analysis V197 V85 V553 V416 V506 V405 V437 V568 V300 V262 V500 V514 Randm 15.73% 18.50% V302 V306 1 V95 V90 Frest V84 V97 V 88 V100 V98 V 495 V99 V510 V493 V416 V400 V514 V507 V404 V508 V410 Supprt Vectr 15.27% 19.50% Cst: 0.01 Machine: Linear Supprt Vectr 14.83% 18.00% Cst: 5.0, 7 : Machine: Radial Supprt Vectr 15.07% 17.75% Cst: 0.001, 7 : 0.1, Degree: 2 Machine: Ply K-Nearest 24.67% 28.50% K : 28 Neighbr

97 Bibligraphy [1] Andrew L Beam, Alisn Mtsinger-Reif, and Jn Dyle, Bayesian neural netwrks fr detecting epistasis in genetic assciatin studies, BMC biinfrmatics 15 (2014), n. 1, 368.

[3] Kevin Beyer, Jnathan Gldstein, Raghu Ramakrishnan, and Uri Shaft, When is nearest neighbr meaningful?, Internatinal cnference n database thery, Springer, 1999, pp. 217-235.

110 97 Bibligraphy [1] Andrew L Beam, Alisn Mtsinger-Reif, and Jn Dyle, Bayesian neural netwrks fr detecting epistasis in genetic assciatin studies, BMC biinfrmatics 15 (2014), n. 1, 368. [2] David A Belsley, Cnditining diagnstics: Cllinearity and weak data in regressin, n B452, Wiley, [3] Kevin Beyer, Jnathan Gldstein, Raghu Ramakrishnan, and Uri Shaft, When is nearest neighbr meaningful?, Internatinal cnference n database thery, Springer, 1999, pp [4] Le Breiman, Out-f-bag estimatin, [5], Randm frests, Machine Learning 45 (2001), n. 1, [6] Le Breiman and Adele Cutler, Randm frests, 2007, Online; accessed 28- June [7] Angel Canty and B. D. Ripley, bt: Btstrap r (s-plus) functins, 2017, R package versin [8] RM Centr and GE Keightley, Receiver perating characteristics (rc) curve area analysis using the rc analyzer., Prceedings. Sympsium n Cmputer Applicatins in Medical Care, American Medical Infrmatics Assciatin, 1989, pp [9] Crinna Crtes and Vladimir Vapnik, Supprt-vectr netwrks, Machine learning 20 (1995), n. 3,

98 [10] Thmas Cver and Peter Hart, Nearest neighbr pattern classificatin, IEEE transactins n infrmatin thery 13 (1967), n. 1, 21-27.

[12] Jerme Friedman, Trevr Hastie, and Rb Tibshirani, Regularizatin paths fr generalized linear mdels via crdinate descent, Jurnal f Statistical Sftware, Articles 33 (2010), n. 1, 1-22.

111 98 [10] Thmas Cver and Peter Hart, Nearest neighbr pattern classificatin, IEEE transactins n infrmatin thery 13 (1967), n. 1, [11] E Dimitriadu, K Hrnik, F Leisch, D Meyer, and A Weingessel, e l 071: Misc functins f the department f statistics (el 071), tu wien. r package versin 1.6-7, [12] Jerme Friedman, Trevr Hastie, and Rb Tibshirani, Regularizatin paths fr generalized linear mdels via crdinate descent, Jurnal f Statistical Sftware, Articles 33 (2010), n. 1, [13] Jerme Friedman, Trevr Hastie, and Rbert Tibshirani, Regularizatin paths fr generalized linear mdels via crdinate descent, Jurnal f Statistical Sftware 33 (2010), n. 1, [14] James A Hanley and Barbara J McNeil, The meaning and use f the area under a receiver perating characteristic (rc) curve., Radilgy 143 (1982), n. 1, [15] Trevr Hastie, Rbert Tibshirani, and Jerme Friedman, The elements f statistical learninq, Springer Series in Statistics, Springer New Yrk Inc., New Yrk, NY, USA, [16] David M Hillis and James J Bull, An empirical test f btstrapping as a methd fr assessing cnfidence in phylgenetic analysis, Systematic bilgy 42 (1993), n. 2, [17] Tin Kam H, The randm subspace methd fr cnstructing decisin frests, IEEE transactins n pattern analysis and machine intelligence 20 (1998), n. 8, [18] Arthur E Herl and Rbert W Kennard, Ridge regressin: Biased estimatin fr nnrthgnal prblems, Technmetrics 12 (1970), n. 1, [19] Frank B Hu, JAnn E Mansn, Meir J Stampfer, Graham Clditz, Simin Liu, Caren G Slmn, and Walter C Willett, Diet, lifestyle, and the risk f type 2 diabetes mellitus in wmen, New England jurnal f medicine 345 (2001), n. 11,

99 [20] Gareth James, Daniela Witten, Trevr Hastie, and Rbert Tibshirani, An intrductin t statistical learning: With applicatins in r, Springer Publishing Cmpany, Incrprated, 2014.

, A study f crss-validatin and btstrap f r accuracy estimatin and mdel selectin, Ijcai, vl. 14, Stanfrd, CA, 1995, pp. 1137-1145.

112 99 [20] Gareth James, Daniela Witten, Trevr Hastie, and Rbert Tibshirani, An intrductin t statistical learning: With applicatins in r, Springer Publishing Cmpany, Incrprated, [21] Minru Kanehisa and Susumu Gt, Kegg: kyt encyclpedia f genes and genmes, Nucleic acids research 28 (2000), n. 1, [22] Rn Khavi et al., A study f crss-validatin and btstrap f r accuracy estimatin and mdel selectin, Ijcai, vl. 14, Stanfrd, CA, 1995, pp [23] Cathy C Laurie, Kimberly F Dheny, Daniel B Mirel, Elizabeth W Pugh, Laura J Bierut, Tushar Bhangale, Frederick Behm, Neil E Capras, Marilyn C Crnelis, Hward J Edenberg, et al., Quality cntrl and quality assurance in gentypic data fr genme-wide assciatin studies, Genetic epidemilgy 34 (2010), n. 6, [24] Eleanr Lawrence, Hendersn s dictinary f bilgy, Pearsn educatin, [25] Sarah Leavitt, Deciphering the genetic cde: Marshall nirenberg, Office f NIH Histry, [26] Andy Liaw and Matthew Wiener, Classificatin and regressin by randmfrest, R News 2 (2002), n. 3, [27] Nick Martin and Hermine Maes, Multivariate analysis, Academic press, [28] Givanni Mntana, hapsim: Hapltype data simulatin, 2012, R package versin 0.3. [29] Christpher Z Mney and Rbert D Duval, Btstrapping: A nnparametric apprach t statistical inference, n , Sage, [30] CNX OpenStax, Anatmy & physilgy, Human Anatmy & Physilgy (2014). [31] Badri Padhukasahasram, Chandan K Reddy, Albert M Levin, Esteban G Burchard, and L Keki Williams, Pwerful tests fr multi-marker assciatin analysis using ensemble learning, PlS ne 10 (2015), n. 11, e

100 [32] Qinxin Pan, Ting Hu, James D Malley, Angeline S Andrew, Margaret R Karagas, and Jasn H Mre, A system-level pathway-phentype assciatin analysis using synthetic feature randm frest, Genetic

n the criterin that a given system f deviatins frm the prbable in the case f a crrelated system f variables is such that it can be reasnably suppsed t have arisen frm randm sampling, The Lndn,

113 100 [32] Qinxin Pan, Ting Hu, James D Malley, Angeline S Andrew, Margaret R Karagas, and Jasn H Mre, A system-level pathway-phentype assciatin analysis using synthetic feature randm frest, Genetic epidemilgy 38 (2014), n. 3, [33] Karl Pearsn, X. n the criterin that a given system f deviatins frm the prbable in the case f a crrelated system f variables is such that it can be reasnably suppsed t have arisen frm randm sampling, The Lndn, Edinburgh, and Dublin Philsphical Magazine and Jurnal f Science 50 (1900), n. 302, [34] J Rss Quinlan, Inductin f decisin trees, Machine learning 1 (1986), n. 1, [35] R Cre Team, R: A language and envirnment fr statistical cmputing, R Fundatin fr Statistical Cmputing, Vienna, Austria, 2013, ISBN [36] Paul Scheet and Matthew Stephens, A fast and flexible statistical mdel fr large-scale ppulatin gentype data: applicatins t inferring missing gentypes and hapltypic phase, The American Jurnal f Human Genetics 78 (2006), n. 4, [37] Matt Silver, Givanni Mntana, Alzheimer s Disease Neurimaging Initiative, et al., Fast identificatin f bilgical pathways assciated with a quantitative trait using grup lass with verlaps, Statistical applicatins in genetics and mlecular bilgy 11 (2012), n. 1, [38] Rbert Tibshirani, Regressin shrinkage and selectin via the lass, Jurnal f the Ryal Statistical Sciety. Series B (Methdlgical) 58 (1996), n. 1, [39] Rb M van Dam, Eric B Rimm, Walter C Willett, Meir J Stampfer, and Frank B Hu, Dietary patterns and risk fr type 2 diabetes mellitus in us men, Annals f internal medicine 136 (2002), n. 3, [40] W. N. Venables and B. D. Ripley, Mdern applied statistics with s, furth ed., Springer, New Yrk, 2002, ISBN

101 [41] Peter M Visscher, Matthew A Brwn, Mark I McCarthy, and Jian Yang, Five years f gwas discvery, The American Jurnal f Human Genetics 90 (2012), n. 1, 7-24.

114 101 [41] Peter M Visscher, Matthew A Brwn, Mark I McCarthy, and Jian Yang, Five years f gwas discvery, The American Jurnal f Human Genetics 90 (2012), n. 1, [42] Danielle Welter, Jacqueline MacArthur, Jannella Mrales, Tny Burdett, Peggy Hall, Heather Junkins, Alan Klemm, Paul Flicek, Teri Manli, Lucia Hindrff, et al., The nhgri gwas catalg, a curated resurce f snp-trait assciatins, Nucleic acids research 42 (2013), n. D l, D1001-D1006.

Resampling Methods. Chapter 5. Chapter 5 1 / 52

Resampling Methods. Chapter 5. Chapter 5 1 / 52 Resampling Methds Chapter 5 Chapter 5 1 / 52 1 51 Validatin set apprach 2 52 Crss validatin 3 53 Btstrap Chapter 5 2 / 52 Abut Resampling An imprtant statistical tl Pretending the data as ppulatin and