COMPARISON OF STATISTICAL LEARNING METHODS FOR PREDICTION IN GENOME-WIDE ASSOCIATION STUDIES

Size: px
Start display at page:

Download "COMPARISON OF STATISTICAL LEARNING METHODS FOR PREDICTION IN GENOME-WIDE ASSOCIATION STUDIES"

Transcription

1 COMPARISON OF STATISTICAL LEARNING METHODS FOR PREDICTION IN GENOME-WIDE ASSOCIATION STUDIES A f j 3 & A thesis presented t the faculty f San Francisc State University In partial fulfilment f The Requirements fr The Degree M ATH Master f Arts In Mathematics by Rachel Hartley San Francisc, Califrnia August2017

2 Cpyright by Rachel Hartley 2017

3 CERTIFICATION OF APPROVAL I certify that I have rad COMPARISON OF STATISTICAL LEARN ING METHODS FOR PREDICTION IN GENOME-WIDE ASSOCI ATION STUDIES by Rachel Ilartlcyand that in my pinin this wrk mts the criteria fr apprving a thsis submitted in partial fulfillment f the requirements fr the degree: Master f Artsin Matheinaticsat San Francisc State University. Dr. Ta lie Assistant Prfessr f Mathematics Dr. Serkan llsten Prfessr f Mathematics Dr. Alexandra Piryatinska Assciate Prfessr f Mathematics

4 COMPARISON OF STATISTICAL LEARNING METHODS FOR PREDICTION IN GENOME-WIDE ASSOCIATION STUDIES Rachel Hartley San Francisc State University 2017 Studies frm Genme Wide Assciatin Studies (GWAS) shw that the detected Single Nucletide Plymrphisms (SNPs) nly explain a small fractin f heritability and althugh identifying thse missing SNPs is imprtant, it is smetimes mre imprtant t be able t predict whether a persn will develp a certain disease r nt. In this thesis, a cmparisn analysis f three different variable selectin methds (LASSO, x 2 Test fr Independence, Randm Frest) and seven classificatin methds (Lgistic Classificatin, Linear Dimensinal Analysis, Randm Frest, Supprt Vectr Machines with Linear, Radial, and Plynmial kernels, A-Nearest Neighbr) will be given n simulated GWAS datasets under different disease mdels. After a discussin f the methds, the best mdel fr each scenari will be chsen based n predictin errr rate and area under the Receiver Operating Characteristic (AUC) curve. I certify that the Abstract is a crrect representatin f the cntent f this thesis. Chair, Thesis Cmmittee

5 ACKNOWLEDGMENTS I wuld lve t thank my amazing thesis advisr Dr. Ta He fr helping me thrugh this prcess, my husband fr always pushing me t succeed, and my parents wh have supprted my math career since high schl. v

6 TABLE OF CONTENTS 1 Intrductin Bilgical B a ck gru n d Overview Genetic V ariatin Genme-wide Assciatin S tu d ies Mtivatin and O bjectives Data Preparatin Simulating Gentypes Simulating Phentypes Crrelatin Structure in G en typ es Variable Selectin Methds The L A S S O x 2 Test fr Independence Randm F r e s t Classificatin M e th d s Lgistic R egressin Linear Discriminate Analysis Randm F r e s t Supprt Vectr M a ch in e...30 vi

7 4.5 K-Nearest Neighbr Results Methds Evaluatin and Tuning Parameter S e le cti n Crss-Validatin and Out f Bag E r r r Classificatin Errr Rate and the ROC C u r v e Variable Selectin R e s u lt s LASSO x 2 T est Randm F r e s t Overall Classificatin Methd R e s u lts Methd D e ta ils Cmparisns Within Variable Selectin M e t h d Cmparisns Acrss Variable Selectin Methd Cnclusin Appendix A: Detailed Descriptin f Simulatin S cen a ris Appendix B: Variable Selectin Detailed Results...87 Appendix C: N Selectin Mdel D eta ils...93

8 Bibligraphy

9 LIST OF TABLES Table Page 2.1 Table f variables remved due t high crrelatin Example f table f values used fr a x 2test Details f mdels created in the LASSO Sparse scenari Details f mdels created in the LASSO Dense scenari Details f mdels created in the LASSO Interactin scenari Details f mdels created in the x 2 Sparse scenari Details f mdels created in the \2 Dense scenari Details f mdels created in the \2 Interactin scenari Details f mdels created in the Randm Frest Sparse scenari Details f mdels created in the Randm Frest Dense scenari Details f mdels created in the Randm Frest Interactin scenari Lgistic Classificatin Sum m ary Linear Discriminant Analysis Sum m ary Randm Frest Summary Linear Supprt Vectr Machine Summary Radial Supprt Vectr Machine S u m m a r y Plynmial Supprt Vectr Machine Sum m ary ic-nearest Neighbr Sum m ary Mre data abut the chrmsmes and genes that the SNPs are frm. 86 ix

10 6.2 Variable selectin results fr LASSO in the Sparse, Dense, and Interactin scenaris Variable selectin results fr x 2 in the Sparse scenari Variable selectin results fr x 2 in the Dense and Interactin scenaris Variable selectin results fr Randm Frest in the Sparse and Dense scenaris Variable selectin results fr Randm Frest in the Interactin scenari NO SELECTION - SPA R SE NO SELECTION - D E N S E NO SELECTION - INTERACTION...96

11 LIST OF FIGURES Figure Page 1.1 Visualizatin f the prcess f DNA t prtein prductin. Image created by Madeleine Price Ball n 23 January 2013, and dwnladed frm in August Diagram f the transitin frm Chrmsme t DNA t gene. [30] Diagram f a gene pathway [21] Representatin f prcess frm raw data t cded data An utline f the different methds used and hw they are applied t the simulated data Hw rws f SNP data are rganized befre and after phasing A representatin f the relatinships between data in each scenari Crrelatin map f the SNPs frm chrmsme On the left is the tw dimensinal LASSO slutin set, the right is Ridge Regressin. (3 represents the cefficient values that wuld be chsen by a least squares mdel. The slid shapes centered n the rigin are the cnstraint areas, while the ellipses centered n (3 are the slutin curves. [15] A graph f x 2 distributin based n degrees f freedm A decisin tree n the left, with the regin it creates n the right. [20] 24 xi

12 4.1 A plt shwing the difference between linear regressin and lgistic classificatin A plt shwing an example f an LDA classificatin Example f e values in an SVC g r a p h Example f cst differences in an SVC g ra p h Example f nearest neighbr graphs with different K values A representatin f hw Crss Validatin makes training and test sets An example f pssible ROC curves Results f the LASSO A selectin prcess fr the three scenaris Cllectin f graphs frm Randm Frest errr utput Tp nineteen SNPs by variable imprtance fr LDA and Randm Frest in the LASSO Sparse scenari Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the LASSO Dense scenari Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the LASSO Inter scenari Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the x 2 Sparse scenari Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the x 2 Dense scenari...57 xii

13 5.10 Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the x 2 Interactin scenari Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the Randm Frest Variable Selectin Sparse scenari Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the Randm Frest Variable Selectin Dense scenari Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the Randm Frest Variable Selectin Interactin scenari ROC curves fr LASSO scenaris ROC curves fr x 2 scenaris ROC curves fr Randm Frest scenaris... 73

14 1 Chapter 1 Intrductin In this chapter, sme bilgical backgrund knwledge will be firstly given, fllwed by the mtivatin and bjective. 1.1 Bilgical Backgrund Overview A bilgical trait is an aspect f an rganism that can be described r measured. Fr all types f rganisms, trait differences exist and can vary widely. Fr example, eye clr r susceptibility t certain disease in humans, yield level f crps, and meat r milk quality in livestck all differ frm animal t animal r plant t plant. The different levels that a trait can take n are called phentypes [24]. Phentypes f eye clr culd be blue, brwn, r green while phentypes f bld type wuld include O, A, and B.

15 2 Central Dgma Diagram G T G C A T C T G A C T C C T G A G G A G A A G * r»ma... C A C G T A G A C T G A G G A C T C C T G T T C * UI>IM ^... G U G G A U G U G A C U C C U G A G G A G A A G *. (transcriptin) YYYTTTYY» "» \ / s j / \ J / \ y s i / s $ / V H L T P E E K prtein Figure 1.1: Visualizatin f the prcess f DNA t prtein prductin. Image created by Madeleine Price Ball n 23 January 2013, and dwnladed frm in August Gentype is the cmplete DNA sequence f an individual inherited frm the parents, which carries the majrity f bilgical infrmatin. Fr simple traits, a sectin f DNA sequence is respnsible fr deciding what phentype the trait exhibits. The prcess f translating frm DNA t phentype is referred t as the central dgma f mlecular bilgy and can be summarized as: DNA encdes RNA, RNA encdes prtein, and trait exhibits thrugh bisynthesis f prtein [25]. Figure 1.1 illustrates the transitin frm DNA t prtein. In humans, DNA is spread acrss 23 chrmsme pairs, and stred as a sequence f repetitins f fur nucletides: adenine A, cytsine C, guanine G, and thymine T. At individual bases, every individual has tw pairs f the nucletides, ne inherited frm the mther, and the ther is passed dwn by the father. A gene is a sectin f DNA sequence that is the basic physical and functinal unit

16 3 Gene Diagram Gene Figure 1.2: Diagram f the transitin frm Chrmsme t DNA t gene. [30]

17 4 f heredity. Figure 1.2 shws the breakdwn f chrmsmes t genes t nucletides. The size f a gene can vary frm a few hundred t 2 millin nucletide bases, and humans have arund 25,000 genes. A series f cmplex interactins amng the genes that lead t a certain prduct r a change in a cell is called a bilgical pathway. Figure 1.3 shws the pathway fr type II diabetes. Each bx represents a gene that is part f the structure that cntrls systems like insulin prductin Genetic Variatin Fr mst f the cmplex traits like cancer r bdy height, the difference in phentype can cme frm variatin in gentype, envirnmental factrs, r the interactins between them (nrmally referred t as GxE interactin). Humans share 99.9% f their DNA sequences, s nly 0.1% f the nuclitides are different. The single-base variatin in DNA sequence is called Single Nucletide Plymrphism (SNP). This ccurs when ne nucletide is accidentally replaced with anther. Fr example, in Figure 1.4 the sequence AGCTAC has becme AGC- TAG fr Individual 1 in the tp rw and Individual 3 has instead A G C C A G in the bttm rw. Fr each SNP lcatin, there is a mre cmmn nucletide, and a less cmmn nucletide. T decide which nucletide is mre cmmn, the frequency f the nucletides is calculated fr each lcatin; the mre frequent is called the majr allele, and the ther is called the minr allele. In Figure 1.4, SNP1 has a majr allele

18 MATURITY ONSET DIABETES OF THE YOUNG Pancreatic cell Image P. Hex II Hhfa9 1 Undifferentiated endthelial cell 1HNF6..I 1? 1 [ " p p x il (MOD Y 4) «=P I HNF6 1 1 Hesl 1 (MODY4) IHNFIP ) Ngn3 j f(mody5) <f * * * j ~~~~~~^^(MODY6) Nkc2.2 j Pax6 I Pax4 inwpdlli RFX6 i i i i? I y INkx6.1] i? i Acinar-cell Orcell 5-cell c p I? I l;1hhf4y j FxA3 \ j /7/15 (c) Kanehisa Labratries? PHr G h fij I las 1 I OK IAPP~ 1 ( I (MODY2)j T ^ J f Typell ^ f Insullin ^ I Diabetes meuitus J I signaling pathway J Figure 1.3: Diagram f a gene pathway [21],

19 6 SNP Cding Individual 1 CL z CM Cl Z \A m Cl SNP1 Majr Allele: A Minr Allele: T ' fn <n a. a. CL Z Z z in m Cding Individual 2 Individual 3 Individual 4 SNP2 Majr Allele: T Minr Allele: C SNP3 Majr Allele: C Minr Allele: G Figure 1.4: Representatin f prcess frm raw data t cded data. f A since it shws up six times as ppsed t T which nly shws up twice. Figure 1.4 als shws the transitin frm nucletide bases t SNP cunt. Several cding methds are available t quantify the SNP frm the raw gentype. One f the mst frequently used cding methds is called additive cding, where the ttal number f minr alleles is used t represent this SNP fr each persn. Since humans have tw cpies f DNA sequence, this number culd be 0, 1 r 2. Individuals 1, 3, and 4 in Figure 1.4 all have a single minr allele in an SNP, represented by the 1 in the respective psitins. Individual 2 has a minr allele fr SNP1 and SNP2, but als has a minr allele in bth spts fr SNP3, s the sequence wuld be cded. Anther type f genetic variatins is cpy number variatin (CNV), where the number f times a sectin f DNA is repeated differs frm persn t persn.

20 7 Like SNP, CNV can als be quantified by using similar cding methds. Fr simplicity, this thesis mainly fcuses n the scenaris when a set f SNPs cntribute t the risk f disease (i.e., a binary trait f interest), but the analysis can easily be generalized t the cases that include mre cmplex factrs r mre types f genetic variatin Genme-wide Assciatin Studies With the advancement f micr-array and sequencing technlgy, individuals can btain detailed and accurate genetic markers (eg. SNPs) f the human genme. In Genme-Wide Assciatin Studies (GWAS), scientists search all the SNPs acrss the genme t identify the special nes that are assciated with certain traits f interest. These culd be SNPs fr cmplex diseases, like diabetes and cancer, r quantitative traits like birth weight [42]. In traditinal GWAS, a statistical test is first perfrmed fr each f the SNPs and a p-value is btained accrdingly. The SNPs whse p-value are less than a very small threshld, cmmnly 10-7, are cnsidered significant markers that are assciated with that trait. GWAS have successfully identified SNPs related t several cmplex cnditins. Hwever, the single-marker based analysis n each SNP has a majr drawback [41]. In reality, SNPs tend t wrk tgether as a system rather than individually t realize certain functins, and testing each SNP separately ignres the cmplex interactin between them. Mrever, since every single SNP that is truly invlved in the cmplex system nly

21 cntributes a small effect tward the trait, it is very likely that the individual tests will be nt able t identify thse SNPs as the genme-wide p-value threshld is t stringent. 1.2 Mtivatin and Objectives Althugh the riginal gal f GWAS was t identify the disease related SNPs, mre recent studies shw that thse identified using traditinal methds can nly explain a small fractin f heritability. S there must be many ther SNPs functining in the cmplex system which are extremely difficult t uncver. Instead f finding the whle set f assciated markers, the new bjective is t make individual predictins abut the disease risk fr each persn. Many statistical learning methds have been applied t GWAS fr the predictin purpse, such as Randm Frest [32], Grup LASSO [37], and Bayesian Neural Netwrks [1]. Each methd is ptimized fr different cnditins, s if the structure f the data is well knwn, it may be bvius what methd wuld wrk best. But if little is knwn abut the data, it may be difficult t decide hw t analyze it. Since there is nt ne best methd t use in every situatin, multiple methds are almst always used and the results cmplied tgether t make cnclusins [31]. This thesis wrk is mainly fcused n applying several statistical learning methds t different sets f simulated data t discver which methds prefrm best under different scenaris. Three different scenaris were built with increasing levels f

22 9 Map f Methds Finr Sparse Dense Interactin m n y TTT TTT I^ L a S S lic h T sq iu a r^ RF I L a S S O c h h S q u ^ JTTaiSSTj chi-squa rej RF ) 7 Class. Methds 7 Class. Methds 7 Class. Methds 7 Class. Methds 7 Class. Methds 7 Class. Methds 7 Class. Methds 7 Class. Methds 7 Class, Methds Figure 1.5: An utline f the different methds used and hw they are applied t the simulated data. cmplexity in the structure f the respnse variable: Sparse, Dense, and Interactin. Since GWAS data is nrmally high-dimensinal (i.e., the ttal number f SNPs is large relative t the number f individuals), variable selectin is almst always recmmended befre applying the algrithms t imprve the predictin perfrmance. Here, three variable selectin methds were used fr each scenari: LASSO, x 2 Test, and Randm Frest. Finally, fr each f the nine subsets f data created by the variable selectin, seven classificatin methds were run and the utcmes cmpared. Figure 1.5 gives an verview f the prcess. The thesis is rganized as fllws. Chapter 2 describes hw the data was prepared, including the simulatin f gentypes and phentypes as well as the design f three different scenaris. The general idea and principals f the three variable selectin methds are briefly reviewed in Chapter 3. Chapter 4 prvides an verview

23 10 abut the seven statistical learning methds fr classificatin. The results f variable selectin and predictin are summarized in Chapter 5. A brief discussin f the paper is given in Chapter 6. Sme simulatin details are relegated t the appendices.

24 11 Chapter 2 Data Preparatin 2.1 Simulating Gentypes The gentypes were simulated based n tw nested case-cntrl chrt type II diabetes datasets, the Nurses s Healthy Study (NHS) and the Health Prfessinal Fllw-up Study (HPFS), which are part f the Gene Envirnment Assciatin Studies (GENVEA) [23]. Fr mre detailed infrmatin abut the datasets, please refer t Diet, lifestyle, and the risk f type 2 diabetes mellitus in wmen [19] and Dietary patterns and risk fr type 2 diabetes mellitus in U.S. men [39]. The raw datasets riginally include 3391 female and 2599 male individuals, respectively. Fllwing the cnventinal data-cleaning prcedures, individuals with large prprtin f missing SNPs (>10%) r any kinship relatinship with thers in datasets were remved, and SNPs with small minr allele frequency (<0.05) were als remved. SNPs that are lcated within 50kb up- and dwn-stream f a gene were

25 12 first mapped t the crrespnding gene, based n Human Genme Build 37.3, and then a grup f genes was mapped t a bilgical pathway. Specifically, we cnsidered a Kyt Encyclpedia f Genes and Genmes (KEGG) pathway named maturity-nset-diabetes-f-the.yung which cntains 599 SNPs ver 25 genes frm a ttal f 5961 individuals. When the raw data was stred, the different nucletides at each spt were recrded in tw rws, but which data came frm the mther and which frm the father was nt kept track f. The prcess f srting the data back int the mther sectin and father sectin is called phasing and is pictured in Figure 2.1. The phased sequence tends t be inherited as a whle frm ne f the parents, and is called a hapltype. The 25 genes cvering 599 SNPs were mapped t 14 chrmsmes (1, 2, 3, 4, 7, 8, 10, 11, 12, 13, 15, 17, 19, 20) and then phased using fastphase [36]. Based n the phased data, hapltype frequencies n each chrmsme can be estimated and used t simulate new gentypes. Specifically, 14 files were generated crrespnding t the phased data fr all the individuals, ne fr each chrmsme. Phasing Diagram Unphased Data Pssible Orientatins Phased Data Figure 2.1: Hw rws f SNP data are rganized befre and after phasing.

26 13 Next, the 14 utput files are fed int the R package hapsim [28] which utilizes the hapltype frequency infrmatin and prduces a simulated set f any size required. While the new data is heavily based n the riginal, nne f the gentypes are cpied directly the simulated data, it is all new cmbinatins f nucletides. Fr each chrmsme, 60,000 hapltypes were first simulated and then a randm subset f 40,000 was selected, paired, and assigned t 20,000 individuals. Finally, the simulated hapltypes fr the 14 chrmsmes are cmbined int ne data matrix with 599 clumns and 20,000 rws, representing the gentypes f 599 SNPs fr the 20,000 individuals. Additive cding was used t quantify the SNP gentypes, i.e., each entry in the matrix is the ttal number f minr alleles fr each SNP and fr each individual. 2.2 Simulating Phentypes Thugh the riginal data did have phentypes assciated with it, the details f hw thse phentypes related back t the gentypes was unknwn. In rder t draw cnclusins abut the different classificatin methds used, the exact structure f the respnse variable must be knwn, s new respnse variables were created based n the simulated data. Here, the theretical disease status will be generated under three different simulatin scenaris: Sparse, Dense and Interactin. Mre specifically, the

27 14 disease status f ith individual Yt was generated thrugh a Bernulli distributin Yi ~ Ber(pi) with lgit(pi\y.i) = h(x i), r pl\xl = - tt 1 + exp(n (X j)) where X, represents zth individual s SNPs gentype vectr, Yi is a binary variable with Yi = 1 indicates the persn has the disease (case) and Yi = 0 indicates n such disease (cntrl), pi represents the risk f having disease fr zth persn, and functin h(-) cntrls hw gentypes are cntributing t the disease risk. Under different scenaris, the functin takes different frms, crrespnding t different disease mdels. The three scenaris are described in Figure 2.2. The Sparse scenari cnsiders the case where five SNPs are cntributing t the disease risk, which cme frm five distinct genes, each with a mderate marginal effect. The Dense scenari assumes that frm the five genes there are 32 casual SNPs, each with smaller marginal effect. The Interactin scenari cnsiders a mre cmplex mdel where interactin between and within genes als exist, in additin t the marginal effect. Mre detailed infrmatin abut the five genes and the exact frms f h () are prvided in Appendix A. After the phentypes f the 20,000 individuals are generated, 3,000 individuals with 1,500 disease and 1,500 nn-disease are randmly selected as training data, and 400 individuals with 200 disease and 200 nn-disease

28 15 Scenari Cmpsitin GCK GCK NR5A2 GCK HNF1B HNF1B HNF1B NEUROG3 NEUROG3 NEUROG3 HNF1A Sparse Scenari Dense Scenari Interactin Scenari Figure 2.2: A representatin f the relatinships between data in each scenari. are randmly selected as test data. 2.3 Crrelatin Structure in Gentypes T reduce the amunt f cmputatinal resurces, the crrelatin f the 599 SNPs was calculated. If variables are highly crrelated, the predictive mdel culd be a bit unstable r even infeasible [2]. In the case f SNPs, variables that are clse t each ther n the gene tend t be highly crrelated. In the crrelatin diagram in Figure 2.3, the darker the clr crrespnds t a strnger crrelatin. It is clear that the darker cells tend t gather arund the center diagnal. Fr every pair f variables crrelated at a higher than.95 level, the variable with the higher index was remved. This lead t a few f the riginal variables getting cut ut very early in the prcess. Table 2.1 lists which variables were remved, and

29 16 Crrelatin Results Original Variable Surrgate Variable V 86 V84 V190 V187 V I92 V I88 V 200 V199 V305 V302 V308 V306 V490 V489 V515 V514 Table 2.1: Table f variables remved due t high crrelatin. what variable will be fllwed instead thrugh the prcess. Variables V84 and V302 were already part f the 32 chsen t build the respnse variable, s the maximum number f variables that culd be picked up by a mdel is reduced t 30. A ttal f 179 variables were remved, leaving 420 SNPs t study.

30 17 Crrelatin Diagram! I! ll 11fill l l l l l l i l l l i l l l l l l l i l l! l l! I I!! I I l l! l l! I I l i!! I i l l i! I!! l I y in i iiiiii! ill! m m m m m n 1 I I!!! I I I! I!! I 1 I I!!! l! l i!!!! i i l l i l! I! I I! S!! I i! l l i!!! l l l l l!!! l l i!! i S I i I l l i l I S I I!! I I!!! I I! l l! i i l i l l i l l l l i l i i I i i l l i i i i i l l i i l IS ll'lllll 111i l l!!!!! I I! I!! S I!!! l l l l l l l l i!l!!ll!!l!l!!! I! I«i!!i!!l 11II! f! 1111 ij lh f i l l!! I! 11! I l l i l! i i i l l! i l i i i! l l l!! l! l l l l l l l l I l l i l l l M l i l I I I I I l i l i I i I!! I l! I I!! l i i l l! l l l l!! I I! l l I i! : ! 111 l i l f!l I! l. i i i l! l!!! i! l i i i i l i l f l l i i t l i i l i l i S I S i i i i 11llllfIjlTllllllllfilllilll! I l i l l l l l l l l l l H I I I l I i [ i i S i i l l l f t l i i l i l i! l! l! l l l l!! l! I! I!! I I l! i! I!!! I i l l l l l! I I I I I m m m n m m m > as?! tv Wi m p* 1IIIIl lilssliiiiiliiiiillil c.doo<?6 c;6d?d6d 9 s ii1111!11i11151i1111 I Figure 2.3: Crrelatin map f the SNPs frm chrmsme 15.

31 18 Chapter 3 Variable Selectin Methds Because the high dimensinality f the predictrs nrmally negatively impact the classificatin perfrmance (the s-called the curse f high-dimensinality), first a variable selectin prcedure will be applied t reduce the dimensin f variables. Specifically, three different selectin methds are utilized t narrw dwn the 599 SNP variables t a smaller subset. In the fllwing sectins, a brief descriptin will be given fr each f the three methds (LASSO, x 2 test, randm frest). 3.1 The LASSO The LASSO, r least abslute shrinkage and selectin peratr, is an analysis methd prpsed by Rbert Tibshirani in 1996 as an extreme frm f the Ridge Regressin technique [38]. Ridge Regressin starts with a basic linear regressin mdel Vi = A) + Plx il + Plx il + ' ' + PpXip + 6i

32 19 but adds an extra term when calculating the values f the cefficients that acts as a penalty n the size f the P is. n, V \ 2 Least Squares: Chse P, Pi,, Pp t minimize E V i ~ P - ^ 2 Pjxij ) i= 1 ' j = 1 ' n / p \ 2 P Ridge Regressin: Chse /3s t min ^ [y i - P ^ Pj%ij I + A i 1 ^ j= l ' j = 1 Here n is the number f bservatins, y* is the ith bservatin s respnse value, and Xij is the jth input fr the ith bservatin. The A in the secnd equatin is a tuning parameter that is usually determined by resampling, which is the prcess f taking the knwn data and making subsets t emulate a test set. Each chice f A will result in a different set f parameters, with a larger A giving smaller cefficients verall. But Ridge Regressin can never frce a cefficient t be exactly zer, s all variables are still included in the mdel. The LASSO, n the ther hand, can frce cefficients t zer by using an t\ penalty instead f the i 2 frm Ridge Regressin. n / P \ 2 p LASSO: Chse P,Pi, t minimize ^ ( y i ~ P - E M + A Y,\Pj\ [38] i= 1 ^ j = 1 ' j = 1 T see why the LASSO can cancel ut variables cmpletely, tw slightly different frms f the Ridge Regressin and LASSO equatins are cnsidered. Bth methds can be seen as the slutins t the fllwing prblems:

33 20 pridge minimize subject Regressin: A )A,-,/3 P n ( P y ( Vi ~P~ fcxii ) t = l ' j = 1 ' n ( p, : LASSO: minimize I yt fa \ faxij ) P,Pi,-,Pp r f \ ^ ' ' / 1 = 1 x? = 1 7 t 0j < t [18] j = 1 P subject t ElftlS<[38] 3= 1 Say there are nly tw cefficients, fa and fa. Then the cnstraint regin fr Ridge Regressin is + (3% < t and fr LASSO is \fa\ + \fa\ < t [15]. Pictured in Figure 3.1 are thse regins mapped alng with the elliptical slutin curves. LASSO vs Ridge Regressin Figure 3.1: On the left is the tw dimensinal LASSO slutin set, the right is Ridge Regressin. j3 represents the cefficient values that wuld be chsen by a least squares mdel. The slid shapes centered n the rigin are the cnstraint areas, while the ellipses centered n ft are the slutin curves. [15]

34 21 Fr Ridge Regressin, since the cnstraint regin is circular, the intersectin will almst never lie n a pint where ne f the betas is zer [20]. But the pinted crners f the LASSO regin make the intersectin much mre likely t ccur at an axis, where ne f the betas is exactly zer [20]. If there are mre /3s and the dimensin increases, the cnstraint shapes will have similar prperties; Ridge Regressin will have a smth n-dimensinal sphere and LASSO will have a plytpe with many places fr betas that are zer [20]. The clear advantage f LASSO ver least squares and Ridge Regressin is the variable selectin prperty. With a sparse data set, a few imprtant variables hidden within many, cutting ut unnecessary variables is a critical step. The lss f variables can cause a decrease in predictin accuracy, but it is nrmally ffset by the increased interpretability f the mdel. And thugh the LASSO was first used with least squares mdels, it has since been extended t cver a wide variety f generalized linear mdels. In particular fr this study, because f the categrical respnse variable, the Lgistic LASSO is used. In this setting the LASSO is a slutin t: n r / p \ minimize Vi i P + PjXij ) lg ( 1 4- ea)+^ = 1h XiA M u - A ^ [ v U J v. p subject t \Pj\ < 1 [15]. j=i

35 22 Cntingency Table Respnse Vi = 0 Vi = 1 SNP1 ^ Xi = Xi = Table 3.1: Example f table f values used fr a x 2test. 3.2 x 2Test fr Independence The Chi-Square, r x 2>test is a classic methd t test if tw categrical variables are independent. The null hypthesis states that patterns in the data sets ccurred by chance, while the alternative hypthesis states that a change in ne set f data is matched by a change in the ther data set [33]. Befre the x 2 statistic is calculated, a cntingency table is created. Table 3.1 gives an example using the data frm SNP1 and the respnse variable frm the Sparse scenari. Let Oij represent the number f bservatins in ith rw and jth clumn, r represent the number f rws, c represent the number f clumns, and N the sum f all the entries in the table. Then Eij represents the expected value f each entry ij with S, fr example, if the expected value f the = 0, y* = 0 entry was t be calculated, first the sum f the first rw is fund, then multiplied by the sum f the first clumn, and then divided by the ttal sum f all the entries: ( )(^ ) = 343,

36 23 Chi-Squared Distributin Chi-Squared Value Figure 3.2: A graph f x 2 distributin based n degrees f freedm. S if SNP1 and the respnse variable were independent, we wuld expect t see apprximately 843 bservatins fall int the first entry f the table. Finally the y2 statistic is given by X2 = E E (0 'i ~ y)2 [33]. i= 1 j = 1 *? This x 2 value is then cmpared t the x 2 distributin t calculate a p-value. Figure shws a few different distributins, based n the degrees f freedm. Degrees f freedm is calculated by df = (r l)(c 1) s fr ur mdel, the degrees f freedm is 2.

37 24 Decisin Tree Diagram X%< h AT** ^2 X i < fa R 2 R a A 2 4 H * H 3 /! s E t JZt A i Figure 3.3: A decisin tree n the left, with the regin it creates n the right. [20] 3.3 Randm Frest Decisin Trees are a very flexible methd f splitting the feature space int sectins based n criterin f certain variables. They make n assumptins abut the structure f the data and fr interpretatin, prduce a list f variables that are mst imprtant in gruping the data. A decisin tree is a graph made up f splitting ndes starting at the tp f the tree, and terminal ndes, r leaves, at the bttm. Each nde represents a gruping f the data int tw parts based n sme variable criterin. An example f a decisin tree frm a tw variable set and the regins it creates is given in Figure 3.3. The general prcess f building a decisin tree starts with dividing the data int

38 25 J distinct nn-verlapping regins, i?i, R 2,...Rj, and assigning the same respnse variable value t each bservatin in the regin [20]. These regins, nrmally simple p-dimensinal rectangles, are chsen in rder t minimize the errr given by the residual sum f squares a s s = E 5 > -» i.>2 j = 1 i Rj where yn] is the mean respnse f the bservatins in Rj [20]. Since it is impssible t try every cmbinatin f regins, a tp-dwn, greedy apprach called recursive binary splitting is used [34]. The prcess is tp-dwn because ne split begins the tree, and then tw mre splits are made frm thse tw riginal branches. Greedy refers the the fact that the best chice is made fr the current split withut cnsidering future decisins. Fr the regressin setting, at each split a variable X j is chsen, and the best cutpint s in the regin is fund such that ivi ~ VRi)2 + ^ (Vi ~ VRif is minimized, where R\ = {x\xi < s} i'.xigri i:xi R,2 and R2 = {x\xi > s} [20]. T predict a respnse, a given bservatin is placed in the regin that fits the crrect cnditins, and the mean f the regin is given as the respnse. A few changes are needed fr a classificatin setting, but the verall ideas are the same. Instead f calculating the mean f the bservatins in Rt, the mst cmmnly ccurring class is fund. Fr the splitting criterin, a measure f nde purity called the Gini index is used in place f the RSS. Let pmk represent the prprtin f

39 26 bservatins in the mth regin that are in the kth class, then the Gini index is K G = '^Tpmk(l -Pm k) [20]. fc=1 A small value f G means that the pmks are all clse t 0 r 1, s the split has made the bservatins in Rk almst all ne class. One disadvantage f decisin trees is that n their wn they tend t nt have the predictive pwer f ther methds. The Randm Frest methd is ne way t imprve the predictive accuracy buy grwing multiple trees and averaging the results. Tin Kam H first develped the Randm Decisin Frests in 1995 t increase the accuracy and cmplexity f decisin trees [17]. Each tree is gwn n a randm subset f the training variables, a prcess knwn as Btstrapping, t simulate multiple training sets instead f just ne [29]. T keep the trees frm becming t highly crrelated, whenever a split is being calculated nly a randm subset f predictrs are allwed t be used fr the split [20]. Generally, if there are p predictrs, y/p variables are chsen fr each decisin nde [15]. By using less than half f the variables at each spt, it gives a chance fr weaker, but still imprtant, features t shw thrugh, instead f being verrun [5].

40 27 Chapter 4 Classificatin Methds After the variable selectin methds have been applied, the fllwing data analysis methds are run n each set f the remaining variables. 4.1 Lgistic Regressin Lgistic Classificatin is the extensin f linear regressin t a data set with a qualitative respnse variable. Instead f mdeling the qualitative variable directly, the prbability the respnse wuld fall int ne f the categries is mdeled. T keep the prbability values between 0 and 1, a lgistic functin is used, where p(xi) is the prbability than an bservatin is a yi = 1 case: g A ) + / 3 l # i l H b (3p Xip p ( x i) ^ g(3-\-(3ixn-\ [-fipxip

41 28 Regressin vs Classificatin Figure 4.1: A plt shwing the difference between linear regressin and lgistic classificatin. In Figure 4.1, an example f a basic linear regressin and lgistic classificatin curves is given fr cmparisn. 4.2 Linear Discriminate Analysis Linear Discriminate Analysis, r LDA, is a classificatin technique that can give estimates that are mre stable than lgistic regressin in certain cases, like when the classes fr the respnse variable are well separated [20]. LDA uses Bayes therem t turn the distributin f the predictrs int estimates fr the prbability f the

42 29 respnse, given the prir prbabilities [27]. Bayes therem states: P r(f = k\x = x) = 7Tfc P r(a = x\y = k) E Z, n Pr(X = x\y = I) where 7Tj is the prir, r verall, prbability that an bservatin belngs t class i and K is the number f classes [20]. This estimated prbability is ften called the psterir prbability that bservatin X is in class k. Estimating P r(x = x\y = k) is the main jb f the LDA. The set up des assume that X = (X i, X 2,..., X p) cmes frm a multivariate Gaussian distributin, with mean // and C v (X )= E. In this case P r(x = x\y = k) = 1 (2tt)p/2 E 1/2 exp LDA Plt CM LD1 Figure 4.2: A plt shwing an example f an LDA classificatin

43 30 It can be shwn by cmbining this frmula with Bayes therem, that an bservatin X will be assigned t the class fr which 6k(x) =.rr X -1///,. + lg ir^ is largest, where /x/t is the mean vectr f all bservatins with class k [20]. LDA estimates the S, /i,s, and tt, s needed t calculate and assigns the bservatin t its class. Figure 4.2 shws a plt fr a tw variable data set with three factrs fr the respnse. 4.3 Randm Frest Lgistic Regressin and LDA have linear equatins at their base, the next three methds are much mre flexible and assume almst nthing abut the structure f the riginal data. The same Randm Frest as described in Chapter 3 was used again t fine tune the data, even in the case where Randm Frest was riginally used fr variable selectin, which is encuraged [6]. 4.4 Supprt Vectr Machine The Supprt Vectr Classifier, SVC, is a classificatin technique that was develped fr a tw level respnse variable [9]. The methd attempts t split the data pints int tw grups with a hyperplane separatr, Pq+P\X i -\ 1- /3pX p = 0. Any pints whse respnse variable value des nt match the label f the grup are cnsidered misclassified. In mst cases the data cannt be perfectly separated, s a margin is

44 31 als built arund the hyperplane. The classifier cmes frm ptimizing the fllwing system f equatins: maximize M p subject t ^ /3 = 1 i=i J/i(A) + PiXn H 1- Pvxip) ^ ~ ei) : fr * = 1,..., n j > 0 : fr i = 1,..., n n where (xn, x i2,...,xip, yi) is a data pint, C is a nn-negative tuning parameter, and gjs are slack variables [20]. The slack variables are terms that allw the data pint rm t be misclassified. If = 0 then the ith data pint is n the crrect side f the hyperplane, if 0 < e* < 1 the data pint is in the margin but still n the crrect side f the hyperplane, and if e* > 1 the data pint is n the wrng side f the hyperplane [9]. Figure 4.3 shws the e values f a few pints, ne frm each pssibility. C is related t the cst parameter, and it is a bund n hw many data pints are allwed t be misclassified [15]. The higher the cst used in the mdel, the smaller the margins are frced t be. In Figure 4.4, the same data is mdeled with different cst values, causing different classificatin and margins t appear.

45 32 Supprt Vectr Classifier Diagram Figure 4.3: Example f e values in an SVC graph The mst interesting aspect f SVC is that the bservatins that are utside f the margin, n the crrect side, have n effect n the placement f the hyperplane. All the pints that are nt circled in the SVC plts in Figure 4.4 culd be shifted arund withut changing the bundaries, as lng as they did nt mve int the margin while shifting. The data pints that d define the hyperplane and margin are called supprt vectrs [20]. The Supprt Vectr Machine, SVM, extends the idea f an SVC t a mdel with a nn-linear decisin bundary [20]. This is accmplished by enlarging the feature space using kernels s that the decisin bundary is linear in the enlarged space, but nn-linear in the riginal space [9]. The kernel uses the slutin t the SVC ptimizatin prblem that invlves nly the inner prducts f the bservatins [20].

46 33 Cst=1 Cst=5 CM X Cst= Cst=20 CM x Figure 4.4: Example f cst differences in an SVC graph

47 34 In this study, the linear, plynmial, and radial kernels are all cnsidered. Belw are the kernels used in each case: p Linear: X j) = XjjXj'j j = 1 p Plynmial: Kp(X.i, X j) = (1 + XjjXj'j)d p j = 1 Radial: K R(X u X j ) = e x p (-7 ^ ( x y- - x ^ ) 2). j=i Here d is the desired degree f the plynmial, and 7 is anther psitive tuning parameter fr the radial kernel. 4.5 K-Nearest Neighbr K-Nearest Neighbrs Classificatin, r KNN, is a nn-parametric mdel that chses the assignment f an bservatin based n the ther bservatins clsest t it. The K tells hw many neighbrs t cnsider, and its value can greatly change the mdel. Figure 4.5 shws hw the classificatin can change with the value f K. Thugh there are multiple ways t define distance, usually Euclidean distance is

48 35 1-nearest neighbur 3-nearest neighbur 10-nearest neighbur 15-nearest neighbur Figure 4.5: Example f nearest neighbr graphs with different K values mst cmmn. KNN estimates the cnditinal prbability with i N0 Pr(F - j\x = *) = ^ J 2 = j) where is the bservatin being classified, Nq is the set f K bservatins clsest t x q, and I(x) is the identity functin [20]. Then the bservatin is assigned t which ever class has the highest prbability. KNN is strng in that there are almst n assumptins made by the mdel, s it can wrk when very little is knwn abut

49 36 the structure f the data set [10]. Its drawback thugh, is that it suffers frm the curse f dimensinality; the mre respnse variables present, the mre dimensins are used, and the sparser the data gets in the p-dimensinal mapping [3].

50 37 Chapter 5 Results 5.1 Methds Evaluatin and Tuning Parameter Selectin Crss-Validatin and Out f Bag Errr In rder t estimate the test errr and tuning parameter values, the resampling methd f Crss Validatin is emplyed. In mst statistical analysis situatins, there is n predetermined validatin, r test, set t asses the mdel with. Crss Validatin wrks by segmenting the data and creating temprary test sets t estimate the errr rate f the mdel. There are several different types f Crss Validatin, depending n hw many segments are created in the prcess. One methd is called leave ne ut crss validatin, LOOCV, where all but ne bservatin is used as the training set and the test errr is calculated frm the single bservatin. Hwever, this methd has a high cmputatinal cst and may nt be reasnable fr certain methds r large data sets. On the ther end, using nly ne train and test set will make the

51 38 Crss Validatin Diagram : Training set 1 ptest-j :;.... ;:.,: :..^,:..,. : : c > Errr 1 l ::: : : ; l = > Errr 2 c > Errr 3 ; i n u t t z > Errr 10 Figure 5.1: A representatin f hw Crss Validatin makes training and test sets. estimated errr have high variance. The mst cmmn cmprmise between the tw is 10-Fld Crss Validatin [22]. First the data is randmly brken up int 10 grups, r flds. Then 9 f the flds are used as a training set, and the 10th is used as a test set, as diagrammed in Figure 5.1. A mdel is build n the training set, and then errr is calculated n the test set, mst cmmnly used in regressins is 1 n the mean square errr: M S E = 'S^(yi Vi)2 [20]. Fr the classificatin setting, n i= 1 classificatin errr rate is used, which is described in the next sectin. Then a new unique chice f the 9 sets is made, and the new test set is used again t calculate the errr. Finally, all 10 f the errr values are averaged tgether, and given as the Crss Validatin errr. These errr values are uncrrelated enugh t keep the variance reasnably lw, but they als vary enugh t keep bias lw as well, s the 10-Fld Crss Validatin errr is accepted as an accurate estimate f true test errr [20]. It is used t estimate the test errr fr all classificatin methds.

52 39 Fr estimating parameters, a list f pssible values must be chsen befrehand; usually a range fr the values is selected thrugh practice. Then fr each chice f parameter value, a 10-fld Crss Validatin is run and the errr is recrded. Then the parameter frm the mdel with the least CV errr is chsen as the best value fr the parameter. 10-fld CV is used in this study t estimate A fr the LASSO, K fr KNN, and cst, 7, and degree fr SVM. Fr the Randm Frest Variable Selectin, as the btstrapping prcess builds the mdel, it can estimate the errr f the mdel, s n extra Crss Validatin calculatin is needed [16]. The randm set f variables that btstrapping takes at each set creates a training set, and the left ut, r Out Of Bag, bservatins are used as a testing set [4], If enugh trees are grwn fr Randm Frest, the Out Of Bag errr is very similar t the 10-fld Crss Validatin errr [20] Classificatin Errr Rate and the ROC Curve Fr each variable selectin methd, the seven different classificatin methds were run n the remaining variable sets. Then there are tw ways the results are cmpared, classificatin errr rate and area under the ROC curve. The classificatin errr rate is used in place f the regressin s mean squared errr, but the interpretatins are equivalent. Classificatin errr gives the percent f bservatins that d nt belng t the mst cmmn class f the grup. A lw classificatin errr rate means the mdel is accurate at predicting the respnse value f an bservatin.

53 40 ROC Curves False Psitive Rate Figure 5.2: An example f pssible ROC curves. The Receiver Operating Characteristic, ROC, Curve is a plt f the true psitive rate f a mdel against the false psitive rate [14]. It wrks by taking the prbabilities calculated in a mdel and sliding the cutff percentage frm 0 t 1 and calculating the errr rates fr each cutff. The area under the ROC curve, AUC, is a measure f hw accurate the mdel is, the mre area, the mre accurate the mdel [8]. A strng ROC curve will hug the upper left crner f the graph, while a curve clse t the diagnal is nt much mre use than randmly guessing the class a data pint falls int. Figure 5.2 shws an example f each case. The

54 Variable Selectin Results T start the prcess f analyzing the SNP data, the train and test sets are read int R, and the highly crrelated variables are remved. Then fr each methd, the selectin is run and a test errr is calculated fr later cmparisns. Then the train and test sets are reduced t the subset f variables selected fr further study LASSO T begin the LASSO, the cv.glmnet functin [12] is used t find the best value fr A, by a 10-fld Crss Validatin. Fr all three scenaris, a range f 100 A s frm 10-5 t 101 is checked. In Figure 5.3, the left graphs shw the cefficients decreasing t zer as A decreases. The right graphs shw the calculatins f the CV prcess t find the best A. The tp rw is the Sparse results, the middle is the Dense, and the bttm is the Interactin. In the sparse cefficient graph, there are five lines that are reduced t zer much slwer than the rest f the set. These lines mst likely crrespnd t the five variables used in that scenari: V100, V199, V300, V400, V500. The A value chsen fr the Sparse scenari by crss validatin is The Dense scenari cefficients are nt as separated as the Sparse, but there des appear t be a set f three variables that take lnger t reach zer. The A value chsen fr the Dense scenari is The Interactin case lks similar t the Dense, except fr ne variable that has a very high cefficient value cmpared t the rest. The A value chsen fr the Interactin scenari is

55 42 LASSO A Graphs "r E c in L1Nrm lg(lambda) Q S E c CD L1 Nrm lg(lambda) Q Ixj E L1Nrm k)g(lambda) Figure 5.3: Results f the LASSO A selectin prcess fr the three scenaris.

56 43 Number f True Variables LASSO Results Number f Neighbr Variables Number f Missing Variables Sparse Dense Inter With these A values, a final LASSO mdel is made fr each scenari. Table gives the results f the variables selected. The details f which variables are selected are given in Table 6.2 in Appendix B. A neighbr is cunted when a true variable is missed, but the SNP right next t it is selected; fr example, if V405 is nt selected by the mdel but V404 is, V404 wuld cunt as a neighbr variable. Fr the Sparse case, the LASSO places the five true variables at the tp f the list, with an extra 14 variables after. Fr the Dense case, the first 20 variables n the list are true variables, and mst f the rest fllw shrtly. The last true variable, V412, is abut tw-thirds dwn the list f 79 ttal variables. The Interactin mdel selects 97 ttal variables, with the first 23 true variables. Here the mdel misses 5 f the variables, but picks up 4 neighbrs. The mdel manages t miss nly ne variable cmpletely, V I X2 Test In the secnd variable selectin methd, the x 2 tests, there is n mdel built and n tuning parameters that need calculating. In R, the functin chisq.testq [35] is used

57 44 Number f True Variables X2Results Number f Neighbr Variables Number f Missing Variables Sparse Dense Inter t run the x 2 tests, and any SNP with a p-value less than 0.05 is cnsidered dependent. Table gives a brief summary f the results, with the details appearing in Appendix B in Tables 6.3 and 6.4. Each variable that is significant at the.05 level is kept t be further analyzed. As with the previus variable selectin methd, the x 2 tests find all five true variables in the Sparse scenari. They are amng 93 variables that are deemed significant. In the Dense scenari, all but three variables are are fund, but thse three d have neighbrs that are included in the mdel. A ttal f 135 variables are listed as significant. The Interactin scenari has 6 true variables missing, but nly 4 have neighbrs in the mdel. Variables V310 and V412 are missed by the mdel. There are nly 99 variables in ttal selected, less than the Dense scenari, whereas the ppsite is true fr the LASSO.

58 45 Randm Frest Results Number f Number f Number f True Neighbr Missing Variables Variables Variables Sparse Dense Inter Randm Frest The Randm Frest is run using the randmfrest [26] functin in R. As described earlier in the chapter, the Out Of Bag errr is calculated as the mdel is being created. In Figure 5.4 the errr is graphed based n the number f trees currently grwn. Fr all scenaris, as the number f trees appraches 500, the errr levels ff. Sparse ends up with abut 30% classificatin errr. Dense stays at abut 22% errr fr mst f the prcess. The Interactin case quickly reaches a classificatin errr rate f abut 15%. The errr rate in bth Dense and Interactin appears t be better fr the disease case than the nn-disease case, as the red lines in bth graphs are abut the OOB errr, and the green is belw. This is acceptable as it wuld be mre imprtant t get the disease cases crrectly identified ver the nn-disease. The variables are rdered by a measure f variable imprtance called decrease in accuracy. It measures hw much accuracy in the mdel is lst when the variable is remved. Fr the Randm Frest, the 100 tp variables by variable imprtance are chsen in each scenari. A quick summery f the results is given in Table 5.2.3, and the particulars are in Appendix B, Tables 6.5 and 6.6.

59 46 Randm Frest Errr Graphs Sparse: t r e e s Dense: t r e e s Inter: t r e e s Figure 5.4: Cllectin f graphs frm Randm Frest errr utput.

60 47 In the Sparse scenari, the five true variables are listed first, in almst the same rder as the LASSO mdel. Unfrtunately the Dense scenari is nt as clear-cut. Only fur true variables are listed befre thers start mixing int the list, and the mdel misses six variables, with nly fur having neighbrs in the mdel. The Interactin scenari similarly has ther variables mixing in after the furth spt, and is missing eight ttal variables, with five having neighbrs Overall All three methds were very successful at picking up the true variables in the Sparse case. Fr the Dense case, the LASSO and x 2 pick up all the variables, but the Randm Frest drps tw. The Interactin case is the hardest fr the mdels t define, nly the LASSO case identifies all the variables r a neighbr. The errr frm the Randm Frest mdels suggest that the mre cmplex the mdel is, the better predictin results will be achieved. When there are s few true variables cmpared t the ttal variables, the extra nise variables have mre chance t be included. Fr the Sparse scenari, all three mdels include, alng with the five true variables, eight ther variables: V79, V99, V105, V172, V182, V416, V421, V501. In the Dense scenari, the x 2 and Randm Frest bth have truble with V195 and V492. Bth mdels include V194 and V493 in their place. Finally, fr the Interactin scenari, all three mdels miss V187 but d

61 48 get V I88, and nly include ne f V197 r V199. In additin, tw ut f the three mdels replace V80 with V79, miss V195, and replace V405 with V404. At this stage f the analysis, it is als hard t draw cnclusins abut hw many true variables the mdels are suggesting. LASSO and x 2 bth select mre than twice the number f true variables, and fr the Randm Frest the number f variables selected was chsen befrehand. The classificatin methds used next can narrw dwn the pssibilities. 5.3 Classificatin Methd Results After the variable selectin methds have been applied, the seven data analysis methds are run n each set f remaining variables. Fr each classificatin methd, first any tuning parameters are estimated. Then a mdel is built and the 10-fld Crss Validatin errr is calculated. Finally the mdel is used with the predictq functin [35] and the test set t calculate the classificatin test errr f the mdel Methd Details Fr Lgisitic Regressin, the R functin glm() [13] is used t set up a generalized linear mdel f the family binmial, which equates t lgistic regressin. Then the mdel is run thrugh the functin cv.glm() [7] t find the 10-fld CV errr f

62 49 the mdel. The list f variables frm the mdel that are significant at the.05 level are cmplied using the utput f the glm bject and the test errr is calculated. Linear Discriminant Analysis uses the lda() functin [40] t build the mdel. There is n built in functin that will run a 10-fld CV n the LDA mdel, s the errr is calculated manually. Then the tp twenty variables are calculated based n the abslute value f the LDA cefficients and the test errr is fund. The Randm Frest is created using the randmfrest() functin [26] and then the rfcv() functin [26] gives the 10-fld CV errr. Then the tp twenty variables based n decrease in accuracy are listed and the test errr calculated. The three Supprt Vectr Machine mdels have tuning parameters that need t be estimated fr the mdels. Cst is used in all types, 7 fr the radial and plynmial kernel, and degree fr nly the plynmial. Cst values tested are { 0.001, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10, 100}, 7 values are {0.005, 0.01, 0.05, 0.1, 0.5, 1.0}, and degree values are {2,3}. The functin tune.svm() [11] is used t run 10-fld CV t estimate the parameters. Then a SVM mdel with the selected parameters is build using the svm() functin and the test errr is fund. K-Nearest Neighbr has ne tuning parameter t calculate befre the mdel is run, K. The functin tune.knn() [11] is used t run 10-fld CV t find the best value fr K. Then the knn() functin [40] builds a mdel with the chsen K, and the test errr is calculated.

63 50 LDA Sparse: Tp Twenty Variables rf. sparse g 0 0 ) 0 0 > 5 > >? > Variable V300 V400 V I99 V500 V100 V416 V99 V421 V182 V501 V545 V172 V393 V22 V599 V79 V105 V118 V395 - T - T" MeanDecreaseAccuracy Figure 5.5: Tp nineteen SNPs by variable imprtance fr LDA and Randm Frest in the LASSO Sparse scenari Cmparisns Within Variable Selectin Methd Fr the LASSO Sparse scenari, the methds all have very similar test errrs, 28-30%. Lgistic Classificatin and the Linear SVM are the lwest with 28.25% classificatin errr and KNN the highest at 30.00%. Lgistic Classificatin, LDA, and Randm Frest all select the five true variables, but they all include extra variables as well. Fr Lgistic Classificatin, V22 and V172 bth have p-values under.05, but the highest p-value f the ther five variables is Fr the LDA and Randm Frest, the tp twenty variables were cnsidered, s there are an extra fifteen variables listed in each. Figure 5.5 shws the cefficient values frm the mdels. There is a large gap in the cefficient values after the first five fr LDA,

64 51 but the separatin is nt as clear in the Randm Frest. LDA als chses V22 and V172 as the next tw mst imprtant variables like Lgistic Classificatin, but Randm Frest has V416 and V99 instead. The full results fr the LASSO Sparse scenari are given in Table 5.1. The LASSO Dense scenari has a slightly larger range f test classificatin errrs, frm 16.00% t 20.50%. Three f the methds have the 16.00% errr: Lgistic Classificatin, Linear SVM, and Radial SVM. The highest errr is frm the KNN mdel. The Lgistic Classificatin mdel selects 41 ttal variables, 27 true variables and 14 extra, missing V306, V412, and V510. In this set, the p-values are mre mixed than the Sparse scenari. The variable V525 has a p-value f which is very clse t the true variable V505 p-value f The cefficient values fr the LDA and Randm Frest are shwn in Figure 5.6. The LDA mdel chses all true variables fr the tp twenty, but the Randm Frest mixes tgether 16 true variables with 4 thers: V98, V398, V507, V512. Table 5.2 gives the details f the LASSO Dense analysis. The LASSO Interactin mdels range frm 17.75% t 25.50% test classificatin errr. Three mdels have the 17.75% errr, LDA, Randm Frest, Linear SVM, and the highest errr is frm the KNN mdel. Lgistic Classificatin nly finds 23 true variables ut f 32 variables picked. The p-values are als mixed, the true variable V405 has a p-value f while and the extra variable V390 has a p-value f LDA has ne extra variable in the tp twenty, V404, but Randm Frest

65 52 LASSO - SPARSE Methd CV Errr Test Errr Variables r Parameters Chsen Lgistic Classificatin Linear Discriminant Analysis Randm Frest Supprt Vectr Machine: Linear Supprt Vectr Machine: Radial Supprt Vectr Machine: Ply 17.19% 28.25% V22 VlOO V I72 V 199 V300 V400 V % 28.50% V300 V I99 V500 V400 VlOO V22 V172 V416 V421 V99 V I 18 V395 V I82 V545 V501 V599 V79 V105 V % 27.50% V300 V400 V I99 V500 VlOO V99 V421 V182 V501 V545 V I72 V393 V22 V599 V79 V105 V I18 V % 28.25% Cst: % 29.75% Cst: 5.0, 7 : % 29.00% Cst: 0.1, 7 : 0.1, Degree: 2 K-Nearest Neighbr 27.17% 30.00% K\ 29 Table 5.1: Details f mdels created in the LASSO Sparse scenari.

66 53 LASSO - DENSE Methd CV Errr Test Errr Variables r Parameters Chsen Lgistic 9.99% 16.00% V80 V84 V90 V95 Classificatin V100 V187 V I 88 V191 V195 V I97 V I99 V220 V277 V292 V293 V295 V300 V302 V310 V333 V339 V400 V402 V405 V408 V410 V416 V427 V469 V489 V492 V495 V498 V500 V505 V508 V514 V517 V525 V526 V576 Linear 13.50% 16.25% V302 V84 V416 V90 Discriminant V514 V492 V I 87 V498 Analysis V I99 V500 V410! > V 95 V495 V295 V306 V I95 V489 V400 V408 V80 Randm 17.37% 19.25% V84 V95 V400 V416 Frest V410 V90 V98 V100 V408 V398 V508 V507 V510 V514 V495 V302 V306 V512 V500 V505 Supprt Vectr 13.23% 16.00% Cst: 0.5 Machine: Linear Supprt Vectr 13.67% 16.00% Cst: 1.0, 7 : Machine: Radial Supprt Vectr 13.93% 16.25% Cst: 0.01, 7 : 0.1, Degree: 2 Machine: Ply K-Nearest 18.80% 20.50% K : 30 Neighbr Table 5.2: Details f mdels created in the LASSO Dense scenari.

67 54 d d d LDA Dense: Tp Twenty Variables * Psitive * Negative i i i i i i i i i i i i i i i i i i «r i i i i i i \ i i i r V84 O D rf.dense 0 a Variable MeanDecreaseAccuracy Figure 5.6: Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the LASSO Dense scenari. again has a mix f 13 true variables and 7 extra variables. Figure 5.7 shws the graphs f the LDA and Randm Frest cefficients. Bth graphs shw that variable V302 is extremely influential in the mdel, and n ther variables cme clse, mst likely because f all the interactin terms V302 is invlved in. Table 5.3 gives the details f the mdels. The x 2 Sparse methds have apprximately the same classificatin errr rates as the LASSO mdels, 28.75% t 30.75%. Here the lwest errr is frm the Radial SVM mdel, and the highest is the Linear SVM. The Lgistic Classificatin mdel identifies all five true variables alng with fur extra, V79, V172, V194, V304. There is a large separatin f p-values between the type f variables, similar t the LASSO. The largest f the true variables is fr V100 and the smallest

68 55 & O ^ m r- V O 05 < Q.. LDA inter: Tp Twenty Variables Psitive * Negative * * *... * * * «i! i! i i i i i i i r g I 8 S? S? 8 U S S i > > r Variable V302 V306 V95 V90 V84 V100 V97 V495 V416 V99 V88 V400 V510 V500 V502 V514 V508 V404 V512 V89 rtinter O O... O a 0 O O O n r MeanDecreaseAccuracy Figure 5.7: Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the LASSO Inter scenari. LDA Sparse: Tp Twenty Variables rf.sparse I V300 V199 V500 V400 V100 V416 V99 V98 V V398 V304 V408 V410 0 V97 V501 V95 a V421 Q V90 Q... V493 V495 c r MeanDecrease Accuracy Figure 5.8: Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the \2 Sparse scenari.

69 56 LASSO - INTER. Methd CV Errr Test Errr Variables r Parameters Chsen Lgistic 8.91% 18.00% V80 V84 V90 V95 Classificatin V137 V I 60 V I 88 V197 V247 V295 V300 V302 V310 V358 V390 V400 V404 V405 V412 V416 V483 V489 V495 V496 V498 V500 V503 V506 V508 V510 V514 V526 Linear 12.13% 17.75% V302 V90 V84 V295 Discriminant V197 V416 V405 V I 88 Analysis V300 V510 V514 V500 V495 V404 V489 V306 V412 V95 V310 V498 Randm 14.50% 17.75% V302 V306 V95 :» V90 Frest V84 V100 V97 V495 V416 V99 V 88 V400 V510 V500 V502 V514 V508 V404 V512 V89 Supprt Vectr 12.33% 17.75% Cst: 0.1 Machine: Linear Supprt Vectr 11.90% 18.75% Cst: 1.0, 7 : 0.01 Machine: Radial Supprt Vectr 12.37% 18.75% Cst: 0.01, 7 : 0.1, Degree: 2 Machine: Ply K-Nearest 19.33% 25.50% K : 28 Neighbr Table 5.3: Details f mdels created in the LASSO Interactin scenari.

70 57 d d LDA Dense: Tp Twenty Variables Psitive * Negative rf.dense O Variable MeanDecreaseAccuracy Figure 5.9: Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the x 2 Dense scenari. f the extra variables is fr V I72. Figure 5.8 shws the cefficients fr LDA and Randm Frest. Fr LDA we have a fairly clear distinctin between the five true variable and the thers. This Randm Frest mdel has a bit mre f a gap than the LASSO Randm Frest, but it is nt as clear as either LDA mdel. All three mdels chse different nise variables t fllw the tp five. The details f the mdel s results are given in Table 5.4. The x 2 Dense mdels all have test classificatin errr between 17.00% and 21.00%, with the Plynmial SVM the lwest, and KNN the highest. The Lgistic Classificatin mdel chses abut the same number f variables as the LASSO, 38 ttal variables with 25 true and 13 extra. The p-values d nt have any clear distinctin, the true variable V405 has a higher p-value, 0.040, than the nise

71 58 X2 - SPARSE Methd CV Errr Test Errr Variables r Parameters Chsen Lgistic Classificatin Linear Discriminant Analysis Randm Frest Supprt Vectr Machine: Linear Supprt Vectr Machine: Radial Supprt Vectr Machine: Ply 17.92% 30.25% V79 VlOO V172 V194 V I99 V300 V304 V400 V % 30.25% V300 V199 V500 VlOO V400 V173 V172 V420 V97 V304 V418 V424 V508 V510 V410 V402 V412 V174 V405 V % 30.00% V300 V I99 V500 V400 VlOO V416 V99 V98 V502 V398 V304 V408 V410 V97 V501 V95 V421 V90 V493 V % 30.75% Cst: % 28.75% Cst: 1.0, r % 29.00% Cst: 0.01, 7 : 0.1, Degree: 2 K-Nearest Neighbr 31.37% 30.50% K: 23 Table 5.4: Details f mdels created in the x2sparse scenari.

72 59 LDA Inter: Tp Twenty Variables rf inter c q * Psith * Negati Variable Mean Decrea seaceuracy Figure 5.10: Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the x 2 Interactin scenari. variable V525, at Figure 5.9 shws the LDA and Randm Frest variable cefficeint values. LDA is able t identify 17 true variables in the tp twenty, with V509, V506, and V409 scattered thrughut. The cefficient values appear t be separated int distinct layers n the graph, almst like steps, whereas the ther graphs tend t have nly ne gap. The Randm Frest mdel is nly able t select 11 true variables ut f the tp twenty, cmpared t the 16 that the LASSO versin finds. Table 5.5 gives the details f the mdels. The test classificatin errrs frm the x 2 Interactin scenari mdels are all within 1.00% except fr the KNN mdel. The Linear and Radial SVM mdels have 15.50% errr, Lgistic Classificatin, LDA, and Randm Frest have 16.50% errr, and the KNN mdel has 21% errr. Lgistic Classificatin selects nly 20

73 60 X2 - DENSE Methd CV Errr Test Errr Variables r Parameters Chsen Lgistic 10.94% 17.25% V75 V84 V 90 V95 Classificatin V100 V187 V I 88 V191 V197 V199 V293 V295 V300 V302 V310 V313 V333 V400 V402 V405 V410 V416 V427 V467 V489 V493 V494 V495 V498 V500 V503 V505 V568 V508 V509 V514 V525 V526 Linear 15.43% 17.75% V84 V302 V509 V506 Discriminant V405 V498 V416 V90 Analysis V495 V187 V95 V I 99 V310 V409 V508 V505 V306 V 400 V100 V410 Randm 18.47% 20.75% V416 V84 V95 V400 Frest V410 V90 V408 V98 V100 V97 V404 V508 V 88 V398 V514 V99 V507 V510 V415 V512 Supprt Vectr 14.90% 18.00% Cst: 0.01 Machine: Linear Supprt Vectr 15.30% 17.25% Cst: 0.5, 7 : 0.01 Machine: Radial Supprt Vectr 15.33% 17.00% Cst: 0.01, 7 : 0.1, Degree: 2 Machine: Ply K-Nearest 20.27% 21.00% K: 23 Neighbr Table 5.5: Details f mdels created in the x 2 Dense scenari.

74 61 LDA Sparser Tp Twenty Variables rf.sparse <30 d <N O Psitive * Negative Q - Variable MeanDecreaseAccuracy Figure 5.11: Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the Randm Frest Variable Selectin Sparse scenari. variables in ttal, 16 true variables and V301, V313, V401, and V568. As with the dense mdel, the p-values d nt have clear separatin, V510 has a p- value f but V568 has a p-value f LDA has 17 variables in its tp twenty, while Randm Frest has 14. Bth are pictured in Figure All three mdels have different nise variables present; LDA has V85, V83, and V301, and Randm Frest has V 88, V98, V97, V99, V493, and V507. Bth graphs shw V302 as being very imprtant in the mdels, just as the LASSO Interactin mdels did. The details f the mdes are given in Table 5.6. The test classificatin errr range fr the Randm Frest Variable Selectin Sparse scenari is slightly larger than the ther tw variable selectin mdels, 28.25% frm the Plynmial SVM t 34% frm the KNN. This Lgistic Classificatin mdel

75 62 X2 - INTERACTION Methd CV Errr Test Errr Variables r Parameters Chsen Lgistic Classificatin Linear Discriminant Analysis Randm Frest Supprt Vectr Machine: Linear Supprt Vectr Machine: Radial Supprt Vectr Machine: Ply 10.33% 16.50% V84 V90 V95 V188 V295 V300 V301 V302 V313 V401 V404 V416 V489 V495 V498 V500 V508 V510 V514 V % 16.50% V302 V90 V188 V416 V84 V85 V295 V510 V300 V514 V500 V95 V495 V83 V489 V498 V508 V306 V404 V % 16.50% V302 V306 V95 V90 VlOO V84 V 88 V416 V98 V97 V99 V400 V495 V510 V404 V500 V508 V493 V514 V % 15.50% Cst: % 15.50% Cst: 5.0, 7 : % 15.75% Cst: 0.1, 7 : 0.05, Degree: 2 K-Nearest Neighbr 19.87% 21.00% K : 25 Table 5.6: Details f mdels created in the x 2 Interactin scenari.

76 63 LDA Dense: Tp Twenty Variables rtdense Q Psitive * Negative I i s i I \ I I i I l l l! I i I I I l en-+0>toir>< >h-ocotf}(moaoio'tfoa:oco O00O'r-0>Oa3gia)piCDr-OOa r-ofl00> 0 S> 53>>>3^55>3> >5>> Variable V84 V410 V416 V95 V9Q V400 V408 V514 V98 V507 V100 V88 V508 V404 V398 V97 V302 V99 V415 V510 J nr mt T " MeanDecreaseAccuracy Figure 5.12: Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the Randm Frest Variable Selectin Dense scenari. chses the mst nise variables f the three sparse mdels; 9 extra variables, cmpared t 4 in the x 2 and 2 in the LASSO. There is still a separatin between the true variables and the nise variables f a factr f apprximately 105. LDA and Randm Frest, pltted in Figure 5.11, bth chse the true five variables as the tp five ut f twenty, but again nly the LDA mdel appears t make a clear gap between the variable types. Table 5.7 shws that the nise variables f LDA and Randm Frest are quite different, but the extra LDA variables appear t align with the Lgistic Classificatin nise variables. Fr the Randm Frest Variable Selectin Dense scenari, the test classificatin errr ranges frm 15.25% t 21.00%. The Linear SVM has the lwest errr, while the Randm Frest has the largest. Lgistic Classificatin chses a ttal f 33 variables;

77 64 RANDOM FORES!? - SPARSE Methd CV Errr Test Errr Variables r Parameters Chsen Lgistic Classificatin Linear Discriminant Analysis Randm Frest Supprt Vectr Machine: Linear Supprt Vectr Machine: Radial Supprt Vectr Machine: Ply 17.97% 29.00% VlOO V194 V I99 V273 V300 V304 V379 V387 V393 VlOO V421 V500 V562 V % 28.75% V300 V I99 V500 VlOO V400 V273 V304 V562 V97 V508 V402 V159 V414 V379 V387 V412 V510 V404 V249 V % 31.00% V300 V199 V500 V400 VlOO V416 V99 V502 V304 V98 V398 V410 V495 V404 V95 V408 V84 V421 V188 V % 29.00% Cst: % 29.00% Cst: 1.0, 7 : % 28.25% Cst: 0.01, 7 : 0.1, Degree: 2 K-Nearest Neighbr 31.60% 34.00% K: 21 Table 5.7: Details f mdels created in the Randm Frest Sparse scenari.

78 65 d IN d LDA Inter: Tp Twenty Variables i i i! i i i i \ i i i [ [ i f i r O... Psitive * Negative Q Q O a a... O.. ftinter MeanDecrease Accuracy Figure 5.13: Tp twenty SNPs by variable imprtance fr LDA and Randm Frest in the Randm Frest Variable Selectin Interactin scenari. 22 true and 11 nise variables, the lwest rati and lwest ttal number f the three variable selectin mdels. The p-values between the true and nise variables are als mixed, as V408 nly has a p-value f but the nise variable V42T has a p-value f Figure 5.12 shws the cmparisns f LDA and Randm Frest. LDA has 18 true variables in its tp twenty, with the extra variables V509 and V506 taking quite high spts n the list. The Randm Frest mdel chses nly 14 true variables ut f the tp twenty, with nise variables shwing up in the bttm tw thirds f the list. The full details f the mdels are given in Table 5.8. And finally, the Randm Frest Variable Selectin Interactin scenari mdels have test classificatin errrs frm 15.75% t 22.00%. Bth the Radial SVM and the LDA mdels have the lw errr, while KNN again has the highest errr. Lgistic

79 66 RANDOM FOREST - DENSE Methd CV Errr Test Errr Variables r Parameters Chsen Lgistic Classificatin Linear Discriminant Analysis Randm Frest Supprt Vectr Machine: Linear Supprt Vectr Machine: Radial Supprt Vectr Machine: Ply 10.75% 16.00% V15 V95 V188 V300 V402 V421 V493 V505 V % 17.00% V302 V495 V498 VlOO V % 21.00% V84 V90 V98 V508 V % 15.75% Cst: 0.5 V75 V98 V195 V302 V408 V466 V495 V % 17.00% Cst: 1.0, 7 : I!V84 V 100 V I99 V309 V410 V489 V498 V509 V90 V I87 V292 V400 V416 V492 V500 V514 V84 V509 V416 V506 V I87 V90 V95 V492 V410 V400 V505 V514 V489 V I99 V I88 V410 V416 V95 V400 V408 V514 V507 VlOO V 88 V404 V398 V97 V99 V415 V % 15.25% Cst: 0.1, 7 : 0.05, Degree: 2 K-Nearest Neighbr 20.10% 19.00% K : 25 Table 5.8: Details f mdels created in the Randm Frest Dense scenari.

80 67 Classificatin lists 25 variables as imprtant, but nly 19 are true variables. The p-values are mixed, nce again, with V306 having a p-value f 0.041, which is mre than the nise variable V23 with p-value LDA has 18 true variables in its tp twenty, with the extra variables V85 and V83. Randm Frest has true variables in the tp six spts, but mixes in 6 nise variables after that. As with the tw ther Interactin setups, the variable V302 is fund t be much strnger than any ther variable, as seen in Figure Table 5.9 gives the details f the mdels. Fr all the LASSO mdels, all methds but KNN have apprximately the same test classificatin errr t ± 1%. In Figure 5.14 the ROC curves are shwn fr the different scenaris. In the Sparse case the curves are practically indistinguishable, and the AUC values are all very clse. In the Dense scenari, tw mdels fall slightly belw the thers, Randm Frest and KNN. The AUC fr the tw mdels is abut.03 less than the thers. Fr the Interactin scenari, KNN clearly struggles t match the ther methds in accuracy, and the Randm Frest curve is missing sme key areas as well. Overall, the Linear SVM appears t have the best predictive pwer in the LASSO scenaris, but the LDA mdels are better at finding the mst imprtant variables. Fr the x 2 mdels, all the test classificatin errrs are very similar, except fr the KNN errr in the Interactin scenari. The ROC curves are given in Figure 5.15 fr the three scenaris. Thugh KNN des nt have the highest test errr in the Sparse

81 68 RANDOM FOREST - INTERACTION Methd CV Errr Test Errr Variables r Parameters Chsen Lgistic 9.60% 17.00% V23 V84 V85 V90 Classificatin V95 V188 V295 V300 V302 V306 V310 V373 V388 V404 V 412 V416 V437 V489 V495 V498 V500 V503 V508 V510 V514 Linear 13.17% 15.75% V302 V90 V84 V416 Discriminant V510 V85 V295 V 188 Analysis V300 V405 V514 V500 V83 V495 V95 "V 198 V412 V508 V 306 V489 Randm 15.33% 17.25% V302 V306 V95 V90 Frest V416 V84 V97 V495 V99 VlOO V 88 V98 VlOO V514 V404 V510 V500 V507 V502 V493 Supprt Vectr 13.27% 16.25% Cst: 0.5 Machine: Linear Supprt Vectr 13.23% 15.75% Cst: 5.0, 7 : Machine: Radial Supprt Vectr 13.23% 16.25% Cst: 0.001, 7 : 0.5, Degree: 2 Machine: Ply K-Nearest 20.27% 22.00% K : 25 Neighbr Table 5.9: Details f mdels created in the Randm Frest Interactin scenari.

82 69 Cmparisn f Sparse Methds Cmparisn f Dense Methds d CO d Sensitivity <D d <N d AUC f Mdel Lg Class «07911 LDA * 0J922 RF « L SVM m RSVM * P SVM m KNN m r CO AUC f Mdel Lg Class m 0,9152 LDA is 0,9174 RF m 0,883 LSVM as RSVM m 0,9139 P SVM it 0,9153 KNN m 0,8881 i r ,0 Specificity Specificity Cmparisn f Inter Methds 00 d <N d AUG f Mdel Lg Class « LDA m RF * 0.89 LSVM m RSVM m PSVM * KNN m i 0, ,0 Figure 5.14: ROC curves fr LASSO scenaris.

83 70 scenari, it des have the lwest AUC value. The Linear SVM had the highest errr, but it has nly slightly lwer AUC than the ther methds. In the Dense scenari, it appears again that Randm Frest and KNN mdels d nt predict as well as the ther, which matches the test errr values. Finally fr the Interactin case, KNN is much lwer than the ther ROC curves, with an AUC value.05 less than the best prefrming mdel, the Plynmial SVM. The Radial and Plynmial SVM mdels appear t wrk best with the x 2 variable selectin methd. Fr variable selectin, Randm Frest is the mst accurate when lking fr the tp five t seven mst imprtant variables, thugh LDA is better at finding the mst ut f twenty. Once again, within the Randm Frest Variable Selectin scenaris, all the test classificatin errrs are very clse, except fr KNN and the Randm Frest. Lking at the ROC curves in Figure 5.16, it is clear in the Sparse scenari that KNN falls shrt f the rest f the mdels. In the Dense scenari, bth the KNN and Randm Frest curves fall belw the thers, which is cnsistent with the test classificatin errrs. In the final Interactin scenari, KNN s AUC value is mre than.06 less than the best curve, the Plynmial SVM. Overall, the SVM mdels appear t wrk best with the Randm Frest Variable Selectin, either the Radial r Plynmial, fr predictin. Fr variable selectin, the situatin is the same as the x 2 cnclusin; Randm Frest is best fr finding the tp few variables, but if searching fr a larger amunt, LDA cllects mre true variables. Mst ften Randm Frest and LDA pick different nise variables, s a cmparisn might imprve the accuracy f the

84 71 Cmparisn f Sparse Methds q Cmparisn f Dense Methds CO 03 Sensitivity t d CM AUC f Mdes Lg Class m LDA « RF m 7782 LSVM * R SVM * 782 P SVM ii KNN * 7351 r <D d d Osj AUC f Mdel Lg Class m LDA * RF m LSVM m R SVM ft PSVM KNN m m Specificity Cmparisn f Inter Methds Specificity 00 d <N d AUC f Mdel Lg Class m i s LDA s RF» L SVM» Q.S 154 R SVM» PSVM m KNN m Specificity I Figure 5.15: ROC curves fr x 2 scenaris.

85 variable selectin if mre than the tp twenty terms were cmpared.

86 I I 73 Cmparisn f Sparse Methds Cmparisn f Dense Methds T- Sensitivity CO d / " J P f <0 M r y f t / / " < <D m i ja y / AUC f Mde? I f/ / Lg Class* 0,7806 LDA «0 781 <N 1 J RF u Oi d LSVM * d RSVM a 0.7$09 PSVM * Jr KNN * d d ~r Specificity II if / AUC f Mdel Lg Class * LDA w 0.92 RF * LSVM m RSVM m 0,9151 P SVM * j KNN m ~r 0.8 Cmparisn f Inter Methds 9... T Seciffcjty Iff AUC f Mde! d Lg Class* Iff LDA * 0.91 BS <N RF * L SVM * Q.9202 R SVM n P SVM * KNN m Specificity 0.2 0,0 Figure 5.16: ROC curves fr Randm Frest scenaris.

87 74 Table 5.10: Lgistic Classificatin Summary Type CV Errr Test Errr Prprtin f True Variables Fund AUC NS - Sparse 21.48% 33.25% 5/ LASSO - Sparse 17.19% 28.25% 5/ X 2 - Sparse 17.92% 30.25% 5/ RF - Sparse 17.97% 29.00% 5/ NS - Dense 13.97% 17.50% 27/ LASSO - Dense 9.99% 16.00% 27/ X 2 - Dense 10.94% 17.25% 25/ RF - Dense 10.75% 16.00% 22/ NS - Inter 13.84% 17.50% 23/ LASSO - Inter 8.91% 18.00% 23/ X 2 - Inter 10.33% 16.50% 16/ RF - Inter 9.60% 17.00% 16/ Cmparisns Acrss Variable Selectin Methd In this sectin, the perfrmance f the different classificatin methds will be cmpared ver all three variable selectin methds and a furth set where n variable sectin was prefrmed. The details f the N Selectin mdels is given in Appendix C. Lgistic Classificatin is a very simple and pwerful mdel that wrks decently acrss all situatins. Unfrtunately all the mdels appear t have a large difference in CV errr and actual test errr. On average the errr increases by 10%. Table 5.10 shws the details f hw the variable selectin methds effect the Lgistic Classificatin mdel. The Crss Validatin errr imprves quite a bit frm N

88 75 Table 5.11: Linear Discriminant Analysis Summary Type CV Errr Test Errr Prprtin f True Variables Fund AUC NS - Sparse 31.43% 32.75% 5/ LASSO - Sparse 25.10% 28.50% 5/ X 2 - Sparse 27.00% 30.25% 5/ RF - Sparse 26.70% 28.75% 5/ NS - Dense 17.60% 17.50% 17/ LASSO - Dense 13.50% 16.25% 20/ X 2 - Dense 15.43% 17.75% 17/ RF - Dense 15.07% 17.00% 18/ NS - Inter 13.84% 17.50% 14/ LASSO - Inter 12.13% 17.75% 19/ X 2 - Inter 13.57% 16.50% 17/ RF - Inter 15.83% 18.50% 14/ Selectin t the ther, but the test errr stays relatively similar. The prprtin f variables fund stays the same r even decreases, mst likely due t the variable selectin methd missing variables. The AUC f the mdels des increase with all f the methds; it appears t increase the mst with the LASSO fr the Sparse and Dense scenaris, and with Randm Frest fr the Interactin. The Lgistic Regressin was nt much imprved by the variable selectin methds. The Linear Discriminant Analysis mdels are summarized in Table LDA has much higher CV classificatin errr than Lgistic Classificatin, but it increases much less when it cmes t the test classificatin errr, s the methds end up lking very similar. In the N Selectin case, the errrs fr the Sparse and Dense

89 76 Table 5.12: Randm Frest Summary Type CV Errr Test Errr Prprtin f True Variables Fund AUC NS - Sparse 27.63% 28.00% 5/ LASSO - Sparse 27.47% 27.50% 5/ X 2 - Sparse 26.37% 30.00% 5/ RF - Sparse 26.57% 31.00% 5/ NS - Dense 18.83% 20.25% 14/ LASSO - Dense 17.37% 19.25% 16/ X 2 - Dense 18.47% 20.75% 12/ RF - Dense 18.43% 21.00% 13/ NS - Inter 15.73% 18.50% 15/ LASSO - Inter 14.50% 17.75% 13/ X 2 - Inter 15.17% 16.50% 14/ RF - Inter 15.33% 17.25% 14/ scenaris barely change. The N Selectin errr appear cmparable t the ther variable selectin errrs. Since nly the tp twenty variables f LDA were chsen, the prprtin f variables is ut f twenty fr the Dense and Interactin scenaris. Here we d see an imprvement with the use f the LASSO; the Dense scenari gains 3 variables and the Interactin scenari gains 5. It appears that LDA and LASSO wrk well tgether. All f the methds als have an increase in AUC frm the N Selectin mdels, thugh it varies in each scenari which mdel has the highest AUC. Table 5.12 details the results f the Randm Frest classificatin results. The CV classificatin errr apprximates the test errr well, unlike Lgistic Classificatin.

90 77 Table 5.13: Linear Supprt Vectr Machine Summary Type CV Errr Test Errr AUC NS - Sparse LASSO - Sparse 30.43% 25.13% 30.00% 28.25% X 2 - Sparse 26.43% 30.75% RF - Sparse 26.03% 29.00% NS - Dense 15.87% 16.25% LASSO - Dense 13.23% 16.00% X 2 - Dense 14.90% 18.00% RF - Dense 15.03% 15.75% NS - Inter 15.23% 19.50% LASSO - Inter 12.33% 17.75% X 2 - Inter RF - Inter 14.10% 13.27% 15.50% 16.25% The Spare and Inter scenaris have similar test errr t the previus tw mdels, but Randm Frest has a higher errr fr the Dense scenari. Like LDA, the Randm Frest was limited t a maximum f 20 selected variables, and there are usually less variables than in LDA. Cmparing the N Selectin variables t the thers, the nly imprvement is frm the LASSO Dense scenari. The AUC des nt always increase fr the Sparse scenari, but it des increase slightly fr all mdels in the Dense and Interactin scenaris. It wuld appear the Randm Frest classificatin mdels did nt imprve with the variable selectin methds. The Linear Supprt Vectr Machine mdel details are cmpiled in Table The CV classificatin errrs and the test errrs are similar t the previus mdels.

91 78 Table 5.14: Radial Supprt Vectr Machine Summary Type CV Errr Test Errr AUC NS - Sparse LASSO - Sparse X2 - Sparse RF - Sparse NS - Dense 29.40% 24.93% 26.97% 26.50% 16.23% 27.25% 29.75% 28.75% 29.00% 16.50% LASSO - Dense X2 - Dense RF - Dense 13.67% 15.30% 14.90% 16.00% 17.25% 17.00% NS - Inter LASSO - Inter X2 - Inter RF - Inter 14.83% 11.90% 13.93% 13.23% 18.00% 18.75% 15.50% 15.75% The test errrs fr the Interactin scenari all decrease cmpared t the N Selectin mdel, but the Spare and Dense mdels vary; the x 2 errr increases and the thers decrease. The same patterns fllw in the AUC values. There are n variables shwn by the SVM mdels. The Linear SVM analysis benefits frm the LASSO and Randm Frest variable selectins, but nt the x 2- The details f the Radial Supprt Vectr Machine are listed in Table There is a slightly larger increase frm the CV errr t the test errr in the Interactin cases, but the final errrs are similar t the ther methds. In the Sparse scenari, the CV errr f the N Selectin mdel decreases by abut 2% fr the test errr, s all the ther Sparse errrs are higher. The errrs fr the Dense and Interactin

92 79 Table 5.15: Plynmial Supprt Vectr Machine Summary Type CV Errr Test Errr AUC NS - Sparse LASSO - Sparse X 2 - Sparse RF - Sparse 29.10% 25.20% 26.87% 27.17% 27.75% 29.00% 29.00% 28.25% NS - Dense LASSO - Dense X 2 - Dense RF - Dense 15.80% 13.93% 15.33% 14.97% 16.75% 16.25% 17.00% 15.25% NS - Inter LASSO - Inter X 2 - Inter RF - Inter 15.07% 12.37% 13.87% 13.23% 17.75% 18.75% 15.75% 16.25% methds d nt appear t fllw any patterns. The AUC values are slightly mre rdered. In all three scenaris, the LASSO and Randm Frest imprve the AUC f the N Selectin mdel. With very similar errrs, and an increase in AUC, it appears that the Radial SVM benefits frm the LASSO and Randm Frest Variable Selectin methds. The Plynmial Supprt Vectr Machine mdels are detailed in Table The N Selectin Sparse errr drps, just as it des in the Radial case. In fact, the ther errrs almst identical, with a less than 1% difference in each pair. The AUC values are different frm the Radial case, with imprvements ccurring in nly the Dense and Interactin scenaris. The Plynmial SVM is als unique in that it was

93 80 Table 5.16: if-nearest Neighbr Summary Type CV Errr Test Errr AUC NS - Sparse LASSO - Sparse X 2 - Sparse RF - Sparse 36.00% 27.17% 31.37% 31.60% 35.75% 30.00% 30.50% 34.00% NS - Dense LASSO - Dense X 2 - Dense RF - Dense 22.73% 18.80% 20.27% 20.10% 23.00% 20.50% 21.00% 19.00% NS - Inter LASSO - Inter X 2 - Inter RF - Inter 24.67% 19.33% 19.87% 20.27% 28.50% 25.50% 21.00% 22.00% the methd that tk the lngest t cmpile. Using the variable selectin methds cut dwn n the prcessing time cnsiderably. The Randm Frest creates a small decreases in errr rate and increases in AUC fr mst scenaris, s it appears t be an imprvement ver the N Selectin mdels fr the Plynmial SVM. Finally, the details f the /^-Nearest Neighbr mdels are given in Table Like the SVMs, KNN has n variable selectin t explre. The KNN almst always had the highest errr f all the mdels in each scenari, but they d have a relatively small increase between CV errr and test errr. The KNN are als the nly mdels t be psitively affected by all the variable selectin methds. The errr drps in every case, and the AUC value als increases fr every case. Fr the Sparse and

94 81 Dense scenaris, the LASSO appears t make the biggest imprvement, while the X2 methd makes the best imprvement in the Interactin scenari. There is nt ne variable selectin methd that imprves all the classificatin analyses, but the LASSO mst frequently imprves the methds, while x 2 rarely des.

95 82 Chapter 6 Cnclusin In this thesis, GWAS datasets with different disease mdels were firstly simulated and the effectiveness f multiple statistical learning methds was thrughly evaluated and cmpared after. The gal was t give recmmendatins abut which variable selectin and classificatin methd might perfrm best under certain disease mdel. Fr the Sparse scenari, mst mdels were able t identify the prper variables, but the LASSO variable selectin had the mst mdels with better than 29% test classificatin errr. In that case, the Randm Frest mdel has the least test errr at 27.50%, but nt in the ther tw variable selectin cases. Fr the Dense scenari the Randm Frest Variable Selectin had the lwest test classificatin errr rates. In that case and bth thers, the Plynmial SVM had the lwest errr. The Interactin Scenari was the tughest fr the mdels t interpret. The \2 variable selectin mdels are just a tuch better than the Randm Frest Variable Selectin.

96 83 In bth f thse variable selectin cases, the Radial SVM had the smallest errr, while in the LASSO case a few ther methds beat Radial SVM. Overall, the Radial and Plynmial SVMs were very reliable, while the KNN mdel tended t have relatively high CV and test errr. Fr future wrk, I wuld like t study the details f the methds further, and pssibly develp nvel statistical methd that culd imprve the predictin accuracy n GWAS dataset. One change I wuld like t research is utilizing a better p-value cut ff fr x 2 test f independence. The value f 0.05 des nt cntrl the family-wised testing errr rate well, s many false psitive SNPs culd enter the classificatin analysis. Fr a secnd, I wuld like t study incrprating envirnmental impact and gene-envirnmental interactin int the mdel. And finally, I wuld cnsider a mre cmplex mdel where SNPs are cntributing in a nn-linear way t disease risk.

97 84 Appendix A: Detailed Descriptin f Simulatin Scenaris The fllwing mdel was used fr all three scenaris: lgit(p(xi)) = h(xi) = fa + B TX, + X f AX* Here fa is an intercept, the ther fa s are the mdel s cefficients, and At]s cntrl the interactin terms. The fllwing sets are given t shw which variables were used in each scenari. The Sparse scenari uses Si, The Dense scenari uses Si and $2, and the Interactin scenari uses all three sets. 51 = {100,200,300,400,500} 52= {80, 84, 86, 90, 95, 190, 192, 195, 197, 295, 302, 305, 308, 310, 405, 408, 410, 412, 416, 490, 492, 495, 498, 505, 508, 510, 515} 53 = {(90,100), (195,200), (410,492), (492,505), (498,510), (80,302), (86,302), (190,302), (400,302), (408,302)}

98 85 Sparse Pi = { _5'6 * 4 j = 0 lg 2.7 i Si Dense /? z = 0 Jg 1.9 i e S i,s 2 Aij = 0 Interactin /3 i s 2.5 (S 3 ) Sz r i = 0 An = I 2 6 >0 e 53 lg 2.5 i 5 i,52 IQ therwise frm. Table 6.1 shws mre infrmatin abut the chrmsmes that the SNPs cme

99 86 SN P Lcatins Gene name Chrmsme GeuelD Gene size KR5A GCK NEUROG HNF1A HNF1B ' Table 6.1: Mre data abut the chrmsmes and genes that the SNPs are frm.

100 87 Appendix B: Variable Selectin Detailed Results The true variables are marked in red, while the variables marked in blue are neighbrs f a true variable nt picked up by the analysis. A few marked in purple are bth true variables and neighbrs f missing variables. The variables are listed in rder f decreasing cefficient value fr LASSO and and Randm Frest, s a variable higher n the list wuld be cnsidered t have mre impact in the mdel than ne twards the end. Fr x 2 the results d nt have as strng an rdering t them as the thers, s they are listed numerically.

101 88 Variables kept by LASSSO Sparse V300 V199 V500 V400 V100 V416 V421 Nne V172 V99 V118 V79 V22 V501 V395 V105 V393 V545 V182 : V599 5 Var. Missing Dense V302 V84 V416 V95 V410 V498 V514 Nne V400 V90 V508 ' V495 V492 V500 V100 V505 V199 V489 V187 V408 V188 V402 V310 V80 "V195 V300 V510 V197 V98 V507 V191 V468 V405 V295 V512 V333 V525 V306 V526 V220 V509 V339 V304 V427 V576 V469 V293 V520 V517 V598 V531 V523 V144 V352 V583 V398 V412 V230 V503 V292 V184r V277 V29 V592 V329 V313 V221 V365 V425 V240 V466 V564 V173 V146 V588 V309 V126 V446 V519 V590 Inter V302 V84 V416 V90 :l!v188 V514 V295 One Step Off V300 V405 V95 "VI97" V510 V310 V500 V187 V495 V489 V508 V498 V404 V412 V306 V199 V100 V400 V99 V80 : V503 V506 V485 V408 V97 V496 V502 V373 V501 V483 V137 V505 V407 V309 V160 V358 V568 V526 V428 N Neighbrs V275 V576 V410 V88 V388 V172 V217 V195 V486 V85 V247 V429 V208 V342 V290 V37 V89 V251 V50 V467 V343 V390 V23 V427 V349 V513 V512 V540 V34 V492 V299 V335 V599 1 V384 > V40 V301 V532 V367 V355 V564 V66 V118 V224 V456 V393 V426 V151 V107 V357 O CO > V365 V449 V523 < CO V423 V580' Table 6.2: Variable selectin results fr LASSO in the Sparse, Dense, and Interactin scenaris.

102 89 Variables kept by x 2 Sparse V43 r V60 V79 V83 V84 V85 V 88 > V89 Nne V90 V95 V97 : V98 V99 V100 V105 V141 V156 V159 V172 V173 V174 V180 V182 V183 V187 V188 V193 V194 V195 V199 V 220 V292 V299 V301 V304 V309 V310 V311 V314 V320 V321 V322 V323 V324 V326 V327 V329 V331 V338 V373 V396 V397 V398 V400 V401 V402 V403 V404 V405 V408 V410 V412 V414 V415 V416 V417 V418 V419 V420 V421 > V422 V424 V481 V492 V495 V499 V500 V501 V502 V503 V504 V507 V508 V510 V512 V513 V514 V518 V541 V547 V580 CO c > Var. Missing Table 6.3: Variable selectin results fr x 2 in the Sparse scenari.

103 05 bjo CD a 4-2 CO LO CN CD rh a O > > > I CM F- T 1 CN QO 00 LO LO CN rh CO CN 05 F- <b 1 1 LO CN cb CN T O CN 05 rh CN b- 05 rh CN LO 05 c T 1 CO rh CN T 1 t-h CN CO CO CO t t t LO t t > > rh CN c c t t LO > > > > > > > > > > > > > > > > > > > > > > > > > > > > T CN rh T 1 N rh b cb tb cb 05 c T rh cb CO 00 LO CO rh 00 c b- OO 05 CM 05 r 1 CN rh CN CO rh r HT 1 T 1 rh t c CO CO LO t t t LO t 05 CN c CO t LO LO * rh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > lo. - r - r LO LO p s. s. F- cb c CN F- cb F- CO LO 00 s. s p 05 F- T 1 cb CM 00 F- y> >> rh 05 rh CN CO rh CN 05 CN rh CO s 05 T 1 T 1 CO CO CO t t LO t t t s 05 rh CO CO t LO lo LO s p p > 05 > > > > > > > > > > > > > > > > - p > > > > > > > > > > > cb LO s p p T p p rr\ T p f-' p r p p > > 00 t CO CN 05 T 1 LO 05 cb CO CN cb > Li J > 'JKJ t CN cb t CO rh p 05 F- cb l>- O) 05 T 1 > rh rh CO rh CN 00 p F- 05 CN rh rh F- "r+l \l CXj 1 1 i CM CO CO t t t t t 05 rh CO CO t t Li J Uj F- > > >!> > > > > > > > > > > > > > c vtj TH CO > > > > > > > > > > > > > > K \r > > '-aj p rh CO F- rh tb CN CO CO p CO rh 00 CO 05 LO s 0> 05 rh 00 T ( CN 05 T 1 CN t 05 CN 05 rh c 05. r HCO rh rh CN CO CO LO LO LO LO LO p 05 rh CO CO CO t LO LO >> CO c > > > > > > > > > > > > > > > > > 05 > > > > > > > > > > > > > > r p p p p p p p p > > ' p p LO 00 cb 05 F- F- 00 LO T 1 05 p 00 rh F- T 1 00 F- CO CO a p p lo CM 00 t LO H (M I>- 05 rh T 1 00 p CO GO CN 05 1 i CN ) CN CN 05 J* 05 T 1 T 1 CN c CO LO t LO t t CN t 05 rh CO CO CO LO t t CO CO a; > > i>- > > > > > > > > > > > > > > > > > l>- > > > > > > > > > > > > > > 1 rh 5-1 F- CN CO 05 F- S CO CO 00 CN r. CO t c tb 00 F- t CO CM CN 05 LO CO 00 CO CO 05 T 1 CN r- 05 T 1 t-h CO t CN x H T 1 t CN rh CN T 1 r- 05 rh rh CN CO CO CO t LO lo t t CO r- 05 T 1 CN CO CO t LO t > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > CD CO 5=1 CD Q t O a ^. ^ r/) p F^* F- tb d) OO H H < > > > > 5-h CD m f-( rq.sp *s T 1 CO > CN Table 6.4: Variable selectin results fr x 2 in the Dense and Interactin scenaris.

104 91 Variables kept by Randm Frest Sparse V300 V199 V400 V500 ' VIOO1> V416 > V99 Nne V304 V502 V398 V41(T1 V98 V415 V404 V421 V301 V495r V501 V95 V97 V493 V188 V408 V194 : V401» vg8 V309 V84 V310 V79 V570 : V183 V326 V412 V503 V293 V294 V141 V298r V219 V146 V90 V599 V367 V531 V108 V519 V532 V255 V43 V349 V390 : V107 V335 V105 V516 V217 V446 V89 : V486 V182 V425 V299 V161 V514 V137 V263 V426 V270 V180 V485 V483 V352 V80 V373 V314 V13 V393 V586 V357 V394 V395 V369 V545 V427 V489 V513 V172 V333 V360 V248 V520 V512 V543 V 5 ii V441 V464 V578 V327 V508 Var. Missing Dense V95 V84 V90 : V416 V98 V97 V410 One Step Off V88 V100!,V400 V99 : V404 V408 V398 V195 V510 V495 V514 V507 V415 "V508 V512 V197 V493 V89 V85 V502 ' V500 V302 V501 V405 V83! V498 V306 V516 V513 V505;> V79 V492 V520 V503 V199 V188 V421 V499 V504 N Neighbrs V511 V304 V523 V401 V300 V486 V187 V295 V506 V509 V489 V75 V194 V522 V519 V412 V309 V352 V468 V578r V80 V586 V485 V263 V107 V367 V255 V217 V429 V310 V314 V466 V426 V265 V360 V342 V137 V583 V532 V43 V333 V588 V390 V594 V173 V37 V209 V369 V579 V441 V146 V313 V397 V13 V108 V34 V106 V454 V105 V395 Table 6.5: Variable selectin results fr Randm Frest in the Sparse and Dense scenaris.

105 92 Variables kept by Randm Frest > O r-h > Var. Missing Inter V302,;V306" V95!: V90 V97 V98 V84 One Step Off V99 V495 V400 V507 V493 V80 V510 V500 V416 V508 V502 : V514 V89 V187 V321 V512 V404 V410 V501,: V85 V304 V295 V83!V408 V503 :!,V188 V499 V398 V504 V405 V415 V486 V79 : 511 V300 V516 V513 V492 V421 V485 V489 V520 V313 V326 V498! N Neighbrs V137 V75 V412 V532 V106 V203 V61 V195 V50 V453 V209 V586 V294 V46 V505 V197 V324 V108 V360 V40 V333 V34 V483 V199 V310 V107 V599 V576 V342 V441 V588 V314 V337 V118 1 V73 V523 V219 V467 V208 V594 V522 V519 V402 V265 V41 V390 V217 : V277 V299 V425 V451 V355 V570 > LO CO CO Table 6.6: Variable selectin results fr Randm Frest in the Interactin scenari.

106 Appendix C: N Selectin Mdel Details 93

107 94 Table 6.7: NO SELECTION - SPARSE Methd CV Errr Test Errr Variables r Parameters Chsen Lgistic 21.48% 33.25% V 22. V24 V33 VlOO Classificatin V I 20 V I 25 V132 V163 V I69 V172 V194 V197 V I99 V231 V269 V292 V293 V295 V300 V301 V304 V365 V380 V393 V400 V402 V407 V412 V451 V453 V461 V471 V496 V500 V529 V544 V545 V573 V580 V581 V589 Linear 31.43% 32.75% V300 V I 99 V500 VlOO Discriminant V400 V I 20 V133 V I 56 Analysis V407 V273 V130 V553 V22 V529 V409 V569 V380 V169 V472 V306 Randm 27.63% 28.00% V300 V I 99 V400 V500 Frest VlOO V416 V99 V502 V98 V304 V410 V398 V415 V404 V95 V408 V401 V495 V421 V501 Supprt Vectr 30.43% 30.00% Cst: 0.01 Machine: Linear Supprt Vectr 29.40% 27.25% Cst: 1.0, 7 : Machine: Radial Supprt Vectr 29.10% 27.75% Cst: 0.001, 7 : 0.1, Degree: 2 Machine: Ply K-Nearest 36.00% 35.75% K- 30 Neighbr

108 95 Table 6.8: NO SELECT:iON - DENSE Methd CV Errr Test Errr Variables r Parameters Chsen Lgistic 13.97% 17.50% V15 V29 V84 V90 Classificatin V95 V98 VlOO V I 15 V I 17 V 144 V145 V184 V187 V188 V191 V193 V195 V197 V 199 V202 V209 V211 V220 V229 V248 V270 V277 V293 V295 V300 V302 V310 V313 V322 V327 V333 V339 V355 V400 V402 V405 V410 V411 V416 V421 V458 V489 V492 V493 V495 V498 V500 V503 V505 V506 V508 V509 V514 V526 V534 V542 V550 V576 V598 Linear 17.60% 17.50% V84 V302 V I 5 V506 Discriminant V509 V416 V405 V264 Analysis V495 V90 V498 V492 V295 V514 V I87 V399 V306 V505 V I 15 V409 Randm 18.83% 20.25% V95 V84 V400 V416 Frest V410 V98 V90 V408 V 88 VlOO V97 V404 V398 V99 V415 V508 V514 V510 V507 V495 Supprt Vectr 15.87% 16.25% Cst: 0.01 Machine: Linear Supprt Vectr 16.23% 16.50% Cst: 1.0, 7 : Machine: Radial Supprt Vectr 15.80% 16.75% Cst: 0.001, 7 : 0.1, Degree: 2 Machine: Ply K-Nearest 22.73% 23.00% K : 29 Neighbr

109 96 Table 6.9: NO SELECTION - INTERACTION Methd CV Errr Test Errr Variables r Parameters Chsen Lgistic 13.84% 17.50% V40 V59 V 66 V71 Classificatin V72 V80 V84 V85 V90 V95 V I37 V147 V I55 V I60 V I72 V183 V I88 V197 V217 V221 V251 V295 V300 V302 V310 V320 V373 V400 V404 V405 V412 V416 V419 V428 V442 V447 V454 V456 V483 V489 V492 V495 V496 V498 V500 V506 V508 V510 V514 V526 V568 V571 V576 V582 V591 V599 Linear 15.83% 18.50% V302 V90 V554 V428 Discriminant V295 V84 V510 V I 88 Analysis V197 V85 V553 V416 V506 V405 V437 V568 V300 V262 V500 V514 Randm 15.73% 18.50% V302 V306 1 V95 V90 Frest V84 V97 V 88 V100 V98 V 495 V99 V510 V493 V416 V400 V514 V507 V404 V508 V410 Supprt Vectr 15.27% 19.50% Cst: 0.01 Machine: Linear Supprt Vectr 14.83% 18.00% Cst: 5.0, 7 : Machine: Radial Supprt Vectr 15.07% 17.75% Cst: 0.001, 7 : 0.1, Degree: 2 Machine: Ply K-Nearest 24.67% 28.50% K : 28 Neighbr

110 97 Bibligraphy [1] Andrew L Beam, Alisn Mtsinger-Reif, and Jn Dyle, Bayesian neural netwrks fr detecting epistasis in genetic assciatin studies, BMC biinfrmatics 15 (2014), n. 1, 368. [2] David A Belsley, Cnditining diagnstics: Cllinearity and weak data in regressin, n B452, Wiley, [3] Kevin Beyer, Jnathan Gldstein, Raghu Ramakrishnan, and Uri Shaft, When is nearest neighbr meaningful?, Internatinal cnference n database thery, Springer, 1999, pp [4] Le Breiman, Out-f-bag estimatin, [5], Randm frests, Machine Learning 45 (2001), n. 1, [6] Le Breiman and Adele Cutler, Randm frests, 2007, Online; accessed 28- June [7] Angel Canty and B. D. Ripley, bt: Btstrap r (s-plus) functins, 2017, R package versin [8] RM Centr and GE Keightley, Receiver perating characteristics (rc) curve area analysis using the rc analyzer., Prceedings. Sympsium n Cmputer Applicatins in Medical Care, American Medical Infrmatics Assciatin, 1989, pp [9] Crinna Crtes and Vladimir Vapnik, Supprt-vectr netwrks, Machine learning 20 (1995), n. 3,

111 98 [10] Thmas Cver and Peter Hart, Nearest neighbr pattern classificatin, IEEE transactins n infrmatin thery 13 (1967), n. 1, [11] E Dimitriadu, K Hrnik, F Leisch, D Meyer, and A Weingessel, e l 071: Misc functins f the department f statistics (el 071), tu wien. r package versin 1.6-7, [12] Jerme Friedman, Trevr Hastie, and Rb Tibshirani, Regularizatin paths fr generalized linear mdels via crdinate descent, Jurnal f Statistical Sftware, Articles 33 (2010), n. 1, [13] Jerme Friedman, Trevr Hastie, and Rbert Tibshirani, Regularizatin paths fr generalized linear mdels via crdinate descent, Jurnal f Statistical Sftware 33 (2010), n. 1, [14] James A Hanley and Barbara J McNeil, The meaning and use f the area under a receiver perating characteristic (rc) curve., Radilgy 143 (1982), n. 1, [15] Trevr Hastie, Rbert Tibshirani, and Jerme Friedman, The elements f statistical learninq, Springer Series in Statistics, Springer New Yrk Inc., New Yrk, NY, USA, [16] David M Hillis and James J Bull, An empirical test f btstrapping as a methd fr assessing cnfidence in phylgenetic analysis, Systematic bilgy 42 (1993), n. 2, [17] Tin Kam H, The randm subspace methd fr cnstructing decisin frests, IEEE transactins n pattern analysis and machine intelligence 20 (1998), n. 8, [18] Arthur E Herl and Rbert W Kennard, Ridge regressin: Biased estimatin fr nnrthgnal prblems, Technmetrics 12 (1970), n. 1, [19] Frank B Hu, JAnn E Mansn, Meir J Stampfer, Graham Clditz, Simin Liu, Caren G Slmn, and Walter C Willett, Diet, lifestyle, and the risk f type 2 diabetes mellitus in wmen, New England jurnal f medicine 345 (2001), n. 11,

112 99 [20] Gareth James, Daniela Witten, Trevr Hastie, and Rbert Tibshirani, An intrductin t statistical learning: With applicatins in r, Springer Publishing Cmpany, Incrprated, [21] Minru Kanehisa and Susumu Gt, Kegg: kyt encyclpedia f genes and genmes, Nucleic acids research 28 (2000), n. 1, [22] Rn Khavi et al., A study f crss-validatin and btstrap f r accuracy estimatin and mdel selectin, Ijcai, vl. 14, Stanfrd, CA, 1995, pp [23] Cathy C Laurie, Kimberly F Dheny, Daniel B Mirel, Elizabeth W Pugh, Laura J Bierut, Tushar Bhangale, Frederick Behm, Neil E Capras, Marilyn C Crnelis, Hward J Edenberg, et al., Quality cntrl and quality assurance in gentypic data fr genme-wide assciatin studies, Genetic epidemilgy 34 (2010), n. 6, [24] Eleanr Lawrence, Hendersn s dictinary f bilgy, Pearsn educatin, [25] Sarah Leavitt, Deciphering the genetic cde: Marshall nirenberg, Office f NIH Histry, [26] Andy Liaw and Matthew Wiener, Classificatin and regressin by randmfrest, R News 2 (2002), n. 3, [27] Nick Martin and Hermine Maes, Multivariate analysis, Academic press, [28] Givanni Mntana, hapsim: Hapltype data simulatin, 2012, R package versin 0.3. [29] Christpher Z Mney and Rbert D Duval, Btstrapping: A nnparametric apprach t statistical inference, n , Sage, [30] CNX OpenStax, Anatmy & physilgy, Human Anatmy & Physilgy (2014). [31] Badri Padhukasahasram, Chandan K Reddy, Albert M Levin, Esteban G Burchard, and L Keki Williams, Pwerful tests fr multi-marker assciatin analysis using ensemble learning, PlS ne 10 (2015), n. 11, e

113 100 [32] Qinxin Pan, Ting Hu, James D Malley, Angeline S Andrew, Margaret R Karagas, and Jasn H Mre, A system-level pathway-phentype assciatin analysis using synthetic feature randm frest, Genetic epidemilgy 38 (2014), n. 3, [33] Karl Pearsn, X. n the criterin that a given system f deviatins frm the prbable in the case f a crrelated system f variables is such that it can be reasnably suppsed t have arisen frm randm sampling, The Lndn, Edinburgh, and Dublin Philsphical Magazine and Jurnal f Science 50 (1900), n. 302, [34] J Rss Quinlan, Inductin f decisin trees, Machine learning 1 (1986), n. 1, [35] R Cre Team, R: A language and envirnment fr statistical cmputing, R Fundatin fr Statistical Cmputing, Vienna, Austria, 2013, ISBN [36] Paul Scheet and Matthew Stephens, A fast and flexible statistical mdel fr large-scale ppulatin gentype data: applicatins t inferring missing gentypes and hapltypic phase, The American Jurnal f Human Genetics 78 (2006), n. 4, [37] Matt Silver, Givanni Mntana, Alzheimer s Disease Neurimaging Initiative, et al., Fast identificatin f bilgical pathways assciated with a quantitative trait using grup lass with verlaps, Statistical applicatins in genetics and mlecular bilgy 11 (2012), n. 1, [38] Rbert Tibshirani, Regressin shrinkage and selectin via the lass, Jurnal f the Ryal Statistical Sciety. Series B (Methdlgical) 58 (1996), n. 1, [39] Rb M van Dam, Eric B Rimm, Walter C Willett, Meir J Stampfer, and Frank B Hu, Dietary patterns and risk fr type 2 diabetes mellitus in us men, Annals f internal medicine 136 (2002), n. 3, [40] W. N. Venables and B. D. Ripley, Mdern applied statistics with s, furth ed., Springer, New Yrk, 2002, ISBN

114 101 [41] Peter M Visscher, Matthew A Brwn, Mark I McCarthy, and Jian Yang, Five years f gwas discvery, The American Jurnal f Human Genetics 90 (2012), n. 1, [42] Danielle Welter, Jacqueline MacArthur, Jannella Mrales, Tny Burdett, Peggy Hall, Heather Junkins, Alan Klemm, Paul Flicek, Teri Manli, Lucia Hindrff, et al., The nhgri gwas catalg, a curated resurce f snp-trait assciatins, Nucleic acids research 42 (2013), n. D l, D1001-D1006.

Resampling Methods. Chapter 5. Chapter 5 1 / 52

Resampling Methods. Chapter 5. Chapter 5 1 / 52 Resampling Methds Chapter 5 Chapter 5 1 / 52 1 51 Validatin set apprach 2 52 Crss validatin 3 53 Btstrap Chapter 5 2 / 52 Abut Resampling An imprtant statistical tl Pretending the data as ppulatin and

More information

Pattern Recognition 2014 Support Vector Machines

Pattern Recognition 2014 Support Vector Machines Pattern Recgnitin 2014 Supprt Vectr Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 1 / 55 Overview 1 Separable Case 2 Kernel Functins 3 Allwing Errrs (Sft

More information

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017 Resampling Methds Crss-validatin, Btstrapping Marek Petrik 2/21/2017 Sme f the figures in this presentatin are taken frm An Intrductin t Statistical Learning, with applicatins in R (Springer, 2013) with

More information

Chapter 3: Cluster Analysis

Chapter 3: Cluster Analysis Chapter 3: Cluster Analysis } 3.1 Basic Cncepts f Clustering 3.1.1 Cluster Analysis 3.1. Clustering Categries } 3. Partitining Methds 3..1 The principle 3.. K-Means Methd 3..3 K-Medids Methd 3..4 CLARA

More information

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels Mtivating Example Memry-Based Learning Instance-Based Learning K-earest eighbr Inductive Assumptin Similar inputs map t similar utputs If nt true => learning is impssible If true => learning reduces t

More information

Tree Structured Classifier

Tree Structured Classifier Tree Structured Classifier Reference: Classificatin and Regressin Trees by L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stne, Chapman & Hall, 98. A Medical Eample (CART): Predict high risk patients

More information

Biplots in Practice MICHAEL GREENACRE. Professor of Statistics at the Pompeu Fabra University. Chapter 13 Offprint

Biplots in Practice MICHAEL GREENACRE. Professor of Statistics at the Pompeu Fabra University. Chapter 13 Offprint Biplts in Practice MICHAEL GREENACRE Prfessr f Statistics at the Pmpeu Fabra University Chapter 13 Offprint CASE STUDY BIOMEDICINE Cmparing Cancer Types Accrding t Gene Epressin Arrays First published:

More information

CHAPTER 3 INEQUALITIES. Copyright -The Institute of Chartered Accountants of India

CHAPTER 3 INEQUALITIES. Copyright -The Institute of Chartered Accountants of India CHAPTER 3 INEQUALITIES Cpyright -The Institute f Chartered Accuntants f India INEQUALITIES LEARNING OBJECTIVES One f the widely used decisin making prblems, nwadays, is t decide n the ptimal mix f scarce

More information

ENSC Discrete Time Systems. Project Outline. Semester

ENSC Discrete Time Systems. Project Outline. Semester ENSC 49 - iscrete Time Systems Prject Outline Semester 006-1. Objectives The gal f the prject is t design a channel fading simulatr. Upn successful cmpletin f the prject, yu will reinfrce yur understanding

More information

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came. MATH 1342 Ch. 24 April 25 and 27, 2013 Page 1 f 5 CHAPTER 24: INFERENCE IN REGRESSION Chapters 4 and 5: Relatinships between tw quantitative variables. Be able t Make a graph (scatterplt) Summarize the

More information

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeff Reading: Chapter 2 STATS 202: Data mining and analysis September 27, 2017 1 / 20 Supervised vs. unsupervised learning In unsupervised

More information

What is Statistical Learning?

What is Statistical Learning? What is Statistical Learning? Sales 5 10 15 20 25 Sales 5 10 15 20 25 Sales 5 10 15 20 25 0 50 100 200 300 TV 0 10 20 30 40 50 Radi 0 20 40 60 80 100 Newspaper Shwn are Sales vs TV, Radi and Newspaper,

More information

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines COMP 551 Applied Machine Learning Lecture 11: Supprt Vectr Machines Instructr: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/cmp551 Unless therwise nted, all material psted fr this curse

More information

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeff Reading: Chapter 2 STATS 202: Data mining and analysis September 27, 2017 1 / 20 Supervised vs. unsupervised learning In unsupervised

More information

IAML: Support Vector Machines

IAML: Support Vector Machines 1 / 22 IAML: Supprt Vectr Machines Charles Suttn and Victr Lavrenk Schl f Infrmatics Semester 1 2 / 22 Outline Separating hyperplane with maimum margin Nn-separable training data Epanding the input int

More information

, which yields. where z1. and z2

, which yields. where z1. and z2 The Gaussian r Nrmal PDF, Page 1 The Gaussian r Nrmal Prbability Density Functin Authr: Jhn M Cimbala, Penn State University Latest revisin: 11 September 13 The Gaussian r Nrmal Prbability Density Functin

More information

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) > Btstrap Methd > # Purpse: understand hw btstrap methd wrks > bs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(bs) > mean(bs) [1] 21.64625 > # estimate f lambda > lambda = 1/mean(bs);

More information

Math Foundations 20 Work Plan

Math Foundations 20 Work Plan Math Fundatins 20 Wrk Plan Units / Tpics 20.8 Demnstrate understanding f systems f linear inequalities in tw variables. Time Frame December 1-3 weeks 6-10 Majr Learning Indicatrs Identify situatins relevant

More information

3.4 Shrinkage Methods Prostate Cancer Data Example (Continued) Ridge Regression

3.4 Shrinkage Methods Prostate Cancer Data Example (Continued) Ridge Regression 3.3.4 Prstate Cancer Data Example (Cntinued) 3.4 Shrinkage Methds 61 Table 3.3 shws the cefficients frm a number f different selectin and shrinkage methds. They are best-subset selectin using an all-subsets

More information

COMP 551 Applied Machine Learning Lecture 4: Linear classification

COMP 551 Applied Machine Learning Lecture 4: Linear classification COMP 551 Applied Machine Learning Lecture 4: Linear classificatin Instructr: Jelle Pineau (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/cmp551 Unless therwise nted, all material psted

More information

Comparing Several Means: ANOVA. Group Means and Grand Mean

Comparing Several Means: ANOVA. Group Means and Grand Mean STAT 511 ANOVA and Regressin 1 Cmparing Several Means: ANOVA Slide 1 Blue Lake snap beans were grwn in 12 pen-tp chambers which are subject t 4 treatments 3 each with O 3 and SO 2 present/absent. The ttal

More information

CHAPTER 4 DIAGNOSTICS FOR INFLUENTIAL OBSERVATIONS

CHAPTER 4 DIAGNOSTICS FOR INFLUENTIAL OBSERVATIONS CHAPTER 4 DIAGNOSTICS FOR INFLUENTIAL OBSERVATIONS 1 Influential bservatins are bservatins whse presence in the data can have a distrting effect n the parameter estimates and pssibly the entire analysis,

More information

Lab 1 The Scientific Method

Lab 1 The Scientific Method INTRODUCTION The fllwing labratry exercise is designed t give yu, the student, an pprtunity t explre unknwn systems, r universes, and hypthesize pssible rules which may gvern the behavir within them. Scientific

More information

Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter

Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter Midwest Big Data Summer Schl: Machine Learning I: Intrductin Kris De Brabanter kbrabant@iastate.edu Iwa State University Department f Statistics Department f Cmputer Science June 24, 2016 1/24 Outline

More information

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical mdel fr micrarray data analysis David Rssell Department f Bistatistics M.D. Andersn Cancer Center, Hustn, TX 77030, USA rsselldavid@gmail.cm

More information

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification COMP 551 Applied Machine Learning Lecture 5: Generative mdels fr linear classificatin Instructr: Herke van Hf (herke.vanhf@mail.mcgill.ca) Slides mstly by: Jelle Pineau Class web page: www.cs.mcgill.ca/~hvanh2/cmp551

More information

In SMV I. IAML: Support Vector Machines II. This Time. The SVM optimization problem. We saw:

In SMV I. IAML: Support Vector Machines II. This Time. The SVM optimization problem. We saw: In SMV I IAML: Supprt Vectr Machines II Nigel Gddard Schl f Infrmatics Semester 1 We sa: Ma margin trick Gemetry f the margin and h t cmpute it Finding the ma margin hyperplane using a cnstrained ptimizatin

More information

Support-Vector Machines

Support-Vector Machines Supprt-Vectr Machines Intrductin Supprt vectr machine is a linear machine with sme very nice prperties. Haykin chapter 6. See Alpaydin chapter 13 fr similar cntent. Nte: Part f this lecture drew material

More information

How do scientists measure trees? What is DBH?

How do scientists measure trees? What is DBH? Hw d scientists measure trees? What is DBH? Purpse Students develp an understanding f tree size and hw scientists measure trees. Students bserve and measure tree ckies and explre the relatinship between

More information

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007 CS 477/677 Analysis f Algrithms Fall 2007 Dr. Gerge Bebis Curse Prject Due Date: 11/29/2007 Part1: Cmparisn f Srting Algrithms (70% f the prject grade) The bjective f the first part f the assignment is

More information

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d) COMP 551 Applied Machine Learning Lecture 9: Supprt Vectr Machines (cnt d) Instructr: Herke van Hf (herke.vanhf@mail.mcgill.ca) Slides mstly by: Class web page: www.cs.mcgill.ca/~hvanh2/cmp551 Unless therwise

More information

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank CAUSAL INFERENCE Technical Track Sessin I Phillippe Leite The Wrld Bank These slides were develped by Christel Vermeersch and mdified by Phillippe Leite fr the purpse f this wrkshp Plicy questins are causal

More information

Math Foundations 10 Work Plan

Math Foundations 10 Work Plan Math Fundatins 10 Wrk Plan Units / Tpics 10.1 Demnstrate understanding f factrs f whle numbers by: Prime factrs Greatest Cmmn Factrs (GCF) Least Cmmn Multiple (LCM) Principal square rt Cube rt Time Frame

More information

MATHEMATICS SYLLABUS SECONDARY 5th YEAR

MATHEMATICS SYLLABUS SECONDARY 5th YEAR Eurpean Schls Office f the Secretary-General Pedaggical Develpment Unit Ref. : 011-01-D-8-en- Orig. : EN MATHEMATICS SYLLABUS SECONDARY 5th YEAR 6 perid/week curse APPROVED BY THE JOINT TEACHING COMMITTEE

More information

A Matrix Representation of Panel Data

A Matrix Representation of Panel Data web Extensin 6 Appendix 6.A A Matrix Representatin f Panel Data Panel data mdels cme in tw brad varieties, distinct intercept DGPs and errr cmpnent DGPs. his appendix presents matrix algebra representatins

More information

Simple Linear Regression (single variable)

Simple Linear Regression (single variable) Simple Linear Regressin (single variable) Intrductin t Machine Learning Marek Petrik January 31, 2017 Sme f the figures in this presentatin are taken frm An Intrductin t Statistical Learning, with applicatins

More information

the results to larger systems due to prop'erties of the projection algorithm. First, the number of hidden nodes must

the results to larger systems due to prop'erties of the projection algorithm. First, the number of hidden nodes must M.E. Aggune, M.J. Dambrg, M.A. El-Sharkawi, R.J. Marks II and L.E. Atlas, "Dynamic and static security assessment f pwer systems using artificial neural netwrks", Prceedings f the NSF Wrkshp n Applicatins

More information

This section is primarily focused on tools to aid us in finding roots/zeros/ -intercepts of polynomials. Essentially, our focus turns to solving.

This section is primarily focused on tools to aid us in finding roots/zeros/ -intercepts of polynomials. Essentially, our focus turns to solving. Sectin 3.2: Many f yu WILL need t watch the crrespnding vides fr this sectin n MyOpenMath! This sectin is primarily fcused n tls t aid us in finding rts/zers/ -intercepts f plynmials. Essentially, ur fcus

More information

AP Statistics Notes Unit Two: The Normal Distributions

AP Statistics Notes Unit Two: The Normal Distributions AP Statistics Ntes Unit Tw: The Nrmal Distributins Syllabus Objectives: 1.5 The student will summarize distributins f data measuring the psitin using quartiles, percentiles, and standardized scres (z-scres).

More information

Department of Electrical Engineering, University of Waterloo. Introduction

Department of Electrical Engineering, University of Waterloo. Introduction Sectin 4: Sequential Circuits Majr Tpics Types f sequential circuits Flip-flps Analysis f clcked sequential circuits Mre and Mealy machines Design f clcked sequential circuits State transitin design methd

More information

Perfrmance f Sensitizing Rules n Shewhart Cntrl Charts with Autcrrelated Data Key Wrds: Autregressive, Mving Average, Runs Tests, Shewhart Cntrl Chart

Perfrmance f Sensitizing Rules n Shewhart Cntrl Charts with Autcrrelated Data Key Wrds: Autregressive, Mving Average, Runs Tests, Shewhart Cntrl Chart Perfrmance f Sensitizing Rules n Shewhart Cntrl Charts with Autcrrelated Data Sandy D. Balkin Dennis K. J. Lin y Pennsylvania State University, University Park, PA 16802 Sandy Balkin is a graduate student

More information

5 th grade Common Core Standards

5 th grade Common Core Standards 5 th grade Cmmn Cre Standards In Grade 5, instructinal time shuld fcus n three critical areas: (1) develping fluency with additin and subtractin f fractins, and develping understanding f the multiplicatin

More information

The blessing of dimensionality for kernel methods

The blessing of dimensionality for kernel methods fr kernel methds Building classifiers in high dimensinal space Pierre Dupnt Pierre.Dupnt@ucluvain.be Classifiers define decisin surfaces in sme feature space where the data is either initially represented

More information

Building to Transformations on Coordinate Axis Grade 5: Geometry Graph points on the coordinate plane to solve real-world and mathematical problems.

Building to Transformations on Coordinate Axis Grade 5: Geometry Graph points on the coordinate plane to solve real-world and mathematical problems. Building t Transfrmatins n Crdinate Axis Grade 5: Gemetry Graph pints n the crdinate plane t slve real-wrld and mathematical prblems. 5.G.1. Use a pair f perpendicular number lines, called axes, t define

More information

Linear programming III

Linear programming III Linear prgramming III Review 1/33 What have cvered in previus tw classes LP prblem setup: linear bjective functin, linear cnstraints. exist extreme pint ptimal slutin. Simplex methd: g thrugh extreme pint

More information

STATS216v Introduction to Statistical Learning Stanford University, Summer Practice Final (Solutions) Duration: 3 hours

STATS216v Introduction to Statistical Learning Stanford University, Summer Practice Final (Solutions) Duration: 3 hours STATS216v Intrductin t Statistical Learning Stanfrd University, Summer 2016 Practice Final (Slutins) Duratin: 3 hurs Instructins: (This is a practice final and will nt be graded.) Remember the university

More information

Introduction to Quantitative Genetics II: Resemblance Between Relatives

Introduction to Quantitative Genetics II: Resemblance Between Relatives Intrductin t Quantitative Genetics II: Resemblance Between Relatives Bruce Walsh 8 Nvember 006 EEB 600A The heritability f a trait, a central cncept in quantitative genetics, is the prprtin f variatin

More information

The Kullback-Leibler Kernel as a Framework for Discriminant and Localized Representations for Visual Recognition

The Kullback-Leibler Kernel as a Framework for Discriminant and Localized Representations for Visual Recognition The Kullback-Leibler Kernel as a Framewrk fr Discriminant and Lcalized Representatins fr Visual Recgnitin Nun Vascncels Purdy H Pedr Mren ECE Department University f Califrnia, San Dieg HP Labs Cambridge

More information

Chemistry 20 Lesson 11 Electronegativity, Polarity and Shapes

Chemistry 20 Lesson 11 Electronegativity, Polarity and Shapes Chemistry 20 Lessn 11 Electrnegativity, Plarity and Shapes In ur previus wrk we learned why atms frm cvalent bnds and hw t draw the resulting rganizatin f atms. In this lessn we will learn (a) hw the cmbinatin

More information

Support Vector Machines and Flexible Discriminants

Support Vector Machines and Flexible Discriminants 12 Supprt Vectr Machines and Flexible Discriminants This is page 417 Printer: Opaque this 12.1 Intrductin In this chapter we describe generalizatins f linear decisin bundaries fr classificatin. Optimal

More information

NUMBERS, MATHEMATICS AND EQUATIONS

NUMBERS, MATHEMATICS AND EQUATIONS AUSTRALIAN CURRICULUM PHYSICS GETTING STARTED WITH PHYSICS NUMBERS, MATHEMATICS AND EQUATIONS An integral part t the understanding f ur physical wrld is the use f mathematical mdels which can be used t

More information

Lead/Lag Compensator Frequency Domain Properties and Design Methods

Lead/Lag Compensator Frequency Domain Properties and Design Methods Lectures 6 and 7 Lead/Lag Cmpensatr Frequency Dmain Prperties and Design Methds Definitin Cnsider the cmpensatr (ie cntrller Fr, it is called a lag cmpensatr s K Fr s, it is called a lead cmpensatr Ntatin

More information

Hypothesis Tests for One Population Mean

Hypothesis Tests for One Population Mean Hypthesis Tests fr One Ppulatin Mean Chapter 9 Ala Abdelbaki Objective Objective: T estimate the value f ne ppulatin mean Inferential statistics using statistics in rder t estimate parameters We will be

More information

7 TH GRADE MATH STANDARDS

7 TH GRADE MATH STANDARDS ALGEBRA STANDARDS Gal 1: Students will use the language f algebra t explre, describe, represent, and analyze number expressins and relatins 7 TH GRADE MATH STANDARDS 7.M.1.1: (Cmprehensin) Select, use,

More information

Determining the Accuracy of Modal Parameter Estimation Methods

Determining the Accuracy of Modal Parameter Estimation Methods Determining the Accuracy f Mdal Parameter Estimatin Methds by Michael Lee Ph.D., P.E. & Mar Richardsn Ph.D. Structural Measurement Systems Milpitas, CA Abstract The mst cmmn type f mdal testing system

More information

Differentiation Applications 1: Related Rates

Differentiation Applications 1: Related Rates Differentiatin Applicatins 1: Related Rates 151 Differentiatin Applicatins 1: Related Rates Mdel 1: Sliding Ladder 10 ladder y 10 ladder 10 ladder A 10 ft ladder is leaning against a wall when the bttm

More information

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression 4th Indian Institute f Astrphysics - PennState Astrstatistics Schl July, 2013 Vainu Bappu Observatry, Kavalur Crrelatin and Regressin Rahul Ry Indian Statistical Institute, Delhi. Crrelatin Cnsider a tw

More information

MODULE 1. e x + c. [You can t separate a demominator, but you can divide a single denominator into each numerator term] a + b a(a + b)+1 = a + b

MODULE 1. e x + c. [You can t separate a demominator, but you can divide a single denominator into each numerator term] a + b a(a + b)+1 = a + b . REVIEW OF SOME BASIC ALGEBRA MODULE () Slving Equatins Yu shuld be able t slve fr x: a + b = c a d + e x + c and get x = e(ba +) b(c a) d(ba +) c Cmmn mistakes and strategies:. a b + c a b + a c, but

More information

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9. Sectin 7 Mdel Assessment This sectin is based n Stck and Watsn s Chapter 9. Internal vs. external validity Internal validity refers t whether the analysis is valid fr the ppulatin and sample being studied.

More information

Guide to Using the Rubric to Score the Klf4 PREBUILD Model for Science Olympiad National Competitions

Guide to Using the Rubric to Score the Klf4 PREBUILD Model for Science Olympiad National Competitions Guide t Using the Rubric t Scre the Klf4 PREBUILD Mdel fr Science Olympiad 2010-2011 Natinal Cmpetitins These instructins are t help the event supervisr and scring judges use the rubric develped by the

More information

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data Outline IAML: Lgistic Regressin Charles Suttn and Victr Lavrenk Schl f Infrmatics Semester Lgistic functin Lgistic regressin Learning lgistic regressin Optimizatin The pwer f nn-linear basis functins Least-squares

More information

Linear Classification

Linear Classification Linear Classificatin CS 54: Machine Learning Slides adapted frm Lee Cper, Jydeep Ghsh, and Sham Kakade Review: Linear Regressin CS 54 [Spring 07] - H Regressin Given an input vectr x T = (x, x,, xp), we

More information

Pipetting 101 Developed by BSU CityLab

Pipetting 101 Developed by BSU CityLab Discver the Micrbes Within: The Wlbachia Prject Pipetting 101 Develped by BSU CityLab Clr Cmparisns Pipetting Exercise #1 STUDENT OBJECTIVES Students will be able t: Chse the crrect size micrpipette fr

More information

T Algorithmic methods for data mining. Slide set 6: dimensionality reduction

T Algorithmic methods for data mining. Slide set 6: dimensionality reduction T-61.5060 Algrithmic methds fr data mining Slide set 6: dimensinality reductin reading assignment LRU bk: 11.1 11.3 PCA tutrial in mycurses (ptinal) ptinal: An Elementary Prf f a Therem f Jhnsn and Lindenstrauss,

More information

AP Statistics Practice Test Unit Three Exploring Relationships Between Variables. Name Period Date

AP Statistics Practice Test Unit Three Exploring Relationships Between Variables. Name Period Date AP Statistics Practice Test Unit Three Explring Relatinships Between Variables Name Perid Date True r False: 1. Crrelatin and regressin require explanatry and respnse variables. 1. 2. Every least squares

More information

making triangle (ie same reference angle) ). This is a standard form that will allow us all to have the X= y=

making triangle (ie same reference angle) ). This is a standard form that will allow us all to have the X= y= Intrductin t Vectrs I 21 Intrductin t Vectrs I 22 I. Determine the hrizntal and vertical cmpnents f the resultant vectr by cunting n the grid. X= y= J. Draw a mangle with hrizntal and vertical cmpnents

More information

Eric Klein and Ning Sa

Eric Klein and Ning Sa Week 12. Statistical Appraches t Netwrks: p1 and p* Wasserman and Faust Chapter 15: Statistical Analysis f Single Relatinal Netwrks There are fur tasks in psitinal analysis: 1) Define Equivalence 2) Measure

More information

PSU GISPOPSCI June 2011 Ordinary Least Squares & Spatial Linear Regression in GeoDa

PSU GISPOPSCI June 2011 Ordinary Least Squares & Spatial Linear Regression in GeoDa There are tw parts t this lab. The first is intended t demnstrate hw t request and interpret the spatial diagnstics f a standard OLS regressin mdel using GeDa. The diagnstics prvide infrmatin abut the

More information

Kinetic Model Completeness

Kinetic Model Completeness 5.68J/10.652J Spring 2003 Lecture Ntes Tuesday April 15, 2003 Kinetic Mdel Cmpleteness We say a chemical kinetic mdel is cmplete fr a particular reactin cnditin when it cntains all the species and reactins

More information

LHS Mathematics Department Honors Pre-Calculus Final Exam 2002 Answers

LHS Mathematics Department Honors Pre-Calculus Final Exam 2002 Answers LHS Mathematics Department Hnrs Pre-alculus Final Eam nswers Part Shrt Prblems The table at the right gives the ppulatin f Massachusetts ver the past several decades Using an epnential mdel, predict the

More information

Least Squares Optimal Filtering with Multirate Observations

Least Squares Optimal Filtering with Multirate Observations Prc. 36th Asilmar Cnf. n Signals, Systems, and Cmputers, Pacific Grve, CA, Nvember 2002 Least Squares Optimal Filtering with Multirate Observatins Charles W. herrien and Anthny H. Hawes Department f Electrical

More information

WRITING THE REPORT. Organizing the report. Title Page. Table of Contents

WRITING THE REPORT. Organizing the report. Title Page. Table of Contents WRITING THE REPORT Organizing the reprt Mst reprts shuld be rganized in the fllwing manner. Smetime there is a valid reasn t include extra chapters in within the bdy f the reprt. 1. Title page 2. Executive

More information

Part 3 Introduction to statistical classification techniques

Part 3 Introduction to statistical classification techniques Part 3 Intrductin t statistical classificatin techniques Machine Learning, Part 3, March 07 Fabi Rli Preamble ØIn Part we have seen that if we knw: Psterir prbabilities P(ω i / ) Or the equivalent terms

More information

The standards are taught in the following sequence.

The standards are taught in the following sequence. B L U E V A L L E Y D I S T R I C T C U R R I C U L U M MATHEMATICS Third Grade In grade 3, instructinal time shuld fcus n fur critical areas: (1) develping understanding f multiplicatin and divisin and

More information

Department of Economics, University of California, Davis Ecn 200C Micro Theory Professor Giacomo Bonanno. Insurance Markets

Department of Economics, University of California, Davis Ecn 200C Micro Theory Professor Giacomo Bonanno. Insurance Markets Department f Ecnmics, University f alifrnia, Davis Ecn 200 Micr Thery Prfessr Giacm Bnann Insurance Markets nsider an individual wh has an initial wealth f. ith sme prbability p he faces a lss f x (0

More information

Enhancing Performance of MLP/RBF Neural Classifiers via an Multivariate Data Distribution Scheme

Enhancing Performance of MLP/RBF Neural Classifiers via an Multivariate Data Distribution Scheme Enhancing Perfrmance f / Neural Classifiers via an Multivariate Data Distributin Scheme Halis Altun, Gökhan Gelen Nigde University, Electrical and Electrnics Engineering Department Nigde, Turkey haltun@nigde.edu.tr

More information

NUROP CONGRESS PAPER CHINESE PINYIN TO CHINESE CHARACTER CONVERSION

NUROP CONGRESS PAPER CHINESE PINYIN TO CHINESE CHARACTER CONVERSION NUROP Chinese Pinyin T Chinese Character Cnversin NUROP CONGRESS PAPER CHINESE PINYIN TO CHINESE CHARACTER CONVERSION CHIA LI SHI 1 AND LUA KIM TENG 2 Schl f Cmputing, Natinal University f Singapre 3 Science

More information

Testing Groups of Genes

Testing Groups of Genes Testing Grups f Genes Part II: Scring Gene Ontlgy Terms Manuela Hummel, LMU München Adrian Alexa, MPI Saarbrücken NGFN-Curses in Practical DNA Micrarray Analysis Heidelberg, March 6, 2008 Bilgical questins

More information

A New Evaluation Measure. J. Joiner and L. Werner. The problems of evaluation and the needed criteria of evaluation

A New Evaluation Measure. J. Joiner and L. Werner. The problems of evaluation and the needed criteria of evaluation III-l III. A New Evaluatin Measure J. Jiner and L. Werner Abstract The prblems f evaluatin and the needed criteria f evaluatin measures in the SMART system f infrmatin retrieval are reviewed and discussed.

More information

Five Whys How To Do It Better

Five Whys How To Do It Better Five Whys Definitin. As explained in the previus article, we define rt cause as simply the uncvering f hw the current prblem came int being. Fr a simple causal chain, it is the entire chain. Fr a cmplex

More information

Chapter 15 & 16: Random Forests & Ensemble Learning

Chapter 15 & 16: Random Forests & Ensemble Learning Chapter 15 & 16: Randm Frests & Ensemble Learning DD3364 Nvember 27, 2012 Ty Prblem fr Bsted Tree Bsted Tree Example Estimate this functin with a sum f trees with 9-terminal ndes by minimizing the sum

More information

MODULE FOUR. This module addresses functions. SC Academic Elementary Algebra Standards:

MODULE FOUR. This module addresses functions. SC Academic Elementary Algebra Standards: MODULE FOUR This mdule addresses functins SC Academic Standards: EA-3.1 Classify a relatinship as being either a functin r nt a functin when given data as a table, set f rdered pairs, r graph. EA-3.2 Use

More information

Activity Guide Loops and Random Numbers

Activity Guide Loops and Random Numbers Unit 3 Lessn 7 Name(s) Perid Date Activity Guide Lps and Randm Numbers CS Cntent Lps are a relatively straightfrward idea in prgramming - yu want a certain chunk f cde t run repeatedly - but it takes a

More information

Multiple Source Multiple. using Network Coding

Multiple Source Multiple. using Network Coding Multiple Surce Multiple Destinatin Tplgy Inference using Netwrk Cding Pegah Sattari EECS, UC Irvine Jint wrk with Athina Markpulu, at UCI, Christina Fraguli, at EPFL, Lausanne Outline Netwrk Tmgraphy Gal,

More information

NGSS High School Physics Domain Model

NGSS High School Physics Domain Model NGSS High Schl Physics Dmain Mdel Mtin and Stability: Frces and Interactins HS-PS2-1: Students will be able t analyze data t supprt the claim that Newtn s secnd law f mtin describes the mathematical relatinship

More information

Weathering. Title: Chemical and Mechanical Weathering. Grade Level: Subject/Content: Earth and Space Science

Weathering. Title: Chemical and Mechanical Weathering. Grade Level: Subject/Content: Earth and Space Science Weathering Title: Chemical and Mechanical Weathering Grade Level: 9-12 Subject/Cntent: Earth and Space Science Summary f Lessn: Students will test hw chemical and mechanical weathering can affect a rck

More information

Name: Block: Date: Science 10: The Great Geyser Experiment A controlled experiment

Name: Block: Date: Science 10: The Great Geyser Experiment A controlled experiment Science 10: The Great Geyser Experiment A cntrlled experiment Yu will prduce a GEYSER by drpping Ments int a bttle f diet pp Sme questins t think abut are: What are yu ging t test? What are yu ging t measure?

More information

1996 Engineering Systems Design and Analysis Conference, Montpellier, France, July 1-4, 1996, Vol. 7, pp

1996 Engineering Systems Design and Analysis Conference, Montpellier, France, July 1-4, 1996, Vol. 7, pp THE POWER AND LIMIT OF NEURAL NETWORKS T. Y. Lin Department f Mathematics and Cmputer Science San Jse State University San Jse, Califrnia 959-003 tylin@cs.ssu.edu and Bereley Initiative in Sft Cmputing*

More information

SIZE BIAS IN LINE TRANSECT SAMPLING: A FIELD TEST. Mark C. Otto Statistics Research Division, Bureau of the Census Washington, D.C , U.S.A.

SIZE BIAS IN LINE TRANSECT SAMPLING: A FIELD TEST. Mark C. Otto Statistics Research Division, Bureau of the Census Washington, D.C , U.S.A. SIZE BIAS IN LINE TRANSECT SAMPLING: A FIELD TEST Mark C. Ott Statistics Research Divisin, Bureau f the Census Washingtn, D.C. 20233, U.S.A. and Kenneth H. Pllck Department f Statistics, Nrth Carlina State

More information

February 28, 2013 COMMENTS ON DIFFUSION, DIFFUSIVITY AND DERIVATION OF HYPERBOLIC EQUATIONS DESCRIBING THE DIFFUSION PHENOMENA

February 28, 2013 COMMENTS ON DIFFUSION, DIFFUSIVITY AND DERIVATION OF HYPERBOLIC EQUATIONS DESCRIBING THE DIFFUSION PHENOMENA February 28, 2013 COMMENTS ON DIFFUSION, DIFFUSIVITY AND DERIVATION OF HYPERBOLIC EQUATIONS DESCRIBING THE DIFFUSION PHENOMENA Mental Experiment regarding 1D randm walk Cnsider a cntainer f gas in thermal

More information

2004 AP CHEMISTRY FREE-RESPONSE QUESTIONS

2004 AP CHEMISTRY FREE-RESPONSE QUESTIONS 2004 AP CHEMISTRY FREE-RESPONSE QUESTIONS 6. An electrchemical cell is cnstructed with an pen switch, as shwn in the diagram abve. A strip f Sn and a strip f an unknwn metal, X, are used as electrdes.

More information

Cells though to send feedback signals from the medulla back to the lamina o L: Lamina Monopolar cells

Cells though to send feedback signals from the medulla back to the lamina o L: Lamina Monopolar cells Classificatin Rules (and Exceptins) Name: Cell type fllwed by either a clumn ID (determined by the visual lcatin f the cell) r a numeric identifier t separate ut different examples f a given cell type

More information

IB Sports, Exercise and Health Science Summer Assignment. Mrs. Christina Doyle Seneca Valley High School

IB Sports, Exercise and Health Science Summer Assignment. Mrs. Christina Doyle Seneca Valley High School IB Sprts, Exercise and Health Science Summer Assignment Mrs. Christina Dyle Seneca Valley High Schl Welcme t IB Sprts, Exercise and Health Science! This curse incrprates the traditinal disciplines f anatmy

More information

1 The limitations of Hartree Fock approximation

1 The limitations of Hartree Fock approximation Chapter: Pst-Hartree Fck Methds - I The limitatins f Hartree Fck apprximatin The n electrn single determinant Hartree Fck wave functin is the variatinal best amng all pssible n electrn single determinants

More information

Module 4: General Formulation of Electric Circuit Theory

Module 4: General Formulation of Electric Circuit Theory Mdule 4: General Frmulatin f Electric Circuit Thery 4. General Frmulatin f Electric Circuit Thery All electrmagnetic phenmena are described at a fundamental level by Maxwell's equatins and the assciated

More information

Inference in the Multiple-Regression

Inference in the Multiple-Regression Sectin 5 Mdel Inference in the Multiple-Regressin Kinds f hypthesis tests in a multiple regressin There are several distinct kinds f hypthesis tests we can run in a multiple regressin. Suppse that amng

More information

BLAST / HIDDEN MARKOV MODELS

BLAST / HIDDEN MARKOV MODELS CS262 (Winter 2015) Lecture 5 (January 20) Scribe: Kat Gregry BLAST / HIDDEN MARKOV MODELS BLAST CONTINUED HEURISTIC LOCAL ALIGNMENT Use Cmmnly used t search vast bilgical databases (n the rder f terabases/tetrabases)

More information

You need to be able to define the following terms and answer basic questions about them:

You need to be able to define the following terms and answer basic questions about them: CS440/ECE448 Sectin Q Fall 2017 Midterm Review Yu need t be able t define the fllwing terms and answer basic questins abut them: Intr t AI, agents and envirnments Pssible definitins f AI, prs and cns f

More information

B. Definition of an exponential

B. Definition of an exponential Expnents and Lgarithms Chapter IV - Expnents and Lgarithms A. Intrductin Starting with additin and defining the ntatins fr subtractin, multiplicatin and divisin, we discvered negative numbers and fractins.

More information

BIOLOGY 101. CHAPTER 17: Gene Expression: From Gene to Protein. The Flow of Genetic Information

BIOLOGY 101. CHAPTER 17: Gene Expression: From Gene to Protein. The Flow of Genetic Information BIOLOGY 101 CHAPTER 17: Gene Expressin: Frm Gene t Prtein Gene Expressin: Frm Gene t Prtein: CONCEPTS: 17.1 Genes specify prteins via transcriptin and translatin 17.2 Transcriptin is the DNA-directed synthesis

More information