NOVEL METHODS FOR INCREASING EFFICIENCY OF QUANTITATIVE TRAIT LOCUS MAPPING ZHIGANG GUO. M. S., Nanjing Agricultural University, 1998

Size: px

Start display at page:

Download "NOVEL METHODS FOR INCREASING EFFICIENCY OF QUANTITATIVE TRAIT LOCUS MAPPING ZHIGANG GUO. M. S., Nanjing Agricultural University, 1998"

Magnus Singleton
5 years ago
Views:

1 NOVEL METHODS FOR INCREASING EFFICIENCY OF QUANTITATIVE TRAIT LOCUS MAPPING by ZHIGANG GUO M. S., Nanjng Agrcultural Unversty, 1998 AN ABSTRACT OF A DISSERTATION submtted n partal fulfllment of the requrements for the degree of DOCTOR OF PHILOSOPHY Department of Plant Pathology College of Agrculture KANSAS STATE UNIVERSITY Manhattan, Kansas 2007

2 Abstract The am of quanttatve trat locus (QTL) mappng s to dentfy assocaton between DNA marker genotype and trat phenotype n expermental populatons. Many QTL mappng methods have been developed to mprove QTL detectng power and estmaton of QTL locaton and effect. Recently, shrnkage Bayesan and penalzed maxmum-lkelhood estmaton approaches have been shown to gve ncreased power and resoluton for estmatng QTL man or epstatc effect. Here I descrbe a new method, shrnkage nterval mappng, that combnes the advantages of these two methods whle avodng the computng load assocated wth them. Studes based on smulated and real data show that shrnkage nterval mappng provdes hgher resoluton for dfferentatng closely lnked QTLs and hgher power for dentfyng QTLs of small effect than conventonal nterval-mappng methods, wth no greater computng tme. A second new method developed n the course of ths research toward ncreasng QTL mappng effcency s the extenson of mult-trat QTL mappng to accommodate ncomplete phenotypc data. I descrbe an EM-based algorthm for explotng all the phenotypc and genotypc nformaton contaned n the data. Ths method supports conventonal hypothess tests for QTL man effect, pleotropy, and QTL-by-envronment nteracton. Smulatons confrm mproved QTL detecton power and precson of QTL locaton and effect estmaton n comparson wth casewse deleton or mputaton methods.

3 NOVEL METHODS FOR INCREASING EFFICIENCY OF QUANTITATIVE TRAIT LOCUS MAPPING by ZHIGANG GUO M. S., Nanjng Agrcultural Unversty, 1998 A DISSERTATION submtted n partal fulfllment of the requrements for the degree of DOCTOR OF PHILOSOPHY Department of Plant Pathology College of Agrculture KANSAS STATE UNIVERSITY Manhattan, Kansas 2007 Approved by: Major Professor James C. Nelson

4 Abstract The am of quanttatve trat locus (QTL) mappng s to dentfy assocaton between DNA marker genotype and trat phenotype n expermental populatons. Many QTL mappng methods have been developed to mprove QTL detectng power and estmaton of QTL locaton and effect. Recently, shrnkage Bayesan and penalzed maxmum-lkelhood estmaton approaches have been shown to gve ncreased power and resoluton for estmatng QTL man or epstatc effect. Here I descrbe a new method, shrnkage nterval mappng, that combnes the advantages of these two methods whle avodng the computng load assocated wth them. Studes based on smulated and real data show that shrnkage nterval mappng provdes hgher resoluton for dfferentatng closely lnked QTLs and hgher power for dentfyng QTLs of small effect than conventonal nterval-mappng methods, wth no greater computng tme. A second new method developed n the course of ths research toward ncreasng QTL mappng effcency s the extenson of mult-trat QTL mappng to accommodate ncomplete phenotypc data. I descrbe an EM-based algorthm for explotng all the phenotypc and genotypc nformaton contaned n the data. Ths method supports conventonal hypothess tests for QTL man effect, pleotropy, and QTL-by-envronment nteracton. Smulatons confrm mproved QTL detecton power and precson of QTL locaton and effect estmaton n comparson wth casewse deleton or mputaton methods.

5 Table of Contents Lst of Fgures... v Lst of Tables... v Acknowledgements...v CHAPTER 1 - Quanttatve trat locus mappng methods: a revew... 1 Sngle-marker tests... 3 Interval mappng... 4 Bayesan QTL mappng... 7 References... 9 CHAPTER 2 - Shrnkage nterval mappng for QTL and QTL epstass analyss n lne crosses Abstract Introducton Methods Results Dscusson References CHAPTER 3 - Multple-trat quanttatve trat locus mappng wth ncomplete phenotypc data Abstract Introducton Methods Results Dscusson References v

6 Lst of Fgures Fgure 1.1 Hstograms of qualtatve and quanttatve trats Fgure 1.2 A rce lnkage map Fgure 1.3 LOD profles produced by SM, SIM and CIM methods for QTL mappng Fgure 1.4 Posteror dstrbutons of QTL parameters from Bayesan QTL mappng wth smulated data Fgure 2.1 Estmated QTL-effect profles for sngle-marker, multple-marker, and shrnkim analyses Fgure 2.2 Estmated QTL-effect and LOD profles for SIM, CIM and shrnkim Fgure 2.3 3D plots of QTL epstatc effects aganst chromosome postons for a smulated RIL populaton Fgure 2.4 3D plots of LOD score of QTL epstass analyss usng 2D IM and shrnkim Fgure 2.5 The statstcal power of QTL detecton at three sgnfcance levels usng SIM, CIM and shrnkim Fgure 2.6 LOD profles produced n the analyss of rce data by SIM, CIM and shrnkim Fgure 3.1 Statstcal power of fve multple-trat QTL-mappng methods wth four levels of mssng data Fgure 3.2 Power of QTL 1 detecton after casewse deleton and by the EM method as a functon of the number of complete trat records Fgure 3.3 Means and standard devatons (SDs) of estmates of QTL poston by mult-trat QTL analyses Fgure 3.4 Means and standard devatons (SDs) of estmates of QTL effects by mult-trat QTL analyses Fgure 3.5 LOD profle produced by multple-mputaton method for mult-trat QTL analyss based on smulated data v

7 Lst of Tables Table 2.1 The true values and estmates of QTL parameters n smulaton experment I Table 2.2 Computng tme requred for SIM, CIM and shrnkim n smulaton experment I Table 2.3 QTL parameters used for smulaton experment II Table 2.4 Estmates of QTL postons and effects for rce data usng shrnkim Table 2.5 Comparson of SIM, CIM and shrnkim n smulaton experment II Table 3.1 QTL effects and varances for two trats used for smulaton of mult-trat QTL mappng Table 3.2 Observed statstcal power of fve mult-trat QTL mappng methods Table 3.3 Observed statstcal specfcty of mult-trat QTL analyses v

8 Acknowledgements Frst, I thank my advsor Dr. James C. Nelson, for hs contnuous support and access to academc thnkng durng my doctoral research. He has been a frend and mentor, and has gven me nspraton, encouragement and confdence to solve research problems and wrte scentfc papers. Wthout hs support and gudance, I could not have fnshed ths dssertaton. Secondly, I would lke to thank my commttee members: Drs. Guhua Ba, Hayan Wang, Shzhong Xu, and outsde char S. Muthukrshnan, for ther frendshp, good questons, consderaton and encouragement. Fnally, I would lke to dedcate ths dssertaton to my parents Qngen Guo and Cume Song, my uncle Qnglng Guo and my brother Xaoqang Guo. Specal thanks go to Xnyan L, my wfe and a carng frend. v

9 CHAPTER 1 - Quanttatve trat locus mappng methods: a revew Quanttatve trats have been a major area of genetc studes for over a century (Fsher 1918; Wrght 1934; Mather 1949; Falconer 1960). In general, observable trats are of two types: quanttatve and qualtatve. A quanttatve trat such as crop yeld and human hypertenson shows contnuous varaton, whle a qualtatve trat such as eye color shows dscrete varaton. The expresson of a trat s called ts phenotype. The phenotype of a qualtatve trat s usually determned by a sngle gene, whle the phenotype of a quanttatve trat may be determned by many genes and envronmental factors. Early studes of quanttatve trats were focused on nferrng numbers of genes from the mean, varance, and covarance of progenes, wth no knowledge of locaton of the genes that underle these trats (Kearsey and Farquhar 1997). Recent development of DNA marker technology allows localzng a gene on a chromosome at the DNA level. To ntroduce the genetc background for QTL mappng, I begn by revewng some basc genetc termnology. In eukaryotes, a chromosome s a lnear macromolecule composed of DNA. A dplod eukaryotc somatc cell contans multple pars of homologous chromosomes. Homology means smlarty by descent from the same ancestral chromosome. For example, corn somatc cells contan 10 pars of homologous chromosomes. One chromosome of each par comes from the mother and the other from the father. A parental corn plant produces female or male gametes through a process called meoss. Each gamete contans a sngle copy of each chromosome. Durng meoss, two homologous chromosomes frst physcally par and exchange segments of homologous DNA, resultng n recombnaton of genes (dscussed below) on each chromosome. The pared chromosomes segregate nto dfferent cells to form gametes. Male and female gametes fuse to regenerate a plant. A gene s a unt of nhertance. Each gene s a DNA sequence that carres the genetc nformaton determnng the expresson of a trat. Wthn a lvng cell, genes are arranged n lnear order along chromosomes. Each chromosome may contan several thousand genes. The poston of a gene on a chromosome s called the locus of the gene. At each locus, varants of the DNA sequence are called alleles. For example, a dplod organsm contans two alleles at a locus on two homologous chromosomes. If these two alleles are dentcal, the organsm s sad to be 1

10 homozygous at the gene locus. Otherwse, the organsm s sad to be heterozygous. DNA segments used as genetc markers to dstngush dfferent alleles at a gven locus are called DNA markers. A DNA marker s not necessarly a gene tself, but t provdes genetc nformaton to help dentfy genes close to ths marker on the same chromosome. The genetc consttuton of an ndvdual s called ts genotype. For one gene, the genotype s descrbed by the two alleles at the locus. For example, f there are two alleles A and a at a locus, there are three possble sngle-locus genotypes AA, Aa and aa n a populaton. For multple genes, the genotype s descrbed by a lst of the genotypes at all loc. For example, f there are two genes, and each has two alleles, there are nne possble genotypes n a populaton: AABB, AABb, AAbb, AaBB, AaBb, Aabb, aabb, aabb, and aabb. Genetc recombnaton generates allele combnatons dfferent from those of ether parent. Consder two markers on homologous chromosomes. Marker 1 has two alleles A and a, whle marker 2 has B and b. Suppose the genotype of P 1 s AABB and that of P 2 aabb. P 1 and P 2 produce gametes AB and ab by meoss. The gametes AB and ab combne to form a F 1 progeny cell wth genotype AaBb. By meoss, a F 1 progeny produces four knds of gametes: AB, ab, Ab and ab. Among these, AB and ab are parental gametes, and Ab and ab are recombnant gametes carryng alleles from dfferent parents. The rato of the number of recombnant gametes to the total number of gametes s the recombnaton fracton between the two loc. Loc wth recombnaton fracton below 0.5 are sad to be lnked. A lnear representaton of the chromosome wth ordered loc s called a lnkage map. The unt of a lnkage map s the centmorgan (cm), whch s genetc dstance calculated based on recombnaton fracton. If there are many loc on the same chromosome, a lnkage map (Fg. 1.2) s constructed by arrangng these loc on the chromosome accordng to the recombnaton fractons between all pars of loc. A gene locus on a chromosome determnng the phenotype of a quanttatve trat s called a quanttatve trat locus (QTL). QTL mappng s the process of dentfyng statstcal assocaton between the trat phenotype and marker genotype. For QTL mappng, ths assocaton s modeled as y = μ + G a + e where y s the phenotype, µ the overall mean of the phenotype, G the genotype of gene, a the effect of gene, and e resdual error followng a normal dstrbuton 2

11 e ~ N(0, σ 2 ). If there are nteractons between dfferent genes, epstass, these nteractons are easly ncorporated as covarates nto the model. The frst requrement for QTL mappng s makng a mappng populaton. Suppose AA and aa are the genotypes of parents 1 (P 1 ) and 2 (P 2 ) at each locus. Makng a cross between P 1 and P 2 leads to F 1 progeny wth genotype Aa. Selfng F 1 results n F 2 progeny wth the expected genotype proportons AA (0.25) : Aa (0.50) : aa (0.25), and contnued selfng of progeny for several generatons results n recombnant nbred lnes (RILs) wth the expected genotype proportons AA (0.50) : aa (0.50). Backcrossng the F 1 to parent P 1 yelds BC 1 progeny segregatng AA (0.50) : Aa (0.50). These can be backcrossed n turn to gve BC 2 progeny segregatng AA (0.75) : Aa (0.25). F 2, RIL and BC populatons are among several types of QTLmappng populaton. Many statstcal methods have been developed for QTL mappng. These methods may be classfed nto least squares, maxmum lkelhood, and Bayesan estmaton. In the followng dscusson, the man deas of these methods are ntroduced n the hstorcal order of ther development. Complex statstcal detals are omtted for smplcty. Sngle-marker tests Sngle-marker (SM) ncludes the t test, ANOVA (ANalyss Of VArance) or smple regresson. The t test and ANOVA focus on testng the dfference between phenotypc means of marker genotype classes, whle smple regresson provdes an estmate of marker effect. At a marker, all the progeny s splt nto dstnct groups accordng to marker genotype and the phenotypc means of the groups are compared. The t test can be used n populatons such as RIL or BC that have only two genotype classes, whle ANOVA s used for populatons such as F 2 that have three. A marker showng a sgnfcant t or F test s presumed to be lnked to a QTL. Smple regresson for SM s based on the lnear model y = µ + ma + e (1) where y s the phenotype, µ the overall mean of the phenotype, m the genotype of a marker, a the marker effect, and e resdual error followng a normal dstrbuton e ~ N(0, σ 2 ). 3

12 Based on ths model, unknown parameters µ, a, and σ 2 are estmated by the least-squares method, whch mnmzes the squares of resdual errors obtaned as the dfference between the phenotype and ftted value. The advantage of SM les n ts smplcty and fast computaton. The t test, ANOVA, and smple regresson are easly mplemented n standard software such as SAS, Splus, R or MATLAB. However, ths method fals to localze a QTL that les between two markers. Interval mappng Smple nterval mappng: Smple nterval mappng (SIM) (Lander and Botsten 1989) allows localzng a QTL between two markers. Suppose there s a QTL located between markers 1 and 2. At best, SM returns ts hghest test statstc for the marker closest to the QTL. Wth SIM, canddate postons at 1- or 2-cM ntervals are tested. At a canddate poston, f QTL genotype could be observed, smple regresson could be used to dentfy assocaton between phenotype and genotype based on the genetc model y = µ + za + e (2) where z s the genotype of the putatve QTL and a s the QTL effect. However, the QTL genotype z s unobservable. But ts probablty dstrbuton condtonal on flankng markers may be nferred, and ts expectaton of z may then be calculated as E z) = ( + 1) p( z = + 1 M, M ) + ( 1) p( z = 1 M, M ). ( left rght left rght Now a test can be done by the regresson of y on E(z) based on model (2). Substtuton of unobserved z wth ts expectaton E(z) ncreases the varance of the ftted phenotype value by the varance caused by uncertanty of the predcted QTL genotype, leadng to reduced test statstcs especally at testng postons n wde ntervals (Xu 1995). Better estmates of QTL parameters are obtaned by an applcaton of the EM algorthm (Lander and Botsten 1989). EM s a varant of maxmum lkelhood estmaton (MLE), performed by teraton of expectaton (E) and maxmzaton (M) steps. In the E-step, nstead of usng only flankng markers to nfer condtonal probablty of QTL genotype (pror probablty), ths method uses both flankng markers and phenotype to nfer posteror probablty based on Bayes Theorem. In the M-step, model parameters µ, a, and σ 2 are estmated by the regresson of phenotype on the expectaton of QTL genotype calculated based on posteror probablty. E and 4

13 M steps are repeated untl the change n lkelhood or parameter estmates s less than a specfed value. The evdence used for the presence of a QTL s LOD (logarthm of odds). It s calculated based on the null hypothess H 0 of no QTL and alternatve hypothess H A of a QTL at the tested poston as LOD = log 10 (L reduced / L full ), where L reduced s the log lkelhood of the reduced model, correspondng to H 0, and L full s that of the full model, correspondng to H A (Lander and Botsten 1989). Repeatng ths calculaton at every pont along a chromosome produces a LOD profle on whch peaks ndcate the presence of QTLs. Fg. 1.3 shows a LOD profle based on a smulated RIL populaton. SIM gves more power for QTL mappng than SM due to explotaton of nformaton from a lnkage map (Lander and Botsten 1989, Haley and Knott 1992, Zeng 1994). It allows nferrng mssng genotype of a marker gven ts flankng markers. However, SIM consders only one QTL at a tme for QTL mappng, and does not model multple QTLs. Composte nterval mappng: Composte nterval mappng (CIM) provdes a way to model multple QTLs (Zeng 1993, 1994; Jansen 1993). The genetc model for CIM s y c = μ + z a + M b + e, (3) j= 1 j j where M j s the genotype of the cofactor marker j of ndvdual, and b j the effect of marker j. The basc dea of CIM s that, when testng for a putatve QTL at a testng poston, one uses other cofactor markers as covarates to remove varaton from these QTLs. QTL parameters are estmated by the ECM (Expectaton/Condtonal Maxmzaton) algorthm (Zeng 1993, 1994). ECM s a combnaton of EM and multple regresson n whch the E step s the same as that of EM used by SIM, whle the CM step nvolves estmates of cofactor effects by least squares. ECM produces unbased estmates of QTL and cofactor effects (Zeng 1993, 1994). Compared wth SIM, CIM provdes mproved power and precson of estmates of QTL locaton and effect (Zeng 1993, 1994). However, CIM does not determne automatcally the number of cofactor markers to be ncluded n the model. If too many are ncluded, they wll overestmate the phenotypc varaton caused by background QTLs, reducng the sgnfcance of tested QTLs. If too few are ncluded, the advantage of CIM over SIM may be nsgnfcant. 5

14 Moreover, the amount of QTL varaton explaned by a cofactor marker decreases wth ncreasng genetc dstance between the QTL and the marker. Mult-trat QTL mappng: Multple-trat composte nterval mappng (mult-trat CIM) provdes ncreased power over sngle-trat mappng by takng nto account the correlated structure of multple trats (Jang and Zeng 1995; Korol et al1995, 1998). Correlaton between dfferent trats s caused by QTLs controllng the expresson of those trats, pleotropc QTLs. In mult-trat CIM, these trats are assumed to follow a multvarate normal dstrbuton. The correlaton between them s represented by the covarance component n the varance-covarance matrx. Mult-trat CIM provdes formal procedures to test bologcally nterestng hypotheses concernng the nature of genetc correlaton (Jang and Zeng 1995). These hypothess tests nclude QTL man effect, pleotropy, QTL by envronment nteracton, and pleotropy vs. close lnkage. However, ths method fals to accommodate ncomplete phenotypc data. Chapter 3 descrbes an EM-based algorthm for explotng all the phenotypc and genotypc nformaton contaned n the ncomplete phenotypc data. Multple-nterval mappng: Multple-nterval mappng (MIM) uses multple marker ntervals smultaneously to ft multple QTLs drectly n the model for mappng QTL (Kao et al. 1999). Wth MIM, a stepwse selecton procedure wth lkelhood rato test statstc as a crteron s used to dentfy QTL. The procedure begns wth no QTL, and then adds or drops QTL one at a tme. In the frst QTL analyss, one QTL dentfed usng SIM or CIM s ncorporated nto the model and used as a cofactor for mappng the next QTL. In the QTL analyss, the ntervals wth a putatve QTL and the QTL dentfed n the frst analyss are tested smultaneously n the model. A stepwse regresson procedure s used to determne whch QTL should be ncluded or dropped from the model for the next QTL search. Ths process s repeated untl the lkelhood rato test for a putatve QTL s lower than a crtcal value. Thus, for a canddate QTL at a testng poston, MIM uses QTLs dentfed n the prevous analyses nstead of cofactor markers as covarates to adjust genetc background. For ths reason t provdes better power and precson of QTL mappng than SIM and CIM. 6

15 Bayesan QTL mappng Bayesan QTL mappng provdes a flexble way to search for multple QTLs smultaneously. Ths method makes nferences about parameters n a way dfferent from MLE or regresson-based methods used by SIM or CIM. Based on a probablstc model wth a parameter vector Ф = [θ 1, θ 2 ] where θ 1, θ 1 are parameters n the model, the lkelhood functon L s defned as the condtonal probablty of observatons gven Ф. Formally, L can be wrtten as L ( Φ; Y) = p( Y Φ), where Y represents a sample from the model. A pont estmate of Ф can be obtaned by maxmzng L wth respect to θ 1 or θ 2. In the Bayesan approach, nference s based on the posteror probablty of Ф. Accordng to Bayes Theorem, ths s p( Φ, Y) p( Φ, Y) p( Φ) p( Y Φ ) p ( Φ Y) = = =, (4) p( Y) p( Φ, Y) p( Φ) p( Y Φ ) Φ Φ where p(ф), the pror probablty of Ф, quantfes the knowledge we have about θ 1 and θ 2 pror to analyss. In general, t s dffcult to calculate jont posteror probablty p(θ 1, θ 2 Y) n closed form from equaton (4), but easy to calculate the margnal posteror probablty of θ 1 or θ 2 as p( Y θ1, θ 2 ) p( θ1) p( θ1 Y, θ 2 ) = (5) p( Y θ, θ ) p( θ ) gven fxed θ 2, and θ p( Y θ1, θ 2 ) p( θ 2 ) p( θ 2 Y, θ1) = (6) p( Y θ, θ ) p( θ ) θ gven fxed θ 1. Samplng Ф from p(ф Y) s replaced wth drawng θ 1 and θ 1 n turn from ther margnal posteror probablty dstrbutons [equatons (5) and (6)]. Ths strategy s called Gbbs samplng. Contnued samplng of ths knd s known as the Markov-chan Monte Carlo (MCMC) method, because the prevous sample values are used as parameters to sample the next values, generatng a Markov chan. Fg. 1.5 gves an example of Bayesan QTL mappng based on smulaton. Wth Bayesan QTL mappng methods, the most dffcult problem s samplng the posteror probablty of QTL number. Whle QTL locaton and effect are relatvely easy to sample, determnng QTL number s a problem of model selecton (Broman and Speed 2002). 7

16 Models wth dfferent number of QTLs are compared, and the best one s selected based on a specfc selecton crteron such as AIC or BIC. In Bayesan analyss, the optmal model s selected by a probablstc jump of MCMC from a model wth m QTLs to a new one wth m + 1 or m 1 QTLs. Reversble-jump MCMC (RJMCMC) (Green 1995) provdes a method for realzng ths jump between models wth dfferent number of QTLs. RJMCMC has been appled n many Bayesan QTL mappng methods for dentfyng multple QTLs (Thomas et al. 1997; Sllanpää and Arjas 1998; Stephens and Fsch 1998; Y and Xu 2000; Gaffney 2001; Y and Xu 2002; Y et al. 2003; Narta and Sasak 2004). However, t requres much more computaton than SIM or CIM, and ts convergence s very senstve to the specfcaton of pror probabltes of parameters. A recent development n Bayesan QTL mappng, the shrnkage Bayesan method, ncludes all markers n a model smultaneously n a sngle test (Xu 2003). When the number of markers s larger than that of ndvduals, the model s oversaturated. The problem of the oversaturated model s that t cannot provde unque estmates of marker effects. Wth the shrnkage Bayesan method, the problem s solved by the assumpton that the effect of each marker follows a normal dstrbuton wth ts own mean and varance. The assumpton s used to lmt large fluctuaton of marker effect estmates, and obtan unque estmates. Ths leads to shrnkage estmates of marker effects, resultng n clear sgnals of QTL effects. Based on shrnkage estmaton, spurous QTL effects are shrunk towards zero, whle real QTL effects are estmated wth vrtually no shrnkage. Penalzed MLE (PMLE) (Zhang et al. 2005), an extenson of the shrnkage method n MLE, was developed to reduce the computaton assocated wth the shrnkage method and analyze marker-marker nteracton. However, PMLE and shrnkage Bayesan mappng are marker-based mappng methods. They cannot be used for nterval mappng. Shrnkage nterval mappng (shrnkim) method (see more detals n chapter 2) extends PMLE and shrnkage Bayesan method to nterval mappng. It combnes the advantages of shrnkage Bayesan method and PMLE. Ths method allows analyzng QTL and QTL epstass based on mappng populatons. 8

17 References Broman K. W., and Speed T. P., 2002 A model selecton approach for the dentfcaton of quanttatve trat loc n expermental crosses. Journal of the Royal Statstcal Socety 64: Calnsk T., Kaczmarek Z., Krajewsk P., Frova C. and Sar-Gorla M., 1999 A multvarate approach to the problem of QTL localzaton. Heredty 84: Doerge R. W., 2001 Mappng and analyss of quanttatve trat loc n expermental populatons. Nature Genetcs 3:43-52 Falconer D. S Introducton to Quanttatve Genetcs, Olver and Boyd, Ednburgh. Fsher R. A The correlaton between relatves on the supposton of Mendelan nhertance. Phlosophcal Transactons of the Royal Socety of Ednburgh 52: Green P. J., 1995 Reversble jump Markov chan Monte Carlo computaton and Bayesan model determnaton. Bometrka 57: Haley C. S. and Knott S.A., 1992 A smple regresson method for mappng quanttatve trat loc n lne crosses usng flankng markers. Heredty 69: Jansen R. C Genotype-by-envronment nteracton n genetc mappng of multple quanttatve trat loc. Theoretcal and Appled Genetcs 91: Jang C. J., and Zeng Z. B., 1995 Multple trat analyss of genetc mappng for quanttatve trat loc. Genetcs 140: Kearsey M. J., Farquhar G. L. F., 1997 QTL analyss n plants; where are we now?. Heredty 80: Korol A. B., Ronn Y. I. and Krzhner V. M Interval mappng of quanttatve trat loc employng correlated trat complexes. Genetcs 140: Korol A. B., Ronn Y. I., Nevo E. and Hayes P. M Mult-nterval mappng of correlated trat complexes. Heredty 80: Mather K Bometrcal Genetcs, 1st edton. Methuen, London. Lander E. S., and Botsten D., 1989 Mappng Mendelan factors underlyng quanttatve trats usng RFLP lnkage maps. Genetcs 121: Sllanpää M. J. and Arjas E., 1998 Bayesan mappng of multple quanttatve trat loc from ncomplete nbred lne cross data. Genetcs 148:

18 Sllanpää M. J. and Arjas E., 1999 Bayesan mappng of multple quanttatve trat loc from ncomplete outbred offsprng data. Genetcs 151: Wrght S An analyss of varablty n number of dgts n an nbred stran of gunea pgs. Genetcs 19: Xu S., 1995 A comment on the smple regresson method for nterval mappng. Genetcs 141: Xu S., 2003 Estmatng polygenc effects usng markers of the entre genome. Genetcs 163: Zeng Z. B., 1993 Theoretcal bass of precson mappng of quanttatve trat loc. Proceedngs of the Natonal Academy of Scences USA 90: Zeng Z. B., 1994 Precson mappng of quanttatve trat loc. Genetcs 136:

19 Fgure 1.1 Hstograms of qualtatve and quanttatve trats. Fgures a and b show the phenotypc frequency dstrbutons of a qualtatve and a quanttatve trat n a sample wth 100 ndvduals. In b, the phenotype of a trat was smulated from a normal dstrbuton wth mean 40 or 100 and standard devaton

20 Fgure 1.2 A rce lnkage map. The genotypc data used was from a rce QTL study focused on mprovng gran yeld of U.S. rce varetes ( 12

21 Fgure 1.3 LOD profles produced by SM, SIM and CIM methods for QTL mappng. SM: sngle-marker mappng; EM-based SIM: EM-based smple nterval mappng; regressonbased SIM: regresson-based smple nterval mappng; CIM: composte nterval mappng. The horzontal black dashed lne represents 0.05 sgnfcance level LOD threshold 2.17 estmated from 1000 permutaton tests wth regresson-based SIM. The blue dots on the SM curve show effect and locaton of each marker. 13

22 Fgure 1.4 Posteror dstrbutons of QTL parameters from Bayesan QTL mappng wth smulated data Posteror frequences of QTL number and locatons were calculated from 2000 MCMC teratons based on a smulated RIL populaton wth 300 ndvduals. Fg. a: a plot of QTL number over teratons. Fg. b: Posteror frequency of QTL number. Fg. c: posteror frequency of QTL locaton on chromosome 1. Fg. d: posteror frequency of QTL locaton on chromosome 2. Astersks show the true postons of smulated QTLs. 14

23 CHAPTER 2 - Shrnkage nterval mappng for QTL and QTL epstass analyss n lne crosses Abstract QTL modelng s an example of an oversaturaton problem, requrng the choce of a subset from an excess of explanatory varables. Shrnkage Bayesan and penalzed maxmum lkelhood estmaton (PMLE) approaches have been shown to gve ncreased power and resoluton for estmatng QTL man or epstatc parameters. However, Bayesan methods are computatonally expensve and PMLE cannot localze a QTL wthn an nterval. We descrbe a two-step shrnkage nterval-mappng method, shrnkim, whch addresses both weaknesses. In the frst step, PMLE s used to select cofactor markers or parwse marker marker nteractons, reducng the dmensonalty of the oversaturated model. In the second step, partally penalzed maxmum lkelhood estmaton (PPMLE) s used for QTL nterval mappng or QTL epstass analyss. PPMLE, n whch only the parameter of nterest QTL man or epstatc effect s penalzed, provdes shrnkage estmates of these effects as well as least-squares estmates of other parameters n the model. Studes based on smulated and real data show that shrnkim provdes hgher resoluton for dfferentatng closely lnked QTLs and hgher power for dentfyng QTLs of small effect than conventonal nterval mappng methods, wth no greater computng tme. Introducton Interval-mappng methods for fndng a predctve relatonshp between DNA marker genotypes and quanttatve-trat phenotypes fall nto three general statstcal classes: lkelhood maxmzaton by EM algorthm used for smultaneous estmaton of genotype and trat dstrbuton parameters (Lander and Botsten 1989); least-squares estmaton by regresson of phenotypes on QTL genotype expectatons (Haley and Knott 1992); and Bayesan methods. The last approach treats all parameters as random varables and constructs ther posteror dstrbutons gven prors, usng Markov chan Monte Carlo estmaton (Satagopan et al. 1996; Sllanpää and 15

24 Arjas 1997, 1999; Y and Xu 2001; Wang et al. 2005). Varous extensons of the frst two approaches have been developed for modelng multple QTL (Zeng 1994), QTL envronment nteracton (Jansen 1994), multple trats (Jang and Zeng 1995; Hackett et al. 2001; Korol et al. 2001) and multple nterval mappng (Kao et al. 1999). All approaches face the dffcult modelselecton problem: fndng a reduced model to explan the response (phenotype data) n the presence of numbers of explanatory varables (DNA markers) that exceed the number of observatonal unts (ndvduals) such that there s no unque soluton to a full model. Recent approaches to ths problem, whle ncorporatng all the markers, apply shrnkage (Groß, 2003, p. 150) to reduce the effectve dmenson of the model. Shrnkage methods penalze model coeffcents by treatng them as drawn from a normal dstrbuton centered on zero, thereby shrnkng them toward a pror mean of zero (Boer et al. 2002). Two shrnkage approaches have been suggested: Bayesan and penalzed lkelhood, the latter ncludng penalzed regresson such as rdge regresson. Typcal of shrnkage methods s a QTL profle scan showng a near-zero baselne over most of the genome map, wth a few QTL sgnals standng out conspcuously. Bayesan shrnkage method: Xu (2003) developed a Bayesan regresson method, multple-marker analyss, for smultaneously estmatng the genetc effect assocated wth the markers along the whole genome map. Each marker effect s allowed to have ts own varance parameters so that the varance can be estmated from the data. Wang et al. (2005) extended ths method to allow localzng a QTL wthn an nterval, usng Metropols-Hastngs samplng snce the QTL locaton parameter does not have an explct posteror dstrbuton. However, the Bayesan method s tme-consumng to compute. Rdge regresson: Consder the lnear model Y = Xβ + ε, where Y s a n 1 trat vector, X a n m marker matrx, β a m 1 vector of regresson coeffcents, and ε a n 1 random error vector wth ε ~ N(0, I n σ 2 ). For an oversaturated model, ordnary least-squares estmates of β cannot be calculated as (X X) -1 X Y because matrx X X s sngular. However, rdge regresson can provde a restrcted least-squares estmate as (X X + τ I n ) -1 X Y under the quadratc constrant Σβ 2 j < τ (τ : a penalty parameter) on β. Boer et al. (2002) proposed the use of rdge regresson for QTL epstass analyss, allowng the penaltes to vary wth regresson coeffcent. However, nverson of matrx X X + τ In becomes tme-consumng wth ncreasng numbers of regresson coeffcents β. 16

25 Penalzed maxmum-lkelhood estmaton: The penalzed maxmum-lkelhood estmaton (PMLE) method suggested by Zhang and Xu (2005) mposes a pror normal dstrbuton N(μ j, σ 2 j ) penalty on each β j, allowng the penalty to vary across β. An teratve algorthm s used to estmate regresson coeffcents β and other parameters. In essence, PMLE s an extenson of the multple-marker Bayesan method of Xu (2003). However, PMLE can localze a QTL only to a marker and not between markers. Shrnkage nterval mappng: The foregong efforts demonstrated that shrnkage estmaton methods can provde ncreased resoluton and power as well as low background, but have a few dsadvantages. To deal wth these we have developed shrnkage nterval mappng (shrnkim), a two-step method. In the frst, dmenson-reducng step, cofactor markers or marker marker nteractons are selected as suggested by Zhang and Xu (2005) usng PMLE, turnng the oversaturated model nto a regular model. In the second step, a partally penalzed maxmum lkelhood estmaton (PPMLE) method a hybrd of PMLE and least squares s used to estmate parameters. Instead of penalzng all βs n a model as does PMLE, PPMLE mposes a pror normal-dstrbuton penalty only on the parameter of nterest (QTL man or epstatc effect) so that a shrnkage estmate can be obtaned. Estmates of other βs are calculated by least squares. In the followng descrpton, snce PMLE, the method used for cofactor selecton, s dentcal wth Zhang and Xu s method (2005), we wll focus on PPMLE as used n the second step of shrnkim. Methods One-QTL model for shrnkim: The method descrbed here s based on a RIL (recombnant nbred lne) desgn but s easly extended to backcross, F 2, or other desgns. The lnear model for shrnkim s y p = μ + z α + x c + ε (1) j= 1 j j Here y s the trat value of ndvdual ; μ s the overall mean; z s the genotype of a QTL for ndvdual ; α s the addtve effect of the QTL; x j s the genotype of the jth cofactor marker n the th ndvdual and s a dummy varable takng the values 1, 0, and -1 for genotypes A 1 A 1, A 1 A 2 (rare n RILs) and A 2 A 2 ; c j s the effect of the jth cofactor marker; ε s the resdual error of the th ndvdual wth a N(0, σ 2 ) dstrbuton; and p s the total number of cofactor markers. QTL 17

26 genotype z s not observed and s replaced n the model wth ts expectaton, calculated from the probablty dstrbuton of QTL genotype condtonal on the closest flankng markers (Haley and Knott 1992). Mssng x j genotype data s smlarly mputed. In ths model, the parameter n whch we are nterested s QTL effect α, whle other regresson coeffcents ncludng overall mean and effects of cofactor markers are treated as nusance parameters, ncluded only to account for background (polygenc varaton). We may combne these and rewrte model (1) n matrx form as Y = Zα + Xβ + ε (2) where n s the number of ndvduals, Y a n 1 vector of trat values, Z a n 1 vector of QTL genotype expectatons, α the addtve effect of the QTL, X a n (p + 1) matrx wth the frst column composed of n ones, β a vector of regresson coeffcents (μ, c 1, c 2,, c p ), and ε ~ N(0, I n σ 2 ). To estmate parameters α, β and σ 2, we ntroduce partally penalzed maxmum lkelhood estmaton (PPMLE), a hybrd of penalzed maxmum lkelhood and least squares estmatons. Our am s to obtan shrnkage estmates of parameters of nterest n order to realze the advantages assocated wth shrnkage Bayesan or PMLE, ncludng ncreased QTL resoluton, hgh power and low background. Frst we apply to α the penalty functon N(μ α, σ 2 α ) from PMLE, mposng a normal dstrbuton n order to lmt the fluctuaton of α. Then we specfy the drecton of shrnkage of α by placng the second penalty N(0, σ 2 α /η) on the mean μ α of α, where η > 0 denotes a pror sample sze (Zhang and Xu 2005). In ths way we force α to shrnk towards zero. The log lkelhood functons for model (1) before and after penalzaton are and n log( L) = 0.5n log(2πσ ) ( ( y ( μ + za + xjc 2 j )) (3) 2σ = 1 j log( L penalzed ) 0.5 log( 2πσ n 2 = 0.5n log( 2πσ ) ( ( μ y z a 2σ = 1 j 2 α ) 1 2σ 2 α ( a μ α ) σ α 0.5 log( 2π ) η η 2σ 2 α μ 2 α x j c j )) 2. (4) 18

27 In practce, an teratve two-step algorthm may be used to estmate the parameters. It starts wth ntal values for θ (0) = (α (0), β (0), σ 2(0), μ (0) α, σ 2(0) α ), settng teraton counter k = 0. In step 1, we calculate the least-squares estmate of β gven α, β = ( X' X) X'( Y Za ( k + 1) 1 ( k ) ). In step 2, estmates of α and hyperparameters μ α and σ 2 α are calculated by maxmzng penalzed log lkelhood functon (4) gven β as a ( k + 1) = Z'( Y Xβ ( k ) ) σ Z' Z + σ μ σ 2( k ) α + 2( k ) σ α 2( k ) 2( k ) α, 1 ) 2( k + 1) ( k ) ( k ) ( k ) ( k σ = ( Y Za Xβ )'( Y Za Xβ ), n ( k + 1) μ α ( k ) α =, η + 1 ( k + 1) ( k ) ( k ) 2 σ = 0.5[( α μ ) + ημ 2 2( k ) α α α ]. Steps 1 and 2 are repeated untl norm θ (k) - θ (k - 1) < τ, where τ s a gven crtcal value; we used A lkelhood rato test under the null hypothess H 0 : α = 0 and the alternatve hypothess H A : α 0 s LRT = 2 ln (L reduced / L full ), where L reduced s the log lkelhood of the reduced model, correspondng to the null hypothess, and L full s that of the full model, correspondng to the alternatve hypothess (Lander and Botsten 1989). Both are calculated from equaton (3) and a LOD score s calculated as LRT/(2 ln 10). QTL epstass model for shrnkim: The lnear model for parwse QTL nteracton s y p q = μ + z α + z α + z z α + x c + x x w + ε (4) r r s s r s rs j j j= 1 u v u v uv where α rs s the nteracton effect between QTL r and s (r s) and w uv the nteracton effect between markers u and v (u v). Now the parameters of nterest are α r, α s and α rs, and the other regresson coeffcents are treated as nusance parameters. Model (4) can be rewrtten as model (2) and parameters estmated usng PPMLE. The hypothess test for QTL epstass s H 0 : α rs = 0 and the alternatve hypothess H A : α rs 0. The LOD may be obtaned as n the one-qtl model. 19

28 It wll be noted that parwse nteractons may be detected even between QTLs nether of whch exerts a man effect. Smulaton studes: The propertes of the shrnkim algorthm were compared wth those of conventonal nterval-mappng methods, based on smulated and real data. The pror value η = 5 was used n the analyss of smulaton or real data, but n tests, no dfference was found wth values of 10 or 20, echong the fndng of Zhang and Xu (2005). The ntal values of pror parameters μ α and σ 2 α were set to 0 and 0.1. Power to detect a gven QTL was calculated as the proporton of replcates showng a LOD peak above threshold wthn the nterval contanng the QTL (Haley and Knott 1992; Jang and Zeng 1995; Zhang and Xu 2005). All calculatons were mplemented n MATLAB (The MathWorks, Inc.), a mathematcal and statstcal computng language. In each of two smulaton experments, RIL populatons of 300 ndvduals were generated based on a 300-cM chromosome wth 31 evenly spaced markers. The model for the smulaton s y = n j= 1 + QTL n EPI α q j j k = 1 α mn q m q n where y s the phenotype of ndvdual, α j s the man effect of QTL, α mn s the epstatc effect of QTL m and n, n QTL s the number of man effects, n EPI s the number of epstatc effects, q j s the genotype of QTL j of ndvdual. Envronmental error for y was sampled from a normal dstrbuton wth mean zero and varance σ 2. In both experments, the calculaton nterval (step sze) used for nterval mappng was 1 cm. Cofactors for CIM were selected by forward stepwse regresson; those for PMLE by the crteron b j /σ > 10-6, where b j s the estmate of effect of marker j and σ s the estmate of the error standard devaton. Experment I: The resoluton and background level for the detecton of QTL or QTL epstass n a sngle smulated populaton were examned. A RIL populaton was smulated accordng to the QTL parameters gven n Table 2.1. Two types of analyses were performed to dentfy QTL man and epstatc effects respectvely. Analyss 1: A one-qtl model was used to detect QTL man effect usng shrnkim and marker-based analyses ncludng sngle-marker analyss (regresson of phenotype on genotypes of ndvdual markers), multple-marker analyss usng PMLE (Zhang and Xu 2005) and the Bayesan approach (Xu 2003). ShrnkIM was compared wth smple nterval mappng by 20

29 regresson (SIM) (Haley and Knott 1992), and CIM (Zeng 1994). The EM-based verson EM- SIM (Lander and Botsten 1989) was also computed, but snce the results were vrtually dentcal to those of SIM, we used ths method only for speed comparson. The evdence for the dentfcaton of a QTL was evaluated based on QTL effect and LOD score. The same three cofactor markers were used n both shrnkim and CIM. We also calculated a varant of the maneffect model that ncluded four marker marker nteracton cofactors calculated by PMLE. In order to test the senstvty of the estmate of QTL effect to the choce of pror parameters μ α and σ 2 α, we ran a separate set of shrnkim analyses n whch the ntal μ α and σ 2 α were vared ndependently along the respectve ranges [ 5:5] and [0.1:1] and the means and standard devatons of QTL effect estmates at each pont on the map were computed. Analyss 2: The QTL-epstass model was used. ShrnkIM was frst compared wth PMLE and then wth a two-dmensonal scan by SIM. In the comparson of shrnkim and PMLE, only QTL epstatc effect was used as evdence to clam QTL nteracton, snce a LOD test statstc s not avalable for PMLE. Experment II: We smulated 500 replcates of 300 ndvduals accordng to the QTL postons and effects gven n Table 2.3. The statstcal power, accuracy, and precson of QTL detecton usng the same three nterval-mappng methods were compared at three sgnfcance levels: α = 0.05, 0.01 or The LOD threshold for each method was calculated from an addtonal 2000 smulatons wth the same total varance of but no QTLs. Analyss of rce data: The phenotypc and genotypc data used for QTL mappng came from a QTL study n rce ( A populaton of 129 RILs from the cross of U.S. rce lnes RT0034 x Cypress genotyped at 155 SSR marker loc along a 1500-cM map of 12 chromosomes was used for the detecton of QTL affectng days to headng. The mean length of marker ntervals was 10.6 cm, wth the longest nterval 40.5 cm. The populaton was phenotyped at three locatons n Arkansas, Texas and Lousana wth two replcates for each locaton. Two QTL have been dentfed from the data of Texas and Lousana usng CIM (results not shown). Ths pror knowledge was used as a reference for the analyss of Arkansas data. For smplcty, we analyzed only one replcate from Arkansas to llustrate the dfference between results from SIM, CIM and shrnkim. As wth the smulated data, we used the same set of cofactor markers for both shrnkim and CIM. 21

30 Results Smulaton experment I: Fg. 2.1 shows the more accurate estmaton of QTL postons and effects usng shrnkim compared to sngle-marker or multple-marker analyss. The background sgnal from PMLE or Bayesan based multple-marker analyss s the same as that of shrnkim. Fg. 2.2 shows the ncreased resoluton of shrnkim of closely lnked QTLs 1 and 2 based on QTL effect and on LOD score compared wth SIM or CIM. ShrnkIM gave sharper separaton than CIM of closely lnked QTLs 1 and 2 based on ether effect (Fg. 2.2a) or LOD score (Fg. 2.2b) and reduced the background effect to baselne, whle SIM was unable to separate the lnked QTLs and consstently overestmated QTL and background effects (Fgs. 2.2a). For ths smulated dataset, the ncluson of marker marker nteractons as cofactors made no apprecable dfference to the results. Table 2.2 shows comparson of computng tmes used for 1000 permutatons n SIM, EM-SIM, CIM, and shrnkim n analyss 1, showng that shrnkim s faster than CIM and EM- SIM. We attrbute ths to the fewer teratons requred n the PPMLE step. QTL effect estmates proved to be very nsenstve to varaton n ntal values for hyperparameters μ α and σ 2 α. Ther standard devaton across at least ten values was less than 10-5, neglgble n comparson wth the estmated effect sze of ~3. Fg. 2.3 shows 3D plots of QTL epstatc effect aganst chromosome postons; not vsble s a spurous close double peak produced by the PMLE method. As wth man QTL effects, shrnkim s expected to provde more accurate estmates of postons of QTL nteractons than PMLE, snce the latter s lmted to testng marker postons, whle shrnkim can localze QTL at any poston on the genetc map. Table 2.1 compares poston and effect estmates from shrnkim wth those of PMLE for the detecton of QTL man and epstatc effect. The background sgnal of shrnkim s comparable to that of PMLE (Fg. 2.3a). 2D SIM was not able to dentfy QTL- QTL nteracton based on only QTL epstatc effect due to strong background, whereas shrnkim clearly dentfed three QTL epstatc effects. Fg. 2.4 shows the 3D LOD surface of QTL epstass usng 2D SIM and shrnkim. 2D SIM fnds two QTL nteractons, whle shrnkim fnds three (Fg. 2.4b). Moreover, the LOD surface produced by shrnkim s much clearer than that of 2D SIM due to decreased background (Fg. 2.4a). 22

31 Experment II: For the detecton of QTL 1 and 2 wth hgher hertablty compared to QTL 3, the power of SIM, CIM and shrnkim was smlar. Fg. 2.5 shows the ncreased power of shrnkim for the detecton of QTL 3 wth relatvely lower hertablty compared wth SIM or CIM. The accuracy and precson of estmates of QTL effects and postons are very close for CIM and shrnkim (Table 2.4). Analyss of rce data: Fg. 2.6 shows the ncreased power of shrnkim for the detecton of the QTL on chromosome 6 based on QTL effect or LOD score. In SIM and CIM analyss, the QTL on chromosome 8 was dentfed, but nether method found the second one, a QTL expressed strongly n the other growng locatons and possbly representng Hd6a, a QTL dentfed near the rce Waxy locus n several other crosses. Poston and effect estmates for the two QTL are gven n Table 2.5. Dscusson We have shown the advantages of shrnkim over conventonal SIM and CIM n the detecton and dentfcaton of QTL or QTL epstass. ShrnkIM offered hgher resoluton of closely lnked QTL, greater power to dentfy QTL and more accurate estmates of QTL parameters wthout ncreased cost n executon tme. The mproved statstcal propertes are due to the control of polygenc background by two steps. The frst step s smlar to CIM except for the use of PMLE for the selecton of cofactors for markers or nteractons between markers, whch accounts for the genetc varance due to QTL or QTL epstass elsewhere n the genome. The addtonal power of shrnkim s conferred by the reducton of background toward zero n the case of no QTL at a map poston. More than an extenson of PMLE from marker-based mappng to nterval mappng, shrnkim nherts the advantages of shrnkage Bayesan, PMLE and penalzed regresson. Though a varant of PMLE, PPMLE offers two apparent mprovements on PMLE. Frst, t lmts penalzaton to parameters of nterest n order to obtan shrnkage estmaton, whle explotng the smplcty of least squares. Second, t eases the dependence of parameter estmates on the pror parameters n PMLE by decreasng the number of penalzed parameters n the model. As a hybrd of shrnkage estmaton and least squares, PPMLE s readly extended to handle mult-envronment data f the factor effects are treated as fxed. It can also be used for the dscovery of genotype-by-envronment nteracton or for combned analyss based on famles 23

32 from multple crosses. If collnearty of factors of a genetc model s problematc, we suggest replacng wth rdge regresson the ordnary least-squares estmate n the frst step of PPMLE. As wth other regresson-based nterval mappng methods, parameter estmates are subject to some bas n case of sparse marker maps. Ths s easly remeded by ncorporaton of the EM algorthm, n whch the probablty dstrbuton of QTL genotypes s posteror-updated usng the flankng markers and phenotype. The clean background produced by shrnkim results from shrnkage estmaton of QTL man or epstatc effect. It s reasonable to ask whether QTLs of small effect can be excluded by shrnkage of these effects to zero n the whole-genome scan. Wang et al. (2005) showed that the Bayesan shrnkage method could detect a QTL accountng for 2% of phenotypc varance, whle Zhang and Xu (2005) showed that PMLE could detect a QTL epstatc effect accountng for only 0.5%. In our smulaton shrnkim detected QTL accountng for 6% phenotypc varance. In practce, the power of shrnkim may approxmate to those of the Bayesan shrnkage method and PMLE because of the smlar penalty dstrbuton used n these methods. Further smulaton studes should resolve the queston. ShrnkIM combnes the merts of the other QTL mappng methods we have consdered, n beng able to dentfy QTL or QTL epstass based on ether QTL effect or LOD score. Though shrnkage Bayesan method and PMLE show excellent performance for the detecton of QTL or QTL nteractons from ther effect estmates, the absence of test statstcs for the tested QTL remans a problem to apply these methods (Wang et al. 2005). In contrast, for conventonal nterval mappng such as SIM and CIM, LOD s commonly used as evdence to clam a QTL, but the QTL effect profle cannot be used for ths purpose because of nosy background. Lke the Bayesan approach, shrnkim supples QTL evdence by sharpenng the QTL effect profle. The method proposed here may be extended to ECM-based QTL mappng. ShrnkIM s a combnaton of shrnkage and least squares estmates. Regresson-based QTL mappng, though easer to mplement and faster to compute, gves based parameter estmates wth sparse markers (Xu 1995) or when QTLs nteract or are closely lnked (Kao 2001). If we nclude posteror probablty f QTL genotype gven flankng markers and observaton n step 1 of our algorthm, the method s easly adapted to ECM-based mappng. ShrnkIM s beng ncorporated nto QGene 4.0, an open-source Java platform for QTL mappng. 24

33 References Boer M. P., Braak C. J. F. and Jansen R. C., 2002 A penalzed lkelhood method for mappng epstatc quanttatve trat loc wth one-dmensonal genome searches. Genetcs 163: Churchll G. A. and Doerge R. W., 1994 Emprcal threshold values for quanttatve trat mappng. Genetcs 138: Groß J Lnear Regresson. Sprnger, Berln. Hackett C. A., Meyer R. C. and Thomas W. T. B., 2001 Mult-trat QTL mappng n barley usng multvarate regresson. Genetc Research 77: Haley C. S., and Knott S. A., 1992 A smple regresson method for mappng quanttatve trat loc n lne crosses usng flankng markers. Heredty 69: Jansen R. C Genotype-by-envronment nteracton n genetc mappng of multple quanttatve trat loc. Theoretcal and Appled Genetcs 91: Jang C., and Zeng Z. B, 1995 Multple trat analyss of genetc mappng for quanttatve trat loc. Genetcs 140: Korol A. B., Ronn Y. I., Itskovch A. M., Peng J. and Nevo E., 2001 Enhanced effcency of quanttatve trat loc mappng analyss based on multvarate complexes of quanttatve trats. Genetcs 157: Lander E. S., and Botsten D., 1989 Mappng Mendelan factors underlyng quanttatve trats usng RFLP lnkage maps. Genetcs 121: Satagopan, J. M., Yandell B. S., Newton M.A. and Osborn T. G., 1996 A Bayesan approach to detect quanttatve trat loc usng Markov chan Monte Carlo. Genetcs 144: Wang H., Zhang Y. M., L X., Masnde G. L., Xu S., 2005 Bayesan shrnkage estmaton of QTL parameters. Genetcs 170: Xu S., 2003 Estmatng polygenc effects usng markers of the entre genome. Genetcs 163: Y N. J., and Xu S., 2000 Bayesan mappng of quanttatve trat loc under complcated matng desgns. Genetcs 157: Zeng Z. B., 1994 Precson mappng of quanttatve trat loc. Genetcs 136: Zhang Y. M., and Xu S., 2005 A penalzed maxmum lkelhood method for estmatng epstatc effect of QTL. Heredty 95:

34 Fgure 2.1 Estmated QTL-effect profles for sngle-marker, multple-marker, and shrnkim analyses. a: sngle-marker; b: multple-marker usng PMLE; c: multple-marker Bayesan; d: shrnkim. Astersks show the true postons and effects of smulated QTL. 26

35 Fgure 2.2 Estmated QTL-effect and LOD profles for SIM, CIM and shrnkim. a: SIM; b: CIM; c: shrnkim. Astersks show the true postons of smulated QTL n a2, b2, c2 and ther effects n a1, b1, c1. The horzontal dotted lnes represent the emprcal p = 0.05 LOD thresholds from 1000 permutatons. 27

36 Fgure 2.3 3D plots of QTL epstatc effects aganst chromosome postons for a smulated RIL populaton. a: Left of man dagonal: PMLE analyss; rght, shrnkim. b: Left, 2D IM; rght, shrnkim. 28

Fgure 2.4 3D plots of LOD score of QTL epstass analyss usng 2D IM and shrnkim. In a and b, the left-hand sde of the fgure shows 2D IM and the rght-hand sde shrnkim.

37 Fgure 2.4 3D plots of LOD score of QTL epstass analyss usng 2D IM and shrnkim. In a and b, the left-hand sde of the fgure shows 2D IM and the rght-hand sde shrnkim. In b, the horzontal surface at LOD 3.39 represents the threshold calculated for 2D IM from 1000 permutatons, gvng a conservatve comparson snce the calculated threshold for QTL epstass analyss usng shrnkim was actually

38 Fgure 2.5 The statstcal power of QTL detecton at three sgnfcance levels usng SIM, CIM and shrnkim. a QTL1; b QTL2; c QTL3. The whte, gray and black bars represent SIM, CIM and shrnkim. 30

39 Fgure 2.6 LOD profles produced n the analyss of rce data by SIM, CIM and shrnkim. a: SIM; b: CIM; c: shrnkim. The horzontal dotted lnes n the rght-hand plots represent emprcal LOD thresholds for the three methods, calculated at sgnfcance level 0.05 from 1000 permutaton tests. Horzontal axes are on cm scale; labels ndcate rce chromosomes. 31

40 Table 2.1 The true values and estmates of QTL parameters n smulaton experment I. Postons are n cm. QTL Man effect Interacton effect Poston Value Poston 1 Poston 2 Value True values Estmates from shrnkim Estmates from PMLE Envronmental varance 30 32

41 Table 2.2 Computng tme requred for SIM, CIM and shrnkim n smulaton experment I Computng tme was evaluated from 1000 permutatons for SIM, CIM and shrnkim. The computer used has a 2-GHz CPU; tmes are expected to scale smlarly on a faster machne Method Computng tme (sec) SIM 32 EM-SIM 697 CIM 855 shrnkim

42 Table 2.3 QTL parameters used for smulaton experment II. Postons are n cm. QTL Poston Addtve Genetc Total Proporton effect varance varance Total

43 Table 2.4 Estmates of QTL postons and effects for rce data usng shrnkim. Postons are n cm. QTL Chromosome Poston QTL effect R

44 Table 2.5 Comparson of SIM, CIM and shrnkim n smulaton experment II. Postons are n cm. Sgnfcance Level Method LOD threshold Power (%) QTL 1 QTL 2 QTL 3 Poston SD Effect SD Power (%) Poston SD Effect SD Power (%) Poston SD Effect SD SIM CIM shrnkim SIM CIM shrnkim SIM CIM shrnkim

45 CHAPTER 3 - Multple-trat quanttatve trat locus mappng wth ncomplete phenotypc data Abstract Conventonal multple-trat quanttatve trat locus (QTL) mappng methods must dscard cases (ndvduals) wth ncomplete phenotypc data, thereby sacrfcng other phenotypc and genotypc nformaton contaned n the dscarded cases. Under standard assumptons about the mssng-data mechansm, t s possble to explot these cases. We present an EM-based algorthm that supports conventonal hypothess tests for QTL man effect, pleotropy, and QTL-byenvronment nteracton. Smulatons confrm mproved QTL detecton power and precson of QTL locaton and effect estmaton n comparson wth case deleton or mputaton methods. The EM method may be ncorporated nto any least-squares or lkelhood-maxmzaton QTLmappng approach. Introducton Statstcal methods for dentfyng and mappng genes controllng complex trats, commonly known as quanttatve trat loc or QTL, have been developed to a hgh degree. The prmary focus has been on methods for sngle trats (Lander and Botsten1989; Haley and Knott 1992; Jansen 1993; Zeng 1994; Satagopan et al. 1996; Kao and Zeng 1999; Y and Xu 2003; Wang et al. 2005; and many others). It was proposed (Jang and Zeng 1995; Korol et al. 1995) that QTL mappng methods that consder smultaneously several correlated phenotypc trats, or a sngle trat measured n several envronments, offer ncreased detecton power and precson of locaton and effect estmaton over sngle-trat QTL mappng. Ths s because trat-by-trat QTLsearchng neglects nformaton contaned n the data about the common nfluence of a QTL on more than one trat or n more than one envronment. Wth the promse of ncreased power from a multvarate approach comes an nterestng problem: what to do when some of the multvarate data are mssng. 37

46 Two man statstcal approaches have been elaborated for mult-trat QTL analyss: regresson (Korol et al. 1995, 1998; Calnsk et al. 1999; Knott and Haley 2000; Hackett et al. 2001) and maxmum lkelhood or ML (Jang and Zeng 1995). Regresson QTL-mappng methods, though easer to mplement and faster to compute, gve based parameter estmates wth sparse markers (Xu 1995) or when QTLs nteract or are closely lnked (Kao 2001), whle ML methods are free of these defects (Kao 2001). It has also been proposed to transform multple trats nto canoncal varates so that conventonal unvarate nterval QTL mappng can be appled (Weller et al. 1996; Mangn et al. 1998; Calnsk et al. 2000), but nterpretaton of the results may be dffcult. Though QTL-mappng data are often ncomplete, nformaton-recovery methods are at present appled only to genotypc data. For ncompletely nformatve marker-genotype data, posteror dstrbutons are readly estmated from flankng markers n the same ndvdual (Jang and Zeng 1997). For unknown QTL genotypes at tested postons n map ntervals, maxmumlkelhood (ML) methods estmate posteror dstrbutons smultaneously wth the parameters of a phenotypc mxture dstrbuton (Lander and Botsten 1989), whle regresson methods (Haley and Knott 1992) replace mssng QTL genotypes wth ther expectatons gven flankng markers. Varatons based on samplng nclude multple mputaton as descrbed by Sen and Churchll (2001) and Bayesan approaches (e.g. Satagopan et al. 1996; Sllanpää and Arjas 1998, 1999; Y and Xu 2001; Wang et al. 2005). In contrast to genotypc data, mssng phenotypc data for any trat results n dscardng all cases (ndvduals) lackng even one value, sacrfcng all other phenotypc and genotypc nformaton avalable for these cases. The problem was recognzed by Knott and Haley (2000), but they provded no soluton. Is there an alternatve to ths casewse (Allson 2002) deleton? Methods for completon of ncomplete multvarate data are of two knds: by mputaton (sngle or multple) and by EM algorthm. Sngle mputaton typcally replaces mssng data wth three knds of values: a value drawn from a specfc model-based dstrbuton, a mean calculated from other observatons of the same varable, or a condtonal mean calculated by least-squares regresson on predctors. Multple mputaton (Rubn 1987, 1996) flls n mssng data multple (e.g. 3 5) tmes to produce several complete datasets, wth parameter estmates calculated as the average over the results from these datasets. The defect of mputaton methods, n analyses such as QTL mappng where we want ML estmates of statstcs, s that bas s ntroduced by 38

47 maxmzaton of the lkelhood over both orgnal and mputed data. In contrast, the EM algorthm as descrbed by Dempster et al. (1977) focuses not on replacng a mssng value wth ts expectaton, but on usng the nformaton avalable n the orgnal dataset. In the framework of EM, mssng data mputed are n effect ntegrated out of the complete-data log lkelhood by teratve refnement of ther expectaton. Lttle and Rubn (2001) provded an EM algorthm for ncomplete multvarate data, and extended t to accommodate multple regresson wth mssng responses. Here we descrbe an adaptaton of Lttle and Rubn s EM method (2001) to the case of mult-trat QTL mappng wth ncomplete phenotypc data. We show that the tests for QTL man effects may be constructed as n Jang and Zeng (1995), and we descrbe the propertes and behavor of the test statstcs and QTL effect and poston estmates based on smulaton studes. Methods Mssng-data mechansm s gnored: Several knds of mssngness have been defned (Rubn 1976). Here we consder only MAR, mssng at random, meanng for our purposes that the probablty of mssng phenotypc data wthn any genotype class s unrelated to the phenotypc value. Ether for MAR or the stronger assumpton, MCAR or mssng completely at random (mssngness also ndependent of genotype), estmaton methods need not model a mssng-data mechansm. Multvarate regresson wth ncomplete data: Consder the lnear model Y = X B E, (1) n m n p p m + n m where Y s a (n m) response matrx wth n the number of ndvduals and m the number of trats (or envronments); X s a (n p) desgn matrx wth p predctors; E s an error matrx and E ( = 1, 2,, n) follows a multvarate normal dstrbuton wth means zero and varance covarance matrx σ 11 σ 12 L σ 1m σ 21 σ 22 L σ 2m V = (2) M M O M σ m1 σ m2 L σ mm Suppose there are some mssng entres n Y ( = 1, 2,, n). Now matrces Y, μ = X B, and V may be parttoned as 39

48 obs mss Y = [ y, y ], (3) obs mss μ = [ μ, μ ], (4) Vobs( ), obs( ) Vobs( ), mss( ) V =. (5) Vmss( ), obs( ) Vmss( ), mss( ) For a random sample wth n ndvduals, the log lkelhood of observatons s gven by n nm 1 1 l ( B, V; Yobs ) = ln(2π ) ln V B (6) 2 2 = 1 n obs obs T 1 obs obs obs, ( y X B) Vobs, ( y X ) 2 = 1 Snce n general, t s dffcult to calculate the MLEs of parameters drectly by maxmzng (6) wth respect to the ndvdual parameters, we may adapt Lttle and Rubn s EM (2001) algorthm to obtan the MLEs of parameters n model (1) as follows. ˆ ) (0) ALGORITHM 1: Startng wth ntal values ˆ (0) (0) [, ˆ, ˆ (0 θ = B μ V ], terate the followng two steps untl convergence. E step: M step: mss( k+ 1) obs ( k ) mss( k+ 1) obs obs( k+ 1) ˆ ( k ) ˆ 1( k ), ) ˆ ( ˆ y θ = μ + y μ ) V y obs mss V obs obs y y y E( y, (7) ( k+ 1) obs mss( k+ 1) y = ( y, y ). (8) ˆ ( k + 1) T 1 T ( k + 1) B = ( X X) X Y, (9) ( k + 1) ˆ ( k + 1) ˆ μ = XB, (10) ( k+ 1) ( k+ 1) T ( k+ 1) ( k+ 1) ˆ ( k + 1) ( Y μˆ ) ( Y μˆ ) V = (11) n Mult-trat QTL mappng wth ncomplete phenotypc data by regresson: We now descrbe our mult-trat QTL mappng method wth ncomplete data. Though the method gven s based on a recombnant nbred lne (RIL) populaton, t s easly extended to other matng desgns such as F 2 or BC. The statstcal model for multple-trat analyss (Jang and Zeng 1995, Korol et al. 1995, Hackett et al. 2001) based on complete phenotypc data s Y n m = z n a(1 m) + xn ( p+ 1) b( p+ 1) m + En m 1 (12) where Y s a n m matrx of phenotypc observatons wth n lnes and m trats and Y = [y 1, y 2,..., y n ], y 1, y 2,,y n are 1 m vectors; z s a n 1 matrx of QTL genotypes represented as 2 for QQ and 0 for qq; a s a 1 m matrx of addtve effects of a putatve QTL at a tested poston; x 40

49 s a n (p+ 1) matrx of genotypes of p cofactor markers wth the frst column ones; b s a (p + 1) m matrx of cofactor marker effects; and E s a n m matrx of resdual errors e j ( = 1, 2,, n; j = 1, 2,, m), whch are assumed to be correlated between trats and follow a multvarate normal dstrbuton wth means zero and covarance matrx as n (3). In ths model, QTL genotype s replaced wth ts condtonal expectaton gven flankng-marker genotypes. Leastsquares estmates of the parameters can then be obtaned by multple regresson. Now suppose mssng values occur n some lnes for some trats. Model (12) may be rewrtten as model (1) z. (13) ( n 1) a (1 m) + x ( n ( p+ 1)) b (( p+ 1) m) = X( n ( p+ 2)) B (( p+ 2) m) and parameter estmates obtaned by ALGORITHM 1. Mult-trat QTL mappng wth ncomplete phenotypc data by ECM: Instead of replacng a mssng QTL genotype wth ts expectaton gven flankng markers, ECM (expectaton/condtonal maxmzaton) treats QTL genotype as mssng data ncluded n model (12) and estmates parameters at a QTL poston by repeatedly updatng the posteror probablty of QTL genotype gven both flankng marker genotypes and phenotypes. Snce we now have two types of mssng data n model (12), QTL genotype and phenotype, we may extend Jang and Zeng s (1995) ECM method for mult-trat QTL mappng as follows: (0) (0) ALGORITHM 2: Startng wth ntal values of parameters ˆ (0) (0) [ˆ,, ˆ, ˆ (0) θ = a b μ V ], terate the followng two steps untl convergence. E step: q ( k+ 1) 1 = p 1 f ( k ) 1 ( y obs p 1 μˆ f ( k ) 1, QQ ( y μˆ, Vˆ ) + p obs, QQ 2 f, Vˆ ) ( k ) 2 ( y obs μˆ, qq, (14), Vˆ ) q ( k+ 1) 2 = p 1 f ( k ) 1 ( y obs p μˆ 1 f, QQ ( k ) 2 ( y μˆ, Vˆ ) + p obs, qq 2 f, Vˆ ) ( k ) 2 ( y obs μˆ, qq, Vˆ, (15) ) where p 1 and p 2 are the condtonal probabltes of QTL genotypes QQ and qq gven flankng markers, f the multvarate normal probablty densty functon, and q 1 and q 2 the posteror probablty of QTL genotypes gven flankng markers and phenotypes (Jang and Zeng 1995). ( k ) mss( k+ 1) obs ˆ mss( k+ 1) obs obs( k+ 1) ˆ ( k ) ˆ 1( k ), ) ˆ ˆ y θ = μ, E + ( y μ, E ) V obs mss V obs Y Y Y Y E( y, (16) ( k+ 1) obs mss( k+ 1) y = ( y, y ). (17) 41 obs

50 M step: 0.5q ( ( Y T( q xbˆ ( k+ 1)T k+ 1) ( k+ 1) ( k+ 1) 2 ) a ˆ = k+ 1) 2 l where l s a (n 1) matrx of ones. bˆ ( k+ 1) T 1 T ( k+ 1) ( k+ 1) ( k+ 1) = (x x) x [Y 2q 2 aˆ ( k+ 1) ( k+ 1) ( k+ 1) ( k+ 1) ˆ = + 2 ˆ E xb q 2 a, (18) ], (19) μ ˆ, (20) μ ˆ = μˆ, (21) ( k+ 1) QQ ( k+ 1) E ( k+ 1) ˆ ( k+ 1) ˆ qq = xb μ, (22) ( k+ 1) ( k+ 1) T ( k+ 1) ( k+ 1) ˆ ( k + 1) ( Y μˆ ˆ E ) ( Y μ E ) V =. (23) n Hypothess tests: Hypothess tests for QTL man effects, pleotropy effects and close lnkage vs. pleotropy are constructed accordng to Jang and Zeng (1995) and can be tested by ALGORITHM 1 f regresson s chosen or ALGORITHM 2 f the ECM method s used. The test statstc LR or LOD follows an asymptotc ch-square dstrbuton wth degrees of freedom determned by the specfc hypothess test (Jang and Zeng 1995). For example, to test man QTL effects n a two-trat example, the hypotheses can be formulated as H0: a1 = 0, a2 = 0 and H1: a1 0, a2 0. For the regresson method, parameters under H0 or H1 are estmated by ALGORITHM 1 (Equatons 7 11) dependng on whether or not QTL effects are ncluded n model (13). If the ECM method s used, frst these quanttes are estmated under H0 by ALGORITHM 1 wthout ncluson of QTL effect and then those of the full model under H1 can be obtaned by ALGORITHM 2 (Equatons 14 23). Then the lkelhood rato (LR) can be obtaned as LR = 2( l reduced l full ), where l reduced s the log lkelhood of the reduced model, correspondng to H0, and l full s that of the full model, correspondng to H1 (Lander and Botsten 1989). Both are calculated from (6) and a LOD score s calculated as LR/(2 ln 10). Smulaton methods: To compare the propertes of the EM method wth those of casewse deleton (CaD), mean substtuton (MS), condtonal mean substtuton (CMS) and complete data (CoD), we performed smulaton experments. RIL populatons from lne crosses wth 100, 200 and 300 ndvduals were generated based on a 300-cM chromosome wth 31 evenly spaced markers. For CMS, mssng data were replaced wth ther condtonal expectatons 42

51 calculated by regresson of each trat on the other(s). Three pleotropc QTLs controllng two trats were smulated at cm postons 53, 182, and 258 wth effects lsted n Table 3.1. Trat values of each lne were calculated as the sum of QTL effects plus a random vector of envronmental effects wth means zero and varance gven n Table 1. Then a specfed proporton (0.05, 0.10, 0.20, or 0.40) of values for each trat ndependently were set to mssng. Lnes lackng data for both trats were dropped. Analyses were performed on 500 replcates. In the QTL analyses, the calculaton nterval (step sze) used was 1 cm. Cofactor markers for each trat were selected by forward stepwse regresson at a sgnfcance level of 0.01 and combned for mult-trat analyss. Cofactors lyng wthn 10 cm of a QTL testng poston were dropped from the model. Genome-wde LOD thresholds of 3.71, 3.54 and 3.43 for n = 100, 200, and 300 at sgnfcance level 0.05 were calculated from 5000 smulatons under the null hypothess of no QTL (Knott and Haley 2000). When sample sze or hertablty s relatvely small, the effect of a QTL may extend to adjacent ntervals due to lmted recombnaton between these ntervals and the QTL. So a QTL was declared f a LOD peak hgher than threshold was found wthn the nterval contanng the smulated QTL and ts two flankng ntervals. Power of QTL detecton was calculated as the number of correctly declared ( true postve ) QTLs dvded by the number of actual QTLs smulated, whle specfcty was calculated as the number of true postve QTLs dvded by the number of QTLs declared. Results Power: As expected, power was hghest when data were complete (Table 3.2, Fgure 3.1). When data were mssng, EM, MS and CMS gave power superor to CaD n all cases. MS and CMS gave smlar power, equal to or lower than that of EM. The gan n power for EM over CaD ncreased wth the proporton of mssng data. Ths trend was also seen for gan n power of EM over MS or CMS, but to a lower degree. In Table 3.2, t s seen that EM gave QTL detecton power about equal to that suppled by CaD wth half the proporton of mssng data. Smple probablty calculatons yeld the numbers to whch ths power relatonshp corresponds. As an example, n a populaton of sze 300 wth 0.4 of the data mssng from each of two trats, the EM method was operatng on only 108 lnes carryng complete data and another 144 lnes wth partal data, but acheved power correspondng to approxmately 192 lnes wth complete data. The ncrease n effectve 43

52 (equvalent-power) number of complete records acheved by the EM method can be estmated graphcally from Fgure 3.2. Here the effectve complete-data sample szes acheved by EM were about 271, 255, 230, and 190, representng gans of 1, 12, 38 and 82 over the number of complete records avalable for CaD at mssng levels of 0.05, 0.1, 0.2 and 0.4. Specfcty and QTL poston: All the methods gave smlar specfcty for QTL detecton, except that CaD gave decreased specfcty wth ncreasng proportons of mssng data (Table 3.3). Accuracy and precson of effect estmaton: All methods gave reasonable estmates of QTL postons. CoD and CaD provded the hghest and lowest precsons for QTL poston estmaton (Fgure 3.3), whle those of MS, MS, and EM were very smlar and ntermedate. For QTL effects (Fgure 3.4), CoD, CaD and EM provded unbased estmates, whle both MS and CMS underestmated these parameters, CMS by slghtly less. The extent of underestmaton tended to ncrease wth mssng percentage and decrease wth sample sze (not shown here). Dscusson The EM-based mult-trat QTL mappng method we propose here s superor to mean substtuton and condtonal mean substtuton for several reasons. MS underestmates phenotypc varaton and QTL effect due to fll-n of mssng data wth a sngle value, resultng n decreased power compared wth our method especally when amounts of mssng data are relatvely large. The same trend can be observed for CMS, whch, as a precursor of the EM algorthm, s closely related to a sngle EM teraton (Lttle and Rubn 2001). Although CMS mproved estmates of QTL effect compared wth MS, t stll underestmates varance (Lttle and Rubn 2001). Whle we dd not nclude multple mputaton (Rubn 1987, 1996) (MI) n the smulaton study, we doubt ts potental utlty for mult-trat QTL mappng wth mssng trat data. We nvestgated MI by fllng n mssng trat data wth values sampled from ther condtonal dstrbuton under the null and alternatve hypotheses gven the observed trat values. Resultng LOD profles were sawtoothed (Fg.3.5) due to random samplng, and a dfferent profle could be obtaned wth each analyss even wth many mputatons (e.g. 100 compared wth 3 5 n regular MI) performed at each QTL test poston. For these reasons, apart from the hgh computatonal cost, we dd not pursue ths method further. 44

53 For MS, CMS, and even MI, the effects of ntroducng mputed data on QTL mappng need further study. Although smulaton results showed specfctes close to those of our method, complete-data analyss, and casewse deleton, the bas mposed on the LOD test statstc by ntroducton of these artfcal data remans unknown. In fact, mputaton of mssng data s also performed n the E step of our EM algorthm. But ths knd of mputaton only furnshes a pvot to facltate parameter estmaton and s actually not nvolved n the lkelhood calculaton. Thus, theoretcally, the EM-based method does not bas QTL detecton and parameter estmaton as may mputaton methods. The nformaton gan of our method over CaD, MS, and CMS depends on the amount of mssng trat data. The reason s readly explaned by the followng example for CaD. Consder a sample of 200 ndvduals wth mssng proporton 0.1 for each of two trats ndependently. The average number of ndvduals avalable for CaD s 162 and that for EM 198, and the dfference s 36. Ths dfference expands to 96 wth a mssng proporton of 0.4. In other words, power s lost more slowly wth data loss when the nformaton-recoverng EM method s appled. Some extensons of the EM method are promsng. Frst, we have derved the EM calculaton of the hypothess test for QTL man effect. By followng the procedure of Jang and Zeng (1995), one may derve specfc EM mplementatons for other hypothess tests ncludng for QTL-by-envronment nteracton, pleotropy, and pleotropy vs. close lnkage. Second, the EM method may be extended to multple nterval mappng (Kao et al. 1999) wth multple trats and ncomplete phenotypc data. Thrd, mxed-model QTL mappng as recommended by Jang and Zeng (1995) can now be appled to ncomplete trat data as an alternatve method for multtrat QTL mappng. When multple trats are actually dfferent expressons of a sngle trat n dfferent envronments (locatons or years), a mxed model allows treatng envronmental effect as a random and QTL effect as a fxed factor (Wang et al. 1999; Pepho 2000). One of the advantages of the mxed model s n accommodatng both balanced and unbalanced data structure. The method we have presented requres more computng tme than the conventonal EM or ECM nterval-mappng algorthm. There are two reasons for ths. Frst, to obtan parameter estmates, the EM algorthm must be appled under both null and alternatve hypotheses, because the trat data are mssng n both cases. In contrast, conventonal methods requre EM teraton only under the alternatve hypothess. Second, our EM algorthm s used to complete both QTL 45

54 genotype and phenotype n the case of ML-based QTL mappng, whle the conventonal method must complete only QTL genotype. The computng load ncreases wth the proporton of mssng data, but the extreme amounts of mssng data we have smulated are unusual n real experments. References Allson P. D., 2002 Mssng Data. Sage Publcatons, Thousand Oaks, Calf. Calnsk T., Kaczmarek Z., Krajewsk P., Frova C. and Sar-Gorla M., 1999 A multvarate approach to the problem of QTL localzaton. Heredty 84: Churchll G. A., and Doerge R. W., 1994 Emprcal threshold values for quanttatve trat mappng. Genetcs 138: Dempster A. P., Lard N. M. and Rubn D. B., 1977 Maxmum lkelhood from ncomplete data va the EM algorthm. Journal of the Royal Statstcal Socety 39: Hackett C. A., Meyer R. C. and Thomas W. T. B., 2001 Mult-trat QTL mappng n barley usng multvarate regresson. Genetcal research 77: Haley C. S. and Knott S. A., 1992 A smple regresson method for mappng quanttatve trat loc n lne crosses usng flankng markers. Heredty 69: Jansen R. C., 1993 Interval mappng of multple quanttatve trat loc. Genetcs 135: Jang C. J. and Zeng Z. B., 1995 Multple trat analyss of genetc mappng for quanttatve trat loc. Genetcs 140: Jang C. J. and Zeng Z. B., 1997 Mappng quanttatve trat loc wth domnant and mssng markers n varous crosses from two nbred lnes. Genetcs 101: Kao C. H., 2000 On the dfference between maxmum lkelhood and regresson nterval mappng n the analyss of quanttatve trat loc. Genetcs 156: Kao C. H., Zeng Z. B. and Teasdale R. D., 1999 Multple nterval mappng for quanttatve trat loc. Genetcs 152: Knott S. A. and Haley C. S., 2000 Multtrat least squares for quanttatve trat loc detecton. Genetcs 156: Korol A. B., Ronn Y. I. and Krzhner V. M Interval mappng of quanttatve trat loc employng correlated trat complexes. Genetcs 140:

55 Korol A. B., Ronn Y. I., Nevo E. and Hayes P. M Mult-nterval mappng of correlated trat complexes. Heredty 80: Lander E. S., and Botsten D., 1989 Mappng Mendelan factors underlyng quanttatve trats usng RFLP lnkage maps. Genetcs 121: Lttle R. J. A. and Rubn D. B., 2001 Statstcal Analyss wth Mssng Data. John Wley & Sons, Hoboken, New Jersey. Mangn B., Thoquet P. and Grmsley N., 1998 Pleotropc QTL analyss. Bometrcs 54: Sllanpää M. J. and Arjas E., 1998 Bayesan mappng of multple quanttatve trat loc from ncomplete nbred lne cross data. Genetcs 148: Sllanpää M. J. and Arjas E., 1999 Bayesan mappng of multple quanttatve trat loc from ncomplete outbred offsprng data. Genetcs 151: Pepho H. P., 2000 A mxed-model approach to mappng quanttatve trat loc n barley on the bass of multple envronment data. Genetcs 156: Rubn D. B., 1976 Inference and mssng data. Bometrka 63: Rubn D. B., 1987 Multple Imputaton for Nonresponse n Surveys. Wley, New York. Rubn D. B., 1996 Multple mputaton after 18+ years. Journal of the Amercan Statstcal Assocaton 91: Satagopan J. M., Yandell B. S., Newton M. A. and Osborn T. G., 1996 A Bayesan approach to detect quanttatve trat loc usng Markov chan Monte Carlo. Genetcs 144: Sen S. and Churchll G. A., 2001 A statstcal framework for quanttatve trat mappng. Genetcs 159: Wang D. L., Zhu J., L Z. K. and Paterson A. H., 1999 Mappng QTLs wth epstatc effects and QTL envronment nteractons by mxed lnear model approaches, Theoretcal and Appled Genetcs 99: Wang H., Zhang Y. M., L X., Masnde G. L., Mohan S., 2005 Bayesan shrnkage estmaton of QTL parameters. Genetcs 170: Weller J. I., Wggans G. R., Van Raden P. M. and Ron M., 1996 Applcaton of a canoncal transformaton to detecton of quanttatve trat loc wth the ad of genetc markers n a mult-trat experment. Theoretcal and Appled Genetcs 92: Y N. J., and Xu S., 2001 Bayesan mappng of quanttatve trat loc under complcated matng desgns. Genetcs 157:

56 Fgure 3.1 Statstcal power of fve multple-trat QTL-mappng methods wth four levels of mssng data. 48

57 Fgure 3.2 Power of QTL 1 detecton after casewse deleton and by the EM method as a functon of the number of complete trat records. The power used s evaluated over 500 replcates of smulatons wth 200 RILs. 49

Fgure 3.3 Means and standard devatons (SDs) of estmates of QTL poston by mult-trat QTL analyses. Means and SDs of estmates of QTL poston were calculated over 500 replcates of smulatons wth 200 RILs.

58 Fgure 3.3 Means and standard devatons (SDs) of estmates of QTL poston by mult-trat QTL analyses. Means and SDs of estmates of QTL poston were calculated over 500 replcates of smulatons wth 200 RILs. Mssng percentage for each trat s Whte, gray and black bars represent QTLs 1, 2 and 3. CoD: complete data analyss; CaD: casewse deleton; MS: mean substtuton; CMS: condtonal mean substtuton; EM: EM algorthm. 50

Chapter 13: Multiple Regression

Chapter 13: Multiple Regression Chapter 13: Multple Regresson 13.1 Developng the multple-regresson Model The general model can be descrbed as: It smplfes for two ndependent varables: The sample ft parameter b 0, b 1, and b are used to