Linkage analysis and QTL mapping in autotetraploid species. Christine Hackett Biomathematics and Statistics Scotland Dundee DD2 5DA

Linkage analysis and QTL mapping in autotetraploid species Christine Hackett Biomathematics and Statistics Scotland Dundee DD2 5DA

Collaborators John Bradshaw Zewei Luo Iain Milne Jim McNicol Data and useful biological discussions from Barnaly Pande, Robbie Waugh, Dan Milbourne, Glenn Bryan, Karen McLean and Rhonda Meyer

Outline Part 1 Segregation analysis Cluster analysis Linkage analysis Part 2 QTL analysis

Part 1: Segregation analysis and Linkage analysis Segregation analysis: identify parental genotypes from parent and offspring phenotypes Cluster analysis: partition markers into linkage groups Linkage analysis: estimate the most likely phase between each pair of markers and calculate recombination frequencies and lod scores order markers to form linkage maps establish marker phases for completed linkage group

Inheritance in tetraploids - random chromosomal segregation 1 2 3 4 Parent Pair at random 1 2 3 4 1 3 2 4 1 4 2 3 and or and or and Recombination 1 2 3 4 and Gametes or or or

Segregation Analysis: Gamete formation A parent with 4 alleles abcd can produce gametes ab, ac, ad, bc, bd, cd with equal probability. There is also a small probability of producing gametes aa,bb,cc,dd by double reduction. probability of aa etc = α/4 probability of ab etc = (1 - α)/6, where α is the coefficient of double reduction. When crossed with a second parent efgh, there are 36 offspring genotypes if no double reduction, 100 if double reduction occurs. It is very unusual in practice to have 8 different alleles. Null alleles can also occur.

Theoretical segregation ratios (no double reduction) simplex duplex double-simplex

Outline Part 1 Segregation analysis Cluster analysis Linkage analysis

Use of cluster analysis (a) Cluster analysis of simplex markers to determine homologous chromosomes. Distance between markers related to recombination frequency. Label markers to show their cluster. (b) Cluster analysis of all markers Calculate χ 2 test for independent segregation Distance between markers related to significance Dendrograms based on single linkage and average linkage cluster analysis.

Cluster analysis Groups of simplex markers on homologous chromosomes are pulled together by other markers.

Outline Part 1 Segregation analysis Cluster analysis Linkage analysis

Linkage Analysis Assume no double reduction Calculate recombination frequency and lod score between each pair of markers in a linkage group, for each possible phase. 1 2 3 4 A, B, C simplex markers present on one chromosome of one parent. A B present B absent B A present n 1 n 2 C A absent n 3 n 4 If markers are in coupling phase (eg A,B), r.f = (n 2 +n 3 )/n In repulsion phase, this will give a r.f. > 0.5. Here r.f. = 3(n 1 +n 4 )/n 1 NB for diploid repulsion r.f. = (n 1 +n 4 )/n

Linkage Analysis The most informative situation would be 8 different alleles among the parents at each locus i.e. aa / bb / cc / dd x ee / ff / gg / hh This gives 36 x 36 = 1296 types of offspring! For each phase, we can classify the number of recombinants Parent 1 gametes: 6 of type aa/bb, 0 recombinants 24 of type ac/bb, 1 recombinant 6 of type ac/bd, 2 recombinants Same for parent 2.

Linkage Analysis Offspring from parents aa/bb/cc/dd x ee/ff/gg/hh: The recombination frequency is In general, not all alleles will be different. However we can estimate the frequencies associated with 0-4 recombinants and hence the recombination frequency via an EM algorithm.

Linkage phase The linkage phase must also be taken into consideration. e.g aa/bb/cc/dd x ee/ff/gg/hh ab/ba/cc/dd x ee/ff/gg/hh are different phases, and will give different probabilities for the offspring classes. There are up to 24 phases for each parent = 576 maximum, and we have to estimate the recombination frequency for each, and compare the likelihoods to see which is the most likely phase. The information associated with r also depends on the phase.

Marker ordering After estimation of the recombination frequency between all pairs of markers, these are ordered by optimising a weighted least squares criterion (same as JoinMap 3). Three ordering methods Initial run (based on seriation) Ripple search Simulated annealing search (slow) Recommend a ripple search initially to see if any spurious markers Finally use simulated annealing search to get final order

Reconstructed map of parents P1 P2 L 1 0.0 C B O D E A D C L 2 11.9 C D D A C D A O L 3 20.0 E E D D C O D A L 4 24.1 D C D O E O C E L 5 30.3 C D C O A E D D L 6 35.7 D A D C D B C E L 7 46.9 A O C B E O B B L 8 48.8 B E E A A O A O L 9 67.1 D E E O B C B B L 10 73.3 B E B E E A C O 1 2 3 4 5 6 7 8

Slides of data and software

Part of potato.loc file 228 73 PAGMAGT_205.0 1 1 0 0 0 1 0 1 0 0 1 9 0 1 0 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 1 1 1 1 1 0 PAGMAGT_174.0 1 1 0 1 1 0 0 0 1 1 1 9 0 1 0 0 1 0 0 0 1 1 1 1 1 0 0 0 0 1 0 1 1 1 1 1 PCAMAGG_114.5 1 1 0 1 1 1 1 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 0 0 0 1 0 1 1 1 0 1 s148_v 4 1 0 9 1 1 1 1 1 1 0 0 1 1 9 1 1 1 1 1 1 1 9 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 9 1 1 1 1 1 1 1 1 1 1 9 1 1 1 1 1 1 1 9 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 9 1 0 0 0 0 0 0 1 0 0 9 0 0 0 0 1 0 0 9 0 1 1 0 0 0 1 1 0 0 1 1 1 1 0 9 0 1 0 1 0 0 1 1 1 1 9 1 0 0 1 1 0 0 9 1 0 0 0 0 0 0 0 0 1 1 1 0 gp179_vb 1 1 1 0 9 9 9 1 1 0 0 0 9 1 1 9 0 1 1 1 1 1 9 0 1 0 1 1 1 1 1 1 0 0 1 0

Segregation analysis

Summary information

Parent 1 linkages

Select markers dialogue box

Dendrograms

Marker ordering dialogue box

Details of marker ordering

Outline Part 1 Segregation analysis Cluster analysis Linkage analysis Part 2 QTL analysis

Locating a QTL: Preliminary ANOVA For each marker: test for different trait means in different classes TetraploidMap has two options Usual ANOVA Kruskal-Wallis test for differences in trait medians Useful first scan but not fully informative e.g. P1 P2 QTL AOOO x OOOO: 2 classes, presence/absence of A ABCD x EFGH: 36 classes If different QTL alleles occur in P2, they will be detected only by ANOVA for more distant marker. Solution - use all marker information on each chromosome to infer QTL genotypes at each position

QTL genotypes The QTL genotype cannot be observed like a marker We assume 8 different QTL alleles, Q 1 -Q 4 from parent 1, Q 5 -Q 8 from parent 2. There are 36 offspring genotypes, such as Q 1 Q 4 Q 5 Q 6 Each genotype may have a different effect on the trait Most general model is main effects of alleles, two-way interactions etc.

Linear model for trait Let X i be indicator of allele Q i present/absent in a genotype Full model for offspring trait values is: Each offspring receives 2 alleles from parents: constrains model Possible models are effects of each allele, or each QTL genotype

QTL model Too few shared markers to be confident of alignment of parental maps Analyse trait data for each parent separately Fit effect of 6 possible QTL genotypes Compare with reduced models eg Simplex allele Dominant duplex allele Assess significance by permutation test

Reconstruction of offspring: 1 P1 P2 L 1 C B O D E A D C L 2 C D D A C D A O L 3 E E D D C O D A L 4 D C D O E O C E L 5 C D C O A E D D L 6 D A D C D B C E L 7 A O C B E O B B L 8 B E E A A O A O L 9 D E E O B C B B L 10 B E B E E A C O ACD = {1367,1467,1468,3468} ACD CDE CDE CDE ABDE = {1268,2368} AE = {1256} ABE BE BCE 1 2 3 4 5 6 7 8

Reconstruction of offspring: 2 For each offspring: Identify configurations for each marker locus Search for complete chromosome configurations with minimum crossovers (branch and bound algorithm) and compatible with bivalent pairing This is biologically realistic in potato - few crossovers We can represent result as a graphical genotype.

Reconstruction of offspring: 3 These configurations have 6 crossovers 1 4 6 8 L 1 ACD C D A C L 2 ACD C A D C/O L 3 CDE E D O C L 4 CDE D C O E/E 3 areas of uncertainty, giving 8 configurations L 5 CDE C D E D L 6 ABDE D A B E L 7 AE A O O E L 8 ABE B E O/A A L 9 BE E E B B We can trace chromosome sections from parent to offspring. L 10 BCE B E C E 3 2 7 5

Inference of QTL genotype for each configuration L 1 ACD L 2 ACD 1 4 6 8 L 3 CDE L 4 CDE 4 genotypes are possible here L 5 CDE L 6 ABDE L 7 AE L 8 ABE L 9 BE L 10 BCE QTL genotype is 1268 or 1265, with probability 0.5 halfway between loci. QTL genotype is 3275, probability 1. 3 2 7 5

Model fitting (1) In practice, we consider each position along the chromosome in turn and assess the likelihood of a QTL. For individual i, write y i =trait value, o i =marker data, G i =set of compatible configurations, Q i =set of QTL genotypes at that location. Likelihood equation is: A regression of trait value on QTL genotype, weighted by QTL genotype probability.

Lod profile for QTL location

Slides of data and software

Trait data 2 maturity tub_blight% 1 4.977 82.50 2 3.442 85.00 3 5.976 50.71 4 6.842 19.17 5 5.334 28.36 6 2.705 97.50 7 4.025 68.66 8 2.657 90.00 9 4.337-99.0 10 6.667 60.00 11 2.966 70.00 12 6.217 18.66 13 6.211-99.0 14 4.620 92.86 15 4.763 61.11 16 5.763-99.0 17 4.930 21.54 18 5.465 63.33

ANOVA of trait data

QTL analysis of chromosome V

Comparison with simpler model

For further details and references consult the TetraploidMap documentation.