One-week Course on Genetic Analysis and Plant Breeding January 2013, CIMMYT, Mexico LOD Threshold and QTL Detection Power Simulation

One-week Course on Genetic Analysis and Plant Breeding 21-2 January 213, CIMMYT, Mexico LOD Threshold and QTL Detection Power Simulation Jiankang Wang, CIMMYT China and CAAS E-mail: jkwang@cgiar.org; wangjiankang@caas.cn Web: http://www.isbreeding.net 1

Outlines Hypothesis testing and two types of associated error LOD threshold in QTL mapping QTL detection power simulation Avoid the over fitting problem in 2

Hypothesis testing and two types of associated error 3

Hypothesis testing A hypothesis is a statement that something is true. Null hypothesis: A hypothesis to be tested. We use the symbol H to represent the null hypothesis Alternative hypothesis: A hypothesis to be considered as an alternative to the null hypothesis. We use the symbol H a to represent the alternative hypothesis. The alternative hypothesis is the one believe to be true, or what you are trying to prove is true.

Two types of error We may make mistakes in the test. Type I error: reject the null hypothesis when it is true. Probability of type I error is denoted by α Type II error: accept the null hypothesis when it is wrong. Probability of type II error is denoted by β

Power of a statistical test P(reject the null hypothesis when it is false)=1-β (1-α) is the probability we accept the null when it was in fact true (1-β) is the probability we reject when the null is in fact false - this is the power of the test. The power changes depending on what the actual population parameter is.

Factors affecting power For example: H : µ=µ, H a : µ>µ Test statistic Z X µ σ n If we want to reject H, we need X σ µ n = Z So the power depends on δ =, σ, n, and α α x µ

δ is small δ is large The larger the difference δ is, the higher the power is X ~ N(µ,σ 2 /n) If H is true, X ~N(µ,σ 2 /n) If H a is true, X ~N(µ +δ,σ 2 /n) H H α α μ μ Ha Ha β 1-β β 1-β μ+δ μ+δ

Large standard error or small population size Small standard error or large population size The smaller the standard error or the larger the population size is, the higher the power is H H α α μ μ Ha Ha β 1-β β 1-β μ+δ μ+δ

Small α Large α The larger α is, the higher the power is H H α α μ Ha Ha β 1-β β 1-β μ+δ

Let s find some Type I and II errors Consider the Binomial distribution H: p=.; Ha: p.; n=6, X~B(n=6, p) Reject H when X=, or 6 Accept H when X=1,..., α =P(X= p=.) + P(X=6 p=.) =.16+.16=.312 Given p=.7, find the Type II error β, and the statistical power β= P(1<=X<= p=.7)=.8218; Power=.1782 Given p=.9, find the Type II error β, and the statistical power β= P(1<=X<= p=.9)=.4686; Power=.314

X~Binomial (n, p), n=6 {, } is the rejection region, i.e. EstP=. or 1. {1,2,3,4} is the acceptance region X H: p=. p=.7 p=.9.162.244.1 1.937.439.4 2.23437.3299.121 3.312.131836.148 4.23437.296631.9841.937.397.34294 6.162.177979.31441 Type I error Power Power.312.178223.31442 Type II error Type II error.821777.4688

Let s find some Type I and II errors Binomial distribution H: p=.; Ha: p.; n=3, X~B(3, p) Reject H when X<=9, or >=21 Accept H when X=1,..., 2 Find α Given p=.7, find the Type II error β, and the statistical power Given p=.9, find the Type II error β, and the statistical power

X H: p=. p=.7 p=.9... 1... 2... 3.4.. 4.26...133.. 6.3.. 7.1896.. 8.41.. 9.1332.. 1.27982.2. 11.876.8. 12.83.4. 13.1113.166. 14.1343.63. 1.144464.1931. 16.1343.43. 17.1113.13414.2 18.83.296.13 19.876.7.74 2.27982.986.36 21.1332.12987.16 22.41.1939.764 23.1896.166236.1843 24.3.1446.47363 2.133.14728.123 26.26.642.17766 27.4.2683.23688 28..8631.22766 29..1786.14134 3..179.42391 Type I error Power Power.42774.8347.99946 Type II error Type II error.19693.44

LOD threshold in QTL mapping Sun, Z., H. Li, L. Zhang, J. Wang*. 213. Properties of the test statistic under null hypothesis and the calculation of LOD threshold in quantitative trait loci (QTL) mapping. Acta Agronomica Sinica (accepted) 1

Threshold is used to control Type I error, say no greater than..1 Probability density.8.6.4.2 Pr{T>T. )=α=. α=. T α=. =18.3 Say we know a test under H hypothesis has the χ 2 (df=1) distribution, the use of threshold 18.3 can make sure the Type I error <. 16

The reason to control Type I error High LOD score can be simply caused by chance! We simulated five DH populations and five F2 population. But we did not assume any QTL on the six chromosomes 17

LOD LOD LOD LOD LOD 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 No QTL in DH populations 1111111111111222222222222333333333333444444444444666666666666 18 No QTL on the six chromosomes, DH population

LOD LOD LOD LOD LOD 3 2 1 3 2 1 3 2 1 3 2 1 3 2 1 No QTL in F2 populations 1111111111111222222222222333333333333444444444444666666666666 19 No QTL on the six chromosomes, F2 population

Distribution of LRT under H at each scanning position Probability..4.3.2.1 DH population 1 2 3 4 6 7 8 9 1 LRT Probability In DH populations, LRT ~ χ 2 (df=1) In F2 populations, LRT ~ χ 2 (df=2) D.F. is equal to the number of independent genetic effects to be estimated 2..4.3.2.1 F2 population 1 2 3 4 6 7 8 9 1 LRT

Number of independent tests DH population, genome-wide Type I error =. DH population, genome-wide Type I error =.1 Indepedent tests 4 3 2 1 MD=1 cm MD=2 cm MD= cm MD=1 cm MD=2 cm 8 11 14 17 2 Length of chromosome (cm) y =.13x y =.126x y =.99x y =.72x y =.4x Indepedent tests 4 3 2 1 MD=1 cm MD=2 cm MD= cm MD=1 cm MD=2 cm 8 11 14 17 2 Length of chromosome (cm) y =.213x y =.16x y =.113x y =.84x y =.82x 21

LOD threshold, assuming marker density is 1 cm Genome Genome-wide α=. Genome-wide α=.1 size DH RIL F2 DH RIL F2 1.61 1.84 2.4 2.37 2.6 3.18 7 1.77 2.1 2.7 2.3 2.73 3.36 1 1.88 2.12 2.7 2.6 2.84 3.49 1 2.4 2.28 2.87 2.81 3.1 3.66 2 2.16 2.4 3. 2.93 3.13 3.79 2 2.24 2.49 3.1 3.2 3.22 3.88 3 2.32 2.6 3.17 3.1 3.29 3.96 2.2 2.77 3.4 3.31 3. 4.18 1 2.8 3. 3.7 3.9 3.79 4.49 1 2.97 3.22 3.87 3.76 3.9 4.66 2 3.9 3.33 4. 3.88 4.7 4.79 3 3.2 3. 4.17 4.4 4.24 4.96 4 3.37 3.62 4.3 4.16 4.36.9 22

LOD threshold from permutation test Highest LOD per test 6 4 3 2 1 Barly DH population LOD threshold = 2.72 at level. Highest LOD per test 12 1 8 6 4 2 Soybean F2 population LOD threshold = 2.44 at level. 1 2 3 4 6 7 8 9 1 Permutation tests 1 2 3 4 6 7 8 9 1 Permutation test 23

QTL detection power simulation Zhang, L., H. Li, J. Wang*. 212. Statistical power of inclusive composite interval mapping in detecting digenic epistasis showing common F2 segregation ratios. Journal of Integrative Plant Biology 4: 27-279 Li, H., S. Hearne, M. Bänziger, Z. Li, and J. Wang*. 21. Statistical properties of QTL linkage mapping in biparental genetic populations. Heredity 1: 27-267. 24

Two independent QTL models Chromosome Position (cm) Additive PVE (%) Independent model I Q1 1 3.316. Q2 2 3.447 1. Q3 3 3.48 1. Q4 4 3.633 2. Genetic variance 1. Error variance 1. Heritability. Independent model II Q1 1 3.316. Q2 2 3 -.447 1. Q3 3 3.48 1. Q4 4 3 -.633 2. Genetic variance 1. Error variance 1. Heritability. 2

Two linked QTL models Chromosome Position (cm) Additive PVE (%) Linkage model I Q1 1 3.316 3.9 Q2 1 6.447 7.9 Q3 2 3.48 11.8 Q4 2 6.633 1.8 Genetic variance 1.3 Error variance 1. Heritability.66 Linkage model II Q1 1 3.316 6.8 Q2 1 6 -.447 13.7 Q3 2 3.48 2. Q4 2 6 -.633 27.3 Genetic variance.46 Error variance 1. Heritability.317 26

LOD LOD LOD LOD LOD 2 1 1 2 1 1 2 1 1 2 1 1 2 1 1 QTL mapping in simulation runs of the two independent models Independent model I Run 1 IM 11111111122222222333333334444444466666666 Run 2 IM 11111111122222222333333334444444466666666 Run 3 11111111122222222333333334444444466666666 Run 4 IM 11111111122222222333333334444444466666666 Run IM IM 11111111122222222333333334444444466666666 LOD score LOD score LOD score LOD score LOD score 2 1 1 2 1 1 2 1 1 2 1 1 2 1 1 Independent model II Run 1 IM 11111111122222222333333334444444466666666 Run 2 IM 11111111122222222333333334444444466666666 Run 3 IM 11111111122222222333333334444444466666666 Run 4 IM 11111111122222222333333334444444466666666 Run IM 11111111122222222333333334444444466666666 27

LOD score LOD score LOD score LOD score LOD score 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 QTL mapping in simulation runs of the two linkage models Linkage model I Run 1 IM 11111111122222222333333334444444466666666 Run 2 IM 11111111122222222333333334444444466666666 Run 3 11111111122222222333333334444444466666666 Run 4 IM 11111111122222222333333334444444466666666 Run IM IM 11111111122222222333333334444444466666666 LOD score LOD score LOD score LOD score LOD score 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 Linkage model II Run 1 IM 11111111122222222333333334444444466666666 Run 2 IM 11111111122222222333333334444444466666666 Run 3 IM 11111111122222222333333334444444466666666 Run 4 IM 11111111122222222333333334444444466666666 Run IM 11111111122222222333333334444444466666666 28

Count of power and false QTL for IM Run QTL identified Support interval Chrom. Position LOD PVE (%) Additive 1 cm 1 cm 1 2 2 4.97 11.44.3 False Q2 3 3.61 13.3.41 Q3 Q3 4 4 13.21 26.22.761 Q4 Q4 2 2 34.36 13.1.9 Q2 Q2 3 34.82 13.72.21 Q3 Q3 4 3 11.9 23.43.682 Q4 Q4 3 1 39. 11.22.8 Q1 Q1 2 32 4.3 1.9.482 Q2 Q2 3 4 8.3 18.42.61 False False 4 36 8.6 18..63 Q4 Q4 4 1 4 3.97 1.21.42 False Q1 2 36 2.69 6.81.343 Q2 Q2 3 34 8.92 19.66.83 Q3 Q3 4 36 8.79 2.1.91 Q4 Q4 3 33 3.8 8.16.389 Q3 Q3 4 3 11.71 26.6.71 Q4 Q4 29

Count of power and false QTL for Run QTL identified Support interval Chrom. Position LOD PVE (%) Additive 1 cm 1 cm 1 1 47 3.8.6.33 False False 2 38 6.79 9.11.448 Q2 Q2 3 33 9.7 13.81.1 Q3 Q3 4 38 16.72 2..73 Q4 Q4 2 1 3 4.6 6.26.32 Q1 Q1 2 36 9.7 12.6. Q2 Q2 3 31 7.93 1.41.44 Q3 Q3 4 27 16.77 24.93.73 False Q4 3 1 36 7.2 1.23.486 Q1 Q1 2 32 6. 8.1.432 Q2 Q2 3 38 9.2 13.63.6 Q3 Q3 4 38 13. 19.18.664 Q4 Q4 4 1 3 3.99.13.298 Q1 Q1 2 37 4.4.89.319 Q2 Q2 3 33 14.21 21.68.613 Q3 Q3 4 36 13.73 21.23.67 Q4 Q4 1 3 4.91 8.4.384 Q1 Q1 2 1 4.3 6.87.36 False False 3 34.3 9.4.419 Q3 Q3 4 3 17.46 31.6.764 Q4 Q4 3

Power and false QTL from the runs Method QTL Times to be detected Detection power (%) 1cM 2cM 1cM 2cM IM Q1 1 2 2 4 Q2 3 4 6 8 Q3 4 4 8 8 Q4 1 1 False QTL 3 1 19 6 Q1 4 4 8 8 Q2 4 4 8 8 Q3 1 1 Q4 4 8 1 False QTL 3 2 1 1 31

The best method Has the highest power Has the lowest false discovery rate 32

Independent model I Method QTL Power Pos. (%) (cm) SE LOD SE Additive SE IM Q1 2.8 3.182 3.461 3.849 1.3.421.6 Q2 62.6 34.861 3.14.84 1.692.48.82 Q3 77.7 3.6 2.669 7.13 2.178.6.92 Q4 8.4 3.67 2.464 9.2 2.84.63.9 FDR (%) 32.4 Q1 49. 34.867 3.184 4.667 1.66.34.62 Q2 73.9 34.874 2.769 7.16 2.29.4.77 Q3 82.8 34.98 2.21 1.161 2.71.48.78 Q4 89. 3.16 2.278 13.87 3.229.632.83 FDR (%) 22.6 33

Independent model II Method QTL Power Pos. (%) (cm) SE LOD SE Additive SE IM Q1 27.3 34.83 3.414 3.86 1.11.424.62 Q2 64.2 3.62 3.3.6 1.67 -.478.78 Q3 78.6 34.96 2.76 6.963 2.143.8.9 Q4 84.6 34.86 2.481 9.374 2.441 -.64.89 FDR (%) 31.1 Q1 49.2 34.831 3.24 4.89 1.64.32.63 Q2 76.1 3.3 2.861 7.142 2.328 -.448.76 Q3 8.6 3.1 2.484 1.193 2.7.48.81 Q4 9. 34.939 2.32 13.23 3.221 -.634.82 FDR (%) 21.3 34

Linkage model I Method QTL Power Pos. (%) (cm) SE LOD SE Additive SE IM Q1 24.1 3.448 2.77 6.944 2.92.62.99 Q2 49. 64.79 2.49 7.89 2.278.66.14 Q3 4.2 3.79 1.648 17.17 3.1.937.9 Q4 6. 64.16 1.624 18.682 3.39.971.93 FDR (%) 3.1 Q1 26.9 3.33 3.1 7.33 3.466.449.118 Q2. 64.872 2.71 1.19 4.184.8.133 Q3 77. 34.92 2.618 1.6 3.89.9.113 Q4 84.2 64.828 2.33 13.668 4.761.649.13 FDR (%) 26.4 3

Linkage model II Method QTL Power Pos. (%) (cm) SE LOD SE Additive SE IM Q1.3 33.333 4.714 2.96.431.313.13 Q2 2.3 66.34 3.22 3.667 1.29 -.34. Q3 6.8 31.691 2.68 3.176.3.334.31 Q4 4.6 67.746 2.68 4.169 1.32 -.373.61 FDR (%) 38.9 Q1 11.6 34.216 3.61.1 1.91.37.66 Q2 33. 66.179 3.3.872 3.188 -.42.18 Q3 6.2 34.383 2.894 8.332 3.381.492.14 Q4 6.9 6.984 2.429 11.413 4.131 -.91.114 FDR (%) 23.8 36

Size of the mapping population PVE (%) Marker density cm Marker density 1 cm Power.8 Power.9 Power.8 Power.9 1 3 6 4 >6 2 16 3 28 32 3 11 2 18 2 4 1 16 14 18 8 14 12 14 1 8 7 8 2 4 6 6 3 4 4 4 4 37

Avoid the over fitting problem in 38

Over-fitting can cause fake QTL LOD 6 4 QTL by overfitting qkwt2h-2 qkwt2h-1 qkwt2h-3 qkwt3h QTL by overfitting qkwth QTL by overfitting qkwt7h-1 qkwt7h-2 2 1. 1H 1H 1H 1H 1H 1H 1H 2H 2H 2H 2H 2H 2H 2H 2H 3H 3H 3H 3H 3H 3H 3H 4H 4H 4H 4H 4H 4H H H H H H H H H H H 6H 6H 6H 6H 6H 6H 7H 7H 7H 7H 7H 7H 7H 7H Additive effect -. -1-1. 1H 1H 1H 1H 1H 1H 1H 2H 2H 2H 2H 2H 2H 2H 2H 3H 3H 3H 3H 3H 3H 3H 4H 4H 4H 4H 4H 4H H H H H H H H H H H 6H 6H 6H 6H 6H 6H 7H 7H 7H 7H 7H 7H 7H 7H One-dimensional scaning on the barley genome, step = 1 cm 39

How can I know there is an overfitting problem? R 2 in step-wise regression exceeds the broadsense heritability There are closely linked QTL identified, especially the QTL are linked in repulsion PIN in step-wise.1.1..1 regression R 2.7289.7963.8131.8886 Use smaller PIN to avoid over-fitting problem 4