Y-STR: Haplotype Frequency Estimation and Evidence Calculation

Size: px

Start display at page:

Download "Y-STR: Haplotype Frequency Estimation and Evidence Calculation"

Kellie Jefferson
5 years ago
Views:

1 Calculating evidence Further work Questions and Evidence Mikkel, MSc Student Supervised by associate professor Poul Svante Eriksen Department of Mathematical Sciences Aalborg University, Denmark June 16 th 2010, Presentation of Master of Science Thesis 1/75

2 Outline Outline Biological framework Motivation Aims 1. Short biological recap 2. Motivation and aims of using Y-STR 3. estimation 3.1 Methods and comments to the 3.2 Model control 4. Evidence calculation 5. Further work Calculating evidence Further work Questions 2/75

Biological framework Outline Biological framework Motivation Aims Calculating evidence Further work Questions Example of (autosomal) STR DNA-type: ({14, 13}, {22},

3 Biological framework Outline Biological framework Motivation Aims Calculating evidence Further work Questions Example of (autosomal) STR DNA-type: ({14, 13}, {22}, {15, 16},..., {22}) Example of Y-STR DNA-type: (17, 14, 22,..., 15) LHS image is from and RHS image is from 3/75

4 Motivation Outline Biological framework Motivation Aims Calculating evidence In some situations Y-STR is more sensible to use than A-STR (autosomal STR), e.g. to avoid noise in the trace from a rape victim Y-STR and A-STR differs in several areas, e.g. the number of alleles at each locus and statistical dependence between loci The statistical developed to handle A-STR cannot be applied on Y-STR directly, so reformulation is required (e.g. for calculating evidence) or new must be developed (to estimate Y-STR haplotype ) Further work Questions 4/75

5 Aims Outline Biological framework Motivation Aims Be able to calculate statistical evidence in trials Estimate for Y-STR haplotypes (also unobserved ones) is required to do this First, for estimating for Y-STR haplotypes will be discussed and afterwards calculation of evidence will be introduced. Calculating evidence Further work Questions 5/75

6 New Calculating 6/75

7 New Neither principal component analysis nor factor analysis yields good results, so that path has not been followed any further. Calculating 7/75

8 New Simple count estimates Not precise enough One published method (used at introduced in A new method for the evaluation of matches in non-recombining genomes: application to Y-chromosomal short tandem repeat (STR) haplotypes in European males. from 2000 by L. Roewer et al. Several problems exist; some will be presented in this presentation (some also presented in a talk at 7th International Y Chromosome User Workshop in Berlin, Germany, from April 22 to April 24, 2010) Calculating 8/75

9 New New Graphical would be an obvious choice Structure based learning (e.g. PC algorithm) or score based learning (e.g. AIC and BIC) Standard tests for conditional independence (e.g. G 2 = 2nCE(A, B {S i } i I ), where n is the sample size and CE is the cross entropy, which is χ 2 ν distributed when A and B are independent given {S i } i I ) do not exploit the ordering in the data nor does it incorporate prior knowledge such as the single step mutation model Better independence tests are required (e.g. classification trees, ordered logistic regression, and support vector machines) and model-based clustering Calculating 9/75

10 New Calculating 10/75

11 The idea: Bayesian approach New Notation: N: number of observations M: number of haplotypes (i.e. unique observations) N i : the number of times the i th haplotype has been observed f i = N i 1 N M : the frequency for the i th haplotype Model using Bayesian inference: 1. A priori: assume f i is Beta distributed with parameters non-stochastic parameters u i and v i 2. Likelihood: Given f i, then N i is Binomial distributed 3. Posterior: Given N i, then f i is (still) Beta distributed (Beta distribution is a conjugate prior for the Binomial distribution) Model expressed in densities using generic notation: p (f i N i ) p (N i f i ) p (f i ) Calculating 11/75

12 u i and v i New 1. Calculate W i = 1 N j N N i i j d ij for i = 1, 2,..., M, where d ij denotes the Manhattan distance/l 1 norm 2. Order the W i s by size and divide into 15 (?) groups and calculate the mean and variance of the f i s in each group 3. Fit regression µ(w ) = β 1 + exp (β 2 W + β 3 ) and σ 2 (W ) = β 4 + exp (β 5 W + β 6 ) based on the 15 estimates 4. Calculate µ i = µ(w i ) and σ 2 i = σ 2 (W i ) and use these to calculate the prior parameters u i and v i 5. Apply the Bayesian approach to obtain the posterier distribution, e.g. to estimate f i using the posterior mean Only dane could fit the full and the fit is not too comforting the others resulted in µ i < 0 or σi 2 < 0 for some i s Calculating 12/75

13 Plot of the regression New µ dane divided into 15 groups W µ(w) = exp( * W ) Calculating 13/75

14 Plot of the regression dane divided into 15 groups New σ W σ 2 (W) = 0 + exp( * W ) Calculating 14/75

15 Modified regression Set β 1 = β 4 = 0 in the regression yielding µ(w ) = exp (β 2 W + β 3 ) New og σ 2 (W ) = exp (β 5 W + β 6 ) Now berlin makes the best fits, which seems quite reasonable for µ(w ), but more doubtful for σ 2 (W ) dane fits almost as before Calculating 15/75

16 Plot of the regression New µ berlin divided into 16 groups W µ(w) = 0 + exp( * W ) Calculating 16/75

17 Plot of the regression New σ 2 0e+00 1e 04 2e 04 3e 04 4e 04 5e 04 berlin divided into 16 groups W σ 2 (W) = 0 + exp( * W ) Calculating 17/75

18 Plot of the regression New µ somali divided into 17 groups W µ(w) = 0 + exp( * W 9.286) Calculating 18/75

19 Plot of the regression New σ somali divided into 17 groups W σ 2 (W) = 0 + exp(10.29 * W 9.328) Calculating 19/75

20 Changes made on New At the 7 th International Y Chromosome User Workshop in Berlin, 2010, Sascha Willuweit (one of the persons behind mentioned a couple of changes between their implementation at and the original article: Using the reduced regression, i.e. without intercepts β 1 and β 4 The number of groups are determined by fitting several regressions and choosing the best one (details for selecting the minimum number of groups was not mentioned) Calculating 20/75

21 Comments and proposals for changes New Not a statistical model, more an ad-hoc method µ i = µ(w i ) = exp(aw i + b) is not bounded above such that µ(w ) 1 for W b a : for berlin a = and b = so µ i 1 for W i (0 W i 1 and 0 < µ i < 1 by definition) Fitting u i and v i : only W i to fit two exponential regression The model is not consistent: generalisation to a Dirichlet prior and a multinomial likelihood might solve this Calculating 21/75

22 New Calculating 22/75 Fitting two parameters (u i and v i ) using only one (W i ) fitted u vs. fitted v for berlin based on 16 groups Using regression without beta1 and beta4 u v

23 Fitting two parameters (u i and v i ) using only one (W i ) New v fitted u vs. fitted v for dane based on 15 groups u Using regression with beta1 and beta4 Calculating 23/75

24 Fitting two parameters (u i and v i ) using only one (W i ) fitted u vs. fitted v for dane based on 15 groups New v u Using regression without beta1 and beta4 Calculating 24/75

25 Fitting two parameters (u i and v i ) using only one (W i ) New v fitted u vs. fitted v for somali based on 17 groups u Using regression without beta1 and beta4 Calculating 25/75

26 Idea: use second moment W i = 1 N j N N i i j d ij = 1 N r j N N i i j k=1 d uses first ijk moment of the allele differences on each locus of r loci Maybe also use the second moment to introduce 1 N j Z i = N N i ) r i j k=1 (d ijk d 2 ij r New Fit µ i s and σi 2 s by multiple regression using a grid of W i and Z i values 15 groups only correspond to a 4 4-grid, which is way too coarse requiring 15 groups of W i s and Z i s, the grid would have size = 225: requires a lot of observations! Too few observations in berlin, dane, and somali, but it would be interesting to see how it would perform compared to just using the W i s Calculating 26/75

27 Generalised frequency New Assume a priori that f Dirichlet (α) where f = (f 1, f 2,..., f K ) is the vector of for all possible haplotypes Use the likelihood N f Multinomial (N +, f ) The posterior becomes f N Dirichlet (α 1 + N 1,..., α K + N K ) = Dirichlet (α 1 + N 1,..., α n + N n, α n+1,..., α K ) where f 1, f 2,..., f n are the for the observed haplotypes and f n+1, f n+2,..., f K are for the unobserved haplotypes Calculating 27/75

28 Marginal distribution New Let α + = K i=1 α i The marginal( posterior distribution for the i th haplotype is f i N i Beta α i + N i, ) K j=1 (α j + N j ) (α i + N i ) = Beta (α i + N i, α + α i + N + N i ) E [f i N i ] = α i +N i K j=1(α j +N j) (α i +N i )+(α i +N i ) = α i +N i α ++N + K i=1 E [f i N i ] = (α + + N + ) 1 K i=1 (α i + N i ) = 1 Incorporate prior knowledge can be done by specifying the prior parameter α i for all possible haplotypes, but with this approach α + might be problematic to calculate for large K Calculating 28/75

29 Approximating α + Histogram of Wi for berlin Gamma: s = , r = Beta: s1 = , s2 = New Density W mean(w) = Calculating 29/75

30 Approximating α + Histogram of Wi for dane Gamma: s = , r = Beta: s1 = , s2 = New Density W mean(w) = Calculating 30/75

31 Approximating α + New Density Histogram of Wi for somali W mean(w) = Gamma: s = 3.112, r = Beta: s1 = 2.44, s2 = Calculating 31/75

32 Approximating α + New As earlier stated, 0 W i 1 so the Beta distribution is the right choice theoretically Assume that α i = h(w i ) for some function h such that α + = K i=1 h(w i) Denote by f β the density of a fitted Beta-distribution, then α + K 1 0 f β (W ) h(w )dw Calculating 32/75

33 Approximating α + New To get equality between the first prior parameter in and the generalised, let α i = u i = µ2 i (1 µ i ) σi 2 Because µ i = µ(w i ) = exp(aw i + b) can result in µ i 1 then 1 µ i 0 such that α i 0 which is now allowed For berlin, µ i 1 for W i so that all contributions to the integral in the α + approximation is negative for W i berlin : α + = and the uncovered probability mass is estimated to dane : b/a = and α + = somali : b/a = and α + = α i = u i and the exponential regression is unusable, but the distribution of the W i s might be helpful for other choices Calculating 33/75

34 Problem New Maybe getting too much attention because it is the only published method for estimating haplotype At the 7 th International Y Chromosome User Workshop in Berlin, 2010, Michael Krawczak (one of the authors of the original articles) gave a talk where the associated slides included the statement [frequency has] never [been] thoroughly studied and validated Calculating 34/75

35 New Calculating 35/75

36 The idea: approximate when identified a common (partial) ancestor New The basic idea is to find I = {i 1, i 2,..., i q } {1, 2,..., r} such that ( ) P L j = a j L i = a i P L j = a j L i = a i j I is a good approximation. j I i I i I Calculating 36/75

37 Example New Assume that we have loci L 1, L 2, L 3, L 4 and I = {1, 2} Then P (L 1, L 2, L 3, L 4 ) = P (L 1 ) P (L 2 L 1 ) P (L 3, L 4 L 1, L 2 ) (1) 4 P (L 1 ) P (L 2 L 1 ) P (L j L 1, L 2 ) (2) j=3 Calculating 37/75

38 How to chose I New I is called an ancestral set, because it can be interpreted as a set of alleles that is common with one s ancestors I can be found using a greedy approach adding the j to I that maximises P (L j = a j i I L i = a i ) Stop adding elements to I, e.g. when only a percentage of the observations is left to use for calculating the marginal probabilities conditional on I Calculating 38/75

39 Drawbacks New The approach is simple, but like it is not a statistical model Calculating 39/75

40 New Calculating 40/75

41 The idea: alternately perceive different loci as a response variable New Let L 1, L 2,..., L r be the r different loci available in the haplotype. Then fit L i1 L k (3) L i2 k {i 1 } k {i 1,i 2 } L k (4). (5) L ir 2 L k (6) L ir 1 k {i 1,i 2,...,i r 2 } Calculating 41/75 k {i 1,i 2,...,i r 1 } and use the empirical distribution for L ir. L k = L ir (7)

42 Properties New A class of statistical The classifications can be done with some of the several available classifications such as classification trees, ordered logistic regression, or support vector machines Selection of i j should be done using standard model selection criteria depending on the classification model used Does not incorporate prior knowledge Calculating 42/75

43 and model based clustering New and model based clustering Calculating 43/75

44 The idea: create a density around each haplotype New Put a scaled density/mass (called a kernel) around each haplotype with mass equal to its relative frequency N i N + In this way unobserved haplotypes get probability mass from the (near) neighbours Calculating 44/75

45 Choice of kernel New A straightforward approach is the Gaussian kernel K (z x i, λ) = ( 2πλ 2 ) r 2 det (Σ) 1 2 exp ( 1 2λ 2 (x i z)σ 1 (x i z) ) where λ is called a parameter that has to be chosen A frequency estimate for any given haplotype z is g(z) = 1 N + n i=1 N ik (z x i, λ) To incorporate prior knowledge, K (z x i, N i, λ) = ( 2π λ2 2 ) r N i could be used det (Σ) 1 2 exp ( 1 2 λ2 N i (x i z)σ 1 (x i z) ) Calculating 45/75

46 Problem New The model can be inaccurate if the kernel has small variance, because then the actual mass when evaluated over the discrete grid can differ greatly from the relative Discrete kernels could be tried instead, e.g. the multinomial Calculating 46/75

47 Model based clustering New a frequency for a haplotype using kernel require evaluating as many densities as the number of haplotypes in the database Model based clustering can be used to perform clustering first to minimise the required number of density evaluations Same problem as with kernel if the variances are too small Calculating 47/75

48 Unobserved probability mass Marginal deviances Calculating evidence Further work Questions 48/75

49 Unobserved probability mass Marginal deviances Calculating evidence Model verification is as always crucial One important feature of a model is to be able to efficiently obtain further samples of haplotypes according to their probability under a model Further work Questions 49/75

50 Different ways of comparing Unobserved probability mass Marginal deviances Calculating evidence Estimated unobserved probability mass Marginal deviances (for a model s single and pairwise compared to observed) Several more should be definitely considered Further work Questions 50/75

51 Unobserved probability mass: point estimate Unobserved probability mass Marginal deviances Calculating evidence K 0 : the number of singletons N: the number of observations In 1968, Robbins showed that V = K 0 N + 1 is an unbiased estimate of the unobserved probability mass. Further work Questions 51/75

52 Unobserved probability mass: limiting consistent variance estimate Unobserved probability mass Marginal deviances Calculating evidence K 1 : the number of doubletons In 1986, Bickel and Yahav showed that under some regularity conditions, ˆσ 2 = K 0 N 2 (K 0 2K 1 ) 2 N 3 is limiting consistent estimate of the variance of the unobserved probability mass Both V and the variance estimate can be verified by simulation Further work Questions 52/75

53 Unobserved probability mass: simulation study Unobserved probability mass Marginal deviances Calculating evidence Further work Questions Coverage Confidence interval coverage for true population size Q = S S S S S S S S S S T T T T T T T T T T Symmetric standard confidence interval logit transformed confidence interval C C C C C C C C C C S S S S S S S S S S T T T T T T T T T T N Based on simulations and probabilities. Line at /75

54 Unobserved probability mass Unobserved probability mass Marginal deviances Calculating evidence The estimate seems like a good and simply way of perform model verification, but it cannot stand alone as we shall soon see It can also be used to fit model parameters, e.g. the parameter in the kernel model Further work Questions 54/75

55 Unobserved probability mass: approximate confidence intervals Unobserved probability mass Marginal deviances Calculating evidence berlin dane somali V ˆσ V 95% conf. int. [0.321; 0.408] [0.510; 0.694] [0.212; 0.342] S: β 1 /β 4 0 NA NA S: β 1 /β 4 = rpart svm polr NA Ancestor: 10% Ancestor: 15% Ancestor: 20% Further work Questions 55/75

56 Marginal deviances Unobserved probability mass Marginal deviances Calculating evidence Depending on the model, exact marginals can be difficult to obtain If haplotypes can be sampled according to their probability under a model, then marginals can be approximated by simulating a huge number of haplotypes under that model Using only the observed marginals should correspond to this, but at least for small databases this is not the case according to simulations studies performed with the classification Further work Questions 56/75

57 Deviance for pairwise marginals Unobserved probability mass Marginal deviances Calculating evidence Let {u} ij be the two-way table with the observation counts, { p} ij the table of predicted probabilities under a model M 0, {ˆp} ij the relative probabilities such that ˆp ij = u ij u ++ For the pairwise marginal tables, the deviance is ( d = 2 log L({ p} ij) L({ˆp} ij) ) where L ({p} ij ) = i,j pu ij ij proportional to the likelihood (the constant cancelled out in the fraction) Then d = 2 ( ) i,j u pij ij log ˆp ij χ 2 ν u ++! i,j u ij is is Further work Questions 57/75

58 Unobserved probability mass Marginal deviances Calculating evidence A deviance is calculated for each pair of loci To compare these deviances can summed and used for relative comparisons (the sum is not χ 2 distributed) The deviance is calculated similar for single marginals Further work Questions 58/75

59 Deviance sums for single marginals Unobserved probability mass Marginal deviances Calculating evidence berlin dane somali rpart svm polr NA Sum of deviances for observed single marginals vs. simulated single marginals for the classification method specified in each row. Further work Questions 59/75

60 Deviance sums for pairwise marginals Unobserved probability mass Marginal deviances Calculating evidence berlin dane somali rpart Inf svm Inf polr Inf NA Sum of deviances for observed pairwise marginals vs. simulated pairwise marginals for the classification method specified in each row. Further work Questions 60/75

61 Calculating evidence Calculating evidence Calculating evidence Two contributors n contributors Further work Questions 61/75

62 Evidence: motivation of usage The purpose is to get an unbiased opinion in a trial Often formulated as the hypothesis H p : The suspect left the crime stain together with n additional contributors. H d : Some other person left the crime stain together with n additional contributors. Calculating evidence Two contributors n contributors Further work Questions Then the likelihood ratio given by LR = P(E Hp) P(E H d ) is calculated In a courtroom it can then be stated that the evidence E is LR times more likely to have arisen under H p than under H d (formulation is from Interpreting DNA Mixtures by Weir et al., 1997) 62/75

63 Computational challenge The actual evaluation of LR can be a computation heavy task: the number of combinations giving rise to the same trace grows with the number of contributors Calculating evidence Two contributors n contributors Further work Questions 63/75

64 Two contributors Calculating evidence Two contributors n contributors Further work Questions Assume H p : The suspect left the crime stain together with one additional contributors. H d : Some other person left the crime stain together with one additional contributors. 64/75

65 Two contributors: notation Calculating evidence Two contributors n contributors Further work Let a = (a 1, a 2,..., a n ) and b = (b 1, b 2,..., b n ) and define T = a b = ({a 1, b 1 }, {a 2, b 2 },..., {a 1, b n }) (8) T a = ({b 1 }, {b 2 },..., {b n }) (9) where the sets are multisets. Questions 65/75

66 Two contributors Calculating evidence Two contributors n contributors Further work Questions Let T be the trace, h s the suspect s haplotype, and h 1 the additional contributor s haplotype. Then T = h s h 1 and h 1 = T h s. Say that (h 1, h 2 ) is consistent with the trace T if h 1 h 2 = T, which is denoted (h 1, h 2 ) T. This makes LR = P (E H p) P (E H d ) P (h s, T h s ) = (h 1,h 2 ) T P (h s, h 1, h 2 ) P (h s ) P (T h s ) = P (h s ) (h 1,h 2 ) T P (h 1) P (h 2 ) P (T h s ) = (h 1,h 2 ) T P (h 1) P (h 2 ) (10) (11) (12) (13) by assuming that haplotypes are independent. 66/75

67 Two contributors: number of terms in the denominator Calculating evidence Two contributors n contributors Further work Let T = (T 1, T 2,..., T r ), T i is a set of alleles such that T i {1, 2} Let H T = T 1 T 2 T r In the non-trivial case a j {1, 2,..., r} exists such that T j = {a 1, a 2 } with a 1 a 2 Let T j = {a 1 } (such that one of the alleles is removed) and H T = T 1 T j 1 T j T j+1 T r Questions 67/75

68 Two contributors: number of terms in the denominator Calculating evidence Two contributors n contributors Further work Questions Now the denominator of LR can be written as 2 h 1 H T P (h 1 ) P (T h 1 ) If k denotes the number of loci in the trace with only one allele, and we assume that we have the non-trivial case with 0 k < r, we have that H T = r i=1 T i = 2 r k such that H T = H T 2 = 2 r k 1 2 r 1 This means that for r loci, a maximum of 2 2 r 1 = 2 r haplotype have to be calculated, e.g = 1024 If a trace has two contributors with no known suspects, the two most likely contributors can be chosen to be arg max h1 H T P (h 1) P (T h 1 ) 68/75

69 n contributors: formulation of the LR for one locus Calculating evidence Two contributors n contributors Further work Questions LR defined generally in Forensic interpretation of Y-chromosomal DNA mixtures by Wolf et al., 2005 At a given locus, let E t, E s, and E k be the set of alleles from the trace, the suspect, and the known contributors, respectively Assume n unknown contributors and let A n denote the set of alleles carried by the these n unknown contributors Let P n (V ; W ) = P (W A n V ) Then LR = P n (E t ; E t \ (E s E k )) P n+1 (E t ; E t \ E k ) = P (E t \ (E s E k ) A n E t ) P (E t \ E k A n+1 E t ) (14) (15) 69/75

70 n contributors: formulation of the LR for m loci Calculating evidence Two contributors n contributors Further work For m loci we have ( m P n {E t,i ; E t,i \ ( ) ) E s,i E k,i } LR = i=1 ( m ) P n+1 {E t,i ; E t,i \ E k,i } i=1 Questions 70/75

71 n contributors: example from thesis, page 18 Calculating evidence Two contributors n contributors Further work Questions Let E t,1 = {1, 2}, E t,2 = {2}, E s,1 = {1}, E s,2 = {2}, E k,1 = E k,2 =. Then ( 2 P 1 i=1 {E t,i ; E t,i \ ( ) ) E s,i E k,i } LR = ( 2 ) P 2 i=1 {E t,i ; E t,i \ E k,i } ( 2 P i=1 {E t,i \ ( ) ) E s,i E k,i A i 1 E t,i = ( 2 ) P i=1 {E t,i \ E k,i A i 2 E t,i } = P ( {E t,1 \ E s,1 A 1 1 E t,1} {E t,2 \ E s,2 A 2 1 E ) t,2 P ( {E t,1 \ E k,1 A 1 2 E t,1} {E t,2 \ E k,2 A 2 2 E t,2} ) = P ( {{2} 1 A 1 1 {1, 2}1 } {A 2 1 {2}2 } ) P ( {{1, 2} 1 A 1 2 {1, 2}1 } {{2} 2 A 2 2 {2}2 } ) = P ( {2} 1 {2} 2) P ({1, 2} 1 {2, 2} 2 ) = P (h 1 = (2, 2)) P (h 1 = (1, 2), h 2 = (2, 2)) + P (h 1 = (2, 2), h 2 = (1, 2)) 71/75

72 n contributors: challenges Calculating evidence Two contributors n contributors Further work It is complicated One key ingredient in calculating the LR is to be able to estimate for unobserved haplotypes The next step is to be able to calculate the LR efficiently even for a large number of contributors Questions 72/75

73 Further work Further work Calculating evidence Further work Questions 73/75

74 Further work Calculating evidence Further work Questions Y-STR haplotype Better incorporation of prior knowledge in a statistical model, e.g. graphical with other test statistics (ordinal data and incorporating prior knowledge such as the single step mutation model) More and better ways to verify Larger datasets ( has gathered a lot of data, both publicly available in journals and directly from laboratories, but none is available for others, yet) Y-STR Mixtures Efficient calculation of LR Use quantitative information (the amount of DNA material which can be seen in the EPG) instead of only the qualitative Model the signal in the EPG (electropherogram) 74/75

75 Questions? Questions? Calculating evidence Further work Questions 75/75

The problem of Y-STR rare haplotype

The problem of Y-STR rare haplotype February 17, 2014 () Short title February 17, 2014 1 / 14 The likelihood ratio When evidence is found on a crime scene, the expert is asked to provide a Likelihood ratio