Y-STR: Haplotype Frequency Estimation and Evidence Calculation

Similar documents
The problem of Y-STR rare haplotype

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Subject CS1 Actuarial Statistics 1 Core Principles

Stochastic Models for Low Level DNA Mixtures

Association studies and regression

A gamma model for DNA mixture analyses

Cluster analysis of Y-chromosomal STR population data using discrete Laplace distributions

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Pattern Recognition and Machine Learning

13: Variational inference II

The Monte Carlo Method: Bayesian Networks

Introduction to Probabilistic Machine Learning

CPSC 540: Machine Learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

A Process over all Stationary Covariance Kernels

Generative Clustering, Topic Modeling, & Bayesian Inference

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

STA 4273H: Statistical Machine Learning

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j

Contrasting Marginal and Mixed Effects Models Recall: two approaches to handling dependence in Generalized Linear Models:

Sparse Linear Models (10/7/13)

Adventures in Forensic DNA: Cold Hits, Familial Searches, and Mixtures

CS-E3210 Machine Learning: Basic Principles

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Ad Placement Strategies

Principal Component Analysis

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning using Bayesian Approaches

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

CPSC 340: Machine Learning and Data Mining

Machine Learning Overview

Exponential Families

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Click Prediction and Preference Ranking of RSS Feeds

Bayesian Learning (II)

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Contents. Part I: Fundamentals of Bayesian Inference 1

Lecture : Probabilistic Machine Learning

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning

Learning from Data: Regression

Probabilistic Machine Learning. Industrial AI Lab.

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Bayes methods for categorical data. April 25, 2017

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Clustering bi-partite networks using collapsed latent block models

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

Learning ancestral genetic processes using nonparametric Bayesian models

PMR Learning as Inference

COS513 LECTURE 8 STATISTICAL CONCEPTS

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Probabilistic Time Series Classification

CS 361: Probability & Statistics

Spatial Normalized Gamma Process

FREQUENTIST BEHAVIOR OF FORMAL BAYESIAN INFERENCE

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Generative Models for Discrete Data

Multiple Sequence Alignment using Profile HMM

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

A Bayesian model for event-based trust

Part III: Unstructured Data. Lecture timetable. Analysis of data. Data Retrieval: III.1 Unstructured data and data retrieval

Introduction to Machine Learning

Bayesian Methods: Naïve Bayes

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Model Estimation Example

Statistical Methods in HYDROLOGY CHARLES T. HAAN. The Iowa State University Press / Ames

Naïve Bayes classification

STAT Advanced Bayesian Inference

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Bayesian Models in Machine Learning

CPSC 540: Machine Learning

More Spectral Clustering and an Introduction to Conjugacy

Introduction to General and Generalized Linear Models

Lab 12: Structured Prediction

Advanced topics from statistics

Bayesian Methods for Machine Learning

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Dreem Challenge report (team Bussanati)

Linear Regression (1/1/17)

Statistics 135 Fall 2007 Midterm Exam

Proteomics and Variable Selection

Interpreting DNA Mixtures in Structured Populations

Lecture 1 Bayesian inference

Curve Fitting Re-visited, Bishop1.2.5

Bayesian Methods with Monte Carlo Markov Chains II

Machine Learning, Fall 2009: Midterm

Machine Learning, Fall 2012 Homework 2

Logistic Regression Logistic

Probabilistic Graphical Models

Experimental Design and Data Analysis for Biologists

The statistical evaluation of DNA crime stains in R

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Non-Parametric Bayes

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Related Concepts: Lecture 9 SEM, Statistical Modeling, AI, and Data Mining. I. Terminology of SEM

Transcription:

Calculating evidence Further work Questions and Evidence Mikkel, MSc Student Supervised by associate professor Poul Svante Eriksen Department of Mathematical Sciences Aalborg University, Denmark June 16 th 2010, Presentation of Master of Science Thesis 1/75

Outline Outline Biological framework Motivation Aims 1. Short biological recap 2. Motivation and aims of using Y-STR 3. estimation 3.1 Methods and comments to the 3.2 Model control 4. Evidence calculation 5. Further work Calculating evidence Further work Questions 2/75

Biological framework Outline Biological framework Motivation Aims Calculating evidence Further work Questions Example of (autosomal) STR DNA-type: ({14, 13}, {22}, {15, 16},..., {22}) Example of Y-STR DNA-type: (17, 14, 22,..., 15) LHS image is from http://ghr.nlm.nih.gov and RHS image is from http://history.earthsci.carleton.ca. 3/75

Motivation Outline Biological framework Motivation Aims Calculating evidence In some situations Y-STR is more sensible to use than A-STR (autosomal STR), e.g. to avoid noise in the trace from a rape victim Y-STR and A-STR differs in several areas, e.g. the number of alleles at each locus and statistical dependence between loci The statistical developed to handle A-STR cannot be applied on Y-STR directly, so reformulation is required (e.g. for calculating evidence) or new must be developed (to estimate Y-STR haplotype ) Further work Questions 4/75

Aims Outline Biological framework Motivation Aims Be able to calculate statistical evidence in trials Estimate for Y-STR haplotypes (also unobserved ones) is required to do this First, for estimating for Y-STR haplotypes will be discussed and afterwards calculation of evidence will be introduced. Calculating evidence Further work Questions 5/75

New Calculating 6/75

New Neither principal component analysis nor factor analysis yields good results, so that path has not been followed any further. Calculating 7/75

New Simple count estimates Not precise enough One published method (used at http://www.yhrd.org): introduced in A new method for the evaluation of matches in non-recombining genomes: application to Y-chromosomal short tandem repeat (STR) haplotypes in European males. from 2000 by L. Roewer et al. Several problems exist; some will be presented in this presentation (some also presented in a talk at 7th International Y Chromosome User Workshop in Berlin, Germany, from April 22 to April 24, 2010) Calculating 8/75

New New Graphical would be an obvious choice Structure based learning (e.g. PC algorithm) or score based learning (e.g. AIC and BIC) Standard tests for conditional independence (e.g. G 2 = 2nCE(A, B {S i } i I ), where n is the sample size and CE is the cross entropy, which is χ 2 ν distributed when A and B are independent given {S i } i I ) do not exploit the ordering in the data nor does it incorporate prior knowledge such as the single step mutation model Better independence tests are required (e.g. classification trees, ordered logistic regression, and support vector machines) and model-based clustering Calculating 9/75

New Calculating 10/75

The idea: Bayesian approach New Notation: N: number of observations M: number of haplotypes (i.e. unique observations) N i : the number of times the i th haplotype has been observed f i = N i 1 N M : the frequency for the i th haplotype Model using Bayesian inference: 1. A priori: assume f i is Beta distributed with parameters non-stochastic parameters u i and v i 2. Likelihood: Given f i, then N i is Binomial distributed 3. Posterior: Given N i, then f i is (still) Beta distributed (Beta distribution is a conjugate prior for the Binomial distribution) Model expressed in densities using generic notation: p (f i N i ) p (N i f i ) p (f i ) Calculating 11/75

u i and v i New 1. Calculate W i = 1 N j N N i i j d ij for i = 1, 2,..., M, where d ij denotes the Manhattan distance/l 1 norm 2. Order the W i s by size and divide into 15 (?) groups and calculate the mean and variance of the f i s in each group 3. Fit regression µ(w ) = β 1 + exp (β 2 W + β 3 ) and σ 2 (W ) = β 4 + exp (β 5 W + β 6 ) based on the 15 estimates 4. Calculate µ i = µ(w i ) and σ 2 i = σ 2 (W i ) and use these to calculate the prior parameters u i and v i 5. Apply the Bayesian approach to obtain the posterier distribution, e.g. to estimate f i using the posterior mean Only dane could fit the full and the fit is not too comforting the others resulted in µ i < 0 or σi 2 < 0 for some i s Calculating 12/75

Plot of the regression New µ 0.00 0.01 0.02 0.03 0.04 0.05 dane divided into 15 groups 0.10 0.12 0.14 0.16 0.18 0.20 0.22 W µ(w) = 0.002 + exp(81.389 * W 21.454) Calculating 13/75

Plot of the regression dane divided into 15 groups New σ 2 0.000 0.001 0.002 0.003 0.10 0.12 0.14 0.16 0.18 0.20 0.22 W σ 2 (W) = 0 + exp(114.043 * W 31.492) Calculating 14/75

Modified regression Set β 1 = β 4 = 0 in the regression yielding µ(w ) = exp (β 2 W + β 3 ) New og σ 2 (W ) = exp (β 5 W + β 6 ) Now berlin makes the best fits, which seems quite reasonable for µ(w ), but more doubtful for σ 2 (W ) dane fits almost as before Calculating 15/75

Plot of the regression New µ 0.000 0.005 0.010 0.015 0.020 0.025 berlin divided into 16 groups 0.15 0.20 0.25 W µ(w) = 0 + exp(34.442 * W 12.972) Calculating 16/75

Plot of the regression New σ 2 0e+00 1e 04 2e 04 3e 04 4e 04 5e 04 berlin divided into 16 groups 0.15 0.20 0.25 W σ 2 (W) = 0 + exp(21.217 * W 13.662) Calculating 17/75

Plot of the regression New µ 0.00 0.02 0.04 0.06 0.08 somali divided into 17 groups 0.1 0.2 0.3 0.4 W µ(w) = 0 + exp(14.328 * W 9.286) Calculating 18/75

Plot of the regression New σ 2 0.000 0.005 0.010 0.015 0.020 0.025 somali divided into 17 groups 0.1 0.2 0.3 0.4 W σ 2 (W) = 0 + exp(10.29 * W 9.328) Calculating 19/75

Changes made on http://www.yhrd.org New At the 7 th International Y Chromosome User Workshop in Berlin, 2010, Sascha Willuweit (one of the persons behind http://www.yhrd.org) mentioned a couple of changes between their implementation at http://www.yhrd.org and the original article: Using the reduced regression, i.e. without intercepts β 1 and β 4 The number of groups are determined by fitting several regressions and choosing the best one (details for selecting the minimum number of groups was not mentioned) Calculating 20/75

Comments and proposals for changes New Not a statistical model, more an ad-hoc method µ i = µ(w i ) = exp(aw i + b) is not bounded above such that µ(w ) 1 for W b a : for berlin a = 34.44 and b = 12.97 so µ i 1 for W i 0.377 (0 W i 1 and 0 < µ i < 1 by definition) Fitting u i and v i : only W i to fit two exponential regression The model is not consistent: generalisation to a Dirichlet prior and a multinomial likelihood might solve this Calculating 21/75

New Calculating 22/75 Fitting two parameters (u i and v i ) using only one (W i ) 0.0 0.5 1.0 1.5 2.0 10 20 30 40 50 60 70 fitted u vs. fitted v for berlin based on 16 groups Using regression without beta1 and beta4 u v

Fitting two parameters (u i and v i ) using only one (W i ) New v 10 20 30 40 50 fitted u vs. fitted v for dane based on 15 groups 0.2 0.4 0.6 0.8 1.0 u Using regression with beta1 and beta4 Calculating 23/75

Fitting two parameters (u i and v i ) using only one (W i ) fitted u vs. fitted v for dane based on 15 groups New v 0 1000 2000 3000 4000 5000 0.0 0.2 0.4 0.6 0.8 u Using regression without beta1 and beta4 Calculating 24/75

Fitting two parameters (u i and v i ) using only one (W i ) New v 2 3 4 5 6 fitted u vs. fitted v for somali based on 17 groups 0.0 0.2 0.4 0.6 0.8 u Using regression without beta1 and beta4 Calculating 25/75

Idea: use second moment W i = 1 N j N N i i j d ij = 1 N r j N N i i j k=1 d uses first ijk moment of the allele differences on each locus of r loci Maybe also use the second moment to introduce 1 N j Z i = N N i ) r i j k=1 (d ijk d 2 ij r New Fit µ i s and σi 2 s by multiple regression using a grid of W i and Z i values 15 groups only correspond to a 4 4-grid, which is way too coarse requiring 15 groups of W i s and Z i s, the grid would have size 15 15 = 225: requires a lot of observations! Too few observations in berlin, dane, and somali, but it would be interesting to see how it would perform compared to just using the W i s Calculating 26/75

Generalised frequency New Assume a priori that f Dirichlet (α) where f = (f 1, f 2,..., f K ) is the vector of for all possible haplotypes Use the likelihood N f Multinomial (N +, f ) The posterior becomes f N Dirichlet (α 1 + N 1,..., α K + N K ) = Dirichlet (α 1 + N 1,..., α n + N n, α n+1,..., α K ) where f 1, f 2,..., f n are the for the observed haplotypes and f n+1, f n+2,..., f K are for the unobserved haplotypes Calculating 27/75

Marginal distribution New Let α + = K i=1 α i The marginal( posterior distribution for the i th haplotype is f i N i Beta α i + N i, ) K j=1 (α j + N j ) (α i + N i ) = Beta (α i + N i, α + α i + N + N i ) E [f i N i ] = α i +N i K j=1(α j +N j) (α i +N i )+(α i +N i ) = α i +N i α ++N + K i=1 E [f i N i ] = (α + + N + ) 1 K i=1 (α i + N i ) = 1 Incorporate prior knowledge can be done by specifying the prior parameter α i for all possible haplotypes, but with this approach α + might be problematic to calculate for large K Calculating 28/75

Approximating α + Histogram of Wi for berlin Gamma: s = 25.926, r = 139.146 Beta: s1 = 21.296, s2 = 93.016 New Density 0 2 4 6 8 10 0.10 0.15 0.20 0.25 W mean(w) = 0.186 Calculating 29/75

Approximating α + Histogram of Wi for dane Gamma: s = 15.291, r = 100.14 Beta: s1 = 13.002, s2 = 72.151 New Density 0 2 4 6 8 10 0.10 0.15 0.20 W mean(w) = 0.153 Calculating 30/75

Approximating α + New Density 0 1 2 3 4 5 Histogram of Wi for somali 0.1 0.2 0.3 0.4 0.5 W mean(w) = 0.242 Gamma: s = 3.112, r = 12.84 Beta: s1 = 2.44, s2 = 7.636 Calculating 31/75

Approximating α + New As earlier stated, 0 W i 1 so the Beta distribution is the right choice theoretically Assume that α i = h(w i ) for some function h such that α + = K i=1 h(w i) Denote by f β the density of a fitted Beta-distribution, then α + K 1 0 f β (W ) h(w )dw Calculating 32/75

Approximating α + New To get equality between the first prior parameter in and the generalised, let α i = u i = µ2 i (1 µ i ) σi 2 Because µ i = µ(w i ) = exp(aw i + b) can result in µ i 1 then 1 µ i 0 such that α i 0 which is now allowed For berlin, µ i 1 for W i 0.377 so that all contributions to the integral in the α + approximation is negative for W i 0.377 berlin : α + = 50205.04 and the uncovered probability mass is estimated to 0.986 dane : b/a = 0.271 and α + = 7897176 somali : b/a = 0.648 and α + = 1535169 α i = u i and the exponential regression is unusable, but the distribution of the W i s might be helpful for other choices Calculating 33/75

Problem New Maybe getting too much attention because it is the only published method for estimating haplotype At the 7 th International Y Chromosome User Workshop in Berlin, 2010, Michael Krawczak (one of the authors of the original articles) gave a talk where the associated slides included the statement [frequency has] never [been] thoroughly studied and validated Calculating 34/75

New Calculating 35/75

The idea: approximate when identified a common (partial) ancestor New The basic idea is to find I = {i 1, i 2,..., i q } {1, 2,..., r} such that ( ) P L j = a j L i = a i P L j = a j L i = a i j I is a good approximation. j I i I i I Calculating 36/75

Example New Assume that we have loci L 1, L 2, L 3, L 4 and I = {1, 2} Then P (L 1, L 2, L 3, L 4 ) = P (L 1 ) P (L 2 L 1 ) P (L 3, L 4 L 1, L 2 ) (1) 4 P (L 1 ) P (L 2 L 1 ) P (L j L 1, L 2 ) (2) j=3 Calculating 37/75

How to chose I New I is called an ancestral set, because it can be interpreted as a set of alleles that is common with one s ancestors I can be found using a greedy approach adding the j to I that maximises P (L j = a j i I L i = a i ) Stop adding elements to I, e.g. when only a percentage of the observations is left to use for calculating the marginal probabilities conditional on I Calculating 38/75

Drawbacks New The approach is simple, but like it is not a statistical model Calculating 39/75

New Calculating 40/75

The idea: alternately perceive different loci as a response variable New Let L 1, L 2,..., L r be the r different loci available in the haplotype. Then fit L i1 L k (3) L i2 k {i 1 } k {i 1,i 2 } L k (4). (5) L ir 2 L k (6) L ir 1 k {i 1,i 2,...,i r 2 } Calculating 41/75 k {i 1,i 2,...,i r 1 } and use the empirical distribution for L ir. L k = L ir (7)

Properties New A class of statistical The classifications can be done with some of the several available classifications such as classification trees, ordered logistic regression, or support vector machines Selection of i j should be done using standard model selection criteria depending on the classification model used Does not incorporate prior knowledge Calculating 42/75

and model based clustering New and model based clustering Calculating 43/75

The idea: create a density around each haplotype New Put a scaled density/mass (called a kernel) around each haplotype with mass equal to its relative frequency N i N + In this way unobserved haplotypes get probability mass from the (near) neighbours Calculating 44/75

Choice of kernel New A straightforward approach is the Gaussian kernel K (z x i, λ) = ( 2πλ 2 ) r 2 det (Σ) 1 2 exp ( 1 2λ 2 (x i z)σ 1 (x i z) ) where λ is called a parameter that has to be chosen A frequency estimate for any given haplotype z is g(z) = 1 N + n i=1 N ik (z x i, λ) To incorporate prior knowledge, K (z x i, N i, λ) = ( 2π λ2 2 ) r N i could be used det (Σ) 1 2 exp ( 1 2 λ2 N i (x i z)σ 1 (x i z) ) Calculating 45/75

Problem New The model can be inaccurate if the kernel has small variance, because then the actual mass when evaluated over the discrete grid can differ greatly from the relative Discrete kernels could be tried instead, e.g. the multinomial Calculating 46/75

Model based clustering New a frequency for a haplotype using kernel require evaluating as many densities as the number of haplotypes in the database Model based clustering can be used to perform clustering first to minimise the required number of density evaluations Same problem as with kernel if the variances are too small Calculating 47/75

Unobserved probability mass Marginal deviances Calculating evidence Further work Questions 48/75

Unobserved probability mass Marginal deviances Calculating evidence Model verification is as always crucial One important feature of a model is to be able to efficiently obtain further samples of haplotypes according to their probability under a model Further work Questions 49/75

Different ways of comparing Unobserved probability mass Marginal deviances Calculating evidence Estimated unobserved probability mass Marginal deviances (for a model s single and pairwise compared to observed) Several more should be definitely considered Further work Questions 50/75

Unobserved probability mass: point estimate Unobserved probability mass Marginal deviances Calculating evidence K 0 : the number of singletons N: the number of observations In 1968, Robbins showed that V = K 0 N + 1 is an unbiased estimate of the unobserved probability mass. Further work Questions 51/75

Unobserved probability mass: limiting consistent variance estimate Unobserved probability mass Marginal deviances Calculating evidence K 1 : the number of doubletons In 1986, Bickel and Yahav showed that under some regularity conditions, ˆσ 2 = K 0 N 2 (K 0 2K 1 ) 2 N 3 is limiting consistent estimate of the variance of the unobserved probability mass Both V and the variance estimate can be verified by simulation Further work Questions 52/75

Unobserved probability mass: simulation study Unobserved probability mass Marginal deviances Calculating evidence Further work Questions Coverage 0.0 0.2 0.4 0.6 0.8 1.0 Confidence interval coverage for true population size Q = 10000 10 110 210 310 410 510 610 710 810 910 S 0.012 S 0.842 S 0.956 S 0.966 S 0.973 S 0.973 S 0.971 S 0.968 S 0.967 S 0.966 T 0.000 T 0.812 T 0.970 T 0.973 T 0.975 T 0.976 T 0.972 T 0.972 T 0.968 T 0.967 Symmetric standard confidence interval logit transformed confidence interval C 0.004 C 0.027 C 0.029 C 0.029 C 0.029 C 0.029 C 0.028 C 0.028 C 0.028 C 0.028 S 0.010 S 0.099 S 0.106 S 0.103 S 0.100 S 0.096 S 0.093 S 0.090 S 0.087 S 0.084 T 0.008 T 0.101 T 0.106 T 0.103 T 0.100 T 0.096 T 0.093 T 0.090 T 0.086 T 0.084 0 200 400 600 800 N Based on 10000 simulations and 10 90 probabilities. Line at 0.95. 53/75

Unobserved probability mass Unobserved probability mass Marginal deviances Calculating evidence The estimate seems like a good and simply way of perform model verification, but it cannot stand alone as we shall soon see It can also be used to fit model parameters, e.g. the parameter in the kernel model Further work Questions 54/75

Unobserved probability mass: approximate confidence intervals Unobserved probability mass Marginal deviances Calculating evidence berlin dane somali V 0.364 0.602 0.277 ˆσ 0.061 0.078 0.120 V 95% conf. int. [0.321; 0.408] [0.510; 0.694] [0.212; 0.342] S: β 1 /β 4 0 NA 0.643 NA S: β 1 /β 4 = 0 0.479 0.703 0.335 rpart 0.478 0.71 0.42 svm 0.526 0.792 0.43 polr 0.639 0.886 NA Ancestor: 10% 0.39 0.589 0.182 Ancestor: 15% 0.454 0.668 0.22 Ancestor: 20% 0.466 0.713 0.246 Further work Questions 55/75

Marginal deviances Unobserved probability mass Marginal deviances Calculating evidence Depending on the model, exact marginals can be difficult to obtain If haplotypes can be sampled according to their probability under a model, then marginals can be approximated by simulating a huge number of haplotypes under that model Using only the observed marginals should correspond to this, but at least for small databases this is not the case according to simulations studies performed with the classification Further work Questions 56/75

Deviance for pairwise marginals Unobserved probability mass Marginal deviances Calculating evidence Let {u} ij be the two-way table with the observation counts, { p} ij the table of predicted probabilities under a model M 0, {ˆp} ij the relative probabilities such that ˆp ij = u ij u ++ For the pairwise marginal tables, the deviance is ( d = 2 log L({ p} ij) L({ˆp} ij) ) where L ({p} ij ) = i,j pu ij ij proportional to the likelihood (the constant cancelled out in the fraction) Then d = 2 ( ) i,j u pij ij log ˆp ij χ 2 ν u ++! i,j u ij is is Further work Questions 57/75

Unobserved probability mass Marginal deviances Calculating evidence A deviance is calculated for each pair of loci To compare these deviances can summed and used for relative comparisons (the sum is not χ 2 distributed) The deviance is calculated similar for single marginals Further work Questions 58/75

Deviance sums for single marginals Unobserved probability mass Marginal deviances Calculating evidence berlin dane somali rpart 1.236 1.449 0.268 svm 13434.632 2363.861 5664.264 polr 3.114 4.623 NA Sum of deviances for observed single marginals vs. simulated single marginals for the classification method specified in each row. Further work Questions 59/75

Deviance sums for pairwise marginals Unobserved probability mass Marginal deviances Calculating evidence berlin dane somali rpart Inf 768.444 772.936 svm Inf 24971.264 53797.852 polr 1761.153 Inf NA Sum of deviances for observed pairwise marginals vs. simulated pairwise marginals for the classification method specified in each row. Further work Questions 60/75

Calculating evidence Calculating evidence Calculating evidence Two contributors n contributors Further work Questions 61/75

Evidence: motivation of usage The purpose is to get an unbiased opinion in a trial Often formulated as the hypothesis H p : The suspect left the crime stain together with n additional contributors. H d : Some other person left the crime stain together with n additional contributors. Calculating evidence Two contributors n contributors Further work Questions Then the likelihood ratio given by LR = P(E Hp) P(E H d ) is calculated In a courtroom it can then be stated that the evidence E is LR times more likely to have arisen under H p than under H d (formulation is from Interpreting DNA Mixtures by Weir et al., 1997) 62/75

Computational challenge The actual evaluation of LR can be a computation heavy task: the number of combinations giving rise to the same trace grows with the number of contributors Calculating evidence Two contributors n contributors Further work Questions 63/75

Two contributors Calculating evidence Two contributors n contributors Further work Questions Assume H p : The suspect left the crime stain together with one additional contributors. H d : Some other person left the crime stain together with one additional contributors. 64/75

Two contributors: notation Calculating evidence Two contributors n contributors Further work Let a = (a 1, a 2,..., a n ) and b = (b 1, b 2,..., b n ) and define T = a b = ({a 1, b 1 }, {a 2, b 2 },..., {a 1, b n }) (8) T a = ({b 1 }, {b 2 },..., {b n }) (9) where the sets are multisets. Questions 65/75

Two contributors Calculating evidence Two contributors n contributors Further work Questions Let T be the trace, h s the suspect s haplotype, and h 1 the additional contributor s haplotype. Then T = h s h 1 and h 1 = T h s. Say that (h 1, h 2 ) is consistent with the trace T if h 1 h 2 = T, which is denoted (h 1, h 2 ) T. This makes LR = P (E H p) P (E H d ) P (h s, T h s ) = (h 1,h 2 ) T P (h s, h 1, h 2 ) P (h s ) P (T h s ) = P (h s ) (h 1,h 2 ) T P (h 1) P (h 2 ) P (T h s ) = (h 1,h 2 ) T P (h 1) P (h 2 ) (10) (11) (12) (13) by assuming that haplotypes are independent. 66/75

Two contributors: number of terms in the denominator Calculating evidence Two contributors n contributors Further work Let T = (T 1, T 2,..., T r ), T i is a set of alleles such that T i {1, 2} Let H T = T 1 T 2 T r In the non-trivial case a j {1, 2,..., r} exists such that T j = {a 1, a 2 } with a 1 a 2 Let T j = {a 1 } (such that one of the alleles is removed) and H T = T 1 T j 1 T j T j+1 T r Questions 67/75

Two contributors: number of terms in the denominator Calculating evidence Two contributors n contributors Further work Questions Now the denominator of LR can be written as 2 h 1 H T P (h 1 ) P (T h 1 ) If k denotes the number of loci in the trace with only one allele, and we assume that we have the non-trivial case with 0 k < r, we have that H T = r i=1 T i = 2 r k such that H T = H T 2 = 2 r k 1 2 r 1 This means that for r loci, a maximum of 2 2 r 1 = 2 r haplotype have to be calculated, e.g. 2 10 = 1024 If a trace has two contributors with no known suspects, the two most likely contributors can be chosen to be arg max h1 H T P (h 1) P (T h 1 ) 68/75

n contributors: formulation of the LR for one locus Calculating evidence Two contributors n contributors Further work Questions LR defined generally in Forensic interpretation of Y-chromosomal DNA mixtures by Wolf et al., 2005 At a given locus, let E t, E s, and E k be the set of alleles from the trace, the suspect, and the known contributors, respectively Assume n unknown contributors and let A n denote the set of alleles carried by the these n unknown contributors Let P n (V ; W ) = P (W A n V ) Then LR = P n (E t ; E t \ (E s E k )) P n+1 (E t ; E t \ E k ) = P (E t \ (E s E k ) A n E t ) P (E t \ E k A n+1 E t ) (14) (15) 69/75

n contributors: formulation of the LR for m loci Calculating evidence Two contributors n contributors Further work For m loci we have ( m P n {E t,i ; E t,i \ ( ) ) E s,i E k,i } LR = i=1 ( m ) P n+1 {E t,i ; E t,i \ E k,i } i=1 Questions 70/75

n contributors: example from thesis, page 18 Calculating evidence Two contributors n contributors Further work Questions Let E t,1 = {1, 2}, E t,2 = {2}, E s,1 = {1}, E s,2 = {2}, E k,1 = E k,2 =. Then ( 2 P 1 i=1 {E t,i ; E t,i \ ( ) ) E s,i E k,i } LR = ( 2 ) P 2 i=1 {E t,i ; E t,i \ E k,i } ( 2 P i=1 {E t,i \ ( ) ) E s,i E k,i A i 1 E t,i = ( 2 ) P i=1 {E t,i \ E k,i A i 2 E t,i } = P ( {E t,1 \ E s,1 A 1 1 E t,1} {E t,2 \ E s,2 A 2 1 E ) t,2 P ( {E t,1 \ E k,1 A 1 2 E t,1} {E t,2 \ E k,2 A 2 2 E t,2} ) = P ( {{2} 1 A 1 1 {1, 2}1 } {A 2 1 {2}2 } ) P ( {{1, 2} 1 A 1 2 {1, 2}1 } {{2} 2 A 2 2 {2}2 } ) = P ( {2} 1 {2} 2) P ({1, 2} 1 {2, 2} 2 ) = P (h 1 = (2, 2)) P (h 1 = (1, 2), h 2 = (2, 2)) + P (h 1 = (2, 2), h 2 = (1, 2)) 71/75

n contributors: challenges Calculating evidence Two contributors n contributors Further work It is complicated One key ingredient in calculating the LR is to be able to estimate for unobserved haplotypes The next step is to be able to calculate the LR efficiently even for a large number of contributors Questions 72/75

Further work Further work Calculating evidence Further work Questions 73/75

Further work Calculating evidence Further work Questions Y-STR haplotype Better incorporation of prior knowledge in a statistical model, e.g. graphical with other test statistics (ordinal data and incorporating prior knowledge such as the single step mutation model) More and better ways to verify Larger datasets (http://www.yhrd.org has gathered a lot of data, both publicly available in journals and directly from laboratories, but none is available for others, yet) Y-STR Mixtures Efficient calculation of LR Use quantitative information (the amount of DNA material which can be seen in the EPG) instead of only the qualitative Model the signal in the EPG (electropherogram) 74/75

Questions? Questions? Calculating evidence Further work Questions 75/75