Expected complete data log-likelihood and EM

Size: px

Start display at page:

Download "Expected complete data log-likelihood and EM"

Julius Copeland
6 years ago
Views:

1 Expected complete data log-likelihood and EM In our EM algorithm, the expected complete data log-likelihood Q is a function of a set of model parameters τ, ie M Qτ = log fb m, r m, g m z m, l m, τ p mz m, l m, where M is the total marker number, m is the SNP marker index, b m is the observed BAF, r m is the observed LRR, g m is the error-free genotype, z m = z m1, z m is ordered haplotype cluster memberships, l m is the aberration type, τ is the model parameters set, p mz m, l m pz m, l m τ, b, r, g is the conditional marginal distribution, given parameter estimates τ We further assume that conditioned on z m, l m, r m and g m, b m are independent see Materials and Methods Thus M Qτ = log fr m l m, τ + log fb m, g m z m, l m, τp mz m, l m We maximize Q at each EM cycle by solving the equation that sets to zero its partial derivative wrt each parameter For some parameters, a closed-form solution is available; for others, a numerical method must be applied In our experience, when the tumor mixture is high eg above 10%, we can approximate the M-step by maximizing Q wrt each individual parameter in τ marginally, rather than maximizing in a multivariate manner However, for extreme low tumor purity eg about 3%, to avoid convergence problems, we must take the approach of expected conditional maximization ECM, meaning we have to re-compute the posterior probability of latent states with the updated estimates after maximizing each parameter The computation is more expensive with ECM Estimation of the mixture proportion The derivative of Q wrt tumor DNA mixture proportion w is composed of the following two summations involving derivatives of BAF and LRR densities respectively: M w Qw = w logfr m l m, τ p mz m, l m + w logfb m, g m z m, l m, τ p mz m, l m, 1 m {i st g i =1} where M is the total number of SNP makers, and the inner sum is over all combinations of z and l Since BAFs are informative at heterozygous sites only germline homozygous sites 1

2 have the derivative of zero wrt w, the second summation in equation 1 is limited to germline heterozygous sites We assume LRRs follow the same normal distribution as dened in GPHMM except for the addition of a sample-specic scale factor, ie fr l, w, o r, σ r, q = 1 σ r φ r µ r l, w, q o r σ r, where µ r 1 w + wαl + βl l, w, q q log, and φ is pdf of the standard normal distribution, l the latent aberration type, σ r the variance, o r the global baseline shift, and q the LRR scale The functions αl m and βl m have domains on the state space of l and give parent-specic allele copy numbers The derivative in the rst summation of equation [1] is w logfr m l m, τ = r m o r q log 1 w+wαl m+βl m σ r q αl m + βl m log e We focus on low purity samples, where the perturbed BAF will remain relatively close to one-half and the truncation of BAFs at 0 or 1 for heterozygotes is of minimal concern Thus, at germline heterozygous sites, we assume the potentially mixed BAF is distributed as fb h, l, w, o b, σ b = 1 σ b φ b µ b h, l, w o b σ b where φ is the pdf of the standard normal distribution, σ b is the variance of BAF, o b is a global baseline shift, h is the inherited allele conguration either AB" or BA" and, µ b h, l, w 05w βl αl 11h= AB 1 w + w αl + βl + 05 For simplicity, we subtract 05 from observed BAFs, then we can drop 05 from µ b h, l, w expression and it has opposite signs for allele congurations AB" and BA" The derivative in the second summation of equation 1 is w logfb m, g m = 1 z m = j, k, l m, w = 1 σ b b m o b 1 Ω m µ AB m 1 + Ω m w µab m,

3 where Ω m exp bm µ AB m phm = BA z m = j, k σb ph m = AB z m = j, k = exp µ AB m µ b h m = AB, l m, w = w µab m = αl m + βl m αl m + βl m w +, and 05 αl m βl m w 1 w + αl m + βl m w, bm µ AB m θjm 1 θ km σb θ km 1 θ jm, θ im is the probability that allele is B given haplotype cluster membership is i at maker m, as dened in fastphase model [1] After substituting the two derivatives in equation 1, we do not have a closed-form solution Therefore we rely on numerical root-nding methods In practice, we use the secant method with previous w estimates as initial values Estimation of BAF global baseline shift o b The derivative of Q wrt o b is Qo b = o b m {i st g i =1} Therefore, the new estimate of o b is ô b = 1 M het m {i st g i =1} logfb m, g m z m, l m, τp o mz m, l m b b m µ AB m 1 Ω m 1 + Ω m where M het is the number of germline heterozygous SNP markers Estimation of BAF variance σ b The derivative of Q wrt σb is Qσb = σ b m {i st g i =1} σ b And using the normality assumption for BAF distribution, σ b 1 σ 4 b p mz m, l m, logfb m, g m z m, l m, τp mz m, l m logfb m, g m = 1 z m = j, k, l m, w = σb + b m o b + µ AB m b m o b µ AB m 1 Ω m 1 + Ω m We apply numerical root-nding method to obtain the new estimate 3

4 Estimation of variance and global baseline shift for LRR σ r, o r It is easy to show that the solutions that maximize Q wrt σ r and o r are the following expressions: and ô r = 1 M ˆσ r = 1 M m=1 where p ml m = z m p mz m, l m M r m µ r l m, w p ml m m=1 l m l m M r m µ r l m, w o r p m l m, Estimation of LRR scale coecient q It has been pointed out that amplitude of LRR varies from sample to sample and that tumor copy number the observed amplitude is usually smaller than the standard value log [] In GAP, this is modeled with a simple coecient of contraction that is specic to the sample GPHMM models the expected LRR as averge allele copy number in mixture µ r l, w log 10 log In our model, extra exibility is achieved by replacing the constant log 10 in GPHMM with a LRR scale parameter q and the new estimate for updating q is ˆq = M m=1 p 1 w+wαl mz m, l m r m o r log m+βl m M p mz m, l m log 1 w+wαl m+βl m Estimation of a GC content coecient Local GC content may induce a wave eect in the LRR data [3] Therefore adjusting for GC content can reduce the noise in LRR signal, as demonstrated in GPHMM [4] Similar to GPHMM, we use average GC-percentage in a 1Mb window around each SNP maker Let x m, m = 1 M denote the average GC content at marker m and t a global coecient for GC content Then we can re-write the density for LRR data as fr m x m, l m, w, o r, σ r, q, t = 1 σ r φ r µ r l m, w, q o r t x m σ r 4

5 It is easy to show the estimate for t is M m=1 z ˆt = m,l m p mz m, l m r m o r µ r l m, w, qx m M p mz m, l m x m The above estimations for rest of the parameters remain valid if we replace r m with r m t x m Identication of over-represented allele in tumor DNA After the EM algorithm converges, the latent aberration state and haplotype cluster membership at marker m has joint posterior probability p c z m, l m = pz m, l m g, r, b, ν, ˆτ We then compute the probability that the allele B is over-represented at a germline heterozygous marker m as follows: pb is over-represented z m, l m p c mz m, l m = 1{B is over-presented h m, l m }ph m z m p c mz m, l m, h m {A,B, B,A} where 1{ } is an indicator function obtained The probability for the allele A can be similarly Mean copy of haplotype cluster in tumor DNA It is possible that a causal factor is correlated with a particular haplotype background, either due to an untyped causal germline allele well tagged by a haplotype or to a haplotype eect itself Therefore it may be helpful to test the association of phenotypes with the mean copy number of a haplotype cluster Suppose we obtain the posterior probability p c mz m, l m as dened above, the mean copy of haplotype cluster k at marker m is 1{z m1 = k}αl m + 1{z m = k}βl m p c mz m, l m, where z m = z m1, z m References [1] P Scheet and M Stephens A fast and exible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase The American Journal of Human Genetics, 784:69644, 006 [] T Popova, E Manié, D Stoppa-Lyonnet, G Rigaill, E Barillot, MH Stern, et al Genome Alteration Print GAP: a tool to visualize and mine complex cancer genomic proles obtained by SNP arrays Genome Biology, 1011:R18, 009 5

6 [3] Sharon J Diskin, Mingyao Li, Cuiping Hou, Shuzhang Yang, Joseph Glessner, Hakon Hakonarson, Maja Bucan, John M Maris, and Kai Wang Adjustment of genomic waves in signal intensities from whole-genome SNP genotyping platforms Nucleic Acids Research, 3619:e16e16, 008 [4] A Li, Z Liu, K Lezon-Geyda, S Sarkar, D Lannin, V Schulz, I Krop, E Winer, L Harris, and D Tuck GPHMM: an integrated hidden markov model for identication of copy number alteration and loss of heterozygosity in complex tumor samples using whole genome snp arrays Nucleic Acids Research, 391: , 011 6

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES: .5. ESTIMATION OF HAPLOTYPE FREQUENCIES: Chapter - 8 For SNPs, alleles A j,b j at locus j there are 4 haplotypes: A A, A B, B A and B B frequencies q,q,q 3,q 4. Assume HWE at haplotype level. Only the