Allele Frequency Estimation

Size: px

Start display at page:

Download "Allele Frequency Estimation"

Mabel Floyd
5 years ago
Views:

1 Allele Frequency Estimation Examle: ABO blood tyes ABO genetic locus exhibits three alleles: A, B, and O Four henotyes: A, B, AB, and O Genotye A/A A/O A/B B/B B/O O/O Phenotye A A AB B B O Data: Observed counts of four henotyes A, B, AB, and O n A n B n AB n O n Aim: Estimate frequencies A, B, and O of alleles A, B, and O Modelling: Observed data: N A, N B, N AB, N O Comlete data: N AA, N AO, N BB, N BO, N AB, N O According to the Hardy-Weinberg law, the genotye frequencies are Genotye A/A A/O A/B B/B B/O O/O Frequency A A O A B B B O O Genotye counts N = (N AA, N AO, N AB, N BB, N BO, N O ) are jointly multinomially distributed. EM Algorithm II, Ar 8, 4 - -

2 Allele Frequency Estimation Comlete-data log-likelihood function l n ( N) = N AA log( A) + N BB log( B) + N O log( O) + N AB log( A B ) + N AO log( A O ) + N BO log( B O ) ( ) n! + log N AA! N AO! N AB! N BB! N BO! N O! Alication of EM algorithm Let N obs = (N A, N B, N AB, N O ). E-ste: Since N AA + N AO = N A we have ( ) A N AA N A Bin N A, A + A O which yields the exectations N AA = E(N AA N obs, ) = N A N AO = E(N AO N obs, ) = N A and similarly N BB = E(N BB N obs, ) = N B N BO = E(N BO N obs, ) = N B while obviously A A B B A + A O + B + B O + A O A O B O B O E(N AB N obs, ) = N AB and E(N O N obs, ) = N O. EM Algorithm II, Ar 8, 4 - -

3 Allele Frequency Estimation M-ste: Maximize Q( ) under the restriction A + B + O =. Introduce Lagrange multilier (Rice,. 9) and maximize Q L (, λ ) = Q( ) + λ( A + B + O ) with resect to and λ. N AA A Q L (, λ ) = A Q L (, λ ) N BB = + N BO B B B Q L (, λ ) = N AO + N BO O O O Q L (, λ ) = A + B + O λ + N AO A + N AB A + N AB B + N O O + λ + λ + λ Taking the sum of the three equations, we get (using A + B + O = ) λ = n which yields for the first three equations the solutions (k+) A (k+) B (k+) O = = N AA + N AO + N AB n N BB + N BO + N AB n = N AO + N BO + N O n EM Algorithm II, Ar 8, 4-3 -

4 Allele Frequency Estimation Starting values: A = B = O = 3 Iterations: k A B O Starting values: A = O =., B =.98 Iterations: k A B O Imlementation in R #EM iteration # Arguments: # N=(Na,Nb,Nab,No) # =(a,b,o) emste<-function(n,) { #E-ste Naa<-N[]*[]^/([]^+*[]*[3]) Nao<-N[]**[]*[3]/([]^+*[]*[3]) Nbb<-N[]*[]^/([]^+*[]*[3]) Nbo<-N[]**[]*[3]/([]^+*[]*[3]) #M-ste n<-sum(n) []=(*Naa+Nao+N[3])/(*n) []=(*Nbb+Nbo+N[3])/(*n) [3]=(*Nao+Nbo+N[4])/(*n) } #Data N<-c(86,38,3,84) #Starting value <-c(,,)/3 #First iteration <-emste(n,) #Second iteration <-emste(n,) #Reeat until convergence Note: Results do not change for different starting values. EM Algorithm II, Ar 8, 4-4 -

5 Examle: Old Faithful Data: Mixtures 7 waiting times between erutions for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA 3 3 Frequency Model: Waiting time between erutions (min) Mixture of two Gaussian oulations (short/long waiting times): f Y (y θ) = ( x µ ) ϕ + ( ) ( x µ ) ϕ σ σ σ σ Idea: Parameters: θ = (, µ, µ, σ, σ ) T If we knew the grou which each observation belongs to, we could simly fit a normal distribution to each grou. Missing data: Grou indicator Z i = { Yi belongs to grou of long waiting times Y i belongs to grou of short waiting times Z i is Bernoulli distributed with arameter : Z i iid Bin(, ) Comlete-data likelihood: L n (θ Y, Z) = n Z i ( ) Zi σ Z i ( Yi µ ) Zi ϕ σ σ Z i ( Yi µ ϕ σ ) Zi EM Algorithm II, Ar 8, 4 - -

6 Log-likelihood function Mixtures l n (θ Y, Z) = n Z i log() + n ( Z i ) log( ) Z i log(πσ) σ Z i (Y i µ ) ( Z i ) log(πσ) ( Z i )(Y i µ ) σ Alication of EM algorithm E-ste: l n (θ Y, Z) is linear in Z i. It therefore suffices to find the conditional mean E(Z i Y i, θ ). The conditional distribution of Z i given Y is with Z i Y i, θ Bin(, i ) i = ϕ σ ( Yi µ σ Thus the conditional mean is E(Z i Y i, θ ) = i. ) σ ϕ ( x µ σ ) + ( ) ϕ σ ( Yi µ σ ). EM Algorithm II, Ar 8, 4-6 -

7 M-ste: Substituting i Q(θ θ ) = n where q i i = i. Mixtures for Z i we obtain log() + n q i log( ) i log(πσ) σ q i log(πσ) σ i (Y i µ ) q i (Y i µ ) Setting the first derivatives of Q(θ θ ) equal to zero we obtain (k+) = n µ (k+) = µ (k+) = (σ (k+) ) = (σ (k+) ) = i i Y i i q i Y i q i i q i (Y i µ (k+) ) i (Y i µ (k+) ) q i EM Algorithm II, Ar 8, 4-7 -

8 Mixtures Starting values: () =.4 µ () = 4 σ () = 4 Iterations: µ () = 9 σ () = 4 k µ µ σ σ Imlementation in R <-c(.,4,9,,) emste<-function(y,) { EZ<-[]*dnorm(Y,[],sqrt([4]))/ ([]*dnorm(y,[],sqrt([4])) +(-[])*dnorm(y,[3],sqrt([]))) []<-mean(ez) []<-sum(ez*y)/sum(ez) [3]<-sum((-EZ)*Y)/sum(-EZ) [4]<-sum(EZ*(Y-[])^)/sum(EZ) []<-sum((-ez)*(y-[3])^)/sum(-ez) } emiteration<-function(y,,n=) { for (i in (:n)) { <-emste(y,) } } <-c(.,4,9,,) <-emiteration(y,,) <-emste(y,) 3 3 Frequency Waiting time between erutions (min) EM Algorithm II, Ar 8, 4-8 -

9 Mixtures Examle: Bivariate distribution Data: Waiting times between erutions (in min) for the Old Faithful geyser Erution times (in min) for the Old Faithful geyser Examle: EM algorithm for bivariate Gaussian mixtures (JAVA alet) htt://dowww.efl.ch/mantra/tutorial/english/gaussian/html EM Algorithm II, Ar 8, 4-9 -

10 Convergence of the EM algorithm Examle: Bivariate t-distribution Suose that Y i = (Y i, Y i ) T, i =,..., + m are indeendently samled from a bivariate t distribution with likelihood function L n (µ Y ) = n ( (Yi µ ) + (Y i µ ) ) 3. Furthermore, suose that only the first values are observed. Convergence of the EM algorithm deends on the amount of missing data. The more data are missing and have to be estimated, the slower the EM algorithm converges. Here µ (k+) = x i x i y i + m µ + m is a weighted mean with strong weight on the revious µ if the roortion of missing data is large. µ µ Log-likelihood function l n (µ y) m= m= m= µ µ µ µ µ µ Convergence of the EM algorithm for m =,,. EM Algorithm II, Ar 8, 4 - -

f X (y, z; θ, σ 2 ) = 1 2 (2πσ2 ) 1 2 exp( (y θz) 2 /2σ 2 ) l c,n (θ, σ 2 ) = i log f(y i, Z i ; θ, σ 2 ) (Y i θz i ) 2 /2σ 2

f X (y, z; θ, σ 2 ) = 1 2 (2πσ2 ) 1 2 exp( (y θz) 2 /2σ 2 ) l c,n (θ, σ 2 ) = i log f(y i, Z i ; θ, σ 2 ) (Y i θz i ) 2 /2σ 2 Chapter 7: EM algorithm in exponential families: JAW 4.30-32 7.1 (i) The EM Algorithm finds MLE s in problems with latent variables (sometimes called missing data ): things you wish you could observe,