Phasing via the Expectation Maximization (EM) Algorithm

Similar documents
The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

Introduction to Linkage Disequilibrium

2. Map genetic distance between markers

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

Lab 12. Linkage Disequilibrium. November 28, 2012

Chapter 6 Linkage Disequilibrium & Gene Mapping (Recombination)

Affected Sibling Pairs. Biostatistics 666

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

An introduction to PRISM and its applications

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin

AUTHORIZATION TO LEND AND REPRODUCE THE THESIS. Date Jong Wha Joanne Joo, Author

Computational statistics

Learning gene regulatory networks Statistical methods for haplotype inference Part I

The Quantitative TDT

Click-Through Rate prediction: TOP-5 solution for the Avazu contest

Calculation of IBD probabilities

For 5% confidence χ 2 with 1 degree of freedom should exceed 3.841, so there is clear evidence for disequilibrium between S and M.

STAT 536: Genetic Statistics

p(d g A,g B )p(g B ), g B

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase

Parameter Estimation in the Spatio-Temporal Mixed Effects Model Analysis of Massive Spatio-Temporal Data Sets

Calculation of IBD probabilities

Statistical learning. Chapter 20, Sections 1 4 1

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Haplotyping. Biostatistics 666

EM algorithm and applications Lecture #9

Finite Singular Multivariate Gaussian Mixture

Last lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda

Genetic Association Studies in the Presence of Population Structure and Admixture

Statistical Genetics I: STAT/BIOST 550 Spring Quarter, 2014

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation

The universal validity of the possible triangle constraint for Affected-Sib-Pairs

Lecture 6: Gaussian Mixture Models (GMM)

A new algorithm for deriving optimal designs

Problems for 3505 (2011)

Association studies and regression

Genotype Imputation and Haplotype Inference for Genome-wide Association Studies

Statistical learning. Chapter 20, Sections 1 3 1

Population Genetics: a tutorial

MAXIMUM LIKELIHOOD ESTIMATION IN A MULTINOMIAL MIXTURE MODEL. Charles E. McCulloch Cornell University, Ithaca, N. Y.

Closed-form sampling formulas for the coalescent with recombination

Linear Regression (1/1/17)

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Computing the MLE and the EM Algorithm

Expectation maximization tutorial

Expectation Maximization, and Learning from Partly Unobserved Data (part 2)

Stage-structured Populations

Math 152. Rumbos Fall Solutions to Exam #2

Haplotype-based variant detection from short-read sequencing

Maximum likelihood in log-linear models

Frequency Spectra and Inference in Population Genetics

EXERCISES FOR CHAPTER 3. Exercise 3.2. Why is the random mating theorem so important?

Computational Systems Biology: Biology X

Tutorial Session 2. MCMC for the analysis of genetic data on pedigrees:

QTL model selection: key players

theta H H H H H H H H H H H K K K K K K K K K K centimorgans

EM algorithm. Rather than jumping into the details of the particular EM algorithm, we ll look at a simpler example to get the idea of how it works

Learning with Probabilities

Discussion of Dimension Reduction... by Dennis Cook

Mathematical models in population genetics II

THE GENETICS OF CERTAIN COMMON VARIATIONS IN COLEUS 1

Statistical learning. Chapter 20, Sections 1 3 1

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

The genomes of recombinant inbred lines

Maximum Likelihood Estimation in Latent Class Models for Contingency Table Data

A brief introduction to Conditional Random Fields

CSE 150. Assignment 6 Summer Maximum likelihood estimation. Out: Thu Jul 14 Due: Tue Jul 19

Linkage and Chromosome Mapping

The implications of neutral evolution for neutral ecology. Daniel Lawson Bioinformatics and Statistics Scotland Macaulay Institute, Aberdeen

Introduction to population genetics & evolution

Testing for Homogeneity in Genetic Linkage Analysis

URN MODELS: the Ewens Sampling Lemma

Modeling IBD for Pairs of Relatives. Biostatistics 666 Lecture 17

Estimation of Parameters in Random. Effect Models with Incidence Matrix. Uncertainty

Empirical Risk Minimization

Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014

1.5 MM and EM Algorithms

ML Testing (Likelihood Ratio Testing) for non-gaussian models

Lesson 4: Understanding Genetics

KEY: Chapter 9 Genetics of Animal Breeding.

Algorithmic approaches to fitting ERG models

QTL Mapping I: Overview and using Inbred Lines

CSCE 471/871 Lecture 3: Markov Chains and

Quantitative trait evolution with mutations of large effect

CS1820 Notes. hgupta1, kjline, smechery. April 3-April 5. output: plausible Ancestral Recombination Graph (ARG)

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

STAT 536: Genetic Statistics

Clustering. Léon Bottou COS 424 3/4/2010. NEC Labs America

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Lecture 2. Basic Population and Quantitative Genetics

CSC411 Fall 2018 Homework 5

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Stephen Scott.

Lecture 4: Probabilistic Learning

CSE446: Clustering and EM Spring 2017

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

EM-algorithm for motif discovery

Lecture 1 Hardy-Weinberg equilibrium and key forces affecting gene frequency

Goodness of Fit Goodness of fit - 2 classes

Transcription:

Computing Haplotype Frequencies and Haplotype Phasing via the Expectation Maximization (EM) Algorithm Department of Computer Science Brown University, Providence sorin@cs.brown.edu September 14, 2010

Outline 1 Outline 2 Problem definition 3 The Input The number of haplotypes in the input Computing θ (t+1) 4

EM by one Example Problem definition Problem: Consider two loci with two allele 0 and 1 at each locus. Given: (We observe) the genotypes of the individuals at both loci. Find: The estimate at the haplotype frequencies.

Solution Outline The Input The number of haplotypes in the input Computing θ (t+1) There are a total of four possible haplotypes, 01, 10, 11 at the two loci. Let us denote their frequencies by θ, θ 01, θ 10, θ 11. Suppose that we have computed already θ (t), θ(t) 01, θ(t) 10, θ(t) 11. We want to compute θ (t+1) as a function of θ (t), θ(t) 01, θ(t) 10, θ(t) 11.

The Input The number of haplotypes in the input Computing θ (t+1) The Genotype Sample: several types A, B, C, D, E, F There are n A genotypes or individuals of type 22 - we denote Y A the set of such genotypes There are n B genotypes or individuals of type 02 There are n C genotypes or individuals of type 20 There are n D genotypes or individuals of type There are n E genotypes or individuals of type 11

The Input The number of haplotypes in the input Computing θ (t+1) The fraction of the genotypes in each category that contains the haplotype (A) For the A group of n A individuals the possible haplotypes show as follows in explanations of the genotypes: 01 11 or 10 (the fractions represent the separation of mother-father) chromosomes. P(Y A ) = 2θ (t) θ(t) 11 + 2θ(t) 01 θ(t) 10 P( 11 Y A) = 2θ (t) θ(t) 11 2θ (t) θ(t) 11 +2θ(t) 01 θ(t) 10

The Input The number of haplotypes in the input Computing θ (t+1) The fraction of the genotypes in each category that contains the haplotype (continued) For group B one haplotype is and the other one is 01 For group C one haplotype is and the other one is 10 For group D both haplotypes are For group E both haplotypes are 11

Computing θ (t+1) Outline The Input The number of haplotypes in the input Computing θ (t+1) Therefore the total expected number of haplotypes are: n (t+1) = n A P( 11 Y A) + n B + n C + 2n D so we update θ (t+1) = n(t+1) 2n where n = n A + n B + n C + n D + n E

The EM algorithm is an iterative method to compute successive sets of haplotype frequencies p 1, p 2,..., p T starting with some initial arbitrary values p (0) 1, p(0) 2,..., p(0) T Those initial values are used as used as if they were the unknown true frequencies to estimate the explanation frequencies P(h k h l ) (0). This is the Expectation step. These expected explanation frequencies are used in turn to estimate haplotype frequencies at the next iteration p (1) 1, p(1) 2,..., p(1) T. This is the Maximization step.... and so on until convergence is reached (i.e., when the changes in haplotype frequency in consecutive iterations are less than some small value (ɛ).

EM Algorithm initialization 1 All explanations are equally likely P j (h k h l ) (0) = 1 c j, 1 j m where m is the total number of genotypes in the input; and n 1, n 2,..., n m are the counts for each genotype type. 2 All haplotypes are equally frequent in the sample. 3 Complete Linkage Equilibrium: Haplotype frequencies = the product of single locus allele frequencies 4 Initial haplotype frequencies are picked at random.

The E Step The Expectation step at the tth iteration consists of using the haplotype frequencies in the previous iteration to calculate the probability of resolving each genotype into different possible explanations: P j = c j i=1 P(explanation i) = c j i=1 P(h ikh il ) if k = l then P(h k h l ) = pk 2 if k l then P(h k h l ) = 2p k p l where a 1 is a constant term and p ik and p il are the population frequencies of the corresponding haplotypes.

The E Step (continued) The likelihood of the haplotype frequencies given the genotype counts n 1, n 2,..., n m is m L(p 1,..., p T ) = a 1 ( P(h ik h il )) n j where T i=1 = 1, and(h ikh il ), 1 i c j are the set of explanations of the jth genotype that occurs n j times in the input. Let P (t) j = c j i=1 P(h ikh il ) (t) j=1 c j i=1

The E Step formula The E Step formula is: P j (h k h l ) (t) = P(h kh l ) (t) cj i=1 P(t) j

The M Step Haplotype frequencies are then computed for each Maximization step: for 1 r T p (t+1) r = 1 2 c m j δ ir P j (h ik h il ) (t) j=1 i=1 where δ it is an indicator variable equal to the number of times haplotype t is present in explanation i; and this number can be 0, 1 or 2.