Learning gene regulatory networks Statistical methods for haplotype inference Part I

Similar documents
The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

Inferring Transcriptional Regulatory Networks from High-throughput Data

Haplotyping. Biostatistics 666

Inferring Transcriptional Regulatory Networks from Gene Expression Data II

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

The Lander-Green Algorithm. Biostatistics 666 Lecture 22

What is the expectation maximization algorithm?

Linear Regression (1/1/17)

For 5% confidence χ 2 with 1 degree of freedom should exceed 3.841, so there is clear evidence for disequilibrium between S and M.

Genotype Imputation. Class Discussion for January 19, 2016

Chapter 6 Linkage Disequilibrium & Gene Mapping (Recombination)

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin

2. Map genetic distance between markers

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

p(d g A,g B )p(g B ), g B

Lecture 9. QTL Mapping 2: Outbred Populations

Genotype Imputation. Biostatistics 666

Phasing via the Expectation Maximization (EM) Algorithm

Lecture 22: Signatures of Selection and Introduction to Linkage Disequilibrium. November 12, 2012

Bayesian Inference of Interactions and Associations

Computational Approaches to Statistical Genetics

Affected Sibling Pairs. Biostatistics 666

Causal Graphical Models in Systems Genetics

(Genome-wide) association analysis

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

Introduction to Bioinformatics

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Solutions to Problem Set 4

Inferring Protein-Signaling Networks II

Learning ancestral genetic processes using nonparametric Bayesian models

1. Understand the methods for analyzing population structure in genomes

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

Chapter 13 Meiosis and Sexual Reproduction

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Complexity and Approximation of the Minimum Recombination Haplotype Configuration Problem

Linkage and Linkage Disequilibrium

AUTHORIZATION TO LEND AND REPRODUCE THE THESIS. Date Jong Wha Joanne Joo, Author

Full file at CHAPTER 2 Genetics

New imputation strategies optimized for crop plants: FILLIN (Fast, Inbred Line Library ImputatioN) FSFHap (Full Sib Family Haplotype)

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

Use of hidden Markov models for QTL mapping

Calculation of IBD probabilities

Outline of lectures 3-6

UNIT 8 BIOLOGY: Meiosis and Heredity Page 148

Statistical Genetics I: STAT/BIOST 550 Spring Quarter, 2014

Backward Genotype-Trait Association. in Case-Control Designs

The phenotype of this worm is wild type. When both genes are mutant: The phenotype of this worm is double mutant Dpy and Unc phenotype.

Association studies and regression

25 : Graphical induced structured input/output models

BTRY 7210: Topics in Quantitative Genomics and Genetics

27: Case study with popular GM III. 1 Introduction: Gene association mapping for complex diseases 1

Gene mapping in model organisms

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda

Latent Variable models for GWAs

EM algorithm and applications Lecture #9

Population Genetics II (Selection + Haplotype analyses)

Lecture WS Evolutionary Genetics Part I 1

Introduction to Linkage Disequilibrium

Major questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics.

Learning in Bayesian Networks

Calculation of IBD probabilities

CHAPTER 23 THE EVOLUTIONS OF POPULATIONS. Section C: Genetic Variation, the Substrate for Natural Selection

Outline of lectures 3-6

Quiz Section 4 Molecular analysis of inheritance: An amphibian puzzle

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Tutorial Session 2. MCMC for the analysis of genetic data on pedigrees:

Inferring Protein-Signaling Networks

Question: If mating occurs at random in the population, what will the frequencies of A 1 and A 2 be in the next generation?

SNP Association Studies with Case-Parent Trios

Parts 2. Modeling chromosome segregation

SNP-SNP Interactions in Case-Parent Trios

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase

Causal Discovery Methods Using Causal Probabilistic Networks

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Haplotype-based variant detection from short-read sequencing

Parts 2. Modeling chromosome segregation

On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease

Time allowed: 2 hours Answer ALL questions in Section A, ALL PARTS of the question in Section B and ONE question from Section C.

Predicting Protein Functions and Domain Interactions from Protein Interactions

Notes for MCTP Week 2, 2014

Introduction to QTL mapping in model organisms

Problems for 3505 (2011)

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation

EXERCISES FOR CHAPTER 3. Exercise 3.2. Why is the random mating theorem so important?

Introduction to Genetics

Inferring Genetic Architecture of Complex Biological Processes

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Outline. P o purple % x white & white % x purple& F 1 all purple all purple. F purple, 224 white 781 purple, 263 white

Supporting Information

Efficient Haplotype Inference with Boolean Satisfiability

Outline for today s lecture (Ch. 14, Part I)

Classical Selection, Balancing Selection, and Neutral Mutations

COMBI - Combining high-dimensional classification and multiple hypotheses testing for the analysis of big data in genetics

O 3 O 4 O 5. q 3. q 4. Transition

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

Q Expected Coverage Achievement Merit Excellence. Punnett square completed with correct gametes and F2.

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

(Write your name on every page. One point will be deducted for every page without your name!)

Natural Selection. Population Dynamics. The Origins of Genetic Variation. The Origins of Genetic Variation. Intergenerational Mutation Rate

Transcription:

Learning gene regulatory networks Statistical methods for haplotype inference Part I Input: Measurement of mrn levels of all genes from microarray or rna sequencing Samples (e.g. 200 patients with lung cancer) Lecture 3 May 21 th, 2013 ENOME 541, Spring 2013 Su In Lee S & SE, UW suinlee@uw.edu 1 oal: Reconstruct the gene regulatory network underlying genome wide gene expression Method: Probabilistic models to represent the regulatory network e 1 e 6 e p 2 Unknown structure, complete data E B P( E,B) H H?? H L?? L H?? L L?? E, B, <H,L,L> <H,L,H> <L,L,H> <L,H,H>.. <L,H,H> E B Learner E B E B P( E,B) H H.9.1 H L.7.3 L H.8.2 L L.99.01 Score based learning Define scoring function that measures how well a certain structure fits the observed data. E, B, <Y,N,N> <Y,Y,Y> <N,N,Y> <N,Y,Y>.. <N,Y,Y> Score( 1 ) = 10 Score ( 2 ) = 1.5 Score ( 3 ) = 0.01 E 1 B E 2 3 B B E Network structure is not specified Learner needs to estimate both structure and parameters Data does not contain missing values 3 Search for a structure that maximizes the score. 4 1

hallenges oo large search space What is the number of possible structures of n genes? omputationally costly Heuristic approaches may be trapped to local maxima. Biologically motivated constraints can alleviate the problems Module based approach ~ 3 n Only the genes in the candidate regulators list can be parents of other variables 5 2 / 2-3 x + 0.5 x he Module networks concept X3 X5 = Module3 X2 Repressor X4 ctivator X3 Linear PDs X 1 Module 1 Module 2 X4 Module 3 X6 activator expression repressor expression target gene expression induced repressed ctivator X3 false true Repressor X4 false ree PDs true regulation program Module 3 genes 6 Segal et al. Nat enet 03, JMLR 05 Linear module network andidate regulators (X 1,,X N ) Expression levels of genes that have regulatory roles ranscription factors, signal transduction proteins, mrn processing factors, M1 M22 Feature selection via regularization ssume linear aussian PD MLE: solve maximize w (Σw i x i Y) 2 KEM1 MK1 HP1 MSK1 DHH1 SS1 SE59 EM18 S7 ME3 UH1 P1 MF1 E1 Regulatory Program? argets PHO3 PHO5 PHO84 RIM15 PHM6 PHO2 PHO4 SS5 SPL2 I1 V3 x 1 x 2 w 1 w 2 w N parameters Y x N Regulatory network andidate regulators (features) Yeast: 350 genes Mouse: 700 genes P(Y x:w) = N(Σw i x i, ε 2 ) Problem: his objective learns too many regulators Lee et al., PLoS enet 2009 7 7 8 2

L 1 regularization Select a subset of regulators ombinatorial search? Effective feature selection algorithm: L 1 regularization (LSSO) [ibshirani, J. Royal. Statist. Soc B. 1996] minimize w (Σw i x i Y) 2 + w i : convex optimization! Induces sparsity in the solution w (Many w i s set to zero) Linear module network Iterative procedure -3 x P1 + Learn a regulatory program for each module 0.5 x MF1 luster genes into modules = Module M1 MSK1 EM18 KEM1 UH1 P1 DHH1 MK1 E1 M22 S7 MF1 ME3 x 1 x 2 w 1 w 2 w N Y x N andidate regulators (features) Yeast: 350 genes Mouse: 700 genes P(Y x:w) = N(Σw i x i, ε 2 ) HP1 SS1 SE59 Better? PHO3 PHO5 PHO84 RIM15 L 1 regularized optimization PHM6 minimize PHO2 w (Σw i x i -E argets PHO4 )2+ w i SS5 SPL2 I1 V3 9 Lee et al., PLoS enet 2009 10 10 Summary Haplotype inference (5/21, 5/23) Basic concepts on Bayesian networks Probabilistic models of gene regulatory networks Learning algorithms Parameter learning Structure learning Structure discovery Evaluation Recent probabilistic approaches to reconstructing the regulatory networks 11 Background & motivation Problem statement Statistical methods for haplotype inference lark s algorithm Expectation Maximization (EM) algorithm oday oalescent based methods and HMM Haplotype inference on sequence data Example applications 12 3

enetic variation Single nucleotide polymorphism (SNP) Each variant is called an allele; each allele has a frequency chromosome SNP 1 SNP 2 SNP 3 P(SNP 1 = ) = 4/8 P(SNP 1 = ) = 4/8 : How about the relationship between alleles of neighboring SNPs? We need to know about haplotype 13 Haplotype combination of alleles present in a chromosome Each haplotype has a frequency, which is the proportion of chromosomes of that type in the population here are 2 N possible haplotypes But in fact, far fewer are seen in human population chromosome haplotype 123 SNP 1 SNP 2 SNP 3 Haplotype frequencies P(haplotype 123 = ) = 3/8 P(haplotype 123 = ) = 2/8 P(haplotype 123 = ) = 2/8 P(haplotype 123 = ) = 1/8 14 History of two neighboring alleles lleles that exist today arose through ancient mutation events Before mutation History of two neighboring alleles One allele arose first, and then the other Before mutation fter mutation Mutation fter mutation Mutation 15 Haplotype: combination of alleles present in a chromosome 16 4

Recombination can create more haplotypes Without recombination No recombination (or 2n recombination events) With recombination Recombination (2n+1 recombination events) 17 Recombinant haplotype 18 Haplotype What determines haplotype frequencies? Recombination rate (r) between neighboring alleles in the population r is different for different regions in genome Linkage disequilibrium (LD) Non random association of alleles at two or more loci, not necessarily on the same chromosome. haplotype 123 SNP 1 SNP 2 SNP 3 Haplotype frequencies P(haplotype 123 = ) = 3/8 P(haplotype 123 = ) = 2/8 P(haplotype 123 = ) = 2/8 P(haplotype 123 = ) = 1/8 How can we measure haplotypes? Haplotypes can be generated through laboratorybased experimental methods X chromosome in males Sperm typing Hybrid cell lines Other molecular techniques omputational approaches Input: enotype data from individuals in a population Output: Haplotypes of each individual in the population chromosome 19 20 5

Haplotype inference problem Sequence and SNP array data generally take the form of unphased genotypes Markers 21 Motivation Enormous amounts of genotype data are being generated Inexpensive genome wide SNP microarrays Whole genome and whole exome sequencing tools Determination of haplotype phase is increasingly important haracterizing the relationship between genetic variation and disease susceptibility Imputing low frequency variants 22 Why useful in WS? In a typical short chromosome segment, there are only a few distinct haplotypes arefully selected markers can determine status of others We can test for association of untyped SNPs Example Holm et al. Nat enet (2011) Used the inferred haplotypes for accurate imputation of a putative rare causal variant in other individuals, to obtain a stronger association signal Haplotype 1 Haplotype 2 Haplotype 3 Haplotype 4 Haplotype 5 S 1 S 2 S 3 S 4 S 5 S N Different alleles of each SNP 30% 20% 20% 20% 10% Marker (ag SNP) 23 24 6

Example Sick sinus syndrome (SSS) haracterized by slow heart rate, sinus arrest and/or failure to increase heart rate with exercise enome wide association scan of 7.2M SNPs with 792 SSS cases and 37,592 controls he association analysis yielded association with several correlated SNPs in and near MYH6 MYH7 (never before associated with SSS) 25 Outline Background & motivation Problem statement Statistical methods for haplotype inference lark s algorithm Expectation Maximization (EM) algorithm oday oalescent based methods and HMM Haplotype inference on sequence data Example applications 26 ypical genotype data wo alleles for each individual for each marker hromosome origin for each allele is unknown Observation marker 1 {} {} {} marker 2 marker 3 Multiple haplotype pairs can fit observed genotype Possible states / / / / 27 Use information on relatives? Family information can help determine phase at many markers an you propose examples? enotype: {} {} {} Maternal genotype: {} {} {} Paternal genotype: {} {} {} hen the haplotype is / / / 28 7

Example inferring haplotypes Still, many ambiguities might not be resolved Problem more serious with larger numbers of markers enotype: {} {} {} Maternal genotype: {} {} {} Paternal genotype: {} {} {} annot determine unique haplotype What if there are no relatives? Rely on linkage disequilibrium (LD) LD: non random association of variants at different sites in the genome ssume that population consists of small number of distinct haplotypes Problem Determine haplotypes without parental genotypes 29 30 Haplotype reconstruction lso called, phasing, haplotype inference or haplotyping Data enotype on N markers from M individuals oals Frequency estimation of all possible haplotypes Haplotype reconstruction for individuals How many out of all possible haplotypes are plausible in a population? Individual i marker 1 marker 2 marker 3 31 Statistical methods for haplotypes inference Let s focus on the methods that are most widely used or historically important Browning and Browning, Nat Rev enet. 2011 32 8

Outline Background & motivation Problem statement Statistical methods for haplotype inference lark s algorithm Expectation Maximization (EM) algorithm oalescent based methods and HMM Haplotype inference on sequence data Example applications lark s haplotyping algorithm lark (1990) Mol Biol Evol 7:111 122 One of the first published haplotyping algorithms omputationally efficient Very fast and widely used in 1990 s More accurate methods are now available 33 34 lark s haplotyping algorithm Find unambiguous individuals Initialize a list of known haplotypes Unambiguous vs. ambiguous Haplotypes for 2 SNPs (alleles: /a, B/b) What kinds of genotypes will these have? Unambiguous individuals Homozygous at every locus (e.g. {} {} {}) Haplotypes: Heterozygous at just one locus (e.g. {} {} {}) Haplotypes: or 35 36 9

lark s haplotyping algorithm Find unambiguous individuals Initialize a list of known haplotypes Parsimonious phasing example Notation (more compact representation) 0/1: homozygous at each locus (00,11) h: heterozygous at each locus (01) Resolve ambiguous individuals If possible, use two haplotypes from the list Otherwise, use one known haplotype and augment list If unphased individuals remain ssign phase randomly to one individual ugment haplotype list and continue from previous step 37 1 0 1 0 0 h h 0 1 h 0 0 0 h h 1 h 0 1 0 1 0 0 0 1 0 1 0 0 1 1 0 1 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 1 0 1 1 0 38 Parsimony algorithm Outline Pros Very fast an deal with very long sequences ons No homozygotes or single SNP heterozygotes in the data Some haplotypes may remain unresolved Outcome depends on order in which lists are transversed Naïve, not very accurate (no modeling) Background & motivation Problem statement Statistical methods for haplotype inference lark s algorithm Expectation Maximization (EM) algorithm oalescent based methods and HMM Haplotype inference on sequence data Example applications 39 40 10

he EM haplotyping algorithm ssume that we know haplotype frequencies Excoffier and Slatkin Mol Biol Evol (1995); Qin et al. m J Hum enet (2002); Excoffier and Lischer Molecular ecology resources (2010) Why EM for haplotyping? EM is a method for MLE with hidden variables. What are the hidden variables, parameters? Hidden variables: haplotype state of each individual Parameters: haplotype frequencies Individual n Haplotype status (hidden variable) z =0 z =1 Haplotype frequencies (parameters) p b, p ab, p B, p ab 41 Probability of first outcome: 2P b P ab = 0.06 Probability of second outcome: 2P B P ab = 0.18 For example, if P B = 0.3 P ab = 0.3 P b = 0.3 P ab = 0.1 42 onditional probabilities are onditional probability of first outcome: 2P b P ab / (2P b P ab + 2P B P ab ) = 0.25 For example, if P B = 0.3 P ab = 0.3 P b = 0.3 P ab = 0.1 ssume that we know the haplotype state of each individual omputing haplotype frequencies is straightforward Individual 1 Individual 2 Individual 3 p B =? P ab =? p b =? p ab =? onditional probability of second outcome: 2P B P ab / (2P b P ab + 2P B P ab ) = 0.75 43 Individual 4 44 11

EM as hicken vs Egg If we know haplotype frequencies p s (parameters), we can estimate the haplotype status of individuals z s (hidden variables) If we know the haplotype state of each individual z s (hidden variables), we can estimate the haplotype frequencies p s (parameters) EM as hicken vs Egg If we know haplotype frequencies p s (parameters), we can estimate the haplotype states of individuals z s (hidden variables) If we know the haplotype state of individuals z s (hidden variables), we can estimate the haplotype frequencies p s (parameters) Individual n Haplotype status (hidden variable) z =0 z =1 Haplotype frequencies (parameters) p b, p ab, p B, p ab 45 BU we know neither; iterate Expectation step: Estimate z s, given haplotype frequencies p s Maximization step: Estimate p s, given the haplotype states of individuals z s Overall, a clever hill climbing strategy 46 Phasing By EM EM: Method for maximum likelihood parameter inference with hidden variables Parameters (haplotype frequencies) p s) Maximize uess Likelihood Inferring haplotype state of each individual E M Estimating haplotype frequencies Hidden variables (haplotype states of individuals z s) Find expected values 47 EM algorithm for haplotyping 1. uesstimate haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency of each haplotype by counting 4. Repeat steps 2 and 3 until frequencies are stable 48 12

Phasing by EM Phasing by EM Data: 1 0 h h 1 h 0 0 1 h 1 h h 1 1 1 0 0 0 1 1 0 1 0 1 0 0 0 1 0 0 0 0 1 1 1 0 0 1 0 1 1 1 1 1 1 1 0 1 1 49 Data: 1 0 h h 1 h 0 0 1 h 1 h h 1 1 1 0 0 0 1 1 0 1 0 1 0 0 0 1 0 0 0 0 1 1 1 0 0 1 0 1 1 1 1 1 1 1 0 1 1 Frequencies 0 0 0 1 0 1/12 0 0 0 1 1 1/12 1 0 0 0 1 1/12 1 0 0 1 0 1/12 3/12 1 0 1 0 1 1/12 2/12 1 /12 1 1 1 1 1 1/12 50 Phasing by EM Phasing by EM Maximization Data: 1 0 h h 1 h 0 0 1 h 1 h h 1 1 1 0 0 0 1 1 0 1 0 1 0 0 0 1 0 0 0 0 1 1 1 0 0 1 0 1 1 1 1 1 1 1 0 1 1 Haplotypes 0.4 0.6 0.75 0.25 0.6 0.4 Frequencies 0 0 0 1 0 1/12 0 0 0 1 1 1/12 1 0 0 0 1 1/12 1 0 0 1 0 1/12 3/12 1 0 1 0 1 1/12 2/12 1 /12 1 1 1 1 1 1/12 Expectation 51 Data: 1 0 h h 1 h 0 0 1 h 1 h h 1 1 1 0 0 0 1 1 0 1 0 1 0 0 0 1 0 0 0 0 1 1 1 0 0 1 0 1 1 1 1 1 1 1 0 1 1 Haplotypes 0.4 0.6 0.75 0.25 0.6 0.4 Frequencies 0 0 0 1 0 1/12.125 0 0 0 1 1 1/12.042 1 0 0 0 1 1/12.067 1 0 0 1 0 1/12.042 3/12.325 1 0 1 0 1 1/12.1 2/12.067 1 /12.067 1 1 1 1 1 1/12.1 Expectation 52 13

Phasing by EM Data: 1 0 h h 1 h 0 0 1 h 1 h h 1 1 1 0 0 0 1 1 0 1 0 1 0 0 0 1 0 0 0 0 1 1 1 0 0 1 0 1 1 1 1 1 1 1 0 1 1 Haplotypes 0 1 1 0 1 0 Frequencies 0 0 0 1 0 1/6 0 0 0 1 1 0 1 0 0 0 1 0 1 0 0 1 0 0 1/2 1 0 1 0 1 1/6 0 1 1 0 1 1 0 1 1 1 1 1 1/6 53 omputational cost (for SNPs) onsider sets of m unphased genotypes Markers 1..m If markers are bi allelic 2 m possible haplotypes 2 m 1 (2 m + 1) possible haplotype pairs 3 m distinct observed genotypes 2 n 1 reconstructions for n heterozygous loci For example, if m=10 = 1024 = 524,800 = 59,049 = 512 54 EM algorithm Outline Pros More accurate than lark s method Fully or partially phased individuals contribute most of the information ons Estimate depends on starting point: need to run multiple times on different starting points Implementation may become computationally expensive: cost grows rapidly with number of markers For each individual, the number of possible haplotypes is 2 m, where m is the number of makers ypically run for short sequences with < 25 SNPs Background & motivation Problem statement Statistical methods for haplotype inference lark s algorithm Expectation Maximization (EM) algorithm oalescent based methods and HMM Haplotype inference on sequence data Example applications No modeling on haplotypes 55 56 14