Learning gene regulatory networks Statistical methods for haplotype inference Part I

Size: px
Start display at page:

Download "Learning gene regulatory networks Statistical methods for haplotype inference Part I"

Transcription

1 Learning gene regulatory networks Statistical methods for haplotype inference Part I Input: Measurement of mrn levels of all genes from microarray or rna sequencing Samples (e.g. 200 patients with lung cancer) Lecture 3 May 21 th, 2013 ENOME 541, Spring 2013 Su In Lee S & SE, UW suinlee@uw.edu 1 oal: Reconstruct the gene regulatory network underlying genome wide gene expression Method: Probabilistic models to represent the regulatory network e 1 e 6 e p 2 Unknown structure, complete data E B P( E,B) H H?? H L?? L H?? L L?? E, B, <H,L,L> <H,L,H> <L,L,H> <L,H,H>.. <L,H,H> E B Learner E B E B P( E,B) H H.9.1 H L.7.3 L H.8.2 L L Score based learning Define scoring function that measures how well a certain structure fits the observed data. E, B, <Y,N,N> <Y,Y,Y> <N,N,Y> <N,Y,Y>.. <N,Y,Y> Score( 1 ) = 10 Score ( 2 ) = 1.5 Score ( 3 ) = 0.01 E 1 B E 2 3 B B E Network structure is not specified Learner needs to estimate both structure and parameters Data does not contain missing values 3 Search for a structure that maximizes the score. 4 1

2 hallenges oo large search space What is the number of possible structures of n genes? omputationally costly Heuristic approaches may be trapped to local maxima. Biologically motivated constraints can alleviate the problems Module based approach ~ 3 n Only the genes in the candidate regulators list can be parents of other variables 5 2 / 2-3 x x he Module networks concept X3 X5 = Module3 X2 Repressor X4 ctivator X3 Linear PDs X 1 Module 1 Module 2 X4 Module 3 X6 activator expression repressor expression target gene expression induced repressed ctivator X3 false true Repressor X4 false ree PDs true regulation program Module 3 genes 6 Segal et al. Nat enet 03, JMLR 05 Linear module network andidate regulators (X 1,,X N ) Expression levels of genes that have regulatory roles ranscription factors, signal transduction proteins, mrn processing factors, M1 M22 Feature selection via regularization ssume linear aussian PD MLE: solve maximize w (Σw i x i Y) 2 KEM1 MK1 HP1 MSK1 DHH1 SS1 SE59 EM18 S7 ME3 UH1 P1 MF1 E1 Regulatory Program? argets PHO3 PHO5 PHO84 RIM15 PHM6 PHO2 PHO4 SS5 SPL2 I1 V3 x 1 x 2 w 1 w 2 w N parameters Y x N Regulatory network andidate regulators (features) Yeast: 350 genes Mouse: 700 genes P(Y x:w) = N(Σw i x i, ε 2 ) Problem: his objective learns too many regulators Lee et al., PLoS enet

3 L 1 regularization Select a subset of regulators ombinatorial search? Effective feature selection algorithm: L 1 regularization (LSSO) [ibshirani, J. Royal. Statist. Soc B. 1996] minimize w (Σw i x i Y) 2 + w i : convex optimization! Induces sparsity in the solution w (Many w i s set to zero) Linear module network Iterative procedure -3 x P1 + Learn a regulatory program for each module 0.5 x MF1 luster genes into modules = Module M1 MSK1 EM18 KEM1 UH1 P1 DHH1 MK1 E1 M22 S7 MF1 ME3 x 1 x 2 w 1 w 2 w N Y x N andidate regulators (features) Yeast: 350 genes Mouse: 700 genes P(Y x:w) = N(Σw i x i, ε 2 ) HP1 SS1 SE59 Better? PHO3 PHO5 PHO84 RIM15 L 1 regularized optimization PHM6 minimize PHO2 w (Σw i x i -E argets PHO4 )2+ w i SS5 SPL2 I1 V3 9 Lee et al., PLoS enet Summary Haplotype inference (5/21, 5/23) Basic concepts on Bayesian networks Probabilistic models of gene regulatory networks Learning algorithms Parameter learning Structure learning Structure discovery Evaluation Recent probabilistic approaches to reconstructing the regulatory networks 11 Background & motivation Problem statement Statistical methods for haplotype inference lark s algorithm Expectation Maximization (EM) algorithm oday oalescent based methods and HMM Haplotype inference on sequence data Example applications 12 3

4 enetic variation Single nucleotide polymorphism (SNP) Each variant is called an allele; each allele has a frequency chromosome SNP 1 SNP 2 SNP 3 P(SNP 1 = ) = 4/8 P(SNP 1 = ) = 4/8 : How about the relationship between alleles of neighboring SNPs? We need to know about haplotype 13 Haplotype combination of alleles present in a chromosome Each haplotype has a frequency, which is the proportion of chromosomes of that type in the population here are 2 N possible haplotypes But in fact, far fewer are seen in human population chromosome haplotype 123 SNP 1 SNP 2 SNP 3 Haplotype frequencies P(haplotype 123 = ) = 3/8 P(haplotype 123 = ) = 2/8 P(haplotype 123 = ) = 2/8 P(haplotype 123 = ) = 1/8 14 History of two neighboring alleles lleles that exist today arose through ancient mutation events Before mutation History of two neighboring alleles One allele arose first, and then the other Before mutation fter mutation Mutation fter mutation Mutation 15 Haplotype: combination of alleles present in a chromosome 16 4

5 Recombination can create more haplotypes Without recombination No recombination (or 2n recombination events) With recombination Recombination (2n+1 recombination events) 17 Recombinant haplotype 18 Haplotype What determines haplotype frequencies? Recombination rate (r) between neighboring alleles in the population r is different for different regions in genome Linkage disequilibrium (LD) Non random association of alleles at two or more loci, not necessarily on the same chromosome. haplotype 123 SNP 1 SNP 2 SNP 3 Haplotype frequencies P(haplotype 123 = ) = 3/8 P(haplotype 123 = ) = 2/8 P(haplotype 123 = ) = 2/8 P(haplotype 123 = ) = 1/8 How can we measure haplotypes? Haplotypes can be generated through laboratorybased experimental methods X chromosome in males Sperm typing Hybrid cell lines Other molecular techniques omputational approaches Input: enotype data from individuals in a population Output: Haplotypes of each individual in the population chromosome

6 Haplotype inference problem Sequence and SNP array data generally take the form of unphased genotypes Markers 21 Motivation Enormous amounts of genotype data are being generated Inexpensive genome wide SNP microarrays Whole genome and whole exome sequencing tools Determination of haplotype phase is increasingly important haracterizing the relationship between genetic variation and disease susceptibility Imputing low frequency variants 22 Why useful in WS? In a typical short chromosome segment, there are only a few distinct haplotypes arefully selected markers can determine status of others We can test for association of untyped SNPs Example Holm et al. Nat enet (2011) Used the inferred haplotypes for accurate imputation of a putative rare causal variant in other individuals, to obtain a stronger association signal Haplotype 1 Haplotype 2 Haplotype 3 Haplotype 4 Haplotype 5 S 1 S 2 S 3 S 4 S 5 S N Different alleles of each SNP 30% 20% 20% 20% 10% Marker (ag SNP)

7 Example Sick sinus syndrome (SSS) haracterized by slow heart rate, sinus arrest and/or failure to increase heart rate with exercise enome wide association scan of 7.2M SNPs with 792 SSS cases and 37,592 controls he association analysis yielded association with several correlated SNPs in and near MYH6 MYH7 (never before associated with SSS) 25 Outline Background & motivation Problem statement Statistical methods for haplotype inference lark s algorithm Expectation Maximization (EM) algorithm oday oalescent based methods and HMM Haplotype inference on sequence data Example applications 26 ypical genotype data wo alleles for each individual for each marker hromosome origin for each allele is unknown Observation marker 1 {} {} {} marker 2 marker 3 Multiple haplotype pairs can fit observed genotype Possible states / / / / 27 Use information on relatives? Family information can help determine phase at many markers an you propose examples? enotype: {} {} {} Maternal genotype: {} {} {} Paternal genotype: {} {} {} hen the haplotype is / / / 28 7

8 Example inferring haplotypes Still, many ambiguities might not be resolved Problem more serious with larger numbers of markers enotype: {} {} {} Maternal genotype: {} {} {} Paternal genotype: {} {} {} annot determine unique haplotype What if there are no relatives? Rely on linkage disequilibrium (LD) LD: non random association of variants at different sites in the genome ssume that population consists of small number of distinct haplotypes Problem Determine haplotypes without parental genotypes Haplotype reconstruction lso called, phasing, haplotype inference or haplotyping Data enotype on N markers from M individuals oals Frequency estimation of all possible haplotypes Haplotype reconstruction for individuals How many out of all possible haplotypes are plausible in a population? Individual i marker 1 marker 2 marker 3 31 Statistical methods for haplotypes inference Let s focus on the methods that are most widely used or historically important Browning and Browning, Nat Rev enet

9 Outline Background & motivation Problem statement Statistical methods for haplotype inference lark s algorithm Expectation Maximization (EM) algorithm oalescent based methods and HMM Haplotype inference on sequence data Example applications lark s haplotyping algorithm lark (1990) Mol Biol Evol 7: One of the first published haplotyping algorithms omputationally efficient Very fast and widely used in 1990 s More accurate methods are now available lark s haplotyping algorithm Find unambiguous individuals Initialize a list of known haplotypes Unambiguous vs. ambiguous Haplotypes for 2 SNPs (alleles: /a, B/b) What kinds of genotypes will these have? Unambiguous individuals Homozygous at every locus (e.g. {} {} {}) Haplotypes: Heterozygous at just one locus (e.g. {} {} {}) Haplotypes: or

10 lark s haplotyping algorithm Find unambiguous individuals Initialize a list of known haplotypes Parsimonious phasing example Notation (more compact representation) 0/1: homozygous at each locus (00,11) h: heterozygous at each locus (01) Resolve ambiguous individuals If possible, use two haplotypes from the list Otherwise, use one known haplotype and augment list If unphased individuals remain ssign phase randomly to one individual ugment haplotype list and continue from previous step h h 0 1 h h h 1 h Parsimony algorithm Outline Pros Very fast an deal with very long sequences ons No homozygotes or single SNP heterozygotes in the data Some haplotypes may remain unresolved Outcome depends on order in which lists are transversed Naïve, not very accurate (no modeling) Background & motivation Problem statement Statistical methods for haplotype inference lark s algorithm Expectation Maximization (EM) algorithm oalescent based methods and HMM Haplotype inference on sequence data Example applications

11 he EM haplotyping algorithm ssume that we know haplotype frequencies Excoffier and Slatkin Mol Biol Evol (1995); Qin et al. m J Hum enet (2002); Excoffier and Lischer Molecular ecology resources (2010) Why EM for haplotyping? EM is a method for MLE with hidden variables. What are the hidden variables, parameters? Hidden variables: haplotype state of each individual Parameters: haplotype frequencies Individual n Haplotype status (hidden variable) z =0 z =1 Haplotype frequencies (parameters) p b, p ab, p B, p ab 41 Probability of first outcome: 2P b P ab = 0.06 Probability of second outcome: 2P B P ab = 0.18 For example, if P B = 0.3 P ab = 0.3 P b = 0.3 P ab = onditional probabilities are onditional probability of first outcome: 2P b P ab / (2P b P ab + 2P B P ab ) = 0.25 For example, if P B = 0.3 P ab = 0.3 P b = 0.3 P ab = 0.1 ssume that we know the haplotype state of each individual omputing haplotype frequencies is straightforward Individual 1 Individual 2 Individual 3 p B =? P ab =? p b =? p ab =? onditional probability of second outcome: 2P B P ab / (2P b P ab + 2P B P ab ) = Individual

12 EM as hicken vs Egg If we know haplotype frequencies p s (parameters), we can estimate the haplotype status of individuals z s (hidden variables) If we know the haplotype state of each individual z s (hidden variables), we can estimate the haplotype frequencies p s (parameters) EM as hicken vs Egg If we know haplotype frequencies p s (parameters), we can estimate the haplotype states of individuals z s (hidden variables) If we know the haplotype state of individuals z s (hidden variables), we can estimate the haplotype frequencies p s (parameters) Individual n Haplotype status (hidden variable) z =0 z =1 Haplotype frequencies (parameters) p b, p ab, p B, p ab 45 BU we know neither; iterate Expectation step: Estimate z s, given haplotype frequencies p s Maximization step: Estimate p s, given the haplotype states of individuals z s Overall, a clever hill climbing strategy 46 Phasing By EM EM: Method for maximum likelihood parameter inference with hidden variables Parameters (haplotype frequencies) p s) Maximize uess Likelihood Inferring haplotype state of each individual E M Estimating haplotype frequencies Hidden variables (haplotype states of individuals z s) Find expected values 47 EM algorithm for haplotyping 1. uesstimate haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency of each haplotype by counting 4. Repeat steps 2 and 3 until frequencies are stable 48 12

13 Phasing by EM Phasing by EM Data: 1 0 h h 1 h h 1 h h Data: 1 0 h h 1 h h 1 h h Frequencies / / / /12 3/ /12 2/12 1 / /12 50 Phasing by EM Phasing by EM Maximization Data: 1 0 h h 1 h h 1 h h Haplotypes Frequencies / / / /12 3/ /12 2/12 1 / /12 Expectation 51 Data: 1 0 h h 1 h h 1 h h Haplotypes Frequencies / / / / / /12.1 2/ / /12.1 Expectation 52 13

14 Phasing by EM Data: 1 0 h h 1 h h 1 h h Haplotypes Frequencies / / / /6 53 omputational cost (for SNPs) onsider sets of m unphased genotypes Markers 1..m If markers are bi allelic 2 m possible haplotypes 2 m 1 (2 m + 1) possible haplotype pairs 3 m distinct observed genotypes 2 n 1 reconstructions for n heterozygous loci For example, if m=10 = 1024 = 524,800 = 59,049 = EM algorithm Outline Pros More accurate than lark s method Fully or partially phased individuals contribute most of the information ons Estimate depends on starting point: need to run multiple times on different starting points Implementation may become computationally expensive: cost grows rapidly with number of markers For each individual, the number of possible haplotypes is 2 m, where m is the number of makers ypically run for short sequences with < 25 SNPs Background & motivation Problem statement Statistical methods for haplotype inference lark s algorithm Expectation Maximization (EM) algorithm oalescent based methods and HMM Haplotype inference on sequence data Example applications No modeling on haplotypes

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8 The E-M Algorithm in Genetics Biostatistics 666 Lecture 8 Maximum Likelihood Estimation of Allele Frequencies Find parameter estimates which make observed data most likely General approach, as long as

More information

Inferring Transcriptional Regulatory Networks from High-throughput Data

Inferring Transcriptional Regulatory Networks from High-throughput Data Inferring Transcriptional Regulatory Networks from High-throughput Data Lectures 9 Oct 26, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20

More information

Haplotyping. Biostatistics 666

Haplotyping. Biostatistics 666 Haplotyping Biostatistics 666 Previously Introduction to te E-M algoritm Approac for likeliood optimization Examples related to gene counting Allele frequency estimation recessive disorder Allele frequency

More information

Inferring Transcriptional Regulatory Networks from Gene Expression Data II

Inferring Transcriptional Regulatory Networks from Gene Expression Data II Inferring Transcriptional Regulatory Networks from Gene Expression Data II Lectures 9 Oct 26, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday

More information

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES: .5. ESTIMATION OF HAPLOTYPE FREQUENCIES: Chapter - 8 For SNPs, alleles A j,b j at locus j there are 4 haplotypes: A A, A B, B A and B B frequencies q,q,q 3,q 4. Assume HWE at haplotype level. Only the

More information

The Lander-Green Algorithm. Biostatistics 666 Lecture 22

The Lander-Green Algorithm. Biostatistics 666 Lecture 22 The Lander-Green Algorithm Biostatistics 666 Lecture Last Lecture Relationship Inferrence Likelihood of genotype data Adapt calculation to different relationships Siblings Half-Siblings Unrelated individuals

More information

What is the expectation maximization algorithm?

What is the expectation maximization algorithm? primer 2008 Nature Publishing Group http://www.nature.com/naturebiotechnology What is the expectation maximization algorithm? Chuong B Do & Serafim Batzoglou The expectation maximization algorithm arises

More information

Linear Regression (1/1/17)

Linear Regression (1/1/17) STA613/CBB540: Statistical methods in computational biology Linear Regression (1/1/17) Lecturer: Barbara Engelhardt Scribe: Ethan Hada 1. Linear regression 1.1. Linear regression basics. Linear regression

More information

For 5% confidence χ 2 with 1 degree of freedom should exceed 3.841, so there is clear evidence for disequilibrium between S and M.

For 5% confidence χ 2 with 1 degree of freedom should exceed 3.841, so there is clear evidence for disequilibrium between S and M. STAT 550 Howework 6 Anton Amirov 1. This question relates to the same study you saw in Homework-4, by Dr. Arno Motulsky and coworkers, and published in Thompson et al. (1988; Am.J.Hum.Genet, 42, 113-124).

More information

Genotype Imputation. Class Discussion for January 19, 2016

Genotype Imputation. Class Discussion for January 19, 2016 Genotype Imputation Class Discussion for January 19, 2016 Intuition Patterns of genetic variation in one individual guide our interpretation of the genomes of other individuals Imputation uses previously

More information

Chapter 6 Linkage Disequilibrium & Gene Mapping (Recombination)

Chapter 6 Linkage Disequilibrium & Gene Mapping (Recombination) 12/5/14 Chapter 6 Linkage Disequilibrium & Gene Mapping (Recombination) Linkage Disequilibrium Genealogical Interpretation of LD Association Mapping 1 Linkage and Recombination v linkage equilibrium ²

More information

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin CHAPTER 1 1.2 The expected homozygosity, given allele

More information

2. Map genetic distance between markers

2. Map genetic distance between markers Chapter 5. Linkage Analysis Linkage is an important tool for the mapping of genetic loci and a method for mapping disease loci. With the availability of numerous DNA markers throughout the human genome,

More information

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics: Homework Assignment, Evolutionary Systems Biology, Spring 2009. Homework Part I: Phylogenetics: Introduction. The objective of this assignment is to understand the basics of phylogenetic relationships

More information

p(d g A,g B )p(g B ), g B

p(d g A,g B )p(g B ), g B Supplementary Note Marginal effects for two-locus models Here we derive the marginal effect size of the three models given in Figure 1 of the main text. For each model we assume the two loci (A and B)

More information

Lecture 9. QTL Mapping 2: Outbred Populations

Lecture 9. QTL Mapping 2: Outbred Populations Lecture 9 QTL Mapping 2: Outbred Populations Bruce Walsh. Aug 2004. Royal Veterinary and Agricultural University, Denmark The major difference between QTL analysis using inbred-line crosses vs. outbred

More information

Genotype Imputation. Biostatistics 666

Genotype Imputation. Biostatistics 666 Genotype Imputation Biostatistics 666 Previously Hidden Markov Models for Relative Pairs Linkage analysis using affected sibling pairs Estimation of pairwise relationships Identity-by-Descent Relatives

More information

Phasing via the Expectation Maximization (EM) Algorithm

Phasing via the Expectation Maximization (EM) Algorithm Computing Haplotype Frequencies and Haplotype Phasing via the Expectation Maximization (EM) Algorithm Department of Computer Science Brown University, Providence sorin@cs.brown.edu September 14, 2010 Outline

More information

Lecture 22: Signatures of Selection and Introduction to Linkage Disequilibrium. November 12, 2012

Lecture 22: Signatures of Selection and Introduction to Linkage Disequilibrium. November 12, 2012 Lecture 22: Signatures of Selection and Introduction to Linkage Disequilibrium November 12, 2012 Last Time Sequence data and quantification of variation Infinite sites model Nucleotide diversity (π) Sequence-based

More information

Bayesian Inference of Interactions and Associations

Bayesian Inference of Interactions and Associations Bayesian Inference of Interactions and Associations Jun Liu Department of Statistics Harvard University http://www.fas.harvard.edu/~junliu Based on collaborations with Yu Zhang, Jing Zhang, Yuan Yuan,

More information

Computational Approaches to Statistical Genetics

Computational Approaches to Statistical Genetics Computational Approaches to Statistical Genetics GWAS I: Concepts and Probability Theory Christoph Lippert Dr. Oliver Stegle Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen

More information

Affected Sibling Pairs. Biostatistics 666

Affected Sibling Pairs. Biostatistics 666 Affected Sibling airs Biostatistics 666 Today Discussion of linkage analysis using affected sibling pairs Our exploration will include several components we have seen before: A simple disease model IBD

More information

Causal Graphical Models in Systems Genetics

Causal Graphical Models in Systems Genetics 1 Causal Graphical Models in Systems Genetics 2013 Network Analysis Short Course - UCLA Human Genetics Elias Chaibub Neto and Brian S Yandell July 17, 2013 Motivation and basic concepts 2 3 Motivation

More information

(Genome-wide) association analysis

(Genome-wide) association analysis (Genome-wide) association analysis 1 Key concepts Mapping QTL by association relies on linkage disequilibrium in the population; LD can be caused by close linkage between a QTL and marker (= good) or by

More information

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia Expression QTLs and Mapping of Complex Trait Loci Paul Schliekelman Statistics Department University of Georgia Definitions: Genes, Loci and Alleles A gene codes for a protein. Proteins due everything.

More information

Introduction to Bioinformatics

Introduction to Bioinformatics CSCI8980: Applied Machine Learning in Computational Biology Introduction to Bioinformatics Rui Kuang Department of Computer Science and Engineering University of Minnesota kuang@cs.umn.edu History of Bioinformatics

More information

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 Association Testing with Quantitative Traits: Common and Rare Variants Timothy Thornton and Katie Kerr Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 1 / 41 Introduction to Quantitative

More information

Solutions to Problem Set 4

Solutions to Problem Set 4 Question 1 Solutions to 7.014 Problem Set 4 Because you have not read much scientific literature, you decide to study the genetics of garden peas. You have two pure breeding pea strains. One that is tall

More information

Inferring Protein-Signaling Networks II

Inferring Protein-Signaling Networks II Inferring Protein-Signaling Networks II Lectures 15 Nov 16, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022

More information

Learning ancestral genetic processes using nonparametric Bayesian models

Learning ancestral genetic processes using nonparametric Bayesian models Learning ancestral genetic processes using nonparametric Bayesian models Kyung-Ah Sohn October 31, 2011 Committee Members: Eric P. Xing, Chair Zoubin Ghahramani Russell Schwartz Kathryn Roeder Matthew

More information

1. Understand the methods for analyzing population structure in genomes

1. Understand the methods for analyzing population structure in genomes MSCBIO 2070/02-710: Computational Genomics, Spring 2016 HW3: Population Genetics Due: 24:00 EST, April 4, 2016 by autolab Your goals in this assignment are to 1. Understand the methods for analyzing population

More information

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics 1 Springer Nan M. Laird Christoph Lange The Fundamentals of Modern Statistical Genetics 1 Introduction to Statistical Genetics and Background in Molecular Genetics 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

More information

Chapter 13 Meiosis and Sexual Reproduction

Chapter 13 Meiosis and Sexual Reproduction Biology 110 Sec. 11 J. Greg Doheny Chapter 13 Meiosis and Sexual Reproduction Quiz Questions: 1. What word do you use to describe a chromosome or gene allele that we inherit from our Mother? From our Father?

More information

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models 02-710 Computational Genomics Systems biology Putting it together: Data integration using graphical models High throughput data So far in this class we discussed several different types of high throughput

More information

Complexity and Approximation of the Minimum Recombination Haplotype Configuration Problem

Complexity and Approximation of the Minimum Recombination Haplotype Configuration Problem Complexity and Approximation of the Minimum Recombination Haplotype Configuration Problem Lan Liu 1, Xi Chen 3, Jing Xiao 3, and Tao Jiang 1,2 1 Department of Computer Science and Engineering, University

More information

Linkage and Linkage Disequilibrium

Linkage and Linkage Disequilibrium Linkage and Linkage Disequilibrium Summer Institute in Statistical Genetics 2014 Module 10 Topic 3 Linkage in a simple genetic cross Linkage In the early 1900 s Bateson and Punnet conducted genetic studies

More information

AUTHORIZATION TO LEND AND REPRODUCE THE THESIS. Date Jong Wha Joanne Joo, Author

AUTHORIZATION TO LEND AND REPRODUCE THE THESIS. Date Jong Wha Joanne Joo, Author AUTHORIZATION TO LEND AND REPRODUCE THE THESIS As the sole author of this thesis, I authorize Brown University to lend it to other institutions or individuals for the purpose of scholarly research. Date

More information

Full file at CHAPTER 2 Genetics

Full file at   CHAPTER 2 Genetics CHAPTER 2 Genetics MULTIPLE CHOICE 1. Chromosomes are a. small linear bodies. b. contained in cells. c. replicated during cell division. 2. A cross between true-breeding plants bearing yellow seeds produces

More information

New imputation strategies optimized for crop plants: FILLIN (Fast, Inbred Line Library ImputatioN) FSFHap (Full Sib Family Haplotype)

New imputation strategies optimized for crop plants: FILLIN (Fast, Inbred Line Library ImputatioN) FSFHap (Full Sib Family Haplotype) New imputation strategies optimized for crop plants: FILLIN (Fast, Inbred Line Library ImputatioN) FSFHap (Full Sib Family Haplotype) Kelly Swarts PAG Allele Mining 1/11/2014 Imputation is the projection

More information

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies. Theoretical and computational aspects of association tests: application in case-control genome-wide association studies Mathieu Emily November 18, 2014 Caen mathieu.emily@agrocampus-ouest.fr - Agrocampus

More information

Use of hidden Markov models for QTL mapping

Use of hidden Markov models for QTL mapping Use of hidden Markov models for QTL mapping Karl W Broman Department of Biostatistics, Johns Hopkins University December 5, 2006 An important aspect of the QTL mapping problem is the treatment of missing

More information

Calculation of IBD probabilities

Calculation of IBD probabilities Calculation of IBD probabilities David Evans and Stacey Cherny University of Oxford Wellcome Trust Centre for Human Genetics This Session IBD vs IBS Why is IBD important? Calculating IBD probabilities

More information

Outline of lectures 3-6

Outline of lectures 3-6 GENOME 453 J. Felsenstein Evolutionary Genetics Autumn, 007 Population genetics Outline of lectures 3-6 1. We want to know what theory says about the reproduction of genotypes in a population. This results

More information

UNIT 8 BIOLOGY: Meiosis and Heredity Page 148

UNIT 8 BIOLOGY: Meiosis and Heredity Page 148 UNIT 8 BIOLOGY: Meiosis and Heredity Page 148 CP: CHAPTER 6, Sections 1-6; CHAPTER 7, Sections 1-4; HN: CHAPTER 11, Section 1-5 Standard B-4: The student will demonstrate an understanding of the molecular

More information

Statistical Genetics I: STAT/BIOST 550 Spring Quarter, 2014

Statistical Genetics I: STAT/BIOST 550 Spring Quarter, 2014 Overview - 1 Statistical Genetics I: STAT/BIOST 550 Spring Quarter, 2014 Elizabeth Thompson University of Washington Seattle, WA, USA MWF 8:30-9:20; THO 211 Web page: www.stat.washington.edu/ thompson/stat550/

More information

Backward Genotype-Trait Association. in Case-Control Designs

Backward Genotype-Trait Association. in Case-Control Designs Backward Genotype-Trait Association (BGTA)-Based Dissection of Complex Traits in Case-Control Designs Tian Zheng, Hui Wang and Shaw-Hwa Lo Department of Statistics, Columbia University, New York, New York,

More information

The phenotype of this worm is wild type. When both genes are mutant: The phenotype of this worm is double mutant Dpy and Unc phenotype.

The phenotype of this worm is wild type. When both genes are mutant: The phenotype of this worm is double mutant Dpy and Unc phenotype. Series 2: Cross Diagrams - Complementation There are two alleles for each trait in a diploid organism In C. elegans gene symbols are ALWAYS italicized. To represent two different genes on the same chromosome:

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

25 : Graphical induced structured input/output models

25 : Graphical induced structured input/output models 10-708: Probabilistic Graphical Models 10-708, Spring 2016 25 : Graphical induced structured input/output models Lecturer: Eric P. Xing Scribes: Raied Aljadaany, Shi Zong, Chenchen Zhu Disclaimer: A large

More information

BTRY 7210: Topics in Quantitative Genomics and Genetics

BTRY 7210: Topics in Quantitative Genomics and Genetics BTRY 7210: Topics in Quantitative Genomics and Genetics Jason Mezey Biological Statistics and Computational Biology (BSCB) Department of Genetic Medicine jgm45@cornell.edu February 12, 2015 Lecture 3:

More information

27: Case study with popular GM III. 1 Introduction: Gene association mapping for complex diseases 1

27: Case study with popular GM III. 1 Introduction: Gene association mapping for complex diseases 1 10-708: Probabilistic Graphical Models, Spring 2015 27: Case study with popular GM III Lecturer: Eric P. Xing Scribes: Hyun Ah Song & Elizabeth Silver 1 Introduction: Gene association mapping for complex

More information

Gene mapping in model organisms

Gene mapping in model organisms Gene mapping in model organisms Karl W Broman Department of Biostatistics Johns Hopkins University http://www.biostat.jhsph.edu/~kbroman Goal Identify genes that contribute to common human diseases. 2

More information

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda 1 Population Genetics with implications for Linkage Disequilibrium Chiara Sabatti, Human Genetics 6357a Gonda csabatti@mednet.ucla.edu 2 Hardy-Weinberg Hypotheses: infinite populations; no inbreeding;

More information

Latent Variable models for GWAs

Latent Variable models for GWAs Latent Variable models for GWAs Oliver Stegle Machine Learning and Computational Biology Research Group Max-Planck-Institutes Tübingen, Germany September 2011 O. Stegle Latent variable models for GWAs

More information

EM algorithm and applications Lecture #9

EM algorithm and applications Lecture #9 EM algorithm and applications Lecture #9 Bacground Readings: Chapters 11.2, 11.6 in the text boo, Biological Sequence Analysis, Durbin et al., 2001.. The EM algorithm This lecture plan: 1. Presentation

More information

Population Genetics II (Selection + Haplotype analyses)

Population Genetics II (Selection + Haplotype analyses) 26 th Oct 2015 Poulation Genetics II (Selection + Halotye analyses) Gurinder Singh Mickey twal Center for Quantitative iology Natural Selection Model (Molecular Evolution) llele frequency Embryos Selection

More information

Lecture WS Evolutionary Genetics Part I 1

Lecture WS Evolutionary Genetics Part I 1 Quantitative genetics Quantitative genetics is the study of the inheritance of quantitative/continuous phenotypic traits, like human height and body size, grain colour in winter wheat or beak depth in

More information

Introduction to Linkage Disequilibrium

Introduction to Linkage Disequilibrium Introduction to September 10, 2014 Suppose we have two genes on a single chromosome gene A and gene B such that each gene has only two alleles Aalleles : A 1 and A 2 Balleles : B 1 and B 2 Suppose we have

More information

Major questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics.

Major questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics. Evolutionary Genetics (for Encyclopedia of Biodiversity) Sergey Gavrilets Departments of Ecology and Evolutionary Biology and Mathematics, University of Tennessee, Knoxville, TN 37996-6 USA Evolutionary

More information

Learning in Bayesian Networks

Learning in Bayesian Networks Learning in Bayesian Networks Florian Markowetz Max-Planck-Institute for Molecular Genetics Computational Molecular Biology Berlin Berlin: 20.06.2002 1 Overview 1. Bayesian Networks Stochastic Networks

More information

Calculation of IBD probabilities

Calculation of IBD probabilities Calculation of IBD probabilities David Evans University of Bristol This Session Identity by Descent (IBD) vs Identity by state (IBS) Why is IBD important? Calculating IBD probabilities Lander-Green Algorithm

More information

CHAPTER 23 THE EVOLUTIONS OF POPULATIONS. Section C: Genetic Variation, the Substrate for Natural Selection

CHAPTER 23 THE EVOLUTIONS OF POPULATIONS. Section C: Genetic Variation, the Substrate for Natural Selection CHAPTER 23 THE EVOLUTIONS OF POPULATIONS Section C: Genetic Variation, the Substrate for Natural Selection 1. Genetic variation occurs within and between populations 2. Mutation and sexual recombination

More information

Outline of lectures 3-6

Outline of lectures 3-6 GENOME 453 J. Felsenstein Evolutionary Genetics Autumn, 009 Population genetics Outline of lectures 3-6 1. We want to know what theory says about the reproduction of genotypes in a population. This results

More information

Quiz Section 4 Molecular analysis of inheritance: An amphibian puzzle

Quiz Section 4 Molecular analysis of inheritance: An amphibian puzzle Genome 371, Autumn 2018 Quiz Section 4 Molecular analysis of inheritance: An amphibian puzzle Goals: To illustrate how molecular tools can be used to track inheritance. In this particular example, we will

More information

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo Friday Harbor 2017 From Genetics to GWAS (Genome-wide Association Study) Sept 7 2017 David Fardo Purpose: prepare for tomorrow s tutorial Genetic Variants Quality Control Imputation Association Visualization

More information

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information # Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Details of PRF Methodology In the Poisson Random Field PRF) model, it is assumed that non-synonymous mutations at a given gene are either

More information

Tutorial Session 2. MCMC for the analysis of genetic data on pedigrees:

Tutorial Session 2. MCMC for the analysis of genetic data on pedigrees: MCMC for the analysis of genetic data on pedigrees: Tutorial Session 2 Elizabeth Thompson University of Washington Genetic mapping and linkage lod scores Monte Carlo likelihood and likelihood ratio estimation

More information

Inferring Protein-Signaling Networks

Inferring Protein-Signaling Networks Inferring Protein-Signaling Networks Lectures 14 Nov 14, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022 1

More information

Question: If mating occurs at random in the population, what will the frequencies of A 1 and A 2 be in the next generation?

Question: If mating occurs at random in the population, what will the frequencies of A 1 and A 2 be in the next generation? October 12, 2009 Bioe 109 Fall 2009 Lecture 8 Microevolution 1 - selection The Hardy-Weinberg-Castle Equilibrium - consider a single locus with two alleles A 1 and A 2. - three genotypes are thus possible:

More information

SNP Association Studies with Case-Parent Trios

SNP Association Studies with Case-Parent Trios SNP Association Studies with Case-Parent Trios Department of Biostatistics Johns Hopkins Bloomberg School of Public Health September 3, 2009 Population-based Association Studies Balding (2006). Nature

More information

Parts 2. Modeling chromosome segregation

Parts 2. Modeling chromosome segregation Genome 371, Autumn 2018 Quiz Section 2 Meiosis Goals: To increase your familiarity with the molecular control of meiosis, outcomes of meiosis, and the important role of crossing over in generating genetic

More information

SNP-SNP Interactions in Case-Parent Trios

SNP-SNP Interactions in Case-Parent Trios Detection of SNP-SNP Interactions in Case-Parent Trios Department of Biostatistics Johns Hopkins Bloomberg School of Public Health June 2, 2009 Karyotypes http://ghr.nlm.nih.gov/ Single Nucleotide Polymphisms

More information

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase Humans have two copies of each chromosome Inherited from mother and father. Genotyping technologies do not maintain the phase Genotyping technologies do not maintain the phase Recall that proximal SNPs

More information

Causal Discovery Methods Using Causal Probabilistic Networks

Causal Discovery Methods Using Causal Probabilistic Networks ausal iscovery Methods Using ausal Probabilistic Networks MINFO 2004, T02: Machine Learning Methods for ecision Support and iscovery onstantin F. liferis & Ioannis Tsamardinos iscovery Systems Laboratory

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Haplotype-based variant detection from short-read sequencing

Haplotype-based variant detection from short-read sequencing Haplotype-based variant detection from short-read sequencing Erik Garrison and Gabor Marth July 16, 2012 1 Motivation While statistical phasing approaches are necessary for the determination of large-scale

More information

Parts 2. Modeling chromosome segregation

Parts 2. Modeling chromosome segregation Genome 371, Autumn 2017 Quiz Section 2 Meiosis Goals: To increase your familiarity with the molecular control of meiosis, outcomes of meiosis, and the important role of crossing over in generating genetic

More information

On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease

On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease Yuehua Cui 1 and Dong-Yun Kim 2 1 Department of Statistics and Probability, Michigan State University,

More information

Time allowed: 2 hours Answer ALL questions in Section A, ALL PARTS of the question in Section B and ONE question from Section C.

Time allowed: 2 hours Answer ALL questions in Section A, ALL PARTS of the question in Section B and ONE question from Section C. UNIVERSITY OF EAST ANGLIA School of Biological Sciences Main Series UG Examination 2017-2018 GENETICS BIO-5009A Time allowed: 2 hours Answer ALL questions in Section A, ALL PARTS of the question in Section

More information

Predicting Protein Functions and Domain Interactions from Protein Interactions

Predicting Protein Functions and Domain Interactions from Protein Interactions Predicting Protein Functions and Domain Interactions from Protein Interactions Fengzhu Sun, PhD Center for Computational and Experimental Genomics University of Southern California Outline High-throughput

More information

Notes for MCTP Week 2, 2014

Notes for MCTP Week 2, 2014 Notes for MCTP Week 2, 2014 Lecture 1: Biological background Evolutionary biology and population genetics are highly interdisciplinary areas of research, with many contributions being made from mathematics,

More information

Introduction to QTL mapping in model organisms

Introduction to QTL mapping in model organisms Human vs mouse Introduction to QTL mapping in model organisms Karl W Broman Department of Biostatistics Johns Hopkins University www.biostat.jhsph.edu/~kbroman [ Teaching Miscellaneous lectures] www.daviddeen.com

More information

Problems for 3505 (2011)

Problems for 3505 (2011) Problems for 505 (2011) 1. In the simplex of genotype distributions x + y + z = 1, for two alleles, the Hardy- Weinberg distributions x = p 2, y = 2pq, z = q 2 (p + q = 1) are characterized by y 2 = 4xz.

More information

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation Instructor: Arindam Banerjee November 26, 2007 Genetic Polymorphism Single nucleotide polymorphism (SNP) Genetic Polymorphism

More information

EXERCISES FOR CHAPTER 3. Exercise 3.2. Why is the random mating theorem so important?

EXERCISES FOR CHAPTER 3. Exercise 3.2. Why is the random mating theorem so important? Statistical Genetics Agronomy 65 W. E. Nyquist March 004 EXERCISES FOR CHAPTER 3 Exercise 3.. a. Define random mating. b. Discuss what random mating as defined in (a) above means in a single infinite population

More information

Introduction to Genetics

Introduction to Genetics Introduction to Genetics We ve all heard of it, but What is genetics? Genetics: the study of gene structure and action and the patterns of inheritance of traits from parent to offspring. Ancient ideas

More information

Inferring Genetic Architecture of Complex Biological Processes

Inferring Genetic Architecture of Complex Biological Processes Inferring Genetic Architecture of Complex Biological Processes BioPharmaceutical Technology Center Institute (BTCI) Brian S. Yandell University of Wisconsin-Madison http://www.stat.wisc.edu/~yandell/statgen

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

Outline. P o purple % x white & white % x purple& F 1 all purple all purple. F purple, 224 white 781 purple, 263 white

Outline. P o purple % x white & white % x purple& F 1 all purple all purple. F purple, 224 white 781 purple, 263 white Outline - segregation of alleles in single trait crosses - independent assortment of alleles - using probability to predict outcomes - statistical analysis of hypotheses - conditional probability in multi-generation

More information

Supporting Information

Supporting Information Supporting Information Hammer et al. 10.1073/pnas.1109300108 SI Materials and Methods Two-Population Model. Estimating demographic parameters. For each pair of sub-saharan African populations we consider

More information

Efficient Haplotype Inference with Boolean Satisfiability

Efficient Haplotype Inference with Boolean Satisfiability Efficient Haplotype Inference with Boolean Satisfiability Joao Marques-Silva 1 and Ines Lynce 2 1 School of Electronics and Computer Science University of Southampton 2 INESC-ID/IST Technical University

More information

Outline for today s lecture (Ch. 14, Part I)

Outline for today s lecture (Ch. 14, Part I) Outline for today s lecture (Ch. 14, Part I) Ploidy vs. DNA content The basis of heredity ca. 1850s Mendel s Experiments and Theory Law of Segregation Law of Independent Assortment Introduction to Probability

More information

Classical Selection, Balancing Selection, and Neutral Mutations

Classical Selection, Balancing Selection, and Neutral Mutations Classical Selection, Balancing Selection, and Neutral Mutations Classical Selection Perspective of the Fate of Mutations All mutations are EITHER beneficial or deleterious o Beneficial mutations are selected

More information

COMBI - Combining high-dimensional classification and multiple hypotheses testing for the analysis of big data in genetics

COMBI - Combining high-dimensional classification and multiple hypotheses testing for the analysis of big data in genetics COMBI - Combining high-dimensional classification and multiple hypotheses testing for the analysis of big data in genetics Thorsten Dickhaus University of Bremen Institute for Statistics AG DANK Herbsttagung

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies Ruth Pfeiffer, Ph.D. Mitchell Gail Biostatistics Branch Division of Cancer Epidemiology&Genetics National

More information

Q Expected Coverage Achievement Merit Excellence. Punnett square completed with correct gametes and F2.

Q Expected Coverage Achievement Merit Excellence. Punnett square completed with correct gametes and F2. NCEA Level 2 Biology (91157) 2018 page 1 of 6 Assessment Schedule 2018 Biology: Demonstrate understanding of genetic variation and change (91157) Evidence Q Expected Coverage Achievement Merit Excellence

More information

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control Xiaoquan Wen Department of Biostatistics, University of Michigan A Model

More information

(Write your name on every page. One point will be deducted for every page without your name!)

(Write your name on every page. One point will be deducted for every page without your name!) POPULATION GENETICS AND MICROEVOLUTIONARY THEORY FINAL EXAMINATION (Write your name on every page. One point will be deducted for every page without your name!) 1. Briefly define (5 points each): a) Average

More information

Natural Selection. Population Dynamics. The Origins of Genetic Variation. The Origins of Genetic Variation. Intergenerational Mutation Rate

Natural Selection. Population Dynamics. The Origins of Genetic Variation. The Origins of Genetic Variation. Intergenerational Mutation Rate Natural Selection Population Dynamics Humans, Sickle-cell Disease, and Malaria How does a population of humans become resistant to malaria? Overproduction Environmental pressure/competition Pre-existing

More information