Calculation of IBD probabilities

Similar documents
Calculation of IBD probabilities

The Lander-Green Algorithm. Biostatistics 666 Lecture 22

Genotype Imputation. Biostatistics 666

Affected Sibling Pairs. Biostatistics 666

Tutorial Session 2. MCMC for the analysis of genetic data on pedigrees:

MCMC IN THE ANALYSIS OF GENETIC DATA ON PEDIGREES

Lecture 9. QTL Mapping 2: Outbred Populations

2. Map genetic distance between markers

The Quantitative TDT

The genomes of recombinant inbred lines

Gene mapping, linkage analysis and computational challenges. Konstantin Strauch

Prediction of the Confidence Interval of Quantitative Trait Loci Location

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

The universal validity of the possible triangle constraint for Affected-Sib-Pairs

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

For 5% confidence χ 2 with 1 degree of freedom should exceed 3.841, so there is clear evidence for disequilibrium between S and M.

p(d g A,g B )p(g B ), g B

UNIT 8 BIOLOGY: Meiosis and Heredity Page 148

Introduction to population genetics & evolution

Analytic power calculation for QTL linkage analysis of small pedigrees

Inference on pedigree structure from genome screen data. Running title: Inference on pedigree structure. Mary Sara McPeek. The University of Chicago

Breeding Values and Inbreeding. Breeding Values and Inbreeding

Lecture WS Evolutionary Genetics Part I 1

Variance Component Models for Quantitative Traits. Biostatistics 666

AFFECTED RELATIVE PAIR LINKAGE STATISTICS THAT MODEL RELATIONSHIP UNCERTAINTY

I Have the Power in QTL linkage: single and multilocus analysis

Gene mapping in model organisms

(Genome-wide) association analysis

Chapter 6 Linkage Disequilibrium & Gene Mapping (Recombination)

Statistical issues in QTL mapping in mice

Complexity and Approximation of the Minimum Recombination Haplotype Configuration Problem

Lecture 6. QTL Mapping

Linkage and Linkage Disequilibrium

Natural Selection. Population Dynamics. The Origins of Genetic Variation. The Origins of Genetic Variation. Intergenerational Mutation Rate

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

Resemblance between relatives

Resultants in genetic linkage analysis

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda

Powerful Regression-Based Quantitative-Trait Linkage Analysis of General Pedigrees

GBLUP and G matrices 1

Notes on Population Genetics

TOPICS IN STATISTICAL METHODS FOR HUMAN GENE MAPPING

QTL mapping under ascertainment

genome a specific characteristic that varies from one individual to another gene the passing of traits from one generation to the next

Optimal Allele-Sharing Statistics for Genetic Mapping Using Affected Relatives

Bayesian Inference of Interactions and Associations

Computational Approaches to Statistical Genetics

SOLUTIONS TO EXERCISES FOR CHAPTER 9

MIXED MODELS THE GENERAL MIXED MODEL

Linear Regression (1/1/17)

Mathematical models in population genetics II

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

Univariate Linkage in Mx. Boulder, TC 18, March 2005 Posthuma, Maes, Neale

Identification of Pedigree Relationship from Genome Sharing

Solutions to Problem Set 4

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

MEIOSIS, THE BASIS OF SEXUAL REPRODUCTION

Modeling IBD for Pairs of Relatives. Biostatistics 666 Lecture 17

Lecture 5: BLUP (Best Linear Unbiased Predictors) of genetic values. Bruce Walsh lecture notes Tucson Winter Institute 9-11 Jan 2013

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation

The Admixture Model in Linkage Analysis

Quantitative Genetics

Evaluating the Performance of a Block Updating McMC Sampler in a Simple Genetic Application

Methods for Cryptic Structure. Methods for Cryptic Structure

THEORETICAL ASPECTS OF PEDIGREE ANALYSIS

Introduction to QTL mapping in model organisms

Life Cycles, Meiosis and Genetic Variability24/02/2015 2:26 PM

Use of hidden Markov models for QTL mapping

Lander-Green Algorithm. Biostatistics 666

Lecture 11: Multiple trait models for QTL analysis

Learning ancestral genetic processes using nonparametric Bayesian models

1. Understand the methods for analyzing population structure in genomes

SNP Association Studies with Case-Parent Trios

Lesson 4: Understanding Genetics

A mixed model based QTL / AM analysis of interactions (G by G, G by E, G by treatment) for plant breeding

Objectives. Announcements. Comparison of mitosis and meiosis

Genotype Imputation. Class Discussion for January 19, 2016

Major Genes, Polygenes, and

EXERCISES FOR CHAPTER 7. Exercise 7.1. Derive the two scales of relation for each of the two following recurrent series:

Outline. P o purple % x white & white % x purple& F 1 all purple all purple. F purple, 224 white 781 purple, 263 white

Q Expected Coverage Achievement Merit Excellence. Punnett square completed with correct gametes and F2.

Methods for QTL analysis

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

Case-Control Association Testing with Related Individuals: A More Powerful Quasi-Likelihood Score Test

Genetics (patterns of inheritance)

A Robust Identity-by-Descent Procedure Using Affected Sib Pairs: Multipoint Mapping for Complex Diseases

Introduction to QTL mapping in model organisms

Normal approximation to Binomial

EM algorithm. Rather than jumping into the details of the particular EM algorithm, we ll look at a simpler example to get the idea of how it works

Covariance between relatives

Major questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics.

Constructing a Pedigree

Phasing via the Expectation Maximization (EM) Algorithm

Computation of Multilocus Prior Probability of Autozygosity for Complex Inbred Pedigrees

Mapping quantitative trait loci in oligogenic models

25 : Graphical induced structured input/output models

Transcription:

Calculation of IBD probabilities David Evans and Stacey Cherny University of Oxford Wellcome Trust Centre for Human Genetics

This Session IBD vs IBS Why is IBD important? Calculating IBD probabilities Lander-Green Algorithm MERLIN) Single locus probabilities Hidden Markov Model Other ways of calculating IBD status Elston-Stewart Algorithm MCMC approaches MERLIN Practical Example IBD determination Information content mapping SNPs vs micro-satellite markers?

Aim of Gene Mapping Experiments Identify variants that control interesting traits Susceptibility to human disease Phenotypic variation in the population The hypothesis Individuals sharing these variants will be more similar for traits they control The difficulty Testing ~0 million variants is impractical

Identity-by-Descent IBD) Two alleles are IBD if they are descended from the same ancestral allele If a stretch of chromosome is IBD among a set of individuals, ALL variants within that stretch will also be shared IBD markers, QTLs, disease genes) Allows surveys of large amounts of variation even when a few polymorphisms measured

A Segregating Disease Allele +/+ +/mut +/mut +/+ +/mut +/mut +/+ +/mut +/+ All affected individuals IBD for disease causing mutation

Segregating Chromosomes MARKER DISEASE LOCUS Affected individuals tend to share adjacent areas of chromosome IBD

Marker Shared Among Affecteds /2 3/4 3/4 /3 2/4 /4 4/4 4/4 /4 4 allele segregates with disease

Why is IBD sharing important? /2 3/4 IBD sharing forms the basis of nonparametric linkage statistics 3/4 /3 2/4 /4 4/4 4/4 /4 Affected relatives tend to share marker alleles close to the disease locus IBD more often than chance

Linkage between QTL and marker QTL Marker IBD 0 IBD IBD 2

NO Linkage between QTL and marker Marker

IBD vs IBS 2 3 4 2 3 3 4 3 2 Identical by Descent and Identical by State Identical by state only

Example: IBD in Siblings Consider a mating between mother AB x father CD: Sib AC AD BC BD Sib 2 AC AD 2 2 0 0 BC 0 2 BD 0 2 IBD 0 : : 2 25% : 50% : 25%

IBD can be trivial / 2 / 2 IBD0 / 2 / 2

Two Other Simple Cases / 2 / 2 / 2 / 2 IBD2 / / 2 2 / 2 2 /

A little more complicated / 2 2 / 2 IBD 50% chance) / 2 / 2 IBD2 50% chance)

And even more complicated IBD? / /

Bayes Theorem j A j B P A j P A i B P A i P B P A i B P A i P B P B B A i P ) ) ) ) ) ) ) ), P ) ) A i

Bayes Theorem for IBD Probabilities j j IBD G P j IBD P i IBD G P i IBD P G P i IBD G P i IBD P G P G i G i IBD P ) ) ) ) ) ) ) ) ), PIBD )

PMarker Genotype IBD State) Sib Sib 2 k0 k k2 A A A A p 4 p 3 p 2 A A A A 2 2p 3 p 2 p 2 p 2 0 A A A 2 A 2 p 2 p 2 2 0 0 A A 2 A A 2p 3 p 2 p 2 p 2 0 A A 2 A A 2 4p 2 p 2 2 p p 2 2p p 2 A A 2 A 2 A 2 2p p 3 2 p p 2 2 0 A 2 A 2 A A p 2 p 2 2 0 0 A 2 A 2 A 2p p p 2 p 3 A 2 2 2 0 A p 4 p 3 p 2 2 A 2 2 2 2 A 2 A 2 Pobserving genotypes / k alleles IBD)

Worked Example p 0.5 / /

Worked Example / / 9 4 ) 4 ) 2 9 4 ) 2 ) 9 ) 4 ) 0 64 9 4 2 4 ) 4 2) 8 ) 6 0) 0.5 2 3 4 2 3 4 2 3 4 + + P G p G IBD P P G p G IBD P P G p G IBD P p p p P G p IBD P G p IBD P G p IBD P G p

For ANY PEDIGREE the inheritance pattern at every point in the genome can be completely described by a binary inheritance vector: vx) p, m, p 2, m 2,,p n,m n ) whose coordinates describe the outcome of the 2n paternal and maternal meioses giving rise to the n non-founders in the pedigree p i m i ) is 0 if the grandpaternal allele transmitted p i m i ) is if the grandmaternal allele is transmitted a / b c / d p p 2 m m 2 vx) [0,0,,] a / c b / d

Inheritance Vector In practice, it is not possible to determine the true inheritance vector at every point in the genome, rather we represent partial information as a probability distribution over the 2 2n possible inheritance vectors a b 2 a c a c p m p 2 m 2 a b 3 4 5 b b Inheritance vector Prior Posterior ------------------------------------------------------------------ 0000 /6 /8 000 /6 /8 000 /6 0 00 /6 0 000 /6 /8 00 /6 /8 00 /6 0 0 /6 0 000 /6 /8 00 /6 /8 00 /6 0 0 /6 0 00 /6 /8 0 /6 /8 0 /6 0 /6 0

Computer Representation Define inheritance vector v l Each inheritance vector indexed by a different memory location Likelihood for each gene flow pattern Conditional on observed genotypes at location l 2 2n elements!!! At each marker location l 0000 000 000 00 000 00 00 0 000 00 00 0 00 0 0 L L L L L L L L L L L L L L L L

a) bit-indexed array 0000 000 000 00 000 00 00 0 000 00 00 0 00 0 0 L L 2 L L 2 L L 2 L L 2 b) packed tree L L 2 L L 2 L L 2 L L 2 c) sparse tree Legend Node with zero likelihood Node identical to sibling L L 2 L L 2 Likelihood for this branch Abecasis et al 2002) Nat Genet 30:97-0

Multipoint IBD IBD status may not be able to be ascertained with certainty because e.g. the mating is not informative, parental information is not available IBD information at uninformative loci can be made more precise by examining nearby linked loci

Multipoint IBD a / b / c / d 2 / IBD 0 a / c b / d / / 2 IBD 0 or IBD?

Complexity of the Problem in Larger Pedigrees 2n meioses in pedigree with n nonfounders Each meiosis has 2 possible outcomes Therefore 2 2n possibilities for each locus For each genetic locus One location for each of m genetic markers Distinct, non-independent meiotic outcomes Up to 4 nm distinct outcomes!!!

Example: Sib-pair Genotyped at 0 Markers Inheritance vector 0000 000 000 2 3 4 m 0 Marker 2 2xn ) m 2 2 x 2 ) 0 0 2 possible paths!!!

Lander-Green Algorithm The inheritance vector at a locus is conditionally independent of the inheritance vectors at all preceding loci given the inheritance vector at the immediately preceding locus Hidden Markov chain ) The conditional probability of an inheritance vector v i+ at locus i+, given the inheritance vector v i at locus i is θ ij -θ i ) 2n-j where θ is the recombination fraction and j is the number of changes in elements of the inheritance vector transition probabilities ) Example: Locus [0000] Locus 2 [000] Conditional probability θ) 3 θ

0000 000 000 2 3 m Total Likelihood Q T Q 2 T 2 T m- Q m Q i P[0000] 0 0 0 0 P[000] 0 0 0 0 0 0 0 0 P[] T i -θ) 4 -θ) 3 θ θ 4 -θ) 3 θ -θ) 4 -θ)θ 3 θ 4 -θ)θ 3 -θ) 4 2 2n x 2 2n diagonal matrix of single locus probabilities at locus i 2 2n x 2 2n matrix of transitional probabilities between locus i and locus i+ ~0 x 2 2 x 2 ) 2 operations 2560 for this case!!!

PIBD) 2 at Marker Three Inheritance vector 0000 000 000 2 3 4 m 0 Marker L[IBD 2 at marker 3] / L[ALL] L[0000] + L[00] + L[00] + L[] ) / L[ALL]

PIBD) 2 at arbitrary position on the chromosome Inheritance vector 0000 000 000 2 3 4 m 0 Marker L[0000] + L[00] + L[00] + L[] ) / L[ALL]

Further speedups Trees summarize redundant information Portions of inheritance vector that are repeated Portions of inheritance vector that are constant or zero Use sparse-matrix by vector multiplication Regularities in transition matrices Use symmetries in divide and conquer algorithm Idury & Elston, 997)

Lander-Green Algorithm Summary Factorize likelihood by marker Complexity m e n Large number of markers e.g. dense SNP data) Relatively small pedigrees MERLIN, GENEHUNTER, ALLEGRO etc

Elston-Stewart Algorithm Factorize likelihood by individual Complexity n e m Small number of markers Large pedigrees With little inbreeding VITESSE etc

Other methods Number of MCMC methods proposed ~Linear on # markers ~Linear on # people Hard to guarantee convergence on very large datasets Many widely separated local minima E.g. SIMWALK, LOKI

MERLIN-- Multipoint Engine for Rapid Likelihood Inference

Capabilities Linkage Analysis NPL and K&C LOD Variance Components Haplotypes Most likely Sampling All IBD and info content Error Detection Most SNP typing errors are Mendelian consistent Recombination No. of recombinants per family per interval can be controlled Simulation

MERLIN Website www.sph.umich.edu/csg/abecasis/merlin Reference FAQ Source Tutorial Linkage Haplotyping Simulation Error detection IBD calculation Binaries

Test Case Pedigrees

Timings Marker Locations Top Generation Genotyped A x000) B C D Genehunter 38s 37s 8m6s * Allegro 8s 2m7s 3h54m3s * Merlin s 8s 3m55s * Top Generation Not Genotyped A x000) B C D Genehunter 45s m54s * * Allegro 8s m08s h2m38s * Merlin 3s 25s 5m50s *

Intuition: Approximate Sparse T Dense maps, closely spaced markers Small recombination fractions θ Reasonable to set θ k with zero Produces a very sparse transition matrix Consider only elements of v separated by <k recombination events At consecutive locations

Additional Speedup Time Memory Exact 40s 00 MB No recombination <s 4 MB recombinant 2s 7 MB 2 recombinants 5s 54 MB Genehunter 2. 6min 024MB Keavney et al 998) ACE data, 0 SNPs within gene, 4-8 individuals per family

Input Files Pedigree File Relationships Genotype data Phenotype data Data File Describes contents of pedigree file Map File Records location of genetic markers

Example Pedigree File <contents of example.ped> 0 0 x 3 3 x x 2 0 0 2 x 4 4 x x 3 0 0 x 2 x x 4 2 2 x 4 3 x x 5 3 4 2 2.234 3 2 2 6 3 4 2 4.32 2 4 2 2 <end of example.ped> Encodes family relationships, marker and phenotype information

Example Data File <contents of example.dat> T some_trait_of_interest M some_marker M another_marker <end of example.dat> Provides information necessary to decode pedigree file

Data File Field Codes Code M A T C Z Description Marker Genotype Affection Status. Quantitative Trait. Covariate. Zygosity.

Example Map File <contents of example.map> CHROMOSOME MARKER POSITION 2 D2S60 60.0 2 D2S308 65.0 <end of example.map> Indicates location of individual markers, necessary to derive recombination fractions between them

Worked Example p 0.5 P IBD 0 G) P IBD G) P IBD 2 G) 9 4 9 4 9 / / merlin d example.dat p example.ped m example.map --ibd

Application: Information Content Mapping Information content: Provides a measure of how well a marker set approaches the goal of completely determining the inheritance outcome Based on concept of entropy E -ΣP i log 2 P i where P i is probability of the ith outcome I E x) Ex)/E 0 Always lies between 0 and Does not depend on test for linkage Scales linearly with power

Application: Information Content Mapping Simulations sib-pairs with/out parental genotypes) micro-satellite per 0cM ABI) microsatellite per 3cM decode) SNP per 0.5cM Illumina) SNP per 0.2 cm Affymetrix) Which panel performs best in terms of extracting marker information? Do the results depend upon the presence of parental genotypes? merlin d file.dat p file.ped m file.map --information --step --markernames

SNPs vs Microsatellites with parents Information Content.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. 0.0 Densities SNP microsat 0.2 cm 3 cm 0.5 cm 0 cm 0 0 20 30 40 50 60 70 80 90 00 Position cm) SNPs + parents microsat + parents

SNPs vs Microsatellites without parents.0 0.9 0.8 Information Content 0.7 0.6 0.5 0.4 0.3 0.2 0. 0.0 Densities SNP Densities microsat 0.2 cm SNP 3 microsat cm 0.50.2 cm cm 0 cm 3 cm 0.5 cm 0 cm 0 0 20 30 40 50 60 70 80 90 00 SNPs - parents microsat - parents Position cm)