Calculation of IBD probabilities

Calculation of IBD probabilities David Evans and Stacey Cherny University of Oxford Wellcome Trust Centre for Human Genetics

This Session IBD vs IBS Why is IBD important? Calculating IBD probabilities Lander-Green Algorithm MERLIN) Single locus probabilities Hidden Markov Model Other ways of calculating IBD status Elston-Stewart Algorithm MCMC approaches MERLIN Practical Example IBD determination Information content mapping SNPs vs micro-satellite markers?

Aim of Gene Mapping Experiments Identify variants that control interesting traits Susceptibility to human disease Phenotypic variation in the population The hypothesis Individuals sharing these variants will be more similar for traits they control The difficulty Testing ~0 million variants is impractical

Identity-by-Descent IBD) Two alleles are IBD if they are descended from the same ancestral allele If a stretch of chromosome is IBD among a set of individuals, ALL variants within that stretch will also be shared IBD markers, QTLs, disease genes) Allows surveys of large amounts of variation even when a few polymorphisms measured

A Segregating Disease Allele +/+ +/mut +/mut +/+ +/mut +/mut +/+ +/mut +/+ All affected individuals IBD for disease causing mutation

Segregating Chromosomes MARKER DISEASE LOCUS Affected individuals tend to share adjacent areas of chromosome IBD

Marker Shared Among Affecteds /2 3/4 3/4 /3 2/4 /4 4/4 4/4 /4 4 allele segregates with disease

Why is IBD sharing important? /2 3/4 IBD sharing forms the basis of nonparametric linkage statistics 3/4 /3 2/4 /4 4/4 4/4 /4 Affected relatives tend to share marker alleles close to the disease locus IBD more often than chance

Linkage between QTL and marker QTL Marker IBD 0 IBD IBD 2

NO Linkage between QTL and marker Marker

IBD vs IBS 2 3 4 2 3 3 4 3 2 Identical by Descent and Identical by State Identical by state only

Example: IBD in Siblings Consider a mating between mother AB x father CD: Sib AC AD BC BD Sib 2 AC AD 2 2 0 0 BC 0 2 BD 0 2 IBD 0 : : 2 25% : 50% : 25%

IBD can be trivial / 2 / 2 IBD0 / 2 / 2

Two Other Simple Cases / 2 / 2 / 2 / 2 IBD2 / / 2 2 / 2 2 /

A little more complicated / 2 2 / 2 IBD 50% chance) / 2 / 2 IBD2 50% chance)

And even more complicated IBD? / /

Bayes Theorem j A j B P A j P A i B P A i P B P A i B P A i P B P B B A i P ) ) ) ) ) ) ) ), P ) ) A i

Bayes Theorem for IBD Probabilities j j IBD G P j IBD P i IBD G P i IBD P G P i IBD G P i IBD P G P G i G i IBD P ) ) ) ) ) ) ) ) ), PIBD )

PMarker Genotype IBD State) Sib Sib 2 k0 k k2 A A A A p 4 p 3 p 2 A A A A 2 2p 3 p 2 p 2 p 2 0 A A A 2 A 2 p 2 p 2 2 0 0 A A 2 A A 2p 3 p 2 p 2 p 2 0 A A 2 A A 2 4p 2 p 2 2 p p 2 2p p 2 A A 2 A 2 A 2 2p p 3 2 p p 2 2 0 A 2 A 2 A A p 2 p 2 2 0 0 A 2 A 2 A 2p p p 2 p 3 A 2 2 2 0 A p 4 p 3 p 2 2 A 2 2 2 2 A 2 A 2 Pobserving genotypes / k alleles IBD)

Worked Example p 0.5 / /

Worked Example / / 9 4 ) 4 ) 2 9 4 ) 2 ) 9 ) 4 ) 0 64 9 4 2 4 ) 4 2) 8 ) 6 0) 0.5 2 3 4 2 3 4 2 3 4 + + P G p G IBD P P G p G IBD P P G p G IBD P p p p P G p IBD P G p IBD P G p IBD P G p

For ANY PEDIGREE the inheritance pattern at every point in the genome can be completely described by a binary inheritance vector: vx) p, m, p 2, m 2,,p n,m n ) whose coordinates describe the outcome of the 2n paternal and maternal meioses giving rise to the n non-founders in the pedigree p i m i ) is 0 if the grandpaternal allele transmitted p i m i ) is if the grandmaternal allele is transmitted a / b c / d p p 2 m m 2 vx) [0,0,,] a / c b / d

Inheritance Vector In practice, it is not possible to determine the true inheritance vector at every point in the genome, rather we represent partial information as a probability distribution over the 2 2n possible inheritance vectors a b 2 a c a c p m p 2 m 2 a b 3 4 5 b b Inheritance vector Prior Posterior ------------------------------------------------------------------ 0000 /6 /8 000 /6 /8 000 /6 0 00 /6 0 000 /6 /8 00 /6 /8 00 /6 0 0 /6 0 000 /6 /8 00 /6 /8 00 /6 0 0 /6 0 00 /6 /8 0 /6 /8 0 /6 0 /6 0

Computer Representation Define inheritance vector v l Each inheritance vector indexed by a different memory location Likelihood for each gene flow pattern Conditional on observed genotypes at location l 2 2n elements!!! At each marker location l 0000 000 000 00 000 00 00 0 000 00 00 0 00 0 0 L L L L L L L L L L L L L L L L

a) bit-indexed array 0000 000 000 00 000 00 00 0 000 00 00 0 00 0 0 L L 2 L L 2 L L 2 L L 2 b) packed tree L L 2 L L 2 L L 2 L L 2 c) sparse tree Legend Node with zero likelihood Node identical to sibling L L 2 L L 2 Likelihood for this branch Abecasis et al 2002) Nat Genet 30:97-0

Multipoint IBD IBD status may not be able to be ascertained with certainty because e.g. the mating is not informative, parental information is not available IBD information at uninformative loci can be made more precise by examining nearby linked loci

Multipoint IBD a / b / c / d 2 / IBD 0 a / c b / d / / 2 IBD 0 or IBD?

Complexity of the Problem in Larger Pedigrees 2n meioses in pedigree with n nonfounders Each meiosis has 2 possible outcomes Therefore 2 2n possibilities for each locus For each genetic locus One location for each of m genetic markers Distinct, non-independent meiotic outcomes Up to 4 nm distinct outcomes!!!

Example: Sib-pair Genotyped at 0 Markers Inheritance vector 0000 000 000 2 3 4 m 0 Marker 2 2xn ) m 2 2 x 2 ) 0 0 2 possible paths!!!

Lander-Green Algorithm The inheritance vector at a locus is conditionally independent of the inheritance vectors at all preceding loci given the inheritance vector at the immediately preceding locus Hidden Markov chain ) The conditional probability of an inheritance vector v i+ at locus i+, given the inheritance vector v i at locus i is θ ij -θ i ) 2n-j where θ is the recombination fraction and j is the number of changes in elements of the inheritance vector transition probabilities ) Example: Locus [0000] Locus 2 [000] Conditional probability θ) 3 θ

0000 000 000 2 3 m Total Likelihood Q T Q 2 T 2 T m- Q m Q i P[0000] 0 0 0 0 P[000] 0 0 0 0 0 0 0 0 P[] T i -θ) 4 -θ) 3 θ θ 4 -θ) 3 θ -θ) 4 -θ)θ 3 θ 4 -θ)θ 3 -θ) 4 2 2n x 2 2n diagonal matrix of single locus probabilities at locus i 2 2n x 2 2n matrix of transitional probabilities between locus i and locus i+ ~0 x 2 2 x 2 ) 2 operations 2560 for this case!!!

PIBD) 2 at Marker Three Inheritance vector 0000 000 000 2 3 4 m 0 Marker L[IBD 2 at marker 3] / L[ALL] L[0000] + L[00] + L[00] + L[] ) / L[ALL]

PIBD) 2 at arbitrary position on the chromosome Inheritance vector 0000 000 000 2 3 4 m 0 Marker L[0000] + L[00] + L[00] + L[] ) / L[ALL]

Further speedups Trees summarize redundant information Portions of inheritance vector that are repeated Portions of inheritance vector that are constant or zero Use sparse-matrix by vector multiplication Regularities in transition matrices Use symmetries in divide and conquer algorithm Idury & Elston, 997)

Lander-Green Algorithm Summary Factorize likelihood by marker Complexity m e n Large number of markers e.g. dense SNP data) Relatively small pedigrees MERLIN, GENEHUNTER, ALLEGRO etc

Elston-Stewart Algorithm Factorize likelihood by individual Complexity n e m Small number of markers Large pedigrees With little inbreeding VITESSE etc

Other methods Number of MCMC methods proposed ~Linear on # markers ~Linear on # people Hard to guarantee convergence on very large datasets Many widely separated local minima E.g. SIMWALK, LOKI

MERLIN-- Multipoint Engine for Rapid Likelihood Inference

Capabilities Linkage Analysis NPL and K&C LOD Variance Components Haplotypes Most likely Sampling All IBD and info content Error Detection Most SNP typing errors are Mendelian consistent Recombination No. of recombinants per family per interval can be controlled Simulation

MERLIN Website www.sph.umich.edu/csg/abecasis/merlin Reference FAQ Source Tutorial Linkage Haplotyping Simulation Error detection IBD calculation Binaries

Test Case Pedigrees

Timings Marker Locations Top Generation Genotyped A x000) B C D Genehunter 38s 37s 8m6s * Allegro 8s 2m7s 3h54m3s * Merlin s 8s 3m55s * Top Generation Not Genotyped A x000) B C D Genehunter 45s m54s * * Allegro 8s m08s h2m38s * Merlin 3s 25s 5m50s *

Intuition: Approximate Sparse T Dense maps, closely spaced markers Small recombination fractions θ Reasonable to set θ k with zero Produces a very sparse transition matrix Consider only elements of v separated by <k recombination events At consecutive locations

Additional Speedup Time Memory Exact 40s 00 MB No recombination <s 4 MB recombinant 2s 7 MB 2 recombinants 5s 54 MB Genehunter 2. 6min 024MB Keavney et al 998) ACE data, 0 SNPs within gene, 4-8 individuals per family

Input Files Pedigree File Relationships Genotype data Phenotype data Data File Describes contents of pedigree file Map File Records location of genetic markers

Example Pedigree File <contents of example.ped> 0 0 x 3 3 x x 2 0 0 2 x 4 4 x x 3 0 0 x 2 x x 4 2 2 x 4 3 x x 5 3 4 2 2.234 3 2 2 6 3 4 2 4.32 2 4 2 2 <end of example.ped> Encodes family relationships, marker and phenotype information

Example Data File <contents of example.dat> T some_trait_of_interest M some_marker M another_marker <end of example.dat> Provides information necessary to decode pedigree file

Data File Field Codes Code M A T C Z Description Marker Genotype Affection Status. Quantitative Trait. Covariate. Zygosity.

Example Map File <contents of example.map> CHROMOSOME MARKER POSITION 2 D2S60 60.0 2 D2S308 65.0 <end of example.map> Indicates location of individual markers, necessary to derive recombination fractions between them

Worked Example p 0.5 P IBD 0 G) P IBD G) P IBD 2 G) 9 4 9 4 9 / / merlin d example.dat p example.ped m example.map --ibd

Application: Information Content Mapping Information content: Provides a measure of how well a marker set approaches the goal of completely determining the inheritance outcome Based on concept of entropy E -ΣP i log 2 P i where P i is probability of the ith outcome I E x) Ex)/E 0 Always lies between 0 and Does not depend on test for linkage Scales linearly with power

Application: Information Content Mapping Simulations sib-pairs with/out parental genotypes) micro-satellite per 0cM ABI) microsatellite per 3cM decode) SNP per 0.5cM Illumina) SNP per 0.2 cm Affymetrix) Which panel performs best in terms of extracting marker information? Do the results depend upon the presence of parental genotypes? merlin d file.dat p file.ped m file.map --information --step --markernames

SNPs vs Microsatellites with parents Information Content.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. 0.0 Densities SNP microsat 0.2 cm 3 cm 0.5 cm 0 cm 0 0 20 30 40 50 60 70 80 90 00 Position cm) SNPs + parents microsat + parents

SNPs vs Microsatellites without parents.0 0.9 0.8 Information Content 0.7 0.6 0.5 0.4 0.3 0.2 0. 0.0 Densities SNP Densities microsat 0.2 cm SNP 3 microsat cm 0.50.2 cm cm 0 cm 3 cm 0.5 cm 0 cm 0 0 20 30 40 50 60 70 80 90 00 SNPs - parents microsat - parents Position cm)