Similarity Measures and Clustering In Genetics

Similar documents
Populations in statistical genetics

Supplementary Materials for

Learning ancestral genetic processes using nonparametric Bayesian models

Non-Parametric Bayes

The Combinatorial Interpretation of Formulas in Coalescent Theory

Clustering using Mixture Models

6 Introduction to Population Genetics

Multivariate analysis of genetic data: exploring groups diversity

6 Introduction to Population Genetics

Overview of clustering analysis. Yuehua Cui

Modelling Genetic Variations with Fragmentation-Coagulation Processes

Bayesian Methods for Machine Learning

Frequency Spectra and Inference in Population Genetics

CPSC 540: Machine Learning

Advanced Machine Learning

Gentle Introduction to Infinite Gaussian Mixture Modeling

Pattern Recognition and Machine Learning

Supplementary Materials: Efficient moment-based inference of admixture parameters and sources of gene flow

Methods for Cryptic Structure. Methods for Cryptic Structure

Feature Engineering, Model Evaluations

Data Exploration and Unsupervised Learning with Clustering

Computer Vision Group Prof. Daniel Cremers. 14. Clustering

Principal component analysis (PCA) for clustering gene expression data

The Bayesian Choice. Christian P. Robert. From Decision-Theoretic Foundations to Computational Implementation. Second Edition.

EVOLUTIONARY DISTANCES

Computational Systems Biology: Biology X

Genetic Association Studies in the Presence of Population Structure and Admixture

Machine Learning - MT Clustering

Gene Genealogies Coalescence Theory. Annabelle Haudry Glasgow, July 2009

Y-STR: Haplotype Frequency Estimation and Evidence Calculation

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

STA 4273H: Statistical Machine Learning

BIOINFORMATICS. Gilles Guillot

Bayesian Models in Machine Learning

Introduction to Probabilistic Machine Learning

STA 4273H: Statistical Machine Learning

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS

Contents. Part I: Fundamentals of Bayesian Inference 1

Mathematical models in population genetics II

Estimating Evolutionary Trees. Phylogenetic Methods

Variational Scoring of Graphical Model Structures

Notes on Population Genetics

Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Integer Programming in Computational Biology. D. Gusfield University of California, Davis Presented December 12, 2016.!

p(d g A,g B )p(g B ), g B

Visualizing Population Genetics

Simultaneous Inference for Multiple Testing and Clustering via Dirichlet Process Mixture Models

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

PMR Learning as Inference

Figure S1: The model underlying our inference of the age of ancient genomes

Unsupervised machine learning

Phylogenetic Tree Reconstruction

Ronald Christensen. University of New Mexico. Albuquerque, New Mexico. Wesley Johnson. University of California, Irvine. Irvine, California

Bayesian time series classification

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Multivariate analysis of genetic data exploring group diversity

BIOINFORMATICS. StructHDP: Automatic inference of number of clusters and population structure from admixed genotype data

Dr. Amira A. AL-Hosary

The problem Lineage model Examples. The lineage model

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Bayesian Clustering with the Dirichlet Process: Issues with priors and interpreting MCMC. Shane T. Jensen

Closed-form sampling formulas for the coalescent with recombination

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

The Wright-Fisher Model and Genetic Drift

Haplotyping as Perfect Phylogeny: A direct approach

PCA and admixture models

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation

Lecturer: David Blei Lecture #3 Scribes: Jordan Boyd-Graber and Francisco Pereira October 1, 2007

Genetic Drift in Human Evolution

Lecture 18 : Ewens sampling formula

Stat 5101 Lecture Notes

Research Statement on Statistics Jun Zhang

Contrasts for a within-species comparative method

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Genetic Variation in Finite Populations

MCMC: Markov Chain Monte Carlo

Data Mining Techniques

Statistics 202: Data Mining. c Jonathan Taylor. Model-based clustering Based in part on slides from textbook, slides of Susan Holmes.

Part IV: Monte Carlo and nonparametric Bayes

Multivariate Statistics 101. Ordination (PCA, NMDS, CA) Cluster Analysis (UPGMA, Ward s) Canonical Correspondence Analysis


Short Answer Questions: Answer on your separate blank paper. Points are given in parentheses.

Estimating Recombination Rates. LRH selection test, and recombination

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

CSE 473: Artificial Intelligence Autumn Topics

(Genome-wide) association analysis

Linear Regression (1/1/17)

The Monte Carlo Method: Bayesian Networks

Gibbs Sampling Methods for Multiple Sequence Alignment

Supplementary Notes: Segment Parameter Labelling in MCMC Change Detection

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

Applying cluster analysis to 2011 Census local authority data

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Preprocessing & dimensionality reduction

A Nonparametric Bayesian Approach for Haplotype Reconstruction from Single and Multi-Population Data

Dimensionality Reduction Techniques (DRT)

CSC 2541: Bayesian Methods for Machine Learning

Statistical Machine Learning

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Transcription:

Similarity Measures and Clustering In Genetics Daniel Lawson Heilbronn Institute for Mathematical Research School of mathematics University of Bristol www.paintmychromosomes.com

Talk outline Introduction Genetic data and overview 1 Generic approaches Similarity measures Similarity based clustering Spectral methods 2 Genetics models Direct model-based clustering Model-based similarity measures ChromoPainter/FineSTRUCTURE clustering 3 Results for real data 2/40

Motivation: Models and Clustering Models for clustering essential Choice of measure strongly influences clustering Example: Random walk 3/40

Genetics data Individual i 01111101111001111000000000100011 11100101101011000001111100100011 Individuals - O(103+) SNP l Correlations due to relatedness Genome position O(106-9) Correlations due to linkage 4/40

Outline: The process SNPs O(106-9) Populations O(102-3) Step 1: Similarity matrix Step 2: Clustering Individuals O(103-4) Step 1: SNPs are converted to similarity matrices Step 2: Analyse the population structure 5/40

Simulated data Simulate data using 'real' conditions Sequence data including linkage disequilibrium, random mating within populations, 'complex' demography Change how much data we show the models 6/40

Part 1. Generic Approaches Treat SNPs as independent features Cluster individuals as independent samples from some clustering distribution on Individual Number i SNP location l 7/40

Clustering K clusters (which should be inferred) Find 'best' assignment of samples into clusters 'Best' is some score, potentially using: Raw Data (Direct) Distances between points (Similaritybased) Dimensionality reduction (Spectral) 8/40

Similarity Measures (between SNPs) Almost infinite number of choices... e.g. Cosine distance TF-IDF (term Frequency, inverse document frequency) Allele Sharing distance/identity by state Edit distance/'norm' distance e.g. L1, L2 Covariance Exponential Normalised covariance Normalised count of shared alleles 9/40

Similarity measures of simulated data 10/40

NCOV 11/40

Dimensionality reduction SVD of = PCA of = MDS on all give (suitably normalised) (suitably normalised) Eigenvalues Eigenvectors (Left Eigenvectors do differ) Low dimensional High dimensional... 12/40

How many dimensions? Scree Kaiser All Eigenvalue orientated... Kaiser (1960) criterion (EV>1) Scree test (Cattell 1966, large jump in EV spectrum ) Velicer's MAP criterion Horn's Parallel Analysis (PA) criterion Also consider Eigenvectors Tracy-Widom distribution 13/40

How many dimensions? N N MAP (Minimum Average Partial Correlation), Velicer 1976 Remove largest eigenvalue Compute (partial) correlation between remaining eigenvectors and the data (accounting for previous eigenvectors) Repeat Parallel Analysis (Horn 1965) Tracy-Widom Distribution (TW 1994, Patterson et al 2006) Theoretical distribution of the EVs of an L by N matrix (L is the (effective) number of SNPs) Remove biggest EV if bigger than some N quantile Daniel Lawson, University of Bristol Repeat Simulate many matrices the same shape, with the same mean and variance as the data Keep all components bigger than some quantile of the simulated values L 14/40

Clustering: MVN Multivariate Normal ( Soft K-Means, implemented by MCLUST in R) Infer mean and variance for each cluster BIC model selection for K 15/40

Clustering: K-Means Minimise Euclidean distance to cluster centers Hard K-Means (as uses the same distance penalty as MVN, but imposes a strict boundary) K estimated using the Calinski (1974) criterion (comparing variance within clusters to that between clusters) 16/40

Clustering: Hierarchical methods Successively merge closest samples (or successively split... lots of ways to define close ) e.g. UPGMA (Unweighted-Pair Group Method with Arithmetic Mean). Close here is distance to the centroid of the clusters (e.g. Ward's (1963) minimum variance criterion) K estimated using the Calinski (1974) criterion 17/40

NCOV NCOV NCOV 18/40

An algorithm to generate papers from Student Projects: Roll 3 dice and refer to the table: Dice roll Similarity Measure Dimensionality Reduction Clustering Algorithm 1 IBD/ASD None MVN 2 Covariance PCA - MAP K-Means 3 Normalised Covariance PCA - Parallel Analysis Hierarchical (standard) 4 Something from Document clustering PCA - Tracy-Widom Hierarchical (iteratively modifying data) 5 Something modelbased Spectral Graph Theory Something from CS literature 6 Something else... Something from image??? analysis 19/40

Part 2: Genetics models Similarity measures matter more than clustering model MVN model seems best On PCA, all models do similarly well No similarity measure is good enough Time to understand why... 20/40

Ancestry process Each generation randomly chooses parent Red individuals ancestors to those sampled Take limit Time Present keeping constant Rate of coalescence between pairs Lots known about this Coalescent Tree distribution. 21/40

Genetic model - Ancestral Tree Time towards present Hein, Schierup and Wiuf 'Gene Genealogies, Variation and Evolution', OUP 2005 22/40

Ancestry process with recombination Probability of getting DNA from two parents Creates a graph structure Coalescence rates unchanged... But now a birth/death process Easily simulated but few analytical results Time Present 23/40

Ancestral Recombination Graph Time towards present Hein, Schierup and Wiuf 'Gene Genealogies, Variation and Evolution', OUP 2005 24/40

Ancestral Recombination Graph Summary Ancestral Recombination Graph (ARG) model backwards in time, ignore unobserved ancestors is equivalent to the Forwards in time model Random mating, within known size populations No selection Inference under the ARG is impossible for reasonable datasets But when recombination is large, each SNP is independently drawn from a random tree 25/40

Time to Most Recent Common Ancestor Random Tree Ancestral Population Mutation Recall that each SNP has a random tree McVean (2009) showed that (where f is simple and known) Some SNP distribution i.e. Counting identical SNPs captures all the information in the tree How do times in tree relate to population structure? Present populations 26/40

Genetic drift Ancestral Population Mutation Some SNP distribution Kimura derived exact distribution for this case... (nasty) Wright studied a related model, leads to Beta distribution of SNP frequency Normal distribution approximation exists... simpler to handle covariance effects Present populations 'Independent drift' of SNP distribution (Recall relationship to diffusion) 27/40

Direct Model-based clustering Population model: Beta distribution for SNP frequencies in each Assume individuals exchangeable within populations Gives likelihood for frequency of SNPs Binomial distribution Assume no linkage (linkage approximations exist) Gives popular STRUCTURE* model Still can't cope with large datasets Can we do this well on genomic (linked) data? *Pritchard, Stephens and Donnelly, Genetics, 155:945-959, 2000 28/40

ChromoPainter 'Coancestry' similarity matrix Unlinked limit: normalised Daniel Lawson,sharing University of Bristol allele See: Li and Stephens, Genetics 165:2213-2233, 2003 29/40

ESU 30/40

FineSTRUCTURE Population structure model Individuals exchangeable within populations Populations donate chunks independently at a characteristic rate Coancestry matrix Population assignment Donation frequency of population Number of individuals to donate from 31/40

Probability of a partition Dirichlet Process prior for partition Rows of (i.e. biological parameters)... : ) are Dirichlet (containing hidden so conjugate, and we integrate out (Idea: add each individual, update Dirichlet posterior, use as prior for the next individual) 32/40

Proven theoretical results To O(N), Coancestry matrix is a rotation of the eigenvector matrix If SNPs are uncorrelated and the number of individuals is large To O(N), FineSTRUCTURE likelihood is equivalent to the STRUCTURE* likelihood if SNPs are uncorrelated, And the MVN drift is weak, likelihood with a genotyped SNPs are not very rare structured covariance... With linkage model we do better. *Pritchard, Stephens and Donnelly, Genetics, 155:945-959, 2000 Calculations due to Simon Myers 33/40

Comparison of linked methods 34/40

HGDP data 938 Individuals worldwide 650k SNPs, linked (but relatively weakly) Known to contain structure at all scales, but previous models missed this Similarity approaches can analyse the whole data Focus on East Asian individuals 35/40

ChromoPainter 36/40

ChromoPainter Eigenvector space 37/40

ChromoPainter Correlation 38/40

Conclusions In General Models matter Summaries are part of the model 'Model-Free' approaches are still making assumptions! Genetics Normalising variance is important other unlinked measures are all worse Linked models extract more information No 'correct' linked model at this stage! ChromoPainter/FineSTRUCTURE pipeline is the most robust option 39/40

Acknowledgements: finestructure Garrett Hellenthal (Oxford) (CP algorithm) Simon Myers (Oxford) (theory) Daniel Falush (Max Planck Institute) (CP/FS concept) Peter Green (Bristol) Grant, support Bluecrystal HPC facilities @ Bristol FineSTRUCTURE Code & GUI: www.paintmychromosomes.com 40/40

FastIBD 41/40

FastIBD (Browning and Browning 2011) Alternative linked model: Identify r closest segments of DNA for each pair of individuals Genetic lengths of each are related to time since common ancestor Similarity measure: sum of the genetic lengths found for each pair Somewhat heuristic, has some tuning parameters, but empirically works well 42/40

Comparison of clusterings 43/40

44/40

Tuscan Italian French Structure in the copy matrix Orcadian Russian Adygei Sardinian Tuscan Italian French Orcadian Russian Adygei Sardinian Basque Basque 45

Weak Biological Model for prior 'Correct' Ancestral Recombination Graph for the limit of large populations at large time with simple population structure Ancestral population Frequency of donating Time (increasing F) 12 P2 P1 K P K 1 PK F 46/40

Posterior evaluation MCMC update of hyperparameters and partitions Partition moves: Move an individual Merge Split Merge and resplit Merge/split 'nearly Gibbs' move: p(q m ; a, b)= p(q1) p(q 2 q1 ) p(qm q1: m 1) p (q m=a) n a F ( x m P m )dh <m, S ( P m ) a (Not exact as the 'unsplit' population interacts with the remaining dataset) Simple case: Pella and Masuda Canadian J. Fish. Aquatic Science 63:576-596, 2003 47/40

Clustering into k clusters Find similar individuals Three main approaches: Cluster on raw data Cluster on similarity matrix Cluster on dimensionality reduced version of data, e.g. MDS/PCA/SVD Recall: Lowest dimension description usually best... Raw data approach terrible here (without good model) 48/40

The future Admixture model Pure population structure is not correct recent mixing leads to admixture Seek conjugate mixture model for individuals Hierarchical Dirichlet Process! Interpretation: Pure populations created by drift, we see mixtures Better model: Allow drift and admixture to both occur in real time Requires more sophisticated model, can we keep conjugacy? (Matrix Coalescent* results available) Daniel Lawson, University of Bristol diffusion*wooding Dirichlet tree** andconcept Rogers, Genetics, 161:1641-1650, 2002 **Neal, in J. M. Bernardo, et al. (ed.), Bayesian Statistics 7, pp. 619-629, 2003 49/40

Mixing: Pairwise coincidence (Individual labels not shown) Run 1 HGDP DATA (650k SNPs, 938 Individuals) Run 2 (data for the hardest continent, Central South Asia) 50/40

MAP tree: whole world HGDP data Continuous populations ~650,000 SNPs 938 individuals Continents Self-ID'd populations 51/40

Posterior evaluation: building block Sample from posterior p(q m ; a, b)= p(q1) p(q 2 q1 ) p(qm q1: m 1) Metropolis-Hastings proposal for a split: Random individuals creates population a and b from c Move rest from c with probability p (m ; a) n a F ( x m p m )dh <m, S ( p ) P (S a,{i=1,, m }) P (S c,{i=1,, m }) na P (S a, {i=1,, m 1}) P (S c,{i=1,, m 1}) m (Not exact as the 'unsplit' population interacts with the remaining dataset) Exact case: Pella and Masuda Canadian J. Fish. Aquatic Science 63:576-596, 2003 52/40

Probability of a partition Rows of P ab are Dirichlet Conjugate to multinomial, sum to 1 Weak prior Compute posterior incrementally due to conjugacy p( x a q)= F ( x m P a, q)dh m a dh <m, S a <m, S a ( Pa) ( P a )= Dirichlet ( P a ;{βab + x <m, b }b=1,, K ) (Idea: add each individual, update Dirichlet posterior, use as prior for the next individual) 53/40

Final model Posterior K K a x ab /c ab p X n a x a a b=1 ab n bx a=1 K ab Prior for hyperparameters { if a b if a=b Drift due to mutation Ancestral donation frequency ab= V b 1 V b = 1 F / F Drift in allele frequency 54/40

Posterior visualisation Too many populations! Pairwise coincidence matrix Create MAP (maximum a posteriori) tree from MAP partition Show partition split posterior support (Population summary of data matrix X) 55/40