Multivariate analysis of genetic data exploring group diversity

Similar documents
Multivariate analysis of genetic data: exploring groups diversity

Multivariate analysis of genetic data: an introduction

A (short) introduction to phylogenetics

Multivariate analysis of genetic data: exploring group diversity

Multivariate analysis of genetic data an introduction

A tutorial for Discriminant Analysis of Principal Components (DAPC) using adegenet 1.3-6

Multivariate analysis of genetic markers as a tool to explore the genetic diversity: some examples

Multivariate analysis of genetic data: investigating spatial structures

Overview of clustering analysis. Yuehua Cui

Multivariate analysis of genetic data: investigating spatial structures

Principal component analysis

Data Preprocessing Tasks

Data Exploration and Unsupervised Learning with Clustering

Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Statistics 202: Data Mining. c Jonathan Taylor. Model-based clustering Based in part on slides from textbook, slides of Susan Holmes.

Multivariate Fundamentals: Rotation. Exploratory Factor Analysis

DIMENSION REDUCTION AND CLUSTER ANALYSIS

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Lecture 13: Population Structure. October 8, 2012

Table of Contents. Multivariate methods. Introduction II. Introduction I

PCA Advanced Examples & Applications

Chapter 4: Factor Analysis

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Similarity Measures and Clustering In Genetics

Principal Component Analysis (PCA) Theory, Practice, and Examples

Multilevel Component Analysis applied to the measurement of a complex product experience

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Machine Learning - MT & 14. PCA and MDS

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Principal component analysis (PCA) for clustering gene expression data

Applying cluster analysis to 2011 Census local authority data

CIFAR Lectures: Non-Gaussian statistics and natural images

Principal Components Analysis (PCA)

Multivariate Statistics

An Introduction to Multivariate Methods

e 2 e 1 (a) (b) (d) (c)

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING

Eigenvalues, Eigenvectors, and an Intro to PCA

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

Outline Week 1 PCA Challenge. Introduction. Multivariate Statistical Analysis. Hung Chen

Eigenvalues, Eigenvectors, and an Intro to PCA

Experimental Design and Data Analysis for Biologists

Algebra of Principal Component Analysis

2/1/2016. Species Abundance Curves Plot of rank abundance (x-axis) vs abundance or P i (yaxis).

L11: Pattern recognition principles

Variable selection for model-based clustering

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Discriminant analysis and supervised classification

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Principal Component Analysis

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

What is Principal Component Analysis?

Applied Multivariate Statistical Analysis Richard Johnson Dean Wichern Sixth Edition

PCA and admixture models

Multistage Anslysis on Solar Spectral Analyses with Uncertainties in Atomic Physical Models

Principal Component Analysis (PCA) Principal Component Analysis (PCA)

Robust scale estimation with extensions

Descriptive Statistics

2/26/2017. This is similar to canonical correlation in some ways. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

EDAMI DIMENSION REDUCTION BY PRINCIPAL COMPONENT ANALYSIS

STATISTICS 407 METHODS OF MULTIVARIATE ANALYSIS TOPICS

STATISTICAL LEARNING SYSTEMS

Clustering compiled by Alvin Wan from Professor Benjamin Recht s lecture, Samaneh s discussion

1 A factor can be considered to be an underlying latent variable: (a) on which people differ. (b) that is explained by unknown variables

Principal Component Analysis

2. Sample representativeness. That means some type of probability/random sampling.

Unconstrained Ordination

Multivariate Statistics 101. Ordination (PCA, NMDS, CA) Cluster Analysis (UPGMA, Ward s) Canonical Correspondence Analysis

Statistical Machine Learning

G E INTERACTION USING JMP: AN OVERVIEW

LECTURE 4 PRINCIPAL COMPONENTS ANALYSIS / EXPLORATORY FACTOR ANALYSIS

CSC411 Fall 2018 Homework 5

MULTIVARIATE HOMEWORK #5

Data reduction for multivariate analysis

Spatial genetics analyses using

Basics of Multivariate Modelling and Data Analysis

CS 340 Lec. 6: Linear Dimensionality Reduction

An application of the GAM-PCA-VAR model to respiratory disease and air pollution data

Multivariate Analysis Cluster Analysis

Covariance and Correlation Matrix

On the Variance of Eigenvalues in PCA and MCA

-Principal components analysis is by far the oldest multivariate technique, dating back to the early 1900's; ecologists have used PCA since the

Revision: Chapter 1-6. Applied Multivariate Statistics Spring 2012

Unsupervised Learning

Cluster Analysis (Sect. 9.6/Chap. 14 of Wilks) Notes by Hong Li

Research Methods II MICHAEL BERNSTEIN CS 376

MODEL BASED CLUSTERING FOR COUNT DATA

Graph Functional Methods for Climate Partitioning

Dimension Reduction Methods

Lecture 2: Data Analytics of Narrative

PCA, Kernel PCA, ICA

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Applied Multivariate Analysis

Dimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014

Pattern Recognition 2

Multivariate Statistics Fundamentals Part 1: Rotation-based Techniques

5. Discriminant analysis

Independent Component Analysis and Its Applications. By Qing Xue, 10/15/2004

Transcription:

Multivariate analysis of genetic data exploring group diversity Thibaut Jombart, Marie-Pauline Beugin MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis with PR Statistics, Millport Field Station 18 Aug 2016 1/30

Outline Introduction Identifying groups Hierarchical clustering K-means Exploring group diversity Aggregating data Optimizing group differences Discriminant Analysis of Principal Components 2/30

Outline Introduction Identifying groups Hierarchical clustering K-means Exploring group diversity Aggregating data Optimizing group differences Discriminant Analysis of Principal Components 3/30

Genetic data: introducing group data markers individual sum=1 alleles How to identify groups? How to explore group diversity? 4/30

Genetic data: introducing group data markers group 1 group 2 group 3 alleles How to identify groups? How to explore group diversity? 4/30

Outline Introduction Identifying groups Hierarchical clustering K-means Exploring group diversity Aggregating data Optimizing group differences Discriminant Analysis of Principal Components 5/30

Hierarchical clustering: a variety of algorithms single linkage complete linkage UPGMA Ward... 6/30

Rationale 1. compute pairwise genetic distances D (or similarities) 2. group the closest pair(s) together 3. (optional) update D 4. return to 2) until no new group can be made 7/30

Rationale 1. compute pairwise genetic distances D (or similarities) 2. group the closest pair(s) together 3. (optional) update D 4. return to 2) until no new group can be made 7/30

Rationale 1. compute pairwise genetic distances D (or similarities) 2. group the closest pair(s) together 3. (optional) update D 4. return to 2) until no new group can be made 7/30

Rationale 1. compute pairwise genetic distances D (or similarities) 2. group the closest pair(s) together 3. (optional) update D 4. return to 2) until no new group can be made 7/30

Rationale individuals a b c d e alleles a b c d e a b c d e D 8/30

Differences between algorithms k D =... k,g single linkage: D k,g = min(d k,i, D k,j ) complete linkage: D k,g = max(d k,i, D k,j ) g i D i,j j UPGMA: D k,g = D k,i+d k,j 2 9/30

Differences between algorithms k D =... k,g single linkage: D k,g = min(d k,i, D k,j ) complete linkage: D k,g = max(d k,i, D k,j ) g i D i,j j UPGMA: D k,g = D k,i+d k,j 2 9/30

Differences between algorithms k D =... k,g single linkage: D k,g = min(d k,i, D k,j ) complete linkage: D k,g = max(d k,i, D k,j ) g i D i,j j UPGMA: D k,g = D k,i+d k,j 2 9/30

Differences between algorithms k D =... k,g single linkage: D k,g = min(d k,i, D k,j ) complete linkage: D k,g = max(d k,i, D k,j ) g i D i,j j UPGMA: D k,g = D k,i+d k,j 2 9/30

Differences between algorithms Data Single linkage e d c c b a d 0.0 0.5 1.0 a e b Variable Complete linkage UPGMA (average linkage) a b a b e e c d d c 10/30

ANOVA model: K-means underlying model total var. = (var. between groups) + (var. within groups) variatibility between groups individuals variability within groups 11/30

K-means rationale Find groups which minimize within group var. (equally: maximize between group var.). In other words: Identify a partition G = {g 1,..., g k } solving: arg min G={g 1,...,g k } k i g k x i µ k 2 with: x i R p : vector of allele frequencies of individual i µ k R p : vector of means allele frequencies of group k 12/30

K-means rationale Find groups which minimize within group var. (equally: maximize between group var.). In other words: Identify a partition G = {g 1,..., g k } solving: arg min G={g 1,...,g k } k i g k x i µ k 2 with: x i R p : vector of allele frequencies of individual i µ k R p : vector of means allele frequencies of group k 12/30

K-means rationale Find groups which minimize within group var. (equally: maximize between group var.). In other words: Identify a partition G = {g 1,..., g k } solving: arg min G={g 1,...,g k } k i g k x i µ k 2 with: x i R p : vector of allele frequencies of individual i µ k R p : vector of means allele frequencies of group k 12/30

K-means algorithm The K-mean problem is solved by the following algorithm: 1. select random group means (µ k, k = 1,..., K) 2. assign each individual x i to the closest group g k 3. update group means µ k 4. go back to 2) until convergence (groups no longer change) 13/30

K-means algorithm The K-mean problem is solved by the following algorithm: 1. select random group means (µ k, k = 1,..., K) 2. assign each individual x i to the closest group g k 3. update group means µ k 4. go back to 2) until convergence (groups no longer change) 13/30

K-means algorithm The K-mean problem is solved by the following algorithm: 1. select random group means (µ k, k = 1,..., K) 2. assign each individual x i to the closest group g k 3. update group means µ k 4. go back to 2) until convergence (groups no longer change) 13/30

K-means algorithm The K-mean problem is solved by the following algorithm: 1. select random group means (µ k, k = 1,..., K) 2. assign each individual x i to the closest group g k 3. update group means µ k 4. go back to 2) until convergence (groups no longer change) 13/30

K-means algorithm initialization allele individual group mean step 1 step 2 14/30

K-means: limitations and extensions Limitations slower for large numbers of alleles (e.g. 100,000) K-means does not identify the number of clusters (K) Extension run K-means after dimension reduction using PCA try increasing values of K use Bayesian Information Criterion (BIC) for model selection 15/30

K-means: limitations and extensions Limitations slower for large numbers of alleles (e.g. 100,000) K-means does not identify the number of clusters (K) Extension run K-means after dimension reduction using PCA try increasing values of K use Bayesian Information Criterion (BIC) for model selection 15/30

Genetic clustering using K-means & BIC (Jombart et al. 2010, BMC Genetics) Simulated data: island model with 6 populations Performances: K-means STRUCTURE on simulated data (various island and stepping stone models) orders of magnitude faster (seconds vs hours/days) 16/30

Genetic clustering using K-means & BIC (Jombart et al. 2010, BMC Genetics) Simulated data: island model with 6 populations Performances: K-means STRUCTURE on simulated data (various island and stepping stone models) orders of magnitude faster (seconds vs hours/days) 16/30

Outline Introduction Identifying groups Hierarchical clustering K-means Exploring group diversity Aggregating data Optimizing group differences Discriminant Analysis of Principal Components 17/30

Why identifying clusters is not the whole story Example of cattle breeds diversity (30 microsatellites, 704 individuals). Group membership probabilities: membership probability 0.0 0.2 0.4 0.6 0.8 1.0 Borgou Somba BretPieNoire MaineAnjou Zebu Aubrac Charolais Montbeliard Lagunaire Bazadais Gascon Salers NDama BlondeAquitaine Limousin Important to assess the relationships between clusters. 18/30

Why identifying clusters is not the whole story Example of cattle breeds diversity (30 microsatellites, 704 individuals). Group membership probabilities: Multivariate analysis: membership probability 0.0 0.2 0.4 0.6 0.8 1.0 MaineAnjou BretPieNoire Charolais Bazadais BlondeAquitaine Montbeliard Limousin Gascon Aubrac Salers DA eigenvalues Lagunaire NDama Somba Borgou Zebu Borgou Zebu Lagunaire NDama Somba Aubrac Bazadais BlondeAquitaine BretPieNoire Charolais Gascon Limousin MaineAnjou Montbeliard Salers Important to assess the relationships between clusters. 18/30

Why identifying clusters is not the whole story Example of cattle breeds diversity (30 microsatellites, 704 individuals). Group membership probabilities: Multivariate analysis: membership probability 0.0 0.2 0.4 0.6 0.8 1.0 MaineAnjou BretPieNoire Charolais Bazadais BlondeAquitaine Montbeliard Limousin Gascon Aubrac Salers DA eigenvalues Lagunaire NDama Somba Borgou Zebu Borgou Zebu Lagunaire NDama Somba Aubrac Bazadais BlondeAquitaine BretPieNoire Charolais Gascon Limousin MaineAnjou Montbeliard Salers Important to assess the relationships between clusters. 18/30

Aggregating data by groups group 1 group 2 average group 3 group 4 group individual alleles alleles average multivariate analysis of group allele frequencies. 19/30

Analysing group data Available methods: Principal Component Analysis (PCA) of allele frequency table Genetic distance between populations Principal Coordinates Analysis (PCoA) Correspondance Analysis (CA) of allele counts Criticism: Lose individual information Neglect within-group diversity CA: possible artefactual outliers 20/30

Analysing group data Available methods: Principal Component Analysis (PCA) of allele frequency table Genetic distance between populations Principal Coordinates Analysis (PCoA) Correspondance Analysis (CA) of allele counts Criticism: Lose individual information Neglect within-group diversity CA: possible artefactual outliers 20/30

Multivariate analysis: reminder Find principal components with maximum total variance. 21/30

Multivariate analysis: reminder Find principal components with maximum total variance. 21/30

Multivariate analysis: reminder Find principal components with maximum total variance. 21/30

Multivariate analysis: reminder Find principal components with maximum total variance. 21/30

But total variance may not reflect group differences Need to optimize different criteria. 22/30

But total variance may not reflect group differences Need to optimize different criteria. 22/30

But total variance may not reflect group differences Need to optimize different criteria. 22/30

Optimizing different criteria Similar approaches to PCA can be used to optimize different quantities: PCA: total variance Between-group PCA: variance between groups Within-group PCA: variance within groups Discriminant Analysis: variance between groups / variance within groups 23/30

Optimizing different criteria Similar approaches to PCA can be used to optimize different quantities: PCA: total variance Between-group PCA: variance between groups Within-group PCA: variance within groups Discriminant Analysis: variance between groups / variance within groups 23/30

Optimizing different criteria Similar approaches to PCA can be used to optimize different quantities: PCA: total variance Between-group PCA: variance between groups Within-group PCA: variance within groups Discriminant Analysis: variance between groups / variance within groups 23/30

Optimizing different criteria Similar approaches to PCA can be used to optimize different quantities: PCA: total variance Between-group PCA: variance between groups Within-group PCA: variance within groups Discriminant Analysis: variance between groups / variance within groups 23/30

From PCA to DA: increasing group differentiation Density on axis 1 Axis 1 max max Density max Density Density min min Classical PCA Axis 1 Between-group PCA Axis 1 Axis 1 Discriminant Analysis Max. total diversity Max. diversity between groups Max. separation of groups 24/30

Discriminant Analysis: limitations and extensions Limitations: DA requires less variables (alleles) than observations (individuals) DA requires uncorrelated variables (no frequencies, no linkage disequilibrium) Discriminant Analysis of Principal Components (DAPC) 1 : data orthogonalisation/reduction using PCA before DA overcomes limitations of DA group membership probabilities, group prediction 1 Jombart et al. 2010, BMC Genetics 25/30

Discriminant Analysis: limitations and extensions Limitations: DA requires less variables (alleles) than observations (individuals) DA requires uncorrelated variables (no frequencies, no linkage disequilibrium) Discriminant Analysis of Principal Components (DAPC) 1 : data orthogonalisation/reduction using PCA before DA overcomes limitations of DA group membership probabilities, group prediction 1 Jombart et al. 2010, BMC Genetics 25/30

Rationale of DAPC PCA p r Principal components (orthogonal variables) n n s Discriminant Analysis r Principal axes (allele loadings) p n Discriminant Functions (synthetic variables) 26/30

PCA of seasonal influenza (A/H3N2) data Data: seasonal influenza (A/H3N2), 500 HA segments. Little temporal evolution, burst of diversity in 2002?? 27/30

PCA of seasonal influenza (A/H3N2) data Data: seasonal influenza (A/H3N2), 500 HA segments. Little temporal evolution, burst of diversity in 2002?? 27/30

DAPC of seasonal influenza (A/H3N2) data Strong temporal signal, originality of 2006 isolates (new alleles). 28/30

DAPC of seasonal influenza (A/H3N2) data Strong temporal signal, originality of 2006 isolates (new alleles). 28/30

Other features DAPC can be used to: provides group assignment probabilities can use supplementary individuals can predict group membership of new data can be used for variable selection 29/30

Time to get your hands dirty (again)! The pdf of the practical is online: http://adegenet.r-forge.r-project.org/ or Google adegenet documents GDAR August 2016 30/30