Multivariate analysis of genetic data: exploring groups diversity

Multivariate analysis of genetic data: exploring groups diversity T. Jombart Imperial College London Bogota 01-12-2010 1/42

Outline Introduction Clustering algorithms Hierarchical clustering K-means Multivariate Analysis with group informations Analysis of population data Between-group PCA Discriminant Analysis Discriminant Analysis of Principal Components 2/42

Genetic data: recall markers individual sum=1 alleles How to define groups? How to handle group information? 4/42

Genetic data: recall markers group 1 group 2 group 3 alleles How to define groups? How to handle group information? 4/42

Finding groups: Finding and using group information hierarchical clustering: single linkage complete linkage UPGMA K-means Using group information: multivariate analysis of group frequencies using groups as partitions discriminant analysis 5/42

A variety of algorithms single linkage complete linkage UPGMA Ward... 8/42

Rationale 1. compute pairwise genetic distances D (or similarities) 2. group the closest pair(s) together 3. (optional) update D 4. return to 2) until no new group can be made 9/42

Rationale individuals a b c d e alleles a b c d e a b c d e D 10/42

Differences between algorithms k D =... k,g single linkage: D k,g = min(d k,i, D k,j ) complete linkage: D k,g = max(d k,i, D k,j ) g i D i,j j UPGMA: D k,g = D k,i +D k,j 2 11/42

Differences between algorithms Data Single linkage e d c c b a d 0.0 0.5 1.0 a e b Variable Complete linkage UPGMA (average linkage) a b a b e e c d d c 12/42

Variability between and within groups variatibility between groups individuals variability within groups 14/42

K-means: underlying model Univariate ANOVA model (ss: sum of squares): with: sst(x) = ssb(x) + ssw(x) k = 1,..., K : number of groups µ k : mean of group k; µ: mean of all data g k : set of individuals in group k (i g k i is in group k) sst(x) = i (x i µ) 2 : total variation ssb(x) = k i g k (µ k µ) 2 : variation between groups ssw(x) = k i g k (x i µ k ) 2 : (residual) variation within groups 15/42

K-means: underlying model Extension to multivariate data (X R n p ): SST(X) = SSB(X) + SSW(X) x i n allele p individual with: µ k : vector of means of group k; µ: vector of means of all data SST(X) = i x i µ 2 : total variation SSB(X) = k i g k µ k µ 2 : variation between groups SSW(X) = k i g k x i µ k 2 : (residual) variation within groups 16/42

K-means rationale Find K groups G = {g 1,..., g k } minimizing the sum of squares within-groups (SSW): arg min G={g 1,...,g k } k i g k x i µ k 2 Note: this equates to finding groups maximizing the between-groups sum of squares. 17/42

K-means algorithm The K-mean problem is solved by the following algorithm: 1. select random group means (µ k, k = 1,..., K ) 2. affect each individual x i to the closest group g k 3. update group means µ k 4. go back to 2) until convergence (groups no longer change) 18/42

K-means algorithm initialization allele individual group mean step 1 step 2 19/42

Which K? K-means does not identify the number of clusters (K ) each K-mean solution is a model with a likelihood model selection can be used to select K 20/42

Using Bayesian Information Criterion (BIC) Defined as: BIC = 2log(L) + klog(n) with: L: likelihood k: number of parameters n: number of observations (individuals) Smallest BIC = best model 21/42

K-means and BIC: example Simulated data: 6 populations in island model Value of BIC versus number of clusters BIC 1100 1150 1200 1250 5 10 15 20 Number of clusters (Jombart et al. 2010, BMC Genetics) 22/42

Aggregating data by groups group 1 group 2 average group 3 group 4 group individual alleles alleles average multivariate analysis of group allele frequencies. 25/42

Analysing group data Available methods: Principal Component Analysis (PCA) of allele frequency table Genetic distance between populations Principal Coordinates Analysis (PCoA) Correspondance Analysis (CA) of allele counts Criticism: Loose individual information Neglect within-group diversity CA: possible artefactual outliers 26/42

Using groups as partitions group 1 group 2 group 3 group 4 X alleles Groups are coded as dummy vectors in H. 1 1 1 1 00 0 0 0 0 0 0 0 0 0 0 1 1 00 0 0 0 0 group 1 group 2 group 3 group 4 0 0 0 0 0 0 1 1 1 0 0 0 H 0 0 0 0 0 0 0 0 0 1 1 1 28/42

The multivariate ANOVA model The data matrix X can be decomposed as: X = PX + (I P)X where: P is the projector onto H: P = H(H T DH) 1 )H T D, D being a metric in R n PX is a n p matrix where each observation is replaced by the group average 29/42

The multivariate ANOVA model Variation partition (same as K-means): VAR(X) = B(X) + W(X) where: VAR(X) = trace(x T DX) B(X) = trace(x T P T DPX) W(X) = trace(x T (I P) T D(I P)X) 30/42

Between-group analysis VAR(X) = B(X) + W(X) Classical PCA: decompose VAR(X) find u so that var(xu) is maximum Between-group PCA: decompose B(X) find u so that b(xu) is maximum 31/42

Between-group PCA PCA axis Density Classical PCA PC1 Between-group PCA axis Density Between-group PCA PC1 Between-group PCA looks at between-group variability. 32/42

PCA, between-group PCA, and Discriminant Analysis VAR(X) = B(X) + W(X) Maximising different quantities: PCA: maximizes overall diversity (max var(xu)) Between-group PCA: maximizes group diversity (max b(xu)) Discriminant Analysis: maximizes group separation (max b(xu), min w(xu)) 34/42

Discriminant Analysis Density on axis 1 Axis 1 max max Density max Density Density min min Classical PCA Axis 1 Between-group PCA Axis 1 Axis 1 Discriminant Analysis Max. total diversity Max. diversity between groups Max. separation of groups 35/42

Technical issues Discriminant Analysis requires: X T DX to be invertible less variables than observations X T DX to be invertible uncorrelated variables Genetic data: (almost) always (many) more alleles than individuals allele frequencies are by definition correlated ( = 1) linkage disequilibrium correlated alleles 36/42

Discriminant Analysis of Principal Components (DAPC) new method (Jombart et al. 2010, BMC Genetics) aim: modify DA for genetic data relies on data orthogonalisation/reduction using PCA 38/42

Rationale of DAPC PCA p r Principal components (orthogonal variables) n n s Discriminant Analysis r Principal axes (allele loadings) p n Discriminant Functions (synthetic variables) 39/42

DAPC: summary Discriminant Analysis requires: less variables than observations uncorrelated variables Advantages of DAPC: always less PCs than observations PCs are uncorrelated still possible to compute allele contributions 40/42

DAPC: example Seasonal influenza (A/H3N2) data, PCA: d = 10 2001 2002 2006 2005 2004 2003 Eigenvalues 41/42

DAPC: example Seasonal influenza (A/H3N2) data, DAPC: d = 5 2001 2002 2003 2004 2005 2006 Eigenvalues 42/42