Multivariate analysis of genetic data: exploring groups diversity

Similar documents
Multivariate analysis of genetic data exploring group diversity

Multivariate analysis of genetic data: an introduction

A (short) introduction to phylogenetics

Multivariate analysis of genetic data an introduction

Multivariate analysis of genetic data: exploring group diversity

Data Exploration and Unsupervised Learning with Clustering

Machine Learning - MT & 14. PCA and MDS

Principal component analysis

Chapter 4: Factor Analysis

Overview of clustering analysis. Yuehua Cui

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Statistics 202: Data Mining. c Jonathan Taylor. Model-based clustering Based in part on slides from textbook, slides of Susan Holmes.

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Similarity Measures and Clustering In Genetics

Computational functional genomics

DIMENSION REDUCTION AND CLUSTER ANALYSIS

Data reduction for multivariate analysis

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Robust scale estimation with extensions

Principal Component Analysis (PCA) Theory, Practice, and Examples

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Principal Components Analysis (PCA)

HST.582J/6.555J/16.456J

Experimental Design and Data Analysis for Biologists

5. Discriminant analysis

Basics of Multivariate Modelling and Data Analysis

Cheng Soon Ong & Christian Walder. Canberra February June 2018

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Discussion of Dimension Reduction... by Dennis Cook

3.1. The probabilistic view of the principal component analysis.

PCA Advanced Examples & Applications

CSC411 Fall 2018 Homework 5

Revision: Chapter 1-6. Applied Multivariate Statistics Spring 2012

Machine Learning 2017

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Multivariate Fundamentals: Rotation. Exploratory Factor Analysis

Dimension Reduction (PCA, ICA, CCA, FLD,

What is Principal Component Analysis?

Statistical Machine Learning

Data Preprocessing Tasks

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Linear Dimensionality Reduction

Principal component analysis (PCA) for clustering gene expression data

L11: Pattern recognition principles

Lecture 16 Solving GLMs via IRWLS

Lecture 6: Single-classification multivariate ANOVA (k-group( MANOVA)

Unsupervised Learning

Classification 2: Linear discriminant analysis (continued); logistic regression

Multivariate Statistics

Table of Contents. Multivariate methods. Introduction II. Introduction I

Dimensionality Reduction Techniques (DRT)

Data Mining and Analysis: Fundamental Concepts and Algorithms

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

PCA and admixture models

EDAMI DIMENSION REDUCTION BY PRINCIPAL COMPONENT ANALYSIS

Eigenvalues, Eigenvectors, and an Intro to PCA

Matrices and Multivariate Statistics - II

Course topics (tentative) The role of random effects

Applying cluster analysis to 2011 Census local authority data

Applied Multivariate Statistical Analysis Richard Johnson Dean Wichern Sixth Edition

Eigenvalues, Eigenvectors, and an Intro to PCA

Multivariate analysis of genetic data: investigating spatial structures

Introduction to Machine Learning

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Clustering using Mixture Models

Principal Component Analysis

Discrimination of Dyed Cotton Fibers Based on UVvisible Microspectrophotometry and Multivariate Statistical Analysis

Selection on Multiple Traits

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

FuncICA for time series pattern discovery

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

X t = a t + r t, (7.1)

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Bayesian Decision and Bayesian Learning

Experimental design. Matti Hotokka Department of Physical Chemistry Åbo Akademi University

Applied Multivariate Analysis

1 Data Arrays and Decompositions

Overview. Background

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Vector Space Models. wine_spectral.r

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

High-dimensional Markov Chain Models for Categorical Data Sequences with Applications Wai-Ki CHING AMACL, Department of Mathematics HKU 19 March 2013

Machine Learning (CS 567) Lecture 5

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Statistical Analysis. G562 Geometric Morphometrics PC 2 PC 2 PC 3 PC 2 PC 1. Department of Geological Sciences Indiana University

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING

On the Variance of Eigenvalues in PCA and MCA

Methods for sparse analysis of high-dimensional data, II

ABTEKNILLINEN KORKEAKOULU

Lecture 2: Data Analytics of Narrative

14 Singular Value Decomposition

Multivariate analysis of genetic data: investigating spatial structures

LECTURE 4 PRINCIPAL COMPONENTS ANALYSIS / EXPLORATORY FACTOR ANALYSIS

Transcription:

Multivariate analysis of genetic data: exploring groups diversity T. Jombart Imperial College London Bogota 01-12-2010 1/42

Outline Introduction Clustering algorithms Hierarchical clustering K-means Multivariate Analysis with group informations Analysis of population data Between-group PCA Discriminant Analysis Discriminant Analysis of Principal Components 2/42

Outline Introduction Clustering algorithms Hierarchical clustering K-means Multivariate Analysis with group informations Analysis of population data Between-group PCA Discriminant Analysis Discriminant Analysis of Principal Components 3/42

Genetic data: recall markers individual sum=1 alleles How to define groups? How to handle group information? 4/42

Genetic data: recall markers group 1 group 2 group 3 alleles How to define groups? How to handle group information? 4/42

Finding groups: Finding and using group information hierarchical clustering: single linkage complete linkage UPGMA K-means Using group information: multivariate analysis of group frequencies using groups as partitions discriminant analysis 5/42

Finding groups: Finding and using group information hierarchical clustering: single linkage complete linkage UPGMA K-means Using group information: multivariate analysis of group frequencies using groups as partitions discriminant analysis 5/42

Outline Introduction Clustering algorithms Hierarchical clustering K-means Multivariate Analysis with group informations Analysis of population data Between-group PCA Discriminant Analysis Discriminant Analysis of Principal Components 6/42

Outline Introduction Clustering algorithms Hierarchical clustering K-means Multivariate Analysis with group informations Analysis of population data Between-group PCA Discriminant Analysis Discriminant Analysis of Principal Components 7/42

A variety of algorithms single linkage complete linkage UPGMA Ward... 8/42

Rationale 1. compute pairwise genetic distances D (or similarities) 2. group the closest pair(s) together 3. (optional) update D 4. return to 2) until no new group can be made 9/42

Rationale 1. compute pairwise genetic distances D (or similarities) 2. group the closest pair(s) together 3. (optional) update D 4. return to 2) until no new group can be made 9/42

Rationale 1. compute pairwise genetic distances D (or similarities) 2. group the closest pair(s) together 3. (optional) update D 4. return to 2) until no new group can be made 9/42

Rationale 1. compute pairwise genetic distances D (or similarities) 2. group the closest pair(s) together 3. (optional) update D 4. return to 2) until no new group can be made 9/42

Rationale individuals a b c d e alleles a b c d e a b c d e D 10/42

Differences between algorithms k D =... k,g single linkage: D k,g = min(d k,i, D k,j ) complete linkage: D k,g = max(d k,i, D k,j ) g i D i,j j UPGMA: D k,g = D k,i +D k,j 2 11/42

Differences between algorithms k D =... k,g single linkage: D k,g = min(d k,i, D k,j ) complete linkage: D k,g = max(d k,i, D k,j ) g i D i,j j UPGMA: D k,g = D k,i +D k,j 2 11/42

Differences between algorithms k D =... k,g single linkage: D k,g = min(d k,i, D k,j ) complete linkage: D k,g = max(d k,i, D k,j ) g i D i,j j UPGMA: D k,g = D k,i +D k,j 2 11/42

Differences between algorithms k D =... k,g single linkage: D k,g = min(d k,i, D k,j ) complete linkage: D k,g = max(d k,i, D k,j ) g i D i,j j UPGMA: D k,g = D k,i +D k,j 2 11/42

Differences between algorithms Data Single linkage e d c c b a d 0.0 0.5 1.0 a e b Variable Complete linkage UPGMA (average linkage) a b a b e e c d d c 12/42

Outline Introduction Clustering algorithms Hierarchical clustering K-means Multivariate Analysis with group informations Analysis of population data Between-group PCA Discriminant Analysis Discriminant Analysis of Principal Components 13/42

Variability between and within groups variatibility between groups individuals variability within groups 14/42

K-means: underlying model Univariate ANOVA model (ss: sum of squares): with: sst(x) = ssb(x) + ssw(x) k = 1,..., K : number of groups µ k : mean of group k; µ: mean of all data g k : set of individuals in group k (i g k i is in group k) sst(x) = i (x i µ) 2 : total variation ssb(x) = k i g k (µ k µ) 2 : variation between groups ssw(x) = k i g k (x i µ k ) 2 : (residual) variation within groups 15/42

K-means: underlying model Univariate ANOVA model (ss: sum of squares): with: sst(x) = ssb(x) + ssw(x) k = 1,..., K : number of groups µ k : mean of group k; µ: mean of all data g k : set of individuals in group k (i g k i is in group k) sst(x) = i (x i µ) 2 : total variation ssb(x) = k i g k (µ k µ) 2 : variation between groups ssw(x) = k i g k (x i µ k ) 2 : (residual) variation within groups 15/42

K-means: underlying model Univariate ANOVA model (ss: sum of squares): with: sst(x) = ssb(x) + ssw(x) k = 1,..., K : number of groups µ k : mean of group k; µ: mean of all data g k : set of individuals in group k (i g k i is in group k) sst(x) = i (x i µ) 2 : total variation ssb(x) = k i g k (µ k µ) 2 : variation between groups ssw(x) = k i g k (x i µ k ) 2 : (residual) variation within groups 15/42

K-means: underlying model Univariate ANOVA model (ss: sum of squares): with: sst(x) = ssb(x) + ssw(x) k = 1,..., K : number of groups µ k : mean of group k; µ: mean of all data g k : set of individuals in group k (i g k i is in group k) sst(x) = i (x i µ) 2 : total variation ssb(x) = k i g k (µ k µ) 2 : variation between groups ssw(x) = k i g k (x i µ k ) 2 : (residual) variation within groups 15/42

K-means: underlying model Extension to multivariate data (X R n p ): SST(X) = SSB(X) + SSW(X) x i n allele p individual with: µ k : vector of means of group k; µ: vector of means of all data SST(X) = i x i µ 2 : total variation SSB(X) = k i g k µ k µ 2 : variation between groups SSW(X) = k i g k x i µ k 2 : (residual) variation within groups 16/42

K-means: underlying model Extension to multivariate data (X R n p ): SST(X) = SSB(X) + SSW(X) x i n allele p individual with: µ k : vector of means of group k; µ: vector of means of all data SST(X) = i x i µ 2 : total variation SSB(X) = k i g k µ k µ 2 : variation between groups SSW(X) = k i g k x i µ k 2 : (residual) variation within groups 16/42

K-means: underlying model Extension to multivariate data (X R n p ): SST(X) = SSB(X) + SSW(X) x i n allele p individual with: µ k : vector of means of group k; µ: vector of means of all data SST(X) = i x i µ 2 : total variation SSB(X) = k i g k µ k µ 2 : variation between groups SSW(X) = k i g k x i µ k 2 : (residual) variation within groups 16/42

K-means: underlying model Extension to multivariate data (X R n p ): SST(X) = SSB(X) + SSW(X) x i n allele p individual with: µ k : vector of means of group k; µ: vector of means of all data SST(X) = i x i µ 2 : total variation SSB(X) = k i g k µ k µ 2 : variation between groups SSW(X) = k i g k x i µ k 2 : (residual) variation within groups 16/42

K-means rationale Find K groups G = {g 1,..., g k } minimizing the sum of squares within-groups (SSW): arg min G={g 1,...,g k } k i g k x i µ k 2 Note: this equates to finding groups maximizing the between-groups sum of squares. 17/42

K-means rationale Find K groups G = {g 1,..., g k } minimizing the sum of squares within-groups (SSW): arg min G={g 1,...,g k } k i g k x i µ k 2 Note: this equates to finding groups maximizing the between-groups sum of squares. 17/42

K-means algorithm The K-mean problem is solved by the following algorithm: 1. select random group means (µ k, k = 1,..., K ) 2. affect each individual x i to the closest group g k 3. update group means µ k 4. go back to 2) until convergence (groups no longer change) 18/42

K-means algorithm The K-mean problem is solved by the following algorithm: 1. select random group means (µ k, k = 1,..., K ) 2. affect each individual x i to the closest group g k 3. update group means µ k 4. go back to 2) until convergence (groups no longer change) 18/42

K-means algorithm The K-mean problem is solved by the following algorithm: 1. select random group means (µ k, k = 1,..., K ) 2. affect each individual x i to the closest group g k 3. update group means µ k 4. go back to 2) until convergence (groups no longer change) 18/42

K-means algorithm The K-mean problem is solved by the following algorithm: 1. select random group means (µ k, k = 1,..., K ) 2. affect each individual x i to the closest group g k 3. update group means µ k 4. go back to 2) until convergence (groups no longer change) 18/42

K-means algorithm initialization allele individual group mean step 1 step 2 19/42

Which K? K-means does not identify the number of clusters (K ) each K-mean solution is a model with a likelihood model selection can be used to select K 20/42

Which K? K-means does not identify the number of clusters (K ) each K-mean solution is a model with a likelihood model selection can be used to select K 20/42

Which K? K-means does not identify the number of clusters (K ) each K-mean solution is a model with a likelihood model selection can be used to select K 20/42

Using Bayesian Information Criterion (BIC) Defined as: BIC = 2log(L) + klog(n) with: L: likelihood k: number of parameters n: number of observations (individuals) Smallest BIC = best model 21/42

K-means and BIC: example Simulated data: 6 populations in island model Value of BIC versus number of clusters BIC 1100 1150 1200 1250 5 10 15 20 Number of clusters (Jombart et al. 2010, BMC Genetics) 22/42

Outline Introduction Clustering algorithms Hierarchical clustering K-means Multivariate Analysis with group informations Analysis of population data Between-group PCA Discriminant Analysis Discriminant Analysis of Principal Components 23/42

Outline Introduction Clustering algorithms Hierarchical clustering K-means Multivariate Analysis with group informations Analysis of population data Between-group PCA Discriminant Analysis Discriminant Analysis of Principal Components 24/42

Aggregating data by groups group 1 group 2 average group 3 group 4 group individual alleles alleles average multivariate analysis of group allele frequencies. 25/42

Analysing group data Available methods: Principal Component Analysis (PCA) of allele frequency table Genetic distance between populations Principal Coordinates Analysis (PCoA) Correspondance Analysis (CA) of allele counts Criticism: Loose individual information Neglect within-group diversity CA: possible artefactual outliers 26/42

Analysing group data Available methods: Principal Component Analysis (PCA) of allele frequency table Genetic distance between populations Principal Coordinates Analysis (PCoA) Correspondance Analysis (CA) of allele counts Criticism: Loose individual information Neglect within-group diversity CA: possible artefactual outliers 26/42

Outline Introduction Clustering algorithms Hierarchical clustering K-means Multivariate Analysis with group informations Analysis of population data Between-group PCA Discriminant Analysis Discriminant Analysis of Principal Components 27/42

Using groups as partitions group 1 group 2 group 3 group 4 X alleles Groups are coded as dummy vectors in H. 1 1 1 1 00 0 0 0 0 0 0 0 0 0 0 1 1 00 0 0 0 0 group 1 group 2 group 3 group 4 0 0 0 0 0 0 1 1 1 0 0 0 H 0 0 0 0 0 0 0 0 0 1 1 1 28/42

The multivariate ANOVA model The data matrix X can be decomposed as: X = PX + (I P)X where: P is the projector onto H: P = H(H T DH) 1 )H T D, D being a metric in R n PX is a n p matrix where each observation is replaced by the group average 29/42

The multivariate ANOVA model Variation partition (same as K-means): VAR(X) = B(X) + W(X) where: VAR(X) = trace(x T DX) B(X) = trace(x T P T DPX) W(X) = trace(x T (I P) T D(I P)X) 30/42

Between-group analysis VAR(X) = B(X) + W(X) Classical PCA: decompose VAR(X) find u so that var(xu) is maximum Between-group PCA: decompose B(X) find u so that b(xu) is maximum 31/42

Between-group analysis VAR(X) = B(X) + W(X) Classical PCA: decompose VAR(X) find u so that var(xu) is maximum Between-group PCA: decompose B(X) find u so that b(xu) is maximum 31/42

Between-group PCA PCA axis Density Classical PCA PC1 Between-group PCA axis Density Between-group PCA PC1 Between-group PCA looks at between-group variability. 32/42

Outline Introduction Clustering algorithms Hierarchical clustering K-means Multivariate Analysis with group informations Analysis of population data Between-group PCA Discriminant Analysis Discriminant Analysis of Principal Components 33/42

PCA, between-group PCA, and Discriminant Analysis VAR(X) = B(X) + W(X) Maximising different quantities: PCA: maximizes overall diversity (max var(xu)) Between-group PCA: maximizes group diversity (max b(xu)) Discriminant Analysis: maximizes group separation (max b(xu), min w(xu)) 34/42

PCA, between-group PCA, and Discriminant Analysis VAR(X) = B(X) + W(X) Maximising different quantities: PCA: maximizes overall diversity (max var(xu)) Between-group PCA: maximizes group diversity (max b(xu)) Discriminant Analysis: maximizes group separation (max b(xu), min w(xu)) 34/42

PCA, between-group PCA, and Discriminant Analysis VAR(X) = B(X) + W(X) Maximising different quantities: PCA: maximizes overall diversity (max var(xu)) Between-group PCA: maximizes group diversity (max b(xu)) Discriminant Analysis: maximizes group separation (max b(xu), min w(xu)) 34/42

Discriminant Analysis Density on axis 1 Axis 1 max max Density max Density Density min min Classical PCA Axis 1 Between-group PCA Axis 1 Axis 1 Discriminant Analysis Max. total diversity Max. diversity between groups Max. separation of groups 35/42

Technical issues Discriminant Analysis requires: X T DX to be invertible less variables than observations X T DX to be invertible uncorrelated variables Genetic data: (almost) always (many) more alleles than individuals allele frequencies are by definition correlated ( = 1) linkage disequilibrium correlated alleles 36/42

Technical issues Discriminant Analysis requires: X T DX to be invertible less variables than observations X T DX to be invertible uncorrelated variables Genetic data: (almost) always (many) more alleles than individuals allele frequencies are by definition correlated ( = 1) linkage disequilibrium correlated alleles 36/42

Outline Introduction Clustering algorithms Hierarchical clustering K-means Multivariate Analysis with group informations Analysis of population data Between-group PCA Discriminant Analysis Discriminant Analysis of Principal Components 37/42

Discriminant Analysis of Principal Components (DAPC) new method (Jombart et al. 2010, BMC Genetics) aim: modify DA for genetic data relies on data orthogonalisation/reduction using PCA 38/42

Rationale of DAPC PCA p r Principal components (orthogonal variables) n n s Discriminant Analysis r Principal axes (allele loadings) p n Discriminant Functions (synthetic variables) 39/42

DAPC: summary Discriminant Analysis requires: less variables than observations uncorrelated variables Advantages of DAPC: always less PCs than observations PCs are uncorrelated still possible to compute allele contributions 40/42

DAPC: summary Discriminant Analysis requires: less variables than observations uncorrelated variables Advantages of DAPC: always less PCs than observations PCs are uncorrelated still possible to compute allele contributions 40/42

DAPC: example Seasonal influenza (A/H3N2) data, PCA: d = 10 2001 2002 2006 2005 2004 2003 Eigenvalues 41/42

DAPC: example Seasonal influenza (A/H3N2) data, DAPC: d = 5 2001 2002 2003 2004 2005 2006 Eigenvalues 42/42