Multivariate analysis of genetic data: an introduction

Similar documents
Multivariate analysis of genetic data an introduction

Multivariate analysis of genetic data exploring group diversity

Multivariate analysis of genetic data: exploring groups diversity

Multivariate analysis of genetic data: investigating spatial structures

Multivariate analysis of genetic data: investigating spatial structures

Multivariate analysis of genetic data: exploring group diversity

A (short) introduction to phylogenetics

Principal component analysis

Introduction to multivariate analysis Outline

Multivariate Statistics 101. Ordination (PCA, NMDS, CA) Cluster Analysis (UPGMA, Ward s) Canonical Correspondence Analysis

Multivariate Statistics Fundamentals Part 1: Rotation-based Techniques

Experimental Design and Data Analysis for Biologists

PCA and admixture models

G E INTERACTION USING JMP: AN OVERVIEW

DIMENSION REDUCTION AND CLUSTER ANALYSIS

Populations in statistical genetics

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Spatial genetics analyses using

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

-Principal components analysis is by far the oldest multivariate technique, dating back to the early 1900's; ecologists have used PCA since the

Evolution AP Biology

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

Multivariate Analysis of Ecological Data using CANOCO

Statistics Toolbox 6. Apply statistical algorithms and probability models

4/4/2018. Stepwise model fitting. CCA with first three variables only Call: cca(formula = community ~ env1 + env2 + env3, data = envdata)

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Eigenfaces. Face Recognition Using Principal Components Analysis

Methods for Cryptic Structure. Methods for Cryptic Structure

Overview of clustering analysis. Yuehua Cui

Asymptotic distribution of the largest eigenvalue with application to genetic data

Multivariate Data Analysis a survey of data reduction and data association techniques: Principal Components Analysis

Horizontal transfer and pathogenicity

What is Principal Component Analysis?

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

Algebra of Principal Component Analysis

Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Chapter 11 Canonical analysis

Interpreting principal components analyses of spatial population genetic variation

Lecture WS Evolutionary Genetics Part I 1

Principal Component Analysis

Table of Contents. Multivariate methods. Introduction II. Introduction I

EE16B Designing Information Devices and Systems II

Bioinformatics. Genotype -> Phenotype DNA. Jason H. Moore, Ph.D. GECCO 2007 Tutorial / Bioinformatics.

Simplifying Drug Discovery with JMP

Multivariate Statistics Summary and Comparison of Techniques. Multivariate Techniques

Revision: Chapter 1-6. Applied Multivariate Statistics Spring 2012

Correlation Preserving Unsupervised Discretization. Outline

MEMGENE package for R: Tutorials

Genetic Variation: The genetic substrate for natural selection. Horizontal Gene Transfer. General Principles 10/2/17.

1.3. Principal coordinate analysis. Pierre Legendre Département de sciences biologiques Université de Montréal

Multivariate Statistics (I) 2. Principal Component Analysis (PCA)

8. FROM CLASSICAL TO CANONICAL ORDINATION

Principal Components Analysis (PCA)

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

Introduction to population genetics & evolution

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Basics of Multivariate Modelling and Data Analysis

Quantitative Trait Variation

FINM 331: MULTIVARIATE DATA ANALYSIS FALL 2017 PROBLEM SET 3

4/2/2018. Canonical Analyses Analysis aimed at identifying the relationship between two multivariate datasets. Cannonical Correlation.

CLASSIFICATION UNIT GUIDE DUE WEDNESDAY 3/1

Statistical Machine Learning

Dimensionality Reduction Techniques (DRT)

Mutual Information between Discrete Variables with Many Categories using Recursive Adaptive Partitioning

INTRODUCCIÓ A L'ANÀLISI MULTIVARIANT. Estadística Biomèdica Avançada Ricardo Gonzalo Sanz 13/07/2015

Dimension Reduction (PCA, ICA, CCA, FLD,

Enduring Understanding: Change in the genetic makeup of a population over time is evolution Pearson Education, Inc.

Chapters AP Biology Objectives. Objectives: You should know...

Dimension Reduction and Low-dimensional Embedding

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Principal Component Analysis (PCA) Principal Component Analysis (PCA)

BIOLOGY STANDARDS BASED RUBRIC

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Applications of visual analytics, data types 3 Data sources and preparation Project 1 out 4

EE16B Designing Information Devices and Systems II

20 Unsupervised Learning and Principal Components Analysis (PCA)

PCA Advanced Examples & Applications

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Principal Component Analysis

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Lecture 2: Diversity, Distances, adonis. Lecture 2: Diversity, Distances, adonis. Alpha- Diversity. Alpha diversity definition(s)

Computation. For QDA we need to calculate: Lets first consider the case that

Introduction to the SNP/ND concept - Phylogeny on WGS data

A recipe for the perfect salsa tomato

UNIVERSITY OF THE PHILIPPINES LOS BAÑOS INSTITUTE OF STATISTICS BS Statistics - Course Description

Experimental design. Matti Hotokka Department of Physical Chemistry Åbo Akademi University

1 Principal Components Analysis

Deriving Principal Component Analysis (PCA)

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Outline Classes of diversity measures. Species Divergence and the Measurement of Microbial Diversity. How do we describe and compare diversity?

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

Temporal eigenfunction methods for multiscale analysis of community composition and other multivariate data

GENETICS - CLUTCH CH.1 INTRODUCTION TO GENETICS.

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING

Eigenvalues, Eigenvectors, and an Intro to PCA

Unsupervised Learning: Dimensionality Reduction

Eigenvalues, Eigenvectors, and an Intro to PCA

Data reduction for multivariate analysis

Transcription:

Multivariate analysis of genetic data: an introduction Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London XXIV Simposio Internacional De Estadística Bogotá, 25th July 2014 1/34

Outline Multivariate analysis in a nutshell Applications to genetic data Genetic diversity of pathogen populations 2/34

Outline Multivariate analysis in a nutshell Applications to genetic data Genetic diversity of pathogen populations 3/34

Multivariate data: some examples Association between individuals? Correlations between variables? 4/34

Multivariate data: some examples Association between individuals? Correlations between variables? 4/34

Multivariate analysis to summarize diversity 5/34

Multivariate analysis to summarize diversity 5/34

Multivariate analysis to summarize diversity 5/34

Multivariate analysis to summarize diversity 5/34

Multivariate analysis: an overview Multivariate analysis, a.k.a: dimension reduction techniques ordinations in reduced space factorial methods Purposes: summarize diversity amongst observations summarize correlations between variables 6/34

Multivariate analysis: an overview Multivariate analysis, a.k.a: dimension reduction techniques ordinations in reduced space factorial methods Purposes: summarize diversity amongst observations summarize correlations between variables 6/34

Most common methods Differences lie in input data: quantitative/binary variables: Principal Component Analysis (PCA) 2 categorical variables: Correspondance Analysis (CA) >2 categorical variables: Multiple Correspondance Analysis (MCA) Euclidean distance matrix: Principal Coordinates Analysis (PCoA) / Metric Multidimensional Scaling (MDS) Many other methods for 2 data tables, spatial analysis, phylogenetic analysis, etc. 7/34

Most common methods Differences lie in input data: quantitative/binary variables: Principal Component Analysis (PCA) 2 categorical variables: Correspondance Analysis (CA) >2 categorical variables: Multiple Correspondance Analysis (MCA) Euclidean distance matrix: Principal Coordinates Analysis (PCoA) / Metric Multidimensional Scaling (MDS) Many other methods for 2 data tables, spatial analysis, phylogenetic analysis, etc. 7/34

Most common methods Differences lie in input data: quantitative/binary variables: Principal Component Analysis (PCA) 2 categorical variables: Correspondance Analysis (CA) >2 categorical variables: Multiple Correspondance Analysis (MCA) Euclidean distance matrix: Principal Coordinates Analysis (PCoA) / Metric Multidimensional Scaling (MDS) Many other methods for 2 data tables, spatial analysis, phylogenetic analysis, etc. 7/34

Most common methods Differences lie in input data: quantitative/binary variables: Principal Component Analysis (PCA) 2 categorical variables: Correspondance Analysis (CA) >2 categorical variables: Multiple Correspondance Analysis (MCA) Euclidean distance matrix: Principal Coordinates Analysis (PCoA) / Metric Multidimensional Scaling (MDS) Many other methods for 2 data tables, spatial analysis, phylogenetic analysis, etc. 7/34

Most common methods Differences lie in input data: quantitative/binary variables: Principal Component Analysis (PCA) 2 categorical variables: Correspondance Analysis (CA) >2 categorical variables: Multiple Correspondance Analysis (MCA) Euclidean distance matrix: Principal Coordinates Analysis (PCoA) / Metric Multidimensional Scaling (MDS) Many other methods for 2 data tables, spatial analysis, phylogenetic analysis, etc. 7/34

1 dimension, 2 dimensions, P dimensions Need to find most informative directions in a P-dimensional space. 8/34

1 dimension, 2 dimensions, P dimensions Need to find most informative directions in a P-dimensional space. 8/34

1 dimension, 2 dimensions, P dimensions Need to find most informative directions in a P-dimensional space. 8/34

Reducing P dimensions into 1 X R N P ; X = [x 1... x P ]: data matrix Q R P P metric in R P ; D R N N metric in R N u R P ; u = [u 1,..., u P ]: principal axis ( u 2 Q = 1) v R N ; v = XQu: principal component find u so that v 2 D is maximum. 9/34

Reducing P dimensions into 1 X R N P ; X = [x 1... x P ]: data matrix Q R P P metric in R P ; D R N N metric in R N u R P ; u = [u 1,..., u P ]: principal axis ( u 2 Q = 1) v R N ; v = XQu: principal component find u so that v 2 D is maximum. 9/34

Reducing P dimensions into 1 X R N P ; X = [x 1... x P ]: data matrix Q R P P metric in R P ; D R N N metric in R N u R P ; u = [u 1,..., u P ]: principal axis ( u 2 Q = 1) v R N ; v = XQu: principal component find u so that v 2 D is maximum. 9/34

Reducing P dimensions into 1 X R N P ; X = [x 1... x P ]: data matrix Q R P P metric in R P ; D R N N metric in R N u R P ; u = [u 1,..., u P ]: principal axis ( u 2 Q = 1) v R N ; v = XQu: principal component find u so that v 2 D is maximum. 9/34

Keeping more than one principal component u 1 and v 1 : 1st principal axis and component u 2 and v 2 : 2nd principal axis and component constraint: u 1 u 2 (i.e., u 1, u 2 Q = 0) find u 2 so that v 2 2 D is maximum 10/34

Keeping more than one principal component u 1 and v 1 : 1st principal axis and component u 2 and v 2 : 2nd principal axis and component constraint: u 1 u 2 (i.e., u 1, u 2 Q = 0) find u 2 so that v 2 2 D is maximum 10/34

Keeping more than one principal component u 1 and v 1 : 1st principal axis and component u 2 and v 2 : 2nd principal axis and component constraint: u 1 u 2 (i.e., u 1, u 2 Q = 0) find u 2 so that v 2 2 D is maximum 10/34

Keeping more than one principal component u 1 and v 1 : 1st principal axis and component u 2 and v 2 : 2nd principal axis and component constraint: u 1 u 2 (i.e., u 1, u 2 Q = 0) find u 2 so that v 2 2 D is maximum 10/34

How do we do this? Things that don t change: take u i the i-th eigenvector of the Q-symmetric matrix X T DXQ (alternatively) take v i the i-th eigenvector of the D-symmetric matrix XQX T D Things that change: pre-transformations of X (recoding, standardisation, etc.) metrics Q and D (implicitely distances in R P and R N ) most usual analyses are defined by (X, Q, D) 11/34

How do we do this? Things that don t change: take u i the i-th eigenvector of the Q-symmetric matrix X T DXQ (alternatively) take v i the i-th eigenvector of the D-symmetric matrix XQX T D Things that change: pre-transformations of X (recoding, standardisation, etc.) metrics Q and D (implicitely distances in R P and R N ) most usual analyses are defined by (X, Q, D) 11/34

Things that don t change: How do we do this? take u i the i-th eigenvector of the Q-symmetric matrix X T DXQ (alternatively) take v i the i-th eigenvector of the D-symmetric matrix XQX T D Things that change: pre-transformations of X (recoding, standardisation, etc.) metrics Q and D (implicitely distances in R P and R N ) most usual analyses are defined by (X, Q, D) packages: ade4, vegan 11/34

How many principal components to retain? Choice based on screeplot : barplot of eigenvalues Retain only significant structures... but not trivial ones. 12/34

Outputs of multivariate analyses: an overview Main outputs: principal components: diversity amongst individuals principal axes: nature of the structures eigenvalues: magnitude of structures 13/34

Outputs of multivariate analyses: an overview Main outputs: principal components: diversity amongst individuals principal axes: nature of the structures eigenvalues: magnitude of structures 13/34

Outputs of multivariate analyses: an overview Main outputs: principal components: diversity amongst individuals principal axes: nature of the structures eigenvalues: magnitude of structures 13/34

Usual summary of an analysis: the biplot Biplot: principal components (points) + loadings (arrows) groups of individuals structuring variables (longest arrows) magnitude of the structures 14/34

Multivariate analysis in a nutshell variety of methods for different types of variables principal components (PCs) summarize diversity variable loadings identify discriminating variables other uses of PCs: maps (spatial structures), models (response variables or predictors),... 15/34

Multivariate analysis in a nutshell variety of methods for different types of variables principal components (PCs) summarize diversity variable loadings identify discriminating variables other uses of PCs: maps (spatial structures), models (response variables or predictors),... 15/34

Multivariate analysis in a nutshell variety of methods for different types of variables principal components (PCs) summarize diversity variable loadings identify discriminating variables other uses of PCs: maps (spatial structures), models (response variables or predictors),... 15/34

Multivariate analysis in a nutshell variety of methods for different types of variables principal components (PCs) summarize diversity variable loadings identify discriminating variables other uses of PCs: maps (spatial structures), models (response variables or predictors),... 15/34

Outline Multivariate analysis in a nutshell Applications to genetic data Genetic diversity of pathogen populations 16/34

From DNA sequences to patterns of biological diversity 17/34

From DNA sequences to patterns of biological diversity 17/34

From DNA sequences to patterns of biological diversity 17/34

From DNA sequences to patterns of biological diversity 17/34

From DNA sequences to patterns of biological diversity 17/34

From DNA sequences to patterns of biological diversity 17/34

From DNA sequences to patterns of biological diversity 17/34

From DNA sequences to patterns of biological diversity 17/34

From DNA sequences to patterns of biological diversity 17/34

DNA sequences: a rich source of information hundreds/thousands individuals up to millions of single nucleotide polymorphism (SNPs) more generally, most genetic data can be treated as frequencies Multivariate analysis use to summarize genetic diversity. 18/34

DNA sequences: a rich source of information hundreds/thousands individuals up to millions of single nucleotide polymorphism (SNPs) more generally, most genetic data can be treated as frequencies Multivariate analysis use to summarize genetic diversity. 18/34

DNA sequences: a rich source of information hundreds/thousands individuals up to millions of single nucleotide polymorphism (SNPs) more generally, most genetic data can be treated as frequencies Multivariate analysis use to summarize genetic diversity. 18/34

DNA sequences: a rich source of information hundreds/thousands individuals up to millions of single nucleotide polymorphism (SNPs) more generally, most genetic data can be treated as frequencies Multivariate analysis use to summarize genetic diversity. 18/34

First application of multivariate analysis in genetics PCA of genetic data, native human populations (Cavalli-Sforza 1966, Proc B) First 2 principal components separate populations into continents. 19/34

First application of multivariate analysis in genetics PCA of genetic data, native human populations (Cavalli-Sforza 1966, Proc B) First 2 principal components separate populations into continents. 19/34

Applications: some examples PCA of genetic data + colored maps of principal components (Cavalli-Sforza et al. 1993, Science) Signatures of Human expansion out-of-africa. 20/34

Since then... Multivariate methods used in genetics Principal Component Analysis (PCA) Principal Coordinates Analysis (PCoA) / Metric Multidimensional Scaling (MDS) Correspondance Analysis (CA) Discriminant Analysis (DA) Canonical Correlation Analysis (CCA)... 21/34

Since then... Multivariate methods used in genetics Principal Component Analysis (PCA) Principal Coordinates Analysis (PCoA) / Metric Multidimensional Scaling (MDS) Correspondance Analysis (CA) Discriminant Analysis (DA) Canonical Correlation Analysis (CCA)... packages: adegenet, ade4, pegas 21/34

Since then... Applications reveal spatial structures (historical spread) explore genetic diversity identify cryptic species discover genotype-phenotype association... review in Jombart et al. 2009, Heredity 102: 330-341 Applications in genetics of pathogen populations. 22/34

Since then... Applications reveal spatial structures (historical spread) explore genetic diversity identify cryptic species discover genotype-phenotype association... review in Jombart et al. 2009, Heredity 102: 330-341 Applications in genetics of pathogen populations. 22/34

Outline Multivariate analysis in a nutshell Applications to genetic data Genetic diversity of pathogen populations 23/34

Why investigate the diversity of pathogen populations? Genetic data: increasingly important in infectious disease epidemiology Purposes classify pathogens, describe their relationships assess the spatio-temporal dynamics of infectious diseases reconstruct epidemiological processes (transmission) 24/34

Why investigate the diversity of pathogen populations? Genetic data: increasingly important in infectious disease epidemiology Purposes classify pathogens, describe their relationships assess the spatio-temporal dynamics of infectious diseases reconstruct epidemiological processes (transmission) 24/34

Why investigate the diversity of pathogen populations? Genetic data: increasingly important in infectious disease epidemiology Purposes classify pathogens, describe their relationships assess the spatio-temporal dynamics of infectious diseases reconstruct epidemiological processes (transmission) 24/34

Why investigate the diversity of pathogen populations? Genetic data: increasingly important in infectious disease epidemiology Purposes classify pathogens, describe their relationships assess the spatio-temporal dynamics of infectious diseases reconstruct epidemiological processes (transmission) 24/34

Different questions at different scales Where and how can multivariate analysis of pathogen genetic data be useful? 25/34

Different questions at different scales Where and how can multivariate analysis of pathogen genetic data be useful? 25/34

Describing pathogen populations Population genetics: identify populations of organisms and describe their relationships What is a population? Usual definition: set of organisms mating at random Problem: no mating in most pathogens (e.g. viruses, bacteria) Genetic clusters: set of genetically related pathogens (e.g. same outbreak, same epidemic). aim: identify and describe genetic clusters 26/34

Describing pathogen populations Population genetics: identify populations of organisms and describe their relationships What is a population? Usual definition: set of organisms mating at random Problem: no mating in most pathogens (e.g. viruses, bacteria) Genetic clusters: set of genetically related pathogens (e.g. same outbreak, same epidemic). aim: identify and describe genetic clusters 26/34

Describing pathogen populations Population genetics: identify populations of organisms and describe their relationships What is a population? Usual definition: set of organisms mating at random Problem: no mating in most pathogens (e.g. viruses, bacteria) Genetic clusters: set of genetically related pathogens (e.g. same outbreak, same epidemic). aim: identify and describe genetic clusters 26/34

Describing pathogen populations Population genetics: identify populations of organisms and describe their relationships What is a population? Usual definition: set of organisms mating at random Problem: no mating in most pathogens (e.g. viruses, bacteria) Genetic clusters: set of genetically related pathogens (e.g. same outbreak, same epidemic). aim: identify and describe genetic clusters 26/34

Describing pathogen populations Population genetics: identify populations of organisms and describe their relationships What is a population? Usual definition: set of organisms mating at random Problem: no mating in most pathogens (e.g. viruses, bacteria) Genetic clusters: set of genetically related pathogens (e.g. same outbreak, same epidemic). aim: identify and describe genetic clusters 26/34

Genetic clustering using K-means & BIC (Jombart et al. 2010, BMC Genetics) Variance partitioning model (ANOVA): tot. variance = (bet. groups) + (wit. groups) Performances: K-means STRUCTURE on simulated data (various island and stepping stone models) orders of magnitude faster (seconds vs hours/days) 27/34

Genetic clustering using K-means & BIC (Jombart et al. 2010, BMC Genetics) Variance partitioning model (ANOVA): tot. variance = (bet. groups) + (wit. groups) Performances: K-means STRUCTURE on simulated data (various island and stepping stone models) orders of magnitude faster (seconds vs hours/days) 27/34

Genetic clustering using K-means & BIC (Jombart et al. 2010, BMC Genetics) Variance partitioning model (ANOVA): tot. variance = (bet. groups) + (wit. groups) Performances: K-means STRUCTURE on simulated data (various island and stepping stone models) orders of magnitude faster (seconds vs hours/days) package: adegenet, function find.clusters 27/34

PCA of seasonal influenza (A/H3N2) data Data: seasonal influenza (A/H3N2), 500 HA segments. Little temporal evolution, burst of diversity in 2002?? 28/34

PCA of seasonal influenza (A/H3N2) data Data: seasonal influenza (A/H3N2), 500 HA segments. Little temporal evolution, burst of diversity in 2002?? 28/34

Which diversity to represent? Total diversity not relevant to analyse clusters. Discriminant Analysis of Principal Components (DAPC): (Jombart et al. 2010, BMC Genetics) maximizes group discrimination ( between/within ratio) provides group membership probabilities (prediction possible) as computer-efficient as PCA 29/34

Which diversity to represent? Total diversity not relevant to analyse clusters. Discriminant Analysis of Principal Components (DAPC): (Jombart et al. 2010, BMC Genetics) maximizes group discrimination ( between/within ratio) provides group membership probabilities (prediction possible) as computer-efficient as PCA 29/34

Which diversity to represent? Total diversity not relevant to analyse clusters. Discriminant Analysis of Principal Components (DAPC): (Jombart et al. 2010, BMC Genetics) maximizes group discrimination ( between/within ratio) provides group membership probabilities (prediction possible) as computer-efficient as PCA package: adegenet, function dapc 29/34

DAPC of seasonal influenza (A/H3N2) data Strong temporal signal, originality of 2006 isolates (new alleles). 30/34

DAPC of seasonal influenza (A/H3N2) data Strong temporal signal, originality of 2006 isolates (new alleles). 30/34

Identifying antigenic clusters in influenza (A/H3N2) Antigenic clusters identified directly from AA sequences. 31/34

Identifying antigenic clusters in influenza (A/H3N2) Antigenic clusters identified directly from AA sequences. 31/34

DAPC to identify structuring alleles DAPC finds combinations of alleles most differing between groups. Simulated data: (Jombart & Ahmed 2011, Bioinformatics) 2 clusters, 50 isolates each 1,000,000 non structured SNPs 1,000 structured SNPs (i.e. different frequencies between groups) Possible applications to pathogen GWAS (e.g. SNPs related to antibiotic resistance in bacteria). 32/34

DAPC to identify structuring alleles DAPC finds combinations of alleles most differing between groups. Simulated data: (Jombart & Ahmed 2011, Bioinformatics) 2 clusters, 50 isolates each 1,000,000 non structured SNPs 1,000 structured SNPs (i.e. different frequencies between groups) Possible applications to pathogen GWAS (e.g. SNPs related to antibiotic resistance in bacteria). 32/34

Limits of multivariate analysis Methicillin-resistant Staphylococcus aureus (MRSA) outbreak within hospital, Thailand. 200 full-genome sequences. 1, 000 SNPs. Observations: greater diversity than expected genetic clusters can be defined transmissions at within-cluster level multivariate analysis = loss of information 33/34

Limits of multivariate analysis Methicillin-resistant Staphylococcus aureus (MRSA) outbreak within hospital, Thailand. 200 full-genome sequences. 1, 000 SNPs. Observations: greater diversity than expected genetic clusters can be defined transmissions at within-cluster level multivariate analysis = loss of information 33/34

Limits of multivariate analysis Methicillin-resistant Staphylococcus aureus (MRSA) outbreak within hospital, Thailand. 200 full-genome sequences. 1, 000 SNPs. Observations: greater diversity than expected genetic clusters can be defined transmissions at within-cluster level multivariate analysis = loss of information 33/34

Limits of multivariate analysis Methicillin-resistant Staphylococcus aureus (MRSA) outbreak within hospital, Thailand. 200 full-genome sequences. 1, 000 SNPs. Observations: greater diversity than expected genetic clusters can be defined transmissions at within-cluster level multivariate analysis = loss of information 33/34

Limits of multivariate analysis Methicillin-resistant Staphylococcus aureus (MRSA) outbreak within hospital, Thailand. 200 full-genome sequences. 1, 000 SNPs. Observations: greater diversity than expected genetic clusters can be defined transmissions at within-cluster level multivariate analysis = loss of information Multivariate analysis usually not informative on small-scale processes. 33/34

Summary multivariate analysis used for 50 years in genetics, still an active field for methodological development increasingly useful as datasets grow specific applications to pathogen genetic data limits reached when reconstructing fine-scale processes more at: http://adegenet.r-forge.r-project.org/ 34/34

Summary multivariate analysis used for 50 years in genetics, still an active field for methodological development increasingly useful as datasets grow specific applications to pathogen genetic data limits reached when reconstructing fine-scale processes more at: http://adegenet.r-forge.r-project.org/ 34/34

Summary multivariate analysis used for 50 years in genetics, still an active field for methodological development increasingly useful as datasets grow specific applications to pathogen genetic data limits reached when reconstructing fine-scale processes more at: http://adegenet.r-forge.r-project.org/ 34/34

Summary multivariate analysis used for 50 years in genetics, still an active field for methodological development increasingly useful as datasets grow specific applications to pathogen genetic data limits reached when reconstructing fine-scale processes more at: http://adegenet.r-forge.r-project.org/ 34/34

Summary multivariate analysis used for 50 years in genetics, still an active field for methodological development increasingly useful as datasets grow specific applications to pathogen genetic data limits reached when reconstructing fine-scale processes more at: http://adegenet.r-forge.r-project.org/ 34/34