Multivariate analysis of genetic data an introduction

Similar documents
Multivariate analysis of genetic data: an introduction

Multivariate analysis of genetic data exploring group diversity

Microsatellite data analysis. Tomáš Fér & Filip Kolář

Multivariate analysis of genetic data: investigating spatial structures

Multivariate analysis of genetic data: investigating spatial structures

Multivariate analysis of genetic data: exploring groups diversity

A (short) introduction to phylogenetics

Multivariate Statistics 101. Ordination (PCA, NMDS, CA) Cluster Analysis (UPGMA, Ward s) Canonical Correspondence Analysis

Experimental Design and Data Analysis for Biologists

Introduction to multivariate analysis Outline

Spatial genetics analyses using

Detecting selection from differentiation between populations: the FLK and hapflk approach.

-Principal components analysis is by far the oldest multivariate technique, dating back to the early 1900's; ecologists have used PCA since the

Levels of genetic variation for a single gene, multiple genes or an entire genome

Statistics Toolbox 6. Apply statistical algorithms and probability models

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Multivariate Statistics Fundamentals Part 1: Rotation-based Techniques

COMBI - Combining high-dimensional classification and multiple hypotheses testing for the analysis of big data in genetics

Package HierDpart. February 13, 2019

Visualizing Population Genetics

Interpreting principal components analyses of spatial population genetic variation

Multivariate Analysis of Ecological Data using CANOCO

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

INTRODUCCIÓ A L'ANÀLISI MULTIVARIANT. Estadística Biomèdica Avançada Ricardo Gonzalo Sanz 13/07/2015

Asymptotic distribution of the largest eigenvalue with application to genetic data

Lecture WS Evolutionary Genetics Part I 1

Revision: Chapter 1-6. Applied Multivariate Statistics Spring 2012

Algebra of Principal Component Analysis

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

G E INTERACTION USING JMP: AN OVERVIEW

Principles of QTL Mapping. M.Imtiaz

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Multivariate Data Analysis a survey of data reduction and data association techniques: Principal Components Analysis

Variations in pelagic bacterial communities in the North Atlantic Ocean coincide with water bodies

BIOL 502 Population Genetics Spring 2017

Principal Component Analysis

Multivariate Analysis of Ecological Data

Lecture 13: Population Structure. October 8, 2012

Correlation Preserving Unsupervised Discretization. Outline

Lecture 22: Signatures of Selection and Introduction to Linkage Disequilibrium. November 12, 2012

Methods for Cryptic Structure. Methods for Cryptic Structure

BTRY 7210: Topics in Quantitative Genomics and Genetics

MEMGENE package for R: Tutorials

NETWORK BIOLOGY AND COMPLEX DISEASES. Ahto Salumets

Combining SEM & GREML in OpenMx. Rob Kirkpatrick 3/11/16

Populations in statistical genetics

Nature Genetics: doi: /ng Supplementary Figure 1. The phenotypes of PI , BR121, and Harosoy under short-day conditions.

BIO 682 Multivariate Statistics Spring 2008

Whole genome sequencing (WGS) as a tool for monitoring purposes. Henrik Hasman DTU - Food

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Table of Contents. Multivariate methods. Introduction II. Introduction I

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

Molecular Markers, Natural History, and Evolution

Lecture 2: Diversity, Distances, adonis. Lecture 2: Diversity, Distances, adonis. Alpha- Diversity. Alpha diversity definition(s)

Lecture 13: Variation Among Populations and Gene Flow. Oct 2, 2006

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

How to analyze many contingency tables simultaneously?

Genetic Drift in Human Evolution

Introduction to the SNP/ND concept - Phylogeny on WGS data

Basics of Multivariate Modelling and Data Analysis

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Unit 2 Lesson 4 - Heredity. 7 th Grade Cells and Heredity (Mod A) Unit 2 Lesson 4 - Heredity

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Multivariate analysis of genetic data: exploring group diversity

DIMENSION REDUCTION AND CLUSTER ANALYSIS

Epigenetic vs. genetic diversity of stenoendemic short toothed sage (Salvia brachyodon Vandas)

Fei Lu. Post doctoral Associate Cornell University

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Microsatellites as genetic tools for monitoring escapes and introgression

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

The problem Lineage model Examples. The lineage model

1.3. Principal coordinate analysis. Pierre Legendre Département de sciences biologiques Université de Montréal

Overview of clustering analysis. Yuehua Cui

FINM 331: MULTIVARIATE DATA ANALYSIS FALL 2017 PROBLEM SET 3

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.

Mathematical models in population genetics II

Conservation genetics of the Ozark pocket gopher

Linear Regression (1/1/17)

Application Evolution: Part 1.1 Basics of Coevolution Dynamics

Molecular Plant Breeding 2015, Vol.6, No.15, 1-22

Correspondence Analysis & Related Methods

Introduction to population genetics & evolution

PCA and admixture models

Multidimensional heritability analysis of neuroanatomical shape. Jingwei Li

Principal component analysis

4/4/2018. Stepwise model fitting. CCA with first three variables only Call: cca(formula = community ~ env1 + env2 + env3, data = envdata)

D. Incorrect! That is what a phylogenetic tree intends to depict.

Research Statement on Statistics Jun Zhang

Functional divergence 1: FFTNS and Shifting balance theory

AEC 550 Conservation Genetics Lecture #2 Probability, Random mating, HW Expectations, & Genetic Diversity,

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

DNA Phylogeny. Signals and Systems in Biology Kushal EE, IIT Delhi

Package AssocTests. October 14, 2017

Processes of Evolution

Ordination & PCA. Ordination. Ordination

Transcription:

Multivariate analysis of genetic data an introduction Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Population genomics in Lausanne 23 Aug 2016 1/25

Outline Multivariate analysis in a nutshell 2/25

Outline Multivariate analysis in a nutshell 3/25

Multivariate data: some examples Association between individuals? Correlations between variables? 4/25

Multivariate data: some examples Association between individuals? Correlations between variables? 4/25

Multivariate analysis to summarize diversity 5/25

Multivariate analysis to summarize diversity 5/25

Multivariate analysis to summarize diversity 5/25

Multivariate analysis to summarize diversity 5/25

Multivariate analysis: an overview Multivariate analysis, a.k.a: dimension reduction techniques ordinations in reduced space factorial methods Purposes: summarize diversity amongst observations summarize correlations between variables 6/25

Multivariate analysis: an overview Multivariate analysis, a.k.a: dimension reduction techniques ordinations in reduced space factorial methods Purposes: summarize diversity amongst observations summarize correlations between variables 6/25

Most common methods Differences lie in input data: quantitative/binary variables: Principal Component Analysis (PCA) 2 categorical variables: Correspondance Analysis (CA) >2 categorical variables: Multiple Correspondance Analysis (MCA) Euclidean distance matrix: Principal Coordinates Analysis (PCoA) / Metric Multidimensional Scaling (MDS) Many other methods for 2 data tables, spatial analysis, phylogenetic analysis, etc. 7/25

Most common methods Differences lie in input data: quantitative/binary variables: Principal Component Analysis (PCA) 2 categorical variables: Correspondance Analysis (CA) >2 categorical variables: Multiple Correspondance Analysis (MCA) Euclidean distance matrix: Principal Coordinates Analysis (PCoA) / Metric Multidimensional Scaling (MDS) Many other methods for 2 data tables, spatial analysis, phylogenetic analysis, etc. 7/25

Most common methods Differences lie in input data: quantitative/binary variables: Principal Component Analysis (PCA) 2 categorical variables: Correspondance Analysis (CA) >2 categorical variables: Multiple Correspondance Analysis (MCA) Euclidean distance matrix: Principal Coordinates Analysis (PCoA) / Metric Multidimensional Scaling (MDS) Many other methods for 2 data tables, spatial analysis, phylogenetic analysis, etc. 7/25

Most common methods Differences lie in input data: quantitative/binary variables: Principal Component Analysis (PCA) 2 categorical variables: Correspondance Analysis (CA) >2 categorical variables: Multiple Correspondance Analysis (MCA) Euclidean distance matrix: Principal Coordinates Analysis (PCoA) / Metric Multidimensional Scaling (MDS) Many other methods for 2 data tables, spatial analysis, phylogenetic analysis, etc. 7/25

Most common methods Differences lie in input data: quantitative/binary variables: Principal Component Analysis (PCA) 2 categorical variables: Correspondance Analysis (CA) >2 categorical variables: Multiple Correspondance Analysis (MCA) Euclidean distance matrix: Principal Coordinates Analysis (PCoA) / Metric Multidimensional Scaling (MDS) Many other methods for 2 data tables, spatial analysis, phylogenetic analysis, etc. 7/25

1 dimension, 2 dimensions, P dimensions Need to find most informative directions in a P -dimensional space. 8/25

1 dimension, 2 dimensions, P dimensions Need to find most informative directions in a P -dimensional space. 8/25

1 dimension, 2 dimensions, P dimensions Need to find most informative directions in a P -dimensional space. 8/25

Reducing P dimensions into 1 X R N P ; X = [x 1... x P ]: data matrix u R P ; u = [u 1,..., u P ]: principal axis ( u 2 = P j=1 u2 j = 1) v R N ; v = Xu = P j=1 u jx j : principal component find u so that 1 N v 2 = var(v) is maximum. 9/25

Reducing P dimensions into 1 X R N P ; X = [x 1... x P ]: data matrix u R P ; u = [u 1,..., u P ]: principal axis ( u 2 = P j=1 u2 j = 1) v R N ; v = Xu = P j=1 u jx j : principal component find u so that 1 N v 2 = var(v) is maximum. 9/25

Reducing P dimensions into 1 X R N P ; X = [x 1... x P ]: data matrix u R P ; u = [u 1,..., u P ]: principal axis ( u 2 = P j=1 u2 j = 1) v R N ; v = Xu = P j=1 u jx j : principal component find u so that 1 N v 2 = var(v) is maximum. 9/25

Reducing P dimensions into 1 X R N P ; X = [x 1... x P ]: data matrix u R P ; u = [u 1,..., u P ]: principal axis ( u 2 = P j=1 u2 j = 1) v R N ; v = Xu = P j=1 u jx j : principal component find u so that 1 N v 2 = var(v) is maximum. 9/25

Keeping more than one principal component u 1 and v 1 : 1st principal axis and component u 2 and v 2 : 2nd principal axis and component constraint: u 1 u 2 ( cor(v 1, v 2 ) = 0) find u 2 so that 1 N v 2 2 = var(v 2 ) is maximum 10/25

Keeping more than one principal component u 1 and v 1 : 1st principal axis and component u 2 and v 2 : 2nd principal axis and component constraint: u 1 u 2 ( cor(v 1, v 2 ) = 0) find u 2 so that 1 N v 2 2 = var(v 2 ) is maximum 10/25

Keeping more than one principal component u 1 and v 1 : 1st principal axis and component u 2 and v 2 : 2nd principal axis and component constraint: u 1 u 2 ( cor(v 1, v 2 ) = 0) find u 2 so that 1 N v 2 2 = var(v 2 ) is maximum 10/25

Keeping more than one principal component u 1 and v 1 : 1st principal axis and component u 2 and v 2 : 2nd principal axis and component constraint: u 1 u 2 ( cor(v 1, v 2 ) = 0) find u 2 so that 1 N v 2 2 = var(v 2 ) is maximum 10/25

How many principal components to retain? Choice based on screeplot : barplot of eigenvalues Retain only significant structures... but not trivial ones. 11/25

Outputs of multivariate analyses: an overview Main outputs: principal components: diversity amongst individuals principal axes: nature of the structures eigenvalues: magnitude of structures 12/25

Outputs of multivariate analyses: an overview Main outputs: principal components: diversity amongst individuals principal axes: nature of the structures eigenvalues: magnitude of structures 12/25

Outputs of multivariate analyses: an overview Main outputs: principal components: diversity amongst individuals principal axes: nature of the structures eigenvalues: magnitude of structures 12/25

Usual summary of an analysis: the biplot Biplot: principal components (points) + loadings (arrows) groups of individuals discriminating variables (longest arrows) magnitude of the structures 13/25

Multivariate analysis in a nutshell variety of methods for different types of variables principal components (PCs) summarize diversity variable loadings identify discriminating variables other uses of PCs: maps (spatial structures), models (response variables or predictors),... 14/25

Multivariate analysis in a nutshell variety of methods for different types of variables principal components (PCs) summarize diversity variable loadings identify discriminating variables other uses of PCs: maps (spatial structures), models (response variables or predictors),... 14/25

Multivariate analysis in a nutshell variety of methods for different types of variables principal components (PCs) summarize diversity variable loadings identify discriminating variables other uses of PCs: maps (spatial structures), models (response variables or predictors),... 14/25

Multivariate analysis in a nutshell variety of methods for different types of variables principal components (PCs) summarize diversity variable loadings identify discriminating variables other uses of PCs: maps (spatial structures), models (response variables or predictors),... 14/25

Outline Multivariate analysis in a nutshell 15/25

From DNA sequences to patterns of biological diversity 16/25

From DNA sequences to patterns of biological diversity 16/25

From DNA sequences to patterns of biological diversity 16/25

From DNA sequences to patterns of biological diversity 16/25

From DNA sequences to patterns of biological diversity 16/25

From DNA sequences to patterns of biological diversity 16/25

From DNA sequences to patterns of biological diversity 16/25

From DNA sequences to patterns of biological diversity 16/25

From DNA sequences to patterns of biological diversity 16/25

DNA sequences: a rich source of information hundreds/thousands individuals up to millions of single nucleotide polymorphism (SNPs) Multivariate analysis use to summarize genetic diversity. 17/25

DNA sequences: a rich source of information hundreds/thousands individuals up to millions of single nucleotide polymorphism (SNPs) Multivariate analysis use to summarize genetic diversity. 17/25

DNA sequences: a rich source of information hundreds/thousands individuals up to millions of single nucleotide polymorphism (SNPs) Multivariate analysis use to summarize genetic diversity. 17/25

DNA sequences: a rich source of information hundreds/thousands individuals up to millions of single nucleotide polymorphism (SNPs) Multivariate analysis use to summarize genetic diversity. 17/25

First application of multivariate analysis in genetics PCA of genetic data, native human populations (Cavalli-Sforza 1966, Proc B) First 2 principal components separate populations into continents. 18/25

First application of multivariate analysis in genetics PCA of genetic data, native human populations (Cavalli-Sforza 1966, Proc B) First 2 principal components separate populations into continents. 18/25

Applications: some examples PCA of genetic data + colored maps of principal components (Cavalli-Sforza et al. 1993, Science) Signatures of Human expansion out-of-africa. 19/25

Since then... Multivariate methods used in genetics Principal Component Analysis (PCA) Principal Coordinates Analysis (PCoA) / Metric Multidimensional Scaling (MDS) Correspondance Analysis (CA) Discriminant Analysis (DA) Canonical Correlation Analysis (CCA)... 20/25

Since then... Applications reveal spatial structures (historical spread) explore genetic diversity identify cryptic species discover genotype-phenotype association... review in Jombart et al. 2009, Heredity 102: 330-341 21/25

In practice Multivariate analysis of genetic data using Usual workflow 1. read data in (adegenet) 2. convert data into numeric values (adegenet) 3. replace missing values (adegenet) 4. use classical methods (ade4/adegenet) 5. graphics and interpretation (ade4/adegenet) 22/25

In practice Multivariate analysis of genetic data using Usual workflow 1. read data in (adegenet) 2. convert data into numeric values (adegenet) 3. replace missing values (adegenet) 4. use classical methods (ade4/adegenet) 5. graphics and interpretation (ade4/adegenet) 22/25

In practice Multivariate analysis of genetic data using Usual workflow 1. read data in (adegenet) 2. convert data into numeric values (adegenet) 3. replace missing values (adegenet) 4. use classical methods (ade4/adegenet) 5. graphics and interpretation (ade4/adegenet) 22/25

In practice Multivariate analysis of genetic data using Usual workflow 1. read data in (adegenet) 2. convert data into numeric values (adegenet) 3. replace missing values (adegenet) 4. use classical methods (ade4/adegenet) 5. graphics and interpretation (ade4/adegenet) 22/25

In practice Multivariate analysis of genetic data using Usual workflow 1. read data in (adegenet) 2. convert data into numeric values (adegenet) 3. replace missing values (adegenet) 4. use classical methods (ade4/adegenet) 5. graphics and interpretation (ade4/adegenet) 22/25

Recoding data numerically Presence/absence (e.g. RFLP, AFLP) and SNPs: binary coding Multiallelic data (e.g. microsatellites) are recoded as counts/frequencies Example using microsatellites: Raw data: Recoded data (allele counts): locus1 locus2 1 80/80 30/30 2 50/55 30/30 3 80/50 29/30 4 50/50 30/30 5 50/50 29/29 locus1.50 locus1.55 locus1.80 locus2.29 locus2.30 1 0 0 2 0 2 2 1 1 0 0 2 3 1 0 1 1 1 4 2 0 0 0 2 5 2 0 0 2 0 23/25

Recoding data numerically Presence/absence (e.g. RFLP, AFLP) and SNPs: binary coding Multiallelic data (e.g. microsatellites) are recoded as counts/frequencies Example using microsatellites: Raw data: Recoded data (allele counts): locus1 locus2 1 80/80 30/30 2 50/55 30/30 3 80/50 29/30 4 50/50 30/30 5 50/50 29/29 locus1.50 locus1.55 locus1.80 locus2.29 locus2.30 1 0 0 2 0 2 2 1 1 0 0 2 3 1 0 1 1 1 4 2 0 0 0 2 5 2 0 0 2 0 23/25

Recoding data numerically Presence/absence (e.g. RFLP, AFLP) and SNPs: binary coding Multiallelic data (e.g. microsatellites) are recoded as counts/frequencies Example using microsatellites: Raw data: Recoded data (allele counts): locus1 locus2 1 80/80 30/30 2 50/55 30/30 3 80/50 29/30 4 50/50 30/30 5 50/50 29/29 locus1.50 locus1.55 locus1.80 locus2.29 locus2.30 1 0 0 2 0 2 2 1 1 0 0 2 3 1 0 1 1 1 4 2 0 0 0 2 5 2 0 0 2 0 23/25

Recoding data numerically Presence/absence (e.g. RFLP, AFLP) and SNPs: binary coding Multiallelic data (e.g. microsatellites) are recoded as counts/frequencies Example using microsatellites: Raw data: Recoded data (allele counts): locus1 locus2 1 80/80 30/30 2 50/55 30/30 3 80/50 29/30 4 50/50 30/30 5 50/50 29/29 locus1.50 locus1.55 locus1.80 locus2.29 locus2.30 1 0 0 2 0 2 2 1 1 0 0 2 3 1 0 1 1 1 4 2 0 0 0 2 5 2 0 0 2 0 23/25

How are data handled in adegenet? Types of data: codominant markers (e.g. microsatellites) with any ploidy level allele counts dominant markers (e.g. RAPD) presence/absence nucleotide / amino-acids variation allele counts purely biallelic SNPs binary data (bits) Formats: software: GENETIX, Fstat, Genepop, STRUCTURE, PLINK data.frame of raw allelic data data.frame of allelic frequencies SNPs/amino-acids extracted from DNA/protein alignments 24/25

How are data handled in adegenet? Types of data: codominant markers (e.g. microsatellites) with any ploidy level allele counts dominant markers (e.g. RAPD) presence/absence nucleotide / amino-acids variation allele counts purely biallelic SNPs binary data (bits) Formats: software: GENETIX, Fstat, Genepop, STRUCTURE, PLINK data.frame of raw allelic data data.frame of allelic frequencies SNPs/amino-acids extracted from DNA/protein alignments 24/25

How are data handled in adegenet? Types of data: codominant markers (e.g. microsatellites) with any ploidy level allele counts dominant markers (e.g. RAPD) presence/absence nucleotide / amino-acids variation allele counts purely biallelic SNPs binary data (bits) Formats: software: GENETIX, Fstat, Genepop, STRUCTURE, PLINK data.frame of raw allelic data data.frame of allelic frequencies SNPs/amino-acids extracted from DNA/protein alignments 24/25

How are data handled in adegenet? Types of data: codominant markers (e.g. microsatellites) with any ploidy level allele counts dominant markers (e.g. RAPD) presence/absence nucleotide / amino-acids variation allele counts purely biallelic SNPs binary data (bits) Formats: software: GENETIX, Fstat, Genepop, STRUCTURE, PLINK data.frame of raw allelic data data.frame of allelic frequencies SNPs/amino-acids extracted from DNA/protein alignments 24/25

How are data handled in adegenet? Types of data: codominant markers (e.g. microsatellites) with any ploidy level allele counts dominant markers (e.g. RAPD) presence/absence nucleotide / amino-acids variation allele counts purely biallelic SNPs binary data (bits) Formats: software: GENETIX, Fstat, Genepop, STRUCTURE, PLINK data.frame of raw allelic data data.frame of allelic frequencies SNPs/amino-acids extracted from DNA/protein alignments 24/25

How are data handled in adegenet? Types of data: codominant markers (e.g. microsatellites) with any ploidy level allele counts dominant markers (e.g. RAPD) presence/absence nucleotide / amino-acids variation allele counts purely biallelic SNPs binary data (bits) Formats: software: GENETIX, Fstat, Genepop, STRUCTURE, PLINK data.frame of raw allelic data data.frame of allelic frequencies SNPs/amino-acids extracted from DNA/protein alignments 24/25

How are data handled in adegenet? Types of data: codominant markers (e.g. microsatellites) with any ploidy level allele counts dominant markers (e.g. RAPD) presence/absence nucleotide / amino-acids variation allele counts purely biallelic SNPs binary data (bits) Formats: software: GENETIX, Fstat, Genepop, STRUCTURE, PLINK data.frame of raw allelic data data.frame of allelic frequencies SNPs/amino-acids extracted from DNA/protein alignments 24/25

(Almost) time to get your hands dirty! And after lunch, the pdf of the practical is online: http://adegenet.r-forge.r-project.org/ or Google adegenet documents Lausanne August 2016 25/25