# Multivariate analysis of genetic data an introduction

Size: px
Start display at page:

## Transcription

1 Multivariate analysis of genetic data an introduction Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Population genomics in Lausanne 23 Aug /25

2 Outline Multivariate analysis in a nutshell 2/25

3 Outline Multivariate analysis in a nutshell 3/25

4 Multivariate data: some examples Association between individuals? Correlations between variables? 4/25

5 Multivariate data: some examples Association between individuals? Correlations between variables? 4/25

6 Multivariate analysis to summarize diversity 5/25

7 Multivariate analysis to summarize diversity 5/25

8 Multivariate analysis to summarize diversity 5/25

9 Multivariate analysis to summarize diversity 5/25

10 Multivariate analysis: an overview Multivariate analysis, a.k.a: dimension reduction techniques ordinations in reduced space factorial methods Purposes: summarize diversity amongst observations summarize correlations between variables 6/25

11 Multivariate analysis: an overview Multivariate analysis, a.k.a: dimension reduction techniques ordinations in reduced space factorial methods Purposes: summarize diversity amongst observations summarize correlations between variables 6/25

12 Most common methods Differences lie in input data: quantitative/binary variables: Principal Component Analysis (PCA) 2 categorical variables: Correspondance Analysis (CA) >2 categorical variables: Multiple Correspondance Analysis (MCA) Euclidean distance matrix: Principal Coordinates Analysis (PCoA) / Metric Multidimensional Scaling (MDS) Many other methods for 2 data tables, spatial analysis, phylogenetic analysis, etc. 7/25

13 Most common methods Differences lie in input data: quantitative/binary variables: Principal Component Analysis (PCA) 2 categorical variables: Correspondance Analysis (CA) >2 categorical variables: Multiple Correspondance Analysis (MCA) Euclidean distance matrix: Principal Coordinates Analysis (PCoA) / Metric Multidimensional Scaling (MDS) Many other methods for 2 data tables, spatial analysis, phylogenetic analysis, etc. 7/25

14 Most common methods Differences lie in input data: quantitative/binary variables: Principal Component Analysis (PCA) 2 categorical variables: Correspondance Analysis (CA) >2 categorical variables: Multiple Correspondance Analysis (MCA) Euclidean distance matrix: Principal Coordinates Analysis (PCoA) / Metric Multidimensional Scaling (MDS) Many other methods for 2 data tables, spatial analysis, phylogenetic analysis, etc. 7/25

15 Most common methods Differences lie in input data: quantitative/binary variables: Principal Component Analysis (PCA) 2 categorical variables: Correspondance Analysis (CA) >2 categorical variables: Multiple Correspondance Analysis (MCA) Euclidean distance matrix: Principal Coordinates Analysis (PCoA) / Metric Multidimensional Scaling (MDS) Many other methods for 2 data tables, spatial analysis, phylogenetic analysis, etc. 7/25

16 Most common methods Differences lie in input data: quantitative/binary variables: Principal Component Analysis (PCA) 2 categorical variables: Correspondance Analysis (CA) >2 categorical variables: Multiple Correspondance Analysis (MCA) Euclidean distance matrix: Principal Coordinates Analysis (PCoA) / Metric Multidimensional Scaling (MDS) Many other methods for 2 data tables, spatial analysis, phylogenetic analysis, etc. 7/25

17 1 dimension, 2 dimensions, P dimensions Need to find most informative directions in a P -dimensional space. 8/25

18 1 dimension, 2 dimensions, P dimensions Need to find most informative directions in a P -dimensional space. 8/25

19 1 dimension, 2 dimensions, P dimensions Need to find most informative directions in a P -dimensional space. 8/25

20 Reducing P dimensions into 1 X R N P ; X = [x 1... x P ]: data matrix u R P ; u = [u 1,..., u P ]: principal axis ( u 2 = P j=1 u2 j = 1) v R N ; v = Xu = P j=1 u jx j : principal component find u so that 1 N v 2 = var(v) is maximum. 9/25

21 Reducing P dimensions into 1 X R N P ; X = [x 1... x P ]: data matrix u R P ; u = [u 1,..., u P ]: principal axis ( u 2 = P j=1 u2 j = 1) v R N ; v = Xu = P j=1 u jx j : principal component find u so that 1 N v 2 = var(v) is maximum. 9/25

22 Reducing P dimensions into 1 X R N P ; X = [x 1... x P ]: data matrix u R P ; u = [u 1,..., u P ]: principal axis ( u 2 = P j=1 u2 j = 1) v R N ; v = Xu = P j=1 u jx j : principal component find u so that 1 N v 2 = var(v) is maximum. 9/25

23 Reducing P dimensions into 1 X R N P ; X = [x 1... x P ]: data matrix u R P ; u = [u 1,..., u P ]: principal axis ( u 2 = P j=1 u2 j = 1) v R N ; v = Xu = P j=1 u jx j : principal component find u so that 1 N v 2 = var(v) is maximum. 9/25

24 Keeping more than one principal component u 1 and v 1 : 1st principal axis and component u 2 and v 2 : 2nd principal axis and component constraint: u 1 u 2 ( cor(v 1, v 2 ) = 0) find u 2 so that 1 N v 2 2 = var(v 2 ) is maximum 10/25

25 Keeping more than one principal component u 1 and v 1 : 1st principal axis and component u 2 and v 2 : 2nd principal axis and component constraint: u 1 u 2 ( cor(v 1, v 2 ) = 0) find u 2 so that 1 N v 2 2 = var(v 2 ) is maximum 10/25

26 Keeping more than one principal component u 1 and v 1 : 1st principal axis and component u 2 and v 2 : 2nd principal axis and component constraint: u 1 u 2 ( cor(v 1, v 2 ) = 0) find u 2 so that 1 N v 2 2 = var(v 2 ) is maximum 10/25

27 Keeping more than one principal component u 1 and v 1 : 1st principal axis and component u 2 and v 2 : 2nd principal axis and component constraint: u 1 u 2 ( cor(v 1, v 2 ) = 0) find u 2 so that 1 N v 2 2 = var(v 2 ) is maximum 10/25

28 How many principal components to retain? Choice based on screeplot : barplot of eigenvalues Retain only significant structures... but not trivial ones. 11/25

29 Outputs of multivariate analyses: an overview Main outputs: principal components: diversity amongst individuals principal axes: nature of the structures eigenvalues: magnitude of structures 12/25

30 Outputs of multivariate analyses: an overview Main outputs: principal components: diversity amongst individuals principal axes: nature of the structures eigenvalues: magnitude of structures 12/25

31 Outputs of multivariate analyses: an overview Main outputs: principal components: diversity amongst individuals principal axes: nature of the structures eigenvalues: magnitude of structures 12/25

32 Usual summary of an analysis: the biplot Biplot: principal components (points) + loadings (arrows) groups of individuals discriminating variables (longest arrows) magnitude of the structures 13/25

33 Multivariate analysis in a nutshell variety of methods for different types of variables principal components (PCs) summarize diversity variable loadings identify discriminating variables other uses of PCs: maps (spatial structures), models (response variables or predictors),... 14/25

34 Multivariate analysis in a nutshell variety of methods for different types of variables principal components (PCs) summarize diversity variable loadings identify discriminating variables other uses of PCs: maps (spatial structures), models (response variables or predictors),... 14/25

35 Multivariate analysis in a nutshell variety of methods for different types of variables principal components (PCs) summarize diversity variable loadings identify discriminating variables other uses of PCs: maps (spatial structures), models (response variables or predictors),... 14/25

36 Multivariate analysis in a nutshell variety of methods for different types of variables principal components (PCs) summarize diversity variable loadings identify discriminating variables other uses of PCs: maps (spatial structures), models (response variables or predictors),... 14/25

37 Outline Multivariate analysis in a nutshell 15/25

38 From DNA sequences to patterns of biological diversity 16/25

39 From DNA sequences to patterns of biological diversity 16/25

40 From DNA sequences to patterns of biological diversity 16/25

41 From DNA sequences to patterns of biological diversity 16/25

42 From DNA sequences to patterns of biological diversity 16/25

43 From DNA sequences to patterns of biological diversity 16/25

44 From DNA sequences to patterns of biological diversity 16/25

45 From DNA sequences to patterns of biological diversity 16/25

46 From DNA sequences to patterns of biological diversity 16/25

47 DNA sequences: a rich source of information hundreds/thousands individuals up to millions of single nucleotide polymorphism (SNPs) Multivariate analysis use to summarize genetic diversity. 17/25

48 DNA sequences: a rich source of information hundreds/thousands individuals up to millions of single nucleotide polymorphism (SNPs) Multivariate analysis use to summarize genetic diversity. 17/25

49 DNA sequences: a rich source of information hundreds/thousands individuals up to millions of single nucleotide polymorphism (SNPs) Multivariate analysis use to summarize genetic diversity. 17/25

50 DNA sequences: a rich source of information hundreds/thousands individuals up to millions of single nucleotide polymorphism (SNPs) Multivariate analysis use to summarize genetic diversity. 17/25

51 First application of multivariate analysis in genetics PCA of genetic data, native human populations (Cavalli-Sforza 1966, Proc B) First 2 principal components separate populations into continents. 18/25

52 First application of multivariate analysis in genetics PCA of genetic data, native human populations (Cavalli-Sforza 1966, Proc B) First 2 principal components separate populations into continents. 18/25

53 Applications: some examples PCA of genetic data + colored maps of principal components (Cavalli-Sforza et al. 1993, Science) Signatures of Human expansion out-of-africa. 19/25

54 Since then... Multivariate methods used in genetics Principal Component Analysis (PCA) Principal Coordinates Analysis (PCoA) / Metric Multidimensional Scaling (MDS) Correspondance Analysis (CA) Discriminant Analysis (DA) Canonical Correlation Analysis (CCA)... 20/25

55 Since then... Applications reveal spatial structures (historical spread) explore genetic diversity identify cryptic species discover genotype-phenotype association... review in Jombart et al. 2009, Heredity 102: /25

61 Recoding data numerically Presence/absence (e.g. RFLP, AFLP) and SNPs: binary coding Multiallelic data (e.g. microsatellites) are recoded as counts/frequencies Example using microsatellites: Raw data: Recoded data (allele counts): locus1 locus2 1 80/80 30/ /55 30/ /50 29/ /50 30/ /50 29/29 locus1.50 locus1.55 locus1.80 locus2.29 locus /25

62 Recoding data numerically Presence/absence (e.g. RFLP, AFLP) and SNPs: binary coding Multiallelic data (e.g. microsatellites) are recoded as counts/frequencies Example using microsatellites: Raw data: Recoded data (allele counts): locus1 locus2 1 80/80 30/ /55 30/ /50 29/ /50 30/ /50 29/29 locus1.50 locus1.55 locus1.80 locus2.29 locus /25

63 Recoding data numerically Presence/absence (e.g. RFLP, AFLP) and SNPs: binary coding Multiallelic data (e.g. microsatellites) are recoded as counts/frequencies Example using microsatellites: Raw data: Recoded data (allele counts): locus1 locus2 1 80/80 30/ /55 30/ /50 29/ /50 30/ /50 29/29 locus1.50 locus1.55 locus1.80 locus2.29 locus /25

64 Recoding data numerically Presence/absence (e.g. RFLP, AFLP) and SNPs: binary coding Multiallelic data (e.g. microsatellites) are recoded as counts/frequencies Example using microsatellites: Raw data: Recoded data (allele counts): locus1 locus2 1 80/80 30/ /55 30/ /50 29/ /50 30/ /50 29/29 locus1.50 locus1.55 locus1.80 locus2.29 locus /25

65 How are data handled in adegenet? Types of data: codominant markers (e.g. microsatellites) with any ploidy level allele counts dominant markers (e.g. RAPD) presence/absence nucleotide / amino-acids variation allele counts purely biallelic SNPs binary data (bits) Formats: software: GENETIX, Fstat, Genepop, STRUCTURE, PLINK data.frame of raw allelic data data.frame of allelic frequencies SNPs/amino-acids extracted from DNA/protein alignments 24/25

66 How are data handled in adegenet? Types of data: codominant markers (e.g. microsatellites) with any ploidy level allele counts dominant markers (e.g. RAPD) presence/absence nucleotide / amino-acids variation allele counts purely biallelic SNPs binary data (bits) Formats: software: GENETIX, Fstat, Genepop, STRUCTURE, PLINK data.frame of raw allelic data data.frame of allelic frequencies SNPs/amino-acids extracted from DNA/protein alignments 24/25

67 How are data handled in adegenet? Types of data: codominant markers (e.g. microsatellites) with any ploidy level allele counts dominant markers (e.g. RAPD) presence/absence nucleotide / amino-acids variation allele counts purely biallelic SNPs binary data (bits) Formats: software: GENETIX, Fstat, Genepop, STRUCTURE, PLINK data.frame of raw allelic data data.frame of allelic frequencies SNPs/amino-acids extracted from DNA/protein alignments 24/25

68 How are data handled in adegenet? Types of data: codominant markers (e.g. microsatellites) with any ploidy level allele counts dominant markers (e.g. RAPD) presence/absence nucleotide / amino-acids variation allele counts purely biallelic SNPs binary data (bits) Formats: software: GENETIX, Fstat, Genepop, STRUCTURE, PLINK data.frame of raw allelic data data.frame of allelic frequencies SNPs/amino-acids extracted from DNA/protein alignments 24/25

69 How are data handled in adegenet? Types of data: codominant markers (e.g. microsatellites) with any ploidy level allele counts dominant markers (e.g. RAPD) presence/absence nucleotide / amino-acids variation allele counts purely biallelic SNPs binary data (bits) Formats: software: GENETIX, Fstat, Genepop, STRUCTURE, PLINK data.frame of raw allelic data data.frame of allelic frequencies SNPs/amino-acids extracted from DNA/protein alignments 24/25

70 How are data handled in adegenet? Types of data: codominant markers (e.g. microsatellites) with any ploidy level allele counts dominant markers (e.g. RAPD) presence/absence nucleotide / amino-acids variation allele counts purely biallelic SNPs binary data (bits) Formats: software: GENETIX, Fstat, Genepop, STRUCTURE, PLINK data.frame of raw allelic data data.frame of allelic frequencies SNPs/amino-acids extracted from DNA/protein alignments 24/25

71 How are data handled in adegenet? Types of data: codominant markers (e.g. microsatellites) with any ploidy level allele counts dominant markers (e.g. RAPD) presence/absence nucleotide / amino-acids variation allele counts purely biallelic SNPs binary data (bits) Formats: software: GENETIX, Fstat, Genepop, STRUCTURE, PLINK data.frame of raw allelic data data.frame of allelic frequencies SNPs/amino-acids extracted from DNA/protein alignments 24/25

72 (Almost) time to get your hands dirty! And after lunch, the pdf of the practical is online: or Google adegenet documents Lausanne August /25

### Multivariate analysis of genetic data: an introduction

Multivariate analysis of genetic data: an introduction Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London XXIV Simposio Internacional De Estadística Bogotá, 25th July

### Multivariate analysis of genetic data exploring group diversity

Multivariate analysis of genetic data exploring group diversity Thibaut Jombart, Marie-Pauline Beugin MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis with PR

### Microsatellite data analysis. Tomáš Fér & Filip Kolář

Microsatellite data analysis Tomáš Fér & Filip Kolář Multilocus data dominant heterozygotes and homozygotes cannot be distinguished binary biallelic data (fragments) presence (dominant allele/heterozygote)

### Multivariate analysis of genetic data: investigating spatial structures

Multivariate analysis of genetic data: investigating spatial structures Thibaut Jombart Imperial College London MRC Centre for Outbreak Analysis and Modelling August 19, 2016 Abstract This practical provides

### Multivariate analysis of genetic data: investigating spatial structures

Multivariate analysis of genetic data: investigating spatial structures Thibaut Jombart Imperial College London MRC Centre for Outbreak Analysis and Modelling March 26, 2014 Abstract This practical provides

### Multivariate analysis of genetic data: exploring groups diversity

Multivariate analysis of genetic data: exploring groups diversity T. Jombart Imperial College London Bogota 01-12-2010 1/42 Outline Introduction Clustering algorithms Hierarchical clustering K-means Multivariate

### A (short) introduction to phylogenetics

A (short) introduction to phylogenetics Thibaut Jombart, Marie-Pauline Beugin MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis with PR Statistics, Millport Field

### Multivariate Statistics 101. Ordination (PCA, NMDS, CA) Cluster Analysis (UPGMA, Ward s) Canonical Correspondence Analysis

Multivariate Statistics 101 Ordination (PCA, NMDS, CA) Cluster Analysis (UPGMA, Ward s) Canonical Correspondence Analysis Multivariate Statistics 101 Copy of slides and exercises PAST software download

### Experimental Design and Data Analysis for Biologists

Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1

### Introduction to multivariate analysis Outline

Introduction to multivariate analysis Outline Why do a multivariate analysis Ordination, classification, model fitting Principal component analysis Discriminant analysis, quickly Species presence/absence

### Spatial genetics analyses using

Practical course using the software Spatial genetics analyses using Thibaut Jombart Abstract This practical course illustrates some methodological aspects of spatial genetics. In the following we shall

### Detecting selection from differentiation between populations: the FLK and hapflk approach.

Detecting selection from differentiation between populations: the FLK and hapflk approach. Bertrand Servin bservin@toulouse.inra.fr Maria-Ines Fariello, Simon Boitard, Claude Chevalet, Magali SanCristobal,

### -Principal components analysis is by far the oldest multivariate technique, dating back to the early 1900's; ecologists have used PCA since the

1 2 3 -Principal components analysis is by far the oldest multivariate technique, dating back to the early 1900's; ecologists have used PCA since the 1950's. -PCA is based on covariance or correlation

### Levels of genetic variation for a single gene, multiple genes or an entire genome

From previous lectures: binomial and multinomial probabilities Hardy-Weinberg equilibrium and testing HW proportions (statistical tests) estimation of genotype & allele frequencies within population maximum

### Statistics Toolbox 6. Apply statistical algorithms and probability models

Statistics Toolbox 6 Apply statistical algorithms and probability models Statistics Toolbox provides engineers, scientists, researchers, financial analysts, and statisticians with a comprehensive set of

### Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies Mathieu Emily November 18, 2014 Caen mathieu.emily@agrocampus-ouest.fr - Agrocampus

### Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 18: Introduction to covariates, the QQ plot, and population structure II + minimal GWAS steps Jason Mezey jgm45@cornell.edu April

### Multivariate Statistics Fundamentals Part 1: Rotation-based Techniques

Multivariate Statistics Fundamentals Part 1: Rotation-based Techniques A reminded from a univariate statistics courses Population Class of things (What you want to learn about) Sample group representing

### COMBI - Combining high-dimensional classification and multiple hypotheses testing for the analysis of big data in genetics

COMBI - Combining high-dimensional classification and multiple hypotheses testing for the analysis of big data in genetics Thorsten Dickhaus University of Bremen Institute for Statistics AG DANK Herbsttagung

### Package HierDpart. February 13, 2019

Version 0.3.5 Date 2019-02-10 Package HierDpart February 13, 2019 Title Partitioning Hierarchical Diversity and Differentiation Across Metrics and Scales, from Genes to Ecosystems Miscellaneous R functions

### Visualizing Population Genetics

Visualizing Population Genetics Supervisors: Rachel Fewster, James Russell, Paul Murrell Louise McMillan 1 December 2015 Louise McMillan Visualizing Population Genetics 1 December 2015 1 / 29 Outline 1

### Interpreting principal components analyses of spatial population genetic variation

Supplemental Information for: Interpreting principal components analyses of spatial population genetic variation John Novembre 1 and Matthew Stephens 1,2 1 Department of Human Genetics, University of Chicago,

### Multivariate Analysis of Ecological Data using CANOCO

Multivariate Analysis of Ecological Data using CANOCO JAN LEPS University of South Bohemia, and Czech Academy of Sciences, Czech Republic Universitats- uric! Lanttesbibiiothek Darmstadt Bibliothek Biologie

### 1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

1 Springer Nan M. Laird Christoph Lange The Fundamentals of Modern Statistical Genetics 1 Introduction to Statistical Genetics and Background in Molecular Genetics 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

### INTRODUCCIÓ A L'ANÀLISI MULTIVARIANT. Estadística Biomèdica Avançada Ricardo Gonzalo Sanz 13/07/2015

INTRODUCCIÓ A L'ANÀLISI MULTIVARIANT Estadística Biomèdica Avançada Ricardo Gonzalo Sanz ricardo.gonzalo@vhir.org 13/07/2015 1. Introduction to Multivariate Analysis 2. Summary Statistics for Multivariate

### Asymptotic distribution of the largest eigenvalue with application to genetic data

Asymptotic distribution of the largest eigenvalue with application to genetic data Chong Wu University of Minnesota September 30, 2016 T32 Journal Club Chong Wu 1 / 25 Table of Contents 1 Background Gene-gene

### Lecture WS Evolutionary Genetics Part I 1

Quantitative genetics Quantitative genetics is the study of the inheritance of quantitative/continuous phenotypic traits, like human height and body size, grain colour in winter wheat or beak depth in

### Revision: Chapter 1-6. Applied Multivariate Statistics Spring 2012

Revision: Chapter 1-6 Applied Multivariate Statistics Spring 2012 Overview Cov, Cor, Mahalanobis, MV normal distribution Visualization: Stars plot, mosaic plot with shading Outlier: chisq.plot Missing

### Algebra of Principal Component Analysis

Algebra of Principal Component Analysis 3 Data: Y = 5 Centre each column on its mean: Y c = 7 6 9 y y = 3..6....6.8 3. 3.8.6 Covariance matrix ( variables): S = -----------Y n c ' Y 8..6 c =.6 5.8 Equation

### Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture16: Population structure and logistic regression I Jason Mezey jgm45@cornell.edu April 11, 2017 (T) 8:40-9:55 Announcements I April

### Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

Homework Assignment, Evolutionary Systems Biology, Spring 2009. Homework Part I: Phylogenetics: Introduction. The objective of this assignment is to understand the basics of phylogenetic relationships

### G E INTERACTION USING JMP: AN OVERVIEW

G E INTERACTION USING JMP: AN OVERVIEW Sukanta Dash I.A.S.R.I., Library Avenue, New Delhi-110012 sukanta@iasri.res.in 1. Introduction Genotype Environment interaction (G E) is a common phenomenon in agricultural

### Principles of QTL Mapping. M.Imtiaz

Principles of QTL Mapping M.Imtiaz Introduction Definitions of terminology Reasons for QTL mapping Principles of QTL mapping Requirements For QTL Mapping Demonstration with experimental data Merit of QTL

### Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Week 2 Based in part on slides from textbook, slides of Susan Holmes October 3, 2012 1 / 1 Part I Other datatypes, preprocessing 2 / 1 Other datatypes Document data You might start with a collection of

### The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

The E-M Algorithm in Genetics Biostatistics 666 Lecture 8 Maximum Likelihood Estimation of Allele Frequencies Find parameter estimates which make observed data most likely General approach, as long as

### Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Week 2 Based in part on slides from textbook, slides of Susan Holmes Part I Other datatypes, preprocessing October 3, 2012 1 / 1 2 / 1 Other datatypes Other datatypes Document data You might start with

### Multivariate Data Analysis a survey of data reduction and data association techniques: Principal Components Analysis

Multivariate Data Analysis a survey of data reduction and data association techniques: Principal Components Analysis For example Data reduction approaches Cluster analysis Principal components analysis

### Variations in pelagic bacterial communities in the North Atlantic Ocean coincide with water bodies

The following supplement accompanies the article Variations in pelagic bacterial communities in the North Atlantic Ocean coincide with water bodies Richard L. Hahnke 1, Christina Probian 1, Bernhard M.

### BIOL 502 Population Genetics Spring 2017

BIOL 502 Population Genetics Spring 2017 Lecture 1 Genomic Variation Arun Sethuraman California State University San Marcos Table of contents 1. What is Population Genetics? 2. Vocabulary Recap 3. Relevance

### Principal Component Analysis

I.T. Jolliffe Principal Component Analysis Second Edition With 28 Illustrations Springer Contents Preface to the Second Edition Preface to the First Edition Acknowledgments List of Figures List of Tables

### Multivariate Analysis of Ecological Data

Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology

### Lecture 13: Population Structure. October 8, 2012

Lecture 13: Population Structure October 8, 2012 Last Time Effective population size calculations Historical importance of drift: shifting balance or noise? Population structure Today Course feedback The

### Correlation Preserving Unsupervised Discretization. Outline

Correlation Preserving Unsupervised Discretization Jee Vang Outline Paper References What is discretization? Motivation Principal Component Analysis (PCA) Association Mining Correlation Preserving Discretization

### Lecture 22: Signatures of Selection and Introduction to Linkage Disequilibrium. November 12, 2012

Lecture 22: Signatures of Selection and Introduction to Linkage Disequilibrium November 12, 2012 Last Time Sequence data and quantification of variation Infinite sites model Nucleotide diversity (π) Sequence-based

### Methods for Cryptic Structure. Methods for Cryptic Structure

Case-Control Association Testing Review Consider testing for association between a disease and a genetic marker Idea is to look for an association by comparing allele/genotype frequencies between the cases

### BTRY 7210: Topics in Quantitative Genomics and Genetics

BTRY 7210: Topics in Quantitative Genomics and Genetics Jason Mezey Biological Statistics and Computational Biology (BSCB) Department of Genetic Medicine jgm45@cornell.edu February 12, 2015 Lecture 3:

### MEMGENE package for R: Tutorials

MEMGENE package for R: Tutorials Paul Galpern 1,2 and Pedro Peres-Neto 3 1 Faculty of Environmental Design, University of Calgary 2 Natural Resources Institute, University of Manitoba 3 Département des

### NETWORK BIOLOGY AND COMPLEX DISEASES. Ahto Salumets

NETWORK BIOLOGY AND COMPLEX DISEASES Ahto Salumets CENTRAL DOGMA OF BIOLOGY https://en.wikipedia.org/wiki/central_dogma_of_molecular_biology http://www.qaraqalpaq.com/genetics.html CHROMOSOMES SINGLE-NUCLEOTIDE

### Combining SEM & GREML in OpenMx. Rob Kirkpatrick 3/11/16

Combining SEM & GREML in OpenMx Rob Kirkpatrick 3/11/16 1 Overview I. Introduction. II. mxgreml Design. III. mxgreml Implementation. IV. Applications. V. Miscellany. 2 G V A A 1 1 F E 1 VA 1 2 3 Y₁ Y₂

### Populations in statistical genetics

Populations in statistical genetics What are they, and how can we infer them from whole genome data? Daniel Lawson Heilbronn Institute, University of Bristol www.paintmychromosomes.com Work with: January

### Nature Genetics: doi: /ng Supplementary Figure 1. The phenotypes of PI , BR121, and Harosoy under short-day conditions.

Supplementary Figure 1 The phenotypes of PI 159925, BR121, and Harosoy under short-day conditions. (a) Plant height. (b) Number of branches. (c) Average internode length. (d) Number of nodes. (e) Pods

### BIO 682 Multivariate Statistics Spring 2008

BIO 682 Multivariate Statistics Spring 2008 Steve Shuster http://www4.nau.edu/shustercourses/bio682/index.htm Lecture 11 Properties of Community Data Gauch 1982, Causton 1988, Jongman 1995 a. Qualitative:

### Whole genome sequencing (WGS) as a tool for monitoring purposes. Henrik Hasman DTU - Food

Whole genome sequencing (WGS) as a tool for monitoring purposes Henrik Hasman DTU - Food The Challenge Is to: Continue to increase the power of surveillance and diagnostic using molecular tools Develop

### Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Association Testing with Quantitative Traits: Common and Rare Variants Timothy Thornton and Katie Kerr Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 1 / 41 Introduction to Quantitative

Table of Contents Introduction Antti Penttilä Department of Physics University of Helsinki Exactum summer school, 04 Construction of multinormal distribution Test of multinormality with 3 Interpretation

### Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis Univariate testing vs. community analysis Univariate testing deals with hypotheses concerning individual taxa Is this taxon differentially

### Molecular Markers, Natural History, and Evolution

Molecular Markers, Natural History, and Evolution Second Edition JOHN C. AVISE University of Georgia Sinauer Associates, Inc. Publishers Sunderland, Massachusetts Contents PART I Background CHAPTER 1:

### Lecture 2: Diversity, Distances, adonis. Lecture 2: Diversity, Distances, adonis. Alpha- Diversity. Alpha diversity definition(s)

Lecture 2: Diversity, Distances, adonis Lecture 2: Diversity, Distances, adonis Diversity - alpha, beta (, gamma) Beta- Diversity in practice: Ecological Distances Unsupervised Learning: Clustering, etc

### Lecture 13: Variation Among Populations and Gene Flow. Oct 2, 2006

Lecture 13: Variation Among Populations and Gene Flow Oct 2, 2006 Questions about exam? Last Time Variation within populations: genetic identity and spatial autocorrelation Today Variation among populations:

### Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Dimension Reduction Techniques Presented by Jie (Jerry) Yu Outline Problem Modeling Review of PCA and MDS Isomap Local Linear Embedding (LLE) Charting Background Advances in data collection and storage

### How to analyze many contingency tables simultaneously?

How to analyze many contingency tables simultaneously? Thorsten Dickhaus Humboldt-Universität zu Berlin Beuth Hochschule für Technik Berlin, 31.10.2012 Outline Motivation: Genetic association studies Statistical

### Genetic Drift in Human Evolution

Genetic Drift in Human Evolution (Part 2 of 2) 1 Ecology and Evolutionary Biology Center for Computational Molecular Biology Brown University Outline Introduction to genetic drift Modeling genetic drift

### Introduction to the SNP/ND concept - Phylogeny on WGS data

Introduction to the SNP/ND concept - Phylogeny on WGS data Johanne Ahrenfeldt PhD student Overview What is Phylogeny and what can it be used for Single Nucleotide Polymorphism (SNP) methods CSI Phylogeny

### Basics of Multivariate Modelling and Data Analysis

Basics of Multivariate Modelling and Data Analysis Kurt-Erik Häggblom 2. Overview of multivariate techniques 2.1 Different approaches to multivariate data analysis 2.2 Classification of multivariate techniques

### Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Lecture 2: Genetic Association Testing with Quantitative Traits Instructors: Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 29 Introduction to Quantitative Trait Mapping

### Unit 2 Lesson 4 - Heredity. 7 th Grade Cells and Heredity (Mod A) Unit 2 Lesson 4 - Heredity

Unit 2 Lesson 4 - Heredity 7 th Grade Cells and Heredity (Mod A) Unit 2 Lesson 4 - Heredity Give Peas a Chance What is heredity? Traits, such as hair color, result from the information stored in genetic

### Bioinformatics. Dept. of Computational Biology & Bioinformatics

Bioinformatics Dept. of Computational Biology & Bioinformatics 3 Bioinformatics - play with sequences & structures Dept. of Computational Biology & Bioinformatics 4 ORGANIZATION OF LIFE ROLE OF BIOINFORMATICS

### Multivariate analysis of genetic data: exploring group diversity

Practical course using the software Multivariate analysis of genetic data: exploring group diversity Thibaut Jombart Abstract This practical course tackles the question of group diversity in genetic data

### DIMENSION REDUCTION AND CLUSTER ANALYSIS

DIMENSION REDUCTION AND CLUSTER ANALYSIS EECS 833, 6 March 2006 Geoff Bohling Assistant Scientist Kansas Geological Survey geoff@kgs.ku.edu 864-2093 Overheads and resources available at http://people.ku.edu/~gbohling/eecs833

### Epigenetic vs. genetic diversity of stenoendemic short toothed sage (Salvia brachyodon Vandas)

Epigenetic vs. genetic diversity of stenoendemic short toothed sage (Salvia brachyodon Vandas) Biruš, I., Liber, Z., Radosavljević, I., Bogdanović, S., Jug Dujaković, M., Zoldoš, V., Šatović, Z. Balkan

### Fei Lu. Post doctoral Associate Cornell University

Fei Lu Post doctoral Associate Cornell University http://www.maizegenetics.net Genotyping by sequencing (GBS) is simple and cost effective 1. Digest DNA 2. Ligate adapters with barcodes 3. Pool DNAs 4.

### Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes October 3, 2012 1 / 1 Similarities Start with X which we assume is centered and standardized. The PCA loadings were

### Microsatellites as genetic tools for monitoring escapes and introgression

Microsatellites as genetic tools for monitoring escapes and introgression Alexander TRIANTAFYLLIDIS & Paulo A. PRODÖHL What are microsatellites? Microsatellites (SSR Simple Sequence Repeats) The repeat

### Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis Univariate testing vs. community analysis Univariate testing deals with hypotheses concerning individual taxa Is this taxon differentially

### The problem Lineage model Examples. The lineage model

The lineage model A Bayesian approach to inferring community structure and evolutionary history from whole-genome metagenomic data Jack O Brien Bowdoin College with Daniel Falush and Xavier Didelot Cambridge,

### 1.3. Principal coordinate analysis. Pierre Legendre Département de sciences biologiques Université de Montréal

1.3. Pierre Legendre Département de sciences biologiques Université de Montréal http://www.numericalecology.com/ Pierre Legendre 2018 Definition of principal coordinate analysis (PCoA) An ordination method

### Overview of clustering analysis. Yuehua Cui

Overview of clustering analysis Yuehua Cui Email: cuiy@msu.edu http://www.stt.msu.edu/~cui A data set with clear cluster structure How would you design an algorithm for finding the three clusters in this

### FINM 331: MULTIVARIATE DATA ANALYSIS FALL 2017 PROBLEM SET 3

FINM 331: MULTIVARIATE DATA ANALYSIS FALL 2017 PROBLEM SET 3 The required files for all problems can be found in: http://www.stat.uchicago.edu/~lekheng/courses/331/hw3/ The file name indicates which problem

### Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 20: Epistasis and Alternative Tests in GWAS Jason Mezey jgm45@cornell.edu April 16, 2016 (Th) 8:40-9:55 None Announcements Summary

### Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.

OEB 242 Exam Practice Problems Answer Key Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate. First, recall

### Mathematical models in population genetics II

Mathematical models in population genetics II Anand Bhaskar Evolutionary Biology and Theory of Computing Bootcamp January 1, 014 Quick recap Large discrete-time randomly mating Wright-Fisher population

### Conservation genetics of the Ozark pocket gopher

Conservation genetics of the Ozark pocket gopher Project Summary The Ozark pocket gopher (Geomys bursarius ozarkensis) is a range-restricted subspecies of the broadly distributed plains pocket gopher (G.

### Linear Regression (1/1/17)

STA613/CBB540: Statistical methods in computational biology Linear Regression (1/1/17) Lecturer: Barbara Engelhardt Scribe: Ethan Hada 1. Linear regression 1.1. Linear regression basics. Linear regression

### Application Evolution: Part 1.1 Basics of Coevolution Dynamics

Application Evolution: Part 1.1 Basics of Coevolution Dynamics S. chilense S. peruvianum Summer Semester 2013 Prof Aurélien Tellier FG Populationsgenetik Color code Color code: Red = Important result or

### Molecular Plant Breeding 2015, Vol.6, No.15, 1-22

Research Article Open Access Genetic Diversity Analysis in Tropical Maize Germplasm for Stem Borer and Storage Pest Resistance using Molecular Markers and Phenotypic traits Mwololo J.K. 1, Munyiri S.W.

### Correspondence Analysis & Related Methods

Correspondence Analysis & Related Methods Michael Greenacre SESSION 9: CA applied to rankings, preferences & paired comparisons Correspondence analysis (CA) can also be applied to other types of data:

### Introduction to population genetics & evolution

Introduction to population genetics & evolution Course Organization Exam dates: Feb 19 March 1st Has everybody registered? Did you get the email with the exam schedule Summer seminar: Hot topics in Bioinformatics

PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1

### Multidimensional heritability analysis of neuroanatomical shape. Jingwei Li

Multidimensional heritability analysis of neuroanatomical shape Jingwei Li Brain Imaging Genetics Genetic Variation Behavior Cognition Neuroanatomy Brain Imaging Genetics Genetic Variation Neuroanatomy

### Principal component analysis

Principal component analysis Motivation i for PCA came from major-axis regression. Strong assumption: single homogeneous sample. Free of assumptions when used for exploration. Classical tests of significance

### 4/4/2018. Stepwise model fitting. CCA with first three variables only Call: cca(formula = community ~ env1 + env2 + env3, data = envdata)

0 Correlation matrix for ironmental matrix 1 2 3 4 5 6 7 8 9 10 11 12 0.087451 0.113264 0.225049-0.13835 0.338366-0.01485 0.166309-0.11046 0.088327-0.41099-0.19944 1 1 2 0.087451 1 0.13723-0.27979 0.062584

### D. Incorrect! That is what a phylogenetic tree intends to depict.

Genetics - Problem Drill 24: Evolutionary Genetics No. 1 of 10 1. A phylogenetic tree gives all of the following information except for. (A) DNA sequence homology among species. (B) Protein sequence similarity

### Research Statement on Statistics Jun Zhang

Research Statement on Statistics Jun Zhang (junzhang@galton.uchicago.edu) My interest on statistics generally includes machine learning and statistical genetics. My recent work focus on detection and interpretation

### Functional divergence 1: FFTNS and Shifting balance theory

Functional divergence 1: FFTNS and Shifting balance theory There is no conflict between neutralists and selectionists on the role of natural selection: Natural selection is the only explanation for adaptation

### AEC 550 Conservation Genetics Lecture #2 Probability, Random mating, HW Expectations, & Genetic Diversity,

AEC 550 Conservation Genetics Lecture #2 Probability, Random mating, HW Expectations, & Genetic Diversity, Today: Review Probability in Populatin Genetics Review basic statistics Population Definition

### Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Details of PRF Methodology In the Poisson Random Field PRF) model, it is assumed that non-synonymous mutations at a given gene are either

### DNA Phylogeny. Signals and Systems in Biology Kushal EE, IIT Delhi

DNA Phylogeny Signals and Systems in Biology Kushal Shah @ EE, IIT Delhi Phylogenetics Grouping and Division of organisms Keeps changing with time Splitting, hybridization and termination Cladistics :

### Package AssocTests. October 14, 2017

Type Package Title Genetic Association Studies Version 0.0-4 Date 2017-10-14 Author Lin Wang [aut], Wei Zhang [aut], Qizhai Li [aut], Weicheng Zhu [ctb] Package AssocTests October 14, 2017 Maintainer Lin