Multivariate analysis of genetic data: exploring groups diversity

Size: px
Start display at page:

Download "Multivariate analysis of genetic data: exploring groups diversity"

Transcription

1 Multivariate analysis of genetic data: exploring groups diversity T. Jombart Imperial College London Bogota /42

2 Outline Introduction Clustering algorithms Hierarchical clustering K-means Multivariate Analysis with group informations Analysis of population data Between-group PCA Discriminant Analysis Discriminant Analysis of Principal Components 2/42

3 Outline Introduction Clustering algorithms Hierarchical clustering K-means Multivariate Analysis with group informations Analysis of population data Between-group PCA Discriminant Analysis Discriminant Analysis of Principal Components 3/42

4 Genetic data: recall markers individual sum=1 alleles How to define groups? How to handle group information? 4/42

5 Genetic data: recall markers group 1 group 2 group 3 alleles How to define groups? How to handle group information? 4/42

6 Finding groups: Finding and using group information hierarchical clustering: single linkage complete linkage UPGMA K-means Using group information: multivariate analysis of group frequencies using groups as partitions discriminant analysis 5/42

7 Finding groups: Finding and using group information hierarchical clustering: single linkage complete linkage UPGMA K-means Using group information: multivariate analysis of group frequencies using groups as partitions discriminant analysis 5/42

8 Outline Introduction Clustering algorithms Hierarchical clustering K-means Multivariate Analysis with group informations Analysis of population data Between-group PCA Discriminant Analysis Discriminant Analysis of Principal Components 6/42

9 Outline Introduction Clustering algorithms Hierarchical clustering K-means Multivariate Analysis with group informations Analysis of population data Between-group PCA Discriminant Analysis Discriminant Analysis of Principal Components 7/42

10 A variety of algorithms single linkage complete linkage UPGMA Ward... 8/42

11 Rationale 1. compute pairwise genetic distances D (or similarities) 2. group the closest pair(s) together 3. (optional) update D 4. return to 2) until no new group can be made 9/42

12 Rationale 1. compute pairwise genetic distances D (or similarities) 2. group the closest pair(s) together 3. (optional) update D 4. return to 2) until no new group can be made 9/42

13 Rationale 1. compute pairwise genetic distances D (or similarities) 2. group the closest pair(s) together 3. (optional) update D 4. return to 2) until no new group can be made 9/42

14 Rationale 1. compute pairwise genetic distances D (or similarities) 2. group the closest pair(s) together 3. (optional) update D 4. return to 2) until no new group can be made 9/42

15 Rationale individuals a b c d e alleles a b c d e a b c d e D 10/42

16 Differences between algorithms k D =... k,g single linkage: D k,g = min(d k,i, D k,j ) complete linkage: D k,g = max(d k,i, D k,j ) g i D i,j j UPGMA: D k,g = D k,i +D k,j 2 11/42

17 Differences between algorithms k D =... k,g single linkage: D k,g = min(d k,i, D k,j ) complete linkage: D k,g = max(d k,i, D k,j ) g i D i,j j UPGMA: D k,g = D k,i +D k,j 2 11/42

18 Differences between algorithms k D =... k,g single linkage: D k,g = min(d k,i, D k,j ) complete linkage: D k,g = max(d k,i, D k,j ) g i D i,j j UPGMA: D k,g = D k,i +D k,j 2 11/42

19 Differences between algorithms k D =... k,g single linkage: D k,g = min(d k,i, D k,j ) complete linkage: D k,g = max(d k,i, D k,j ) g i D i,j j UPGMA: D k,g = D k,i +D k,j 2 11/42

20 Differences between algorithms Data Single linkage e d c c b a d a e b Variable Complete linkage UPGMA (average linkage) a b a b e e c d d c 12/42

21 Outline Introduction Clustering algorithms Hierarchical clustering K-means Multivariate Analysis with group informations Analysis of population data Between-group PCA Discriminant Analysis Discriminant Analysis of Principal Components 13/42

22 Variability between and within groups variatibility between groups individuals variability within groups 14/42

23 K-means: underlying model Univariate ANOVA model (ss: sum of squares): with: sst(x) = ssb(x) + ssw(x) k = 1,..., K : number of groups µ k : mean of group k; µ: mean of all data g k : set of individuals in group k (i g k i is in group k) sst(x) = i (x i µ) 2 : total variation ssb(x) = k i g k (µ k µ) 2 : variation between groups ssw(x) = k i g k (x i µ k ) 2 : (residual) variation within groups 15/42

24 K-means: underlying model Univariate ANOVA model (ss: sum of squares): with: sst(x) = ssb(x) + ssw(x) k = 1,..., K : number of groups µ k : mean of group k; µ: mean of all data g k : set of individuals in group k (i g k i is in group k) sst(x) = i (x i µ) 2 : total variation ssb(x) = k i g k (µ k µ) 2 : variation between groups ssw(x) = k i g k (x i µ k ) 2 : (residual) variation within groups 15/42

25 K-means: underlying model Univariate ANOVA model (ss: sum of squares): with: sst(x) = ssb(x) + ssw(x) k = 1,..., K : number of groups µ k : mean of group k; µ: mean of all data g k : set of individuals in group k (i g k i is in group k) sst(x) = i (x i µ) 2 : total variation ssb(x) = k i g k (µ k µ) 2 : variation between groups ssw(x) = k i g k (x i µ k ) 2 : (residual) variation within groups 15/42

26 K-means: underlying model Univariate ANOVA model (ss: sum of squares): with: sst(x) = ssb(x) + ssw(x) k = 1,..., K : number of groups µ k : mean of group k; µ: mean of all data g k : set of individuals in group k (i g k i is in group k) sst(x) = i (x i µ) 2 : total variation ssb(x) = k i g k (µ k µ) 2 : variation between groups ssw(x) = k i g k (x i µ k ) 2 : (residual) variation within groups 15/42

27 K-means: underlying model Extension to multivariate data (X R n p ): SST(X) = SSB(X) + SSW(X) x i n allele p individual with: µ k : vector of means of group k; µ: vector of means of all data SST(X) = i x i µ 2 : total variation SSB(X) = k i g k µ k µ 2 : variation between groups SSW(X) = k i g k x i µ k 2 : (residual) variation within groups 16/42

28 K-means: underlying model Extension to multivariate data (X R n p ): SST(X) = SSB(X) + SSW(X) x i n allele p individual with: µ k : vector of means of group k; µ: vector of means of all data SST(X) = i x i µ 2 : total variation SSB(X) = k i g k µ k µ 2 : variation between groups SSW(X) = k i g k x i µ k 2 : (residual) variation within groups 16/42

29 K-means: underlying model Extension to multivariate data (X R n p ): SST(X) = SSB(X) + SSW(X) x i n allele p individual with: µ k : vector of means of group k; µ: vector of means of all data SST(X) = i x i µ 2 : total variation SSB(X) = k i g k µ k µ 2 : variation between groups SSW(X) = k i g k x i µ k 2 : (residual) variation within groups 16/42

30 K-means: underlying model Extension to multivariate data (X R n p ): SST(X) = SSB(X) + SSW(X) x i n allele p individual with: µ k : vector of means of group k; µ: vector of means of all data SST(X) = i x i µ 2 : total variation SSB(X) = k i g k µ k µ 2 : variation between groups SSW(X) = k i g k x i µ k 2 : (residual) variation within groups 16/42

31 K-means rationale Find K groups G = {g 1,..., g k } minimizing the sum of squares within-groups (SSW): arg min G={g 1,...,g k } k i g k x i µ k 2 Note: this equates to finding groups maximizing the between-groups sum of squares. 17/42

32 K-means rationale Find K groups G = {g 1,..., g k } minimizing the sum of squares within-groups (SSW): arg min G={g 1,...,g k } k i g k x i µ k 2 Note: this equates to finding groups maximizing the between-groups sum of squares. 17/42

33 K-means algorithm The K-mean problem is solved by the following algorithm: 1. select random group means (µ k, k = 1,..., K ) 2. affect each individual x i to the closest group g k 3. update group means µ k 4. go back to 2) until convergence (groups no longer change) 18/42

34 K-means algorithm The K-mean problem is solved by the following algorithm: 1. select random group means (µ k, k = 1,..., K ) 2. affect each individual x i to the closest group g k 3. update group means µ k 4. go back to 2) until convergence (groups no longer change) 18/42

35 K-means algorithm The K-mean problem is solved by the following algorithm: 1. select random group means (µ k, k = 1,..., K ) 2. affect each individual x i to the closest group g k 3. update group means µ k 4. go back to 2) until convergence (groups no longer change) 18/42

36 K-means algorithm The K-mean problem is solved by the following algorithm: 1. select random group means (µ k, k = 1,..., K ) 2. affect each individual x i to the closest group g k 3. update group means µ k 4. go back to 2) until convergence (groups no longer change) 18/42

37 K-means algorithm initialization allele individual group mean step 1 step 2 19/42

38 Which K? K-means does not identify the number of clusters (K ) each K-mean solution is a model with a likelihood model selection can be used to select K 20/42

39 Which K? K-means does not identify the number of clusters (K ) each K-mean solution is a model with a likelihood model selection can be used to select K 20/42

40 Which K? K-means does not identify the number of clusters (K ) each K-mean solution is a model with a likelihood model selection can be used to select K 20/42

41 Using Bayesian Information Criterion (BIC) Defined as: BIC = 2log(L) + klog(n) with: L: likelihood k: number of parameters n: number of observations (individuals) Smallest BIC = best model 21/42

42 K-means and BIC: example Simulated data: 6 populations in island model Value of BIC versus number of clusters BIC Number of clusters (Jombart et al. 2010, BMC Genetics) 22/42

43 Outline Introduction Clustering algorithms Hierarchical clustering K-means Multivariate Analysis with group informations Analysis of population data Between-group PCA Discriminant Analysis Discriminant Analysis of Principal Components 23/42

44 Outline Introduction Clustering algorithms Hierarchical clustering K-means Multivariate Analysis with group informations Analysis of population data Between-group PCA Discriminant Analysis Discriminant Analysis of Principal Components 24/42

45 Aggregating data by groups group 1 group 2 average group 3 group 4 group individual alleles alleles average multivariate analysis of group allele frequencies. 25/42

46 Analysing group data Available methods: Principal Component Analysis (PCA) of allele frequency table Genetic distance between populations Principal Coordinates Analysis (PCoA) Correspondance Analysis (CA) of allele counts Criticism: Loose individual information Neglect within-group diversity CA: possible artefactual outliers 26/42

47 Analysing group data Available methods: Principal Component Analysis (PCA) of allele frequency table Genetic distance between populations Principal Coordinates Analysis (PCoA) Correspondance Analysis (CA) of allele counts Criticism: Loose individual information Neglect within-group diversity CA: possible artefactual outliers 26/42

48 Outline Introduction Clustering algorithms Hierarchical clustering K-means Multivariate Analysis with group informations Analysis of population data Between-group PCA Discriminant Analysis Discriminant Analysis of Principal Components 27/42

49 Using groups as partitions group 1 group 2 group 3 group 4 X alleles Groups are coded as dummy vectors in H group 1 group 2 group 3 group H /42

50 The multivariate ANOVA model The data matrix X can be decomposed as: X = PX + (I P)X where: P is the projector onto H: P = H(H T DH) 1 )H T D, D being a metric in R n PX is a n p matrix where each observation is replaced by the group average 29/42

51 The multivariate ANOVA model Variation partition (same as K-means): VAR(X) = B(X) + W(X) where: VAR(X) = trace(x T DX) B(X) = trace(x T P T DPX) W(X) = trace(x T (I P) T D(I P)X) 30/42

52 Between-group analysis VAR(X) = B(X) + W(X) Classical PCA: decompose VAR(X) find u so that var(xu) is maximum Between-group PCA: decompose B(X) find u so that b(xu) is maximum 31/42

53 Between-group analysis VAR(X) = B(X) + W(X) Classical PCA: decompose VAR(X) find u so that var(xu) is maximum Between-group PCA: decompose B(X) find u so that b(xu) is maximum 31/42

54 Between-group PCA PCA axis Density Classical PCA PC1 Between-group PCA axis Density Between-group PCA PC1 Between-group PCA looks at between-group variability. 32/42

55 Outline Introduction Clustering algorithms Hierarchical clustering K-means Multivariate Analysis with group informations Analysis of population data Between-group PCA Discriminant Analysis Discriminant Analysis of Principal Components 33/42

56 PCA, between-group PCA, and Discriminant Analysis VAR(X) = B(X) + W(X) Maximising different quantities: PCA: maximizes overall diversity (max var(xu)) Between-group PCA: maximizes group diversity (max b(xu)) Discriminant Analysis: maximizes group separation (max b(xu), min w(xu)) 34/42

57 PCA, between-group PCA, and Discriminant Analysis VAR(X) = B(X) + W(X) Maximising different quantities: PCA: maximizes overall diversity (max var(xu)) Between-group PCA: maximizes group diversity (max b(xu)) Discriminant Analysis: maximizes group separation (max b(xu), min w(xu)) 34/42

58 PCA, between-group PCA, and Discriminant Analysis VAR(X) = B(X) + W(X) Maximising different quantities: PCA: maximizes overall diversity (max var(xu)) Between-group PCA: maximizes group diversity (max b(xu)) Discriminant Analysis: maximizes group separation (max b(xu), min w(xu)) 34/42

59 Discriminant Analysis Density on axis 1 Axis 1 max max Density max Density Density min min Classical PCA Axis 1 Between-group PCA Axis 1 Axis 1 Discriminant Analysis Max. total diversity Max. diversity between groups Max. separation of groups 35/42

60 Technical issues Discriminant Analysis requires: X T DX to be invertible less variables than observations X T DX to be invertible uncorrelated variables Genetic data: (almost) always (many) more alleles than individuals allele frequencies are by definition correlated ( = 1) linkage disequilibrium correlated alleles 36/42

61 Technical issues Discriminant Analysis requires: X T DX to be invertible less variables than observations X T DX to be invertible uncorrelated variables Genetic data: (almost) always (many) more alleles than individuals allele frequencies are by definition correlated ( = 1) linkage disequilibrium correlated alleles 36/42

62 Outline Introduction Clustering algorithms Hierarchical clustering K-means Multivariate Analysis with group informations Analysis of population data Between-group PCA Discriminant Analysis Discriminant Analysis of Principal Components 37/42

63 Discriminant Analysis of Principal Components (DAPC) new method (Jombart et al. 2010, BMC Genetics) aim: modify DA for genetic data relies on data orthogonalisation/reduction using PCA 38/42

64 Rationale of DAPC PCA p r Principal components (orthogonal variables) n n s Discriminant Analysis r Principal axes (allele loadings) p n Discriminant Functions (synthetic variables) 39/42

65 DAPC: summary Discriminant Analysis requires: less variables than observations uncorrelated variables Advantages of DAPC: always less PCs than observations PCs are uncorrelated still possible to compute allele contributions 40/42

66 DAPC: summary Discriminant Analysis requires: less variables than observations uncorrelated variables Advantages of DAPC: always less PCs than observations PCs are uncorrelated still possible to compute allele contributions 40/42

67 DAPC: example Seasonal influenza (A/H3N2) data, PCA: d = Eigenvalues 41/42

68 DAPC: example Seasonal influenza (A/H3N2) data, DAPC: d = Eigenvalues 42/42

Multivariate analysis of genetic data exploring group diversity

Multivariate analysis of genetic data exploring group diversity Multivariate analysis of genetic data exploring group diversity Thibaut Jombart, Marie-Pauline Beugin MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis with PR

More information

Multivariate analysis of genetic data: an introduction

Multivariate analysis of genetic data: an introduction Multivariate analysis of genetic data: an introduction Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London XXIV Simposio Internacional De Estadística Bogotá, 25th July

More information

A (short) introduction to phylogenetics

A (short) introduction to phylogenetics A (short) introduction to phylogenetics Thibaut Jombart, Marie-Pauline Beugin MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis with PR Statistics, Millport Field

More information

Multivariate analysis of genetic data an introduction

Multivariate analysis of genetic data an introduction Multivariate analysis of genetic data an introduction Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Population genomics in Lausanne 23 Aug 2016 1/25 Outline Multivariate

More information

Multivariate analysis of genetic data: exploring group diversity

Multivariate analysis of genetic data: exploring group diversity Practical course using the software Multivariate analysis of genetic data: exploring group diversity Thibaut Jombart Abstract This practical course tackles the question of group diversity in genetic data

More information

Data Exploration and Unsupervised Learning with Clustering

Data Exploration and Unsupervised Learning with Clustering Data Exploration and Unsupervised Learning with Clustering Paul F Rodriguez,PhD San Diego Supercomputer Center Predictive Analytic Center of Excellence Clustering Idea Given a set of data can we find a

More information

Machine Learning - MT & 14. PCA and MDS

Machine Learning - MT & 14. PCA and MDS Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016 Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary)

More information

Principal component analysis

Principal component analysis Principal component analysis Motivation i for PCA came from major-axis regression. Strong assumption: single homogeneous sample. Free of assumptions when used for exploration. Classical tests of significance

More information

Chapter 4: Factor Analysis

Chapter 4: Factor Analysis Chapter 4: Factor Analysis In many studies, we may not be able to measure directly the variables of interest. We can merely collect data on other variables which may be related to the variables of interest.

More information

Overview of clustering analysis. Yuehua Cui

Overview of clustering analysis. Yuehua Cui Overview of clustering analysis Yuehua Cui Email: cuiy@msu.edu http://www.stt.msu.edu/~cui A data set with clear cluster structure How would you design an algorithm for finding the three clusters in this

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

Statistics 202: Data Mining. c Jonathan Taylor. Model-based clustering Based in part on slides from textbook, slides of Susan Holmes.

Statistics 202: Data Mining. c Jonathan Taylor. Model-based clustering Based in part on slides from textbook, slides of Susan Holmes. Model-based clustering Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Model-based clustering General approach Choose a type of mixture model (e.g. multivariate Normal)

More information

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti

More information

Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Clusters. Unsupervised Learning. Luc Anselin.   Copyright 2017 by Luc Anselin, All Rights Reserved Clusters Unsupervised Learning Luc Anselin http://spatial.uchicago.edu 1 curse of dimensionality principal components multidimensional scaling classical clustering methods 2 Curse of Dimensionality 3 Curse

More information

Similarity Measures and Clustering In Genetics

Similarity Measures and Clustering In Genetics Similarity Measures and Clustering In Genetics Daniel Lawson Heilbronn Institute for Mathematical Research School of mathematics University of Bristol www.paintmychromosomes.com Talk outline Introduction

More information

Computational functional genomics

Computational functional genomics Computational functional genomics (Spring 2005: Lecture 8) David K. Gifford (Adapted from a lecture by Tommi S. Jaakkola) MIT CSAIL Basic clustering methods hierarchical k means mixture models Multi variate

More information

DIMENSION REDUCTION AND CLUSTER ANALYSIS

DIMENSION REDUCTION AND CLUSTER ANALYSIS DIMENSION REDUCTION AND CLUSTER ANALYSIS EECS 833, 6 March 2006 Geoff Bohling Assistant Scientist Kansas Geological Survey geoff@kgs.ku.edu 864-2093 Overheads and resources available at http://people.ku.edu/~gbohling/eecs833

More information

Data reduction for multivariate analysis

Data reduction for multivariate analysis Data reduction for multivariate analysis Using T 2, m-cusum, m-ewma can help deal with the multivariate detection cases. But when the characteristic vector x of interest is of high dimension, it is difficult

More information

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu Dimension Reduction Techniques Presented by Jie (Jerry) Yu Outline Problem Modeling Review of PCA and MDS Isomap Local Linear Embedding (LLE) Charting Background Advances in data collection and storage

More information

Robust scale estimation with extensions

Robust scale estimation with extensions Robust scale estimation with extensions Garth Tarr, Samuel Müller and Neville Weber School of Mathematics and Statistics THE UNIVERSITY OF SYDNEY Outline The robust scale estimator P n Robust covariance

More information

Principal Component Analysis (PCA) Theory, Practice, and Examples

Principal Component Analysis (PCA) Theory, Practice, and Examples Principal Component Analysis (PCA) Theory, Practice, and Examples Data Reduction summarization of data with many (p) variables by a smaller set of (k) derived (synthetic, composite) variables. p k n A

More information

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA Principle Components Analysis: Uses one group of variables (we will call this X) In

More information

Principal Components Analysis (PCA)

Principal Components Analysis (PCA) Principal Components Analysis (PCA) Principal Components Analysis (PCA) a technique for finding patterns in data of high dimension Outline:. Eigenvectors and eigenvalues. PCA: a) Getting the data b) Centering

More information

HST.582J/6.555J/16.456J

HST.582J/6.555J/16.456J Blind Source Separation: PCA & ICA HST.582J/6.555J/16.456J Gari D. Clifford gari [at] mit. edu http://www.mit.edu/~gari G. D. Clifford 2005-2009 What is BSS? Assume an observation (signal) is a linear

More information

Experimental Design and Data Analysis for Biologists

Experimental Design and Data Analysis for Biologists Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1

More information

5. Discriminant analysis

5. Discriminant analysis 5. Discriminant analysis We continue from Bayes s rule presented in Section 3 on p. 85 (5.1) where c i is a class, x isap-dimensional vector (data case) and we use class conditional probability (density

More information

Basics of Multivariate Modelling and Data Analysis

Basics of Multivariate Modelling and Data Analysis Basics of Multivariate Modelling and Data Analysis Kurt-Erik Häggblom 2. Overview of multivariate techniques 2.1 Different approaches to multivariate data analysis 2.2 Classification of multivariate techniques

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII

More information

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY

More information

Discussion of Dimension Reduction... by Dennis Cook

Discussion of Dimension Reduction... by Dennis Cook Discussion of Dimension Reduction... by Dennis Cook Department of Statistics University of Chicago Anthony Atkinson 70th birthday celebration London, December 14, 2007 Outline The renaissance of PCA? State

More information

3.1. The probabilistic view of the principal component analysis.

3.1. The probabilistic view of the principal component analysis. 301 Chapter 3 Principal Components and Statistical Factor Models This chapter of introduces the principal component analysis (PCA), briefly reviews statistical factor models PCA is among the most popular

More information

PCA Advanced Examples & Applications

PCA Advanced Examples & Applications PCA Advanced Examples & Applications Objectives: Showcase advanced PCA analysis: - Addressing the assumptions - Improving the signal / decreasing the noise Principal Components (PCA) Paper II Example:

More information

CSC411 Fall 2018 Homework 5

CSC411 Fall 2018 Homework 5 Homework 5 Deadline: Wednesday, Nov. 4, at :59pm. Submission: You need to submit two files:. Your solutions to Questions and 2 as a PDF file, hw5_writeup.pdf, through MarkUs. (If you submit answers to

More information

Revision: Chapter 1-6. Applied Multivariate Statistics Spring 2012

Revision: Chapter 1-6. Applied Multivariate Statistics Spring 2012 Revision: Chapter 1-6 Applied Multivariate Statistics Spring 2012 Overview Cov, Cor, Mahalanobis, MV normal distribution Visualization: Stars plot, mosaic plot with shading Outlier: chisq.plot Missing

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision) CS4495/6495 Introduction to Computer Vision 8B-L2 Principle Component Analysis (and its use in Computer Vision) Wavelength 2 Wavelength 2 Principal Components Principal components are all about the directions

More information

Multivariate Fundamentals: Rotation. Exploratory Factor Analysis

Multivariate Fundamentals: Rotation. Exploratory Factor Analysis Multivariate Fundamentals: Rotation Exploratory Factor Analysis PCA Analysis A Review Precipitation Temperature Ecosystems PCA Analysis with Spatial Data Proportion of variance explained Comp.1 + Comp.2

More information

Dimension Reduction (PCA, ICA, CCA, FLD,

Dimension Reduction (PCA, ICA, CCA, FLD, Dimension Reduction (PCA, ICA, CCA, FLD, Topic Models) Yi Zhang 10-701, Machine Learning, Spring 2011 April 6 th, 2011 Parts of the PCA slides are from previous 10-701 lectures 1 Outline Dimension reduction

More information

What is Principal Component Analysis?

What is Principal Component Analysis? What is Principal Component Analysis? Principal component analysis (PCA) Reduce the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables Retains most

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x

More information

Data Preprocessing Tasks

Data Preprocessing Tasks Data Tasks 1 2 3 Data Reduction 4 We re here. 1 Dimensionality Reduction Dimensionality reduction is a commonly used approach for generating fewer features. Typically used because too many features can

More information

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data. Structure in Data A major objective in data analysis is to identify interesting features or structure in the data. The graphical methods are very useful in discovering structure. There are basically two

More information

Linear Dimensionality Reduction

Linear Dimensionality Reduction Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Principal Component Analysis 3 Factor Analysis

More information

Principal component analysis (PCA) for clustering gene expression data

Principal component analysis (PCA) for clustering gene expression data Principal component analysis (PCA) for clustering gene expression data Ka Yee Yeung Walter L. Ruzzo Bioinformatics, v17 #9 (2001) pp 763-774 1 Outline of talk Background and motivation Design of our empirical

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Lecture 16 Solving GLMs via IRWLS

Lecture 16 Solving GLMs via IRWLS Lecture 16 Solving GLMs via IRWLS 09 November 2015 Taylor B. Arnold Yale Statistics STAT 312/612 Notes problem set 5 posted; due next class problem set 6, November 18th Goals for today fixed PCA example

More information

Lecture 6: Single-classification multivariate ANOVA (k-group( MANOVA)

Lecture 6: Single-classification multivariate ANOVA (k-group( MANOVA) Lecture 6: Single-classification multivariate ANOVA (k-group( MANOVA) Rationale and MANOVA test statistics underlying principles MANOVA assumptions Univariate ANOVA Planned and unplanned Multivariate ANOVA

More information

Unsupervised Learning

Unsupervised Learning 2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and

More information

Classification 2: Linear discriminant analysis (continued); logistic regression

Classification 2: Linear discriminant analysis (continued); logistic regression Classification 2: Linear discriminant analysis (continued); logistic regression Ryan Tibshirani Data Mining: 36-462/36-662 April 4 2013 Optional reading: ISL 4.4, ESL 4.3; ISL 4.3, ESL 4.4 1 Reminder:

More information

Multivariate Statistics

Multivariate Statistics Multivariate Statistics Chapter 6: Cluster Analysis Pedro Galeano Departamento de Estadística Universidad Carlos III de Madrid pedro.galeano@uc3m.es Course 2017/2018 Master in Mathematical Engineering

More information

Table of Contents. Multivariate methods. Introduction II. Introduction I

Table of Contents. Multivariate methods. Introduction II. Introduction I Table of Contents Introduction Antti Penttilä Department of Physics University of Helsinki Exactum summer school, 04 Construction of multinormal distribution Test of multinormality with 3 Interpretation

More information

Dimensionality Reduction Techniques (DRT)

Dimensionality Reduction Techniques (DRT) Dimensionality Reduction Techniques (DRT) Introduction: Sometimes we have lot of variables in the data for analysis which create multidimensional matrix. To simplify calculation and to get appropriate,

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1 Week 2 Based in part on slides from textbook, slides of Susan Holmes October 3, 2012 1 / 1 Part I Other datatypes, preprocessing 2 / 1 Other datatypes Document data You might start with a collection of

More information

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes Week 2 Based in part on slides from textbook, slides of Susan Holmes Part I Other datatypes, preprocessing October 3, 2012 1 / 1 2 / 1 Other datatypes Other datatypes Document data You might start with

More information

PCA and admixture models

PCA and admixture models PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1

More information

EDAMI DIMENSION REDUCTION BY PRINCIPAL COMPONENT ANALYSIS

EDAMI DIMENSION REDUCTION BY PRINCIPAL COMPONENT ANALYSIS EDAMI DIMENSION REDUCTION BY PRINCIPAL COMPONENT ANALYSIS Mario Romanazzi October 29, 2017 1 Introduction An important task in multidimensional data analysis is reduction in complexity. Recalling that

More information

Eigenvalues, Eigenvectors, and an Intro to PCA

Eigenvalues, Eigenvectors, and an Intro to PCA Eigenvalues, Eigenvectors, and an Intro to PCA Eigenvalues, Eigenvectors, and an Intro to PCA Changing Basis We ve talked so far about re-writing our data using a new set of variables, or a new basis.

More information

Matrices and Multivariate Statistics - II

Matrices and Multivariate Statistics - II Matrices and Multivariate Statistics - II Richard Mott November 2011 Multivariate Random Variables Consider a set of dependent random variables z = (z 1,..., z n ) E(z i ) = µ i cov(z i, z j ) = σ ij =

More information

Course topics (tentative) The role of random effects

Course topics (tentative) The role of random effects Course topics (tentative) random effects linear mixed models analysis of variance frequentist likelihood-based inference (MLE and REML) prediction Bayesian inference The role of random effects Rasmus Waagepetersen

More information

Applying cluster analysis to 2011 Census local authority data

Applying cluster analysis to 2011 Census local authority data Applying cluster analysis to 2011 Census local authority data Kitty.Lymperopoulou@manchester.ac.uk SPSS User Group Conference November, 10 2017 Outline Basic ideas of cluster analysis How to choose variables

More information

Applied Multivariate Statistical Analysis Richard Johnson Dean Wichern Sixth Edition

Applied Multivariate Statistical Analysis Richard Johnson Dean Wichern Sixth Edition Applied Multivariate Statistical Analysis Richard Johnson Dean Wichern Sixth Edition Pearson Education Limited Edinburgh Gate Harlow Essex CM20 2JE England and Associated Companies throughout the world

More information

Eigenvalues, Eigenvectors, and an Intro to PCA

Eigenvalues, Eigenvectors, and an Intro to PCA Eigenvalues, Eigenvectors, and an Intro to PCA Eigenvalues, Eigenvectors, and an Intro to PCA Changing Basis We ve talked so far about re-writing our data using a new set of variables, or a new basis.

More information

Multivariate analysis of genetic data: investigating spatial structures

Multivariate analysis of genetic data: investigating spatial structures Multivariate analysis of genetic data: investigating spatial structures Thibaut Jombart Imperial College London MRC Centre for Outbreak Analysis and Modelling August 19, 2016 Abstract This practical provides

More information

Introduction to Machine Learning

Introduction to Machine Learning 1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer

More information

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated Fall 3 Computer Vision Overview of Statistical Tools Statistical Inference Haibin Ling Observation inference Decision Prior knowledge http://www.dabi.temple.edu/~hbling/teaching/3f_5543/index.html Bayesian

More information

Clustering using Mixture Models

Clustering using Mixture Models Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior

More information

Principal Component Analysis

Principal Component Analysis Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used

More information

Discrimination of Dyed Cotton Fibers Based on UVvisible Microspectrophotometry and Multivariate Statistical Analysis

Discrimination of Dyed Cotton Fibers Based on UVvisible Microspectrophotometry and Multivariate Statistical Analysis Discrimination of Dyed Cotton Fibers Based on UVvisible Microspectrophotometry and Multivariate Statistical Analysis Elisa Liszewski, Cheryl Szkudlarek and John Goodpaster Department of Chemistry and Chemical

More information

Selection on Multiple Traits

Selection on Multiple Traits Selection on Multiple Traits Bruce Walsh lecture notes Uppsala EQG 2012 course version 7 Feb 2012 Detailed reading: Chapter 30 Genetic vs. Phenotypic correlations Within an individual, trait values can

More information

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Introduction Edps/Psych/Stat/ 584 Applied Multivariate Statistics Carolyn J Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN c Board of Trustees,

More information

FuncICA for time series pattern discovery

FuncICA for time series pattern discovery FuncICA for time series pattern discovery Nishant Mehta and Alexander Gray Georgia Institute of Technology The problem Given a set of inherently continuous time series (e.g. EEG) Find a set of patterns

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

X t = a t + r t, (7.1)

X t = a t + r t, (7.1) Chapter 7 State Space Models 71 Introduction State Space models, developed over the past 10 20 years, are alternative models for time series They include both the ARIMA models of Chapters 3 6 and the Classical

More information

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,

More information

Bayesian Decision and Bayesian Learning

Bayesian Decision and Bayesian Learning Bayesian Decision and Bayesian Learning Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1 / 30 Bayes Rule p(x ω i

More information

Experimental design. Matti Hotokka Department of Physical Chemistry Åbo Akademi University

Experimental design. Matti Hotokka Department of Physical Chemistry Åbo Akademi University Experimental design Matti Hotokka Department of Physical Chemistry Åbo Akademi University Contents Elementary concepts Regression Validation Hypotesis testing ANOVA PCA, PCR, PLS Clusters, SIMCA Design

More information

Applied Multivariate Analysis

Applied Multivariate Analysis Department of Mathematics and Statistics, University of Vaasa, Finland Spring 2017 Dimension reduction Exploratory (EFA) Background While the motivation in PCA is to replace the original (correlated) variables

More information

1 Data Arrays and Decompositions

1 Data Arrays and Decompositions 1 Data Arrays and Decompositions 1.1 Variance Matrices and Eigenstructure Consider a p p positive definite and symmetric matrix V - a model parameter or a sample variance matrix. The eigenstructure is

More information

Overview. Background

Overview. Background Overview Implementation of robust methods for locating quantitative trait loci in R Introduction to QTL mapping Andreas Baierl and Andreas Futschik Institute of Statistics and Decision Support Systems

More information

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture16: Population structure and logistic regression I Jason Mezey jgm45@cornell.edu April 11, 2017 (T) 8:40-9:55 Announcements I April

More information

Vector Space Models. wine_spectral.r

Vector Space Models. wine_spectral.r Vector Space Models 137 wine_spectral.r Latent Semantic Analysis Problem with words Even a small vocabulary as in wine example is challenging LSA Reduce number of columns of DTM by principal components

More information

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation) Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation) PCA transforms the original input space into a lower dimensional space, by constructing dimensions that are linear combinations

More information

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson PCA Review! Supervised learning! Fisher linear discriminant analysis! Nonlinear discriminant analysis! Research example! Multiple Classes! Unsupervised

More information

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007 MIT OpenCourseWare http://ocw.mit.edu HST.582J / 6.555J / 16.456J Biomedical Signal and Image Processing Spring 2007 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES Saurabh Ghosh Human Genetics Unit Indian Statistical Institute, Kolkata Most common diseases are caused by

More information

High-dimensional Markov Chain Models for Categorical Data Sequences with Applications Wai-Ki CHING AMACL, Department of Mathematics HKU 19 March 2013

High-dimensional Markov Chain Models for Categorical Data Sequences with Applications Wai-Ki CHING AMACL, Department of Mathematics HKU 19 March 2013 High-dimensional Markov Chain Models for Categorical Data Sequences with Applications Wai-Ki CHING AMACL, Department of Mathematics HKU 19 March 2013 Abstract: Markov chains are popular models for a modelling

More information

Machine Learning (CS 567) Lecture 5

Machine Learning (CS 567) Lecture 5 Machine Learning (CS 567) Lecture 5 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012 Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Principal Components Analysis Le Song Lecture 22, Nov 13, 2012 Based on slides from Eric Xing, CMU Reading: Chap 12.1, CB book 1 2 Factor or Component

More information

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations. Previously Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations y = Ax Or A simply represents data Notion of eigenvectors,

More information

Statistical Analysis. G562 Geometric Morphometrics PC 2 PC 2 PC 3 PC 2 PC 1. Department of Geological Sciences Indiana University

Statistical Analysis. G562 Geometric Morphometrics PC 2 PC 2 PC 3 PC 2 PC 1. Department of Geological Sciences Indiana University PC 2 PC 2 G562 Geometric Morphometrics Statistical Analysis PC 2 PC 1 PC 3 Basic components of GMM Procrustes Whenever shapes are analyzed together, they must be superimposed together This aligns shapes

More information

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given

More information

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING Vishwanath Mantha Department for Electrical and Computer Engineering Mississippi State University, Mississippi State, MS 39762 mantha@isip.msstate.edu ABSTRACT

More information

On the Variance of Eigenvalues in PCA and MCA

On the Variance of Eigenvalues in PCA and MCA On the Variance of in PCA and MCA Jean-Luc Durand jean-luc.durand@univ-paris13.fr Laboratoire d Éthologie Expérimentale et Comparée - LEEC EA 4443 Université Paris 13 Sorbonne Paris Cité Naples, CARME

More information

Methods for sparse analysis of high-dimensional data, II

Methods for sparse analysis of high-dimensional data, II Methods for sparse analysis of high-dimensional data, II Rachel Ward May 23, 2011 High dimensional data with low-dimensional structure 300 by 300 pixel images = 90, 000 dimensions 2 / 47 High dimensional

More information

ABTEKNILLINEN KORKEAKOULU

ABTEKNILLINEN KORKEAKOULU Two-way analysis of high-dimensional collinear data 1 Tommi Suvitaival 1 Janne Nikkilä 1,2 Matej Orešič 3 Samuel Kaski 1 1 Department of Information and Computer Science, Helsinki University of Technology,

More information

Lecture 2: Data Analytics of Narrative

Lecture 2: Data Analytics of Narrative Lecture 2: Data Analytics of Narrative Data Analytics of Narrative: Pattern Recognition in Text, and Text Synthesis, Supported by the Correspondence Analysis Platform. This Lecture is presented in three

More information

14 Singular Value Decomposition

14 Singular Value Decomposition 14 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing

More information

Multivariate analysis of genetic data: investigating spatial structures

Multivariate analysis of genetic data: investigating spatial structures Multivariate analysis of genetic data: investigating spatial structures Thibaut Jombart Imperial College London MRC Centre for Outbreak Analysis and Modelling March 26, 2014 Abstract This practical provides

More information

LECTURE 4 PRINCIPAL COMPONENTS ANALYSIS / EXPLORATORY FACTOR ANALYSIS

LECTURE 4 PRINCIPAL COMPONENTS ANALYSIS / EXPLORATORY FACTOR ANALYSIS LECTURE 4 PRINCIPAL COMPONENTS ANALYSIS / EXPLORATORY FACTOR ANALYSIS NOTES FROM PRE- LECTURE RECORDING ON PCA PCA and EFA have similar goals. They are substantially different in important ways. The goal

More information