Multivariate analysis of genetic data exploring group diversity
|
|
- Suzanna Stanley
- 6 years ago
- Views:
Transcription
1 Multivariate analysis of genetic data exploring group diversity Thibaut Jombart, Marie-Pauline Beugin MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis with PR Statistics, Millport Field Station 18 Aug /30
2 Outline Introduction Identifying groups Hierarchical clustering K-means Exploring group diversity Aggregating data Optimizing group differences Discriminant Analysis of Principal Components 2/30
3 Outline Introduction Identifying groups Hierarchical clustering K-means Exploring group diversity Aggregating data Optimizing group differences Discriminant Analysis of Principal Components 3/30
4 Genetic data: introducing group data markers individual sum=1 alleles How to identify groups? How to explore group diversity? 4/30
5 Genetic data: introducing group data markers group 1 group 2 group 3 alleles How to identify groups? How to explore group diversity? 4/30
6 Outline Introduction Identifying groups Hierarchical clustering K-means Exploring group diversity Aggregating data Optimizing group differences Discriminant Analysis of Principal Components 5/30
7 Hierarchical clustering: a variety of algorithms single linkage complete linkage UPGMA Ward... 6/30
8 Rationale 1. compute pairwise genetic distances D (or similarities) 2. group the closest pair(s) together 3. (optional) update D 4. return to 2) until no new group can be made 7/30
9 Rationale 1. compute pairwise genetic distances D (or similarities) 2. group the closest pair(s) together 3. (optional) update D 4. return to 2) until no new group can be made 7/30
10 Rationale 1. compute pairwise genetic distances D (or similarities) 2. group the closest pair(s) together 3. (optional) update D 4. return to 2) until no new group can be made 7/30
11 Rationale 1. compute pairwise genetic distances D (or similarities) 2. group the closest pair(s) together 3. (optional) update D 4. return to 2) until no new group can be made 7/30
12 Rationale individuals a b c d e alleles a b c d e a b c d e D 8/30
13 Differences between algorithms k D =... k,g single linkage: D k,g = min(d k,i, D k,j ) complete linkage: D k,g = max(d k,i, D k,j ) g i D i,j j UPGMA: D k,g = D k,i+d k,j 2 9/30
14 Differences between algorithms k D =... k,g single linkage: D k,g = min(d k,i, D k,j ) complete linkage: D k,g = max(d k,i, D k,j ) g i D i,j j UPGMA: D k,g = D k,i+d k,j 2 9/30
15 Differences between algorithms k D =... k,g single linkage: D k,g = min(d k,i, D k,j ) complete linkage: D k,g = max(d k,i, D k,j ) g i D i,j j UPGMA: D k,g = D k,i+d k,j 2 9/30
16 Differences between algorithms k D =... k,g single linkage: D k,g = min(d k,i, D k,j ) complete linkage: D k,g = max(d k,i, D k,j ) g i D i,j j UPGMA: D k,g = D k,i+d k,j 2 9/30
17 Differences between algorithms Data Single linkage e d c c b a d a e b Variable Complete linkage UPGMA (average linkage) a b a b e e c d d c 10/30
18 ANOVA model: K-means underlying model total var. = (var. between groups) + (var. within groups) variatibility between groups individuals variability within groups 11/30
19 K-means rationale Find groups which minimize within group var. (equally: maximize between group var.). In other words: Identify a partition G = {g 1,..., g k } solving: arg min G={g 1,...,g k } k i g k x i µ k 2 with: x i R p : vector of allele frequencies of individual i µ k R p : vector of means allele frequencies of group k 12/30
20 K-means rationale Find groups which minimize within group var. (equally: maximize between group var.). In other words: Identify a partition G = {g 1,..., g k } solving: arg min G={g 1,...,g k } k i g k x i µ k 2 with: x i R p : vector of allele frequencies of individual i µ k R p : vector of means allele frequencies of group k 12/30
21 K-means rationale Find groups which minimize within group var. (equally: maximize between group var.). In other words: Identify a partition G = {g 1,..., g k } solving: arg min G={g 1,...,g k } k i g k x i µ k 2 with: x i R p : vector of allele frequencies of individual i µ k R p : vector of means allele frequencies of group k 12/30
22 K-means algorithm The K-mean problem is solved by the following algorithm: 1. select random group means (µ k, k = 1,..., K) 2. assign each individual x i to the closest group g k 3. update group means µ k 4. go back to 2) until convergence (groups no longer change) 13/30
23 K-means algorithm The K-mean problem is solved by the following algorithm: 1. select random group means (µ k, k = 1,..., K) 2. assign each individual x i to the closest group g k 3. update group means µ k 4. go back to 2) until convergence (groups no longer change) 13/30
24 K-means algorithm The K-mean problem is solved by the following algorithm: 1. select random group means (µ k, k = 1,..., K) 2. assign each individual x i to the closest group g k 3. update group means µ k 4. go back to 2) until convergence (groups no longer change) 13/30
25 K-means algorithm The K-mean problem is solved by the following algorithm: 1. select random group means (µ k, k = 1,..., K) 2. assign each individual x i to the closest group g k 3. update group means µ k 4. go back to 2) until convergence (groups no longer change) 13/30
26 K-means algorithm initialization allele individual group mean step 1 step 2 14/30
27 K-means: limitations and extensions Limitations slower for large numbers of alleles (e.g. 100,000) K-means does not identify the number of clusters (K) Extension run K-means after dimension reduction using PCA try increasing values of K use Bayesian Information Criterion (BIC) for model selection 15/30
28 K-means: limitations and extensions Limitations slower for large numbers of alleles (e.g. 100,000) K-means does not identify the number of clusters (K) Extension run K-means after dimension reduction using PCA try increasing values of K use Bayesian Information Criterion (BIC) for model selection 15/30
29 Genetic clustering using K-means & BIC (Jombart et al. 2010, BMC Genetics) Simulated data: island model with 6 populations Performances: K-means STRUCTURE on simulated data (various island and stepping stone models) orders of magnitude faster (seconds vs hours/days) 16/30
30 Genetic clustering using K-means & BIC (Jombart et al. 2010, BMC Genetics) Simulated data: island model with 6 populations Performances: K-means STRUCTURE on simulated data (various island and stepping stone models) orders of magnitude faster (seconds vs hours/days) 16/30
31 Outline Introduction Identifying groups Hierarchical clustering K-means Exploring group diversity Aggregating data Optimizing group differences Discriminant Analysis of Principal Components 17/30
32 Why identifying clusters is not the whole story Example of cattle breeds diversity (30 microsatellites, 704 individuals). Group membership probabilities: membership probability Borgou Somba BretPieNoire MaineAnjou Zebu Aubrac Charolais Montbeliard Lagunaire Bazadais Gascon Salers NDama BlondeAquitaine Limousin Important to assess the relationships between clusters. 18/30
33 Why identifying clusters is not the whole story Example of cattle breeds diversity (30 microsatellites, 704 individuals). Group membership probabilities: Multivariate analysis: membership probability MaineAnjou BretPieNoire Charolais Bazadais BlondeAquitaine Montbeliard Limousin Gascon Aubrac Salers DA eigenvalues Lagunaire NDama Somba Borgou Zebu Borgou Zebu Lagunaire NDama Somba Aubrac Bazadais BlondeAquitaine BretPieNoire Charolais Gascon Limousin MaineAnjou Montbeliard Salers Important to assess the relationships between clusters. 18/30
34 Why identifying clusters is not the whole story Example of cattle breeds diversity (30 microsatellites, 704 individuals). Group membership probabilities: Multivariate analysis: membership probability MaineAnjou BretPieNoire Charolais Bazadais BlondeAquitaine Montbeliard Limousin Gascon Aubrac Salers DA eigenvalues Lagunaire NDama Somba Borgou Zebu Borgou Zebu Lagunaire NDama Somba Aubrac Bazadais BlondeAquitaine BretPieNoire Charolais Gascon Limousin MaineAnjou Montbeliard Salers Important to assess the relationships between clusters. 18/30
35 Aggregating data by groups group 1 group 2 average group 3 group 4 group individual alleles alleles average multivariate analysis of group allele frequencies. 19/30
36 Analysing group data Available methods: Principal Component Analysis (PCA) of allele frequency table Genetic distance between populations Principal Coordinates Analysis (PCoA) Correspondance Analysis (CA) of allele counts Criticism: Lose individual information Neglect within-group diversity CA: possible artefactual outliers 20/30
37 Analysing group data Available methods: Principal Component Analysis (PCA) of allele frequency table Genetic distance between populations Principal Coordinates Analysis (PCoA) Correspondance Analysis (CA) of allele counts Criticism: Lose individual information Neglect within-group diversity CA: possible artefactual outliers 20/30
38 Multivariate analysis: reminder Find principal components with maximum total variance. 21/30
39 Multivariate analysis: reminder Find principal components with maximum total variance. 21/30
40 Multivariate analysis: reminder Find principal components with maximum total variance. 21/30
41 Multivariate analysis: reminder Find principal components with maximum total variance. 21/30
42 But total variance may not reflect group differences Need to optimize different criteria. 22/30
43 But total variance may not reflect group differences Need to optimize different criteria. 22/30
44 But total variance may not reflect group differences Need to optimize different criteria. 22/30
45 Optimizing different criteria Similar approaches to PCA can be used to optimize different quantities: PCA: total variance Between-group PCA: variance between groups Within-group PCA: variance within groups Discriminant Analysis: variance between groups / variance within groups 23/30
46 Optimizing different criteria Similar approaches to PCA can be used to optimize different quantities: PCA: total variance Between-group PCA: variance between groups Within-group PCA: variance within groups Discriminant Analysis: variance between groups / variance within groups 23/30
47 Optimizing different criteria Similar approaches to PCA can be used to optimize different quantities: PCA: total variance Between-group PCA: variance between groups Within-group PCA: variance within groups Discriminant Analysis: variance between groups / variance within groups 23/30
48 Optimizing different criteria Similar approaches to PCA can be used to optimize different quantities: PCA: total variance Between-group PCA: variance between groups Within-group PCA: variance within groups Discriminant Analysis: variance between groups / variance within groups 23/30
49 From PCA to DA: increasing group differentiation Density on axis 1 Axis 1 max max Density max Density Density min min Classical PCA Axis 1 Between-group PCA Axis 1 Axis 1 Discriminant Analysis Max. total diversity Max. diversity between groups Max. separation of groups 24/30
50 Discriminant Analysis: limitations and extensions Limitations: DA requires less variables (alleles) than observations (individuals) DA requires uncorrelated variables (no frequencies, no linkage disequilibrium) Discriminant Analysis of Principal Components (DAPC) 1 : data orthogonalisation/reduction using PCA before DA overcomes limitations of DA group membership probabilities, group prediction 1 Jombart et al. 2010, BMC Genetics 25/30
51 Discriminant Analysis: limitations and extensions Limitations: DA requires less variables (alleles) than observations (individuals) DA requires uncorrelated variables (no frequencies, no linkage disequilibrium) Discriminant Analysis of Principal Components (DAPC) 1 : data orthogonalisation/reduction using PCA before DA overcomes limitations of DA group membership probabilities, group prediction 1 Jombart et al. 2010, BMC Genetics 25/30
52 Rationale of DAPC PCA p r Principal components (orthogonal variables) n n s Discriminant Analysis r Principal axes (allele loadings) p n Discriminant Functions (synthetic variables) 26/30
53 PCA of seasonal influenza (A/H3N2) data Data: seasonal influenza (A/H3N2), 500 HA segments. Little temporal evolution, burst of diversity in 2002?? 27/30
54 PCA of seasonal influenza (A/H3N2) data Data: seasonal influenza (A/H3N2), 500 HA segments. Little temporal evolution, burst of diversity in 2002?? 27/30
55 DAPC of seasonal influenza (A/H3N2) data Strong temporal signal, originality of 2006 isolates (new alleles). 28/30
56 DAPC of seasonal influenza (A/H3N2) data Strong temporal signal, originality of 2006 isolates (new alleles). 28/30
57 Other features DAPC can be used to: provides group assignment probabilities can use supplementary individuals can predict group membership of new data can be used for variable selection 29/30
58 Time to get your hands dirty (again)! The pdf of the practical is online: or Google adegenet documents GDAR August /30
Multivariate analysis of genetic data: exploring groups diversity
Multivariate analysis of genetic data: exploring groups diversity T. Jombart Imperial College London Bogota 01-12-2010 1/42 Outline Introduction Clustering algorithms Hierarchical clustering K-means Multivariate
More informationMultivariate analysis of genetic data: an introduction
Multivariate analysis of genetic data: an introduction Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London XXIV Simposio Internacional De Estadística Bogotá, 25th July
More informationA (short) introduction to phylogenetics
A (short) introduction to phylogenetics Thibaut Jombart, Marie-Pauline Beugin MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis with PR Statistics, Millport Field
More informationMultivariate analysis of genetic data: exploring group diversity
Practical course using the software Multivariate analysis of genetic data: exploring group diversity Thibaut Jombart Abstract This practical course tackles the question of group diversity in genetic data
More informationMultivariate analysis of genetic data an introduction
Multivariate analysis of genetic data an introduction Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Population genomics in Lausanne 23 Aug 2016 1/25 Outline Multivariate
More informationA tutorial for Discriminant Analysis of Principal Components (DAPC) using adegenet 1.3-6
A tutorial for Discriminant Analysis of Principal Components (DAPC) using adegenet 1.3-6 Thibaut Jombart January 29, 2013 Abstract This vignette provides a tutorial for applying the Discriminant Analysis
More informationMultivariate analysis of genetic markers as a tool to explore the genetic diversity: some examples
Practical course using the software Multivariate analysis of genetic markers as a tool to explore the genetic diversity: some examples Thibaut Jombart 09/09/09 Abstract This practical course aims at illustrating
More informationMultivariate analysis of genetic data: investigating spatial structures
Multivariate analysis of genetic data: investigating spatial structures Thibaut Jombart Imperial College London MRC Centre for Outbreak Analysis and Modelling August 19, 2016 Abstract This practical provides
More informationOverview of clustering analysis. Yuehua Cui
Overview of clustering analysis Yuehua Cui Email: cuiy@msu.edu http://www.stt.msu.edu/~cui A data set with clear cluster structure How would you design an algorithm for finding the three clusters in this
More informationMultivariate analysis of genetic data: investigating spatial structures
Multivariate analysis of genetic data: investigating spatial structures Thibaut Jombart Imperial College London MRC Centre for Outbreak Analysis and Modelling March 26, 2014 Abstract This practical provides
More informationPrincipal component analysis
Principal component analysis Motivation i for PCA came from major-axis regression. Strong assumption: single homogeneous sample. Free of assumptions when used for exploration. Classical tests of significance
More informationData Preprocessing Tasks
Data Tasks 1 2 3 Data Reduction 4 We re here. 1 Dimensionality Reduction Dimensionality reduction is a commonly used approach for generating fewer features. Typically used because too many features can
More informationData Exploration and Unsupervised Learning with Clustering
Data Exploration and Unsupervised Learning with Clustering Paul F Rodriguez,PhD San Diego Supercomputer Center Predictive Analytic Center of Excellence Clustering Idea Given a set of data can we find a
More informationClusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved
Clusters Unsupervised Learning Luc Anselin http://spatial.uchicago.edu 1 curse of dimensionality principal components multidimensional scaling classical clustering methods 2 Curse of Dimensionality 3 Curse
More informationStatistics 202: Data Mining. c Jonathan Taylor. Model-based clustering Based in part on slides from textbook, slides of Susan Holmes.
Model-based clustering Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Model-based clustering General approach Choose a type of mixture model (e.g. multivariate Normal)
More informationMultivariate Fundamentals: Rotation. Exploratory Factor Analysis
Multivariate Fundamentals: Rotation Exploratory Factor Analysis PCA Analysis A Review Precipitation Temperature Ecosystems PCA Analysis with Spatial Data Proportion of variance explained Comp.1 + Comp.2
More informationDIMENSION REDUCTION AND CLUSTER ANALYSIS
DIMENSION REDUCTION AND CLUSTER ANALYSIS EECS 833, 6 March 2006 Geoff Bohling Assistant Scientist Kansas Geological Survey geoff@kgs.ku.edu 864-2093 Overheads and resources available at http://people.ku.edu/~gbohling/eecs833
More informationROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015
ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti
More informationLecture 13: Population Structure. October 8, 2012
Lecture 13: Population Structure October 8, 2012 Last Time Effective population size calculations Historical importance of drift: shifting balance or noise? Population structure Today Course feedback The
More informationTable of Contents. Multivariate methods. Introduction II. Introduction I
Table of Contents Introduction Antti Penttilä Department of Physics University of Helsinki Exactum summer school, 04 Construction of multinormal distribution Test of multinormality with 3 Interpretation
More informationPCA Advanced Examples & Applications
PCA Advanced Examples & Applications Objectives: Showcase advanced PCA analysis: - Addressing the assumptions - Improving the signal / decreasing the noise Principal Components (PCA) Paper II Example:
More informationChapter 4: Factor Analysis
Chapter 4: Factor Analysis In many studies, we may not be able to measure directly the variables of interest. We can merely collect data on other variables which may be related to the variables of interest.
More informationPrinciple Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA
Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA Principle Components Analysis: Uses one group of variables (we will call this X) In
More informationSimilarity Measures and Clustering In Genetics
Similarity Measures and Clustering In Genetics Daniel Lawson Heilbronn Institute for Mathematical Research School of mathematics University of Bristol www.paintmychromosomes.com Talk outline Introduction
More informationPrincipal Component Analysis (PCA) Theory, Practice, and Examples
Principal Component Analysis (PCA) Theory, Practice, and Examples Data Reduction summarization of data with many (p) variables by a smaller set of (k) derived (synthetic, composite) variables. p k n A
More informationMultilevel Component Analysis applied to the measurement of a complex product experience
Multilevel Component Analysis applied to the measurement of a complex product experience Boucon, C.A., Petit-Jublot, C.E.F., Groeneschild C., Dijksterhuis, G.B. Outline Background Introduction to Simultaneous
More informationDimension Reduction Techniques. Presented by Jie (Jerry) Yu
Dimension Reduction Techniques Presented by Jie (Jerry) Yu Outline Problem Modeling Review of PCA and MDS Isomap Local Linear Embedding (LLE) Charting Background Advances in data collection and storage
More informationMachine Learning - MT & 14. PCA and MDS
Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016 Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary)
More informationGatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II
Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II Gatsby Unit University College London 27 Feb 2017 Outline Part I: Theory of ICA Definition and difference
More informationStructure in Data. A major objective in data analysis is to identify interesting features or structure in the data.
Structure in Data A major objective in data analysis is to identify interesting features or structure in the data. The graphical methods are very useful in discovering structure. There are basically two
More informationPrincipal component analysis (PCA) for clustering gene expression data
Principal component analysis (PCA) for clustering gene expression data Ka Yee Yeung Walter L. Ruzzo Bioinformatics, v17 #9 (2001) pp 763-774 1 Outline of talk Background and motivation Design of our empirical
More informationApplying cluster analysis to 2011 Census local authority data
Applying cluster analysis to 2011 Census local authority data Kitty.Lymperopoulou@manchester.ac.uk SPSS User Group Conference November, 10 2017 Outline Basic ideas of cluster analysis How to choose variables
More informationCIFAR Lectures: Non-Gaussian statistics and natural images
CIFAR Lectures: Non-Gaussian statistics and natural images Dept of Computer Science University of Helsinki, Finland Outline Part I: Theory of ICA Definition and difference to PCA Importance of non-gaussianity
More informationPrincipal Components Analysis (PCA)
Principal Components Analysis (PCA) Principal Components Analysis (PCA) a technique for finding patterns in data of high dimension Outline:. Eigenvectors and eigenvalues. PCA: a) Getting the data b) Centering
More informationMultivariate Statistics
Multivariate Statistics Chapter 6: Cluster Analysis Pedro Galeano Departamento de Estadística Universidad Carlos III de Madrid pedro.galeano@uc3m.es Course 2017/2018 Master in Mathematical Engineering
More informationAn Introduction to Multivariate Methods
Chapter 12 An Introduction to Multivariate Methods Multivariate statistical methods are used to display, analyze, and describe data on two or more features or variables simultaneously. I will discuss multivariate
More informatione 2 e 1 (a) (b) (d) (c)
2.13 Rotated principal component analysis [Book, Sect. 2.2] Fig.: PCA applied to a dataset composed of (a) 1 cluster, (b) 2 clusters, (c) and (d) 4 clusters. In (c), an orthonormal rotation and (d) an
More informationFACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING
FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING Vishwanath Mantha Department for Electrical and Computer Engineering Mississippi State University, Mississippi State, MS 39762 mantha@isip.msstate.edu ABSTRACT
More informationEigenvalues, Eigenvectors, and an Intro to PCA
Eigenvalues, Eigenvectors, and an Intro to PCA Eigenvalues, Eigenvectors, and an Intro to PCA Changing Basis We ve talked so far about re-writing our data using a new set of variables, or a new basis.
More informationLinear & Non-Linear Discriminant Analysis! Hugh R. Wilson
Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson PCA Review! Supervised learning! Fisher linear discriminant analysis! Nonlinear discriminant analysis! Research example! Multiple Classes! Unsupervised
More informationOutline Week 1 PCA Challenge. Introduction. Multivariate Statistical Analysis. Hung Chen
Introduction Multivariate Statistical Analysis Hung Chen Department of Mathematics https://ceiba.ntu.edu.tw/972multistat hchen@math.ntu.edu.tw, Old Math 106 2009.02.16 1 Outline 2 Week 1 3 PCA multivariate
More informationEigenvalues, Eigenvectors, and an Intro to PCA
Eigenvalues, Eigenvectors, and an Intro to PCA Eigenvalues, Eigenvectors, and an Intro to PCA Changing Basis We ve talked so far about re-writing our data using a new set of variables, or a new basis.
More informationExperimental Design and Data Analysis for Biologists
Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1
More informationAlgebra of Principal Component Analysis
Algebra of Principal Component Analysis 3 Data: Y = 5 Centre each column on its mean: Y c = 7 6 9 y y = 3..6....6.8 3. 3.8.6 Covariance matrix ( variables): S = -----------Y n c ' Y 8..6 c =.6 5.8 Equation
More information2/1/2016. Species Abundance Curves Plot of rank abundance (x-axis) vs abundance or P i (yaxis).
Specie Abundance Curve Plot of rank abundance (x-axi) v abundance or P i (yaxi). More divere communitie lack numerically dominant pecie, flatter line. Proportion abundance 0 200 400 600 800 A C DB F E
More informationL11: Pattern recognition principles
L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction
More informationVariable selection for model-based clustering
Variable selection for model-based clustering Matthieu Marbac (Ensai - Crest) Joint works with: M. Sedki (Univ. Paris-sud) and V. Vandewalle (Univ. Lille 2) The problem Objective: Estimation of a partition
More informationIntroduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More informationDiscriminant analysis and supervised classification
Discriminant analysis and supervised classification Angela Montanari 1 Linear discriminant analysis Linear discriminant analysis (LDA) also known as Fisher s linear discriminant analysis or as Canonical
More informationStatistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1
Week 2 Based in part on slides from textbook, slides of Susan Holmes October 3, 2012 1 / 1 Part I Other datatypes, preprocessing 2 / 1 Other datatypes Document data You might start with a collection of
More informationPrincipal Component Analysis
Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used
More informationPart I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes
Week 2 Based in part on slides from textbook, slides of Susan Holmes Part I Other datatypes, preprocessing October 3, 2012 1 / 1 2 / 1 Other datatypes Other datatypes Document data You might start with
More informationPrincipal Component Analysis -- PCA (also called Karhunen-Loeve transformation)
Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation) PCA transforms the original input space into a lower dimensional space, by constructing dimensions that are linear combinations
More informationWhat is Principal Component Analysis?
What is Principal Component Analysis? Principal component analysis (PCA) Reduce the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables Retains most
More informationApplied Multivariate Statistical Analysis Richard Johnson Dean Wichern Sixth Edition
Applied Multivariate Statistical Analysis Richard Johnson Dean Wichern Sixth Edition Pearson Education Limited Edinburgh Gate Harlow Essex CM20 2JE England and Associated Companies throughout the world
More informationPCA and admixture models
PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1
More informationMultistage Anslysis on Solar Spectral Analyses with Uncertainties in Atomic Physical Models
Multistage Anslysis on Solar Spectral Analyses with Uncertainties in Atomic Physical Models Xixi Yu Imperial College London xixi.yu16@imperial.ac.uk 23 October 2018 Xixi Yu (Imperial College London) Multistage
More informationPrincipal Component Analysis (PCA) Principal Component Analysis (PCA)
Recall: Eigenvectors of the Covariance Matrix Covariance matrices are symmetric. Eigenvectors are orthogonal Eigenvectors are ordered by the magnitude of eigenvalues: λ 1 λ 2 λ p {v 1, v 2,..., v n } Recall:
More informationRobust scale estimation with extensions
Robust scale estimation with extensions Garth Tarr, Samuel Müller and Neville Weber School of Mathematics and Statistics THE UNIVERSITY OF SYDNEY Outline The robust scale estimator P n Robust covariance
More informationDescriptive Statistics
Descriptive Statistics DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall17 Carlos Fernandez-Granda Descriptive statistics Techniques to visualize
More information2/26/2017. This is similar to canonical correlation in some ways. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2
PSY 512: Advanced Statistics for Psychological and Behavioral Research 2 What is factor analysis? What are factors? Representing factors Graphs and equations Extracting factors Methods and criteria Interpreting
More informationEDAMI DIMENSION REDUCTION BY PRINCIPAL COMPONENT ANALYSIS
EDAMI DIMENSION REDUCTION BY PRINCIPAL COMPONENT ANALYSIS Mario Romanazzi October 29, 2017 1 Introduction An important task in multidimensional data analysis is reduction in complexity. Recalling that
More informationSTATISTICS 407 METHODS OF MULTIVARIATE ANALYSIS TOPICS
STATISTICS 407 METHODS OF MULTIVARIATE ANALYSIS TOPICS Principal Component Analysis (PCA): Reduce the, summarize the sources of variation in the data, transform the data into a new data set where the variables
More informationSTATISTICAL LEARNING SYSTEMS
STATISTICAL LEARNING SYSTEMS LECTURE 8: UNSUPERVISED LEARNING: FINDING STRUCTURE IN DATA Institute of Computer Science, Polish Academy of Sciences Ph. D. Program 2013/2014 Principal Component Analysis
More informationClustering compiled by Alvin Wan from Professor Benjamin Recht s lecture, Samaneh s discussion
Clustering compiled by Alvin Wan from Professor Benjamin Recht s lecture, Samaneh s discussion 1 Overview With clustering, we have several key motivations: archetypes (factor analysis) segmentation hierarchy
More information1 A factor can be considered to be an underlying latent variable: (a) on which people differ. (b) that is explained by unknown variables
1 A factor can be considered to be an underlying latent variable: (a) on which people differ (b) that is explained by unknown variables (c) that cannot be defined (d) that is influenced by observed variables
More informationPrincipal Component Analysis
Principal Component Analysis Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [based on slides from Nina Balcan] slide 1 Goals for the lecture you should understand
More information2. Sample representativeness. That means some type of probability/random sampling.
1 Neuendorf Cluster Analysis Model: X1 X2 X3 X4 X5 Clusters (Nominal variable) Y1 Y2 Y3 Clustering/Internal Variables External Variables Assumes: 1. Actually, any level of measurement (nominal, ordinal,
More informationUnconstrained Ordination
Unconstrained Ordination Sites Species A Species B Species C Species D Species E 1 0 (1) 5 (1) 1 (1) 10 (4) 10 (4) 2 2 (3) 8 (3) 4 (3) 12 (6) 20 (6) 3 8 (6) 20 (6) 10 (6) 1 (2) 3 (2) 4 4 (5) 11 (5) 8 (5)
More informationMultivariate Statistics 101. Ordination (PCA, NMDS, CA) Cluster Analysis (UPGMA, Ward s) Canonical Correspondence Analysis
Multivariate Statistics 101 Ordination (PCA, NMDS, CA) Cluster Analysis (UPGMA, Ward s) Canonical Correspondence Analysis Multivariate Statistics 101 Copy of slides and exercises PAST software download
More informationStatistical Machine Learning
Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x
More informationG E INTERACTION USING JMP: AN OVERVIEW
G E INTERACTION USING JMP: AN OVERVIEW Sukanta Dash I.A.S.R.I., Library Avenue, New Delhi-110012 sukanta@iasri.res.in 1. Introduction Genotype Environment interaction (G E) is a common phenomenon in agricultural
More informationLECTURE 4 PRINCIPAL COMPONENTS ANALYSIS / EXPLORATORY FACTOR ANALYSIS
LECTURE 4 PRINCIPAL COMPONENTS ANALYSIS / EXPLORATORY FACTOR ANALYSIS NOTES FROM PRE- LECTURE RECORDING ON PCA PCA and EFA have similar goals. They are substantially different in important ways. The goal
More informationCSC411 Fall 2018 Homework 5
Homework 5 Deadline: Wednesday, Nov. 4, at :59pm. Submission: You need to submit two files:. Your solutions to Questions and 2 as a PDF file, hw5_writeup.pdf, through MarkUs. (If you submit answers to
More informationMULTIVARIATE HOMEWORK #5
MULTIVARIATE HOMEWORK #5 Fisher s dataset on differentiating species of Iris based on measurements on four morphological characters (i.e. sepal length, sepal width, petal length, and petal width) was subjected
More informationData reduction for multivariate analysis
Data reduction for multivariate analysis Using T 2, m-cusum, m-ewma can help deal with the multivariate detection cases. But when the characteristic vector x of interest is of high dimension, it is difficult
More informationSpatial genetics analyses using
Practical course using the software Spatial genetics analyses using Thibaut Jombart Abstract This practical course illustrates some methodological aspects of spatial genetics. In the following we shall
More informationBasics of Multivariate Modelling and Data Analysis
Basics of Multivariate Modelling and Data Analysis Kurt-Erik Häggblom 6. Principal component analysis (PCA) 6.1 Overview 6.2 Essentials of PCA 6.3 Numerical calculation of PCs 6.4 Effects of data preprocessing
More informationCS 340 Lec. 6: Linear Dimensionality Reduction
CS 340 Lec. 6: Linear Dimensionality Reduction AD January 2011 AD () January 2011 1 / 46 Linear Dimensionality Reduction Introduction & Motivation Brief Review of Linear Algebra Principal Component Analysis
More informationAn application of the GAM-PCA-VAR model to respiratory disease and air pollution data
An application of the GAM-PCA-VAR model to respiratory disease and air pollution data Márton Ispány 1 Faculty of Informatics, University of Debrecen Hungary Joint work with Juliana Bottoni de Souza, Valdério
More informationMultivariate Analysis Cluster Analysis
Multivariate Analysis Cluster Analysis Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Cluster Analysis System Samples Measurements Similarities Distances Clusters
More informationCovariance and Correlation Matrix
Covariance and Correlation Matrix Given sample {x n } N 1, where x Rd, x n = x 1n x 2n. x dn sample mean x = 1 N N n=1 x n, and entries of sample mean are x i = 1 N N n=1 x in sample covariance matrix
More informationOn the Variance of Eigenvalues in PCA and MCA
On the Variance of in PCA and MCA Jean-Luc Durand jean-luc.durand@univ-paris13.fr Laboratoire d Éthologie Expérimentale et Comparée - LEEC EA 4443 Université Paris 13 Sorbonne Paris Cité Naples, CARME
More information-Principal components analysis is by far the oldest multivariate technique, dating back to the early 1900's; ecologists have used PCA since the
1 2 3 -Principal components analysis is by far the oldest multivariate technique, dating back to the early 1900's; ecologists have used PCA since the 1950's. -PCA is based on covariance or correlation
More informationRevision: Chapter 1-6. Applied Multivariate Statistics Spring 2012
Revision: Chapter 1-6 Applied Multivariate Statistics Spring 2012 Overview Cov, Cor, Mahalanobis, MV normal distribution Visualization: Stars plot, mosaic plot with shading Outlier: chisq.plot Missing
More informationUnsupervised Learning
2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and
More informationCluster Analysis (Sect. 9.6/Chap. 14 of Wilks) Notes by Hong Li
77 Cluster Analysis (Sect. 9.6/Chap. 14 of Wilks) Notes by Hong Li 1) Introduction Cluster analysis deals with separating data into groups whose identities are not known in advance. In general, even the
More informationResearch Methods II MICHAEL BERNSTEIN CS 376
Research Methods II MICHAEL BERNSTEIN CS 376 Goal Understand and use statistical techniques common to HCI research 2 Last time How to plan an evaluation What is a statistical test? Chi-square t-test Paired
More informationMODEL BASED CLUSTERING FOR COUNT DATA
MODEL BASED CLUSTERING FOR COUNT DATA Dimitris Karlis Department of Statistics Athens University of Economics and Business, Athens April OUTLINE Clustering methods Model based clustering!"the general model!"algorithmic
More informationGraph Functional Methods for Climate Partitioning
Graph Functional Methods for Climate Partitioning Mathilde Mougeot - with D. Picard, V. Lefieux*, M. Marchand* Université Paris Diderot, France *Réseau Transport Electrique (RTE) Buenos Aires, 2015 Mathilde
More informationDimension Reduction Methods
Dimension Reduction Methods And Bayesian Machine Learning Marek Petrik 2/28 Previously in Machine Learning How to choose the right features if we have (too) many options Methods: 1. Subset selection 2.
More informationLecture 2: Data Analytics of Narrative
Lecture 2: Data Analytics of Narrative Data Analytics of Narrative: Pattern Recognition in Text, and Text Synthesis, Supported by the Correspondence Analysis Platform. This Lecture is presented in three
More informationPCA, Kernel PCA, ICA
PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per
More informationECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction
ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering
More informationApplied Multivariate Analysis
Department of Mathematics and Statistics, University of Vaasa, Finland Spring 2017 Dimension reduction Exploratory (EFA) Background While the motivation in PCA is to replace the original (correlated) variables
More informationDimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014
Dimensionality Reduction Using PCA/LDA Hongyu Li School of Software Engineering TongJi University Fall, 2014 Dimensionality Reduction One approach to deal with high dimensional data is by reducing their
More informationPattern Recognition 2
Pattern Recognition 2 KNN,, Dr. Terence Sim School of Computing National University of Singapore Outline 1 2 3 4 5 Outline 1 2 3 4 5 The Bayes Classifier is theoretically optimum. That is, prob. of error
More informationMultivariate Statistics Fundamentals Part 1: Rotation-based Techniques
Multivariate Statistics Fundamentals Part 1: Rotation-based Techniques A reminded from a univariate statistics courses Population Class of things (What you want to learn about) Sample group representing
More information5. Discriminant analysis
5. Discriminant analysis We continue from Bayes s rule presented in Section 3 on p. 85 (5.1) where c i is a class, x isap-dimensional vector (data case) and we use class conditional probability (density
More informationIndependent Component Analysis and Its Applications. By Qing Xue, 10/15/2004
Independent Component Analysis and Its Applications By Qing Xue, 10/15/2004 Outline Motivation of ICA Applications of ICA Principles of ICA estimation Algorithms for ICA Extensions of basic ICA framework
More information