Dimension Reduc-on. Example: height of iden-cal twins. PCA, SVD, MDS, and clustering [ RI ] Twin 2 (inches away from avg)

Similar documents
Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

CSE 554 Lecture 7: Alignment

Overview of clustering analysis. Yuehua Cui

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

MLCC 2015 Dimensionality Reduction and PCA

PCA and admixture models

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Machine Learning - MT Clustering

Lecture 5 Singular value decomposition

What is Principal Component Analysis?

Principal Component Analysis

14 Singular Value Decomposition

Statistical Machine Learning

THE UNIVERSITY OF CHICAGO Graduate School of Business Business 41912, Spring Quarter 2008, Mr. Ruey S. Tsay. Solutions to Final Exam

Computation. For QDA we need to calculate: Lets first consider the case that

High-dimensional data: Exploratory data analysis

Computational functional genomics

Machine Learning - MT & 14. PCA and MDS

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Probabilistic Latent Semantic Analysis

Data Exploration and Unsupervised Learning with Clustering

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

CS 340 Lec. 6: Linear Dimensionality Reduction

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Machine Learning (BSMC-GA 4439) Wenke Liu

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

LECTURE 16: PCA AND SVD

Principal component analysis

Principal component analysis (PCA) for clustering gene expression data

PCA, Kernel PCA, ICA

Unsupervised dimensionality reduction

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

15 Singular Value Decomposition

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

linearly indepedent eigenvectors as the multiplicity of the root, but in general there may be no more than one. For further discussion, assume matrice

Relations Between Adjacency And Modularity Graph Partitioning: Principal Component Analysis vs. Modularity Component Analysis

PCA vignette Principal components analysis with snpstats

7 Principal Component Analysis

Dimension Reduction and Iterative Consensus Clustering

Regularized Discriminant Analysis and Reduced-Rank LDA

EE16B Designing Information Devices and Systems II

Unsupervised Learning: K- Means & PCA

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Weighted gene co-expression analysis. Yuehua Cui June 7, 2013

Principal Component Analysis

Clustering compiled by Alvin Wan from Professor Benjamin Recht s lecture, Samaneh s discussion

More Linear Algebra. Edps/Soc 584, Psych 594. Carolyn J. Anderson

Lecture 5: Clustering, Linear Regression

Clustering. Léon Bottou COS 424 3/4/2010. NEC Labs America

Linear Algebra Methods for Data Mining

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Jeffrey D. Ullman Stanford University

Lecture 6. Numerical methods. Approximation of functions

Lecture 6: Methods for high-dimensional problems

Clustering using Mixture Models

Introduction to Machine Learning

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Applications of visual analytics, data types 3 Data sources and preparation Project 1 out 4

Preprocessing & dimensionality reduction

.. CSC 566 Advanced Data Mining Alexander Dekhtyar..

Singular Value Decomposition and Principal Component Analysis (PCA) I

Principal Component Analysis & Factor Analysis. Psych 818 DeShon

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 147 le-tex

Lecture 5: Clustering, Linear Regression

Cluster Analysis (Sect. 9.6/Chap. 14 of Wilks) Notes by Hong Li

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Multivariate Statistics

Mathematical foundations - linear algebra

Manning & Schuetze, FSNLP, (c)

STAT420 Midterm Exam. University of Illinois Urbana-Champaign October 19 (Friday), :00 4:15p. SOLUTIONS (Yellow)

Nonrobust and Robust Objective Functions

Principal Component Analysis

Multivariate analysis of genetic data: exploring groups diversity

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Dimensionality Reduc1on

Eigenvalues, Eigenvectors, and an Intro to PCA

Data Mining and Matrices

Eigenvalues, Eigenvectors, and an Intro to PCA

Functional SVD for Big Data

Sta$s$cs for Genomics ( )

Singular Value Decomposition

Matrix Representation

Methods for sparse analysis of high-dimensional data, II

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Linear Algebra Review. Vectors

Unsupervised Learning

Vector Space Models. wine_spectral.r

Principal Component Analysis and Linear Discriminant Analysis

Linear Algebra (Review) Volker Tresp 2018

Lecture: Face Recognition and Feature Reduction

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Signal Analysis. Principal Component Analysis

From Arabidopsis roots to bilinear equations

Linear Methods in Data Mining

LECTURE NOTE #11 PROF. ALAN YUILLE

Machine Learning 2nd Edition

Methods for sparse analysis of high-dimensional data, II

CS 4495 Computer Vision Principle Component Analysis

Transcription:

Dimension Reduc-on PCA, SVD, MDS, and clustering Example: height of iden-cal twins Twin (inches away from avg) 0 5 0 5 0 5 0 5 0 Twin (inches away from avg)

Expression between two ethnic groups Frequency 0 000 000 000 000 5000 6000 log0 (p value) 0 0 0 0 0 00 0 0 06 08 0 p values 0 Effect size Ethnicity is confounded with year Year ASN CEU 00 0 00 0 5 00 0 005 80 006 0

Two batches within ethnic groups Frequency 0 000 000 000 000 5000 6000 log0 (p value) 0 6 8 0 00 0 0 06 08 0 p values 0 Effect size males and females, months, 09 genes Female Male June 005 9 October 005 9

Finding an unknown batch i (Y i,,y i,n )(/n,,/n, /n,, /n )= n Xn j= Y i,j n nx +n j=n + Y i,j Find n and n that make this difference large for many genes More precisely, maximize: m mx i= i i= Finding an unknown batch More generally, let v be any vector with mean 0 and variance, find the v that maximizes 8 9 mx < nx = Y : i,j v j =(Y ; m n v n ) 0 (Y m n v n ) j= The v that maximizes this variance is called the first principal component direc0on or eigenvector, and Y m n v n is the first principal component

Principal components We can remove the variability explained by v, and find the vector v that maximizes the variability in these residuals By con0nuing this process we end up with n eigenvectors: v n n =(v v n ) Singular value decomposi-on (SVD) SVD is a powerful mathema0cal approach that permits us to compute matrices U, D and V such that and V are the eigenvectors U and V are both orthogonal matrices and D is diagonal Y m n = U m n D n n V 0 n n U orthogonal means that the columns of U are such that U 0 iu i = and U 0 iu j =0 In other words, the sample standard devia0on of each column is and the sample correla0on of any two columns is 0

RMSD from SVD PMID 8588 Principal components from SVD No0ce that we can get the principal components from U and D Y m n V n n = U m n D n n and the variance from D: (Y m n V n n ) 0 (Y m n V n n )=DU 0 UD = D n n

Example: height of iden-cal twins Twin (inches away from avg) 0 5 0 5 0 5 0 5 0 Twin (inches away from avg) Example: principal components Second PC 0 5 0 5 0 5 0 5 0 5 0 5 First PC

Example: eigenvectors and SDs V = 0706 0708 0708 0706 p p m D = 0 0 Gene expression example batch batch First eigenvector 0 0 00 0

Gene-c heterogeneity Principal Compenent Parents from cleft trios Parents from control trios Parents from HapMap CEU trios Parents from HapMap JPT/CHB trios Parents from HapMap YRI trios Principal Compenent PMID 5899 Gene-c heterogeneity PMID 8758

A heatmap commonswikimediaorg/wiki/file:heatmappng Another heatmap wwwfluidigmcom

Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be close? Once we know this, how do we define groups? Distance in two dimension

Gene expression Subset of a,5 x 89 gene expression table: Distances There are 7,776 pairs of samples for which we can compute a distance: d(j, k) = v u X t,5 i= (X i,j X i,k ) There are 6,7,005 pairs of genes for which we can compute a distance: d(h, i) = v ux t N (X h,j X i,j ) j=

The similarity / distance matrices N G G G DATA MATRIX GENE SIMILARITY MATRIX [ 0688 ] The similarity / distance matrices N N G N SAMPLE SIMILARITY MATRIX DATA MATRIX [ 0688 ]

Mul-dimensional scaling We can find a linear transforma0on for the data Z = AX such that v u t,5 X (X i,j X i,k ) i= q (Z,j Z,k ) +(Z,j Z,k ) Mul-dimensional scaling mds[, ] 00 50 0 50 endometriu hippocamp placenta 50 0 50 00 mds[, ]

Single cell RNAseq PMIDs 6586, 6059 Single cell RNAseq PMID 6995

Single cell RNAseq PMID 600088 Hierarchical Clustering Par00oning (K-means) [ 0688 ]

K-means We start with some data For example: We are showing expression for two samples for genes We are showing expression for two genes for samples This is simplifac0on Iteration = 0 [ 0688 ] K-means Choose K centroids These are star0ng values that the user picks There are some data driven ways to do it Iteration = 0 [ 0688 ]

Make first paron by finding the closest centroid for each point This is where distance is used K-means Iteration = [ 0688 ] Now re-compute the centroids by taking the middle of each cluster K-means Iteration = [ 0688 ]

Repeat un0l the centroids stop moving or un0l you get 0red of wai0ng K-means Iteration = [ 0688 ] Hierarchical clustering algorithm Say every point is its own cluster Merge closest points Repeat Distance Between Two Sets of Points Centroids Single Linkage

Linkage Single linkage defines the distance between clusters as the distance between the closest two points Single linkage can lead to a lot of singleton clusters, and to clusters that look stringlike in high dimensions Complete linkage defines the distance between clusters as the distance between the farthest two points Complete linkage tends to lead to more compact spherical structures Average linkage is the average of all the pairwise distances between points in the two clusters Average linkage is between single and complete linkage in terms of the type of clusters it outputs compbiopbworkscom A dendogram Height 0 50 00 50 00 50 Cluster Dendrogram placenta placenta placenta placenta placenta placenta

A heatmap PMID 97008

PMID 97008 PMID 97008

PMID 97008 PMID 97008

Result The distance is equivalent to the correla0on when the data are standardized M MX i= Xi X s X Y i Ȳ s Y = M MX i= Xi s X X + M MX i= Yi s Y Ȳ M MX i= Xi s X X Yi s Y Ȳ = ( r) Result The difference in the averages can drive the distance M MX (X i Y i ) = M i= MX i= (X i X) (Yi Ȳ )+( X Ȳ ) = M MX i= (X i X) (Yi Ȳ ) + ( X Ȳ ) M MX i= (X i X) (Yi Ȳ ) + M MX ( X Ȳ ) i= = ( r)+ M MX ( X Ȳ ) i= = ( r)+( X Ȳ ) MX M (assuming (X i X) =) i=

Four gene cluster no mean removal Four gene cluster awer mean removal

Simula-on only differen-ally expressed genes 6 8 0 Height Simula-on all genes 98 99 00 0 0 Height

Batch effects Height 0 50 00 50 00 50 Cluster Dendrogram placenta placenta placenta placenta placenta placenta Color represents -ssue mds[,] 00 50 0 50 hippocamp 60 0 0 0 0 0 60 mds[,]

Color represents study hippocamp mds[,] 00 50 0 50 GSE GSE90 GSE97 GSE6 GSE790 60 0 0 0 0 0 60 mds[,] Null distribu-on of p-values only p value 0 5000 0000 5000 00 0 0 06 08 0 tt$pvalue