Dimension Reduc-on. Example: height of iden-cal twins. PCA, SVD, MDS, and clustering [ RI ] Twin 2 (inches away from avg)

Size: px

Start display at page:

Download "Dimension Reduc-on. Example: height of iden-cal twins. PCA, SVD, MDS, and clustering [ RI ] Twin 2 (inches away from avg)"

Albert Doyle
6 years ago
Views:

1 Dimension Reduc-on PCA, SVD, MDS, and clustering Example: height of iden-cal twins Twin (inches away from avg) Twin (inches away from avg)

2 Expression between two ethnic groups Frequency log0 (p value) p values 0 Effect size Ethnicity is confounded with year Year ASN CEU

3 Two batches within ethnic groups Frequency log0 (p value) p values 0 Effect size males and females, months, 09 genes Female Male June October 005 9

4 Finding an unknown batch i (Y i,,y i,n )(/n,,/n, /n,, /n )= n Xn j= Y i,j n nx +n j=n + Y i,j Find n and n that make this difference large for many genes More precisely, maximize: m mx i= i i= Finding an unknown batch More generally, let v be any vector with mean 0 and variance, find the v that maximizes 8 9 mx < nx = Y : i,j v j =(Y ; m n v n ) 0 (Y m n v n ) j= The v that maximizes this variance is called the first principal component direc0on or eigenvector, and Y m n v n is the first principal component

5 Principal components We can remove the variability explained by v, and find the vector v that maximizes the variability in these residuals By con0nuing this process we end up with n eigenvectors: v n n =(v v n ) Singular value decomposi-on (SVD) SVD is a powerful mathema0cal approach that permits us to compute matrices U, D and V such that and V are the eigenvectors U and V are both orthogonal matrices and D is diagonal Y m n = U m n D n n V 0 n n U orthogonal means that the columns of U are such that U 0 iu i = and U 0 iu j =0 In other words, the sample standard devia0on of each column is and the sample correla0on of any two columns is 0

6 RMSD from SVD PMID 8588 Principal components from SVD No0ce that we can get the principal components from U and D Y m n V n n = U m n D n n and the variance from D: (Y m n V n n ) 0 (Y m n V n n )=DU 0 UD = D n n

7 Example: height of iden-cal twins Twin (inches away from avg) Twin (inches away from avg) Example: principal components Second PC First PC

8 Example: eigenvectors and SDs V = p p m D = 0 0 Gene expression example batch batch First eigenvector

9 Gene-c heterogeneity Principal Compenent Parents from cleft trios Parents from control trios Parents from HapMap CEU trios Parents from HapMap JPT/CHB trios Parents from HapMap YRI trios Principal Compenent PMID 5899 Gene-c heterogeneity PMID 8758

10 A heatmap commonswikimediaorg/wiki/file:heatmappng Another heatmap wwwfluidigmcom

11 Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be close? Once we know this, how do we define groups? Distance in two dimension

12 Gene expression Subset of a,5 x 89 gene expression table: Distances There are 7,776 pairs of samples for which we can compute a distance: d(j, k) = v u X t,5 i= (X i,j X i,k ) There are 6,7,005 pairs of genes for which we can compute a distance: d(h, i) = v ux t N (X h,j X i,j ) j=

13 The similarity / distance matrices N G G G DATA MATRIX GENE SIMILARITY MATRIX [ 0688 ] The similarity / distance matrices N N G N SAMPLE SIMILARITY MATRIX DATA MATRIX [ 0688 ]

14 Mul-dimensional scaling We can find a linear transforma0on for the data Z = AX such that v u t,5 X (X i,j X i,k ) i= q (Z,j Z,k ) +(Z,j Z,k ) Mul-dimensional scaling mds[, ] endometriu hippocamp placenta mds[, ]

15 Single cell RNAseq PMIDs 6586, 6059 Single cell RNAseq PMID 6995

16 Single cell RNAseq PMID Hierarchical Clustering Par00oning (K-means) [ 0688 ]

17 K-means We start with some data For example: We are showing expression for two samples for genes We are showing expression for two genes for samples This is simplifac0on Iteration = 0 [ 0688 ] K-means Choose K centroids These are star0ng values that the user picks There are some data driven ways to do it Iteration = 0 [ 0688 ]

18 Make first paron by finding the closest centroid for each point This is where distance is used K-means Iteration = [ 0688 ] Now re-compute the centroids by taking the middle of each cluster K-means Iteration = [ 0688 ]

19 Repeat un0l the centroids stop moving or un0l you get 0red of wai0ng K-means Iteration = [ 0688 ] Hierarchical clustering algorithm Say every point is its own cluster Merge closest points Repeat Distance Between Two Sets of Points Centroids Single Linkage

20 Linkage Single linkage defines the distance between clusters as the distance between the closest two points Single linkage can lead to a lot of singleton clusters, and to clusters that look stringlike in high dimensions Complete linkage defines the distance between clusters as the distance between the farthest two points Complete linkage tends to lead to more compact spherical structures Average linkage is the average of all the pairwise distances between points in the two clusters Average linkage is between single and complete linkage in terms of the type of clusters it outputs compbiopbworkscom A dendogram Height Cluster Dendrogram placenta placenta placenta placenta placenta placenta

21 A heatmap PMID 97008

22 PMID PMID 97008

23 PMID PMID 97008

24 Result The distance is equivalent to the correla0on when the data are standardized M MX i= Xi X s X Y i Ȳ s Y = M MX i= Xi s X X + M MX i= Yi s Y Ȳ M MX i= Xi s X X Yi s Y Ȳ = ( r) Result The difference in the averages can drive the distance M MX (X i Y i ) = M i= MX i= (X i X) (Yi Ȳ )+( X Ȳ ) = M MX i= (X i X) (Yi Ȳ ) + ( X Ȳ ) M MX i= (X i X) (Yi Ȳ ) + M MX ( X Ȳ ) i= = ( r)+ M MX ( X Ȳ ) i= = ( r)+( X Ȳ ) MX M (assuming (X i X) =) i=

25 Four gene cluster no mean removal Four gene cluster awer mean removal

26 Simula-on only differen-ally expressed genes Height Simula-on all genes Height

27 Batch effects Height Cluster Dendrogram placenta placenta placenta placenta placenta placenta Color represents -ssue mds[,] hippocamp mds[,]

28 Color represents study hippocamp mds[,] GSE GSE90 GSE97 GSE6 GSE mds[,] Null distribu-on of p-values only p value tt$pvalue

Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved Clusters Unsupervised Learning Luc Anselin http://spatial.uchicago.edu 1 curse of dimensionality principal components multidimensional scaling classical clustering methods 2 Curse of Dimensionality 3 Curse