Dimension Reduc-on PCA, SVD, MDS, and clustering Example: height of iden-cal twins Twin (inches away from avg) 0 5 0 5 0 5 0 5 0 Twin (inches away from avg)
Expression between two ethnic groups Frequency 0 000 000 000 000 5000 6000 log0 (p value) 0 0 0 0 0 00 0 0 06 08 0 p values 0 Effect size Ethnicity is confounded with year Year ASN CEU 00 0 00 0 5 00 0 005 80 006 0
Two batches within ethnic groups Frequency 0 000 000 000 000 5000 6000 log0 (p value) 0 6 8 0 00 0 0 06 08 0 p values 0 Effect size males and females, months, 09 genes Female Male June 005 9 October 005 9
Finding an unknown batch i (Y i,,y i,n )(/n,,/n, /n,, /n )= n Xn j= Y i,j n nx +n j=n + Y i,j Find n and n that make this difference large for many genes More precisely, maximize: m mx i= i i= Finding an unknown batch More generally, let v be any vector with mean 0 and variance, find the v that maximizes 8 9 mx < nx = Y : i,j v j =(Y ; m n v n ) 0 (Y m n v n ) j= The v that maximizes this variance is called the first principal component direc0on or eigenvector, and Y m n v n is the first principal component
Principal components We can remove the variability explained by v, and find the vector v that maximizes the variability in these residuals By con0nuing this process we end up with n eigenvectors: v n n =(v v n ) Singular value decomposi-on (SVD) SVD is a powerful mathema0cal approach that permits us to compute matrices U, D and V such that and V are the eigenvectors U and V are both orthogonal matrices and D is diagonal Y m n = U m n D n n V 0 n n U orthogonal means that the columns of U are such that U 0 iu i = and U 0 iu j =0 In other words, the sample standard devia0on of each column is and the sample correla0on of any two columns is 0
RMSD from SVD PMID 8588 Principal components from SVD No0ce that we can get the principal components from U and D Y m n V n n = U m n D n n and the variance from D: (Y m n V n n ) 0 (Y m n V n n )=DU 0 UD = D n n
Example: height of iden-cal twins Twin (inches away from avg) 0 5 0 5 0 5 0 5 0 Twin (inches away from avg) Example: principal components Second PC 0 5 0 5 0 5 0 5 0 5 0 5 First PC
Example: eigenvectors and SDs V = 0706 0708 0708 0706 p p m D = 0 0 Gene expression example batch batch First eigenvector 0 0 00 0
Gene-c heterogeneity Principal Compenent Parents from cleft trios Parents from control trios Parents from HapMap CEU trios Parents from HapMap JPT/CHB trios Parents from HapMap YRI trios Principal Compenent PMID 5899 Gene-c heterogeneity PMID 8758
A heatmap commonswikimediaorg/wiki/file:heatmappng Another heatmap wwwfluidigmcom
Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be close? Once we know this, how do we define groups? Distance in two dimension
Gene expression Subset of a,5 x 89 gene expression table: Distances There are 7,776 pairs of samples for which we can compute a distance: d(j, k) = v u X t,5 i= (X i,j X i,k ) There are 6,7,005 pairs of genes for which we can compute a distance: d(h, i) = v ux t N (X h,j X i,j ) j=
The similarity / distance matrices N G G G DATA MATRIX GENE SIMILARITY MATRIX [ 0688 ] The similarity / distance matrices N N G N SAMPLE SIMILARITY MATRIX DATA MATRIX [ 0688 ]
Mul-dimensional scaling We can find a linear transforma0on for the data Z = AX such that v u t,5 X (X i,j X i,k ) i= q (Z,j Z,k ) +(Z,j Z,k ) Mul-dimensional scaling mds[, ] 00 50 0 50 endometriu hippocamp placenta 50 0 50 00 mds[, ]
Single cell RNAseq PMIDs 6586, 6059 Single cell RNAseq PMID 6995
Single cell RNAseq PMID 600088 Hierarchical Clustering Par00oning (K-means) [ 0688 ]
K-means We start with some data For example: We are showing expression for two samples for genes We are showing expression for two genes for samples This is simplifac0on Iteration = 0 [ 0688 ] K-means Choose K centroids These are star0ng values that the user picks There are some data driven ways to do it Iteration = 0 [ 0688 ]
Make first paron by finding the closest centroid for each point This is where distance is used K-means Iteration = [ 0688 ] Now re-compute the centroids by taking the middle of each cluster K-means Iteration = [ 0688 ]
Repeat un0l the centroids stop moving or un0l you get 0red of wai0ng K-means Iteration = [ 0688 ] Hierarchical clustering algorithm Say every point is its own cluster Merge closest points Repeat Distance Between Two Sets of Points Centroids Single Linkage
Linkage Single linkage defines the distance between clusters as the distance between the closest two points Single linkage can lead to a lot of singleton clusters, and to clusters that look stringlike in high dimensions Complete linkage defines the distance between clusters as the distance between the farthest two points Complete linkage tends to lead to more compact spherical structures Average linkage is the average of all the pairwise distances between points in the two clusters Average linkage is between single and complete linkage in terms of the type of clusters it outputs compbiopbworkscom A dendogram Height 0 50 00 50 00 50 Cluster Dendrogram placenta placenta placenta placenta placenta placenta
A heatmap PMID 97008
PMID 97008 PMID 97008
PMID 97008 PMID 97008
Result The distance is equivalent to the correla0on when the data are standardized M MX i= Xi X s X Y i Ȳ s Y = M MX i= Xi s X X + M MX i= Yi s Y Ȳ M MX i= Xi s X X Yi s Y Ȳ = ( r) Result The difference in the averages can drive the distance M MX (X i Y i ) = M i= MX i= (X i X) (Yi Ȳ )+( X Ȳ ) = M MX i= (X i X) (Yi Ȳ ) + ( X Ȳ ) M MX i= (X i X) (Yi Ȳ ) + M MX ( X Ȳ ) i= = ( r)+ M MX ( X Ȳ ) i= = ( r)+( X Ȳ ) MX M (assuming (X i X) =) i=
Four gene cluster no mean removal Four gene cluster awer mean removal
Simula-on only differen-ally expressed genes 6 8 0 Height Simula-on all genes 98 99 00 0 0 Height
Batch effects Height 0 50 00 50 00 50 Cluster Dendrogram placenta placenta placenta placenta placenta placenta Color represents -ssue mds[,] 00 50 0 50 hippocamp 60 0 0 0 0 0 60 mds[,]
Color represents study hippocamp mds[,] 00 50 0 50 GSE GSE90 GSE97 GSE6 GSE790 60 0 0 0 0 0 60 mds[,] Null distribu-on of p-values only p value 0 5000 0000 5000 00 0 0 06 08 0 tt$pvalue