Fast algorithms for dimensionality reduction and data visualization

Size: px

Start display at page:

Download "Fast algorithms for dimensionality reduction and data visualization"

Lilian Howard
5 years ago
Views:

1 Fast algorithms for dimensionality reduction and data visualization Manas Rachh Yale University 1/33

2 Acknowledgements George Linderman (Yale) Jeremy Hoskins (Yale) Stefan Steinerberger (Yale) Yuval Kluger (Yale) Vladimir Rokhlin (Yale) Mark Tygert (Facebook) 2/33

3 Introduction Applications Single-cell RNA-sequencing (scrna-seq) Latent representations in deep learning Astronomy...and much more t-sne implementations scale poorly to large datasets (e.g. 8 hours for dataset of 1 million points in 500 dimensional space) FFT-accelerated Interpolation-based t-sne (FIt-SNE), for faster t-sne (30 min for same dataset) Out-of-Core PCA (oocpca) for datasets that don t fit in memory 3/33

4 Applications: scrna-seq Bulk RNA-seq averages expression across all cells Single cell RNA-seq measures expression in individual cells Results tabulated as an expression matrix columns are genes ( 30,000) rows are cells ( 10 3 to 10 6 ) Number of Cells 1 M 100 k 10 k 1 k Islam et al. Tang et al. Jaitin et al. Macosko et al. Dixit et al. 10X Year Number of cells growing rapidly 4/33

5 Applications: scrna-seq For example, t-sne of 1.3 million brain cells1 1 5/33 10X Genomics (2016)

6 t-sne Optimization Input: d-dimensional dataset X = {x 1, x 2,..., x N } R d Output: s-dimensional embedding Y = {y 1, y 2,..., y N } R s, s d Goal: x i and x j close in the input space = y i and y j are also close Affinities between points x i and x j in the input space, p ij - Gaussian exp ( x i x j 2 /2σi 2 p i j = ) k =i exp ( x i x k 2 /2σi 2) and p ij = p i j + p j i. 2N Affinities between points y i and y j - Cauchy kernel q ij = Minimize Kullback-Leibler divergence (1 + y i y j 2 ) 1 k =l (1 + y k y l 2 ) 1. C (Y) = p ij log p ij. q i =j ij 6/33

7 Gradient Descent Minimize C (Y) via gradient descent Z is a global normalization constant Split into two parts C = 4Z y i (p ij q ij )q ij (y i y j ) j =i N N Z = j=1 l=1 l =j 1 (1 + y l y j 2 ) 1 C = Z 4 y i p ij q ij (y i y j ) Z qij 2(y i y j ) j =i j =i }{{}}{{} F attr,i F rep,i Direct calculation: O(N 2 ) 7/33

8 Repulsion term - F rep F rep,k (m) = N l=1 l =k / y l (m) y k (m) (1 + y l y k 2 ) 2 N N j=1 l=1 l =j 1 (1 + y l y j 2 ), Combinations of where N K (y i, y j )σ j j=1 K (y, z) = y z 2 or K (y, z) = 1 (1 + y z 2 ) 2 Existing methods: Tree Codes/ Fast-multipole methods (FMMs) L. Greengard, V. Rokhlin (1987) 8/33

9 FMM illustration 9/33

10 FMM illustration 9/33

11 FMM illustration 9/33

12 FMM illustration 9/33

13 FMM matrices F i = N K (y i, z j )σ j j=1 Low rank Low rank Low rank Low rank Low rank K (y, z) singular when y = z Self-interaction: full-rank Low rank Low rank Low rank Low rank Low rank Low rank Low rank Tree refinement strategy: O(1) particles per leaf box Low rank Low rank Low rank Low rank Low rank Low rank Low rank Low rank Low rank Low rank Low rank Low rank 10/33

14 Self-interaction - smooth kernels t-sne kernels - smooth even for y = z Even self-interaction can be compressed! (0, 1) (1, 1) / y z 1/(1 + y z 2 ) 10 0 σ(k) (0, 0) (1, 0) /33

15 Polynomial interpolation based fast algorithms K p (y, z) - polynomial interpolant of K (y, z) of order p ỹ l, z m - interpolation nodes, L - Lagrange polynomials; then Replace Relative error: φ l = p 2 p 2 K p (y, z) = K (ỹ l, z m )L j,ỹ (y)l l, z (z) j=1 l=1 N N K (y l, z j )σ j with φ l = K p (y l, z j )σ j j=1 j=1 φ l φ l φ l sup K (y, z) K p (y, z) For fixed tolerance ε, p depends on smoothness of K, independent of N Greengard, Rokhlin, Gimbutas, Ying, Darve, Zorin, Biros, Barnett, Ho, Gillman, Martinsson,... 12/33

16 j K p (y l, z j )σ j p 2 p 2 N φ l = K (ỹ m, z n )L m,ỹ (y l )L n, z (z j )σ j j=1 m=1 n=1 ( p 2 N = L m,ỹ (y l ) p2 K (ỹ m, z n ) m=1 n=1 j=1 L n, z (z j )σ j ) Step 1: Step 2: Step 3: N w n = L n, z (z j )σ j Work: O(N p 2 ) j=1 p 2 v m = K (ỹ m, z n )w n Work: O(p 4 ) n=1 φ l = p 2 m=1 L m,ỹ (y l )v m Work: O(M p 2 ) 13/33

17 Algorithm illustration Step 2 z ỹ Step 1 Step 3 z j y i 14/33

18 FFT accelerated interpolation based t-sne (FIt-SNE) Subdivide domain into N int N int boxes Given ε, determine p Equispaced interpolation nodes In each box, compute effective charges at interpolation nodes N w n,l = L n,ỹ l (y j )σ j Work: O(N p 2 ) y j B l Interaction between equispaced nodes - via FFT Nint 2 v m,n = j=1 p 2 l=1 K (ỹ m,n, ỹ l,j )w l,j Work: O((N int p) 2 log (N int p)) Interpolate, for y i B l, p 2 φ i = L m,ỹ l (y i )v m,l Work: O(N p 2 ) j=1 15/33

19 Choosing N int and p Large p with equispaced nodes - unstable t-sne kernels archetypical examples of Runge phenomenon L 1.4, p < 10 works For fixed accuracy, N int p constant = Computational complexity O(N p 2 ) 16/33

20 Runge phenomenon and equispaced interpolation Interpolation SVD L=0.5 L=1.5 L= Error Error p p 17/33

21 Error estimates 1-D interpolation: x j = L/2 + (j 1/2) L/p j = 1, 2,... p f (x) = 1/(1 + x 2 ) or f (x) = 1/(1 + x 2 ) 2 Interpolation error: p f (x) L j,{ xj } (x)f ( x j ) f p p (ζ) p! x x j j=1 j=1 }{{} π p (x) Estimates: f p p + 2 p! π p (x) (2p)! 2 2 2p p! ( ) L p p Error in 1-D p f (x) L j,{ xj } (x)f ( x j ) p + 2 ( L j=1 2 e ) p e 1 24p. 18/33

22 Error estimates - II In d dimensions f (x) = 1/(1 + x 2 ) or f (x) = 1/(1 + x 2 ) 2 In d dimensional interpolation, estimates via error estimates along lines Interpolation error: p f (x) L j,{ xj } (x)f ( x j ) p + 2 ( 2 d ) p L e 1 24p. j=1 2 e Not sharp for d > 1 19/33

23 Algorithm Illustration - Step 1 ( p 2 p 2 L m,ỹ (y l ) m=1 n=1 K (ỹ m, z n ) ( N ) ) L n, z (z j )σ j j=1 } {{ } w n Spread 20/33

24 Algorithm Illustration - Step 2 ( ) p 2 p 2 L m,ỹ (y l ) K (ỹ m, z n )w n m=1 n=1 } {{ } v m FFT 21/33

25 Algorithm Illustration - Step 3 p 2 L m,ỹ (y l )v m m=1 Interpolate 22/33

26 Matrix decomposition Matrix K block separable, all submatrices low rank K i,j = U i S i,j U T j U U S 1,1 S 1,2... S 1,N 2 int S 2,1 S 2,2... S 2,N 2 int U1 T U2 T U N 2 }{{ int } U S N 2 int,1 S N 2 int,2... S N 2 int,nint 2 }{{} S U T N 2 int U i : n i p 2 matrix S - Toeplitz (almost) 23/33

27 Attractive forces - F attr p ij exp ( x i x j 2 /σ). Computing p ij - a local calculation Attractive forces F attr,i = p ij q ij Z (y i y j ) p ij q ij Z (y i y j ). j =i j KNN of i One time computation - doesn t need to be computed every iteration of gradient descent 24/33

28 Nearest Neighbors bhtsne: exact nearest neighbors Using vantage point-trees Slows down in high dimensions FIt-SNE: approximate nearest neighbors Using ANNOY 2 Random projections Smoothing effect 3 from using near neighbors? G. Linderman and S. Steinerberger (2017) arxiv: /33

29 oocpca for Big Data What if dataset is extremely large? Computers without enough memory to load data cannot visualize it e.g. 1 million cells with 30, 000 genes requires 240GB! Out-of-core implementation of randomized PCA Compute the top few ( 50) principal components of a dataset without loading it entirely Mundane computers can visualize/analyze the largest datasets oocpca computes top 50 principal components with varying memory limitations: Memory (GB) Time (Min) /33

30 MNIST data 106 digit images from Infinite MNIST data-set Late exaggeration to separate clusters more effectively 27/33 G. Lindermann and S. Steinerberger (2017)

31 28/33 Retinal cells and t-sne heatmaps VSX1 OPN1MW PECAM1 Cluster 1D t-sne heatmaps (left) vs 2D t-sne (right) for retinal cells Data from Macosko et al (2016)

32 Numerical results - FIt-SNE 1 Dimensional Embedding 2 Dimensional Embedding Runtime (hours) Runtime (hours) k 100k 1M Number of Points k 100k 1M Number of Points BH FI 29/33

33 Numerical results - Fast nearest neighbors Dimensions Dimensions Dimensions Runtime (minutes) k 100 k 1 M 10 k 100 k 1 M Number of Points 10 k 100 k 1 M vptree now approx end 30/33

34 Summary We developed fast algorithms for data visualization and dimensionality reduction using t-sne which is roughly 15 times faster than the state of the art We presented interpolation based fast algorithms for N body interactions with smooth kernels Late exaggeration for better separation of clusters Out of core PCA for visualizing extremely large data sets on laptops Github: 31/33

35 Future work Better convergence estimates and theoretical framework Different affinities for input and target spaces Fast multipole style multi-level schemes 32/33

36 Questions? 33/33

37 Questions? Thank you 33/33

t-sne and its theoretical guarantee

t-sne and its theoretical guarantee Ziyuan Zhong Columbia University July 4, 2018 Ziyuan Zhong (Columbia University) t-sne July 4, 2018 1 / 72 Overview Timeline: PCA (Karl Pearson, 1901) Manifold Learning(Isomap