t-sne and its theoretical guarantee

Similar documents
Interpreting Deep Classifiers

Linear and Non-Linear Dimensionality Reduction

Fast algorithms for dimensionality reduction and data visualization

Lecture 10: Dimension Reduction Techniques

Statistical Machine Learning

Non-linear Dimensionality Reduction

Entropic Affinities: Properties and Efficient Numerical Computation

Metric Learning. 16 th Feb 2017 Rahul Dey Anurag Chowdhury

(Non-linear) dimensionality reduction. Department of Computer Science, Czech Technical University in Prague

Lecture 13 October 6, Covering Numbers and Maurey s Empirical Method

LECTURE NOTE #11 PROF. ALAN YUILLE

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Latent Variable Models and EM algorithm

Random projection trees and low dimensional manifolds. Sanjoy Dasgupta and Yoav Freund University of California, San Diego

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Statistical Pattern Recognition

Clustering with k-means and Gaussian mixture distributions

Multivariate Statistics Random Projections and Johnson-Lindenstrauss Lemma

Information theoretic perspectives on learning algorithms

Randomized Algorithms

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

unsupervised learning

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Algorithms Reading Group Notes: Provable Bounds for Learning Deep Representations

Apprentissage non supervisée

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Faster Johnson-Lindenstrauss style reductions

Online (and Distributed) Learning with Information Constraints. Ohad Shamir

Lecture 35: December The fundamental statistical distances

Nonlinear Manifold Learning Summary

Machine Learning - MT Clustering

Metric Embedding for Kernel Classification Rules

Lecture 21: Minimax Theory

8.1 Concentration inequality for Gaussian random matrix (cont d)

Nonlinear Dimensionality Reduction

Distance Metric Learning in Data Mining (Part II) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Certifying the Global Optimality of Graph Cuts via Semidefinite Programming: A Theoretic Guarantee for Spectral Clustering

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Unsupervised machine learning

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

Lecture: Some Practical Considerations (3 of 4)

Lecture 9: March 26, 2014

Lecture 15: Random Projections

Nonlinear Dimensionality Reduction

Methods for sparse analysis of high-dimensional data, II

Single-tree GMM training

Statistical Learning. Dong Liu. Dept. EEIS, USTC

Self-Organization by Optimizing Free-Energy

Linear Models for Classification

Linear Sketches A Useful Tool in Streaming and Compressive Sensing

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Manifold Coarse Graining for Online Semi-supervised Learning

On rational approximation of algebraic functions. Julius Borcea. Rikard Bøgvad & Boris Shapiro

Machine Learning (BSMC-GA 4439) Wenke Liu

Introduction to graphical models: Lecture III

Sketching for Large-Scale Learning of Mixture Models

Unsupervised dimensionality reduction

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.

Global (ISOMAP) versus Local (LLE) Methods in Nonlinear Dimensionality Reduction

Information geometry for bivariate distribution control

Tailored Bregman Ball Trees for Effective Nearest Neighbors

Techniques for Dimensionality Reduction. PCA and Other Matrix Factorization Methods

Stochastic Proximal Gradient Algorithm

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Lecture 24: August 28

A Magiv CV Theory for Large-Margin Classifiers

The Informativeness of k-means for Learning Mixture Models

ECE 4400:693 - Information Theory

Linear and Logistic Regression. Dr. Xiaowei Huang

Ad Placement Strategies

Basic Properties of Metric and Normed Spaces

Advanced Machine Learning & Perception

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

CPSC 540: Machine Learning

Optimization methods

10708 Graphical Models: Homework 2

Clustering with k-means and Gaussian mixture distributions

L26: Advanced dimensionality reduction

CPSC 540: Machine Learning

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)

Adaptivity to Local Smoothness and Dimension in Kernel Regression

CSE 546 Final Exam, Autumn 2013

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

9 Brownian Motion: Construction

Permutation-invariant regularization of large covariance matrices. Liza Levina

Randomized Algorithms III Min Cut

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Methods for sparse analysis of high-dimensional data, II

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Generalization and Regularization

Machine Learning, Fall 2009: Midterm

Class 2 & 3 Overfitting & Regularization

Mid Term-1 : Practice problems

Sparse Linear Models (10/7/13)

Advances in Manifold Learning Presented by: Naku Nak l Verm r a June 10, 2008

Principal Component Analysis!! Lecture 11!

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Gaussian Graphical Models and Graphical Lasso

Transcription:

t-sne and its theoretical guarantee Ziyuan Zhong Columbia University July 4, 2018 Ziyuan Zhong (Columbia University) t-sne July 4, 2018 1 / 72

Overview Timeline: PCA (Karl Pearson, 1901) Manifold Learning(Isomap by Tenenbaum et al., LLE by Roweis and Saul,... 2000) The following topics will be covered in today s presentation: SNE (Hinton and Roweis, 2002) t-sne (van der Maaten and Hinton, 2008) Accelerate t-sne. (van der Maaten, 2014) First step towards theoretical guarantee for t-sne. (Linderman and Steinerberger, 2017) Accelerate t-sne more.(linderman et al., 2017) Theoretical guarantee for t-sne. (Arora et al., 2018) Generalization of t-sne to manifold (Verma et al. preprint) Ziyuan Zhong (Columbia University) t-sne July 4, 2018 2 / 72

Preliminaries and Notation Part1 Matrix Norm: A p = sup x 0 Ax p x p. A 2 = λ max (A A) = σ max (A) m n A F = a ij 2 i=1 j=1 Diam(S) denotes that diameter of a bounded set S R s,i.e., Diam(S) := sup x,y S x y 2. Ziyuan Zhong (Columbia University) t-sne July 4, 2018 3 / 72

Preliminaries and Notation Part2 In a clustering setting, π : [n] [k] denotes function that maps data index i [n] to cluster index k. i j means π(i) = π(j) KL-divergence: A measure of how unsimilar two distributions are. Let p and q be two joint probability over i, j, KL(p q) = i j p ij log p ij q ij Ziyuan Zhong (Columbia University) t-sne July 4, 2018 4 / 72

Problem Formulation(A sketch) Given a collection of points X = {x 1,..., x n } R d, find a collection of points Y = {y 1,..., y n } R d, where d d, such that the lower dimension embedding preserves the relationship among different points in the original space. Ziyuan Zhong (Columbia University) t-sne July 4, 2018 5 / 72

What is a good visualization? Illustration Intuitive idea: Help to explore the inherent structure of the dataset(e.g. discover natural clusters, find linear relationship) Ziyuan Zhong (Columbia University) t-sne July 4, 2018 6 / 72

What is a good visualization for embedding? Technical Requirement Technical requirements: Visualizability: (usually 2D or 3D). Fidelity: Relationship among points are preserved(i.e. similar points remain distinct; distinct points remain distinct). Scalability: Can deal with large, high-dimensional data sets. Ziyuan Zhong (Columbia University) t-sne July 4, 2018 7 / 72

Motivation for a better visualization algorithm For many real-world dataset with non-linear inherent structures(e.g. MNIST), both linear methods like PCA and classical manifold learning algorithms like Isomap and LLE fail. Ziyuan Zhong (Columbia University) t-sne July 4, 2018 8 / 72

SNE G.E. Hinton and S.T. Roweis. Stochastic Neighbor Embedding. NIPS2002 Ziyuan Zhong (Columbia University) t-sne July 4, 2018 9 / 72

SNE Summary Compute an N x N similarity matrix in the high-dimensional input space. Define an N x N similarity matrix in the low-dimensional embedding space. Define cost function - sum of KL divergence between the two probability distributions at each point. Iteratively learn low-dimensional embedding by minimizing the cost function using gradient descent. Ziyuan Zhong (Columbia University) t-sne July 4, 2018 10 / 72

SNE Similarity matrix at high dimension: p j i = exp( x i x j 2 /2τi 2 ) k i exp( x i x k 2 /2τi 2) where τi 2 is the variance for the Gaussian distribution centered around x i. Similarity matrix at low dimension(τi 2 is set to 1 2 for all i): The cost function is defined as: C = i exp( y i y j 2 ) q j i = k i exp( y i y k 2 ) KL(P i Q i ) = i j p j i log p j i q j i SNE focuses on local structure(small p ij Small penalty. Large p ij Large penalty) It has gradient: dc dy i = 2 j (y i y j )(p j i q j i + p i j q i j ) Ziyuan Zhong (Columbia University) t-sne July 4, 2018 11 / 72

How to choose variance τ i? Reference τ i is the variance of the Gaussian that is centered around each high-dimensional datapoint x i. In dense region, small τ i is more appropriate. SNE performs a binary search for the value of τ i that produces a P i with a fixed perplexity that is specified by the user. The perplexity can be interpreted as a smooth measure of the effective number of neighbors. The perplexity is defined as: Perp(P i ) = 2 H(P i ) where H(P i ) is the Shannon entropy of P i measured in bits: H(P i ) = j p j i log 2 p j i Ziyuan Zhong (Columbia University) t-sne July 4, 2018 12 / 72

Result The result of running the SNE algorithm on 3000 256-dimensional grayscale images of handwritten digits(not all points are shown). Ziyuan Zhong (Columbia University) t-sne July 4, 2018 13 / 72

Problem with SNE: crowding problem SNE suffers from the crowding problem : The area of the 2D map that is available to accommodate moderately distant data points will not be large enough compared with the area available to accommodate nearby data points. Ziyuan Zhong (Columbia University) t-sne July 4, 2018 14 / 72

t-sne L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-sne. JMLR2008 A symmetrized version of the SNE cost function with simpler gradients. A Student-t distribution rather than a Gaussian to compute the similarity in the low-dimensional space to alleviate the crowding problem and the optimization problems of SNE. Ziyuan Zhong (Columbia University) t-sne July 4, 2018 15 / 72

t-sne: Make similarity symmetric exp( x i x j 2 /2τi 2 p j i = ) k i exp( x i x k 2 /2τi 2) We may tend to define the similarity function in the symmetric similarity matrix on high dimension as: exp( x i x j 2 /2τi 2 p ij = ) k l exp( x k x l 2 /2τ 2 ) However, outlier x i on high-dimensional space can cause problem by making p ij very small for all j. Instead, define p ij by symmetrizing two conditional probabilities as follows: p ij = p j i + p i j 2n In this way, j p ij > 1 2n for all data points x i. As a result, each x i makes a significant contribution to the cost function. The main advantage of the symmetric form is mainly simpler gradient(will be shown later). Ziyuan Zhong (Columbia University) t-sne July 4, 2018 16 / 72

t-sne: Fix crowding, Illustration Ziyuan Zhong (Columbia University) t-sne July 4, 2018 17 / 72

t-sne: Fix crowding Define q ij = (1 + y i y j 2 ) 1 k l (1 + y k y l 2 ) 1 The cost function of t-sne is now defined as: C = i KL(P i Q i ) = n n i=1 j=1 p ij log p ij q ij The heavy tails of the normalized Student-t kernel allow dissimilar input objects x i and x j to be modeled by low-dimensional counterparts y i and y j that are too far apart because q ij is not very small for two embedded points that are far apart. Note: Since q is what to be learned, the outlier problem does not exist for low-dimension. Ziyuan Zhong (Columbia University) t-sne July 4, 2018 18 / 72

t-sne: Gradient The gradient of the cost function is: dc dy i = 4 = 4 n j=1,j i n j=1,j i (p ij q ij )(1 + y i y j 2 ) 1 (y i y j ) (p ij q ij )q ij Z(y i y j ) ( = 4 p ij q ij Z(y i y j ) j i j i = 4(F attraction + F repulsion ) ) qijz(y 2 i y j ) where Z = n l,s=1,l s (1 + y l y s 2 ) 1. The derivation can be found in the appendix of the t-sne paper. Exercise: There are two small errors but they cancel out so the result is correct. Can you find the errors in their derivation? Ziyuan Zhong (Columbia University) t-sne July 4, 2018 19 / 72

After Fix Ziyuan Zhong (Columbia University) t-sne July 4, 2018 20 / 72

t-sne: Physics Analogy-N body system Ziyuan Zhong (Columbia University) t-sne July 4, 2018 21 / 72

Ziyuan Zhong (Columbia University) t-sne July 4, 2018 22 / 72

t-sne: Early exaggeration In the initial stage, multiply p ij by a coefficient α > 1. This encourages to focus on modeling the large p ij by fairly large q ij. A natural result is to form tight widely separated clusters in the map and thus makes it easier for the clusters to move around relative to each other in order to find a global organization. Ziyuan Zhong (Columbia University) t-sne July 4, 2018 23 / 72

Result on MNIST Ziyuan Zhong (Columbia University) t-sne July 4, 2018 24 / 72

Comparison on MNIST Ziyuan Zhong (Columbia University) t-sne July 4, 2018 25 / 72

Caveats I Perplexity needs to be chosen carefully. Relative size is usually not meaningful. Ziyuan Zhong (Columbia University) t-sne July 4, 2018 26 / 72

Caveats II Global Structures are preserved only sometimes. Ziyuan Zhong (Columbia University) t-sne July 4, 2018 27 / 72

Limitations Dimensionality reduction for other problems(due to the heavy tail of the t-distribution, it does not preserve the local structure as well if the embedded dimension is larger, say 100). curse of dimensionality(t-sne employs Euclidean distances between near neighbors so it implicitly depends on the local linearity on the manifold). O(N 2 ) computational complexity(the evaluation of the joint distributions involve N(N 1) pairs of objects. The method is limited to 10k points). Perplexity number, number of iterations, the magnitude of early exaggeration parameter have to be manually chosen. Ziyuan Zhong (Columbia University) t-sne July 4, 2018 28 / 72

Accelerate t-sne L.J.P. van der Maaten. Accelerating t-sne using Tree-Based Algorithms. JMLR2014 Observations: Many of the pairwise interactions between points are very similar. Idea: Approximate similar interactions by a single interaction using a metric tree that has O(uN) non-zero values. Result: Reduce complexity to O(N log N) via Barnes-Hut-SNE (tree-based) algorithm. The method can deal with up to tens of millions data points. Ziyuan Zhong (Columbia University) t-sne July 4, 2018 29 / 72

Preliminary: Quadtree A quadtree(2d) is defined as the following: Each node represents a rectangular cell with a particular center, width, and height. It stores the the number of points inside N cell and their center-of-mass y cell. Non-leaf node have four children that split up the cell into four smaller cells on the current embedding. It can be constructed in O(N log N) time by inserting the points one-by-one, splitting a leaf node whenever a second point is inserted in its cell, and updating y cell and N cell of all visited nodes. Ziyuan Zhong (Columbia University) t-sne July 4, 2018 30 / 72

Approximate high-dimensional similarity Compute the sparse approximation by finding 3u nearest neighbors where u is the perplexity of the conditional distribution. exp( x i x j 2 /2τi 2 ), if j N p j i = k N exp( x i x k 2 /2τ 2 i i ) i 0, otherwise p ij = p j i + p i j 2n Ziyuan Zhong (Columbia University) t-sne July 4, 2018 31 / 72

Problem with low-dimensional repulsion gradient Recall the t-sne gradient as the following: dc ( = 4(F attraction + F repulsion ) = 4 p ij q ij Z(y i y j ) dy i j i j i ) qijz(y 2 i y j ) where Z = k l (1 + y k y l 2 ) 1 so q ij Z = (1 + y i y j 2 ) 1 Problem: F attraction can be done in O(uN) but naive of computation of F repulsion is O(N 2 ). Ziyuan Zhong (Columbia University) t-sne July 4, 2018 32 / 72

Barnets-Hut Approximation for low-dimensional repulsion gradient Solution: construct a quadtree(2d) for the current embedding. Traverse the quadtree using DFS. At every node, decide whether the corresponding cell can be used as a summary for the contribution to F repulsion of all points in that cell. Replace q 2 ij Z(y i y j ) with N cell q 2 i,cell Z(y i y cell ). Z and q ij Z are evaluated along the way so we can compute F repulsion = q2 ij Z 2 in O(NlogN). Z Ziyuan Zhong (Columbia University) t-sne July 4, 2018 33 / 72

Illustration of BH-Approximation The condition was proposed by Barnes and Hut(1986), where r cell represents the length of the diagonal of the cell under consideration and θ is a threshold that trades off speed and accuracy (higher values of θ lead to faster but coarser approximations). Note that when θ = 0, all pairwise interactions are computed(equivalent to naive t-sne). Ziyuan Zhong (Columbia University) t-sne July 4, 2018 34 / 72

Experiment Part1 Ziyuan Zhong (Columbia University) t-sne July 4, 2018 35 / 72

Experiment Part2 Ziyuan Zhong (Columbia University) t-sne July 4, 2018 36 / 72

Further speedup Limitation: O(N log N) is still not fast enough when the dataset is very, very large (e.g. some datasets in biology have 1, 000, 000, 000 data points). Efficient Algorithms for t-distributed Stochastic Neighborhood Embedding. Linderman et al. arxiv2017. Solution: Further approximation to achieve O(N). Summary: Use ANNOY to search for nearest neighbors for high dimension similarity approximation and use interpolation-based methods to approximate low dimension similarity. Results: more than 10x speedup on 1D and 2D visualization tasks. Ziyuan Zhong (Columbia University) t-sne July 4, 2018 37 / 72

A step towards theoretical guarantee for t-sne Clustering with t-sne, provably. George C. Linderman, Stefan Steinerberger. arxiv 2017. Contribution: At the high level, this paper shows that points in the same cluster move towards each other. Limitation: The result is insufficient to establish that t-sne succeeds in finding a full visualization as it does not rule out multiple clusters merging into each other. Ziyuan Zhong (Columbia University) t-sne July 4, 2018 38 / 72

First Theoretical Guarantee for t-sne An Analysis of the t-sne Algorithm for Data Visualization Sanjeev Arora, Wei Hu, Pravesh K. Kothari. COLT 2018 Contribution: First provable guarantees on t-sne for computing visualization of clusterable data. The proof is built on results from Linderman s paper. Proof Technique: They obtained an update rule for the centroids of the embeddings of all underlying clusters. They showed that the distance between distinct centroids remains lower-bounded whenever the data is γ-spherical and γ-well-separated. Combined with the shrinkage result for points in the same cluster, this implies that t-sne outputs a full visualization of the data. Ziyuan Zhong (Columbia University) t-sne July 4, 2018 39 / 72

Preliminary A distribution D with density function f on R d is said to be log-concave if log(f ) is a concave function. Example: Gaussian, Uniform on any convex set. D is said to be isotropic if its covariance is I. A mixture of k log-concave distributions is described by k positive mixing weights w 1,..., w k, ( k l=1 w l = 1) and k log-concave distribution D 1,..., D k in R d. To sample a point from this model, we pick cluster l with probability w l and draw x from D l. Ziyuan Zhong (Columbia University) t-sne July 4, 2018 40 / 72

Formalizing Visualization Given a collection of points X = {x 1,..., x n } R d and there exists a ground-truth clustering described by a partition C 1,..., C k of [n] into k clusters. A visualization is a 2-dimensional embedding Y = {y 1,..., y n } R 2 of X, where each x i X is mapped to the corresponding y i Y. Intuitively, a cluster C l in the original data is visualized if the corresponding points in the 2-dimensional embedding Y are well-separated from all the rest. Ziyuan Zhong (Columbia University) t-sne July 4, 2018 41 / 72

Formalizing Visualization Definition 1.1 (Visible Cluster) Let Y be a 2-dimensional embedding of a dataset X with ground-truth clustering C 1,..., C k. Given ɛ 0, a cluster C l in X is said to be (1 ɛ)-visible in Y if there exist P, P err [n] such that: 1. (P\C l ) (C l \P) ɛ C l i.e. the number of False Positive points and False Negative points are relatively small compared with the size of the ground-truth cluster. 2.for every i, i P and j [n]\(p P err ), y i y i 1 2 y i y j i.e. except some mistakenly embedded points, other clusters are far away from the current clusters. In such a case, we say that P(1 ɛ)-visualize C i in Y. Ziyuan Zhong (Columbia University) t-sne July 4, 2018 42 / 72

Formalizing Visualization Definition 1.2 (Visualization) Let Y be a 2-dimensional embedding of a dataset X with ground-truth clustering C 1,..., C k. Given ɛ 0, we say that Y is a (1 ɛ)-visualization of X if there exists a partition P 1,..., P k, P err of [n] such that: (i) For each i [k], P i (1 ɛ)-visualizes C i in Y. (ii) P err ɛn i.e. the proportion of mistakenly embedded points must be small. When ɛ = 0, we call Y a full visualization of X. Ziyuan Zhong (Columbia University) t-sne July 4, 2018 43 / 72

Definition of Well-separated and Spherical Definition 1.4 (Well-separated, spherical data) Let X = {x 1,..., x n } R d be clusterable data with C 1,..., C k defining the individual clusters such that for each l [k], C l 0.1(n/k). We say that X is γ-spherical and γ-well-separated if for some b 1,..., b k > 0, we have: 1.γ-Spherical: For any l [k] and i, j C l (i j), we have x i x j 2 b l 1+γ, and for i C l we have {j C l \{i} : x i x j 2 b l } 0.51 C l i.e. for any point, points from the same cluster are not too close with it and at least half of them are not too far away. 2.γ-Well-separated: For any l, l [k](l l ), i C l and k C l, we have x i x j 2 (1 + γ log n) max{b l, b l }i.e. for any point, points from other clusters are far away. Ziyuan Zhong (Columbia University) t-sne July 4, 2018 44 / 72

Main Result Theorem 3.1 Let X = {x 1,..., x n } R d be a γ-spherical and γ-well-separated clusterable data with C 1,..., C k defining k individual clusters of size at least 0.1(n/k), where k << n 1/5. Choose τi 2 = γ 4 min j [n]\{i} x i x j 2 ( i [n]), step size h = 1, and any early exaggeration coefficient α satisfying k 2 n log n << α << n. Let Y (T ) be the output of t-sne after T = Θ( n log n α ) iterations on input X with the above parameters. Then, with probability at least 0.99 over the choice of the initialization, Y (T ) is a full visualization of X. As γ becomes smaller, t-sne requires more separations among points in the same cluster but less separation between individual clusters in X in order to succeed in finding a full visualization of X. Ziyuan Zhong (Columbia University) t-sne July 4, 2018 45 / 72

Main Result Corollary 3.2 Let X = {x 1,..., x n } be generated i.i.d. from a mixture of k Gaussians N(µ i, I) whose means µ 1,..., µ k satisfy µ l µ l = Ω(d 1/4 )(d is the dimension of the embedded space) for any l l. Let Y be the output of the t-sne algorithm with early exaggeration when run on input X with parameters from Theorem 3.1. Then with high probability over the draw of X and the choice of the random initialization, Y is a full visualization of X. Ziyuan Zhong (Columbia University) t-sne July 4, 2018 46 / 72

Proof Road-map Ziyuan Zhong (Columbia University) t-sne July 4, 2018 47 / 72

Recall: h is the step size and α is the early exaggeration coefficient. Remarks: For any point, (i)most points from the same cluster are not far away.(ii)all points from the same clusters cannot be too close to it.(iii)all points from the same clusters are far away from it. (iv) for math in Lemma3.6 to work out. Ziyuan Zhong (Columbia University) t-sne July 4, 2018 48 / 72

Proof of Theorem 3.1 Proof. Lemma 3.3 identifies sufficient conditions on the pairwise affinities p ij s that imply that t-sne outputs a full visualization, and Lemma3.4 shows that p i,j s computed for γ-spherical, γ-well-separated data satisfy the requirements in Lemma 3.3. Therefore, combining Lemma3.3 and 3.4 gives Theorem 3.1. Ziyuan Zhong (Columbia University) t-sne July 4, 2018 49 / 72

Lemmas needed for Lemma3.3 Ziyuan Zhong (Columbia University) t-sne July 4, 2018 50 / 72

Proof of Lemma3.3 Using Lemmas3.5 and 3.6, we know that after T = Θ( log 1 ɛ δη ) iterations, for any i, j [n] we have: if i j, then y (T ) i y (T ) j Diam({y (t) l : l C π(i) }) = O( ɛ δη ) if i j, then y (T ) i y (T ) j µ (T ) µ (T ) π(i) µ(t ) π(j) π(i) µ(t ) π(j) (T ) y i µ (T ) (T ) y j µ (T ) π(i) (t) Diam({y l : l C π(i) }) Diam({y (t) l : l C π(j) }) 1 Ω( k 2 n ) O( ɛ δη ) O( ɛ δη ) 1 = Ω( k 2 n ) >> O( ɛ δη ) π(j) Ziyuan Zhong (Columbia University) t-sne July 4, 2018 51 / 72

Recall Lemma3.6 Ziyuan Zhong (Columbia University) t-sne July 4, 2018 52 / 72

Lemmas needed for Lemma3.6 Random initialization ensures that the cluster centroids are initially well-separated with high probability. The centroid of each cluster will move no more than ɛ in each of the first iterations. 0.01 ɛ Ziyuan Zhong (Columbia University) t-sne July 4, 2018 53 / 72

Proof of Lemma3.6 By condition (iv) in Lemma3.3 ( ɛ log 1 ɛ δη << 1 k 2 ), We have n O( log 1 ɛ δη ) << 1 k 2 nɛ < 0.01 ɛ. Hence we can apply Lemma3.9 for all t T. Lemma3.7 says that the initial distance µ (0) l and µ (0) 1 l is at least Ω( k 2 n ) with high probability. Lemma3.9 says that after each iteration every centroid moves by at most ɛ so the distance between any two centroids changes by at most 2ɛ. We know that after T rounds, with high probability, we have µ (T ) l µ (T ) 1 l Ω( k 2 n ) T 2ɛ 1 = Ω( k 2 n ) O(ɛ log 1 ɛ δη ) 1 = Ω( k 2 n ), where the last step is due to condition (iv) in Lemma3.6. Ziyuan Zhong (Columbia University) t-sne July 4, 2018 54 / 72

Recall Lemma3.7 It is suffices to prove that (µ l ) 1 (µ l ) 1 = Ω( 1 k 2 n ) for all l l. Ziyuan Zhong (Columbia University) t-sne July 4, 2018 55 / 72

Lemma needed for Lemma3.7 Remark: This is similar to Central Limit Theorem. It says that the CDF of the average of i.i.d random variables is close to the standard normal s CDF. Ziyuan Zhong (Columbia University) t-sne July 4, 2018 56 / 72

Proof of Lemma3.7 Part1 Consider a fixed l [k]. Note that (y i ) 1 s i.i.d. with uniform distribution over [ 0.01, 0.01], which clearly has zero mean and finite second and third absolute moments. Since (µ l ) 1 = 1 C l i C l (y i ) 1, using the Berry-Esseen theorem we know that F (x) φ(x) O(1/ C l ) where F is the CDF of (µ l ) 1 Cl σ (σ is the standard deviation of the uniform distribution over [ 0.01, 0.01]), and φ is the CDF of N(0, 1). Ziyuan Zhong (Columbia University) t-sne July 4, 2018 57 / 72

Proof of Lemma3.7 Part2 It follows that for any fixed a R and b > 0, we have: [ b ] [ (µ) 1 Cl Pr (µ) 1 a k 2 = Pr a C l C l σ σ = F ( α C l + b/k 2 σ Φ( α C l + b/k 2 σ 1 + O( Cl ) = α C l +b/k 2 σ α C l b/k 2 σ b ] k 2 σ ) F ( α C l b/k 2 ) σ ) Φ( α C l b/k 2 ) σ 1 2π e x2 2 dx + O( k/n) Remark: The last equality is by the definition of CDF for standard normal distribution and the assumption that C l 0.1 n k. Ziyuan Zhong (Columbia University) t-sne July 4, 2018 58 / 72

Proof of Lemma3.7 Part3 α C l +b/k 2 σ α C l b/k 2 σ = 1 2π 2b k 2 σ + O( k/n) 1 2π dx + O( k/n) From k << n 1/5 we have k/n << 1/k 2. Therefore, letting b be a sufficiently small constant, for any a R, we can ensure [ Pr (µ l ) 1 a b k 2 C l ] 0.01 k 2 Ziyuan Zhong (Columbia University) t-sne July 4, 2018 59 / 72

Proof of Lemma3.7 Part4 For any l l, because (µ l ) 1 and (µ l ) 1 are independent, we can let a = (µ l ) 1, [ b ] Pr (u l ) 1 (u l ) 1 k 2 0.01 C l k 2 The above inequality holds for any l, l [k](l l ). Taking a union bound over all l and l, we know that with probability at least 0.99 we have b (µ l ) 1 (µ l ) 1 k 2 = Ω( 1 C l k 2 n ) for all l, l [k](l l ) simultaneously(recall C l 0.1(n/k)). Ziyuan Zhong (Columbia University) t-sne July 4, 2018 60 / 72

Claim needed for Lemma3.9 where ɛ (t) i := αh j ip ij q (t) ij Z (t) (y (t) j y (t) i ) h j i (q(t) ) 2 Z (t) (y (t) ij j y (t) i ) Ziyuan Zhong (Columbia University) t-sne July 4, 2018 61 / 72

Proof of Lemma3.9 Part1 Taking the average of y (t+1) i i C l, we obtain: 1 y (t+1) i = 1 y (t) i C l C l i C l i C l = 1 C l + h C l y (t) i i C l i C l j C l,j i + h C l = 1 C l i C l + h C l i C l j i (αp ij q (t) ij (αp ij q (t) ij j C l y (t) i i C l + h C l i C l (αp ij q (t) ij )q (t) ij Z (t) (y (t) j )q (t) ij Z (t) (y (t) y (t) i ) )q (t) ij Z (t) (y (t) j y (t) i ) (αp ij q (t) ij j C l j y (t) i ) )q (t) ij Z (t) (y (t) j y (t) i ) Ziyuan Zhong (Columbia University) t-sne July 4, 2018 62 / 72

Proof of Lemma3.9 Part2 Thus we have: µ t l = h C l µ t+1 l h C l i C l i C l (αp ij qij)q t (t) ij j C l (αp ij + q (t) ij q (t) ij j C l Z (t) (y (t) j Z (t) y (t) j y (t) i ) y (t) i ) Since t 0.01 ɛ and αh j C l p ij + h n ɛ for all i C l (condition(iii) in Lemma 3.3), we can apply Claim 3.10 and get: µ t+1 l µ t l h 1 (αp ij + C l 0.9n(n 1) ) 1 0.06 j C l i C l h C l 0.06 C l i C l (αh i C l j C l p ij + ɛ 0.9 ɛ h 0.9n ) 0.06 Ziyuan Zhong (Columbia University) t-sne July 4, 2018 63 / 72

Results on mixture of Log-Concave Distributions The results on mixture of isotropic Gaussian or isotropic log-concave distributions can be generalized to mixture of (non-isotropic) log-concave distributions. Ziyuan Zhong (Columbia University) t-sne July 4, 2018 64 / 72

Results on partial visualization via t-sne Ziyuan Zhong (Columbia University) t-sne July 4, 2018 65 / 72

Limitations k << n 1/5 can be potentially improved. It has implicit assumption that k 2 n > 100 Ziyuan Zhong (Columbia University) t-sne July 4, 2018 66 / 72

Generalize t-sne Stochastic Neighbor Embedding under f-divergences. Verma et al. Preprint. Ziyuan Zhong (Columbia University) t-sne July 4, 2018 67 / 72

Results Ziyuan Zhong (Columbia University) t-sne July 4, 2018 68 / 72

Results Ziyuan Zhong (Columbia University) t-sne July 4, 2018 69 / 72

Results Ziyuan Zhong (Columbia University) t-sne July 4, 2018 70 / 72

Summary SNE tsne(crowding problem) t-sne BH-tSNE(O(N 2 ) O(N log N)) BH-SNE FIt-tSNE(O(N log N) O(N)) t-sne guarantee(clusters shrink) More t-sne guarantee(centroids keep well-separated for mixtures of well-separated log-concave distributions full-visualization and for general settings partial-visualization). t-sne with KL is for dataset with cluster-like structure. RKL is for dataset with manifold-like structure. Ziyuan Zhong (Columbia University) t-sne July 4, 2018 71 / 72

Future Work k << n 1/5 can potentially be improved. improve the visualization of t-sne using the insight gained from its theoretical guarantee. automatically adjust the early/late exaggeration parameter α and choose the appropriate perplexity number. modify t-sne for general dimensionality reduction problems(not just for visualization). Theoretical guarantee for t-sne with RKL on manifold data. Ziyuan Zhong (Columbia University) t-sne July 4, 2018 72 / 72