Clustering in kernel embedding spaces and organization of documents

Similar documents
Diffusion Wavelets and Applications

Nonlinear Dimensionality Reduction. Jose A. Costa

CSE 291. Assignment Spectral clustering versus k-means. Out: Wed May 23 Due: Wed Jun 13

Diffusion Geometries, Diffusion Wavelets and Harmonic Analysis of large data sets.

Diffusion Geometries, Global and Multiscale

Non-linear Dimensionality Reduction

Multiscale bi-harmonic Analysis of Digital Data Bases and Earth moving distances.

Bi-stochastic kernels via asymmetric affinity functions

Graphs, Geometry and Semi-supervised Learning

Diffusion/Inference geometries of data features, situational awareness and visualization. Ronald R Coifman Mathematics Yale University

Nonlinear Dimensionality Reduction

Unsupervised dimensionality reduction

Robust Laplacian Eigenmaps Using Global Information

Data-dependent representations: Laplacian Eigenmaps

Graph Metrics and Dimension Reduction

March 13, Paper: R.R. Coifman, S. Lafon, Diffusion maps ([Coifman06]) Seminar: Learning with Graphs, Prof. Hein, Saarland University

Stable MMPI-2 Scoring: Introduction to Kernel. Extension Techniques

Contribution from: Springer Verlag Berlin Heidelberg 2005 ISBN

Learning on Graphs and Manifolds. CMPSCI 689 Sridhar Mahadevan U.Mass Amherst

Learning gradients: prescriptive models

Multiscale Wavelets on Trees, Graphs and High Dimensional Data

Intrinsic Structure Study on Whale Vocalizations

Nonlinear Dimensionality Reduction

Multiscale Manifold Learning

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

6/9/2010. Feature-based methods and shape retrieval problems. Structure. Combining local and global structures. Photometric stress

L26: Advanced dimensionality reduction

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Nonlinear Methods. Data often lies on or near a nonlinear low-dimensional curve aka manifold.

Data dependent operators for the spatial-spectral fusion problem

Manifold Learning: Theory and Applications to HRI

Classification of handwritten digits using supervised locally linear embedding algorithm and support vector machine

Beyond Scalar Affinities for Network Analysis or Vector Diffusion Maps and the Connection Laplacian

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Data Analysis and Manifold Learning Lecture 3: Graphs, Graph Matrices, and Graph Embeddings

Locality Preserving Projections

Is Manifold Learning for Toy Data only?

Lecture: Some Practical Considerations (3 of 4)

Large-Scale Manifold Learning

Conference in Honor of Aline Bonami Orleans, June 2014

Global Positioning from Local Distances

Distance Metric Learning in Data Mining (Part II) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

EECS 275 Matrix Computation

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

OCEAN FUN PACK. Polar Regions

Learning a Kernel Matrix for Nonlinear Dimensionality Reduction

Statistical and Computational Analysis of Locality Preserving Projection

Fast Direct Policy Evaluation using Multiscale Analysis of Markov Diffusion Processes

General Characteristics

Statistical Machine Learning

A Duality View of Spectral Methods for Dimensionality Reduction

Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators

A Multiscale Framework for Markov Decision Processes using Diffusion Wavelets

Diffusion Wavelets for multiscale analysis on manifolds and graphs: constructions and applications

Learning a kernel matrix for nonlinear dimensionality reduction

A Duality View of Spectral Methods for Dimensionality Reduction

Regression on Manifolds Using Kernel Dimension Reduction

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

DIMENSION REDUCTION. min. j=1

Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Apprentissage non supervisée

Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators

Gaussian Process Latent Random Field

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Global vs. Multiscale Approaches

Filtering via a Reference Set. A.Haddad, D. Kushnir, R.R. Coifman Technical Report YALEU/DCS/TR-1441 February 21, 2011

Graph Matching & Information. Geometry. Towards a Quantum Approach. David Emms

LECTURE NOTE #11 PROF. ALAN YUILLE

Using the computer to select the right variables

Justin Solomon MIT, Spring 2017

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Lecture 10: Dimension Reduction Techniques

Exploiting Sparse Non-Linear Structure in Astronomical Data

Data Analysis and Manifold Learning Lecture 7: Spectral Clustering

Graph-Laplacian PCA: Closed-form Solution and Robustness

Manifold Learning and it s application

Dimensionality Reduction AShortTutorial

Statistical Pattern Recognition

Spectral approximations in machine learning

PARAMETERIZATION OF NON-LINEAR MANIFOLDS

Nearly Isometric Embedding by Relaxation

Cambridge International Examinations Cambridge International Advanced Subsidiary and Advanced Level

Manifold Coarse Graining for Online Semi-supervised Learning

High-Dimensional Pattern Recognition using Low-Dimensional Embedding and Earth Mover s Distance

Discriminative Direction for Kernel Classifiers

Harmonic Analysis and Geometries of Digital Data Bases

Discriminant Uncorrelated Neighborhood Preserving Projections

Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions

Spectral Techniques for Clustering

Dimensionality Reduction:

Signal processing methods have significantly changed. Diffusion Maps. Signal Processing

Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota

Nonlinear Data Transformation with Diffusion Map

Estimating the Intrinsic Dimension of High-Dimensional Data Sets

Indicate the answer choice that best completes the statement or answers the question. A Continent of Ice

Multiscale Analysis and Diffusion Semigroups With Applications

CMPSCI 791BB: Advanced ML: Laplacian Learning

Towards Multiscale Harmonic Analysis of Graphs and Document Corpora

Spectral Hashing: Learning to Leverage 80 Million Images

Transcription:

Clustering in kernel embedding spaces and organization of documents Stéphane Lafon Collaborators: Raphy Coifman (Yale), Yosi Keller (Yale), Ioannis G. Kevrekidis (Princeton), Ann B. Lee (CMU), Boaz Nadler (Weizmann)

Motivation Data everywhere: Scientific databases: biomedical/physics The web (web search, ads targeting, ads spam/fraud detection ) Books: Conventional libraries + {Amazon, Google Print } Military applications (Intelligence, ATR, undersea mapping, situational awareness ) Corporate and financial data

Motivation Tremendous need for automated ways of Organizing this information (clustering, parametrization of data sets, adapted metrics on the data) Extracting knowledge (learning, clustering, pattern recognition, dimensionality reduction) Make it useful to end-users (dimensionality reduction) All of this in the most efficient way.

Challenges data sets are high-dimensional: need for dimension reduction data sets are nonlinear: need to learn these nonlinearities data likely to be noisy: need for robustness. data sets are massive.

Highlights of this talk: Review the diffusion framework: parametrization of data sets, dimensionality reduction, diffusion metric Present a few results on clustering in the diffusion space Extend these results to non-symmetric graphs: beyond diffusions.

From data sets to graphs {,,..., } Let X = x x x be a data set. 2 n ( X, W) Construct a graph G = where to each point x corresponds a node, i every two nodes are connected by an edge with a non-negative weight wxy (, ).

Choice of the weight matrix The quantity wxy (, ) should reflect the degree of similarity or interaction between x and y. The choice of the weight is crucial and application-driven. Examples: Take a correlation matrix, and throw away values that are not close to. d( x, y) / ε If we have a distance metric on the data, consider using. d( x, y) e Sometimes, the data already comes in the form of a graph (e.g. social networks). 2 In practise, difficult problem! Feature selection, dynamical data sets...

Markov chain on the data Define the degree of node x as d( x) = w( x, z). wxy (, ) n n P p( x, y) = P= D W d( x) Form the by matrix with entries. In other words,. Because pxy (, ) = and pxy (, ), y X P is the transition matrix of a Markov chain on the graph of the data. z X I P is called "normalized graph Laplacian" [Chung'97]. t Notation: entries of P are denoted p ( x, y). t

Powers of P Main idea: the behavior of the Markov chain over different time scales will reveal geometric structures of the data.

Time parameter t p ( x, y) is the probability of transition from x to y in t time steps. t Therefore, it is close to if y is easily reachable from x in t steps. This happens if there are many paths connecting these two points t defines the granularity of the analysis. Increasing the value of t is a way to integrate the local geometric information: transitions are likely to happen between similar data points and occur rarely otherwise.

Assumptions on the weights Our object of interest: powers of P. We now assume that the weight function is symmetric, i.e., that the graph is symmetric. wxy (, ) = wyx (, ).

Spectral decomposition of P With these conditions, P has a sequence λ λ λ of eigenvalues n and a collection { ψ } of corresponding (right) eigenvectors m t P ψ = λ ψ t m m m In addition, it can be checked that λ = and ψ.

Diffusion coordinates Each eigenvector is a function defined on the data points. Therefore, we can think of the (right) eigenvectors { ψ } as forming a set of coordinates on the data set X. For any choice of t, define the mapping: ψ t t λψ ( x) t λψ 2 2( x) : x t λn ψn ( x) m ψ t Original space Diffusion space

Over the past 5 years, new techniques have emerged for manifold learning Isomap [Tenenbaum-DeSilva-Langford ] L.L.E. [Roweis-Saul ] Laplacian eigenmaps [Belkin-Niyogi ] Hessian eigenmaps [Donoho-Grimes 3] They all aim at finding coordinates on data sets by computing the eigenfunctions of a psd matrix.

Diffusion coordinates Data set: unordered collection of images of handwritten digits "" 2 x y / ε 2 Weight kernel wxy (, ) = e where x y is the L distance between two images x and y

Diffusion coordinates

Diffusion distance Meaning of mapping Ψ : two data points x and y are mapped as Ψ ( x) and Ψ ( y) t t t so that the distance between them is equal to the so-called "diffusion distance": Ψ ( x) Ψ ( y) = D ( x, y) = p ( x, i) p ( y, i ). ) t t t t t L 2 ( X,/ π Two points are close in this metric if they are highly connected in the graph. ψ t Original space Diffusion metric in original space Diffusion space Euclidean distance in diffusion space

Diffusion distance The diffusion metric measures proximity in terms of connectivity in the graph. In particular, It is useful to detect and characterize clusters. It allows to develop learning algorithms based on the preponderance of evidences. It is very robust to noise, unlike the geodesic distance. Geodesic distance.2.8.6 B.8.6 B 5.4.2 A.4.2 A.5.5 2 Diffusion distance.2.2.4.4.6.6.8.8.5.5.5.5 Resistance distance, mean commute time, Green s function 5.5.5 2

Dimension reduction Since λ λ, not all terms are numerically significant in 2 t λψ ( x) t λψ 2 2( x) ψt : x t λn ψn ( x) Therefore, for a given value of t, one only needs to embed using t coordinates for which λ m is non-negligible. The key to dimensionality reduction: decay of the spectrum of the powers of the transition matrix..9.8.7 P.6.5 P 2.4 P 4.3 P 8.2 P 6. 2 4 6 8 2 4 6 8 2

Example: image analysis. Take an image 2. Form the set of all 7x7 patches 3. Compute the diffusion maps Parametrization

Example: image analysis Regular diffusion maps Density-normalized diffusion maps.5.5.5.5.5.5.2.2.4.6.8.2.5.5.5.5 Now, if we drop some patches.5.5.5.5.5.2.2.4.6.8.2.5 Regular diffusion maps.5.5.5.5 Density-normalized diffusion maps

Example: Molecular dynamics (with Raphy R. Coifman, Ioannis Kevrekidis, Mauro Maggioni and Boaz Nadler) Output of the simulation of a molecule of 7 atoms in a solvent. The molecule evolves in a potential, and is also subject to interactions with the solvent. Data set: millions of snapshots of the molecule (coordinates of atoms).

Graph matching and data alignment (with Yosi Keller) Suppose we have two datasets X and Y with approximately the same geometry. Question: how can we match/align/register X and Y? Since we have coordinates for X and Y, we can embed both sets, and align them in diffusion space.

Graph matching and data alignment Illustration: moving heads. The heads of 3 subjects wearing strange masks are recorded in 3 movies. α We obtain 3 data sets where each frame is a data point. β Because this is a constrained system (head-neck mechanical articulation), all 3 sets exhibit approximately the same geometry.

Graph matching and data alignment We compute 3 density-invariant embeddings..5 Yellow set embedding.2.4.6.8.2.4.6.8.8.6.4 Red set embedding.2.8.6.4.2.8.6.5.4.2 Black set embedding.5.2.4.6.8 We then align the 3 data sets in diffusion space using a limited number of landmarks.

Graph matching and data alignment

Science News articles (reloaded ) Data set: collection of documents from Science News Data modeling : we form a mutual information feature matrix Q. For each document i and word j, we compute the information shared by i and j : Ni, j: number of words j in document i N i, j Qi (, j) = log where N i : total number of words in document i NN i j N j : total number of words j in all documents We can work on the rows of Q in order to organize documents or on its columns to organize words. Word j Document i Qi (, j)

Clustering and data subsampling Organizing documents: we work on rows of Q. Automatic extraction of themes, organization of documents according to these themes. Can work at different levels of granularity by changing the number of diffusion coordinates considered. [TODO: Matlab demo]

Clustering of documents in diffusion space (with Ann B. Lee) In the diffusion space, we can use classical geometric objects: can use hyperplanes can use Euclidean distance can use balls can apply usual geometric algorithms: k-means, coverings We now show that clustering in the diffusion space is equivalent to compressing the diffusion matrix P.

Learning nonlinear structures K-means in original space K-means in diffusion space 3 2 3 2.5.5.5.5.5.5

Coarsening of the graph and Markov chain lumping { Si} Given an arbitrary partition of the nodes into k subsets, i k we coarsen the graph G into a new graph G with k meta-nodes. Each meta-node is identified with a subset Si The new edge-weight function is defined by ws (, S) = wxy (, ) i j x S y S We form a new Markov chain on this new graph: ds ( ) = ws (, S) i i j j k ps (, S) = i j i ws ( i, Sj) ds ( ) i j

Coarsening of the graph and Markov chain lumping S w(x,y) S 3 ~ w(s, S 3 ) Coarsening S S 3 S 2 S 2

Equivalence with matrix approximation problem The new Markov matrix P is k k, and it is an approximation of P. Are the spectral properties of P preserved by the coarsening operation? Define the approximate eigenvectors of P by ψ ( S ) = l i x S π( x) ψ ( x) l π ( x) These are not actual eigenvectors of P, but only approximations. { S } The quality of approximation depends on the choice of the partition. i x S i i

Error of approximation Proposition: We have, for all l k, where D is the distortion in the diffusion space: π( x) π( z) D= π ( Si ) Ψ( x) Ψ( z) i k x S ( ) ( ) i z S π S i i π Si This is the P ψ λψ l l l π D 2 { } quantization distortion induced by, where all the points have a mass distribution π. This results justifies the use of k-means in diffusion space (MeilaShi) and coverings for data clustering. Any clustering scheme that aims at minimizing this distortion preserves the spectral properties of the Markov chain. S i

Clustering of words: automatic lexicon organization Organizing words: we work on columns of Q. We can subsample the word collection by quantizing the diffusion space. Clusters can be interpretated as "wordlets", or semantic units. Centers of the clusters are representative of a concept. Automatic extraction of concepts from a collection of documents.

Clustering and data subsampling

Clustering and data subsampling

Going multiscale.8.6.4.2.2.4.6.8.8.6.4.2.2.4.6.8.8.6.4.2.2.4.6.8.5.5.5.5.5.5 3 3 3 2 2 2 t λ 3 Ψ 3 t λ Ψ 3 3 t λ Ψ 3 3 2 2 2 3 3 3 4 4 4 4 4 4 2 t λ Ψ2 2 2 4 2 λ t Ψ 3 4 2 t λ Ψ2 2 2 4 2 λ t Ψ 3 4 2 t λ Ψ2 2 2 4 2 λ t Ψ 3 4 t= t=6 t=64

Beyond diffusions The previous ideas can be generalized to the wider setting of a directed graph, with signed edge-weights w(,) ii, and a set of positive node-weights π () i. This time, we consider oriented interactions between data points (e.g. source-target). No spectral theory available! The node-weight can be used to rescale the relative importance of each data point (e.g. density estimate). This framework contains diffusion matrices as in this case the node-weight is equal to the out-degree of each node.

Coarsening This time we have no spectral embedding. { } However, for a given partitioning of the nodes, the same coarsening procedure can still be applied to the graph. S i S w(x,y) S 2 w(s ~,S 2 ) Coarsening S S 2 S 3 S 3 ws (, S) = wxy (, ) and π( S) = π( x) i j i x S y S x S i j i

Beyond spectral embeddings Idea: form the matrix and view each row A=Π W a( x, i) as a feature vector representing x. The graph G now n lives as a cloud of points in. G A =Π W Similarly for the coarser graph, form and obtain a k cloud of points in. Question: how do these pointsets compare?

Comparison between graphs f T A is n n, whereas A is k k. They can be related by the n k matrix Q whose rows are the indicators of the partitioning. A T f A Q Q T f Q = f T A? T T f A f AQ In other words, this diagram should approximately commute: QA AQ

Error of approximation We now have a similar result as before: Proposition: sup n f : f = / π ( ) T f QA AQ D / π π( x) π( z) where D= π ( Si ) a( x, i) a( z, i). π i k x S ( ) ( ) i z Sπ S i i π S i D is the distortion resulting from quantizing the set of rows of A, with mass distribution π 2

Antarctic food web Squid Krill Zooplankton Phytoplankton Cormorant Large fish Gull Petrel Sheathbill Sperm whale CO 2 Sun Minke whale Blue whale Albatross Tern Right whale Fulmar Fin Whale Leopard seal Killer whale Weddell seal Ross seal Small fish Adelie penguin Gentoo penguin Emperor penguin Macaroni penguin King penguin Chinstrap penguin Sei whale Humpback whale Octopus Elephant seal Crabeater seal Skua

Partitioning the forward graph Gull King penguin Large fish Leopard seal Octopus Weddell seal Ross seal Krill Small fish Squid Adelie penguin Chinstrap penguin Emperor penguin Gentoo penguin Macaroni penguin CO2, Sun Cormorant, Albatross Crabeater seal Elephant seal Fin whale, Fulmar Blue whale Humpback whale Southern right whale Minke whale Sei whale Sperm whale Killer whale, Petrel Phytoplankton Sheathbill Skua,Tern Zooplankton

Partitioning the reverse graph CO2, Sun Phytoplankton Zooplankton Krill Blue whale Fin whale Humpback whale Southern right whale Minke whale Small fish Sei whale Gull Elephant seal Petrel Sheathbill Killer whale Leopard seal Adelie penguin Albatross, Fulmar Chinstrap penguin Crabeater seal Emperor penguin Gentoo penguin King penguin Large fish, Octopus Macaroni penguin Ross seal, Cormorant Skua, Squid, Tern Sperm whale Weddell seal

Conclusion Very flexible framework providing coordinates on data sets Diffusion distance relevant to data mining and machine learning applications Clustering meaningful in this embedding space Can be extended to directed graphs Multiscale bases and analyses on graphs Thank you!