Random projection trees and low dimensional manifolds. Sanjoy Dasgupta and Yoav Freund University of California, San Diego

Similar documents
Random projection trees and low dimensional manifolds

Distribution-specific analysis of nearest neighbor search and classification

Escaping the curse of dimensionality with a tree-based regressor

Random projection trees for vector quantization

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

COMS 4771 Regression. Nakul Verma

Introduction to Machine Learning

Advances in Manifold Learning Presented by: Naku Nak l Verm r a June 10, 2008

Nearest neighbor classification in metric spaces: universal consistency and rates of convergence

Local regression, intrinsic dimension, and nonparametric sparsity

Random Projection Trees Revisited

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

Local Self-Tuning in Nonparametric Regression. Samory Kpotufe ORFE, Princeton University

COMS 4771 Introduction to Machine Learning. Nakul Verma

Lecture: Face Recognition and Feature Reduction

Ch. 10 Vector Quantization. Advantages & Design

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing

Lecture: Face Recognition and Feature Reduction

Non-linear Dimensionality Reduction

Active learning. Sanjoy Dasgupta. University of California, San Diego

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Manifold Learning and it s application

Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi

Learning Decision Trees

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Statistical Pattern Recognition

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

PAC-learning, VC Dimension and Margin-based Bounds

Atutorialonactivelearning

Machine Learning, Fall 2009: Midterm

Compressed Sensing and Sparse Recovery

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

CALCULUS ON MANIFOLDS. 1. Riemannian manifolds Recall that for any smooth manifold M, dim M = n, the union T M =

Lecture 7: DecisionTrees

20 Unsupervised Learning and Principal Components Analysis (PCA)

Statistical Machine Learning

Graph Partitioning Using Random Walks

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Manifold Regularization

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Decision Tree Learning Lecture 2

Learning Decision Trees

Adaptive Compressive Imaging Using Sparse Hierarchical Learned Dictionaries

Lecture 3: Compressive Classification

Mid Term-1 : Practice problems

Classification and Regression Trees

PAC-learning, VC Dimension and Margin-based Bounds

Approximate Voronoi Diagrams

Introduction How it works Theory behind Compressed Sensing. Compressed Sensing. Huichao Xue. CS3750 Fall 2011

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Vector Quantization. Institut Mines-Telecom. Marco Cagnazzo, MN910 Advanced Compression

The theory of manifolds Lecture 3

Probabilistic Graphical Models

Holdout and Cross-Validation Methods Overfitting Avoidance

Randomized Algorithms

High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction

Does Unlabeled Data Help?

LECTURE NOTE #11 PROF. ALAN YUILLE

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Modeling and Predicting Chaotic Time Series

Diffusion/Inference geometries of data features, situational awareness and visualization. Ronald R Coifman Mathematics Yale University

Discriminative Direction for Kernel Classifiers

Linear Classification: Perceptron

10. Smooth Varieties. 82 Andreas Gathmann

STAT 200C: High-dimensional Statistics

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Linear Sketches A Useful Tool in Streaming and Compressive Sensing

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Machine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015

Principal Component Analysis

Lecture 6 Proof for JL Lemma and Linear Dimensionality Reduction

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Global (ISOMAP) versus Local (LLE) Methods in Nonlinear Dimensionality Reduction

Set-Theoretic Geometry

Active Learning from Single and Multiple Annotators

Very Sparse Random Projections

Ad Placement Strategies

Machine Learning

Randomized Decision Trees

Department of Computer Science and Engineering

A first model of learning

Multimedia Communications. Scalar Quantization

k-nn Regression Adapts to Local Intrinsic Dimension

Learning convex bodies is hard

Metric-based classifiers. Nuno Vasconcelos UCSD

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

LECTURE 9: THE WHITNEY EMBEDDING THEOREM

Manifold Coarse Graining for Online Semi-supervised Learning

Lossy Compression Coding Theorems for Arbitrary Sources

Optimization Algorithms for Compressed Sensing

Nonlinear Dimensionality Reduction

PCA, Kernel PCA, ICA

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Classification: The rest of the story

Transcription:

Random projection trees and low dimensional manifolds Sanjoy Dasgupta and Yoav Freund University of California, San Diego

I. The new nonparametrics

The new nonparametrics The traditional bane of nonparametric statistics is the curse of dimensionality. For data in R D : convergence rates n - D But recently, some sources of rejuvenation: 1. Data near low-dimensional manifolds 2. Sparsity in data space or parameter space

Low dimensional manifolds Motion capture: N markers on a human body yields data in R 3N

Benefits of intrinsic low-dimensionality Benefits you need to work for Learning the structure of the manifold (a) Find explicit embedding R D! R d, then work in low-dimensional space (b) Use manifold structure for regularization This talk: Simple tweaks that make standard methods manifold-adaptive

The k-d tree Problem: curse of dimensionality, as usual Key notion in statistical theory of tree estimators: at what rate does cell diameter decrease as you move down the tree?

Rate of diameter decrease Consider: X = [ D i=1fte i : 1 t 1g ½ R D Need at least D levels to halve the diameter Intrinsic dimension of this set is d = log D (or perhaps even 1, depending on your definition)

Random projection trees K-d tree RP tree Pick coordinate direction Split at median Pick random direction Split at median plus noise If the data in R D has intrinsic dimension d, then an RP tree halves the diameter in just d levels: no dependence on D.

II. RP trees and Assouad dimension

Assouad dimension Set S ½ R D has Assouad dimension d if: for any ball B, subset S Å B can be covered by 2 d balls of half the radius. Also called doubling dimension. S = line Assouad dimension = 1 B S = k-dimensional affine subspace Assouad dimension = O(k) S = set of N points Assouad dimension log N S = k-dim submanifold of R D with finite condition number Assouad dimension = O(k) in small enough neighborhoods Crucially: if S has Assouad dim d, so do subsets of S

RP trees Spatial partitioning Cell Binary tree Node

RP tree algorithm procedure MAKETREE(S) if S < MinSize: return (Leaf) else: Rule CHOOSERULE(S) LeftTree MAKETREE(,x S : Rule(x) = true}) RightTree MAKETREE(,x S : Rule(x) = false}) return ([Rule,LeftTree,RightTree]) procedure CHOOSERULE(S) choose a random unit direction v R D pick any point x S, and let y be the farthest point from it in S choose δ uniformly at random in * 1, 1+ 6 x y /D 1/2 Rule(x) := x v (median(,z v : z S}) + δ) return (Rule)

Performance guarantee There is a constant c 0 with the following property. Build RP tree using data set S ½ R D. Pick any cell C in tree such that S C has Assouad dimension d. Then, with prob 1/2 (over construction of subtree rooted at C): for every descendant C that is more than c 0 d log d levels below C, we have radius(c ) radius(c)/2.

One-dimensional random projections Projection from R D onto (a random line) R 1 : how does this affect the lengths of vectors? Very roughly: it shrinks them by D 1/2. 1 1/ D Lemma: Fix any vector x 2 R D. Pick a random unit vector U» S D-1. (a) Pr (b) Pr h jx Uj kxk p D i q 2 ¼ h jx Uj kxk p D i 2 e 2=2

Effect of RP on diameter Set S ½ R D is subjected to random projection U. How does the diameter of S U compare to that of S? If S is full-dimensional: diam(s U) diam(s). If S has Assouad dimension d: diam(s U) diam(s) d D (with high probability). U S U bounding ball of S

Diameter of projected set S ½ R D has Assouad dim d. Pick random projection U. With high prob: diam(s U) diam(s) O Ãr! d log D D Bounding ball of S is (wlog) B(0,1) 1. Can cover S by (D/d) d/2 balls of radius Need 2 d balls of radius 1/2, 4 d balls of radius 1/4, 8 d balls of radius 1/8,..., (1/²) d balls of radius ² d D 2. Pick any of these balls. Its projected center is fairly close to the origin. w.p. O(1): within 1 D w.p. 1-1/D d : within d log D D U 0 3. Do a union bound over all the balls.

Proof outline Pick any cell in the RP tree, and let S ½ R D be the data in it. Suppose S has Assouad dim d and lies in a ball of radius 1. Show: In every descendant cell d log d levels below, the data is contained in a ball of radius 1/2. Cell has bounding ball of radius 1

Proof outline Current cell (radius 1): B i 1. Cover S by d d/2 balls B i of radius 1/d 1/2 2. Consider any pair of balls B i, B j at distance 1/2 apart. A single random split has constant probability of cleanly separating them B j 3. There are at most d d such pairs B i, B j So after d log d splits, every faraway pair of balls will be separated which means all cells will have radius 1/2

median in here Big picture random line U radius 1 10 d dist > 1/2 2 1 D radius 1 10 D radius 1 Recall effect of random projection: lengths x 1/D 1/2, diameter x (d/d) 1/2

III. RP trees and local covariance dimension

More general Intrinsic low dimensionality of sets Empirically verifiable? Conducive to analysis? Summary Small covering numbers Kind of Yes, but too weak in some ways Small global covers Small Assouad dimension Not really Yes AND: small local covers Low-dimensional manifold No To some extent AND: smoothness (local flatness) Low-dimensional affine subspace Yes Yes AND: global flatness Obvious extension to distributions: at least 1-± of the probability mass lies within distance ² of a set of low intrinsic dimension

Local covariance dimension A distribution over R D has covariance dimension (d, ²) if its covariance matrix has eigenvalues 1 D that satisfy: ( 1 + + d) (1-²) ( 1 + + D). That is, there is a d-dimensional affine subspace such that (avg dist 2 from subspace) ² (avg dist 2 from mean) We are interested in distributions that locally have this property, ie., for some partition of R D, the restriction of the distribution to each region of the partition has covariance dimension (d,²).

Performance guarantee Instead of cell diameter, use vector quantization error: VQ(cell) = avg squared dist from point in cell to mean(cell) [Using slightly different RP tree construction.] There are constants c 1, c 2 for which the following holds. Build an RP tree from data S ½ R D. Suppose a cell C has covariance dimension (d, c 1 ). Then for each of its children C : E*VQ(C )+ VQ(C) (1 c 2 /d) where the expectation is over the split at C.

Proof outline Pick any cell in the RP tree, and let S ½ R D be the data in it. Show that the VQ error of the cell decreases by (1-1/d) as a result of the split.

The change in VQ error If a set S is split into two pieces S 1 and S 2 with equal numbers of points, by how much does its VQ error drop? S 1 Set S S 2 By exactly mean(s 1 ) mean(s 2 ) 2.

Proof outline -- 3 VQ(S) = average squared distance to mean(s) = (1/2) average squared interpoint distance = variance of S Projection onto U shrinks distances by D 1/2, so shrinks variance by D U S 1 S 2 Variance of projected S is roughly VQ(S)/D Distance between projected means is at least VQ(S) D VQ(S) D

Proof outline -- 4 S is close to a d-dimensional affine subspace; so mean(s 1 ) and mean(s 2 ) lie very close to this subspace The subspace has Assouad dimension O(d), so all vectors in it shrink to (d/d) 1/2 their original length when projected onto U U S 1 S 2 Therefore the distance between mean(s 1 ) and mean(s 2 ) is at least VQ(S) d VQ(S) D

IV. Connections and open problems

The uses of k-d trees 1. Classification and regression Given data points (x 1, y 1 ),, (x n, y n ), build a tree on the x i. For any subsequent query x, assign it a y-label that is an average or majority vote of y i values in cell(x). 2. Near neighbor search Build tree on data base x 1,, x n. Given query x, find an x i close to it: return nearest neighbor in cell(x). 3. Nearest neighbor search Like (2), but may need to look beyond cell(x). 4. Speeding up geometric computations For instance, N-body problems in which all interactions between nearby pairs of particles must be computed.

Vector quantization Setting: lossy data compression. Data generated from some distribution P over R D. Pick: finite codebook C ½ R D encoding function : R D! C such that E X - (X) 2 is small. Tree-based VQ in applications with large C. Typical rate: VQ error e -r/d (r = depth of tree). RP trees have VQ error e -r/d.

Compressed sensing New model for working with D-dimensional data: Never look at the original data X! Work exclusively with a few random projections Á(X) Candes-Tao, Donoho: sparse X can be reconstructed from Á(X). Cottage industry of algorithms working exclusively with Á(X). RP trees are compatible with this viewpoint. Use the same random projection across a level of the tree Precompute random projections

What next 1. Other tree data structures? e.g. nearest neighbor search *such as cover trees + 2. Other nonparametric estimators e.g. kernel density estimation 3. Other structure (such as clustering) that can be exploited to improve convergence rates of statistical estimators

Thanks Yoav Freund, my co-author Nakul Verma and Mayank Kabra, students who ran many experiments National Science Foundation, for support under grants IIS-0347646 and IIS-0713540