Some Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform

Similar documents
Fast Dimension Reduction Using Rademacher Series on Dual BCH Codes

Fast Random Projections using Lean Walsh Transforms Yale University Technical report #1390

Fast Dimension Reduction

Fast Dimension Reduction

Faster Johnson-Lindenstrauss style reductions

Fast Random Projections

16 Embeddings of the Euclidean metric

Yale University Department of Computer Science

Dense Fast Random Projections and Lean Walsh Transforms

Fast Dimension Reduction Using Rademacher Series on Dual BCH Codes

Randomized Algorithms

1 Dimension Reduction in Euclidean Space

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing

Accelerated Dense Random Projections

Dimensionality reduction: Johnson-Lindenstrauss lemma for structured random matrices

Yale university technical report #1402.

Non-Asymptotic Theory of Random Matrices Lecture 4: Dimension Reduction Date: January 16, 2007

Sparse Johnson-Lindenstrauss Transforms

Very Sparse Random Projections

Lecture 18: March 15

arxiv: v1 [cs.ds] 30 May 2010

Sparser Johnson-Lindenstrauss Transforms

JOHNSON-LINDENSTRAUSS TRANSFORMATION AND RANDOM PROJECTION

Methods for sparse analysis of high-dimensional data, II

Nearest Neighbor Preserving Embeddings

Kernel Method: Data Analysis with Positive Definite Kernels

Random Feature Maps for Dot Product Kernels Supplementary Material

MATH 4200 HW: PROBLEM SET FOUR: METRIC SPACES

Multivariate Statistics Random Projections and Johnson-Lindenstrauss Lemma

High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction

Problem Set 6: Solutions Math 201A: Fall a n x n,

An Algorithmist s Toolkit Nov. 10, Lecture 17

The Rademacher Cotype of Operators from l N

Lecture 4: Graph Limits and Graphons

Finite Metric Spaces & Their Embeddings: Introduction and Basic Tools

Hilbert Spaces. Contents

Edo Liberty. Accelerated Dense Random Projections

18.409: Topics in TCS: Embeddings of Finite Metric Spaces. Lecture 6

: Error Correcting Codes. October 2017 Lecture 1

CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction

Overview of normed linear spaces

Basic Properties of Metric and Normed Spaces

Randomized Algorithms

The Tensor Product of Two Codes is Not Necessarily Robustly Testable

Exercises to Applied Functional Analysis

Sketching as a Tool for Numerical Linear Algebra

Notions such as convergent sequence and Cauchy sequence make sense for any metric space. Convergent Sequences are Cauchy

Succinct Data Structures for Approximating Convex Functions with Applications

Introduction to Real Analysis Alternative Chapter 1

Using the Johnson-Lindenstrauss lemma in linear and integer programming

Chapter 2. Vectors and Vector Spaces

Methods for sparse analysis of high-dimensional data, II

IEOR 265 Lecture 3 Sparse Linear Regression

4 Hilbert spaces. The proof of the Hilbert basis theorem is not mathematics, it is theology. Camille Jordan

Cambridge University Press The Mathematics of Signal Processing Steven B. Damelin and Willard Miller Excerpt More information

Convex Feasibility Problems

Metric structures in L 1 : Dimension, snowflakes, and average distortion

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product

On metric characterizations of some classes of Banach spaces

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...

Normed Vector Spaces and Double Duals

Examples of Dual Spaces from Measure Theory

Metric spaces and metrizability

Arrangements, matroids and codes

Applied Analysis (APPM 5440): Final exam 1:30pm 4:00pm, Dec. 14, Closed books.

Introduction and Preliminaries

Potpourri, 3. Stephen William Semmes Rice University Houston, Texas. 1 Gaussian random variables 2. 2 Rademacher functions 3. 3 Lacunary series 5

Notes taken by Costis Georgiou revised by Hamed Hatami

1 Basics of vector space

Math 209B Homework 2

Exercise Solutions to Functional Analysis

PROBLEMS. (b) (Polarization Identity) Show that in any inner product space

Random Methods for Linear Algebra

Linear Analysis Lecture 5

Johnson-Lindenstrauss, Concentration and applications to Support Vector Machines and Kernels

The Johnson-Lindenstrauss Lemma

A Simple Proof of the Restricted Isometry Property for Random Matrices

Linear Normed Spaces (cont.) Inner Product Spaces

Advanced Analysis Qualifying Examination Department of Mathematics and Statistics University of Massachusetts. Monday, August 26, 2013

The cocycle lattice of binary matroids

FUNCTIONAL ANALYSIS LECTURE NOTES: WEAK AND WEAK* CONVERGENCE

Similarity searching, or how to find your neighbors efficiently

On Riesz-Fischer sequences and lower frame bounds

Contents: 1. Minimization. 2. The theorem of Lions-Stampacchia for variational inequalities. 3. Γ -Convergence. 4. Duality mapping.

(1) Consider the space S consisting of all continuous real-valued functions on the closed interval [0, 1]. For f, g S, define

An introduction to some aspects of functional analysis

Chapter 2 Linear Transformations

Functional Analysis, Math 7320 Lecture Notes from August taken by Yaofeng Su

Introduction to binary block codes

FUNCTIONAL ANALYSIS LECTURE NOTES: COMPACT SETS AND FINITE-DIMENSIONAL SPACES. 1. Compact Sets

3 (Due ). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure?

Supremum of simple stochastic processes

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

The Hilbert Space of Random Variables

Ordinary differential equations. Recall that we can solve a first-order linear inhomogeneous ordinary differential equation (ODE) dy dt

Economics 204 Summer/Fall 2011 Lecture 5 Friday July 29, 2011

Optimal compression of approximate Euclidean distances

Functional Analysis Exercise Class

Bounded uniformly continuous functions

Sobolevology. 1. Definitions and Notation. When α = 1 this seminorm is the same as the Lipschitz constant of the function f. 2.

Transcription:

Some Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform Nir Ailon May 22, 2007 This writeup includes very basic background material for the talk on the Fast Johnson Lindenstrauss Transform on May 28th, 2007 by the author at the Summer School on Algorithmic Data Analysis in Helsinki, Finland. The talk is about the results in [2], but will be based on a new more recent proof (together with Edo Liberty), using more modern tools, such as probability in Banach spaces and error-correcting codes. These tools also give stronger results. A written version of the new proof will be available for download as a technical report shortly. The information here touches the surface of deep, classic theories, but should be useful for those who are interested in a quick review. The talk will fuse the seemingly disparate ideas here into an algorithm. Some exercises are included. 1 Dimension Reduction In many applications, we are given as input points in high dimension and wish to reduce the dimension while (approximately, with high probability) preserving certain properties. Such a procedure can be used to save space or time, or to remove redundancy from data. The above definition is extremely loose and can encompass buzz words such as Principal Component Analysis (PCA), sparse approximation and even compression. I will talk about dimension reduction from the perspective of algorithmic metric embedding. This means that the property we preserve is the metric on the points, and we also care about the computational resources required for the reduction. There are two metrics in question: the domain (original) metric, which will always be the l 2 (Euclidean) metric in R d here and the range (reduced) metric, which will either be l 1 ( Manhattan distance ) or l 2. We remind that the Euclidean distance between two vectors u, v R d is u v 2 = ( (u i v i ) 2 ) 1/2 and the Manhattan distance is u i v i. Given a set of n points V in R d for some large dimension d, the goal to apply a random map Φ from R d to R k (for k d) so that with probability 2/3 (say), simultaneously for all pairs of points u, v V, (1 ε) u v 2 Φu Φv p (1 + ε) u v 2, where p = 1 or p = 2. We will use Φu Φv p ε u v 2 as shorthand for the last sandwich inequality. A classic result by Johnson and Lindenstrauss [6] and certain simplifications [1, 3, 5, 4] show that if k = O(ε 2 log n), then a random transformation does the trick. The main argument in Institute for Advanced Study, Princeton NJ. Email: nailon@math.ias.edu 1

all these proof is: Show using a measure concentration principle that x R d, Pr[ Φx p ε x 2 ] = e Ω(kε2). (1) Now plug in k as above and apply a union bound over all ( n 2) possibilities x = u v for u, v V. It makes sense to abstract this and say that a distribution on R d R k linear transformations has the Johnson-Lindenstrauss property if (1) holds for Φ drawn from this distribution. Note the quantification: For any fixed (adversarial) x, Φx p ε x with high probability. By this we actually mean that a family of distributions for increasingly large k, d has this property with the same constant hiding in the Ω-notation. Once we have this definition, and since we know already it is not an empty one, it makes sense to ask what is the best distribution satisfying the J-L property in terms of computational resources? More precisely, we try to optimize the time it takes to reduce a vector, and the number of random bits used. (The latter may seem not very important for applications but reducing this quantity can in fact reduce the amount of space required when applying the dimension reduction to data streams.) For example, the constructions in [1, 3, 5, 4] require O(kd) time and random bits. The construction in [2] requires O(max{d log d, k 3 }) time for p = 2 and O(max{d log d, k 2 }) for p = 1. A new proof that will be given in the talk offers better results. We review some relevant background in this writeup. 2 Finite Dimensional Real Banach Spaces A Banach space is a normed vector space satisfying a certain topological property which we will not really need in the talk. Here we only consider finite dimensional real Banach spaces and some standard norms. Some definitions and facts: For 1 p < the space l d p is R d equipped with the norm x = x p = ( x i p ) 1/p. For p =, l d p is R d equipped with the norm x = x = max x i. We call p the norm index or norm exponent. For norm index p the dual norm q is defined by 1/q + 1/p = 1. If p, q are dual norm indices, then for all x, y R d, x, y = x i y i x p y q. This is called the Hölder inequality. For p = q = 2, this is commonly known as Cauchy- Schwartz inequality. If p, q are dual norm indices, then for x l d p x p = sup x, y. y l d q, y q 1 2

For a matrix A R k d and two norm indices p 1, p 2 we define the operator norm A p1 p 2 as Equivalently, using the above, A p1 p 2 = where q 2 is the dual norm to p 2. sup Ax p2. x l d p 1, x 1 sup sup x l d p 1, x 1 y l d q 2, y 1 y T Ax, Exercise: Convince yourself that A p1 p 2 = A T q2 q 1, where q 1, q 2 are dual to p 1, p 2, respectively. Riesz-Thorin interpolation theorem: Let p 1, r 1, p 2, r 2 be some norm indices. Let A R k d, and assume A p1 r 1 C 1 and A p2 r 2 C 2. Let 0 λ 1, and let p, r be such that 1 p = λ 1 p 1 + (1 λ) 1 p and 1 2 r = λ 1 r 1 + (1 λ) 1 r. (geometrically, the point (1/p, 1/r) is on 2 the line connecting (1/p 1, 1/r 1 ) and (1/p 2, 1/r 2 )). Then A p r C1 λc1 λ 2. 3 The Walsh-Hadamard Matrix It is sometimes useful to assume that the coordinates of R d are structured. One commonly used structure is the discrete line. Often it is useful to assume that d = 2 t for some integer t and coordinate i is a t-dimensional vector over F 2 (the field with two elements 0, 1) representing the number in binary. The Walsh-Hadamard matrix H d R d d is defined by H d (i, j) = d 1/2 ( 1) i,j, where i, j is the scalar product of i, j over F t 2. It is a type of Fourier transform matrix. Exercise: Prove that H d can be recursively written as H d = 1 ( ) Hd/2 H d/2. 2 H d/2 H d/2 Using this, convince yourself that x H d x can be computed in time O(d log d) for a vector x R d. Exercise: Prove that H d is an orthogonal matrix (the columns have unit Euclidean length and are orthogonal to each other). Exercise: Let 1 p 2 be a norm index, and let q be its dual. Prove that H d p q d 1/2 1/p (Hint: find H d 1 and use Riesz-Thorin). This is often known as the Hausdorff-Young theorem. 4 Measure Concentration on the Hypercube Let f be a function from l d 2 to R that is C-Lipschitz (this means that for any x, y ld 2, f(x) f(y) C x y 2 ) and convex. 3

Example: f(x) = Ax p for some matrix A R k d. Then f is convex and ( A 2 p )-Lipschitz. Now let X 1,..., X d be d independent {±1} random variables. Let M f be a median of f(x 1,..., X d ). That is, M f is a number such that both Pr[f(X 1,..., X d ) M f ] 1/2 and Pr[f(X 1,..., X d ) M f ] 1/2. Then Pr[ f(x 1,..., X d ) M f > t] 4 exp( t 2 /8C 2 ). This was proved by Talagrand (see proof in [7]) and is a powerful generalization of many other measure concentration theorems. 5 Error Correcting Codes Error correcting codes are combinatorial objects with many nice properties useful in engineering and also in theoretical computer science. A good introduction can be found in [8]. Another great resource is Madhu Sudan s lecture notes http://theory.lcs.mit.edu/ madhu/ft01/. Here we concentrate on the definition of linear binary error correcting codes. We will not discuss constructions. The field of interest is F 2. A [k, m, D]-code is a linear subspace of F k 2 of dimension m k with the property that for any nonzero x F k 2, (x) D, where (x) is the Hamming weight of x (number of nonzeros). If V is a [k, m, D]-code, then the dual code V is the space of all vectors y F k 2 such that the scalar product x, y = 0 for all x V. By standard linear algebra, dim V = k m. Exercise: 1. A set of vectors U F k 2 is s-wise independent if for any set i 1 < i 2 < i s [k] of coordinates, the number of vectors x U such that (x i1,..., x is ) = y is exactly U /2 s, for any y F s 2. Let V F k 2 be the dual of a [k, m, D]-code. Prove that V is (D 1)-wise independent. 2. Let V be as above, and assume in addition that D 3 and odd. We transform V into a matrix A of size 2 k m k by writing each codeword x V as a row of A, turning 0 s to (+1) s and 1 s to ( 1) s (note that V contains exactly 2 k m vectors). Prove that A 2 (D 1) c2 (k m)/(d 1) where c is a constant that depends on D only (and not on k, m). (Hint: Consider the random variable (y x) D 1 for a random row y of A and a fixed x l d 2 ). Find the best bound you can on A 2 p for 2 p D 1 (Hint: Find A 2 2 exactly and then apply Riesz-Thorin). 3. Let A be as above. Prove that if we permute the rows of A then (up to a constant factor) the columns of A are a subset of the columns of H 2 k m. (Hint: We only need the fact that V is a vector space.) Conclude that x A T x can be efficiently computed for any x R 2k m. References [1] D. Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. J. Comput. Syst. Sci., 66(4):671 687, 2003. [2] N. Ailon and B. Chazelle. Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform. In Proceedings of the 38st Annual Symposium on the Theory of Compututing (STOC), pages 557 563, Seattle, WA, 2006. 4

[3] S. DasGupta and A. Gupta. An elementary proof of the Johnson-Lindenstrauss lemma. Technical Report, UC Berkeley, 99-006, 1999. [4] P. Frankl and H. Maehara. The Johnson-Lindenstrauss lemma and the sphericity of some graphs. Journal of Combinatorial Theory Series A, 44:355 362, 1987. [5] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the 30th Annual ACM Symposium on Theory of Computing (STOC), pages 604 613, 1998. [6] W. B. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics, 26:189 206, 1984. [7] M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes. Springer-Verlag, 1991. [8] F. MacWilliams and N. Sloane. The Theory of Error Correcting Codes. North-Holland, 1983. 5