Dimensionality Reduction Notes 1

Similar documents
CIS 700: algorithms for Big Data

HANSON-WRIGHT INEQUALITY AND SUB-GAUSSIAN CONCENTRATION

Dimensionality Reduction Notes 2

P exp(tx) = 1 + t 2k M 2k. k N

HANSON-WRIGHT INEQUALITY AND SUB-GAUSSIAN CONCENTRATION

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

APPENDIX A Some Linear Algebra

An introduction to chaining, and applications to sublinear algorithms

Eigenvalues of Random Graphs

Lecture 4: Constant Time SVD Approximation

Lecture Notes on Linear Regression

Lecture 4: September 12

Matrix Approximation via Sampling, Subspace Embedding. 1 Solving Linear Systems Using SVD

CSCE 790S Background Results

Vapnik-Chervonenkis theory

Edge Isoperimetric Inequalities

Hanson-Wright inequality and sub-gaussian concentration

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

TAIL BOUNDS FOR SUMS OF GEOMETRIC AND EXPONENTIAL VARIABLES

More metrics on cartesian products

Exercise Solutions to Real Analysis

The Geometry of Logit and Probit

Problem Set 9 Solutions

Bezier curves. Michael S. Floater. August 25, These notes provide an introduction to Bezier curves. i=0

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

Approximate Smallest Enclosing Balls

Learning Theory: Lecture Notes

General viscosity iterative method for a sequence of quasi-nonexpansive mappings

Bézier curves. Michael S. Floater. September 10, These notes provide an introduction to Bézier curves. i=0

MMA and GCMMA two methods for nonlinear optimization

Finding Dense Subgraphs in G(n, 1/2)

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 3: Large deviations bounds and applications Lecturer: Sanjeev Arora

Supplementary material: Margin based PU Learning. Matrix Concentration Inequalities

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016

2.3 Nilpotent endomorphisms

Strong Markov property: Same assertion holds for stopping times τ.

Yong Joon Ryang. 1. Introduction Consider the multicommodity transportation problem with convex quadratic cost function. 1 2 (x x0 ) T Q(x x 0 )

Lecture 4: Universal Hash Functions/Streaming Cont d

MAT 578 Functional Analysis

Linear Approximation with Regularization and Moving Least Squares

Feature Selection: Part 1

The Expectation-Maximization Algorithm

Lecture 10 Support Vector Machines II

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Report on Image warping

Lecture 5 September 17, 2015

Lecture 3. Ax x i a i. i i

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

Lecture 14 (03/27/18). Channels. Decoding. Preview of the Capacity Theorem.

Module 2. Random Processes. Version 2 ECE IIT, Kharagpur

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Lecture 12: Discrete Laplacian

Error Probability for M Signals

THE CHINESE REMAINDER THEOREM. We should thank the Chinese for their wonderful remainder theorem. Glenn Stevens

Randomness and Computation

Communication Complexity 16:198: February Lecture 4. x ij y ij

Composite Hypotheses testing

Math 426: Probability MWF 1pm, Gasson 310 Homework 4 Selected Solutions

MATH 5707 HOMEWORK 4 SOLUTIONS 2. 2 i 2p i E(X i ) + E(Xi 2 ) ä i=1. i=1

A New Refinement of Jacobi Method for Solution of Linear System Equations AX=b

Economics 101. Lecture 4 - Equilibrium and Efficiency

Some basic inequalities. Definition. Let V be a vector space over the complex numbers. An inner product is given by a function, V V C

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

Expected Value and Variance

REAL ANALYSIS I HOMEWORK 1

Supplement to Clustering with Statistical Error Control

Estimation: Part 2. Chapter GREG estimation

The Order Relation and Trace Inequalities for. Hermitian Operators

Linear, affine, and convex sets and hulls In the sequel, unless otherwise specified, X will denote a real vector space.

THE WEIGHTED WEAK TYPE INEQUALITY FOR THE STRONG MAXIMAL FUNCTION

Math 217 Fall 2013 Homework 2 Solutions

Affine transformations and convexity

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

First Year Examination Department of Statistics, University of Florida

e - c o m p a n i o n

A CHARACTERIZATION OF ADDITIVE DERIVATIONS ON VON NEUMANN ALGEBRAS

Projective change between two Special (α, β)- Finsler Metrics

Lecture 3: Shannon s Theorem

STEINHAUS PROPERTY IN BANACH LATTICES

Lecture 2: Gram-Schmidt Vectors and the LLL Algorithm

Errors for Linear Systems

8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS

Singular Value Decomposition: Theory and Applications

PROCEEDINGS OF THE AMERICAN MATHEMATICAL SOCIETY Volume 125, Number 7, July 1997, Pages 2119{2125 S (97) THE STRONG OPEN SET CONDITION

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

DISCRIMINANTS AND RAMIFIED PRIMES. 1. Introduction A prime number p is said to be ramified in a number field K if the prime ideal factorization

CS286r Assign One. Answer Key

Maximizing the number of nonnegative subsets

n α j x j = 0 j=1 has a nontrivial solution. Here A is the n k matrix whose jth column is the vector for all t j=0

Difference Equations

NP-Completeness : Proofs

Notes on Frequency Estimation in Data Streams

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

APPROXIMATE PRICES OF BASKET AND ASIAN OPTIONS DUPONT OLIVIER. Premia 14

BOUNDEDNESS OF THE RIESZ TRANSFORM WITH MATRIX A 2 WEIGHTS

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Lecture 20: Lift and Project, SDP Duality. Today we will study the Lift and Project method. Then we will prove the SDP duality theorem.

Transcription:

Dmensonalty Reducton Notes 1 Jelan Nelson mnlek@seas.harvard.edu August 10, 2015 1 Prelmnares Here we collect some notaton and basc lemmas used throughout ths note. Throughout, for a random varable X, X p denotes (E X p ) 1/p. It s known that p s a norm for any p 1 (Mnkowsk s nequalty). It s also known X p X q whenever p q. Henceforth, whenever we dscuss p, we wll assume p 1. Lemma 1 (Khntchne nequalty). For any p 1, x R n, and (σ ) ndependent Rademachers, σ x p p x 2 Proof. Wthout loss of generalty we can assume p s an even nteger. Consder (g ) ndependent gaussans of mean zero and varance 1. Expand E( σ x ) p nto a sum of monomals. Any monomal wth odd exponents vanshes, as n the gaussan case. Meanwhle other monomals are nonnegatve wth all Rademacher moments beng 1, whle n the gaussan case the moments are at least 1. Thus the Rademacher case s term-by-term domnated by the gaussan case and σ x p g x p. But g x s a gaussan wth mean zero and varance x 2 2, and hence ts p-norm s x 2 (p!/(2 p/2 (p/2)!)) 1/p. We often use Jensen s nequalty below, especally for F (x) = x p (p 1). Lemma 2 (Jensen s nequalty). For F convex, F (E X) E F (X). 1

Before provng a couple concentraton nequaltes, we prove a lemma now whch lets us freely obtan tal bounds from moment bounds and vce versa (often we prove a moment bound and later nvoke a tal bound, or vce versa, wthout even mentonng any justfcaton). Lemma 3. Let Z be a scalar random varable. Consder the followng statements: (1a) There exsts σ > 0 s.t. p 1, Z p C 1 σ p. (1b) There exsts σ > 0 s.t. λ > 0, P( Z > λ) C 2 e C 2 λ2 /σ 2. (2a) There exsts K > 0 s.t. p 1, Z p C 3 Kp. (2b) There exsts K > 0 s.t. λ > 0, P( Z > λ) C 4 e C 4 λ/k. (3a) There exst σ, K > 0 s.t. p 1, Z p C 5 (σ p + Kp). (3b) There exst σ, K > 0 s.t. λ > 0, P( Z > λ) C 6 (e C 6 λ2 /σ 2 + e C 6 λ/k ). Then 1a s equvalent to 1b, 2a s equvalent to 2b, and 3a s equvalent to 3b, where the constants C, C n each case change by at most some absolute constant factor. Proof. We wll show only that 1a s equvalent to 1b; the other cases are argued dentcally. To show that 1a mples 1b, by Markov s nequalty ( ) C P(Z > λ) λ p E Z p 2 1 σ 2 p/2. λ 2 p Statement 1b follows by choosng p = max{1, 2C1λ 2 2 /σ 2 }. To show that 1b mples 1a, by ntegraton by parts we have E Z p = 0 px p 1 P( Z > λ)dλ 2C 2 p 0 px p 1 e C 2 λ2 /σ 2 dλ. The ntegral on the rght hand sde s exactly the pth moment of a gaussan random varable wth mean zero and varance σ 2 = σ 2 /(2C 2). Statement 1a then follows snce such a gaussan has p-norm Θ(σ p). 2

Now, the followng s a bread-and-butter trck for boundng pth moments of sums of ndependent random varables. A more general verson of ths lemma can be found as Lemma 6.3 n [LT91]. Lemma 4 (Symmetrzaton / Desymmetrzaton). Let Z 1,..., Z n be ndependent random varables. Let r 1,..., r n be ndependent Rademachers. Then Z E Z p 2 r Z p (symmetrzaton nequalty) and (1/2) r (Z E Z ) p Z p (desymmetrzaton nequalty). Proof. For the frst nequalty, let Y 1,..., Y n be ndependent of the Z but dentcally dstrbuted to them. Then Z E Z p = Z E Y p Y (Z Y ) p (Jensen) = r (Z Y ) p (1) 2 r X p (trangle nequalty) (1) follows snce the X Y are ndependent across and symmetrc. For the second nequalty, let Y be as before. Then r (Z E Z ) p = E r (Z Y ) p Y r (Z Y ) p (Jensen) = (Z Y ) p 2 Z p (trangle nequalty) 3

Lemma 5 (Decouplng [dlpng99]). Let x 1,..., x n be ndependent and mean zero, and x 1,..., x n dentcally dstrbuted as the x and ndependent of them. Then for any (a,j ) and for all p 1 j a,j x x j p 4,j a,j x x j p Proof. Let η 1,..., η n be ndependent Bernoull random varables each of expectaton 1/2. Then a,j x x j p = 4 E a,j x x j η 1 η j p η j j 4 j a,j x x j η (1 η j ) p (Jensen) (2) Hence there must be some fxed vector η {0, 1} n whch acheves a,j x x j η (1 η j ) p a,j x x j p j S where S = { : η = 1}. Let x S denote the S -dmensonal vector correspondng to the x for S. Then a,j x x j p = a,j x x j p S j / S S j / S = E E a,j x x xs j p ( E x = E x x S j = 0),j,j j / S a,j x x j p (Jensen) The followng proof of the Hanson-Wrght was shared to me by Sjoerd Drksen (personal communcaton). Theorem 1 (Hanson-Wrght nequalty [HW71]). For σ 1,..., σ n ndependent Rademachers and A R n n real and symmetrc, for all p 1 σ T Aσ E σ T Aσ p p A F + p A. 4

Proof. Wthout loss of generalty we assume n ths proof that p 2 (so that p/2 1). Then σ T Aσ E σ T Aσ p σ T Aσ p (Lemma 5) (3) p Ax 2 p (Khntchne) (4) = p Ax 2 2 1/2 p/2 (5) p Ax 2 2 1/2 p p ( A 2 F + Ax 2 2 E Ax 2 2 p ) 1/2 (trangle nequalty) p A F + p Ax 2 2 E Ax 2 2 1/2 p p A F + p x T A T Ax 1/2 p (Lemma 5) p A F + p 3/4 A T Ax 2 1/2 p (Khntchne) p A F + p 3/4 A 1/2 Ax 2 p 1/2 (6) Wrtng E = Ax 2 1/2 p and comparng (4) and (6), we see that for some constant C > 0, E 2 Cp 1/4 A 1/2 E C A F 0. Thus E must be smaller than the larger root of the above quadratc equaton, mplyng our desred upper bound on E 2. Remark 1. The square root trck n the proof of the Hanson-Wrght nequalty above s qute handy and can be used to prove several moment nequaltes (for example, you wll see how to prove the Bernsten nequalty wth t n tomorrow s lecture). As far as I am aware, the trck was frst used n a work of Rudelson [Rud99]. Remark 2. We could have upper bounded Eq. (5) by p A F + p Ax 2 2 E Ax 2 2 1/2 p/2 by the trangle nequalty. Now notce we have bounded the pth central moment of a symmetrc quadratc form (3) by the p/2th moment also of a symmetrc quadratc form. Wrtng p = 2 k, ths observaton leads to a proof by nducton on k, whch was the approach used n [DKN10]. 5

2 Johnson-Lndenstrauss (JL) lemma Frst we prove the Dstrbutonal JL Lemma (DJL). Lemma 6. DJL Lemma For any nteger n > 1 and ε, δ (0, 1/2), there exsts a dstrbuton D ε,δ over R m n for m ε 2 log(1/δ)) such that for any x R n of unt Eucldean norm, P ( Πx 2 2 1 > ε) < δ Π D ε,δ Proof. Wrte Π,j = σ,j / m, where the σ,j are ndependent Rademachers. Also overload σ to mean these Rademachers arranged as a vector of length mn, by concatenatng rows of Π. Note then Πx 2 2 = A x σ 2 2 where Thus A x = 1 m x T 0 0 0 x T 0.... (7) 0 0 x T P( Πx 2 2 1 > ε) = P( A x σ 2 2 E A x σ 2 2 > ε), where we see that the rght-hand sde s readly handled by the Hanson- Wrght nequalty Theorem 1 wth A = A T x A x. Now observe A s a blockdagonal matrx wth each block equalng (1/m)xx T, and thus A = x 2 2/m = 1/m. We also have A 2 F = 1/m. Thus Hanson-Wrght yelds P( Πx 2 2 1 > ε) e Cε2m + e Cεm, whch for ε < 1 s at most δ for m ε 2 log(1/δ). The followng s what s usually referred to as the Johnson-Lndenstrauss (JL) lemma [JL84]. In ths note we typcally refer to t as the Metrc JL lemma (or MJL) to dstngush t from DJL above. At some ponts we smply say JL nstead of DJL or MJL, but the verson meant wll be understood from context. 6

Corollary 1 (Metrc JL lemma (MJL)). For any X = {x 1,..., x N } R n and 0 < ε < 1/2, there exsts f : X R m for m = O(ε 2 log N) such that for all 1 < j N, (1 ε) x x j 2 f(x ) f(x j ) 2 (1 + ε) x x j 2. (8) Proof. Let D ε,δ be as n DJL wth δ < 1/ ( ) N 2. Consder a random f, where f(x) = Πx for Π drawn from D ε,δ. By DJL, each vector of the form (x x j )/ x x j 2 has ts norm preserved up to 1 + ε wth probablty strctly larger than 1 1/ ( ) N 2. Thus by a unon bound over all, j, all such vectors are preserved wth postve probablty, showng exstence of the desred f. 2.1 Example applcaton: k-means clusterng In the k-means clusterng problem the nput conssts of x 1,..., x N R n and a postve nteger k, and the goal s to output some partton P of [n] nto k dsjont subsets P 1,..., P k as well as some y = (y 1,..., y k ) (R n ) k (the y need not be equal to any of the x and can be chosen arbtrarly) so as to mnmze the cost functon cost P,y (x 1,..., x N ) = k j=1 P j x y j 2 2. That s, the x are clustered nto k clusters accordng to P, and the cost of a gven clusterng s the sum of squared Eucldean dstances to the cluster centers (the y j s). Unfortunately fndng the optmal clusterng for k-means s NP-hard, however effcent approxmaton algorthms do exst whch fnd a clusterngs that are close to optmal. It s easy to show, e.g. by takng the gradent of the cost functon, that for a fxed partton P of [n], the optmal choce of cluster centers y for that gven P s the one where, for the P j of postve sze, y j = (1/ P j ) P j x. Thus we can restrct our attenton to just optmzng over P. For a set of nput ponts X, we let cost P (X) denote nf y cost P,y (X). Lemma 7. Let the nput ponts to k-means be X = {x 1,..., x n }. Then for any 0 < ε < 1/2, f f : X R m s such that, j (1 ε) x x j 2 2 f(x ) f(x j ) 2 2 (1 + ε) x x j 2 2 7

then for ˆP a γ-approxmate optmal clusterng for f(x) and P an optmal clusterng for X, t holds that ( ) 1 + ε cost(x) γ cost ˆP 1 ε (X). P Proof. Fx a partton P of [n] and wrte P = (P 1,..., P k ). Then cost (X) = x 1 x 2 2 P P j j [k] P j P j = 1 x 2 2 2 x, x + x 2 2 P j j [k] P j P j P j P j = 1 ( ) x 2 2 + x 2 2 x, x P j 2 j [k] P j P j = 1 x x 2 2 2 P j j [k] P j P j Thus f f satsfes the condton of the lemma, then (1 ε) cost(x) cost(f(x)) (1 + ε) cost (X) P P for all parttons P smultaneously. Thus we have (1 ε) cost(x) cost(f(x)) γ cost(f(x)) γ (1 + ε) cost (X). ˆP ˆP P P P The lemma follows by comparng the rght hand sde wth the left. References [DKN10] Ilas Dakonkolas, Danel M. Kane, and Jelan Nelson. Bounded ndependence fools degree-2 threshold functons. In 51th Annual IEEE Symposum on Foundatons of Computer Scence (FOCS), pages 11 20, 2010. [dlpng99] Vctor de la Peña and Evarst Gné. Decouplng: From dependence to ndependence. Probablty and ts Applcatons. Sprnger- Verlag, New York, 1999. 8

[HW71] [JL84] Davd Lee Hanson and Farroll Tm Wrght. A bound on tal probabltes for quadratc forms n ndependent random varables. Ann. Math. Statst., 42:1079 1083, 1971. Wllam B. Johnson and Joram Lndenstrauss. Extensons of Lpschtz mappngs nto a Hlbert space. Contemporary Mathematcs, 26:189 206, 1984. [LT91] Mchel Ledoux and Mchel Talagrand. Probablty n Banach spaces. Sprnger-Verlag, Berln, 1991. [Rud99] Mark Rudelson. Random vectors n the sotropc poston. J. Functonal Analyss, 164(1):60 72, 1999. 9