Improved Bounds on the Dot Product under Random Projection and Random Sign Projection

Similar documents
Sharp Generalization Error Bounds for Randomly-projected Classifiers

Is margin preserved after random projection?

Very Sparse Random Projections

Randomized Algorithms

A Randomized Algorithm for Large Scale Support Vector Learning

Sharp Generalization Error Bounds for Randomly-projected Classifiers

arxiv: v1 [math.st] 28 Sep 2017

A tight bound on the performance of Fisher s linear discriminant in randomly projected data spaces

16 Embeddings of the Euclidean metric

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Learning in High Dimensions with Projected Linear Discriminants

Lecture 18: March 15

1 Dimension Reduction in Euclidean Space

PAC-learning, VC Dimension and Margin-based Bounds

Does Unlabeled Data Help?

A Randomized Algorithm for Large Scale Support Vector Learning

A New Look at Nearest Neighbours: Identifying Benign Input Geometries via Random Projections

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Methods for sparse analysis of high-dimensional data, II

Multivariate Statistics Random Projections and Johnson-Lindenstrauss Lemma

Some Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Lecture 6 Proof for JL Lemma and Linear Dimensionality Reduction

Learning with Rejection

Advances in Manifold Learning Presented by: Naku Nak l Verm r a June 10, 2008

Foundations of Machine Learning

Announcements. Proposals graded

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

1 Differential Privacy and Statistical Query Learning

Faster Johnson-Lindenstrauss style reductions

Randomized Algorithms

Distance concentration and detection of meaningless distances

Generalization, Overfitting, and Model Selection

i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Concentration inequalities and tail bounds

Learning with L q<1 vs L 1 -norm regularisation with exponentially many irrelevant features

Active Learning Class 22, 03 May Claire Monteleoni MIT CSAIL

Lecture Notes 1: Vector spaces

Optimal compression of approximate Euclidean distances

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Supervised Metric Learning with Generalization Guarantees

Maximum Mean Discrepancy

PAC-learning, VC Dimension and Margin-based Bounds

Kernels for Multi task Learning

Minimax risk bounds for linear threshold functions

Efficient Non-Oblivious Randomized Reduction for Risk Minimization with Improved Excess Risk Guarantee

Rademacher Bounds for Non-i.i.d. Processes

Dimension Reduction in Kernel Spaces from Locality-Sensitive Hashing

Topics in Natural Language Processing

Non-Asymptotic Theory of Random Matrices Lecture 4: Dimension Reduction Date: January 16, 2007

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Introduction to Machine Learning (67577) Lecture 3

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Kernels as Features: On Kernels, Margins, and Low-dimensional Mappings

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Rademacher Complexity Bounds for Non-I.I.D. Processes

Dimensionality reduction: Johnson-Lindenstrauss lemma for structured random matrices

Understanding Generalization Error: Bounds and Decompositions

Kernel Methods in Machine Learning

Computational and Statistical Learning Theory

Random projection ensemble classification

Discriminative Direction for Kernel Classifiers

High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction

Generalization Bounds

Bregman Divergences. Barnabás Póczos. RLAI Tea Talk UofA, Edmonton. Aug 5, 2008

CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction

1 Review of The Learning Setting

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

CSE 546 Final Exam, Autumn 2013

Fast Dimension Reduction

From Batch to Transductive Online Learning

Incentive Compatible Regression Learning

18.9 SUPPORT VECTOR MACHINES

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Discriminative Learning can Succeed where Generative Learning Fails

Random Projections as Regularizers: Learning a Linear Discriminant from Fewer Observations than Dimensions

CIS 520: Machine Learning Oct 09, Kernel Methods

Reproducing Kernel Hilbert Spaces

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Generalization and Overfitting

PROPERTIES OF SIMPLE ROOTS

Statistical learning theory, Support vector machines, and Bioinformatics

The Johnson-Lindenstrauss Lemma in Linear Programming

Error Limiting Reductions Between Classification Tasks

Linear Regression and Discrimination

Machine learning for pervasive systems Classification in high-dimensional spaces

Metric Embedding for Kernel Classification Rules

Using the Johnson-Lindenstrauss lemma in linear and integer programming

Random Methods for Linear Algebra

Similarity-Based Theoretical Foundation for Sparse Parzen Window Prediction

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing

Lecture 5: GPs and Streaming regression

Learning Bound for Parameter Transfer Learning

Transcription:

Improved Bounds on the Dot Product under Random Projection and Random Sign Projection Ata Kabán School of Computer Science The University of Birmingham Birmingham B15 2TT, UK http://www.cs.bham.ac.uk/ axk KDD 2015, Sydney, 10-13 August 2015.

Outline Introduction & motivation A Johnson-Lindenstrauss lemma (JLL) for the dot product without union bound Corollaries & connections with previous results Numerical validation Application to bounding generalisation error of compressive linear classifiers Conclusions and future work

Introduction Dot product a key building block in data mining classification, regression, retrieval, correlation-clustering, etc. Random projection (RP) a universal dimensionality reduction method independent of the data, computationally cheap, has low-distortion guarantees The Johnson-Lindenstrauss lemma (JLL) for Euclidean distances is optimal, but for dot product the guarantees have been looser; some suggested that obtuse angles may be not preserved.

Background: JLL for Euclidean distance Theorem[Johnson-Lindenstrauss lemma] Let x, y R d. Let R M k d, k < d, be a random projection matrix with entries drawn i.i.d. from a 0-mean subgaussian distribution with parameter σ 2, and let Rx, Ry R k be the images of x, y under R. Then, ɛ (0, 1): ( ) Pr{ Rx Ry 2 < (1 ɛ) x y 2 kσ 2 } < exp kɛ2 (1) 8 ( ) Pr{ Rx Ry 2 > (1 + ɛ) x y 2 kσ 2 } < exp kɛ2 (2) 8 An elementary constructive proof is in [Dasgupta & Gupta, 2002]. These bounds are known to be optimal [Larsen & Nelson, 2014].

The quick & loose JLL for dot product (Rx) T Ry = 1 ( 4 R(x + y) 2 R(x y) 2) Now, applying the JLL on both terms separately and applying the union bound yields: ( ) Pr{(Rx) T Ry < x T ykσ 2 ɛkσ 2 x y } < 2 exp kɛ2 8 ( ) Pr{(Rx) T Ry > x T ykσ 2 + ɛkσ 2 x y } < 2 exp kɛ2 8 Or, (Rx) T Ry = 2 1 ( R(x y) 2 Rx 2 Ry 2)...then we get factors of 3 in front of exp.

Can we improve the JLL for dot products? The problems: Technical issue: Union bound. More fundamental issue: Ratio of std of projected dot product and original dot product ( coefficient of variation ) is unbounded [Li et al. 2006]. Other issue: Some previous proofs were only applicable to acute angles [Shi et al, 2012]; obtuse angles investigated empirically is inevitably based on limited numerical tests.

Results: Improved bounds for dot product Theorem[Dot Product under Random Projection] Let x, y R d. Let R M k d, k < d, be a random projection matrix having i.i.d. 0-mean subgaussian entries with parameter σ 2, and let Rx, Ry R k be the images of x, y under R. Then, ɛ (0, 1): ( ) Pr{(Rx) T Ry < x T ykσ 2 ɛkσ 2 x y } < exp kɛ2 (3) 8 ( ) Pr{(Rx) T Ry > x T ykσ 2 + ɛkσ 2 x y } < exp kɛ2 (4) 8 The proof uses elementary techniques. A standard Chernoff bound argument, but exploit the convexity of the exponential function. The union bound is eliminated. (Details in the paper.)

Corollaries (1): Clarifying the role of angle Corollary[Relative distortion bounds] Denote by θ the angle between the vectors x, y R d. Then we have the following: 1. Relative distortion bound: Assume x T y 0. Then, Pr { xt R T Ry x T y } ( kσ 2 > ɛ < 2 exp k ) cos 2 (θ) 8(kσ 2 ) 2ɛ2 (5) 2. Multiplicative form of relative distortion bound: ( Pr{x T R T Ry < x T y(1 ɛ)kσ 2 } < exp k ) 8 ɛ2 cos 2 (θ) ( Pr{x T R T Ry > x T y(1 + ɛ)kσ 2 } < exp k ) 8 ɛ2 cos 2 (θ) (6) (7)

Observations from Corollary Guarantees are the same for both obtuse and acute angles! Symmetric around orthogonal angles. Relation to coefficient of variation [Li et al.]: Var(x T R T Ry) x T y 2 Computing this (case of Gaussian R), Var(x T R T Ry) x T y = k 1 k (unbounded) (8) ( 1 + 1 ) cos 2 (θ) we see that unbounded coefficient of variation occurs only when x and y are perpendicular. Again, symmetric around orthogonal angles. (9)

Corollaries (2) Corollary[Margin type bounds and random sign projection] Denote by θ the angle between the vectors x, y R d. Then, 1. Margin bound: Assume x T y 0. Then, for all ρ s.t. ρ < x T ykσ 2 and ρ > (cos(θ) 1) x y kσ 2, ( Pr{x T R T Ry < ρ} < exp k ( ) ) ρ 2 cos(θ) 8 x y kσ 2 (10) for all ρ s.t. ρ > x T ykσ 2 and ρ < (cos(θ) + 1) x y kσ 2, ( Pr{x T R T Ry > ρ} < exp k ( ) ) ρ 2 8 x y kσ 2 cos(θ) (11)

2. Dot product under random sign projection: Assume x T y 0. Then, { x T R T } ( Ry Pr x T < 0 < exp k ) y 8 cos2 (θ) (12) These forms of the bound, with ρ > 0, are useful for instance to bound the margin loss of compressive classifiers. Details to follow shortly. The random sign projection bound was used before to bound the error of compressive classifiers under 0-1 loss [Durrant & Kabán, ICML 13] in the case of Gaussian RP; here subgaussian RP is allowed.

Numerical validation We will compute empirical estimates of the following probabilities, from 2000 independently drawn instances of the RP. The target dimension varies from 1 to the original dimension d = 300. Rejection probability for dot product preservation = Probability that the relative distortion of the dot product after RP falls outside the allowed error tolerance ɛ: 1 P r { (1 ɛ) < (Rx)T Ry x T y } < (1 + ɛ) (13) The sign flipping probability: { (Rx) T Ry P r x T y } < 0 (14)

Replicating the results in [Shi et al, ICML 12]. Left: Two acute angles; Right: Two obtuse angles. Preservation of these obtuse angles looks indeed worse......but not because they are obtuse (see next slide!).

Now take the angles symmetrical around π/2 and observe the opposite behaviour. this is why the previous result in [Shi et al, ICML 12] has been misleading. Left: Two acute angles; Right: Two obtuse angles.

Numerical validation full picture Left: Empirical estimates of rejection probability for dot product preservation; Right: Our analytic upper bound. The error tolerance was set to ɛ = 0.3. Darker means higher probability.

The same with ɛ = 0.1. Bound matches the true behaviour: All of these probabilities are symmetric around the angles of π/2 and 3π/2 (i.e. orthogonal vectors before RP). Thus, the preservation of the dot product is symmetrically identical for both acute and obtuse angles.

Empirical estimates of sign flipping probability vs. our analytic upper-bound. Darker means higher probability.

An application in machine learning: Margin bound on compressive linear classification Consider the hypothesis class of linear classifiers defined by a unit length parameter vector: H = {x h(x) = w T x : w R d, w 2 = 1} (15) The parameters w are estimated from a training set of size N: T N = {(x n, y n )} N n=1, where (x n, y n ) i.i.d D over X { 1, 1}, X R d. We will work with the margin loss: 0 if ρ u l ρ (u) = 1 u/ρ if u [0, ρ] 1 if u 0 (16)

We are interested in the case when d is large and N not proportionately so. Use a RP matrix R M k d, k < d, with entries R ij drawn i.i.d. from a subgaussian distribution with parameter 1/k. Analogous definitions in the reduced k-dimensional space. The hypothesis class: H R = {x h R (Rx) = w T R Rx : w R R k, w R 2 = 1} (17) where the parameters w R R k are estimated from T N R = {(Rx n, y n )} N n=1 by minimising the empirical margin error: 1 ĥ R = arg min h R H R N N l ρ (h R (Rx n ), y n ) (18) n=1 The quantity of our interest is the generalisation error of ĥr as a random function of both T N, and R: ] E (x,y) D [ĥr (Rx) y (19)

Theorem Let R by a k d, k < d matrix having i.i.d. 0-mean subgaussian entries with parameter 1/k, and a compressed training set TR N = {(Rx n, y n )} N n=1, where (x n, y n ) are drawn i.i.d. from some distribution D. For any δ (0, 1), the following holds with probability at least 1 3δ for the empirical minimiser of the margin loss in the RP space, ĥr, uniformly for any margin parameter ρ (0, 1): ] E (x,y) D [ĥr (Rx) y + 4 ρ 1 N { 1 N min 1 (h(x n )y n < ρ) + S k + h H N n=1 ( 8 log(1/δ) 1 + XX T ) Tr + k N N log log 2 (2/ρ) 3 log(1/δ) S k } log(4/δ) + 3 2N where θ n is the angle between the parameter vector of h and the vector x n y n. The function 1( ) takes value 1 if its argument is true and 0 otherwise. X is an N d matrix that holds the input points, and S k = 1 N N n=1 1 (h(x n)y n ρ) exp cos(θ n ) ρ k 8 1+ x n 8 log(1/δ) k 2 + + δ.

Illustration of the bound Illustration of the predictive behaviour of the bound (δ = 0.1 and ρ = 0.05) on the Advert classification data set from the UCI (d = 1554 features and N = 3279 points). The empirical error was estimated on holdout sets using SVM with default settings and 30 random splits (in proportion 2/3 training & 1/3 testing) of the data. We standardised the data first, and scaled it to max x n = 1. n {1,...,N}

Conclusions & Future work We proved new bounds on the dot product under random projection that take the same form as the optimal bounds on the Euclidean distance in the Johnson-Lindenstrauss lemma. The dot product is ubiquitous in data mining, and the use of RP on this operation is now better justified. We cleared the controversy about preservation of obtuse angles and clarified the precise role of angles in the relative distortion of the dot product under random projection. We further discussed connections with the notion of margin in generalisation theory, and connections with sign random projections generalise earlier results. Our proof technique applies to any subgaussian RP matrices with i.i.d entries. In future work is would be of interest to see if it could be adapted to Fast JL transforms whose entries are not i.i.d.

Selected References [Achlioptas] D. Achlioptas. Database-friendly Random Projections: Johnson-Lindenstrauss with Binary Coins. Journal of Computer and System Sciences, 66(4):671 687, 2003. [Balcan & Blum] M.F. Balcan, A. Blum, S. Vempala. Kernels as features: On kernels, margins, and low-dimensional mappings, Machine Learning 65 (1), 79-94, 2006. [Bingham & Mannila] E. Bingham and H. Mannila. Random projection in dimensionality reduction: Applications to image and text data. In Knowledge Discovery and Data Mining (KDD), pp. 245 250, ACM Press, 2001. [Buldygin & Kozachenko] V.V. Buldygin, Y.V. Kozachenko. Metric characterization of random variables and random processes. American Mathematical Society, 2000. [Dasgupta & Gupta] S. Dasgupta, and A. Gupta. An elementary proof of the Johnson Lindenstrauss Lemma. Random Structures & Algorithms, 22:60 65, 2002. [Durrant & Kabán] R.J. Durrant, A. Kabán. Sharp generalization error bounds for randomlyprojected classifiers. ICML 13, Journal of Machine Learning Research-Proceedings Track 28(3):693-701, 2013. [Larsen & Nelson] K.G. Larsen, J. Nelson. The Johnson-Lindenstrauss lemma is optimal for linear dimensionality reduction, arxiv preprint arxiv:1411.2404, 2014. [Li et al.] P. Li, T. Hastie, K. Church. Improving random projections using marginal information. In Proc. Conference on Learning Theory (COLT) 4005, 635-649, 2006. [Shi et al.] Q. Shi, C. Shen, R. Hill, A. Hengel. Is margin preserved after random projection? Proceedings of the 29th International Conference on Machine Learning (ICML), pp. 591 598, 2012.