Improved Bounds on the Dot Product under Random Projection and Random Sign Projection

Improved Bounds on the Dot Product under Random Projection and Random Sign Projection Ata Kabán School of Computer Science The University of Birmingham Birmingham B15 2TT, UK http://www.cs.bham.ac.uk/ axk KDD 2015, Sydney, 10-13 August 2015.

Outline Introduction & motivation A Johnson-Lindenstrauss lemma (JLL) for the dot product without union bound Corollaries & connections with previous results Numerical validation Application to bounding generalisation error of compressive linear classifiers Conclusions and future work

Introduction Dot product a key building block in data mining classification, regression, retrieval, correlation-clustering, etc. Random projection (RP) a universal dimensionality reduction method independent of the data, computationally cheap, has low-distortion guarantees The Johnson-Lindenstrauss lemma (JLL) for Euclidean distances is optimal, but for dot product the guarantees have been looser; some suggested that obtuse angles may be not preserved.

Background: JLL for Euclidean distance Theorem[Johnson-Lindenstrauss lemma] Let x, y R d. Let R M k d, k < d, be a random projection matrix with entries drawn i.i.d. from a 0-mean subgaussian distribution with parameter σ 2, and let Rx, Ry R k be the images of x, y under R. Then, ɛ (0, 1): ( ) Pr{ Rx Ry 2 < (1 ɛ) x y 2 kσ 2 } < exp kɛ2 (1) 8 ( ) Pr{ Rx Ry 2 > (1 + ɛ) x y 2 kσ 2 } < exp kɛ2 (2) 8 An elementary constructive proof is in [Dasgupta & Gupta, 2002]. These bounds are known to be optimal [Larsen & Nelson, 2014].

The quick & loose JLL for dot product (Rx) T Ry = 1 ( 4 R(x + y) 2 R(x y) 2) Now, applying the JLL on both terms separately and applying the union bound yields: ( ) Pr{(Rx) T Ry < x T ykσ 2 ɛkσ 2 x y } < 2 exp kɛ2 8 ( ) Pr{(Rx) T Ry > x T ykσ 2 + ɛkσ 2 x y } < 2 exp kɛ2 8 Or, (Rx) T Ry = 2 1 ( R(x y) 2 Rx 2 Ry 2)...then we get factors of 3 in front of exp.

Can we improve the JLL for dot products? The problems: Technical issue: Union bound. More fundamental issue: Ratio of std of projected dot product and original dot product ( coefficient of variation ) is unbounded [Li et al. 2006]. Other issue: Some previous proofs were only applicable to acute angles [Shi et al, 2012]; obtuse angles investigated empirically is inevitably based on limited numerical tests.

Results: Improved bounds for dot product Theorem[Dot Product under Random Projection] Let x, y R d. Let R M k d, k < d, be a random projection matrix having i.i.d. 0-mean subgaussian entries with parameter σ 2, and let Rx, Ry R k be the images of x, y under R. Then, ɛ (0, 1): ( ) Pr{(Rx) T Ry < x T ykσ 2 ɛkσ 2 x y } < exp kɛ2 (3) 8 ( ) Pr{(Rx) T Ry > x T ykσ 2 + ɛkσ 2 x y } < exp kɛ2 (4) 8 The proof uses elementary techniques. A standard Chernoff bound argument, but exploit the convexity of the exponential function. The union bound is eliminated. (Details in the paper.)

Corollaries (1): Clarifying the role of angle Corollary[Relative distortion bounds] Denote by θ the angle between the vectors x, y R d. Then we have the following: 1. Relative distortion bound: Assume x T y 0. Then, Pr { xt R T Ry x T y } ( kσ 2 > ɛ < 2 exp k ) cos 2 (θ) 8(kσ 2 ) 2ɛ2 (5) 2. Multiplicative form of relative distortion bound: ( Pr{x T R T Ry < x T y(1 ɛ)kσ 2 } < exp k ) 8 ɛ2 cos 2 (θ) ( Pr{x T R T Ry > x T y(1 + ɛ)kσ 2 } < exp k ) 8 ɛ2 cos 2 (θ) (6) (7)

Observations from Corollary Guarantees are the same for both obtuse and acute angles! Symmetric around orthogonal angles. Relation to coefficient of variation [Li et al.]: Var(x T R T Ry) x T y 2 Computing this (case of Gaussian R), Var(x T R T Ry) x T y = k 1 k (unbounded) (8) ( 1 + 1 ) cos 2 (θ) we see that unbounded coefficient of variation occurs only when x and y are perpendicular. Again, symmetric around orthogonal angles. (9)

Corollaries (2) Corollary[Margin type bounds and random sign projection] Denote by θ the angle between the vectors x, y R d. Then, 1. Margin bound: Assume x T y 0. Then, for all ρ s.t. ρ < x T ykσ 2 and ρ > (cos(θ) 1) x y kσ 2, ( Pr{x T R T Ry < ρ} < exp k ( ) ) ρ 2 cos(θ) 8 x y kσ 2 (10) for all ρ s.t. ρ > x T ykσ 2 and ρ < (cos(θ) + 1) x y kσ 2, ( Pr{x T R T Ry > ρ} < exp k ( ) ) ρ 2 8 x y kσ 2 cos(θ) (11)

2. Dot product under random sign projection: Assume x T y 0. Then, { x T R T } ( Ry Pr x T < 0 < exp k ) y 8 cos2 (θ) (12) These forms of the bound, with ρ > 0, are useful for instance to bound the margin loss of compressive classifiers. Details to follow shortly. The random sign projection bound was used before to bound the error of compressive classifiers under 0-1 loss [Durrant & Kabán, ICML 13] in the case of Gaussian RP; here subgaussian RP is allowed.

Numerical validation We will compute empirical estimates of the following probabilities, from 2000 independently drawn instances of the RP. The target dimension varies from 1 to the original dimension d = 300. Rejection probability for dot product preservation = Probability that the relative distortion of the dot product after RP falls outside the allowed error tolerance ɛ: 1 P r { (1 ɛ) < (Rx)T Ry x T y } < (1 + ɛ) (13) The sign flipping probability: { (Rx) T Ry P r x T y } < 0 (14)

Replicating the results in [Shi et al, ICML 12]. Left: Two acute angles; Right: Two obtuse angles. Preservation of these obtuse angles looks indeed worse......but not because they are obtuse (see next slide!).

Now take the angles symmetrical around π/2 and observe the opposite behaviour. this is why the previous result in [Shi et al, ICML 12] has been misleading. Left: Two acute angles; Right: Two obtuse angles.

Numerical validation full picture Left: Empirical estimates of rejection probability for dot product preservation; Right: Our analytic upper bound. The error tolerance was set to ɛ = 0.3. Darker means higher probability.

The same with ɛ = 0.1. Bound matches the true behaviour: All of these probabilities are symmetric around the angles of π/2 and 3π/2 (i.e. orthogonal vectors before RP). Thus, the preservation of the dot product is symmetrically identical for both acute and obtuse angles.

Empirical estimates of sign flipping probability vs. our analytic upper-bound. Darker means higher probability.

An application in machine learning: Margin bound on compressive linear classification Consider the hypothesis class of linear classifiers defined by a unit length parameter vector: H = {x h(x) = w T x : w R d, w 2 = 1} (15) The parameters w are estimated from a training set of size N: T N = {(x n, y n )} N n=1, where (x n, y n ) i.i.d D over X { 1, 1}, X R d. We will work with the margin loss: 0 if ρ u l ρ (u) = 1 u/ρ if u [0, ρ] 1 if u 0 (16)

We are interested in the case when d is large and N not proportionately so. Use a RP matrix R M k d, k < d, with entries R ij drawn i.i.d. from a subgaussian distribution with parameter 1/k. Analogous definitions in the reduced k-dimensional space. The hypothesis class: H R = {x h R (Rx) = w T R Rx : w R R k, w R 2 = 1} (17) where the parameters w R R k are estimated from T N R = {(Rx n, y n )} N n=1 by minimising the empirical margin error: 1 ĥ R = arg min h R H R N N l ρ (h R (Rx n ), y n ) (18) n=1 The quantity of our interest is the generalisation error of ĥr as a random function of both T N, and R: ] E (x,y) D [ĥr (Rx) y (19)

Theorem Let R by a k d, k < d matrix having i.i.d. 0-mean subgaussian entries with parameter 1/k, and a compressed training set TR N = {(Rx n, y n )} N n=1, where (x n, y n ) are drawn i.i.d. from some distribution D. For any δ (0, 1), the following holds with probability at least 1 3δ for the empirical minimiser of the margin loss in the RP space, ĥr, uniformly for any margin parameter ρ (0, 1): ] E (x,y) D [ĥr (Rx) y + 4 ρ 1 N { 1 N min 1 (h(x n )y n < ρ) + S k + h H N n=1 ( 8 log(1/δ) 1 + XX T ) Tr + k N N log log 2 (2/ρ) 3 log(1/δ) S k } log(4/δ) + 3 2N where θ n is the angle between the parameter vector of h and the vector x n y n. The function 1( ) takes value 1 if its argument is true and 0 otherwise. X is an N d matrix that holds the input points, and S k = 1 N N n=1 1 (h(x n)y n ρ) exp cos(θ n ) ρ k 8 1+ x n 8 log(1/δ) k 2 + + δ.

Illustration of the bound Illustration of the predictive behaviour of the bound (δ = 0.1 and ρ = 0.05) on the Advert classification data set from the UCI (d = 1554 features and N = 3279 points). The empirical error was estimated on holdout sets using SVM with default settings and 30 random splits (in proportion 2/3 training & 1/3 testing) of the data. We standardised the data first, and scaled it to max x n = 1. n {1,...,N}

Conclusions & Future work We proved new bounds on the dot product under random projection that take the same form as the optimal bounds on the Euclidean distance in the Johnson-Lindenstrauss lemma. The dot product is ubiquitous in data mining, and the use of RP on this operation is now better justified. We cleared the controversy about preservation of obtuse angles and clarified the precise role of angles in the relative distortion of the dot product under random projection. We further discussed connections with the notion of margin in generalisation theory, and connections with sign random projections generalise earlier results. Our proof technique applies to any subgaussian RP matrices with i.i.d entries. In future work is would be of interest to see if it could be adapted to Fast JL transforms whose entries are not i.i.d.

Selected References [Achlioptas] D. Achlioptas. Database-friendly Random Projections: Johnson-Lindenstrauss with Binary Coins. Journal of Computer and System Sciences, 66(4):671 687, 2003. [Balcan & Blum] M.F. Balcan, A. Blum, S. Vempala. Kernels as features: On kernels, margins, and low-dimensional mappings, Machine Learning 65 (1), 79-94, 2006. [Bingham & Mannila] E. Bingham and H. Mannila. Random projection in dimensionality reduction: Applications to image and text data. In Knowledge Discovery and Data Mining (KDD), pp. 245 250, ACM Press, 2001. [Buldygin & Kozachenko] V.V. Buldygin, Y.V. Kozachenko. Metric characterization of random variables and random processes. American Mathematical Society, 2000. [Dasgupta & Gupta] S. Dasgupta, and A. Gupta. An elementary proof of the Johnson Lindenstrauss Lemma. Random Structures & Algorithms, 22:60 65, 2002. [Durrant & Kabán] R.J. Durrant, A. Kabán. Sharp generalization error bounds for randomlyprojected classifiers. ICML 13, Journal of Machine Learning Research-Proceedings Track 28(3):693-701, 2013. [Larsen & Nelson] K.G. Larsen, J. Nelson. The Johnson-Lindenstrauss lemma is optimal for linear dimensionality reduction, arxiv preprint arxiv:1411.2404, 2014. [Li et al.] P. Li, T. Hastie, K. Church. Improving random projections using marginal information. In Proc. Conference on Learning Theory (COLT) 4005, 635-649, 2006. [Shi et al.] Q. Shi, C. Shen, R. Hill, A. Hengel. Is margin preserved after random projection? Proceedings of the 29th International Conference on Machine Learning (ICML), pp. 591 598, 2012.