Sharp Generalization Error Bounds for Randomly-projected Classifiers R.J. Durrant and A. Kabán School of Computer Science The University of Birmingham Birmingham B15 2TT, UK http://www.cs.bham.ac.uk/ axk International Conference on Machine Learning (ICML 2013), Atlanta, 16-21 June 2013.
Outline Introduction; Preliminaries Main result - Generalisation bound for a generic linear classifier trained on randomly-projected data Main ingredient of the proof: Flipping probability Geometric interpretation Corollaries Summary and future work
Introduction (1) High dimensional data is everywhere Random projection (RP) is fast becoming a workhorse in high dimensional learning computationally cheap; amenable to theoretical analysis Little is known about its effect on the generalisation performance of a classifier, except for a few specific cases. Results so far only for specific families of classifiers, and under constraints of some form on the data.
Introduction (1) High dimensional data is everywhere Random projection (RP) is fast becoming a workhorse in high dimensional learning computationally cheap; amenable to theoretical analysis Little is known about its effect on the generalisation performance of a classifier, except for a few specific cases. Results so far only for specific families of classifiers, and under constraints of some form on the data. Seminal paper by (Arriaga & Vempala 1999) on RP-perceptron - relies on Johnson-Lindenstrauss lemma so the bound unnaturally loosens with increasing the sample size
Introduction (1) High dimensional data is everywhere Random projection (RP) is fast becoming a workhorse in high dimensional learning computationally cheap; amenable to theoretical analysis Little is known about its effect on the generalisation performance of a classifier, except for a few specific cases. Results so far only for specific families of classifiers, and under constraints of some form on the data. Seminal paper by (Arriaga & Vempala 1999) on RP-perceptron - relies on Johnson-Lindenstrauss lemma so the bound unnaturally loosens with increasing the sample size (Calderbank et al. 09) on RP-SVM - uses techniques from Compressed Sensing - bound only holds under assumption of sparse data
Introduction (1) High dimensional data is everywhere Random projection (RP) is fast becoming a workhorse in high dimensional learning computationally cheap; amenable to theoretical analysis Little is known about its effect on the generalisation performance of a classifier, except for a few specific cases. Results so far only for specific families of classifiers, and under constraints of some form on the data. Seminal paper by (Arriaga & Vempala 1999) on RP-perceptron - relies on Johnson-Lindenstrauss lemma so the bound unnaturally loosens with increasing the sample size (Calderbank et al. 09) on RP-SVM - uses techniques from Compressed Sensing - bound only holds under assumption of sparse data (Davenport et al. 10) - assumes spherical classes; (Durrant & Kabán 11) - RP-FLD, assumes subgaussian classes
Introduction (2) Our aim is to obtain generalisation bound for a generic linear classifier trained by ERM on randomly projected data. Make no restrictive assumptions other than the original data is drawn i.i.d from some unknown distribution D.
Introduction (2) Our aim is to obtain generalisation bound for a generic linear classifier trained by ERM on randomly projected data. Make no restrictive assumptions other than the original data is drawn i.i.d from some unknown distribution D. Nice idea in (Garg et al, 02) where they use RP as an analytic tool to bound the error of classifiers trained in the original data space.
Introduction (2) Our aim is to obtain generalisation bound for a generic linear classifier trained by ERM on randomly projected data. Make no restrictive assumptions other than the original data is drawn i.i.d from some unknown distribution D. Nice idea in (Garg et al, 02) where they use RP as an analytic tool to bound the error of classifiers trained in the original data space. The idea is to quantify the effect of RP by how it changes class label predictions for projected points relative to the predictions of the data space classifier. (Garg et al, 02) derived a rather loose bound on this flipping probability, yielding numerically trivial bound.
Introduction (2) Our aim is to obtain generalisation bound for a generic linear classifier trained by ERM on randomly projected data. Make no restrictive assumptions other than the original data is drawn i.i.d from some unknown distribution D. Nice idea in (Garg et al, 02) where they use RP as an analytic tool to bound the error of classifiers trained in the original data space. The idea is to quantify the effect of RP by how it changes class label predictions for projected points relative to the predictions of the data space classifier. (Garg et al, 02) derived a rather loose bound on this flipping probability, yielding numerically trivial bound. Our main result is to derive the exact expression for the flipping probability, and turn around the high-level approach of (Garg et al. 02) to obtain nontrivial bounds for RP-classifiers. The ingredients of our proof then also feed back to improve the bound of (Garg et al, 02).
Preliminaries 2-class classification Training set T N = {(x i, y i )} N i=1 ; (x i, y i ) i.i.d D over R d {0,1}. Let ĥ H be the ERM linear classifier. So ĥ Rd, and w.l.o.g. we take it to pass through the origin, and can take that all data lies on S d 1 R d and ĥ = 1. For an unlabelled query point x q the label returned by nĥt o ĥ(x q ) = 1 x q > 0 ĥ is then: where 1{ } is the indicator function.
Preliminaries 2-class classification Training set T N = {(x i, y i )} N i=1 ; (x i, y i ) i.i.d D over R d {0,1}. Let ĥ H be the ERM linear classifier. So ĥ Rd, and w.l.o.g. we take it to pass through the origin, and can take that all data lies on S d 1 R d and ĥ = 1. For an unlabelled query point x q the label returned by nĥt o ĥ(x q ) = 1 x q > 0 where 1{ } is the indicator function. ĥ is then: The generalisation error of ĥ is defined as E (x q,y q ) D [L(ĥ(x q), y q )], and we use the (0,1)-loss: L (0,1) (ĥ(x q), y q ) = j 0 if ĥ(x q ) = y q 1 otherwise.
Problem setting Consider the case when d is too large. Many dimensionality reduction methods; RP is computationally cheap and non-adaptive.
Problem setting Consider the case when d is too large. Many dimensionality reduction methods; RP is computationally cheap and non-adaptive. Random projection: Take a random matrix R M k d, k d, with entries r ij i.i.d N(0, σ 2 ). Pre-multiply the data points with it: T N R = {(Rx i, y i )} N i=1.
Problem setting Consider the case when d is too large. Many dimensionality reduction methods; RP is computationally cheap and non-adaptive. Random projection: Take a random matrix R M k d, k d, with entries r ij i.i.d N(0, σ 2 ). Pre-multiply the data points with it: T N R = {(Rx i, y i )} N i=1. Side note: r ij can be subgaussian in general but in this work we exploit the rotation-invariance of the Gaussian.
Problem setting Consider the case when d is too large. Many dimensionality reduction methods; RP is computationally cheap and non-adaptive. Random projection: Take a random matrix R M k d, k d, with entries r ij i.i.d N(0, σ 2 ). Pre-multiply the data points with it: T N R = {(Rx i, y i )} N i=1. Side note: r ij can be subgaussian in general but in this work we exploit the rotation-invariance of the Gaussian. We are interested in the generalisation error of the ERM linear classifier trained on T N R rather than T N.
Denote the trained classifier by ĥr R k (possibly not through the origin, but translation does not affect our proof technique) The label returned by ĥr is therefore: o ĥ R (Rx q ) = 1 nĥt R Rx q + b > 0 where b R. We want to estimate: h i L (0,1) (ĥr(rx q ), y q ) E (xq,y q ) D = Pr (xq,y q ) D nĥr (Rx q ) y q o with high probability w.r.t the random choice of T N and R.
Results
Theorem [Generalization error bound] Let T N = {(x i, y i ) x i R d, y i {0,1}} N i=1 be a training set, and let ĥ be the linear ERM classifier estimated from T N. Let R M k d, k < d be a RP matrix with entries r ij i.i.d N(0, σ 2 ). Denote by T N R = {(Rx i, y i )} N i=1 the RP of T N, and let ĥr be the linear classifier estimated from T N R.
Theorem [Generalization error bound] Let T N = {(x i, y i ) x i R d, y i {0,1}} N i=1 be a training set, and let ĥ be the linear ERM classifier estimated from T N. Let R M k d, k < d be a RP matrix with entries r ij i.i.d N(0, σ 2 ). Denote by T N R = {(Rx i, y i )} N i=1 the RP of T N, and let ĥr be the linear classifier estimated from T N R. Then, for all δ (0,1], with probability at least 1 2δ, Pr xq,y q {ĥr(rx q ) y q } Ê(T N, ĥ) + 1 N 8v v >< u + min t 1u 3 log t 1 NX f k (θ i ), 1 δ >: δ N δ i=1 NX f k (θ i ) i=1 1 N 9 s NX >= f k (θ i ) >; + 2 (k + 1) log 2eN k+1 + log 1 δ N where f k (θ i ) := Pr R {sign(ĥrt Rx i ) sign (ĥt x i )} is the flipping probability for x i with θ i the principal angle between ĥ and x i, and Ê(T N, ĥ) is the empirical risk of the data space classifier. i=1
Proof.(sketch) For a fixed instance of R, from classical VC theory we have δ (0,1) w.p. 1 δ over T N, s Pr xq,y q {ĥr(rx q ) y q } Ê(T R N, ĥr) (k + 1) log(2en/(k + 1)) + log(1/δ) + 2 N where Ê(T N R, ĥr) = 1 N P Ni=1 1{ĥR(Rx i ) y i } the empirical error. We see RP reduces the complexity term but will increase the empirical error.
Proof.(sketch) For a fixed instance of R, from classical VC theory we have δ (0,1) w.p. 1 δ over T N, s Pr xq,y q {ĥr(rx q ) y q } Ê(T R N, ĥr) (k + 1) log(2en/(k + 1)) + log(1/δ) + 2 N where Ê(T N R, ĥr) = 1 N P Ni=1 1{ĥR(Rx i ) y i } the empirical error. We see RP reduces the complexity term but will increase the empirical error. We bound the latter further: Ê(TR N, ĥr)... 1 NX N 1{sign((Rĥ)T Rx i ) sign(ĥt x i )} +Ê(T N, ĥ) i=1 {z } S
Proof.(sketch) For a fixed instance of R, from classical VC theory we have δ (0,1) w.p. 1 δ over T N, s Pr xq,y q {ĥr(rx q ) y q } Ê(T R N, ĥr) (k + 1) log(2en/(k + 1)) + log(1/δ) + 2 N where Ê(T N R, ĥr) = 1 N P Ni=1 1{ĥR(Rx i ) y i } the empirical error. We see RP reduces the complexity term but will increase the empirical error. We bound the latter further: Ê(TR N, ĥr)... 1 NX N 1{sign((Rĥ)T Rx i ) sign(ĥt x i )} +Ê(T N, ĥ) i=1 {z } S Now, bound S from E R [S] w.h.p, w.r.t. the random choice of R. Dependent sum, each term depends on the same R. Can use Markov inequality or a Chernoff bound for dependent variables (Siegel, 95) - numerically tighter for small δ - and take min of these. The terms of E R [S] is what we call the flipping probabilities we derive the exact form of this, which is the main technical ingredient.
Theorem [Flipping Probability] Let h, x R d be two vectors with angle θ [0, π/2] between them. Without loss of generality take h = x = 1. Let R M k d, k < d, be a random projection matrix with entries i.i.d r ij N(0, σ 2 ) and let Rh, Rx R k be the images of h, x under R with angular separation θ R. 1. Denote by f k (θ) the flipping probability f k (θ) := Pr{(Rh) T Rx < 0 h T x > 0}. Then: f k (θ) = Γ(k) (Γ(k/2)) 2 Z ψ where ψ = (1 cos(θ))/(1 + cos(θ)). 0 z (k 2)/2 dz (1) (1 + z) k
2. The expression above can be rewritten as the quotient of the surface area of a hyperspherical cap with an angle of 2θ by the surface area of the corresponding hypersphere, namely: f k (θ) = R θ0 sin k 1 (φ) dφ R π0 sin k 1 (φ) dφ (2) 3. The flipping probability is monotonic decreasing as a function of k: Fix θ [0, π/2], then f k (θ) f k+1 (θ). Proof is in the paper.
Flip Probability - Illustration 0.5 Theoretical Bound: k {1 5 10 15 20 25} 0.45 0.4 0.35 0.3 Pr[flip] 0.25 0.2 0.15 0.1 0.05 0 0 0.5 1 1.5 Angle θ (radians)
Flip Probability - d-invariance 0.5 k=5, d=500 0.5 k=5, d=450 k=5, d=400 0.4 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.1 0 0 0.5 1 1.5 0 0 0.5 1 1.5 0 0 0.5 1 1.5 k=5, d=350 0.5 k=5, d=300 0.5 k=5, d=250 0.4 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.1 0 0 0.5 1 1.5 0 0 0.5 1 1.5 0 0 0.5 1 1.5 MC trials showing flip probability invariance w.r.t d. Here k = 5, d {500,450,..., 250}.
Geometric interpretation of Flipping Probability The flip probability f k (θ) recovers the known result for k = 1, namely θ/π, as given in (Goemans & Williamson, 1995, Lemma 3.2.). Geometrically, when k = 1 the flipping probability is the quotient of the length of the arc with angle 2θ by the circumference of the unit circle which is 2π. Projection line Vector h Flip region Normal to projection line Vector x
Geometric interpretation of Flipping Probability The flip probability f k (θ) recovers the known result for k = 1, namely θ/π, as given in (Goemans & Williamson, 1995, Lemma 3.2.). For generic k, our flip probability equals the ratio of the surface area in R k+1 of a hyperspherical cap with angle 2θ, 2π Qk 2 R π0 i=1 sin i (φ)dφ R θ 0 sin k 1 (φ)dφ R to the surface area of the unit hypersphere, π0 2π Qk 1 i=1 sin i (φ)dφ. Hence our result gives a natural generalisation of the result in (Goemans & Williamson). Also, since this ratio is known to be upperbounded by exp( 1 2 k cos2 (θ)) (Ball, 1997), we have that the cost of working with randomly-projected data decays exponentially with k.
Two straightforward Corollaries Corollary 1 [Upper Bound on Generalization Error for Data Separable with a Margin] f k (θ) exp( 1 2 k cos2 (θ)) (Ball, 1997,Lemma 2.2.) cos(θ) = m
Corollary 1 [Upper Bound on Generalisation Error for Data Separable with a Margin] If the conditions of Theorem hold and the data classes are also separable with a margin, m, in the data space then for all δ (0,1) with probability at least 1 2δ we have: Pr xq,y q {ĥr(rx q ) y q } Ê(T N, ĥ) + exp 1 «2 km2 +min + 2 (r3 log 1δ exp 14 km2 «s (k + 1) log 2eN k+1 + log 1 δ N, 1 δ δ exp 12 «) km2
Corollary 2 [Upper Bound on Generalisation Error in Data Space] Let T 2N = {(x i, y i )} 2N i=1 be a set of d-dimensional labelled training examples drawn i.i.d. from some data distribution D, and let ĥ be a linear classifier estimated from T 2N by ERM. Let k {1,2,..., d} be an integer and let R M k d be a random projection matrix, with entries r ij N(0, σ 2 ). Then for all δ (0,1], i.i.d with probability at least 1 4δ w.r.t. the random draws of T 2N and R the generalisation error of ĥ w.r.t the (0,1)-loss is bounded above by: Pr xq,y q {ĥt x q y q } Ê(T 2N, 8v v >< u + min t 1u 3 log t 1 >: δ N 2NX i=1 ĥ) + 2 min k f k (θ i ), 1 δ δ 1 N ( 1 2NX f k (θ i )... N i=1 9 2NX >= f k (θ i ) >; + i=1 s (k + 1) log 2eN k+1 + log 1 δ 2N 9 >= >;
Conclusions & Future work We derived the exact probability of label flipping as a result of Gaussian random projection & used this to give sharp upper bounds on the generalisation error of a randomly-projected classifier.
Conclusions & Future work We derived the exact probability of label flipping as a result of Gaussian random projection & used this to give sharp upper bounds on the generalisation error of a randomly-projected classifier. We require neither a large margin nor data sparsity for our bounds to hold, and we make no assumption on the data distribution.
Conclusions & Future work We derived the exact probability of label flipping as a result of Gaussian random projection & used this to give sharp upper bounds on the generalisation error of a randomly-projected classifier. We require neither a large margin nor data sparsity for our bounds to hold, and we make no assumption on the data distribution. We saw that a margin implies low flipping probabilities; identifying other structures that imply low flipping probabilities is subject to future work.
Conclusions & Future work We derived the exact probability of label flipping as a result of Gaussian random projection & used this to give sharp upper bounds on the generalisation error of a randomly-projected classifier. We require neither a large margin nor data sparsity for our bounds to hold, and we make no assumption on the data distribution. We saw that a margin implies low flipping probabilities; identifying other structures that imply low flipping probabilities is subject to future work. Our proof makes use of the orthogonal invariance of the standard Gaussian distribution, which cannot be applied for other random matrices with entries whose distribution is not orthogonally invariant: Extension to other random matrices is subject to future work.
Conclusions & Future work We derived the exact probability of label flipping as a result of Gaussian random projection & used this to give sharp upper bounds on the generalisation error of a randomly-projected classifier. We require neither a large margin nor data sparsity for our bounds to hold, and we make no assumption on the data distribution. We saw that a margin implies low flipping probabilities; identifying other structures that imply low flipping probabilities is subject to future work. Our proof makes use of the orthogonal invariance of the standard Gaussian distribution, which cannot be applied for other random matrices with entries whose distribution is not orthogonally invariant: Extension to other random matrices is subject to future work. Further tightening is possible if we replace the form of VC complexity term in our bounds with better guarantees given in (Bartlett & Mendelson 02).
Selected References Arriaga, R.I. and Vempala, S. An Algorithmic Theory of Learning: Robust Concepts and Random Projection. In 40th Annual Symposium on Foundations of Computer Science (FOCS 1999)., pp. 616 623. IEEE, 1999. Ball, K. An Elementary Introduction to Modern Convex Geometry. Flavors of Geometry, 31: 1 58, 1997. Bartlett, P.L. and Mendelson, S. Rademacher and Gaussian Complexities: Risk Bounds and Structural Results. J. Machine Learning Research, 3:463 482, 2002. Calderbank, R., Jafarpour, S., and Schapire, R. Compressed Learning: Universal Sparse Dimensionality Reduction and Learning in the Measurement Domain. Technical Report, Rice University, 2009. Durrant, R.J. and Kabán, A. Compressed Fisher Linear Discriminant Analysis: Classification of Randomly Projected Data. In Proc. 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2010), 2010. Garg, A. and Roth, D. Margin Distribution and Learning Algorithms. In Proc. 20th International Conference on Machine Learning (ICML 2003), pp. 210 217, 2003. Garg, A., Har-Peled, S., and Roth, D. On Generalization Bounds, Projection Profile, and Margin Distribution. In Proc. 19th International Conference on Machine Learning (ICML 2002), pp. 171 178, 2002. Goemans, M.X. and Williamson, D.P. Improved Approximation Algorithms for Maximum Cut and Satisfiability Problems using Semidefinite Programming. Journal of the ACM, 42(6): 1145, 1995.