Sharp Generalization Error Bounds for Randomly-projected Classifiers

Similar documents
Improved Bounds on the Dot Product under Random Projection and Random Sign Projection

Sharp Generalization Error Bounds for Randomly-projected Classifiers

A tight bound on the performance of Fisher s linear discriminant in randomly projected data spaces

arxiv: v1 [math.st] 28 Sep 2017

Learning with L q<1 vs L 1 -norm regularisation with exponentially many irrelevant features

Learning in High Dimensions with Projected Linear Discriminants

Foundations of Machine Learning

Rademacher Complexity Bounds for Non-I.I.D. Processes

A New Look at Nearest Neighbours: Identifying Benign Input Geometries via Random Projections

Lecture Support Vector Machine (SVM) Classifiers

Rademacher Bounds for Non-i.i.d. Processes

Baum s Algorithm Learns Intersections of Halfspaces with respect to Log-Concave Distributions

Random Projections as Regularizers: Learning a Linear Discriminant from Fewer Observations than Dimensions

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Randomized Algorithms

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

PAC-Bayes Risk Bounds for Sample-Compressed Gibbs Classifiers

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

l 1 -Regularized Linear Regression: Persistence and Oracle Inequalities

Computational and Statistical Learning Theory

Discriminative Models

Lecture 13 October 6, Covering Numbers and Maurey s Empirical Method

Learning Bound for Parameter Transfer Learning

Learning with Rejection

Discriminative Models

Introduction to Support Vector Machines

High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction

Reconstruction from Anisotropic Random Measurements

Active Learning: Disagreement Coefficient

Generalization error bounds for classifiers trained with interdependent data

Lecture 8: The Goemans-Williamson MAXCUT algorithm

Methods for sparse analysis of high-dimensional data, II

CS446: Machine Learning Spring Problem Set 4

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Is margin preserved after random projection?

Support Vector Machines

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery

Compressive Sensing with Random Matrices

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

COMS 4771 Introduction to Machine Learning. Nakul Verma

Improved Algorithms for Confidence-Rated Prediction with Error Guarantees

From Batch to Transductive Online Learning

Part of the slides are adapted from Ziko Kolter

On the Generalization Ability of Online Strongly Convex Programming Algorithms

PAC-learning, VC Dimension and Margin-based Bounds

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

Generalization, Overfitting, and Model Selection

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization

Understanding Generalization Error: Bounds and Decompositions

Generalization bounds

The sample complexity of agnostic learning with deterministic labels

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Dimension Reduction in Kernel Spaces from Locality-Sensitive Hashing

Classifier Complexity and Support Vector Classifiers

The Perceptron Algorithm, Margins

Brief Introduction to Machine Learning

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

Polynomial time Prediction Strategy with almost Optimal Mistake Probability

Active Learning Class 22, 03 May Claire Monteleoni MIT CSAIL

Supervised Machine Learning (Spring 2014) Homework 2, sample solutions

Constructing Explicit RIP Matrices and the Square-Root Bottleneck

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Large-Margin Thresholded Ensembles for Ordinal Regression

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Similarity-Based Theoretical Foundation for Sparse Parzen Window Prediction

Linear & nonlinear classifiers

Lecture 17 (Nov 3, 2011 ): Approximation via rounding SDP: Max-Cut

Machine Learning 4771

Learning Kernels -Tutorial Part III: Theoretical Guarantees.

Advanced Machine Learning

On the tradeoff between computational complexity and sample complexity in learning

Computational Learning Theory - Hilary Term : Learning Real-valued Functions

Compressed Learning: Universal Sparse Dimensionality Reduction and Learning in the Measurement Domain

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence

IFT Lecture 7 Elements of statistical learning theory

Support Vector Machines (SVMs).

Optimization Methods for Machine Learning (OMML)

Cross-Validation and Mean-Square Stability

Perceptron (Theory) + Linear Regression

Generalization and Overfitting

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Sparse PCA in High Dimensions

Kernels as Features: On Kernels, Margins, and Low-dimensional Mappings

A Pseudo-Boolean Set Covering Machine

Linear & nonlinear classifiers

Rank, Trace-Norm & Max-Norm

Learning Combinatorial Functions from Pairwise Comparisons

A fast randomized algorithm for overdetermined linear least-squares regression

Statistical Learning Theory and the C-Loss cost function

The definitions and notation are those introduced in the lectures slides. R Ex D [h

Statistical learning theory, Support vector machines, and Bioinformatics

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Introduction to Machine Learning (67577) Lecture 3

Lecture 16: Compressed Sensing

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Consistency of Nearest Neighbor Methods

Transcription:

Sharp Generalization Error Bounds for Randomly-projected Classifiers R.J. Durrant and A. Kabán School of Computer Science The University of Birmingham Birmingham B15 2TT, UK http://www.cs.bham.ac.uk/ axk International Conference on Machine Learning (ICML 2013), Atlanta, 16-21 June 2013.

Outline Introduction; Preliminaries Main result - Generalisation bound for a generic linear classifier trained on randomly-projected data Main ingredient of the proof: Flipping probability Geometric interpretation Corollaries Summary and future work

Introduction (1) High dimensional data is everywhere Random projection (RP) is fast becoming a workhorse in high dimensional learning computationally cheap; amenable to theoretical analysis Little is known about its effect on the generalisation performance of a classifier, except for a few specific cases. Results so far only for specific families of classifiers, and under constraints of some form on the data.

Introduction (1) High dimensional data is everywhere Random projection (RP) is fast becoming a workhorse in high dimensional learning computationally cheap; amenable to theoretical analysis Little is known about its effect on the generalisation performance of a classifier, except for a few specific cases. Results so far only for specific families of classifiers, and under constraints of some form on the data. Seminal paper by (Arriaga & Vempala 1999) on RP-perceptron - relies on Johnson-Lindenstrauss lemma so the bound unnaturally loosens with increasing the sample size

Introduction (1) High dimensional data is everywhere Random projection (RP) is fast becoming a workhorse in high dimensional learning computationally cheap; amenable to theoretical analysis Little is known about its effect on the generalisation performance of a classifier, except for a few specific cases. Results so far only for specific families of classifiers, and under constraints of some form on the data. Seminal paper by (Arriaga & Vempala 1999) on RP-perceptron - relies on Johnson-Lindenstrauss lemma so the bound unnaturally loosens with increasing the sample size (Calderbank et al. 09) on RP-SVM - uses techniques from Compressed Sensing - bound only holds under assumption of sparse data

Introduction (1) High dimensional data is everywhere Random projection (RP) is fast becoming a workhorse in high dimensional learning computationally cheap; amenable to theoretical analysis Little is known about its effect on the generalisation performance of a classifier, except for a few specific cases. Results so far only for specific families of classifiers, and under constraints of some form on the data. Seminal paper by (Arriaga & Vempala 1999) on RP-perceptron - relies on Johnson-Lindenstrauss lemma so the bound unnaturally loosens with increasing the sample size (Calderbank et al. 09) on RP-SVM - uses techniques from Compressed Sensing - bound only holds under assumption of sparse data (Davenport et al. 10) - assumes spherical classes; (Durrant & Kabán 11) - RP-FLD, assumes subgaussian classes

Introduction (2) Our aim is to obtain generalisation bound for a generic linear classifier trained by ERM on randomly projected data. Make no restrictive assumptions other than the original data is drawn i.i.d from some unknown distribution D.

Introduction (2) Our aim is to obtain generalisation bound for a generic linear classifier trained by ERM on randomly projected data. Make no restrictive assumptions other than the original data is drawn i.i.d from some unknown distribution D. Nice idea in (Garg et al, 02) where they use RP as an analytic tool to bound the error of classifiers trained in the original data space.

Introduction (2) Our aim is to obtain generalisation bound for a generic linear classifier trained by ERM on randomly projected data. Make no restrictive assumptions other than the original data is drawn i.i.d from some unknown distribution D. Nice idea in (Garg et al, 02) where they use RP as an analytic tool to bound the error of classifiers trained in the original data space. The idea is to quantify the effect of RP by how it changes class label predictions for projected points relative to the predictions of the data space classifier. (Garg et al, 02) derived a rather loose bound on this flipping probability, yielding numerically trivial bound.

Introduction (2) Our aim is to obtain generalisation bound for a generic linear classifier trained by ERM on randomly projected data. Make no restrictive assumptions other than the original data is drawn i.i.d from some unknown distribution D. Nice idea in (Garg et al, 02) where they use RP as an analytic tool to bound the error of classifiers trained in the original data space. The idea is to quantify the effect of RP by how it changes class label predictions for projected points relative to the predictions of the data space classifier. (Garg et al, 02) derived a rather loose bound on this flipping probability, yielding numerically trivial bound. Our main result is to derive the exact expression for the flipping probability, and turn around the high-level approach of (Garg et al. 02) to obtain nontrivial bounds for RP-classifiers. The ingredients of our proof then also feed back to improve the bound of (Garg et al, 02).

Preliminaries 2-class classification Training set T N = {(x i, y i )} N i=1 ; (x i, y i ) i.i.d D over R d {0,1}. Let ĥ H be the ERM linear classifier. So ĥ Rd, and w.l.o.g. we take it to pass through the origin, and can take that all data lies on S d 1 R d and ĥ = 1. For an unlabelled query point x q the label returned by nĥt o ĥ(x q ) = 1 x q > 0 ĥ is then: where 1{ } is the indicator function.

Preliminaries 2-class classification Training set T N = {(x i, y i )} N i=1 ; (x i, y i ) i.i.d D over R d {0,1}. Let ĥ H be the ERM linear classifier. So ĥ Rd, and w.l.o.g. we take it to pass through the origin, and can take that all data lies on S d 1 R d and ĥ = 1. For an unlabelled query point x q the label returned by nĥt o ĥ(x q ) = 1 x q > 0 where 1{ } is the indicator function. ĥ is then: The generalisation error of ĥ is defined as E (x q,y q ) D [L(ĥ(x q), y q )], and we use the (0,1)-loss: L (0,1) (ĥ(x q), y q ) = j 0 if ĥ(x q ) = y q 1 otherwise.

Problem setting Consider the case when d is too large. Many dimensionality reduction methods; RP is computationally cheap and non-adaptive.

Problem setting Consider the case when d is too large. Many dimensionality reduction methods; RP is computationally cheap and non-adaptive. Random projection: Take a random matrix R M k d, k d, with entries r ij i.i.d N(0, σ 2 ). Pre-multiply the data points with it: T N R = {(Rx i, y i )} N i=1.

Problem setting Consider the case when d is too large. Many dimensionality reduction methods; RP is computationally cheap and non-adaptive. Random projection: Take a random matrix R M k d, k d, with entries r ij i.i.d N(0, σ 2 ). Pre-multiply the data points with it: T N R = {(Rx i, y i )} N i=1. Side note: r ij can be subgaussian in general but in this work we exploit the rotation-invariance of the Gaussian.

Problem setting Consider the case when d is too large. Many dimensionality reduction methods; RP is computationally cheap and non-adaptive. Random projection: Take a random matrix R M k d, k d, with entries r ij i.i.d N(0, σ 2 ). Pre-multiply the data points with it: T N R = {(Rx i, y i )} N i=1. Side note: r ij can be subgaussian in general but in this work we exploit the rotation-invariance of the Gaussian. We are interested in the generalisation error of the ERM linear classifier trained on T N R rather than T N.

Denote the trained classifier by ĥr R k (possibly not through the origin, but translation does not affect our proof technique) The label returned by ĥr is therefore: o ĥ R (Rx q ) = 1 nĥt R Rx q + b > 0 where b R. We want to estimate: h i L (0,1) (ĥr(rx q ), y q ) E (xq,y q ) D = Pr (xq,y q ) D nĥr (Rx q ) y q o with high probability w.r.t the random choice of T N and R.

Results

Theorem [Generalization error bound] Let T N = {(x i, y i ) x i R d, y i {0,1}} N i=1 be a training set, and let ĥ be the linear ERM classifier estimated from T N. Let R M k d, k < d be a RP matrix with entries r ij i.i.d N(0, σ 2 ). Denote by T N R = {(Rx i, y i )} N i=1 the RP of T N, and let ĥr be the linear classifier estimated from T N R.

Theorem [Generalization error bound] Let T N = {(x i, y i ) x i R d, y i {0,1}} N i=1 be a training set, and let ĥ be the linear ERM classifier estimated from T N. Let R M k d, k < d be a RP matrix with entries r ij i.i.d N(0, σ 2 ). Denote by T N R = {(Rx i, y i )} N i=1 the RP of T N, and let ĥr be the linear classifier estimated from T N R. Then, for all δ (0,1], with probability at least 1 2δ, Pr xq,y q {ĥr(rx q ) y q } Ê(T N, ĥ) + 1 N 8v v >< u + min t 1u 3 log t 1 NX f k (θ i ), 1 δ >: δ N δ i=1 NX f k (θ i ) i=1 1 N 9 s NX >= f k (θ i ) >; + 2 (k + 1) log 2eN k+1 + log 1 δ N where f k (θ i ) := Pr R {sign(ĥrt Rx i ) sign (ĥt x i )} is the flipping probability for x i with θ i the principal angle between ĥ and x i, and Ê(T N, ĥ) is the empirical risk of the data space classifier. i=1

Proof.(sketch) For a fixed instance of R, from classical VC theory we have δ (0,1) w.p. 1 δ over T N, s Pr xq,y q {ĥr(rx q ) y q } Ê(T R N, ĥr) (k + 1) log(2en/(k + 1)) + log(1/δ) + 2 N where Ê(T N R, ĥr) = 1 N P Ni=1 1{ĥR(Rx i ) y i } the empirical error. We see RP reduces the complexity term but will increase the empirical error.

Proof.(sketch) For a fixed instance of R, from classical VC theory we have δ (0,1) w.p. 1 δ over T N, s Pr xq,y q {ĥr(rx q ) y q } Ê(T R N, ĥr) (k + 1) log(2en/(k + 1)) + log(1/δ) + 2 N where Ê(T N R, ĥr) = 1 N P Ni=1 1{ĥR(Rx i ) y i } the empirical error. We see RP reduces the complexity term but will increase the empirical error. We bound the latter further: Ê(TR N, ĥr)... 1 NX N 1{sign((Rĥ)T Rx i ) sign(ĥt x i )} +Ê(T N, ĥ) i=1 {z } S

Proof.(sketch) For a fixed instance of R, from classical VC theory we have δ (0,1) w.p. 1 δ over T N, s Pr xq,y q {ĥr(rx q ) y q } Ê(T R N, ĥr) (k + 1) log(2en/(k + 1)) + log(1/δ) + 2 N where Ê(T N R, ĥr) = 1 N P Ni=1 1{ĥR(Rx i ) y i } the empirical error. We see RP reduces the complexity term but will increase the empirical error. We bound the latter further: Ê(TR N, ĥr)... 1 NX N 1{sign((Rĥ)T Rx i ) sign(ĥt x i )} +Ê(T N, ĥ) i=1 {z } S Now, bound S from E R [S] w.h.p, w.r.t. the random choice of R. Dependent sum, each term depends on the same R. Can use Markov inequality or a Chernoff bound for dependent variables (Siegel, 95) - numerically tighter for small δ - and take min of these. The terms of E R [S] is what we call the flipping probabilities we derive the exact form of this, which is the main technical ingredient.

Theorem [Flipping Probability] Let h, x R d be two vectors with angle θ [0, π/2] between them. Without loss of generality take h = x = 1. Let R M k d, k < d, be a random projection matrix with entries i.i.d r ij N(0, σ 2 ) and let Rh, Rx R k be the images of h, x under R with angular separation θ R. 1. Denote by f k (θ) the flipping probability f k (θ) := Pr{(Rh) T Rx < 0 h T x > 0}. Then: f k (θ) = Γ(k) (Γ(k/2)) 2 Z ψ where ψ = (1 cos(θ))/(1 + cos(θ)). 0 z (k 2)/2 dz (1) (1 + z) k

2. The expression above can be rewritten as the quotient of the surface area of a hyperspherical cap with an angle of 2θ by the surface area of the corresponding hypersphere, namely: f k (θ) = R θ0 sin k 1 (φ) dφ R π0 sin k 1 (φ) dφ (2) 3. The flipping probability is monotonic decreasing as a function of k: Fix θ [0, π/2], then f k (θ) f k+1 (θ). Proof is in the paper.

Flip Probability - Illustration 0.5 Theoretical Bound: k {1 5 10 15 20 25} 0.45 0.4 0.35 0.3 Pr[flip] 0.25 0.2 0.15 0.1 0.05 0 0 0.5 1 1.5 Angle θ (radians)

Flip Probability - d-invariance 0.5 k=5, d=500 0.5 k=5, d=450 k=5, d=400 0.4 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.1 0 0 0.5 1 1.5 0 0 0.5 1 1.5 0 0 0.5 1 1.5 k=5, d=350 0.5 k=5, d=300 0.5 k=5, d=250 0.4 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.1 0 0 0.5 1 1.5 0 0 0.5 1 1.5 0 0 0.5 1 1.5 MC trials showing flip probability invariance w.r.t d. Here k = 5, d {500,450,..., 250}.

Geometric interpretation of Flipping Probability The flip probability f k (θ) recovers the known result for k = 1, namely θ/π, as given in (Goemans & Williamson, 1995, Lemma 3.2.). Geometrically, when k = 1 the flipping probability is the quotient of the length of the arc with angle 2θ by the circumference of the unit circle which is 2π. Projection line Vector h Flip region Normal to projection line Vector x

Geometric interpretation of Flipping Probability The flip probability f k (θ) recovers the known result for k = 1, namely θ/π, as given in (Goemans & Williamson, 1995, Lemma 3.2.). For generic k, our flip probability equals the ratio of the surface area in R k+1 of a hyperspherical cap with angle 2θ, 2π Qk 2 R π0 i=1 sin i (φ)dφ R θ 0 sin k 1 (φ)dφ R to the surface area of the unit hypersphere, π0 2π Qk 1 i=1 sin i (φ)dφ. Hence our result gives a natural generalisation of the result in (Goemans & Williamson). Also, since this ratio is known to be upperbounded by exp( 1 2 k cos2 (θ)) (Ball, 1997), we have that the cost of working with randomly-projected data decays exponentially with k.

Two straightforward Corollaries Corollary 1 [Upper Bound on Generalization Error for Data Separable with a Margin] f k (θ) exp( 1 2 k cos2 (θ)) (Ball, 1997,Lemma 2.2.) cos(θ) = m

Corollary 1 [Upper Bound on Generalisation Error for Data Separable with a Margin] If the conditions of Theorem hold and the data classes are also separable with a margin, m, in the data space then for all δ (0,1) with probability at least 1 2δ we have: Pr xq,y q {ĥr(rx q ) y q } Ê(T N, ĥ) + exp 1 «2 km2 +min + 2 (r3 log 1δ exp 14 km2 «s (k + 1) log 2eN k+1 + log 1 δ N, 1 δ δ exp 12 «) km2

Corollary 2 [Upper Bound on Generalisation Error in Data Space] Let T 2N = {(x i, y i )} 2N i=1 be a set of d-dimensional labelled training examples drawn i.i.d. from some data distribution D, and let ĥ be a linear classifier estimated from T 2N by ERM. Let k {1,2,..., d} be an integer and let R M k d be a random projection matrix, with entries r ij N(0, σ 2 ). Then for all δ (0,1], i.i.d with probability at least 1 4δ w.r.t. the random draws of T 2N and R the generalisation error of ĥ w.r.t the (0,1)-loss is bounded above by: Pr xq,y q {ĥt x q y q } Ê(T 2N, 8v v >< u + min t 1u 3 log t 1 >: δ N 2NX i=1 ĥ) + 2 min k f k (θ i ), 1 δ δ 1 N ( 1 2NX f k (θ i )... N i=1 9 2NX >= f k (θ i ) >; + i=1 s (k + 1) log 2eN k+1 + log 1 δ 2N 9 >= >;

Conclusions & Future work We derived the exact probability of label flipping as a result of Gaussian random projection & used this to give sharp upper bounds on the generalisation error of a randomly-projected classifier.

Conclusions & Future work We derived the exact probability of label flipping as a result of Gaussian random projection & used this to give sharp upper bounds on the generalisation error of a randomly-projected classifier. We require neither a large margin nor data sparsity for our bounds to hold, and we make no assumption on the data distribution.

Conclusions & Future work We derived the exact probability of label flipping as a result of Gaussian random projection & used this to give sharp upper bounds on the generalisation error of a randomly-projected classifier. We require neither a large margin nor data sparsity for our bounds to hold, and we make no assumption on the data distribution. We saw that a margin implies low flipping probabilities; identifying other structures that imply low flipping probabilities is subject to future work.

Conclusions & Future work We derived the exact probability of label flipping as a result of Gaussian random projection & used this to give sharp upper bounds on the generalisation error of a randomly-projected classifier. We require neither a large margin nor data sparsity for our bounds to hold, and we make no assumption on the data distribution. We saw that a margin implies low flipping probabilities; identifying other structures that imply low flipping probabilities is subject to future work. Our proof makes use of the orthogonal invariance of the standard Gaussian distribution, which cannot be applied for other random matrices with entries whose distribution is not orthogonally invariant: Extension to other random matrices is subject to future work.

Conclusions & Future work We derived the exact probability of label flipping as a result of Gaussian random projection & used this to give sharp upper bounds on the generalisation error of a randomly-projected classifier. We require neither a large margin nor data sparsity for our bounds to hold, and we make no assumption on the data distribution. We saw that a margin implies low flipping probabilities; identifying other structures that imply low flipping probabilities is subject to future work. Our proof makes use of the orthogonal invariance of the standard Gaussian distribution, which cannot be applied for other random matrices with entries whose distribution is not orthogonally invariant: Extension to other random matrices is subject to future work. Further tightening is possible if we replace the form of VC complexity term in our bounds with better guarantees given in (Bartlett & Mendelson 02).

Selected References Arriaga, R.I. and Vempala, S. An Algorithmic Theory of Learning: Robust Concepts and Random Projection. In 40th Annual Symposium on Foundations of Computer Science (FOCS 1999)., pp. 616 623. IEEE, 1999. Ball, K. An Elementary Introduction to Modern Convex Geometry. Flavors of Geometry, 31: 1 58, 1997. Bartlett, P.L. and Mendelson, S. Rademacher and Gaussian Complexities: Risk Bounds and Structural Results. J. Machine Learning Research, 3:463 482, 2002. Calderbank, R., Jafarpour, S., and Schapire, R. Compressed Learning: Universal Sparse Dimensionality Reduction and Learning in the Measurement Domain. Technical Report, Rice University, 2009. Durrant, R.J. and Kabán, A. Compressed Fisher Linear Discriminant Analysis: Classification of Randomly Projected Data. In Proc. 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2010), 2010. Garg, A. and Roth, D. Margin Distribution and Learning Algorithms. In Proc. 20th International Conference on Machine Learning (ICML 2003), pp. 210 217, 2003. Garg, A., Har-Peled, S., and Roth, D. On Generalization Bounds, Projection Profile, and Margin Distribution. In Proc. 19th International Conference on Machine Learning (ICML 2002), pp. 171 178, 2002. Goemans, M.X. and Williamson, D.P. Improved Approximation Algorithms for Maximum Cut and Satisfiability Problems using Semidefinite Programming. Journal of the ACM, 42(6): 1145, 1995.