Sharp Generalization Error Bounds for Randomly-projected Classifiers
|
|
- Donald Randall
- 5 years ago
- Views:
Transcription
1 Sharp Generalization Error Bounds for Randomly-projected Classifiers R.J. Durrant and A. Kabán School of Computer Science The University of Birmingham Birmingham B15 2TT, UK axk International Conference on Machine Learning (ICML 2013), Atlanta, June 2013.
2 Outline Introduction; Preliminaries Main result - Generalisation bound for a generic linear classifier trained on randomly-projected data Main ingredient of the proof: Flipping probability Geometric interpretation Corollaries Summary and future work
3 Introduction (1) High dimensional data is everywhere Random projection (RP) is fast becoming a workhorse in high dimensional learning computationally cheap; amenable to theoretical analysis Little is known about its effect on the generalisation performance of a classifier, except for a few specific cases. Results so far only for specific families of classifiers, and under constraints of some form on the data.
4 Introduction (1) High dimensional data is everywhere Random projection (RP) is fast becoming a workhorse in high dimensional learning computationally cheap; amenable to theoretical analysis Little is known about its effect on the generalisation performance of a classifier, except for a few specific cases. Results so far only for specific families of classifiers, and under constraints of some form on the data. Seminal paper by (Arriaga & Vempala 1999) on RP-perceptron - relies on Johnson-Lindenstrauss lemma so the bound unnaturally loosens with increasing the sample size
5 Introduction (1) High dimensional data is everywhere Random projection (RP) is fast becoming a workhorse in high dimensional learning computationally cheap; amenable to theoretical analysis Little is known about its effect on the generalisation performance of a classifier, except for a few specific cases. Results so far only for specific families of classifiers, and under constraints of some form on the data. Seminal paper by (Arriaga & Vempala 1999) on RP-perceptron - relies on Johnson-Lindenstrauss lemma so the bound unnaturally loosens with increasing the sample size (Calderbank et al. 09) on RP-SVM - uses techniques from Compressed Sensing - bound only holds under assumption of sparse data
6 Introduction (1) High dimensional data is everywhere Random projection (RP) is fast becoming a workhorse in high dimensional learning computationally cheap; amenable to theoretical analysis Little is known about its effect on the generalisation performance of a classifier, except for a few specific cases. Results so far only for specific families of classifiers, and under constraints of some form on the data. Seminal paper by (Arriaga & Vempala 1999) on RP-perceptron - relies on Johnson-Lindenstrauss lemma so the bound unnaturally loosens with increasing the sample size (Calderbank et al. 09) on RP-SVM - uses techniques from Compressed Sensing - bound only holds under assumption of sparse data (Davenport et al. 10) - assumes spherical classes; (Durrant & Kabán 11) - RP-FLD, assumes subgaussian classes
7 Introduction (2) Our aim is to obtain generalisation bound for a generic linear classifier trained by ERM on randomly projected data. Make no restrictive assumptions other than the original data is drawn i.i.d from some unknown distribution D.
8 Introduction (2) Our aim is to obtain generalisation bound for a generic linear classifier trained by ERM on randomly projected data. Make no restrictive assumptions other than the original data is drawn i.i.d from some unknown distribution D. Nice idea in (Garg et al, 02) where they use RP as an analytic tool to bound the error of classifiers trained in the original data space.
9 Introduction (2) Our aim is to obtain generalisation bound for a generic linear classifier trained by ERM on randomly projected data. Make no restrictive assumptions other than the original data is drawn i.i.d from some unknown distribution D. Nice idea in (Garg et al, 02) where they use RP as an analytic tool to bound the error of classifiers trained in the original data space. The idea is to quantify the effect of RP by how it changes class label predictions for projected points relative to the predictions of the data space classifier. (Garg et al, 02) derived a rather loose bound on this flipping probability, yielding numerically trivial bound.
10 Introduction (2) Our aim is to obtain generalisation bound for a generic linear classifier trained by ERM on randomly projected data. Make no restrictive assumptions other than the original data is drawn i.i.d from some unknown distribution D. Nice idea in (Garg et al, 02) where they use RP as an analytic tool to bound the error of classifiers trained in the original data space. The idea is to quantify the effect of RP by how it changes class label predictions for projected points relative to the predictions of the data space classifier. (Garg et al, 02) derived a rather loose bound on this flipping probability, yielding numerically trivial bound. Our main result is to derive the exact expression for the flipping probability, and turn around the high-level approach of (Garg et al. 02) to obtain nontrivial bounds for RP-classifiers. The ingredients of our proof then also feed back to improve the bound of (Garg et al, 02).
11 Preliminaries 2-class classification Training set T N = {(x i, y i )} N i=1 ; (x i, y i ) i.i.d D over R d {0,1}. Let ĥ H be the ERM linear classifier. So ĥ Rd, and w.l.o.g. we take it to pass through the origin, and can take that all data lies on S d 1 R d and ĥ = 1. For an unlabelled query point x q the label returned by nĥt o ĥ(x q ) = 1 x q > 0 ĥ is then: where 1{ } is the indicator function.
12 Preliminaries 2-class classification Training set T N = {(x i, y i )} N i=1 ; (x i, y i ) i.i.d D over R d {0,1}. Let ĥ H be the ERM linear classifier. So ĥ Rd, and w.l.o.g. we take it to pass through the origin, and can take that all data lies on S d 1 R d and ĥ = 1. For an unlabelled query point x q the label returned by nĥt o ĥ(x q ) = 1 x q > 0 where 1{ } is the indicator function. ĥ is then: The generalisation error of ĥ is defined as E (x q,y q ) D [L(ĥ(x q), y q )], and we use the (0,1)-loss: L (0,1) (ĥ(x q), y q ) = j 0 if ĥ(x q ) = y q 1 otherwise.
13 Problem setting Consider the case when d is too large. Many dimensionality reduction methods; RP is computationally cheap and non-adaptive.
14 Problem setting Consider the case when d is too large. Many dimensionality reduction methods; RP is computationally cheap and non-adaptive. Random projection: Take a random matrix R M k d, k d, with entries r ij i.i.d N(0, σ 2 ). Pre-multiply the data points with it: T N R = {(Rx i, y i )} N i=1.
15 Problem setting Consider the case when d is too large. Many dimensionality reduction methods; RP is computationally cheap and non-adaptive. Random projection: Take a random matrix R M k d, k d, with entries r ij i.i.d N(0, σ 2 ). Pre-multiply the data points with it: T N R = {(Rx i, y i )} N i=1. Side note: r ij can be subgaussian in general but in this work we exploit the rotation-invariance of the Gaussian.
16 Problem setting Consider the case when d is too large. Many dimensionality reduction methods; RP is computationally cheap and non-adaptive. Random projection: Take a random matrix R M k d, k d, with entries r ij i.i.d N(0, σ 2 ). Pre-multiply the data points with it: T N R = {(Rx i, y i )} N i=1. Side note: r ij can be subgaussian in general but in this work we exploit the rotation-invariance of the Gaussian. We are interested in the generalisation error of the ERM linear classifier trained on T N R rather than T N.
17 Denote the trained classifier by ĥr R k (possibly not through the origin, but translation does not affect our proof technique) The label returned by ĥr is therefore: o ĥ R (Rx q ) = 1 nĥt R Rx q + b > 0 where b R. We want to estimate: h i L (0,1) (ĥr(rx q ), y q ) E (xq,y q ) D = Pr (xq,y q ) D nĥr (Rx q ) y q o with high probability w.r.t the random choice of T N and R.
18 Results
19 Theorem [Generalization error bound] Let T N = {(x i, y i ) x i R d, y i {0,1}} N i=1 be a training set, and let ĥ be the linear ERM classifier estimated from T N. Let R M k d, k < d be a RP matrix with entries r ij i.i.d N(0, σ 2 ). Denote by T N R = {(Rx i, y i )} N i=1 the RP of T N, and let ĥr be the linear classifier estimated from T N R.
20 Theorem [Generalization error bound] Let T N = {(x i, y i ) x i R d, y i {0,1}} N i=1 be a training set, and let ĥ be the linear ERM classifier estimated from T N. Let R M k d, k < d be a RP matrix with entries r ij i.i.d N(0, σ 2 ). Denote by T N R = {(Rx i, y i )} N i=1 the RP of T N, and let ĥr be the linear classifier estimated from T N R. Then, for all δ (0,1], with probability at least 1 2δ, Pr xq,y q {ĥr(rx q ) y q } Ê(T N, ĥ) + 1 N 8v v >< u + min t 1u 3 log t 1 NX f k (θ i ), 1 δ >: δ N δ i=1 NX f k (θ i ) i=1 1 N 9 s NX >= f k (θ i ) >; + 2 (k + 1) log 2eN k+1 + log 1 δ N where f k (θ i ) := Pr R {sign(ĥrt Rx i ) sign (ĥt x i )} is the flipping probability for x i with θ i the principal angle between ĥ and x i, and Ê(T N, ĥ) is the empirical risk of the data space classifier. i=1
21 Proof.(sketch) For a fixed instance of R, from classical VC theory we have δ (0,1) w.p. 1 δ over T N, s Pr xq,y q {ĥr(rx q ) y q } Ê(T R N, ĥr) (k + 1) log(2en/(k + 1)) + log(1/δ) + 2 N where Ê(T N R, ĥr) = 1 N P Ni=1 1{ĥR(Rx i ) y i } the empirical error. We see RP reduces the complexity term but will increase the empirical error.
22 Proof.(sketch) For a fixed instance of R, from classical VC theory we have δ (0,1) w.p. 1 δ over T N, s Pr xq,y q {ĥr(rx q ) y q } Ê(T R N, ĥr) (k + 1) log(2en/(k + 1)) + log(1/δ) + 2 N where Ê(T N R, ĥr) = 1 N P Ni=1 1{ĥR(Rx i ) y i } the empirical error. We see RP reduces the complexity term but will increase the empirical error. We bound the latter further: Ê(TR N, ĥr)... 1 NX N 1{sign((Rĥ)T Rx i ) sign(ĥt x i )} +Ê(T N, ĥ) i=1 {z } S
23 Proof.(sketch) For a fixed instance of R, from classical VC theory we have δ (0,1) w.p. 1 δ over T N, s Pr xq,y q {ĥr(rx q ) y q } Ê(T R N, ĥr) (k + 1) log(2en/(k + 1)) + log(1/δ) + 2 N where Ê(T N R, ĥr) = 1 N P Ni=1 1{ĥR(Rx i ) y i } the empirical error. We see RP reduces the complexity term but will increase the empirical error. We bound the latter further: Ê(TR N, ĥr)... 1 NX N 1{sign((Rĥ)T Rx i ) sign(ĥt x i )} +Ê(T N, ĥ) i=1 {z } S Now, bound S from E R [S] w.h.p, w.r.t. the random choice of R. Dependent sum, each term depends on the same R. Can use Markov inequality or a Chernoff bound for dependent variables (Siegel, 95) - numerically tighter for small δ - and take min of these. The terms of E R [S] is what we call the flipping probabilities we derive the exact form of this, which is the main technical ingredient.
24 Theorem [Flipping Probability] Let h, x R d be two vectors with angle θ [0, π/2] between them. Without loss of generality take h = x = 1. Let R M k d, k < d, be a random projection matrix with entries i.i.d r ij N(0, σ 2 ) and let Rh, Rx R k be the images of h, x under R with angular separation θ R. 1. Denote by f k (θ) the flipping probability f k (θ) := Pr{(Rh) T Rx < 0 h T x > 0}. Then: f k (θ) = Γ(k) (Γ(k/2)) 2 Z ψ where ψ = (1 cos(θ))/(1 + cos(θ)). 0 z (k 2)/2 dz (1) (1 + z) k
25 2. The expression above can be rewritten as the quotient of the surface area of a hyperspherical cap with an angle of 2θ by the surface area of the corresponding hypersphere, namely: f k (θ) = R θ0 sin k 1 (φ) dφ R π0 sin k 1 (φ) dφ (2) 3. The flipping probability is monotonic decreasing as a function of k: Fix θ [0, π/2], then f k (θ) f k+1 (θ). Proof is in the paper.
26 Flip Probability - Illustration 0.5 Theoretical Bound: k { } Pr[flip] Angle θ (radians)
27 Flip Probability - d-invariance 0.5 k=5, d= k=5, d=450 k=5, d= k=5, d= k=5, d= k=5, d= MC trials showing flip probability invariance w.r.t d. Here k = 5, d {500,450,..., 250}.
28 Geometric interpretation of Flipping Probability The flip probability f k (θ) recovers the known result for k = 1, namely θ/π, as given in (Goemans & Williamson, 1995, Lemma 3.2.). Geometrically, when k = 1 the flipping probability is the quotient of the length of the arc with angle 2θ by the circumference of the unit circle which is 2π. Projection line Vector h Flip region Normal to projection line Vector x
29 Geometric interpretation of Flipping Probability The flip probability f k (θ) recovers the known result for k = 1, namely θ/π, as given in (Goemans & Williamson, 1995, Lemma 3.2.). For generic k, our flip probability equals the ratio of the surface area in R k+1 of a hyperspherical cap with angle 2θ, 2π Qk 2 R π0 i=1 sin i (φ)dφ R θ 0 sin k 1 (φ)dφ R to the surface area of the unit hypersphere, π0 2π Qk 1 i=1 sin i (φ)dφ. Hence our result gives a natural generalisation of the result in (Goemans & Williamson). Also, since this ratio is known to be upperbounded by exp( 1 2 k cos2 (θ)) (Ball, 1997), we have that the cost of working with randomly-projected data decays exponentially with k.
30 Two straightforward Corollaries Corollary 1 [Upper Bound on Generalization Error for Data Separable with a Margin] f k (θ) exp( 1 2 k cos2 (θ)) (Ball, 1997,Lemma 2.2.) cos(θ) = m
31 Corollary 1 [Upper Bound on Generalisation Error for Data Separable with a Margin] If the conditions of Theorem hold and the data classes are also separable with a margin, m, in the data space then for all δ (0,1) with probability at least 1 2δ we have: Pr xq,y q {ĥr(rx q ) y q } Ê(T N, ĥ) + exp 1 «2 km2 +min + 2 (r3 log 1δ exp 14 km2 «s (k + 1) log 2eN k+1 + log 1 δ N, 1 δ δ exp 12 «) km2
32 Corollary 2 [Upper Bound on Generalisation Error in Data Space] Let T 2N = {(x i, y i )} 2N i=1 be a set of d-dimensional labelled training examples drawn i.i.d. from some data distribution D, and let ĥ be a linear classifier estimated from T 2N by ERM. Let k {1,2,..., d} be an integer and let R M k d be a random projection matrix, with entries r ij N(0, σ 2 ). Then for all δ (0,1], i.i.d with probability at least 1 4δ w.r.t. the random draws of T 2N and R the generalisation error of ĥ w.r.t the (0,1)-loss is bounded above by: Pr xq,y q {ĥt x q y q } Ê(T 2N, 8v v >< u + min t 1u 3 log t 1 >: δ N 2NX i=1 ĥ) + 2 min k f k (θ i ), 1 δ δ 1 N ( 1 2NX f k (θ i )... N i=1 9 2NX >= f k (θ i ) >; + i=1 s (k + 1) log 2eN k+1 + log 1 δ 2N 9 >= >;
33 Conclusions & Future work We derived the exact probability of label flipping as a result of Gaussian random projection & used this to give sharp upper bounds on the generalisation error of a randomly-projected classifier.
34 Conclusions & Future work We derived the exact probability of label flipping as a result of Gaussian random projection & used this to give sharp upper bounds on the generalisation error of a randomly-projected classifier. We require neither a large margin nor data sparsity for our bounds to hold, and we make no assumption on the data distribution.
35 Conclusions & Future work We derived the exact probability of label flipping as a result of Gaussian random projection & used this to give sharp upper bounds on the generalisation error of a randomly-projected classifier. We require neither a large margin nor data sparsity for our bounds to hold, and we make no assumption on the data distribution. We saw that a margin implies low flipping probabilities; identifying other structures that imply low flipping probabilities is subject to future work.
36 Conclusions & Future work We derived the exact probability of label flipping as a result of Gaussian random projection & used this to give sharp upper bounds on the generalisation error of a randomly-projected classifier. We require neither a large margin nor data sparsity for our bounds to hold, and we make no assumption on the data distribution. We saw that a margin implies low flipping probabilities; identifying other structures that imply low flipping probabilities is subject to future work. Our proof makes use of the orthogonal invariance of the standard Gaussian distribution, which cannot be applied for other random matrices with entries whose distribution is not orthogonally invariant: Extension to other random matrices is subject to future work.
37 Conclusions & Future work We derived the exact probability of label flipping as a result of Gaussian random projection & used this to give sharp upper bounds on the generalisation error of a randomly-projected classifier. We require neither a large margin nor data sparsity for our bounds to hold, and we make no assumption on the data distribution. We saw that a margin implies low flipping probabilities; identifying other structures that imply low flipping probabilities is subject to future work. Our proof makes use of the orthogonal invariance of the standard Gaussian distribution, which cannot be applied for other random matrices with entries whose distribution is not orthogonally invariant: Extension to other random matrices is subject to future work. Further tightening is possible if we replace the form of VC complexity term in our bounds with better guarantees given in (Bartlett & Mendelson 02).
38 Selected References Arriaga, R.I. and Vempala, S. An Algorithmic Theory of Learning: Robust Concepts and Random Projection. In 40th Annual Symposium on Foundations of Computer Science (FOCS 1999)., pp IEEE, Ball, K. An Elementary Introduction to Modern Convex Geometry. Flavors of Geometry, 31: 1 58, Bartlett, P.L. and Mendelson, S. Rademacher and Gaussian Complexities: Risk Bounds and Structural Results. J. Machine Learning Research, 3: , Calderbank, R., Jafarpour, S., and Schapire, R. Compressed Learning: Universal Sparse Dimensionality Reduction and Learning in the Measurement Domain. Technical Report, Rice University, Durrant, R.J. and Kabán, A. Compressed Fisher Linear Discriminant Analysis: Classification of Randomly Projected Data. In Proc. 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2010), Garg, A. and Roth, D. Margin Distribution and Learning Algorithms. In Proc. 20th International Conference on Machine Learning (ICML 2003), pp , Garg, A., Har-Peled, S., and Roth, D. On Generalization Bounds, Projection Profile, and Margin Distribution. In Proc. 19th International Conference on Machine Learning (ICML 2002), pp , Goemans, M.X. and Williamson, D.P. Improved Approximation Algorithms for Maximum Cut and Satisfiability Problems using Semidefinite Programming. Journal of the ACM, 42(6): 1145, 1995.
Improved Bounds on the Dot Product under Random Projection and Random Sign Projection
Improved Bounds on the Dot Product under Random Projection and Random Sign Projection Ata Kabán School of Computer Science The University of Birmingham Birmingham B15 2TT, UK http://www.cs.bham.ac.uk/
More informationSharp Generalization Error Bounds for Randomly-projected Classifiers
Sharp Generalization Error Bounds for Randomly-projected Classifiers Robert J. Durrant r.j.durrant@cs.bham.ac.uk School of Computer Science, University of Birmingham, Edgbaston, UK, B5 TT Ata Kabán a.kaban@cs.bham.ac.uk
More informationA tight bound on the performance of Fisher s linear discriminant in randomly projected data spaces
A tight bound on the performance of Fisher s linear discriminant in randomly projected data spaces Robert John Durrant, Ata Kabán School of Computer Science, University of Birmingham, Edgbaston, UK, B15
More informationarxiv: v1 [math.st] 28 Sep 2017
arxiv: arxiv:0000.0000 Structure-aware error bounds for linear classification with the zero-one loss arxiv:709.09782v [math.st] 28 Sep 207 Ata Kabán and Robert J. Durrant School of Computer Science The
More informationLearning with L q<1 vs L 1 -norm regularisation with exponentially many irrelevant features
Learning with L q
More informationLearning in High Dimensions with Projected Linear Discriminants
Learning in High Dimensions with Projected Linear Discriminants By Robert John Durrant A thesis submitted to the University of Birmingham for the degree of Doctor of Philosophy School of Computer Science
More informationFoundations of Machine Learning
Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about
More informationRademacher Complexity Bounds for Non-I.I.D. Processes
Rademacher Complexity Bounds for Non-I.I.D. Processes Mehryar Mohri Courant Institute of Mathematical ciences and Google Research 5 Mercer treet New York, NY 00 mohri@cims.nyu.edu Afshin Rostamizadeh Department
More informationA New Look at Nearest Neighbours: Identifying Benign Input Geometries via Random Projections
Journal of Machine Learning Research 6 A New Look at Nearest Neighbours: Identifying Benign Input Geometries via Random Projections Ata Kabán School of Computer Science, The University of Birmingham, Edgbaston,
More informationLecture Support Vector Machine (SVM) Classifiers
Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in
More informationRademacher Bounds for Non-i.i.d. Processes
Rademacher Bounds for Non-i.i.d. Processes Afshin Rostamizadeh Joint work with: Mehryar Mohri Background Background Generalization Bounds - How well can we estimate an algorithm s true performance based
More informationBaum s Algorithm Learns Intersections of Halfspaces with respect to Log-Concave Distributions
Baum s Algorithm Learns Intersections of Halfspaces with respect to Log-Concave Distributions Adam R Klivans UT-Austin klivans@csutexasedu Philip M Long Google plong@googlecom April 10, 2009 Alex K Tang
More informationRandom Projections as Regularizers: Learning a Linear Discriminant from Fewer Observations than Dimensions
Random Projections as Regularizers: Learning a Linear Discriminant from Fewer Observations than Dimensions Robert J. Durrant (bobd@waikato.ac.nz Department of Statistics, University of Waikato, Hamilton
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationRandomized Algorithms
Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models
More informationGeneralization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh
Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds
More informationPAC-Bayes Risk Bounds for Sample-Compressed Gibbs Classifiers
PAC-Bayes Ris Bounds for Sample-Compressed Gibbs Classifiers François Laviolette Francois.Laviolette@ift.ulaval.ca Mario Marchand Mario.Marchand@ift.ulaval.ca Département d informatique et de génie logiciel,
More informationLecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;
CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and
More informationl 1 -Regularized Linear Regression: Persistence and Oracle Inequalities
l -Regularized Linear Regression: Persistence and Oracle Inequalities Peter Bartlett EECS and Statistics UC Berkeley slides at http://www.stat.berkeley.edu/ bartlett Joint work with Shahar Mendelson and
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory Problem set 1 Due: Monday, October 10th Please send your solutions to learning-submissions@ttic.edu Notation: Input space: X Label space: Y = {±1} Sample:
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationLecture 13 October 6, Covering Numbers and Maurey s Empirical Method
CS 395T: Sublinear Algorithms Fall 2016 Prof. Eric Price Lecture 13 October 6, 2016 Scribe: Kiyeon Jeon and Loc Hoang 1 Overview In the last lecture we covered the lower bound for p th moment (p > 2) and
More informationLearning Bound for Parameter Transfer Learning
Learning Bound for Parameter Transfer Learning Wataru Kumagai Faculty of Engineering Kanagawa University kumagai@kanagawa-u.ac.jp Abstract We consider a transfer-learning problem by using the parameter
More informationLearning with Rejection
Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationIntroduction to Support Vector Machines
Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear
More informationHigh Dimensional Geometry, Curse of Dimensionality, Dimension Reduction
Chapter 11 High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction High-dimensional vectors are ubiquitous in applications (gene expression data, set of movies watched by Netflix customer,
More informationReconstruction from Anisotropic Random Measurements
Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013
More informationActive Learning: Disagreement Coefficient
Advanced Course in Machine Learning Spring 2010 Active Learning: Disagreement Coefficient Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz In previous lectures we saw examples in which
More informationGeneralization error bounds for classifiers trained with interdependent data
Generalization error bounds for classifiers trained with interdependent data icolas Usunier, Massih-Reza Amini, Patrick Gallinari Department of Computer Science, University of Paris VI 8, rue du Capitaine
More informationLecture 8: The Goemans-Williamson MAXCUT algorithm
IU Summer School Lecture 8: The Goemans-Williamson MAXCUT algorithm Lecturer: Igor Gorodezky The Goemans-Williamson algorithm is an approximation algorithm for MAX-CUT based on semidefinite programming.
More informationMethods for sparse analysis of high-dimensional data, II
Methods for sparse analysis of high-dimensional data, II Rachel Ward May 23, 2011 High dimensional data with low-dimensional structure 300 by 300 pixel images = 90, 000 dimensions 2 / 47 High dimensional
More informationCS446: Machine Learning Spring Problem Set 4
CS446: Machine Learning Spring 2017 Problem Set 4 Handed Out: February 27 th, 2017 Due: March 11 th, 2017 Feel free to talk to other members of the class in doing the homework. I am more concerned that
More informationMachine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015
Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and
More informationIs margin preserved after random projection?
Qinfeng Shi javen.shi@adelaide.edu.au Chunhua Shen chunhua.shen@adelaide.edu.au Rhys Hill rhys.hill@adelaide.edu.au Anton van den Hengel anton.vandenhengel@adelaide.edu.au Australian Centre for Visual
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationMAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing
MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing Afonso S. Bandeira April 9, 2015 1 The Johnson-Lindenstrauss Lemma Suppose one has n points, X = {x 1,..., x n }, in R d with d very
More informationIEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER 2011 7255 On the Performance of Sparse Recovery Via `p-minimization (0 p 1) Meng Wang, Student Member, IEEE, Weiyu Xu, and Ao Tang, Senior
More informationCompressive Sensing with Random Matrices
Compressive Sensing with Random Matrices Lucas Connell University of Georgia 9 November 017 Lucas Connell (University of Georgia) Compressive Sensing with Random Matrices 9 November 017 1 / 18 Overview
More information1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015
10-806 Foundations of Machine Learning and Data Science Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015 1 Active Learning Most classic machine learning methods and the formal learning
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric
More informationImproved Algorithms for Confidence-Rated Prediction with Error Guarantees
Improved Algorithms for Confidence-Rated Prediction with Error Guarantees Kamalika Chaudhuri Chicheng Zhang Department of Computer Science and Engineering University of California, San Diego 95 Gilman
More informationFrom Batch to Transductive Online Learning
From Batch to Transductive Online Learning Sham Kakade Toyota Technological Institute Chicago, IL 60637 sham@tti-c.org Adam Tauman Kalai Toyota Technological Institute Chicago, IL 60637 kalai@tti-c.org
More informationPart of the slides are adapted from Ziko Kolter
Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,
More informationOn the Generalization Ability of Online Strongly Convex Programming Algorithms
On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationCS168: The Modern Algorithmic Toolbox Lecture #6: Regularization
CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization Tim Roughgarden & Gregory Valiant April 18, 2018 1 The Context and Intuition behind Regularization Given a dataset, and some class of models
More informationGeneralization, Overfitting, and Model Selection
Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How
More informationUniform concentration inequalities, martingales, Rademacher complexity and symmetrization
Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization John Duchi Outline I Motivation 1 Uniform laws of large numbers 2 Loss minimization and data dependence II Uniform
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More informationGeneralization bounds
Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question
More informationThe sample complexity of agnostic learning with deterministic labels
The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College
More informationSupport Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI
Support Vector Machines CAP 5610: Machine Learning Instructor: Guo-Jun QI 1 Linear Classifier Naive Bayes Assume each attribute is drawn from Gaussian distribution with the same variance Generative model:
More informationDimension Reduction in Kernel Spaces from Locality-Sensitive Hashing
Dimension Reduction in Kernel Spaces from Locality-Sensitive Hashing Alexandr Andoni Piotr Indy April 11, 2009 Abstract We provide novel methods for efficient dimensionality reduction in ernel spaces.
More informationClassifier Complexity and Support Vector Classifiers
Classifier Complexity and Support Vector Classifiers Feature 2 6 4 2 0 2 4 6 8 RBF kernel 10 10 8 6 4 2 0 2 4 6 Feature 1 David M.J. Tax Pattern Recognition Laboratory Delft University of Technology D.M.J.Tax@tudelft.nl
More informationThe Perceptron Algorithm, Margins
The Perceptron Algorithm, Margins MariaFlorina Balcan 08/29/2018 The Perceptron Algorithm Simple learning algorithm for supervised classification analyzed via geometric margins in the 50 s [Rosenblatt
More informationBrief Introduction to Machine Learning
Brief Introduction to Machine Learning Yuh-Jye Lee Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU August 29, 2016 1 / 49 1 Introduction 2 Binary Classification 3 Support Vector
More informationPAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht
PAC Learning prof. dr Arno Siebes Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht Recall: PAC Learning (Version 1) A hypothesis class H is PAC learnable
More informationPolynomial time Prediction Strategy with almost Optimal Mistake Probability
Polynomial time Prediction Strategy with almost Optimal Mistake Probability Nader H. Bshouty Department of Computer Science Technion, 32000 Haifa, Israel bshouty@cs.technion.ac.il Abstract We give the
More informationActive Learning Class 22, 03 May Claire Monteleoni MIT CSAIL
Active Learning 9.520 Class 22, 03 May 2006 Claire Monteleoni MIT CSAIL Outline Motivation Historical framework: query learning Current framework: selective sampling Some recent results Open problems Active
More informationSupervised Machine Learning (Spring 2014) Homework 2, sample solutions
58669 Supervised Machine Learning (Spring 014) Homework, sample solutions Credit for the solutions goes to mainly to Panu Luosto and Joonas Paalasmaa, with some additional contributions by Jyrki Kivinen
More informationConstructing Explicit RIP Matrices and the Square-Root Bottleneck
Constructing Explicit RIP Matrices and the Square-Root Bottleneck Ryan Cinoman July 18, 2018 Ryan Cinoman Constructing Explicit RIP Matrices July 18, 2018 1 / 36 Outline 1 Introduction 2 Restricted Isometry
More informationCS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims
CS340 Machine learning Lecture 5 Learning theory cont'd Some slides are borrowed from Stuart Russell and Thorsten Joachims Inductive learning Simplest form: learn a function from examples f is the target
More informationMachine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)
Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang (Chap. 12 of CIML) Nonlinear Features x4: -1 x1: +1 x3: +1 x2: -1 Concatenated (combined) features XOR:
More informationLarge-Margin Thresholded Ensembles for Ordinal Regression
Large-Margin Thresholded Ensembles for Ordinal Regression Hsuan-Tien Lin and Ling Li Learning Systems Group, California Institute of Technology, U.S.A. Conf. on Algorithmic Learning Theory, October 9,
More informationLinear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26
Huiping Cao, Slide 1/26 Classification Linear SVM Huiping Cao linear hyperplane (decision boundary) that will separate the data Huiping Cao, Slide 2/26 Support Vector Machines rt Vector Find a linear Machines
More informationMachine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012
Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Computational Learning Theory Le Song Lecture 11, September 20, 2012 Based on Slides from Eric Xing, CMU Reading: Chap. 7 T.M book 1 Complexity of Learning
More informationSimilarity-Based Theoretical Foundation for Sparse Parzen Window Prediction
Similarity-Based Theoretical Foundation for Sparse Parzen Window Prediction Nina Balcan Avrim Blum Nati Srebro Toyota Technological Institute Chicago Sparse Parzen Window Prediction We are concerned with
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationLecture 17 (Nov 3, 2011 ): Approximation via rounding SDP: Max-Cut
CMPUT 675: Approximation Algorithms Fall 011 Lecture 17 (Nov 3, 011 ): Approximation via rounding SDP: Max-Cut Lecturer: Mohammad R. Salavatipour Scribe: based on older notes 17.1 Approximation Algorithm
More informationMachine Learning 4771
Machine Learning 477 Instructor: Tony Jebara Topic 5 Generalization Guarantees VC-Dimension Nearest Neighbor Classification (infinite VC dimension) Structural Risk Minimization Support Vector Machines
More informationLearning Kernels -Tutorial Part III: Theoretical Guarantees.
Learning Kernels -Tutorial Part III: Theoretical Guarantees. Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute & Google Research mohri@cims.nyu.edu Afshin Rostami UC Berkeley
More informationAdvanced Machine Learning
Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:
More informationOn the tradeoff between computational complexity and sample complexity in learning
On the tradeoff between computational complexity and sample complexity in learning Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Joint work with Sham
More informationComputational Learning Theory - Hilary Term : Learning Real-valued Functions
Computational Learning Theory - Hilary Term 08 8 : Learning Real-valued Functions Lecturer: Varun Kanade So far our focus has been on learning boolean functions. Boolean functions are suitable for modelling
More informationCompressed Learning: Universal Sparse Dimensionality Reduction and Learning in the Measurement Domain
Compressed Learning: Universal Sparse Dimensionality Reduction and Learning in the easurement Domain Robert Calderbank Electrical Engineering and athematics Princeton University calderbk@princeton.edu
More informationSupport'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan
Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination
More informationBennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence
Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence Chao Zhang The Biodesign Institute Arizona State University Tempe, AZ 8587, USA Abstract In this paper, we present
More informationIFT Lecture 7 Elements of statistical learning theory
IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and
More informationSupport Vector Machines (SVMs).
Support Vector Machines (SVMs). SemiSupervised Learning. SemiSupervised SVMs. MariaFlorina Balcan 3/25/215 Support Vector Machines (SVMs). One of the most theoretically well motivated and practically most
More informationOptimization Methods for Machine Learning (OMML)
Optimization Methods for Machine Learning (OMML) 2nd lecture (2 slots) Prof. L. Palagi 16/10/2014 1 What is (not) Data Mining? By Namwar Rizvi - Ad Hoc Query: ad Hoc queries just examines the current data
More informationCross-Validation and Mean-Square Stability
Cross-Validation and Mean-Square Stability Satyen Kale Ravi Kumar Sergei Vassilvitskii Yahoo! Research 701 First Ave Sunnyvale, CA 94089, SA. {skale, ravikumar, sergei}@yahoo-inc.com Abstract: k-fold cross
More informationPerceptron (Theory) + Linear Regression
10601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Perceptron (Theory) Linear Regression Matt Gormley Lecture 6 Feb. 5, 2018 1 Q&A
More informationGeneralization and Overfitting
Generalization and Overfitting Model Selection Maria-Florina (Nina) Balcan February 24th, 2016 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle
More informationIntroduction to Machine Learning. Introduction to ML - TAU 2016/7 1
Introduction to Machine Learning Introduction to ML - TAU 2016/7 1 Course Administration Lecturers: Amir Globerson (gamir@post.tau.ac.il) Yishay Mansour (Mansour@tau.ac.il) Teaching Assistance: Regev Schweiger
More informationSparse PCA in High Dimensions
Sparse PCA in High Dimensions Jing Lei, Department of Statistics, Carnegie Mellon Workshop on Big Data and Differential Privacy Simons Institute, Dec, 2013 (Based on joint work with V. Q. Vu, J. Cho, and
More informationKernels as Features: On Kernels, Margins, and Low-dimensional Mappings
Kernels as Features: On Kernels, Margins, and Low-dimensional Mappings Maria-Florina Balcan 1 and Avrim Blum 1 and Santosh Vempala 1 Computer Science Department, Carnegie Mellon University {ninamf,avrim}@cs.cmu.edu
More informationA Pseudo-Boolean Set Covering Machine
A Pseudo-Boolean Set Covering Machine Pascal Germain, Sébastien Giguère, Jean-Francis Roy, Brice Zirakiza, François Laviolette, and Claude-Guy Quimper Département d informatique et de génie logiciel, Université
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationRank, Trace-Norm & Max-Norm
Rank, Trace-Norm & Max-Norm as measures of matrix complexity Nati Srebro University of Toronto Adi Shraibman Hebrew University Matrix Learning users movies 2 1 4 5 5 4? 1 3 3 5 2 4? 5 3? 4 1 3 5 2 1? 4
More informationLearning Combinatorial Functions from Pairwise Comparisons
Learning Combinatorial Functions from Pairwise Comparisons Maria-Florina Balcan Ellen Vitercik Colin White Abstract A large body of work in machine learning has focused on the problem of learning a close
More informationA fast randomized algorithm for overdetermined linear least-squares regression
A fast randomized algorithm for overdetermined linear least-squares regression Vladimir Rokhlin and Mark Tygert Technical Report YALEU/DCS/TR-1403 April 28, 2008 Abstract We introduce a randomized algorithm
More informationStatistical Learning Theory and the C-Loss cost function
Statistical Learning Theory and the C-Loss cost function Jose Principe, Ph.D. Distinguished Professor ECE, BME Computational NeuroEngineering Laboratory and principe@cnel.ufl.edu Statistical Learning Theory
More informationThe definitions and notation are those introduced in the lectures slides. R Ex D [h
Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 2 October 04, 2016 Due: October 18, 2016 A. Rademacher complexity The definitions and notation
More informationStatistical learning theory, Support vector machines, and Bioinformatics
1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationIntroduction to Machine Learning (67577) Lecture 3
Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz
More informationLecture 16: Compressed Sensing
Lecture 16: Compressed Sensing Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 16 1 / 12 Review of Johnson-Lindenstrauss Unsupervised learning technique key insight:
More informationA graph contains a set of nodes (vertices) connected by links (edges or arcs)
BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,
More informationConsistency of Nearest Neighbor Methods
E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study
More information