Robustness and Regularization of Support Vector Machines
|
|
- Lydia Anabel Bailey
- 5 years ago
- Views:
Transcription
1 Robustness and Regularization of Support Vector Machines Huan Xu ECE, McGill University Montreal, QC, Canada Constantine Caraanis ECE, The University of Texas at Austin Austin, TX, USA Shie Mannor ECE, McGill University Montreal, QC Canada Abstract We consider a robust classification proble and show that standard regularized SVM is a special case of our forulation, providing an explicit link between regularization and robustness. At the sae tie, the physical connection of noise and robustness suggests the potential for a broad new faily of robust classification algoriths. Finally, we show that robustness is a fundaental property of classification algoriths, by re-proving consistency of support vector achines using only robustness arguents (instead of VC diension or stability). 1 Introduction Support Vector Machines or SVMs [1, 2] find the hyperplane in the feature space that achieves axiu saple argin in the separable case. When the saples are not separable, a penalty ter that approxiates the total training-error is considered [3]. It is well known that iniizing the training error itself can lead to poor classification perforance for new unlabeled data because of, essentially, overfitting [4]. One of the ost popular ethods proposed to cobat this proble is iniizing a cobination of the training-error and a regularization ter. The resulting regularized classifier perfors better on new data. This phenoenon is often interpreted fro a statistical learning theory view: the regularization ter restricts the coplexity of the classifier, hence the deviation of the testing error and the training error is controlled [5, 6, 7]. We consider a different setup, assuing that soe non-iid (potentially adversarial) disturbance is added to the training saples we observe. We follow a robust optiization approach (e.g., [8, 9]) iniizing the worst possible epirical error under such disturbances. The use of robust optiization in classification is not new (e.g., [10, 11]). Past robust classification odels consider only box-type uncertainty sets, which allow the possibility that the data have all been skewed in soe non-neutral anner. We develop a new robust classification fraework that treats non box-type uncertainty sets, itigates conservatis, and provides an explicit connection to regularization. Our contributions include: We show that the standard regularized SVM classifier is a special case of our robust classification, thus explicitly relating robustness and regularization. This provides an alternative explanation to the success of regularization, and also suggests new physically-otivated ways to construct regularizers. Our robust SVM forulation perits finer control of the adversarial disturbance, restricting it to satisfy aggregate constraints across data points, therefore controlling the conservatis. 1
2 We show that the robustness perspective, steing fro a non-iid analysis, can be useful in a standard iid setup, by using it to give a new proof of consistency for standard SVM classification. This result iplies that generalization ability is a direct result of robustness to local disturbances, and we can construct learning algoriths that generalize well by robustifying non-consistent algoriths. We explain here how the explicit equivalence of robustness and regularization we derive differs fro previous work, and why it is interesting. While certain equivalence relationships between robustness and regularization have been established for probles outside the achine learning field ([8, 9]), their results do not directly apply to the classification proble. Research on classifier regularization ainly focuses on bounding the coplexity of the function class (e.g., [5, 6, 7]). Meanwhile, research on robust classification has not attepted to relate robustness and regularization (e.g., [10, 11, 12]), in part due to the robustness forulations used there. The connection of robustness and regularization in the SVM context is iportant for the following reasons. It gives an alternative and potentially powerful explanation of the generalization ability of the regularization ter. In the classical achine learning literature, the regularization ter bounds the coplexity of the class of classifiers. The robust view of regularization regards the testing saples as a perturbed copy of the training saples. We show that when the total perturbation is given or bounded, the regularization ter bounds the gap between the classification errors of the SVM on these two sets of saples. In contrast to the standard PAC approach, this bound depends neither on how rich the class of candidate classifiers is, nor on an assuption that all saples are picked in an i.i.d. anner. In addition, this suggests novel approaches to designing good classification algoriths, in particular, designing the regularization ter. In Section 3 we use this new view to provide a novel proof of consistency that does not rely on VC-diension or stability arguents. In the PAC structural-risk iniization approach, regularization is chosen to iniize a bound on the generalization error based on the training error and a coplexity ter. This coplexity ter typically leads to overly ephasizing the regularizer, and indeed this approach is known to often be too pessiistic ([13]). The robust approach offers another avenue. Since both noise and robustness are physical processes, a close investigation of the application and noise characteristics at hand, can provide insights into how to properly robustify, and therefore regularize the classifier. For exaple, it is widely known that noralizing the saples so that the variance aong all features are roughly the sae often leads to good generalization perforance. Fro the robustness perspective, this siply says that the noise is skewed (ellipsoidal) rather than spherical, and hence an appropriate robustification ust be designed to fit the skew of the physical noise process. Notation: Capital letters and boldface letters are used to denote atrices and colun vectors, respectively. For a given nor, we use to denote its dual nor. The set of integers fro 1 to n is denoted by [1 : n]. 2 Robust Classification and Regularization We consider the standard binary classification proble, where we are given a finite nuber of training saples {x i, y i } Rn { 1, +1}, and ust find a linear classifier, specified by the function h (x) = sgn( w, x + b). For the standard regularized classifier, the paraeters (w, b) are obtained by solving the following convex optiization proble: { in r(w, b) + ax [ 1 y i ( w,x i + b), 0 ]}. where r(w, b) is a regularization ter. Previous robust classification work [10, 14] considers the classification proble where the input are subject to (unknown) disturbances δ = (δ 1,..., δ ) and essentially solves the following ini-ax proble: in { ax r(w, b) + ax [ 1 y i ( w, x i δ i + b), 0 ]}, (1) δ Nbox for a box-type uncertainty set N box. That is, let N i denote the projection of N box onto the δ i coponent, then N box = N 1 N. Effectively, this allows siultaneous worst-case disturbances 2
3 across any saples, and leads to overly conservative solutions. The goal of this paper is to obtain a robust forulation where the disturbances {δ i } ay be eaningfully taken to be correlated, i.e., to solve for a non-box-type N : { in ax r(w, b) + ax [ 1 y i ( w,x i δ i + b), 0 ]}. (2) δ N We define explicitly the correlated disturbance (or uncertainty) which we study below. Definition A set N 0 R n is called an Atoic Uncertainty Set if [ (I) 0 N 0 ; (II) sup w δ ] [ = sup w δ ] <, w R n. δ N 0 δ N 0 2. Let N 0 be an atoic uncertainty set. A set N R n is called a Concave Correlated Uncertainty Set (CCUS) of N 0, if N N N +. Here N Nt ; N t {(δ 1,, δ ) δ t N 0 ; δ i =t = 0; } t=1 N + {(α 1 δ 1,, α δ ) α i = 1; α i 0, δ i N 0, i}. The concave correlated uncertainty definition odels the case where the disturbances on each saple are treated identically, but their aggregate behavior across ultiple saples is controlled. In particular, {(δ 1,, δ ) δ i c} is a CCUS with N 0 = { δ } δ c. Theore 1. Assue {x i, y i } are non-separable, r( ) : Rn+1 R is an arbitrary function, N is a CCUS with corresponding atoic uncertainty set N 0. Then the following two probles are equivalent: { in sup r(w, b) + ax [ 1 y i ( w,x i δ i + b), 0 ]} ; (3) (δ 1,,δ ) N in : r(w, b) + sup (w δ) + δ N 0 ξ i, (4) s.t. : ξ i 1 [y i ( w, x i + b)], i = 1,...,; ξ i 0, i = 1,...,. Proof. We outline the proof. Let v(w, b) sup δ N0 (w δ)+ ax[ 1 y i ( w,x i +b), 0 ]. It suffices to show that for any (ŵ,ˆb) R n+1, v(ŵ,ˆb) sup (δ 1,,δ ) N + sup (δ 1,,δ ) N ax [ 1 y i ( ŵ,x i δ i + ˆb), 0 ]. (5) ax [ 1 y i ( ŵ,x i δ i + ˆb), 0 ] v(ŵ,ˆb). (6) Since the saples {x i, y i } are not separable, there exists t [1 : ] such that y t ( ŵ,x t + ˆb) < 0. With soe algebra, we have sup(δ1,,δ ) N t ax [ 1 y i ( ŵ,x i δ i +ˆb), 0 ] = v(ŵ,ˆb). Since Nt N, Inequality (5) follows. Establishing Inequality (6) is standard. The following corollary thus shows regularized SVM is a special case of robust classification. { Corollary 1. Let T (δ 1, δ ) } δ i c. If the training saples {x i, y i } are non-separable, then the following two optiization probles on (w, b) are equivalent: in : ax [ ( 1 y i w, xi δ i + b ), 0 ], (7) ax (δ 1,,δ ) T k in : c w + ax [ ( 1 y i w, xi + b ), 0 ]. (8) 3
4 This corollary explains the widely known fact that the regularized classifier tends to be robust. It also suggests that the appropriate way to regularize should coe fro a disturbance-robustness perspective, e.g., by exaining the variation of the data and solving the corresponding robust classification proble. Corollary 1 can be easily generalize to a kernelized version, i.e., a linear classifier in the feature space H that is defined as a Hilbert space containing the range of soe feature apping Φ( ). { Corollary 2. Let T H (δ 1, δ ) } δ i H c. If {Φ(x i ), y i } are non-separable, then the following two optiization probles on (w, b) are equivalent in : ax [ ( 1 y i w, Φ(xi ) δ i + b ), 0 ], (9) ax (δ 1,,δ ) T k in : c w H + ax [ ( 1 y i w, Φ(xi ) + b ), 0 ]. (10) Here, H stands for the RKHS nor, which is self-dual. Corollary 2 essentially says that the standard kernelized SVM is iplicitly a robust classifier with disturbance in the feature-space. Disturbance in the feature-space is less intuitive than disturbance in the saple space. However, the next lea relates these two different setups: under certain conditions, a classifier that achieves robustness in the feature space (the SVM for exaple) also achieves robustness in the saple space. The proof is straightforward and oitted. Lea 1. Suppose there exist X R n, ρ > 0, and a continuous non-decreasing function f : R + R + satisfying f(0) = 0, such that Then k(x,x) + k(x,x ) 2k(x,x ) f( x x 2 2), x,x X, x x 2 ρ. Φ(ˆx + δ) Φ(ˆx) H 3 Consistency of Regularization f( δ 2 2 ), δ 2 ρ, ˆx, ˆx + δ X. In this section we explore a fundaental connection between learning and robustness, by using robustness properties to re-prove the statistical consistency of the linear classifier, and then the kernelized SVM. Indeed, our proof irrors the consistency proof found in [15], with the key difference that we replace etric entropy, VC-diension, and stability used there, with robustness. In contrast to these standard techniques which often work for a liited range of algoriths, robustness arguent works for a uch wider range of algoriths and allows a unified approach to show consistency. Thus far we have considered the setup where the training-saples are corrupted by certain setinclusive disturbances, and now we turn to the standard statistical learning setup. That is, let X R n be bounded, and suppose the training saples (x i, y i ) are generated i.i.d. according to an unknown distribution P supported on X { 1, +1}. The next theore shows that our robust classifier setup and equivalently regularized SVM iniizes an asyptotical upper-bound of the expected classification error and hinge loss. Theore 2. Denote K ax x X k(x,x). Suppose there exist ρ > 0 and a continuous nondecreasing function f : R + R + satisfying f(0) = 0, such that: k(x,x) + k(x,x ) 2k(x,x ) f( x x 2 2 ), x,x X, x x 2 ρ. Then there exists a rando sequence {γ,c } independent of P such that, c > 0, li γ,c = 0, alost surely, and (w, b) H R, the following bounds on the Bayes loss and the hinge loss hold E P (1 y sgn( w, Φ(x) +b) ) γ,c + c w H + 1 ax [ 1 y i ( w, Φ(x i ) + b), 0 ], E (x,y) P ( ax(1 y( w, Φ(x) + b), 0) ) γ,c (1 + K w H + b ) + c w H + 1 ax [ 1 y i ( w, Φ(x i ) + b), 0 ]. 4
5 Proof. Step 1: We first prove the theore for a non-kernelized case. We fix a c > 0 and drop the subscript c of γ,c. A testing saple (x, y ) and a training saple (x, y) are called a saple pair if y = y and x x 2 c. We say a set of training saples and a set of testing saples for l pairings if there exist l saple pairs with no data reused. Given training saples and testing saples, we use M to denote the largest nuber of pairings. Lea 2. Given c > 0, M / 1 alost surely as +, uniforly w.r.t. P. Proof. We provide a proof sketch. Partition X into finite (say T ) sets such that if a training saple and a testing saple fall into one set, they for a pairing. This is doable due to finite-diensionality of the saple space. Let Ni tr and Ni te be the nuber of training saples and testing saples falling in the i th set, respectively. Thus, (N1 tr,, NT tr ) and (Nte 1,, NT te ) are ultinoially distributed rando vectors following a sae distribution. It is straightforward to show that T Ni tr Ni te / 0 with probability one (e.g., Bretagnolle-Huber-Carol inequality), and hence M / 1 alost surely. Moreover, the convergence rate does not depend on P. Now we proceed to prove the theore. Let N 0 = {δ δ c}. Given training saples and testing saples with M saple pairs, for these paired saples, both the total testing error and the total testing hinge-loss are upper bounded by ax (δ 1,,δ ) N 0 N 0 c w 2 + ax [ ( 1 y i w, xi δ i + b ), 0 ] ax [ 1 y i ( w, x i + b), 0]. Hence the average testing error (including unpaired ones) is upper bounded by 1 M / + c w n ax [ 1 y i ( w, x i + b), 0]. Since ax x X (1 y( w,x )) 1 + b + K w 2, the average hinge loss is upper bounded by (1 M /)(1 + K w 2 + b ) + c w ax [ 1 y i ( w, x i + b), 0 ]. The proof follows since M / 1 alost surely. Step 2:Now we generalize the result to a kernelized version. Siilarly we lower-bound the nuber of saple pairs in the feature-space. The ultinoial rando variable arguent used in the proof of Lea 2 breaks down, due to possible infinite-diensionality of the feature space. Nevertheless, we are able to lower bound the nuber of saple pairs in the feature space by the nuber of saple pairs in the saple space. Define f 1 (α) ax{β 0 f(β) α}. Since f( ) is continuous, f 1 (α) > 0 for any α > 0. By Lea 1, if x and x belong to a hyper-cube of length in(ρ/ n, f 1 (c 2 )/ n) in the saple space, then Φ(x) Φ(x ) H c. Hence the nuber of saple pairs in the feature space is lower bounded by the nuber of pairs of saples that fall in the sae hyper-cube in the saple space. We can cover X with finitely any such hyper-cubes since f 1 (c 2 ) > 0. The rest of the proof is identical to Step 1. Notice that the condition in Theore 2 requires that the feature apping is sooth and hence preserves locality of the disturbance, i.e., sall disturbance in the saple space guarantees the corresponding disturbance in the feature space is also sall. This condition is satisfied by ost widely used kernels, e.g., hoogeneous polynoinal kernels, and Gaussian RBF. It is easy to construct non-sooth kernel functions which do not generalize well. For exaple, consider the following kernel k(x,x ) = 1 (x=x ), i.e., k(x,x ) = 1 if x = x, and zero otherwise. A standard RKHS regularized SVM leads to the a decision function sign( α ik(x,x i ) + b), which equals sign(b) and provides no eaningful prediction if the testing saple x is not one of the training saples. Hence as increases, the testing error reains as large as 50% regardless of the tradeoff paraeter used in the algorith, while the training error can be arbitrarily sall by fine-tuning the paraeter. 5
6 Theore 2 can further lead to the standard consistency notion, i.e., convergence to the Bayes Risk ([15]). The proof in [15] involves a step showing that the expected hinge loss of the iniizer of the regularized training hinge loss concentrates around the epirical regularized hinge loss, accoplished using concentration inequalities derived fro VC-diension considerations, and stability considerations. Instead, we can use our robustness-based results of Theore 2 to replace these approaches. The detailed proof is oitted due to space liits. 4 Concluding Rearks This work considers the relationship between robustness and regularized SVM classification. In particular, we prove that the standard nor-regularized SVM classifier is in fact the solution to a robust classification setup, and thus known results about regularized classifiers extend to robust classifiers. To the best of our knowledge, this is the first explicit such link between regularization and robustness in pattern classification. This link suggests that nor-based regularization essentially builds in a robustness to saple noise whose probability level sets are syetric, and oreover have the structure of the unit ball with respect to the dual of the regularizing nor. It would be interesting to understand the perforance gains possible when the noise does not have such characteristics, and the robust setup is used in place of regularization with appropriately defined uncertainty set. In addition, based on the robustness interpretation of the regularization ter, we re-proved the consistency of SVMs without direct appeal to notions of etric entropy, VC-diension, or stability. Our proof suggests that the ability to handle disturbance is crucial for an algorith to achieve good generalization ability. References [1] P. Boser, I. Guyon, and V. Vapnik. A training algorith for optial argin classifiers. In Proceedings of the Fifth Annual ACM Workshop on Coputational Learning Theory, pages , New York, NY, [2] V. Vapnik and A. Lerner. Pattern recognition using generalized portrait ethod. Autoation and Reote Control, 24: , [3] K. Bennett and O. Mangasarian. Robust linear prograing discriination of two linearly inseparable sets. Optiization Methods and Software, 1(1):23 34, [4] V. Vapnik and A. Chervonenkis. The necessary and sufficient conditions for consistency in the epirical risk iniization ethod. Pattern Recognition and Iage Analysis, 1(3): , [5] A. Sola, B. Schölkopf, and K. Müller. The connection between regularization operators and support vector kernels. Neural Networks, 11: , [6] P. Bartlett and S. Mendelson. Radeacher and Gaussian coplexities: Risk bounds and structural results. Journal of Machine Learning Research, 3: , Noveber [7] V. Koltchinskii and D. Panchenko. Epirical argin distributions and bounding the generalization error of cobined classifiers. The Annals of Statistics, 30(1):1 50, [8] L. El Ghaoui and H. Lebret. Robust solutions to least-squares probles with uncertain data. SIAM Journal on Matrix Analysis and Applications, 18: , [9] A. Ben-Tal and A. Neirovski. Robust solutions of uncertain linear progras. Operations Research Letters, 25(1):1 13, August [10] P. Shivasway, C. Bhattacharyya, and A. Sola. Second order cone prograing approaches for handling issing and uncertain data. Journal of Machine Learning Research, 7: , July [11] G. Lanckriet, L. El Ghaoui, C. Bhattacharyya, and M. Jordan. A robust iniax approach to classification. Journal of Machine Learning Research, 3: , Deceber [12] T. Trafalis and R. Gilbert. Robust support vector achines for classification and coputational issues. Optiization Methods and Software, 22(1): , February [13] M. Kearns, Y. Mansour, A. Ng, and D. Ron. An experiental and theoretical coparison of odel selection ethods. Machine Learning, 27:7 50, [14] C. Bhattacharyya, K. Pannagadatta, and A. Sola. A second order cone prograing forulation for classifying issing data. In Lawrence K. Saul, Yair Weiss, and Léon Bottou, editors, Advances in Neural Inforation Processing Systes (NIPS17), Cabridge, MA, MIT Press. [15] I. Steinwart. Consistency of support vector achines and other regularized kernel classifiers. IEEE Transactions on Inforation Theory, 51(1): ,
Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization
Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering
More informationIntelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines
Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October
More informationKernel Methods and Support Vector Machines
Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic
More information1 Bounding the Margin
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost
More informationSupport Vector Machines. Goals for the lecture
Support Vector Machines Mark Craven and David Page Coputer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Soe of the slides in these lectures have been adapted/borrowed fro aterials developed
More informationComputational and Statistical Learning Theory
Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher
More informationE0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis
E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,
More informationFoundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Weak Learning Definition: concept class C is weakly PAC-learnable if there exists a (weak)
More informationSupport Vector Machines. Maximizing the Margin
Support Vector Machines Support vector achines (SVMs) learn a hypothesis: h(x) = b + Σ i= y i α i k(x, x i ) (x, y ),..., (x, y ) are the training exs., y i {, } b is the bias weight. α,..., α are the
More informationSupport Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab
Support Vector Machines Machine Learning Series Jerry Jeychandra Bloh Lab Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a
More informationSoft-margin SVM can address linearly separable problems with outliers
Non-linear Support Vector Machines Non-linearly separable probles Hard-argin SVM can address linearly separable probles Soft-argin SVM can address linearly separable probles with outliers Non-linearly
More informationSupport Vector Machines MIT Course Notes Cynthia Rudin
Support Vector Machines MIT 5.097 Course Notes Cynthia Rudin Credit: Ng, Hastie, Tibshirani, Friedan Thanks: Şeyda Ertekin Let s start with soe intuition about argins. The argin of an exaple x i = distance
More informationSharp Time Data Tradeoffs for Linear Inverse Problems
Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used
More informationComputational and Statistical Learning Theory
Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic
More informatione-companion ONLY AVAILABLE IN ELECTRONIC FORM
OPERATIONS RESEARCH doi 10.1287/opre.1070.0427ec pp. ec1 ec5 e-copanion ONLY AVAILABLE IN ELECTRONIC FORM infors 07 INFORMS Electronic Copanion A Learning Approach for Interactive Marketing to a Custoer
More informationRademacher Complexity Margin Bounds for Learning with a Large Number of Classes
Radeacher Coplexity Margin Bounds for Learning with a Large Nuber of Classes Vitaly Kuznetsov Courant Institute of Matheatical Sciences, 25 Mercer street, New York, NY, 002 Mehryar Mohri Courant Institute
More informationPAC-Bayes Analysis Of Maximum Entropy Learning
PAC-Bayes Analysis Of Maxiu Entropy Learning John Shawe-Taylor and David R. Hardoon Centre for Coputational Statistics and Machine Learning Departent of Coputer Science University College London, UK, WC1E
More information1 Generalization bounds based on Rademacher complexity
COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #0 Scribe: Suqi Liu March 07, 08 Last tie we started proving this very general result about how quickly the epirical average converges
More informationResearch Article Robust ε-support Vector Regression
Matheatical Probles in Engineering, Article ID 373571, 5 pages http://dx.doi.org/10.1155/2014/373571 Research Article Robust ε-support Vector Regression Yuan Lv and Zhong Gan School of Mechanical Engineering,
More informationA Simple Regression Problem
A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where
More informationCombining Classifiers
Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/
More informationConsistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material
Consistent Multiclass Algoriths for Coplex Perforance Measures Suppleentary Material Notations. Let λ be the base easure over n given by the unifor rando variable (say U over n. Hence, for all easurable
More informationE0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)
E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how
More informationUNIVERSITY OF TRENTO ON THE USE OF SVM FOR ELECTROMAGNETIC SUBSURFACE SENSING. A. Boni, M. Conci, A. Massa, and S. Piffer.
UIVRSITY OF TRTO DIPARTITO DI IGGRIA SCIZA DLL IFORAZIO 3823 Povo Trento (Italy) Via Soarive 4 http://www.disi.unitn.it O TH US OF SV FOR LCTROAGTIC SUBSURFAC SSIG A. Boni. Conci A. assa and S. Piffer
More informationDomain-Adversarial Neural Networks
Doain-Adversarial Neural Networks Hana Ajakan, Pascal Gerain 2, Hugo Larochelle 3, François Laviolette 2, Mario Marchand 2,2 Départeent d inforatique et de génie logiciel, Université Laval, Québec, Canada
More informationKernel Choice and Classifiability for RKHS Embeddings of Probability Distributions
Kernel Choice and Classifiability for RKHS Ebeddings of Probability Distributions Bharath K. Sriperubudur Departent of ECE UC San Diego, La Jolla, USA bharathsv@ucsd.edu Kenji Fukuizu The Institute of
More informationMachine Learning Basics: Estimators, Bias and Variance
Machine Learning Basics: Estiators, Bias and Variance Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Basics
More informationMulti-view Discriminative Manifold Embedding for Pattern Classification
Multi-view Discriinative Manifold Ebedding for Pattern Classification X. Wang Departen of Inforation Zhenghzou 450053, China Y. Guo Departent of Digestive Zhengzhou 450053, China Z. Wang Henan University
More information1 Proof of learning bounds
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #4 Scribe: Akshay Mittal February 13, 2013 1 Proof of learning bounds For intuition of the following theore, suppose there exists a
More information13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices
CS71 Randoness & Coputation Spring 018 Instructor: Alistair Sinclair Lecture 13: February 7 Disclaier: These notes have not been subjected to the usual scrutiny accorded to foral publications. They ay
More informationBlock designs and statistics
Bloc designs and statistics Notes for Math 447 May 3, 2011 The ain paraeters of a bloc design are nuber of varieties v, bloc size, nuber of blocs b. A design is built on a set of v eleents. Each eleent
More informationCOS 424: Interacting with Data. Written Exercises
COS 424: Interacting with Data Hoework #4 Spring 2007 Regression Due: Wednesday, April 18 Written Exercises See the course website for iportant inforation about collaboration and late policies, as well
More informationFeature Extraction Techniques
Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that
More informationOn Conditions for Linearity of Optimal Estimation
On Conditions for Linearity of Optial Estiation Erah Akyol, Kuar Viswanatha and Kenneth Rose {eakyol, kuar, rose}@ece.ucsb.edu Departent of Electrical and Coputer Engineering University of California at
More informationA note on the multiplication of sparse matrices
Cent. Eur. J. Cop. Sci. 41) 2014 1-11 DOI: 10.2478/s13537-014-0201-x Central European Journal of Coputer Science A note on the ultiplication of sparse atrices Research Article Keivan Borna 12, Sohrab Aboozarkhani
More informationStochastic Subgradient Methods
Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods
More informationPattern Recognition and Machine Learning. Artificial Neural networks
Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2016 Lessons 7 14 Dec 2016 Outline Artificial Neural networks Notation...2 1. Introduction...3... 3 The Artificial
More informationNew Bounds for Learning Intervals with Implications for Semi-Supervised Learning
JMLR: Workshop and Conference Proceedings vol (1) 1 15 New Bounds for Learning Intervals with Iplications for Sei-Supervised Learning David P. Helbold dph@soe.ucsc.edu Departent of Coputer Science, University
More informationA Theoretical Analysis of a Warm Start Technique
A Theoretical Analysis of a War Start Technique Martin A. Zinkevich Yahoo! Labs 701 First Avenue Sunnyvale, CA Abstract Batch gradient descent looks at every data point for every step, which is wasteful
More informationThis article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and
This article appeared in a ournal published by Elsevier. The attached copy is furnished to the author for internal non-coercial research and education use, including for instruction at the authors institution
More information1 Rademacher Complexity Bounds
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Max Goer March 07, 2013 1 Radeacher Coplexity Bounds Recall the following theore fro last lecture: Theore 1. With probability
More informationA Smoothed Boosting Algorithm Using Probabilistic Output Codes
A Soothed Boosting Algorith Using Probabilistic Output Codes Rong Jin rongjin@cse.su.edu Dept. of Coputer Science and Engineering, Michigan State University, MI 48824, USA Jian Zhang jian.zhang@cs.cu.edu
More informationA Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness
A Note on Scheduling Tall/Sall Multiprocessor Tasks with Unit Processing Tie to Miniize Maxiu Tardiness Philippe Baptiste and Baruch Schieber IBM T.J. Watson Research Center P.O. Box 218, Yorktown Heights,
More informationSupport recovery in compressed sensing: An estimation theoretic approach
Support recovery in copressed sensing: An estiation theoretic approach Ain Karbasi, Ali Horati, Soheil Mohajer, Martin Vetterli School of Coputer and Counication Sciences École Polytechnique Fédérale de
More informationPattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition
Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lesson 1 4 October 2017 Outline Learning and Evaluation for Pattern Recognition Notation...2 1. The Pattern Recognition
More informationAsynchronous Gossip Algorithms for Stochastic Optimization
Asynchronous Gossip Algoriths for Stochastic Optiization S. Sundhar Ra ECE Dept. University of Illinois Urbana, IL 680 ssrini@illinois.edu A. Nedić IESE Dept. University of Illinois Urbana, IL 680 angelia@illinois.edu
More informationComputable Shell Decomposition Bounds
Coputable Shell Decoposition Bounds John Langford TTI-Chicago jcl@cs.cu.edu David McAllester TTI-Chicago dac@autoreason.co Editor: Leslie Pack Kaelbling and David Cohn Abstract Haussler, Kearns, Seung
More informationBayes Decision Rule and Naïve Bayes Classifier
Bayes Decision Rule and Naïve Bayes Classifier Le Song Machine Learning I CSE 6740, Fall 2013 Gaussian Mixture odel A density odel p(x) ay be ulti-odal: odel it as a ixture of uni-odal distributions (e.g.
More informationOn Constant Power Water-filling
On Constant Power Water-filling Wei Yu and John M. Cioffi Electrical Engineering Departent Stanford University, Stanford, CA94305, U.S.A. eails: {weiyu,cioffi}@stanford.edu Abstract This paper derives
More informationRotational Prior Knowledge for SVMs
Rotational Prior Knowledge for SVMs Arkady Epshteyn and Gerald DeJong University of Illinois at Urbana-Chapaign, Urbana, IL 68, USA {aepshtey,dejong}@uiuc.edu Abstract. Incorporation of prior knowledge
More informationStability Bounds for Non-i.i.d. Processes
tability Bounds for Non-i.i.d. Processes Mehryar Mohri Courant Institute of Matheatical ciences and Google Research 25 Mercer treet New York, NY 002 ohri@cis.nyu.edu Afshin Rostaiadeh Departent of Coputer
More informationEstimating Parameters for a Gaussian pdf
Pattern Recognition and achine Learning Jaes L. Crowley ENSIAG 3 IS First Seester 00/0 Lesson 5 7 Noveber 00 Contents Estiating Paraeters for a Gaussian pdf Notation... The Pattern Recognition Proble...3
More informationBounds on the Minimax Rate for Estimating a Prior over a VC Class from Independent Learning Tasks
Bounds on the Miniax Rate for Estiating a Prior over a VC Class fro Independent Learning Tasks Liu Yang Steve Hanneke Jaie Carbonell Deceber 01 CMU-ML-1-11 School of Coputer Science Carnegie Mellon University
More informationA general forulation of the cross-nested logit odel Michel Bierlaire, Dpt of Matheatics, EPFL, Lausanne Phone: Fax:
A general forulation of the cross-nested logit odel Michel Bierlaire, EPFL Conference paper STRC 2001 Session: Choices A general forulation of the cross-nested logit odel Michel Bierlaire, Dpt of Matheatics,
More informationProbability Distributions
Probability Distributions In Chapter, we ephasized the central role played by probability theory in the solution of pattern recognition probles. We turn now to an exploration of soe particular exaples
More informationarxiv: v1 [cs.ds] 3 Feb 2014
arxiv:40.043v [cs.ds] 3 Feb 04 A Bound on the Expected Optiality of Rando Feasible Solutions to Cobinatorial Optiization Probles Evan A. Sultani The Johns Hopins University APL evan@sultani.co http://www.sultani.co/
More informationLeast Squares Fitting of Data
Least Squares Fitting of Data David Eberly, Geoetric Tools, Redond WA 98052 https://www.geoetrictools.co/ This work is licensed under the Creative Coons Attribution 4.0 International License. To view a
More informationIntroduction to Discrete Optimization
Prof. Friedrich Eisenbrand Martin Nieeier Due Date: March 9 9 Discussions: March 9 Introduction to Discrete Optiization Spring 9 s Exercise Consider a school district with I neighborhoods J schools and
More informationPAC-Bayesian Learning of Linear Classifiers
Pascal Gerain Pascal.Gerain.@ulaval.ca Alexandre Lacasse Alexandre.Lacasse@ift.ulaval.ca François Laviolette Francois.Laviolette@ift.ulaval.ca Mario Marchand Mario.Marchand@ift.ulaval.ca Départeent d inforatique
More informationPolygonal Designs: Existence and Construction
Polygonal Designs: Existence and Construction John Hegean Departent of Matheatics, Stanford University, Stanford, CA 9405 Jeff Langford Departent of Matheatics, Drake University, Des Moines, IA 5011 G
More informationPrediction by random-walk perturbation
Prediction by rando-walk perturbation Luc Devroye School of Coputer Science McGill University Gábor Lugosi ICREA and Departent of Econoics Universitat Popeu Fabra lucdevroye@gail.co gabor.lugosi@gail.co
More informationKernel-Based Nonparametric Anomaly Detection
Kernel-Based Nonparaetric Anoaly Detection Shaofeng Zou Dept of EECS Syracuse University Eail: szou@syr.edu Yingbin Liang Dept of EECS Syracuse University Eail: yliang6@syr.edu H. Vincent Poor Dept of
More informationCS Lecture 13. More Maximum Likelihood
CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood
More informationNon-Parametric Non-Line-of-Sight Identification 1
Non-Paraetric Non-Line-of-Sight Identification Sinan Gezici, Hisashi Kobayashi and H. Vincent Poor Departent of Electrical Engineering School of Engineering and Applied Science Princeton University, Princeton,
More informationBootstrapping Dependent Data
Bootstrapping Dependent Data One of the key issues confronting bootstrap resapling approxiations is how to deal with dependent data. Consider a sequence fx t g n t= of dependent rando variables. Clearly
More informationIntroduction to Machine Learning. Recitation 11
Introduction to Machine Learning Lecturer: Regev Schweiger Recitation Fall Seester Scribe: Regev Schweiger. Kernel Ridge Regression We now take on the task of kernel-izing ridge regression. Let x,...,
More informationLearnability and Stability in the General Learning Setting
Learnability and Stability in the General Learning Setting Shai Shalev-Shwartz TTI-Chicago shai@tti-c.org Ohad Shair The Hebrew University ohadsh@cs.huji.ac.il Nathan Srebro TTI-Chicago nati@uchicago.edu
More informationConvex Programming for Scheduling Unrelated Parallel Machines
Convex Prograing for Scheduling Unrelated Parallel Machines Yossi Azar Air Epstein Abstract We consider the classical proble of scheduling parallel unrelated achines. Each job is to be processed by exactly
More informationA MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION
A eshsize boosting algorith in kernel density estiation A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION C.C. Ishiekwene, S.M. Ogbonwan and J.E. Osewenkhae Departent of Matheatics, University
More informationA Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay
A Low-Coplexity Congestion Control and Scheduling Algorith for Multihop Wireless Networks with Order-Optial Per-Flow Delay Po-Kai Huang, Xiaojun Lin, and Chih-Chun Wang School of Electrical and Coputer
More informationChapter 6 1-D Continuous Groups
Chapter 6 1-D Continuous Groups Continuous groups consist of group eleents labelled by one or ore continuous variables, say a 1, a 2,, a r, where each variable has a well- defined range. This chapter explores:
More informationLearnability of Gaussians with flexible variances
Learnability of Gaussians with flexible variances Ding-Xuan Zhou City University of Hong Kong E-ail: azhou@cityu.edu.hk Supported in part by Research Grants Council of Hong Kong Start October 20, 2007
More informationA Theoretical Framework for Deep Transfer Learning
A Theoretical Fraewor for Deep Transfer Learning Toer Galanti The School of Coputer Science Tel Aviv University toer22g@gail.co Lior Wolf The School of Coputer Science Tel Aviv University wolf@cs.tau.ac.il
More informationGrafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space
Journal of Machine Learning Research 3 (2003) 1333-1356 Subitted 5/02; Published 3/03 Grafting: Fast, Increental Feature Selection by Gradient Descent in Function Space Sion Perkins Space and Reote Sensing
More informationAccuracy at the Top. Abstract
Accuracy at the Top Stephen Boyd Stanford University Packard 264 Stanford, CA 94305 boyd@stanford.edu Mehryar Mohri Courant Institute and Google 25 Mercer Street New York, NY 002 ohri@cis.nyu.edu Corinna
More informationRecovering Data from Underdetermined Quadratic Measurements (CS 229a Project: Final Writeup)
Recovering Data fro Underdeterined Quadratic Measureents (CS 229a Project: Final Writeup) Mahdi Soltanolkotabi Deceber 16, 2011 1 Introduction Data that arises fro engineering applications often contains
More informationLecture 9: Multi Kernel SVM
Lecture 9: Multi Kernel SVM Stéphane Canu stephane.canu@litislab.eu Sao Paulo 204 April 6, 204 Roadap Tuning the kernel: MKL The ultiple kernel proble Sparse kernel achines for regression: SVR SipleMKL:
More informationModel Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon
Model Fitting CURM Background Material, Fall 014 Dr. Doreen De Leon 1 Introduction Given a set of data points, we often want to fit a selected odel or type to the data (e.g., we suspect an exponential
More informationUnderstanding Machine Learning Solution Manual
Understanding Machine Learning Solution Manual Written by Alon Gonen Edited by Dana Rubinstein Noveber 17, 2014 2 Gentle Start 1. Given S = ((x i, y i )), define the ultivariate polynoial p S (x) = i []:y
More informationSupplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data
Suppleentary to Learning Discriinative Bayesian Networks fro High-diensional Continuous Neuroiaging Data Luping Zhou, Lei Wang, Lingqiao Liu, Philip Ogunbona, and Dinggang Shen Proposition. Given a sparse
More informationOn the Impact of Kernel Approximation on Learning Accuracy
On the Ipact of Kernel Approxiation on Learning Accuracy Corinna Cortes Mehryar Mohri Aeet Talwalkar Google Research New York, NY corinna@google.co Courant Institute and Google Research New York, NY ohri@cs.nyu.edu
More informationLecture 9 November 23, 2015
CSC244: Discrepancy Theory in Coputer Science Fall 25 Aleksandar Nikolov Lecture 9 Noveber 23, 25 Scribe: Nick Spooner Properties of γ 2 Recall that γ 2 (A) is defined for A R n as follows: γ 2 (A) = in{r(u)
More informationNeural Network Learning as an Inverse Problem
Neural Network Learning as an Inverse Proble VĚRA KU RKOVÁ, Institute of Coputer Science, Acadey of Sciences of the Czech Republic, Pod Vodárenskou věží 2, 182 07 Prague 8, Czech Republic. Eail: vera@cs.cas.cz
More informationDistributed Subgradient Methods for Multi-agent Optimization
1 Distributed Subgradient Methods for Multi-agent Optiization Angelia Nedić and Asuan Ozdaglar October 29, 2007 Abstract We study a distributed coputation odel for optiizing a su of convex objective functions
More informationBest Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence
Best Ar Identification: A Unified Approach to Fixed Budget and Fixed Confidence Victor Gabillon Mohaad Ghavazadeh Alessandro Lazaric INRIA Lille - Nord Europe, Tea SequeL {victor.gabillon,ohaad.ghavazadeh,alessandro.lazaric}@inria.fr
More informationComputable Shell Decomposition Bounds
Journal of Machine Learning Research 5 (2004) 529-547 Subitted 1/03; Revised 8/03; Published 5/04 Coputable Shell Decoposition Bounds John Langford David McAllester Toyota Technology Institute at Chicago
More informationError Exponents in Asynchronous Communication
IEEE International Syposiu on Inforation Theory Proceedings Error Exponents in Asynchronous Counication Da Wang EECS Dept., MIT Cabridge, MA, USA Eail: dawang@it.edu Venkat Chandar Lincoln Laboratory,
More informationLecture 21. Interior Point Methods Setup and Algorithm
Lecture 21 Interior Point Methods In 1984, Kararkar introduced a new weakly polynoial tie algorith for solving LPs [Kar84a], [Kar84b]. His algorith was theoretically faster than the ellipsoid ethod and
More informationRandomized Recovery for Boolean Compressed Sensing
Randoized Recovery for Boolean Copressed Sensing Mitra Fatei and Martin Vetterli Laboratory of Audiovisual Counication École Polytechnique Fédéral de Lausanne (EPFL) Eail: {itra.fatei, artin.vetterli}@epfl.ch
More informationMachine Learning: Fisher s Linear Discriminant. Lecture 05
Machine Learning: Fisher s Linear Discriinant Lecture 05 Razvan C. Bunescu chool of Electrical Engineering and Coputer cience bunescu@ohio.edu Lecture 05 upervised Learning ask learn an (unkon) function
More informationRandomized Accuracy-Aware Program Transformations For Efficient Approximate Computations
Randoized Accuracy-Aware Progra Transforations For Efficient Approxiate Coputations Zeyuan Allen Zhu Sasa Misailovic Jonathan A. Kelner Martin Rinard MIT CSAIL zeyuan@csail.it.edu isailo@it.edu kelner@it.edu
More informationBoosting with log-loss
Boosting with log-loss Marco Cusuano-Towner Septeber 2, 202 The proble Suppose we have data exaples {x i, y i ) i =... } for a two-class proble with y i {, }. Let F x) be the predictor function with the
More informationFairness via priority scheduling
Fairness via priority scheduling Veeraruna Kavitha, N Heachandra and Debayan Das IEOR, IIT Bobay, Mubai, 400076, India vavitha,nh,debayan}@iitbacin Abstract In the context of ulti-agent resource allocation
More informationTight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Sample Guarantees
Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Saple Guarantees Jean Honorio CSAIL, MIT Cabridge, MA 0239, USA jhonorio@csail.it.edu Toi Jaakkola CSAIL, MIT Cabridge, MA
More information3.3 Variational Characterization of Singular Values
3.3. Variational Characterization of Singular Values 61 3.3 Variational Characterization of Singular Values Since the singular values are square roots of the eigenvalues of the Heritian atrices A A and
More informationarxiv: v1 [cs.lg] 8 Jan 2019
Data Masking with Privacy Guarantees Anh T. Pha Oregon State University phatheanhbka@gail.co Shalini Ghosh Sasung Research shalini.ghosh@gail.co Vinod Yegneswaran SRI international vinod@csl.sri.co arxiv:90.085v
More informationNyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison
yströ Method vs : A Theoretical and Epirical Coparison Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou Machine Learning Lab, GE Global Research, San Raon, CA 94583 Michigan State University,
More informationUsing EM To Estimate A Probablity Density With A Mixture Of Gaussians
Using EM To Estiate A Probablity Density With A Mixture Of Gaussians Aaron A. D Souza adsouza@usc.edu Introduction The proble we are trying to address in this note is siple. Given a set of data points
More information