Margin Maximizing Loss Functions


 Shon Garry Ferguson
 1 years ago
 Views:
Transcription
1 Margin Maximizing Loss Functions Saharon Rosset, Ji Zhu and Trevor Hastie Department of Statistics Stanford University Stanford, CA, saharon, jzhu, Abstract Margin maximizing properties play an important role in the analysis of classification models, such as boosting and support vector machines. Margin maximization is theoretically interesting because it facilitates generalization error analysis, and practically interesting because it presents a clear geometric interpretation of the models being built. We formulate and prove a sufficient condition for the solutions of regularized loss functions to converge to margin maximizing separators, as the regularization vanishes. This condition covers the hinge loss of SVM, the exponential loss of AdaBoost and logistic regression loss. We also generalize it to multiclass classification problems, and present margin maximizing multiclass versions of logistic regression and support vector machines. 1 Introduction Assume we have a classification learning sample Ü Ý Ò ½ with Ý ½ ½. We wish to build a model Üµ for this data by minimizing (exactly or approximately) a loss criterion È Ý Ü µµ È Ý Ü µµ which is a function of the margins Ý Ü µ of this model on this data. Most common classification modeling approaches can be cast in this framework: logistic regression, support vector machines, boosting and more. The model Üµ which these methods actually build is a linear combination of dictionary functions coming from a dictionary À which can be large or even infinite: Üµ À Üµ and our prediction at point Ü based on this model is sgn Üµ. When À is large, as is the case in most boosting or kernel SVM applications, some regularization is needed to control the complexity of the model Üµ and the resulting overfitting. Thus, it is common that the quantity actually minimized on the data is a regularized version of the loss function: (1) µ ÑÒ Ý ¼ Ü µµ Ô Ô where the second term penalizes for the Ð Ô norm of the coefficient vector (Ô ½ for convexity, and in practice usually Ô ½ ), and ¼ is a tuning regularization parameter. The ½ and norm support vector machine training problems with slack can be cast in this form ([6], chapter 12). In [8] we have shown that boosting approximately follows the
2 path of regularized solutions traced by (1) as the regularization parameter varies, with the appropriate loss and an Ð ½ penalty. The main question that we answer in this paper is: for what loss functions does µ converge to an optimal separator as ¼? The definition of optimal which we will use depends on the Ð Ô norm used for regularization, and we will term it the Ð Ô margin maximizing separating hyperplane. More concisely, we will investigate for which loss functions and under which conditions we have: (2) ÐÑ ¼ µ µ Ö ÑÜ Ô½ ÑÒ Ý ¼ Ü µ This margin maximizing property is interesting for two distinct reasons. First, it gives us a geometric interpretation of the limiting model as we relax the regularization. It tells us that this loss seeks to optimally separate the data by maximizing a distance between a separating hyperplane and the closest points. A theorem by Mangasarian [7] allows us to interpret Ð Ô margin maximization as Ð Õ distance maximization, with ½Ô ½Õ ½, and hence make a clear geometric interpretation. Second, from a learning theory perspective large margins are an important quantity generalization error bounds that depend on the margins have been generated for support vector machines ([10] using Ð margins) and boosting ( [9] using Ð ½ margins). Thus, showing that a loss function is margin maximizing in this sense is useful and promising information regarding this loss function s potential for generating good prediction models. Our main result is a sufficient condition on the loss function, which guarantees that (2) holds, if the data is separable, i.e. if the maximum on the RHS of (2) is positive. This condition is presented and proven in section 2. It covers the hinge loss of support vector machines, the logistic loglikelihood loss of logistic regression, and the exponential loss, most notably used in boosting. We discuss these and other examples in section 3. Our result generalizes elegantly to multiclass models and loss functions. We present the resulting marginmaximizing versions of SVMs and logistic regression in section 4. 2 Sufficient condition for margin maximization The following theorem shows that if the loss function vainshes quickly enough, then it will be marginmaximizing as the regularization vanishes. It provides us with a unified marginmaximization theory, covering SVMs, logistic regression and boosting. Theorem 2.1 Assume the data Ü Ý Ò ½ is separable, i.e. s.t. ÑÒ Ý ¼ Ü µ ¼. Let Ý µ Ý µ be a monotone nonincreasing loss function depending on the margin only. If Ì ¼ (possibly Ì ½ ) such that: (3) Ø ½ ÐÑ Ø±Ì Øµ µ ½ ¼ Then is a margin maximizing loss function in the sense that any convergence point of the normalized solutions µ µ Ô to the regularized problems (1) as ¼ is an Ð Ô marginmaximizing separating hyperplane. Consequently, if this marginmaximizing hyperplane is unique, then the solutions converge to it: (4) ÐÑ ¼ µ µ Ô Ö ÑÜ Ô½ ÑÒ Ý ¼ Ü µ Proof We prove the result separately for Ì ½ and Ì ½.
3 a. Ì ½: Lemma 2.2 µ Ô ¼ ½ Proof Since Ì ½ then Ñµ ¼ Ñ ¼ and ÐÑ Ñ½ Ñµ ¼. Therefore, for loss+penalty to vanish as ¼, µ Ô must diverge, to allow the margins to diverge. Lemma 2.3 Assume ½ are two separating models, with ½ Ô Ô ½, and ½ separates the data better, i.e.: ¼ Ñ ÑÒ Ý Ü µ ¼ Ñ ½ ÑÒ Ý Ü µ ¼ ½. Then Í Í Ñ ½ Ñ µ such that Ø Í Ý Ü µ ¼ Ø ½ µµ Ý Ü µ ¼ Ø µµ In words, if ½ separates better than then scaledup versions of ½ will incur smaller loss than scaledup versions of, if the scaling factor is large enough. Proof Since condition (3) holds with Ì ½, there exists Í such that Ø Í Ò. Thus from being nonincreasing we immediately get: Ø Í Ý Ü µ ¼ Ø ½ µµ Ò ØÑ ½ µ ØÑ µ Ý Ü µ ¼ Ø µµ ØÑ µ ØÑ ½µ Proof of case a.: Assume µ is a convergence point of as µ ¼, with Ô ½. Ô Now assume by contradiction has Ô ½ and bigger minimal Ð Ô margin. Denote the minimal margins for the two models by Ñ and Ñ, respectively, with Ñ Ñ. By continuity of the minimal margin in, there exists some open neighborhood of on the Ð Ô sphere: Æ Ô ½ Æ and an ¼, such that: ÑÒ Ý ¼ Ü µ Ñ Æ Now by lemma 2.3 we get that exists Í Í Ñ Ñ µ such that Ø incurs smaller loss than Ø for any Ø Í Æ. Therefore cannot be a convergence point of b. Ì ½ Lemma 2.4 Ì µ ¼ and Ì Æµ ¼ Æ ¼. Proof From condition (3), Ì Ì µ Ì µ Lemma 2.5 ÐÑ ¼ ÑÒ Ý µ ¼ Ü µ Ì µ µ Ô. ½. Both results follow immediately, with Æ Ì. Proof Assume by contradiction that there is a sequence ½ ² ¼ and ¼ s.t. ÑÒ Ý µ ¼ Ü µ Ì. Pick any separating normalized model i.e. Ô ½ and Ñ ÑÒ Ý ¼ Ü µ ¼. Then for any Ñ Ô Ì µ Ì we get: Ô Ì Ý ¼ Ü µµ Ì Ô Ô Ì µ Ñ Ñ
4 since the first term (loss) is 0 and the penalty is smaller than Ì µ by condition on. But ¼ s.t. ¼ Ñ Ô Ì µ Ì and so we get a contradiction to optimality of Ô ¼ µ, since we assumed ÑÒ Ý ¼ µ ¼ Ü µ Ì and thus: Ý ¼ µ ¼ Ü µµ Ì µ We have thus proven that ÐÑ Ò ¼ ÑÒ Ý µ ¼ Ü µ Ì. It remains to prove equality. Assume by contradiction that for some value of we have Ñ ÑÒ Ý µ ¼ Ü µ Ì. Then the rescaled model Ì Ñ µ has the same zero loss as µ, but a smaller penalty, since Ì Ñ µ Ì Ñ µ µ. So we get a contradiction to optimality of µ. Proof of case b.: Assume µ is a convergence point of as µ ¼, with Ô ½. Ô Now assume by contradiction has Ô ½ and bigger minimal margin. Denote the minimal margins for the two models by Ñ and Ñ, respectively, with Ñ Ñ. Let ½ ² ¼ be a sequence along which µ µô. By lemma 2.5 and our assumption, µ Ô Ì Ñ Ì Ñ. Thus, ¼ such that ¼ µ Ô Ì Ñ and consequently: Ý µ ¼ Ü µµ µ Ô Ô Ì Ì Ñ µô Ý Ü µµ Ì Ô Ô Ñ Ñ So we get a contradiction to optimality of µ. Thus we conclude for both cases a. and b. that any convergence point of µ µ µ Ô must maximize the Ð Ô margin. Since µ Ô ½, such convergence points obviously exist. Ô If the Ð Ô marginmaximizing separating hyperplane is unique, then we can conclude: 2.1 Necessity results µ µ Ô Ö ÑÜ Ô½ ÑÒ Ý ¼ Ü µ A necessity result for margin maximization on any separable data seems to require either additional assumptions on the loss or a relaxation of condition (3). We conjecture that if we also require that the loss is convex and vanishing (i.e. ÐÑ Ñ½ Ñµ ¼) then condition (3) is sufficient and necessary. However this is still a subject for future research. 3 Examples Support vector machines Support vector machines (linear or kernel) can be described as a regularized problem: (5) ÑÒ ½ Ý ¼ Ü µ Ô Ô where Ô for the standard ( 2norm ) SVM and Ô ½ for the 1norm SVM. This formulation is equivalent to the better known norm minimization SVM formulation in the sense that they have the same set of solutions as the regularization parameter varies in (5) or the slack bound varies in the norm minimization formulation.
5 The loss in (5) is termed hinge loss since it s linear for margins less than 1, then fixed at 0 (see figure 1). The theorem obviously holds for Ì ½, and it verifies our knowledge that the nonregularized SVM solution, which is the limit of the regularized solutions, maximizes the appropriate margin (Euclidean for standard SVM, Ð ½ for 1norm SVM). Note that our theorem indicates that the squared hinge loss (AKA truncated squared loss): is also a marginmaximizing loss. Logistic regression and boosting Ý Ü µµ ½ Ý Ü µ The two loss functions we consider in this context are: (6) (7) ÜÔÓÒÒØÐ Ñµ ÜÔ Ñµ ÄÓ ÐÐÓÓ Ð Ñµ ÐÓ ½ ÜÔ Ñµµ These two loss functions are of great interest in the context of two class classification: Ð is used in logistic regression and more recently for boosting [4], while is the implicit loss function used by AdaBoost  the original and most famous boosting algorithm [3]. In [8] we showed that boosting approximately follows the regularized path of solutions µ using these loss functions and Ð ½ regularization. We also proved that the two loss functions are very similar for positive margins, and that their regularized solutions converge to marginmaximizing separators. Theorem 2.1 provides a new proof of this result, since the theorem s condition holds with Ì ½ for both loss functions. Some interesting nonexamples Commonly used classification loss functions which are not marginmaximizing include any polynomial loss function: Ñµ ½ Ñ, Ñµ Ñ, etc. do not guarantee convergence of regularized solutions to margin maximizing solutions. Another interesting method in this context is linear discriminant analysis. Although it does not correspond to the loss+penalty formulation we have described, it does find a decision hyperplane in the predictor space. For both polynomial loss functions and linear discriminant analysis it is easy to find examples which show that they are not necessarily margin maximizing on separable data. 4 A multiclass generalization Our main result can be elegantly extended to versions of multiclass logistic regression and support vector machines, as follows. Assume the response is now multiclass, with Ã possible values i.e. Ý ½ Ã. Our model consists of a prediction for each class: Üµ µ Üµ À with the obvious prediction rule at Ü being Ö ÑÜ Üµ. This gives rise to a Ã ½ dimensional margin for each observation. For Ý, define the margin vector as: (8) Ñ ½ Ã µ ½ ½ ½ Ã µ ¼ And our loss is a function of this Ã Ý ½ Ã µ ½ dimensional margin: ÁÝ Ñ ½ Ã µµ
6 3 2.5 hinge exponential logistic Figure 1: Margin maximizing loss functions for 2class problems (left) and the SVM 3class loss function of section 4.1 (right) The Ð Ô regularized problem is now: (9) µ Ö ÑÒ ½µ Ãµ Where µ ½µ µ Ãµ µµ ¼ Ê Ã À. Ý Ü µ ¼ ½µ Ü µ ¼ Ãµ µ µ Ô Ô In this formulation, the concept of margin maximization corresponds to maximizing the minimal of all Ò Ã ½µ normalized Ð Ô margins generated by the data: (10) ÑÜ ½µ Ô Ô Ãµ Ô Ô½ ÑÒ ÑÒ Ü µ ¼ Ýµ µ µ Ý Note that this margin maximization problem still has a natural geometric interpretation, as Ü µ ¼ Ýµ µ µ ¼ Ý implies that the hyperplane Üµ ¼ µ µ µ ¼ successfully separates classes and for any two classes. Here is a generalization of the optimal separation theorem 2.1 to multiclass models: Theorem 4.1 Assume Ñµ is commutative and decreasing in each coordinate, then if Ì ¼ (possibly Ì ½ ) such that: (11) ÐÑ Ø±Ì Ø½ ØÙ ½ ØÙ Ã µ Ø ØÚ ½ ØÚ Ã µ ½ ¼ Ù ½ ½ Ù Ã ½ Ú ½ ½ Ú Ã ½ Then is a marginmaximizing loss function for multiclass models, in the sense that any µ convergence point of the normalized solutions to (9),, attains the optimal separation as defined in µ Ô (10) Idea of proof The proof is essentially identical to the two class case, now considering the Ò Ã ½µ margins on which the loss depends. The condition (11) implies that as the regularization vanishes the model is determined by the minimal margin, and so an optimal model puts the emphasis on maximizing that margin.
7 Corollary 4.2 In the 2class case, theorem 4.1 reduces to theorem 2.1. Proof The loss depends on ½µ µ, the penalty on ½µ Ô Ô µ Ô Ô. An optimal solution to the regularized problem must thus have ½µ µ ¼, since by transforming: ½µ ½µ ½µ µ µ µ ½µ µ we are not changing the loss, but reducing the penalty, by Jensen s inequality: ½µ ½µ µ Ô Ô µ ½µ µ Ô Ô ½µ µ Ô Ô ½µ Ô Ô µ Ô Ô So we can conclude that ½µ µ µ µ and consequently that the two margin maximization tasks (2), (10) are equivalent. 4.1 Margin maximization in multiclass SVM and logistic regression Here we apply theorem 4.1 to versions of multiclass logistic regression and SVM. For logistic regression, we use a slightly different formulation than the standard logistic regression models, which uses class Ã as a reference class, i.e. assumes that Ãµ ¼. This is required for nonregularized fitting, since without it the solution is not uniquely defined. However, using regularization as in (9) guarantees that the solution will be unique and consequently we can symmetrize the model which allows us to apply theorem 4.1. So the loss function we use is (assume Ý belongs to class ): (12) Ý ½ Ã µ ÐÓ ½ Ã ÐÓ ½ ½ ½ ½ Ã µ with the linear model: Ü µ Ü µ ¼ µ. It is not difficult to verify that condition (11) holds for this loss function with Ì ½, using the fact that ÐÓ ½ µ Ç µ. The sum of exponentials which results from applying this firstorder approximation satisfies (11), and as ¼, the second order term can be ignored. For support vector machines, consider a multiclass loss which is a natural generalization of the twoclass loss: (13) Ñµ Ã ½ ½ ½ Ñ Where Ñ is the j th component of the multimargin Ñ as in (8). Figure 1 shows this loss for Ã classes as a function of the two margins. The loss+penalty formulation using 13 is equivalent to a standard optimization formulation of multiclass SVM (e.g. [11]): ÑÜ s.t. Ü µ ¼ Ýµ µ µ ½ µ ½ Ò ½ Ã Ý ¼ µ Ô Ô ½ As both theorem 4.1 (using Ì ½) and the optimization formulation indicate, the regularized solutions to this problem converge to the Ð Ô margin maximizing multiclass solution.
8 5 Discussion What are the properties we would like to have in a classification loss function? Recently there has been a lot of interest in Bayesconsistency of loss functions and algorithms ([1] and references therein), as the data size increases. It turns out that practically all reasonable loss functions are consistent in that sense, although convergence rates and other measures of degree of consistency may vary. Margin maximization, on the other hand, is a finite sample optimality property of loss functions, which is potentially of decreasing interest as sample size grows, since the training dataset is less likely to be separable. Note, however, that in very high dimensional predictor spaces, such as those typically used by boosting or kernel SVM, separability of any finitesize dataset is a mild assumption, which is violated only in pathological cases. We have shown that the margin maximizing property is shared by some popular loss functions used in logistic regression, support vector machines and boosting. Knowing that these algorithms converge, as regularization vanishes, to the same model (provided they use the same regularization) is an interesting insight. So, for example, we can conclude that 1norm support vector machines, exponential boosting and Ð ½ regularized logistic regression all facilitate the same nonregularized solution, which is an Ð ½ margin maximizing separating hyperplane. From Mangasarian s theorem [7] we know that this hyperplane maximizes the Ð ½ distance from the closest points on either side. The most interesting statistical question which arises is: are these optimal separating models really good for prediction, or should we expect regularized models to do better in practice? Statistical common sense, as well as the empirical evidence, seem to support the latter, for example see the marginmaximizing experiments by Breiman [2] and Grove and Schuurmans [5]. Accepting this view implies that margindependent generalization error bounds cannot capture the essential elements of predicting future performance, and that the theoretical efforts should rather be targeted at the loss functions themselves. References [1] Bartlett, P., Jordan, M. & McAuliffe, J. (2003). Convexity, Classification and Risk Bounds. Technical reports, dept. of Statistics, UC Berkeley. [2] Breiman, L. (1999). Prediction games and arcing algorithms. Neural Computation 7: [3] Freund, Y. & Scahpire, R.E. (1995). A decision theoretic generalization of online learning and an application to boosting. Proc. of 2nd Eurpoean Conf. on Computational Learning Theory. [4] Friedman, J. H., Hastie, T. & Tibshirani, R. (2000). Additive logistic regression: a statistical view of boosting. Annals of Statistics 28, pp [5] Grove, A.J. & Schuurmans, D. (1998). Boosting in the limit: Maximizing the margin of learned ensembles. Proc. of 15th National Conf. on AI. [6] Hastie, T., Tibshirani, R. & Friedman, J. (2001). Elements of Stat. Learning. SpringerVerlag. [7] Mangasarian, O.L. (1999). Arbitrarynorm separating plane. Operations Research Letters, Vol :1523 [8] Rosset, R., Zhu, J & Hastie, T. (2003). Boosting as a regularized path to a maximum margin classifier. Technical report, Dept. of Statistics, Stanford Univ. [9] Scahpire, R.E., Freund, Y., Bartlett, P. & Lee, W.S. (1998). Boosting the margin: a new explanation for the effectiveness of voting methods. Annals of Statistics 26(5): [10] Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer. [11] Weston, J. & Watkins, C. (1998). Multiclass support vector machines. Technical report CSD TR9804, dept of CS, Royal Holloway, University of London.
Kernel expansions with unlabeled examples
Kernel expansions with unlabeled examples Martin Szummer MIT AI Lab & CBCL Cambridge, MA szummer@ai.mit.edu Tommi Jaakkola MIT AI Lab Cambridge, MA tommi@ai.mit.edu Abstract Modern classification applications
More informationA Generalization of Principal Component Analysis to the Exponential Family
A Generalization of Principal Component Analysis to the Exponential Family Michael Collins Sanjoy Dasgupta Robert E. Schapire AT&T Labs Research 8 Park Avenue, Florham Park, NJ 7932 mcollins, dasgupta,
More informationAnalysis of Spectral Kernel Design based Semisupervised Learning
Analysis of Spectral Kernel Design based Semisupervised Learning Tong Zhang IBM T. J. Watson Research Center Yorktown Heights, NY 10598 Rie Kubota Ando IBM T. J. Watson Research Center Yorktown Heights,
More informationAdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague
AdaBoost Lecturer: Jan Šochman Authors: Jan Šochman, Jiří Matas Center for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Motivation Presentation 2/17 AdaBoost with trees
More informationBoosting Methods: Why They Can Be Useful for HighDimensional Data
New URL: http://www.rproject.org/conferences/dsc2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609395X Kurt Hornik,
More informationAdvanced Machine Learning
Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:
More informationBarrier Boosting. Abstract. 1 Introduction. 2 Boosting and Convex Programming
Barrier Boosting Gunnar Rätsch, Manfred Warmuth, Sebastian Mika, Takashi Onoda Ý, Steven Lemm and KlausRobert Müller GMD FIRST, Kekuléstr. 7, 12489 Berlin, Germany Computer Science Department, UC Santa
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two onepage, twosided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write
More informationApplications of Discrete Mathematics to the Analysis of Algorithms
Applications of Discrete Mathematics to the Analysis of Algorithms Conrado Martínez Univ. Politècnica de Catalunya, Spain May 2007 Goal Given some algorithm taking inputs from some set Á, we would like
More informationMonte Carlo Theory as an Explanation of Bagging and Boosting
Monte Carlo Theory as an Explanation of Bagging and Boosting Roberto Esposito Università di Torino C.so Svizzera 185, Torino, Italy esposito@di.unito.it Lorenza Saitta Università del Piemonte Orientale
More informationThe Entire Regularization Path for the Support Vector Machine
The Entire Regularization Path for the Support Vector Machine Trevor Hastie Department of Statistics Stanford University Stanford, CA 905, USA hastie@stanford.edu Saharon Rosset IBM Watson Research Center
More informationRobust KernelBased Regression
Robust KernelBased Regression Budi Santosa Department of Industrial Engineering Sepuluh Nopember Institute of Technology Kampus ITS Surabaya Surabaya 60111,Indonesia Theodore B. Trafalis School of Industrial
More informationSupport vector machines Lecture 4
Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The
More informationLearnability and the Doubling Dimension
Learnability and the Doubling Dimension Yi Li Genome Institute of Singapore liy3@gis.astar.edu.sg Philip M. Long Google plong@google.com Abstract Given a set of classifiers and a probability distribution
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two onepage, twosided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationPredicting the Probability of Correct Classification
Predicting the Probability of Correct Classification Gregory Z. Grudic Department of Computer Science University of Colorado, Boulder grudic@cs.colorado.edu Abstract We propose a formulation for binary
More informationCONVEX OPTIMIZATION OVER POSITIVE POLYNOMIALS AND FILTER DESIGN. Y. Genin, Y. Hachez, Yu. Nesterov, P. Van Dooren
CONVEX OPTIMIZATION OVER POSITIVE POLYNOMIALS AND FILTER DESIGN Y. Genin, Y. Hachez, Yu. Nesterov, P. Van Dooren CESAME, Université catholique de Louvain Bâtiment Euler, Avenue G. Lemaître 46 B1348 LouvainlaNeuve,
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationFast Fourier Transform Solvers and Preconditioners for Quadratic Spline Collocation
Fast Fourier Transform Solvers and Preconditioners for Quadratic Spline Collocation Christina C. Christara and Kit Sun Ng Department of Computer Science University of Toronto Toronto, Ontario M5S 3G4,
More informationRelationship between Loss Functions and Confirmation Measures
Relationship between Loss Functions and Confirmation Measures Krzysztof Dembczyński 1 and Salvatore Greco 2 and Wojciech Kotłowski 1 and Roman Słowiński 1,3 1 Institute of Computing Science, Poznań University
More informationIntroduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri  Introduction to Machine Learning page 2 Boosting Ideas Main idea:
More informationLearning theory. Ensemble methods. Boosting. Boosting: history
Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over
More informationETNA Kent State University
Electronic Transactions on Numerical Analysis. Volume 17, pp. 7692, 2004. Copyright 2004,. ISSN 10689613. ETNA STRONG RANK REVEALING CHOLESKY FACTORIZATION M. GU AND L. MIRANIAN Ý Abstract. For any symmetric
More informationLogistic Regression and Boosting for Labeled Bags of Instances
Logistic Regression and Boosting for Labeled Bags of Instances Xin Xu and Eibe Frank Department of Computer Science University of Waikato Hamilton, New Zealand {xx5, eibe}@cs.waikato.ac.nz Abstract. In
More informationTwoscale homogenization of a hydrodynamic ElrodAdams model
March 2004 1 Twoscale homogenization of a hydrodynamic ElrodAdams model G. Bayada MAPLY CNRS UMR5585 / LAMCOS CNRS UMR5514, INSA Lyon, 69621 Villeurbanne cedex, France S. Martin MAPLY CNRS UMR5585,
More informationBlock vs. Stream cipher
Block vs. Stream cipher Idea of a block cipher: partition the text into relatively large (e.g. 128 bits) blocks and encode each block separately. The encoding of each block generally depends on at most
More informationLINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning
LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In KNN we saw an example of a nonlinear classifier: the decision boundary
More informationSupport Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs
E0 270 Machine Learning Lecture 5 (Jan 22, 203) Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in
More informationSVM TRADEOFF BETWEEN MAXIMIZE THE MARGIN AND MINIMIZE THE VARIABLES USED FOR REGRESSION
International Journal of Pure and Applied Mathematics Volume 87 No. 6 2013, 741750 ISSN: 13118080 (printed version); ISSN: 13143395 (online version) url: http://www.ijpam.eu doi: http://dx.doi.org/10.12732/ijpam.v87i6.2
More informationECE 5984: Introduction to Machine Learning
ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement
More informationVARIATIONAL CALCULUS IN SPACE OF MEASURES AND OPTIMAL DESIGN
Chapter 1 VARIATIONAL CALCULUS IN SPACE OF MEASURES AND OPTIMAL DESIGN Ilya Molchanov Department of Statistics University of Glasgow ilya@stats.gla.ac.uk www.stats.gla.ac.uk/ ilya Sergei Zuyev Department
More informationLecture 7: Kernels for Classification and Regression
Lecture 7: Kernels for Classification and Regression CS 19410, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 15, 2011 Outline Outline A linear regression problem Linear autoregressive
More informationA Language for Task Orchestration and its Semantic Properties
DEPARTMENT OF COMPUTER SCIENCES A Language for Task Orchestration and its Semantic Properties David Kitchin, William Cook and Jayadev Misra Department of Computer Science University of Texas at Austin
More informationBoosted Lasso and Reverse Boosting
Boosted Lasso and Reverse Boosting Peng Zhao University of California, Berkeley, USA. Bin Yu University of California, Berkeley, USA. Summary. This paper introduces the concept of backward step in contrast
More informationABCBoost: Adaptive Base Class Boost for Multiclass Classification
ABCBoost: Adaptive Base Class Boost for Multiclass Classification Ping Li Department of Statistical Science, Cornell University, Ithaca, NY 14853 USA pingli@cornell.edu Abstract We propose boost (adaptive
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne
More informationOn Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing
On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing XuanLong Nguyen Martin J. Wainwright Michael I. Jordan Electrical Engineering & Computer Science Department
More informationOverview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated
Fall 3 Computer Vision Overview of Statistical Tools Statistical Inference Haibin Ling Observation inference Decision Prior knowledge http://www.dabi.temple.edu/~hbling/teaching/3f_5543/index.html Bayesian
More informationMachine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /
Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/9394/2/ce7171 / Agenda Combining Classifiers Empirical view Theoretical
More informationLecture Support Vector Machine (SVM) Classifiers
Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in
More informationOn the Generalization of Soft Margin Algorithms
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 10, OCTOBER 2002 2721 On the Generalization of Soft Margin Algorithms John ShaweTaylor, Member, IEEE, and Nello Cristianini Abstract Generalization
More informationClassification Ensemble That Maximizes the Area Under Receiver Operating Characteristic Curve (AUC)
Classification Ensemble That Maximizes the Area Under Receiver Operating Characteristic Curve (AUC) Eunsik Park 1 and Yc Ivan Chang 2 1 Chonnam National University, Gwangju, Korea 2 Academia Sinica, Taipei,
More informationWorstCase Bounds for Gaussian Process Models
WorstCase Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis
More informationLogistic Regression Trained with Different Loss Functions. Discussion
Logistic Regression Trained with Different Loss Functions Discussion CS640 Notations We restrict our discussions to the binary case. g(z) = g (z) = g(z) z h w (x) = g(wx) = + e z = g(z)( g(z)) + e wx =
More informationMachine Learning: Logistic Regression. Lecture 04
Machine Learning: Logistic Regression Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Supervised Learning Task = learn an (unkon function t : X T that maps input
More informationPerceptron Revisited: Linear Separators. Support Vector Machines
Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department
More informationLecture 10: A brief introduction to Support Vector Machine
Lecture 10: A brief introduction to Support Vector Machine Advanced Applied Multivariate Analysis STAT 2221, Fall 2013 Sungkyu Jung Department of Statistics, University of Pittsburgh Xingye Qiao Department
More informationIntroduction to Logistic Regression and Support Vector Machine
Introduction to Logistic Regression and Support Vector Machine guest lecturer: MingWei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel
More informationCOMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma
COMS 4771 Introduction to Machine Learning James McInerney Adapted from slides by Nakul Verma Announcements HW1: Please submit as a group Watch out for zero variance features (Q5) HW2 will be released
More informationCoordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /
Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10725 / 36725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools Firstorder methods
More informationSupport Vector Machine (SVM) and Kernel Methods
Support Vector Machine (SVM) and Kernel Methods CE717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept HardMargin SVM SoftMargin SVM Dual Problems of HardMargin
More informationSupport Vector Machines and Kernel Methods
Support Vector Machines and Kernel Methods Geoff Gordon ggordon@cs.cmu.edu July 10, 2003 Overview Why do people care about SVMs? Classification problems SVMs often produce good results over a wide range
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationMLCC 2017 Regularization Networks I: Linear Models
MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGEMITIIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational
More informationAn adaptive version of the boost by majority algorithm
An adaptive version of the boost by majority algorithm Yoav Freund AT&T Labs 180 Park Avenue Florham Park NJ 092 USA October 2 2000 Abstract We propose a new boosting algorithm This boosting algorithm
More informationBoosting Algorithms for Maximizing the Soft Margin
Boosting Algorithms for Maximizing the Soft Margin Manfred K. Warmuth Dept. of Engineering University of California Santa Cruz, CA, U.S.A. Karen Glocer Dept. of Engineering University of California Santa
More informationarxiv: v1 [stat.ml] 3 Apr 2017
Geometric Insights into SVM Tuning Geometric Insights into Support Vector Machine Behavior using the KKT Conditions arxiv:1704.00767v1 [stat.ml] 3 Apr 2017 Iain Carmichael Department of Statistics and
More informationF O R SOCI AL WORK RESE ARCH
7 TH EUROPE AN CONFERENCE F O R SOCI AL WORK RESE ARCH C h a l l e n g e s i n s o c i a l w o r k r e s e a r c h c o n f l i c t s, b a r r i e r s a n d p o s s i b i l i t i e s i n r e l a t i o n
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationLecture 3: Multiclass Classification
Lecture 3: Multiclass Classification KaiWei Chang CS @ University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar and Dan Roth CS6501 Lecture 3 1 Announcement v Please enroll in
More informationThe AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m
) Set W () i The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m m =.5 1 n W (m 1) i y i h(x i ; 2 ˆθ
More informationAn Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI
An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process
More informationCOMS 4771 Regression. Nakul Verma
COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the
More informationEffective Dimension and Generalization of Kernel Learning
Effective Dimension and Generalization of Kernel Learning Tong Zhang IBM T.J. Watson Research Center Yorktown Heights, Y 10598 tzhang@watson.ibm.com Abstract We investigate the generalization performance
More informationSupport Vector Machines
Support Vector Machines Tobias Pohlen Selected Topics in Human Language Technology and Pattern Recognition February 10, 2014 Human Language Technology and Pattern Recognition Lehrstuhl für Informatik 6
More informationGeneralization bounds
Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halevhwartz he problem of characterizing learnability is the most basic question
More informationConsistency of Nearest Neighbor Methods
E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study
More informationAdaptive Boosting of Neural Networks for Character Recognition
Adaptive Boosting of Neural Networks for Character Recognition Holger Schwenk Yoshua Bengio Dept. Informatique et Recherche Opérationnelle Université de Montréal, Montreal, Qc H3C3J7, Canada fschwenk,bengioyg@iro.umontreal.ca
More informationFisher Consistency of Multicategory Support Vector Machines
Fisher Consistency o Multicategory Support Vector Machines Yueng Liu Department o Statistics and Operations Research Carolina Center or Genome Sciences University o North Carolina Chapel Hill, NC 7599360
More informationPAClearning, VC Dimension and Marginbased Bounds
More details: General: http://www.learningwithkernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAClearning, VC Dimension and Marginbased
More informationLinear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.
Preliminaries Linear models: the perceptron and closest centroid algorithms Chapter 1, 7 Definition: The Euclidean dot product beteen to vectors is the expression d T x = i x i The dot product is also
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwthaachen.de leibe@vision.rwthaachen.de Course Outline Fundamentals Bayes Decision Theory
More informationAn EM algorithm for Batch Markovian Arrival Processes and its comparison to a simpler estimation procedure
An EM algorithm for Batch Markovian Arrival Processes and its comparison to a simpler estimation procedure Lothar Breuer (breuer@info04.unitrier.de) University of Trier, Germany Abstract. Although the
More informationEnsemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods
The wisdom of the crowds Ensemble learning Sir Francis Galton discovered in the early 1900s that a collection of educated guesses can add up to very accurate predictions! Chapter 11 The paper in which
More informationObject Recognition Using Local Characterisation and Zernike Moments
Object Recognition Using Local Characterisation and Zernike Moments A. Choksuriwong, H. Laurent, C. Rosenberger, and C. Maaoui Laboratoire Vision et Robotique  UPRES EA 2078, ENSI de Bourges  Université
More informationA Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007
Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized
More informationAdvanced Introduction to Machine Learning CMU10715
Advanced Introduction to Machine Learning CMU10715 Risk Minimization Barnabás Póczos What have we seen so far? Several classification & regression algorithms seem to work fine on training datasets: Linear
More informationReproducing Kernel Hilbert Spaces
9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we
More informationClassification: The rest of the story
U NIVERSITY OF ILLINOIS AT URBANACHAMPAIGN CS598 Machine Learning for Signal Processing Classification: The rest of the story 3 October 2017 Today s lecture Important things we haven t covered yet Fisher
More informationLow Bias Bagged Support Vector Machines
Low Bias Bagged Support Vector Machines Giorgio Valentini Dipartimento di Scienze dell Informazione, Università degli Studi di Milano, Italy INFM, Istituto Nazionale per la Fisica della Materia, Italy.
More informationLoss Functions for Preference Levels: Regression with Discrete Ordered Labels
Loss Functions for Preference Levels: Regression with Discrete Ordered Labels Jason D. M. Rennie Massachusetts Institute of Technology Comp. Sci. and Artificial Intelligence Laboratory Cambridge, MA 9,
More informationOn the Performance of Random Vector Quantization Limited Feedback Beamforming in a MISO System
1 On the Performance of Random Vector Quantization Limited Feedback Beamforming in a MISO System Chun Kin AuYeung, Student Member, IEEE, and David J. Love, Member, IEEE Abstract In multiple antenna wireless
More informationFrom Lasso regression to Feature vector machine
From Lasso regression to Feature vector machine Fan Li, Yiming Yang and Eric P. Xing,2 LTI and 2 CALD, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA USA 523 {hustlf,yiming,epxing}@cs.cmu.edu
More informationBoosting: Foundations and Algorithms. Rob Schapire
Boosting: Foundations and Algorithms Rob Schapire Example: Spam Filtering problem: filter out spam (junk email) gather large collection of examples of spam and nonspam: From: yoav@ucsd.edu Rob, can you
More informationData Mining und Maschinelles Lernen
Data Mining und Maschinelles Lernen Ensemble Methods BiasVariance Tradeoff Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking ErrorCorrecting
More informationEnsembles of Classifiers.
Ensembles of Classifiers www.biostat.wisc.edu/~dpage/cs760/ 1 Goals for the lecture you should understand the following concepts ensemble bootstrap sample bagging boosting random forests error correcting
More information1. Kernel ridge regression In contrast to ordinary least squares which has a cost function. m (θ T x (i) y (i) ) 2, J(θ) = 1 2.
CS229 Problem Set #2 Solutions 1 CS 229, Public Course Problem Set #2 Solutions: Theory Kernels, SVMs, and 1. Kernel ridge regression In contrast to ordinary least squares which has a cost function J(θ)
More informationIntroduction to Machine Learning
1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppäaho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer
More informationLINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
LINEAR CLASSIFIERS Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification, the input
More informationCS446: Machine Learning Fall Final Exam. December 6 th, 2016
CS446: Machine Learning Fall 2016 Final Exam December 6 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains
More informationModified Logistic Regression: An Approximation to SVM and Its Applications in LargeScale Text Categorization
Modified Logistic Regression: An Approximation to SVM and Its Applications in LargeScale Text Categorization Jian Zhang jian.zhang@cs.cmu.edu Rong Jin rong@cs.cmu.edu Yiming Yang yiming@cs.cmu.edu Alex
More informationA Bias Correction for the Minimum Error Rate in Crossvalidation
A Bias Correction for the Minimum Error Rate in Crossvalidation Ryan J. Tibshirani Robert Tibshirani Abstract Tuning parameters in supervised learning problems are often estimated by crossvalidation.
More informationConvex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)
ORF 523 Lecture 8 Princeton University Instructor: A.A. Ahmadi Scribe: G. Hall Any typos should be emailed to a a a@princeton.edu. 1 Outline Convexitypreserving operations Convex envelopes, cardinality
More informationMIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE
MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE March 28, 2012 The exam is closed book. You are allowed a double sided one page cheat sheet. Answer the questions in the spaces provided on
More informationCollaborative Filtering via Ensembles of Matrix Factorizations
Collaborative Ftering via Ensembles of Matrix Factorizations Mingrui Wu Max Planck Institute for Biological Cybernetics Spemannstrasse 38, 72076 Tübingen, Germany mingrui.wu@tuebingen.mpg.de ABSTRACT We
More informationNearest Neighbors Methods for Support Vector Machines
Nearest Neighbors Methods for Support Vector Machines A. J. Quiroz, Dpto. de Matemáticas. Universidad de Los Andes joint work with María GonzálezLima, Universidad Simón Boĺıvar and Sergio A. Camelo, Universidad
More informationStatistical and Computational Learning Theory
Statistical and Computational Learning Theory Fundamental Question: Predict Error Rates Given: Find: The space H of hypotheses The number and distribution of the training examples S The complexity of the
More informationRegularized Linear Models in Stacked Generalization
Regularized Linear Models in Stacked Generalization Sam Reid and Greg Grudic University of Colorado at Boulder, Boulder CO 803090430, USA Abstract Stacked generalization is a flexible method for multiple
More informationSparseness Versus Estimating Conditional Probabilities: Some Asymptotic Results
Sparseness Versus Estimating Conditional Probabilities: Some Asymptotic Results Peter L. Bartlett 1 and Ambuj Tewari 2 1 Division of Computer Science and Department of Statistics University of California,
More informationDATA MINING AND MACHINE LEARNING
DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems
More information