# Margin Maximizing Loss Functions

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Margin Maximizing Loss Functions Saharon Rosset, Ji Zhu and Trevor Hastie Department of Statistics Stanford University Stanford, CA, saharon, jzhu, Abstract Margin maximizing properties play an important role in the analysis of classification models, such as boosting and support vector machines. Margin maximization is theoretically interesting because it facilitates generalization error analysis, and practically interesting because it presents a clear geometric interpretation of the models being built. We formulate and prove a sufficient condition for the solutions of regularized loss functions to converge to margin maximizing separators, as the regularization vanishes. This condition covers the hinge loss of SVM, the exponential loss of AdaBoost and logistic regression loss. We also generalize it to multi-class classification problems, and present margin maximizing multiclass versions of logistic regression and support vector machines. 1 Introduction Assume we have a classification learning sample Ü Ý Ò ½ with Ý ½ ½. We wish to build a model Üµ for this data by minimizing (exactly or approximately) a loss criterion È Ý Ü µµ È Ý Ü µµ which is a function of the margins Ý Ü µ of this model on this data. Most common classification modeling approaches can be cast in this framework: logistic regression, support vector machines, boosting and more. The model Üµ which these methods actually build is a linear combination of dictionary functions coming from a dictionary À which can be large or even infinite: Üµ À Üµ and our prediction at point Ü based on this model is sgn Üµ. When À is large, as is the case in most boosting or kernel SVM applications, some regularization is needed to control the complexity of the model Üµ and the resulting overfitting. Thus, it is common that the quantity actually minimized on the data is a regularized version of the loss function: (1) µ ÑÒ Ý ¼ Ü µµ Ô Ô where the second term penalizes for the Ð Ô norm of the coefficient vector (Ô ½ for convexity, and in practice usually Ô ½ ), and ¼ is a tuning regularization parameter. The ½- and -norm support vector machine training problems with slack can be cast in this form ([6], chapter 12). In [8] we have shown that boosting approximately follows the

2 path of regularized solutions traced by (1) as the regularization parameter varies, with the appropriate loss and an Ð ½ penalty. The main question that we answer in this paper is: for what loss functions does µ converge to an optimal separator as ¼? The definition of optimal which we will use depends on the Ð Ô norm used for regularization, and we will term it the Ð Ô -margin maximizing separating hyper-plane. More concisely, we will investigate for which loss functions and under which conditions we have: (2) ÐÑ ¼ µ µ Ö ÑÜ Ô½ ÑÒ Ý ¼ Ü µ This margin maximizing property is interesting for two distinct reasons. First, it gives us a geometric interpretation of the limiting model as we relax the regularization. It tells us that this loss seeks to optimally separate the data by maximizing a distance between a separating hyper-plane and the closest points. A theorem by Mangasarian [7] allows us to interpret Ð Ô margin maximization as Ð Õ distance maximization, with ½Ô ½Õ ½, and hence make a clear geometric interpretation. Second, from a learning theory perspective large margins are an important quantity generalization error bounds that depend on the margins have been generated for support vector machines ([10] using Ð margins) and boosting ( [9] using Ð ½ margins). Thus, showing that a loss function is margin maximizing in this sense is useful and promising information regarding this loss function s potential for generating good prediction models. Our main result is a sufficient condition on the loss function, which guarantees that (2) holds, if the data is separable, i.e. if the maximum on the RHS of (2) is positive. This condition is presented and proven in section 2. It covers the hinge loss of support vector machines, the logistic log-likelihood loss of logistic regression, and the exponential loss, most notably used in boosting. We discuss these and other examples in section 3. Our result generalizes elegantly to multi-class models and loss functions. We present the resulting margin-maximizing versions of SVMs and logistic regression in section 4. 2 Sufficient condition for margin maximization The following theorem shows that if the loss function vainshes quickly enough, then it will be margin-maximizing as the regularization vanishes. It provides us with a unified margin-maximization theory, covering SVMs, logistic regression and boosting. Theorem 2.1 Assume the data Ü Ý Ò ½ is separable, i.e. s.t. ÑÒ Ý ¼ Ü µ ¼. Let Ý µ Ý µ be a monotone non-increasing loss function depending on the margin only. If Ì ¼ (possibly Ì ½ ) such that: (3) Ø ½ ÐÑ Ø±Ì Øµ µ ½ ¼ Then is a margin maximizing loss function in the sense that any convergence point of the normalized solutions µ µ Ô to the regularized problems (1) as ¼ is an Ð Ô marginmaximizing separating hyper-plane. Consequently, if this margin-maximizing hyper-plane is unique, then the solutions converge to it: (4) ÐÑ ¼ µ µ Ô Ö ÑÜ Ô½ ÑÒ Ý ¼ Ü µ Proof We prove the result separately for Ì ½ and Ì ½.

3 a. Ì ½: Lemma 2.2 µ Ô ¼ ½ Proof Since Ì ½ then Ñµ ¼ Ñ ¼ and ÐÑ Ñ½ Ñµ ¼. Therefore, for loss+penalty to vanish as ¼, µ Ô must diverge, to allow the margins to diverge. Lemma 2.3 Assume ½ are two separating models, with ½ Ô Ô ½, and ½ separates the data better, i.e.: ¼ Ñ ÑÒ Ý Ü µ ¼ Ñ ½ ÑÒ Ý Ü µ ¼ ½. Then Í Í Ñ ½ Ñ µ such that Ø Í Ý Ü µ ¼ Ø ½ µµ Ý Ü µ ¼ Ø µµ In words, if ½ separates better than then scaled-up versions of ½ will incur smaller loss than scaled-up versions of, if the scaling factor is large enough. Proof Since condition (3) holds with Ì ½, there exists Í such that Ø Í Ò. Thus from being non-increasing we immediately get: Ø Í Ý Ü µ ¼ Ø ½ µµ Ò ØÑ ½ µ ØÑ µ Ý Ü µ ¼ Ø µµ ØÑ µ ØÑ ½µ Proof of case a.: Assume µ is a convergence point of as µ ¼, with Ô ½. Ô Now assume by contradiction has Ô ½ and bigger minimal Ð Ô margin. Denote the minimal margins for the two models by Ñ and Ñ, respectively, with Ñ Ñ. By continuity of the minimal margin in, there exists some open neighborhood of on the Ð Ô sphere: Æ Ô ½ Æ and an ¼, such that: ÑÒ Ý ¼ Ü µ Ñ Æ Now by lemma 2.3 we get that exists Í Í Ñ Ñ µ such that Ø incurs smaller loss than Ø for any Ø Í Æ. Therefore cannot be a convergence point of b. Ì ½ Lemma 2.4 Ì µ ¼ and Ì Æµ ¼ Æ ¼. Proof From condition (3), Ì Ì µ Ì µ Lemma 2.5 ÐÑ ¼ ÑÒ Ý µ ¼ Ü µ Ì µ µ Ô. ½. Both results follow immediately, with Æ Ì. Proof Assume by contradiction that there is a sequence ½ ² ¼ and ¼ s.t. ÑÒ Ý µ ¼ Ü µ Ì. Pick any separating normalized model i.e. Ô ½ and Ñ ÑÒ Ý ¼ Ü µ ¼. Then for any Ñ Ô Ì µ Ì we get: Ô Ì Ý ¼ Ü µµ Ì Ô Ô Ì µ Ñ Ñ

4 since the first term (loss) is 0 and the penalty is smaller than Ì µ by condition on. But ¼ s.t. ¼ Ñ Ô Ì µ Ì and so we get a contradiction to optimality of Ô ¼ µ, since we assumed ÑÒ Ý ¼ µ ¼ Ü µ Ì and thus: Ý ¼ µ ¼ Ü µµ Ì µ We have thus proven that ÐÑ Ò ¼ ÑÒ Ý µ ¼ Ü µ Ì. It remains to prove equality. Assume by contradiction that for some value of we have Ñ ÑÒ Ý µ ¼ Ü µ Ì. Then the re-scaled model Ì Ñ µ has the same zero loss as µ, but a smaller penalty, since Ì Ñ µ Ì Ñ µ µ. So we get a contradiction to optimality of µ. Proof of case b.: Assume µ is a convergence point of as µ ¼, with Ô ½. Ô Now assume by contradiction has Ô ½ and bigger minimal margin. Denote the minimal margins for the two models by Ñ and Ñ, respectively, with Ñ Ñ. Let ½ ² ¼ be a sequence along which µ µô. By lemma 2.5 and our assumption, µ Ô Ì Ñ Ì Ñ. Thus, ¼ such that ¼ µ Ô Ì Ñ and consequently: Ý µ ¼ Ü µµ µ Ô Ô Ì Ì Ñ µô Ý Ü µµ Ì Ô Ô Ñ Ñ So we get a contradiction to optimality of µ. Thus we conclude for both cases a. and b. that any convergence point of µ µ µ Ô must maximize the Ð Ô margin. Since µ Ô ½, such convergence points obviously exist. Ô If the Ð Ô -margin-maximizing separating hyper-plane is unique, then we can conclude: 2.1 Necessity results µ µ Ô Ö ÑÜ Ô½ ÑÒ Ý ¼ Ü µ A necessity result for margin maximization on any separable data seems to require either additional assumptions on the loss or a relaxation of condition (3). We conjecture that if we also require that the loss is convex and vanishing (i.e. ÐÑ Ñ½ Ñµ ¼) then condition (3) is sufficient and necessary. However this is still a subject for future research. 3 Examples Support vector machines Support vector machines (linear or kernel) can be described as a regularized problem: (5) ÑÒ ½ Ý ¼ Ü µ Ô Ô where Ô for the standard ( 2-norm ) SVM and Ô ½ for the 1-norm SVM. This formulation is equivalent to the better known norm minimization SVM formulation in the sense that they have the same set of solutions as the regularization parameter varies in (5) or the slack bound varies in the norm minimization formulation.

5 The loss in (5) is termed hinge loss since it s linear for margins less than 1, then fixed at 0 (see figure 1). The theorem obviously holds for Ì ½, and it verifies our knowledge that the non-regularized SVM solution, which is the limit of the regularized solutions, maximizes the appropriate margin (Euclidean for standard SVM, Ð ½ for 1-norm SVM). Note that our theorem indicates that the squared hinge loss (AKA truncated squared loss): is also a margin-maximizing loss. Logistic regression and boosting Ý Ü µµ ½ Ý Ü µ The two loss functions we consider in this context are: (6) (7) ÜÔÓÒÒØÐ Ñµ ÜÔ Ñµ ÄÓ ÐÐÓÓ Ð Ñµ ÐÓ ½ ÜÔ Ñµµ These two loss functions are of great interest in the context of two class classification: Ð is used in logistic regression and more recently for boosting [4], while is the implicit loss function used by AdaBoost - the original and most famous boosting algorithm [3]. In [8] we showed that boosting approximately follows the regularized path of solutions µ using these loss functions and Ð ½ regularization. We also proved that the two loss functions are very similar for positive margins, and that their regularized solutions converge to margin-maximizing separators. Theorem 2.1 provides a new proof of this result, since the theorem s condition holds with Ì ½ for both loss functions. Some interesting non-examples Commonly used classification loss functions which are not margin-maximizing include any polynomial loss function: Ñµ ½ Ñ, Ñµ Ñ, etc. do not guarantee convergence of regularized solutions to margin maximizing solutions. Another interesting method in this context is linear discriminant analysis. Although it does not correspond to the loss+penalty formulation we have described, it does find a decision hyper-plane in the predictor space. For both polynomial loss functions and linear discriminant analysis it is easy to find examples which show that they are not necessarily margin maximizing on separable data. 4 A multi-class generalization Our main result can be elegantly extended to versions of multi-class logistic regression and support vector machines, as follows. Assume the response is now multi-class, with Ã possible values i.e. Ý ½ Ã. Our model consists of a prediction for each class: Üµ µ Üµ À with the obvious prediction rule at Ü being Ö ÑÜ Üµ. This gives rise to a Ã ½ dimensional margin for each observation. For Ý, define the margin vector as: (8) Ñ ½ Ã µ ½ ½ ½ Ã µ ¼ And our loss is a function of this Ã Ý ½ Ã µ ½ dimensional margin: ÁÝ Ñ ½ Ã µµ

6 3 2.5 hinge exponential logistic Figure 1: Margin maximizing loss functions for 2-class problems (left) and the SVM 3-class loss function of section 4.1 (right) The Ð Ô -regularized problem is now: (9) µ Ö ÑÒ ½µ Ãµ Where µ ½µ µ Ãµ µµ ¼ Ê Ã À. Ý Ü µ ¼ ½µ Ü µ ¼ Ãµ µ µ Ô Ô In this formulation, the concept of margin maximization corresponds to maximizing the minimal of all Ò Ã ½µ normalized Ð Ô -margins generated by the data: (10) ÑÜ ½µ Ô Ô Ãµ Ô Ô½ ÑÒ ÑÒ Ü µ ¼ Ýµ µ µ Ý Note that this margin maximization problem still has a natural geometric interpretation, as Ü µ ¼ Ýµ µ µ ¼ Ý implies that the hyper-plane Üµ ¼ µ µ µ ¼ successfully separates classes and for any two classes. Here is a generalization of the optimal separation theorem 2.1 to multi-class models: Theorem 4.1 Assume Ñµ is commutative and decreasing in each coordinate, then if Ì ¼ (possibly Ì ½ ) such that: (11) ÐÑ Ø±Ì Ø½ ØÙ ½ ØÙ Ã µ Ø ØÚ ½ ØÚ Ã µ ½ ¼ Ù ½ ½ Ù Ã ½ Ú ½ ½ Ú Ã ½ Then is a margin-maximizing loss function for multi-class models, in the sense that any µ convergence point of the normalized solutions to (9),, attains the optimal separation as defined in µ Ô (10) Idea of proof The proof is essentially identical to the two class case, now considering the Ò Ã ½µ margins on which the loss depends. The condition (11) implies that as the regularization vanishes the model is determined by the minimal margin, and so an optimal model puts the emphasis on maximizing that margin.

7 Corollary 4.2 In the 2-class case, theorem 4.1 reduces to theorem 2.1. Proof The loss depends on ½µ µ, the penalty on ½µ Ô Ô µ Ô Ô. An optimal solution to the regularized problem must thus have ½µ µ ¼, since by transforming: ½µ ½µ ½µ µ µ µ ½µ µ we are not changing the loss, but reducing the penalty, by Jensen s inequality: ½µ ½µ µ Ô Ô µ ½µ µ Ô Ô ½µ µ Ô Ô ½µ Ô Ô µ Ô Ô So we can conclude that ½µ µ µ µ and consequently that the two margin maximization tasks (2), (10) are equivalent. 4.1 Margin maximization in multi-class SVM and logistic regression Here we apply theorem 4.1 to versions of multi-class logistic regression and SVM. For logistic regression, we use a slightly different formulation than the standard logistic regression models, which uses class Ã as a reference class, i.e. assumes that Ãµ ¼. This is required for non-regularized fitting, since without it the solution is not uniquely defined. However, using regularization as in (9) guarantees that the solution will be unique and consequently we can symmetrize the model which allows us to apply theorem 4.1. So the loss function we use is (assume Ý belongs to class ): (12) Ý ½ Ã µ ÐÓ ½ Ã ÐÓ ½ ½ ½ ½ Ã µ with the linear model: Ü µ Ü µ ¼ µ. It is not difficult to verify that condition (11) holds for this loss function with Ì ½, using the fact that ÐÓ ½ µ Ç µ. The sum of exponentials which results from applying this first-order approximation satisfies (11), and as ¼, the second order term can be ignored. For support vector machines, consider a multi-class loss which is a natural generalization of the two-class loss: (13) Ñµ Ã ½ ½ ½ Ñ Where Ñ is the j th component of the multi-margin Ñ as in (8). Figure 1 shows this loss for Ã classes as a function of the two margins. The loss+penalty formulation using 13 is equivalent to a standard optimization formulation of multi-class SVM (e.g. [11]): ÑÜ s.t. Ü µ ¼ Ýµ µ µ ½ µ ½ Ò ½ Ã Ý ¼ µ Ô Ô ½ As both theorem 4.1 (using Ì ½) and the optimization formulation indicate, the regularized solutions to this problem converge to the Ð Ô margin maximizing multi-class solution.

8 5 Discussion What are the properties we would like to have in a classification loss function? Recently there has been a lot of interest in Bayes-consistency of loss functions and algorithms ([1] and references therein), as the data size increases. It turns out that practically all reasonable loss functions are consistent in that sense, although convergence rates and other measures of degree of consistency may vary. Margin maximization, on the other hand, is a finite sample optimality property of loss functions, which is potentially of decreasing interest as sample size grows, since the training data-set is less likely to be separable. Note, however, that in very high dimensional predictor spaces, such as those typically used by boosting or kernel SVM, separability of any finite-size data-set is a mild assumption, which is violated only in pathological cases. We have shown that the margin maximizing property is shared by some popular loss functions used in logistic regression, support vector machines and boosting. Knowing that these algorithms converge, as regularization vanishes, to the same model (provided they use the same regularization) is an interesting insight. So, for example, we can conclude that 1-norm support vector machines, exponential boosting and Ð ½ -regularized logistic regression all facilitate the same non-regularized solution, which is an Ð ½ -margin maximizing separating hyper-plane. From Mangasarian s theorem [7] we know that this hyper-plane maximizes the Ð ½ distance from the closest points on either side. The most interesting statistical question which arises is: are these optimal separating models really good for prediction, or should we expect regularized models to do better in practice? Statistical common sense, as well as the empirical evidence, seem to support the latter, for example see the margin-maximizing experiments by Breiman [2] and Grove and Schuurmans [5]. Accepting this view implies that margin-dependent generalization error bounds cannot capture the essential elements of predicting future performance, and that the theoretical efforts should rather be targeted at the loss functions themselves. References [1] Bartlett, P., Jordan, M. & McAuliffe, J. (2003). Convexity, Classification and Risk Bounds. Technical reports, dept. of Statistics, UC Berkeley. [2] Breiman, L. (1999). Prediction games and arcing algorithms. Neural Computation 7: [3] Freund, Y. & Scahpire, R.E. (1995). A decision theoretic generalization of on-line learning and an application to boosting. Proc. of 2nd Eurpoean Conf. on Computational Learning Theory. [4] Friedman, J. H., Hastie, T. & Tibshirani, R. (2000). Additive logistic regression: a statistical view of boosting. Annals of Statistics 28, pp [5] Grove, A.J. & Schuurmans, D. (1998). Boosting in the limit: Maximizing the margin of learned ensembles. Proc. of 15th National Conf. on AI. [6] Hastie, T., Tibshirani, R. & Friedman, J. (2001). Elements of Stat. Learning. Springer-Verlag. [7] Mangasarian, O.L. (1999). Arbitrary-norm separating plane. Operations Research Letters, Vol :15-23 [8] Rosset, R., Zhu, J & Hastie, T. (2003). Boosting as a regularized path to a maximum margin classifier. Technical report, Dept. of Statistics, Stanford Univ. [9] Scahpire, R.E., Freund, Y., Bartlett, P. & Lee, W.S. (1998). Boosting the margin: a new explanation for the effectiveness of voting methods. Annals of Statistics 26(5): [10] Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer. [11] Weston, J. & Watkins, C. (1998). Multi-class support vector machines. Technical report CSD- TR-98-04, dept of CS, Royal Holloway, University of London.

### Kernel expansions with unlabeled examples

Kernel expansions with unlabeled examples Martin Szummer MIT AI Lab & CBCL Cambridge, MA szummer@ai.mit.edu Tommi Jaakkola MIT AI Lab Cambridge, MA tommi@ai.mit.edu Abstract Modern classification applications

### A Generalization of Principal Component Analysis to the Exponential Family

A Generalization of Principal Component Analysis to the Exponential Family Michael Collins Sanjoy Dasgupta Robert E. Schapire AT&T Labs Research 8 Park Avenue, Florham Park, NJ 7932 mcollins, dasgupta,

### Analysis of Spectral Kernel Design based Semi-supervised Learning

Analysis of Spectral Kernel Design based Semi-supervised Learning Tong Zhang IBM T. J. Watson Research Center Yorktown Heights, NY 10598 Rie Kubota Ando IBM T. J. Watson Research Center Yorktown Heights,

### AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

AdaBoost Lecturer: Jan Šochman Authors: Jan Šochman, Jiří Matas Center for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Motivation Presentation 2/17 AdaBoost with trees

### Boosting Methods: Why They Can Be Useful for High-Dimensional Data

New URL: http://www.r-project.org/conferences/dsc-2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609-395X Kurt Hornik,

Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:

### Barrier Boosting. Abstract. 1 Introduction. 2 Boosting and Convex Programming

Barrier Boosting Gunnar Rätsch, Manfred Warmuth, Sebastian Mika, Takashi Onoda Ý, Steven Lemm and Klaus-Robert Müller GMD FIRST, Kekuléstr. 7, 12489 Berlin, Germany Computer Science Department, UC Santa

### UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write

### Applications of Discrete Mathematics to the Analysis of Algorithms

Applications of Discrete Mathematics to the Analysis of Algorithms Conrado Martínez Univ. Politècnica de Catalunya, Spain May 2007 Goal Given some algorithm taking inputs from some set Á, we would like

### Monte Carlo Theory as an Explanation of Bagging and Boosting

Monte Carlo Theory as an Explanation of Bagging and Boosting Roberto Esposito Università di Torino C.so Svizzera 185, Torino, Italy esposito@di.unito.it Lorenza Saitta Università del Piemonte Orientale

### The Entire Regularization Path for the Support Vector Machine

The Entire Regularization Path for the Support Vector Machine Trevor Hastie Department of Statistics Stanford University Stanford, CA 905, USA hastie@stanford.edu Saharon Rosset IBM Watson Research Center

### Robust Kernel-Based Regression

Robust Kernel-Based Regression Budi Santosa Department of Industrial Engineering Sepuluh Nopember Institute of Technology Kampus ITS Surabaya Surabaya 60111,Indonesia Theodore B. Trafalis School of Industrial

### Support vector machines Lecture 4

Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

### Learnability and the Doubling Dimension

Learnability and the Doubling Dimension Yi Li Genome Institute of Singapore liy3@gis.a-star.edu.sg Philip M. Long Google plong@google.com Abstract Given a set of classifiers and a probability distribution

### UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

### Predicting the Probability of Correct Classification

Predicting the Probability of Correct Classification Gregory Z. Grudic Department of Computer Science University of Colorado, Boulder grudic@cs.colorado.edu Abstract We propose a formulation for binary

### CONVEX OPTIMIZATION OVER POSITIVE POLYNOMIALS AND FILTER DESIGN. Y. Genin, Y. Hachez, Yu. Nesterov, P. Van Dooren

CONVEX OPTIMIZATION OVER POSITIVE POLYNOMIALS AND FILTER DESIGN Y. Genin, Y. Hachez, Yu. Nesterov, P. Van Dooren CESAME, Université catholique de Louvain Bâtiment Euler, Avenue G. Lemaître 4-6 B-1348 Louvain-la-Neuve,

### Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

### Fast Fourier Transform Solvers and Preconditioners for Quadratic Spline Collocation

Fast Fourier Transform Solvers and Preconditioners for Quadratic Spline Collocation Christina C. Christara and Kit Sun Ng Department of Computer Science University of Toronto Toronto, Ontario M5S 3G4,

### Relationship between Loss Functions and Confirmation Measures

Relationship between Loss Functions and Confirmation Measures Krzysztof Dembczyński 1 and Salvatore Greco 2 and Wojciech Kotłowski 1 and Roman Słowiński 1,3 1 Institute of Computing Science, Poznań University

### Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:

### Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

### ETNA Kent State University

Electronic Transactions on Numerical Analysis. Volume 17, pp. 76-92, 2004. Copyright 2004,. ISSN 1068-9613. ETNA STRONG RANK REVEALING CHOLESKY FACTORIZATION M. GU AND L. MIRANIAN Ý Abstract. For any symmetric

### Logistic Regression and Boosting for Labeled Bags of Instances

Logistic Regression and Boosting for Labeled Bags of Instances Xin Xu and Eibe Frank Department of Computer Science University of Waikato Hamilton, New Zealand {xx5, eibe}@cs.waikato.ac.nz Abstract. In

### Two-scale homogenization of a hydrodynamic Elrod-Adams model

March 2004 1 Two-scale homogenization of a hydrodynamic Elrod-Adams model G. Bayada MAPLY CNRS UMR-5585 / LAMCOS CNRS UMR-5514, INSA Lyon, 69621 Villeurbanne cedex, France S. Martin MAPLY CNRS UMR-5585,

### Block vs. Stream cipher

Block vs. Stream cipher Idea of a block cipher: partition the text into relatively large (e.g. 128 bits) blocks and encode each block separately. The encoding of each block generally depends on at most

### LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary

### Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

E0 270 Machine Learning Lecture 5 (Jan 22, 203) Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in

### SVM TRADE-OFF BETWEEN MAXIMIZE THE MARGIN AND MINIMIZE THE VARIABLES USED FOR REGRESSION

International Journal of Pure and Applied Mathematics Volume 87 No. 6 2013, 741-750 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu doi: http://dx.doi.org/10.12732/ijpam.v87i6.2

### ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement

### VARIATIONAL CALCULUS IN SPACE OF MEASURES AND OPTIMAL DESIGN

Chapter 1 VARIATIONAL CALCULUS IN SPACE OF MEASURES AND OPTIMAL DESIGN Ilya Molchanov Department of Statistics University of Glasgow ilya@stats.gla.ac.uk www.stats.gla.ac.uk/ ilya Sergei Zuyev Department

### Lecture 7: Kernels for Classification and Regression

Lecture 7: Kernels for Classification and Regression CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 15, 2011 Outline Outline A linear regression problem Linear auto-regressive

### A Language for Task Orchestration and its Semantic Properties

DEPARTMENT OF COMPUTER SCIENCES A Language for Task Orchestration and its Semantic Properties David Kitchin, William Cook and Jayadev Misra Department of Computer Science University of Texas at Austin

### Boosted Lasso and Reverse Boosting

Boosted Lasso and Reverse Boosting Peng Zhao University of California, Berkeley, USA. Bin Yu University of California, Berkeley, USA. Summary. This paper introduces the concept of backward step in contrast

### ABC-Boost: Adaptive Base Class Boost for Multi-class Classification

ABC-Boost: Adaptive Base Class Boost for Multi-class Classification Ping Li Department of Statistical Science, Cornell University, Ithaca, NY 14853 USA pingli@cornell.edu Abstract We propose -boost (adaptive

### Statistical Machine Learning from Data

Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

### On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing

On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing XuanLong Nguyen Martin J. Wainwright Michael I. Jordan Electrical Engineering & Computer Science Department

### Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Fall 3 Computer Vision Overview of Statistical Tools Statistical Inference Haibin Ling Observation inference Decision Prior knowledge http://www.dabi.temple.edu/~hbling/teaching/3f_5543/index.html Bayesian

### Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

### Lecture Support Vector Machine (SVM) Classifiers

Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

### On the Generalization of Soft Margin Algorithms

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 10, OCTOBER 2002 2721 On the Generalization of Soft Margin Algorithms John Shawe-Taylor, Member, IEEE, and Nello Cristianini Abstract Generalization

### Classification Ensemble That Maximizes the Area Under Receiver Operating Characteristic Curve (AUC)

Classification Ensemble That Maximizes the Area Under Receiver Operating Characteristic Curve (AUC) Eunsik Park 1 and Y-c Ivan Chang 2 1 Chonnam National University, Gwangju, Korea 2 Academia Sinica, Taipei,

### Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

### Logistic Regression Trained with Different Loss Functions. Discussion

Logistic Regression Trained with Different Loss Functions Discussion CS640 Notations We restrict our discussions to the binary case. g(z) = g (z) = g(z) z h w (x) = g(wx) = + e z = g(z)( g(z)) + e wx =

### Machine Learning: Logistic Regression. Lecture 04

Machine Learning: Logistic Regression Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Supervised Learning Task = learn an (unkon function t : X T that maps input

### Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department

### Lecture 10: A brief introduction to Support Vector Machine

Lecture 10: A brief introduction to Support Vector Machine Advanced Applied Multivariate Analysis STAT 2221, Fall 2013 Sungkyu Jung Department of Statistics, University of Pittsburgh Xingye Qiao Department

### Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

### COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

COMS 4771 Introduction to Machine Learning James McInerney Adapted from slides by Nakul Verma Announcements HW1: Please submit as a group Watch out for zero variance features (Q5) HW2 will be released

### Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods

### Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

### Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods Geoff Gordon ggordon@cs.cmu.edu July 10, 2003 Overview Why do people care about SVMs? Classification problems SVMs often produce good results over a wide range

### Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

### MLCC 2017 Regularization Networks I: Linear Models

MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational

### An adaptive version of the boost by majority algorithm

An adaptive version of the boost by majority algorithm Yoav Freund AT&T Labs 180 Park Avenue Florham Park NJ 092 USA October 2 2000 Abstract We propose a new boosting algorithm This boosting algorithm

### Boosting Algorithms for Maximizing the Soft Margin

Boosting Algorithms for Maximizing the Soft Margin Manfred K. Warmuth Dept. of Engineering University of California Santa Cruz, CA, U.S.A. Karen Glocer Dept. of Engineering University of California Santa

### arxiv: v1 [stat.ml] 3 Apr 2017

Geometric Insights into SVM Tuning Geometric Insights into Support Vector Machine Behavior using the KKT Conditions arxiv:1704.00767v1 [stat.ml] 3 Apr 2017 Iain Carmichael Department of Statistics and

### F O R SOCI AL WORK RESE ARCH

7 TH EUROPE AN CONFERENCE F O R SOCI AL WORK RESE ARCH C h a l l e n g e s i n s o c i a l w o r k r e s e a r c h c o n f l i c t s, b a r r i e r s a n d p o s s i b i l i t i e s i n r e l a t i o n

### Machine Learning Lecture 7

Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

### Lecture 3: Multiclass Classification

Lecture 3: Multiclass Classification Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar and Dan Roth CS6501 Lecture 3 1 Announcement v Please enroll in

### The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

) Set W () i The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m m =.5 1 n W (m 1) i y i h(x i ; 2 ˆθ

### An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process

### COMS 4771 Regression. Nakul Verma

COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the

### Effective Dimension and Generalization of Kernel Learning

Effective Dimension and Generalization of Kernel Learning Tong Zhang IBM T.J. Watson Research Center Yorktown Heights, Y 10598 tzhang@watson.ibm.com Abstract We investigate the generalization performance

### Support Vector Machines

Support Vector Machines Tobias Pohlen Selected Topics in Human Language Technology and Pattern Recognition February 10, 2014 Human Language Technology and Pattern Recognition Lehrstuhl für Informatik 6

### Generalization bounds

Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question

### Consistency of Nearest Neighbor Methods

E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study

### Adaptive Boosting of Neural Networks for Character Recognition

Adaptive Boosting of Neural Networks for Character Recognition Holger Schwenk Yoshua Bengio Dept. Informatique et Recherche Opérationnelle Université de Montréal, Montreal, Qc H3C-3J7, Canada fschwenk,bengioyg@iro.umontreal.ca

### Fisher Consistency of Multicategory Support Vector Machines

Fisher Consistency o Multicategory Support Vector Machines Yueng Liu Department o Statistics and Operations Research Carolina Center or Genome Sciences University o North Carolina Chapel Hill, NC 7599-360

### PAC-learning, VC Dimension and Margin-based Bounds

More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

### Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Preliminaries Linear models: the perceptron and closest centroid algorithms Chapter 1, 7 Definition: The Euclidean dot product beteen to vectors is the expression d T x = i x i The dot product is also

### Machine Learning Lecture 5

Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

### An EM algorithm for Batch Markovian Arrival Processes and its comparison to a simpler estimation procedure

An EM algorithm for Batch Markovian Arrival Processes and its comparison to a simpler estimation procedure Lothar Breuer (breuer@info04.uni-trier.de) University of Trier, Germany Abstract. Although the

### Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

The wisdom of the crowds Ensemble learning Sir Francis Galton discovered in the early 1900s that a collection of educated guesses can add up to very accurate predictions! Chapter 11 The paper in which

### Object Recognition Using Local Characterisation and Zernike Moments

Object Recognition Using Local Characterisation and Zernike Moments A. Choksuriwong, H. Laurent, C. Rosenberger, and C. Maaoui Laboratoire Vision et Robotique - UPRES EA 2078, ENSI de Bourges - Université

### A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized

### Advanced Introduction to Machine Learning CMU-10715

Advanced Introduction to Machine Learning CMU-10715 Risk Minimization Barnabás Póczos What have we seen so far? Several classification & regression algorithms seem to work fine on training datasets: Linear

### Reproducing Kernel Hilbert Spaces

9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

### Classification: The rest of the story

U NIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN CS598 Machine Learning for Signal Processing Classification: The rest of the story 3 October 2017 Today s lecture Important things we haven t covered yet Fisher

### Low Bias Bagged Support Vector Machines

Low Bias Bagged Support Vector Machines Giorgio Valentini Dipartimento di Scienze dell Informazione, Università degli Studi di Milano, Italy INFM, Istituto Nazionale per la Fisica della Materia, Italy.

### Loss Functions for Preference Levels: Regression with Discrete Ordered Labels

Loss Functions for Preference Levels: Regression with Discrete Ordered Labels Jason D. M. Rennie Massachusetts Institute of Technology Comp. Sci. and Artificial Intelligence Laboratory Cambridge, MA 9,

### On the Performance of Random Vector Quantization Limited Feedback Beamforming in a MISO System

1 On the Performance of Random Vector Quantization Limited Feedback Beamforming in a MISO System Chun Kin Au-Yeung, Student Member, IEEE, and David J. Love, Member, IEEE Abstract In multiple antenna wireless

### From Lasso regression to Feature vector machine

From Lasso regression to Feature vector machine Fan Li, Yiming Yang and Eric P. Xing,2 LTI and 2 CALD, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA USA 523 {hustlf,yiming,epxing}@cs.cmu.edu

### Boosting: Foundations and Algorithms. Rob Schapire

Boosting: Foundations and Algorithms Rob Schapire Example: Spam Filtering problem: filter out spam (junk email) gather large collection of examples of spam and non-spam: From: yoav@ucsd.edu Rob, can you

### Data Mining und Maschinelles Lernen

Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting

### Ensembles of Classifiers.

Ensembles of Classifiers www.biostat.wisc.edu/~dpage/cs760/ 1 Goals for the lecture you should understand the following concepts ensemble bootstrap sample bagging boosting random forests error correcting

### 1. Kernel ridge regression In contrast to ordinary least squares which has a cost function. m (θ T x (i) y (i) ) 2, J(θ) = 1 2.

CS229 Problem Set #2 Solutions 1 CS 229, Public Course Problem Set #2 Solutions: Theory Kernels, SVMs, and 1. Kernel ridge regression In contrast to ordinary least squares which has a cost function J(θ)

### Introduction to Machine Learning

1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer

### LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

LINEAR CLASSIFIERS Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification, the input

### CS446: Machine Learning Fall Final Exam. December 6 th, 2016

CS446: Machine Learning Fall 2016 Final Exam December 6 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains

### Modified Logistic Regression: An Approximation to SVM and Its Applications in Large-Scale Text Categorization

Modified Logistic Regression: An Approximation to SVM and Its Applications in Large-Scale Text Categorization Jian Zhang jian.zhang@cs.cmu.edu Rong Jin rong@cs.cmu.edu Yiming Yang yiming@cs.cmu.edu Alex

### A Bias Correction for the Minimum Error Rate in Cross-validation

A Bias Correction for the Minimum Error Rate in Cross-validation Ryan J. Tibshirani Robert Tibshirani Abstract Tuning parameters in supervised learning problems are often estimated by cross-validation.

### Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

ORF 523 Lecture 8 Princeton University Instructor: A.A. Ahmadi Scribe: G. Hall Any typos should be emailed to a a a@princeton.edu. 1 Outline Convexity-preserving operations Convex envelopes, cardinality

### MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE March 28, 2012 The exam is closed book. You are allowed a double sided one page cheat sheet. Answer the questions in the spaces provided on

### Collaborative Filtering via Ensembles of Matrix Factorizations

Collaborative Ftering via Ensembles of Matrix Factorizations Mingrui Wu Max Planck Institute for Biological Cybernetics Spemannstrasse 38, 72076 Tübingen, Germany mingrui.wu@tuebingen.mpg.de ABSTRACT We

### Nearest Neighbors Methods for Support Vector Machines

Nearest Neighbors Methods for Support Vector Machines A. J. Quiroz, Dpto. de Matemáticas. Universidad de Los Andes joint work with María González-Lima, Universidad Simón Boĺıvar and Sergio A. Camelo, Universidad

### Statistical and Computational Learning Theory

Statistical and Computational Learning Theory Fundamental Question: Predict Error Rates Given: Find: The space H of hypotheses The number and distribution of the training examples S The complexity of the

### Regularized Linear Models in Stacked Generalization

Regularized Linear Models in Stacked Generalization Sam Reid and Greg Grudic University of Colorado at Boulder, Boulder CO 80309-0430, USA Abstract Stacked generalization is a flexible method for multiple