Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Size: px
Start display at page:

Download "Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research"

Transcription

1 Foundations of Machine Learning Multi-Class Classification Mehryar Mohri Courant Institute and Google Research

2 Motivation Real-world problems often have multiple classes: text, speech, image, biological sequences. Algorithms studied so far: designed for binary classification problems. How do we design multi-class classification algorithms? can the algorithms used for binary classification be generalized to multi-class classification? can we reduce multi-class classification to binary classification? page 2

3 Multi-Class Classification Problem Training data: sample drawn i.i.d. from set according to some distribution D, S =((x 1,y 1 ),...,(x m,y m )) X Y, mono-label case: Card(Y )=k. multi-label case: Y ={ 1, +1} k. Problem: find classifier h: X Y in H with small generalization error, mono-label case: R(h)=E x D [1 h(x)6=f(x) ]. multi-label case: R(h)=E 1 P k x D k l=1 1 [h(x)] l 6=[f(x)] l. X page 3

4 Notes In most tasks considered, number of classes For k large, problem often not treated as a multiclass classification problem (ranking or density estimation, e.g., automatic speech recognition). Computational efficiency issues arise for larger ks. In general, classes not balanced. k 100. page 4

5 Multi-Class Classification - Margin Hypothesis set H: functions h: X Y R. x label returned: argmax h(x, y). Margin: h(x, y) =h(x, y) max. error: 1. h (x,y)apple0 apple ( h (x, y)) empirical margin loss: R (h) = 1 m m y i=1 Y y =y h(x, y ) ( h (x, y)). page 5

6 Multi-Class Margin Bound (MM et al. 2012; Kuznetsov, MM, and Syed, 2014) Theorem: let H R X Y with Y = {1,...,k}. Fix >0. Then, for any >0, with probability at least 1, the following multi-class classification bound holds for all h H: R(h) R (h)+ 4k R m ( 1 (H)) + log 1 2m, with 1 (H) ={x h(x, y): y Y,h H}. page 6

7 Kernel Based Hypotheses Hypothesis set H K,p : feature mapping associated to PDS kernel K. functions (x, y) w y (x), y {1,...,k}. label returned: x argmax w y (x). y {1,...,k} for any p 1, H K,p = {(x, y) X [1,k] w y (x): W =(w 1,...,w k ), W H,p }. page 7

8 Multi-Class Margin Bound - Kernels Theorem: let K: X X R be a PDS kernel and let : X H be a feature mapping associated to K. Fix >0. Then, for any >0, with probability at least 1, the following multiclass bound holds for all h : H K,p (MM et al. 2012) R(h) R (h)+4k r m + log 1 2m, where r 2 =sup x X K(x, x). page 8

9 Single classifier: Multi-class SVMs. AdaBoost.MH. Conditional Maxent. Decision trees. Approaches Combination of binary classifiers: One-vs-all. One-vs-one. Error-correcting codes. page 9

10 Multi-Class SVMs Optimization problem: min w, 1 2 Decision function: k l=1 h: x argmax l Y (Weston and Watkins, 1999; Crammer and Singer, 2001) w l 2 + C m i=1 subject to: w yi x i + yi,l w l x i +1 i (i, l) [1,m] Y. (w l x). i page 10

11 Directly based on generalization bounds. Comparison with (Weston and Watkins, 1999): single slack variable per point, maximum of slack variables (penalty for worst class): k l=1 Notes PDS kernel instead of inner product Optimization: complex constraints, il k max l=1 mk -size problem. specific solution based on decomposition into disjoint sets of constraints (Crammer and Singer, 2001). il. m page 11

12 Dual Formulation Optimization problem: th row of matrix. max =[ ij] m i=1 i e yi 1 2 m i=1 i i R m k ( i j)(x i x j ) subject to: i [1,m], (0 iy i C) ( j = y i, ij 0) ( i 1 =0). Decision function: h(x) = k argmax l=1 m i=1 il(x i x). page 12

13 AdaBoost Training data (multi-label case): (x 1,y 1 ),...,(x m,y m ) X { 1, 1} k. Reduction to binary classification: each example leads to k binary examples: apply AdaBoost to the resulting problem. choice of t. Computational cost: each round. (Schapire and Singer, 2000) (x i,y i ) ((x i, 1),y i [1]),...,((x i,k),y i [k]),i [1,m]. mk distribution updates at page 13

14 AdaBoost.MH H ({ 1, +1} k ) (X Y ). AdaBoost.MH(S =((x 1,y 1 ),...,(x m,y m ))) 1 for i 1 to m do 2 for l 1 to k do 3 D 1 (i, l) 1 mk 4 for t 1 to T do 5 h t base classifier in H with small error t =Pr Dt [h t (x i,l)=y i [l]] 6 t choose to minimize Z t 7 Z t i,l D t(i, l)exp( ty i [l]h t (x i,l)) 8 for i 1 to m do 9 for l 1 to k do D 10 D t+1 (i, l) t (i,l)exp( 11 f T T t=1 th t 12 return h T =sgn(f T ) ty i [l]h t (x i,l)) Z t page 14

15 Bound on Empirical Error Theorem: The empirical error of the classifier output by AdaBoost.MH verifies: R(h) T Z t. t=1 Proof: similar to the proof for AdaBoost. Choice of t: for for bound. H ({ 1, +1} k ) X Y H ([ 1, 1] k ) X Y, as for AdaBoost, t = 1 2 log 1, same choice: minimize upper other cases: numerical/approximation method. t t. page 15

16 Objective function: F ( )= m i=1 k l=1 Notes e y i[l]f n (x i,l) = All comments and analysis given for AdaBoost apply here. Alternative: Adaboost.MR, which coincides with a special case of RankBoost (ranking lecture). m i=1 k l=1 e y i[l] P n t=1 th t (x i,l). page 16

17 Decision Trees X1 < a1 X 2 R 2 X1 < a2 X2 < a3 a 4 R 5 R 1 R 3 a 3 X2 < a4 R3 R4 R5 R 4 R1 R2 a 2 a 1 X 1 page 17

18 Different Types of Questions Decision trees X {blue, white, red} : categorical questions. X a : continuous variables. Binary space partition (BSP) trees: n i=1 ix i a: partitioning with convex polyhedral regions. Sphere trees: X a 0 a : partitioning with pieces of spheres. page 18

19 Hypotheses In each region R t, classification: majority vote - ties broken arbitrarily, y t =argmax y Y regression: average value, y t = {x i R t : i [1,m],y i = y}. 1 y i. Form of hypotheses: h: x S R t xi R t i [1,m] t y t 1 x Rt. page 19

20 Training Problem: general problem of determining partition with minimum empirical error is NP-hard. Heuristics: greedy algorithm. for all,, j [1,N] R R + (j, )={x i R: x i [j], i [1,m]} R (j, )={x i R: x i [j]<,i [1,m]}. Decision-Trees(S =((x 1,y 1 ),...,(x m,y m ))) 1 P {S} initial partition 2 for each region R P such that Pred(R) do 3 (j, ) argmin (j, ) error(r (j, )) + error(r + (j, )) 4 P P R {R (j, ),R + (j, )} 5 return P page 20

21 Splitting/Stopping Criteria Problem: larger trees overfit training sample. Conservative splitting: split node only if loss reduced by some fixed value >0. issue: seemingly bad split dominating useful splits. Grow-then-prune technique (CART): grow very large tree, Pred(R): R > n 0. prune tree based on:, parameter determined by cross-validation. F (T )=Loss(T )+ T 0 page 21

22 Decision Tree Tools Most commonly used tools for learning decision trees: CART (classification and regression tree) (Breiman et al., 1984). C4.5 (Quinlan, 1986, 1993) and C5.0 (RuleQuest Research) a commercial system. Differences: minor between latest versions. page 22

23 Approaches Single classifier: SVM-type algorithm. AdaBoost-type algorithm. Conditional Maxent. Decision trees. Combination of binary classifiers: One-vs-all. One-vs-one. Error-correcting codes. page 23

24 One-vs-All Technique: for each class l Y learn binary classifier h l =sgn(f l ). combine binary classifiers via voting mechanism, typically majority vote: h: x argmax l Y f l (x). Problem: poor justification (in general). calibration: classifier scores not comparable. nevertheless: simple and frequently used in practice, computational advantages in some cases. page 24

25 One-vs-One Technique: for each pair (l, l ) classifier h ll : X {0, 1}. learn binary combine binary classifiers via majority vote: Problem: computational: train h(x) =argmax l Y Y,l=l {l : h ll (x) =1}. k(k 1)/2 binary classifiers. overfitting: size of training sample could become small for a given pair. page 25

26 Computational Comparison Training Testing One-vs-all O(kB train (m)) O(kB test ) O(km α ) One-vs-one O(k 2 B train (m/k)) O(k (on average) 2 B test ) O(k 2 α m α ) smaller N SV per B Time complexity for SVMs, α less than 3. page 26

27 Error-Correcting Code Approach Idea: assign F-long binary code word to each class: M =[M lj ] {0, 1} [1,k] [1,F ]. learn binary classifier f j : X {0, 1} for each column. Example x in class l labeled with. classifier output: f(x)= f 1 (x),...,f F (x), (Dietterich and Bakiri, 1995) M lj h: x argmin l Y d Hamming M l, f(x). page 27

28 Illustration classes 8 classes, code-length: 6. codes f 1 (x)f 2 (x)f 3 (x)f 4 (x)f 5 (x)f 6 (x) new example x page 28

29 Error-Correcting Codes - Design Main ideas: independent columns: otherwise no effective discrimination. distance between rows: if the minimal Hamming distance between rows is d, then the multi-class d 1 can correct errors. 2 columns may correspond to features selected for the task. one-vs-all and one-vs-one (with ternary codes) are special cases. page 29

30 Extensions Matrix entries in { 1, 0, +1} : examples marked with 0 disregarded during training. one-vs-one becomes also a special case. Margin loss L: function of Hamming loss: F Margin loss: h(x) = argmin l {1,...,k} h(x) = argmin l {1,...,k} j=1 F j=1 yf(x), e.g., hinge loss. 1 sgn M lj f j (x) 2 L M lj f j (x). page 30 (Allwein et al., 2000).

31 Applications One-vs-all approach is the most widely used. No clear empirical evidence of the superiority of other approaches (Rifkin and Klautau, 2004). except perhaps on small data sets with relatively large error rate. Large structured multi-class problems: often treated as ranking problems (see ranking lecture). page 31

32 References Erin L. Allwein, Robert E. Schapire and Yoram Singer. Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1: , K. Crammer and Y. Singer. Improved output coding for classification using continuous relaxation. In Proceedings of NIPS, Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernelbased vector machines. Journal of Machine Learning Research, 2: , Koby Crammer and Yoram Singer. On the Learnability and Design of Output Codes for Multiclass Problems. Machine Learning 47, Thomas G. Dietterich, Ghulum Bakiri: Solving Multiclass Learning Problems via Error- Correcting Output Codes. Journal of Artificial Intelligence Research (JAIR) 2: , Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning, the MIT Press, John C. Platt, Nello Cristianini, and John Shawe-Taylor. Large Margin DAGS for Multiclass Classification. In Advances in Neural Information Processing Systems 12 (NIPS 1999), pp , page 32

33 References Ryan Rifkin. Everything Old Is New Again: A Fresh Look at Historical Approaches in Machine Learning. Ph.D. Thesis, MIT, Rifkin and Klautau. In Defense of One-Vs-All Classification. Journal of Machine Learning Research, 5: , Robert E. Schapire. The boosting approach to machine learning: An overview. In D. D. Denison, M. H. Hansen, C. Holmes, B. Mallick, B. Yu, editors, Nonlinear Estimation and Classification. Springer, Robert E. Schapire, Yoav Freund, Peter Bartlett and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5): , Robert E. Schapire and Yoram Singer. BoosTexter: A boosting-based system for text categorization. Machine Learning, 39(2/3): , Jason Weston and Chris Watkins. Support Vector Machines for Multi-Class Pattern Recognition. Proceedings of the Seventh European Symposium On Artificial Neural Networks (ESANN 99), page 33

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification page 2 Motivation Real-world problems often have multiple classes:

More information

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 13 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification Mehryar Mohri - Introduction to Machine Learning page 2 Motivation

More information

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:

More information

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:

More information

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague AdaBoost Lecturer: Jan Šochman Authors: Jan Šochman, Jiří Matas Center for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Motivation Presentation 2/17 AdaBoost with trees

More information

Boosting Ensembles of Structured Prediction Rules

Boosting Ensembles of Structured Prediction Rules Boosting Ensembles of Structured Prediction Rules Corinna Cortes Google Research 76 Ninth Avenue New York, NY 10011 corinna@google.com Vitaly Kuznetsov Courant Institute 251 Mercer Street New York, NY

More information

Foundations of Machine Learning Ranking. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Ranking. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Ranking Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation Very large data sets: too large to display or process. limited resources, need

More information

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH. Deep Boosting Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Ensemble Methods in ML Combining several base classifiers

More information

Ensembles of Classifiers.

Ensembles of Classifiers. Ensembles of Classifiers www.biostat.wisc.edu/~dpage/cs760/ 1 Goals for the lecture you should understand the following concepts ensemble bootstrap sample bagging boosting random forests error correcting

More information

Supervised Learning of Non-binary Problems Part I: Multiclass Categorization via Output Codes

Supervised Learning of Non-binary Problems Part I: Multiclass Categorization via Output Codes Supervised Learning of Non-binary Problems Part I: Multiclass Categorization via Output Codes Yoram Singer Hebrew University http://www.cs.huji.il/ singer Based on joint work with: Koby Crammer, Hebrew

More information

Lecture 18: Multiclass Support Vector Machines

Lecture 18: Multiclass Support Vector Machines Fall, 2017 Outlines Overview of Multiclass Learning Traditional Methods for Multiclass Problems One-vs-rest approaches Pairwise approaches Recent development for Multiclass Problems Simultaneous Classification

More information

ML (cont.): SUPPORT VECTOR MACHINES

ML (cont.): SUPPORT VECTOR MACHINES ML (cont.): SUPPORT VECTOR MACHINES CS540 Bryan R Gibson University of Wisconsin-Madison Slides adapted from those used by Prof. Jerry Zhu, CS540-1 1 / 40 Support Vector Machines (SVMs) The No-Math Version

More information

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.

More information

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Weak Learning Definition: concept class C is weakly PAC-learnable if there exists a (weak)

More information

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m ) Set W () i The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m m =.5 1 n W (m 1) i y i h(x i ; 2 ˆθ

More information

Perceptron Mistake Bounds

Perceptron Mistake Bounds Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce

More information

ABC-Boost: Adaptive Base Class Boost for Multi-class Classification

ABC-Boost: Adaptive Base Class Boost for Multi-class Classification ABC-Boost: Adaptive Base Class Boost for Multi-class Classification Ping Li Department of Statistical Science, Cornell University, Ithaca, NY 14853 USA pingli@cornell.edu Abstract We propose -boost (adaptive

More information

Learning with Imperfect Data

Learning with Imperfect Data Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Joint work with: Yishay Mansour (Tel-Aviv & Google) and Afshin Rostamizadeh (Courant Institute). Standard Learning Assumptions IID assumption.

More information

Learning Kernels -Tutorial Part III: Theoretical Guarantees.

Learning Kernels -Tutorial Part III: Theoretical Guarantees. Learning Kernels -Tutorial Part III: Theoretical Guarantees. Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute & Google Research mohri@cims.nyu.edu Afshin Rostami UC Berkeley

More information

Error Limiting Reductions Between Classification Tasks

Error Limiting Reductions Between Classification Tasks Error Limiting Reductions Between Classification Tasks Keywords: supervised learning, reductions, classification Abstract We introduce a reduction-based model for analyzing supervised learning tasks. We

More information

The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers

The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers Hui Zou University of Minnesota Ji Zhu University of Michigan Trevor Hastie Stanford University Abstract We propose a new framework

More information

Statistics and learning: Big Data

Statistics and learning: Big Data Statistics and learning: Big Data Learning Decision Trees and an Introduction to Boosting Sébastien Gadat Toulouse School of Economics February 2017 S. Gadat (TSE) SAD 2013 1 / 30 Keywords Decision trees

More information

Kobe University Repository : Kernel

Kobe University Repository : Kernel Kobe University Repository : Kernel タイトル Title 著者 Author(s) 掲載誌 巻号 ページ Citation 刊行日 Issue date 資源タイプ Resource Type 版区分 Resource Version 権利 Rights DOI JaLCDOI URL Comparison between error correcting output

More information

Data Mining und Maschinelles Lernen

Data Mining und Maschinelles Lernen Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting

More information

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69 R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual

More information

Support Vector and Kernel Methods

Support Vector and Kernel Methods SIGIR 2003 Tutorial Support Vector and Kernel Methods Thorsten Joachims Cornell University Computer Science Department tj@cs.cornell.edu http://www.joachims.org 0 Linear Classifiers Rules of the Form:

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 A. Kernels 1. Let X be a finite set. Show that the kernel

More information

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers JMLR: Workshop and Conference Proceedings vol 35:1 8, 014 Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers Balázs Kégl LAL/LRI, University

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

Boosting with decision stumps and binary features

Boosting with decision stumps and binary features Boosting with decision stumps and binary features Jason Rennie jrennie@ai.mit.edu April 10, 2003 1 Introduction A special case of boosting is when features are binary and the base learner is a decision

More information

ADANET: adaptive learning of neural networks

ADANET: adaptive learning of neural networks ADANET: adaptive learning of neural networks Joint work with Corinna Cortes (Google Research) Javier Gonzalo (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR

More information

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods The wisdom of the crowds Ensemble learning Sir Francis Galton discovered in the early 1900s that a collection of educated guesses can add up to very accurate predictions! Chapter 11 The paper in which

More information

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

Analysis of Multiclass Support Vector Machines

Analysis of Multiclass Support Vector Machines Analysis of Multiclass Support Vector Machines Shigeo Abe Graduate School of Science and Technology Kobe University Kobe, Japan abe@eedept.kobe-u.ac.jp Abstract Since support vector machines for pattern

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

A Simple Algorithm for Multilabel Ranking

A Simple Algorithm for Multilabel Ranking A Simple Algorithm for Multilabel Ranking Krzysztof Dembczyński 1 Wojciech Kot lowski 1 Eyke Hüllermeier 2 1 Intelligent Decision Support Systems Laboratory (IDSS), Poznań University of Technology, Poland

More information

i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x

i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 November 16, 017 Due: Dec 01, 017 A. Kernels Show that the following kernels K are PDS: 1.

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

A NEW MULTI-CLASS SUPPORT VECTOR ALGORITHM

A NEW MULTI-CLASS SUPPORT VECTOR ALGORITHM Optimization Methods and Software Vol. 00, No. 00, Month 200x, 1 18 A NEW MULTI-CLASS SUPPORT VECTOR ALGORITHM PING ZHONG a and MASAO FUKUSHIMA b, a Faculty of Science, China Agricultural University, Beijing,

More information

Margin Maximizing Loss Functions

Margin Maximizing Loss Functions Margin Maximizing Loss Functions Saharon Rosset, Ji Zhu and Trevor Hastie Department of Statistics Stanford University Stanford, CA, 94305 saharon, jzhu, hastie@stat.stanford.edu Abstract Margin maximizing

More information

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring / Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

Hierarchical Boosting and Filter Generation

Hierarchical Boosting and Filter Generation January 29, 2007 Plan Combining Classifiers Boosting Neural Network Structure of AdaBoost Image processing Hierarchical Boosting Hierarchical Structure Filters Combining Classifiers Combining Classifiers

More information

COMS 4771 Lecture Boosting 1 / 16

COMS 4771 Lecture Boosting 1 / 16 COMS 4771 Lecture 12 1. Boosting 1 / 16 Boosting What is boosting? Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. 3 / 16 What is boosting?

More information

Multiclass Learning by Probabilistic Embeddings

Multiclass Learning by Probabilistic Embeddings Multiclass Learning by Probabilistic Embeddings Ofer Dekel and Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {oferd,singer}@cs.huji.ac.il Abstract

More information

Learning Binary Classifiers for Multi-Class Problem

Learning Binary Classifiers for Multi-Class Problem Research Memorandum No. 1010 September 28, 2006 Learning Binary Classifiers for Multi-Class Problem Shiro Ikeda The Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku, Tokyo, 106-8569,

More information

Understanding SVM (and associated kernel machines) through the development of a Matlab toolbox

Understanding SVM (and associated kernel machines) through the development of a Matlab toolbox Understanding SVM (and associated kernel machines) through the development of a Matlab toolbox Stephane Canu To cite this version: Stephane Canu. Understanding SVM (and associated kernel machines) through

More information

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination

More information

Lecture 3: Multiclass Classification

Lecture 3: Multiclass Classification Lecture 3: Multiclass Classification Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar and Dan Roth CS6501 Lecture 3 1 Announcement v Please enroll in

More information

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University BOOSTING Robert E. Schapire and Yoav

More information

Domain Adaptation for Regression

Domain Adaptation for Regression Domain Adaptation for Regression Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Motivation Applications: distinct training and test distributions.

More information

1/sqrt(B) convergence 1/B convergence B

1/sqrt(B) convergence 1/B convergence B The Error Coding Method and PICTs Gareth James and Trevor Hastie Department of Statistics, Stanford University March 29, 1998 Abstract A new family of plug-in classication techniques has recently been

More information

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Learning Ensembles. 293S T. Yang. UCSB, 2017. Learning Ensembles 293S T. Yang. UCSB, 2017. Outlines Learning Assembles Random Forest Adaboost Training data: Restaurant example Examples described by attribute values (Boolean, discrete, continuous)

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Hypothesis Space variable size deterministic continuous parameters Learning Algorithm linear and quadratic programming eager batch SVMs combine three important ideas Apply optimization

More information

A Uniform Convergence Bound for the Area Under the ROC Curve

A Uniform Convergence Bound for the Area Under the ROC Curve Proceedings of the th International Workshop on Artificial Intelligence & Statistics, 5 A Uniform Convergence Bound for the Area Under the ROC Curve Shivani Agarwal, Sariel Har-Peled and Dan Roth Department

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines www.cs.wisc.edu/~dpage 1 Goals for the lecture you should understand the following concepts the margin slack variables the linear support vector machine nonlinear SVMs the kernel

More information

CS-E4830 Kernel Methods in Machine Learning

CS-E4830 Kernel Methods in Machine Learning CS-E4830 Kernel Methods in Machine Learning Lecture 5: Multi-class and preference learning Juho Rousu 11. October, 2017 Juho Rousu 11. October, 2017 1 / 37 Agenda from now on: This week s theme: going

More information

i=1 = H t 1 (x) + α t h t (x)

i=1 = H t 1 (x) + α t h t (x) AdaBoost AdaBoost, which stands for ``Adaptive Boosting", is an ensemble learning algorithm that uses the boosting paradigm []. We will discuss AdaBoost for binary classification. That is, we assume that

More information

Multiclass Boosting with Repartitioning

Multiclass Boosting with Repartitioning Multiclass Boosting with Repartitioning Ling Li Learning Systems Group, Caltech ICML 2006 Binary and Multiclass Problems Binary classification problems Y = { 1, 1} Multiclass classification problems Y

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes

A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes Ürün Dogan 1 Tobias Glasmachers 2 and Christian Igel 3 1 Institut für Mathematik Universität Potsdam Germany

More information

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13 Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y

More information

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Journal of Machine Learning Research () 3- Submitted /; Published / Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin L. Allwein Southwest Research Institute Culebra Road San

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

Microarray Data Analysis: Discovery

Microarray Data Analysis: Discovery Microarray Data Analysis: Discovery Lecture 5 Classification Classification vs. Clustering Classification: Goal: Placing objects (e.g. genes) into meaningful classes Supervised Clustering: Goal: Discover

More information

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018 From Binary to Multiclass Classification CS 6961: Structured Prediction Spring 2018 1 So far: Binary Classification We have seen linear models Learning algorithms Perceptron SVM Logistic Regression Prediction

More information

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie Computational Biology Program Memorial Sloan-Kettering Cancer Center http://cbio.mskcc.org/leslielab

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

A Large Deviation Bound for the Area Under the ROC Curve

A Large Deviation Bound for the Area Under the ROC Curve A Large Deviation Bound for the Area Under the ROC Curve Shivani Agarwal, Thore Graepel, Ralf Herbrich and Dan Roth Dept. of Computer Science University of Illinois Urbana, IL 680, USA {sagarwal,danr}@cs.uiuc.edu

More information

Decision Tree Learning Lecture 2

Decision Tree Learning Lecture 2 Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

CS229 Supplemental Lecture notes

CS229 Supplemental Lecture notes CS229 Supplemental Lecture notes John Duchi 1 Boosting We have seen so far how to solve classification (and other) problems when we have a data representation already chosen. We now talk about a procedure,

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM 1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University

More information

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch . Machine Learning Boosting Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de

More information

Evaluation requires to define performance measures to be optimized

Evaluation requires to define performance measures to be optimized Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation

More information

Boosting the margin: A new explanation for the effectiveness of voting methods

Boosting the margin: A new explanation for the effectiveness of voting methods Machine Learning: Proceedings of the Fourteenth International Conference, 1997. Boosting the margin: A new explanation for the effectiveness of voting methods Robert E. Schapire Yoav Freund AT&T Labs 6

More information

SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS

SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS Filippo Portera and Alessandro Sperduti Dipartimento di Matematica Pura ed Applicata Universit a di Padova, Padova, Italy {portera,sperduti}@math.unipd.it

More information

Rademacher Complexity Bounds for Non-I.I.D. Processes

Rademacher Complexity Bounds for Non-I.I.D. Processes Rademacher Complexity Bounds for Non-I.I.D. Processes Mehryar Mohri Courant Institute of Mathematical ciences and Google Research 5 Mercer treet New York, NY 00 mohri@cims.nyu.edu Afshin Rostamizadeh Department

More information

Artificial Intelligence Roman Barták

Artificial Intelligence Roman Barták Artificial Intelligence Roman Barták Department of Theoretical Computer Science and Mathematical Logic Introduction We will describe agents that can improve their behavior through diligent study of their

More information

SUPPORT VECTOR MACHINE

SUPPORT VECTOR MACHINE SUPPORT VECTOR MACHINE Mainly based on https://nlp.stanford.edu/ir-book/pdf/15svm.pdf 1 Overview SVM is a huge topic Integration of MMDS, IIR, and Andrew Moore s slides here Our foci: Geometric intuition

More information

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution: Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 October 31, 2016 Due: A. November 11, 2016; B. November 22, 2016 A. Boosting 1. Implement

More information

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University Multi-class SVMs Lecture 17: Aykut Erdem April 2016 Hacettepe University Administrative We will have a make-up lecture on Saturday April 23, 2016. Project progress reports are due April 21, 2016 2 days

More information

2 Upper-bound of Generalization Error of AdaBoost

2 Upper-bound of Generalization Error of AdaBoost COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Haipeng Zheng March 5, 2008 1 Review of AdaBoost Algorithm Here is the AdaBoost Algorithm: input: (x 1,y 1 ),...,(x m,y

More information

Ensembles. Léon Bottou COS 424 4/8/2010

Ensembles. Léon Bottou COS 424 4/8/2010 Ensembles Léon Bottou COS 424 4/8/2010 Readings T. G. Dietterich (2000) Ensemble Methods in Machine Learning. R. E. Schapire (2003): The Boosting Approach to Machine Learning. Sections 1,2,3,4,6. Léon

More information

Midterms. PAC Learning SVM Kernels+Boost Decision Trees. MultiClass CS446 Spring 17

Midterms. PAC Learning SVM Kernels+Boost Decision Trees. MultiClass CS446 Spring 17 Midterms PAC Learning SVM Kernels+Boost Decision Trees 1 Grades are on a curve Midterms Will be available at the TA sessions this week Projects feedback has been sent. Recall that this is 25% of your grade!

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations

More information

Structured Prediction

Structured Prediction Structured Prediction Ningshan Zhang Advanced Machine Learning, Spring 2016 Outline Ensemble Methods for Structured Prediction[1] On-line learning Boosting AGeneralizedKernelApproachtoStructuredOutputLearning[2]

More information

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation. Andrea Passerini Machine Learning. Evaluation Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain

More information

Boosting: Foundations and Algorithms. Rob Schapire

Boosting: Foundations and Algorithms. Rob Schapire Boosting: Foundations and Algorithms Rob Schapire Example: Spam Filtering problem: filter out spam (junk email) gather large collection of examples of spam and non-spam: From: yoav@ucsd.edu Rob, can you

More information

Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008)

Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008) Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008) RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement

More information

Holdout and Cross-Validation Methods Overfitting Avoidance

Holdout and Cross-Validation Methods Overfitting Avoidance Holdout and Cross-Validation Methods Overfitting Avoidance Decision Trees Reduce error pruning Cost-complexity pruning Neural Networks Early stopping Adjusting Regularizers via Cross-Validation Nearest

More information

Machine Learning : Support Vector Machines

Machine Learning : Support Vector Machines Machine Learning Support Vector Machines 05/01/2014 Machine Learning : Support Vector Machines Linear Classifiers (recap) A building block for almost all a mapping, a partitioning of the input space into

More information

CS7267 MACHINE LEARNING

CS7267 MACHINE LEARNING CS7267 MACHINE LEARNING ENSEMBLE LEARNING Ref: Dr. Ricardo Gutierrez-Osuna at TAMU, and Aarti Singh at CMU Mingon Kang, Ph.D. Computer Science, Kennesaw State University Definition of Ensemble Learning

More information

Top-k Parametrized Boost

Top-k Parametrized Boost Top-k Parametrized Boost Turki Turki 1,4, Muhammad Amimul Ihsan 2, Nouf Turki 3, Jie Zhang 4, Usman Roshan 4 1 King Abdulaziz University P.O. Box 80221, Jeddah 21589, Saudi Arabia tturki@kau.edu.sa 2 Department

More information