Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Size: px
Start display at page:

Download "Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research"

Transcription

1 Foundations of Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

2 Multi-Class Classification page 2

3 Motivation Real-world problems often have multiple classes: text, speech, image, biological sequences. Algorithms studied so far: designed for binary classification problems. How do we design multi-class classification algorithms? can the algorithms used for binary classification be generalized to multi-class classification? can we reduce multi-class classification to binary classification? page 3

4 Multi-Class Classification Problem Training data: sample drawn i.i.d. from set according to some distribution D, S =((x 1,y 1 ),...,(x m,y m )) X Y, mono-label case: Card(Y )=k. multi-label case: Y ={ 1, +1} k. Problem: find classifier h: X Y in H with small generalization error, mono-label case: R D (h)=e x D [1 h(x)=f(x) ]. 1 k multi-label case: R D (h)=e x D k l=1 1 [h(x)] k =[f(x)] k. X page 4

5 Notes In most tasks, number of classes For k large or infinite, problem often not treated as a multi-class classification problem, e.g., automatic speech recognition. Computational efficiency issues arise for larger ks. In general, classes not balanced. k 100. page 5

6 Multiclass Classification - Margin Hypothesis set H: functions h: X Y R. x label returned: argmax h(x, y). Margin: h(x, y) =h(x, y) max h(x, y ). y =y empirical margin loss: R (h) = 1 m y m i=1 Y ( h (x, y)). page 6

7 Multiclass Margin Bound (Koltchinskii and Panchenko, 2002; MM et al. 2012) Theorem: let H R X Y with Y = {1,...,k}. Fix >0. Then, for any >0, with probability at least 1, the following multi-class classitication bound holds for all h H: R(h) R (h)+ 2k2 R m ( 1 (H)) + log 1 2m, with 1 (H) ={x h(x, y): y Y,h H}. page 7

8 Kernel Based Hypotheses Hypothesis set H K,p : feature mapping associated to PDS kernel K. functions (x, y) w y (x), y {1,...,k}. label returned: x argmax w y (x). y {1,...,k} for any p 1, H K,p = {(x, y) X [1,k] w y (x): W =(w 1,...,w k ), W H,p }. page 8

9 Multiclass Margin Bound - Kernels Theorem: let K: X X R be a PDS kernel and let : X H be a feature mapping associated to K. Assume that K(x, x) r 2 for all x X. Fix >0. Then, for any >0, with probability at least 1, the following multiclass bound holds for all h : H K,p R(h) R (h)+2k 2 r 2 2 / 2 m + log 1 2m. (MM et al. 2012) page 9

10 Single classifier: Multi-class SVMs. AdaBoost.MH. Decision trees. Approaches Combination of binary classifiers: One-vs-all. One-vs-one. Error-correcting codes. page 10

11 Multi-Class SVMs Optimization problem: min w, 1 2 Decision function: k l=1 h: x argmax l Y (Weston and Watkins, 1999; Crammer and Singer, 2001) w l 2 + C m i=1 subject to: w yi x i + yi,l w l x i +1 i (i, l) [1,m] Y. (w l x). i page 11

12 Directly based on generalization bounds. Comparison with (Weston and Watkins, 1999): single slack variable per point, maximum of slack variables (penalty for worst class): k l=1 Notes PDS kernel instead of inner product Optimization: complex constraints, il k max l=1 mk -size problem. specific solution based on decomposition into disjoint sets of constraints (Crammer and Singer, 2001). il. m page 12

13 Dual Formulation Optimization problem: th row of matrix. max =[ ij] m i=1 i e yi 1 2 m i=1 i i R m k ( i j)(x i x j ) subject to: i [1,m], (0 iy i C) ( j = y i, ij 0) ( i 1 =0). Decision function: h(x) = k argmax l=1 m i=1 il(x i x). page 13

14 AdaBoost Training data (multi-label case): (x 1,y 1 ),...,(x m,y m ) X { 1, 1} k. Reduction to binary classification: each example leads to k binary examples: apply AdaBoost to the resulting problem. choice of t. Computational cost: each round. (Schapire and Singer, 2000) (x i,y i ) ((x i, 1),y i [1]),...,((x i,k),y i [k]),i [1,m]. mk distribution updates at page 14

15 AdaBoost.MH H ({ 1, +1} k ) (X Y ). AdaBoost.MH(S =((x 1,y 1 ),...,(x m,y m ))) 1 for i 1 to m do 2 for l 1 to k do 3 D 1 (i, l) 1 mk 4 for t 1 to T do 5 h t base classifier in H with small error t =Pr Dt [h t (x i,l)=y i [l]] 6 t choose to minimize Z t 7 Z t i,l D t(i, l)exp( ty i [l]h t (x i,l)) 8 for i 1 to m do 9 for l 1 to k do D 10 D t+1 (i, l) t (i,l)exp( 11 f T T t=1 th t 12 return h T =sgn(f T ) ty i [l]h t (x i,l)) Z t page 15

16 Bound on Empirical Error Theorem: The empirical error of the classifier output by AdaBoost.MH verifies: R(h) T Z t. t=1 Proof: similar to the proof for AdaBoost. Choice of t: for for bound. H ({ 1, +1} k ) X Y H ([ 1, 1] k ) X Y, as for AdaBoost, t = 1 2 log 1, same choice: minimize upper other cases: numerical/approximation method. t t. page 16

17 Objective function: F ( )= m i=1 k l=1 Notes e y i[l]f n (x i,l) = All comments and analysis given for AdaBoost apply here. Alternative: Adaboost.MR, which coincides with a special case of RankBoost (next lecture). m i=1 k l=1 e y i[l] P n t=1 th t (x i,l). page 17

18 Decision Trees X1 < a1 X 2 R 2 X1 < a2 X2 < a3 a 4 R 5 R 1 R 3 a 3 X2 < a4 R3 R4 R5 R 4 R1 R2 a 2 a 1 X 1 page 18

19 Different Types of Questions Decision trees X {blue, white, red} : categorical questions. X a : continuous variables. Binary space partition (BSP) trees: n i=1 ix i a : partitioning with convex polyhedral regions. Sphere trees: X a 0 a : partitioning with pieces of spheres. page 19

20 Hypotheses In each region R t, classification: majority vote - ties broken arbitrarily, y t =argmax y Y regression: average value, y t = {x i R t : i [1,m],y i = y}. 1 y i. Form of hypotheses: h: x S R t xi R t i [1,m] t y t 1 x Rt. page 20

21 Training Problem: general problem of determining partition with minimum empirical error is NP-hard. Heuristics: greedy algorithm. for all,, j [1,N] R R + (j, )={x i R: x i [j], i [1,m]} R (j, )={x i R: x i [j]<,i [1,m]}. Decision-Trees(S =((x 1,y 1 ),...,(x m,y m ))) 1 P {S} initial partition 2 for each region R P such that Pred(R) do 3 (j, ) argmin (j, ) error(r (j, )) + error(r + (j, )) 4 P P R {R (j, ),R + (j, )} 5 return P page 21

22 Splitting/Stopping Criteria Problem: larger trees overfit training sample. Conservative splitting: split node only if loss reduced by some fixed value >0. issue: seemingly bad split dominating useful splits. Grow-then-prune technique (CART): grow very large tree, Pred(R): R > n 0. prune tree based on:, parameter determined by cross-validation. F (T )=Loss(T )+ T 0 page 22

23 Decision Tree Tools Most commonly used tools for learning decision trees: CART (classification and regression tree) (Breiman et al., 1984). C4.5 (Quinlan, 1986, 1993) and C5.0 (RuleQuest Research) a commercial system. Differences: minor between latest versions. page 23

24 Approaches Single classifier: SVM-type algorithm. AdaBoost-type algorithm. Decision trees. Combination of binary classifiers: One-vs-all. One-vs-one. Error-correcting codes. page 24

25 One-vs-All Technique: for each class l Y learn binary classifier h l =sgn(f l ). combine binary classifiers via voting mechanism, typically majority vote: h: x argmax l Y f l (x). Problem: poor justification. calibration: classifier scores not comparable. nevertheless: simple and frequently used in practice, computational advantages in some cases. page 25

26 One-vs-One Technique: for each pair (l, l ) classifier h ll : X {0, 1}. learn binary combine binary classifiers via majority vote: Problem: computational: train h(x) =argmax l Y Y,l=l {l : h ll (x) =1}. k(k 1)/2 binary classifiers. overfitting: size of training sample could become small for a given pair. page 26

27 Computational Comparison Training Testing One-vs-all O(kB train (m)) O(kB test ) O(km α ) One-vs-one O(k 2 B train (m/k)) O(k (on average) 2 B test ) O(k 2 α m α ) smaller N SV per B Time complexity for SVMs, α less than 3. page 27

28 Heuristics Training: reuse of computation between classifiers, e.g., sharing of kernel computations. caching. 1 vs 4 Testing: directed acyclic graph. smaller number of tests. ordering? 3 vs 4 2 vs 4 not 2 not 1 not 4 1 vs 3 not 4 not 1 2 vs 3 (Platt et al., 2000) not 3 1 vs page 28

29 Error-Correcting Code Approach Technique: assign F-long binary code word to each class: M =[M lj ] {0, 1} [1,k] [1,F ]. learn binary classifier f j : X {0, 1} for each column. Example x in class l labeled with. classifier output: f(x)= f 1 (x),...,f F (x), (Dietterich and Bakiri, 1995) M lj h: x argmin l Y d Hamming M l, f(x). page 29

30 Illustration classes 8 classes, code-length: 6. codes f 1 (x)f 2 (x)f 3 (x)f 4 (x)f 5 (x)f 6 (x) new example x page 30

31 Error-Correcting Codes - Design Main ideas: independent columns: otherwise no effective discrimination. distance between rows: if the minimal Hamming distance between rows is d, then the multi-class d 1 can correct errors. 2 columns may correspond to features selected for the task. one-vs-all and one-vs-one (with ternary codes) are special cases. page 31

32 Extensions Matrix entries in { 1, 0, +1} : examples marked with 0 disregarded during training. one-vs-one becomes also a special case. Margin loss L: function of Hamming loss: F Margin loss: h(x) = argmin l {1,...,k} h(x) = argmin l {1,...,k} j=1 F j=1 yf(x), e.g., hinge loss. 1 sgn M lj f j (x) 2 L M lj f j (x). page 32 (Allwein et al., 2000).

33 Ideas Continuous codes: real-valued matrix. Learn matrix code M. Similar optimization problems with other matrix norms. Kernel K used for similarity between matrix row and prediction vector. page 33

34 Continuous Codes Optimization problem: ( M l lth row of M) min M, Decision function: M C m i=1 subject to: K(f(x i ), M yi ) K(f(x i ), M l )+1 i (i, l) [1,m] [1,k]. h: x argmax l {1,...,k} i K(f(x), M l ). (Crammer and Singer, 2000, 2002) page 34

35 Applications One-vs-all approach is the most widely used. No clear empirical evidence of the superiority of other approaches (Rifkin and Klautau, 2004). except perhaps on small data sets with relatively large error rate. Large structured multi-class problems: often treated as ranking problems (see next lecture). page 35

36 References Erin L. Allwein, Robert E. Schapire and Yoram Singer. Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1: , K. Crammer and Y. Singer. Improved output coding for classification using continuous relaxation. In Proceedings of NIPS, Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernelbased vector machines. Journal of Machine Learning Research, 2: , Koby Crammer and Yoram Singer. On the Learnability and Design of Output Codes for Multiclass Problems. Machine Learning 47, Thomas G. Dietterich, Ghulum Bakiri: Solving Multiclass Learning Problems via Error- Correcting Output Codes. Journal of Artificial Intelligence Research (JAIR) 2: , Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning, the MIT Press, John C. Platt, Nello Cristianini, and John Shawe-Taylor. Large Margin DAGS for Multiclass Classification. In Advances in Neural Information Processing Systems 12 (NIPS 1999), pp , page 36

37 References Ryan Rifkin. Everything Old Is New Again: A Fresh Look at Historical Approaches in Machine Learning. Ph.D. Thesis, MIT, Rifkin and Klautau. In Defense of One-Vs-All Classification. Journal of Machine Learning Research, 5: , Robert E. Schapire. The boosting approach to machine learning: An overview. In D. D. Denison, M. H. Hansen, C. Holmes, B. Mallick, B. Yu, editors, Nonlinear Estimation and Classification. Springer, Robert E. Schapire, Yoav Freund, Peter Bartlett and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5): , Robert E. Schapire and Yoram Singer. BoosTexter: A boosting-based system for text categorization. Machine Learning, 39(2/3): , Jason Weston and Chris Watkins. Support Vector Machines for Multi-Class Pattern Recognition. Proceedings of the Seventh European Symposium On Artificial Neural Networks (ESANN 99), page 37

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Multi-Class Classification Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation Real-world problems often have multiple classes: text, speech,

More information

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 13 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification Mehryar Mohri - Introduction to Machine Learning page 2 Motivation

More information

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:

More information

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning

More information

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague AdaBoost Lecturer: Jan Šochman Authors: Jan Šochman, Jiří Matas Center for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Motivation Presentation 2/17 AdaBoost with trees

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:

More information

Boosting Ensembles of Structured Prediction Rules

Boosting Ensembles of Structured Prediction Rules Boosting Ensembles of Structured Prediction Rules Corinna Cortes Google Research 76 Ninth Avenue New York, NY 10011 corinna@google.com Vitaly Kuznetsov Courant Institute 251 Mercer Street New York, NY

More information

Foundations of Machine Learning Ranking. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Ranking. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Ranking Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation Very large data sets: too large to display or process. limited resources, need

More information

Ensembles of Classifiers.

Ensembles of Classifiers. Ensembles of Classifiers www.biostat.wisc.edu/~dpage/cs760/ 1 Goals for the lecture you should understand the following concepts ensemble bootstrap sample bagging boosting random forests error correcting

More information

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH. Deep Boosting Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Ensemble Methods in ML Combining several base classifiers

More information

Supervised Learning of Non-binary Problems Part I: Multiclass Categorization via Output Codes

Supervised Learning of Non-binary Problems Part I: Multiclass Categorization via Output Codes Supervised Learning of Non-binary Problems Part I: Multiclass Categorization via Output Codes Yoram Singer Hebrew University http://www.cs.huji.il/ singer Based on joint work with: Koby Crammer, Hebrew

More information

ML (cont.): SUPPORT VECTOR MACHINES

ML (cont.): SUPPORT VECTOR MACHINES ML (cont.): SUPPORT VECTOR MACHINES CS540 Bryan R Gibson University of Wisconsin-Madison Slides adapted from those used by Prof. Jerry Zhu, CS540-1 1 / 40 Support Vector Machines (SVMs) The No-Math Version

More information

Lecture 18: Multiclass Support Vector Machines

Lecture 18: Multiclass Support Vector Machines Fall, 2017 Outlines Overview of Multiclass Learning Traditional Methods for Multiclass Problems One-vs-rest approaches Pairwise approaches Recent development for Multiclass Problems Simultaneous Classification

More information

Learning Kernels -Tutorial Part III: Theoretical Guarantees.

Learning Kernels -Tutorial Part III: Theoretical Guarantees. Learning Kernels -Tutorial Part III: Theoretical Guarantees. Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute & Google Research mohri@cims.nyu.edu Afshin Rostami UC Berkeley

More information

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Weak Learning Definition: concept class C is weakly PAC-learnable if there exists a (weak)

More information

The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers

The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers Hui Zou University of Minnesota Ji Zhu University of Michigan Trevor Hastie Stanford University Abstract We propose a new framework

More information

Perceptron Mistake Bounds

Perceptron Mistake Bounds Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce

More information

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.

More information

Kobe University Repository : Kernel

Kobe University Repository : Kernel Kobe University Repository : Kernel タイトル Title 著者 Author(s) 掲載誌 巻号 ページ Citation 刊行日 Issue date 資源タイプ Resource Type 版区分 Resource Version 権利 Rights DOI JaLCDOI URL Comparison between error correcting output

More information

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m ) Set W () i The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m m =.5 1 n W (m 1) i y i h(x i ; 2 ˆθ

More information

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 A. Kernels 1. Let X be a finite set. Show that the kernel

More information

Support Vector and Kernel Methods

Support Vector and Kernel Methods SIGIR 2003 Tutorial Support Vector and Kernel Methods Thorsten Joachims Cornell University Computer Science Department tj@cs.cornell.edu http://www.joachims.org 0 Linear Classifiers Rules of the Form:

More information

Analysis of Multiclass Support Vector Machines

Analysis of Multiclass Support Vector Machines Analysis of Multiclass Support Vector Machines Shigeo Abe Graduate School of Science and Technology Kobe University Kobe, Japan abe@eedept.kobe-u.ac.jp Abstract Since support vector machines for pattern

More information

i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x

i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 November 16, 017 Due: Dec 01, 017 A. Kernels Show that the following kernels K are PDS: 1.

More information

Statistics and learning: Big Data

Statistics and learning: Big Data Statistics and learning: Big Data Learning Decision Trees and an Introduction to Boosting Sébastien Gadat Toulouse School of Economics February 2017 S. Gadat (TSE) SAD 2013 1 / 30 Keywords Decision trees

More information

ABC-Boost: Adaptive Base Class Boost for Multi-class Classification

ABC-Boost: Adaptive Base Class Boost for Multi-class Classification ABC-Boost: Adaptive Base Class Boost for Multi-class Classification Ping Li Department of Statistical Science, Cornell University, Ithaca, NY 14853 USA pingli@cornell.edu Abstract We propose -boost (adaptive

More information

Error Limiting Reductions Between Classification Tasks

Error Limiting Reductions Between Classification Tasks Error Limiting Reductions Between Classification Tasks Keywords: supervised learning, reductions, classification Abstract We introduce a reduction-based model for analyzing supervised learning tasks. We

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

Margin Maximizing Loss Functions

Margin Maximizing Loss Functions Margin Maximizing Loss Functions Saharon Rosset, Ji Zhu and Trevor Hastie Department of Statistics Stanford University Stanford, CA, 94305 saharon, jzhu, hastie@stat.stanford.edu Abstract Margin maximizing

More information

Understanding SVM (and associated kernel machines) through the development of a Matlab toolbox

Understanding SVM (and associated kernel machines) through the development of a Matlab toolbox Understanding SVM (and associated kernel machines) through the development of a Matlab toolbox Stephane Canu To cite this version: Stephane Canu. Understanding SVM (and associated kernel machines) through

More information

Data Mining und Maschinelles Lernen

Data Mining und Maschinelles Lernen Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting

More information

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69 R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual

More information

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines www.cs.wisc.edu/~dpage 1 Goals for the lecture you should understand the following concepts the margin slack variables the linear support vector machine nonlinear SVMs the kernel

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers JMLR: Workshop and Conference Proceedings vol 35:1 8, 014 Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers Balázs Kégl LAL/LRI, University

More information

Boosting with decision stumps and binary features

Boosting with decision stumps and binary features Boosting with decision stumps and binary features Jason Rennie jrennie@ai.mit.edu April 10, 2003 1 Introduction A special case of boosting is when features are binary and the base learner is a decision

More information

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods The wisdom of the crowds Ensemble learning Sir Francis Galton discovered in the early 1900s that a collection of educated guesses can add up to very accurate predictions! Chapter 11 The paper in which

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

A Simple Algorithm for Multilabel Ranking

A Simple Algorithm for Multilabel Ranking A Simple Algorithm for Multilabel Ranking Krzysztof Dembczyński 1 Wojciech Kot lowski 1 Eyke Hüllermeier 2 1 Intelligent Decision Support Systems Laboratory (IDSS), Poznań University of Technology, Poland

More information

Learning with Imperfect Data

Learning with Imperfect Data Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Joint work with: Yishay Mansour (Tel-Aviv & Google) and Afshin Rostamizadeh (Courant Institute). Standard Learning Assumptions IID assumption.

More information

CS-E4830 Kernel Methods in Machine Learning

CS-E4830 Kernel Methods in Machine Learning CS-E4830 Kernel Methods in Machine Learning Lecture 5: Multi-class and preference learning Juho Rousu 11. October, 2017 Juho Rousu 11. October, 2017 1 / 37 Agenda from now on: This week s theme: going

More information

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring / Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

Hierarchical Boosting and Filter Generation

Hierarchical Boosting and Filter Generation January 29, 2007 Plan Combining Classifiers Boosting Neural Network Structure of AdaBoost Image processing Hierarchical Boosting Hierarchical Structure Filters Combining Classifiers Combining Classifiers

More information

Rademacher Complexity Bounds for Non-I.I.D. Processes

Rademacher Complexity Bounds for Non-I.I.D. Processes Rademacher Complexity Bounds for Non-I.I.D. Processes Mehryar Mohri Courant Institute of Mathematical ciences and Google Research 5 Mercer treet New York, NY 00 mohri@cims.nyu.edu Afshin Rostamizadeh Department

More information

COMS 4771 Lecture Boosting 1 / 16

COMS 4771 Lecture Boosting 1 / 16 COMS 4771 Lecture 12 1. Boosting 1 / 16 Boosting What is boosting? Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. 3 / 16 What is boosting?

More information

Multiclass Learning by Probabilistic Embeddings

Multiclass Learning by Probabilistic Embeddings Multiclass Learning by Probabilistic Embeddings Ofer Dekel and Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {oferd,singer}@cs.huji.ac.il Abstract

More information

SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS

SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS Filippo Portera and Alessandro Sperduti Dipartimento di Matematica Pura ed Applicata Universit a di Padova, Padova, Italy {portera,sperduti}@math.unipd.it

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Hypothesis Space variable size deterministic continuous parameters Learning Algorithm linear and quadratic programming eager batch SVMs combine three important ideas Apply optimization

More information

Learning Binary Classifiers for Multi-Class Problem

Learning Binary Classifiers for Multi-Class Problem Research Memorandum No. 1010 September 28, 2006 Learning Binary Classifiers for Multi-Class Problem Shiro Ikeda The Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku, Tokyo, 106-8569,

More information

Lecture 3: Multiclass Classification

Lecture 3: Multiclass Classification Lecture 3: Multiclass Classification Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar and Dan Roth CS6501 Lecture 3 1 Announcement v Please enroll in

More information

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University BOOSTING Robert E. Schapire and Yoav

More information

A NEW MULTI-CLASS SUPPORT VECTOR ALGORITHM

A NEW MULTI-CLASS SUPPORT VECTOR ALGORITHM Optimization Methods and Software Vol. 00, No. 00, Month 200x, 1 18 A NEW MULTI-CLASS SUPPORT VECTOR ALGORITHM PING ZHONG a and MASAO FUKUSHIMA b, a Faculty of Science, China Agricultural University, Beijing,

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

Boosting the margin: A new explanation for the effectiveness of voting methods

Boosting the margin: A new explanation for the effectiveness of voting methods Machine Learning: Proceedings of the Fourteenth International Conference, 1997. Boosting the margin: A new explanation for the effectiveness of voting methods Robert E. Schapire Yoav Freund AT&T Labs 6

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

A Uniform Convergence Bound for the Area Under the ROC Curve

A Uniform Convergence Bound for the Area Under the ROC Curve Proceedings of the th International Workshop on Artificial Intelligence & Statistics, 5 A Uniform Convergence Bound for the Area Under the ROC Curve Shivani Agarwal, Sariel Har-Peled and Dan Roth Department

More information

ADANET: adaptive learning of neural networks

ADANET: adaptive learning of neural networks ADANET: adaptive learning of neural networks Joint work with Corinna Cortes (Google Research) Javier Gonzalo (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR

More information

Domain Adaptation for Regression

Domain Adaptation for Regression Domain Adaptation for Regression Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Motivation Applications: distinct training and test distributions.

More information

1/sqrt(B) convergence 1/B convergence B

1/sqrt(B) convergence 1/B convergence B The Error Coding Method and PICTs Gareth James and Trevor Hastie Department of Statistics, Stanford University March 29, 1998 Abstract A new family of plug-in classication techniques has recently been

More information

A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes

A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes Ürün Dogan 1 Tobias Glasmachers 2 and Christian Igel 3 1 Institut für Mathematik Universität Potsdam Germany

More information

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018 From Binary to Multiclass Classification CS 6961: Structured Prediction Spring 2018 1 So far: Binary Classification We have seen linear models Learning algorithms Perceptron SVM Logistic Regression Prediction

More information

Perceptron Revisited: Linear Separators. Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department

More information

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Learning Ensembles. 293S T. Yang. UCSB, 2017. Learning Ensembles 293S T. Yang. UCSB, 2017. Outlines Learning Assembles Random Forest Adaboost Training data: Restaurant example Examples described by attribute values (Boolean, discrete, continuous)

More information

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie Computational Biology Program Memorial Sloan-Kettering Cancer Center http://cbio.mskcc.org/leslielab

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2

More information

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM 1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University

More information

Sequential Minimal Optimization (SMO)

Sequential Minimal Optimization (SMO) Data Science and Machine Intelligence Lab National Chiao Tung University May, 07 The SMO algorithm was proposed by John C. Platt in 998 and became the fastest quadratic programming optimization algorithm,

More information

CS229 Supplemental Lecture notes

CS229 Supplemental Lecture notes CS229 Supplemental Lecture notes John Duchi 1 Boosting We have seen so far how to solve classification (and other) problems when we have a data representation already chosen. We now talk about a procedure,

More information

Multiclass Boosting with Repartitioning

Multiclass Boosting with Repartitioning Multiclass Boosting with Repartitioning Ling Li Learning Systems Group, Caltech ICML 2006 Binary and Multiclass Problems Binary classification problems Y = { 1, 1} Multiclass classification problems Y

More information

i=1 = H t 1 (x) + α t h t (x)

i=1 = H t 1 (x) + α t h t (x) AdaBoost AdaBoost, which stands for ``Adaptive Boosting", is an ensemble learning algorithm that uses the boosting paradigm []. We will discuss AdaBoost for binary classification. That is, we assume that

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13 Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y

More information

2 Upper-bound of Generalization Error of AdaBoost

2 Upper-bound of Generalization Error of AdaBoost COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Haipeng Zheng March 5, 2008 1 Review of AdaBoost Algorithm Here is the AdaBoost Algorithm: input: (x 1,y 1 ),...,(x m,y

More information

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Journal of Machine Learning Research () 3- Submitted /; Published / Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin L. Allwein Southwest Research Institute Culebra Road San

More information

SUPPORT VECTOR MACHINE

SUPPORT VECTOR MACHINE SUPPORT VECTOR MACHINE Mainly based on https://nlp.stanford.edu/ir-book/pdf/15svm.pdf 1 Overview SVM is a huge topic Integration of MMDS, IIR, and Andrew Moore s slides here Our foci: Geometric intuition

More information

Microarray Data Analysis: Discovery

Microarray Data Analysis: Discovery Microarray Data Analysis: Discovery Lecture 5 Classification Classification vs. Clustering Classification: Goal: Placing objects (e.g. genes) into meaningful classes Supervised Clustering: Goal: Discover

More information

On the Learnability and Design of Output Codes for Multiclass Problems

On the Learnability and Design of Output Codes for Multiclass Problems c Machine Learning,, 1?? () Kluwer Academic ublishers, Boston. Manufactured in The etherlands. On the Learnability and Design of Output Codes for Multiclass roblems kobics@cs.huji.ac.il School of Computer

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

Decision Tree Learning Lecture 2

Decision Tree Learning Lecture 2 Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over

More information

Midterms. PAC Learning SVM Kernels+Boost Decision Trees. MultiClass CS446 Spring 17

Midterms. PAC Learning SVM Kernels+Boost Decision Trees. MultiClass CS446 Spring 17 Midterms PAC Learning SVM Kernels+Boost Decision Trees 1 Grades are on a curve Midterms Will be available at the TA sessions this week Projects feedback has been sent. Recall that this is 25% of your grade!

More information

Evaluation requires to define performance measures to be optimized

Evaluation requires to define performance measures to be optimized Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation

More information

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch . Machine Learning Boosting Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Jeff Howbert Introduction to Machine Learning Winter

Jeff Howbert Introduction to Machine Learning Winter Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Hsuan-Tien Lin Learning Systems Group, California Institute of Technology Talk in NTU EE/CS Speech Lab, November 16, 2005 H.-T. Lin (Learning Systems Group) Introduction

More information

Top-k Parametrized Boost

Top-k Parametrized Boost Top-k Parametrized Boost Turki Turki 1,4, Muhammad Amimul Ihsan 2, Nouf Turki 3, Jie Zhang 4, Usman Roshan 4 1 King Abdulaziz University P.O. Box 80221, Jeddah 21589, Saudi Arabia tturki@kau.edu.sa 2 Department

More information

Artificial Intelligence Roman Barták

Artificial Intelligence Roman Barták Artificial Intelligence Roman Barták Department of Theoretical Computer Science and Mathematical Logic Introduction We will describe agents that can improve their behavior through diligent study of their

More information

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution: Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 October 31, 2016 Due: A. November 11, 2016; B. November 22, 2016 A. Boosting 1. Implement

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

Ensembles. Léon Bottou COS 424 4/8/2010

Ensembles. Léon Bottou COS 424 4/8/2010 Ensembles Léon Bottou COS 424 4/8/2010 Readings T. G. Dietterich (2000) Ensemble Methods in Machine Learning. R. E. Schapire (2003): The Boosting Approach to Machine Learning. Sections 1,2,3,4,6. Léon

More information

Kernel Methods & Support Vector Machines

Kernel Methods & Support Vector Machines Kernel Methods & Support Vector Machines Mahdi pakdaman Naeini PhD Candidate, University of Tehran Senior Researcher, TOSAN Intelligent Data Miners Outline Motivation Introduction to pattern recognition

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information