Supervised Learning of Non-binary Problems Part I: Multiclass Categorization via Output Codes
|
|
- Marjory Parks
- 5 years ago
- Views:
Transcription
1 Supervised Learning of Non-binary Problems Part I: Multiclass Categorization via Output Codes Yoram Singer Hebrew University singer Based on joint work with: Koby Crammer, Hebrew Univ. Rob Schapire, AT&T Labs Erin Allwein, SWRI NATO ASI on Learning Theory and Practice July K.U. Leuven Belgium
2 Problem Setting Given: multiclass data e.g.: predict children s favorite Teletubbies (name=guy, age=5, sex=m) (name=neta, age=2, sex=f) (name=dana, age=6, sex=f).. Given: binary (2-class) learning algorithm A x ( x 1, + ), ( x 2,-- ), ( x 3, + )... A h h(x) ε {--, + }
3 Problem Setting (cont.) Goal: use binary learning algorithm to build classifier for multiclass data ( x 1, ), ( x, ), (, )... 2 x 3 multiclass learner A x H H(x)ε {,,, }
4 The One-against-all Approach [folklore] break k-class problem into k binary problems and solve each separately how to combine predictions? x 1 x 1 x 1 + x 1 x 1 x 2 x 2 x 2 x 2 + x 2 x 3 x 3 x 3 x 3 x 3 + x 4 x 4 x 4 + x 4 x 4 x 5 x 5 + x 5 x 5 x 5 h 1 h 2 h 3 h 4 evaluate all h s and hope exactly one is + e.g.: if h 1 (x) = h 2 (x) = h 4 (x) = and h 3 (x) =+ then predict problem: will give incorrect prediction if only one h is wrong
5 The All-pairs Approach [Hastie & Tibshirani] create one binary problem for each pair of classes vs. vs. vs. vs. vs. vs. x 1 x 1 x 1 x 1 x 2 x 2 x 2 + x 2 + x 3 x 3 x 3 + x 3 x 4 x 4 x 4 x 4 x 5 x 5 + x 5 + x 5 + h 1 h 2 h 3 h 4 h 5 h 6 not obvious how to combine predictions
6 Error-correcting Output Codes reduce to binary using coding matrix M rows of M code words x 1 x 1 x 1 x 1 + x 1 + x 1 + x 2 x 2 + x 2 + x 2 x 2 x 2 x 3 x 3 + x 3 + x 3 + x 3 + x 3 x 4 x 4 x 4 x 4 + x 4 + x 4 + x 5 x 5 + x 5 x 5 + x 5 x 5 + h 1 h 2 h 3 h 4 h 5 binary problems [Dietterich & Bakiri] given x, choose closest row of M to h(x) = (h 1 (x),..., h 5 (x)) e.g.: if h(x) = (, +, +, +, ) then predict
7 ECOC (continued) if rows of M far from one another, will be highly robust to errors potentially much faster when k (# of classes) large disadvantage: binary problems may be unnatural and hard to solve
8 Outline general framework for studying such reductions from multiclass to binary particularly for margin-based binary learners decoding methods for combining predictions of binary classifiers Hamming decoding loss-based decoding general training error bounds generalization error bound when binary learner is AdaBoost finding a good output coding matrix discrete output codes continuous output codes learning continuous output codes using QP generalization error bound for continuous output codes experiments with discrete and continuous codes variation on a theme: multilabel problems
9 A (Slightly) More General Approach choose matrix M over { 1,, +1} x 1 x 1 x 1 x 1 + x 1 + x 2 x 2 + x 2 x 2 x 2 x 3 x 3 + x 3 + x 3 + x 4 x 4 x 4 x 4 + x 4 + x 5 x 5 x 5 x 5 + h 1 h 2 h 3 h 4 h 5 binary problems generalizes all three earlier approaches how to do decoding?
10 Hamming Decoding (generalized) Hamming distance between u, v { 1,, +1} l is (u, v) = 1 l l s=1 if u s = v s u s v s 1 if u s v s u s v s 1/2 if u s = v s =. e.g.: ( (+ +), (+ + +) ) = 1 5 ( ) to classify x evaluate h(x) = (h 1 (x),..., h l (x)) choose row r closest to h(x) in Hamming distance weakness: ignores confidence of predictions
11 Margin-based Binary Learning Algorithms typical binary classifier produces real-valued predictions i.e., h(x) rather than h(x) {, +} interpretation: sign(h(x)) = predicted class ( or +) h(x) = confidence margin of labeled example (x, y) is yh(x) yh(x) > correct prediction yh(x) = h(x) = confidence many learning algorithms attempt to find classifier h minimizing m i=1 L(y ih(x i )) where (x 1, y 1 ),..., (x m, y m ) is given training set L is loss function of margin
12 Examples of Margin-based Learning Algorithms algorithm L(z) support-vector machines max{1 z, } logistic regression ln(1 + e z ) AdaBoost e z least-squares regression (1 z) 2 C4.5 ln(1 + e z ) CART (1 z) 2
13 Loss-based Decoding can use instead of Hamming decoding when using margin-based learning algorithm to classify x evaluate h(x) = (h 1 (x),..., h l (x)) for each class r compute total loss that would have been suffered if x were labeled r l L(M(r, s) h s(x)) s=1 choose class r that minimizes total loss
14 Loss-based Decoding (example) M = h(x) = (1.9, 3.1, 4.2,.7, 6.2) compute: total loss for each class L( )+L( 3.1 )+L( )+L(.7)+L( 6.2 ) L( 1.9)+L( 3.1 )+L( 4.2 )+L( )+L( 6.2 ) L( 1.9 )+L( )+L( 4.2)+L(.7)+L( 6.2) L( 1.9 )+L( 3.1)+L( )+L(.7 )+L( ) Hamming decoding (sign(h(x)) = (+,, +, +, +)): predict loss-based decoding with L(z) = e z : predict
15 Training Error Bounds let ε be expected loss on training averaged over all l binary problems ε = 1 ml m i=1 l L(M(y i, s) h s (x i )) s=1 let ρ be minimum distance between pairs of rows of M ρ = min{ (M(r 1, ), M(r 2, )) : r 1 r 2 } assume L() = 1 and 1 2 (L(z) + L( z)) L() and other minor technical assumptions Theorem: For loss-based decoding: For Hamming decoding: # training mistakes ρ ε # training mistakes 2ε ρ
16 Proof Sketch Loss-based decoding incorrectly classifies an example (x, y): l L(M(r, s)f s(x)) l L(M(y, s)f s(x)) s=1 s=1 Define: S = {s : M(r, s) M(y, s) M(r, s) M(y, s) } S = {s : M(r, s) = M(y, s) = } z s = M(y, s)f s (x) z s = M(r, s)f s (x) Then: l s=1 L(z s) l s=1 L(z s) L(z s) s S S L(z s ) s S S
17 Proof Sketch (cont.) l L(z s) s=1 s S S L(z s ) 1 2 (L(z s) + L(z s )) s S S = 1 (L(z 2 s) + L(z s )) + 1 (L(z s S 2 s) + L(z s )) s S L() S + S = L() (M(r), M(y)) 2 ρl() a mistake on training example (x i, y i ) implies that l L(M(y i, s)f s (x i )) ρl() s=1 so the number of training mistakes is at most 1 m l ρl() L(M(y i, s)f s (x i )) i=1 s=1 and the training error is at most lε/(ρl())
18 Discussion want good performance on binary problems (so that ε small) want rows far apart from each other (so ρ large) ρ 1/2 for ECOC ρ = 2/k for one-against-all but: making ρ large may make binary problems more difficult to solve many zeros in matrix makes binary problems easier and faster to solve but: adding zeros tends to increase average error ε
19 Generalization Error for Boosting with Loss-based Decoding - Outline analyze performance on unseen data when binary learner is AdaBoost in binary case, use margins to analyze boosting: (Schapire, Freund, Bartlett, Lee 1998) larger empirical margins data imply better bound on generalization error boosting tends to increase training-data margins for loss-based decoding modify classification rule modifying definition of margin gives two-part analysis
20 Generalization Error Bounds for AdaBoost.MO Loss-based decoding for AdaBoost: H(x) = arg min y Y = arg min y Y l s=1 exp ( M(y, s)f(x, s)) l exp s=1 M(y, s) T α th t (x, s) t=1 Scaling f to [ 1, +1] using the transform x (1/η) ln(x/l): η = T t=1 α t f(x, s) = 1 T η α th t (x, s) t=1 ν(f, η, x, y) = 1 η ln 1 l l s=1 e ηm(y,s)f(x,s) H(x) = arg max y Y ν(f, η, x, y)
21 Generalization Error Bounds for AdaBoost.MO (cont.) Modified definition of margin: M f,η (x, y) = 1 ν(f, η, x, y) max 2 r y ν(f, η, x, r) H the base-hypothesis space, convex hull of H co(h) = f : x h α hh(x) α h, h α h = 1 Theorem: D distribution over X Y; S sample of m examples; base-classifier space H has VC-dimension d; coding matrix M; w.p. 1 δ every function f co(h) and η > satisfy [ Pr D M f,η (x, y) ] [ Pr S M f,η (x, y) θ ] + O 1 m d log 2 (lm/d) θ 2 + log(1/δ) 1/2
22 Experiments tested SVM on 8 benchmark problems using both decoding methods and 5 different coding matrices overall, loss-based decoding is better than Hamming decoding: 2 one against all 5 All Pairs.5 Complete.5 Random Dense 2 Random Sparse Test error difference 4 6 Test error difference 1 15 Test error difference.5 1 Test error difference Test error difference Number of Classes Number of Classes Number of Classes Number of Classes Number of Classes one-against-all is consistently worse than other methods but no clear pattern among other methods 35 one against all All Pairs 35 one against all Complete 35 one against all Random Dense 3 one against all Random Sparse Test error difference Test error difference Test error difference Test error difference Number of Classes Number of Classes Number of Classes Number of Classes
23 So far we assumed that the matrix M is predefined problems: all columns have the same influence in the decoding process but some binary problems might be simpler than others each binary classifier induces the same loss for all labels but some labels are easier to predict than others correlations between the binary classifiers are not taken into account but similar columns of M might result highly correlated classifiers change M in accordance to the predictions of the binary classifiers
24 Code design - discrete case The following problem is NP-complete: given: a training set S = {(x 1, y 1 ),..., (x m, y m )} a set of binary classifiers h(x) = (h 1 (x),..., h l ) error bound q find: a matrix M { 1, +1} k l such that the classifier H(x) based on M and h( ) makes at most q mistakes on S however, if l is fixed there might exist a poly-time algorithm for finding the matrix M which attains the smallest number of errors on S
25 Code design - continuous case analogously to the binary classifiers let M k l column t of M defines a partition of the labels into two disjoint sets sign(m r,t ) is the set (+1 or 1) to which the class r belongs M r,t is the confidence in the associated partition the classification rule H(x) = arg max r where M r is the rth row of M (h(x) Mr )
26 Continuous codes - example M = h(x) = (1.1, 3.1, 4.2,.7, 6.2) compute: similarity for each class predict:
27 Multiclass Margin Multiclass Margin Multiclass Margin Multiclass Margin Multiclass Margin Similarity (Similarity of correct label)-(largest simialiry of rest) > 1!!!!!! " " " " " " # $# $ % &% & Similarity Loss (Similarity of correct label)-(largest simialiry of rest) < 1 ' ' ' ' ' ' ( ( ( ( ( ( ) ) ) ) ) ) * * * * * * ,,,,,, / / / / / / Similarity (Similarity of correct label)-(largest simialiry of rest) < Loss replace the discrete multiclass loss with a piecewise-linear bound ɛ S = 1 m m i=1 [H(x i) yi] 1 m m i=1 max r {( h(xi) Mr) + 1 δyi,r} ( h(xi) Myi )
28 Optimization problem Minimize: β M p + m i=1 ξ i subject to the constraints: i, r : ( h(x i ) Myi ) ( h(x i ) Mr ) 1 δ yi,r ξ i l 1 l 2 l Primal Variables m + 2kl m + kl m + kl + 1 -Constraints Constraints km + 2kl km km + 2kl Dual Variables km + 2kl km km + 2kl -Constraints km + 2kl km km + 2kl Constraints m + 2kl m m + kl + 1
29 Constructing the code using the dual program we get M r = β 1 m i=1 [δ y i,r η i,r ] h(x i ) an example i participates in the solution for M if and only if {η i,1,..., η i,k } is not a point distribution concentrating on the correct label y i code is constructed from examples whose labels are hard to predict efficient algorithm: SPOC - Support Pattern Output Coding can be combined with Mercer kernels chooses a single example and efficiently solves a mini-qp problem constructs continuous code or multiclass SVM ( h(x) = x) can bound number of iterations needed to be ɛ close to optimal solution
30 Generalization Error for Continuous Codes and Multiclass SVM suppose we perfectly classify a random samples S of m examples using matrix M; then w.p 1 δ a bound on the generalization error is O R 2 m k M 2 log(m) + k 2 log m δ where R is the radius of the ball containing the support of S
31 Example 1 Training Set 1 Square vs Rest Star vs Rest 1 Circle vs Rest
32 Old and new decision boundaries Old New
33 Old and new codes.5 ID code matrix 1.5 New code matrix
34 Experiments 9 8 id BCH random 9 8 id BCH random 7 7 Relative performence % Relative performence % satimage shuttle mnist isolet letter vowel glass soybean 1 satimage shuttle mnist isolet letter vowel glass soybean 5 45 id BCH random 8 id BCH random 4 6 Relative performence % Relative performence % satimage shuttle mnist isolet letter vowel glass soybean 4 satimage shuttle mnist isolet letter vowel glass soybean
35 Multilabel Problems Joint work with Eleazar Eskin (Columbia) and Yair Weiss (HUJI) topics: Earnings Forecasts Share-Capital Debt-Issues Loans Credits Credit-Ratings Mergers Acquisitions Asset-Transfers... labeled documents: Doc#1: {Share-Capital, Credit-Ratings, Asset-Transfers} Doc#2: {Mergers, Acquisitions} Doc#3: {Earnings} Doc#4: {Earnings, Forecasts} Doc#5:... task find set of relevant topics for each new document
36 Multilabel problems - formal definition Instances x i X Set of possible topics Y = {1, 2,..., K} Relevant topics Labels Example t i 2 Y y i { 1, +1} K Doc#17 {Share-Capital, Acquisitions} Doc#17 (+ + ) Find f : X { 1, +1} K
37 Naïve Approach train K classifiers f i : X { 1, +1} f(x) = (f 1 (x),..., f K (x)) classifier i predicts whether topic i is relevant (+1) or not ( 1) apply each classifier independently to each new instance pros: simple cons: error prone does account for correlations between topics Employed by AdaBoost.MH and versions of SVM
38 Adding Fault Tolerance train classifiers that take topic correlations into account: article is about Mergers learn whether also about Acquisitions article is not about Mergers abstain obtain K 2 classifiers f i j : X { 1, +1} view each possible vector of labels in { 1, +1} K as a super-label build corresponding output code of size 2 K K 2 use loss-based decoding
39 Multilabel + ECOC: example three classes, nine classifiers output code = 8 9 f 1 1 f 2 2 f 3 3 f 1 2 f 2 3 f 2 1 f 2 3 f 3 1 f but... classification time is exponential in K
40 Bayes to the Rescue Define the probability of K labels corresponding to row r Note that if P (r x) = exp l L(M(r, s)f s(x) s=1 then l L(M(r, s)f s(x)) > l L(M(y, s)f s(x)) s=1 s=1 P (r x) < P (y x) construct belief net with loops from constraints use loopy belief propagation (Max-product) to find best K labels
41 Simulation Results ECOC Error Raw Error Apps in the works: multilabel text classification gene multi-functionality classification
42 Summary general framework for solving multiclass problems with margin classifiers using output codes general empirical loss analysis generalization error analysis when binary learner is AdaBoost experimental results support loss-based decoding design of continuous output codes efficient algorithm generalization analysis research directions online models for multiclass problems hierarchical classification and coding VC analysis of learning with output codes learning output codes using Bregman-Legendre projections
43 Based On T. G. Dietterich and G. Bakiri. Solving multiclass learning problems via error-correcting output codes. J. of AI Research, 2: , R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):1 4, E.L. Allwein, R.E. Schapire, and Y. Singer. Reducing multiclass to binary: A unifying approach for margin classifiers. J. of ML Research, 1: , 2. K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines. J. of ML Research, 2: , 21. K. Crammer and Y. Singer. On the learnability and design of output codes for multiclass problems. Machine Learning, 47(2/3), pp , 22.
Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers
Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning
More informationReducing Multiclass to Binary: A Unifying Approach for Margin Classifiers
Journal of Machine Learning Research () 3- Submitted /; Published / Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin L. Allwein Southwest Research Institute Culebra Road San
More informationIntroduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 13 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification Mehryar Mohri - Introduction to Machine Learning page 2 Motivation
More informationMulticlass Learning by Probabilistic Embeddings
Multiclass Learning by Probabilistic Embeddings Ofer Dekel and Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {oferd,singer}@cs.huji.ac.il Abstract
More informationFoundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning Multi-Class Classification Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation Real-world problems often have multiple classes: text, speech,
More informationAdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague
AdaBoost Lecturer: Jan Šochman Authors: Jan Šochman, Jiří Matas Center for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Motivation Presentation 2/17 AdaBoost with trees
More informationFoundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification page 2 Motivation Real-world problems often have multiple classes:
More informationMulticlass Boosting with Repartitioning
Multiclass Boosting with Repartitioning Ling Li Learning Systems Group, Caltech ICML 2006 Binary and Multiclass Problems Binary classification problems Y = { 1, 1} Multiclass classification problems Y
More informationCS446: Machine Learning Fall Final Exam. December 6 th, 2016
CS446: Machine Learning Fall 2016 Final Exam December 6 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains
More informationECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple
More informationLecture 8. Instructor: Haipeng Luo
Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine
More informationData Mining und Maschinelles Lernen
Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting
More informationEnsembles of Classifiers.
Ensembles of Classifiers www.biostat.wisc.edu/~dpage/cs760/ 1 Goals for the lecture you should understand the following concepts ensemble bootstrap sample bagging boosting random forests error correcting
More informationImproved Boosting Algorithms Using Confidence-rated Predictions
Improved Boosting Algorithms Using Confidence-rated Predictions schapire@research.att.com AT&T Labs, Shannon Laboratory, 18 Park Avenue, Room A79, Florham Park, NJ 793-971 singer@research.att.com AT&T
More informationBoosting: Foundations and Algorithms. Rob Schapire
Boosting: Foundations and Algorithms Rob Schapire Example: Spam Filtering problem: filter out spam (junk email) gather large collection of examples of spam and non-spam: From: yoav@ucsd.edu Rob, can you
More information2 Upper-bound of Generalization Error of AdaBoost
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Haipeng Zheng March 5, 2008 1 Review of AdaBoost Algorithm Here is the AdaBoost Algorithm: input: (x 1,y 1 ),...,(x m,y
More informationMachine Learning in the Data Revolution Era
Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,
More informationECE 5984: Introduction to Machine Learning
ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement
More informationThe AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m
) Set W () i The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m m =.5 1 n W (m 1) i y i h(x i ; 2 ˆθ
More informationThe Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016
The Boosting Approach to Machine Learning Maria-Florina Balcan 10/31/2016 Boosting General method for improving the accuracy of any given learning algorithm. Works by creating a series of challenge datasets
More informationVoting (Ensemble Methods)
1 2 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers
More informationLearning Ensembles. 293S T. Yang. UCSB, 2017.
Learning Ensembles 293S T. Yang. UCSB, 2017. Outlines Learning Assembles Random Forest Adaboost Training data: Restaurant example Examples described by attribute values (Boolean, discrete, continuous)
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationLarge-Margin Thresholded Ensembles for Ordinal Regression
Large-Margin Thresholded Ensembles for Ordinal Regression Hsuan-Tien Lin and Ling Li Learning Systems Group, California Institute of Technology, U.S.A. Conf. on Algorithmic Learning Theory, October 9,
More informationLearning Binary Classifiers for Multi-Class Problem
Research Memorandum No. 1010 September 28, 2006 Learning Binary Classifiers for Multi-Class Problem Shiro Ikeda The Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku, Tokyo, 106-8569,
More information10-701/ Machine Learning - Midterm Exam, Fall 2010
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationCS7267 MACHINE LEARNING
CS7267 MACHINE LEARNING ENSEMBLE LEARNING Ref: Dr. Ricardo Gutierrez-Osuna at TAMU, and Aarti Singh at CMU Mingon Kang, Ph.D. Computer Science, Kennesaw State University Definition of Ensemble Learning
More informationABC-Boost: Adaptive Base Class Boost for Multi-class Classification
ABC-Boost: Adaptive Base Class Boost for Multi-class Classification Ping Li Department of Statistical Science, Cornell University, Ithaca, NY 14853 USA pingli@cornell.edu Abstract We propose -boost (adaptive
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationEnsemble Methods for Machine Learning
Ensemble Methods for Machine Learning COMBINING CLASSIFIERS: ENSEMBLE APPROACHES Common Ensemble classifiers Bagging/Random Forests Bucket of models Stacking Boosting Ensemble classifiers we ve studied
More informationLearning with multiple models. Boosting.
CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models
More informationVBM683 Machine Learning
VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data
More informationCS229 Supplemental Lecture notes
CS229 Supplemental Lecture notes John Duchi 1 Boosting We have seen so far how to solve classification (and other) problems when we have a data representation already chosen. We now talk about a procedure,
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationSupport Vector Machine. Industrial AI Lab.
Support Vector Machine Industrial AI Lab. Classification (Linear) Autonomously figure out which category (or class) an unknown item should be categorized into Number of categories / classes Binary: 2 different
More informationCS446: Machine Learning Spring Problem Set 4
CS446: Machine Learning Spring 2017 Problem Set 4 Handed Out: February 27 th, 2017 Due: March 11 th, 2017 Feel free to talk to other members of the class in doing the homework. I am more concerned that
More informationFINAL: CS 6375 (Machine Learning) Fall 2014
FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for
More informationMachine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9
Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9 Slides adapted from Jordan Boyd-Graber Machine Learning: Chenhao Tan Boulder 1 of 39 Recap Supervised learning Previously: KNN, naïve
More informationSupport Vector Machine. Industrial AI Lab. Prof. Seungchul Lee
Support Vector Machine Industrial AI Lab. Prof. Seungchul Lee Classification (Linear) Autonomously figure out which category (or class) an unknown item should be categorized into Number of categories /
More informationOn Nearest-Neighbor Error-Correcting Output Codes with Application to All-Pairs Multiclass Support Vector Machines
Journal of Machine Learning Research 4 (2003) 1-15 Submitted 10/02; Published 4/03 On Nearest-Neighbor Error-Correcting Output Codes with Application to All-Pairs Multiclass Support Vector Machines Aldebaro
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationLarge-Margin Thresholded Ensembles for Ordinal Regression
Large-Margin Thresholded Ensembles for Ordinal Regression Hsuan-Tien Lin (accepted by ALT 06, joint work with Ling Li) Learning Systems Group, Caltech Workshop Talk in MLSS 2006, Taipei, Taiwan, 07/25/2006
More informationOpen Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers
JMLR: Workshop and Conference Proceedings vol 35:1 8, 014 Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers Balázs Kégl LAL/LRI, University
More informationMulticlass Classification-1
CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass
More informationDoes Unlabeled Data Help?
Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline
More informationA Brief Introduction to Adaboost
A Brief Introduction to Adaboost Hongbo Deng 6 Feb, 2007 Some of the slides are borrowed from Derek Hoiem & Jan ˇSochman. 1 Outline Background Adaboost Algorithm Theory/Interpretations 2 What s So Good
More informationBoosting with decision stumps and binary features
Boosting with decision stumps and binary features Jason Rennie jrennie@ai.mit.edu April 10, 2003 1 Introduction A special case of boosting is when features are binary and the base learner is a decision
More informationError Limiting Reductions Between Classification Tasks
Error Limiting Reductions Between Classification Tasks Keywords: supervised learning, reductions, classification Abstract We introduce a reduction-based model for analyzing supervised learning tasks. We
More informationMACHINE LEARNING. Support Vector Machines. Alessandro Moschitti
MACHINE LEARNING Support Vector Machines Alessandro Moschitti Department of information and communication technology University of Trento Email: moschitti@dit.unitn.it Summary Support Vector Machines
More informationLinear discriminant functions
Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative
More informationClassification objectives COMS 4771
Classification objectives COMS 4771 1. Recap: binary classification Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22 Scoring functions Consider binary classification
More informationMidterm Exam Solutions, Spring 2007
1-71 Midterm Exam Solutions, Spring 7 1. Personal info: Name: Andrew account: E-mail address:. There should be 16 numbered pages in this exam (including this cover sheet). 3. You can use any material you
More informationA Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007
Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized
More informationIntroduction to Machine Learning Midterm Exam
10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but
More informationLecture Support Vector Machine (SVM) Classifiers
Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in
More informationMIRA, SVM, k-nn. Lirong Xia
MIRA, SVM, k-nn Lirong Xia Linear Classifiers (perceptrons) Inputs are feature values Each feature has a weight Sum is the activation activation w If the activation is: Positive: output +1 Negative, output
More informationMachine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /
Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 12: Weak Learnability and the l 1 margin Converse to Scale-Sensitive Learning Stability Convex-Lipschitz-Bounded Problems
More informationIntroduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:
More informationMachine Learning for NLP
Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers
More informationPart of the slides are adapted from Ziko Kolter
Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,
More information6.036 midterm review. Wednesday, March 18, 15
6.036 midterm review 1 Topics covered supervised learning labels available unsupervised learning no labels available semi-supervised learning some labels available - what algorithms have you learned that
More informationThe Perceptron Algorithm, Margins
The Perceptron Algorithm, Margins MariaFlorina Balcan 08/29/2018 The Perceptron Algorithm Simple learning algorithm for supervised classification analyzed via geometric margins in the 50 s [Rosenblatt
More informationSupport Vector Machines
Support Vector Machines Jordan Boyd-Graber University of Colorado Boulder LECTURE 7 Slides adapted from Tom Mitchell, Eric Xing, and Lauren Hannah Jordan Boyd-Graber Boulder Support Vector Machines 1 of
More informationi=1 = H t 1 (x) + α t h t (x)
AdaBoost AdaBoost, which stands for ``Adaptive Boosting", is an ensemble learning algorithm that uses the boosting paradigm []. We will discuss AdaBoost for binary classification. That is, we assume that
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationNearest Neighbors Methods for Support Vector Machines
Nearest Neighbors Methods for Support Vector Machines A. J. Quiroz, Dpto. de Matemáticas. Universidad de Los Andes joint work with María González-Lima, Universidad Simón Boĺıvar and Sergio A. Camelo, Universidad
More informationThe Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers
The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers Hui Zou University of Minnesota Ji Zhu University of Michigan Trevor Hastie Stanford University Abstract We propose a new framework
More informationDeep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.
Deep Boosting Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Ensemble Methods in ML Combining several base classifiers
More informationMachine Learning Theory (CS 6783)
Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Hollister, 306 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)
More informationBoosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi
Boosting CAP5610: Machine Learning Instructor: Guo-Jun Qi Weak classifiers Weak classifiers Decision stump one layer decision tree Naive Bayes A classifier without feature correlations Linear classifier
More informationLearning with Rejection
Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationA Simple Algorithm for Multilabel Ranking
A Simple Algorithm for Multilabel Ranking Krzysztof Dembczyński 1 Wojciech Kot lowski 1 Eyke Hüllermeier 2 1 Intelligent Decision Support Systems Laboratory (IDSS), Poznań University of Technology, Poland
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationUnderstanding SVM (and associated kernel machines) through the development of a Matlab toolbox
Understanding SVM (and associated kernel machines) through the development of a Matlab toolbox Stephane Canu To cite this version: Stephane Canu. Understanding SVM (and associated kernel machines) through
More informationJeff Howbert Introduction to Machine Learning Winter
Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable
More informationOn the Learnability and Design of Output Codes for Multiclass Problems
c Machine Learning,, 1?? () Kluwer Academic ublishers, Boston. Manufactured in The etherlands. On the Learnability and Design of Output Codes for Multiclass roblems kobics@cs.huji.ac.il School of Computer
More information1/sqrt(B) convergence 1/B convergence B
The Error Coding Method and PICTs Gareth James and Trevor Hastie Department of Statistics, Stanford University March 29, 1998 Abstract A new family of plug-in classication techniques has recently been
More informationIntroduction to Machine Learning (67577) Lecture 3
Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 8: Boosting (and Compression Schemes) Boosting the Error If we have an efficient learning algorithm that for any distribution
More informationMIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE
MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE March 28, 2012 The exam is closed book. You are allowed a double sided one page cheat sheet. Answer the questions in the spaces provided on
More informationCS145: INTRODUCTION TO DATA MINING
CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)
More informationSupport Vector Machines, Kernel SVM
Support Vector Machines, Kernel SVM Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 1 / 40 Outline 1 Administration 2 Review of last lecture 3 SVM
More informationOnline Learning and Sequential Decision Making
Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning
More informationDoes Modeling Lead to More Accurate Classification?
Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass
More informationLogistic regression and linear classifiers COMS 4771
Logistic regression and linear classifiers COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random
More informationMinimax risk bounds for linear threshold functions
CS281B/Stat241B (Spring 2008) Statistical Learning Theory Lecture: 3 Minimax risk bounds for linear threshold functions Lecturer: Peter Bartlett Scribe: Hao Zhang 1 Review We assume that there is a probability
More informationI D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69
R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual
More informationA Study of Relative Efficiency and Robustness of Classification Methods
A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:
More informationName (NetID): (1 Point)
CS446: Machine Learning (D) Spring 2017 March 16 th, 2017 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains
More informationIntroduction to Support Vector Machines
Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear
More informationMachine Learning Linear Models
Machine Learning Linear Models Outline II - Linear Models 1. Linear Regression (a) Linear regression: History (b) Linear regression with Least Squares (c) Matrix representation and Normal Equation Method
More informationBoosting the Margin: A New Explanation for the Effectiveness of Voting Methods
The Annals of Statistics, 26(5)1651-1686, 1998. Boosting the Margin A New Explanation for the Effectiveness of Voting Methods Robert E. Schapire AT&T Labs 18 Park Avenue, Room A279 Florham Park, NJ 7932-971
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationIntroduction to Machine Learning
Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland SUPPORT VECTOR MACHINES Slides adapted from Tom Mitchell, Eric Xing, and Lauren Hannah Machine Learning: Jordan
More information