Infinite Ensemble Learning with Support Vector Machinery

Similar documents
Novel Distance-Based SVM Kernels for Infinite Ensemble Learning

Support Vector Machinery for Infinite Ensemble Learning

Introduction to Support Vector Machines

Support Vector Machinery for Infinite Ensemble Learning

Infinite Ensemble Learning with Support Vector Machines

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Large-Margin Thresholded Ensembles for Ordinal Regression

Support Vector Machine (continued)

Support Vector Machines

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Introduction to Support Vector Machines

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Kernel Methods and Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Support Vector and Kernel Methods

18.9 SUPPORT VECTOR MACHINES

COMS 4771 Introduction to Machine Learning. Nakul Verma

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Large-Margin Thresholded Ensembles for Ordinal Regression

Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Support Vector Machines

Multiclass Boosting with Repartitioning

Support Vector Machines. Maximizing the Margin

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Multiclass Classification-1

CS7267 MACHINE LEARNING

Support Vector Machines and Kernel Methods

Linear & nonlinear classifiers

Support Vector Machines for Classification: A Statistical Portrait

Learning with kernels and SVM

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

CS534 Machine Learning - Spring Final Exam

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Perceptron Revisited: Linear Separators. Support Vector Machines

Statistical Machine Learning from Data

Linear & nonlinear classifiers

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

9 Classification. 9.1 Linear Classifiers

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Kernel Methods & Support Vector Machines

Discriminative Models

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

6.036 midterm review. Wednesday, March 18, 15

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Introduction to SVM and RVM

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Discriminative Models

Chapter 6: Classification

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines.

Review: Support vector machines. Machine learning techniques and image analysis

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Support Vector Machines, Kernel SVM

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Totally Corrective Boosting Algorithms that Maximize the Margin

Introduction to Support Vector Machines

Pattern Recognition 2018 Support Vector Machines

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

CS145: INTRODUCTION TO DATA MINING

Relevance Vector Machines

ML (cont.): SUPPORT VECTOR MACHINES

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Support Vector Machine for Classification and Regression

Basis Expansion and Nonlinear SVM. Kai Yu

TDT4173 Machine Learning

Machine Learning, Fall 2011: Homework 5

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

CSC 411 Lecture 17: Support Vector Machine

Max Margin-Classifier

Support Vector Machines and Speaker Verification

SVMs: nonlinearity through kernels

Support Vector Machines

Artificial Intelligence Roman Barták

Machine Learning. Support Vector Machines. Manfred Huber

Lecture 10: A brief introduction to Support Vector Machine

CS446: Machine Learning Spring Problem Set 4

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Brief Introduction to Machine Learning

Support Vector Machines and Kernel Methods

Microarray Data Analysis: Discovery

Learning theory. Ensemble methods. Boosting. Boosting: history

Support Vector Machine

Training Support Vector Machines: Status and Challenges

Support Vector Machines

Support Vector Machines

Large Scale Semi-supervised Linear SVMs. University of Chicago

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Statistical Machine Learning from Data

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Machine Learning Techniques

L5 Support Vector Classification

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Transcription:

Infinite Ensemble Learning with Support Vector Machinery Hsuan-Tien Lin and Ling Li Learning Systems Group, California Institute of Technology ECML/PKDD, October 4, 2005 H.-T. Lin and L. Li (Learning Systems Group) Infinite Ensemble Learning with SVM 2005/10/04 1 / 24

Outline 1 Motivation of Infinite Ensemble Learning 2 Connecting SVM and Ensemble Learning 3 SVM-Based Framework of Infinite Ensemble Learning 4 Concrete Instance of the Framework: Stump Kernel 5 Experimental Comparison 6 Conclusion H.-T. Lin and L. Li (Learning Systems Group) Infinite Ensemble Learning with SVM 2005/10/04 2 / 24

Motivation of Infinite Ensemble Learning Learning Problem notation: example x X R D and label y {+1, 1} hypotheses (classifiers): functions from X {+1, 1} binary classification problem: given training examples and labels {(x i, y i )} N i=1, find a classifier g(x) : X {+1, 1} that predicts the label of unseen x well H.-T. Lin and L. Li (Learning Systems Group) Infinite Ensemble Learning with SVM 2005/10/04 3 / 24

Motivation of Infinite Ensemble Learning Ensemble Learning g(x) : X {+1, 1} ensemble learning: popular paradigm (bagging, boosting, etc.) ensemble: weighted vote of a committee of hypotheses g(x) = sign( w t h t (x)) h t : base hypotheses, usually chosen from a set H w t : nonnegative weight for h t ensemble usually better than individual h t (x) in stability/performance H.-T. Lin and L. Li (Learning Systems Group) Infinite Ensemble Learning with SVM 2005/10/04 4 / 24

Motivation of Infinite Ensemble Learning Infinite Ensemble Learning ) g(x) = sign( wt h t (x), h t H, w t 0 set H can be of infinite size traditional algorithms: assign finite number of nonzero w t 1 is finiteness regularization and/or restriction? 2 how to handle infinite number of nonzero weights? 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 1 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 finite ensemble 1 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 infinite ensemble H.-T. Lin and L. Li (Learning Systems Group) Infinite Ensemble Learning with SVM 2005/10/04 5 / 24

Motivation of Infinite Ensemble Learning SVM for Infinite Ensemble Learning Support Vector Machine (SVM): large-margin hyperplane in some feature space SVM: possibly infinite dimensional hyperplane g(x) = sign( w d φ d (x) + b) an important machinery to conquer infinity: kernel trick. how can we use Support Vector Machinery for infinite ensemble learning? H.-T. Lin and L. Li (Learning Systems Group) Infinite Ensemble Learning with SVM 2005/10/04 6 / 24

Connecting SVM and Ensemble Learning Properties of SVM g(x) = sign( d=1 w N ) dφ d (x) + b) = sign( i=1 λ iy i K(x i, x) + b a successful large-margin learning algorithm. goal: (infinite dimensional) large-margin hyperplane min w,b 1 N ( 2 w 2 2 + C ) ξ i, s.t. y i w d φ d (x i ) + b 1 ξ i, ξ i 0 i=1 d=1 optimal hyperplane: represented through duality key for handling infinity: computation with kernel tricks K(x, x ) = d=1 φ d(x)φ d (x ) regularization: controlled with the trade-off parameter C H.-T. Lin and L. Li (Learning Systems Group) Infinite Ensemble Learning with SVM 2005/10/04 7 / 24

Connecting SVM and Ensemble Learning Properties of AdaBoost T ) g(x) = sign( t=1 w th t (x) a successful ensemble learning algorithm goal: asymptotically, large-margin ensemble ( ) min w 1, s.t. y i w t h t (x i ) 1, w t 0 w,h t=1 optimal ensemble: approximated by finite one key for good approximation: finiteness: some h t1 (x i ) = h t2 (x i ) for all i sparsity: optimal ensemble usually has many zero weights regularization: finite approximation H.-T. Lin and L. Li (Learning Systems Group) Infinite Ensemble Learning with SVM 2005/10/04 8 / 24

Connecting SVM and Ensemble Learning Connection between SVM and AdaBoost φ d (x) h t (x) SVM AdaBoost G(x) = k w kφ k (x) + b G(x) = k w kh k (x) w k 0 hard-goal min w p, s.t. y i G(x i ) 1 p = 2 p = 1 key for infinity kernel trick finiteness and sparsity regularization soft-margin trade-off finite approximation H.-T. Lin and L. Li (Learning Systems Group) Infinite Ensemble Learning with SVM 2005/10/04 9 / 24

SVM-Based Framework of Infinite Ensemble Learning Challenge challenge: how to design a good infinite ensemble learning algorithm? traditional ensemble learning: iterative and cannot be directly generalized our main contribution: novel and powerful infinite ensemble learning algorithm with Support Vector Machinery our approach: embedding infinite number of hypotheses in SVM kernel, i.e., K(x, x ) = t=1 h t(x)h t (x ) then, SVM classifier: g(x) = sign( t=1 w th t (x) + b) 1 does the kernel exist? 2 how to ensure w t 0? H.-T. Lin and L. Li (Learning Systems Group) Infinite Ensemble Learning with SVM 2005/10/04 10 / 24

SVM-Based Framework of Infinite Ensemble Learning Embedding Hypotheses into the Kernel Definition The kernel that embodies H = {h α : α C} is defined as K H,r (x, x ) = φ x (α)φ x (α) dα, C where C is a measure space, φ x (α) = r(α)h α (x), and r : C R + is chosen such that the integral always exists integral instead of sum: works even for uncountable H existence problem handled with a suitable r( ) K H,r (x, x ): an inner product for φ x and φ x in F = L 2 (C) the classifier: g(x) = sign ( C w(α)r(α)h α(x) dα + b ) H.-T. Lin and L. Li (Learning Systems Group) Infinite Ensemble Learning with SVM 2005/10/04 11 / 24

SVM-Based Framework of Infinite Ensemble Learning Negation Completeness and Constant Hypotheses ( ) g(x) = sign w(α)r(α)h α (x) dα + b C not an ensemble classifier yet w(α) 0? hard to handle: possibly uncountable constraints simple with negation completeness assumption on H (h H if and only if ( h) H) e.g. neural networks, perceptrons, decision trees, etc. for any w, exists nonnegative w that produces same g What is b? equivalently, the weight on a constant hypothesis another assumption: H contains a constant hypothesis with mild assumptions, g(x) is equivalent to an ensemble classifier H.-T. Lin and L. Li (Learning Systems Group) Infinite Ensemble Learning with SVM 2005/10/04 12 / 24

SVM-Based Framework of Infinite Ensemble Learning Framework of Infinite Ensemble Learning Algorithm 1 Consider a hypothesis set H (negation complete and contains a constant hypothesis) 2 Construct a kernel K H,r with proper r( ) 3 Properly choose other SVM parameters 4 Train SVM with K H,r and {(x i, y i )} N i=1 to obtain λ i and b N ) 5 Output g(x) = sign( i=1 y iλ i K H (x i, x) + b hard: kernel construction SVM as an optimization machinery: training routines are widely available SVM as a well-studied learning model: inherit the profound regularization properties H.-T. Lin and L. Li (Learning Systems Group) Infinite Ensemble Learning with SVM 2005/10/04 13 / 24

Concrete Instance of the Framework: Stump Kernel Decision Stump decision stump: s q,d,α (x) = q sign((x) d α) simplicity: popular for ensemble learning (x) 2 α? Y N +1 1 (x) 2 s +1,2,α (x) = +1 (x) 2 = α (x) 1 (a) Decision Process (b) Decision Boundary Figure: Illustration of the decision stump s +1,2,α (x) H.-T. Lin and L. Li (Learning Systems Group) Infinite Ensemble Learning with SVM 2005/10/04 14 / 24

Concrete Instance of the Framework: Stump Kernel Stump Kernel consider the set of decision stumps S = { s q,d,αd : q {+1, 1}, d {1,..., D}, α d [L d, R d ] } when X [L 1, R 1 ] [L 2, R 2 ] [L D, R D ], S is negation complete, and contains a constant hypothesis Definition The stump kernel K S is defined for S with r(q, d, α d ) = 1 2 K S (x, x ) = S D (x) d (x ) d = S x x 1, d=1 where S = 1 2 D d=1 (R d L d ) is a constant H.-T. Lin and L. Li (Learning Systems Group) Infinite Ensemble Learning with SVM 2005/10/04 15 / 24

Concrete Instance of the Framework: Stump Kernel Properties of Stump Kernel simple to compute: can even use a simpler one K S (x, x ) = x x 1 while getting the same solution under the dual constraint i y iλ i = 0, using K S or K S is the same feature space explanation for l 1 -norm distance infinite power: under mild assumptions, SVM-Stump with C = can perfectly classify all training examples if there is a dimension for which all feature values are different, the kernel matrix K with K ij = K(x i, x j ) is strictly positive definite similar power to the popular Gaussian kernel exp( γ x x 2 2 ) suitable control on the power leads to good performance H.-T. Lin and L. Li (Learning Systems Group) Infinite Ensemble Learning with SVM 2005/10/04 16 / 24

Concrete Instance of the Framework: Stump Kernel Properties of Stump Kernel (Cont d) fast automatic parameter selection: only needs to search for a good soft-margin parameter C scaling the stump kernel is equivalent to scaling soft-margin parameter C Gaussian kernel depends on a good (γ, C) pair (10 times slower) well suited in some specific applications: cancer prediction with gene expressions (Lin and Li, ECML/PKDD Discovery Challenge, 2005) H.-T. Lin and L. Li (Learning Systems Group) Infinite Ensemble Learning with SVM 2005/10/04 17 / 24

Concrete Instance of the Framework: Stump Kernel Infinite Decision Stump Ensemble g(x) = sign D q {+1, 1} d=1 Rd L d w q,d (α)s q,d,α (x) dα + b each s q,d,α : infinitesimal influence w q,d (α) dα equivalently, g(x) = sign D A d ŵ q,d,a ŝ q,d,a (x) + b q {+1, 1} d=1 a=0 ŝ: a smoother variant of decision stump ŝ(x) (x i ) d (x j ) d (x) d H.-T. Lin and L. Li (Learning Systems Group) Infinite Ensemble Learning with SVM 2005/10/04 18 / 24

Concrete Instance of the Framework: Stump Kernel Infinitesimal Influence g(x) = sign D A d ŵ q,d,a ŝ q,d,a (x) + b q {+1, 1} d=1 a=0 infinity dense combination of finite number of smooth stumps infinitesimal influence concrete weight of the smooth stumps ŝ(x) s(x) (x i ) d (x j ) d (x) d (x i ) d (x j ) d (x) d SVM: dense combination of smoothed stumps AdaBoost: sparse combination of middle stumps H.-T. Lin and L. Li (Learning Systems Group) Infinite Ensemble Learning with SVM 2005/10/04 19 / 24

Experimental Comparison Experiment Setting ensemble learning algorithms: SVM-Stump: infinite ensemble of decision stumps (dense ensemble of smooth stumps) SVM-Mid: dense ensemble of middle stumps AdaBoost-Stump: sparse ensemble of middle stumps SVM algorithms: SVM-Stump versus SVM-Gauss artificial, noisy, and realworld datasets cross-validation for automatic parameter selection of SVM evaluate on hold-out test set and averaged over 100 different splits H.-T. Lin and L. Li (Learning Systems Group) Infinite Ensemble Learning with SVM 2005/10/04 20 / 24

Experimental Comparison Comparison between SVM and AdaBoost error (%) 35 30 25 20 15 10 5 0 SVM Stump SVM Mid AdaBoost Stump(100) AdaBoost Stump(1000) tw twn th thn ri rin aus bre ger hea ion pim son vot Results fair comparison between AdaBoost and SVM SVM-Stump is usually best benefits to go to infinity SVM-Mid is also good benefits to have dense ensemble sparsity and finiteness are restrictions H.-T. Lin and L. Li (Learning Systems Group) Infinite Ensemble Learning with SVM 2005/10/04 21 / 24

Experimental Comparison Comparison between SVM and AdaBoost (Cont d) 5 5 5 (x) 2 0 (x) 2 0 (x) 2 0 5 5 0 5 (x) 1 5 5 0 5 (x) 1 5 5 0 5 (x) 1 left to right: SVM-Stump, SVM-Mid, AdaBoost-Stump smoother boundary with infinite ensemble (SVM-Stump) still fits well with dense ensemble (SVM-Mid) cannot fit well when sparse and finite (AdaBoost-Stump) H.-T. Lin and L. Li (Learning Systems Group) Infinite Ensemble Learning with SVM 2005/10/04 22 / 24

Experimental Comparison Comparison of SVM Kernels error (%) 35 30 25 20 15 10 SVM Stump SVM Gauss Results SVM-Stump is only a bit worse than SVM-Gauss still benefit with faster parameter selection in some applications 5 0 tw twn th thn ri rin aus bre ger hea ion pim son vot H.-T. Lin and L. Li (Learning Systems Group) Infinite Ensemble Learning with SVM 2005/10/04 23 / 24

Conclusion Conclusion novel and powerful framework for infinite ensemble learning derived a new and meaningful kernel stump kernel: succeeded in specific applications infinite ensemble learning could be better existing AdaBoost-Stump applications may switch not the only kernel: perceptron kernel infinite ensemble of perceptrons Laplacian kernel infinite ensemble of decision trees SVM: our machinery for conquering infinity possible to apply similar machinery to areas that need infinite or lots of aggregation H.-T. Lin and L. Li (Learning Systems Group) Infinite Ensemble Learning with SVM 2005/10/04 24 / 24