Machine Learning Techniques

Similar documents
Machine Learning Techniques

Machine Learning Foundations

Machine Learning Techniques

Machine Learning Techniques

Machine Learning Techniques

Machine Learning Techniques

Machine Learning Foundations

Machine Learning Foundations

Machine Learning Techniques

Array 10/26/2015. Department of Computer Science & Information Engineering. National Taiwan University

Machine Learning Foundations

Polymorphism 11/09/2015. Department of Computer Science & Information Engineering. National Taiwan University

Inheritance 11/02/2015. Department of Computer Science & Information Engineering. National Taiwan University

Encapsulation 10/26/2015. Department of Computer Science & Information Engineering. National Taiwan University

Basic Java OOP 10/12/2015. Department of Computer Science & Information Engineering. National Taiwan University

Statistical Machine Learning from Data

Infinite Ensemble Learning with Support Vector Machinery

ECE 5984: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning

Introduction to Support Vector Machines

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Understanding Generalization Error: Bounds and Decompositions

SVMs, Duality and the Kernel Trick

Variance Reduction and Ensemble Methods

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Machine Learning. Ensemble Methods. Manfred Huber

Learning with multiple models. Boosting.

VBM683 Machine Learning

Data Mining und Maschinelles Lernen

PAC-learning, VC Dimension and Margin-based Bounds

TDT4173 Machine Learning

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Machine Learning (CSE 446): Probabilistic Machine Learning

COMS 4771 Introduction to Machine Learning. Nakul Verma

Support vector machines Lecture 4

Support Vector Machines

Introduction to Logistic Regression and Support Vector Machine

Introduction to Support Vector Machines

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Voting (Ensemble Methods)

CS7267 MACHINE LEARNING

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

PDEEC Machine Learning 2016/17

Learning theory. Ensemble methods. Boosting. Boosting: history

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

CS 6375 Machine Learning

Learning Theory Continued

Ensembles. Léon Bottou COS 424 4/8/2010

Lecture 8. Instructor: Haipeng Luo

TDT4173 Machine Learning

Novel Distance-Based SVM Kernels for Infinite Ensemble Learning

Jeff Howbert Introduction to Machine Learning Winter

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Dan Roth 461C, 3401 Walnut

Machine Learning Linear Classification. Prof. Matteo Matteucci

Support Vector Machine

CS534 Machine Learning - Spring Final Exam

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Boosting & Deep Learning

Discriminative Models

Discriminative Models

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Support Vector Machines

The Naïve Bayes Classifier. Machine Learning Fall 2017

Ensembles of Classifiers.

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Statistics and learning: Big Data

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Lecture 13: Ensemble Methods

Generalization, Overfitting, and Model Selection

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

Learning From Data Lecture 25 The Kernel Trick

Bias-Variance in Machine Learning

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

PAC-learning, VC Dimension and Margin-based Bounds

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Short Note: Naive Bayes Classifiers and Permanence of Ratios

Linear Models for Regression CS534

Bias-Variance Tradeoff

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

18.9 SUPPORT VECTOR MACHINES

CSC321 Lecture 2: Linear Regression

Introduction to Machine Learning

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Stochastic Gradient Descent

COMS 4771 Regression. Nakul Verma

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Announcements Kevin Jamieson

An Ensemble Ranking Solution for the Yahoo! Learning to Rank Challenge

Algorithm-Independent Learning Issues

Bits of Machine Learning Part 1: Supervised Learning

Transcription:

Machine Learning Techniques ( 機器學習技法 ) Lecture 7: Blending and Bagging Hsuan-Tien Lin ( 林軒田 ) htlin@csie.ntu.edu.tw Department of Computer Science & Information Engineering National Taiwan University ( 國立台灣大學資訊工程系 ) Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/23

Roadmap 1 Embedding Numerous Features: Kernel Models Lecture 6: Support Vector Regression kernel ridge regression (dense) via ridge regression + representer theorem; support vector regression (sparse) via regularized tube error + Lagrange dual 2 Combining Predictive Features: Aggregation Models Lecture 7: Blending and Bagging Motivation of Aggregation Uniform Blending Linear and Any Blending Bagging (Bootstrap Aggregation) 3 Distilling Implicit Features: Extraction Models Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 1/23

Motivation of Aggregation An Aggregation Story Your T friends g 1,, g T predicts whether stock will go up as g t (x). You can... select the most trust-worthy friend from their usual performance validation! mix the predictions from all your friends uniformly let them vote! mix the predictions from all your friends non-uniformly let them vote, but give some more ballots combine the predictions conditionally if [t satisfies some condition] give some ballots to friend t... aggregation models: mix or combine hypotheses (for better performance) Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/23

Motivation of Aggregation Aggregation with Math Notations Your T friends g 1,, g T predicts whether stock will go up as g t (x). select the most trust-worthy friend from their usual performance G(x) = g t (x) with t = argmin t {1,2,,T } E val (g t ) mix the predictions from all your friends uniformly T ) G(x) = sign( t=1 1 g t(x) mix the predictions from all your friends non-uniformly T ) G(x) = sign( t=1 α t g t (x) with α t 0 include select: α t = E val (g t ) smallest include uniformly: α t = 1 combine the predictions conditionally T ) G(x) = sign( t=1 q t(x) g t (x) with q t (x) 0 include non-uniformly: q t (x) = α t aggregation models: a rich family Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/23

Motivation of Aggregation Recall: Selection by Validation G(x) = g t (x) with t = argmin E val (g t ) t {1,2,,T } simple and popular what if use E in (g t ) instead of E val (g t )? complexity price on d VC, remember? :-) need one strong g t to guarantee small E val (and small E out ) selection: rely on one strong hypothesis aggregation: can we do better with many (possibly weaker) hypotheses? Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/23

Blending and Bagging Motivation of Aggregation Why Might Aggregation Work? mix different weak hypotheses uniformly G(x) strong aggregation = feature transform (?) mix different random-pla hypotheses uniformly G(x) moderate aggregation = regularization (?) proper aggregation = better performance Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/23

Motivation of Aggregation Fun Time Consider three decision stump hypotheses from R to { 1, +1}: g 1 (x) = sign(1 x), g 2 (x) = sign(1 + x), g 3 (x) = 1. When mixing the three hypotheses uniformly, what is the resulting G(x)? 1 2 x 1 1 2 2 x 1 1 3 2 x 1 1 4 2 x +1 1

Motivation of Aggregation Fun Time Consider three decision stump hypotheses from R to { 1, +1}: g 1 (x) = sign(1 x), g 2 (x) = sign(1 + x), g 3 (x) = 1. When mixing the three hypotheses uniformly, what is the resulting G(x)? 1 2 x 1 1 2 2 x 1 1 3 2 x 1 1 4 2 x +1 1 Reference Answer: 1 The region that gets two positive votes from g 1 and g 2 is x 1, and thus G(x) is positive within the region only. We see that the three decision stumps g t can be aggregated to form a more sophisticated hypothesis G. Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/23

Uniform Blending Uniform Blending (Voting) for Classification uniform blending: known g t, each ( with 1 ballot T ) G(x) = sign 1 g t (x) t=1 same g t (autocracy): as good as one single g t very different g t (diversity + democracy): majority can correct minority similar results with uniform voting for multiclass T G(x) = argmax g t (x) = k 1 k K t=1 how about regression? Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/23

Uniform Blending Uniform Blending for Regression G(x) = 1 T T g t (x) t=1 same g t (autocracy): as good as one single g t very different g t (diversity + democracy): some g t (x) > f (x), some g t (x) < f (x) = average could be more accurate than individual diverse hypotheses: even simple uniform blending can be better than any single hypothesis Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/23

Uniform Blending Theoretical Analysis of Uniform Blending G(x) = 1 T T g t (x) t=1 avg ( (g t (x) f (x)) 2) = avg ( gt 2 2g t f + f 2) = avg ( gt 2 ) 2Gf + f 2 = avg ( gt 2 ) G 2 + (G f ) 2 = avg ( gt 2 ) 2G 2 + G 2 + (G f ) 2 = avg ( gt 2 2g t G + G 2) + (G f ) 2 = avg ( (g t G) 2) + (G f ) 2 ( avg (E out (g t )) = avg E(g t G) 2) + E out (G) + E out (G) Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/23

Uniform Blending Some Special g t consider a virtual iterative process that for t = 1, 2,..., T 1 request size-n data D t from P N (i.i.d.) 2 obtain g t by A(D t ) ḡ = lim T G = lim 1 T T T g t = E A(D) D t=1 ( avg (E out (g t )) = avg E(g t ḡ) 2) + E out (ḡ) expected performance of A = expected deviation to consensus performance of consensus: called bias +performance of consensus expected deviation to consensus: called variance uniform blending: reduces variance for more stable performance Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/23

Uniform Blending Consider applying uniform blending G(x) = 1 T T t=1 g t(x) on linear regression hypotheses g t (x) = innerprod(w t, x). Which of the following property best describes the resulting G(x)? 1 a constant function of x 2 a linear function of x 3 a quadratic function of x 4 none of the other choices

Uniform Blending Consider applying uniform blending G(x) = 1 T T t=1 g t(x) on linear regression hypotheses g t (x) = innerprod(w t, x). Which of the following property best describes the resulting G(x)? 1 a constant function of x 2 a linear function of x 3 a quadratic function of x 4 none of the other choices Reference Answer: 2 G(x) = innerprod ( 1 T ) T w t, x t=1 which is clearly a linear function of x. Note that we write innerprod instead of the usual transpose notation to avoid symbol conflict with T (number of hypotheses). Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/23

Linear and Any Blending Linear Blending linear blending: known g t, each to be given α t ballot ( T ) G(x) = sign α t g t (x) with α t 0 t=1 computing good α t : linear blending for regression 1 N T min y n α t g t (x n ) α t 0 N n=1 t=1 2 min E in(α) α t 0 LinReg + transformation 1 N d min y n w i φ i (x n ) w i N n=1 like two-level learning, remember? :-) linear blending = LinModel + hypotheses as transform + constraints Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/23 i=1 2

Linear and Any Blending Constraint on α t linear blending = LinModel + hypotheses as transform + constraints: min α t 0 1 N ( N err y n, n=1 linear blending for binary classification ) T α t g t (x n ) t=1 if α t < 0 = α t g t (x) = α t ( g t (x)) negative α t for g t positive α t for g t if you have a stock up/down classifier with 99% error, tell me! :-) in practice, often linear blending = LinModel + hypotheses as transform + constraints Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/23

in practice, often by minimum E in Linear and Any Blending Linear Blending versus Selection g 1 H 1, g 2 H 2,..., g T H T recall: selection by minimum ( E in T ) best of best, paying d VC H t recall: linear blending includes selection as special case by setting α t = E val (g t ) smallest t=1 complexity ( price of linear blending with E in (aggregation of best): T ) d VC H t t=1 like selection, blending practically done with (E val instead of E in ) + (g t from minimum E train ) Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/23

Linear and Any Blending Any Blending Given g 1, g 2,..., g T from D train, transform (x n, y n ) in D val to (z n = Φ (x n ), y n ), where Φ (x) = (g 1 (x),..., g T (x)) Linear Blending 1 compute α ( ) = LinearModel {(z n, y n )} 2 return G LINB (x) = LinearHypothesis α (Φ(x)), Any Blending (Stacking) 1 compute g ( ) = AnyModel {(z n, y n )} 2 return G ANYB (x) = g(φ(x)), where Φ(x) = (g 1 (x),..., g T (x)) any blending: powerful, achieves conditional blending but danger of overfitting, as always :-( Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 15/23

Linear and Any Blending Blending in Practice (Chen et al., A linear ensemble of individual and blended models for music rating prediction, 2012) KDDCup 2011 Track 1: World Champion Solution by NTU validation set blending: a special any blending model E test (squared): 519.45 = 456.24 helped secure the lead in last two weeks test set blending: linear blending using Ẽtest E test (squared): 456.24 = 442.06 helped turn the tables in last hour blending useful in practice, despite the computational burden Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 16/23

Linear and Any Blending Fun Time Consider three decision stump hypotheses from R to { 1, +1}: g 1 (x) = sign(1 x), g 2 (x) = sign(1 + x), g 3 (x) = 1. When x = 0, what is the resulting Φ(x) = (g 1 (x), g 2 (x), g 3 (x)) used in the returned hypothesis of linear/any blending? 1 (+1, +1, +1) 2 (+1, +1, 1) 3 (+1, 1, 1) 4 ( 1, 1, 1)

Linear and Any Blending Fun Time Consider three decision stump hypotheses from R to { 1, +1}: g 1 (x) = sign(1 x), g 2 (x) = sign(1 + x), g 3 (x) = 1. When x = 0, what is the resulting Φ(x) = (g 1 (x), g 2 (x), g 3 (x)) used in the returned hypothesis of linear/any blending? 1 (+1, +1, +1) 2 (+1, +1, 1) 3 (+1, 1, 1) 4 ( 1, 1, 1) Reference Answer: 2 Too easy? :-) Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/23

Bagging (Bootstrap Aggregation) What We Have Done blending: aggregate after getting g t ; learning: aggregate as well as getting g t aggregation type blending learning uniform voting/averaging? non-uniform linear? conditional stacking? learning g t for uniform aggregation: diversity important diversity by different models: g 1 H 1, g 2 H 2,..., g T H T diversity by different parameters: GD with η = 0.001, 0.01,..., 10 diversity by algorithmic randomness: random PLA with different random seeds diversity by data randomness: within-cross-validation hypotheses g v next: diversity by data randomness without g Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/23

Bagging (Bootstrap Aggregation) Revisit of Bias-Variance expected performance of A = expected deviation to consensus +performance of consensus consensus ḡ = expected g t from D t P N consensus more stable than direct A(D), but comes from many more D t than the D on hand want: approximate ḡ by finite (large) T approximate g t = A(D t ) from D t P N using only D bootstrapping: a statistical tool that re-samples from D to simulate D t Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/23

bootstrapping Bagging (Bootstrap Aggregation) Bootstrap Aggregation bootstrap sample D t : re-sample N examples from D uniformly with replacement can also use arbitrary N instead of original N virtual aggregation consider a virtual iterative process that for t = 1, 2,..., T 1 request size-n data D t from P N (i.i.d.) 2 obtain g t by A(D t ) G = Uniform({g t }) bootstrap aggregation consider a physical iterative process that for t = 1, 2,..., T 1 request size-n data D t from bootstrapping 2 obtain g t by A( D t ) G = Uniform({g t }) bootstrap aggregation (BAGging): a simple meta algorithm on top of base algorithm A Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/23

Blending and Bagging Bagging (Bootstrap Aggregation) Bagging Pocket in Action TPOCKET = 1000; TBAG = 25 very diverse gt from bagging proper non-linear boundary after aggregating binary classifiers bagging works reasonably well if base algorithm sensitive to data randomness Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/23

Bagging (Bootstrap Aggregation) Fun Time When using bootstrapping to re-sample N examples D t from a data set D with N examples, what is the probability of getting D t exactly the same as D? 1 0 /N N = 0 2 1 /N N 3 N! /N N 4 N N /N N = 1

Bagging (Bootstrap Aggregation) Fun Time When using bootstrapping to re-sample N examples D t from a data set D with N examples, what is the probability of getting D t exactly the same as D? 1 0 /N N = 0 2 1 /N N 3 N! /N N 4 N N /N N = 1 Reference Answer: 3 Consider re-sampling in an ordered manner for N steps. Then there are (N N ) possible outcomes D t, each with equal probability. Most importantly, (N!) of the outcomes are permutations of the original D, and thus the answer. Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 22/23

Bagging (Bootstrap Aggregation) Summary 1 Embedding Numerous Features: Kernel Models 2 Combining Predictive Features: Aggregation Models Lecture 7: Blending and Bagging Motivation of Aggregation aggregated G strong and/or moderate Uniform Blending diverse hypotheses, one vote, one value Linear and Any Blending two-level learning with hypotheses as transform Bagging (Bootstrap Aggregation) bootstrapping for diverse hypotheses next: getting more diverse hypotheses to make G strong 3 Distilling Implicit Features: Extraction Models Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 23/23