Importance Sampling: An Alternative View of Ensemble Learning. Jerome H. Friedman Bogdan Popescu Stanford University

Size: px
Start display at page:

Download "Importance Sampling: An Alternative View of Ensemble Learning. Jerome H. Friedman Bogdan Popescu Stanford University"

Transcription

1 Importance Sampling: An Alternative View of Ensemble Learning Jerome H. Friedman Bogdan Popescu Stanford University 1

2 PREDICTIVE LEARNING Given data: {z i } N 1 = {y i, x i } N 1 q(z) y = output or response attribute (variable) x = {x 1,, x n } = inputs or predictors and loss function L(y, F ): estimate F (x) = arg min F (x) E q(z) L(y, F (x)) 2

3 WHY? F (x) is best predictor of y x under L. Examples: Regression: y, F R L(y, F ) = y F, (y F ) 2 Classification: y, F {c 1,, c K } L(y, F ) = L y,f (K K matrix) 3

4 F (x) = target function (regression) concept (classification) Estimate: ˆF (x) learning procedure ({zi } N 1 ) Here: procedure = LEARNING ENSEMBLES 4

5 BASIC LINEAR MODEL F (x) = a(p) f(x; p) dp P f(x; p) = base learner (basis function) parameters: p = (p 1, p 2, ) p P indexes particular function of x from { f(x; p)} p P a(p) = coefficient of f(x; p) 5

6 Examples: f(x; p) = [1 + exp( p t x)] 1 (neural nets) = multivariate splines (MARS) = decision trees (Mart, RF) 6

7 NUMERICAL QUADRATURE P I(p) dp M m=1 w mi(p m ) here: I(p) = a(p) f(x; p) Quadrature rule defined by: {p m } M 1 = evaluation points P {w m } M 1 = weights 7

8 M F (x) w m a(p m ) f(x; p m ) m=1 M c m f(x; p m ) m=1 Averaging over x: {c m} M 1 = linear regression of y on {f(x; p m )} M 1 Problem: find good {p m } M 1. (pop.) 8

9 MONTE CARLO METHODS r(p) = sampling pdf of p P {p m r(p)} M 1 Simple Monte Carlo: r(p) = constant Usually not very good 9

10 IMPORTANCE SAMPLING Customize r(p) for each particular problem (F (x)) r(p m ) = big = p m important to high accuracy when used with {p m } m m 10

11 MONTE CARLO METHODS (1) Random Monte Carlo: ignore other points: p m r(p) iid (2) Quasi Monte Carlo: {p m } M 1 = deterministic account for other points importance groups of points 11

12 RANDOM MONTE CARLO (Lack of) importance J(p) depends only on p One measure: partial importance J(p) = E q(z) L(y, f(x; p)) p = arg min p J(p) = best single point (M = 1) rule f(x; p ) = optimal single base learner 12

13 Usually not very good, especially if F (x) / {f(x; p)} p P BUT, often used: single logistic regression or tree Note: J(p m ) ignores {p m } m m Hope: better than r(p) = constant. 13

14 PARTIAL IMPORTANCE SAMPLING r(p) = g(j(p)) g( ) = monotone decreasing function r(p ) = max center (location) p p = r(p) < r(p ) d(p, p ) = J(p) J(p ) 14

15 Besides location, Critical parameter for importance sampling: Scale (width) of r(p): σ = P d(p, p ) r(p) dp Controlled by choice of g( ): σ = too large r(p) = constant. σ = too small best single point rule p 15

16

17 Questions: (1) how to choose g( ) σ (2) sample from r(p) = g(j(p)) 17

18 TRICK Perturbation sampling repeatedly: (1) randomly modify (perturb) problem (2) find optimal f(x; p m ) for perturbed problem p m = R m { arg minp E q(z) L(y, f(x; p)) } control width σ of r(p) by degree Perturb: L(y, F ), q(z), algorithm, hybrid. 18

19 EXAMPLES Perturb loss function: L m (y, f) = L(y, f) + η l m (y, f) l m (y, f) = random function L m (y, f) = L(y, f + η h m (x)) h m (x) = random function of x p m = arg min p E q(z) L m (y, f(x; p)) Width σ of r(p) value of η 19

20 Perturb sampling distribution: q m (z) = [w m (z)] η q(z) w m (z) = random function of z p m = arg min p E qm(z)l(y, f(x; p)) Width σ of r(p) value of η 20

21 Perturb algorithm: p m = rand[arg min p ]E q(z) L(y, f(x; p)) control width σ of r(p) by degree repeated partial optimizations perturb partial solutions Examples: Dittereich - random trees Breiman - random forests 21

22 GOAL Produce a good {p m } M 1 so that M c mf(x; p m ) F (x) m=1 where {c m} M 1 = pop. linear regression (L) of y on {f(x; p m )} M 1 Note: both depend on knowing population q(z). 22

23 FINITE DATA {z i } N 1 q(z) ˆq(z) = N i=1 1 δ(z z N i) Apply perturbation sampling based on ˆq(z): Loss function / algorithm: q(z) ˆq(z) width σ of r(p) controlled as before 23

24 Sampling distribution: random reweighting ˆq m (z) = N i=1 w imδ(z z i ) w im Pr(w) : Ew im = 1/N width σ of r(p) controlled by std(w im ) Fastest comp: w im {0, 1/K} draw K from N without replacement σ std(w) = (N/K 1) 1/2 /N computation K/N 24

25 Quadrature Coefficients Population: Linear regression of y on {f(x; ( p m )} M 1 : {c m} M 1 = arg min {cm} E q(z) L y, ) M m=1 c m f(x; p m ) 25

26 Finite data: regularized linear regression ( {ĉ m } M 1 = arg min {cm} Eˆq(z) L y, ) M m=1 c m f(x; p m ) +λ M m=1 c m c (0) m (Lasso) Regularization reduce variance {c (0) m } M 1 = prior guess (usually = 0) λ > 0 chosen by cross validation Fast algorithm: sol ns for all λ (see HTF 2001, EHT 2002) 26

27 Importance Sampled Learning Ensembles (ISLE) Numerical Integration F (x) = a(p) f(x; p) dp P M m=1 ĉm f(x; ˆp m ) {ˆp m } M 1 ˆr(p) importance sampling perturbation sampling on ˆq(z) {ĉ m } M 1 regularized linear regression of y on {f(x; ˆp m )} M 1 27

28 BAGGING (Breiman 1996) Perturb data distribution ˆq(z): ˆq m (z) = bootstrap sample = N i=1 w imδ(z z i ) w im {0, 1/N, 2/N,, 1} multinomial (1/N) p m = arg min p Eˆqm(z)L(y, f(x; p)) ˆF (x) = M 1 m=1 f(x; p M m) (average) 28

29 Width σ of r(p): E(std(w im )) = (1 1/N) 1 /N 1/N Fixed no control No joint fitting of coefficients: λ = & c (0) m = 1/M Potential improvements: Different σ (sampling strategy) λ < jointly fit coeff s to data 29

30 RANDOM FORESTS (Breiman 1998) f(x; p) = T (x) = largest possible decision tree Hybrid sampling strategy : (1) ˆq m (z) = bootstrap sample (bagging) (2) random algorithm modification: select var. for each split from among randomly chosen subset Breiman: n s = fl(log 2 n + 1) 30

31 ˆF (x) = M 1 m=1 T M m(x) (average) As an ISLE: σ(rf ) > σ(bag), ( as n s ) Potential improvements: same as bagging Different σ (sampling strategy) λ < jointly fit coeff s to data (more later) 31

32 SEQUENTIAL SAMPLING Random Monte Carlo: {p m r(p)} M 1 iid Quasi Monte Carlo: {p m } M 1 = deterministic ( J({p m } M 1 ) = min {αm} E q(z) L y, ) M m=1 α m f(x; p m ) Joint regression of y on { f(x; p m )} M 1 (pop.) 32

33 Approximation: sequential sampling (forward stagewise) J m (p {p l } m 1 1 ) = min α E q(z) L(y, α f(x; p) + h m (x)) h m (x) = m 1 l=1 α l f(x; p l ), α l = sol n for p l p m = arg min p J m (p {p l } m 1 1 ) Repeatedly modify loss function: Similar to L m (y, f) = L(y, f) + η h m (x)) but here η = 1 & h m (x) = deterministic 33

34 Connection to Boosting AdaBoost (Freund & Shapire 1996): L(y, f) = exp( y ( f), y { 1, 1} M ) ˆF (x) = sign m=1 α m f(x; p m ) {α m } M 1 = sequential partial reg. coeff s Gradient Boosting (MART Friedman 2001): general y & L(y, f), α m = shrunk (η << 1) ˆF (x) = M m=1 α m f(x; p m ) 34

35 Potential improvements (ISLE): (1) ˆF (x) = M m=1 ĉm f(x; ˆp m ) {ˆp m } M 1 seq. sampling on ˆq(x) {ĉ m } M 1 regularized linear regression (2) and/or hybrid with random ˆq m (z) (speed) (sample K from N without replacement) 35

36 ISLE Paradigm Wide variety of ISLE methods: (1) base learner f(x; p); (2) loss criterion L(y, f) (3) perturbation method (4) degree of perturbation: σ of r(p) (5) iid vs. sequential (6) hybrids Examine several options. 36

37 Monte Carlo Study 100 data sets: each N = 10000, n = 40 {y il = F l (x i ) + ε il } i=1, l = 1, 100 {F l (x)} = different (random) target fun s. x i N(0, I 40 ), ε il N(0, V ar x (F l (x))) signal/noise = 1/1 37

38 Evaluation Criteria Relative RMS error: rmse( ˆF jl ) = [1 R 2 (F l, ˆF jl )] 1/2 Comparative RMS error: cmse = rmse( ˆF jl )/ min k {rmse( ˆF kl )} (adjusts for problem difficulty) j, k {respective methods} indep. obs. 38

39 Properties of F l (x) (1) 30 noise variables (2) wide variety of fun s (difficulty) (3) emphasize lower order interactions (3) not in span of base learners Decision Trees 39

40 Bagging Relative RMS Error Relative RMS error Bag Bag_P Bag_.05 Bag_.05_P Bag_6 Bag_6_P Bag_6_.05_P 40

41 Bagging Comparative RMS Error Comparative RMS error Bag Bag_P Bag_.05 Bag_.05_P Bag_6 Bag_6_P Bag_6_.05_P 41

42 Random Forests Comparative RMS error Comparative RMS error RF RF_P RF_.05 RF_.05_P RF_6 RF_6_P RF_6_.05_P 42

43 Bag/RF Comparative RMS Error Comparative RMS error Bag RF Bag_6_05_P RF_6_05_P 43

44 Sequential Sampling Comparative RMS Error Comparative RMS error Mart Mart_P Mart_10_P Mart_.01_10_P Mart_.01_20_P 44

45 Seq/Bag/RF Comparative RMS Error Mart Mart_P Seq_.01_.2 Bag Bag_6_.05_p RF RF_6_.05_p 45

46 SUMMARY: Theory unify: (1) Bagging, (2) random forests, (3) Bayesian model averaging, (4) boosting single paradigm Monte Carlo integration (1) (3) : iid Monte Carlo, p r(p) (1), (2) perturb. sampling; (3) MCMC (4) quasi Monte Carlo: approx. seq. sampling 46

47 Practice: {w i } M 1 lasso linear regression (1) improves accuracy of RF bagging (faster) (2) combined with aggressive subsampling & weaker base learners, improves speed: bagging & RF > 10 2, MART 5 allowing much bigger data sets. Also, prediction many times faster. 47

48 FUTURE DIRECTIONS (1) More thorough understanding (theory) specific recommendations 48

49 (2) Multiple learning ensembles (MISLES) F (x) = K k=1 P k a k (p k ) f k (x, p k ) dp k { f k (x, p k )} K 1 = different (comp.) base learners F (x) = c km f k (x, p k ) {c km } combined lasso regression Example: f 1 = decision trees f 2 = {x j } n 1 (no sampling) 49

50 SLIDES jhf/isletalk.pdf REFERENCES HTF (2001): Hastie, Tibshirani, & Friedman, Elements of Statistical Learning, Springer. EHT (2002): Efron, Hastie, & Tibshirani, Least angle regression (Stanford preprint) 50

BAGGING PREDICTORS AND RANDOM FOREST

BAGGING PREDICTORS AND RANDOM FOREST BAGGING PREDICTORS AND RANDOM FOREST DANA KANER M.SC. SEMINAR IN STATISTICS, MAY 2017 BAGIGNG PREDICTORS / LEO BREIMAN, 1996 RANDOM FORESTS / LEO BREIMAN, 2001 THE ELEMENTS OF STATISTICAL LEARNING (CHAPTERS

More information

Chapter 6. Ensemble Methods

Chapter 6. Ensemble Methods Chapter 6. Ensemble Methods Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Introduction

More information

JEROME H. FRIEDMAN Department of Statistics and Stanford Linear Accelerator Center, Stanford University, Stanford, CA

JEROME H. FRIEDMAN Department of Statistics and Stanford Linear Accelerator Center, Stanford University, Stanford, CA 1 SEPARATING SIGNAL FROM BACKGROUND USING ENSEMBLES OF RULES JEROME H. FRIEDMAN Department of Statistics and Stanford Linear Accelerator Center, Stanford University, Stanford, CA 94305 E-mail: jhf@stanford.edu

More information

Gradient Boosting (Continued)

Gradient Boosting (Continued) Gradient Boosting (Continued) David Rosenberg New York University April 4, 2016 David Rosenberg (New York University) DS-GA 1003 April 4, 2016 1 / 31 Boosting Fits an Additive Model Boosting Fits an Additive

More information

Data Mining und Maschinelles Lernen

Data Mining und Maschinelles Lernen Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

Lecture 13: Ensemble Methods

Lecture 13: Ensemble Methods Lecture 13: Ensemble Methods Applied Multivariate Analysis Math 570, Fall 2014 Xingye Qiao Department of Mathematical Sciences Binghamton University E-mail: qiao@math.binghamton.edu 1 / 71 Outline 1 Bootstrap

More information

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24 Big Data Analytics Special Topics for Computer Science CSE 4095-001 CSE 5095-005 Feb 24 Fei Wang Associate Professor Department of Computer Science and Engineering fei_wang@uconn.edu Prediction III Goal

More information

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13 Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y

More information

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12 Ensemble Methods Charles Sutton Data Mining and Exploration Spring 2012 Bias and Variance Consider a regression problem Y = f(x)+ N(0, 2 ) With an estimate regression function ˆf, e.g., ˆf(x) =w > x Suppose

More information

Voting (Ensemble Methods)

Voting (Ensemble Methods) 1 2 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers

More information

Variance Reduction and Ensemble Methods

Variance Reduction and Ensemble Methods Variance Reduction and Ensemble Methods Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Last Time PAC learning Bias/variance tradeoff small hypothesis

More information

Ensemble Methods and Random Forests

Ensemble Methods and Random Forests Ensemble Methods and Random Forests Vaishnavi S May 2017 1 Introduction We have seen various analysis for classification and regression in the course. One of the common methods to reduce the generalization

More information

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m ) CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions with R 1,..., R m R p disjoint. f(x) = M c m 1(x R m ) m=1 The CART algorithm is a heuristic, adaptive

More information

Harrison B. Prosper. Bari Lectures

Harrison B. Prosper. Bari Lectures Harrison B. Prosper Florida State University Bari Lectures 30, 31 May, 1 June 2016 Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 1 h Lecture 1 h Introduction h Classification h Grid Searches

More information

CS7267 MACHINE LEARNING

CS7267 MACHINE LEARNING CS7267 MACHINE LEARNING ENSEMBLE LEARNING Ref: Dr. Ricardo Gutierrez-Osuna at TAMU, and Aarti Singh at CMU Mingon Kang, Ph.D. Computer Science, Kennesaw State University Definition of Ensemble Learning

More information

LEAST ANGLE REGRESSION 469

LEAST ANGLE REGRESSION 469 LEAST ANGLE REGRESSION 469 Specifically for the Lasso, one alternative strategy for logistic regression is to use a quadratic approximation for the log-likelihood. Consider the Bayesian version of Lasso

More information

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions f (x) = M c m 1(x R m ) m=1 with R 1,..., R m R p disjoint. The CART algorithm is a heuristic, adaptive

More information

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14 Brett Bernstein CDS at NYU March 30, 2017 Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 1 / 14 Initial Question Intro Question Question Suppose 10 different meteorologists have produced functions

More information

Statistics and learning: Big Data

Statistics and learning: Big Data Statistics and learning: Big Data Learning Decision Trees and an Introduction to Boosting Sébastien Gadat Toulouse School of Economics February 2017 S. Gadat (TSE) SAD 2013 1 / 30 Keywords Decision trees

More information

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction

More information

PDEEC Machine Learning 2016/17

PDEEC Machine Learning 2016/17 PDEEC Machine Learning 2016/17 Lecture - Model assessment, selection and Ensemble Jaime S. Cardoso jaime.cardoso@inesctec.pt INESC TEC and Faculdade Engenharia, Universidade do Porto Nov. 07, 2017 1 /

More information

VBM683 Machine Learning

VBM683 Machine Learning VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data

More information

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Learning Ensembles. 293S T. Yang. UCSB, 2017. Learning Ensembles 293S T. Yang. UCSB, 2017. Outlines Learning Assembles Random Forest Adaboost Training data: Restaurant example Examples described by attribute values (Boolean, discrete, continuous)

More information

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c15 2013/9/9 page 331 le-tex 331 15 Ensemble Learning The expression ensemble learning refers to a broad class

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement

More information

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7 Bagging Ryan Tibshirani Data Mining: 36-462/36-662 April 23 2013 Optional reading: ISL 8.2, ESL 8.7 1 Reminder: classification trees Our task is to predict the class label y {1,... K} given a feature vector

More information

Gradient Boosting, Continued

Gradient Boosting, Continued Gradient Boosting, Continued David Rosenberg New York University December 26, 2016 David Rosenberg (New York University) DS-GA 1003 December 26, 2016 1 / 16 Review: Gradient Boosting Review: Gradient Boosting

More information

Ensembles of Classifiers.

Ensembles of Classifiers. Ensembles of Classifiers www.biostat.wisc.edu/~dpage/cs760/ 1 Goals for the lecture you should understand the following concepts ensemble bootstrap sample bagging boosting random forests error correcting

More information

Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets

Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets Nan Zhou, Wen Cheng, Ph.D. Associate, Quantitative Research, J.P. Morgan nan.zhou@jpmorgan.com The 4th Annual

More information

REGRESSION TREE CREDIBILITY MODEL

REGRESSION TREE CREDIBILITY MODEL LIQUN DIAO AND CHENGGUO WENG Department of Statistics and Actuarial Science, University of Waterloo Advances in Predictive Analytics Conference, Waterloo, Ontario Dec 1, 2017 Overview Statistical }{{ Method

More information

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi Boosting CAP5610: Machine Learning Instructor: Guo-Jun Qi Weak classifiers Weak classifiers Decision stump one layer decision tree Naive Bayes A classifier without feature correlations Linear classifier

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

A Brief Introduction to Adaboost

A Brief Introduction to Adaboost A Brief Introduction to Adaboost Hongbo Deng 6 Feb, 2007 Some of the slides are borrowed from Derek Hoiem & Jan ˇSochman. 1 Outline Background Adaboost Algorithm Theory/Interpretations 2 What s So Good

More information

SPECIAL INVITED PAPER

SPECIAL INVITED PAPER The Annals of Statistics 2000, Vol. 28, No. 2, 337 407 SPECIAL INVITED PAPER ADDITIVE LOGISTIC REGRESSION: A STATISTICAL VIEW OF BOOSTING By Jerome Friedman, 1 Trevor Hastie 2 3 and Robert Tibshirani 2

More information

Regularization Paths

Regularization Paths December 2005 Trevor Hastie, Stanford Statistics 1 Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Saharon Rosset, Ji Zhu, Hui Zhou, Rob Tibshirani and

More information

RANDOMIZING OUTPUTS TO INCREASE PREDICTION ACCURACY

RANDOMIZING OUTPUTS TO INCREASE PREDICTION ACCURACY 1 RANDOMIZING OUTPUTS TO INCREASE PREDICTION ACCURACY Leo Breiman Statistics Department University of California Berkeley, CA. 94720 leo@stat.berkeley.edu Technical Report 518, May 1, 1998 abstract Bagging

More information

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012 Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Neural networks Neural network Another classifier (or regression technique)

More information

ENSEMBLES OF DECISION RULES

ENSEMBLES OF DECISION RULES ENSEMBLES OF DECISION RULES Jerzy BŁASZCZYŃSKI, Krzysztof DEMBCZYŃSKI, Wojciech KOTŁOWSKI, Roman SŁOWIŃSKI, Marcin SZELĄG Abstract. In most approaches to ensemble methods, base classifiers are decision

More information

Boosting. March 30, 2009

Boosting. March 30, 2009 Boosting Peter Bühlmann buhlmann@stat.math.ethz.ch Seminar für Statistik ETH Zürich Zürich, CH-8092, Switzerland Bin Yu binyu@stat.berkeley.edu Department of Statistics University of California Berkeley,

More information

Machine Learning. Ensemble Methods. Manfred Huber

Machine Learning. Ensemble Methods. Manfred Huber Machine Learning Ensemble Methods Manfred Huber 2015 1 Bias, Variance, Noise Classification errors have different sources Choice of hypothesis space and algorithm Training set Noise in the data The expected

More information

Conjugate direction boosting

Conjugate direction boosting Conjugate direction boosting June 21, 2005 Revised Version Abstract Boosting in the context of linear regression become more attractive with the invention of least angle regression (LARS), where the connection

More information

Ensemble Methods: Jay Hyer

Ensemble Methods: Jay Hyer Ensemble Methods: committee-based learning Jay Hyer linkedin.com/in/jayhyer @adatahead Overview Why Ensemble Learning? What is learning? How is ensemble learning different? Boosting Weak and Strong Learners

More information

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes Support Vector Machine, Random Forests, Boosting December 2, 2012 1 / 1 2 / 1 Neural networks Artificial Neural networks: Networks

More information

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Boosting Methods: Why They Can Be Useful for High-Dimensional Data New URL: http://www.r-project.org/conferences/dsc-2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609-395X Kurt Hornik,

More information

A Study of Relative Efficiency and Robustness of Classification Methods

A Study of Relative Efficiency and Robustness of Classification Methods A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics

More information

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity What makes good ensemble? CS789: Machine Learning and Neural Network Ensemble methods Jakramate Bootkrajang Department of Computer Science Chiang Mai University 1. A member of the ensemble is accurate.

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

8.6 Bayesian neural networks (BNN) [Book, Sect. 6.7]

8.6 Bayesian neural networks (BNN) [Book, Sect. 6.7] 8.6 Bayesian neural networks (BNN) [Book, Sect. 6.7] While cross-validation allows one to find the weight penalty parameters which would give the model good generalization capability, the separation of

More information

IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS - PART B 1

IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS - PART B 1 IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS - PART B 1 1 2 IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS - PART B 2 An experimental bias variance analysis of SVM ensembles based on resampling

More information

Ordinal Classification with Decision Rules

Ordinal Classification with Decision Rules Ordinal Classification with Decision Rules Krzysztof Dembczyński 1, Wojciech Kotłowski 1, and Roman Słowiński 1,2 1 Institute of Computing Science, Poznań University of Technology, 60-965 Poznań, Poland

More information

COMPSTAT2010 in Paris. Hiroki Motogaito. Masashi Goto

COMPSTAT2010 in Paris. Hiroki Motogaito. Masashi Goto COMPSTAT2010 in Paris Ensembled Multivariate Adaptive Regression Splines with Nonnegative Garrote Estimator t Hiroki Motogaito Osaka University Masashi Goto Biostatistical Research Association, NPO. JAPAN

More information

Why does boosting work from a statistical view

Why does boosting work from a statistical view Why does boosting work from a statistical view Jialin Yi Applied Mathematics and Computational Science University of Pennsylvania Philadelphia, PA 939 jialinyi@sas.upenn.edu Abstract We review boosting

More information

ABC random forest for parameter estimation. Jean-Michel Marin

ABC random forest for parameter estimation. Jean-Michel Marin ABC random forest for parameter estimation Jean-Michel Marin Université de Montpellier Institut Montpelliérain Alexander Grothendieck (IMAG) Institut de Biologie Computationnelle (IBC) Labex Numev! joint

More information

STK-IN4300 Statistical Learning Methods in Data Science

STK-IN4300 Statistical Learning Methods in Data Science Outline of the lecture STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no AdaBoost Introduction algorithm Statistical Boosting Boosting as a forward stagewise additive

More information

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory. Ensemble methods. Boosting. Boosting: history Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

More information

Does Modeling Lead to More Accurate Classification?

Does Modeling Lead to More Accurate Classification? Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang

More information

BART: Bayesian additive regression trees

BART: Bayesian additive regression trees BART: Bayesian additive regression trees Hedibert F. Lopes & Paulo Marques Insper Institute of Education and Research São Paulo, Brazil Most of the notes were kindly provided by Rob McCulloch (Arizona

More information

Multivariate Analysis Techniques in HEP

Multivariate Analysis Techniques in HEP Multivariate Analysis Techniques in HEP Jan Therhaag IKTP Institutsseminar, Dresden, January 31 st 2013 Multivariate analysis in a nutshell Neural networks: Defeating the black box Boosted Decision Trees:

More information

Classification using stochastic ensembles

Classification using stochastic ensembles July 31, 2014 Topics Introduction Topics Classification Application and classfication Classification and Regression Trees Stochastic ensemble methods Our application: USAID Poverty Assessment Tools Topics

More information

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon Lecture 2: Enseble Methods Isabelle Guyon guyoni@inf.ethz.ch Introduction Book Chapter 7 Weighted Majority Mixture of Experts/Coittee Assue K experts f, f 2, f K (base learners) x f (x) Each expert akes

More information

Announcements Kevin Jamieson

Announcements Kevin Jamieson Announcements My office hours TODAY 3:30 pm - 4:30 pm CSE 666 Poster Session - Pick one First poster session TODAY 4:30 pm - 7:30 pm CSE Atrium Second poster session December 12 4:30 pm - 7:30 pm CSE Atrium

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

Prediction & Feature Selection in GLM

Prediction & Feature Selection in GLM Tarigan Statistical Consulting & Coaching statistical-coaching.ch Doctoral Program in Computer Science of the Universities of Fribourg, Geneva, Lausanne, Neuchâtel, Bern and the EPFL Hands-on Data Analysis

More information

Classification of Longitudinal Data Using Tree-Based Ensemble Methods

Classification of Longitudinal Data Using Tree-Based Ensemble Methods Classification of Longitudinal Data Using Tree-Based Ensemble Methods W. Adler, and B. Lausen 29.06.2009 Overview 1 Ensemble classification of dependent observations 2 3 4 Classification of dependent observations

More information

Boosting & Deep Learning

Boosting & Deep Learning Boosting & Deep Learning Ensemble Learning n So far learning methods that learn a single hypothesis, chosen form a hypothesis space that is used to make predictions n Ensemble learning à select a collection

More information

Bayesian Ensemble Learning

Bayesian Ensemble Learning Bayesian Ensemble Learning Hugh A. Chipman Department of Mathematics and Statistics Acadia University Wolfville, NS, Canada Edward I. George Department of Statistics The Wharton School University of Pennsylvania

More information

Boosting. Jiahui Shen. October 27th, / 44

Boosting. Jiahui Shen. October 27th, / 44 Boosting Jiahui Shen October 27th, 2017 1 / 44 Target of Boosting Figure: Weak learners Figure: Combined learner 2 / 44 Boosting introduction and notation Boosting: combines weak learners into a strong

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Generalized Boosted Models: A guide to the gbm package

Generalized Boosted Models: A guide to the gbm package Generalized Boosted Models: A guide to the gbm package Greg Ridgeway April 15, 2006 Boosting takes on various forms with different programs using different loss functions, different base models, and different

More information

A Simple Algorithm for Learning Stable Machines

A Simple Algorithm for Learning Stable Machines A Simple Algorithm for Learning Stable Machines Savina Andonova and Andre Elisseeff and Theodoros Evgeniou and Massimiliano ontil Abstract. We present an algorithm for learning stable machines which is

More information

Least Angle Regression, Forward Stagewise and the Lasso

Least Angle Regression, Forward Stagewise and the Lasso January 2005 Rob Tibshirani, Stanford 1 Least Angle Regression, Forward Stagewise and the Lasso Brad Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani Stanford University Annals of Statistics,

More information

Ensemble Methods for Machine Learning

Ensemble Methods for Machine Learning Ensemble Methods for Machine Learning COMBINING CLASSIFIERS: ENSEMBLE APPROACHES Common Ensemble classifiers Bagging/Random Forests Bucket of models Stacking Boosting Ensemble classifiers we ve studied

More information

Lossless Online Bayesian Bagging

Lossless Online Bayesian Bagging Lossless Online Bayesian Bagging Herbert K. H. Lee ISDS Duke University Box 90251 Durham, NC 27708 herbie@isds.duke.edu Merlise A. Clyde ISDS Duke University Box 90251 Durham, NC 27708 clyde@isds.duke.edu

More information

Multiple Additive Regression Trees with Application in Epidemiology

Multiple Additive Regression Trees with Application in Epidemiology Multiple Additive Regression Trees with Application in Epidemiology Jerome H. Friedman Jacqueline J. Meulman April 4, 2006 Abstract Predicting future outcomes based on knowledge obtained from past observational

More information

1 History of statistical/machine learning. 2 Supervised learning. 3 Two approaches to supervised learning. 4 The general learning procedure

1 History of statistical/machine learning. 2 Supervised learning. 3 Two approaches to supervised learning. 4 The general learning procedure Overview Breiman L (2001). Statistical modeling: The two cultures Statistical learning reading group Aleander L 1 Histor of statistical/machine learning 2 Supervised learning 3 Two approaches to supervised

More information

Variable Selection and Weighting by Nearest Neighbor Ensembles

Variable Selection and Weighting by Nearest Neighbor Ensembles Variable Selection and Weighting by Nearest Neighbor Ensembles Jan Gertheiss (joint work with Gerhard Tutz) Department of Statistics University of Munich WNI 2008 Nearest Neighbor Methods Introduction

More information

Incremental Training of a Two Layer Neural Network

Incremental Training of a Two Layer Neural Network Incremental Training of a Two Layer Neural Network Group 10 Gurpreet Singh 150259 guggu@iitk.ac.in Jaivardhan Kapoor 150300 jkapoor@iitk.ac.in Abstract Gradient boosting for convex objectives has had a

More information

A Magiv CV Theory for Large-Margin Classifiers

A Magiv CV Theory for Large-Margin Classifiers A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector

More information

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I What We Did The Machine Learning Zoo Moving Forward M Magdon-Ismail CSCI 4100/6100 recap: Three Learning Principles Scientist 2

More information

Look before you leap: Some insights into learner evaluation with cross-validation

Look before you leap: Some insights into learner evaluation with cross-validation Look before you leap: Some insights into learner evaluation with cross-validation Gitte Vanwinckelen and Hendrik Blockeel Department of Computer Science, KU Leuven, Belgium, {gitte.vanwinckelen,hendrik.blockeel}@cs.kuleuven.be

More information

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague AdaBoost Lecturer: Jan Šochman Authors: Jan Šochman, Jiří Matas Center for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Motivation Presentation 2/17 AdaBoost with trees

More information

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Bagging During Markov Chain Monte Carlo for Smoother Predictions Bagging During Markov Chain Monte Carlo for Smoother Predictions Herbert K. H. Lee University of California, Santa Cruz Abstract: Making good predictions from noisy data is a challenging problem. Methods

More information

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring / Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Proteomics and Variable Selection

Proteomics and Variable Selection Proteomics and Variable Selection p. 1/55 Proteomics and Variable Selection Alex Lewin With thanks to Paul Kirk for some graphs Department of Epidemiology and Biostatistics, School of Public Health, Imperial

More information

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods The wisdom of the crowds Ensemble learning Sir Francis Galton discovered in the early 1900s that a collection of educated guesses can add up to very accurate predictions! Chapter 11 The paper in which

More information

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University A Gentle Introduction to Gradient Boosting Cheng Li chengli@ccs.neu.edu College of Computer and Information Science Northeastern University Gradient Boosting a powerful machine learning algorithm it can

More information

Linear Classifiers: Expressiveness

Linear Classifiers: Expressiveness Linear Classifiers: Expressiveness Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Lecture outline Linear classifiers: Introduction What functions do linear classifiers express?

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Stat 587: Key points and formulae Week 15

Stat 587: Key points and formulae Week 15 Odds ratios to compare two proportions: Difference, p 1 p 2, has issues when applied to many populations Vit. C: P[cold Placebo] = 0.82, P[cold Vit. C] = 0.74, Estimated diff. is 8% What if a year or place

More information

1 Handling of Continuous Attributes in C4.5. Algorithm

1 Handling of Continuous Attributes in C4.5. Algorithm .. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. Data Mining: Classification/Supervised Learning Potpourri Contents 1. C4.5. and continuous attributes: incorporating continuous

More information

Adaptive Crowdsourcing via EM with Prior

Adaptive Crowdsourcing via EM with Prior Adaptive Crowdsourcing via EM with Prior Peter Maginnis and Tanmay Gupta May, 205 In this work, we make two primary contributions: derivation of the EM update for the shifted and rescaled beta prior and

More information

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes. Random Forests One of the best known classifiers is the random forest. It is very simple and effective but there is still a large gap between theory and practice. Basically, a random forest is an average

More information

Ensembles. Léon Bottou COS 424 4/8/2010

Ensembles. Léon Bottou COS 424 4/8/2010 Ensembles Léon Bottou COS 424 4/8/2010 Readings T. G. Dietterich (2000) Ensemble Methods in Machine Learning. R. E. Schapire (2003): The Boosting Approach to Machine Learning. Sections 1,2,3,4,6. Léon

More information

Relationship between Loss Functions and Confirmation Measures

Relationship between Loss Functions and Confirmation Measures Relationship between Loss Functions and Confirmation Measures Krzysztof Dembczyński 1 and Salvatore Greco 2 and Wojciech Kotłowski 1 and Roman Słowiński 1,3 1 Institute of Computing Science, Poznań University

More information

Variable Selection in Data Mining Project

Variable Selection in Data Mining Project Variable Selection Variable Selection in Data Mining Project Gilles Godbout IFT 6266 - Algorithmes d Apprentissage Session Project Dept. Informatique et Recherche Opérationnelle Université de Montréal

More information

Neural Networks and Ensemble Methods for Classification

Neural Networks and Ensemble Methods for Classification Neural Networks and Ensemble Methods for Classification NEURAL NETWORKS 2 Neural Networks A neural network is a set of connected input/output units (neurons) where each connection has a weight associated

More information

LASSO Review, Fused LASSO, Parallel LASSO Solvers

LASSO Review, Fused LASSO, Parallel LASSO Solvers Case Study 3: fmri Prediction LASSO Review, Fused LASSO, Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 1 Variable

More information