Variance Reduction and Ensemble Methods

Similar documents
Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Binary Classification / Perceptron

CS 6375 Machine Learning

Machine Learning. Ensemble Methods. Manfred Huber

BAGGING PREDICTORS AND RANDOM FOREST

ECE 5424: Introduction to Machine Learning

Machine Learning

TDT4173 Machine Learning

Machine Learning

Data Mining und Maschinelles Lernen

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Understanding Generalization Error: Bounds and Decompositions

PAC-learning, VC Dimension and Margin-based Bounds

Ensemble Methods and Random Forests

CS7267 MACHINE LEARNING

ECE 5984: Introduction to Machine Learning

VBM683 Machine Learning

Holdout and Cross-Validation Methods Overfitting Avoidance

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Voting (Ensemble Methods)

Statistical Machine Learning from Data

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Bias-Variance in Machine Learning

TDT4173 Machine Learning

Learning Theory Continued

Ensembles of Classifiers.

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Learning with multiple models. Boosting.

UVA CS 4501: Machine Learning

Linear Classifiers: Expressiveness

Day 3: Classification, logistic regression

Lossless Online Bayesian Bagging

Bayesian Methods: Naïve Bayes

PDEEC Machine Learning 2016/17

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Linear and Logistic Regression. Dr. Xiaowei Huang

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Decision Tree Learning Lecture 2

Classification using stochastic ensembles

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Probabilistic Graphical Models

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Day 5: Generative models, structured classification

IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS - PART B 1

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Stochastic Gradient Descent

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Classifier Performance. Assessment and Improvement

Low Bias Bagged Support Vector Machines

Bagging and Other Ensemble Methods

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Low Bias Bagged Support Vector Machines

RANDOMIZING OUTPUTS TO INCREASE PREDICTION ACCURACY

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.

CS534 Machine Learning - Spring Final Exam

Algorithm Independent Topics Lecture 6

PAC-learning, VC Dimension and Margin-based Bounds

SF2930 Regression Analysis

Lecture 3: Statistical Decision Theory (Part II)

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Oliver Dürr. Statistisches Data Mining (StDM) Woche 11. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

Chapter ML:II (continued)

Decision Trees Lecture 12

Learning theory. Ensemble methods. Boosting. Boosting: history

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Learning Theory, Overfi1ng, Bias Variance Decomposi9on

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Cross Validation & Ensembling

Learning Decision Trees

CS145: INTRODUCTION TO DATA MINING

Decision Trees: Overfitting

CS 6375: Machine Learning Computational Learning Theory

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Neural Networks and Ensemble Methods for Classification

MS&E 226: Small Data

Introduction to Machine Learning CMU-10701

Importance Sampling: An Alternative View of Ensemble Learning. Jerome H. Friedman Bogdan Popescu Stanford University

Introduction to Machine Learning

Variance and Bias for General Loss Functions

Overfitting, Bias / Variance Analysis

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Introduction to Machine Learning (67577) Lecture 3

Support vector machines Lecture 4

CMU-Q Lecture 24:

Boos$ng Can we make dumb learners smart?

Gaussians Linear Regression Bias-Variance Tradeoff

1 Handling of Continuous Attributes in C4.5. Algorithm

Transcription:

Variance Reduction and Ensemble Methods Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag

Last Time PAC learning Bias/variance tradeoff small hypothesis spaces (not enough flexibility) can have high bias rich hypothesis spaces (too much flexibility) can have high variance Today: more on this phenomenon and how to get around it 2

Intuition Bias Measures the accuracy or quality of the algorithm High bias means a poor match Variance Measures the precision or specificity of the match High variance means a weak match We would like to minimize each of these Unfortunately, we can t do this independently, there is a trade-off 3

Bias-Variance Analysis in Regression True function is y = f(x) + ε where ε is normally distributed with zero mean and standard deviation σ Given a set of training examples, x 1, y (1),, x n, y (n), we fit a hypothesis g(x) = w T x + b to the data to minimize the squared error i y (i) g x i 2 4

2-D Example Sample 20 points from f(x) = x + 2 sin(1.5x) + N(0,0.2) 5

2-D Example 50 fits (20 examples each) 6

Bias-Variance Analysis Given a new data point x with observed value y = f x + ε, want to understand the expected prediction error Suppose that training samples are drawn independently from a distribution p(s), want to compute E[ y g S x 2 ] 7

Probability Reminder Variance of a random variable, Z Var Z = E Z E Z 2 = E Z 2 2ZE Z + E Z 2 = E Z 2 E Z 2 Properties of Var(Z) Var az = E a 2 Z 2 E az 2 = a 2 Var(Z) 8

Bias-Variance-Noise Decomposition E y g S x 2 = E g S x 2 2g S x y + y 2 = E g S x 2 2E g S x E y + E y 2 = Var g S x + E g s x 2 2E g S x f x + Var y + f x 2 = Var g S x + E g s (x ) f x 2 + Var ε = Var g S x + E g s (x ) f x 2 + σ 2 9

Bias-Variance-Noise Decomposition E y g S x 2 The samples S and the noise ε are independent = E g S x 2 2g S x y + y 2 = E g S x 2 2E g S x E y + E y 2 = Var g S x + E g s x 2 2E g S x f x + Var y + f x 2 = Var g S x + E g s (x ) f x 2 + Var ε = Var g S x + E g s (x ) f x 2 + σ 2 10

Bias-Variance-Noise Decomposition E y g S x 2 = E g S x 2 2g S x y + y 2 = E g S x 2 2E g S x E y + E y 2 Follows from definition of variance = Var g S x + E g s x 2 2E g S x f x + Var y + f x 2 = Var g S x + E g s (x ) f x 2 + Var ε = Var g S x + E g s (x ) f x 2 + σ 2 11

Bias-Variance-Noise Decomposition E y g S x 2 = E g S x 2 2g S x y + y 2 = E g S x 2 2E g S x E y + E y 2 E y = f(x ) = Var g S x + E g s x 2 2E g S x f x + Var y + f x 2 = Var g S x + E g s (x ) f x 2 + Var ε = Var g S x + E g s (x ) f x 2 + σ 2 12

Bias-Variance-Noise Decomposition E y g S x 2 = E g S x 2 2g S x y + y 2 = E g S x 2 2E g S x E y + E y 2 = Var g S x + E g s x 2 2E g S x f x + Var y + f x 2 = Var g S x + E g s (x ) f x 2 + Var ε = Var g S x + E g s (x ) f x 2 + σ 2 Variance Bias Noise 13

Bias, Variance, and Noise Variance: E[ (g S x E g S (x ) ) 2 ] Describes how much g S (x ) varies from one training set S to another Bias: E g S (x ) f(x ) Describes the average error of g S (x ) Noise: E y f x 2 = E[ε2 ] = σ 2 Describes how much y varies from f(x ) 14

2-D Example 50 fits (20 examples each) 15

Bias 16

Variance 17

Noise 18

Bias Low bias? High bias? 19

Bias Low bias Linear regression applied to linear data 2nd degree polynomial applied to quadratic data High bias Constant function Linear regression applied to non-linear data 20

Variance Low variance? High variance? 21

Variance Low variance Constant function Model independent of training data High variance High degree polynomial 22

Bias/Variance Tradeoff (bias 2 +variance) is what counts for prediction As we saw in PAC learning, we often have Low bias high variance Low variance high bias Is this a firm rule? 23

Reduce Variance Without Increasing Bias Averaging reduces variance: let Z 1,, Z N be i.i.d random variables Var 1 N i Z i = 1 N Var(Z i) Idea: average models to reduce model variance The problem Only one training set Where do multiple models come from? 24

Bagging: Bootstrap Aggregation Take repeated bootstrap samples from training set D (Breiman, 1994) Bootstrap sampling: Given set D containing N training examples, create D by drawing N examples at random with replacement from D Bagging Create k bootstrap samples D 1,, D k Train distinct classifier on each D i Classify new instance by majority vote / average 25

[image from the slides of David Sontag] 26

Bagging Data 1 2 3 4 5 6 7 8 9 10 BS 1 7 1 9 10 7 8 8 4 7 2 BS 2 8 1 3 1 1 9 7 4 10 1 BS 3 5 4 8 8 2 5 5 7 8 8 Build a classifier from each bootstrap sample In each bootstrap sample, each data point has probability 1 1 N of not being selected N Expected number of data points in each sample is then N 1 1 1 N N N (1 exp( 1)) =.632 N 27

Bagging Data 1 2 3 4 5 6 7 8 9 10 BS 1 7 1 9 10 7 8 8 4 7 2 BS 2 8 1 3 1 1 9 7 4 10 1 BS 3 5 4 8 8 2 5 5 7 8 8 Build a classifier from each bootstrap sample In each bootstrap sample, each data point has probability 1 1 N of not being selected N If we have 1 TB of data, each bootstrap sample will be ~ 632GB (this can present computational challenges) 28

Decision Tree Bagging [image from the slides of David Sontag] 29

Decision Tree Bagging (100 Bagged Trees) [image from the slides of David Sontag] 30

Bagging Experiments 31

Bagging Results Breiman Bagging Predictors Berkeley Statistics Department TR#421, 1994 32

Random Forests 33

Random Forests Ensemble method specifically designed for decision tree classifiers Introduce two sources of randomness: bagging and random input vectors Bagging method: each tree is grown using a bootstrap sample of training data Random vector method: best split at each node is chosen from a random sample of m attributes instead of all attributes 34

Random Forest Algorithm For b = 1 to B Draw a bootstrap sample of size N from the data Grow a tree T b using the bootstrap sample as follows Choose m attributes uniformly at random from the data Choose the best attribute among the m to split on Split on the best attribute and recurse (until partitions have fewer than s min number of nodes) Prediction for a new data point x Regression: 1 B σ b T b (x) Classification: choose the majority class label among T 1 x,, T B (x) 35

Random Forest Demo A demo of random forests implemented in JavaScript 36

When Will Bagging Improve Accuracy? Depends on the stability of the base-level classifiers. A learner is unstable if a small change to the training set causes a large change in the output hypothesis If small changes in D cause large changes in the output, then there will be an improvement in performance with bagging Bagging helps unstable procedures, but could hurt the performance of stable procedures Decision trees are unstable k-nearest neighbor is stable 37