PDEEC Machine Learning 2016/17

Similar documents
Hastie, Tibshirani & Friedman: Elements of Statistical Learning Chapter Model Assessment and Selection. CN700/March 4, 2008.

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Data Mining und Maschinelles Lernen

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning Lecture 7

Machine Learning. Ensemble Methods. Manfred Huber

Chapter 14 Combining Models

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

Chapter 7: Model Assessment and Selection

MS-C1620 Statistical inference

Understanding Generalization Error: Bounds and Decompositions

CS7267 MACHINE LEARNING

CS534 Machine Learning - Spring Final Exam

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Computational Learning Theory

ECE 5424: Introduction to Machine Learning

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

TDT4173 Machine Learning

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Voting (Ensemble Methods)

Learning with multiple models. Boosting.

Variance Reduction and Ensemble Methods

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

VBM683 Machine Learning

MS&E 226: Small Data

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

FINAL: CS 6375 (Machine Learning) Fall 2014

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Recap from previous lecture

Bootstrap, Jackknife and other resampling methods

Introduction to Machine Learning and Cross-Validation

TDT4173 Machine Learning

PAC-learning, VC Dimension and Margin-based Bounds

Statistical Machine Learning from Data

Introduction to Statistical modeling: handout for Math 489/583

Machine Learning for OR & FE

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Does Unlabeled Data Help?

ECE 5984: Introduction to Machine Learning

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Learning theory. Ensemble methods. Boosting. Boosting: history

Machine Learning

Machine Learning Lecture 5

A Brief Introduction to Adaboost

Announcements. Proposals graded

Lecture 3: Introduction to Complexity Regularization

Algorithm Independent Topics Lecture 6

Resampling techniques for statistical modeling

Transformations The bias-variance tradeoff Model selection criteria Remarks. Model selection I. Patrick Breheny. February 17

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Qualifying Exam in Machine Learning

A Bias Correction for the Minimum Error Rate in Cross-validation

VC-dimension for characterizing classifiers

Lecture 8. Instructor: Haipeng Luo

Fundamentals of Machine Learning. Mohammad Emtiyaz Khan EPFL Aug 25, 2015

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

Lecture 2 Machine Learning Review

Machine Learning

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

10-701/ Machine Learning - Midterm Exam, Fall 2010

Stochastic Gradient Descent

Applied Machine Learning Annalisa Marsico

Algorithm-Independent Learning Issues

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

VC-dimension for characterizing classifiers

Decision Tree Learning Lecture 2

Support Vector Machines. Machine Learning Fall 2017

Overfitting, Bias / Variance Analysis

ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp.

Introduction to Machine Learning

Generalization, Overfitting, and Model Selection

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Performance Evaluation and Comparison

Data Mining and Analysis: Fundamental Concepts and Algorithms

Machine Learning Linear Classification. Prof. Matteo Matteucci

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Lecture 3: Statistical Decision Theory (Part II)

An Introduction to Statistical and Probabilistic Linear Models

A Simple Algorithm for Learning Stable Machines

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Lecture 13: Ensemble Methods

Minimax risk bounds for linear threshold functions

Computational Learning Theory

GI01/M055: Supervised Learning

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Chapter 6. Ensemble Methods

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Ensemble Methods and Random Forests

CSCI-567: Machine Learning (Spring 2019)

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

CS 6375 Machine Learning

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

LECTURE NOTE #3 PROF. ALAN YUILLE

Transcription:

PDEEC Machine Learning 2016/17 Lecture - Model assessment, selection and Ensemble Jaime S. Cardoso jaime.cardoso@inesctec.pt INESC TEC and Faculdade Engenharia, Universidade do Porto Nov. 07, 2017 1 / 65

Main Sources of the Slides Aarti Singh Model Selection http://www.cs.cmu.edu/ epxing/class/10701/lecture/lecture8.pdf Aarti Singh Boosting http://www.cs.cmu.edu/ epxing/class/10701/lecture/lecture19.pdf Trevor Hastie and Robert Tibshirani and Jerome Friedman The elements of statistical learning (chap 7, 10), Springer. 2 / 65

True vs. Empirical Risk True Risk: Target performance measure Classification - Probability of misclassification P (f(x) y) Regression - Mean Squared Error E[(f(x) y) 2 ] Also known as Generalization Error - performance on a random test point (X, Y ) Empirical Risk: Performance on training data Classification - Proportion of misclassified examples 1 N N i=1 I f(x i ) y i P (f(x) y) Regression - Average Squared Error 1 N N i=1 (f(x i) y i ) 2 3 / 65

Overfitting I Is the following predictor a good one? What is its empirical risk? (performance on training data) zero! What about true risk? > zero Will predict very poorly on new random test point, Large generalization error! 4 / 65

Overfitting II If we allow very complicated predictors, we could overfit the training data. Examples: Classification 5 / 65

Overfitting III If we allow very complicated predictors, we could overfit the training data. Examples: Regression 6 / 65

Effect of Model Complexity If we allow very complicated predictors, we could overfit the training data. 7 / 65

Behavior of True Risk 8 / 65

Behavior of True Risk 9 / 65

Behavior of True Risk 10 / 65

Behavior of True Risk 11 / 65

Behavior of True Risk 12 / 65

Behavior of True Risk 13 / 65

Behavior of True Risk 14 / 65

Bias-variance in Regression 1. The bias term measures how far the mean prediction is from the true value. 2. The variance term measures how far each prediction is from the mean prediction. t 1 ln λ = 2.6 t 1 0 0 1 1 0 x 1 0 x 1 t 1 ln λ = 0.31 t 1 0 0 1 1 0 x 1 0 x 1 t 1 ln λ = 2.4 t 1 0 0 1 1 0 x 1 0 x 1 15 / 65

Bias-variance Decomposition 1. Given an estimator ˆθ of a parameter θ underlying the distribution P (X θ), the mean-squared error of ˆθ is defined as MSE(ˆθ) = E θ (ˆθ θ) 2 2. A very general way to understand any machine learning model is to decompose its error into a bias and a variance term. 3. Let us define the bias of an estimator as bias(ˆθ) = E θ (ˆθ) θ 4. An unbiased estimator has zero bias, of course, but often it turns out that the estimator with the lowest MSE may be biased! 5. The variance of an estimator is defined as V ar(ˆθ) = E θ (ˆθ E θ (ˆθ)) 2 16 / 65

Bias-variance Decomposition 1. Bias-Variance Theorem: The mean-squared error of any estimator ˆθ can be decomposed into its (squared) bias and variance components. MSE(ˆθ) = = E θ (ˆθ θ) 2 = E θ [ˆθ E θ (ˆθ) + E θ (ˆθ) θ] 2 ] = E θ [(ˆθ E θ (ˆθ)) 2 + (E θ (ˆθ) θ) 2 2(ˆθ E θ (ˆθ))(E θ (ˆθ) θ) = E θ (ˆθ E θ (ˆθ)) 2 + E θ (E θ (ˆθ) θ) 2 2E θ (ˆθ E θ (ˆθ))(E θ (ˆθ) θ) = E θ (ˆθ E θ (ˆθ)) 2 + E θ (E θ (ˆθ) θ) 2 = E θ (ˆθ E θ (ˆθ)) 2 + (E θ (ˆθ) θ) 2 = V ar(ˆθ) + Bias 2 (ˆθ) 17 / 65

Bias-variance Decomposition Bias-Variance Theorem. y(x):true value to be predicted E(y x) f(x): estimate of y(x) E{f(x) y(x)} 2 = E{f(x) E(y x) + E(y x) y(x)} 2 = E{f(x) E(y x)} 2 + E{E(y x) y(x)} 2 2E{(f(x) E(y x))(e(y x) y(x))} = E{f(x) E(y x)} 2 + E{E(y x) y(x)} 2 = E{f(x) E(y x)} 2 + noise E{f(x, D) E(y x)} 2 = E{f(x, D) E{f(x, D)} + E{f(x, D)} E(y x)} 2 = E{f(x, D) E{f(x, D)}} 2 + E{E{f(x, D)} E(y x)} 2 2E{(f(x, D) E{f(x, D)})(E{f(x, D)} E(y x))} = E{f(x, D) E{f(x, D)}} 2 + E{E{f(x, D)} E(y x)} 2 = V ar(f(x, D)) + Bias 2 (f(x, D)) Expected loss = (bias) 2 + variance + noise 18 / 65

Bias-variance Tradeoff Example 1. The bias term measures how far the mean prediction is from the true value. 2. The variance term measures how far each prediction is from the mean prediction. 0.15 0.12 0.09 (bias) 2 variance (bias) 2 + variance test error 0.06 0.03 0 3 2 1 0 1 2 ln λ 19 / 65

Examples of Model Spaces Model Spaces with increasing complexity: Nearest-Neighbor classifiers with varying neighborhood sizes k = 1, 2, 3, Small neighborhood => Higher complexity Decision Trees with depth k or with k leaves Higher depth/ More # leaves => Higher complexity Regression with polynomials of order k = 0, 1, 2, Higher degree => Higher complexity Kernel Regression with bandwidth h Small bandwidth => Higher complexity How can we select the right complexity model? 20 / 65

Model selection and assessment Data-rich situation 1. selection: estimating the performance of different models in order to choose the (approximate) best one 2. assessment: having chosen a final model, estimating its prediction error (generalization error) on new data 3. Data-rich situation: randomly divide the dataset into three parts: a training set, validation set and a test set. the training set is used to fit the models; the validation set is used to estimate prediction error for model selection; the test set is used for assessment of the generalization error of the final chosen model. training error: the average loss over the training sample err = 1 N N n=1 L(y i, f(x i )) test error or generalization error: the espected prediction error over an independent test sample Err = E[L(Y, f(x))] typically the training error err is less than the true error Err because the same data is being used to fit the method and assess its error. 21 / 65

Model selection and assessment Data-rich situation Hold-out method We would like to pick the model that has smallest generalization error. Can judge generalization error by using an independent sample of data. Hold - out procedure: 22 / 65

Model selection and assessment Data-rich situation Hold-out method 23 / 65

Model selection and assessment Data-rich situation Hold-out method 24 / 65

Estimates of Prediction In-Sample Estimates An obvious way to estimate the prediction error is to estimate the optimism of the training error and then add it ot the training error err: Err ˆ = err + optimism Akaike information criterion: AIC = err + 2 d N ˆσ2 ɛ N: number of samples ˆσ ɛ 2 : estimate of the noise variance, obtained from the mean-squared error of a low-bias model. d: effective number of parameters. The concept of number of parameters can be generalized to models with regularization. 25 / 65

Estimates of Prediction In-Sample Estimates Bayesian information criterion: BIC = Ṋ [ ] σ err + (log N) d ɛ 2 N ˆσ2 ɛ assuming N > e 2 7.4, BIC tends to penalize complex models more heavily than AIC, giving preference to simpler models in selection. like AIC, BIC is applicable in settings where the fitting is carried out by maximization of a log-likelihood arises in a Bayesian approach to model selection, under some approximating conditions P r(mm D) P r(m l D) = P r(mm) P r(m l ). P r(d Mm) P r(d M l ) assuming some simplifications to P r(d M m ), one obtain the generic form of BIC: BIC = 2loglik + (logn).d 26 / 65

Estimates of Prediction In-Sample Estimates Bayesian Model Comparison p(m i D) p(m i )p(d M i ) p(d M i ) = p(d w, M i )p(w M i )dw p(d) M 1 M 2 M 3 D 0 D 27 / 65

Estimates of Prediction In-Sample Estimates Bayesian Model Comparison the Bayesian framework avoids the problem of over-fitting and allows models to be compared on the basis of the training data alone. a Bayesian approach, like any approach to pattern recognition, needs to make assumptions about the form of the model, and if these are invalid then the results can be misleading. In particular, the model evidence can be sensitive to many aspects of the prior, such as the behaviour in the tails. In a practical application, therefore, it will be wise to keep aside an independent test set of data on which to evaluate the overall performance of the final system. 28 / 65

Estimates of Prediction In-Sample Estimates Minimum description length the minimum description length approach gives a selection criterion formally identical to the BIC but motivated from a optimal coding viewpoint assume we have a model with parameters θ and data Z = (X, y) consisting of both the inputs and outputs. Let the probability of the outputs under de model be P r(y θ, X), assume the receiver knows all the inputs and we wish to transmit the outputs. Then the message length required to transmit the outputs is length = log P r(y θ, X) log P r(θ) the preceding view of MLD for model selection says that we should choose the model with highest posterior probability. However, many bayesians would instead do inference by sampling from the posterior probability. 29 / 65

Estimates of Prediction Vapnik-Chernovenkis Dimension a difficulty in using estimates of in-sample error is the need to specify the number of parameters (or the complexity) used in the fit. The effective number of parameters is not fully general the Vapnik-Chernovenkis (VC) theory provides such a general measure of complexity. 30 / 65

Vapnik-Chernovenkis Dimension Shattering Idea: measure complexity by the effective number of distinct functions that can be defined. A hypothesis class is more complex if it can represent lots of possible classification decisions. Def: A set of points is shattered by H (the hypothesis space) if for all possible binary labelings of the points, there exists h H that can represent this decision. We will measure complexity by the number of points that can be shattered by H. 31 / 65

Vapnik-Chernovenkis Dimension Shattering Def: A set of points is shattered by H (the hypothesis space) if for all possible binary labelings of the points, there exists h H that can represent this decision. 32 / 65

Vapnik-Chernovenkis Dimension The Game VC dimension is a shattering game between us and an adversary. To show that H has VC dimension d: I choose d points positioned however I want Adversary chooses a labeling of these d points I choose h H that labels the points properly The VC dimension of H is the maximum d I can choose so that I can always succeed in the game. a linear indicator function in p dimensions has VC equal to (p+1), which is also the number of free parameters. the VC dimension can be extended to real-valued functions VC dimension is a measure of complexity of infinite hypothesis classes 33 / 65

Estimates of Prediction Extra-Sample Estimates Cross validation ideally, if we had enough data, we would set aside a validation set and use it to assess the performance of our prediction model. K-fold cross-validation uses part of the available data to fit the model and a different part to test it. we split the data into K roughly equal-size parts. For k = 1, 2,..., K, we fit the model to the data with the kth part removed and calculate the prediction error of the fitted model when predicting for the kth part of the data 34 / 65

Estimates of Prediction Extra-Sample Estimates K-fold cross-validation 35 / 65

Estimates of Prediction Extra-Sample Estimates Leave-one-out (LOO) cross-validation 36 / 65

Estimates of Prediction Extra-Sample Estimates Cross-validation: Random subsampling 37 / 65

Estimates of Prediction Extra-Sample Estimates Estimating generalization error 38 / 65

Estimates of Prediction Extra-Sample Estimates Cross validation - What value should we choose for K? with k=n, CV is approximately unbiased for the true prediction error, but can have high variance because the N training sets are so similar to one another. The computational burden is also considerable. on the other hand, with K=5 CV has low variance but the bias could be a problem. 39 / 65

Estimates of Prediction Extra-Sample Estimates Practical Issues in Cross-validation 40 / 65

Evaluation Measures 41 / 65

Classification measures: basics 42 / 65

Classification Error 43 / 65

Misses and False Alarms 44 / 65

Evaluation (recap) 45 / 65

Thresholds in Classification 46 / 65

ROC curves 47 / 65

Evaluating regression 48 / 65

Mean Squared Error 49 / 65

Model Assessment Bootstrap methods Bootstrap is a general tool to assessing statistical accuracy, falling within a broader class of resampling methods. Bootstrapping is the practice of estimating properties of an estimator (such as its variance) by measuring those properties when sampling from an approximating distribution. One standard choice for an approximating distribution is the empirical distribution of the observed data. In the case where a set of observations can be assumed to be from an independent and identically distributed population, this can be implemented by constructing a number of resamples of the observed dataset (and of equal size to the observed dataset), each of which is obtained by random sampling with replacement from the original dataset. 50 / 65

Model Assessment Bootstrap methods 51 / 65

Model Assessment Bootstrap methods S(Z) is any quantity computed form the data Z, for example, the prediction at some point. the estimate of the variance of S(Z) is V ar[s(z)] = 1 B b=1 (S(Z b Ŝ ) 2 B 1 how to estimate the prediction error? Êrr = 1 1 B N B N b=1 n=1 L(y n, f b (x n )) is optimistic because bootstrap datasets are acting as the training samples, while the original training set is acting as the test sample, and these two have observations is common. leave one out bootstrap estimate of the prediction error: Êrr (1) = 1 N 1 N n=1 C i b C L(y n, f b (x i n )), where C i is the set of indices of the bootstrap samples that do not contain observation i Êrr (.632) =.368 err +.632 Êrr(1) (P r( observation i bootstrap sample b) = 1 (1 1/N) N 1 e 1 = 0.632) 52 / 65

Ensemble Methods Methods for Constructing Ensembles Subsampling the training data Bagging Boosting Manipulating the Input Features Manipulating the Output Targets Injecting Randomness 53 / 65

Model Averaging Bagging (bootstrap aggregating) Using boostrap to improve the estimate or prediction itself. Bagging is a meta-algorithm to improve machine learning of classification and regression models in terms of stability and classification accuracy. It also reduces variance and helps to avoid overfitting. Bagging is a special case of the model averaging approach. For each bootstrap sample Z b we fit our model, giving prediction f b (x). The bagging estimate is defined by f bag (x) = 1 B B b=1 f b (x) For classification, consider f b (x) as an indicator vector with a single one and K-1 zeros. Then the bagged estimate is a K-vector (p 1, p 2,..., p K ) with p k equal to the proportion of models predicting class k at x. Our predicted class is the one with more votes from the models. 54 / 65

Model Averaging Bagging (bootstrap aggregating) under the squared-error loss, bagging (averaging) helps reducing variance and leaves bias unchanged. the same does not hold for classification under the 0-1 loss, because of the nonadditivity of bias an variance. Bagging a good classifier can make it better, but bagging a bad classifier can make it worst. when we bag a model, any simple structure in the model is lost. A bagged tree is no longer a tree. For interpretation of the model, this can be a drawback. more stable models are not much affected by bagging. unstable models most helped by bagging are unstable because of the emphasis on interpretability, and this is lost in the bagging process. Bumping: choose the model that produces the smallest prediction error, averaged over the training set. 55 / 65

Boosting methods Fighting the bias-variance tradeoff 56 / 65

Boosting methods Voting (Ensemble Methods) 57 / 65

Boosting methods procedure to combine the outputs of many weak classifiers to produce a powerful committee sequentially apply the weak classification algorithm to repeatedly modified versions of the data, thereby producing a sequence of weak classifiers G m (x), m = 1, 2,..., M. combine the predictions from ( all of them through a weighted M ) majority vote G(x) = sgn m=1 α mg m (x) 58 / 65

Boosting methods AdaBoost 59 / 65

Boosting methods AdaBoost The data modifications at each boosting step consist of applying weights w 1, w2,..., w N to each of the training observations 60 / 65

Boosting methods AdaBoost-Learning from weighted data 61 / 65

Boosting methods AdaBoost 1. initialize the observation weights w i = 1/N, i = 1,..., N 2. for m=1 to M (the number of boosting steps) 2.1 fit a classifier G m (x) to the training data using the weights w i 2.2 compute err m = N i=1 wii(yi Gm(xi) N i=1 wi 2.3 compute α m = log((1 err m )/err m ) 2.4 update w i w i. exp(α m I(y i G m (x i ))), i = 1,..., N 3. output G(x) = sign[ M i=1 α mg m (x)] Adaboost is a forward stage additive modeling: sequentially adds new basis functions to the expansion without adjusting the parameters and coefficients of those that have already been added. 62 / 65

Boosting methods AdaBoost 63 / 65

Boosting methods AdaBoost 2 m = 1 2 m = 2 2 m = 3 0 0 0 2 2 2 1 0 1 2 1 0 1 2 1 0 1 2 2 m = 6 2 m = 10 2 m = 150 0 0 0 2 2 2 1 0 1 2 1 0 1 2 1 0 1 2 64 / 65

References Aarti Singh Model Selection http://www.cs.cmu.edu/ epxing/class/10701/lecture/lecture8.pdf Aarti Singh Boosting http://www.cs.cmu.edu/ epxing/class/10701/lecture/lecture19.pdf Wikipedia Wikipedia, http://en.wikipedia.org/wiki/main Page Tom Mitchell Machine Learning, Book. Bishop Pattern recognition and machine learning, Book. Trevor Hastie and Robert Tibshirani and Jerome Friedman The elements of statistical learning, Springer. 65 / 65