Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Size: px
Start display at page:

Download "Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro"

Transcription

1 Methods and Criteria for Model Selection CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro

2 Goal } Introduce classifier evaluation criteria } Introduce Bias x Variance duality } Model Assessment } Model Selection 2015 Bruno Ribeiro

3 In Training Phase There is Often a Better Model Train Test Model assessment Real-world Error 2015 Bruno Ribeiro

4 Testing Classification Accuracy 4

5 Classification Error True positive (TP): positive prediction that is correct True negative (TN): negative prediction that is correct False positive (FP): positive prediction that is incorrect False negative (FN): negative prediction that is incorrect } Accuracy = (TP + TN) / (TP + TN + FP + FN) % predictions that are correct } Misclassification = (FP+FN) / (TP+TN+FP+FN) % predictions that are incorrect } Recall (Sensitivity): R = TP / (TP + FN) % positive instances that are predicted positive } Precision: P = TP / (TP + FP) % positive predictions that are correct } Specificity = TN / (TN + FP) % negative instances that are predicted negative } F1 = 2 (P R) / (P+R) harmonic mean of precision and recall

6 Precision and Recall 6

7 More score functions } Absolute loss: } Squared loss: } Likelihood/conditional likelihood: } Area under the ROC curve Receiver Operating Characteristic (ROC) curve For classifier that a classification threshold can be defined (e.g. Logistic Regression, SVMs [vary hyperplane offset b],..) Plots the true positive rate against the false positive rate for different classification thresholds

8 } Evaluates performance over varying costs and class distributions Can summarize with area under the curve (AUC) AUC of 0.5 is random AUC of 1.0 is perfect

9 E.g. Measuring the performance of Logistic Regression predicting labels {+,-} P(Y) True class P(Y) True class Predict class P(Y) True class Predict class P(Y) True class Predict class FPR = 0/5 TPR = 1/4 FPR = 1/5 TPR = 1/4 FPR = 1/5 TPR = 2/4

10 ROC curve True positive rate (TPR) False positive rate (FPR)

11 ROC Curves Should not Look Like This } If your ROC curve looks like this, you are likely doing something wrong Either you have too few discrete features Or you are predicting labels 0 and 1 but gave software library the predictions { 0, 1 } instead of the probability [0,1] the label is 1 11

12 Issues with conventional score functions } Assumes errors for all individuals are equally weighted May want to weight recent instances more heavily May want to include information about reliability of sets of measurements } Assumes errors for all contexts are equally weighted May want to weigh false positive and false negatives differently May be costs associated with acting on particular classifications (e.g., marketing to individuals)

13 Examples } Loan decisions Cost of lending to a defaulter is much greater than the lost business of refusing loan to a non-defaulter } Oil-slick detection Cost of failing to detect an environmental disaster is far less than the cost of a false alarm } Promotional mailing Cost of sending junk mail to a household that doesn t respond is far less consequential than the lost business of not sending it to a household that would have responded

14 Cost-sensitive models } Define a score function based on a cost matrix } If ~y is the predicted class and y is the true class, then need to define a matrix of costs C(~y,y) } Reflects the severity of classifying an instance with true class y to class ~y +$10 -$1 -$5 0

15 Assessing Error with Training Data We don t have the test data! Train Test Can we just use training data to evaluate our classifier? 15

16 The Problem of Overfitting Prediction Error High Bias Low Variance Low Bias High Variance Real-World Error Training Error Bruno Ribeiro Model Complexity (df) The Elements of Statistical Learning Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie

17 Linear Regression n observations

18 Cubic Regression y Better n observations x

19 Degree n-1 Polynomial y Even better? n observations x

20 But while overfitting is widely used, be careful a) Overfitting is not just about number of parameters b) Overfitting is an easy-to-understand concept but reality is more complex c) Overfitting is also known as the variance of the generalization error 20

21 Overfitting, hypotheses, and priors } Consider Logistic regression classifier. } For a dataset {φ n,t n } where ϕ n is transformed feature vector of item n, t n {0,1}is hood func the class of item n, for n=1,,n data points where t =(t 1,...,t N ) T and y n = σ(a n ) and a n = w T φ n.. } If N on is by small taking but the dimension neg t toof w, ϕ we n is large, obtainwe will likely overfit data } But what happens if w ~ Normal(0, Σ)? } Prior effectively reduces the number of hypothesis we are encouraged to test (w is no longer free ), note that number of model parameters stay the same. Overfitting is just a symptom Real underlying cause is the test of multiple hypotheses 21

22 Multiple Hypotheses view of Classification } We will not cover the multiple hypothesis view of classification Just a few pointers (topic covered in the Machine Learning class) } Problem is that of Empirical Risk Minimization, VC-dimension and PAC learning We define a hypothesis class H Data points {(x (i),y (i) )} i=1,,m with empirical error for hypothesis h H ˆε(h) = 1 m m 1{h(x (i) ) y (i) } i=1 Goal: Find hypothesis that minimizes empirical error: ĥ = arg min ˆε(h) h H Distinct theorems for finite and infinite hypothesis classes 22

23 Model Selection and Assessment } Model Selection: Estimating performances of different models to select the best one } Model Assessment: Having chosen a model, estimating the prediction error on new data 2015 Bruno Ribeiro

24 Model Selection and Assessment } In data-rich scenarios split the data: Train Validation Test Model Selection Model assessment } In data-poor scenarios: approximate validation step analytically AIC, BIC, MDL via sample re-use cross- validation Leave-one-out K-fold bootstrap 2015 Bruno Ribeiro

25 Learning Curves (both data-rich and data-poor) } Learning curve shows how method performs with more training data : (a) Over validation data (b) Over training data Score: F1 = 2 (P R) / (P+R) } One of the most useful methods to debug classifiers. Check for overfitting? See gap between training and validation Check for code bugs or bad local optimum? Training data accuracy must decrease with more data Training data Validation data 25

26 Data-poor Assessment of Classification Error 26

27 Model: True target Predicted target random noise t = f(x 0 ;? )+ ; E[ ] = 0 Var( ) = 2 For squared-error loss with additive noise: Add and subtract E h f(x 0 ; ˆ ) i Deviation of average estimate From true function s mean Err(x 0 )=E[(t f(x 0 ; ˆ )) 2 X = x 0 ] h apple = E[t 2 ]+E f(x 0 ; ˆ ) 2 f(x 0 ; )? + E f(x 0 ; ˆ ) E = 2 + Bias 2 (f(x 0 ; ˆ )) + Var(f(x 0 ; ˆ ) h f(x 0 ; ˆ ) i 2 True noise Expected squared deviation of estimate around its own mean 2015 Bruno Ribeiro

28 } Bias Often related to size of model space Smaller the model space M, less likely true model lies in M More complex models tend to have lower bias } Variance Often related to size of dataset and quality of information When data is large enough to estimate true parameters (θ * ) we get lower variance } With big data simple models can perform surprisingly well due to lower variance Err(x 0 )=E[(t f(x 0 ; ˆ )) 2 X = x 0 ] = 2 + Ef(x 0 ; ˆ ) 2 h f(x 0 ;? ) + E f(x 0 ; ˆ ) Ef(x 0 ; ˆ ) i 2 = 2 + Bias 2 (f(x 0 ; ˆ )) + Var(f(x 0 ; ˆ ))

29 } Hypothesis space Larger search space, smaller bias Larger search space, larger variance } Problem: Looking for BEST MODEL (point estimate) } By testing too many hypotheses the best looking hypothesis is likely not a good one Models with error < 3ε Models with error < ε Models with error < 2ε Our Search Space Our Search Space

30 Optimism of the Training Error Rate } Typically: training error rate < true error (same data is being used to fit the method and assess its error) Err train = 1 N NX L(t(i),f(x 0 (i); ˆ )) <Err(X) =E[L(t, f(x; ˆ ))] i=1 Loss function Target overly optimistic Estimate from training data without k-th fold 2015 Bruno Ribeiro

31 Estimating Test Error } Can we estimate the discrepancy between Err(x 0 ) and Err(X)? } D old observations; D new new observations } Suppose we observed new values of label t(i) for same input x(i) Err in --- In-sample error: Expectation over N new responses at each x i E D [Err in ]= 1 N NX E D E D newl(t new (i),f(x 0 (i); ˆ )) i=1 = E D [Err train ]+ 2 N NX Cov(ˆt i,t i ) i=1 Trained with old observation D 2015 Bruno Ribeiro Adjustment for optimism of training error

32 Measuring Optimism If model: with θ = d t = f(x; )+ ; E[ ] = 0 Var( ) = 2 Then, if f is linear, f(x; ) =x T then E D [Err in ]=E D [Err train ]+ 2 N d 2 Optimism grows linearly with model dimension Optimism decreases as training sample size increases 2015 Bruno Ribeiro

33 Ways to Estimate Prediction Error } In-sample error estimates: AIC BIC MDL } Extra-sample error estimates: Cross-Validation Leave-one-out K-fold Bootstrap 2015 Bruno Ribeiro

34 Estimates of In-Sample Prediction Error } General form of the in-sample estimate: } For additive error: Known as the Marllow s C p statistic 2015 Bruno Ribeiro

35 2015 Bruno Ribeiro Akaike Information Criterion (AIC) AIC = 2 N ln P [t = f(x 0; ˆ )] + 2 d N AIC does not assume model is correct Better for small # samples May not converge to correct model (as N model bias 0) Bayesian Information Criterion (BIC) BIC = 2 N ln P [t = f(x 0; ˆ )] + d N log N BIC assumes model is correct Converges to correct parameterization (θ * ) of model (as N model bias = 0)

36 MDL (Minimum Description Length) } Find hypothesis (model) that minimizes H(M) + H(D M), where H(M) length in bits of description of model M H(D M) length in bits of description of data encoded by model M } If model probabilistic: length = ln P [t = f(x 0 ; ˆ ) M] ln P [ˆ M] Length of transmitting the discrepancy given the model + optimal coding under the given model Description of the model under optimal coding MDL principle: choose the model with the minimum description length Equivalent to maximizing the posterior: 2015 Bruno Ribeiro P [t = f(x 0 ; ˆ ) M]P [ˆ M]

37 BIC is one example of Bayesian Approach } While we will not cover Bayesian methods in this course, they tend to ameliorate some of these point estimate issues } Harder to compute 37

38 } Does not look for Best Hypothesis Not interested in point estimates of model parameters } Seeks to take variance into account into model evaluation Err(x 0 )=E[(t f(x 0 ; )) 2 X = x 0, ]P [ Data] } If testing too many hypotheses and not enough data, P [ˆ X = x 0 ] is less concentrated (higher entropy), thus leading to larger errors

39 Data-rich Assessment of Classification Error 39

40 Estimation of Extra-Sample Error } Cross Validation Train Validation Test } Bootstrap Model Selection Model assessment 2015 Bruno Ribeiro

41 Estimation of Extra-Sample Error via Cross-Validation } Goal: Estimate the test error for a supervised learning method } Algorithm: 1. Split the data in two parts (training and validation) 2. Train the method in the training part 3. Compute the error on the validation part 41

42 Wrong Way to Perform Cross-validation } Use cross-validation to estimate too many model decisions or for selection of too many features } Use cross-validation to hunt for too many new models Same problem we have when hunting for p-values (too many hypotheses) The human regression machine: Same as multiple hypothesis testing Proposes New Model CV Bad Check CV 2015 Bruno Ribeiro 42

43 In Data-poor scenarios we should use K-fold Cross-validation 43

44 Train Train Validation Train Train validation train K-fold K-fold cross-validation error CV err (K) = 1 K 2016 Bruno Ribeiro KX k=1 1 S k X Validation fold Loss function i2s k L(t(i), ˆt ( Target Estimate from training data without k-th fold k) (i))

45 Computation increases Variance decreases bias decreases K fold Leave-one-out Small K Large K

46 Cross-Validation: Choosing K 1-Err Size of Training Set 2015 Bruno Ribeiro Popular choices for K: 5, 10, N=leave one out

47 Evaluating classification algorithms A and B } Use k-fold cross-validation to get k estimates of error for algorithms A and B Train1 Train2 Train3 Train4 Train5 Train6 Dataset Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Valid.1 Valid.2 Valid.3 Valid.4 Valid.5 Valid.6 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Error A.1 Error A.2 Error A.3 Error A.4 Error A.5 Error A.6 Error B.1 Error B.2 Error B.3 Error B.4 Error B.5 Error B.6 } Set of errors estimated over the test set folds provides empirical estimate of sampling distribution } Mean is estimate of expected error } Is the mean error significant between A and B?

48 Assessing significance } Use paired t-test to assess whether the two distributions of errors are statistically different from each other Error A.1 Error B.1 Error A.2 Error B.2 Error A.3 Error B.3 Error A.4 Error B.4 Error A.5 Error B.5 } Takes into account both the difference in means and the variability of the scores B A

49 Paired t-test } Instead of comparing the difference in means between populations 1 and 2 1 NX X (1) 1 NX i X (2) i N N i=1 i=1 } A paired t-test will compare the mean difference d = 1 NX (X (1) i X (2) i ) N i=1 } The real effect is when estimating the standard error of the mean difference v SE( d) = s d p n, where s d = } The t statistic has N-1 degrees of freedom and is given by t = u t 1 N d SE( d) NX i=1 (X (1) i X (2) i d) 2 49

50 Bootstrapping 50

51 Bootstrapping } Bootstrap is a powerful statistical tool to quantify the uncertainty associated with a given estimator or statistical learning method } E.g.: It can provide an estimate of the standard error of a coefficient, or a confidence interval for that coefficient } Instead of sampling new datasets from the unknown population distribution, resample from the training data (empirical population distribution) 51

52 Why Bootstrapping? } The term bootstrap derives from to pull oneself up by one s bootstraps, likely based on one of the eighteenth century novel The Surprising Adventures of Baron Munchausen by Rudolph Erich Raspe: The Baron had fallen to the bottom of a deep lake. Just when it looked like all was lost, he thought to pick himself up by his own bootstraps. } It is not the same as the term bootstrap used in computer science meaning to boot a computer from a set of core instructions, though the meaning is similar. 52

53 Bootstrap: Main Concept Model Assessment and Selection Step 3: Compute variance of estimated parameters & error over validation data S(Z 1 ) S(Z 2 ) S(Z B ) Bootstrap replications Step 2: Learn model S(.) over Bootstrap samples Bootstrap samples Z 1 Z 2 Z B Z =(z 1,z 2,...,z N ) Step 1: Draw N samples with replacement from training data Training sample The Elements of Statistical Learning Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie

54 Bootstrap Validation Data } Two ways to use validation data Validation comes from whatever data points are not used in training Issue: Validation data size not consistent across bootstrap replications Validation error uncertainty not consistent Advantage: Keeps all training data for potentially training model Validation split from training before bootstrapping Issue: Must split validation from training, reducing training data Advantage: Validation data size kept consistent between bootstrap replications 54

PDEEC Machine Learning 2016/17

PDEEC Machine Learning 2016/17 PDEEC Machine Learning 2016/17 Lecture - Model assessment, selection and Ensemble Jaime S. Cardoso jaime.cardoso@inesctec.pt INESC TEC and Faculdade Engenharia, Universidade do Porto Nov. 07, 2017 1 /

More information

Performance Evaluation and Comparison

Performance Evaluation and Comparison Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Cross Validation and Resampling 3 Interval Estimation

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Performance Evaluation

Performance Evaluation Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Example:

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

Lecture 3: Introduction to Complexity Regularization

Lecture 3: Introduction to Complexity Regularization ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,

More information

Recap from previous lecture

Recap from previous lecture Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience

More information

Performance Evaluation

Performance Evaluation Performance Evaluation David S. Rosenberg Bloomberg ML EDU October 26, 2017 David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 1 / 36 Baseline Models David S. Rosenberg (Bloomberg ML EDU) October 26,

More information

Model Accuracy Measures

Model Accuracy Measures Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses

More information

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,

Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press, Lecture Slides for INTRODUCTION TO Machine Learning ETHEM ALPAYDIN The MIT Press, 2004 alpaydin@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/i2ml CHAPTER 14: Assessing and Comparing Classification Algorithms

More information

Bias-Variance Tradeoff

Bias-Variance Tradeoff What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff

More information

Stats notes Chapter 5 of Data Mining From Witten and Frank

Stats notes Chapter 5 of Data Mining From Witten and Frank Stats notes Chapter 5 of Data Mining From Witten and Frank 5 Credibility: Evaluating what s been learned Issues: training, testing, tuning Predicting performance: confidence limits Holdout, cross-validation,

More information

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1 Linear Regression Machine Learning CSE546 Kevin Jamieson University of Washington Oct 5, 2017 1 The regression problem Given past sales data on zillow.com, predict: y = House sale price from x = {# sq.

More information

Applied Machine Learning Annalisa Marsico

Applied Machine Learning Annalisa Marsico Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature

More information

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation. Andrea Passerini Machine Learning. Evaluation Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain

More information

Pointwise Exact Bootstrap Distributions of Cost Curves

Pointwise Exact Bootstrap Distributions of Cost Curves Pointwise Exact Bootstrap Distributions of Cost Curves Charles Dugas and David Gadoury University of Montréal 25th ICML Helsinki July 2008 Dugas, Gadoury (U Montréal) Cost curves July 8, 2008 1 / 24 Outline

More information

Chapter 7: Model Assessment and Selection

Chapter 7: Model Assessment and Selection Chapter 7: Model Assessment and Selection DD3364 April 20, 2012 Introduction Regression: Review of our problem Have target variable Y to estimate from a vector of inputs X. A prediction model ˆf(X) has

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Evaluation requires to define performance measures to be optimized

Evaluation requires to define performance measures to be optimized Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation

More information

Minimum Description Length (MDL)

Minimum Description Length (MDL) Minimum Description Length (MDL) Lyle Ungar AIC Akaike Information Criterion BIC Bayesian Information Criterion RIC Risk Inflation Criterion MDL u Sender and receiver both know X u Want to send y using

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Hastie, Tibshirani & Friedman: Elements of Statistical Learning Chapter Model Assessment and Selection. CN700/March 4, 2008.

Hastie, Tibshirani & Friedman: Elements of Statistical Learning Chapter Model Assessment and Selection. CN700/March 4, 2008. Hastie, Tibshirani & Friedman: Elements of Statistical Learning Chapter 7.1-7.9 Model Assessment and Selection CN700/March 4, 2008 Satyavarta sat@cns.bu.edu Auditory Neuroscience Laboratory, Department

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013 Bayesian Methods Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2013 1 What about prior n Billionaire says: Wait, I know that the thumbtack is close to 50-50. What can you

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

Bayesian Learning (II)

Bayesian Learning (II) Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

Smart Home Health Analytics Information Systems University of Maryland Baltimore County

Smart Home Health Analytics Information Systems University of Maryland Baltimore County Smart Home Health Analytics Information Systems University of Maryland Baltimore County 1 IEEE Expert, October 1996 2 Given sample S from all possible examples D Learner L learns hypothesis h based on

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Lecture 4 Discriminant Analysis, k-nearest Neighbors Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se

More information

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018 Data Mining CS57300 Purdue University Bruno Ribeiro February 8, 2018 Decision trees Why Trees? interpretable/intuitive, popular in medical applications because they mimic the way a doctor thinks model

More information

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă mmp@stat.washington.edu Reading: Murphy: BIC, AIC 8.4.2 (pp 255), SRM 6.5 (pp 204) Hastie, Tibshirani

More information

Evaluation & Credibility Issues

Evaluation & Credibility Issues Evaluation & Credibility Issues What measure should we use? accuracy might not be enough. How reliable are the predicted results? How much should we believe in what was learned? Error on the training data

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

Statistical Methods for SVM

Statistical Methods for SVM Statistical Methods for SVM Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot,

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

Hypothesis Evaluation

Hypothesis Evaluation Hypothesis Evaluation Machine Learning Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall 1395 1 / 31 Table of contents 1 Introduction

More information

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms 03/Feb/2010 VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms Presented by Andriy Temko Department of Electrical and Electronic Engineering Page 2 of

More information

Qualifying Exam in Machine Learning

Qualifying Exam in Machine Learning Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 477 Instructor: Tony Jebara Topic 5 Generalization Guarantees VC-Dimension Nearest Neighbor Classification (infinite VC dimension) Structural Risk Minimization Support Vector Machines

More information

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic

More information

Data Mining und Maschinelles Lernen

Data Mining und Maschinelles Lernen Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting

More information

Machine Learning and Data Mining. Linear classification. Kalev Kask

Machine Learning and Data Mining. Linear classification. Kalev Kask Machine Learning and Data Mining Linear classification Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ = f(x ; q) Parameters q Program ( Learner ) Learning algorithm Change q

More information

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

6.867 Machine learning

6.867 Machine learning 6.867 Machine learning Mid-term eam October 8, 6 ( points) Your name and MIT ID: .5.5 y.5 y.5 a).5.5 b).5.5.5.5 y.5 y.5 c).5.5 d).5.5 Figure : Plots of linear regression results with different types of

More information

Machine Learning. Ensemble Methods. Manfred Huber

Machine Learning. Ensemble Methods. Manfred Huber Machine Learning Ensemble Methods Manfred Huber 2015 1 Bias, Variance, Noise Classification errors have different sources Choice of hypothesis space and algorithm Training set Noise in the data The expected

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

CS7267 MACHINE LEARNING

CS7267 MACHINE LEARNING CS7267 MACHINE LEARNING ENSEMBLE LEARNING Ref: Dr. Ricardo Gutierrez-Osuna at TAMU, and Aarti Singh at CMU Mingon Kang, Ph.D. Computer Science, Kennesaw State University Definition of Ensemble Learning

More information

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables? Linear Regression Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2014 1 What about continuous variables? n Billionaire says: If I am measuring a continuous variable, what

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1 CptS 570 Machine Learning School of EECS Washington State University CptS 570 - Machine Learning 1 IEEE Expert, October 1996 CptS 570 - Machine Learning 2 Given sample S from all possible examples D Learner

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

Introduction to Supervised Learning. Performance Evaluation

Introduction to Supervised Learning. Performance Evaluation Introduction to Supervised Learning Performance Evaluation Marcelo S. Lauretto Escola de Artes, Ciências e Humanidades, Universidade de São Paulo marcelolauretto@usp.br Lima - Peru Performance Evaluation

More information

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization : Neural Networks Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization 11s2 VC-dimension and PAC-learning 1 How good a classifier does a learner produce? Training error is the precentage

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Ensemble Methods for Machine Learning

Ensemble Methods for Machine Learning Ensemble Methods for Machine Learning COMBINING CLASSIFIERS: ENSEMBLE APPROACHES Common Ensemble classifiers Bagging/Random Forests Bucket of models Stacking Boosting Ensemble classifiers we ve studied

More information

SUPPORT VECTOR MACHINE

SUPPORT VECTOR MACHINE SUPPORT VECTOR MACHINE Mainly based on https://nlp.stanford.edu/ir-book/pdf/15svm.pdf 1 Overview SVM is a huge topic Integration of MMDS, IIR, and Andrew Moore s slides here Our foci: Geometric intuition

More information

Bootstrap, Jackknife and other resampling methods

Bootstrap, Jackknife and other resampling methods Bootstrap, Jackknife and other resampling methods Part VI: Cross-validation Rozenn Dahyot Room 128, Department of Statistics Trinity College Dublin, Ireland dahyot@mee.tcd.ie 2005 R. Dahyot (TCD) 453 Modern

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 14, 2014 Today s Schedule Course Project Introduction Linear Regression Model Decision Tree 2 Methods

More information

Evaluating Classifiers. Lecture 2 Instructor: Max Welling

Evaluating Classifiers. Lecture 2 Instructor: Max Welling Evaluating Classifiers Lecture 2 Instructor: Max Welling Evaluation of Results How do you report classification error? How certain are you about the error you claim? How do you compare two algorithms?

More information

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline Other Measures 1 / 52 sscott@cse.unl.edu learning can generally be distilled to an optimization problem Choose a classifier (function, hypothesis) from a set of functions that minimizes an objective function

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training

More information

Diagnostics. Gad Kimmel

Diagnostics. Gad Kimmel Diagnostics Gad Kimmel Outline Introduction. Bootstrap method. Cross validation. ROC plot. Introduction Motivation Estimating properties of an estimator. Given data samples say the average. x 1, x 2,...,

More information

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Performance Evaluation

Performance Evaluation Performance Evaluation Confusion Matrix: Detected Positive Negative Actual Positive A: True Positive B: False Negative Negative C: False Positive D: True Negative Recall or Sensitivity or True Positive

More information

ECE521 Lecture7. Logistic Regression

ECE521 Lecture7. Logistic Regression ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

More information

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro Decision Trees CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Classification without Models Well, partially without a model } Today: Decision Trees 2015 Bruno Ribeiro 2 3 Why Trees? } interpretable/intuitive,

More information

Computational Learning Theory. CS534 - Machine Learning

Computational Learning Theory. CS534 - Machine Learning Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows

More information

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher Lecture 3 STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher Previous lectures What is machine learning? Objectives of machine learning Supervised and

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2015 Announcements TA Monisha s office hour has changed to Thursdays 10-12pm, 462WVH (the same

More information

Performance Evaluation and Hypothesis Testing

Performance Evaluation and Hypothesis Testing Performance Evaluation and Hypothesis Testing 1 Motivation Evaluating the performance of learning systems is important because: Learning systems are usually designed to predict the class of future unlabeled

More information

NeuroComp Machine Learning and Validation

NeuroComp Machine Learning and Validation NeuroComp Machine Learning and Validation Michèle Sebag http://tao.lri.fr/tiki-index.php Nov. 16th 2011 Validation, the questions 1. What is the result? 2. My results look good. Are they? 3. Does my system

More information

Generalization and Overfitting

Generalization and Overfitting Generalization and Overfitting Model Selection Maria-Florina (Nina) Balcan February 24th, 2016 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle

More information

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan Ensemble Methods NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan How do you make a decision? What do you want for lunch today?! What did you have last night?! What are your favorite

More information

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables? Linear Regression Machine Learning CSE546 Sham Kakade University of Washington Oct 4, 2016 1 What about continuous variables? Billionaire says: If I am measuring a continuous variable, what can you do

More information

Online Advertising is Big Business

Online Advertising is Big Business Online Advertising Online Advertising is Big Business Multiple billion dollar industry $43B in 2013 in USA, 17% increase over 2012 [PWC, Internet Advertising Bureau, April 2013] Higher revenue in USA

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers Computational Methods for Data Analysis Massimo Poesio SUPPORT VECTOR MACHINES Support Vector Machines Linear classifiers 1 Linear Classifiers denotes +1 denotes -1 w x + b>0 f(x,w,b) = sign(w x + b) How

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin Learning Theory Machine Learning CSE546 Carlos Guestrin University of Washington November 25, 2013 Carlos Guestrin 2005-2013 1 What now n We have explored many ways of learning from data n But How good

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

Announcements Kevin Jamieson

Announcements Kevin Jamieson Announcements My office hours TODAY 3:30 pm - 4:30 pm CSE 666 Poster Session - Pick one First poster session TODAY 4:30 pm - 7:30 pm CSE Atrium Second poster session December 12 4:30 pm - 7:30 pm CSE Atrium

More information

ESS2222. Lecture 3 Bias-Variance Trade-off

ESS2222. Lecture 3 Bias-Variance Trade-off ESS2222 Lecture 3 Bias-Variance Trade-off Hosein Shahnas University of Toronto, Department of Earth Sciences, 1 Outline Bias-Variance Trade-off Overfitting & Regularization Ridge & Lasso Regression Nonlinear

More information