Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro
|
|
- Brice Chapman
- 5 years ago
- Views:
Transcription
1 Methods and Criteria for Model Selection CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro
2 Goal } Introduce classifier evaluation criteria } Introduce Bias x Variance duality } Model Assessment } Model Selection 2015 Bruno Ribeiro
3 In Training Phase There is Often a Better Model Train Test Model assessment Real-world Error 2015 Bruno Ribeiro
4 Testing Classification Accuracy 4
5 Classification Error True positive (TP): positive prediction that is correct True negative (TN): negative prediction that is correct False positive (FP): positive prediction that is incorrect False negative (FN): negative prediction that is incorrect } Accuracy = (TP + TN) / (TP + TN + FP + FN) % predictions that are correct } Misclassification = (FP+FN) / (TP+TN+FP+FN) % predictions that are incorrect } Recall (Sensitivity): R = TP / (TP + FN) % positive instances that are predicted positive } Precision: P = TP / (TP + FP) % positive predictions that are correct } Specificity = TN / (TN + FP) % negative instances that are predicted negative } F1 = 2 (P R) / (P+R) harmonic mean of precision and recall
6 Precision and Recall 6
7 More score functions } Absolute loss: } Squared loss: } Likelihood/conditional likelihood: } Area under the ROC curve Receiver Operating Characteristic (ROC) curve For classifier that a classification threshold can be defined (e.g. Logistic Regression, SVMs [vary hyperplane offset b],..) Plots the true positive rate against the false positive rate for different classification thresholds
8 } Evaluates performance over varying costs and class distributions Can summarize with area under the curve (AUC) AUC of 0.5 is random AUC of 1.0 is perfect
9 E.g. Measuring the performance of Logistic Regression predicting labels {+,-} P(Y) True class P(Y) True class Predict class P(Y) True class Predict class P(Y) True class Predict class FPR = 0/5 TPR = 1/4 FPR = 1/5 TPR = 1/4 FPR = 1/5 TPR = 2/4
10 ROC curve True positive rate (TPR) False positive rate (FPR)
11 ROC Curves Should not Look Like This } If your ROC curve looks like this, you are likely doing something wrong Either you have too few discrete features Or you are predicting labels 0 and 1 but gave software library the predictions { 0, 1 } instead of the probability [0,1] the label is 1 11
12 Issues with conventional score functions } Assumes errors for all individuals are equally weighted May want to weight recent instances more heavily May want to include information about reliability of sets of measurements } Assumes errors for all contexts are equally weighted May want to weigh false positive and false negatives differently May be costs associated with acting on particular classifications (e.g., marketing to individuals)
13 Examples } Loan decisions Cost of lending to a defaulter is much greater than the lost business of refusing loan to a non-defaulter } Oil-slick detection Cost of failing to detect an environmental disaster is far less than the cost of a false alarm } Promotional mailing Cost of sending junk mail to a household that doesn t respond is far less consequential than the lost business of not sending it to a household that would have responded
14 Cost-sensitive models } Define a score function based on a cost matrix } If ~y is the predicted class and y is the true class, then need to define a matrix of costs C(~y,y) } Reflects the severity of classifying an instance with true class y to class ~y +$10 -$1 -$5 0
15 Assessing Error with Training Data We don t have the test data! Train Test Can we just use training data to evaluate our classifier? 15
16 The Problem of Overfitting Prediction Error High Bias Low Variance Low Bias High Variance Real-World Error Training Error Bruno Ribeiro Model Complexity (df) The Elements of Statistical Learning Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie
17 Linear Regression n observations
18 Cubic Regression y Better n observations x
19 Degree n-1 Polynomial y Even better? n observations x
20 But while overfitting is widely used, be careful a) Overfitting is not just about number of parameters b) Overfitting is an easy-to-understand concept but reality is more complex c) Overfitting is also known as the variance of the generalization error 20
21 Overfitting, hypotheses, and priors } Consider Logistic regression classifier. } For a dataset {φ n,t n } where ϕ n is transformed feature vector of item n, t n {0,1}is hood func the class of item n, for n=1,,n data points where t =(t 1,...,t N ) T and y n = σ(a n ) and a n = w T φ n.. } If N on is by small taking but the dimension neg t toof w, ϕ we n is large, obtainwe will likely overfit data } But what happens if w ~ Normal(0, Σ)? } Prior effectively reduces the number of hypothesis we are encouraged to test (w is no longer free ), note that number of model parameters stay the same. Overfitting is just a symptom Real underlying cause is the test of multiple hypotheses 21
22 Multiple Hypotheses view of Classification } We will not cover the multiple hypothesis view of classification Just a few pointers (topic covered in the Machine Learning class) } Problem is that of Empirical Risk Minimization, VC-dimension and PAC learning We define a hypothesis class H Data points {(x (i),y (i) )} i=1,,m with empirical error for hypothesis h H ˆε(h) = 1 m m 1{h(x (i) ) y (i) } i=1 Goal: Find hypothesis that minimizes empirical error: ĥ = arg min ˆε(h) h H Distinct theorems for finite and infinite hypothesis classes 22
23 Model Selection and Assessment } Model Selection: Estimating performances of different models to select the best one } Model Assessment: Having chosen a model, estimating the prediction error on new data 2015 Bruno Ribeiro
24 Model Selection and Assessment } In data-rich scenarios split the data: Train Validation Test Model Selection Model assessment } In data-poor scenarios: approximate validation step analytically AIC, BIC, MDL via sample re-use cross- validation Leave-one-out K-fold bootstrap 2015 Bruno Ribeiro
25 Learning Curves (both data-rich and data-poor) } Learning curve shows how method performs with more training data : (a) Over validation data (b) Over training data Score: F1 = 2 (P R) / (P+R) } One of the most useful methods to debug classifiers. Check for overfitting? See gap between training and validation Check for code bugs or bad local optimum? Training data accuracy must decrease with more data Training data Validation data 25
26 Data-poor Assessment of Classification Error 26
27 Model: True target Predicted target random noise t = f(x 0 ;? )+ ; E[ ] = 0 Var( ) = 2 For squared-error loss with additive noise: Add and subtract E h f(x 0 ; ˆ ) i Deviation of average estimate From true function s mean Err(x 0 )=E[(t f(x 0 ; ˆ )) 2 X = x 0 ] h apple = E[t 2 ]+E f(x 0 ; ˆ ) 2 f(x 0 ; )? + E f(x 0 ; ˆ ) E = 2 + Bias 2 (f(x 0 ; ˆ )) + Var(f(x 0 ; ˆ ) h f(x 0 ; ˆ ) i 2 True noise Expected squared deviation of estimate around its own mean 2015 Bruno Ribeiro
28 } Bias Often related to size of model space Smaller the model space M, less likely true model lies in M More complex models tend to have lower bias } Variance Often related to size of dataset and quality of information When data is large enough to estimate true parameters (θ * ) we get lower variance } With big data simple models can perform surprisingly well due to lower variance Err(x 0 )=E[(t f(x 0 ; ˆ )) 2 X = x 0 ] = 2 + Ef(x 0 ; ˆ ) 2 h f(x 0 ;? ) + E f(x 0 ; ˆ ) Ef(x 0 ; ˆ ) i 2 = 2 + Bias 2 (f(x 0 ; ˆ )) + Var(f(x 0 ; ˆ ))
29 } Hypothesis space Larger search space, smaller bias Larger search space, larger variance } Problem: Looking for BEST MODEL (point estimate) } By testing too many hypotheses the best looking hypothesis is likely not a good one Models with error < 3ε Models with error < ε Models with error < 2ε Our Search Space Our Search Space
30 Optimism of the Training Error Rate } Typically: training error rate < true error (same data is being used to fit the method and assess its error) Err train = 1 N NX L(t(i),f(x 0 (i); ˆ )) <Err(X) =E[L(t, f(x; ˆ ))] i=1 Loss function Target overly optimistic Estimate from training data without k-th fold 2015 Bruno Ribeiro
31 Estimating Test Error } Can we estimate the discrepancy between Err(x 0 ) and Err(X)? } D old observations; D new new observations } Suppose we observed new values of label t(i) for same input x(i) Err in --- In-sample error: Expectation over N new responses at each x i E D [Err in ]= 1 N NX E D E D newl(t new (i),f(x 0 (i); ˆ )) i=1 = E D [Err train ]+ 2 N NX Cov(ˆt i,t i ) i=1 Trained with old observation D 2015 Bruno Ribeiro Adjustment for optimism of training error
32 Measuring Optimism If model: with θ = d t = f(x; )+ ; E[ ] = 0 Var( ) = 2 Then, if f is linear, f(x; ) =x T then E D [Err in ]=E D [Err train ]+ 2 N d 2 Optimism grows linearly with model dimension Optimism decreases as training sample size increases 2015 Bruno Ribeiro
33 Ways to Estimate Prediction Error } In-sample error estimates: AIC BIC MDL } Extra-sample error estimates: Cross-Validation Leave-one-out K-fold Bootstrap 2015 Bruno Ribeiro
34 Estimates of In-Sample Prediction Error } General form of the in-sample estimate: } For additive error: Known as the Marllow s C p statistic 2015 Bruno Ribeiro
35 2015 Bruno Ribeiro Akaike Information Criterion (AIC) AIC = 2 N ln P [t = f(x 0; ˆ )] + 2 d N AIC does not assume model is correct Better for small # samples May not converge to correct model (as N model bias 0) Bayesian Information Criterion (BIC) BIC = 2 N ln P [t = f(x 0; ˆ )] + d N log N BIC assumes model is correct Converges to correct parameterization (θ * ) of model (as N model bias = 0)
36 MDL (Minimum Description Length) } Find hypothesis (model) that minimizes H(M) + H(D M), where H(M) length in bits of description of model M H(D M) length in bits of description of data encoded by model M } If model probabilistic: length = ln P [t = f(x 0 ; ˆ ) M] ln P [ˆ M] Length of transmitting the discrepancy given the model + optimal coding under the given model Description of the model under optimal coding MDL principle: choose the model with the minimum description length Equivalent to maximizing the posterior: 2015 Bruno Ribeiro P [t = f(x 0 ; ˆ ) M]P [ˆ M]
37 BIC is one example of Bayesian Approach } While we will not cover Bayesian methods in this course, they tend to ameliorate some of these point estimate issues } Harder to compute 37
38 } Does not look for Best Hypothesis Not interested in point estimates of model parameters } Seeks to take variance into account into model evaluation Err(x 0 )=E[(t f(x 0 ; )) 2 X = x 0, ]P [ Data] } If testing too many hypotheses and not enough data, P [ˆ X = x 0 ] is less concentrated (higher entropy), thus leading to larger errors
39 Data-rich Assessment of Classification Error 39
40 Estimation of Extra-Sample Error } Cross Validation Train Validation Test } Bootstrap Model Selection Model assessment 2015 Bruno Ribeiro
41 Estimation of Extra-Sample Error via Cross-Validation } Goal: Estimate the test error for a supervised learning method } Algorithm: 1. Split the data in two parts (training and validation) 2. Train the method in the training part 3. Compute the error on the validation part 41
42 Wrong Way to Perform Cross-validation } Use cross-validation to estimate too many model decisions or for selection of too many features } Use cross-validation to hunt for too many new models Same problem we have when hunting for p-values (too many hypotheses) The human regression machine: Same as multiple hypothesis testing Proposes New Model CV Bad Check CV 2015 Bruno Ribeiro 42
43 In Data-poor scenarios we should use K-fold Cross-validation 43
44 Train Train Validation Train Train validation train K-fold K-fold cross-validation error CV err (K) = 1 K 2016 Bruno Ribeiro KX k=1 1 S k X Validation fold Loss function i2s k L(t(i), ˆt ( Target Estimate from training data without k-th fold k) (i))
45 Computation increases Variance decreases bias decreases K fold Leave-one-out Small K Large K
46 Cross-Validation: Choosing K 1-Err Size of Training Set 2015 Bruno Ribeiro Popular choices for K: 5, 10, N=leave one out
47 Evaluating classification algorithms A and B } Use k-fold cross-validation to get k estimates of error for algorithms A and B Train1 Train2 Train3 Train4 Train5 Train6 Dataset Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Valid.1 Valid.2 Valid.3 Valid.4 Valid.5 Valid.6 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Error A.1 Error A.2 Error A.3 Error A.4 Error A.5 Error A.6 Error B.1 Error B.2 Error B.3 Error B.4 Error B.5 Error B.6 } Set of errors estimated over the test set folds provides empirical estimate of sampling distribution } Mean is estimate of expected error } Is the mean error significant between A and B?
48 Assessing significance } Use paired t-test to assess whether the two distributions of errors are statistically different from each other Error A.1 Error B.1 Error A.2 Error B.2 Error A.3 Error B.3 Error A.4 Error B.4 Error A.5 Error B.5 } Takes into account both the difference in means and the variability of the scores B A
49 Paired t-test } Instead of comparing the difference in means between populations 1 and 2 1 NX X (1) 1 NX i X (2) i N N i=1 i=1 } A paired t-test will compare the mean difference d = 1 NX (X (1) i X (2) i ) N i=1 } The real effect is when estimating the standard error of the mean difference v SE( d) = s d p n, where s d = } The t statistic has N-1 degrees of freedom and is given by t = u t 1 N d SE( d) NX i=1 (X (1) i X (2) i d) 2 49
50 Bootstrapping 50
51 Bootstrapping } Bootstrap is a powerful statistical tool to quantify the uncertainty associated with a given estimator or statistical learning method } E.g.: It can provide an estimate of the standard error of a coefficient, or a confidence interval for that coefficient } Instead of sampling new datasets from the unknown population distribution, resample from the training data (empirical population distribution) 51
52 Why Bootstrapping? } The term bootstrap derives from to pull oneself up by one s bootstraps, likely based on one of the eighteenth century novel The Surprising Adventures of Baron Munchausen by Rudolph Erich Raspe: The Baron had fallen to the bottom of a deep lake. Just when it looked like all was lost, he thought to pick himself up by his own bootstraps. } It is not the same as the term bootstrap used in computer science meaning to boot a computer from a set of core instructions, though the meaning is similar. 52
53 Bootstrap: Main Concept Model Assessment and Selection Step 3: Compute variance of estimated parameters & error over validation data S(Z 1 ) S(Z 2 ) S(Z B ) Bootstrap replications Step 2: Learn model S(.) over Bootstrap samples Bootstrap samples Z 1 Z 2 Z B Z =(z 1,z 2,...,z N ) Step 1: Draw N samples with replacement from training data Training sample The Elements of Statistical Learning Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie
54 Bootstrap Validation Data } Two ways to use validation data Validation comes from whatever data points are not used in training Issue: Validation data size not consistent across bootstrap replications Validation error uncertainty not consistent Advantage: Keeps all training data for potentially training model Validation split from training before bootstrapping Issue: Must split validation from training, reducing training data Advantage: Validation data size kept consistent between bootstrap replications 54
PDEEC Machine Learning 2016/17
PDEEC Machine Learning 2016/17 Lecture - Model assessment, selection and Ensemble Jaime S. Cardoso jaime.cardoso@inesctec.pt INESC TEC and Faculdade Engenharia, Universidade do Porto Nov. 07, 2017 1 /
More informationPerformance Evaluation and Comparison
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Cross Validation and Resampling 3 Interval Estimation
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationPerformance Evaluation
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Example:
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationSUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION
SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology
More informationLecture 3: Introduction to Complexity Regularization
ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,
More informationRecap from previous lecture
Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience
More informationPerformance Evaluation
Performance Evaluation David S. Rosenberg Bloomberg ML EDU October 26, 2017 David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 1 / 36 Baseline Models David S. Rosenberg (Bloomberg ML EDU) October 26,
More informationModel Accuracy Measures
Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses
More informationClass 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio
Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant
More informationCMU-Q Lecture 24:
CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input
More informationLecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,
Lecture Slides for INTRODUCTION TO Machine Learning ETHEM ALPAYDIN The MIT Press, 2004 alpaydin@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/i2ml CHAPTER 14: Assessing and Comparing Classification Algorithms
More informationBias-Variance Tradeoff
What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff
More informationStats notes Chapter 5 of Data Mining From Witten and Frank
Stats notes Chapter 5 of Data Mining From Witten and Frank 5 Credibility: Evaluating what s been learned Issues: training, testing, tuning Predicting performance: confidence limits Holdout, cross-validation,
More informationLinear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1
Linear Regression Machine Learning CSE546 Kevin Jamieson University of Washington Oct 5, 2017 1 The regression problem Given past sales data on zillow.com, predict: y = House sale price from x = {# sq.
More informationApplied Machine Learning Annalisa Marsico
Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature
More informationEvaluation. Andrea Passerini Machine Learning. Evaluation
Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain
More informationPointwise Exact Bootstrap Distributions of Cost Curves
Pointwise Exact Bootstrap Distributions of Cost Curves Charles Dugas and David Gadoury University of Montréal 25th ICML Helsinki July 2008 Dugas, Gadoury (U Montréal) Cost curves July 8, 2008 1 / 24 Outline
More informationChapter 7: Model Assessment and Selection
Chapter 7: Model Assessment and Selection DD3364 April 20, 2012 Introduction Regression: Review of our problem Have target variable Y to estimate from a vector of inputs X. A prediction model ˆf(X) has
More informationMachine Learning Linear Classification. Prof. Matteo Matteucci
Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationEvaluation requires to define performance measures to be optimized
Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation
More informationMinimum Description Length (MDL)
Minimum Description Length (MDL) Lyle Ungar AIC Akaike Information Criterion BIC Bayesian Information Criterion RIC Risk Inflation Criterion MDL u Sender and receiver both know X u Want to send y using
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationHastie, Tibshirani & Friedman: Elements of Statistical Learning Chapter Model Assessment and Selection. CN700/March 4, 2008.
Hastie, Tibshirani & Friedman: Elements of Statistical Learning Chapter 7.1-7.9 Model Assessment and Selection CN700/March 4, 2008 Satyavarta sat@cns.bu.edu Auditory Neuroscience Laboratory, Department
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationMachine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013
Bayesian Methods Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2013 1 What about prior n Billionaire says: Wait, I know that the thumbtack is close to 50-50. What can you
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationGeneralization, Overfitting, and Model Selection
Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How
More informationBayesian Learning (II)
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP
More informationECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple
More informationSmart Home Health Analytics Information Systems University of Maryland Baltimore County
Smart Home Health Analytics Information Systems University of Maryland Baltimore County 1 IEEE Expert, October 1996 2 Given sample S from all possible examples D Learner L learns hypothesis h based on
More informationData Mining and Analysis: Fundamental Concepts and Algorithms
Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA
More informationLecture 4 Discriminant Analysis, k-nearest Neighbors
Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se
More informationData Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018
Data Mining CS57300 Purdue University Bruno Ribeiro February 8, 2018 Decision trees Why Trees? interpretable/intuitive, popular in medical applications because they mimic the way a doctor thinks model
More informationSTAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă
STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă mmp@stat.washington.edu Reading: Murphy: BIC, AIC 8.4.2 (pp 255), SRM 6.5 (pp 204) Hastie, Tibshirani
More informationEvaluation & Credibility Issues
Evaluation & Credibility Issues What measure should we use? accuracy might not be enough. How reliable are the predicted results? How much should we believe in what was learned? Error on the training data
More informationMachine Learning Practice Page 2 of 2 10/28/13
Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes
More informationData Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts
Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification
More informationStatistical Methods for SVM
Statistical Methods for SVM Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot,
More informationFINAL: CS 6375 (Machine Learning) Fall 2014
FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for
More informationHypothesis Evaluation
Hypothesis Evaluation Machine Learning Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall 1395 1 / 31 Table of contents 1 Introduction
More informationVC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms
03/Feb/2010 VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms Presented by Andriy Temko Department of Electrical and Electronic Engineering Page 2 of
More informationQualifying Exam in Machine Learning
Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationMachine Learning 4771
Machine Learning 477 Instructor: Tony Jebara Topic 5 Generalization Guarantees VC-Dimension Nearest Neighbor Classification (infinite VC dimension) Structural Risk Minimization Support Vector Machines
More informationSVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels
SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic
More informationData Mining und Maschinelles Lernen
Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting
More informationMachine Learning and Data Mining. Linear classification. Kalev Kask
Machine Learning and Data Mining Linear classification Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ = f(x ; q) Parameters q Program ( Learner ) Learning algorithm Change q
More informationLearning with multiple models. Boosting.
CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models
More information6.867 Machine learning
6.867 Machine learning Mid-term eam October 8, 6 ( points) Your name and MIT ID: .5.5 y.5 y.5 a).5.5 b).5.5.5.5 y.5 y.5 c).5.5 d).5.5 Figure : Plots of linear regression results with different types of
More informationMachine Learning. Ensemble Methods. Manfred Huber
Machine Learning Ensemble Methods Manfred Huber 2015 1 Bias, Variance, Noise Classification errors have different sources Choice of hypothesis space and algorithm Training set Noise in the data The expected
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationCS7267 MACHINE LEARNING
CS7267 MACHINE LEARNING ENSEMBLE LEARNING Ref: Dr. Ricardo Gutierrez-Osuna at TAMU, and Aarti Singh at CMU Mingon Kang, Ph.D. Computer Science, Kennesaw State University Definition of Ensemble Learning
More informationMachine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?
Linear Regression Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2014 1 What about continuous variables? n Billionaire says: If I am measuring a continuous variable, what
More informationAnnouncements. Proposals graded
Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:
More informationCptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1
CptS 570 Machine Learning School of EECS Washington State University CptS 570 - Machine Learning 1 IEEE Expert, October 1996 CptS 570 - Machine Learning 2 Given sample S from all possible examples D Learner
More informationMachine Learning And Applications: Supervised Learning-SVM
Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine
More informationIntroduction to Supervised Learning. Performance Evaluation
Introduction to Supervised Learning Performance Evaluation Marcelo S. Lauretto Escola de Artes, Ciências e Humanidades, Universidade de São Paulo marcelolauretto@usp.br Lima - Peru Performance Evaluation
More informationCOMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization
: Neural Networks Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization 11s2 VC-dimension and PAC-learning 1 How good a classifier does a learner produce? Training error is the precentage
More informationComputer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo
Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationEnsemble Methods for Machine Learning
Ensemble Methods for Machine Learning COMBINING CLASSIFIERS: ENSEMBLE APPROACHES Common Ensemble classifiers Bagging/Random Forests Bucket of models Stacking Boosting Ensemble classifiers we ve studied
More informationSUPPORT VECTOR MACHINE
SUPPORT VECTOR MACHINE Mainly based on https://nlp.stanford.edu/ir-book/pdf/15svm.pdf 1 Overview SVM is a huge topic Integration of MMDS, IIR, and Andrew Moore s slides here Our foci: Geometric intuition
More informationBootstrap, Jackknife and other resampling methods
Bootstrap, Jackknife and other resampling methods Part VI: Cross-validation Rozenn Dahyot Room 128, Department of Statistics Trinity College Dublin, Ireland dahyot@mee.tcd.ie 2005 R. Dahyot (TCD) 453 Modern
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 14, 2014 Today s Schedule Course Project Introduction Linear Regression Model Decision Tree 2 Methods
More informationEvaluating Classifiers. Lecture 2 Instructor: Max Welling
Evaluating Classifiers Lecture 2 Instructor: Max Welling Evaluation of Results How do you report classification error? How certain are you about the error you claim? How do you compare two algorithms?
More informationCS6375: Machine Learning Gautam Kunapuli. Decision Trees
Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationMS-C1620 Statistical inference
MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents
More informationSupport vector machines Lecture 4
Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The
More informationKernel Methods and Support Vector Machines
Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete
More informationRegularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline
Other Measures 1 / 52 sscott@cse.unl.edu learning can generally be distilled to an optimization problem Choose a classifier (function, hypothesis) from a set of functions that minimizes an objective function
More informationStephen Scott.
1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training
More informationDiagnostics. Gad Kimmel
Diagnostics Gad Kimmel Outline Introduction. Bootstrap method. Cross validation. ROC plot. Introduction Motivation Estimating properties of an estimator. Given data samples say the average. x 1, x 2,...,
More informationMachine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem
More informationLecture 3: More on regularization. Bayesian vs maximum likelihood learning
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting
More informationPerformance Evaluation
Performance Evaluation Confusion Matrix: Detected Positive Negative Actual Positive A: True Positive B: False Negative Negative C: False Positive D: True Negative Recall or Sensitivity or True Positive
More informationECE521 Lecture7. Logistic Regression
ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard
More informationDecision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro
Decision Trees CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Classification without Models Well, partially without a model } Today: Decision Trees 2015 Bruno Ribeiro 2 3 Why Trees? } interpretable/intuitive,
More informationComputational Learning Theory. CS534 - Machine Learning
Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows
More informationLecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher
Lecture 3 STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher Previous lectures What is machine learning? Objectives of machine learning Supervised and
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2015 Announcements TA Monisha s office hour has changed to Thursdays 10-12pm, 462WVH (the same
More informationPerformance Evaluation and Hypothesis Testing
Performance Evaluation and Hypothesis Testing 1 Motivation Evaluating the performance of learning systems is important because: Learning systems are usually designed to predict the class of future unlabeled
More informationNeuroComp Machine Learning and Validation
NeuroComp Machine Learning and Validation Michèle Sebag http://tao.lri.fr/tiki-index.php Nov. 16th 2011 Validation, the questions 1. What is the result? 2. My results look good. Are they? 3. Does my system
More informationGeneralization and Overfitting
Generalization and Overfitting Model Selection Maria-Florina (Nina) Balcan February 24th, 2016 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle
More informationEnsemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan
Ensemble Methods NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan How do you make a decision? What do you want for lunch today?! What did you have last night?! What are your favorite
More informationMachine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?
Linear Regression Machine Learning CSE546 Sham Kakade University of Washington Oct 4, 2016 1 What about continuous variables? Billionaire says: If I am measuring a continuous variable, what can you do
More informationOnline Advertising is Big Business
Online Advertising Online Advertising is Big Business Multiple billion dollar industry $43B in 2013 in USA, 17% increase over 2012 [PWC, Internet Advertising Bureau, April 2013] Higher revenue in USA
More informationMODULE -4 BAYEIAN LEARNING
MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities
More information10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers
Computational Methods for Data Analysis Massimo Poesio SUPPORT VECTOR MACHINES Support Vector Machines Linear classifiers 1 Linear Classifiers denotes +1 denotes -1 w x + b>0 f(x,w,b) = sign(w x + b) How
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationLearning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin
Learning Theory Machine Learning CSE546 Carlos Guestrin University of Washington November 25, 2013 Carlos Guestrin 2005-2013 1 What now n We have explored many ways of learning from data n But How good
More information18.9 SUPPORT VECTOR MACHINES
744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the
More informationAnnouncements Kevin Jamieson
Announcements My office hours TODAY 3:30 pm - 4:30 pm CSE 666 Poster Session - Pick one First poster session TODAY 4:30 pm - 7:30 pm CSE Atrium Second poster session December 12 4:30 pm - 7:30 pm CSE Atrium
More informationESS2222. Lecture 3 Bias-Variance Trade-off
ESS2222 Lecture 3 Bias-Variance Trade-off Hosein Shahnas University of Toronto, Department of Earth Sciences, 1 Outline Bias-Variance Trade-off Overfitting & Regularization Ridge & Lasso Regression Nonlinear
More information