Cross-validation for detecting and preventing overfitting

Size: px
Start display at page:

Download "Cross-validation for detecting and preventing overfitting"

Transcription

1 Cross-validation for detecting and preventing overfitting A Regression Problem = f() + noise Can we learn f from this data? Note to other teachers and users of these slides. Andrew would be delighted if ou found this source material useful in giving our own lectures. Feel free to use these slides verbatim, or to modif them to fit our own needs. PowerPoint originals are available. If ou make use of a significant portion of these slides in our own lecture, please include this message, or the following link to the source repositor of Andrew s tutorials http// Comments and corrections gratefull received. Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon Universit awm@cs.cmu.edu Let s consider three methods Copright 200, Andrew W. Moore Oct 5th, 200 Copright 200, Andrew W. Moore Cross- Slide 2 Univariate Linear regression with a constant term X Y X= = =().. =.. Originall discussed in the previous Andrew Lecture Neural Nets Copright 200, Andrew W. Moore Cross- Slide Copright 200, Andrew W. Moore Cross- Slide 4 Univariate Linear regression with a constant term X Y X= = =().. Z= = =.. z =(,).. z k =(, k ) =.. Univariate Linear regression with a constant term X Y X= = =().. =.. Z= = β=(z T Z) - (Z T ) z =(,).. z k =(, k ) =.. est = β 0 + β Copright 200, Andrew W. Moore Cross- Slide 5 Copright 200, Andrew W. Moore Cross- Slide 6

2 Quadratic Regression Quadratic Regression X Y Z= z=(,, 2,) 9 X= = =(,2).. =.. = Much more about this in the future Andrew Lecture Favorite Regression Algorithms β=(z T Z) - (Z T ) est = β 0 + β + β 2 2 Copright 200, Andrew W. Moore Cross- Slide Copright 200, Andrew W. Moore Cross- Slide 8 Join-the-dots Which is best? Also known as piecewise linear nonparametric regression if that makes ou feel better Wh not choose the method with the best fit to the data? Copright 200, Andrew W. Moore Cross- Slide 9 Copright 200, Andrew W. Moore Cross- Slide 0 What do we reall want? The test method. Randoml choose 0% of the data to be in a test training Wh not choose the method with the best fit to the data? How well are ou going to predict future data drawn from the same distribution? Copright 200, Andrew W. Moore Cross- Slide Copright 200, Andrew W. Moore Cross- Slide 2 2

3 The test method The test method. Randoml choose 0% of the data to be in a test. Randoml choose 0% of the data to be in a test training training. Perform our regression on the training. Perform our regression on the training (Linear regression eample) (Linear regression eample) Mean Squared Error = Estimate our future performance with the test Copright 200, Andrew W. Moore Cross- Slide Copright 200, Andrew W. Moore Cross- Slide 4 The test method The test method. Randoml choose 0% of the data to be in a test. Randoml choose 0% of the data to be in a test training training. Perform our regression on the training. Perform our regression on the training (Quadratic regression eample) 4. Estimate our future performance with the test Mean Squared Error = 0.9 (Join the dots eample) Mean Squared Error = Estimate our future performance with the test Copright 200, Andrew W. Moore Cross- Slide 5 Copright 200, Andrew W. Moore Cross- Slide 6 The test method Good news Ver ver simple Can then simpl choose the method with the best test- score Bad news What s the downside? The test method Good news Ver ver simple Can then simpl choose the method with the best test- score Bad news Wastes data we get an estimate of the best method to appl to 0% less data If we don t have much data, our test- might just be luck or unluck We sa the test- estimator of performance has high variance Copright 200, Andrew W. Moore Cross- Slide Copright 200, Andrew W. Moore Cross- Slide 8

4 LOOCV (Leave-one-out Cross ) For k= to R. Let ( k, k ) be the k th record LOOCV (Leave-one-out Cross ) For k= to R. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the data Copright 200, Andrew W. Moore Cross- Slide 9 Copright 200, Andrew W. Moore Cross- Slide 20 LOOCV (Leave-one-out Cross ) For k= to R. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the data. Train on the remaining R- LOOCV (Leave-one-out Cross ) For k= to R. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the data. Train on the remaining R- 4. Note our error ( k, k ) Copright 200, Andrew W. Moore Cross- Slide 2 Copright 200, Andrew W. Moore Cross- Slide 22 LOOCV (Leave-one-out Cross ) For k= to R. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the data. Train on the remaining R- 4. Note our error ( k, k ) When ou ve done all points, report the mean error. LOOCV (Leave-one-out Cross ) For k= to R. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the data. Train on the remaining R- 4. Note our error ( k, k ) When ou ve done all points, report the mean error. MSE LOOCV = 2.2 Copright 200, Andrew W. Moore Cross- Slide 2 Copright 200, Andrew W. Moore Cross- Slide 24 4

5 LOOCV for Quadratic Regression For k= to R. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the data. Train on the remaining R- 4. Note our error ( k, k ) When ou ve done all points, report the mean error. MSE LOOCV =0.962 LOOCV for Join The Dots For k= to R. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the data. Train on the remaining R- 4. Note our error ( k, k ) When ou ve done all points, report the mean error. MSE LOOCV =. Copright 200, Andrew W. Moore Cross- Slide 25 Copright 200, Andrew W. Moore Cross- Slide 26 Which kind of Cross? Test- Leaveone-out Downside Variance unreliable estimate of future performance Epensive. Has some weird behavior Upside Cheap Doesn t waste data Randoml break the data into k partitions (in our eample we ll have k=..can we get the best of both worlds? Copright 200, Andrew W. Moore Cross- Slide 2 Copright 200, Andrew W. Moore Cross- Slide 28 Randoml break the data into k partitions (in our eample we ll have k= For the red partition Train on all the the test- sum of errors on the red Randoml break the data into k partitions (in our eample we ll have k= For the red partition Train on all the the test- sum of errors on the red For the green partition Train on all the Find the test- sum of errors on the green Copright 200, Andrew W. Moore Cross- Slide 29 Copright 200, Andrew W. Moore Cross- Slide 0 5

6 Randoml break the data into k partitions (in our eample we ll have k= For the red partition Train on all the the test- sum of errors on the red Randoml break the data into k partitions (in our eample we ll have k= For the red partition Train on all the the test- sum of errors on the red For the green partition Train on all the Find the test- sum of errors on the green For the green partition Train on all the Find the test- sum of errors on the green For the blue partition Train on all the points not in the blue partition. Find the test- sum of errors on the blue MSE FOLD =2.05 For the blue partition Train on all the points not in the blue partition. Find the test- sum of errors on the blue Then report the mean error Copright 200, Andrew W. Moore Cross- Slide Copright 200, Andrew W. Moore Cross- Slide 2 Randoml break the data into k partitions (in our eample we ll have k= For the red partition Train on all the the test- sum of errors on the red Randoml break the data into k partitions (in our eample we ll have k= For the red partition Train on all the the test- sum of errors on the red For the green partition Train on all the Find the test- sum of errors on the green For the green partition Train on all the Find the test- sum of errors on the green Quadratic Regression MSE FOLD =. For the blue partition Train on all the points not in the blue partition. Find the test- sum of errors on the blue Then report the mean error Joint-the-dots MSE FOLD =2.9 For the blue partition Train on all the points not in the blue partition. Find the test- sum of errors on the blue Then report the mean error Copright 200, Andrew W. Moore Cross- Slide Copright 200, Andrew W. Moore Cross- Slide 4 Which kind of Cross? Test- Leaveone-out 0-fold -fold R-fold Downside Variance unreliable estimate of future performance Epensive. Has some weird behavior Wastes 0% of the data. 0 times more epensive than test Wastier than 0-fold. Epensivier than test Identical to Leave-one-out Upside Cheap Doesn t waste data Which kind of Cross? Test- Onl wastes 0%. Onl 0 times more epensive instead of R times. Slightl better than test Leaveone-out 0-fold -fold R-fold Downside Variance unreliable Cheap estimate of future performance But note One of Epensive. Andrew s Doesn t jos waste in data life is Has some weird behavioralgorithmic tricks for Wastes 0% of the data. making Onl wastes these cheap 0%. Onl 0 times more epensive 0 times more epensive than test instead of R times. Wastier than 0-fold. Epensivier than test Identical to Leave-one-out Upside Slightl better than test Copright 200, Andrew W. Moore Cross- Slide 5 Copright 200, Andrew W. Moore Cross- Slide 6 6

7 CV-based Model Selection We re tring to decide which algorithm to use. We train each machine and make a table Alternatives to CV-based model selection Model selection methods. Cross-validation 2. AIC (Akaike Information Criterion). BIC (Baesian Information Criterion) 4. VC-dimension (Vapnik-Chervonenkis Dimension) i f i TRAINERR 0-FOLD-CV-ERR Choice f 2 4 f 2 f f 4 Onl directl applicable to choosing classifiers 5 6 f 5 f 6 Described in a future Andrew Lecture Copright 200, Andrew W. Moore Cross- Slide Copright 200, Andrew W. Moore Cross- Slide 8 Which model selection method is best?. (CV) Cross-validation 2. AIC (Akaike Information Criterion). BIC (Baesian Information Criterion) 4. (SRMVC) Structural Risk Minimize with VC-dimension AIC, BIC and SRMVC advantage ou onl need the training error. CV error might have more variance SRMVC is wildl conservative Asmptoticall AIC and Leave-one-out CV should be the same Asmptoticall BIC and carefull chosen k-fold should be same You want BIC ou want the best structure instead of the best predictor (e.g. for clustering or Baes Net structure finding) Man alternatives---including proper Baesian approaches. It s an emotional issue. Copright 200, Andrew W. Moore Cross- Slide 9 Other Cross-validation issues Can do leave all pairs out or leave-allntuples-out if feeling resourceful. Some folks do k-folds in which each fold is an independentl-chosen sub of the data Do ou know what AIC and BIC are? If so LOOCV behaves like AIC asmptoticall. k-fold behaves like BIC if ou choose k carefull If not Nardel nardel noo noo Copright 200, Andrew W. Moore Cross- Slide 40 Cross- for regression Choosing the number of hidden units in a neural net Feature selection (see later) Choosing a polnomial degree Choosing which regressor to use Supervising Gradient Descent This is a weird but common use of Test- validation Suppose ou have a neural net with too man hidden units. It will overfit. As gradient descent progresses, maintain a graph of MSE-test-error vs. Iteration Mean Squared Error Use the weights ou found on this iteration Training Set Test Set Iteration of Gradient Descent Copright 200, Andrew W. Moore Cross- Slide 4 Copright 200, Andrew W. Moore Cross- Slide 42

8 Supervising Gradient Descent This is a weird but common use of Test- validation Suppose ou have a neural net with too Relies on an intuition that a not-full- man hidden units. It will overfit. As gradient descent having fewer progresses, parameters. maintain a graph of MSE-test-error Works prett well in vs. practice, Iteration apparentl minimized of weights is somewhat like Cross-validation for classification Instead of computing the sum squared errors on a test, ou should compute Mean Squared Error Use the weights ou found on this iteration Training Set Test Set Iteration of Gradient Descent Copright 200, Andrew W. Moore Cross- Slide 4 Copright 200, Andrew W. Moore Cross- Slide 44 Cross-validation for classification Instead of computing the sum squared errors on a test, ou should compute The total number of misclassifications on a test. Cross-validation for classification Instead of computing the sum squared errors on a test, ou should compute The total number of misclassifications on a test. But there s a more sensitive alternative Compute log P(all test outputs all test inputs, our model) Copright 200, Andrew W. Moore Cross- Slide 45 Copright 200, Andrew W. Moore Cross- Slide 46 Cross- for classification Choosing the pruning parameter for decision trees Feature selection (see later) What kind of Gaussian to use in a Gaussianbased Baes Classifier Choosing which classifier to use Copright 200, Andrew W. Moore Cross- Slide 4 Cross- for densit estimation Compute the sum of log-likelihoods of test points Eample uses Choosing what kind of Gaussian assumption to use Choose the densit estimator NOT Feature selection (test densit will almost alwas look better with fewer features) Copright 200, Andrew W. Moore Cross- Slide 48 8

9 Feature Selection Suppose ou have a learning algorithm LA and a of input attributes { X, X 2.. X m } You epect that LA will onl find some sub of the attributes useful. Question How can we use cross-validation to find a useful sub? Four ideas Forward selection Backward elimination Hill Climbing Another fun area in which Andrew has spent a lot of his wild outh Stochastic search (Simulated Annealing or GAs) Copright 200, Andrew W. Moore Cross- Slide 49 Ver serious warning Intensive use of cross validation can overfit. How? What can be done about it? Copright 200, Andrew W. Moore Cross- Slide 50 Ver serious warning Intensive use of cross validation can overfit. How? Imagine a data with 50 records and 000 attributes. You tr 000 linear regression models, each one using one of the attributes. Ver serious warning Intensive use of cross validation can overfit. How? Imagine a data with 50 records and 000 attributes. You tr 000 linear regression models, each one using one of the attributes. The best of those 000 looks good! What can be done about it? What can be done about it? Copright 200, Andrew W. Moore Cross- Slide 5 Copright 200, Andrew W. Moore Cross- Slide 52 Ver serious warning Intensive use of cross validation can overfit. How? Imagine a data with 50 records and 000 attributes. You tr 000 linear regression models, each one using one of the attributes. The best of those 000 looks good! But ou realize it would have looked good even if the output had been purel random! What can be done about it? Hold out an additional test before doing an model selection. Check the best model performs well even on the additional test. Copright 200, Andrew W. Moore Cross- Slide 5 What ou should know Wh ou can t use training--error to estimate the qualit of our learning algorithm on our data. Wh ou can t use training error to choose the learning algorithm Test- cross-validation Leave-one-out cross-validation k-fold cross-validation Feature selection methods CV for classification, regression & densities Copright 200, Andrew W. Moore Cross- Slide 54 9

VC-dimension for characterizing classifiers

VC-dimension for characterizing classifiers VC-dimension for characterizing classifiers Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to

More information

VC-dimension for characterizing classifiers

VC-dimension for characterizing classifiers VC-dimension for characterizing classifiers Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to

More information

Probability Densities in Data Mining

Probability Densities in Data Mining Probabilit Densities in Data Mining Note to other teachers and users of these slides. Andrew would be delighted if ou found this source material useful in giving our own lectures. Feel free to use these

More information

Eight more Classic Machine Learning algorithms

Eight more Classic Machine Learning algorithms Note to other teachers and users of these slides. Andrew would be delighted if ou found this source material useful in giving our own lectures. Feel free to use these slides verbatim, or to modif them

More information

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms 03/Feb/2010 VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms Presented by Andriy Temko Department of Electrical and Electronic Engineering Page 2 of

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Machine Learning Foundations

Machine Learning Foundations Machine Learning Foundations ( 機器學習基石 ) Lecture 13: Hazard of Overfitting Hsuan-Tien Lin ( 林軒田 ) htlin@csie.ntu.edu.tw Department of Computer Science & Information Engineering National Taiwan Universit

More information

Feedforward Neural Networks

Feedforward Neural Networks Chapter 4 Feedforward Neural Networks 4. Motivation Let s start with our logistic regression model from before: P(k d) = softma k =k ( λ(k ) + w d λ(k, w) ). (4.) Recall that this model gives us a lot

More information

Holdout and Cross-Validation Methods Overfitting Avoidance

Holdout and Cross-Validation Methods Overfitting Avoidance Holdout and Cross-Validation Methods Overfitting Avoidance Decision Trees Reduce error pruning Cost-complexity pruning Neural Networks Early stopping Adjusting Regularizers via Cross-Validation Nearest

More information

Learning From Data Lecture 7 Approximation Versus Generalization

Learning From Data Lecture 7 Approximation Versus Generalization Learning From Data Lecture 7 Approimation Versus Generalization The VC Dimension Approimation Versus Generalization Bias and Variance The Learning Curve M. Magdon-Ismail CSCI 4100/6100 recap: The Vapnik-Chervonenkis

More information

Algorithms for Playing and Solving games*

Algorithms for Playing and Solving games* Algorithms for Playing and Solving games* Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599 * Two Player Zero-sum Discrete

More information

INF Introduction to classifiction Anne Solberg

INF Introduction to classifiction Anne Solberg INF 4300 8.09.17 Introduction to classifiction Anne Solberg anne@ifi.uio.no Introduction to classification Based on handout from Pattern Recognition b Theodoridis, available after the lecture INF 4300

More information

Andrew W. Moore Professor School of Computer Science Carnegie Mellon University

Andrew W. Moore Professor School of Computer Science Carnegie Mellon University Spport Vector Machines Note to other teachers and sers of these slides. Andrew wold be delighted if yo fond this sorce material sefl in giving yor own lectres. Feel free to se these slides verbatim, or

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining Linear Classifiers: predictions Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due Friday of next

More information

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

Lecture Slides for INTRODUCTION TO. Machine Learning. By:  Postedited by: R. Lecture Slides for INTRODUCTION TO Machine Learning By: alpaydin@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/i2ml Postedited by: R. Basili Learning a Class from Examples Class C of a family car Prediction:

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

Notes on Discriminant Functions and Optimal Classification

Notes on Discriminant Functions and Optimal Classification Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Midterm Exam Solutions, Spring 2007

Midterm Exam Solutions, Spring 2007 1-71 Midterm Exam Solutions, Spring 7 1. Personal info: Name: Andrew account: E-mail address:. There should be 16 numbered pages in this exam (including this cover sheet). 3. You can use any material you

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 6: Model complexity scores (v3) Ramesh Johari ramesh.johari@stanford.edu Fall 2015 1 / 34 Estimating prediction error 2 / 34 Estimating prediction error We saw how we can estimate

More information

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification INF 4300 151014 Introduction to classifiction Anne Solberg anne@ifiuiono Based on Chapter 1-6 in Duda and Hart: Pattern Classification 151014 INF 4300 1 Introduction to classification One of the most challenging

More information

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007 Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized

More information

Information Gain. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

Information Gain. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University. te to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University, 1 Introduction to Machine Learning Regression Computer Science, Tel-Aviv University, 2013-14 Classification Input: X Real valued, vectors over real. Discrete values (0,1,2,...) Other structures (e.g.,

More information

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă mmp@stat.washington.edu Reading: Murphy: BIC, AIC 8.4.2 (pp 255), SRM 6.5 (pp 204) Hastie, Tibshirani

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

PDEEC Machine Learning 2016/17

PDEEC Machine Learning 2016/17 PDEEC Machine Learning 2016/17 Lecture - Model assessment, selection and Ensemble Jaime S. Cardoso jaime.cardoso@inesctec.pt INESC TEC and Faculdade Engenharia, Universidade do Porto Nov. 07, 2017 1 /

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

Machine Learning, Fall 2009: Midterm

Machine Learning, Fall 2009: Midterm 10-601 Machine Learning, Fall 009: Midterm Monday, November nd hours 1. Personal info: Name: Andrew account: E-mail address:. You are permitted two pages of notes and a calculator. Please turn off all

More information

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless

More information

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2017

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2017 CPSC 340: Machine Learning and Data Mining Regularization Fall 2017 Assignment 2 Admin 2 late days to hand in tonight, answers posted tomorrow morning. Extra office hours Thursday at 4pm (ICICS 246). Midterm

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector

More information

Bayesian Networks Structure Learning (cont.)

Bayesian Networks Structure Learning (cont.) Koller & Friedman Chapters (handed out): Chapter 11 (short) Chapter 1: 1.1, 1., 1.3 (covered in the beginning of semester) 1.4 (Learning parameters for BNs) Chapter 13: 13.1, 13.3.1, 13.4.1, 13.4.3 (basic

More information

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS LAST TIME Intro to cudnn Deep neural nets using cublas and cudnn TODAY Building a better model for image classification Overfitting

More information

Machine Learning and Data Mining. Linear classification. Kalev Kask

Machine Learning and Data Mining. Linear classification. Kalev Kask Machine Learning and Data Mining Linear classification Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ = f(x ; q) Parameters q Program ( Learner ) Learning algorithm Change q

More information

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1 Decision Trees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 5 th, 2007 2005-2007 Carlos Guestrin 1 Linear separability A dataset is linearly separable iff 9 a separating

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

12 Statistical Justifications; the Bias-Variance Decomposition

12 Statistical Justifications; the Bias-Variance Decomposition Statistical Justifications; the Bias-Variance Decomposition 65 12 Statistical Justifications; the Bias-Variance Decomposition STATISTICAL JUSTIFICATIONS FOR REGRESSION [So far, I ve talked about regression

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2015 Announcements TA Monisha s office hour has changed to Thursdays 10-12pm, 462WVH (the same

More information

Minimum Description Length (MDL)

Minimum Description Length (MDL) Minimum Description Length (MDL) Lyle Ungar AIC Akaike Information Criterion BIC Bayesian Information Criterion RIC Risk Inflation Criterion MDL u Sender and receiver both know X u Want to send y using

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 14, 2014 Today s Schedule Course Project Introduction Linear Regression Model Decision Tree 2 Methods

More information

Machine Learning and Data Mining. Linear regression. Kalev Kask

Machine Learning and Data Mining. Linear regression. Kalev Kask Machine Learning and Data Mining Linear regression Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ Parameters q Learning algorithm Program ( Learner ) Change q Improve performance

More information

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods) Machine Learning InstanceBased Learning (aka nonparametric methods) Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Non parametric CSE 446 Machine Learning Daniel Weld March

More information

Probabilistic and Bayesian Analytics

Probabilistic and Bayesian Analytics Probabilistic and Bayesian Analytics Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Chapter 18 Quadratic Function 2

Chapter 18 Quadratic Function 2 Chapter 18 Quadratic Function Completed Square Form 1 Consider this special set of numbers - the square numbers or the set of perfect squares. 4 = = 9 = 3 = 16 = 4 = 5 = 5 = Numbers like 5, 11, 15 are

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

BAYESIAN DECISION THEORY

BAYESIAN DECISION THEORY Last updated: September 17, 2012 BAYESIAN DECISION THEORY Problems 2 The following problems from the textbook are relevant: 2.1 2.9, 2.11, 2.17 For this week, please at least solve Problem 2.3. We will

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative

More information

Midterm, Fall 2003

Midterm, Fall 2003 5-78 Midterm, Fall 2003 YOUR ANDREW USERID IN CAPITAL LETTERS: YOUR NAME: There are 9 questions. The ninth may be more time-consuming and is worth only three points, so do not attempt 9 unless you are

More information

Machine Learning Basics: Maximum Likelihood Estimation

Machine Learning Basics: Maximum Likelihood Estimation Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning

More information

10-701/ Machine Learning, Fall

10-701/ Machine Learning, Fall 0-70/5-78 Machine Learning, Fall 2003 Homework 2 Solution If you have questions, please contact Jiayong Zhang .. (Error Function) The sum-of-squares error is the most common training

More information

CSE 546 Midterm Exam, Fall 2014

CSE 546 Midterm Exam, Fall 2014 CSE 546 Midterm Eam, Fall 2014 1. Personal info: Name: UW NetID: Student ID: 2. There should be 14 numbered pages in this eam (including this cover sheet). 3. You can use an material ou brought: an book,

More information

Regression I: Mean Squared Error and Measuring Quality of Fit

Regression I: Mean Squared Error and Measuring Quality of Fit Regression I: Mean Squared Error and Measuring Quality of Fit -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD 1 The Setup Suppose there is a scientific problem we are interested in solving

More information

ECE521 Lecture7. Logistic Regression

ECE521 Lecture7. Logistic Regression ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

Linear Regression In God we trust, all others bring data. William Edwards Deming

Linear Regression In God we trust, all others bring data. William Edwards Deming Linear Regression ddebarr@uw.edu 2017-01-19 In God we trust, all others bring data. William Edwards Deming Course Outline 1. Introduction to Statistical Learning 2. Linear Regression 3. Classification

More information

Lecture 27: More on Rotational Kinematics

Lecture 27: More on Rotational Kinematics Lecture 27: More on Rotational Kinematics Let s work out the kinematics of rotational motion if α is constant: dω α = 1 2 α dω αt = ω ω ω = αt + ω ( t ) dφ α + ω = dφ t 2 α + ωo = φ φo = 1 2 = t o 2 φ

More information

Introduction to Machine Learning and Cross-Validation

Introduction to Machine Learning and Cross-Validation Introduction to Machine Learning and Cross-Validation Jonathan Hersh 1 February 27, 2019 J.Hersh (Chapman ) Intro & CV February 27, 2019 1 / 29 Plan 1 Introduction 2 Preliminary Terminology 3 Bias-Variance

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University PAC Learning Matt Gormley Lecture 14 March 5, 2018 1 ML Big Picture Learning Paradigms:

More information

Artificial Neural Networks 2

Artificial Neural Networks 2 CSC2515 Machine Learning Sam Roweis Artificial Neural s 2 We saw neural nets for classification. Same idea for regression. ANNs are just adaptive basis regression machines of the form: y k = j w kj σ(b

More information

Support Vector Machines

Support Vector Machines Two SVM tutorials linked in class website (please, read both): High-level presentation with applications (Hearst 1998) Detailed tutorial (Burges 1998) Support Vector Machines Machine Learning 10701/15781

More information

INF Anne Solberg One of the most challenging topics in image analysis is recognizing a specific object in an image.

INF Anne Solberg One of the most challenging topics in image analysis is recognizing a specific object in an image. INF 4300 700 Introduction to classifiction Anne Solberg anne@ifiuiono Based on Chapter -6 6inDuda and Hart: attern Classification 303 INF 4300 Introduction to classification One of the most challenging

More information

Announcements - Homework

Announcements - Homework Announcements - Homework Homework 1 is graded, please collect at end of lecture Homework 2 due today Homework 3 out soon (watch email) Ques 1 midterm review HW1 score distribution 40 HW1 total score 35

More information

INTRODUCTION TO DIOPHANTINE EQUATIONS

INTRODUCTION TO DIOPHANTINE EQUATIONS INTRODUCTION TO DIOPHANTINE EQUATIONS In the earl 20th centur, Thue made an important breakthrough in the stud of diophantine equations. His proof is one of the first eamples of the polnomial method. His

More information

Neural Networks in Structured Prediction. November 17, 2015

Neural Networks in Structured Prediction. November 17, 2015 Neural Networks in Structured Prediction November 17, 2015 HWs and Paper Last homework is going to be posted soon Neural net NER tagging model This is a new structured model Paper - Thursday after Thanksgiving

More information

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring / Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

Machine Learning (CSE 446): Neural Networks

Machine Learning (CSE 446): Neural Networks Machine Learning (CSE 446): Neural Networks Noah Smith c 2017 University of Washington nasmith@cs.washington.edu November 6, 2017 1 / 22 Admin No Wednesday office hours for Noah; no lecture Friday. 2 /

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression Machine Learning 070/578 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features Y target

More information

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Backpropagation Matt Gormley Lecture 12 Feb 23, 2018 1 Neural Networks Outline

More information

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017 CPSC 340: Machine Learning and Data Mining Stochastic Gradient Fall 2017 Assignment 3: Admin Check update thread on Piazza for correct definition of trainndx. This could make your cross-validation code

More information

Classification and Regression Trees

Classification and Regression Trees Classification and Regression Trees Ryan P Adams So far, we have primarily examined linear classifiers and regressors, and considered several different ways to train them When we ve found the linearity

More information

The Perceptron algorithm

The Perceptron algorithm The Perceptron algorithm Tirgul 3 November 2016 Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following

More information

Notes on Machine Learning for and

Notes on Machine Learning for and Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Learning = improving with experience Improve over task T (e.g, Classification, control tasks) with respect

More information

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com Linear Regression These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these

More information

Resampling techniques for statistical modeling

Resampling techniques for statistical modeling Resampling techniques for statistical modeling Gianluca Bontempi Département d Informatique Boulevard de Triomphe - CP 212 http://www.ulb.ac.be/di Resampling techniques p.1/33 Beyond the empirical error

More information

FIND A FUNCTION TO CLASSIFY HIGH VALUE CUSTOMERS

FIND A FUNCTION TO CLASSIFY HIGH VALUE CUSTOMERS LINEAR CLASSIFIER 1 FIND A FUNCTION TO CLASSIFY HIGH VALUE CUSTOMERS x f y High Value Customers Salary Task: Find Nb Orders 150 70 300 100 200 80 120 100 Low Value Customers Salary Nb Orders 40 80 220

More information

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma COMS 4771 Introduction to Machine Learning James McInerney Adapted from slides by Nakul Verma Announcements HW1: Please submit as a group Watch out for zero variance features (Q5) HW2 will be released

More information

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are

More information

Introduction to Statistical modeling: handout for Math 489/583

Introduction to Statistical modeling: handout for Math 489/583 Introduction to Statistical modeling: handout for Math 489/583 Statistical modeling occurs when we are trying to model some data using statistical tools. From the start, we recognize that no model is perfect

More information

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences Biostatistics-Lecture 16 Model Selection Ruibin Xi Peking University School of Mathematical Sciences Motivating example1 Interested in factors related to the life expectancy (50 US states,1969-71 ) Per

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods Support Vector Machines and Kernel Methods Geoff Gordon ggordon@cs.cmu.edu July 10, 2003 Overview Why do people care about SVMs? Classification problems SVMs often produce good results over a wide range

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

1 History of statistical/machine learning. 2 Supervised learning. 3 Two approaches to supervised learning. 4 The general learning procedure

1 History of statistical/machine learning. 2 Supervised learning. 3 Two approaches to supervised learning. 4 The general learning procedure Overview Breiman L (2001). Statistical modeling: The two cultures Statistical learning reading group Aleander L 1 Histor of statistical/machine learning 2 Supervised learning 3 Two approaches to supervised

More information

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017 Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper

More information

COMS 4771 Regression. Nakul Verma

COMS 4771 Regression. Nakul Verma COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the

More information

Figure 1: Visualising the input features. Figure 2: Visualising the input-output data pairs.

Figure 1: Visualising the input features. Figure 2: Visualising the input-output data pairs. Regression. Data visualisation (a) To plot the data in x trn and x tst, we could use MATLAB function scatter. For example, the code for plotting x trn is scatter(x_trn(:,), x_trn(:,));........... x (a)

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information