CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

Similar documents
Data Mining Classification Trees (2)

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Holdout and Cross-Validation Methods Overfitting Avoidance

Decision trees COMS 4771

Decision Support. Dr. Johan Hagelbäck.

Introduction to Machine Learning

Data Mining. 3.6 Regression Analysis. Fall Instructor: Dr. Masoud Yaghini. Numeric Prediction

Final Exam, Machine Learning, Spring 2009

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report

CS145: INTRODUCTION TO DATA MINING

Data-analysis and Retrieval Ordinal Classification

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Introduction to Machine Learning CMU-10701

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong

Lecture 7 Decision Tree Classifier

Predictive Modeling: Classification. KSE 521 Topic 6 Mun Yi

CS6220: DATA MINING TECHNIQUES

Day 3: Classification, logistic regression

CS6220: DATA MINING TECHNIQUES

Machine Learning Lecture 7

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

PATTERN RECOGNITION AND MACHINE LEARNING

C4.5 - pruning decision trees

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Learning with multiple models. Boosting.

Decision Trees: Overfitting

Machine Learning Linear Classification. Prof. Matteo Matteucci

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Data Mining 2018 Logistic Regression Text Classification

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1

Binary Logistic Regression

Decision Tree Learning Lecture 2

Decision Tree Learning

Regression and Classification Trees

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Oliver Dürr. Statistisches Data Mining (StDM) Woche 11. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

MIRA, SVM, k-nn. Lirong Xia

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Knowledge Discovery and Data Mining

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

the tree till a class assignment is reached

Learning Decision Trees

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Randomized Decision Trees

Decision Tree And Random Forest

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Nonlinear Classification

Data Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td

Recap from previous lecture

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January,

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours

EECS 349:Machine Learning Bryan Pardo

Machine Leanring Theory and Applications: third lecture

Machine Learning for NLP

Pattern Recognition 2018 Support Vector Machines

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Decision T ree Tree Algorithm Week 4 1

Machine Learning 2nd Edition

Discriminative Learning and Big Data

Proteomics and Variable Selection

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Machine Learning and Deep Learning! Vincent Lepetit!

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Tufts COMP 135: Introduction to Machine Learning

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

y i s 2 X 1 n i 1 1. Show that the least squares estimators can be written as n xx i x i 1 ns 2 X i 1 n ` px xqx i x i 1 pδ ij 1 n px i xq x j x

Making Our Cities Safer: A Study In Neighbhorhood Crime Patterns

Machine Learning

Classification and Prediction

L11: Pattern recognition principles

ECE 5424: Introduction to Machine Learning

Introduction to Machine Learning Midterm Exam

Classification 2: Linear discriminant analysis (continued); logistic regression

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

ECE521 week 3: 23/26 January 2017

Introduction to Logistic Regression

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Support Vector Machines

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

10701/15781 Machine Learning, Spring 2007: Homework 2

Notes on Machine Learning for and

Assignment 4. Machine Learning, Summer term 2014, Ulrike von Luxburg To be discussed in exercise groups on May 12-14

Empirical Risk Minimization, Model Selection, and Model Assessment

Machine Learning 2nd Edi7on

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Statistical Consulting Topics Classification and Regression Trees (CART)

Advanced Introduction to Machine Learning CMU-10715

Linear Models in Machine Learning

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Machine Learning

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Transcription:

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition Ad Feelders Universiteit Utrecht Department of Information and Computing Sciences Algorithmic Data Analysis Group May 2, 27 Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 / 53

Terminology Machine Learning 2 Statistical Learning 3 Data Mining 4 Pattern Recognition 5... It s all about learning from data. Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 2 / 53

What is machine learning? The field of pattern recognition/machine learning is concerned with the automatic discovery of regularities in data through the use of computer algorithms and with the use of these regularities to take actions such as classifying the data into different categories. Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer 26, page. Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 3 / 53

Example: Handwritten Digit Recognition 28 28 pixel images Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 4 / 53

Example: Handwritten Digit Recognition 28 28 grid Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 5 / 53

Machine Learning Approach Use training data D = {(x, y ),..., (x n, y n )} of n labeled examples, and fit a model to the training data. This model can subsequently be used to predict the class (digit) for new input vectors x. The ability to categorize correctly new examples is called generalization. Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 6 / 53

Types of Learning Problems Supervised Learning Numeric target: regression. Discrete unordered target: classification. Discrete ordered target: ordinal classification/regression; ranking. Unsupervised Learning Clustering. Density estimation. Frequent pattern mining. Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 7 / 53

Linear Regression Model The central assumption of linear regression is E[y x] = w + w x, where E stands for expected value ( average ). Alternatively, we can write with E[ε x] =. y = w + w x + ε The observed y values are composed of a structural part which is a (linear) function of x, and random noise. Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 8 / 53

Minimizing empirical error Given training data D = {(x, y ), (x 2, y 2 ),..., (x n, y n )}, find the values of w and w such that the sum of squared errors is minimized. SSE(w, w ) = n (y i i= prediction for y i {}}{ (w + w x i ) ) 2 Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 9 / 53

Example: the data generating process y 2 2..2.4.6.8. x y = sin(2πx) + ε, ε N(µ =, σ =.3) Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 / 53

Fitting a linear model: large empirical error y 2 2..2.4.6.8. x ŷ =.8.68x Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 / 53

Fitting a third-order polynomial: just about right y 2 2..2.4.6.8. x ŷ =.9 + 8.82x 28.23x 2 + 9.4x 3 Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 2 / 53

Fitting a ninth-order polynomial: zero error, but overfitting y 2 2..2.4.6.8. x ŷ =.8 22.22x + 592.95x 2 4849.7x 3 +... 26547.6x 9 Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 3 / 53

Lesson Learned Minimizing empirical error may be a good way to fit the parameters of a single model, but it is not a good way to compare models of different complexities, as this would lead to overfitting and hence bad generalization. There are different ways to address this problem, for example: evaluate the predictive performance on data that was not used for training. Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 4 / 53

Cross-Validation: Training Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 5 / 53

Cross-Validation: Prediction Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 6 / 53

Cross-Validation: Training Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 7 / 53

Cross-Validation: Prediction Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 8 / 53

Cross-Validation: Training Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 9 / 53

Cross-Validation: Prediction Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 2 / 53

Cross-Validation: Training Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 2 / 53

Cross-Validation: Prediction Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 22 / 53

Cross-Validation: Training Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 23 / 53

Cross-Validation: Prediction Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 24 / 53

K-fold cross-validation Divide the data in K parts. 2 For each of the K parts do Use the remaining K parts to train the model. Predict on the part that was not used for training. 3 Compute accuracy of the predictions. All predictions are made on data that was not used for training! Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 25 / 53

K-fold cross-validation: selecting a complexity parameter C is a complexity parameter, for example the degree of the polynomial in the regression example. Divide the data in K parts. 2 For each value c of C do For each of the K parts do Use the remaining K parts to train the model with C = c. Predict on the part that was not used for training. Compute accuracy of the predictions with C = c. 3 Select c as the value of C with highest accuracy. 4 Train on the complete data with C = c. Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 26 / 53

Logistic regression for binary classification Code the 2 classes as and (coding is arbitrary, but this coding is often convenient). y {, }: why not linear regression? Logistic regression assumption: and therefore E[y x] = P(y = x) = P(y = x) = ew +w x + e w +w x + e w +w x since P(y = x) and P(y = x) should add up to one. Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 27 / 53

Logistic regression has a linear decision boundary The log odds ln is a linear function of x. ( ) ( P(y = x) e w +w x = ln P(y = x) ) = w + w x Both classes are equally probable when ( ) P(y = x) P(y = x) =, and therefore when ln = P(y = x) P(y = x) So the decision boundary is w + w x = Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 28 / 53

Fitting the logistic regression function The coefficients w and w are estimated by maximum likelihood. Except for some unlikely cases, there is a unique optimal solution. Plug in the estimates to get the fitted response function: P(y = x) = eŵ+ŵ x + eŵ+ŵ x Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 29 / 53

Analysis of the handwritten digit data We have 42, examples of handwritten digits in the data frame mnist.dat. 2 The first column is the class label (digit), the remaining 784 columns are the pixel values. 3 Each class is approximately equally frequent. We derive 2 features: The amount of ink : sum pixel values of a digit. 2 Horizontal symmetry: subtract amount of ink in right half from amount of ink in left half of the image. Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 3 / 53

Distribution of digits in the data 2 3 4 2 3 4 5 6 7 8 9 Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 3 / 53

Feature: amount of ink 2 3 4 5 6 7 8 9 2 4 6 8 digit amount of ink Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 32 / 53

Feature: horizontal symmetry 2 3 4 5 6 7 8 9 5 5 5 digit horizontal symmetry Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 33 / 53

Scatter plot of the sample of zeroes and ones 2 3 4 5 8 6 4 2 2 4 amount of ink horizontal symmetry Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 34 / 53

Fitting a logistic regression model # Fit a logistic regression model to the sample of zeroes and ones. > digits.logreg <- glm(digit ~ ink+horsym,data=mnist.df[index.s,], family="binomial") # Give some relevant information about the fitted model. > summary(digits.logreg) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 3.5927 3.4757 4.8 2.9e-5 *** ink -.657.547-4.248 2.6e-5 *** horsym -.7294.352-2.39.69 * --- Signif. codes: "***". "**". "*".5 ".". " " Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 35 / 53

Logistic Regression Decision Boundary 2 3 4 5 8 6 4 2 2 4 amount of ink horizontal symmetry Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 36 / 53

Prediction with a logistic regression model # Use the logistic regression model to make predictions on all zeroes and ones. # The result is a vector of probabilities of digit. > digits.logreg.pred <- predict(digits.logreg,newdata=mnist.df[index.test,],type="response") # Make a so-called "confusion matrix" of the true class against the predicted class. # We predict the class with the highest fitted probability. > digits.logreg.confmat <- table(as.numeric(digits.logreg.pred >.5), mnist.df[index.test,])[:2,:2] # Display the confusion matrix. > digits.logreg.confmat 3858 27 74 434 # Compute the percentage correctly classified. > sum(diag(digits.logreg.confmat))/sum(digits.logreg.confmat) [].948468 Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 37 / 53

Assignment : Logistic Regression Go to: http://www.staff.science.uu.nl/ feeld/teaching.html, download the workspace, and load it into R. Open the script file on the webpage. Reproduce my analysis by copying the relevant lines from the script file, and entering them into R. 2 Perform a similar analysis, but now for digits 8 and 9. Make appropriate changes to the relevant commands in the script file. Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 38 / 53

Crash course in classification trees Growing the tree. Split the data into two subsets using a test on a single predictor (for example ink > 4,). 2 Try all possible such tests, and choose the most informative one (biggest reduction of error on the training data). 3 Split the two resulting subsets in a similar manner. 4 Continue until some stopping condition is met (subset has become too small) 2 Pruning the tree: consider pruned subtrees of the tree grown, and pick the one with smallest cross-validated error. 3 Prediction: pass a new case down the tree, and predict the majority class of the leaf node where it ends up. Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 39 / 53

Example: Loan Data Record age married? own house income gender class 22 no no 28, male bad 2 46 no yes 32, female bad 3 24 yes yes 24, male bad 4 25 no no 27, male bad 5 29 yes yes 32, female bad 6 45 yes yes 3, female good 7 63 yes yes 58, male good 8 36 yes no 52, male good 9 23 no yes 4, female good 5 yes yes 28, female good Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 4 / 53

Credit Scoring Tree 5 5 income > 36, income 36, bad rec# good 3 7,8,9 5 2 6, age > 37 age 37 married 2 2,6, not married 4,3,4,5 2 6, 2 Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 4 / 53

Why not split on gender in top node? gender = male 5 5 gender = female bad rec# good 3 2,3,4,7,8 2 3 2,5,6,9, Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 42 / 53

Growing a classification tree # Load the necessary libraries (packages). > library(rpart) > library(rpart.plot) # Set the random seed for reproducibility. > set.seed(2345) # Grow a classification tree on the sample. > digits.rpart <- rpart(digit ~ ink+horsym,data=mnist.df[index.s,], cp=,minsplit=2,minbucket=) # Show the cost-complexity pruning results. > digits.rpart$cptable CP nsplit rel error xerror xstd.93..2.79972 2.2.7..38227 3. 2.5.9.293723 4.5 4.3.8.27728 5...8.27728 Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 43 / 53

Pruning sequence size of tree 2 3 5 X val Relative Error..2.4.6.8..2 Inf.4.4.7 cp Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 44 / 53

The Big Tree.5 % yes ink >= 2e+3 no.5 52% horsym >= 543.98 48% ink >= 9e+3.3 5% ink >= 25e+3.8 5% ink < 9e+3.7 3% ink >= 2e+3.38 4% ink < 23e+3 horsym >= 223.75 2%.89 4% horsym < 736.5 % ink < 2e+3. 46%. 2%. %. %. %. %. %. %. 2%. 2%. 44% Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 45 / 53

Pruning the Big Tree Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 46 / 53

The Pruned Tree yes.5 % ink >= 2e+3 no.5 52% horsym >= 543.3 5%. %.98 48% Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 47 / 53

The Decision Boundary 2 3 4 5 8 6 4 2 2 4 amount of ink horizontal symmetry Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 48 / 53

Assignment 2: Classification Trees Reproduce my analysis by copying the relevant lines from the script file, and entering them into R. 2 Perform a similar analysis, but now for digits 8 and 9. Make appropriate changes to the relevant commands in the script file. 3 In pruning, pick the subtree with lowest cross-validation error. Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 49 / 53

Nearest neighbour classification Intuition: examples tend to have the same class as examples that are close by in feature space. 2 So to classify a new example, find the nearest training example(s) and predict their majority class. 3 Note that we don t actually learn a model, we just have to memorize (store) the training set for future reference. 4 Scale the variables to have mean and standard deviation. x i = x i x s x, i =,..., n Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 5 / 53

Nearest neighbour: example? Prediction for k=?, k=3?, k=9? Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 5 / 53

The 3-NN Decision Boundary Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 52 / 53

The 3-NN Decision Boundary Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 53 / 53

Assignment 3: Nearest Neighbour Reproduce my analysis by copying the relevant lines from the script file, and entering them into R. 2 Perform a similar analysis, but now for digits 8 and 9. Make appropriate changes to the relevant commands in the script file. 3 Use cross-validation (knn.cv) on the training sample to estimate the accuracy of the knn classifier for different values of k. Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 54 / 53