CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition Ad Feelders Universiteit Utrecht Department of Information and Computing Sciences Algorithmic Data Analysis Group May 2, 27 Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 / 53

Terminology Machine Learning 2 Statistical Learning 3 Data Mining 4 Pattern Recognition 5... It s all about learning from data. Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 2 / 53

What is machine learning? The field of pattern recognition/machine learning is concerned with the automatic discovery of regularities in data through the use of computer algorithms and with the use of these regularities to take actions such as classifying the data into different categories. Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer 26, page. Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 3 / 53

Example: Handwritten Digit Recognition 28 28 pixel images Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 4 / 53

Example: Handwritten Digit Recognition 28 28 grid Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 5 / 53

Machine Learning Approach Use training data D = {(x, y ),..., (x n, y n )} of n labeled examples, and fit a model to the training data. This model can subsequently be used to predict the class (digit) for new input vectors x. The ability to categorize correctly new examples is called generalization. Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 6 / 53

Types of Learning Problems Supervised Learning Numeric target: regression. Discrete unordered target: classification. Discrete ordered target: ordinal classification/regression; ranking. Unsupervised Learning Clustering. Density estimation. Frequent pattern mining. Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 7 / 53

Linear Regression Model The central assumption of linear regression is E[y x] = w + w x, where E stands for expected value ( average ). Alternatively, we can write with E[ε x] =. y = w + w x + ε The observed y values are composed of a structural part which is a (linear) function of x, and random noise. Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 8 / 53

Minimizing empirical error Given training data D = {(x, y ), (x 2, y 2 ),..., (x n, y n )}, find the values of w and w such that the sum of squared errors is minimized. SSE(w, w ) = n (y i i= prediction for y i {}}{ (w + w x i ) ) 2 Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 9 / 53

Example: the data generating process y 2 2..2.4.6.8. x y = sin(2πx) + ε, ε N(µ =, σ =.3) Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 / 53

Fitting a linear model: large empirical error y 2 2..2.4.6.8. x ŷ =.8.68x Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 / 53

Fitting a third-order polynomial: just about right y 2 2..2.4.6.8. x ŷ =.9 + 8.82x 28.23x 2 + 9.4x 3 Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 2 / 53

Fitting a ninth-order polynomial: zero error, but overfitting y 2 2..2.4.6.8. x ŷ =.8 22.22x + 592.95x 2 4849.7x 3 +... 26547.6x 9 Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 3 / 53

Lesson Learned Minimizing empirical error may be a good way to fit the parameters of a single model, but it is not a good way to compare models of different complexities, as this would lead to overfitting and hence bad generalization. There are different ways to address this problem, for example: evaluate the predictive performance on data that was not used for training. Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 4 / 53

Cross-Validation: Training Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 5 / 53

Cross-Validation: Prediction Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 6 / 53

Cross-Validation: Training Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 7 / 53

Cross-Validation: Prediction Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 8 / 53

Cross-Validation: Training Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 9 / 53

Cross-Validation: Prediction Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 2 / 53

Cross-Validation: Training Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 2 / 53

Cross-Validation: Prediction Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 22 / 53

Cross-Validation: Training Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 23 / 53

Cross-Validation: Prediction Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 24 / 53

K-fold cross-validation Divide the data in K parts. 2 For each of the K parts do Use the remaining K parts to train the model. Predict on the part that was not used for training. 3 Compute accuracy of the predictions. All predictions are made on data that was not used for training! Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 25 / 53

K-fold cross-validation: selecting a complexity parameter C is a complexity parameter, for example the degree of the polynomial in the regression example. Divide the data in K parts. 2 For each value c of C do For each of the K parts do Use the remaining K parts to train the model with C = c. Predict on the part that was not used for training. Compute accuracy of the predictions with C = c. 3 Select c as the value of C with highest accuracy. 4 Train on the complete data with C = c. Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 26 / 53

Logistic regression for binary classification Code the 2 classes as and (coding is arbitrary, but this coding is often convenient). y {, }: why not linear regression? Logistic regression assumption: and therefore E[y x] = P(y = x) = P(y = x) = ew +w x + e w +w x + e w +w x since P(y = x) and P(y = x) should add up to one. Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 27 / 53

Logistic regression has a linear decision boundary The log odds ln is a linear function of x. ( ) ( P(y = x) e w +w x = ln P(y = x) ) = w + w x Both classes are equally probable when ( ) P(y = x) P(y = x) =, and therefore when ln = P(y = x) P(y = x) So the decision boundary is w + w x = Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 28 / 53

Fitting the logistic regression function The coefficients w and w are estimated by maximum likelihood. Except for some unlikely cases, there is a unique optimal solution. Plug in the estimates to get the fitted response function: P(y = x) = eŵ+ŵ x + eŵ+ŵ x Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 29 / 53

Analysis of the handwritten digit data We have 42, examples of handwritten digits in the data frame mnist.dat. 2 The first column is the class label (digit), the remaining 784 columns are the pixel values. 3 Each class is approximately equally frequent. We derive 2 features: The amount of ink : sum pixel values of a digit. 2 Horizontal symmetry: subtract amount of ink in right half from amount of ink in left half of the image. Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 3 / 53

Distribution of digits in the data 2 3 4 2 3 4 5 6 7 8 9 Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 3 / 53

Feature: amount of ink 2 3 4 5 6 7 8 9 2 4 6 8 digit amount of ink Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 32 / 53

Feature: horizontal symmetry 2 3 4 5 6 7 8 9 5 5 5 digit horizontal symmetry Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 33 / 53

Scatter plot of the sample of zeroes and ones 2 3 4 5 8 6 4 2 2 4 amount of ink horizontal symmetry Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 34 / 53

Fitting a logistic regression model # Fit a logistic regression model to the sample of zeroes and ones. > digits.logreg <- glm(digit ~ ink+horsym,data=mnist.df[index.s,], family="binomial") # Give some relevant information about the fitted model. > summary(digits.logreg) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 3.5927 3.4757 4.8 2.9e-5 *** ink -.657.547-4.248 2.6e-5 *** horsym -.7294.352-2.39.69 * --- Signif. codes: "***". "**". "*".5 ".". " " Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 35 / 53

Logistic Regression Decision Boundary 2 3 4 5 8 6 4 2 2 4 amount of ink horizontal symmetry Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 36 / 53

Prediction with a logistic regression model # Use the logistic regression model to make predictions on all zeroes and ones. # The result is a vector of probabilities of digit. > digits.logreg.pred <- predict(digits.logreg,newdata=mnist.df[index.test,],type="response") # Make a so-called "confusion matrix" of the true class against the predicted class. # We predict the class with the highest fitted probability. > digits.logreg.confmat <- table(as.numeric(digits.logreg.pred >.5), mnist.df[index.test,])[:2,:2] # Display the confusion matrix. > digits.logreg.confmat 3858 27 74 434 # Compute the percentage correctly classified. > sum(diag(digits.logreg.confmat))/sum(digits.logreg.confmat) [].948468 Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 37 / 53

Assignment : Logistic Regression Go to: http://www.staff.science.uu.nl/ feeld/teaching.html, download the workspace, and load it into R. Open the script file on the webpage. Reproduce my analysis by copying the relevant lines from the script file, and entering them into R. 2 Perform a similar analysis, but now for digits 8 and 9. Make appropriate changes to the relevant commands in the script file. Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 38 / 53

Crash course in classification trees Growing the tree. Split the data into two subsets using a test on a single predictor (for example ink > 4,). 2 Try all possible such tests, and choose the most informative one (biggest reduction of error on the training data). 3 Split the two resulting subsets in a similar manner. 4 Continue until some stopping condition is met (subset has become too small) 2 Pruning the tree: consider pruned subtrees of the tree grown, and pick the one with smallest cross-validated error. 3 Prediction: pass a new case down the tree, and predict the majority class of the leaf node where it ends up. Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 39 / 53

Example: Loan Data Record age married? own house income gender class 22 no no 28, male bad 2 46 no yes 32, female bad 3 24 yes yes 24, male bad 4 25 no no 27, male bad 5 29 yes yes 32, female bad 6 45 yes yes 3, female good 7 63 yes yes 58, male good 8 36 yes no 52, male good 9 23 no yes 4, female good 5 yes yes 28, female good Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 4 / 53

Credit Scoring Tree 5 5 income > 36, income 36, bad rec# good 3 7,8,9 5 2 6, age > 37 age 37 married 2 2,6, not married 4,3,4,5 2 6, 2 Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 4 / 53

Why not split on gender in top node? gender = male 5 5 gender = female bad rec# good 3 2,3,4,7,8 2 3 2,5,6,9, Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 42 / 53

Growing a classification tree # Load the necessary libraries (packages). > library(rpart) > library(rpart.plot) # Set the random seed for reproducibility. > set.seed(2345) # Grow a classification tree on the sample. > digits.rpart <- rpart(digit ~ ink+horsym,data=mnist.df[index.s,], cp=,minsplit=2,minbucket=) # Show the cost-complexity pruning results. > digits.rpart$cptable CP nsplit rel error xerror xstd.93..2.79972 2.2.7..38227 3. 2.5.9.293723 4.5 4.3.8.27728 5...8.27728 Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 43 / 53

Pruning sequence size of tree 2 3 5 X val Relative Error..2.4.6.8..2 Inf.4.4.7 cp Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 44 / 53

The Big Tree.5 % yes ink >= 2e+3 no.5 52% horsym >= 543.98 48% ink >= 9e+3.3 5% ink >= 25e+3.8 5% ink < 9e+3.7 3% ink >= 2e+3.38 4% ink < 23e+3 horsym >= 223.75 2%.89 4% horsym < 736.5 % ink < 2e+3. 46%. 2%. %. %. %. %. %. %. 2%. 2%. 44% Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 45 / 53

Pruning the Big Tree Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 46 / 53

The Pruned Tree yes.5 % ink >= 2e+3 no.5 52% horsym >= 543.3 5%. %.98 48% Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 47 / 53

The Decision Boundary 2 3 4 5 8 6 4 2 2 4 amount of ink horizontal symmetry Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 48 / 53

Assignment 2: Classification Trees Reproduce my analysis by copying the relevant lines from the script file, and entering them into R. 2 Perform a similar analysis, but now for digits 8 and 9. Make appropriate changes to the relevant commands in the script file. 3 In pruning, pick the subtree with lowest cross-validation error. Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 49 / 53

Nearest neighbour classification Intuition: examples tend to have the same class as examples that are close by in feature space. 2 So to classify a new example, find the nearest training example(s) and predict their majority class. 3 Note that we don t actually learn a model, we just have to memorize (store) the training set for future reference. 4 Scale the variables to have mean and standard deviation. x i = x i x s x, i =,..., n Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 5 / 53

Nearest neighbour: example? Prediction for k=?, k=3?, k=9? Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 5 / 53

The 3-NN Decision Boundary Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 52 / 53

The 3-NN Decision Boundary Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 53 / 53

Assignment 3: Nearest Neighbour Reproduce my analysis by copying the relevant lines from the script file, and entering them into R. 2 Perform a similar analysis, but now for digits 8 and 9. Make appropriate changes to the relevant commands in the script file. 3 Use cross-validation (knn.cv) on the training sample to estimate the accuracy of the knn classifier for different values of k. Ad Feelders ( Universiteit Utrecht ) Machine Learning May 2, 27 54 / 53