Manual for a computer class in ML

Similar documents
Assignment 3. Introduction to Machine Learning Prof. B. Ravindran

Empirical Risk Minimization Algorithms

CS7267 MACHINE LEARNING

Warm up: risk prediction with logistic regression

6.867 Machine Learning

Machine Learning Practice Page 2 of 2 10/28/13

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Least Squares Regression

Statistical Machine Learning from Data

Least Squares Regression

Machine Learning: A Statistics and Optimization Perspective

Logistic Regression. COMP 527 Danushka Bollegala

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

A Magiv CV Theory for Large-Margin Classifiers

INTRODUCTION TO DATA SCIENCE

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

COMS 4771 Regression. Nakul Verma

Discriminative Models

Midterm: CS 6375 Spring 2015 Solutions

6.036 midterm review. Wednesday, March 18, 15

Midterm exam CS 189/289, Fall 2015

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Machine Learning Linear Models

Discriminative Models

CS Machine Learning Qualifying Exam

Understanding Generalization Error: Bounds and Decompositions

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

IFT Lecture 7 Elements of statistical learning theory

A Bias Correction for the Minimum Error Rate in Cross-validation

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Set 2 January 12 th, 2018

An Introduction to Statistical Machine Learning - Theoretical Aspects -

Constrained Optimization and Support Vector Machines

Lecture 2 Machine Learning Review

ECE 5424: Introduction to Machine Learning

Advanced Introduction to Machine Learning

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Online Learning and Sequential Decision Making

Machine Learning (CS 567) Lecture 2

Lecture 16: Perceptron and Exponential Weights Algorithm

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Managing Uncertainty

Mathematical Tools for Neuroscience (NEU 314) Princeton University, Spring 2016 Jonathan Pillow. Homework 8: Logistic Regression & Information Theory

Jeff Howbert Introduction to Machine Learning Winter

Introduction to Machine Learning

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

10-701/ Machine Learning - Midterm Exam, Fall 2010

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

18.9 SUPPORT VECTOR MACHINES

Data Mining und Maschinelles Lernen

CS534 Machine Learning - Spring Final Exam

An Introduction to Statistical and Probabilistic Linear Models

CS 188: Artificial Intelligence. Outline

Introduction to Machine Learning Midterm Exam

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Consistency of Nearest Neighbor Methods

ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp.

COMP 551 Applied Machine Learning Lecture 2: Linear Regression

6.867 Machine learning

15-388/688 - Practical Data Science: Decision trees and interpretable models. J. Zico Kolter Carnegie Mellon University Spring 2018

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Machine Learning, Midterm Exam

Deep Learning for Computer Vision

Lecture #11: Classification & Logistic Regression

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

The Perceptron algorithm

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

10-701/15-781, Machine Learning: Homework 4

10701/15781 Machine Learning, Spring 2007: Homework 2

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Lecture 3 Classification, Logistic Regression

Advanced analysis and modelling tools for spatial environmental data. Case study: indoor radon data in Switzerland

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Linear & nonlinear classifiers

Perceptron (Theory) + Linear Regression

Multiclass Classification-1

CMU-Q Lecture 24:

CS 195-5: Machine Learning Problem Set 1

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Introduction to Machine Learning

Machine learning - HT Basis Expansion, Regularization, Validation

Statistical learning theory, Support vector machines, and Bioinformatics

Statistical Methods for Data Mining

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Support vector machines Lecture 4

SUPPORT VECTOR MACHINE

Machine Learning. Ensemble Methods. Manfred Huber

Introduction to Bayesian Learning. Machine Learning Fall 2018

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Introduction to Machine Learning

Learning with Rejection

Security Analytics. Topic 6: Perceptron and Support Vector Machine

ECE521 week 3: 23/26 January 2017

Machine Learning 2017

Final Exam, Fall 2002

Tufts COMP 135: Introduction to Machine Learning

Transcription:

Manual for a computer class in ML November 3, 2015 Abstract This document describes a tour of Machine Learning (ML) techniques using tools in MATLAB. We point to the standard implementations, give example code and point out some pitfalls. Please work your way through the manual, try out the different implementations, and answer the questions and discuss with your peers. 2 hours should be sufficient to work your way through the material. I don t ask for any report on this, but I expect some of the techniques to trickle through in the mini-projects. Lets start with giving an overview of available ML toolboxes. (MATLAB) Since 2006, MATLAB maintains and updates a Statistics and Machine Learning Toolbox, see http://se.mathworks.com/help/stats/index.html. Functionality is rather wide-ranging (classification, regression, clustering) and inference (bootstrap, ANOVA, testing). SVMs and other modern told are implemented since 2014. (R) Particularly complete functionality when it comes to ML. But functions are scattered in many toolboxes: see e.g. https://cran.r-project.org/web/views/machinelearning.html. Of particular note is the software supporting the book by Hastie et al., see https://cran.r-project.org/ (Python) There is a recent trend to use Python as an alternative to the above. One such effort towards ML is in SCIKIT-LEARN, see http://scikit-learn.org/stable/. (Java) Weka is a continual effort to implement ML tools in packages which can be included in larger software projects. Focus is on data-mining, and not so much on statistical ML. See http://www.cs.waikato.ac.nz/ml/weka/. 1 Hello world: Classification 1.1 LDA The simplest way to do classification is Linear Discriminant Analysis (LDA), which basically fits two equal Gaussian distributions to the data which is assumed to be a vector in 1

dimension p. Then, the optimal classifier corresponding to those two Gaussians is a linear model so that the predicted label ŷ associated to a new data-point x R p is given as ŷ = sign(ŵ T x). (1) LDA gives this vector ŵ R p. LDA is implemented in MATLAB as classify. Generate m = 100 training data and m = 100 test-data from to Gaussians in R 2, and classify all the points in a grid in R 2 (as in the help classify explanation). 1.2 SVMs SVMs are implemented in the MATLAB Stat&ML toolbox as fitcsvm. read the help on this function. Run the example on the Fisher dataset as detailed in the documentation, see http://se.mathworks.com/help/stats/fitcsvm.html. This function contains an option to compute the CV error. Run the example - and compare later to your own CV implementation. In earlier versions of MATLAB, this function is not implemented. Then download LIBSVM at https://www.csie.ntu.edu.tw/~cjlin/libsvm/, and run the implementation on the breats-cancer UCI dataset instead. The page provides example code for doing this. 2 Generalities 2.1 Generalization error. The overall focus of the FOML course has been on relating the actual risk to the empirical risk. Lets say we have a measure of loss, relating the measured label y i to the predicted one ŷ i as l(y i, ŷ i ). For example for classification this is l(y i, ŷ i ) = I(y i ŷ i ) where I(z) = 1 if z holds true, and equals 0 otherwise (the zero-one loss). Then the actual risk is defined as R(h) = E x D [l(h(x), f(x))] (2) and given an I.I.D. sample {(x i, f(x i ) = y i )} m i=1, one can approximate this theoretical quantity by R m (h) = 1 m l (h(x i ), f(x i )), (3) m i=1 which in general match when m. The first quantity is in general impossible to compute, but here we can generate different realisations of an independent testset, so that E[R m (h)] = R(h). If we only have few data points, one uses L-fold cross-validation instead. This is an estimation of the actual risk for the classifier h learned on a sample. It is defined as follows Rm L = 1 L R ml (ĥm/m L l ), (4) l=1 2

where we divide the set M = {1,..., m} into L disjunct ones {M l } L l=1 of size respectively m l. Then R ml (h) = 1 l(y i, h(x i )), (5) m l i M l and ĥm/m l is the classifier obtained by your learning algorithm when trained on all the m samples except for the m l used for testing in (5). Note that the CV error (4) is not paramerized for different h s, and is as such qualitatively different from (3). Task: The first task is to implement this CV error in MATLAB. Take a classification model implemented with SVM (as in the previous section). Implement CV for the zero-one loss, with L = 10 and the dataset as input. The CV-error is output. Now we can also generate a plot of the distribution (histogram, or hist) of the empirical error for various trained classifiers. So for different realisations of training sets and testing sets, one can compute R m (ĥm) of the training error and R m(ĥm) on m of the testset of size m. Take for example m = m = 100, and use data generated from two equally-sized Gaussians. Then one can plot the actual CV error on the original data on this histogram. How do they relate? 2.2 Rademacher complexity A general trend in the book is to relate actual to empirical risk through the notion of the Rademacher complexity term, defined as follows. Given a set of m samples {x i } m i=1, and a Hypothesis set H = {h}. Then the empirical Rademacher complexity is defined as 1 R m (H) = E σ [sup h H m ] m σ i h(x i ), (6) where the expectation E σ is over a random sample of the Rademacher variables σ i, such that P (σ i = +1) = P (σ i = 1) = 1 2. If H is a singleton, then R m(h) = 0, if H contains any measurable function (and x i x j for i j) with sup x h(x) = 1, then R m (H) = 1. In general, one interpolates between those two extremes. This exercise illustrates this in a practical setting. Task: We will replace sup h H with the learning approach that we study. Here, let s use SVM with parameter C = 1. We approximate the expectation by an average over random draws of labels {σ i } i { 1, +1} m. Then for each draw of this labels, we use the SVM implementation fitcsvm return the h of interest, denoted as ĥj. Then i=1 R m (H) 1 q q j=1 1 m m σ j i ĥj (x i ) (7) i=1 where q = 1000. Plot the result. Make also a histogram of the individual terms of the first summand (for various draws of {σ i }). Now we can empirically relate the empirical 3

risk (training error), the actual risk (the test error), via the estimated Rademacher term. while theory gives an upper bound of this relation, inspection of those three plots give an empirical relation. What is your conclusion? Discuss with others. 3 Regression Regression or approximation assumes that the labels y take values in R. We consider here the loss function l(y, ŷ) = (y ŷ) 2. We will consider two principles, and contrast them to the Ordinary Least Squares (OLS) approach. Again, plot the distribution of the training error (histogram of different realisations), the distribution of the test-set error (histogram of different realisations), and indicate the CV error on those. 3.1 Regularization schemes This section is going to use the following data. Let p = 100, and m = 200. then generate x i from a standard distribution over R p, and let the true parameter vector w 0 = (1, 0, 0,..., 0) T {0, 1} p. Then y i = x T i w 0 + e where e is again standard distributed (Gaussian, zero mean and unit variance). Generate a training set of size m, and a testiest of size m = 10.000. First implement the OLS estimate, calculate the trainingand testiest error. Task: Then we implement a Ridge Regression (RR) estimate with regularisation constant λ = 1. Plot the training error and the testiest error in terms of a λ taking values in the range 10 3,..., 10 3. Generate the same plots using the LASSO estimator instead (see help lasso in MATLAB). 3.2 Support Vector Regression (SVR) Support Vector regression is implemented in the libsvm toolbox. Download that one at https://www.csie.ntu.edu.tw/~cjlin/libsvm/ and run the example. Run this implementation on the data as before, plot training error and testiest error, and let C vary over the same range as before. How does the resulting plot look like? Task: Relate the estimates obtained by LASSO, RR, LASSO and SVR qualitatively. What can you say about the goodness of fit in either case? 4 Online learning Lets turn attention to the ONLINE LEARNING scenario, where data comes in sequentially. That is, we have a time-index t running from 1 to T. and an initial parameter vector w 0 which is here taken to be the zero vector. Then the general protocol is 4

Algorithm 1 Online learning Require: w 0 for t = 1,..., T do Receive x t D. Predict ŷ t = f(x t, w t 1 ). Receive y t Update as w t w t 1, x t, l(y t, ŷ t ) end for These algorithms are usually analysed in terms of the regret, defined as r T = T t=1 l(y t, w T t 1x t ) min w where here we focus on the loss function l(y, ŷ) = (y ŷ) 2. 4.1 Halving T l(y t, w T x t ). (8) The halving scheme is a particularly simple and elegant approach for online classification. That is y t { 1, 1}, and we consider the zero-one loss. It works under the expert setting, and under the reliability assumption, that is the input is also binary, and there is a feature (expert) which equals always the output. Then the HALVING scheme sniffs this feature out. Task: Generate at each time point t a uniformly random binary vector x t { 1, +1} p with p = 1000, and let y t = x t,111. Then run the halving scheme and see if it converges to feature 111. What happens if you vary p to 100, 1000, 10000,...? What happens if you add a tiny bit of noise as y t = σ t x t,111 where σ t is 1 with high probability, and 1 otherwise? Task: See how the regret evolves when t = 1, 2, 3,... grows. For the regret, replace the minimum w in the second summand by w 0 = (0,..., 0, 1, 0,..., o) T R p with the 1 on place 111. What happens to the test-error? What happens to the distribution (histogram) of the test-errors during the iterations of the algorithm? 4.2 Linear PERCEPTRON First, we get the simplest model where we predict the binary label at time t as t=1 ŷ t = sign(w T t 1x t ), (9) where w t, x t R p. The perceptron algorithm is then further defined by the update rule w t = w t 1 + I(y t ŷ t )y t x t. (10) 5

where I(z) equals one if z holds true, and equals zero otherwise. Implement the resulting algorithm in a MATLAB script. Task: make a simple implementation in MATLAB on the data as used in the halving scheme, but where new predictions are this time not made as the majority vote as in the HALVING scheme, but as in eq. (9). See how the regret evolves when t = 1, 2, 3,... grows. For the regret, replace the minimum w in the second summand by w 0 = (0,..., 0, 1, 0,..., o) T R p with the 1 on place 111. 6