GI01/M055: Supervised Learning

Similar documents
CS 6375 Machine Learning

Overfitting, Bias / Variance Analysis

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

COMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics)

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Multivariate statistical methods and data mining in particle physics

ECE-271B. Nuno Vasconcelos ECE Department, UCSD

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

An Introduction to Statistical and Probabilistic Linear Models

Machine Learning Lecture 7

Recap from previous lecture

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Foundations of Machine Learning

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Lecture 3: Introduction to Complexity Regularization

Least Squares Regression

PDEEC Machine Learning 2016/17

Machine Learning. Lecture 9: Learning Theory. Feng Li.

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January,

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Linear Regression and Discrimination

Least Squares Regression

Lecture 2 Machine Learning Review

10-701/ Machine Learning - Midterm Exam, Fall 2010

Machine Learning Lecture 2

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Linear Models for Regression CS534

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Machine Learning Lecture 5

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Introduction to Machine Learning Midterm Exam Solutions

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

ECE521 week 3: 23/26 January 2017

STK Statistical Learning: Advanced Regression and Classification

Lecture : Probabilistic Machine Learning

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Linear Regression (continued)

Machine Learning, Fall 2009: Midterm

CS 195-5: Machine Learning Problem Set 1

Logistic Regression. William Cohen

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Gaussian and Linear Discriminant Analysis; Multiclass Classification

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Brief Introduction to Machine Learning

1 Machine Learning Concepts (16 points)

Introduction to Machine Learning (67577) Lecture 3

Logic and machine learning review. CS 540 Yingyu Liang

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Understanding Generalization Error: Bounds and Decompositions

Introduction and Models

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Midterm exam CS 189/289, Fall 2015

Machine Learning 2017

Consistency of Nearest Neighbor Methods

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Introduction to Machine Learning Midterm Exam

Naïve Bayes classification

Stephen Scott.

CS 188: Artificial Intelligence. Outline

Stochastic Gradient Descent

Linear & nonlinear classifiers

Machine Learning (CS 567) Lecture 2

Introduction to Bayesian Learning. Machine Learning Fall 2018

Announcements. Proposals graded

Advanced statistical methods for data analysis Lecture 1

Binary Classification / Perceptron

Lecture 3: Statistical Decision Theory (Part II)

Association studies and regression

Probabilistic Graphical Models for Image Analysis - Lecture 1

Mathematical Formulation of Our Example

Tutorial on Machine Learning for Advanced Electronics

MLE/MAP + Naïve Bayes

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL

CS Machine Learning Qualifying Exam

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Statistical Machine Learning Hilary Term 2018

Support Vector Machines for Classification: A Statistical Portrait

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Kernel Methods and Support Vector Machines

Linear & nonlinear classifiers

Active Learning and Optimized Information Gathering

Qualifying Exam in Machine Learning

Day 3: Classification, logistic regression

Learning with multiple models. Boosting.

Machine Learning Lecture 1

Transcription:

GI01/M055: Supervised Learning 1. Introduction to Supervised Learning October 5, 2009 John Shawe-Taylor 1

Course information 1. When: Mondays, 14:00 17:00 Where: Room 1.20, Engineering Building, Malet Place 2. Course webpage: http://www.cs.ucl.ac.uk/staff/j.shawe-taylor/courses/index-gi01.html 3. Mailing list: gi01@cs.ucl.ac.uk, m055@cs.ucl.ac.uk 4. Office: 8.14, CS Building, Malet Place 2

Assessment 1. Homework (25%) and Exam (75%) 2. 3 homework assignments (deliver them on-time, penalty otherwise) 3. To pass the course, you must obtain an average of at least 50% when the homework and exam components are weighted together. 3

Material Lecture notes http://www.cs.ucl.ac.uk/staff/j.shawe-taylor/courses/index-gi01.html Reference book Hastie, Tibshirani, & Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2001 Additional material (see webpage for more info) General: Duda, Hart & Stork; Mitchell; Bishop;... Bayesian methods in ML: Mackay;... Kernel methods: Shawe-Taylor & Cristianini; Schölkopft & Smola;... Learning theory: Devroye, Lugosi, &Gÿorfi; Vapnik;... 4

Prerequisites Calculus (real-valued functions, limits, derivatives, Taylor series, integrals,...) Elements of probability theory (random variables, expectation, variance, conditional probabilities, Bayes rule,...) Fundamentals of linear algebra (vectors, angles, matrices, eigenvectors/eigenvalues,...), A bit of optimization theory (convex functions, Lagrange multipliers) 5

Provisional course outline (Week 1) Key concepts (probabilistic formulation of learning from examples, error functionals; loss function, Bayes rule, learning algorithm, overfitting and underfitting, model selection, cross validation); Some basic learning algorithms (linear regression, k NN); (Week 2) Discriminative vs. Generative approach (Week 3) Optimization and learning algorithms (Week 4) Regularisation, Kernels 6

Provisional course outline (Week 6) Elements of Learning Theory (PAC-learning, Complexity of hypothesis space, VC-bounds) (Week 7) Support Vector Machines/Bayesian Interpretation (Week 8) Tree-based Learning Algorithms and Boosting Acknowledgments: The lectures are based on earlier year s courses. Thanks to Massi Pontil for the course notes. He also inherited notes from Fernando Perez-Cruz, Iain Murray and Ed Snelson of the Gatsby Unit at UCL. 7

Today s plan Supervised learning problem Why and when learning? Regression and classification Two learning algorithms: least squares and k-nn Probabilistic model, error functional, optimal solutions Hypothesis space, overfitting and underfitting Choice of the learning algorithm (Model selection) 8

Supervised learning problem Given a set of input/output pairs (training set) we wish to compute the functional relationship between the input and the output x f y Ex1 (people detection) given an image we wish to say if it depicts a person or not The output is one of 2 possible categories Ex2 (pose estimation) we wish to predict the pose of a face image The output is a continuous number (here a real number describing the face rotation angle) In both problems the input is a high dimensional vector x representing pixel intensity/color 9

Why and when learning? The aim of learning is to develop software systems able to perform particular tasks such as people detection. Standard software engineering approach would be to specify the problem, develop an algorithm to compute the solution, and then implement efficiently. Problem with this approach is developing the algorithm: No known criterion for distinguishing the images; In many cases humans have no difficulty; Typically problem is to specify the problem in logical terms. 10

Learning approach Learning attempts to infer the algorithm for a set of (labelled) examples in much the same way that children learn by being shown a set of examples (eg sports/non sports car). Attempts to isolate underlying structure from a set of examples. Approach should be stable: finds something that is not chance part of set of examples efficient: infers solution in time polynomial in the size of the data robust: should not be too sensitive to mislabelled/noisy examples 11

People detection example 12

People detection example (cont.) 13

Notation X: input space (eg, X IR d ), with elements x, x, x i,... Y: output space, with elements y, y,y i,... Classification (qualitative output) Y = {c 1,...,c k } Binary classification: k =2 Regression (quantitative output) Y IR S = {(x i,y i )} m i=1 : training set (set of I/O examples) Supervised learning problem: describes I/O relationship compute a function which best 14

Learning algorithm Training set: S = {(x i,y i ) m i=1 } X Y A learning algorithm is a mapping S f S A new input x is predicted as f S (x) In the course we mainly deal with deterministic algorithms but we ll also comment on some randomized ones Today: we describe two simple learning algorithms: linear regression and k nearest neighbours 15

Some important questions How is the data collected? (need assumptions!) How do we represent the inputs? (may require preprocessing step) How accurate is f S on new data (study of generalization error) /Howdoweevaluate performance of the learning algorithm on unseen data? (computational com- How complex is a learning task? plexity, sample complexity) Given two different learning algorithms, f S and f S should we choose? (model selection problem) which one 16

Some difficulties/aspects of the learning process New inputs differ from the ones in the training set (look up tables do not work!) Inputs are measured with noise Output is not deterministically obtained by the input Input is high dimensional but some components/variables may be irrelevant Whenever prior knowledge is available it should be used 17

More examples / applications Optical digit recognition (useful for identifying the numbers in a ZIP code from a digitalized image) (Computer Vision) Predicting house prices based on sq. feet, number of rooms, distance from central London,... (Marketing) Estimate amount of glucose in the blood of a diabetic person (Medicine) Detect spam emails (Information retrieval) Predict protein functions / structures (Bioinformatics) Speaker identification / sound recognition (Speech recognition) 18

Binary classification: an example We describe two basic learning algorithms/models for classification which can be easily adapted to regression as well. We choose: X =IR 2, x =(x 1,x 2 )andy = {green, red} Our first learning algorithm computes a linear function (perceptron), w x + b and classifies an input x as f(x) = red w x + b>0 green w x + b 0 19

Least squares How do we compute the parameters w and b? We do not loose generality if we set b = 0 (we can change the data representation as x (x, 1) and w (w,b)) We find w by minimizing the residual sum of squares on the data R(w) = m i=1 ( yi w x i ) 2 To compute the minimum we need to solve the system of equations ( ) R(w) = 0 recall that : =, w 1 w 2 20

Normal equations Note that: R w k =2 m i=1 R(w) = m i=1 ( yi w x i ) 2 ( w x i y i ) (w x i ) w k =2 m i=1 ( w x i y i ) xik Hence, to find w =(w 1,w 2 ) we need to solve the linear system of equations m i=1 (x ik x i1 w 1 + x ik x i2 w 2 ) = m i=1 x ik y i, k =1, 2 21

Normal equations (cont.) In vector notations: In matrix notation: m i=1 x i x i w = m i=1 x i y i where X = X Xw = X y x 11 x m1..... [ ] x 1,, x m, y = x 1d x md Note: here d = 2 but, clearly, the result holds for any d IN y 1. y m 22

Least square solution X Xw = X y For the time being we will assume that the matrix X X is invertible, so we conclude that w =(X X) 1 X y Otherwise, the solution may not be unique... (we will deal with the case that X X is not invertible later) 23

Going back to b Substituting x by (x, 1) and w by (w,b), the above system of equations can be expressed in matrix form as (exercise): (X X)w + X 1b = X y 1 Xw + mb = 1 y that is [ ][ X X X 1 w 1 X m b ] = [ ] X y 1 y where 1 =(1, 1,...,1), m 1 vector of ones 24

A different approach: k nearest neighbours Let N(x; k) be the set of k nearest training inputs to x and I x = {i : x i N(x; k)} the corresponding index set f(x) = red green if if 1 k i I x y i > 1 2 1 k i I x y i 1 2 Closeness is measured using a metric (eg, Euclidean dist.) Local rule (compute local majority vote) Decision boundary is non-linear Note: for regression we set f(x) = 1 k i I x y i (a local mean ) 25

k NN: the effect of k The smaller k the more irregular the decision boundary k =15 k =1 How to choose k? later... 26

Linear regression vs. k NN (informal) Global vs. local Linear vs. non-linear Bias / variance considerations: LR relies heavily on linear assumption (may have large bias) k-nn does not LR is stable (solution does not change much if data are perturbed) 1-NN isn t! k NN sensitive to input dimension d: if d is high, the inputs tends to be far away from each other! 27

Supervised learning model (back to Q1) We assume that the data is obtained by sampling i.i.d. from a fixed but unknown probability density P (x,y) Expected error: E(f) :=E [ (y f(x)) 2] = (y f(x)) 2 dp (x,y) Our goal is to minimize E Optimal solution: f := argmin f E(f) But: in order to compute f we need to know P! Note: for binary classification with Y = {0, 1} and f : X Y, E(f) counts the average number of mistakes of f (aka expected misclassification error) 28

Regression function... Let us compute the optimal solution f for regression (Y =IR) and binary classification (Y = {0, 1}) Using the decomposition P (y, x) = P (y x)p (x) wehave { ( ) } 2dP E(f) = y f(x) (y x) dp (x) so we see that X Y for regression, f (called the regression function) is f (x) =argmin c IR Y (y c)2 dp (y x) = (exercise) = Y ydp(y x) 29

... and Bayes classifier for binary classification, f (called the Bayes classifier) is f (x) = 1 if P (1 x) > 1 2 0 otherwise To see this note, as before, that f (x) =argmin c {0,1} { Y (y c)2 dp (y x) =(1 c) 2 P (1 x)+c 2 P (0 x) and use P (0 x)+p(1 x) =1 } 30

k-nn revised k-nn attempts to estimate/approximate P (1 x) as 1 k i I x y i Expectation is replaced by averaging over sample data Conditioning at x is relaxed to conditioning on some region close to x One can show that E(f S ) E(f ) as m provided that: k(m) and k(m) m 0 Weakness: the approximation (rate of convergence) depends critically on the input dimension... 31

Least squares revisited P (x,y) is unknown cannot compute f =argmin f E(f) We are only given a sample (training set) from P A natural approach: we approximate the expected error E(f) by the empirical error E emp (f) := 1 m m i=1 (y i f(x i )) 2 If we minimize E emp over all possible functions, we can always find a function with zero empirical error! Note: Sometimes the (empirical) error is called (empirical) risk 32

Least squares revisited (cont.) Proposed solution: we introduce a restrictive class of functions H called hypothesis space possibly with a prior distribution over the functions in H We minimize E emp within H. That is, our learning algorithm is: f S =argmin f H E emp (f) This approach is usually called empirical error minimization Linear regression: H = {f(x) =w x + b : w IR d,b IR} 33

Summary Data S sampled i.i.d from P (fixed but unknown) f is what we want, f S is what we get Different approaches to attempt to estimate/approximate f : Minimize E emp in some restricted space of functions (eg, linear) Compute local approximation of f (k-nn) Estimate P and then use Bayes rule... (we ll discuss this next week) 34

Perspectives Theoretical / methodological aspects involved in supervised learning Function representation and approximation to describe H Optimization/numerical methods to compute f S Probabilistic methods to study generalization error of f S or to infer the likelihood of f S 35

Additive noise model Consider the regression problem. computed as Assume that the output is y = f(x)+ɛ where ɛ is a zero mean r.v. Hence we can write where E[ɛ] =0 P (y, x) =P (y x)p (x) =P ɛ ( y f(x) ) P (x) Noisefreemodel: y deterministically computed from x (ɛ 0) 36

Additive noise model (cont.) y = f(x)+ɛ P (y, x) =P (y x)p (x) =P ɛ ( y f(x) ) P (x) The training data is obtained, for i =1,...,m as follows sample x i from P x sample ɛ i from P ɛ set y i = f(x i )+ɛ i Note: P (x) P x (x) (just use different notation when needed) 37

Additive noise model (cont.) P (y, x) =P (y x)p x (x) =P ɛ ( y f(x) ) Px (x) So, since ɛ has zero mean, we have that f ( ) (x) := ydp(y x) = ydp ɛ y f(x) = (f(x)+ɛ)dp ɛ (ɛ) =f(x) A common choice for the noise distribution P ɛ is a Gaussian: P ɛ (ɛ) = 1 ( ) 2πσ 2 exp ɛ2 2σ 2 38

Maximum likelihood What is the probability of the data S given the underlying function is f? where P (S f) = = m i=1 m i=1 A = P (y i, x i f) = P (x i ) m i=1 m i=1 We define the likelihood of f as m i=1 P (y i x i,f)p (x i ) P (y i x i,f)=a m i=1 P (x i )=P (x 1,...,x m ) L(f; S) =P (S f) P (y i x i,f) 39

Maximum likelihood (cont.) Maximum likelihood principle: compute f by maximizing L(f; S) If we use linear functions and additive Gaussian noise, we have L(w; S) =A m i=1 ( 2πσ 2 ) 12 exp { (y i w x i ) 2 2σ 2 In particular the log likelihood is (note: since the log function is strictly increasing maximining L or log L is the same) log L(w; S) = 1 2σ 2 m i=1 ( yi w x i ) 2 + const Hence maximizing the likelihood is equivalent to least squares! } 40

Supervised learning as function approximation Given the training data y i = f (x i )+ɛ i, the goal is to compute an approximation of f in some region of IR d. We look for an approximant of f within a prescribed hypothesis space H Unless prior knowledge is available on f (eg, f is linear) we cannot expect f H Choosing H very large leads to overfitting! (we ll see an example of this in a moment) 41

Polynomial fitting As an example of hypothesis spaces of increasing complexity consider regression in one dimension H 0 = { f(x) =b : b IR } H 1 = { f(x) =ax + b : a, b IR } H 2 = { f(x) =a 1 x + a 2 x 2 + b : a 1,a 2,b IR }. H n = f(x) = n l=1 a l x l + b : a 1,...,a n,b IR Let us minimize the empirical error in H r (r = polynomial degree) 42

Polynomial fitting (simulation) 1.2 1.2 1.2 y 1 0.8 0.6 0.4 0.2 0 y 1 0.8 0.6 0.4 0.2 0 0.2 y 1 0.8 0.6 0.4 0.2 0 0.2 0 0.2 0.4 0.6 0.8 1 1.2 x 0.4 0 0.2 0.4 0.6 0.8 1 1.2 x 0.2 0 0.2 0.4 0.6 0.8 1 1.2 x 1 1 1 0.8 0.8 0.8 0.6 0.6 0.6 y y y 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 0.2 0 0.2 0.4 0.6 0.8 1 x 0.2 0 0.2 0.4 0.6 0.8 1 x 0.2 0 0.2 0.4 0.6 0.8 1 x r =0, 1, 2, 3, 4, 5. As r increases the fit to the data improves (empirical error decreases) 43

Overfitting vs. underfitting Compare the empirical error (solid line) with expected error (dashed line) r small: underfitting r large: overfitting The larger r the lower the empirical error of f S! We cannot rely on the training error! Empirical Error 0.12 0.1 0.08 0.06 0.04 0.02 0 0 1 2 3 4 5 6 7 8 9 degree 44

k NN: the effect of k The smaller k the more irregular the decision boundary and the smaller the empirical error k =15 k =1 45

k NN: the effect of k m k large: overfitting m k small: underfitting 46

Model selection How to choose k in k-nn? How to choose the degree r for polynomial regression? The simplest approach is to use part of the training data (say 2/3) for training and the rest as a validation set Another approach is K fold cross-validation see next slide In all cases use the proxy for the test error to select the best k We will come back to model selection later in the course 47

Cross-validation we split the data in K parts (of roughly equal sizes) repeatedly train on K 1 parts and test on the part left out average errors of K tests to give so-called cross-validation error smaller K is less expensive but poorer estimate as size of training set is smaller and random fluctuations larger For a dataset of size m, m-fold cross-validation is referred to as leave-one-out (LOO) testing 48

Other learning paradigms Supervised learning is not the only learning setup! Semi-supervised learning: the learning environment may give us access to many input examples but only few of them are labeled Active learning: we are given many inputs and we can choose which ones to ask the label for Unsupervised learning: We have only input examples. Here we may want to find data clusters, estimate the probability density of the data, find important features/variable (dimensionality reduction problem), detect anomalies, etc. 49