GI01/M055: Supervised Learning

GI01/M055: Supervised Learning 1. Introduction to Supervised Learning October 5, 2009 John Shawe-Taylor 1

Course information 1. When: Mondays, 14:00 17:00 Where: Room 1.20, Engineering Building, Malet Place 2. Course webpage: http://www.cs.ucl.ac.uk/staff/j.shawe-taylor/courses/index-gi01.html 3. Mailing list: gi01@cs.ucl.ac.uk, m055@cs.ucl.ac.uk 4. Office: 8.14, CS Building, Malet Place 2

Assessment 1. Homework (25%) and Exam (75%) 2. 3 homework assignments (deliver them on-time, penalty otherwise) 3. To pass the course, you must obtain an average of at least 50% when the homework and exam components are weighted together. 3

Material Lecture notes http://www.cs.ucl.ac.uk/staff/j.shawe-taylor/courses/index-gi01.html Reference book Hastie, Tibshirani, & Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2001 Additional material (see webpage for more info) General: Duda, Hart & Stork; Mitchell; Bishop;... Bayesian methods in ML: Mackay;... Kernel methods: Shawe-Taylor & Cristianini; Schölkopft & Smola;... Learning theory: Devroye, Lugosi, &Gÿorfi; Vapnik;... 4

Prerequisites Calculus (real-valued functions, limits, derivatives, Taylor series, integrals,...) Elements of probability theory (random variables, expectation, variance, conditional probabilities, Bayes rule,...) Fundamentals of linear algebra (vectors, angles, matrices, eigenvectors/eigenvalues,...), A bit of optimization theory (convex functions, Lagrange multipliers) 5

Provisional course outline (Week 1) Key concepts (probabilistic formulation of learning from examples, error functionals; loss function, Bayes rule, learning algorithm, overfitting and underfitting, model selection, cross validation); Some basic learning algorithms (linear regression, k NN); (Week 2) Discriminative vs. Generative approach (Week 3) Optimization and learning algorithms (Week 4) Regularisation, Kernels 6

Provisional course outline (Week 6) Elements of Learning Theory (PAC-learning, Complexity of hypothesis space, VC-bounds) (Week 7) Support Vector Machines/Bayesian Interpretation (Week 8) Tree-based Learning Algorithms and Boosting Acknowledgments: The lectures are based on earlier year s courses. Thanks to Massi Pontil for the course notes. He also inherited notes from Fernando Perez-Cruz, Iain Murray and Ed Snelson of the Gatsby Unit at UCL. 7

Today s plan Supervised learning problem Why and when learning? Regression and classification Two learning algorithms: least squares and k-nn Probabilistic model, error functional, optimal solutions Hypothesis space, overfitting and underfitting Choice of the learning algorithm (Model selection) 8

Supervised learning problem Given a set of input/output pairs (training set) we wish to compute the functional relationship between the input and the output x f y Ex1 (people detection) given an image we wish to say if it depicts a person or not The output is one of 2 possible categories Ex2 (pose estimation) we wish to predict the pose of a face image The output is a continuous number (here a real number describing the face rotation angle) In both problems the input is a high dimensional vector x representing pixel intensity/color 9

Why and when learning? The aim of learning is to develop software systems able to perform particular tasks such as people detection. Standard software engineering approach would be to specify the problem, develop an algorithm to compute the solution, and then implement efficiently. Problem with this approach is developing the algorithm: No known criterion for distinguishing the images; In many cases humans have no difficulty; Typically problem is to specify the problem in logical terms. 10

Learning approach Learning attempts to infer the algorithm for a set of (labelled) examples in much the same way that children learn by being shown a set of examples (eg sports/non sports car). Attempts to isolate underlying structure from a set of examples. Approach should be stable: finds something that is not chance part of set of examples efficient: infers solution in time polynomial in the size of the data robust: should not be too sensitive to mislabelled/noisy examples 11

People detection example 12

People detection example (cont.) 13

Notation X: input space (eg, X IR d ), with elements x, x, x i,... Y: output space, with elements y, y,y i,... Classification (qualitative output) Y = {c 1,...,c k } Binary classification: k =2 Regression (quantitative output) Y IR S = {(x i,y i )} m i=1 : training set (set of I/O examples) Supervised learning problem: describes I/O relationship compute a function which best 14

Learning algorithm Training set: S = {(x i,y i ) m i=1 } X Y A learning algorithm is a mapping S f S A new input x is predicted as f S (x) In the course we mainly deal with deterministic algorithms but we ll also comment on some randomized ones Today: we describe two simple learning algorithms: linear regression and k nearest neighbours 15

Some important questions How is the data collected? (need assumptions!) How do we represent the inputs? (may require preprocessing step) How accurate is f S on new data (study of generalization error) /Howdoweevaluate performance of the learning algorithm on unseen data? (computational com- How complex is a learning task? plexity, sample complexity) Given two different learning algorithms, f S and f S should we choose? (model selection problem) which one 16

Some difficulties/aspects of the learning process New inputs differ from the ones in the training set (look up tables do not work!) Inputs are measured with noise Output is not deterministically obtained by the input Input is high dimensional but some components/variables may be irrelevant Whenever prior knowledge is available it should be used 17

More examples / applications Optical digit recognition (useful for identifying the numbers in a ZIP code from a digitalized image) (Computer Vision) Predicting house prices based on sq. feet, number of rooms, distance from central London,... (Marketing) Estimate amount of glucose in the blood of a diabetic person (Medicine) Detect spam emails (Information retrieval) Predict protein functions / structures (Bioinformatics) Speaker identification / sound recognition (Speech recognition) 18

Binary classification: an example We describe two basic learning algorithms/models for classification which can be easily adapted to regression as well. We choose: X =IR 2, x =(x 1,x 2 )andy = {green, red} Our first learning algorithm computes a linear function (perceptron), w x + b and classifies an input x as f(x) = red w x + b>0 green w x + b 0 19

Least squares How do we compute the parameters w and b? We do not loose generality if we set b = 0 (we can change the data representation as x (x, 1) and w (w,b)) We find w by minimizing the residual sum of squares on the data R(w) = m i=1 ( yi w x i ) 2 To compute the minimum we need to solve the system of equations ( ) R(w) = 0 recall that : =, w 1 w 2 20

Normal equations Note that: R w k =2 m i=1 R(w) = m i=1 ( yi w x i ) 2 ( w x i y i ) (w x i ) w k =2 m i=1 ( w x i y i ) xik Hence, to find w =(w 1,w 2 ) we need to solve the linear system of equations m i=1 (x ik x i1 w 1 + x ik x i2 w 2 ) = m i=1 x ik y i, k =1, 2 21

Normal equations (cont.) In vector notations: In matrix notation: m i=1 x i x i w = m i=1 x i y i where X = X Xw = X y x 11 x m1..... [ ] x 1,, x m, y = x 1d x md Note: here d = 2 but, clearly, the result holds for any d IN y 1. y m 22

Least square solution X Xw = X y For the time being we will assume that the matrix X X is invertible, so we conclude that w =(X X) 1 X y Otherwise, the solution may not be unique... (we will deal with the case that X X is not invertible later) 23

Going back to b Substituting x by (x, 1) and w by (w,b), the above system of equations can be expressed in matrix form as (exercise): (X X)w + X 1b = X y 1 Xw + mb = 1 y that is [ ][ X X X 1 w 1 X m b ] = [ ] X y 1 y where 1 =(1, 1,...,1), m 1 vector of ones 24

A different approach: k nearest neighbours Let N(x; k) be the set of k nearest training inputs to x and I x = {i : x i N(x; k)} the corresponding index set f(x) = red green if if 1 k i I x y i > 1 2 1 k i I x y i 1 2 Closeness is measured using a metric (eg, Euclidean dist.) Local rule (compute local majority vote) Decision boundary is non-linear Note: for regression we set f(x) = 1 k i I x y i (a local mean ) 25

k NN: the effect of k The smaller k the more irregular the decision boundary k =15 k =1 How to choose k? later... 26

Linear regression vs. k NN (informal) Global vs. local Linear vs. non-linear Bias / variance considerations: LR relies heavily on linear assumption (may have large bias) k-nn does not LR is stable (solution does not change much if data are perturbed) 1-NN isn t! k NN sensitive to input dimension d: if d is high, the inputs tends to be far away from each other! 27

Supervised learning model (back to Q1) We assume that the data is obtained by sampling i.i.d. from a fixed but unknown probability density P (x,y) Expected error: E(f) :=E [ (y f(x)) 2] = (y f(x)) 2 dp (x,y) Our goal is to minimize E Optimal solution: f := argmin f E(f) But: in order to compute f we need to know P! Note: for binary classification with Y = {0, 1} and f : X Y, E(f) counts the average number of mistakes of f (aka expected misclassification error) 28

Regression function... Let us compute the optimal solution f for regression (Y =IR) and binary classification (Y = {0, 1}) Using the decomposition P (y, x) = P (y x)p (x) wehave { ( ) } 2dP E(f) = y f(x) (y x) dp (x) so we see that X Y for regression, f (called the regression function) is f (x) =argmin c IR Y (y c)2 dp (y x) = (exercise) = Y ydp(y x) 29

... and Bayes classifier for binary classification, f (called the Bayes classifier) is f (x) = 1 if P (1 x) > 1 2 0 otherwise To see this note, as before, that f (x) =argmin c {0,1} { Y (y c)2 dp (y x) =(1 c) 2 P (1 x)+c 2 P (0 x) and use P (0 x)+p(1 x) =1 } 30

k-nn revised k-nn attempts to estimate/approximate P (1 x) as 1 k i I x y i Expectation is replaced by averaging over sample data Conditioning at x is relaxed to conditioning on some region close to x One can show that E(f S ) E(f ) as m provided that: k(m) and k(m) m 0 Weakness: the approximation (rate of convergence) depends critically on the input dimension... 31

Least squares revisited P (x,y) is unknown cannot compute f =argmin f E(f) We are only given a sample (training set) from P A natural approach: we approximate the expected error E(f) by the empirical error E emp (f) := 1 m m i=1 (y i f(x i )) 2 If we minimize E emp over all possible functions, we can always find a function with zero empirical error! Note: Sometimes the (empirical) error is called (empirical) risk 32

Least squares revisited (cont.) Proposed solution: we introduce a restrictive class of functions H called hypothesis space possibly with a prior distribution over the functions in H We minimize E emp within H. That is, our learning algorithm is: f S =argmin f H E emp (f) This approach is usually called empirical error minimization Linear regression: H = {f(x) =w x + b : w IR d,b IR} 33

Summary Data S sampled i.i.d from P (fixed but unknown) f is what we want, f S is what we get Different approaches to attempt to estimate/approximate f : Minimize E emp in some restricted space of functions (eg, linear) Compute local approximation of f (k-nn) Estimate P and then use Bayes rule... (we ll discuss this next week) 34

Perspectives Theoretical / methodological aspects involved in supervised learning Function representation and approximation to describe H Optimization/numerical methods to compute f S Probabilistic methods to study generalization error of f S or to infer the likelihood of f S 35

Additive noise model Consider the regression problem. computed as Assume that the output is y = f(x)+ɛ where ɛ is a zero mean r.v. Hence we can write where E[ɛ] =0 P (y, x) =P (y x)p (x) =P ɛ ( y f(x) ) P (x) Noisefreemodel: y deterministically computed from x (ɛ 0) 36

Additive noise model (cont.) y = f(x)+ɛ P (y, x) =P (y x)p (x) =P ɛ ( y f(x) ) P (x) The training data is obtained, for i =1,...,m as follows sample x i from P x sample ɛ i from P ɛ set y i = f(x i )+ɛ i Note: P (x) P x (x) (just use different notation when needed) 37

Additive noise model (cont.) P (y, x) =P (y x)p x (x) =P ɛ ( y f(x) ) Px (x) So, since ɛ has zero mean, we have that f ( ) (x) := ydp(y x) = ydp ɛ y f(x) = (f(x)+ɛ)dp ɛ (ɛ) =f(x) A common choice for the noise distribution P ɛ is a Gaussian: P ɛ (ɛ) = 1 ( ) 2πσ 2 exp ɛ2 2σ 2 38

Maximum likelihood What is the probability of the data S given the underlying function is f? where P (S f) = = m i=1 m i=1 A = P (y i, x i f) = P (x i ) m i=1 m i=1 We define the likelihood of f as m i=1 P (y i x i,f)p (x i ) P (y i x i,f)=a m i=1 P (x i )=P (x 1,...,x m ) L(f; S) =P (S f) P (y i x i,f) 39

Maximum likelihood (cont.) Maximum likelihood principle: compute f by maximizing L(f; S) If we use linear functions and additive Gaussian noise, we have L(w; S) =A m i=1 ( 2πσ 2 ) 12 exp { (y i w x i ) 2 2σ 2 In particular the log likelihood is (note: since the log function is strictly increasing maximining L or log L is the same) log L(w; S) = 1 2σ 2 m i=1 ( yi w x i ) 2 + const Hence maximizing the likelihood is equivalent to least squares! } 40

Supervised learning as function approximation Given the training data y i = f (x i )+ɛ i, the goal is to compute an approximation of f in some region of IR d. We look for an approximant of f within a prescribed hypothesis space H Unless prior knowledge is available on f (eg, f is linear) we cannot expect f H Choosing H very large leads to overfitting! (we ll see an example of this in a moment) 41

Polynomial fitting As an example of hypothesis spaces of increasing complexity consider regression in one dimension H 0 = { f(x) =b : b IR } H 1 = { f(x) =ax + b : a, b IR } H 2 = { f(x) =a 1 x + a 2 x 2 + b : a 1,a 2,b IR }. H n = f(x) = n l=1 a l x l + b : a 1,...,a n,b IR Let us minimize the empirical error in H r (r = polynomial degree) 42

Polynomial fitting (simulation) 1.2 1.2 1.2 y 1 0.8 0.6 0.4 0.2 0 y 1 0.8 0.6 0.4 0.2 0 0.2 y 1 0.8 0.6 0.4 0.2 0 0.2 0 0.2 0.4 0.6 0.8 1 1.2 x 0.4 0 0.2 0.4 0.6 0.8 1 1.2 x 0.2 0 0.2 0.4 0.6 0.8 1 1.2 x 1 1 1 0.8 0.8 0.8 0.6 0.6 0.6 y y y 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 0.2 0 0.2 0.4 0.6 0.8 1 x 0.2 0 0.2 0.4 0.6 0.8 1 x 0.2 0 0.2 0.4 0.6 0.8 1 x r =0, 1, 2, 3, 4, 5. As r increases the fit to the data improves (empirical error decreases) 43

Overfitting vs. underfitting Compare the empirical error (solid line) with expected error (dashed line) r small: underfitting r large: overfitting The larger r the lower the empirical error of f S! We cannot rely on the training error! Empirical Error 0.12 0.1 0.08 0.06 0.04 0.02 0 0 1 2 3 4 5 6 7 8 9 degree 44

k NN: the effect of k The smaller k the more irregular the decision boundary and the smaller the empirical error k =15 k =1 45

k NN: the effect of k m k large: overfitting m k small: underfitting 46

Model selection How to choose k in k-nn? How to choose the degree r for polynomial regression? The simplest approach is to use part of the training data (say 2/3) for training and the rest as a validation set Another approach is K fold cross-validation see next slide In all cases use the proxy for the test error to select the best k We will come back to model selection later in the course 47

Cross-validation we split the data in K parts (of roughly equal sizes) repeatedly train on K 1 parts and test on the part left out average errors of K tests to give so-called cross-validation error smaller K is less expensive but poorer estimate as size of training set is smaller and random fluctuations larger For a dataset of size m, m-fold cross-validation is referred to as leave-one-out (LOO) testing 48

Other learning paradigms Supervised learning is not the only learning setup! Semi-supervised learning: the learning environment may give us access to many input examples but only few of them are labeled Active learning: we are given many inputs and we can choose which ones to ask the label for Unsupervised learning: We have only input examples. Here we may want to find data clusters, estimate the probability density of the data, find important features/variable (dimensionality reduction problem), detect anomalies, etc. 49