Discriminative Learning and Big Data

Size: px

Start display at page:

Download "Discriminative Learning and Big Data"

Briana Cook
5 years ago
Views:

1 AIMS-CDT Michaelmas 2016 Discriminative Learning and Big Data Lecture 2: Other loss functions and ANN Andrew Zisserman Visual Geometry Group University of Oxford

2 Lecture Outline Regression Ridge regression L 1 regularization Lasso epsilon-insensitive loss Multi-class Classification Using binary classifiers Part II Approximate Nearest Neighbours (ANN) Locality sensitive hashing Randomized k-d trees Product quantization

3 Regression y Suppose we are given a training set of N observations ((x 1,y 1 ),...,(x N,y N )) with x i R d,y i R The regression problem is to estimate f(x) from this data such that y i = f(x i )

4 Example: Real time head pose regression Fanelli et al., DAGM 2011

6 Learning by optimization As in the case of classification, learning a regressor can be formulated as an optimization: Minimize with respect to f F NX i=1 l (f(x i ),y i ) + λr (f) loss function regularization There is a choice of regression function f(), and also of loss functions and regularization: e.g. squared loss, SVM hinge-like loss squared regularizer, lasso regularizer

7 Reminder: Regression example: polynomial curve fitting The green curve is the true function (which is not a polynomial) from Bishop The data points are uniform in x but have noise in y. We will use a loss function that measures the squared error in the prediction of y(x) from x. The loss for the red polynomial is the sum of the squared vertical errors. target value polynomial regression

8 Choice of regression function non-linear basis functions Function for regression y(x, w) is a non-linear function of x, but linear in w: f(x, w) =w 0 + w 1 φ 1 (x)+w 2 φ 2 (x)+...+ w M φ M (x) =w > Φ(x) For example, for x R, polynomial regression with φ j (x) =x j : f(x, w) =w 0 + w 1 φ 1 (x)+w 2 φ 2 (x)+...+ w M φ M (x) = e.g. for M =3, f(x, w) =(w 0,w 1,w 2,w 3 ) Φ : x Φ(x) R 1 R 4 1 x x 2 x 3 = w> Φ(x) MX w j x j j=0

9 or the basis functions can be Gaussians centred on the training data: φ j (x) =exp (x x j ) 2 /2σ 2 e.g. for 3 points, f(x, w) =(w 1,w 2,w 3 ) e (x x 1) 2 /2σ 2 e (x x 2) 2 /2σ 2 e (x x 3) 2 /2σ 2 = w > Φ(x) Φ : x Φ(x) R 1 R 3

10 Least squares ridge regression Cost function squared loss: target value y i loss function regularization x i Regression function for x (1D): f(x, w) =w 0 + w 1 φ 1 (x)+w 2 φ 2 (x)+...+ w M φ M (x) =w > Φ(x) NB squared loss arises in Maximum Likelihood estimation for an error model y i =ỹ i + n i n i N (0, σ 2 ) measured value true value Convex cost, closed form solution

11 Regression cost functions Minimize with respect to w NX i=1 l (f(x i, w),y i ) + λr (w) There is a choice of both loss functions and regularization So far we have seen ridge regression squared loss: squared regularizer: λkwk 2 loss function regularization NX i=1 (y i f(x i, w)) 2 Now, consider other losses and regularizers

12 The Lasso or L 1 norm regularization LASSO = Least Absolute Shrinkage and Selection Minimize with respect to w R d NX i=1 (y i f(x i, w)) 2 + λ dx j w j loss function regularization This is a quadratic optimization problem There is a unique solution p-norm definition: kwk p = dx j=1 w i p 1 p

13 Sparsity property of the Lasso contour plots for d = 2 NX i=1 (y i f(x i, w)) 2 ridge regression λkwk 2 λ lasso Minimum where loss contours tangent to regularizer s For the lasso case, minima occur at corners Consequently one of the weights is zero In high dimensions many weights can be zero dx j w j

15 Comparison of learnt weight vectors Linear regression Least Squares Ridge Regression N = d = 10 y = w. x LASSO lambda = Lasso: number of non-zero coefficients =

16 Lasso in action 1.5 Ridge Regression L1-Regularized Least Squares coefficient values coefficient values percent of lambdamax percent of lambdamax regularization parameter lambda

17 Sparse weight vectors Weights being zero is a method of feature selection zeroing out the unimportant features The SVM classifier also has this property (sparse alpha in the dual representation) Ridge regression does not

18 Reminder: SVM Primal and dual formulations N is number of training points, and d is dimension of feature vector x. Primal problem: for w R d min w R w 2 + C d NX i max (0, 1 y i f(x i )) Dual problem: for α R N (stated without proof): X max α i 1 X α j α k y j y k (x > j x k ) subject to 0 α i C for i, and X α i 0 i 2 jk i α i y i =0 Need to learn d parameters for primal, and N for dual If N<<dthen more efficient to solve for α than w Dual form only involves (x j > x k ). We will return to why this is an advantage whenwelookatkernels.

19 Other regularizers NX i=1 (y i f(x i, w)) 2 + λ dx j w j q For q 1, the cost function is convex and has a unique minimum. The solution can be obtained by quadratic optimization. For q<1, the problem is not convex, and obtaining the global minimum is more difficult

20 SVM like loss function for regression Use ε-insensitive error measure ( 0 if r ε V ε (r) = r ε otherwise. This can also be written as V ε (r) square loss V ε (r) =( r ε) + where () + indicates the positive part of (.). Or equivalently as V ε (r) =max(( r ε), 0) cost is zero inside epsilon tube r loss function regularization

21 loss function regularization Again, this is a quadratic programming problem cost is zero inside epsilon tube It can be dualized Some of the data points will become support vectors It can be kernelized

22 Example: SV regression with Gaussian basis functions 1.5 Regression function Gaussians centred on data points Parameters are: C, epsilon, sigma f(x, a) = NX a i e (x x i) 2 /2σ 2 N = 50 i=1

23 C = 1000, epsilon = 0.2, sigma = 0.5 rbf = 0.2 C = 1e+003 = est. func. + - supp. vec. margin. vec As epsilon increases: fit becomes looser less data points are support vectors

24 Summary: Loss functions for regression quadratic (square) loss `(y, f(x)) = 1 2 (y f(x))2 ε-insensitive loss `(y, f(x)) = max (( r ε), 0) Hüber loss (mixed quadratic/linear): robustness to outliers: `(y, f(x)) = h(y f(x)) h(r) = ( r 2 if r c 2c r c 2 otherwise. all of these are convex square ε insensitive Huber y f(x)

25 Multi-class Classification

26 Multi-Class Classification what we would like Assign input vector to one of K classes Goal: a decision rule that divides input space into K decision regions separated by decision boundaries

27 Reminder: K Nearest Neighbour (K-NN) Classifier Algorithm For each test point, x, to be classified, find the K nearest samples in the training data Classify the point, x, according to the majority vote of their class labels e.g. K = 3 naturally applicable to multi-class case

28 Build from binary classifiers Learn: K two-class 1-vs-the-rest classifiers f k (x)? 1 vs 2 & 3 C 1?? C 2 C 3? 2 vs 1 & 3 3 vs 1 & 2

29 Build from binary classifiers continued Learn: K two-class 1 vs the rest classifiers f k (x) Classification: choose class with most positive score 1 vs 2 & 3 C 1 C 2 max k f k (x) C 3 2 vs 1 & 3 3 vs 1 & 2

30 Build from binary classifiers continued Learn: K two-class 1 vs the rest classifiers f k (x) Classification: choose class with most positive score C 1 C 2 max k f k (x) C 3

31 Application: hand written digit recognition Feature vectors: each image is 28 x 28 pixels. Rearrange as a 784-vector x Training: learn k=10 two-class 1-vs-the-rest SVM classifiers f k (x) Classification: choose class with most positive score f(x) =max k f k (x)

32 Example hand drawn classification

33 Why not learn a multi-class SVM directly? For example for three classes Learn w = (w 1, w 2, w 3 ) > using the cost function min w w 2 subject to w 1 > x i w 2 > x i & w 1 > x i w 3 > x i for i class 1 w 2 > x i w 3 > x i & w 2 > x i w 1 > x i for i class 2 w 3 > x i w 1 > x i & w 3 > x i w 2 > x i for i class 3 This is a quadratic optimization problem subject to linear constraints and there is a unique minimum Note, a margin can also be included in the constraints In practice there is a little or no improvement over the binary case

34 There is more Other classifiers e.g. Adaboost Other regressors e.g. Gaussian process regression Other losses e.g. group lasso, ranking loss Multi-class classifiers e.g. random forests, soft-max

35 Part II Outline Approximate nearest neighbours Locality Sensitive Hashing (LSH) Random k-d trees Product Quantization (PQ)

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin