Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Size: px

Start display at page:

Download "Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data"

Garry Sutton
5 years ago
Views:

1 Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data DD2424 March 23, 2017

2 Binary classification problem given labelled training data Have labelled training examples? Given a test example how do we decide its class?

3 High level solution Decision Boundary Learn a decision boundary from the labelled training data. Compare the test example to the decision boundary.

4 Technical description of the binary problem Have a set of labelled training examples D = {(x 1, y 1 ),..., (x n, y n )} with each x i R d, y i { 1, 1}. Want to learn from D a classification function Usually g : R d input space R p parameter space { 1, 1} g(x; θ) = sign(f(x; θ)) where f : R d R p R Have to decide on 1. Form of f (a hyperplane?) and 2. How to estimate f s parameters ˆθ from D.

5 Learn decision boundary discriminatively Set up an optimization of the form (usually) arg min θ training error { }}{ l(y, f(x; θ)) + λ R(θ) }{{} (x,y) D regularization term where - l(y, f(x θ)) is the loss function and measures how well (and robustly) f(x; θ) predicts the label y. - The training error term measures how well and robustly the function f( ; θ) predicts the labels over all the training data. - The regularization term measures the complexity of the function f( ; θ). Usually want to learn simpler functions = less risk of over-fitting.

6 Comment on Over- and Under-fitting

7 Example of Over and Under fitting Bayes Optimal Under-fitting Over-fitting

8 Overfitting Overview of Supervised Learning High Bias Low Variance Low Bias High Variance Prediction Error Test Sample Training Sample Low Model Complexity High FIGURE Test and training error as a function of model complexity. Too much fitting = adapt too closely to the training data. be close to f(x 0 ). As k grows, the neighbors are further away, and then anything can happen. The variance term is simply the variance of an average here, and decreases as the inverse of k. So as k varies, there is a bias variance tradeoff. More generally, as the model complexity of our procedure is increased, the variance tends to increase and the squared bias tends to decrease. The opposite behavior occurs as the model complexity is decreased. For k-nearest neighbors, the model complexity is controlled by k. Have a high variance predictor. This scenario is termed overfitting. In such cases predictor loses the ability to generalize.

9 Underfitting Overview of Supervised Learning High Bias Low Variance Low Bias High Variance Prediction Error Test Sample Training Sample Low Model Complexity High FIGURE Test and training error as a function of model complexity. Low complexity model = predictor may have large bias be close to f(x 0 ). As k grows, the neighbors are further away, and then anything can happen. The variance term is simply the variance of an average here, and decreases as the inverse of k. So as k varies, there is a bias variance tradeoff. More generally, as the model complexity of our procedure is increased, the variance tends to increase and the squared bias tends to decrease. The opposite behavior occurs as the model complexity is decreased. For k-nearest Therefore predictor has poor generalization.

10 Linear Decision Boundaries

11 Linear discriminant functions Linear function for the binary classification problem: f(x; w, b) = w T x + b where model parameters are w the weight vector and b the bias. x 2 w w d w T x + b > 0 d = b w w T x + b < 0 w T x + b = 0 g(x; w, b) = x 1 { 1 if f(x; w, b) > 0 1 if f(x; w, b) < 0

12 Pros & Cons of Linear classifiers Pros Low variance classifier Easy to estimate. Frequently can set up training so that have an easy optimization problem. For high dimensional input data a linear decision boundary can sometimes be sufficient. Cons High bias classifier Often the decision boundary is not well-described by a linear classifier.

13 Optimal h training examples will have better generalization capabil How do we choose & learn the linear classifier? " Therefore, the optimal separating hyperplane will be th margin, which is defined as the minimum distance of a surface x 2 x 2 Introduction to Pattern Analysis Ricardo Gutierrez-Osuna Texas A&M University Given labelled training data: how do we choose and learn the best hyperplane to separate the two classes? x 1 From [Cherkassky

14 Have a linear classifier next need to decide: Supervised learning of my classifier 1. How to measure the quality of the classifier w.r.t. labelled training data? - Choose/Define a loss function.

15 Most intuitive loss function , 1 Loss function For a single example (x, y) the 0-1 loss is defined as { 0 if y = sgn(f(x; θ)) l(y, f(x; θ)) = 1 if y sgn(f(x; θ)) 0, 1 Loss y * f(x) = { 0 if y f(x; θ) > 0 1 if y f(x; θ) < 0 (assuming y { 1, 1}) Applied to all training data = count the number of misclassifications. Not really used in practice as has lots of problems! What are some?

16 Have a linear classifier next need to decide: Supervised learning of my classifier 1. How to measure the quality of the classifier w.r.t. labelled training data? - Choose/Define a loss function. 2. How to measure the complexity of the classifier? - Choose/Define a regularization term.

17 Most common regularization function L 2 regularization R(w) = w 2 = d i=1 w 2 i Adding this form of regularization: - Encourages w not to contain entries with large absolute values. - or want small absolute values in all entries of w.

18 Have a linear classifier next need to decide: Supervised learning of my classifier 1. How to measure the quality of the classifier w.r.t. labelled training data? - Choose/Define a loss function. 2. How to measure the complexity of the classifier? - Choose/Define a regularization term. 3. How to do estimate the classifier s parameters by optimizing relative to the above factors?

19 Example: Squared Error loss

20 Squared error loss & no regularization Learn w, b from D. Find the w, b that minimizes: L(D, w, b) = 1 2 = 1 2 (x,y) D (x,y) D l sq (y, f(x; w, b)) ( (w T x + b) y ) 2 } {{ } Squared error loss L is known as the sum-of-squares error function. The w, b that minimizes L(D, w, b) is known as the Minimum Squared Error solution. This minimum is found as follows...

21 Technical interlude: Matrix Calculus

22 Matrix Calculus Have a function f : R d R that is f(x) = b We use the notation f x = f x 1. f x d Example: If f(x) = a T x = d i=1 a ix i then f x i = a i = f x = a 1. a d = a

23 Matrix Calculus Have a function f : R d d R that is f(x) = b with X R d d We use the notation f X = f x 11 f x 21. f x d1 f f x 1d f x 2d x f x f x d2... f x dd Example: If f(x) = a T Xb = d i=1 a d i j=1 x ijb j then f = a i b j = f a 1 b 1 a 1 b 2... a 1 b d x ij X =.... = ab T a d b 1 a d b 2... a d b d

24 Matrix Calculus Derivative of a linear function x T a x = a (1) a T x x = a (2) a T Xb X = abt (3) a T X T b X = ba T (4)

25 Matrix Calculus Derivative of a quadratic function x T Bx x b T X T Xc X (Bx + b) T C(Dx + d) x = (B + B T )x (5) = X(bc T + cb T ) (6) = B T C(Dx + d) + D T C T (Bx + b) (7) b T X T DXc X = D T Xbc T + DXcb T (8)

26 End of Technical interlude

27 Pseudo-Inverse solution Can write the cost function as L(D, w, b) = 1 ( w T x + b y ) 2 1 = 2 2 (x,y) D where x = (x T, 1) T, w 1 = (w T, b) T (x,y) D ( w T 1 x y ) 2 Writing in matrix notation this becomes L(D, w 1 ) = 1 2 Xw 1 y 2 = 1 2 (Xw 1 y) T (Xw 1 y) = 1 ( w T 2 1 X T Xw 1 2y T Xw 1 + y T y ) where x T 1 1 y = (y 1,..., y n ) T, w = (w 1,..., w d+1 ) T, X =.. x T n 1

28 Pseudo-Inverse solution The gradient of L(D, w 1 ) w.r.t. w 1 : w1 L(D, w 1 ) = X T Xw 1 X T y Setting this equal to zero yields X T Xw 1 = X T y and w 1 = X y where X ( X T X ) 1 X T X is called the pseudo-inverse of X. Note that X X = I but in general XX I.

29 Simple 2D Example Decision boundary found by minimizing L squared error (D, w, b) = ( y (w T x + b) ) 2 (x,y) D

30 Pseudo-Inverse solution The gradient of L(D, w 1 ) w.r.t. w 1 : w1 L(D, w 1 ) = X T Xw 1 X T y Setting this equal to zero yields X T Xw 1 = X T y and where w 1 = X y X ( X T X ) 1 X T X is called the pseudo-inverse of X. Note that X X = I but in general XX I. If X T X singular = no unique solution to X T Xw = X T y.

31 Technical interlude: Iterative Optimization

32 Iterative Optimization Common approach to solving such unconstrained optimization problem is iterative non-linear optimization. Start with an estimate x (0). x = arg min x f(x) Try to improve it by finding successive new estimates x (1), x (2), x (3),... s.t. f(x (1) ) f(x (2) ) f(x (3) ) until convergence. To find a better estimate at each iteration: Perform the search locally around the current estimate. Such iterative approaches will find a local minima.

33 Iterative Optimization Iterative optimization methods alternate between these two steps: Decide search direction Choose a search direction based on the local properties of the cost function. Line Search Perform an intensive search to find the minimum along the chosen direction.

34 Choosing a search direction: The gradient The gradient is defined as: x f(x) f(x) x = f(x) x 1 f(x) x 2. f(x) x d The gradient points in the direction of the greatest increase of f(x).

35 Gradient descent: Method for function minimization nction minimization defined by the zeros of the gradient (x) $ 0 Gradient descent finds the minimum in an iterative fashion by unction moving has a closed in the form direction solution of steepest descent. ay exist, but is numerically illents) Gradient Descent Minimization iterative fashion by moving 2 1. Start with an arbitrary solution Local x (0). x 2 0 minimum Initial guess Global minimum x Compute the gradient x f(x (k) ). 3. Move in the direction of steepest descent: x (k+1) = x (k) η (k) x f(x (k) ). where η (k) is the step size. 4. Go to 2 (until convergence).

36 nction minimization defined by the zeros of the gradient Gradient descent: Method for function minimization (x) $ 0 Gradient descent finds the minimum in an iterative fashion by unction has a closed form solution moving in the direction of steepest descent. ay exist, but is numerically illents) iterative fashion by moving Gradient Descent Minimization Properties x Local minimum Initial guess Global minimum x Will converge to a local minimum. 2. The local minimum found depends on the initialization x (0). But this is okay 1. For convex optimization problems: local minimum global minimum 2. For deep networks most parameter setting corresponding to a local minimum are fine.

37 nction minimization defined by the zeros of the gradient Gradient descent: Method for function minimization (x) $ 0 Gradient descent finds the minimum in an iterative fashion by unction has a closed form solution moving in the direction of steepest descent. ay exist, but is numerically illents) iterative fashion by moving Gradient Descent Minimization Properties x Local minimum Initial guess Global minimum x Will converge to a local minimum. 2. The local minimum found depends on the initialization x (0). But this is okay 1. For convex optimization problems: local minimum global minimum 2. For deep networks most parameter setting corresponding to a local minimum are fine.

38 End of Technical interlude

39 Gradient descent solution The error function L(D, w 1 ) could also be minimized wrt w 1 by using a gradient descent procedure. Why? This avoids the numerical problems that arise when X T X is (nearly) singular. It also avoids the need for working with large matrices. How 1. Begin with an initial guess w (0) 1 for w Update the weight vector by moving a small distance in the direction w1 L.

40 Gradient descent solution Solution w (t+1) 1 = w (t) 1 η (t) X T (Xw (t) 1 y) If η (t) = η 0 /t, where η 0 > 0, then w (0) 1, w(1) 1, w(2) 1,... converges to a solution of X T (Xw 1 y) = 0 Irrespective of whether X T X is singular or not.

41 Stochastic gradient descent solution Increase the number of updates per computation by considering each training sample sequentially w (t+1) 1 = w (t) 1 η (t) (x T i w (t) 1 y i )x i This is known as the Widrow-Hoff, least-mean-squares (LMS) or delta rule [Mitchell, 1997]. More generally this is an application of Stochastic Gradient Descent.

42 Technical interlude: Stochastic Gradient Descent

43 Common Optimization Problem in Machine Learning Form of the optimization problem: J(D, θ) = 1 D (x,y) D l(y, f(x; θ)) + λr(θ) θ = arg min θ J(D, θ) Solution with gradient descent 1. Start with a random guess θ (0) for the parameters. 2. Then iterate until convergence θ (t+1) = θ (t) η (t) θ J(D, θ) θ (t)

44 If D is large = computing θ J(θ, D) θ (t) is time consuming = each update of θ (t) takes lots of computations Given large scale data Gradient descent needs lots of iterations to converge as η usually small = GD takes an age to find a local optimum.

45 Work around: Stochastic Gradient Descent Start with a random solution θ (0). Until convergence for t = 1, Randomly select (x, y) D. 2. Set D (t) = {(x, y)}. 3. Update parameter estimate with θ (t+1) = θ (t) η (t) θ J(D (t), θ) θ (t)

46 Comments about SGD When D (t) = 1: θ J(D (t), θ) θ (t) a noisy estimate of θ J(D, θ) θ (t) Therefore D noisy update steps in SGD 1 correct update step in GD. In practice SGD converges a lot faster then GD. Given lots of labelled training data: Quantity of updates more important than quality of updates!

47 Best practices for SGD Preparing the data - Randomly shuffle the training examples and zip sequentially through D. - Use preconditioning techniques. Monitoring and debugging - Monitor both the training cost and the validation error. - Check the gradients using finite differences. - Experiment with learning rates η (t) using a small sample of the training set.

48 Mini-Batch Gradient Descent Start with a random guess θ (0) for the parameters. Until convergence for t = 1, Randomly select a subset D (t) D s.t. D (t) = n b (typically n b 150.) 2. Update parameter estimate with θ (t+1) = θ (t) η (t) θ J(D (t), θ) θ (t)

49 Benefits of mini-batch gradient descent Obtain a more accurate estimate of θ J(D, θ) θ (t) SGD. than in Still get lots of updates per epoch (one iteration through all the training data).

50 What learning rate? Issues with setting the learning rate η (t)? - Larger η s = potentially faster learning but with the risk of less stable convergence. - Smaller η s = slow learning but stable convergence. Strategies - Constant: η (t) =.01 - Decreasing: η (t) = 1/ t Lots of recent algorithms dealing with this issue Will describe these algorithms in the near future.

51 End of Technical interlude

52 Squared Error loss + L 2 regularization

53 Add an L 2 regularization term (a.k.a. ridge regression) Add a regularization term to the loss function J ridge (D, w, b) = 1 2 (x,y) D l sq (y, w T x + b) + λ w 2 = 1 2 Xw + b1 y 2 + λ w 2 where λ > 0 and small and X is the data matrix x T 1 x T 2 X =. x T n

54 Solving Ridge Regression: Centre the data to simplify Add a regularization term to the loss function J ridge (D, w, b) = 1 2 Xw + b1 y 2 + λ w 2 Let s centre the input data = X T c 1 = 0. x T c,1 x T c,2 X c =. x T c,n where xc,i = xi µ x Optimal bias with centered input X c (does not depend on w ) is: J ridge b = b = 1/n n i=1 y i = ȳ. = b1 T 1 + w T X T c 1 1 T y = b1 T 1 1 T y

55 Solving ridge regression: Optimal weight vector Add a regularization term to the loss function J ridge (D, w) = 1 2 X cw + ȳ1 y 2 + λ w 2 Compute the gradient of J ridge w.r.t. w J ridge w = ( X T c X c + λi d ) w X T c y Set to zero to get w = ( X T c X c + λi d ) 1 X T c y (X T c X c + λi d ) has a unique inverse even if X T c X c is singular.

56 Ridge Regression decision boundaries as λ is varied Simple 2D Example λ = 1 λ = 10 λ = 100 λ = 1000

57 Solving ridge regression: Optimal weight vector Add a regularization term to the loss function J ridge (D, w) = 1 2 X cw + ȳ1 y 2 + λ w 2 Compute the gradient of J ridge w.r.t. w J ridge w = ( X T c X c + λi d ) w X T c y Set to zero to get w = ( X T c X c + λi d ) 1 X T c y (X T c X c + λi d ) has a unique inverse even if X T c X c is singular. If d is large = have to invert a very large matrix.

58 Solving ridge regression: Iteratively The gradient-descent update step is [ (X w (t+1) = w (t) ) ] T η c X c + λi d w (t) Xc T y The SGD update step for sample (x, y) is ) ] w (t+1) = w (t) η [((x µ x )(x µ x ) T + λi d w (t) (x µ x )y

59 Hinge Loss

60 The Hinge loss l(x, y; w, b) = max { 0, 1 y(w T x + b) } 12 hinge loss y g(x; θ)) This loss is not differentiable but is convex. Correctly classified examples sufficiently far from the decision boundary have zero loss. = have a way of choosing between classifiers that correctly classify all the training examples.

61 Technical interlude: Sub-gradient

62 g is a subgradient Subgradient of f at x ifof a function g is a subgradient of f (not necessarily f(y) f(x) + g T convex) at x if (y x) y Subgradient of a function f(y) f(x) + g T (y x) ( 1D (g, example: 1) supports epi f at (x, f(x))) PSfrag replacements f(x 1 ) + g T 1 (x x 1) for all y f(x) f(x 2 ) + g T 2 (x x 2) f(x 2 ) + g T 3 (x x 2) x 1 x 2 g 2, g 3 are - g subgradients at x 2 ; g 1 is a subgradient at x 1 2, g 3 are subgradients at x 2 ; - g 1 is a subgradient at x 1. Prof. S. Boyd, EE392o, Stanford University 2

63 Subgradient of a function Set of all subgradients of f at x is called the subdifferential of example: f at x, f(x) written = x f(x) 1D example: frag replacements f(x) = x f(x) 1 x 1 x If f is convex and differentiable: f(x) a subgradient of f at x.

64 End of Technical interlude

65 The Hinge loss Find w, b that minimize L hinge (D, w, b) = (x,y) D max { 0, 1 y(w T x + b) } }{{} Hinge Loss Can use stochastic gradient descent to do the optimization. The (sub-)gradients of the hinge-loss are { y x if y(w T x + b) < 1 w l(x, y; w, b) = 0 otherwise. { l(x, y; w, b) y if y(w T x + b) < 1 = b 0 otherwise.

66 Example of decision boundary found Decision boundary found by minimizing with SGD L hinge (D, w, b) = max { 0, 1 y(w T x + b) } (x,y) D

67 L 2 Regularization + Hinge Loss

68 L 2 regularization + Hinge loss Find w, b that minimize J svm (D, w, b) = λ 2 w 2 + (x,y) D max { 0, 1 y(w T x + b) } }{{} Hinge Loss Can use stochastic gradient descent to do the optimization. The sub-gradients of this cost function { λw y x if y(w T x + b) < 1 w l(x, y; w, b) = λw otherwise. { l(x, y; w, b) y if y(w T x + b) < 1 = b 0 otherwise.

69 Example of decision boundary found Decision boundary found with SGD by minimizing (λ =.01) J hinge (D, w, b) = λ 2 w 2 + max { 0, 1 y(w T x + b) } (x,y) D

70 Regularization reduces the influence of outliers Decision boundaries found by minimizing Hinge Loss L 2 Regularization + Hinge Loss

71 L 2 Regularization + Hinge Loss SVM

72 SVM s constrained optimization problem SVM solves this constrained optimization problem: ( ) 1 n min w,b 2 wt w + C ξ i subject to 12. Flexible Discriminants i=1 y i (w T x i + b) 1 ξ i for i = 1,..., n and ξ i 0 for i = 1,..., n. T β + β 0 =0 M = 1 β M = 1 β margin x T β + β 0 =0 ξ 4 ξ5 ξ ξ 3 1 ξ 2 M = 1 β M = 1 β margin

73 Alternative formulation of SVM optimization SVM solves this constrained optimization problem: ( ) 1 n min w,b 2 wt w + C ξ i subject to i=1 y i (w T x i + b) 1 ξ i for i = 1,..., n and ξ i 0 for i = 1,..., n. Let s look at the constraints: y i (w T x i + b) 1 ξ i = ξ i 1 y i (w T x i + b) But ξ i 0 also, therefore ξ i max { 0, 1 y i (w T x i + b) }

74 Alternative formulation of the SVM optimization Thus the original constrained optimization problem can be restated as an unconstrained optimization problem: min 1 n w,b 2 w 2 + C max { 0, 1 y i (w T x i + b) } }{{}}{{} i=1 Regularization term Hinge loss and corresponds to the L 2 regularization + Hinge loss formulation! = can train SVMs with SGD/mini-batch gradient descent.

75 Alternative formulation of the SVM optimization Thus the original constrained optimization problem can be restated as an unconstrained optimization problem: min 1 n w,b 2 w 2 + C max { 0, 1 y i (w T x i + b) } }{{}}{{} i=1 Regularization term Hinge loss and corresponds to the L 2 regularization + Hinge loss formulation! = can train SVMs with SGD/mini-batch gradient descent.

76 From binary to multi-class classification

77 Example dataset: CIFAR-10 airplane automobile bird cat deer dog frog horse ship truck

78 Example dataset: CIFAR-10 airplane automobile bird cat deer dog frog horse ship CIFAR classes 50,000 training images 10,000 test images Each image has size truck

79 Technical description of the multi-class problem Have a set of labelled training examples D = {(x 1, y 1 ),..., (x n, y n )} with each x i R d, y i {1,..., C}. Want to learn from D a classification function Usually g : R d input space where for j = 1,..., C: and Θ = (θ 1, θ 2,..., θ C ). R P parameter space {1,..., C} g(x; Θ) = arg max 1 j C f j(x; θ j ) f j : R d R p R

80 Multi-class linear classifier Let each f j be a linear function that is Define then where f j (x; θ j ) = w T j x + b j f 1 (x) f(x; Θ) =. f C (x) f(x; Θ) = f(x; W, b) = W x + b W = w T 1. w T C, b = Note W has size C d and b is C 1. b 1. b C

Apply a multi-class linear classifier to an image Have a 2D colour image but can flatten it into a 1D vector x flatten image 32 32 3 3072 1 Apply classifier: W x + b to get a score for

81 Apply a multi-class linear classifier to an image Have a 2D colour image but can flatten it into a 1D vector x flatten image Apply classifier: W x + b to get a score for each class. + = W x b class scores airplane car bird cat deer dog frog horse ship truck

82 Interpreting a multi-class linear classifier Learn W, b to classify the images in a dataset. Can interpret each row, w j, of W as a template for class j. Below is the visualization of each learnt w j for CIFAR-10 airplane car bird cat deer dog frog horse ship truck

83 Interpreting a multi-class linear classifier Each w T j x + b j = 0 corresponds to a hyperplane, H j, in R d. sign(w T j x + b j) tells us which side of H j the point x lies. The score w T j x + b j the distance of x to H j. 0 car classifier airplane classifier deer classifier

84 How do we learn W and b? As before need to Specify a loss function (+ a regularization term). Set up the optimization problem. Perform the optimization.

85 How do we learn W and b? As before need to Specify a loss function - must quantify the quality of all the class scores across all the training data. Set up the optimization problem. Perform the optimization.

86 Multi-class loss functions

87 Multi-class SVM Loss Remember have training data D = {(x 1, y 1 ),..., (x n, y n )} with each x i R d, y i {1,..., C}. Let s j be the score of function f j applied to x s j = f j (x; w j, b j ) = w T j x + b j The SVM loss for training example x with label y is l = C max(0, s j s y + 1) j=1 j y

88 Multi-class SVM Loss s j is the score of function f j applied to x s j = f j (x; w j, b j ) = w T j x + b j max(0, s i s y + 1) s i s y SVM loss for training example x with label y is l = C max(0, s j s y + 1) j=1 j y

89 Calculate the multi-class SVM loss for a CIFAR image input: x output label loss s = W x + b y = 8 l = 10 j=1 j y max(0, s j s y + 1) Scores airplane car bird cat deer dog frog horse ship truck s = W x + b

90 Calculate the multi-class SVM loss for a CIFAR image input: x output label loss s = W x + b y = 8 l = 10 j=1 j y max(0, s j s y + 1) Scores Compare to horse score airplane car bird cat deer dog frog horse ship truck s = W x + b s s 8 + 1

91 Calculate the multi-class SVM loss for a CIFAR image input: x output label loss s = W x + b y = 8 l = 10 j=1 j y max(0, s j s y + 1) Scores Compare to horse score Keep badly performing classes airplane car bird cat deer dog frog horse ship truck Loss for x: s = W x + b s s max(0, s s 8 + 1)

92 Problem with the SVM loss Given W and b then Response for one training example f(x; W, b) = W x + b = s loss for x l(x, y, W, x) = C max(0, s j s y + 1) j=1 j y Loss over all the training data L(D, W, b) = 1 D (x,y) D Have found a W s.t. L = 0. Is this W unique? l(x, y, W, b)

93 No Let W 1 = αw and b 1 = αb where α > 1 then Response for one training example f(x; W 1, b) = W 1 x + b 1 = s = α(w x + b) Loss for (x, y) w.r.t. W 1 and b 1 l(x, y, W 1, b 1 ) = C max(0, s j s y + 1) j=1 j y = max(0, α(w T j x + b j w T y x b y ) + 1) = max(0, α(s j s y ) + 1) = 0 as by definition s j s y < 1 and α > 1 Thus the total loss L(D, W 1, b 1 ) is 0.

94 Solution: Weight regularization L(D, W, b) = 1 D (x,y) D C max(0, f j (x; W, b) f y (x; W, b) + 1) + λr(w ) j=1 j y Commonly used Regularization Name of regularization Mathematical def. of R(W ) L 2 k l W k,l 2 L 1 k Elastic Net k l l W k,l ( βw 2 k,l + W k,l )

95 Cross-entropy Loss

96 Probabilistic interpretation of scores Let p j be the probability that input x has label j: P Y X (j x) = p j For x our linear classifier outputs scores for each class: s = W x + b Can interpret scores, s, as: unnormalized log probability for each class. = s j = log p j where αp j = p j and α = p j. = P Y X (j x) = p j = exp(s j) k exp(s k)

97 Softmax operation This transformation is known as Softmax(s) = exp(s j) k exp(s k) x Softmax(x)

98 Softmax operation This transformation is known as Softmax(s) = exp(s j) k exp(s k) x Softmax(x)

99 Softmax classifier: Log likelihood of the training data Given probabilistic model: Estimate its parameters by maximizing the log-likelihood of the training data. θ = arg max θ 1 D = arg min θ 1 D (x,y) D (x,y) D log P Y X (y x; θ) log P Y X (y x; θ) Given probabilistic interpretation of our classifier, the negative log-likelihood of the training data is ( ) 1 exp(s y ) log D C k=1 exp(s k) where s = W x + b. (x,y) D

100 Softmax classifier: Log likelihood of the training data Given probabilistic model: Estimate its parameters by maximizing the log-likelihood of the training data. θ = arg max θ 1 D = arg min θ 1 D (x,y) D (x,y) D log P Y X (y x; θ) log P Y X (y x; θ) Given probabilistic interpretation of our classifier, the negative log-likelihood of the training data is ( ) 1 exp(s y ) log D C k=1 exp(s k) where s = W x + b. (x,y) D

101 Softmax classifier + cross-entropy loss Given the probabilistic interpretation of our classifier, the negative log-likelihood of the training data is ( ) L(D, W, b) = 1 exp(s y ) log D C k=1 exp(s k) where s = W x + b. (x,y) D Can also interpret this in terms of the cross-entropy loss: ( ) L(D, W, b) = 1 exp(s y ) log D C (x,y) D k=1 exp(s k) }{{} cross-entropy loss for (x, y) = 1 l(x, y, W, b) D (x,y) D

102 Cross-entropy loss p the probability vector the network assigns to x for each class p = SOFTMAX (W x + b) log(p y ) Cross-entropy loss for training example x with label y is l = log(p y ) p y

103 Calculate the cross-entropy loss for a CIFAR image input: x output label loss ( ) s = W x + b y = 8 exp(sy) l = log exp(sk ) Scores airplane car bird cat deer dog frog horse ship truck s = W x + b

104 Calculate the cross-entropy loss for a CIFAR image input: x output label loss ( ) s = W x + b y = 8 exp(sy) l = log exp(sk ) Scores exp(scores) airplane car bird cat deer dog frog horse ship truck s = W x + b exp(s)

105 Calculate the cross-entropy loss for a CIFAR image input: x output label loss ( ) s = W x + b y = 8 exp(sy) l = log exp(sk ) Scores exp(scores) Normalized scores airplane car bird cat deer dog frog horse ship truck Loss for x: s = W x + b exp(s) exp(s) k exp(s k)

106 Cross-entropy loss ( ) exp(s y ) l(x, y, W, b) = log C k=1 exp(s k) Cross-entropy loss Questions What is the minimum possible value of l(x, y, W, b)? What is the max possible value of l(x, y, W, b)? At initialization all the entries of W are small = all s k 0. What is the loss? A training point s input value is changed slightly. What happens to the loss? The log of zero is not defined. Could this be a problem?

107 Learning the parameters: W, b Have training data D. Have scoring function: s = f(x; W, b) = W x + b We have a choice of loss functions ( ) exp(s y ) l softmax (x, y, W, b) = log C k=1 exp(s k) l svm (x, y, W, b) = Complete training loss L(W, b; D) = 1 D (x,y) D C max(0, s j s y + 1) j=1 j y l softmax(svm) (W, b; x, y) + λr(w )

108 Learning the parameters: W, b Learning W, b corresponds to solving the optimization problem where L(D, W, b) = 1 D W, b = arg min L(D, W, b) W,b (x,y) D l softmax(svm) (x, y, W, b) + λr(w ) Know how to solve this! Mini-batch gradient descent. To implement mini-batch gradient descent need - to compute gradient of the loss l softmax(svm) (x, y, W, b) and R(W ) - Set the hyper-parameters of the mini-batch gradient descent procedure.

109 Learning the parameters: W, b Learning W, b corresponds to solving the optimization problem where L(D, W, b) = 1 D W, b = arg min L(D, W, b) W,b (x,y) D l softmax(svm) (x, y, W, b) + λr(w ) Know how to solve this! Mini-batch gradient descent. To implement mini-batch gradient descent need - to compute gradient of the loss l softmax(svm) (x, y, W, b) and R(W ) - Set the hyper-parameters of the mini-batch gradient descent procedure.

110 Learning the parameters: W, b Learning W, b corresponds to solving the optimization problem where L(D, W, b) = 1 D W, b = arg min L(D, W, b) W,b (x,y) D l softmax(svm) (x, y, W, b) + λr(w ) Know how to solve this! Mini-batch gradient descent. To implement mini-batch gradient descent need - to compute gradient of the loss l softmax(svm) (x, y, W, b) and R(W ) - Set the hyper-parameters of the mini-batch gradient descent procedure.

111 We will cover how to compute these gradients using back-propagation. Next Lecture

Linear Regression (continued)

Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression