CSCI567 Machine Learning (Fall 2018)

CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, 2018 1 / 49

Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due this Sunday (09/16) 11:59PM You need to submit a form if you use late days (see course website) September 12, 2018 2 / 49

Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due this Sunday (09/16) 11:59PM You need to submit a form if you use late days (see course website) Effort-based grade for written assignments: see the explanation on Piazza key: let us know what you have tried and how you thought I spend an hour and came up with nothing = empty solution September 12, 2018 2 / 49

Outline 1 Review of Last Lecture 2 Multiclass Classification 3 Neural Nets September 12, 2018 3 / 49

Review of Last Lecture Outline 1 Review of Last Lecture 2 Multiclass Classification 3 Neural Nets September 12, 2018 4 / 49

Review of Last Lecture Summary Linear models for binary classification: Step 1. Model is the set of separating hyperplanes F = {f(x) = sgn(w T x) w R D } September 12, 2018 5 / 49

Review of Last Lecture Step 2. Pick the surrogate loss perceptron loss l perceptron (z) = max{0, z} (used in Perceptron) hinge loss l hinge (z) = max{0, 1 z}(used in SVM and many others) logistic loss l logistic (z) = log(1 + exp( z)) (used in logistic regression) September 12, 2018 6 / 49

Review of Last Lecture Step 3. Find empirical risk minimizer (ERM): using GD: SGD: w = argmin w R D w w η F (w) w w η F (w) F (w) = argmin w R D N l(y n w T x n ) n=1 Newton: w w ( 2 F (w) ) 1 F (w) September 12, 2018 7 / 49

Review of Last Lecture A Probabilistic view of logistic regression Minimizing logistic loss = MLE for the sigmoid model where w = argmin w N n=1 l logistic (y n w T x n ) = argmax w N P(y n x n ; w) n=1 P(y x; w) = σ(yw T 1 x) = 1 + e ywt x 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 6 4 2 0 2 4 6 September 12, 2018 8 / 49

Outline 1 Review of Last Lecture 2 Multiclass Classification Multinomial logistic regression Reduction to binary classification 3 Neural Nets September 12, 2018 9 / 49

Classification Recall the setup: input (feature vector): x R D output (label): y [C] = {1, 2,, C} goal: learn a mapping f : R D [C] September 12, 2018 10 / 49

Classification Recall the setup: input (feature vector): x R D output (label): y [C] = {1, 2,, C} goal: learn a mapping f : R D [C] Examples: recognizing digits (C = 10) or letters (C = 26 or 52) predicting weather: sunny, cloudy, rainy, etc predicting image category: ImageNet dataset (C 20K) September 12, 2018 10 / 49

Multinomial logistic regression Linear models: from binary to multiclass What should a linear model look like for multiclass tasks? September 12, 2018 11 / 49

Multinomial logistic regression Linear models: from binary to multiclass What should a linear model look like for multiclass tasks? Note: a linear model for binary tasks (switching from { 1, +1} to {1, 2}) { 1 if w T x 0 f(x) = 2 if w T x < 0 can be written as f(x) = { 1 if w T 1 x wt 2 x 2 if w T 2 x > wt 1 x for any w 1, w 2 s.t. w = w 1 w 2 September 12, 2018 11 / 49

Multinomial logistic regression Linear models: from binary to multiclass What should a linear model look like for multiclass tasks? Note: a linear model for binary tasks (switching from { 1, +1} to {1, 2}) { 1 if w T x 0 f(x) = 2 if w T x < 0 can be written as f(x) = { 1 if w T 1 x wt 2 x 2 if w T 2 x > wt 1 x = argmax wk T x k {1,2} for any w 1, w 2 s.t. w = w 1 w 2 September 12, 2018 11 / 49

Multinomial logistic regression Linear models: from binary to multiclass What should a linear model look like for multiclass tasks? Note: a linear model for binary tasks (switching from { 1, +1} to {1, 2}) { 1 if w T x 0 f(x) = 2 if w T x < 0 can be written as f(x) = { 1 if w T 1 x wt 2 x 2 if w T 2 x > wt 1 x = argmax wk T x k {1,2} for any w 1, w 2 s.t. w = w 1 w 2 Think of w T k x as a score for class k. September 12, 2018 11 / 49

Multinomial logistic regression Linear models: from binary to multiclass w = ( 3 2, 1 6 ) Blue class: {x : w T x 0} Orange class: {x : w T x < 0} September 12, 2018 12 / 49

Multinomial logistic regression Linear models: from binary to multiclass w = ( 3 2, 1 6 ) = w 1 w 2 w 1 = (1, 1 3 ) w 2 = ( 1 2, 1 2 ) Blue class: {x : 1 = argmax k w T k x} Orange class: {x : 2 = argmax k w T k x} September 12, 2018 12 / 49

Multinomial logistic regression Linear models: from binary to multiclass w 1 = (1, 1 3 ) w 2 = ( 1 2, 1 2 ) Blue class: {x : 1 = argmax k w T k x} Orange class: {x : 2 = argmax k w T k x} September 12, 2018 12 / 49

Multinomial logistic regression Linear models: from binary to multiclass w 1 = (1, 1 3 ) w 2 = ( 1 2, 1 2 ) w 3 = (0, 1) Blue class: {x : 1 = argmax k w T k x} Orange class: {x : 2 = argmax k w T k x} Green class: {x : 3 = argmax k w T k x} September 12, 2018 13 / 49

Multinomial logistic regression Linear models for multiclass classification F = { f(x) = argmax k [C] w T k x w 1,..., w C R D } September 12, 2018 14 / 49

Multinomial logistic regression Linear models for multiclass classification F = = { { f(x) = argmax k [C] f(x) = argmax k [C] w T k x w 1,..., w C R D } (W x) k W R C D } September 12, 2018 14 / 49

Multinomial logistic regression Linear models for multiclass classification F = = { { f(x) = argmax k [C] f(x) = argmax k [C] w T k x w 1,..., w C R D } (W x) k W R C D } How do we generalize perceptron/hinge/logistic loss? September 12, 2018 14 / 49

Multinomial logistic regression Multinomial logistic regression: a probabilistic view Observe: for binary logistic regression, with w = w 1 w 2 : P(y = 1 x; w) = σ(w T x) = 1 1 + e = e wt wt x e wt 1 x + e wt 2 x ewt 1 x 1 x Naturally, for multiclass: P(y = k x; W ) = e wt k x ewt k [C] ewt k x k x September 12, 2018 15 / 49

Multinomial logistic regression Applying MLE again Maximize probability of see labels y 1,..., y N given x 1,..., x N P (W ) = N P(y n x n ; W ) = n=1 N n=1 e wt yn x k [C] ewt k x September 12, 2018 16 / 49

Multinomial logistic regression Applying MLE again Maximize probability of see labels y 1,..., y N given x 1,..., x N P (W ) = N P(y n x n ; W ) = n=1 N n=1 e wt yn x k [C] ewt k x By taking negative log, this is equivalent to minimizing ( ) N k [C] F (W ) = ln ewt k xn N = ln 1 + n=1 e wt yn xn n=1 k y n e (wk wyn)txn September 12, 2018 16 / 49

Multinomial logistic regression Optimization Apply SGD: what is the gradient of g(w ) = ln 1 + (w e k wyn)txn k y n? September 12, 2018 17 / 49

Multinomial logistic regression Optimization Apply SGD: what is the gradient of g(w ) = ln 1 + (w e k wyn)txn k y n It s a C D matrix. Let s focus on the k-th row:? September 12, 2018 17 / 49

Multinomial logistic regression Optimization Apply SGD: what is the gradient of g(w ) = ln 1 + (w e k wyn)txn k y n It s a C D matrix. Let s focus on the k-th row: If k y n : wk g(w ) = e (w k w yn ) T x n 1 + k y n e (w k wyn)t x n x T n? September 12, 2018 17 / 49

Multinomial logistic regression Optimization Apply SGD: what is the gradient of g(w ) = ln 1 + (w e k wyn)txn k y n It s a C D matrix. Let s focus on the k-th row: If k y n : wk g(w ) =? e (w k w yn ) T x n 1 + k y n e (w k wyn)t x n x T n = P(k x n ; W )x T n September 12, 2018 17 / 49

Multinomial logistic regression Optimization Apply SGD: what is the gradient of g(w ) = ln 1 + (w e k wyn)txn k y n It s a C D matrix. Let s focus on the k-th row: If k y n : else: wk g(w ) = wk g(w ) =? e (w k w yn ) T x n 1 + k y n e (w k wyn)t x n x T n = P(k x n ; W )x T n ( ) k yn e(w k wyn)t x n 1 + k y n e (w k wyn)t x n x T n September 12, 2018 17 / 49

Multinomial logistic regression SGD for multinomial logistic regression Initialize W = 0 (or randomly). Repeat: 1 pick n [N] uniformly at random 2 update the parameters W W η P(y = 1 x n ; W ). P(y = y n x n ; W ) 1. P(y = C x n ; W ) x T n September 12, 2018 18 / 49

Multinomial logistic regression A note on prediction Having learned W, we can either make a deterministic prediction argmax k [C] w T k x September 12, 2018 19 / 49

Multinomial logistic regression A note on prediction Having learned W, we can either make a deterministic prediction argmax k [C] w T k x make a randomized prediction according to P(k x; W ) e wt k x September 12, 2018 19 / 49

Multinomial logistic regression A note on prediction Having learned W, we can either make a deterministic prediction argmax k [C] wk Tx make a randomized prediction according to P(k x; W ) e wt k x In either case, (expected) mistake is bounded by logistic loss deterministic I[f(x) y] log 2 1 + e (w k w y) T x k y September 12, 2018 19 / 49

Reduction to binary classification Reduce multiclass to binary Is there an even more general and simpler approach to derive multiclass classification algorithms? September 12, 2018 20 / 49

Reduction to binary classification Reduce multiclass to binary Is there an even more general and simpler approach to derive multiclass classification algorithms? Given a binary classification algorithm (any one, not just linear methods), can we turn it to a multiclass algorithm, in a black-box manner? Yes, there are in fact many ways to do it. one-versus-all (one-versus-rest, one-against-all, etc) one-versus-one (all-versus-all, etc) Error-Correcting Output Codes (ECOC) tree-based reduction September 12, 2018 20 / 49

Reduction to binary classification One-versus-all (OvA) (picture credit: link) Idea: train C binary classifiers to learn is class k or not? for each k. September 12, 2018 21 / 49

Reduction to binary classification One-versus-all (OvA) (picture credit: link) Idea: train C binary classifiers to learn is class k or not? for each k. Training: for each class k [C], relabel examples with class k as +1, and all others as 1 train a binary classifier h k using this new dataset September 12, 2018 21 / 49

Reduction to binary classification One-versus-all (OvA) Prediction: for a new example x ask each h k : does this belong to class k? (i.e. h k (x)) September 12, 2018 22 / 49

Reduction to binary classification One-versus-all (OvA) Prediction: for a new example x ask each h k : does this belong to class k? (i.e. h k (x)) randomly pick among all k s s.t. h k (x) = +1. September 12, 2018 22 / 49

Reduction to binary classification One-versus-one (OvO) (picture credit: link) Idea: train ( C 2) binary classifiers to learn is class k or k?. September 12, 2018 23 / 49

Reduction to binary classification One-versus-one (OvO) (picture credit: link) Idea: train ( C 2) binary classifiers to learn is class k or k?. Training: for each pair (k, k ), relabel examples with class k as +1 and examples with class k as 1 discard all other examples train a binary classifier h (k,k ) using this new dataset September 12, 2018 23 / 49

Reduction to binary classification One-versus-one (OvO) Prediction: for a new example x ask each classifier h (k,k ) to vote for either class k or k September 12, 2018 24 / 49

Reduction to binary classification One-versus-one (OvO) Prediction: for a new example x ask each classifier h (k,k ) to vote for either class k or k predict the class with the most votes (break tie in some way) September 12, 2018 24 / 49

Reduction to binary classification Error-correcting output codes (ECOC) (picture credit: link) Idea: based on a code M { 1, +1} C L, train L binary classifiers to learn is bit b on or off. Training: for each bit b [L] relabel example x n as M yn,b train a binary classifier h b using this new dataset. September 12, 2018 25 / 49

Reduction to binary classification Error-correcting output codes (ECOC) Prediction: for a new example x compute the predicted code c = (h 1 (x),..., h L (x)) T September 12, 2018 26 / 49

Reduction to binary classification Error-correcting output codes (ECOC) Prediction: for a new example x compute the predicted code c = (h 1 (x),..., h L (x)) T predict the class with the most similar code: k = argmax k (Mc) k How to pick the code M? the more dissimilar the codes between different classes are, the better September 12, 2018 26 / 49

Reduction to binary classification Tree based method Idea: train C binary classifiers to learn belongs to which half?. September 12, 2018 27 / 49

Reduction to binary classification Tree based method Idea: train C binary classifiers to learn belongs to which half?. Training: see pictures September 12, 2018 27 / 49

Reduction to binary classification Tree based method Idea: train C binary classifiers to learn belongs to which half?. Training: see pictures Prediction is also natural, September 12, 2018 27 / 49

Reduction to binary classification Tree based method Idea: train C binary classifiers to learn belongs to which half?. Training: see pictures Prediction is also natural, but is very fast! (think ImageNet where C 20K) September 12, 2018 27 / 49

Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA OvO ECOC Tree September 12, 2018 28 / 49

Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN OvO ECOC Tree September 12, 2018 28 / 49

Reduction to binary classification Comparisons In big O notation, Reduction #training points test time OvA CN C OvO ECOC Tree remark September 12, 2018 28 / 49

Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN C not robust OvO ECOC Tree September 12, 2018 28 / 49

Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN C not robust OvO ECOC Tree CN September 12, 2018 28 / 49

Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN C not robust OvO CN C 2 ECOC Tree September 12, 2018 28 / 49

Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN C not robust OvO CN C 2 can achieve very small training error ECOC Tree September 12, 2018 28 / 49

Neural Nets Outline 1 Review of Last Lecture 2 Multiclass Classification 3 Neural Nets Definition Backpropagation Preventing overfitting September 12, 2018 29 / 49

Neural Nets Definition Linear models are not always adequate We can use a nonlinear mapping as discussed: φ(x) : x R D z R M September 12, 2018 30 / 49

Neural Nets Definition Linear models are not always adequate We can use a nonlinear mapping as discussed: φ(x) : x R D z R M But what kind of nonlinear mapping φ should be used? Can we actually learn this nonlinear mapping? September 12, 2018 30 / 49

Neural Nets Definition Linear model as a one-layer neural net h(a) = a for linear model September 12, 2018 31 / 49

Neural Nets Definition Linear model as a one-layer neural net h(a) = a for linear model To create non-linearity, can use Rectified Linear Unit (ReLU): h(a) = max{0, a} sigmoid function: h(a) = 1 1+e a TanH: h(a) = ea e a e a +e a many more September 12, 2018 31 / 49

Neural Nets Definition More output nodes W R 4 3, h : R 4 R 4 so h(a) = (h 1 (a 1 ), h 2 (a 2 ), h 3 (a 3 ), h 4 (a 4 )) September 12, 2018 32 / 49

Neural Nets Definition More output nodes W R 4 3, h : R 4 R 4 so h(a) = (h 1 (a 1 ), h 2 (a 2 ), h 3 (a 3 ), h 4 (a 4 )) Can think of this as a nonlinear basis: Φ(x) = h(w x) September 12, 2018 32 / 49

Neural Nets Definition More layers Becomes a network: September 12, 2018 33 / 49

Neural Nets Definition More layers Becomes a network: each node is called a neuron September 12, 2018 33 / 49

Neural Nets Definition More layers Becomes a network: each node is called a neuron h is called the activation function can use h(a) = 1 for one neuron in each layer to incorporate bias term output neuron can use h(a) = a #layers refers to #hidden layers (plus 1 or 2 for input/output layers) deep neural nets can have many layers and millions of parameters September 12, 2018 33 / 49

Neural Nets Definition How powerful are neural nets? Universal approximation theorem (Cybenko, 89; Hornik, 91): A feedforward neural net with a single hidden layer can approximate any continuous functions. It might need a huge number of neurons though, and depth helps! Designing network architecture is important and very complicated for feedforward network, need to decide number of hidden layers, number of neurons at each layer, activation functions, etc. September 12, 2018 34 / 49

Neural Nets Definition Math formulation An L-layer neural net can be written as f(x) = h L (W L h L 1 (W L 1 h 1 (W 1 x))) September 12, 2018 35 / 49

Neural Nets Definition Math formulation An L-layer neural net can be written as f(x) = h L (W L h L 1 (W L 1 h 1 (W 1 x))) To ease notation, for a given input x, define recursively o 0 = x, a l = W l o l 1, o l = h l (a l ) (l = 1,..., L) where W l R D l D l 1 is the weights for layer l D 0 = D, D 1,..., D L are numbers of neurons at each layer a l R D l is input to layer l o l R D l is output to layer l h : R D l R D l is activation functions at layer l September 12, 2018 35 / 49

Neural Nets Definition Learning the model No matter how complicated the model is, our goal is the same: minimize N E(W 1,..., W L ) = E n (W 1,..., W L ) n=1 September 12, 2018 36 / 49

Neural Nets Definition Learning the model No matter how complicated the model is, our goal is the same: minimize N E(W 1,..., W L ) = E n (W 1,..., W L ) n=1 where { f(xn ) y n 2 2 for regression E n (W 1,..., W L ) = ln (1 + ) e f(xn) k f(x n) yn for classification k yn September 12, 2018 36 / 49

Neural Nets Backpropagation How to optimize such a complicated function? Same thing: apply SGD! even if the model is nonconvex. September 12, 2018 37 / 49

Neural Nets Backpropagation How to optimize such a complicated function? Same thing: apply SGD! even if the model is nonconvex. What is the gradient of this complicated function? Chain rule is the only secret: for a composite function f(g(w)) f w = f g g w for a composite function f(g 1 (w),..., g d (w)) f w = d i=1 f g i g i w September 12, 2018 37 / 49

Neural Nets Backpropagation Computing the derivative Drop the subscript l for layer for simplicity. Find the derivative of E n w.r.t. to w ij September 12, 2018 38 / 49

Neural Nets Backpropagation Computing the derivative Drop the subscript l for layer for simplicity. Find the derivative of E n w.r.t. to w ij E n w ij = E n a i a i w ij September 12, 2018 38 / 49

Neural Nets Backpropagation Computing the derivative Drop the subscript l for layer for simplicity. Find the derivative of E n w.r.t. to w ij E n = E n a i = E n (w ij o j ) w ij a i w ij a i w ij September 12, 2018 38 / 49

Neural Nets Backpropagation Computing the derivative Drop the subscript l for layer for simplicity. Find the derivative of E n w.r.t. to w ij E n = E n a i = E n (w ij o j ) = E n o j w ij a i w ij a i w ij a i September 12, 2018 38 / 49

Neural Nets Backpropagation Computing the derivative Drop the subscript l for layer for simplicity. Find the derivative of E n w.r.t. to w ij E n a i = E n o i o i a i E n = E n a i = E n (w ij o j ) = E n o j w ij a i w ij a i w ij a i September 12, 2018 38 / 49

Neural Nets Backpropagation Computing the derivative Drop the subscript l for layer for simplicity. Find the derivative of E n w.r.t. to w ij E n = E n w ij a i ( E n a i = E n o i o i a i = k a i w ij = E n a i (w ij o j ) E n a k a k o i ) w ij h i(a i ) = E n a i o j September 12, 2018 38 / 49

Neural Nets Backpropagation Computing the derivative Drop the subscript l for layer for simplicity. Find the derivative of E n w.r.t. to w ij E n = E n w ij a i ( E n a i = E n o i o i a i = k a i w ij = E n a i (w ij o j ) E n a k a k o i w ij = E n a i o j ) ( ) h E n i(a i ) = w ki h a i(a i ) k k September 12, 2018 38 / 49

Neural Nets Backpropagation Computing the derivative Adding the subscript for layer: E n a l,i = E n w l,ij = E n a l,i o l 1,j ( k E n a l+1,k w l+1,ki ) h l,i (a l,i) September 12, 2018 39 / 49

Neural Nets Backpropagation Computing the derivative Adding the subscript for layer: E n a l,i = E n w l,ij = E n a l,i o l 1,j ( k E n a l+1,k w l+1,ki ) For the last layer, for square loss E n = (h L,i(a L,i ) y n,i ) 2 a L,i a L,i h l,i (a l,i) September 12, 2018 39 / 49

Neural Nets Backpropagation Computing the derivative Using matrix notation greatly simplifies presentation and implementation: E n a l = E n W l = E n a l o T l 1 {( W T l+1 En a l+1 ) h l (a l) if l < L 2(h L (a L ) y n ) h L (a L) else where v 1 v 2 = (v 11 v 21,, v 1D v 2D ) is the element-wise product (a.k.a. Hadamard product). Verify yourself! September 12, 2018 40 / 49

Neural Nets Backpropagation Putting everything into SGD The backpropagation algorithm (Backprop) Initialize W 1,..., W L (all 0 or randomly). Repeat: 1 randomly pick one data point n [N] 2 forward propagation: for each layer l = 1,..., L compute a l = W l o l 1 and o l = h l (a l ) (o 0 = x n ) 3 backward propagation: for each l = L,..., 1 compute {( ) E n Wl+1 T En a = l+1 h l (a l) if l < L a l 2(h L (a L ) y n ) h L (a L) else update weights W l W l η E n W l = W l η E n a l o T l 1 September 12, 2018 41 / 49

Neural Nets Backpropagation More tricks to optimize neural nets Many variants based on backprop SGD with minibatch: randomly sample a batch of examples to form a stochastic gradient SGD with momentum September 12, 2018 42 / 49

Neural Nets Backpropagation SGD with momentum Initialize w 0 and velocity v = 0 For t = 1, 2,... form a stochastic gradient g t update velocity v αv ηg t for some discount factor α (0, 1) update weight w t w t 1 + v September 12, 2018 43 / 49

Neural Nets Preventing overfitting Overfitting Overfitting is very likely since the models are too powerful. Methods to overcome overfitting: data augmentation regularization dropout early stopping September 12, 2018 44 / 49

Neural Nets Preventing overfitting Data augmentation Data: the more the better. How do we get more data? September 12, 2018 45 / 49

Neural Nets Preventing overfitting Data augmentation Data: the more the better. How do we get more data? Exploit prior knowledge to add more training data September 12, 2018 45 / 49

Neural Nets Preventing overfitting Regularization L2 regularization: minimize L E (W 1,..., W L ) = E(W 1,..., W L ) + λ W l 2 2 l=1 September 12, 2018 46 / 49

Neural Nets Preventing overfitting Regularization L2 regularization: minimize E (W 1,..., W L ) = E(W 1,..., W L ) + λ Simple change to the gradient: E w ij = E w ij + 2λw ij L W l 2 2 l=1 September 12, 2018 46 / 49

Neural Nets Preventing overfitting Regularization L2 regularization: minimize E (W 1,..., W L ) = E(W 1,..., W L ) + λ Simple change to the gradient: E w ij = E w ij + 2λw ij L W l 2 2 l=1 Introduce weight decaying effect September 12, 2018 46 / 49

Neural Nets Preventing overfitting Dropout Randomly delete neurons during training Very effective, makes training faster as well September 12, 2018 47 / 49

Neural Nets Preventing overfitting Early stopping Stop training when the performance on validation set stops improving Early stopping Loss (negative log-likelihood) 0.20 0.15 0.10 0.05 0.00 0 50 100 150 200 250 Time (epochs) Training set loss Validation set loss September 12, 2018 48 / 49

Neural Nets Preventing overfitting Conclusions for neural nets Deep neural networks are hugely popular, achieving best performance on many problems September 12, 2018 49 / 49

Neural Nets Preventing overfitting Conclusions for neural nets Deep neural networks are hugely popular, achieving best performance on many problems do need a lot of data to work well take a lot of time to train (need GPUs for massive parallel computing) September 12, 2018 49 / 49