CSCI567 Machine Learning (Fall 2018)

Similar documents
1 Machine Learning Concepts (16 points)

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Neural Networks and Deep Learning

Neural Network Training

Overfitting, Bias / Variance Analysis

Introduction to Neural Networks

Stochastic gradient descent; Classification

Lecture 5: Logistic Regression. Neural Networks

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

ECE521 Lecture 7/8. Logistic Regression

Reading Group on Deep Learning Session 1

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

ECE521 Lectures 9 Fully Connected Neural Networks

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Linear Regression (continued)

Intro to Neural Networks and Deep Learning

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

SGD and Deep Learning

Introduction to Logistic Regression and Support Vector Machine

Introduction to (Convolutional) Neural Networks

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Logistic Regression. Machine Learning Fall 2018

Neural Nets Supervised learning

4. Multilayer Perceptrons

A summary of Deep Learning without Poor Local Minima

Jakub Hajic Artificial Intelligence Seminar I

CSCI-567: Machine Learning (Spring 2019)

Artificial Neural Networks 2

Deep Learning Lab Course 2017 (Deep Learning Practical)

Introduction to Machine Learning (67577)

Lecture 17: Neural Networks and Deep Learning

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Logistic Regression & Neural Networks

From perceptrons to word embeddings. Simon Šuster University of Groningen

Neural networks COMS 4771

Neural Networks: Backpropagation

Statistical Machine Learning from Data

1 What a Neural Network Computes

Rapid Introduction to Machine Learning/ Deep Learning

Introduction to Convolutional Neural Networks (CNNs)

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

CSC321 Lecture 9: Generalization

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Logistic Regression. COMP 527 Danushka Bollegala

Introduction to Machine Learning

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Neural networks and support vector machines

Neural Networks: Optimization & Regularization

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Lecture 3 Feedforward Networks and Backpropagation

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

CSC321 Lecture 5: Multilayer Perceptrons

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Week 5: Logistic Regression & Neural Networks

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

CSCI567 Machine Learning (Fall 2014)

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Deep Feedforward Networks

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Artificial Neural Networks. MGS Lecture 2

Nonlinear Classification

Bits of Machine Learning Part 1: Supervised Learning

ECE521 Lecture7. Logistic Regression

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

text classification 3: neural networks

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Lecture 4 Backpropagation

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Introduction to Machine Learning Spring 2018 Note Neural Networks

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Deep Neural Networks (1) Hidden layers; Back-propagation

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

CSC321 Lecture 9: Generalization

Multiclass Classification-1

Neural Networks and the Back-propagation Algorithm

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

The connection of dropout and Bayesian statistics

Feed-forward Network Functions

Feedforward Neural Networks

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Machine Learning Basics

CSC242: Intro to AI. Lecture 21

Transcription:

CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, 2018 1 / 49

Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due this Sunday (09/16) 11:59PM You need to submit a form if you use late days (see course website) September 12, 2018 2 / 49

Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due this Sunday (09/16) 11:59PM You need to submit a form if you use late days (see course website) Effort-based grade for written assignments: see the explanation on Piazza key: let us know what you have tried and how you thought I spend an hour and came up with nothing = empty solution September 12, 2018 2 / 49

Outline 1 Review of Last Lecture 2 Multiclass Classification 3 Neural Nets September 12, 2018 3 / 49

Review of Last Lecture Outline 1 Review of Last Lecture 2 Multiclass Classification 3 Neural Nets September 12, 2018 4 / 49

Review of Last Lecture Summary Linear models for binary classification: Step 1. Model is the set of separating hyperplanes F = {f(x) = sgn(w T x) w R D } September 12, 2018 5 / 49

Review of Last Lecture Step 2. Pick the surrogate loss perceptron loss l perceptron (z) = max{0, z} (used in Perceptron) hinge loss l hinge (z) = max{0, 1 z}(used in SVM and many others) logistic loss l logistic (z) = log(1 + exp( z)) (used in logistic regression) September 12, 2018 6 / 49

Review of Last Lecture Step 3. Find empirical risk minimizer (ERM): using GD: SGD: w = argmin w R D w w η F (w) w w η F (w) F (w) = argmin w R D N l(y n w T x n ) n=1 Newton: w w ( 2 F (w) ) 1 F (w) September 12, 2018 7 / 49

Review of Last Lecture A Probabilistic view of logistic regression Minimizing logistic loss = MLE for the sigmoid model where w = argmin w N n=1 l logistic (y n w T x n ) = argmax w N P(y n x n ; w) n=1 P(y x; w) = σ(yw T 1 x) = 1 + e ywt x 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 6 4 2 0 2 4 6 September 12, 2018 8 / 49

Outline 1 Review of Last Lecture 2 Multiclass Classification Multinomial logistic regression Reduction to binary classification 3 Neural Nets September 12, 2018 9 / 49

Classification Recall the setup: input (feature vector): x R D output (label): y [C] = {1, 2,, C} goal: learn a mapping f : R D [C] September 12, 2018 10 / 49

Classification Recall the setup: input (feature vector): x R D output (label): y [C] = {1, 2,, C} goal: learn a mapping f : R D [C] Examples: recognizing digits (C = 10) or letters (C = 26 or 52) predicting weather: sunny, cloudy, rainy, etc predicting image category: ImageNet dataset (C 20K) September 12, 2018 10 / 49

Classification Recall the setup: input (feature vector): x R D output (label): y [C] = {1, 2,, C} goal: learn a mapping f : R D [C] Examples: recognizing digits (C = 10) or letters (C = 26 or 52) predicting weather: sunny, cloudy, rainy, etc predicting image category: ImageNet dataset (C 20K) Nearest Neighbor Classifier naturally works for arbitrary C. September 12, 2018 10 / 49

Multinomial logistic regression Linear models: from binary to multiclass What should a linear model look like for multiclass tasks? September 12, 2018 11 / 49

Multinomial logistic regression Linear models: from binary to multiclass What should a linear model look like for multiclass tasks? Note: a linear model for binary tasks (switching from { 1, +1} to {1, 2}) { 1 if w T x 0 f(x) = 2 if w T x < 0 September 12, 2018 11 / 49

Multinomial logistic regression Linear models: from binary to multiclass What should a linear model look like for multiclass tasks? Note: a linear model for binary tasks (switching from { 1, +1} to {1, 2}) { 1 if w T x 0 f(x) = 2 if w T x < 0 can be written as f(x) = { 1 if w T 1 x wt 2 x 2 if w T 2 x > wt 1 x for any w 1, w 2 s.t. w = w 1 w 2 September 12, 2018 11 / 49

Multinomial logistic regression Linear models: from binary to multiclass What should a linear model look like for multiclass tasks? Note: a linear model for binary tasks (switching from { 1, +1} to {1, 2}) { 1 if w T x 0 f(x) = 2 if w T x < 0 can be written as f(x) = { 1 if w T 1 x wt 2 x 2 if w T 2 x > wt 1 x = argmax wk T x k {1,2} for any w 1, w 2 s.t. w = w 1 w 2 September 12, 2018 11 / 49

Multinomial logistic regression Linear models: from binary to multiclass What should a linear model look like for multiclass tasks? Note: a linear model for binary tasks (switching from { 1, +1} to {1, 2}) { 1 if w T x 0 f(x) = 2 if w T x < 0 can be written as f(x) = { 1 if w T 1 x wt 2 x 2 if w T 2 x > wt 1 x = argmax wk T x k {1,2} for any w 1, w 2 s.t. w = w 1 w 2 Think of w T k x as a score for class k. September 12, 2018 11 / 49

Multinomial logistic regression Linear models: from binary to multiclass w = ( 3 2, 1 6 ) Blue class: {x : w T x 0} Orange class: {x : w T x < 0} September 12, 2018 12 / 49

Multinomial logistic regression Linear models: from binary to multiclass w = ( 3 2, 1 6 ) = w 1 w 2 w 1 = (1, 1 3 ) w 2 = ( 1 2, 1 2 ) Blue class: {x : 1 = argmax k w T k x} Orange class: {x : 2 = argmax k w T k x} September 12, 2018 12 / 49

Multinomial logistic regression Linear models: from binary to multiclass w 1 = (1, 1 3 ) w 2 = ( 1 2, 1 2 ) Blue class: {x : 1 = argmax k w T k x} Orange class: {x : 2 = argmax k w T k x} September 12, 2018 12 / 49

Multinomial logistic regression Linear models: from binary to multiclass w 1 = (1, 1 3 ) w 2 = ( 1 2, 1 2 ) w 3 = (0, 1) Blue class: {x : 1 = argmax k w T k x} Orange class: {x : 2 = argmax k w T k x} Green class: {x : 3 = argmax k w T k x} September 12, 2018 13 / 49

Multinomial logistic regression Linear models for multiclass classification F = { f(x) = argmax k [C] w T k x w 1,..., w C R D } September 12, 2018 14 / 49

Multinomial logistic regression Linear models for multiclass classification F = = { { f(x) = argmax k [C] f(x) = argmax k [C] w T k x w 1,..., w C R D } (W x) k W R C D } September 12, 2018 14 / 49

Multinomial logistic regression Linear models for multiclass classification F = = { { f(x) = argmax k [C] f(x) = argmax k [C] w T k x w 1,..., w C R D } (W x) k W R C D } How do we generalize perceptron/hinge/logistic loss? September 12, 2018 14 / 49

Multinomial logistic regression Linear models for multiclass classification F = = { { f(x) = argmax k [C] f(x) = argmax k [C] w T k x w 1,..., w C R D } (W x) k W R C D } How do we generalize perceptron/hinge/logistic loss? This lecture: focus on the more popular logistic loss September 12, 2018 14 / 49

Multinomial logistic regression Multinomial logistic regression: a probabilistic view Observe: for binary logistic regression, with w = w 1 w 2 : P(y = 1 x; w) = σ(w T x) = 1 1 + e = e wt wt x e wt 1 x + e wt 2 x ewt 1 x 1 x September 12, 2018 15 / 49

Multinomial logistic regression Multinomial logistic regression: a probabilistic view Observe: for binary logistic regression, with w = w 1 w 2 : P(y = 1 x; w) = σ(w T x) = 1 1 + e = e wt wt x e wt 1 x + e wt 2 x ewt 1 x 1 x Naturally, for multiclass: P(y = k x; W ) = e wt k x ewt k [C] ewt k x k x September 12, 2018 15 / 49

Multinomial logistic regression Multinomial logistic regression: a probabilistic view Observe: for binary logistic regression, with w = w 1 w 2 : P(y = 1 x; w) = σ(w T x) = 1 1 + e = e wt wt x e wt 1 x + e wt 2 x ewt 1 x 1 x Naturally, for multiclass: P(y = k x; W ) = e wt k x ewt k [C] ewt k x k x This is called the softmax function. September 12, 2018 15 / 49

Multinomial logistic regression Applying MLE again Maximize probability of see labels y 1,..., y N given x 1,..., x N P (W ) = N P(y n x n ; W ) = n=1 N n=1 e wt yn x k [C] ewt k x September 12, 2018 16 / 49

Multinomial logistic regression Applying MLE again Maximize probability of see labels y 1,..., y N given x 1,..., x N P (W ) = N P(y n x n ; W ) = n=1 N n=1 e wt yn x k [C] ewt k x By taking negative log, this is equivalent to minimizing F (W ) = N ln n=1 ( k [C] ewt k xn e wt yn xn ) September 12, 2018 16 / 49

Multinomial logistic regression Applying MLE again Maximize probability of see labels y 1,..., y N given x 1,..., x N P (W ) = N P(y n x n ; W ) = n=1 N n=1 e wt yn x k [C] ewt k x By taking negative log, this is equivalent to minimizing ( ) N k [C] F (W ) = ln ewt k xn N = ln 1 + n=1 e wt yn xn n=1 k y n e (wk wyn)txn September 12, 2018 16 / 49

Multinomial logistic regression Applying MLE again Maximize probability of see labels y 1,..., y N given x 1,..., x N P (W ) = N P(y n x n ; W ) = n=1 N n=1 e wt yn x k [C] ewt k x By taking negative log, this is equivalent to minimizing ( ) N k [C] F (W ) = ln ewt k xn N = ln 1 + n=1 e wt yn xn n=1 k y n e (wk wyn)txn This is the multiclass logistic loss, a.k.a cross-entropy loss. September 12, 2018 16 / 49

Multinomial logistic regression Applying MLE again Maximize probability of see labels y 1,..., y N given x 1,..., x N P (W ) = N P(y n x n ; W ) = n=1 N n=1 e wt yn x k [C] ewt k x By taking negative log, this is equivalent to minimizing ( ) N k [C] F (W ) = ln ewt k xn N = ln 1 + n=1 e wt yn xn n=1 k y n e (wk wyn)txn This is the multiclass logistic loss, a.k.a cross-entropy loss. When C = 2, this is the same as binary logistic loss. September 12, 2018 16 / 49

Multinomial logistic regression Optimization Apply SGD: what is the gradient of g(w ) = ln 1 + (w e k wyn)txn k y n? September 12, 2018 17 / 49

Multinomial logistic regression Optimization Apply SGD: what is the gradient of g(w ) = ln 1 + (w e k wyn)txn k y n It s a C D matrix. Let s focus on the k-th row:? September 12, 2018 17 / 49

Multinomial logistic regression Optimization Apply SGD: what is the gradient of g(w ) = ln 1 + (w e k wyn)txn k y n It s a C D matrix. Let s focus on the k-th row: If k y n : wk g(w ) = e (w k w yn ) T x n 1 + k y n e (w k wyn)t x n x T n? September 12, 2018 17 / 49

Multinomial logistic regression Optimization Apply SGD: what is the gradient of g(w ) = ln 1 + (w e k wyn)txn k y n It s a C D matrix. Let s focus on the k-th row: If k y n : wk g(w ) =? e (w k w yn ) T x n 1 + k y n e (w k wyn)t x n x T n = P(k x n ; W )x T n September 12, 2018 17 / 49

Multinomial logistic regression Optimization Apply SGD: what is the gradient of g(w ) = ln 1 + (w e k wyn)txn k y n It s a C D matrix. Let s focus on the k-th row: If k y n : else: wk g(w ) = wk g(w ) =? e (w k w yn ) T x n 1 + k y n e (w k wyn)t x n x T n = P(k x n ; W )x T n ( ) k yn e(w k wyn)t x n 1 + k y n e (w k wyn)t x n x T n September 12, 2018 17 / 49

Multinomial logistic regression Optimization Apply SGD: what is the gradient of g(w ) = ln 1 + (w e k wyn)txn k y n It s a C D matrix. Let s focus on the k-th row: If k y n : else: wk g(w ) = wk g(w ) =? e (w k w yn ) T x n 1 + k y n e (w k wyn)t x n x T n = P(k x n ; W )x T n ( ) k yn e(w k wyn)t x n 1 + k y n e (w k wyn)t x n x T n = (P(y n x n ; W ) 1) x T n September 12, 2018 17 / 49

Multinomial logistic regression SGD for multinomial logistic regression Initialize W = 0 (or randomly). Repeat: 1 pick n [N] uniformly at random 2 update the parameters W W η P(y = 1 x n ; W ). P(y = y n x n ; W ) 1. P(y = C x n ; W ) x T n September 12, 2018 18 / 49

Multinomial logistic regression SGD for multinomial logistic regression Initialize W = 0 (or randomly). Repeat: 1 pick n [N] uniformly at random 2 update the parameters W W η P(y = 1 x n ; W ). P(y = y n x n ; W ) 1. P(y = C x n ; W ) x T n Think about why the algorithm makes sense intuitively. September 12, 2018 18 / 49

Multinomial logistic regression A note on prediction Having learned W, we can either make a deterministic prediction argmax k [C] w T k x September 12, 2018 19 / 49

Multinomial logistic regression A note on prediction Having learned W, we can either make a deterministic prediction argmax k [C] w T k x make a randomized prediction according to P(k x; W ) e wt k x September 12, 2018 19 / 49

Multinomial logistic regression A note on prediction Having learned W, we can either make a deterministic prediction argmax k [C] w T k x make a randomized prediction according to P(k x; W ) e wt k x In either case, (expected) mistake is bounded by logistic loss September 12, 2018 19 / 49

Multinomial logistic regression A note on prediction Having learned W, we can either make a deterministic prediction argmax k [C] wk Tx make a randomized prediction according to P(k x; W ) e wt k x In either case, (expected) mistake is bounded by logistic loss deterministic I[f(x) y] log 2 1 + e (w k w y) T x k y September 12, 2018 19 / 49

Multinomial logistic regression A note on prediction Having learned W, we can either make a deterministic prediction argmax k [C] wk Tx make a randomized prediction according to P(k x; W ) e wt k x In either case, (expected) mistake is bounded by logistic loss deterministic I[f(x) y] log 2 1 + e (w k w y) T x k y randomized E [I[f(x) y]] September 12, 2018 19 / 49

Multinomial logistic regression A note on prediction Having learned W, we can either make a deterministic prediction argmax k [C] wk Tx make a randomized prediction according to P(k x; W ) e wt k x In either case, (expected) mistake is bounded by logistic loss deterministic I[f(x) y] log 2 1 + e (w k w y) T x k y randomized E [I[f(x) y]] = 1 P(y x; W ) September 12, 2018 19 / 49

Multinomial logistic regression A note on prediction Having learned W, we can either make a deterministic prediction argmax k [C] wk Tx make a randomized prediction according to P(k x; W ) e wt k x In either case, (expected) mistake is bounded by logistic loss deterministic I[f(x) y] log 2 1 + e (w k w y) T x k y randomized E [I[f(x) y]] = 1 P(y x; W ) ln P(y x; W ) September 12, 2018 19 / 49

Reduction to binary classification Reduce multiclass to binary Is there an even more general and simpler approach to derive multiclass classification algorithms? September 12, 2018 20 / 49

Reduction to binary classification Reduce multiclass to binary Is there an even more general and simpler approach to derive multiclass classification algorithms? Given a binary classification algorithm (any one, not just linear methods), can we turn it to a multiclass algorithm, in a black-box manner? September 12, 2018 20 / 49

Reduction to binary classification Reduce multiclass to binary Is there an even more general and simpler approach to derive multiclass classification algorithms? Given a binary classification algorithm (any one, not just linear methods), can we turn it to a multiclass algorithm, in a black-box manner? Yes, there are in fact many ways to do it. one-versus-all (one-versus-rest, one-against-all, etc) one-versus-one (all-versus-all, etc) Error-Correcting Output Codes (ECOC) tree-based reduction September 12, 2018 20 / 49

Reduction to binary classification One-versus-all (OvA) (picture credit: link) Idea: train C binary classifiers to learn is class k or not? for each k. September 12, 2018 21 / 49

Reduction to binary classification One-versus-all (OvA) (picture credit: link) Idea: train C binary classifiers to learn is class k or not? for each k. Training: for each class k [C], relabel examples with class k as +1, and all others as 1 train a binary classifier h k using this new dataset September 12, 2018 21 / 49

Reduction to binary classification One-versus-all (OvA) (picture credit: link) Idea: train C binary classifiers to learn is class k or not? for each k. Training: for each class k [C], relabel examples with class k as +1, and all others as 1 train a binary classifier h k using this new dataset September 12, 2018 21 / 49

Reduction to binary classification One-versus-all (OvA) Prediction: for a new example x ask each h k : does this belong to class k? (i.e. h k (x)) September 12, 2018 22 / 49

Reduction to binary classification One-versus-all (OvA) Prediction: for a new example x ask each h k : does this belong to class k? (i.e. h k (x)) randomly pick among all k s s.t. h k (x) = +1. September 12, 2018 22 / 49

Reduction to binary classification One-versus-all (OvA) Prediction: for a new example x ask each h k : does this belong to class k? (i.e. h k (x)) randomly pick among all k s s.t. h k (x) = +1. Issue: will (probably) make a mistake as long as one of h k errs. September 12, 2018 22 / 49

Reduction to binary classification One-versus-one (OvO) (picture credit: link) Idea: train ( C 2) binary classifiers to learn is class k or k?. September 12, 2018 23 / 49

Reduction to binary classification One-versus-one (OvO) (picture credit: link) Idea: train ( C 2) binary classifiers to learn is class k or k?. Training: for each pair (k, k ), relabel examples with class k as +1 and examples with class k as 1 discard all other examples train a binary classifier h (k,k ) using this new dataset September 12, 2018 23 / 49

Reduction to binary classification One-versus-one (OvO) (picture credit: link) Idea: train ( C 2) binary classifiers to learn is class k or k?. Training: for each pair (k, k ), relabel examples with class k as +1 and examples with class k as 1 discard all other examples train a binary classifier h (k,k ) using this new dataset September 12, 2018 23 / 49

Reduction to binary classification One-versus-one (OvO) Prediction: for a new example x ask each classifier h (k,k ) to vote for either class k or k September 12, 2018 24 / 49

Reduction to binary classification One-versus-one (OvO) Prediction: for a new example x ask each classifier h (k,k ) to vote for either class k or k predict the class with the most votes (break tie in some way) September 12, 2018 24 / 49

Reduction to binary classification One-versus-one (OvO) Prediction: for a new example x ask each classifier h (k,k ) to vote for either class k or k predict the class with the most votes (break tie in some way) More robust than one-versus-all, but slower in prediction. September 12, 2018 24 / 49

Reduction to binary classification Error-correcting output codes (ECOC) (picture credit: link) Idea: based on a code M { 1, +1} C L, train L binary classifiers to learn is bit b on or off. September 12, 2018 25 / 49

Reduction to binary classification Error-correcting output codes (ECOC) (picture credit: link) Idea: based on a code M { 1, +1} C L, train L binary classifiers to learn is bit b on or off. Training: for each bit b [L] relabel example x n as M yn,b train a binary classifier h b using this new dataset. September 12, 2018 25 / 49

Reduction to binary classification Error-correcting output codes (ECOC) Prediction: for a new example x compute the predicted code c = (h 1 (x),..., h L (x)) T September 12, 2018 26 / 49

Reduction to binary classification Error-correcting output codes (ECOC) Prediction: for a new example x compute the predicted code c = (h 1 (x),..., h L (x)) T predict the class with the most similar code: k = argmax k (Mc) k September 12, 2018 26 / 49

Reduction to binary classification Error-correcting output codes (ECOC) Prediction: for a new example x compute the predicted code c = (h 1 (x),..., h L (x)) T predict the class with the most similar code: k = argmax k (Mc) k How to pick the code M? September 12, 2018 26 / 49

Reduction to binary classification Error-correcting output codes (ECOC) Prediction: for a new example x compute the predicted code c = (h 1 (x),..., h L (x)) T predict the class with the most similar code: k = argmax k (Mc) k How to pick the code M? the more dissimilar the codes between different classes are, the better September 12, 2018 26 / 49

Reduction to binary classification Error-correcting output codes (ECOC) Prediction: for a new example x compute the predicted code c = (h 1 (x),..., h L (x)) T predict the class with the most similar code: k = argmax k (Mc) k How to pick the code M? the more dissimilar the codes between different classes are, the better random code is a good choice, September 12, 2018 26 / 49

Reduction to binary classification Error-correcting output codes (ECOC) Prediction: for a new example x compute the predicted code c = (h 1 (x),..., h L (x)) T predict the class with the most similar code: k = argmax k (Mc) k How to pick the code M? the more dissimilar the codes between different classes are, the better random code is a good choice, but might create hard training sets September 12, 2018 26 / 49

Reduction to binary classification Tree based method Idea: train C binary classifiers to learn belongs to which half?. September 12, 2018 27 / 49

Reduction to binary classification Tree based method Idea: train C binary classifiers to learn belongs to which half?. Training: see pictures September 12, 2018 27 / 49

Reduction to binary classification Tree based method Idea: train C binary classifiers to learn belongs to which half?. Training: see pictures Prediction is also natural, September 12, 2018 27 / 49

Reduction to binary classification Tree based method Idea: train C binary classifiers to learn belongs to which half?. Training: see pictures Prediction is also natural, but is very fast! (think ImageNet where C 20K) September 12, 2018 27 / 49

Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA OvO ECOC Tree September 12, 2018 28 / 49

Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN OvO ECOC Tree September 12, 2018 28 / 49

Reduction to binary classification Comparisons In big O notation, Reduction #training points test time OvA CN C OvO ECOC Tree remark September 12, 2018 28 / 49

Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN C not robust OvO ECOC Tree September 12, 2018 28 / 49

Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN C not robust OvO ECOC Tree CN September 12, 2018 28 / 49

Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN C not robust OvO CN C 2 ECOC Tree September 12, 2018 28 / 49

Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN C not robust OvO CN C 2 can achieve very small training error ECOC Tree September 12, 2018 28 / 49

Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN C not robust OvO CN C 2 can achieve very small training error ECOC Tree LN September 12, 2018 28 / 49

Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN C not robust OvO CN C 2 can achieve very small training error ECOC LN L Tree September 12, 2018 28 / 49

Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN C not robust OvO CN C 2 can achieve very small training error ECOC LN L need diversity when designing code Tree September 12, 2018 28 / 49

Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN C not robust OvO CN C 2 can achieve very small training error ECOC LN L need diversity when designing code Tree (log 2 C)N September 12, 2018 28 / 49

Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN C not robust OvO CN C 2 can achieve very small training error ECOC LN L need diversity when designing code Tree (log 2 C)N log 2 C September 12, 2018 28 / 49

Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN C not robust OvO CN C 2 can achieve very small training error ECOC LN L need diversity when designing code Tree (log 2 C)N log 2 C good for extreme classification September 12, 2018 28 / 49

Neural Nets Outline 1 Review of Last Lecture 2 Multiclass Classification 3 Neural Nets Definition Backpropagation Preventing overfitting September 12, 2018 29 / 49

Neural Nets Definition Linear models are not always adequate We can use a nonlinear mapping as discussed: φ(x) : x R D z R M September 12, 2018 30 / 49

Neural Nets Definition Linear models are not always adequate We can use a nonlinear mapping as discussed: φ(x) : x R D z R M But what kind of nonlinear mapping φ should be used? Can we actually learn this nonlinear mapping? September 12, 2018 30 / 49

Neural Nets Definition Linear models are not always adequate We can use a nonlinear mapping as discussed: φ(x) : x R D z R M But what kind of nonlinear mapping φ should be used? Can we actually learn this nonlinear mapping? THE most popular nonlinear models nowadays: neural nets September 12, 2018 30 / 49

Neural Nets Definition Linear model as a one-layer neural net h(a) = a for linear model September 12, 2018 31 / 49

Neural Nets Definition Linear model as a one-layer neural net h(a) = a for linear model To create non-linearity, can use Rectified Linear Unit (ReLU): h(a) = max{0, a} sigmoid function: h(a) = 1 1+e a TanH: h(a) = ea e a e a +e a many more September 12, 2018 31 / 49

Neural Nets Definition More output nodes W R 4 3, h : R 4 R 4 so h(a) = (h 1 (a 1 ), h 2 (a 2 ), h 3 (a 3 ), h 4 (a 4 )) September 12, 2018 32 / 49

Neural Nets Definition More output nodes W R 4 3, h : R 4 R 4 so h(a) = (h 1 (a 1 ), h 2 (a 2 ), h 3 (a 3 ), h 4 (a 4 )) Can think of this as a nonlinear basis: Φ(x) = h(w x) September 12, 2018 32 / 49

Neural Nets Definition More layers Becomes a network: September 12, 2018 33 / 49

Neural Nets Definition More layers Becomes a network: each node is called a neuron September 12, 2018 33 / 49

Neural Nets Definition More layers Becomes a network: each node is called a neuron h is called the activation function can use h(a) = 1 for one neuron in each layer to incorporate bias term output neuron can use h(a) = a September 12, 2018 33 / 49

Neural Nets Definition More layers Becomes a network: each node is called a neuron h is called the activation function can use h(a) = 1 for one neuron in each layer to incorporate bias term output neuron can use h(a) = a #layers refers to #hidden layers (plus 1 or 2 for input/output layers) September 12, 2018 33 / 49

Neural Nets Definition More layers Becomes a network: each node is called a neuron h is called the activation function can use h(a) = 1 for one neuron in each layer to incorporate bias term output neuron can use h(a) = a #layers refers to #hidden layers (plus 1 or 2 for input/output layers) deep neural nets can have many layers and millions of parameters September 12, 2018 33 / 49

Neural Nets Definition More layers Becomes a network: each node is called a neuron h is called the activation function can use h(a) = 1 for one neuron in each layer to incorporate bias term output neuron can use h(a) = a #layers refers to #hidden layers (plus 1 or 2 for input/output layers) deep neural nets can have many layers and millions of parameters this is a feedforward, fully connected neural net, there are many variants September 12, 2018 33 / 49

Neural Nets Definition How powerful are neural nets? Universal approximation theorem (Cybenko, 89; Hornik, 91): A feedforward neural net with a single hidden layer can approximate any continuous functions. September 12, 2018 34 / 49

Neural Nets Definition How powerful are neural nets? Universal approximation theorem (Cybenko, 89; Hornik, 91): A feedforward neural net with a single hidden layer can approximate any continuous functions. It might need a huge number of neurons though, and depth helps! September 12, 2018 34 / 49

Neural Nets Definition How powerful are neural nets? Universal approximation theorem (Cybenko, 89; Hornik, 91): A feedforward neural net with a single hidden layer can approximate any continuous functions. It might need a huge number of neurons though, and depth helps! Designing network architecture is important and very complicated for feedforward network, need to decide number of hidden layers, number of neurons at each layer, activation functions, etc. September 12, 2018 34 / 49

Neural Nets Definition Math formulation An L-layer neural net can be written as f(x) = h L (W L h L 1 (W L 1 h 1 (W 1 x))) September 12, 2018 35 / 49

Neural Nets Definition Math formulation An L-layer neural net can be written as f(x) = h L (W L h L 1 (W L 1 h 1 (W 1 x))) To ease notation, for a given input x, define recursively o 0 = x, a l = W l o l 1, o l = h l (a l ) (l = 1,..., L) September 12, 2018 35 / 49

Neural Nets Definition Math formulation An L-layer neural net can be written as f(x) = h L (W L h L 1 (W L 1 h 1 (W 1 x))) To ease notation, for a given input x, define recursively o 0 = x, a l = W l o l 1, o l = h l (a l ) (l = 1,..., L) where W l R D l D l 1 is the weights for layer l D 0 = D, D 1,..., D L are numbers of neurons at each layer a l R D l is input to layer l o l R D l is output to layer l h : R D l R D l is activation functions at layer l September 12, 2018 35 / 49

Neural Nets Definition Learning the model No matter how complicated the model is, our goal is the same: minimize N E(W 1,..., W L ) = E n (W 1,..., W L ) n=1 September 12, 2018 36 / 49

Neural Nets Definition Learning the model No matter how complicated the model is, our goal is the same: minimize N E(W 1,..., W L ) = E n (W 1,..., W L ) n=1 where { f(xn ) y n 2 2 for regression E n (W 1,..., W L ) = ln (1 + ) e f(xn) k f(x n) yn for classification k yn September 12, 2018 36 / 49

Neural Nets Backpropagation How to optimize such a complicated function? Same thing: apply SGD! even if the model is nonconvex. September 12, 2018 37 / 49

Neural Nets Backpropagation How to optimize such a complicated function? Same thing: apply SGD! even if the model is nonconvex. What is the gradient of this complicated function? September 12, 2018 37 / 49

Neural Nets Backpropagation How to optimize such a complicated function? Same thing: apply SGD! even if the model is nonconvex. What is the gradient of this complicated function? Chain rule is the only secret: for a composite function f(g(w)) f w = f g g w September 12, 2018 37 / 49

Neural Nets Backpropagation How to optimize such a complicated function? Same thing: apply SGD! even if the model is nonconvex. What is the gradient of this complicated function? Chain rule is the only secret: for a composite function f(g(w)) f w = f g g w for a composite function f(g 1 (w),..., g d (w)) f w = d i=1 f g i g i w September 12, 2018 37 / 49

Neural Nets Backpropagation How to optimize such a complicated function? Same thing: apply SGD! even if the model is nonconvex. What is the gradient of this complicated function? Chain rule is the only secret: for a composite function f(g(w)) f w = f g g w for a composite function f(g 1 (w),..., g d (w)) f w = d i=1 f g i g i w the simplest example f(g 1 (w), g 2 (w)) = g 1 (w)g 2 (w) September 12, 2018 37 / 49

Neural Nets Backpropagation Computing the derivative Drop the subscript l for layer for simplicity. Find the derivative of E n w.r.t. to w ij September 12, 2018 38 / 49

Neural Nets Backpropagation Computing the derivative Drop the subscript l for layer for simplicity. Find the derivative of E n w.r.t. to w ij E n w ij = E n a i a i w ij September 12, 2018 38 / 49

Neural Nets Backpropagation Computing the derivative Drop the subscript l for layer for simplicity. Find the derivative of E n w.r.t. to w ij E n = E n a i = E n (w ij o j ) w ij a i w ij a i w ij September 12, 2018 38 / 49

Neural Nets Backpropagation Computing the derivative Drop the subscript l for layer for simplicity. Find the derivative of E n w.r.t. to w ij E n = E n a i = E n (w ij o j ) = E n o j w ij a i w ij a i w ij a i September 12, 2018 38 / 49

Neural Nets Backpropagation Computing the derivative Drop the subscript l for layer for simplicity. Find the derivative of E n w.r.t. to w ij E n a i = E n o i o i a i E n = E n a i = E n (w ij o j ) = E n o j w ij a i w ij a i w ij a i September 12, 2018 38 / 49

Neural Nets Backpropagation Computing the derivative Drop the subscript l for layer for simplicity. Find the derivative of E n w.r.t. to w ij E n = E n w ij a i ( E n a i = E n o i o i a i = k a i w ij = E n a i (w ij o j ) E n a k a k o i ) w ij h i(a i ) = E n a i o j September 12, 2018 38 / 49

Neural Nets Backpropagation Computing the derivative Drop the subscript l for layer for simplicity. Find the derivative of E n w.r.t. to w ij E n = E n w ij a i ( E n a i = E n o i o i a i = k a i w ij = E n a i (w ij o j ) E n a k a k o i w ij = E n a i o j ) ( ) h E n i(a i ) = w ki h a i(a i ) k k September 12, 2018 38 / 49

Neural Nets Backpropagation Computing the derivative Adding the subscript for layer: E n a l,i = E n w l,ij = E n a l,i o l 1,j ( k E n a l+1,k w l+1,ki ) h l,i (a l,i) September 12, 2018 39 / 49

Neural Nets Backpropagation Computing the derivative Adding the subscript for layer: E n a l,i = E n w l,ij = E n a l,i o l 1,j ( k E n a l+1,k w l+1,ki ) For the last layer, for square loss E n = (h L,i(a L,i ) y n,i ) 2 a L,i a L,i h l,i (a l,i) September 12, 2018 39 / 49

Neural Nets Backpropagation Computing the derivative Adding the subscript for layer: E n a l,i = E n w l,ij = E n a l,i o l 1,j ( k E n a l+1,k w l+1,ki ) For the last layer, for square loss E n = (h L,i(a L,i ) y n,i ) 2 a L,i a L,i h l,i (a l,i) = 2(h L,i (a L,i ) y n,i )h L,i (a L,i) September 12, 2018 39 / 49

Neural Nets Backpropagation Computing the derivative Adding the subscript for layer: E n a l,i = E n w l,ij = E n a l,i o l 1,j ( k E n a l+1,k w l+1,ki ) For the last layer, for square loss E n = (h L,i(a L,i ) y n,i ) 2 a L,i a L,i h l,i (a l,i) = 2(h L,i (a L,i ) y n,i )h L,i (a L,i) Exercise: try to do it for logistic loss yourself. September 12, 2018 39 / 49

Neural Nets Backpropagation Computing the derivative Using matrix notation greatly simplifies presentation and implementation: E n a l = E n W l = E n a l o T l 1 {( W T l+1 En a l+1 ) h l (a l) if l < L 2(h L (a L ) y n ) h L (a L) else where v 1 v 2 = (v 11 v 21,, v 1D v 2D ) is the element-wise product (a.k.a. Hadamard product). Verify yourself! September 12, 2018 40 / 49

Neural Nets Backpropagation Putting everything into SGD The backpropagation algorithm (Backprop) Initialize W 1,..., W L (all 0 or randomly). Repeat: 1 randomly pick one data point n [N] September 12, 2018 41 / 49

Neural Nets Backpropagation Putting everything into SGD The backpropagation algorithm (Backprop) Initialize W 1,..., W L (all 0 or randomly). Repeat: 1 randomly pick one data point n [N] 2 forward propagation: for each layer l = 1,..., L compute a l = W l o l 1 and o l = h l (a l ) (o 0 = x n ) September 12, 2018 41 / 49

Neural Nets Backpropagation Putting everything into SGD The backpropagation algorithm (Backprop) Initialize W 1,..., W L (all 0 or randomly). Repeat: 1 randomly pick one data point n [N] 2 forward propagation: for each layer l = 1,..., L compute a l = W l o l 1 and o l = h l (a l ) (o 0 = x n ) 3 backward propagation: for each l = L,..., 1 compute {( ) E n Wl+1 T En a = l+1 h l (a l) if l < L a l 2(h L (a L ) y n ) h L (a L) else update weights W l W l η E n W l = W l η E n a l o T l 1 September 12, 2018 41 / 49

Neural Nets Backpropagation Putting everything into SGD The backpropagation algorithm (Backprop) Initialize W 1,..., W L (all 0 or randomly). Repeat: 1 randomly pick one data point n [N] 2 forward propagation: for each layer l = 1,..., L compute a l = W l o l 1 and o l = h l (a l ) (o 0 = x n ) 3 backward propagation: for each l = L,..., 1 compute {( ) E n Wl+1 T En a = l+1 h l (a l) if l < L a l 2(h L (a L ) y n ) h L (a L) else update weights W l W l η E n W l = W l η E n a l o T l 1 Think about how to do the last two steps properly! September 12, 2018 41 / 49

Neural Nets Backpropagation More tricks to optimize neural nets Many variants based on backprop SGD with minibatch: randomly sample a batch of examples to form a stochastic gradient SGD with momentum September 12, 2018 42 / 49

Neural Nets Backpropagation SGD with momentum Initialize w 0 and velocity v = 0 For t = 1, 2,... form a stochastic gradient g t update velocity v αv ηg t for some discount factor α (0, 1) update weight w t w t 1 + v September 12, 2018 43 / 49

Neural Nets Backpropagation SGD with momentum Initialize w 0 and velocity v = 0 For t = 1, 2,... form a stochastic gradient g t update velocity v αv ηg t for some discount factor α (0, 1) update weight w t w t 1 + v Updates for first few rounds: w 1 = w 0 ηg 1 w 2 = w 1 αηg 1 ηg 2 w 3 = w 2 α 2 ηg 1 αηg 2 ηg 3 September 12, 2018 43 / 49

Neural Nets Preventing overfitting Overfitting Overfitting is very likely since the models are too powerful. Methods to overcome overfitting: data augmentation regularization dropout early stopping September 12, 2018 44 / 49

Neural Nets Preventing overfitting Data augmentation Data: the more the better. How do we get more data? September 12, 2018 45 / 49

Neural Nets Preventing overfitting Data augmentation Data: the more the better. How do we get more data? Exploit prior knowledge to add more training data September 12, 2018 45 / 49

Neural Nets Preventing overfitting Regularization L2 regularization: minimize L E (W 1,..., W L ) = E(W 1,..., W L ) + λ W l 2 2 l=1 September 12, 2018 46 / 49

Neural Nets Preventing overfitting Regularization L2 regularization: minimize E (W 1,..., W L ) = E(W 1,..., W L ) + λ Simple change to the gradient: E w ij = E w ij + 2λw ij L W l 2 2 l=1 September 12, 2018 46 / 49

Neural Nets Preventing overfitting Regularization L2 regularization: minimize E (W 1,..., W L ) = E(W 1,..., W L ) + λ Simple change to the gradient: E w ij = E w ij + 2λw ij L W l 2 2 l=1 Introduce weight decaying effect September 12, 2018 46 / 49

Neural Nets Preventing overfitting Dropout Randomly delete neurons during training Very effective, makes training faster as well September 12, 2018 47 / 49

Neural Nets Preventing overfitting Early stopping Stop training when the performance on validation set stops improving Early stopping Loss (negative log-likelihood) 0.20 0.15 0.10 0.05 0.00 0 50 100 150 200 250 Time (epochs) Training set loss Validation set loss September 12, 2018 48 / 49

Neural Nets Preventing overfitting Conclusions for neural nets Deep neural networks are hugely popular, achieving best performance on many problems September 12, 2018 49 / 49

Neural Nets Preventing overfitting Conclusions for neural nets Deep neural networks are hugely popular, achieving best performance on many problems do need a lot of data to work well September 12, 2018 49 / 49

Neural Nets Preventing overfitting Conclusions for neural nets Deep neural networks are hugely popular, achieving best performance on many problems do need a lot of data to work well take a lot of time to train (need GPUs for massive parallel computing) September 12, 2018 49 / 49

Neural Nets Preventing overfitting Conclusions for neural nets Deep neural networks are hugely popular, achieving best performance on many problems do need a lot of data to work well take a lot of time to train (need GPUs for massive parallel computing) take some work to select architecture and hyperparameters September 12, 2018 49 / 49

Neural Nets Preventing overfitting Conclusions for neural nets Deep neural networks are hugely popular, achieving best performance on many problems do need a lot of data to work well take a lot of time to train (need GPUs for massive parallel computing) take some work to select architecture and hyperparameters are still not well understood in theory September 12, 2018 49 / 49