Loss Functions and Optimization. Lecture 3-1

Similar documents
Loss Functions and Optimization. Lecture 3-1

CS 1674: Intro to Computer Vision. Final Review. Prof. Adriana Kovashka University of Pittsburgh December 7, 2016

Machine Learning Basics

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Neural networks and optimization

Discriminative part-based models. Many slides based on P. Felzenszwalb

Deep Learning Lab Course 2017 (Deep Learning Practical)

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

CENG 793. On Machine Learning and Optimization. Sinan Kalkan

Neural networks and optimization

Gaussian and Linear Discriminant Analysis; Multiclass Classification

CSCI567 Machine Learning (Fall 2018)

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Neural Network Training

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Deep Learning (CNNs)

Lecture 35: Optimization and Neural Nets

Warm up: risk prediction with logistic regression

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Machine Learning for Signal Processing Bayes Classification and Regression

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Neural Networks: Backpropagation

An Introduction to Optimization Methods in Deep Learning. Yuan YAO HKUST

Nonlinear Classification

Lecture 3: Multiclass Classification

Distinguish between different types of scenes. Matching human perception Understanding the environment

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Part 4: Conditional Random Fields

CS6220: DATA MINING TECHNIQUES

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Stochastic gradient descent; Classification

Lecture 5 Neural models for NLP

Logistic Regression & Neural Networks

Kernelized Perceptron Support Vector Machines

A Discriminatively Trained, Multiscale, Deformable Part Model

CSC321 Lecture 6: Backpropagation

Regularization in Neural Networks

Clustering with k-means and Gaussian mixture distributions

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

Optimization and (Under/over)fitting

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

CSC 411 Lecture 10: Neural Networks

Support Vector Machine I

Convolutional Neural Networks

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Nonparametric Bayesian Methods (Gaussian Processes)

CS60010: Deep Learning

Lecture 17: Neural Networks and Deep Learning

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

CS 188: Artificial Intelligence Spring Announcements

CSE 546 Final Exam, Autumn 2013

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Machine Learning for NLP

CS 6375 Machine Learning

INTRODUCTION TO DATA SCIENCE

Global Scene Representations. Tilke Judd

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

CSC 411 Lecture 17: Support Vector Machine

Non-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I

STA 4273H: Statistical Machine Learning

CPSC 340 Assignment 4 (due November 17 ATE)

Linear Models in Machine Learning

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

MIRA, SVM, k-nn. Lirong Xia

Spatial Transformation

Corners, Blobs & Descriptors. With slides from S. Lazebnik & S. Seitz, D. Lowe, A. Efros

Fisher Vector image representation

Introduction to CNN and PyTorch

ECS171: Machine Learning

Compressed Fisher vectors for LSVR

Bayesian Deep Learning

CSC321 Lecture 8: Optimization

Detecting Humans via Their Pose

Linear classifiers: Overfitting and regularization

SGD and Deep Learning

Pattern Recognition and Machine Learning

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Kernel Density Topic Models: Visual Topics Without Visual Words

Neural Networks: Backpropagation

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Lecture 14: Deep Generative Learning

Learning features by contrasting natural images with noise

Interpreting Deep Classifiers

Style-aware Mid-level Representation for Discovering Visual Connections in Space and Time

Machine Learning Linear Classification. Prof. Matteo Matteucci

Support Vector Machines

Machine Learning Lecture 10

CSC2541 Lecture 5 Natural Gradient

MLE/MAP + Naïve Bayes

Mathematical Tools for Neuroscience (NEU 314) Princeton University, Spring 2016 Jonathan Pillow. Homework 8: Logistic Regression & Information Theory

Support Vector Machines

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

CS-E3210 Machine Learning: Basic Principles

CSE 473/573 Computer Vision and Image Processing (CVIP)

Transcription:

Lecture 3: Loss Functions and Optimization Lecture 3-1

Administrative Assignment 1 is released: http://cs231n.github.io/assignments2017/assignment1/ Due Thursday April 20, 11:59pm on Canvas (Extending due date since it was released late) Lecture 3-2

Administrative Check out Project Ideas on Piazza Schedule for Office hours is on the course website TA specialties are posted on Piazza Lecture 3-3

Administrative Details about redeeming Google Cloud Credits should go out today; will be posted on Piazza $100 per student to use for homeworks and projects Lecture 3-4

Recall from last time: Challenges of recognition Viewpoint Illumination Deformation This image is CC0 1.0 public domain This image by Umberto Salvagnin is licensed under CC-BY 2.0 Clutter Intraclass Variation This image is CC0 1.0 public domain This image is CC0 1.0 public domain Lecture 3-5 Occlusion This image by jonsson is licensed under CC-BY 2.0

Recall from last time: data-driven approach, knn 1-NN classifier test train train 5-NN classifier validation test Lecture 3-6

Recall from last time: Linear Classifier f(x,w) = Wx + b Lecture 3-7

Recall from last time: Linear Classifier TODO: 1. Define a loss function that quantifies our unhappiness with the scores across the training data. 2. Come up with a way of efficiently finding the parameters that minimize the loss function. (optimization) Cat image by Nikita is licensed under CC-BY 2.0; Car image is CC0 1.0 public domain; Frog image is in the public domain Lecture 3-8

Suppose: 3 training examples, 3 classes. With some W the scores are: cat car frog 3.2 5.1-1.7 1.3 4.9 2.0 2.2 2.5-3.1 Lecture 3-9

Suppose: 3 training examples, 3 classes. With some W the scores are: A loss function tells how good our current classifier is Given a dataset of examples Where cat car frog 3.2 5.1-1.7 1.3 4.9 2.0 2.2 2.5-3.1 is image and is (integer) label Loss over the dataset is a sum of loss over examples: Lecture 3-10

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: cat car frog 3.2 5.1-1.7 1.3 4.9 2.0 2.2 2.5-3.1 the SVM loss has the form: Lecture 3-11

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example Hinge loss where is the image and where is the (integer) label, and using the shorthand for the scores vector: cat car frog 3.2 5.1-1.7 1.3 4.9 2.0 2.2 2.5-3.1 the SVM loss has the form: Lecture 3-12

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: cat car frog 3.2 5.1-1.7 1.3 4.9 2.0 2.2 2.5-3.1 the SVM loss has the form: Lecture 3-13

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: cat car frog Losses: 3.2 5.1-1.7 2.9 1.3 4.9 2.0 2.2 2.5-3.1 the SVM loss has the form: = max(0, 5.1-3.2 + 1) +max(0, -1.7-3.2 + 1) = max(0, 2.9) + max(0, -3.9) = 2.9 + 0 = 2.9 Lecture 3-14

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: cat car frog Losses: 3.2 5.1-1.7 2.9 1.3 4.9 2.0 0 2.2 2.5-3.1 the SVM loss has the form: = max(0, 1.3-4.9 + 1) +max(0, 2.0-4.9 + 1) = max(0, -2.6) + max(0, -1.9) =0+0 =0 Lecture 3-15

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: cat car frog Losses: 3.2 5.1-1.7 2.9 1.3 4.9 2.0 0 2.2 2.5-3.1 12.9 the SVM loss has the form: = max(0, 2.2 - (-3.1) + 1) +max(0, 2.5 - (-3.1) + 1) = max(0, 6.3) + max(0, 6.6) = 6.3 + 6.6 = 12.9 Lecture 3-16

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: cat car frog Losses: 3.2 5.1-1.7 2.9 1.3 4.9 2.0 0 2.2 2.5-3.1 12.9 the SVM loss has the form: Loss over full dataset is average: L = (2.9 + 0 + 12.9)/3 = 5.27 Lecture 3-17

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: cat car frog Losses: 3.2 5.1-1.7 2.9 1.3 4.9 2.0 0 2.2 2.5-3.1 12.9 the SVM loss has the form: Q: What happens to loss if car scores change a bit? Lecture 3-18

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: cat car frog Losses: 3.2 5.1-1.7 2.9 1.3 4.9 2.0 0 2.2 2.5-3.1 12.9 the SVM loss has the form: Q2: what is the min/max possible loss? Lecture 3-19

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: cat car frog Losses: 3.2 5.1-1.7 2.9 1.3 4.9 2.0 0 2.2 2.5-3.1 12.9 the SVM loss has the form: Q3: At initialization W is small so all s 0. What is the loss? Lecture 3-20

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: cat car frog Losses: 3.2 5.1-1.7 2.9 1.3 4.9 2.0 0 2.2 2.5-3.1 12.9 the SVM loss has the form: Q4: What if the sum was over all classes? (including j = y_i) Lecture 3-21

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: cat car frog Losses: 3.2 5.1-1.7 2.9 1.3 4.9 2.0 0 2.2 2.5-3.1 12.9 the SVM loss has the form: Q5: What if we used mean instead of sum? Lecture 3-22

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: cat car frog Losses: 3.2 5.1-1.7 2.9 1.3 4.9 2.0 0 2.2 2.5-3.1 12.9 the SVM loss has the form: Q6: What if we used Lecture 3-23

Multiclass SVM Loss: Example code Lecture 3-24

E.g. Suppose that we found a W such that L = 0. Is this W unique? Lecture 3-25

E.g. Suppose that we found a W such that L = 0. Is this W unique? No! 2W is also has L = 0! Lecture 3-26

Suppose: 3 training examples, 3 classes. With some W the scores are: Before: = max(0, 1.3-4.9 + 1) +max(0, 2.0-4.9 + 1) = max(0, -2.6) + max(0, -1.9) =0+0 =0 cat car frog Losses: 3.2 5.1-1.7 2.9 1.3 4.9 2.0 0 2.2 2.5-3.1 With W twice as large: = max(0, 2.6-9.8 + 1) +max(0, 4.0-9.8 + 1) = max(0, -6.2) + max(0, -4.8) =0+0 =0 Lecture 3-27

Data loss: Model predictions should match training data Lecture 3-28

Data loss: Model predictions should match training data Lecture 3-29

Data loss: Model predictions should match training data Lecture 3-30

Data loss: Model predictions should match training data Lecture 3-31

Data loss: Model predictions should match training data Regularization: Model should be simple, so it works on test data Lecture 3-32

Data loss: Model predictions should match training data Regularization: Model should be simple, so it works on test data Occam s Razor: Among competing hypotheses, the simplest is the best William of Ockham, 1285-1347 Lecture 3-33

Regularization = regularization strength (hyperparameter) In common use: L2 regularization L1 regularization Elastic net (L1 + L2) Max norm regularization (might see later) Dropout (will see later) Fancier: Batch normalization, stochastic depth Lecture 3-34

L2 Regularization (Weight Decay) Lecture 3-35

L2 Regularization (Weight Decay) (If you are a Bayesian: L2 regularization also corresponds MAP inference using a Gaussian prior on W) Lecture 3-36

Softmax Classifier (Multinomial Logistic Regression) cat car frog 3.2 5.1-1.7 Lecture 3-37

Softmax Classifier (Multinomial Logistic Regression) scores = unnormalized log probabilities of the classes. cat car frog 3.2 5.1-1.7 Lecture 3-38

Softmax Classifier (Multinomial Logistic Regression) scores = unnormalized log probabilities of the classes. where cat car frog 3.2 5.1-1.7 Lecture 3-39

Softmax Classifier (Multinomial Logistic Regression) scores = unnormalized log probabilities of the classes. where cat car frog 3.2 5.1-1.7 Softmax function Lecture 3-40

Softmax Classifier (Multinomial Logistic Regression) scores = unnormalized log probabilities of the classes. where cat car frog 3.2 5.1-1.7 Want to maximize the log likelihood, or (for a loss function) to minimize the negative log likelihood of the correct class: Lecture 3-41

Softmax Classifier (Multinomial Logistic Regression) scores = unnormalized log probabilities of the classes. where cat car frog 3.2 5.1-1.7 Want to maximize the log likelihood, or (for a loss function) to minimize the negative log likelihood of the correct class: in summary: Lecture 3-42

Softmax Classifier (Multinomial Logistic Regression) cat car frog 3.2 5.1-1.7 unnormalized log probabilities Lecture 3-43

Softmax Classifier (Multinomial Logistic Regression) unnormalized probabilities cat car frog 3.2 5.1-1.7 exp 24.5 164.0 0.18 unnormalized log probabilities Lecture 3-44

Softmax Classifier (Multinomial Logistic Regression) unnormalized probabilities cat car frog 3.2 5.1-1.7 exp 24.5 164.0 0.18 unnormalized log probabilities normalize 0.13 0.87 0.00 probabilities Lecture 3-45

Softmax Classifier (Multinomial Logistic Regression) unnormalized probabilities cat car frog 3.2 5.1-1.7 exp 24.5 164.0 0.18 unnormalized log probabilities normalize 0.13 0.87 0.00 L_i = -log(0.13) = 0.89 probabilities Lecture 3-46

Softmax Classifier (Multinomial Logistic Regression) Q: What is the min/max possible loss L_i? unnormalized probabilities cat car frog 3.2 5.1-1.7 exp 24.5 164.0 0.18 unnormalized log probabilities normalize 0.13 0.87 0.00 L_i = -log(0.13) = 0.89 probabilities Lecture 3-47

Softmax Classifier (Multinomial Logistic Regression) unnormalized probabilities cat car frog 3.2 5.1-1.7 exp 24.5 164.0 0.18 unnormalized log probabilities normalize Q2: Usually at initialization W is small so all s 0. What is the loss? 0.13 0.87 0.00 L_i = -log(0.13) = 0.89 probabilities Lecture 3-48

Lecture 3-49

Softmax vs. SVM Lecture 3-50

Softmax vs. SVM assume scores: [10, -2, 3] [10, 9, 9] [10, -100, -100] and Q: Suppose I take a datapoint and I jiggle a bit (changing its score slightly). What happens to the loss in both cases? Lecture 3-51

Recap - We have some dataset of (x,y) - We have a score function: - We have a loss function: e.g. Softmax SVM Full loss Lecture 3-52

How do we find the best W? Recap - We have some dataset of (x,y) - We have a score function: - We have a loss function: e.g. Softmax SVM Full loss Lecture 3-53

Optimization Lecture 3-54

This image is CC0 1.0 public domain Lecture 3-55

Walking man image is CC0 1.0 public domain Lecture 3-56

Strategy #1: A first very bad idea solution: Random search Lecture 3-57

Lets see how well this works on the test set... 15.5% accuracy! not bad! (SOTA is ~95%) Lecture 3-58

Strategy #2: Follow the slope Lecture 3-59

Strategy #2: Follow the slope In 1-dimension, the derivative of a function: In multiple dimensions, the gradient is the vector of (partial derivatives) along each dimension The slope in any direction is the dot product of the direction with the gradient The direction of steepest descent is the negative gradient Lecture 3-60

current W: gradient dw: [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33, ] loss 1.25347 [?,?,?,?,?,?,?,?,?, ] Lecture 3-61

current W: W + h (first dim): gradient dw: [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33, ] loss 1.25347 [0.34 + 0.0001, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33, ] loss 1.25322 [?,?,?,?,?,?,?,?,?, ] Lecture 3-62

current W: W + h (first dim): [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33, ] loss 1.25347 [0.34 + 0.0001, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33, ] loss 1.25322 gradient dw: [-2.5,?,?,?,- 1.25347)/0.0001 (1.25322 = -2.5?,?,?,?,?, ] Lecture 3-63

current W: W + h (second dim): gradient dw: [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33, ] loss 1.25347 [0.34, -1.11 + 0.0001, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33, ] loss 1.25353 [-2.5,?,?,?,?,?,?,?,?, ] Lecture 3-64

current W: W + h (second dim): [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33, ] loss 1.25347 [0.34, -1.11 + 0.0001, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33, ] loss 1.25353 gradient dw: [-2.5, 0.6,?,?,?,- 1.25347)/0.0001 (1.25353 = 0.6?,?,?,?, ] Lecture 3-65

current W: W + h (third dim): gradient dw: [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33, ] loss 1.25347 [0.34, -1.11, 0.78 + 0.0001, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33, ] loss 1.25347 [-2.5, 0.6,?,?,?,?,?,?,?, ] Lecture 3-66

current W: W + h (third dim): [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33, ] loss 1.25347 [0.34, -1.11, 0.78 + 0.0001, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33, ] loss 1.25347 gradient dw: [-2.5, 0.6, 0,?,?, (1.25347-1.25347)/0.0001?, =0?,?,?, ] Lecture 3-67

This is silly. The loss is just a function of W: want Lecture 3-68

This is silly. The loss is just a function of W: want This image is in the public domain Lecture 3-69 This image is in the public domain

Hammer image is in the public domain This is silly. The loss is just a function of W: Calculus! want Use calculus to compute an analytic gradient This image is in the public domain Lecture 3-70 This image is in the public domain

current W: gradient dw: [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33, ] loss 1.25347 [-2.5, 0.6, 0, 0.2, 0.7, -0.5, 1.1, 1.3, -2.1, ] dw =... (some function data and W) Lecture 3-71

In summary: - Numerical gradient: approximate, slow, easy to write - Analytic gradient: exact, fast, error-prone => In practice: Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check. Lecture 3-72

Gradient Descent Lecture 3-73

W_2 original W W_1 negative gradient direction Lecture 3-74

Lecture 3-75

Stochastic Gradient Descent (SGD) Full sum expensive when N is large! Approximate sum using a minibatch of examples 32 / 64 / 128 common Lecture 3-76

Interactive Web Demo time... http://vision.stanford.edu/teaching/cs231n-demos/linear-classify/ Lecture 3-77

Interactive Web Demo time... Lecture 3-78

Aside: Image Features Lecture 3-79

Image Features: Motivation y θ f(x, y) = (r(x, y), θ(x, y)) x r Cannot separate red and blue points with linear classifier After applying feature transform, points can be separated by linear classifier Lecture 3-80

Example: Color Histogram +1 Lecture 3-81

Example: Histogram of Oriented Gradients (HoG) Divide image into 8x8 pixel regions Within each region quantize edge direction into 9 bins Example: 320x240 image gets divided into 40x30 bins; in each bin there are 9 numbers so feature vector has 30*40*9 = 10,800 numbers Lowe, Object recognition from local scale-invariant features, ICCV 1999 Dalal and Triggs, "Histograms of oriented gradients for human detection," CVPR 2005 Lecture 3-82

Example: Bag of Words Step 1: Build codebook Extract random patches Cluster patches to form codebook of visual words Step 2: Encode images Fei-Fei and Perona, A bayesian hierarchical model for learning natural scene categories, CVPR 2005 Lecture 3-83

Image features vs ConvNets f Feature Extraction 10 numbers giving scores for classes training 10 numbers giving scores for classes training Lecture 3-84

Next time: Introduction to neural networks Backpropagation Lecture 3-85