Loss Functions and Optimization. Lecture 3-1

Similar documents
Loss Functions and Optimization. Lecture 3-1

CS 1674: Intro to Computer Vision. Final Review. Prof. Adriana Kovashka University of Pittsburgh December 7, 2016

Machine Learning Basics

Neural networks and optimization

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Neural networks and optimization

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Nonlinear Classification

CENG 793. On Machine Learning and Optimization. Sinan Kalkan

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Optimization and (Under/over)fitting

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning Lab Course 2017 (Deep Learning Practical)

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Lecture 35: Optimization and Neural Nets

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Convolutional Neural Networks

CSCI567 Machine Learning (Fall 2018)

Deep Learning (CNNs)

Lecture 5 Neural models for NLP

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Natural Language Processing with Deep Learning CS224N/Ling284

Neural Networks: Backpropagation

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Neural Network Training

Interpreting Deep Classifiers

Lecture 17: Neural Networks and Deep Learning

Clustering with k-means and Gaussian mixture distributions

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

CS 6375 Machine Learning

CSE 546 Final Exam, Autumn 2013

Machine Learning for Signal Processing Bayes Classification and Regression

Stochastic gradient descent; Classification

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Mathematical Tools for Neuroscience (NEU 314) Princeton University, Spring 2016 Jonathan Pillow. Homework 8: Logistic Regression & Information Theory

ECE521 Lectures 9 Fully Connected Neural Networks

CS60010: Deep Learning

Lecture 4: Training a Classifier

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Discriminative part-based models. Many slides based on P. Felzenszwalb

Warm up: risk prediction with logistic regression

text classification 3: neural networks

Distinguish between different types of scenes. Matching human perception Understanding the environment

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Linear classifiers: Overfitting and regularization

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

Lecture 3: Multiclass Classification

CSC321 Lecture 4: Learning a Classifier

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Value Function Methods. CS : Deep Reinforcement Learning Sergey Levine

From perceptrons to word embeddings. Simon Šuster University of Groningen

Introduction to CNN and PyTorch

Kernelized Perceptron Support Vector Machines

CSC321 Lecture 6: Backpropagation

CSC2541 Lecture 5 Natural Gradient

Introduction to Convolutional Neural Networks (CNNs)

An Introduction to Optimization Methods in Deep Learning. Yuan YAO HKUST

Machine Learning - MT Linear Regression

Lecture 6: Neural Networks for Representing Word Meaning

Series 7, May 22, 2018 (EM Convergence)

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

Lecture 4: Training a Classifier

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Learning features by contrasting natural images with noise

FINAL: CS 6375 (Machine Learning) Fall 2014

Lecture 2: Linear regression

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Machine Learning (CSE 446): Neural Networks

Logistic Regression. Machine Learning Fall 2018

Machine Learning Lecture 7

Linear Models in Machine Learning

Non-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

CSC321 Lecture 8: Optimization

Based on the original slides of Hung-yi Lee

Outline. CSCI567 Machine Learning (Spring 2019) Outline. Math formulation. Prof. Victor Adamchik. Feb. 12, 2019

Machine Learning (CS 567) Lecture 5

The connection of dropout and Bayesian statistics

Gradient-Based Learning. Sargur N. Srihari

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning for NLP

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Neural Networks: Backpropagation

Style-aware Mid-level Representation for Discovering Visual Connections in Space and Time

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements

Midterm: CS 6375 Spring 2018

CS 3710: Visual Recognition Describing Images with Features. Adriana Kovashka Department of Computer Science January 8, 2015

A Discriminatively Trained, Multiscale, Deformable Part Model

CSC321 Lecture 9: Generalization

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Transcription:

Lecture 3: Loss Functions and Optimization Lecture 3-1

Administrative: Live Questions We ll use Zoom to take questions from remote students live-streaming the lecture Check Piazza for instructions and meeting ID: https://piazza.com/class/jdmurnqexkt47x?cid=108 Lecture 3-2

Administrative: Office Hours Office hours started this week, schedule is on the course website: http://cs231n.stanford.edu/office_hours.html Areas of expertise for all TAs are posted on Piazza: https://piazza.com/class/jdmurnqexkt47x?cid=155 Lecture 3-3

Administrative: Assignment 1 Assignment 1 is released: http://cs231n.github.io/assignments2018/assignment1/ Due Wednesday April 18, 11:59pm Lecture 3-4

Administrative: Google Cloud You should have received an email yesterday about claiming a coupon for Google Cloud; make a private post on Piazza if you didn t get it There was a problem with @cs.stanford.edu emails; this is resolved If you have problems with coupons: Post on Piazza DO NOT email me, DO NOT email Prof. Phil Levis Lecture 3-5

Administrative: SCPD Tutors This year the SCPD office has hired tutors specifically for SCPD students taking CS231N; you should have received an email about this yesterday (4/9/2018) Lecture 3-6

Administrative: Poster Session Poster session will be Tuesday June 12 (our final exam slot) Attendance is mandatory for non-scpd students; if you don t have a legitimate reason for skipping it then you forfeit the points for the poster presentation Lecture 3-7

Recall from last time: Challenges of recognition Viewpoint Illumination Deformation This image is CC0 1.0 public domain This image by Umberto Salvagnin is licensed under CC-BY 2.0 Clutter Intraclass Variation This image is CC0 1.0 public domain This image is CC0 1.0 public domain Lecture 3-8 Occlusion This image by jonsson is licensed under CC-BY 2.0

Recall from last time: data-driven approach, knn 1-NN classifier test train train 5-NN classifier validation test Lecture 3-9

Recall from last time: Linear Classifier f(x,w) = Wx + b Lecture 3-10

Recall from last time: Linear Classifier TODO: 1. Define a loss function that quantifies our unhappiness with the scores across the training data. 2. Come up with a way of efficiently finding the parameters that minimize the loss function. (optimization) Cat image by Nikita is licensed under CC-BY 2.0; Car image is CC0 1.0 public domain; Frog image is in the public domain Lecture 3-11

Suppose: 3 training examples, 3 classes. With some W the scores are: cat car frog 3.2 5.1-1.7 1.3 4.9 2.0 2.2 2.5-3.1 Lecture 3-12

Suppose: 3 training examples, 3 classes. With some W the scores are: A loss function tells how good our current classifier is Given a dataset of examples Where cat car frog 3.2 5.1-1.7 1.3 4.9 2.0 2.2 2.5-3.1 is image and is (integer) label Loss over the dataset is a sum of loss over examples: Lecture 3-13

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: cat car frog 3.2 5.1-1.7 1.3 4.9 2.0 2.2 2.5-3.1 the SVM loss has the form: Lecture 3-14

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example Hinge loss where is the image and where is the (integer) label, and using the shorthand for the scores vector: cat car frog 3.2 5.1-1.7 1.3 4.9 2.0 2.2 2.5-3.1 the SVM loss has the form: Lecture 3-15

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: cat car frog 3.2 5.1-1.7 1.3 4.9 2.0 2.2 2.5-3.1 the SVM loss has the form: Lecture 3-16

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: cat car frog Losses: 3.2 5.1-1.7 2.9 1.3 4.9 2.0 2.2 2.5-3.1 the SVM loss has the form: = max(0, 5.1-3.2 + 1) +max(0, -1.7-3.2 + 1) = max(0, 2.9) + max(0, -3.9) = 2.9 + 0 = 2.9 Lecture 3-17

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: cat car frog Losses: 3.2 5.1-1.7 2.9 1.3 4.9 2.0 0 2.2 2.5-3.1 the SVM loss has the form: = max(0, 1.3-4.9 + 1) +max(0, 2.0-4.9 + 1) = max(0, -2.6) + max(0, -1.9) =0+0 =0 Lecture 3-18

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: cat car frog Losses: 3.2 5.1-1.7 2.9 1.3 4.9 2.0 0 2.2 2.5-3.1 12.9 the SVM loss has the form: = max(0, 2.2 - (-3.1) + 1) +max(0, 2.5 - (-3.1) + 1) = max(0, 6.3) + max(0, 6.6) = 6.3 + 6.6 = 12.9 Lecture 3-19

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: cat car frog Losses: 3.2 5.1-1.7 2.9 1.3 4.9 2.0 0 2.2 2.5-3.1 12.9 the SVM loss has the form: Loss over full dataset is average: L = (2.9 + 0 + 12.9)/3 = 5.27 Lecture 3-20

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: cat car frog Losses: 3.2 5.1-1.7 2.9 1.3 4.9 2.0 0 2.2 2.5-3.1 12.9 the SVM loss has the form: Q: What happens to loss if car scores change a bit? Lecture 3-21

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: cat car frog Losses: 3.2 5.1-1.7 2.9 1.3 4.9 2.0 0 2.2 2.5-3.1 12.9 the SVM loss has the form: Q2: what is the min/max possible loss? Lecture 3-22

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: cat car frog Losses: 3.2 5.1-1.7 2.9 1.3 4.9 2.0 0 2.2 2.5-3.1 12.9 the SVM loss has the form: Q3: At initialization W is small so all s 0. What is the loss? Lecture 3-23

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: cat car frog Losses: 3.2 5.1-1.7 2.9 1.3 4.9 2.0 0 2.2 2.5-3.1 12.9 the SVM loss has the form: Q4: What if the sum was over all classes? (including j = y_i) Lecture 3-24

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: cat car frog Losses: 3.2 5.1-1.7 2.9 1.3 4.9 2.0 0 2.2 2.5-3.1 12.9 the SVM loss has the form: Q5: What if we used mean instead of sum? Lecture 3-25

Suppose: 3 training examples, 3 classes. With some W the scores are: Multiclass SVM loss: Given an example where is the image and where is the (integer) label, and using the shorthand for the scores vector: cat car frog Losses: 3.2 5.1-1.7 2.9 1.3 4.9 2.0 0 2.2 2.5-3.1 12.9 the SVM loss has the form: Q6: What if we used Lecture 3-26

Multiclass SVM Loss: Example code Lecture 3-27

E.g. Suppose that we found a W such that L = 0. Is this W unique? Lecture 3-28

E.g. Suppose that we found a W such that L = 0. Is this W unique? No! 2W is also has L = 0! Lecture 3-29

Suppose: 3 training examples, 3 classes. With some W the scores are: Before: = max(0, 1.3-4.9 + 1) +max(0, 2.0-4.9 + 1) = max(0, -2.6) + max(0, -1.9) =0+0 =0 cat car frog Losses: 3.2 5.1-1.7 2.9 1.3 4.9 2.0 0 2.2 2.5-3.1 With W twice as large: = max(0, 2.6-9.8 + 1) +max(0, 4.0-9.8 + 1) = max(0, -6.2) + max(0, -4.8) =0+0 =0 Lecture 3-30

E.g. Suppose that we found a W such that L = 0. Is this W unique? No! 2W is also has L = 0! How do we choose between W and 2W? Lecture 3-31

Regularization Data loss: Model predictions should match training data Lecture 3-32

Regularization Data loss: Model predictions should match training data Regularization: Prevent the model from doing too well on training data Lecture 3-33

Regularization Data loss: Model predictions should match training data = regularization strength (hyperparameter) Regularization: Prevent the model from doing too well on training data Lecture 3-34

Regularization Data loss: Model predictions should match training data = regularization strength (hyperparameter) Regularization: Prevent the model from doing too well on training data Simple examples L2 regularization: L1 regularization: Elastic net (L1 + L2): Lecture 3-35

Regularization Data loss: Model predictions should match training data = regularization strength (hyperparameter) Regularization: Prevent the model from doing too well on training data Simple examples L2 regularization: L1 regularization: Elastic net (L1 + L2): More complex: Dropout Batch normalization Stochastic depth, fractional pooling, etc Lecture 3-36

Regularization Data loss: Model predictions should match training data = regularization strength (hyperparameter) Regularization: Prevent the model from doing too well on training data Why regularize? - Express preferences over weights - Make the model simple so it works on test data - Improve optimization by adding curvature Lecture 3-37

Regularization: Expressing Preferences L2 Regularization Lecture 3-38

Regularization: Expressing Preferences L2 Regularization L2 regularization likes to spread out the weights Lecture 3-39

Regularization: Prefer Simpler Models y x Lecture 3-40

Regularization: Prefer Simpler Models y f1 f2 x Lecture 3-41

Regularization: Prefer Simpler Models y f1 f2 x Regularization pushes against fitting the data too well so we don t fit noise in the data Lecture 3-42

Softmax Classifier (Multinomial Logistic Regression) Want to interpret raw classifier scores as probabilities cat car frog 3.2 5.1-1.7 Lecture 3-43

Softmax Classifier (Multinomial Logistic Regression) Want to interpret raw classifier scores as probabilities Softmax Function cat car frog 3.2 5.1-1.7 Lecture 3-44

Softmax Classifier (Multinomial Logistic Regression) Want to interpret raw classifier scores as probabilities Softmax Function Probabilities must be >= 0 cat car frog 3.2 5.1-1.7 exp 24.5 164.0 0.18 unnormalized probabilities Lecture 3-45

Softmax Classifier (Multinomial Logistic Regression) Want to interpret raw classifier scores as probabilities Softmax Function Probabilities must be >= 0 cat car frog 3.2 5.1-1.7 exp 24.5 164.0 0.18 Probabilities must sum to 1 normalize unnormalized probabilities 0.13 0.87 0.00 probabilities Lecture 3-46

Softmax Classifier (Multinomial Logistic Regression) Want to interpret raw classifier scores as probabilities Softmax Function Probabilities must be >= 0 cat car frog 3.2 5.1-1.7 Unnormalized log-probabilities / logits exp 24.5 164.0 0.18 Probabilities must sum to 1 normalize unnormalized probabilities 0.13 0.87 0.00 probabilities Lecture 3-47

Softmax Classifier (Multinomial Logistic Regression) Want to interpret raw classifier scores as probabilities Softmax Function Probabilities must be >= 0 cat car frog 3.2 5.1-1.7 Unnormalized log-probabilities / logits exp 24.5 164.0 0.18 Probabilities must sum to 1 normalize unnormalized probabilities 0.13 0.87 0.00 Li = -log(0.13) = 0.89 probabilities Lecture 3-48

Softmax Classifier (Multinomial Logistic Regression) Want to interpret raw classifier scores as probabilities Softmax Function Probabilities must be >= 0 cat car frog 3.2 5.1-1.7 Unnormalized log-probabilities / logits exp 24.5 164.0 0.18 Probabilities must sum to 1 normalize unnormalized probabilities 0.13 0.87 0.00 probabilities Li = -log(0.13) = 2.04 Maximum Likelihood Estimation Choose probabilities to maximize the likelihood of the observed data (See CS 229 for details) Lecture 3-49

Softmax Classifier (Multinomial Logistic Regression) Want to interpret raw classifier scores as probabilities Softmax Function Probabilities must be >= 0 cat car frog 3.2 5.1-1.7 Unnormalized log-probabilities / logits exp 24.5 164.0 0.18 Probabilities must sum to 1 normalize unnormalized probabilities 0.13 0.87 0.00 probabilities Lecture 3-50 compare 1.00 0.00 0.00 Correct probs

Softmax Classifier (Multinomial Logistic Regression) Want to interpret raw classifier scores as probabilities Softmax Function Probabilities must be >= 0 cat car frog 3.2 5.1-1.7 Unnormalized log-probabilities / logits exp 24.5 164.0 0.18 Probabilities must sum to 1 normalize unnormalized probabilities 0.13 0.87 0.00 compare Kullback Leibler divergence probabilities Lecture 3-51 1.00 0.00 0.00 Correct probs

Softmax Classifier (Multinomial Logistic Regression) Want to interpret raw classifier scores as probabilities Softmax Function Probabilities must be >= 0 cat car frog 3.2 5.1-1.7 Unnormalized log-probabilities / logits exp 24.5 164.0 0.18 Probabilities must sum to 1 normalize unnormalized probabilities 0.13 0.87 0.00 compare Cross Entropy probabilities Lecture 3-52 1.00 0.00 0.00 Correct probs

Softmax Classifier (Multinomial Logistic Regression) Want to interpret raw classifier scores as probabilities Softmax Function Maximize probability of correct class cat car frog Putting it all together: 3.2 5.1-1.7 Lecture 3-53

Softmax Classifier (Multinomial Logistic Regression) Want to interpret raw classifier scores as probabilities Softmax Function Maximize probability of correct class cat car frog 3.2 5.1-1.7 Putting it all together: Q: What is the min/max possible loss L_i? Lecture 3-54

Softmax Classifier (Multinomial Logistic Regression) Want to interpret raw classifier scores as probabilities Softmax Function Maximize probability of correct class cat car frog 3.2 5.1-1.7 Putting it all together: Q: What is the min/max possible loss L_i? A: min 0, max infinity Lecture 3-55

Softmax Classifier (Multinomial Logistic Regression) Want to interpret raw classifier scores as probabilities Softmax Function Maximize probability of correct class cat car frog 3.2 5.1-1.7 Putting it all together: Q2: At initialization all s will be approximately equal; what is the loss? Lecture 3-56

Softmax Classifier (Multinomial Logistic Regression) Want to interpret raw classifier scores as probabilities Softmax Function Maximize probability of correct class cat car frog 3.2 5.1-1.7 Putting it all together: Q2: At initialization all s will be approximately equal; what is the loss? A: log(c), eg log(10) 2.3 Lecture 3-57

Softmax vs. SVM Lecture 3-58

Softmax vs. SVM Lecture 3-59

Softmax vs. SVM assume scores: [10, -2, 3] [10, 9, 9] [10, -100, -100] and Q: Suppose I take a datapoint and I jiggle a bit (changing its score slightly). What happens to the loss in both cases? Lecture 3-60

Recap - We have some dataset of (x,y) - We have a score function: - We have a loss function: e.g. Softmax SVM Full loss Lecture 3-61

How do we find the best W? Recap - We have some dataset of (x,y) - We have a score function: - We have a loss function: e.g. Softmax SVM Full loss Lecture 3-62

Optimization Lecture 3-63

This image is CC0 1.0 public domain Lecture 3-64

Walking man image is CC0 1.0 public domain Lecture 3-65

Strategy #1: A first very bad idea solution: Random search Lecture 3-66

Lets see how well this works on the test set... 15.5% accuracy! not bad! (SOTA is ~95%) Lecture 3-67

Strategy #2: Follow the slope Lecture 3-68

Strategy #2: Follow the slope In 1-dimension, the derivative of a function: In multiple dimensions, the gradient is the vector of (partial derivatives) along each dimension The slope in any direction is the dot product of the direction with the gradient The direction of steepest descent is the negative gradient Lecture 3-69

current W: gradient dw: [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33, ] loss 1.25347 [?,?,?,?,?,?,?,?,?, ] Lecture 3-70

current W: W + h (first dim): gradient dw: [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33, ] loss 1.25347 [0.34 + 0.0001, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33, ] loss 1.25322 [?,?,?,?,?,?,?,?,?, ] Lecture 3-71

current W: W + h (first dim): [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33, ] loss 1.25347 [0.34 + 0.0001, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33, ] loss 1.25322 gradient dw: [-2.5,?,?,?,- 1.25347)/0.0001 (1.25322 = -2.5?,?,?,?,?, ] Lecture 3-72

current W: W + h (second dim): gradient dw: [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33, ] loss 1.25347 [0.34, -1.11 + 0.0001, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33, ] loss 1.25353 [-2.5,?,?,?,?,?,?,?,?, ] Lecture 3-73

current W: W + h (second dim): [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33, ] loss 1.25347 [0.34, -1.11 + 0.0001, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33, ] loss 1.25353 gradient dw: [-2.5, 0.6,?,?,?,- 1.25347)/0.0001 (1.25353 = 0.6?,?,?,?, ] Lecture 3-74

current W: W + h (third dim): gradient dw: [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33, ] loss 1.25347 [0.34, -1.11, 0.78 + 0.0001, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33, ] loss 1.25347 [-2.5, 0.6,?,?,?,?,?,?,?, ] Lecture 3-75

current W: W + h (third dim): [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33, ] loss 1.25347 [0.34, -1.11, 0.78 + 0.0001, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33, ] loss 1.25347 gradient dw: [-2.5, 0.6, 0,?,?, (1.25347-1.25347)/0.0001?, =0?,?,?, ] Lecture 3-76

current W: W + h (third dim): [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33, ] loss 1.25347 [0.34, -1.11, 0.78 + 0.0001, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33, ] loss 1.25347 gradient dw: [-2.5, 0.6, 0,?,?, Numeric Gradient?, Need to loop over - Slow!?, all dimensions - Approximate?,?, ] Lecture 3-77

This is silly. The loss is just a function of W: want Lecture 3-78

Hammer image is in the public domain This is silly. The loss is just a function of W: want Use calculus to compute an analytic gradient This image is in the public domain Lecture 3-79 This image is in the public domain

current W: gradient dw: [0.34, -1.11, 0.78, 0.12, 0.55, 2.81, -3.1, -1.5, 0.33, ] loss 1.25347 [-2.5, 0.6, 0, 0.2, 0.7, -0.5, 1.1, 1.3, -2.1, ] dw =... (some function data and W) Lecture 3-80

In summary: - Numerical gradient: approximate, slow, easy to write - Analytic gradient: exact, fast, error-prone => In practice: Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check. Lecture 3-81

Gradient Descent Lecture 3-82

W_2 original W W_1 negative gradient direction Lecture 3-83

Lecture 3-84

Stochastic Gradient Descent (SGD) Full sum expensive when N is large! Approximate sum using a minibatch of examples 32 / 64 / 128 common Lecture 3-85

Interactive Web Demo time... http://vision.stanford.edu/teaching/cs231n-demos/linear-classify/ Lecture 3-86

Interactive Web Demo time... Lecture 3-87

Aside: Image Features f(x) = Wx Lecture 3-88 Class scores

Aside: Image Features f(x) = Wx Feature Representation Lecture 3-89 Class scores

Image Features: Motivation y x Cannot separate red and blue points with linear classifier Lecture 3-90

Image Features: Motivation y θ f(x, y) = (r(x, y), θ(x, y)) x r Cannot separate red and blue points with linear classifier After applying feature transform, points can be separated by linear classifier Lecture 3-91

Example: Color Histogram +1 Lecture 3-92

Example: Histogram of Oriented Gradients (HoG) Divide image into 8x8 pixel regions Within each region quantize edge direction into 9 bins Example: 320x240 image gets divided into 40x30 bins; in each bin there are 9 numbers so feature vector has 30*40*9 = 10,800 numbers Lowe, Object recognition from local scale-invariant features, ICCV 1999 Dalal and Triggs, "Histograms of oriented gradients for human detection," CVPR 2005 Lecture 3-93

Example: Bag of Words Step 1: Build codebook Extract random patches Cluster patches to form codebook of visual words Step 2: Encode images Fei-Fei and Perona, A bayesian hierarchical model for learning natural scene categories, CVPR 2005 Lecture 3-94

Aside: Image Features Lecture 3-95

Image features vs ConvNets f Feature Extraction 10 numbers giving scores for classes training 10 numbers giving scores for classes training Lecture 3-96

Next time: Introduction to neural networks Backpropagation Lecture 3-97