CSC321 Lecture 5 Learning in a Single Neuron

Similar documents
CSC321 Lecture 4 The Perceptron Algorithm

CSC321 Lecture 6: Backpropagation

CSC321 Lecture 2: Linear Regression

CSC321 Lecture 7 Neural language models

CSC321 Lecture 5: Multilayer Perceptrons

CSC 411 Lecture 10: Neural Networks

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

Lecture 2: Linear regression

CSC321 Lecture 4: Learning a Classifier

CSC321 Lecture 4: Learning a Classifier

CSC321 Lecture 7: Optimization

Lecture 4: Training a Classifier

CSC321 Lecture 8: Optimization

Lecture 6: Backpropagation

CSC321 Lecture 10 Training RNNs

CSC 411 Lecture 7: Linear Classification

Lecture 4: Training a Classifier

CSC 411 Lecture 6: Linear Regression

1 What a Neural Network Computes

CSC321 Lecture 9 Recurrent neural nets

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Artificial Intelligence

Machine Learning (CSE 446): Neural Networks

CSC321 Lecture 9: Generalization

Neural Network Training

Optimization and Gradient Descent

Neural Networks: Backpropagation

15. LECTURE 15. I can calculate the dot product of two vectors and interpret its meaning. I can find the projection of one vector onto another one.

CSC321 Lecture 9: Generalization

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Lecture 15: Exploding and Vanishing Gradients

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Automatic Differentiation and Neural Networks

Intro to Neural Networks and Deep Learning

CSC321 Lecture 16: ResNets and Attention

Linear Classifiers. Michael Collins. January 18, 2012

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Week 5: Logistic Regression & Neural Networks

1 Review of the dot product

Multilayer Perceptrons and Backpropagation

CSC321 Lecture 15: Exploding and Vanishing Gradients

Lecture 17: Neural Networks and Deep Learning

Chapter 1 Review of Equations and Inequalities

Intelligent Systems Discriminative Learning, Neural Networks

Feedforward Neural Networks. Michael Collins, Columbia University

Least Mean Squares Regression

Neural Nets Supervised learning

Neural Networks for Machine Learning. Lecture 2a An overview of the main types of neural network architecture

LECTURE # - NEURAL COMPUTATION, Feb 04, Linear Regression. x 1 θ 1 output... θ M x M. Assumes a functional form

Linear Models for Classification

CSC321 Lecture 15: Recurrent Neural Networks

Revision: Neural Network

Lecture 6. Notes on Linear Algebra. Perceptron

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Finish section 3.6 on Determinants and connections to matrix inverses. Use last week's notes. Then if we have time on Tuesday, begin:

Deep Learning for Computer Vision

CSC242: Intro to AI. Lecture 21

Distributive property and its connection to areas

Lesson 3-2: Solving Linear Systems Algebraically

8-1: Backpropagation Prof. J.C. Kao, UCLA. Backpropagation. Chain rule for the derivatives Backpropagation graphs Examples

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Machine Learning Basics III

Gaussian and Linear Discriminant Analysis; Multiclass Classification

CSCI567 Machine Learning (Fall 2018)

Neural Networks. Xiaojin Zhu Computer Sciences Department University of Wisconsin, Madison. slide 1

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

CSC 411 Lecture 12: Principal Component Analysis

Computational Graphs, and Backpropagation

Lecture 35: Optimization and Neural Nets

Logistic Regression & Neural Networks

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

TDA231. Logistic regression

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Machine Learning (CSE 446): Backpropagation

22 Approximations - the method of least squares (1)

From Stationary Methods to Krylov Subspaces

Linear Models for Regression

Introduction to gradient descent

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

CS221 / Autumn 2017 / Liang & Ermon. Lecture 2: Machine learning I

17 Neural Networks NEURAL NETWORKS. x XOR 1. x Jonathan Richard Shewchuk

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Single layer NN. Neuron Model

Statistical Machine Learning from Data

MATH 353 LECTURE NOTES: WEEK 1 FIRST ORDER ODES

Support Vector Machines

Tutorial on Tangent Propagation

Classification with Perceptrons. Reading:

10 NEURAL NETWORKS Bio-inspired Multi-Layer Networks. Learning Objectives:

Lab 5: 16 th April Exercises on Neural Networks

Artificial Neural Networks

7.2 Steepest Descent and Preconditioning

Major Ideas in Calc 3 / Exam Review Topics

Transcription:

CSC321 Lecture 5 Learning in a Single Neuron Roger Grosse and Nitish Srivastava January 21, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 5 Learning in a Single Neuron January 21, 2015 1 / 14

Overview So far, we ve talked about - Predicting scalar targets using linear regression. y = w T x Roger Grosse and Nitish Srivastava CSC321 Lecture 5 Learning in a Single Neuron January 21, 2015 2 / 14

Overview So far, we ve talked about - Predicting scalar targets using linear regression. y = w T x Classifying { between 2 classes using the perceptron. 1 if w y = T x 0 0 otherwise Roger Grosse and Nitish Srivastava CSC321 Lecture 5 Learning in a Single Neuron January 21, 2015 2 / 14

Overview So far, we ve talked about - Predicting scalar targets using linear regression. y = w T x Classifying { between 2 classes using the perceptron. 1 if w y = T x 0 0 otherwise Converting linear models into nonlinear models using basis functions (features). y = w T x, becomes y = w T Φ(x) Roger Grosse and Nitish Srivastava CSC321 Lecture 5 Learning in a Single Neuron January 21, 2015 2 / 14

Overview So far, we ve talked about - Predicting scalar targets using linear regression. y = w T x Classifying { between 2 classes using the perceptron. 1 if w y = T x 0 0 otherwise Converting linear models into nonlinear models using basis functions (features). y = w T x, becomes y = w T Φ(x) But this raises some questions: What if the thing we re trying to predict isn t real-valued or binary-valued? What if we don t know the right features Φ? This week, we cover a much more general learning framework which gets around both of these issues. Roger Grosse and Nitish Srivastava CSC321 Lecture 5 Learning in a Single Neuron January 21, 2015 2 / 14

Overview Linear regression and the perceptron algorithm were both one-off tricks. For linear regression, we derived a closed-form solution by setting the partial derivatives to 0. This only works for a handful of learning algorithms. Roger Grosse and Nitish Srivastava CSC321 Lecture 5 Learning in a Single Neuron January 21, 2015 3 / 14

Overview Linear regression and the perceptron algorithm were both one-off tricks. For linear regression, we derived a closed-form solution by setting the partial derivatives to 0. This only works for a handful of learning algorithms. For the perceptron, we gave a simple add/subtract algorithm which works under an unrealistic linear separability assumption. Roger Grosse and Nitish Srivastava CSC321 Lecture 5 Learning in a Single Neuron January 21, 2015 3 / 14

Overview Linear regression and the perceptron algorithm were both one-off tricks. For linear regression, we derived a closed-form solution by setting the partial derivatives to 0. This only works for a handful of learning algorithms. For the perceptron, we gave a simple add/subtract algorithm which works under an unrealistic linear separability assumption. For most of this course, we will instead write down an objective function and optimize it using a technique called gradient descent. Roger Grosse and Nitish Srivastava CSC321 Lecture 5 Learning in a Single Neuron January 21, 2015 3 / 14

Overview Loss function : A measure of unhappiness. How unhappy should we be if the model predicts y when we want it to predict t? Roger Grosse and Nitish Srivastava CSC321 Lecture 5 Learning in a Single Neuron January 21, 2015 4 / 14

Overview Loss function : A measure of unhappiness. How unhappy should we be if the model predicts y when we want it to predict t? Some examples of loss functions we ll learn about later Setting Example Loss function C(w) least-squares regression predict stock prices (y t) 2 robust regression predict stock prices y t classification predict object category from image log p(t x) generative modeling model distribution of English sentences log p(x) Roger Grosse and Nitish Srivastava CSC321 Lecture 5 Learning in a Single Neuron January 21, 2015 4 / 14

Overview Loss function : A measure of unhappiness. How unhappy should we be if the model predicts y when we want it to predict t? Some examples of loss functions we ll learn about later Setting Example Loss function C(w) least-squares regression predict stock prices (y t) 2 robust regression predict stock prices y t classification predict object category from image log p(t x) generative modeling model distribution of English sentences log p(x) Synonyms : Loss Function/Error Function/Cost Function/Objective Function. Learning minimizing unhappiness. Roger Grosse and Nitish Srivastava CSC321 Lecture 5 Learning in a Single Neuron January 21, 2015 4 / 14

Optimization Visualizing gradient descent in one dimension: w w ɛ dc dw Roger Grosse and Nitish Srivastava CSC321 Lecture 5 Learning in a Single Neuron January 21, 2015 5 / 14

Optimization Roger Grosse and Nitish Srivastava CSC321 Lecture 5 Learning in a Single Neuron January 21, 2015 6 / 14

Optimization Visualizing it in two dimensions is a bit tricker. Level sets (or contours): sets of points on which C(w) is constant Gradient: the vector of partial derivatives ( C w C =, C ) w 1 w 2 points in the direction of maximum increase orthogonal to the level set The gradient descent updates are opposite the gradient direction. Roger Grosse and Nitish Srivastava CSC321 Lecture 5 Learning in a Single Neuron January 21, 2015 7 / 14

Optimization Roger Grosse and Nitish Srivastava CSC321 Lecture 5 Learning in a Single Neuron January 21, 2015 8 / 14

Question 1: Geometry of optimization Suppose we have a linear regression problem with two training cases and no bias term: x (1) = (1, 0), t (1) = 0.5 x (2) = (0, 3), t (2) = 0 Recall that the objective function is C(w) = 1 2 (wt x (1) t (1) ) 2 + 1 2 (wt x (2) t (2) ) 2. 1 In weight space, sketch the level set of this objective function corresponding to C(w) = 1/2. Draw both axes with the same scale. Hint: Write the equation C(w) = 1/2 explicitly in the form (w 1 c) 2 a 2 + (w 2 d) 2 b 2 = 1. What geometric object is this, and what do (a, b, c, d) represent? 2 The point w = (1.25, 0.22) is on the level set for C(w) = 1/2. Sketch the gradient wc at this point. 3 Now sketch the gradient descent update if we use a learning rate of 1/2. Roger Grosse and Nitish Srivastava CSC321 Lecture 5 Learning in a Single Neuron January 21, 2015 9 / 14

Question 1: Geometry of optimization Roger Grosse and Nitish Srivastava CSC321 Lecture 5 Learning in a Single Neuron January 21, 2015 10 / 14

Question 1: Geometry of optimization Note: we chose the numbers for this problem so that the ellipse would be axis-aligned. In general, the level sets for linear regression will be ellipses, but they won t be axis-aligned. But all ellipses are rotations of axis-aligned ellipses, so the axis-aligned case gives us all the intuition we need. You may have learned how to draw non-axis-aligned ellipses in a linear algebra class. If not, don t worry about it. Roger Grosse and Nitish Srivastava CSC321 Lecture 5 Learning in a Single Neuron January 21, 2015 11 / 14

Question 1: Geometry of optimization 1 Suppose you perform gradient descent with a very small learning rate (e.g. 0.01). Will the objective function increase or decrease? Roger Grosse and Nitish Srivastava CSC321 Lecture 5 Learning in a Single Neuron January 21, 2015 12 / 14

Question 1: Geometry of optimization 1 Suppose you perform gradient descent with a very small learning rate (e.g. 0.01). Will the objective function increase or decrease? 2 Now suppose you use a very large learning rate (e.g. 100). Will the objective function increase or decrease? Roger Grosse and Nitish Srivastava CSC321 Lecture 5 Learning in a Single Neuron January 21, 2015 12 / 14

Question 1: Geometry of optimization 1 Suppose you perform gradient descent with a very small learning rate (e.g. 0.01). Will the objective function increase or decrease? 2 Now suppose you use a very large learning rate (e.g. 100). Will the objective function increase or decrease? 3 Which weight will change faster: w 1 or w 2? Which one do you want to change faster? Roger Grosse and Nitish Srivastava CSC321 Lecture 5 Learning in a Single Neuron January 21, 2015 12 / 14

Question 1: Geometry of optimization 1 Suppose you perform gradient descent with a very small learning rate (e.g. 0.01). Will the objective function increase or decrease? 2 Now suppose you use a very large learning rate (e.g. 100). Will the objective function increase or decrease? 3 Which weight will change faster: w 1 or w 2? Which one do you want to change faster? Note: a more representative picture would show it MUCH more elongated! Roger Grosse and Nitish Srivastava CSC321 Lecture 5 Learning in a Single Neuron January 21, 2015 12 / 14

Question 2: Computing the gradient Now let s consider a non-linear neuron whose activation is computed as follows: z = w T x = w j x j y = log(1 + z 2 ) j We will use the cost function C(w) = y t. Show how to compute the partial derivative C/ w 1 using the Chain Rule as follows: 1 Compute the derivative dc/dy. (You may assume y t.) 2 Express the derivative dc/dz in terms of dc/dy. 3 Express the partial derivative C/ w 1 in terms of dc/dz Roger Grosse and Nitish Srivastava CSC321 Lecture 5 Learning in a Single Neuron January 21, 2015 13 / 14

Question 2: Computing the gradient Solution: { dc 1 if y > t dy = 1 if y < t dc dz = dy dc dz dy = 2z dc 1 + z 2 dy C = z dc w 1 w 1 dz dc = x 1 dz Roger Grosse and Nitish Srivastava CSC321 Lecture 5 Learning in a Single Neuron January 21, 2015 14 / 14

Question 2: Computing the gradient Solution: { dc 1 if y > t dy = 1 if y < t dc dz = dy dc dz dy = 2z dc 1 + z 2 dy C = z dc w 1 w 1 dz dc = x 1 dz Note: If this were a calculus class, you d do the substitutions to get C/ w 1 explicitly. We won t do that here, since at this point you ve already derived everything you need to implement the computation in Python. Roger Grosse and Nitish Srivastava CSC321 Lecture 5 Learning in a Single Neuron January 21, 2015 14 / 14

Question 2: Computing the gradient Observe that we compute derivatives going backwards through the computation graph: This is true in general, not just for neural nets. This is how we get the term backpropagation. Roger Grosse and Nitish Srivastava CSC321 Lecture 5 Learning in a Single Neuron January 21, 2015 15 / 14