Backpropagation and Neural Networks part 1. Lecture 4-1

Similar documents
Neural networks. Chapter 20. Chapter 20 1

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Machine Learning (CSE 446): Neural Networks

Artificial Neural Networks. MGS Lecture 2

Lecture 35: Optimization and Neural Nets

Lecture 3 Feedforward Networks and Backpropagation

Introduction to Convolutional Neural Networks (CNNs)

Computing Neural Network Gradients

CSC 411 Lecture 10: Neural Networks

Slide credit from Hung-Yi Lee & Richard Socher

Convolutional Neural Networks

Deep Neural Networks (1) Hidden layers; Back-propagation

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Lecture 2: Learning with neural networks

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

2018 EE448, Big Data Mining, Lecture 5. (Part II) Weinan Zhang Shanghai Jiao Tong University

Machine Learning (CSE 446): Backpropagation

CSC321 Lecture 5: Multilayer Perceptrons

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Neural networks. Chapter 19, Sections 1 5 1

Lecture 3 Feedforward Networks and Backpropagation

Artificial Neural Networks. Edward Gatt

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

SGD and Deep Learning

A thorough derivation of back-propagation for people who really want to understand it by: Mike Gashler, September 2010

Deep Feedforward Networks

CS 1674: Intro to Computer Vision. Final Review. Prof. Adriana Kovashka University of Pittsburgh December 7, 2016

Introduction to Machine Learning Spring 2018 Note Neural Networks

Lecture 17: Neural Networks and Deep Learning

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

ECE521 Lectures 9 Fully Connected Neural Networks

8-1: Backpropagation Prof. J.C. Kao, UCLA. Backpropagation. Chain rule for the derivatives Backpropagation graphs Examples

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Day 3 Lecture 3. Optimizing deep networks

CSCI567 Machine Learning (Fall 2018)

Training Neural Networks Practical Issues

Lecture 4: Feed Forward Neural Networks

Deep Neural Networks (1) Hidden layers; Back-propagation

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Neural Networks: Introduction

More on Neural Networks

Artifical Neural Networks

1 What a Neural Network Computes

Artificial Neural Networks

Deep Feedforward Networks. Sargur N. Srihari

Lecture 4 Backpropagation

Neural Networks: Backpropagation

Neural Networks and Deep Learning

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

Machine Learning Lecture 14

Ch.6 Deep Feedforward Networks (2/3)

CSC321 Lecture 15: Exploding and Vanishing Gradients

Introduction to Deep Learning

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 4.1 Gradient-Based Learning: Back-Propagation and Multi-Module Systems Yann LeCun

CSC321 Lecture 16: ResNets and Attention

4. Multilayer Perceptrons

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018

CSC242: Intro to AI. Lecture 21

Lecture 4: Perceptrons and Multilayer Perceptrons

Neural networks and support vector machines

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

EE04 804(B) Soft Computing Ver. 1.2 Class 2. Neural Networks - I Feb 23, Sasidharan Sreedharan

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Artificial Neural Networks 2

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Neural networks COMS 4771

Artificial Neural Networks

Revision: Neural Network

Long-Short Term Memory and Other Gated RNNs

Simple neuron model Components of simple neuron

Lecture 2: Modular Learning

Artificial Neural Network

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Introduction to (Convolutional) Neural Networks

Automatic Differentiation and Neural Networks

Deep Learning (CNNs)

Introduction to Neural Networks

Introduction to Neural Networks

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Lecture 11 Recurrent Neural Networks I

Intro to Neural Networks and Deep Learning

Understanding How ConvNets See

CS:4420 Artificial Intelligence

arxiv: v3 [cs.lg] 14 Jan 2018

CSC321 Lecture 6: Backpropagation

Lecture 11 Recurrent Neural Networks I

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Neural Networks. Intro to AI Bert Huang Virginia Tech

Backpropagation Neural Net

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

y(x n, w) t n 2. (1)

Transcription:

Lecture 4: Backpropagation and Neural Networks part 1 Lecture 4-1

Administrative A1 is due Jan 20 (Wednesday). ~150 hours left Warning: Jan 18 (Monday) is Holiday (no class/office hours) Also note: Lectures are non-exhaustive. Read course notes for completeness. I ll hold make up office hours on Wed Jan20, 5pm @ Gates 259 Lecture 4-2

Where we are... scores function SVM loss data loss + regularization want Lecture 4-3

Optimization (image credits to Alec Radford) Lecture 4-4

Gradient Descent Numerical gradient: slow :(, approximate :(, easy to write :) Analytic gradient: fast :), exact :), error-prone :( In practice: Derive analytic gradient, check your implementation with numerical gradient Lecture 4-5

Computational Graph x * s (scores) hinge loss + L W R Lecture 4-6

Convolutional Network (AlexNet) input image weights loss Lecture 4-7

Neural Turing Machine input tape loss Lecture 4-8

Neural Turing Machine Lecture 4-9

e.g. x = -2, y = 5, z = -4 Lecture 4-10

e.g. x = -2, y = 5, z = -4 Want: Lecture 4-11

e.g. x = -2, y = 5, z = -4 Want: Lecture 4-12

e.g. x = -2, y = 5, z = -4 Want: Lecture 4-13

e.g. x = -2, y = 5, z = -4 Want: Lecture 4-14

e.g. x = -2, y = 5, z = -4 Want: Lecture 4-15

e.g. x = -2, y = 5, z = -4 Want: Lecture 4-16

e.g. x = -2, y = 5, z = -4 Want: Lecture 4-17

e.g. x = -2, y = 5, z = -4 Want: Lecture 4-18

e.g. x = -2, y = 5, z = -4 Chain rule: Want: Lecture 4-19

e.g. x = -2, y = 5, z = -4 Want: Lecture 4-20

e.g. x = -2, y = 5, z = -4 Chain rule: Want: Lecture 4-21

activations f Lecture 4-22

activations local gradient f Lecture 4-23

activations local gradient f gradients Lecture 4-24

activations local gradient f gradients Lecture 4-25

activations local gradient f gradients Lecture 4-26

activations local gradient f gradients Lecture 4-27

Another example: Lecture 4-28

Another example: Lecture 4-29

Another example: Lecture 4-30

Another example: Lecture 4-31

Another example: Lecture 4-32

Another example: Lecture 4-33

Another example: Lecture 4-34

Another example: Lecture 4-35

Another example: Lecture 4-36

Another example: (-1) * (-0.20) = 0.20 Lecture 4-37

Another example: Lecture 4-38

Another example: [local gradient] x [its gradient] [1] x [0.2] = 0.2 [1] x [0.2] = 0.2 (both inputs!) Lecture 4-39

Another example: Lecture 4-40

Another example: [local gradient] x [its gradient] x0: [2] x [0.2] = 0.4 w0: [-1] x [0.2] = -0.2 Lecture 4-41

sigmoid function sigmoid gate Lecture 4-42

sigmoid function sigmoid gate (0.73) * (1-0.73) = 0.2 Lecture 4-43

Patterns in backward flow add gate: gradient distributor max gate: gradient router mul gate: gradient switcher? Lecture 4-44

Gradients add at branches + Lecture 4-45

Implementation: forward/backward API Graph (or Net) object. (Rough psuedo code) Lecture 4-46

Implementation: forward/backward API x * z y (x,y,z are scalars) Lecture 4-47

Implementation: forward/backward API x * z y (x,y,z are scalars) Lecture 4-48

Example: Torch Layers Lecture 4-49

Example: Torch Layers = Lecture 4-50

Example: Torch MulConstant initialization forward() backward() Lecture 4-51

Example: Caffe Layers Lecture 4-52

Caffe Sigmoid Layer *top_diff (chain rule) Lecture 4-53

Gradients for vectorized code (x,y,z are now vectors) This is now the Jacobian matrix (derivative of each element of z w.r.t. each element of x) local gradient f gradients Lecture 4-54

Vectorized operations 4096-d input vector f(x) = max(0,x) (elementwise) 4096-d output vector Lecture 4-55

Vectorized operations Jacobian matrix 4096-d input vector f(x) = max(0,x) (elementwise) 4096-d output vector Q: what is the size of the Jacobian matrix? Lecture 4-56

Vectorized operations Jacobian matrix 4096-d input vector f(x) = max(0,x) max(0,x) (elementwise) (elementwise) Q: what is the size of the Jacobian matrix? [4096 x 4096!] 4096-d output vector Q2: what does it look like? Lecture 4-57

Vectorized operations in practice we process an entire minibatch (e.g. 100) of examples at one time: 100 4096-d input vectors f(x) = max(0,x) max(0,x) (elementwise) (elementwise) 100 4096-d output vectors i.e. Jacobian would technically be a [409,600 x 409,600] matrix :\ Lecture 4-58

Assignment: Writing SVM/Softmax Stage your forward/backward computation! margins E.g. for the SVM: Lecture 4-59

Summary so far - - neural nets will be very large: no hope of writing down gradient formula by hand for all parameters backpropagation = recursive application of the chain rule along a computational graph to compute the gradients of all inputs/parameters/intermediates implementations maintain a graph structure, where the nodes implement the forward() / backward() API. forward: compute result of an operation and save any intermediates needed for gradient computation in memory backward: apply the chain rule to compute the gradient of the loss function with respect to the inputs. Lecture 4-60

Lecture 4-61

Neural Network: without the brain stuff (Before) Linear score function: Lecture 4-62

Neural Network: without the brain stuff (Before) Linear score function: (Now) 2-layer Neural Network Lecture 4-63

Neural Network: without the brain stuff (Before) Linear score function: (Now) 2-layer Neural Network x 3072 W1 h 100 W2 s 10 Lecture 4-64

Neural Network: without the brain stuff (Before) Linear score function: (Now) 2-layer Neural Network x 3072 W1 h 100 W2 s 10 Lecture 4-65

Neural Network: without the brain stuff (Before) Linear score function: (Now) 2-layer Neural Network or 3-layer Neural Network Lecture 4-66

Full implementation of training a 2-layer Neural Network needs ~11 lines: from @iamtrask, http://iamtrask.github.io/2015/07/12/basic-python-network/ Lecture 4-67

Assignment: Writing 2layer Net Stage your forward/backward computation! Lecture 4-68

Lecture 4-69

Lecture 4-70

Lecture 4-71

Lecture 4-72

sigmoid activation function Lecture 4-73

Lecture 4-74

Be very careful with your Brain analogies: Biological Neurons: - Many different types - Dendrites can perform complex nonlinear computations - Synapses are not a single weight but a complex non-linear dynamical system - Rate code may not be adequate [Dendritic Computation. London and Hausser] Lecture 4-75

Leaky ReLU max(0.1x, x) Activation Functions Sigmoid Maxout tanh ReLU tanh(x) ELU max(0,x) Lecture 4-76

Neural Networks: Architectures 2-layer Neural Net, or 1-hidden-layer Neural Net 3-layer Neural Net, or 2-hidden-layer Neural Net Fully-connected layers Lecture 4-77

Example Feed-forward computation of a Neural Network We can efficiently evaluate an entire layer of neurons. Lecture 4-78

Example Feed-forward computation of a Neural Network Lecture 4-79

Setting the number of layers and their sizes more neurons = more capacity Lecture 4-80

Do not use size of neural network as a regularizer. Use stronger regularization instead: (you can play with this demo over at ConvNetJS: http://cs.stanford. edu/people/karpathy/convnetjs/demo/classify2d.html) Lecture 4-81

Summary - we arrange neurons into fully-connected layers - the abstraction of a layer has the nice property that it allows us to use efficient vectorized code (e.g. matrix multiplies) - neural networks are not really neural - neural networks: bigger = better (but might have to regularize more strongly) Lecture 4-82

Next Lecture: More than you ever wanted to know about Neural Networks and how to train them. Lecture 4-83

inputs x outputs y complex graph reverse-mode differentiation (if you want effect of many things on one thing) for many different x forward-mode differentiation (if you want effect of one thing on many things) for many different y Lecture 4-84