CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Similar documents
Deep Feedforward Networks

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

CSC 411 Lecture 10: Neural Networks

Deep Learning & Artificial Intelligence WS 2018/2019

Artificial Intelligence

Introduction to Convolutional Neural Networks (CNNs)

Artificial Neural Networks. MGS Lecture 2

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Deep Feedforward Networks

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Learning Deep Architectures for AI. Part II - Vijay Chakilam

4. Multilayer Perceptrons

CSC321 Lecture 5: Multilayer Perceptrons

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Lecture 3 Feedforward Networks and Backpropagation

Neural Networks and Deep Learning

Neural Networks and the Back-propagation Algorithm

1 What a Neural Network Computes

ECE521 Lectures 9 Fully Connected Neural Networks

Introduction to Machine Learning Spring 2018 Note Neural Networks

Artificial Neural Networks

Data Mining Part 5. Prediction

Feedforward Neural Nets and Backpropagation

Neural Networks: Backpropagation

Data Mining & Machine Learning

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Lecture 2: Learning with neural networks

NN V: The generalized delta learning rule

CS60010: Deep Learning

Lecture 3 Feedforward Networks and Backpropagation

Artificial Neuron (Perceptron)

How to do backpropagation in a brain

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Feed-forward Network Functions

Artificial Neural Networks

Advanced Machine Learning

Logistic Regression & Neural Networks

Deep Neural Networks (1) Hidden layers; Back-propagation

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Ch.6 Deep Feedforward Networks (2/3)

A summary of Deep Learning without Poor Local Minima

Unit III. A Survey of Neural Network Model

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Deep Neural Networks (1) Hidden layers; Back-propagation

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

CSC321 Lecture 6: Backpropagation

Neural networks. Chapter 20. Chapter 20 1

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Recurrent Neural Networks

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Lecture 5 Neural models for NLP

Cheng Soon Ong & Christian Walder. Canberra February June 2018

C4 Phenomenological Modeling - Regression & Neural Networks : Computational Modeling and Simulation Instructor: Linwei Wang

CSC242: Intro to AI. Lecture 21

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

Multilayer Perceptrons (MLPs)

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

CSCI567 Machine Learning (Fall 2018)

Notes on Back Propagation in 4 Lines

Neural Networks (Part 1) Goals for the lecture

ECE 471/571 - Lecture 17. Types of NN. History. Back Propagation. Recurrent (feedback during operation) Feedforward

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

Lecture 4 Backpropagation

Course 395: Machine Learning - Lectures

Multilayer Perceptrons and Backpropagation

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Intro to Neural Networks and Deep Learning

Introduction to Machine Learning

Introduction to Neural Networks

Jakub Hajic Artificial Intelligence Seminar I

Understanding Neural Networks : Part I

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

Multilayer Perceptron

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units

Lecture 15: Exploding and Vanishing Gradients

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

CSC 578 Neural Networks and Deep Learning

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Learning Deep Architectures for AI. Part I - Vijay Chakilam

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Neural Networks. Mark van Rossum. January 15, School of Informatics, University of Edinburgh 1 / 28

Day 3 Lecture 3. Optimizing deep networks

Artifical Neural Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Rapid Introduction to Machine Learning/ Deep Learning

EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)

Neural Networks and Deep Learning.

Transcription:

CS 6501: Deep Learning for Computer Graphics Basics of Neural Networks Connelly Barnes

Overview Simple neural networks Perceptron Feedforward neural networks Multilayer perceptron and properties Autoencoders How to train neural networks Gradient descent Stochastic gradient descent Automatic differentiation Backpropagation

Perceptron (1957, Cornell) Inputs Output (Class) Weights: arbitrary, learned parameters Bias b: arbitrary, learned parameter

Perceptron (1957, Cornell) Binary classifier, can learn linearly separable patterns. Diagram from Wikipedia

Feedforward neural networks We could connect units (neurons) in any arbitrary graph Input Output

Feedforward neural networks We could connect units (neurons) in any arbitrary graph If no cycles in the graph we call it a feedforward neural network. Input Output

Recurrent neural networks (later) If cycles in the graph we call it a recurrent neural network. Input Output

Overview Simple neural networks Perceptron Feedforward neural networks Multilayer perceptron and properties Autoencoders How to train neural networks Gradient descent Stochastic gradient descent Automatic differentiation Backpropagation

Multilayer Perceptron (1960s) In matrix notation: L = f (WL + b ) i i i 1 i i L " : Input layer (inputs) vector L $ : Hidden layer vector L % : Output layer vector, i 1 W + : Weight matrix for connections from layer i-1 tolayer i b + : Biases for neurons in layer i f i : Activation function for layer i

Activation Functions: Sigmoid / Logistic f x = 1 1 + e 23 df dx = f(x)(1 f x ) Problems: Gradients at tails are almost zero Outputs are not zero-centered

Activation Functions: Tanh f x = tanh x = 2 1 + e 2$3 1 df dx = 1 f x $ Problems: Gradients at tails are almost zero

Activation Functions: ReLU (Rectified Linear Unit) f x = max (x, 0) df dx = A1, if x > 0 0, if x < 0 Pros: Accelerates training stage by 6x over sigmoid/tanh [1] Simple to compute Sparser activation patterns Cons: Neurons can die by getting stuck in zero gradient region Summary: Currently preferred kind of neuron

Universal Approximation Theorem Multilayer perceptron with a single hidden layer and linear output layer can approximate any continuous function on a compact subset of R G to within any desired degree of accuracy. Assumes activation function is bounded, non-constant, monotonically increasing. Also applies for ReLU activation function.

Universal Approximation Theorem In the worst case, exponential number of hidden units may be required. Can informally show this for binary case: If we have n bits input to a binary function, how many possible inputs are there? How many possible binary functions are there? So how many weights do we need to represent a given binary function?

Why Use Deep Networks? Functions representable with a deep rectifier network can require an exponential number of hidden units with a shallow (one hidden layer) network (Goodfellow 6.4) Piecewise linear networks (e.g. using ReLU) can represent functions that have a number of regions exponential in depth of network. Can capture repeating / mirroring / symmetric patterns in data. Empirically, greater depth often results in better generalization.

Neural Network Architecture Architecture: refers to which parameters (e.g. weights) are used in the network and their topological connectivity. Fully connected: A common connectivity pattern for multilayer perceptrons. All possible connections made between layers i-1 and i. Is this network fully connected?

Neural Network Architecture Architecture: refers to which parameters (e.g. weights) are used in the network and their topological connectivity. Fully connected: A common connectivity pattern for multilayer perceptrons. All possible connections made between layers i-1 and i. Input Output Is this network fully connected?

How to Choose Network Architecture? Long discussion Summary: Rules of thumb do not work. Need 10x [or 30x] more training data than weights. Not true if very low noise Might need even more training data if high noise Try many networks with different numbers of units and layers Check generalization using validation dataset or cross-validation.

Overview Simple neural networks Perceptron Feedforward neural networks Multilayer perceptron and properties Autoencoders How to train neural networks Gradient descent Stochastic gradient descent Automatic differentiation Backpropagation

Autoencoders Learn the identity function h I x = x Input Hidden Output Is this supervised or unsupervised learning? Encode Decode

Autoencoders Applications: Input Hidden Output Dimensionality reduction Learning manifolds Hashing for search problems Encode Decode

Overview Simple neural networks Perceptron Feedforward neural networks Multilayer perceptron and properties Autoencoders How to train neural networks Gradient descent Stochastic gradient descent Automatic differentiation Backpropagation

Gradient Descent Discuss amongst students near you: What are some problems that could be easily optimized with gradient descent? Problems where this is difficult? Should the learning rate be constant or change? Step size or Learning rate

Gradient Descent with Energy Functions that have Narrow Valleys Zig-zagging problem Source: "Banana-SteepDesc" by P.A. Simionescu Wikipedia English

Gradient Descent with Momentum x n+1 = x n + Δ n x GK" = x G G Δ n = γ n F(x n )+ mδ n 1 Momentum Could use small value e.g. m=0.5 at first Could use larger value e.g. m=0.9 near end of training when there are more oscillations.

Gradient Descent with Momentum Without Momentum With Momentum Figure from Genevieve B. Orr, Willamette.edu

Stochastic gradient descent Stochastic gradient descent (Wikipedia) Gradient of sum of n terms where n is large Sample rather than computing the full sum Sample size s is mini-batch size Could be 1 (very noisy gradient estimate) Could be 100 (collect photos 100 at a time to find each noisy next estimate for the gradient) Use same step as in gradient descent to the estimated gradient

Stochastic gradient descent Pseudocode: From Wikipedia

Problem Statement Take the gradient of an arbitrary program or model (e.g. a neural network) with respect to the parameters in the model (e.g. weights). If we can do this, we can use gradient descent!

Review: Chain Rule in One Dimension Suppose f: R R and g: R R Define h x = f(g x ) Then what is h O x = dh/dx? h x = f O g x g (x)

Chain Rule in Multiple Dimensions Suppose f: R R R Define and g: R G R R, and x R G h x = f(g " x,, g R x ) Then we can define partial derivatives using the multidimensional chain rule: R f = V f g W x + g W x + XY"

The Problem with Symbolic Derivatives What if our program takes 5 exponential operations: What is dy/dx? (blackboard) y = exp exp exp exp exp x How many exponential operations in the resulting expression? What if the program contained n exponential operations?

Solution: Automatic Differentiation (1960s, 1970s) Write an arbitrary program as consisting of basic operations f 1,..., f n (e.g. +, -, *, cos, sin, ) that we know how to differentiate. Label the inputs of the program as x ",, x G, and the output x ]. The computation: For i = n + 1,, N x + = f + (x a(+) ) π(i): Sequence of parent values (e.g. if π 3 = (1,2), and f 3 =+, then x 3 =x 1 +x 2 ) Reverse mode automatic differentiation: apply the chain rule from the end of the program x ] back towards the beginning. Explanation from Justin Domke

Solution for Simplified Chain of Dependencies Suppose π i = i 1 The computation: For i = n + 1,, N x + = f + (x +2" ) Input For example: x 1 x $ = f $ (x " ) Output x % = f % (x $ ) dx ] What is? dx ]

Solution for Simplified Chain of Dependencies Suppose π i = i 1 The computation: For i = n + 1,, N x + = f + (x +2" ) Input For example: x 1 x $ = f $ (x " ) Output x % = f % (x $ ) dx ] What is = dx ] 1

Solution for Simplified Chain of Dependencies Suppose π i = i 1 The computation: For i = n + 1,, N x + = f + (x +2" ) Input For example: x 1 x $ = f $ (x " ) Output x % = f % (x $ ) What is this? dx ] dx ] What is in terms of? dx + dx +K" dx ] dx + = dx ] dx +K" x +K" x + dx +K" = dx + x +K" x + dx +K" dx ] = dx + dx ] x +K" x +

Solution for Simplified Chain of Dependencies Suppose π i = i 1 The computation: For i = n + 1,, N x + = f + (x +2" ) Input For example: x 1 x $ = f $ (x " ) Output x % = f % (x $ ) dx ] dx ] dx ] What is in terms of? = dx ] dx + dx +K" dx + dx +K" x +K" x + dx ] dx ] = 1 Conclusion: run the computation forwards. Then initialize dx and work backwards through the computation to find ] for each i dx] dx + from. This gives us the gradient of the output (x ] ) with respect dx +K" to every expression in our compute graph!

What if the Dependency Graph is More Complex? x Input % = Output f % (x $ ) x x $ = x d = 1 f $ (x " ) f d (x %,x e ) x e = f e (x $ ) The computation: For i = n + 1,, N x + = f + (x a(+) ) π(i): Sequence of parent values (e.g. if π 3 = (1,2), and f 3 =+, then x 3 =x 1 +x 2 ) Solution: apply multi-dimensional chain rule.

Solution: Automatic Differentiation (1960s, 1970s) Computation: x + = f + (x a(+) ) Multidimensional chain rule: Result: R f = V f g W x + g W x + XY" f(g " x,, g R x ) dx ] = V dx ] x W dx + dx W x + W:+ a(w) Explanation from Justin Domke

Solution: Automatic Differentiation (1960s, 1970s) Back-propagation algorithm: initialize: dx ] dx ] = 1 For i = N 1, N 2,, 1, compute: Example on blackboard for a program with one addition. dx ] = V dx ] x W dx + dx W x + W:+ a(w) Now we have differentiated the output of the program x ] with respect to the inputs x ",, x G, as well as every computed expression x +. Explanation from Justin Domke

Backpropagation Algorithm (1960s- 1980s) Apply reverse mode automatic differentiation to a neural network s loss function. A special case of what we just derived. If we have one output neuron, squared error is: From Wikipedia

Backpropagation Algorithm (1960s- 1980s) From Wikipedia

Backpropagation Algorithm (1960s- 1980s) Apply the chain rule twice: Last term is easy: From Wikipedia

Backpropagation Algorithm (1960s- 1980s) Apply the chain rule twice: Second term is easy: o W net W = φ (net W ) From Wikipedia

Backpropagation Algorithm (1960s- 1980s) Apply the chain rule twice: If the neuron is in output layer, first term is easy: (Derivation on board) From Wikipedia

Backpropagation Algorithm (1960s- 1980s) Apply the chain rule twice: If the neuron is interior neuron, we use the chain rule from automatic differentiation. To do this, we need to know what expressions depend on the current neuron s output o j? Answer: other neurons input sums, i.e. net l for all neurons l receiving inputs from the current neuron. From Wikipedia

Backpropagation Algorithm (1960s- 1980s) Apply the chain rule twice: If the neuron is an interior neuron, chain rule: All neurons receiving input from the current neuron. From Wikipedia

Backpropagation Algorithm (1960s- 1980s) Partial derivative of error E with respect to weight w ij : E w +W = δ W o + δ W = E o W o (o W t W ) W = φ (o net W ) k W V δ X w WX X r if j is an output neuron if j is an interior neuron All neurons receiving input from the current neuron. From Wikipedia

Backpropagation Algorithm (1960s- 1980s) Forward direction Calculate network and error.

Backpropagation Algorithm (1960s- 1980s) Backward direction Backpropagate: from output to input, recursively compute E w +W = I E

Gradient Descent with Backpropagation Initialize weights at good starting point w 0 Repeatedly apply gradient descent step (1) Continue training until validation error hits a minimum. Step size or Learning rate w GK" = w G γ G I E(w G ) (1)

Stochastic Gradient Descent with Backpropagation Initialize weights at good starting point w 0 Repeat until validation error hits a minimum: Randomly shuffle dataset Loop through mini-batches of data, batch index is i Calculate stochastic gradient using backpropagation for each, and apply update rule (1) w GK" = w G γ G I E + (w G ) Step size or Learning rate (1)