Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Similar documents
Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Artificial Neural Networks

Reading Group on Deep Learning Session 1

Neural networks. Chapter 20. Chapter 20 1

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Artificial Neural Networks. MGS Lecture 2

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Neural networks. Chapter 19, Sections 1 5 1

Multilayer Perceptrons (MLPs)

Neural Networks and Deep Learning

Feed-forward Network Functions

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

y(x n, w) t n 2. (1)

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Neural Networks (and Gradient Ascent Again)

Linear Models for Classification

How to do backpropagation in a brain

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Neural Network Training

ECE521 Lectures 9 Fully Connected Neural Networks

Machine Learning Lecture 5

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Logistic Regression & Neural Networks

CSC 411 Lecture 10: Neural Networks

Multi-layer Neural Networks

Introduction to Machine Learning

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Machine Learning. 7. Logistic and Linear Regression

Artificial Neural Networks

Neural Networks and the Back-propagation Algorithm

Multilayer Perceptron

Neural Networks. Henrik I. Christensen. Computer Science and Engineering University of California, San Diego

Deep Neural Networks (1) Hidden layers; Back-propagation

The XOR problem. Machine learning for vision. The XOR problem. The XOR problem. x 1 x 2. x 2. x 1. Fall Roland Memisevic

Lecture 5: Logistic Regression. Neural Networks

4. Multilayer Perceptrons

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Artificial Intelligence

Neural networks. Chapter 20, Section 5 1

NEURAL NETWORKS

RegML 2018 Class 8 Deep learning

Deep Neural Networks (1) Hidden layers; Back-propagation

CSCI567 Machine Learning (Fall 2018)

Machine Learning Lecture 10

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Lecture 17: Neural Networks and Deep Learning

From perceptrons to word embeddings. Simon Šuster University of Groningen

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Artificial Neural Networks. Edward Gatt

Deep Feedforward Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Neural Networks. Intro to AI Bert Huang Virginia Tech

Machine Learning Lecture 12

Introduction to Neural Networks

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Introduction to Convolutional Neural Networks (CNNs)

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Ch 4. Linear Models for Classification

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Machine Learning (CSE 446): Backpropagation

Regularization in Neural Networks

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning Basics III

Introduction to Deep Learning

Intro to Neural Networks and Deep Learning

Feedforward Neural Nets and Backpropagation

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Machine Learning. Neural Networks

Error Backpropagation

CSC321 Lecture 5: Multilayer Perceptrons

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Supervised Learning. George Konidaris

Deep Feedforward Networks. Sargur N. Srihari

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Neural Networks and Deep Learning.

Convolutional Neural Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Linear & nonlinear classifiers

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning (CSE 446): Neural Networks

SGD and Deep Learning

COMP-4360 Machine Learning Neural Networks

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Intelligent Systems Discriminative Learning, Neural Networks

Neural Networks: Backpropagation

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Reading Group on Deep Learning Session 2

Artificial neural networks

Neural Networks biological neuron artificial neuron 1

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

AI Programming CS F-20 Neural Networks

Transcription:

Neural Networks Bishop PRML Ch. 5 Alireza Ghane Neural Networks Alireza Ghane / Greg Mori 1

Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of biological plausibility We will focus on multi-layer perceptrons Mathematical properties rather than plausibility Neural Networks Alireza Ghane / Greg Mori 2

Outline Feed-forward Networks Network Training Error Backpropagation Applications Neural Networks Alireza Ghane / Greg Mori 3

Outline Feed-forward Networks Network Training Error Backpropagation Applications Neural Networks Alireza Ghane / Greg Mori 4

Feed-forward Networks We have looked at generalized linear models of the form: M y(x, w) = f w j φ j (x) j=1 for fixed non-linear basis functions φ( ) We now extend this model by allowing adaptive basis functions, and learning their parameters In feed-forward networks (a.k.a. multi-layer perceptrons) we let each basis function be another non-linear function of linear combination of the inputs: M φ j (x) = f... j=1 Neural Networks Alireza Ghane / Greg Mori 5

Feed-forward Networks We have looked at generalized linear models of the form: M y(x, w) = f w j φ j (x) j=1 for fixed non-linear basis functions φ( ) We now extend this model by allowing adaptive basis functions, and learning their parameters In feed-forward networks (a.k.a. multi-layer perceptrons) we let each basis function be another non-linear function of linear combination of the inputs: M φ j (x) = f... j=1 Neural Networks Alireza Ghane / Greg Mori 6

Feed-forward Networks Starting with input x = (x 1,..., x D ), construct linear combinations: D a j = w (1) ji x i + w (1) j0 i=1 These a j are known as activations Pass through an activation function h( ) to get output z j = h(a j ) Model of an individual neuron from Russell and Norvig, AIMA2e Neural Networks Alireza Ghane / Greg Mori 7

Feed-forward Networks Starting with input x = (x 1,..., x D ), construct linear combinations: D a j = w (1) ji x i + w (1) j0 i=1 These a j are known as activations Pass through an activation function h( ) to get output z j = h(a j ) Model of an individual neuron from Russell and Norvig, AIMA2e Neural Networks Alireza Ghane / Greg Mori 8

Feed-forward Networks Starting with input x = (x 1,..., x D ), construct linear combinations: D a j = w (1) ji x i + w (1) j0 i=1 These a j are known as activations Pass through an activation function h( ) to get output z j = h(a j ) Model of an individual neuron from Russell and Norvig, AIMA2e Neural Networks Alireza Ghane / Greg Mori 9

Activation Functions Can use a variety of activation functions Sigmoidal (S-shaped) Logistic sigmoid 1/(1 + exp( a)) (useful for binary classification) Hyperbolic tangent tanh Radial basis function z j = i (x i w ji ) 2 Softmax Useful for multi-class classification Identity Useful for regression Threshold... Needs to be differentiable for gradient-based learning (later) Can use different activation functions in each unit Neural Networks Alireza Ghane / Greg Mori 10

Feed-forward Networks x D hidden units z M w (1) MD w (2) KM y K inputs outputs y 1 x 1 x 0 z 1 w (2) 10 z 0 Connect together a number of these units into a feed-forward network (DAG) Above shows a network with one layer of hidden units Implements function: ( M D ) y k (x, w) = h w (2) kj h w (1) ji x i + w (1) j0 + w (2) k0 j=1 Neural Networks Alireza Ghane / Greg Mori 11 i=1

Outline Feed-forward Networks Network Training Error Backpropagation Applications Neural Networks Alireza Ghane / Greg Mori 12

Network Training Given a specified network structure, how do we set its parameters (weights)? As usual, we define a criterion to measure how well our network performs, optimize against it For regression, training data are (x n, t), t n R Squared error naturally arises: E(w) = N {y(x n, w) t n } 2 n=1 For binary classification, this is another discriminative model, ML: p(t w) = N n=1 E(w) = y tn n {1 y n } 1 tn N {t n ln y n + (1 t n ) ln(1 y n )} n=1 Neural Networks Alireza Ghane / Greg Mori 13

Network Training Given a specified network structure, how do we set its parameters (weights)? As usual, we define a criterion to measure how well our network performs, optimize against it For regression, training data are (x n, t), t n R Squared error naturally arises: E(w) = N {y(x n, w) t n } 2 n=1 For binary classification, this is another discriminative model, ML: p(t w) = N n=1 E(w) = y tn n {1 y n } 1 tn N {t n ln y n + (1 t n ) ln(1 y n )} n=1 Neural Networks Alireza Ghane / Greg Mori 14

Network Training Given a specified network structure, how do we set its parameters (weights)? As usual, we define a criterion to measure how well our network performs, optimize against it For regression, training data are (x n, t), t n R Squared error naturally arises: E(w) = N {y(x n, w) t n } 2 n=1 For binary classification, this is another discriminative model, ML: p(t w) = N n=1 E(w) = y tn n {1 y n } 1 tn N {t n ln y n + (1 t n ) ln(1 y n )} n=1 Neural Networks Alireza Ghane / Greg Mori 15

Parameter Optimization E(w) w A w B wc w 1 w 2 E For either of these problems, the error function E(w) is nasty Nasty = non-convex Non-convex = has local minima Neural Networks Alireza Ghane / Greg Mori 16

Descent Methods The typical strategy for optimization problems of this sort is a descent method: w (τ+1) = w (τ) + w (τ) As we ve seen before, these come in many flavours Gradient descent E(w (τ) ) Stochastic gradient descent E n (w (τ) ) Newton-Raphson (second order) 2 All of these can be used here, stochastic gradient descent is particularly effective Redundancy in training data, escaping local minima Neural Networks Alireza Ghane / Greg Mori 17

Descent Methods The typical strategy for optimization problems of this sort is a descent method: w (τ+1) = w (τ) + w (τ) As we ve seen before, these come in many flavours Gradient descent E(w (τ) ) Stochastic gradient descent E n (w (τ) ) Newton-Raphson (second order) 2 All of these can be used here, stochastic gradient descent is particularly effective Redundancy in training data, escaping local minima Neural Networks Alireza Ghane / Greg Mori 18

Descent Methods The typical strategy for optimization problems of this sort is a descent method: w (τ+1) = w (τ) + w (τ) As we ve seen before, these come in many flavours Gradient descent E(w (τ) ) Stochastic gradient descent E n (w (τ) ) Newton-Raphson (second order) 2 All of these can be used here, stochastic gradient descent is particularly effective Redundancy in training data, escaping local minima Neural Networks Alireza Ghane / Greg Mori 19

Computing Gradients The function y(x n, w) implemented by a network is complicated It isn t obvious how to compute error function derivatives with respect to weights Numerical method for calculating error derivatives, use finite differences: E n(w ji + ɛ) E n (w ji ɛ) w ji 2ɛ How much computation would this take with W weights in the network? O(W ) per derivative, O(W 2 ) total per gradient descent step Neural Networks Alireza Ghane / Greg Mori 20

Computing Gradients The function y(x n, w) implemented by a network is complicated It isn t obvious how to compute error function derivatives with respect to weights Numerical method for calculating error derivatives, use finite differences: E n(w ji + ɛ) E n (w ji ɛ) w ji 2ɛ How much computation would this take with W weights in the network? O(W ) per derivative, O(W 2 ) total per gradient descent step Neural Networks Alireza Ghane / Greg Mori 21

Computing Gradients The function y(x n, w) implemented by a network is complicated It isn t obvious how to compute error function derivatives with respect to weights Numerical method for calculating error derivatives, use finite differences: E n(w ji + ɛ) E n (w ji ɛ) w ji 2ɛ How much computation would this take with W weights in the network? O(W ) per derivative, O(W 2 ) total per gradient descent step Neural Networks Alireza Ghane / Greg Mori 22

Computing Gradients The function y(x n, w) implemented by a network is complicated It isn t obvious how to compute error function derivatives with respect to weights Numerical method for calculating error derivatives, use finite differences: E n(w ji + ɛ) E n (w ji ɛ) w ji 2ɛ How much computation would this take with W weights in the network? O(W ) per derivative, O(W 2 ) total per gradient descent step Neural Networks Alireza Ghane / Greg Mori 23

Outline Feed-forward Networks Network Training Error Backpropagation Applications Neural Networks Alireza Ghane / Greg Mori 24

Error Backpropagation Backprop is an efficient method for computing error derivatives w ji O(W ) to compute derivatives wrt all weights First, feed training example x n forward through the network, storing all activations a j Calculating derivatives for weights connected to output nodes is easy e.g. For linear output nodes y k = i w kiz i : = 1 w ki w ki 2 (y n t n ) 2 = (y n t n )x ni For hidden layers, propagate error backwards from the output nodes Neural Networks Alireza Ghane / Greg Mori 25

Error Backpropagation Backprop is an efficient method for computing error derivatives w ji O(W ) to compute derivatives wrt all weights First, feed training example x n forward through the network, storing all activations a j Calculating derivatives for weights connected to output nodes is easy e.g. For linear output nodes y k = i w kiz i : = 1 w ki w ki 2 (y n t n ) 2 = (y n t n )x ni For hidden layers, propagate error backwards from the output nodes Neural Networks Alireza Ghane / Greg Mori 26

Error Backpropagation Backprop is an efficient method for computing error derivatives w ji O(W ) to compute derivatives wrt all weights First, feed training example x n forward through the network, storing all activations a j Calculating derivatives for weights connected to output nodes is easy e.g. For linear output nodes y k = i w kiz i : = 1 w ki w ki 2 (y n t n ) 2 = (y n t n )x ni For hidden layers, propagate error backwards from the output nodes Neural Networks Alireza Ghane / Greg Mori 27

Chain Rule for Partial Derivatives A reminder For f(x, y), with f differentiable wrt x and y, and x and y differentiable wrt u and v: f u = and f x x u + f y y u f v = f x x v + f y y v Neural Networks Alireza Ghane / Greg Mori 28

We can write Error Backpropagation w ji = w ji E n (a j1, a j2,..., a jm ) where {j i } are the indices of the nodes in the same layer as node j Using the chain rule: = a j + w ji a j w ji k a k a k w ji where k runs over all other nodes k in the same layer as node j. Since a k does not depend on w ji, all terms in the summation go to 0 = a j w ji a j w ji Neural Networks Alireza Ghane / Greg Mori 29

We can write Error Backpropagation w ji = w ji E n (a j1, a j2,..., a jm ) where {j i } are the indices of the nodes in the same layer as node j Using the chain rule: = a j + w ji a j w ji k a k a k w ji where k runs over all other nodes k in the same layer as node j. Since a k does not depend on w ji, all terms in the summation go to 0 = a j w ji a j w ji Neural Networks Alireza Ghane / Greg Mori 30

We can write Error Backpropagation w ji = w ji E n (a j1, a j2,..., a jm ) where {j i } are the indices of the nodes in the same layer as node j Using the chain rule: = a j + w ji a j w ji k a k a k w ji where k runs over all other nodes k in the same layer as node j. Since a k does not depend on w ji, all terms in the summation go to 0 = a j w ji a j w ji Neural Networks Alireza Ghane / Greg Mori 31

Error Backpropagation cont. Introduce error δ j En a j w ji = δ j a j w ji Other factor is: a j w ji = w jk z k = z i w ji k Neural Networks Alireza Ghane / Greg Mori 32

Error Backpropagation cont. Error δ j can also be computed using chain rule: δ j a j = k a k a }{{ k a } j δ k where k runs over all nodes k in the layer after node j. Eventually: δ j = h (a j ) w kj δ k k A weighted sum of the later error caused by this weight Neural Networks Alireza Ghane / Greg Mori 33

Error Backpropagation cont. Error δ j can also be computed using chain rule: δ j a j = k a k a }{{ k a } j δ k where k runs over all nodes k in the layer after node j. Eventually: δ j = h (a j ) w kj δ k k A weighted sum of the later error caused by this weight Neural Networks Alireza Ghane / Greg Mori 34

Outline Feed-forward Networks Network Training Error Backpropagation Applications Neural Networks Alireza Ghane / Greg Mori 35

Applications of Neural Networks Many success stories for neural networks Credit card fraud detection Hand-written digit recognition Face detection Autonomous driving (CMU ALVINN) Neural Networks Alireza Ghane / Greg Mori 36

Hand-written Digit Recognition MNIST - standard dataset for hand-written digit recognition 60000 training, 10000 test images Neural Networks Alireza Ghane / Greg Mori 37

LeNet-5 INPUT 32x32 C1: feature maps 6@28x28 C3: f. maps 16@10x10 S4: f. maps 16@5x5 S2: f. maps 6@14x14 C5: layer 120 F6: layer 84 OUTPUT 10 Convolutions Subsampling Convolutions Full connection Gaussian connections Subsampling Full connection LeNet developed by Yann LeCun et al. Convolutional neural network Local receptive fields (5x5 connectivity) Subsampling (2x2) Shared weights (reuse same 5x5 filter ) Breaking symmetry Neural Networks Alireza Ghane / Greg Mori 38

4!>6 3!>5 8!>2 2!>1 5!>3 4!>8 2!>8 3!>5 6!>5 7!>3 9!>4 8!>0 7!>8 5!>3 8!>7 0!>6 3!>7 2!>7 8!>3 9!>4 8!>2 5!>3 4!>8 3!>9 6!>0 9!>8 4!>9 6!>1 9!>4 9!>1 9!>4 2!>0 6!>1 3!>5 3!>2 9!>5 6!>0 6!>0 6!>0 6!>8 4!>6 7!>3 9!>4 4!>6 2!>7 9!>7 4!>3 9!>4 9!>4 9!>4 8!>7 4!>2 8!>4 3!>5 8!>4 6!>5 8!>5 3!>8 3!>8 9!>8 1!>5 9!>8 6!>3 0!>2 6!>5 9!>5 0!>7 1!>6 4!>9 2!>1 2!>8 8!>5 4!>9 7!>2 7!>2 6!>5 9!>7 6!>1 5!>6 5!>0 4!>9 2!>8 The 82 errors made by LeNet5 (0.82% test error rate) Neural Networks Alireza Ghane / Greg Mori 39

Deep Learning Collection of important techniques to improve performance: Multi-layer networks (as in LeNet 5), auto-encoders for unsupervised feature learning, sparsity, convolutional networks, parameter tying, probabilistic models for initializing parameters, hinge activation functions for steeper gradients ufldl.stanford.edu/eccv10-tutorial http://www.image-net.org/challenges/lsvrc/ 2012/supervision.pdf Scalability is key, can use lots of data since stochastic gradient descent is memory-efficient Neural Networks Alireza Ghane / Greg Mori 40

Conclusion Readings: Ch. 5.1, 5.2, 5.3 Feed-forward networks can be used for regression or classification Similar to linear models, except with adaptive non-linear basis functions These allow us to do more than e.g. linear decision boundaries Different error functions Learning is more difficult, error function not convex Use stochastic gradient descent, obtain (good?) local minimum Backpropagation for efficient gradient computation Neural Networks Alireza Ghane / Greg Mori 41