BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

Similar documents
Lecture 17: Neural Networks and Deep Learning

Statistical NLP for the Web

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Artificial Neural Networks

RegML 2018 Class 8 Deep learning

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Artificial Neural Networks. MGS Lecture 2

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Convolutional Neural Networks

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

CSC321 Lecture 5: Multilayer Perceptrons

Machine Learning Lecture 12

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

How to do backpropagation in a brain

Intro to Neural Networks and Deep Learning

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Lecture 5: Logistic Regression. Neural Networks

Neural Networks. Intro to AI Bert Huang Virginia Tech

Deep Learning Lab Course 2017 (Deep Learning Practical)

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Course 395: Machine Learning - Lectures

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

CSC 411 Lecture 10: Neural Networks

Reading Group on Deep Learning Session 1

CSC321 Lecture 6: Backpropagation

Introduction to Neural Networks

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Machine Learning Techniques

Machine Learning Basics

Supervised Learning. George Konidaris

An overview of deep learning methods for genomics

ECE521 Lectures 9 Fully Connected Neural Networks

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning Basics III

Topic 3: Neural Networks

Machine Learning (CSE 446): Backpropagation

Neural Networks biological neuron artificial neuron 1

Neural networks COMS 4771

THE MOST IMPORTANT BIT

Multilayer Perceptron = FeedForward Neural Network

4. Multilayer Perceptrons

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Nonlinear Classification

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Neural Networks: Backpropagation

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Machine Learning (CSE 446): Neural Networks

Logistic Regression & Neural Networks

Machine Learning 2017

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Course Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Artificial Intelligence

Deep Feedforward Networks

Ch.6 Deep Feedforward Networks (2/3)

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

The XOR problem. Machine learning for vision. The XOR problem. The XOR problem. x 1 x 2. x 2. x 1. Fall Roland Memisevic

Neural Networks and Deep Learning

Artificial Neural Networks

Introduction to Machine Learning

Statistical Learning Theory. Part I 5. Deep Learning

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

AI Programming CS F-20 Neural Networks

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Multilayer Perceptrons and Backpropagation

y(x n, w) t n 2. (1)

STA 414/2104: Lecture 8

C4 Phenomenological Modeling - Regression & Neural Networks : Computational Modeling and Simulation Instructor: Linwei Wang

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Pattern Recognition and Machine Learning

Neural networks (NN) 1

Artificial Neuron (Perceptron)

p(d θ ) l(θ ) 1.2 x x x

Radial-Basis Function Networks

Neural Networks. William Cohen [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ]

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Computational statistics

Feedforward Neural Nets and Backpropagation

Deep Learning (CNNs)

Machine Learning

Automatic Differentiation and Neural Networks

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Neural Networks: Backpropagation

Deep Feedforward Networks. Sargur N. Srihari

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

STA 414/2104: Lecture 8

Neural Networks and the Back-propagation Algorithm

CSC321 Lecture 16: ResNets and Attention

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Transcription:

BACKPROPAGATION Neural network training optimization problem min J(w) w The application of gradient descent to this problem is called backpropagation. Backpropagation is gradient descent applied to J(w) in a feed-forward network. Deriving backpropagation We have to evaluate the derivative w J(w). Since J is additive over training points, J(w) = n J n(w), it suffices to derive w J n (w). Peter Orbanz Applied Data Mining 292

The next few slides were written for a different class, and you are not expected to know their content. I show them only to illustrate the interesting way in which gradient descent interleaves with the feed-forward architecture.

Peter Orbanz Applied Data Mining Not examinable. 293 BACKPROPAGATION Deriving backpropagation We have to evaluate the derivative w J(w). Since J is additive over training points, J(w) = n J n(w), it suffices to derive w J n (w).

CHAIN RULE Recall from calculus: Chain rule Consider a composition of functions f g(x) = f (g(x)). d(f g) dx = df dg dg dx If the derivatives of f and g are f and g, that means: d(f g) (x) = f dx (g(x))g (x) Application to feed-forward network Let w (k) denote the weights in layer k. The function represented by the network is f w (x) = f w (K) f w (1) (x) = f (K) (1) w (K) f w (1)(x) To solve the optimization problem, we have to compute derivatives of the form d dw D(f w(x n ), y n ) = dd(, y n ) df w df w dw Peter Orbanz Applied Data Mining Not examinable. 294

Peter Orbanz Applied Data Mining Not examinable. 295 DECOMPOSING THE DERIVATIVES The chain rule means we compute the derivates layer by layer. Suppose we are only interested in the weights of layer k, and keep all other weights fixed. The function f represented by the network is then f w (k)(x) = f (K) f (k+1) f (k) w (k) f (k 1) f (1) (x) The first k 1 layers enter only as the function value of x, so we define z (k) := f (k 1) f (1) (x) and get f w (k)(x) = f (K) f (k+1) f (k) w (k)(z(k) ) If we differentiate with respect to w (k), the chain rule gives d dw (k) f w (k) (x) = df df (K) (K 1) df (k+1) df (k) df (k) w (k) dw (k)

Peter Orbanz Applied Data Mining Not examinable. 296 WITHIN A SINGLE LAYER Each f (k) is a vector-valued function f (k) : R d k R d k+1. It is parametrized by the weights w (k) of the kth layer and takes an input vector z R d k. We write f (k) (z, w (k) ). Layer-wise derivative Since f (k) and f (k 1) are vector-valued, we get a Jacobian matrix f (k+1) 1 f (k+1) f (k) 1... df (k+1) 1 f (k) d k df (k) =.. =: (k) (z, w (k+1) )... f (k+1) d k+1 f (k) 1 (k) is a matrix of size d k+1 d k. f (k+1) d k+1 f (k) d k The derivatives in the matrix quantify how f (k+1) reacts to changes in the argument of f (k) if the weights w (k+1) and w (k) of both functions are fixed.

Peter Orbanz Applied Data Mining Not examinable. 297 BACKPROPAGATION ALGORITHM Let w (1),..., w (K) be the current settings of the layer weights. These have either been computed in the previous iteration, or (in the first iteration) are initialized at random. Step 1: Forward pass We start with an input vector x and compute for all layers k. Step 2: Backward pass z (k) := f (k) f (1) (x) Start with the last layer. Update the weights w (K) by performing a gradient step on D ( f (K) (z (K), w (K) ), y ) regarded as a function of w (K) (so z (K) and y are fixed). Denote the updated weights w (K). Move backwards one layer at a time. At layer k, we have already computed updates w (K),..., w (k+1). Update w (k) by a gradient step, where the derivative is computed as df (k) (K 1) (z (K 1), w (K) )... (k) (z (k), w (k+1) ) dw (k) (z, w(k) ) On reaching level 1, go back to step 1 and recompute the z (k) using the updated weights.

Peter Orbanz Applied Data Mining Not examinable. 298 SUMMARY: BACKPROPAGATION Backpropagation is a gradient descent method for the optimization problem min w J(w) = N D(f w (x i ), y i ) i=1 D must be chosen such that it is additive over data points. It alternates between forward passes that update the layer-wise function values z (k) given the current weights, and backward passes that update the weights using the current z (k). The layered architecture means we can (1) compute each z (k) from z (k 1) and (2) we can use the weight updates computed in layers K,..., k + 1 to update weights in layer k.

FEATURE EXTRACTION Features Raw measurement data is typically not used directly as input for a learning algorithm. Some form of preprocessing is applied first. We can think of this preprocessing as a function, e.g. F: raw data space R d (R d is only an example, but a very common one.) If the raw measurements are m 1,..., m N, the data points which are fed into the learning algorithm are the images x n := F(m n ). Terminology F is called a feature map. Its dimensions (the dimensions of its range space) are called features. The preprocessing step (= application of F to the raw data) is called feature extraction. Peter Orbanz Applied Data Mining 299

EXAMPLE PROCESSING PIPELINE Raw data (measurements) Feature extraction (preprocessing) Working data Mark patterns This is what a typical processing pipeline for a supervided learning propblem might look like. Split Training data (patterns marked) Test data (patterns marked) Training (calibration) Trained model Apply on test data Error estimate Peter Orbanz Applied Data Mining 300

FEATURE EXTRACTION VS LEARNING Where does learning start? It is often a matter of definition where feature extraction stops and learning starts. If we have a perfect feature extractor, learning is trivial. For example: Consider a classfication problem with two classes. Suppose the feature extractor maps the raw data measurements of class 1 to a single point, and all data points in class to to a single distinct point. Then classification is trivial. That is of course what the classifier is supposed to do in the end (e.g. map to the points 0 and 1). Multi-layer networks and feature extraction An interesting aspect of multi-layer neural networks is that their early layers can be intepreted as feature extraction. For certain types of problems (e.g. computer vision), features were long hand-tuned by humans. Features extracted by neural networks give much better results. Several important problems, such as object recognition and face recognition, have basically been solved in this way. Peter Orbanz Applied Data Mining 301

DEEP NETWORKS AS FEATURE EXTRACTORS x 1 x 2 x d The network on the right is a classifier f : R d {0, 1}. Suppose we subdivide the network into the first K 1 layer and the final layer, by defining...... = f (1) = f (2) F(x) := f (K 1)... f (1) (x) The entire network is then... f (x) = f (K) F(x) The function f (K) is a two-class logistic regression classifier. We can hence think of f as a feature extraction F followed by linear classification f (K).... f (K) ( ) = σ( w (K), ) = f (K) Peter Orbanz Applied Data Mining 302

A SIMPLE EXAMPLE f 1 f 2 f 3 h 1 h 2 w 11 w 12 w 64,1 w 64,2... x 1 x 64 Problem: Classify characters into three classes (E, F and L). Each digit given as a 8 8 = 64 pixel image Neural network: 64 input units (=pixels) 2 hidden units 3 binary output units, where f i (x) = 1 means image is in class i. Each hidden unit has 64 input weights, one per pixel. The weight values can be plottes as 8 8 images. Peter Orbanz Applied Data Mining 303

A SIMPLE EXAMPLE h 1 h 2 learned input-to-hidden weights training data (with random noise) weight values of h 1 and h 2 plotted as images igure 6.14: The top images represent patterns from a larg 64-2-3 Dark sigmoidal regions = large weight network values. for classifying three characters Note the weights emphasize regions that distinguish characters. he input-to-hidden weights (represented as patterns) at We can think of weight (= each pixel) as a feature. raining. TheNote features with that largethese weights forlearned h 1 distinguish {E,F} weights from L. indeed describe f he classification The features for h 2 task. distinguish {E,L} In large from F. networks, such patterns ifficult to interpret in this way. Peter Orbanz Applied Data Mining Illustration: R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, Wiley 2001 304

EXAMPLE: AUTOENCODERS An example for the effect of layer are autoencoders. An autoencoder is a neural network that is trained on its own input: If the network has weights W and represents a function f W, training solves the optimization problem or something similar for a different norm. min W x f W(x) 2 That seems pointless at first glance: The network tries to approximate the identity function using its (possibly nonlinear) component functions. However: If the layers in the middle have much fewer nodes that those at the top and bottom, the network learns to compress the input. Peter Orbanz Applied Data Mining 305

AUTOENCODERS x x f (1) f (1) f (2) f (2) f (3) f (3) f (x) x f (x) x Layers have same width: No effect Narrow middle layers: Compression effect Train network on many images. Once trained: Input an image x. Store x := f (2) (x). Note x has fewer dimensions than x compression. To decompress x : Input it into f (3) and apply the remaining layers of the network reconstruction f (x) x of x. Peter Orbanz Applied Data Mining 306

AUTOENCODERS Decoder 2000 1000 500 500 1000 2000 W W W W W W W W T 1 T 2 T 3 T 4 30 Code layer 4 3 2 1 Encoder Peter Orbanz Applied Data Mining Illustration: K. Murphy, Machine Learning: A Bayesian perspective, MIT Press 2012 307