Feedforward Neural Nets and Backpropagation

Similar documents
Lecture 4: Feed Forward Neural Networks

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Artificial Neural Networks

Data Mining Part 5. Prediction

Introduction to Artificial Neural Networks

CSC242: Intro to AI. Lecture 21

Neural networks. Chapter 20, Section 5 1

Neural networks. Chapter 19, Sections 1 5 1

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

Neural networks. Chapter 20. Chapter 20 1

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Lecture 4: Perceptrons and Multilayer Perceptrons

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Artifical Neural Networks

Artificial Intelligence

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Introduction Biologically Motivated Crude Model Backpropagation

EEE 241: Linear Systems

Introduction to Neural Networks

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

Neural Networks and Deep Learning

CS:4420 Artificial Intelligence

Revision: Neural Network

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Neural Networks: Introduction

COGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November.

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Multilayer Neural Networks

Multilayer Perceptrons and Backpropagation

Lecture 5: Logistic Regression. Neural Networks

Sections 18.6 and 18.7 Artificial Neural Networks

Artificial Neural Networks. Q550: Models in Cognitive Science Lecture 5

Sections 18.6 and 18.7 Artificial Neural Networks

4. Multilayer Perceptrons

Neural Networks and the Back-propagation Algorithm

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Supervised (BPL) verses Hybrid (RBF) Learning. By: Shahed Shahir

Artificial Neural Networks. Historical description

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

18.6 Regression and Classification with Linear Models

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Neural Networks biological neuron artificial neuron 1

Artificial Neural Networks

Simple Neural Nets For Pattern Classification

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Neural Networks and Fuzzy Logic Rajendra Dept.of CSE ASCET

COMP9444 Neural Networks and Deep Learning 2. Perceptrons. COMP9444 c Alan Blair, 2017

Lecture 7 Artificial neural networks: Supervised learning

Introduction To Artificial Neural Networks

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Master Recherche IAC TC2: Apprentissage Statistique & Optimisation

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Artificial Neural Networks The Introduction

Neural Networks. Xiaojin Zhu Computer Sciences Department University of Wisconsin, Madison. slide 1

Unit III. A Survey of Neural Network Model

Multilayer Neural Networks

Artificial Neural Network

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Neural Networks. Fundamentals of Neural Networks : Architectures, Algorithms and Applications. L, Fausett, 1994

CSC321 Lecture 5: Multilayer Perceptrons

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Artificial Neural Networks

Neural Nets Supervised learning

Chapter ML:VI. VI. Neural Networks. Perceptron Learning Gradient Descent Multilayer Perceptron Radial Basis Functions

Artificial Neural Networks. Edward Gatt

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

2018 EE448, Big Data Mining, Lecture 5. (Part II) Weinan Zhang Shanghai Jiao Tong University

Machine Learning (CSE 446): Neural Networks

Machine Learning. Neural Networks

Learning and Memory in Neural Networks

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Unit 8: Introduction to neural networks. Perceptrons

y(x n, w) t n 2. (1)

AI Programming CS F-20 Neural Networks

Artificial Neural Networks. MGS Lecture 2

Logistic Regression & Neural Networks

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Artificial neural networks

Neural networks COMS 4771

Artificial Neural Networks (ANN) Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso

Artificial Neural Networks

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Neural networks (not in book)

Machine Learning Lecture 10

Artificial Neural Network

C4 Phenomenological Modeling - Regression & Neural Networks : Computational Modeling and Simulation Instructor: Linwei Wang

Course 395: Machine Learning - Lectures

Artificial Neural Network and Fuzzy Logic

Artificial Neuron (Perceptron)

Ch.8 Neural Networks

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Input layer. Weight matrix [ ] Output layer

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Artificial Neural Networks. Part 2

Lab 5: 16 th April Exercises on Neural Networks

CSC 411 Lecture 10: Neural Networks

Transcription:

Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28 th, 2016 1 / 23

Supervised Learning Roadmap Supervised Learning: Assume that we are given the features x i. Could also use basis functions or kernels. 2 / 23

Supervised Learning Roadmap Supervised Learning: Assume that we are given the features x i. Could also use basis functions or kernels. Unsupervised Learning: Learn a representation z i based on features x i. Also used for supervised learning: use z i as features. 2 / 23

Supervised Learning Roadmap Supervised Learning: Assume that we are given the features x i. Could also use basis functions or kernels. Unsupervised Learning: Learn a representation z i based on features x i. Also used for supervised learning: use z i as features. Supervised Learning: Learn features z i that are good for supervised learning. 2 / 23

Supervised Learning Roadmap Linear Model yi w1 w2 w3 wd xi1 xi2 xi3 xid 3 / 23

Supervised Learning Roadmap Change of Basis Linear Model yi w1 w2 w3 wk yi zi1 zi2 zi3 zik w1 w2 w3 wd xi1 xi2 xi3 xid Basis xi1 xi2 xi3 xid 3 / 23

Supervised Learning Roadmap Change of Basis Basis from Latent-Factor Model yi Linear Model yi w1 w2 w3 wk yi zi1 zi2 zi3 zik w1 w2 w3 wk zi1 zi2 zi3 zik w1 w2 w3 wd xi1 xi2 xi3 xid Basis w and W are trained separately zi1 zi2 zi3 zik W11 Wkd xi1 xi2 xi3 xid xi1 xi2 xi3 xid 3 / 23

Supervised Learning Roadmap Linear Model Change of Basis yi Basis from Latent-Factor Model yi Simultaneously Learn Features for Task and Regression Model w1 w2 w3 wk yi zi1 zi2 zi3 zik w1 w2 w3 wk zi1 zi2 zi3 zik yi w1 w2 w3 wk w1 w2 w3 wd xi1 xi2 xi3 xid Basis w and W are trained separately zi1 zi2 zi3 zik W11 Wkd zi1 zi2 zi3 zik xi1 xi2 xi3 xid W11 Wkd xi1 xi2 xi3 xid xi1 xi2 xi3 xid 3 / 23

Supervised Learning Roadmap Linear Model Change of Basis yi Basis from Latent-Factor Model yi Simultaneously Learn Features for Task and Regression Model w1 w2 w3 wk yi zi1 zi2 zi3 zik w1 w2 w3 wk zi1 zi2 zi3 zik yi w1 w2 w3 wk w1 w2 w3 wd xi1 xi2 xi3 xid Basis w and W are trained separately zi1 zi2 zi3 zik W11 Wkd zi1 zi2 zi3 zik xi1 xi2 xi3 xid W11 Wkd xi1 xi2 xi3 xid xi1 xi2 xi3 xid These are all examples of Feedforward Neural Networks. 3 / 23

Feed-forward Neural Network Information always moves one direction. No loops. Never goes backwards. Forms a directed acyclic graph. 4 / 23

Feed-forward Neural Network Information always moves one direction. No loops. Never goes backwards. Forms a directed acyclic graph. Each node receives input only from immediately preceding layer. Simplest type of artificial neural network. 4 / 23

Neural Networks - Historical Background 1943: McCulloch and Pitts proposed first computational model of neuron 1949: Hebb proposed the first learning rule 1958: Rosenblatt s work on perceptrons 1969: Minsky and Papert s paper exposed limits of theory 1970s: Decade of dormancy for neural networks 1980-90s: Neural network return (self-organization, back-propagation algorithms, etc.) 5 / 23

Model of Single Neuron McCulloch and Pitts (1943): integrate and fire model (no hidden layers) yi w1 w2 w3 wd xi1 xi2 xi3 xid Denote the d input values for sample i by x i1, x i2,..., x id. Each of the d inputs has a weight w 1, w 2,..., w d. 6 / 23

Model of Single Neuron McCulloch and Pitts (1943): integrate and fire model (no hidden layers) yi w1 w2 w3 wd xi1 xi2 xi3 xid Denote the d input values for sample i by x i1, x i2,..., x id. Each of the d inputs has a weight w 1, w 2,..., w d. Compute prediction as weighted sum, d ŷ i = w 1x i1 + w 2x i2 + + w d x id = w jx ij j=1 6 / 23

Model of Single Neuron McCulloch and Pitts (1943): integrate and fire model (no hidden layers) yi w1 w2 w3 wd xi1 xi2 xi3 xid Denote the d input values for sample i by x i1, x i2,..., x id. Each of the d inputs has a weight w 1, w 2,..., w d. Compute prediction as weighted sum, d ŷ i = w 1x i1 + w 2x i2 + + w d x id = w jx ij j=1 Use ŷ i in some loss function: 1 (yi ŷi)2 2 6 / 23

Incorporating Hidden Layers Consider more than one neuron: yi w1 w2 w3 wk zi1 zi2 zi3 zik W11 Wkd xi1 xi2 xi3 xid 7 / 23

Incorporating Hidden Layers Consider more than one neuron: yi w1 w2 w3 wk zi1 zi2 zi3 zik W11 Wkd xi1 xi2 xi3 xid Input to hidden layer: function h of features from latent-factor model: z i = h(w x i). Each neuron has directed connection to ALL neurons of a subsequent layer. 7 / 23

Incorporating Hidden Layers Consider more than one neuron: yi w1 w2 w3 wk zi1 zi2 zi3 zik W11 Wkd xi1 xi2 xi3 xid Input to hidden layer: function h of features from latent-factor model: z i = h(w x i). Each neuron has directed connection to ALL neurons of a subsequent layer. This function h is often called the activation function. Each unit/node applies an activation function. 7 / 23

Linear Activation Function A linear activation function has the form where a is called bias (intercept). h(w x i) = a + W x i, 8 / 23

Linear Activation Function A linear activation function has the form where a is called bias (intercept). h(w x i) = a + W x i, Example: linear regression with linear bias (linear-linear model) Representation: z i = h(w x i ) (from latent-factor model) Prediction: ŷ i = w T z i Loss: 1 2 (y i ŷ i ) 2 8 / 23

Linear Activation Function A linear activation function has the form where a is called bias (intercept). h(w x i) = a + W x i, Example: linear regression with linear bias (linear-linear model) Representation: z i = h(w x i ) (from latent-factor model) Prediction: ŷ i = w T z i Loss: 1 2 (y i ŷ i ) 2 To train this model, we solve: 1 n argmin (y i w T z i) 2 = 1 w IR k,w IR k d 2 2 i=1 }{{} linear regression with z i as features n (y i w T h(w x i)) 2 i=1 8 / 23

Linear Activation Function A linear activation function has the form where a is called bias (intercept). h(w x i) = a + W x i, Example: linear regression with linear bias (linear-linear model) Representation: z i = h(w x i ) (from latent-factor model) Prediction: ŷ i = w T z i Loss: 1 2 (y i ŷ i ) 2 To train this model, we solve: 1 n argmin (y i w T z i) 2 = 1 w IR k,w IR k d 2 2 i=1 }{{} linear regression with z i as features n (y i w T h(w x i)) 2 i=1 But this is just a linear model: w T (W x i) = (W T w) T x i = w T x i 8 / 23

Binary Activation Function To increase flexibility, something needs to be non-linear. 9 / 23

Binary Activation Function To increase flexibility, something needs to be non-linear. A Heaviside step function has the form 1 if v a h(v) = 0 otherwise where a is the threshold. 9 / 23

Binary Activation Function To increase flexibility, something needs to be non-linear. A Heaviside step function has the form 1 if v a h(v) = 0 otherwise where a is the threshold. Example: Let a = 0, This yields a binary z i = h(w x i). 9 / 23

Binary Activation Function To increase flexibility, something needs to be non-linear. A Heaviside step function has the form 1 if v a h(v) = 0 otherwise where a is the threshold. Example: Let a = 0, This yields a binary z i = h(w x i). W x i has a concept encoded by each of its 2 k possible signs. 9 / 23

Perceptrons Minsky and Papert (late 50s). Algorithm for supervised learning of binary classifiers. Decides whether input belongs to a specific class or not. 10 / 23

Perceptrons Minsky and Papert (late 50s). Algorithm for supervised learning of binary classifiers. Decides whether input belongs to a specific class or not. Uses binary activation function. Can learn AND, OR, NOT functions. 10 / 23

Perceptrons Minsky and Papert (late 50s). Algorithm for supervised learning of binary classifiers. Decides whether input belongs to a specific class or not. Uses binary activation function. Can learn AND, OR, NOT functions. Perceptrons only capable of learning linearly separable patterns. 10 / 23

Learning Algorithm for Perceptrons The perceptron learning rule is given as follows: 1 For each x i and desired output y i in training set, ( ( 1 Calculate output: ŷ (t) i = h w (t)) ) T xi. 11 / 23

Learning Algorithm for Perceptrons The perceptron learning rule is given as follows: 1 For each x i and desired output y i in training set, ( ( 1 Calculate output: ŷ (t) i = h w (t)) ) T xi. 2 Update the weights: w (t+1) j = w (t) j + (y i ŷ (t) i )x ji 1 j d. 11 / 23

Learning Algorithm for Perceptrons The perceptron learning rule is given as follows: 1 For each x i and desired output y i in training set, ( ( 1 Calculate output: ŷ (t) i = h w (t)) ) T xi. 2 Update the weights: w (t+1) j = w (t) j + (y i ŷ (t) i )x ji 1 j d. 2 Continue these steps until 1 n 2 i=1 y i ŷ (t) i < γ (threshold). 11 / 23

Learning Algorithm for Perceptrons The perceptron learning rule is given as follows: 1 For each x i and desired output y i in training set, ( ( 1 Calculate output: ŷ (t) i = h w (t)) ) T xi. 2 Update the weights: w (t+1) j = w (t) j + (y i ŷ (t) i )x ji 1 j d. 2 Continue these steps until 1 n 2 i=1 y i ŷ (t) i < γ (threshold). If training set is linearly separable, then guaranteed to converge. (Rosenblatt, 1962). 11 / 23

Learning Algorithm for Perceptrons The perceptron learning rule is given as follows: 1 For each x i and desired output y i in training set, ( ( 1 Calculate output: ŷ (t) i = h w (t)) ) T xi. 2 Update the weights: w (t+1) j = w (t) j + (y i ŷ (t) i )x ji 1 j d. 2 Continue these steps until 1 n 2 i=1 y i ŷ (t) i < γ (threshold). If training set is linearly separable, then guaranteed to converge. (Rosenblatt, 1962). An upper bound exists on number of times weights adjusted in training. 11 / 23

Learning Algorithm for Perceptrons The perceptron learning rule is given as follows: 1 For each x i and desired output y i in training set, ( ( 1 Calculate output: ŷ (t) i = h w (t)) ) T xi. 2 Update the weights: w (t+1) j = w (t) j + (y i ŷ (t) i )x ji 1 j d. 2 Continue these steps until 1 n 2 i=1 y i ŷ (t) i < γ (threshold). If training set is linearly separable, then guaranteed to converge. (Rosenblatt, 1962). An upper bound exists on number of times weights adjusted in training. Can also use gradient descent if function is differentiable. 11 / 23

Non-Linear Continuous Activation Function What about a smooth approximation to the step function? 12 / 23

Non-Linear Continuous Activation Function What about a smooth approximation to the step function? A sigmoid function has the form h(v) = 1 1 + e v 12 / 23

Non-Linear Continuous Activation Function What about a smooth approximation to the step function? A sigmoid function has the form h(v) = 1 1 + e v Applying the sigmoid function element-wise, 1 z ic =. 1 + e W c T x i This is called a multi-layer perceptron or neural network. 12 / 23

Why Neural Network? A typical neuron. Neuron has many dendrites. Each dendrite takes input. Neuron has a single axon. Axon sends output. 13 / 23

Why Neural Network? A typical neuron. Neuron has many dendrites. Each dendrite takes input. Neuron has a single axon. Axon sends output. With the right input to dendrites: Action potential along axon. 13 / 23

Why Neural Network? First neuron: Each dendrite: x i1, x i1,..., x id Nucleus: computes Wc T x i 1 Axon: sends binary signal 1+e W c T x i Axon terminal: neurotransmitter, synapse Second neuron: Each dendrite receives: z i1, z i2,..., z ik Nucleus: computes w T z i Axon: sends signal ŷ i 14 / 23

Why Neural Network? First neuron: Each dendrite: x i1, x i1,..., x id Nucleus: computes Wc T x i 1 Axon: sends binary signal 1+e W c T x i Axon terminal: neurotransmitter, synapse Second neuron: Each dendrite receives: z i1, z i2,..., z ik Nucleus: computes w T z i Axon: sends signal ŷ i Describes a neural network with one hidden layer (2 neurons). 14 / 23

Why Neural Network? First neuron: Each dendrite: x i1, x i1,..., x id Nucleus: computes Wc T x i 1 Axon: sends binary signal 1+e W c T x i Axon terminal: neurotransmitter, synapse Second neuron: Each dendrite receives: z i1, z i2,..., z ik Nucleus: computes w T z i Axon: sends signal ŷ i Describes a neural network with one hidden layer (2 neurons). Human brain has approximately 10 11 neurons. 14 / 23

Why Neural Network? First neuron: Each dendrite: x i1, x i1,..., x id Nucleus: computes Wc T x i 1 Axon: sends binary signal 1+e W c T x i Axon terminal: neurotransmitter, synapse Second neuron: Each dendrite receives: z i1, z i2,..., z ik Nucleus: computes w T z i Axon: sends signal ŷ i Describes a neural network with one hidden layer (2 neurons). Human brain has approximately 10 11 neurons. Each neuron connected to approximately 10 4 neurons. 14 / 23

Why Neural Network? First neuron: Each dendrite: x i1, x i1,..., x id Nucleus: computes Wc T x i 1 Axon: sends binary signal 1+e W c T x i Axon terminal: neurotransmitter, synapse Second neuron: Each dendrite receives: z i1, z i2,..., z ik Nucleus: computes w T z i Axon: sends signal ŷ i Describes a neural network with one hidden layer (2 neurons). Human brain has approximately 10 11 neurons. Each neuron connected to approximately 10 4 neurons.... 14 / 23

Artificial Neural Nets vs. Real Neural Nets yi Artificial neural network: x i is measuring of the world. z i is internal representation of the world. ŷ i as output of neuron for classification/regression. W11 w1 w2 w3 wk zi1 zi2 zi3 zik Wkd xi1 xi2 xi3 xid 15 / 23

Artificial Neural Nets vs. Real Neural Nets yi Artificial neural network: x i is measuring of the world. z i is internal representation of the world. ŷ i as output of neuron for classification/regression. W11 w1 w2 w3 wk zi1 zi2 zi3 zik Wkd xi1 xi2 xi3 xid Real neural networks are more complicated: Timing of action potentials seems to be important. Rate coding: frequency of action potentials simulates continuous output. 15 / 23

Artificial Neural Nets vs. Real Neural Nets yi Artificial neural network: x i is measuring of the world. z i is internal representation of the world. ŷ i as output of neuron for classification/regression. W11 w1 w2 w3 wk zi1 zi2 zi3 zik Wkd xi1 xi2 xi3 xid Real neural networks are more complicated: Timing of action potentials seems to be important. Rate coding: frequency of action potentials simulates continuous output. Neural networks don t reflect sparsity of action potentials. 15 / 23

Artificial Neural Nets vs. Real Neural Nets yi Artificial neural network: x i is measuring of the world. z i is internal representation of the world. ŷ i as output of neuron for classification/regression. W11 w1 w2 w3 wk zi1 zi2 zi3 zik Wkd xi1 xi2 xi3 xid Real neural networks are more complicated: Timing of action potentials seems to be important. Rate coding: frequency of action potentials simulates continuous output. Neural networks don t reflect sparsity of action potentials. How much computation is done inside neuron? 15 / 23

Artificial Neural Nets vs. Real Neural Nets yi Artificial neural network: x i is measuring of the world. z i is internal representation of the world. ŷ i as output of neuron for classification/regression. W11 w1 w2 w3 wk zi1 zi2 zi3 zik Wkd xi1 xi2 xi3 xid Real neural networks are more complicated: Timing of action potentials seems to be important. Rate coding: frequency of action potentials simulates continuous output. Neural networks don t reflect sparsity of action potentials. How much computation is done inside neuron? Brain is highly organized (e.g., substructures and cortical columns). Connection structure changes. 15 / 23

Artificial Neural Nets vs. Real Neural Nets yi Artificial neural network: x i is measuring of the world. z i is internal representation of the world. ŷ i as output of neuron for classification/regression. W11 w1 w2 w3 wk zi1 zi2 zi3 zik Wkd xi1 xi2 xi3 xid Real neural networks are more complicated: Timing of action potentials seems to be important. Rate coding: frequency of action potentials simulates continuous output. Neural networks don t reflect sparsity of action potentials. How much computation is done inside neuron? Brain is highly organized (e.g., substructures and cortical columns). Connection structure changes. Different types of neurotransmitters. 15 / 23

The Universal Approximation Theorem One of the first versions proven by George Cybenko (1989). 16 / 23

The Universal Approximation Theorem One of the first versions proven by George Cybenko (1989). For feedforward neural networks, the universal approximation theorem:... claims that every continuous function defined on a compact set can be arbitrarily well approximated by a neural network with one hidden layer. 16 / 23

The Universal Approximation Theorem One of the first versions proven by George Cybenko (1989). For feedforward neural networks, the universal approximation theorem:... claims that every continuous function defined on a compact set can be arbitrarily well approximated by a neural network with one hidden layer. Thus, a simple neural network capable of representing a wide variety of functions when given appropriate parameters. 16 / 23

The Universal Approximation Theorem One of the first versions proven by George Cybenko (1989). For feedforward neural networks, the universal approximation theorem:... claims that every continuous function defined on a compact set can be arbitrarily well approximated by a neural network with one hidden layer. Thus, a simple neural network capable of representing a wide variety of functions when given appropriate parameters. Algorithmic learnability of those parameters? 16 / 23

Artificial Neural Networks With squared loss, our objective function is: 1 argmin w IR k,w IR k d 2 n (y i w T h(w x i)) 2 i=1 17 / 23

Artificial Neural Networks With squared loss, our objective function is: 1 argmin w IR k,w IR k d 2 n (y i w T h(w x i)) 2 i=1 Usual training procedure: stochastic gradient. Compute gradient of random example i, update w and W. Computing the gradient is known as backpropagation. 17 / 23

Artificial Neural Networks With squared loss, our objective function is: 1 argmin w IR k,w IR k d 2 n (y i w T h(w x i)) 2 i=1 Usual training procedure: stochastic gradient. Compute gradient of random example i, update w and W. Computing the gradient is known as backpropagation. Adding regularization to w and/or W is known as weight decay. 17 / 23

Backpropagation Output values ŷ i compared to correct values y i to compute error. 18 / 23

Backpropagation Output values ŷ i compared to correct values y i to compute error. Error is fed back through the network (backpropagation of errors). 18 / 23

Backpropagation Output values ŷ i compared to correct values y i to compute error. Error is fed back through the network (backpropagation of errors). Weights are adjusted to reduce value of error function by small amount. 18 / 23

Backpropagation Output values ŷ i compared to correct values y i to compute error. Error is fed back through the network (backpropagation of errors). Weights are adjusted to reduce value of error function by small amount. After number of cycles, network converges to state where error is small. 18 / 23

Backpropagation Output values ŷ i compared to correct values y i to compute error. Error is fed back through the network (backpropagation of errors). Weights are adjusted to reduce value of error function by small amount. After number of cycles, network converges to state where error is small. To adjust weights, use non-linear optimization method gradient descent. Backpropagation can only be applied to differentiable activation functions. 18 / 23

Backpropagation Algorithm repeat until some stopping criterion is satisfied 19 / 23

Backpropagation Algorithm repeat for each weight w i,j in the network do w i,j a small random number for each example (x, y) do until some stopping criterion is satisfied 19 / 23

Backpropagation Algorithm repeat for each weight w i,j in the network do w i,j a small random number for each example (x, y) do /*Propagate the inputs forward to compute the outputs*/ /*Propagate the deltas backwards from output layer to input layer*/ /*Update every weight in network using deltas*/ until some stopping criterion is satisfied 19 / 23

Backpropagation Algorithm repeat for each weight w i,j in the network do w i,j a small random number for each example (x, y) do /*Propagate the inputs forward to compute the outputs*/ for each node i in the input layer do a i x i for l = 2 to L do for each node j in layer l do v j i wi,jai a j h(v j) /*Propagate the deltas backwards from output layer to input layer*/ /*Update every weight in network using deltas*/ until some stopping criterion is satisfied 19 / 23

Backpropagation Algorithm repeat for each weight w i,j in the network do w i,j a small random number for each example (x, y) do /*Propagate the inputs forward to compute the outputs*/ for each node i in the input layer do a i x i for l = 2 to L do for each node j in layer l do v j i wi,jai a j h(v j) /*Propagate the deltas backwards from output layer to input layer*/ for each node j in output layer do [j] h (v j)(y j a j) for l = L 1 to 1 do for each node i in layer l do [i] h (v i) j wi,j [j] /*Update every weight in network using deltas*/ until some stopping criterion is satisfied 19 / 23

Backpropagation Algorithm repeat for each weight w i,j in the network do w i,j a small random number for each example (x, y) do /*Propagate the inputs forward to compute the outputs*/ for each node i in the input layer do a i x i for l = 2 to L do for each node j in layer l do v j i wi,jai a j h(v j) /*Propagate the deltas backwards from output layer to input layer*/ for each node j in output layer do [j] h (v j)(y j a j) for l = L 1 to 1 do for each node i in layer l do [i] h (v i) j wi,j [j] /*Update every weight in network using deltas*/ for each weight w i,j in network do w i,j w i,j + αa i [j] until some stopping criterion is satisfied 19 / 23

Backpropagation Derivations of the back-propagation equations are very similar to the gradient calculation for logistic regression. Uses the multivariable chain rule more than once. 20 / 23

Backpropagation Derivations of the back-propagation equations are very similar to the gradient calculation for logistic regression. Uses the multivariable chain rule more than once. Consider the loss for a single node: f(w, W ) = 1 k 2 (yi w jh(w jx i)) 2. j=1 20 / 23

Backpropagation Derivations of the back-propagation equations are very similar to the gradient calculation for logistic regression. Uses the multivariable chain rule more than once. Consider the loss for a single node: f(w, W ) = 1 k 2 (yi w jh(w jx i)) 2. j=1 Derivatives with respect to w j (weights connect hidden to output layer): f(w, W ) w j = (y i k w jh(w jx i)) h(w jx i) j=1 20 / 23

Backpropagation Derivations of the back-propagation equations are very similar to the gradient calculation for logistic regression. Uses the multivariable chain rule more than once. Consider the loss for a single node: f(w, W ) = 1 k 2 (yi w jh(w jx i)) 2. j=1 Derivatives with respect to w j (weights connect hidden to output layer): f(w, W ) w j = (y i k w jh(w jx i)) h(w jx i) j=1 Derivative with respect to W ij (weights connect input to hidden layer): f(w, W ) W ij = (y i k w jh(w jx i)) w jh (W jx i)x ij j=1 20 / 23

Backpropagation Notice repeated calculations in gradients: f(w, W ) w j f(w, W ) W ij = (y i = (y i k w jh(w jx)) h(w jx) j=1 k w jh(w jx)) w jh (W jx)x i j=1 21 / 23

Backpropagation Notice repeated calculations in gradients: f(w, W ) w j f(w, W ) W ij = (y i = (y i k w jh(w jx)) h(w jx) j=1 k w jh(w jx)) w jh (W jx)x i j=1 Same value for f(w,w ) w j with all j and f(w,w ) W ij for all i and j. Same value for f(w,w ) W ij with all i. 21 / 23

Disadvantages of Neural Networks Cannot be used if very little available data. 22 / 23

Disadvantages of Neural Networks Cannot be used if very little available data. Many free parameters (e.g., number of hidden nodes, learning rate, minimal error, etc.) 22 / 23

Disadvantages of Neural Networks Cannot be used if very little available data. Many free parameters (e.g., number of hidden nodes, learning rate, minimal error, etc.) Not great for precision calculations. 22 / 23

Disadvantages of Neural Networks Cannot be used if very little available data. Many free parameters (e.g., number of hidden nodes, learning rate, minimal error, etc.) Not great for precision calculations. Do not provide explanations (unlike slopes in linear models that represent correlations). 22 / 23

Disadvantages of Neural Networks Cannot be used if very little available data. Many free parameters (e.g., number of hidden nodes, learning rate, minimal error, etc.) Not great for precision calculations. Do not provide explanations (unlike slopes in linear models that represent correlations). Network overfits training data, fails to capture true statistical process behind the data. Early stopping can be used to avoid this. 22 / 23

Disadvantages of Neural Networks Cannot be used if very little available data. Many free parameters (e.g., number of hidden nodes, learning rate, minimal error, etc.) Not great for precision calculations. Do not provide explanations (unlike slopes in linear models that represent correlations). Network overfits training data, fails to capture true statistical process behind the data. Early stopping can be used to avoid this. Speed of convergence. Local minimia. 22 / 23

Advantages of Neural Networks Easy to maintain. Versatile. Used on problems for which analytical methods do not yet exist. 23 / 23

Advantages of Neural Networks Easy to maintain. Versatile. Used on problems for which analytical methods do not yet exist. Models non-linear dependencies. Quickly work out patterns, even with noisy data. 23 / 23

Advantages of Neural Networks Easy to maintain. Versatile. Used on problems for which analytical methods do not yet exist. Models non-linear dependencies. Quickly work out patterns, even with noisy data. Always outputs answer, even when input is incomplete. 23 / 23

Advantages of Neural Networks Easy to maintain. Versatile. Used on problems for which analytical methods do not yet exist. Models non-linear dependencies. Quickly work out patterns, even with noisy data. Always outputs answer, even when input is incomplete. 23 / 23