Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Similar documents
Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Kernel Methods and Support Vector Machines

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Estimating Parameters for a Gaussian pdf

Feature Extraction Techniques

Ch 12: Variations on Backpropagation

Machine Learning Basics: Estimators, Bias and Variance

Feedforward Networks. Gradient Descent Learning and Backpropagation. Christian Jacob. CPSC 533 Winter 2004

Feedforward Networks

Block designs and statistics

Feedforward Networks

Combining Classifiers

Bayes Decision Rule and Naïve Bayes Classifier

Ensemble Based on Data Envelopment Analysis

Neural Networks: Introduction

Neural Networks (Part 1) Goals for the lecture

Introduction Biologically Motivated Crude Model Backpropagation

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Neural Networks and Deep Learning

Tracking using CONDENSATION: Conditional Density Propagation

Boosting with log-loss

OBJECTIVES INTRODUCTION

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Topic 5a Introduction to Curve Fitting & Linear Regression

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

arxiv: v2 [cs.lg] 30 Mar 2017

Support Vector Machines. Goals for the lecture

Support Vector Machines. Maximizing the Margin

Multilayer Neural Networks

CS 4700: Foundations of Artificial Intelligence

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Variations on Backpropagation

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

CSC 411 Lecture 10: Neural Networks

COS 424: Interacting with Data. Written Exercises

Introduction to Machine Learning

are equal to zero, where, q = p 1. For each gene j, the pairwise null and alternative hypotheses are,

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

3.3 Variational Characterization of Singular Values

4. Multilayer Perceptrons

Training an RBM: Contrastive Divergence. Sargur N. Srihari

Artificial Intelligence

Sharp Time Data Tradeoffs for Linear Inverse Problems

Least Squares Fitting of Data

ZISC Neural Network Base Indicator for Classification Complexity Estimation

VI. Backpropagation Neural Networks (BPNN)

Non-Parametric Non-Line-of-Sight Identification 1

Pattern Classification using Simplified Neural Networks with Pruning Algorithm

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

Principal Components Analysis

Chapter 1: Basics of Vibrations for Simple Mechanical Systems

A Simple Regression Problem

CSC321 Lecture 5: Multilayer Perceptrons

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Hand Written Digit Recognition Using Backpropagation Neural Network on Master-Slave Architecture

Chapter 6 1-D Continuous Groups

Artificial Neural Networks. Historical description

Soft-margin SVM can address linearly separable problems with outliers

1 Proof of learning bounds

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Ştefan ŞTEFĂNESCU * is the minimum global value for the function h (x)

Support Vector Machines MIT Course Notes Cynthia Rudin

Neural Networks biological neuron artificial neuron 1

ACTIVE VIBRATION CONTROL FOR STRUCTURE HAVING NON- LINEAR BEHAVIOR UNDER EARTHQUAKE EXCITATION

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

Measures of average are called measures of central tendency and include the mean, median, mode, and midrange.

Interactive Markov Models of Evolutionary Algorithms

Analyzing Simulation Results

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space

Using a De-Convolution Window for Operating Modal Analysis

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Course 395: Machine Learning - Lectures

1 Bounding the Margin

The Algorithms Optimization of Artificial Neural Network Based on Particle Swarm

EEE 241: Linear Systems

Neural Dynamic Optimization for Control Systems Part III: Applications

Intelligent Systems Discriminative Learning, Neural Networks

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Neural networks. Chapter 20, Section 5 1

Notes on Back Propagation in 4 Lines

Grundlagen der Künstlichen Intelligenz

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Introduction To Artificial Neural Networks

NBN Algorithm Introduction Computational Fundamentals. Bogdan M. Wilamoswki Auburn University. Hao Yu Auburn University

1 Generalization bounds based on Rademacher complexity

Sections 18.6 and 18.7 Analysis of Artificial Neural Networks

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Multiple Testing Issues & K-Means Clustering. Definitions related to the significance level (or type I error) of multiple tests

paper prepared for the 1996 PTRC Conference, September 2-6, Brunel University, UK ON THE CALIBRATION OF THE GRAVITY MODEL

Artificial Neural Networks

Physics 215 Winter The Density Matrix

Transcription:

Intelligent Systes: Reasoning and Recognition Jaes L. Crowley MOSIG M1 Winter Seester 2018 Lesson 7 1 March 2018 Outline Artificial Neural Networks Notation...2 Introduction...3 Key Equations... 3 Artificial Neural Networks... 4 The Artificial Neuron... 5 The Neural Network odel... 7 Backpropagation... 9 Suary of Backpropagation... 13

Notation x d A feature. An observed or easured value. X A vector of D features. D The nuber of diensions for the vector X { x } {y } Training saples for learning. M The nuber of training saples. a j the activation output of the j th neuron of the l th layer. the weight fro unit i of layer l 1 to the unit j of layer l. w ij b j l bias for unit j of layer l. " A learning rate. Typically very sall 0.01. Can be variable. L The nuber of layers in the network. " L+1 = a L # y Output Error of the network for the th training saple " j, Error for the j th neuron of layer l, for the th training saple. "w ij, = a l#1 i j, Update for weight fro unit i of layer l 1 to the unit j of layer l. "b j, = # j, Update for bias for unit j of layer l. 7-2

Introduction Key Equations Feed Forward fro Layer i to j: N l"1 ' a j = f w ij a l"1 & # i + b j i=1 Feed Forward fro Layer j to k: # N l & a l+1 k = f w jk a l+1 " j + b k j=1 ' Back Propagation fro Layer j to i: Back Propagation fro Layer k to j: " l#1 i, = f z l#1 i " j, N l z i l#1 w ij = #f z j N l+1 j=1 #z j w jk k=1 " j, l+1 l+1 " k, Weight and Bias Corrections for layer j: "w ij, "b j, = a l#1 i j, = # j, Network Update Forulas: w ij " w ij # &w ij, b j " b j # &b j, 7-3

Artificial Neural Networks Artificial Neural Networks, also referred to as Multi-layer Perceptrons, are coputational structures coposed a weighted sus of neural units. Each neural unit is coposed of a weighted su of input units, followed by a non-linear decision function. Note that the ter neural is isleading. The coputational echanis of a neural network is only loosely inspired fro neural biology. Neural networks do NOT ipleent the sae learning and recognition algoriths as biological systes. The approach was first proposed by Warren McCullough and Walter Pitts in 1943 as a possible universal coputational odel. During the 1950 s, Frank Rosenblatt developed the idea to provide a trainable achine for pattern recognition, called a Perceptron. In 1973, Steven Grossberg, showed that a two layered perceptron could overcoe the probles raised by Minsky and Papert, and solve any probles that plagued sybolic AI. In 1975, Paul Werbos developed an algorith referred to as Back-Propagation that uses gradient descent to learn the paraeters for perceptrons fro classification errors with training data. During the 1980 s, Neural Networks went through a period of popularity with researchers showing that Networks could be trained to provide siple solutions to probles such as recognizing handwritten characters, recognizing spoken words, and steering a car on a highway. However, results were overtaken by ore atheatically sound approaches for statistical pattern recognition such as support vector achines and boosted learning. In 1998, Yves LeCun showed that convolutional networks coposed fro any layers could outperfor other approaches for recognition probles. Unfortunately such networks required extreely large aounts of data and coputation. Around 2010, with the eergence of cloud coputing cobined with planetary-scale data, training and using convolutional networks becae practical. Since 2012, Deep Networks have outperfored other approaches for recognition tasks coon to Coputer Vision, Speech and Robotics. A rapidly growing research counity currently seeks to extend the application beyond recognition to generation of speech and robot actions. Notably, just about any algorith can be used to train a network, often yielding a solution that executes faster. 7-4

The Artificial Neuron The siplest possible neural network is coposed of a single neuron. A neuron is a coputational unit that integrates inforation fro a vector of features, X, to copute the likelihood of a hypothesis, h w,b a = h w,b " X The neuron is coposed of a weighted su of input values z = w 1 x 1 + w 2 x 2 +...+ w D x D + b followed by a non-linear activation function, f z soeties written "z a = h " w,b X = f w " T X + b Many different activation functions ay be used. 1 A popular choice for activation function is the sigoid: f z = 1+ e "z This function is useful because the derivative is: df z dz = f z1" f z This gives a decision function: if h w,b " X > 0.5 POSITIVE else NEGATIVE Other popular decision functions include the hyperbolic tangent and the softax. The hyperbolic Tangent: f z = tanhz = ez " e "z e z + e "z 7-5

The hyperbolic tangent is a rescaled for of sigoid ranging over [-1,1] # We could use the step function: f z = 1 if z " 0 0 if z < 0 Or the sgn function: f z = 1 if z " 0 &#1 if z < 0 However these is not differentiable, and we need a derivative for backpropagtion. Plot of Sigoid red, Hyperbolic Tangent Blue and Step Function Green The softax function is often used for ulti-class networks. For K classes: f z k = " ez k K e z k k=1 The rectified linear function is popular for deep learning because of a trivial derivative: Relu: f z = ax0,z While Relu is discontinuous at z=0, for z > 0 : df z dz =1 Note that the choice of decision function will deterine the target variable y for supervised learning. 7-6

The Neural Network odel A neural network is a ulti-layer assebly of neurons. For exaple, this is a 2-layer network: The circles labeled +1 are the bias ters. The circles on the left are the input ters. Soe authors, notably in the Stanford tutorials, refer to this as Level 1. We will NOT refer to this as a level or, if necessary, level L=0. The rightost circle is the output layer, also called L. The circles in the iddle are referred to as a hidden layer. In this exaple there is a single hidden layer and the total nuber of layers is L=2. The paraeters carry a superscript, referring to their layer. We will use the following notation: L The nuber of layers Layers of non-linear activations. l The layer index. l ranges fro 0 input layer to L output layer N The nuber of units in layer l. N 0 =D a j The activation output of the j th neuron of the l th layer. The weight fro the unit i of layer l-1 for the unit j of layer l. w ij b j fz The bias ter for j th unit of the l th layer A non-linear activation function, such as a sigoid, tanh, or soft-ax For exaple: a 1 2 is the activation output of the first neuron of the second layer. W 13 2 is the weight for neuron 1 fro the first level to neuron 3 in the second level. The above network would be described by: a 1 1 = f w 1 11 X 1 + w 1 21 X 2 + w 1 31 X 3 + b 1 1 a 1 2 = f w 1 12 X 1 + w 1 22 X 2 + w 1 32 X 3 + b 1 2 a 1 3 = f w 1 13 X 1 + w 1 23 X 2 + w 1 33 X 3 + b 1 3 h w,b X = a 2 1 = f w 2 11 a 1 1 + w 2 21 a 1 2 + w 2 31 a 1 3 + b 2 1 7-7

This can be generalized to ultiple layers. For exaple: h x is the vector of network outputs one for each class. Each unit is defined as follows: The notation for a ulti-layer network is a 0 = X is the input layer. a 0 i = X d l is the current layer under discussion. N is the nuber of activation units in layer l. N 0 = D i,j,k Unit indices for layers l-1, l and l+1: i j k w ij is the weight for the unit i of layer l-1 feeding to unit j of layer l. a j is the activation output of the j th unit of the layer l b j the bias ter feeding to unit j of layer l. z j = N l"1 # w a l"1 + b ij is the weighted input to j th unit of layer l i j i=1 fz is a non-linear decision function, such as a sigoid, tanh, or soft-ax a j = f z j is the activation output for the j th unit of layer l For layer l this gives: z j = N l"1 N l"1 w a l"1 # ij + b i j a j = f & # w ij i=1 i=1 ' a l"1 i + b j and then for l+1 : N l # N l & z l+1 k = " w l+1 a l+1 jk + b j k a l+1 k = f w l+1 jk a l+1 " j + b k j=1 j=1 ' 7-8

It can be ore convenient to represent this using vectors: " z 1 ' z z 2 = ' " ' ' # & z N l " a 1 ' a a 2 = ' " ' ' # & a N l and to write the weights and bias at each level l as a k by j Matrix, # W = w 11 w 1i w 1N l"1 " # " " w j1 w ji w jn l"1 " " # " w N l 1 w N l i w N l N l"1 & ' # b = b 1 l " b i l " l b N l"1 & ' note: To respect atrix notation, we have reversed the order of i and j in the subscripts. We can see that the weights are a 3 rd order Tensor or vector of atrices, with one atrix for each layer, The biases are a atrix vector of vectors with a vector for each level. z = W a l"1 + b " and a = f z = f W a l"1 + b We can asseble the set of atrices W into an 3rd order Tensor Vector of atrices, W, and represent a, z and b as atrices vectors of vectors: A, Z, B. So how to do we learn the weights W and biases B? We could train a 2-class detector fro a labeled training set { x }, {y } using gradient descent. For ore than two layers, we will need to use the ore general backpropagation algorith. Backpropagation Back-propagation adjusts the network the weights w ij and biases b j so as to iniize an error function between the network output h x ;W, B = a L and the target value y for the M training saples { x }, { y }. 7-9

This is an iterative algorith that propagates an error ter back through the hidden layers and coputes a correction for the weights at each layer so as to iniize the error ter. This raises two questions: 1 How do we initialize the weights? 2 How do we copute the error ter for hidden layers? 1 How do we initialize the weights? A natural answer for the first question is to initialize the weights to 0. By experience this causes probles. If the paraeters all start with identical values, then the algorith can end up learning the sae value for all paraeters. To avoid this, we initialize the paraeters with a sall rando variable that is near 0, for exaple coputed with a noral density with variance ε typically 0.01. " w ji = N X;0,# and " b j = N X;0,# where N is a saple fro a noral i,j,l j,l density. An even better solution is provided by Xavier GLORIOT s technique see course web site on Xavier noralization. However that solution is too coplex for today s lecture. 2 How do we copute the error ter? Back-propagation propagates the error ter back through the layers, using the weights. We will present this for individual training saples. The algorith can easily be generalized to learning fro sets of training saples Batch ode. Given a training saple, x, we first propagate the x through the L layers of the network Forward propagation to obtain a hypothesis h x ;W, B = a L. We then copute an error ter. In the case, of a ulti-class network, this is a vector, with k coponents, one output for each hypothesis. In this case the indicator vector would be a vector, with one coponent for each possible class: " L+1 = a L # y or for each class k: " k, L+1 = a L k, # y k, 7-10

The error ter " L+1 is the total error for the whole network for saple. The index L+1 is used so that this ter fits into the back-propagation forulae. To keep things siple, let us consider the case of a two class network, so that " L+1, h X, a L+1, and y are scalars. The results are easily generalized to vectors for ulti-class networks. At the output layer, the error for each training saple is: " L+1 = a L # y For the hidden units in layers l L the error " j is based on a weighted average of the error ters for " k l+1. We copute error ters, " j for each unit j in layer l back to l =l 1 using the su of errors ties the corresponding weights ties the derivative of the activation function. " j, = #f z j N l+1 #z j w jk k=1 l+1 l+1 " k, This error ter tells how uch the unit j was responsible for differences between the activation of the network h x ;w jk,b k and the target value y. 1 For the sigoid activation function. f z = the derivative is: 1+ e "z For a j = f z j this gives: df z dz = f z1" f z " j, = a j, 1# a j, N l+1 k=1 w l+1 l+1 jk " k, This error ter can then used to correct the weights and bias ters leading fro layer j to layer i. "w ij, = a l#1 i j, 7-11

"b j, = # j, Note that the corrections "w ij, and "b j, are NOT applied until after the error has propagated all the way back to layer l=1, and that when l=1, a 0 i = x i. For batch learning, the corrections ters, "w ji, and "b j, are averaged over M saples of the training data and then only an average correction is applied to the weights. then "w ij = 1 M M "w # ij, "b j = 1 M =1 M "b # j, =1 w ij " w ij # &w ij b j " b j # &b j where " is the learning rate. Back-propagation is equivalent to coputing the gradient of the loss function for each layer of the network. A coon proble with gradient descent is that the loss function can have local iniu. This proble can be iniized by regularization. A popular regularization technique for back propagation is to use oentu w ij " w ij # &w ij + µ w ij b j " b j # &b j + µ b j where the ters µ " w j and µ "b j serves to stabilize the estiation. The back-propagation algorith ay be continued until all training data has been used. For batch training, the algorith ay be repeated until all error ters, " j,, are a less than a threshold. 7-12

Suary of Backpropagation The Back-propagation algorith can be suarized as: 1 Initialize the network and a set of correction vectors: " w ji = N X;0,# i,j,l " b j = N X;0,# i,l " #w ji = 0 i,j,l "#b j = 0 i,l where N is a saple fro a noral density, and " is a sall value. 2 For each training saple, x, propagate x through the network forward propagation to obtain a hypothesis h x ;W, B. Copute the error and propagate this back through the network: a Copute the error ter: " L+1 = h x # y = a L # y b Propagate the error back fro l=l to l=1: " j, = #f z j N l+1 #z j w jk k=1 l+1 l+1 " k, c Use the error to set a vector of correction weights. "w ij, = a l#1 i j, "b j, = # j, 3 For all layers, l=1 to L, update the weights and bias using a learning rate, " w ij " w ij # &w ij, + µ w ij b j " b j # &b j, + µ b j Note that this last step can be done with an average correction atrix obtained fro any training saples Batch ode, providing a ore efficient algorith. 7-13