Pattern Recognition and Machine Learning. Artificial Neural networks

Similar documents
Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Pattern Recognition and Machine Learning. Artificial Neural networks

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Kernel Methods and Support Vector Machines

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Pattern Recognition and Machine Learning. Artificial Neural networks

Machine Learning Basics: Estimators, Bias and Variance

Ch 12: Variations on Backpropagation

Feedforward Networks

Feature Extraction Techniques

Feedforward Networks. Gradient Descent Learning and Backpropagation. Christian Jacob. CPSC 533 Winter 2004

Combining Classifiers

Estimating Parameters for a Gaussian pdf

Feedforward Networks

Ensemble Based on Data Envelopment Analysis

Bayes Decision Rule and Naïve Bayes Classifier

Introduction Biologically Motivated Crude Model Backpropagation

Block designs and statistics

Neural Networks: Introduction

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Multilayer Neural Networks

OBJECTIVES INTRODUCTION

Boosting with log-loss

Support Vector Machines. Goals for the lecture

Support Vector Machines. Maximizing the Margin

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

Neural Networks (Part 1) Goals for the lecture

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

COS 424: Interacting with Data. Written Exercises

Support Vector Machines MIT Course Notes Cynthia Rudin

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

arxiv: v2 [cs.lg] 30 Mar 2017

Neural Networks and Deep Learning

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Tracking using CONDENSATION: Conditional Density Propagation

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

CS 4700: Foundations of Artificial Intelligence

Topic 5a Introduction to Curve Fitting & Linear Regression

1 Proof of learning bounds

Variations on Backpropagation

ZISC Neural Network Base Indicator for Classification Complexity Estimation

Sharp Time Data Tradeoffs for Linear Inverse Problems

Hand Written Digit Recognition Using Backpropagation Neural Network on Master-Slave Architecture

Neural Dynamic Optimization for Control Systems Part III: Applications

A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

NBN Algorithm Introduction Computational Fundamentals. Bogdan M. Wilamoswki Auburn University. Hao Yu Auburn University

EEE 241: Linear Systems

An Inverse Interpolation Method Utilizing In-Flight Strain Measurements for Determining Loads and Structural Response of Aerospace Vehicles

Artificial Intelligence

Introduction To Artificial Neural Networks

3.3 Variational Characterization of Singular Values

Non-Parametric Non-Line-of-Sight Identification 1

VI. Backpropagation Neural Networks (BPNN)

Least Squares Fitting of Data

Interactive Markov Models of Evolutionary Algorithms

CSC321 Lecture 5: Multilayer Perceptrons

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

1 Bounding the Margin

Measures of average are called measures of central tendency and include the mean, median, mode, and midrange.

Symbolic Analysis as Universal Tool for Deriving Properties of Non-linear Algorithms Case study of EM Algorithm

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

2. Image processing. Mahmoud Mohamed Hamoud Kaid, IJECS Volume 6 Issue 2 Feb., 2017 Page No Page 20254

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Chapter 1: Basics of Vibrations for Simple Mechanical Systems

Artificial Neural Networks. Historical description

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Course 395: Machine Learning - Lectures

Data-Driven Imaging in Anisotropic Media

ACTIVE VIBRATION CONTROL FOR STRUCTURE HAVING NON- LINEAR BEHAVIOR UNDER EARTHQUAKE EXCITATION

Figure 1: Equivalent electric (RC) circuit of a neurons membrane

Neural Networks biological neuron artificial neuron 1

are equal to zero, where, q = p 1. For each gene j, the pairwise null and alternative hypotheses are,

A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words)

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space

The Algorithms Optimization of Artificial Neural Network Based on Particle Swarm

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information

Fault Diagnosis of Planetary Gear Based on Fuzzy Entropy of CEEMDAN and MLP Neural Network by Using Vibration Signal

Using a De-Convolution Window for Operating Modal Analysis

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

CSC 411 Lecture 10: Neural Networks

Pattern Classification using Simplified Neural Networks with Pruning Algorithm

A Simplified Analytical Approach for Efficiency Evaluation of the Weaving Machines with Automatic Filling Repair

Training an RBM: Contrastive Divergence. Sargur N. Srihari

Randomized Recovery for Boolean Compressed Sensing

Introduction to Machine Learning

TABLE FOR UPPER PERCENTAGE POINTS OF THE LARGEST ROOT OF A DETERMINANTAL EQUATION WITH FIVE ROOTS. By William W. Chen

Computational and Statistical Learning Theory

Probability Distributions

Chapter 6 1-D Continuous Groups

Ştefan ŞTEFĂNESCU * is the minimum global value for the function h (x)

Soft-margin SVM can address linearly separable problems with outliers

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Kinematics and dynamics, a computational approach

A Simple Regression Problem

Stochastic Subgradient Methods

Transcription:

Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lessons 7 20 Dec 2017 Outline Artificial Neural networks Notation...2 Introduction...3 Key Equations... 3 Artificial Neural Networks... 3 The Artificial Neuron... 5 The Neural Network odel... 7 Backpropagation... 10 Suary of Backpropagation... 13 Using notation and figures fro the Stanford Deep learning tutorial at: http://ufldl.stanford.edu/tutorial/

Notation x d A feature. An observed or easured value. X A vector of D features. D The nuber of diensions for the vector X { X } {y } Training saples for learning. M The nuber of training saples. a the activation output of the th neuron of the l th layer. the weight fro unit i of layer l 1 to the unit of layer l. w i b l bias for unit of layer l. " A learning rate. Typically very sall 0.01. Can be variable. " L = a L # y Error for the th training saple. error for the th neuron of level l, for the th training saple. ", "w i, "b, = a l#1 i, Update for weight fro unit i of layer l 1 to the unit of layer l. = #, Update for bias for unit of layer l. 7-2

Introduction Key Equations Feed Forward fro Layer i to : N l"1 ' a = f w i a l"1 & # i + b i=1 Feed Forward fro Layer to k: # N l & a l+1 k = f w k a l+1 " + b k =1 ' Back Propagation fro Layer to i: Back Propagation fro Layer k to : " l#1 i, = f z l#1 i ", N l z i l#1 w i = #f z N l+1 =1 #z w k k=1 ", l+1 l+1 " k, Weight and Bias Corrections for layer : "w i, "b, = a l#1 i, = #, Network Update Forulas: w i " w i # &w i, b " b # &b, Artificial Neural Networks Artificial Neural Networks, also referred to as Multi-layer Perceptrons, are coputational structures coposed a weighted sus of neural units. Each neural unit is coposed of a weighted su of input units, followed by a non-linear decision function. Note that the ter neural is isleading. The coputational echanis of a neural network is only loosely inspired fro neural biology. Neural networks do NOT ipleent the sae learning and recognition algoriths as biological systes. The approach was first proposed by Warren McCullough and Walter Pitts in 1943 as a possible universal coputational odel. During the 1950 s, Frank Rosenblatt developed the idea to provide a trainable achine for pattern recognition, called a Perceptron. The perceptron is an increental learning algorith for linear classifiers. The first Perceptron, constructed in 1956, was a roo-sized analog coputer that learned recognition functions. However, both the learning algorith and the resulting 7-3

recognition algorith are easily ipleented as coputer progras, and future perceptrons were ipleented as progras.. In 1969, Marvin Minsky and Seyour Papert of MIT published a book entitled Perceptrons, that claied to docuent the fundaental liitations of the perceptron approach. Notably, they deonstrated that a one-level perceptron could not be constructed to perfor an exclusive OR. In the 1970s, frustrations with the liits of Artificial Intelligence research based on Sybolic Logic led a sall counity of researchers to explore the perceptron based approach. In 1973, Steven Grossberg, showed that a two layered perceptron could overcoe the probles raised by Minsky and Papert, and solve any probles that plagued sybolic AI. In 1975, Paul Werbos developed an algorith referred to as Back-Propagation that uses gradient descent to learn the paraeters for perceptrons fro classification errors with training data. During the 1980 s, Neural Networks went through a period of popularity with researchers showing that Networks could be trained to provide siple solutions to probles such as recognizing handwritten characters, recognizing spoken words, and steering a car on a highway. However, results were overtaken by ore atheatically sound approaches for statistical pattern recognition such as support vector achines and boosted learning. In 1998, Yves LeCun showed that convolutional networks coposed fro any layers could outperfor other approaches for recognition probles. Unfortunately such networks required extreely large aounts of data and coputation. Around 2010, with the eergence of cloud coputing cobined with planetary-scale data, training and using convolutional networks becae practical. Since 2012, Deep Networks have outperfored other approaches for recognition tasks coon to Coputer Vision, Speech and Robotics. A rapidly growing research counity currently seeks to extend the application beyond recognition to generation of speech and robot actions. Notably, ust about any algorith can be used to train a network, often yielding a solution that executes faster. 7-4

The Artificial Neuron The siplest possible neural network is coposed of a single neuron. A neuron is a coputational unit that integrates inforation fro a vector of features, X, to copute the likelihood of a hypothesis, h w,b a = h w,b " X The neuron is coposed of a weighted su of input values z = w 1 x 1 + w 2 x 2 +...+ w D x D + b followed by a non-linear activation function, f z soeties written "z a = h " w,b X = f w " T X + b Many different activation functions ay be used. 1 A popular choice for activation function is the sigoid: f z = 1+ e "z This function is useful because the derivative is: df z dz = f z1" f z This gives a decision function: if h w,b " X > 0.5 POSITIVE else NEGATIVE Other popular decision functions include the hyperbolic tangent and the softax. The hyperbolic Tangent: f z = tanhz = ez " e "z e z + e "z 7-5

The hyperbolic tangent is a rescaled for of sigoid ranging over [-1,1] # We can also use the step function: f z = 1 if z " 0 0 if z < 0 Or the sgn function: f z = 1 if z " 0 &#1 if z < 0 Plot of Sigoid red, Hyperbolic Tangent Blue and Step Function Green The softax function is often used for ulti-class networks. For K classes: f z k = " ez k K e z k k=1 The rectified linear function is popular for deep learning because of a trivial derivative: Relu: f z = ax0,z While Relu is discontinuous at z=0, for z > 0 : df z dz =1 Note that the choice of decision function will deterine the target variable y for supervised learning. 7-6

The Neural Network odel A neural network is a ulti-layer assebly of neurons. For exaple, this is a 2-layer network: The circles labeled +1 are the bias ters. The circles on the left are the input ters. Soe authors, notably in the Stanford tutorials, refer to this as Level 1. We will NOT refer to this as a level or, if necessary, level L=0. The rightost circle is the output layer, also called L. The circles in the iddle are referred to as a hidden layer. In this exaple there is a single hidden layer and the total nuber of layers is L=2. The paraeters carry a superscript, referring to their layer. We will use the following notation: L The nuber of layers Layers of non-linear activations. l The layer index. l ranges fro 0 input layer to L output layer N The nuber of units in layer l. N 0 =D a The activation output of the th neuron of the l th layer. The weight fro the unit i of layer l-1 for the unit of layer l. w i b fz The bias ter for th unit of the l th layer A non-linear activation function, such as a sigoid, tanh, or soft-ax For exaple: a 1 2 is the activation output of the first neuron of the second layer. W 13 2 is the weight for neuron 1 fro the first level to neuron 3 in the second level. The above network would be described by: a 1 1 = f w 1 11 X 1 + w 1 21 X 2 + w 1 31 X 3 + b 1 1 a 1 2 = f w 1 12 X 1 + w 1 22 X 2 + w 1 32 X 3 + b 1 2 a 1 3 = f w 1 13 X 1 + w 1 23 X 2 + w 1 33 X 3 + b 1 3 h w,b X = a 2 1 = f w 2 11 a 1 1 + w 2 21 a 1 2 + w 2 31 a 1 3 + b 2 1 7-7

This can be generalized to ultiple layers. For exaple: h X is the vector of network outputs one for each class. Each unit is defined as follows: The notation for a ulti-layer network is a 0 = X is the input layer. a 0 i = X d l is the current layer under discussion. N is the nuber of activation units in layer l. N 0 = D i,,k Unit indices for layers l-1, l and l+1: i k w i is the weight for the unit i of layer l-1 feeding to unit of layer l. a is the activation output of the th unit of the layer l b the bias ter feeding to unit of layer l. z = N l"1 # w a l"1 + b i is the weighted input to th unit of layer l i i=1 fz is a non-linear decision function, such as a sigoid, tanh, or soft-ax a = f z is the activation output for the th unit of layer l For layer l this gives: z = N l"1 N l"1 w a l"1 # i + b i a = f & # w i i=1 i=1 ' a l"1 i + b 7-8

and then for l+1 : N l # N l & z l+1 k = " w l+1 a l+1 k + b k a l+1 k = f w l+1 k a l+1 " + b k =1 =1 ' It can be ore convenient to represent this using vectors: " z 1 ' z z 2 = ' " ' ' # & z N l " a 1 ' a a 2 = ' " ' ' # & a N l and to write the weights and bias at each level l as a k by Matrix, # W = w 11 w 1i w 1N l"1 " # " " w 1 w i w N l"1 " " # " w N l 1 w N l i w N l N l"1 & ' # b = b 1 l " b i l " l b N l"1 & ' note: To respect atrix notation, we have reversed the order of i and in the subscripts. We can see that the weights are a 3 rd order Tensor or vector of atrices, with one atrix for each level, The biases are a atrix vector of vectors with a vector for each level. z = W a l"1 + b " and a = f z = f W a l"1 + b We can asseble the set of atrices W into an 3rd order Tensor Vector of atrices, W, and represent a, z and b as atrices vectors of vectors: A, Z, B. So how to do we learn the weights W and biases B? We could train a 2-class detector fro a labeled training set { X }, {y } using gradient descent. For ore than two layers, we will need to use the ore general backpropagation algorith. 7-9

Backpropagation Back-propagation adusts the network the weights w i and biases b so as to iniize an error function between the network output h X ;W, B = a L and the target value y for the M training saples { X }, { y }. This is an iterative algorith that propagates an error ter back through the hidden layers and coputes a correction for the weights at each layer so as to iniize the error ter. This raises two questions: 1 How do we initialize the weights? 2 How do we copute the error ter for hidden layers? 1 How do we initialize the weights? A natural answer for the first question is to initialize the weights to 0. By experience this causes probles. If the paraeters all start with identical values, then the algorith can end up learning the sae value for all paraeters. To avoid this, we initialize the paraeters with a sall rando variable that is near 0, for exaple coputed with a noral density with variance ε typically 0.01. " w i = N X;0,# and " b = N X;0,# where N is a saple fro a noral i,,l,l density. An even better solution is provided by Xavier GLORIOT s technique see course web site on Xavier noralization. However that solution is too coplex for today s lecture. 2 How do we copute the error ter? Back-propagation propagates the error ter back through the layers, using the weights. We will present this for individual training saples. The algorith can easily be generalized to learning fro sets of training saples Batch ode. Given a training saple, X, we first propagate the X through the L layers of the network Forward propagation to obtain a hypothesis h X ;W, B = a L. 7-10

We then copute an error ter. In the case, of a ulti-class network, this is a vector, with k coponents, one output for each hypothesis. In this case the indicator vector would be a vector, with one coponent for each possible class: " L = a L # y or for each class k: " k, L = a L k, # y k, This error ter tells how uch the unit was responsible for differences between the activation of the network h X ;w k,b k and the target value y. To keep things siple, let us consider the case of a two class network, so that " L+1, h X, a L+1, and y are scalars. The results are easily generalized to vectors for ulti-class networks. At the output layer, the error for each training saple is: " L = a L # y For the hidden units in layers l L the error " is based on a weighted average of the error ters for " k l+1. We copute error ters, " for each unit in layer l back to l =l 1 using the su of errors ties the corresponding weights ties the derivative of the activation function. ", = #f z N l+1 w #z k l+1 l+1 " k, " i, k=1 l#1 = f z l#1 i N l z i l#1 w i =1 ", 1 For the sigoid activation function. f z = the derivative is: 1+ e "z For a = f z this gives: df z dz = f z1" f z ", = a, 1# a, N l+1 k=1 w l+1 l+1 k " k, 7-11

This error ter can then used to correct the weights and bias ters leading fro layer to layer i. "w i, "b, = a l#1 i, = #, Note that the corrections "w i, and "b, are NOT applied until after the error has propagated all the way back to layer l=1, and that when l=1, a 0 i = x i. For batch learning, the corrections ters, "w i, and "b, are averaged over M saples of the training data and then only an average correction is applied to the weights. then "w i = 1 M M "w # i, "b = 1 M =1 M "b #, =1 w i " w i # &w i b " b # &b where " is the learning rate. Back-propagation is equivalent to coputing the gradient of the loss function for each layer of the network. A coon proble with gradient descent is that the loss function can have local iniu. This proble can be iniized by regularization. A popular regularization technique for back propagation is to use oentu w i " w i # &w i + µ w i b " b # &b + µ b where the ters µ " w and µ "b serves to stabilize the estiation. The back-propagation algorith ay be continued until all training data has been used. For batch training, the algorith ay be repeated until all error ters, ",, are a less than a threshold. 7-12

Suary of Backpropagation The Back-propagation algorith can be suarized as: 1 Initialize the network and a set of correction vectors: " w i = N X;0,# i,,l " b = N X;0,# i,l " #w i = 0 i,,l "#b = 0 i,l where N is a saple fro a noral density, and " is a sall value. 2 For each training saple, X, propagate X through the network forward propagation to obtain a hypothesis h X ;W, B. Copute the error and propagate this back through the network: a Copute the error ter: " L = h X # y = a L # y b Propagate the error back fro l=l-1 to l=1: ", = #f z N l+1 w #z k l+1 l+1 " k, " i, k=1 l#1 = f z l#1 i N l z i l#1 w i =1 ", c Use the error to set a vector of correction weights. "w i, = a l#1 i, "b, = #, 3 For all layers, l=1 to L, update the weights and bias using a learning rate, " w i " w i # &w i, + µ w i b " b # &b, + µ b Note that this last step can be done with an average correction atrix obtained fro any training saples Batch ode, providing a ore efficient algorith. 7-13