Pattern Recognition and Machine Learning. Artificial Neural networks

Similar documents
Pattern Recognition and Machine Learning. Artificial Neural networks

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Pattern Recognition and Machine Learning. Artificial Neural networks

Kernel Methods and Support Vector Machines

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Estimating Parameters for a Gaussian pdf

Feature Extraction Techniques

Machine Learning Basics: Estimators, Bias and Variance

Feedforward Networks

Block designs and statistics

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Support Vector Machines. Maximizing the Margin

Feedforward Networks. Gradient Descent Learning and Backpropagation. Christian Jacob. CPSC 533 Winter 2004

Support Vector Machines. Goals for the lecture

1 Bounding the Margin

Introduction Biologically Motivated Crude Model Backpropagation

Neural Networks: Introduction

Support Vector Machines MIT Course Notes Cynthia Rudin

Combining Classifiers

Bayes Decision Rule and Naïve Bayes Classifier

Course 395: Machine Learning - Lectures

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

Ensemble Based on Data Envelopment Analysis

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Feedforward Networks

Ch 12: Variations on Backpropagation

Multilayer Neural Networks

1 Proof of learning bounds

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

Neural Networks biological neuron artificial neuron 1

Introduction To Artificial Neural Networks

Soft-margin SVM can address linearly separable problems with outliers

A Simple Regression Problem

Boosting with log-loss

COS 424: Interacting with Data. Written Exercises

A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words)

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Neural Networks and Deep Learning

Topic 5a Introduction to Curve Fitting & Linear Regression

A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION

CS 4700: Foundations of Artificial Intelligence

Chapter 1: Basics of Vibrations for Simple Mechanical Systems

UNIVERSITY OF TRENTO ON THE USE OF SVM FOR ELECTROMAGNETIC SUBSURFACE SENSING. A. Boni, M. Conci, A. Massa, and S. Piffer.

A method to determine relative stroke detection efficiencies from multiplicity distributions

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

A Simplified Analytical Approach for Efficiency Evaluation of the Weaving Machines with Automatic Filling Repair

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Probability Distributions

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Sharp Time Data Tradeoffs for Linear Inverse Problems

Least Squares Fitting of Data

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

NBN Algorithm Introduction Computational Fundamentals. Bogdan M. Wilamoswki Auburn University. Hao Yu Auburn University

Sections 18.6 and 18.7 Analysis of Artificial Neural Networks

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Non-Parametric Non-Line-of-Sight Identification 1

Nonmonotonic Networks. a. IRST, I Povo (Trento) Italy, b. Univ. of Trento, Physics Dept., I Povo (Trento) Italy

Predicting FTSE 100 Close Price Using Hybrid Model

Interactive Markov Models of Evolutionary Algorithms

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

COMP-4360 Machine Learning Neural Networks

ZISC Neural Network Base Indicator for Classification Complexity Estimation

Figure 1: Equivalent electric (RC) circuit of a neurons membrane

Neural Networks (Part 1) Goals for the lecture

Symbolic Analysis as Universal Tool for Deriving Properties of Non-linear Algorithms Case study of EM Algorithm

Geometrical intuition behind the dual problem

are equal to zero, where, q = p 1. For each gene j, the pairwise null and alternative hypotheses are,

PAC-Bayes Analysis Of Maximum Entropy Learning

Qualitative Modelling of Time Series Using Self-Organizing Maps: Application to Animal Science

Artificial Intelligence

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

A Theoretical Framework for Deep Transfer Learning

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Pattern Classification using Simplified Neural Networks with Pruning Algorithm

Using a De-Convolution Window for Operating Modal Analysis

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

CSC321 Lecture 5: Multilayer Perceptrons

An improved self-adaptive harmony search algorithm for joint replenishment problems

Ufuk Demirci* and Feza Kerestecioglu**

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

An Adaptive UKF Algorithm for the State and Parameter Estimations of a Mobile Robot

Learning Methods for Linear Detectors

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information

A note on the multiplication of sparse matrices

Artificial Neural Networks The Introduction

Hand Written Digit Recognition Using Backpropagation Neural Network on Master-Slave Architecture

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Machine Learning. Neural Networks

Sections 18.6 and 18.7 Artificial Neural Networks

Artificial Neural Networks. MGS Lecture 2

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

The Algorithms Optimization of Artificial Neural Network Based on Particle Swarm

Sections 18.6 and 18.7 Artificial Neural Networks

COMP9444 Neural Networks and Deep Learning 2. Perceptrons. COMP9444 c Alan Blair, 2017

Transcription:

Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2016 Lessons 7 14 Dec 2016 Outline Artificial Neural networks Notation...2 1. Introduction...3... 3 The Artificial Neuron... 4 The Neural Network odel... 6 Network Structures for Siple Feed-Forward Networks... 8 Regression Analysis...9 Linear Models... 10 Estiation of a hyperplane with supervised learning... 11 Gradient Descent... 13 Practical Considerations for Gradient Descent... 14 Regression for a Sigoid Activation Function...16 Gradient Descent... 17 Regularisation... 17 Using notation and figures fro the Stanford Deep learning tutorial at: http://ufldl.stanford.edu/tutorial/

Notation x d X D { X } {y } M (l a i (l w ij b i l A feature. An observed or easured value. A vector of D features. The nuber of diensions for the vector X Training saples for learning. The nuber of training saples. the activation output of the i th neuron of the l th layer. the weight for the unit j of layer l and the unit i of layer l+1. the bias ter for i th using of the the l+1 th layer 7-2

Introduction, also referred to as Multi-layer Perceptrons, are coputational structures coposed a weighted sus of neural units. Each neural unit is coposed of a weighted su of input units, followed by a non-linear decision function. Note that the ter neural is isleading. The coputational echanis of a neural network is only loosely inspired fro neural biology. Neural networks do NOT ipleent the sae learning and recognition algoriths as biological systes. The approach was first proposed by Warren McCullough and Walter Pitts in 1943 as a possible universal coputational odel. During the 1950 s, Frank Rosenblatt developed the idea to provide a trainable achine for pattern recognition, called a Perceptron. The perceptron is an increental learning algorith for linear classifiers invented by Frank Rosenblatt in 1956. The first Perceptron was a roo-sized analog coputer that ipleented Rosenblatz learning recognition functions. Both the learning algorith and the resulting recognition algorith are easily ipleented as coputer progras. In 1969, Marvin Minsky and Seyour Papert of MIT published a book entitled Perceptrons, that claied to docuent the fundaental liitations of the perceptron approach. Notably, they claied that a perceptron could not be constructed to perfor an exclusive OR. In the 1970s, frustrations with the liits of Artificial Intelligence research based on Sybolic Logic led a sall counity of researchers to explore the perceptron based approach. In 1973, Steven Grossberg, showed that a two layered perceptron could overcoe the probles raised by Minsky and Papert, and solve any probles that plagued sybolic AI. In 1975, Paul Werbos developed an algorith referred to as Back-Propagation that uses gradient descent to learn the paraeters for perceptrons fro classification errors with training data. During the 1980 s, Neural Networks went through a period of popularity with researchers showing that Networks could be trained to provide siple solutions to probles such as recognizing handwritten characters, recognizing spoken words, and steering a car on a highway. However, results were overtaken by ore 7-3

atheatically sound approaches for statistical pattern recognition such as support vector achines and boosted learning. In 1998, Yves LeCun showed that convolutional networks coposed fro any layers could outperfor other approaches recognition probles. Unfortunately such networks required extreely large aounts of data and coputation. Around 2010, with the eergence of cloud coputing cobined with planetary-scale data convolutional networks becae practical. Since 2012, Deep Networks have outperfored other approaches for recognition tasks coon to coputer Vision, Speech and robotics. A rapidly growing research counity currently seeks to extend the application beyond recognition to generation of speech and robot actions. The Artificial Neuron The siplest possible neural network is coposed of a single neuron. A neuron is a coputational unit that integrates inforation fro a vector of D features, X, to the likelihood of a hypothesis, h w,b ( a = h w,b ( X The neuron is coposed of a weighted su of input values z = w 1 x 1 + w 2 x 2 +...+ w D x D + b followed by a non-linear activation function, f (z (soeties written (z a = h ( w,b X = f ( w T X + b Many different activation functions are used. 1 A popular choice for hidden layers is the sigoid (logistic function: f (z = 1 e z 7-4

This function is useful because the derivative is: df (z dz = f (z(1 f (z This gives a decision function: if h w,b ( X > 0.5 POSITIVE else NEGATIVE Other popular decision functions include the hyperbolic tangent and the softax. The hyperbolic Tangent: f (z = tanh(z = ez e z e z + e z The hyperbolic tangent is a rescaled for of sigoid ranging over [-1,1] The rectified linear function is also popular Relu: f (z = ax(0,z While Relu is discontinuous at z=0, for z > 0 : df (z dz =1 The following plot (fro A. Ng shows the sigoid, tanh and Relu functions Note that the choice of decision function will deterine the target variable y for supervised learning. 7-5

The Neural Network odel A neural network is a ulti-layer assebly of neurons of the for. For exaple, this is a 2-layer network: The circles labeled +1 are the bias ters. The circles on the left are the input ters, also called L 1. This is called the input layer. Note that any authors do not consider this to count as a layer The rightost circle is the output layer (in this case, only one node, also called L 3 The circles in the iddle are referred to as a hidden layer, L 2. Here we follow the notation used by Andrew Ng. The paraeters carry a superscript, referring to their layer. For exaple: a 1 (2 is the activation output of the first neuron of the second layer. W 13 (2 is the weight for input 1 of activation neuron 3 in the second level. The above network would be described by: a (2 1 = f (w (1 11 X 1 + w (1 12 X 2 + w (1 13 X 3 + b (1 1 a (2 2 = f (w (1 21 X 1 + w (1 22 X 2 + w (1 23 X 3 + b (1 2 a (2 3 = f (w (1 31 X 1 + w (1 32 X 2 + w (1 33 X 3 + b (1 3 h ( w,b X = a (3 1 = f (w (2 11 a (2 1 + w (2 12 a (2 2 + w (2 13 a (2 3 + b (2 1 This can be generalized to ultiple layers. For exaple: 7-6

(taken fro the Stanford ufldl tutorial, Ng et al Note that we can recognize ultiple classes by learning ultiple h w,b (x functions. In general (following the notation of Ng, this can be described by: L 1 is the input layer. L l is the l th layer L N is the output layer (l w ij denotes the paraeter (weight for the unit j of layer l and the unit i of layer l+1. l b i is the bias ter for i th using of the the l+1 th layer (l a i is the activation i in layer l Note that any authors prefer to define the input layer as l=0. This way the l is the nuber of hidden layers. In deriving the regression algoriths for learning, we will use N l z (l+1 i = w (l a (l ij + b (l i j=1 It will be ore convenient to represent this using vectors: (l z 1 % (l z z (l 2 = # & (l z N l As defined, the neural network coputes a forward propagation of the for z l+1 = w (l a (l + b (l a (l1 = f ( z (l+1 7-7

This is called the Feed Forward Network. Note that feed forward networks do not contain loops. The weights for a Neural Network are coonly trained using a technique called back-propagation, or back propagation of errors. Back propagation is a for of regression analysis. The ost coon approach is to eploy a for of Gradient Descent. Gradient descent calculates a loss function for the weights of the network and then iteratively seeks to iniize this loss function. This is coonly perfored as a for of supervised learning using labeled training data. Note that Gradient descent requires that the activation function be differentiable. Network Structures for Siple Feed-Forward Networks The architecture of a neural network refers to the nuber of layers, the nuber of neurons of each layer and the kind of neurons, and the kinds of neural coputation perfored. For a siple feed forward network: The nuber of input neurons is deterined by the nuber of input values for each observation (size of the feature vector, D The nuber of output neurons is the nuber of classes to be recognized. The nuber of hidden layers is deterined by the coplexity of the recognition and the aount of training data available. If your training data is linearly separable (if there exists a hyper-plane between the two classes then you do not even need a hidden layer. Use a siple linear classifiers for exaple defined by a Support Vector achine or Linear Discriinant Analysis. A single hidden layer is often sufficient for any probles (although ore reliable solutions can be found using Support vector achines or other techniques. 7-8

The nuber of neurons in a hidden layer depends on the coplexity of the proble, and the aount of training data available. In theory, any proble can be solved with a single hidden layer, given enough training data. In practice the quantity of training data is not practical. Experience shows that soe very difficult probles can be solved using less training data by adding additional layers. However, the spectacular progress of the last few years has been obtained by the introduction of new coputational odels such as Convolutional Neural Networks, Pooling, Auto-encoders and Long Ter-Short Ter Meory. Many of these are actually known pattern recognition techniques recycled into a Neural Network fraework. But before we get to ore advanced techniques, we need to look at the basics. Regression Analysis The paraeters for a feed forward network are coonly learned using regression analysis of labeled training data. Regression is the estiation of the paraeters for a function that aps a set of independent variables into a dependent variable. y ˆ = f ( X, w Where X is a vector of D independent (unknown variables. y ˆ is an estiate for a variable y that depends on X. and f ( is a function that aps X onto y ˆ w is a vector of paraeters for the odel. Note: For y ˆ, the hat indicates an estiated value for the target value y X is upper case because it is a rando (unknown vector. 7-9

Regression analysis refers to a faily of techniques for odeling and analyzing the apping one or ore independent variables fro a dependent variable. For exaple, consider the following table of age, height and weight for 10 feales: M AGE H (M W (kg 1 17 163 52 2 32 169 68 3 25 158 49 4 55 158 73 5 12 161 71 6 41 172 99 7 32 156 50 8 56 161 82 9 22 154 56 10 16 145 46 We can use any two variables to estiate the third. We can use regression to estiate the paraeters for a function to predict any feature y ˆ fro the two other features X. For exaple we can predict weight fro height and age as a function. y ˆ = f ( X, w Age % where y ˆ = Weight, X = and w are the odel paraeters # Height& Linear Models A linear odel has the for y ˆ = f ( X, w = w T X + b = w 1 x 1 + w 2 x 2 +...+ w D x D + b w 1 % The vector w w 2 = are the paraeters of the odel that relates X to y ˆ. # w D & The equation w T X + b = 0 is a hyper-plane in a D-diensional space, w 1 % w 2 w = is the noral to the hyperplane and b is a constant ter. # & w D 7-10

It is generally convenient to include the constant as part of the paraeter vector and to add an extra constant ter to the observed feature vector. This gives a linear odel with D+1 paraeters where the vectors are: 1 % x 1 X = # & x D and w = # w 0 w 1 w D % & where w 0 represents b. This gives the hoogeneous equation for the odel: y ˆ = f ( X, w = w T X Hoogeneous coordinates provide a unified notation for geoetric operations. With this notation, we can predict weight fro height and age using a function learned fro regression analysis. y ˆ = f ( X, w 1 % w where y ˆ = Weight, X = Age and 0 % W = w 1 are the odel # Height& # w 2 & paraeters, and the surface is a plane in the space (weight, age, height. In a D diensional space, linear hoogeneous equation is a hyper-plane. The perpendicular distance of an arbitrary point fro the plane is coputed as d = w 0 +w 1 x 1 +w 2 x 2 This can be used as an error. Estiation of a hyperplane with supervised learning In supervised learning, we learn the paraeters of a odel fro a labeled set of training data. The training data is coposed of M sets of independent variables, { X } for which we know the value of the dependent variable {y }. The training data is the set { X }, {y } For a linear odel, learning the paraeters of the odel fro a training set is equivalent to estiating the paraeters of a hyperplane using least squares. In the case of a linear odel, there are any ways to estiate the paraeters: 7-11

For exaple, atrix algebra provides a direct, closed for solution. Assue a training set of M observations { X } {y } where the constant d is included as a 0th ter in X and w. 1 % x 1 X = and # & x D w = # w 0 w 1 w D % & We seek the paraeters for a linear odel: y ˆ = f ( X, w = w T X This can be deterined by iniizing a Loss function that can be defined as the Square of the error. L( w M = #( w T X y 2 =1 To build or function, we will use the M training saples to copose a atrix X and a vector Y. 1 1 1 % x 11 x 12 x 1M X = x 21 x 22 x 2 M # # x D1 x D2 x DM & (D+1 rows by M coluns Y = # y 1 y 2 y M % & (M rows. We can factor the loss function to obtain: L( w = ( w T X Y T ( w T X Y To iniize the loss function, we calculate the derivative and solve for w when the derivative is 0. L( w w = 2X T Y # 2X T X w = 0 which gives X T Y = 2X T X w and thus w = (X T X 1 X T Y While this is an elegant solution for linear regression, it does not generalize to other odels. A ore general approach is to use Gradient Descent. 7-12

Gradient Descent Gradient descent is a popular algorith for estiating paraeters for a large variety of odels. Here we will illustrate the approach with estiation of paraeters for a linear odel. As before we seek to estiate that paraeters w for a odel y ˆ = f ( X, w = w T X fro a training set of M saples { X } {y } We will define our loss function as 1 2 average error L( w = 1 M 2M ( f ( X, w # y 2 =1 where we have included the ter 1 2 to siplify the algebra later. The gradient is the derivative of the loss function with respect to each ter w d of w is f ( X, w = #f ( X, w # w #f ( X,w 0 & & #w 0 #f ( X &,w 1 = & #w 1 & # #f ( X &,w D & % #w D ( where: f ( X,w d = 1 M w d M ( f ( X, w # y x d =1 x d is the d th coefficient of the th training vector. Of course x 0 =1 is the constant ter. We use the gradient to correct an estiate of the paraeter vector for each training saple. The correction is weighted by a learning rate α We can see M 1 M ( f ( X, w (i1 (i1 # y x d as the average error for paraeter w d =1 Gradient descent corrects by subtracting the average error weighted by the learning rate. 7-13

Gradient Descent Algorith Initialization: (i=0 Let w d (o = 0 for all D coefficients of w Repeat until L( w (i L( w (i1 < # : w (i = w (i1 # f ( X, w (i1 where L( w = 1 M 2M ( f ( X, w y 2 =1 That is: w (i d = w (i1 d # 1 M M ( f ( X, w (i1 y x d =1 Note that all coefficients are updated in parallel. The algorith halts when the change in L( w (i becoes sall: L( w (i L( w (i1 < # For soe sall constant. Gradient Descent can be used to learn the paraeters for a non-linear odel. For exaple, when D=2, a second order odel would be: 1 % x 1 2 x 1 X = x 2 2 x 1 # x 1 x 2 & and f ( X, w = w 0 +w 1 x 1 + w 2 x 1 2 + w 3 x 2 + w 4 x 2 2 + w 5 x 1 x 2 Practical Considerations for Gradient Descent The following are soe practical issues concerning gradient descent. Feature Scaling Make sure that features have siilar scales (range of values. One way to assure this is to noralize the training date so that each feature has a range of 1. Siple technique: divide by the range of saple values. 7-14

For a training set { X } of M training saples with D values. Range: r D = Max(x d - Min(x d Then M =1 : x d := x d r d Even better would be to scale with the ean and standard deviation of the each feature in the training data µ d = E{x d } 2 = E{(x d # µ d 2 } M =1 : x d := (x # µ d d d Note that the value of the loss function should always decrease: Verify that L( w (i L( w (i1 < 0. if L( w (i L( w (i1 > 0 then decrease the learning rate α You can use this to dynaically adjust the learning rate α. For exaple, one can start with a high learning rate. Any tie that L( w (i L( w (i1 > 0 1 reset a a/2 2 Recalculate the i th iteration. Halt when a < threshold. 7-15

Regression for a Sigoid Activation Function Gradient descent is easily used to estiate a sigoid estiation function. Recall that the sigoid function has the for h( X, w 1 = 1+ e g( X, w This function is differentiable. When g k ( X, w = w T X h( X, w 1 = 1+ e w T X and the decision rule is: if h( X, w > 0.5 then Positive else Negative The cost function for this activation function is: Cost(h( X, w, y = # % Thus the loss function would be: L( w = 1 M Cost(h( X, w, y =1 log(h( X, w if y =1 log(1 h( X, w if y = 1 L( w = 1 M # y log(h( X, w + (1 y log(1 h( X, w =1 h( X,w d = 1 M w d M (y # h( X, w x d =1 7-16

Gradient Descent Repeat until L( w (i L( w (i1 < # : w (i = w (i1 # h( X, w (i1 where L( w = 1 M # y log(h( X, w + (1 y log(1 h( X, w =1 w (i d = w (i1 d # 1 M M (y h( X, w (i1 x d =1 Regularisation Overfitting: Tuning the function to the training data. Overfitting is a coon proble with achine learning, particularly when there are any features and not enough training data. One solution is to regularise the function by adding an additional ter. replace w (i 0 = w (i1 0 # 1 M M (y h( X, w x 0 with =1 w (i d = w (i1 d # 1 M M (y h( X, w (i1 x d + % w d =1 (i1 7-17