Supervised learning in single-stage feedforward networks

Similar documents
Neural Networks and the Back-propagation Algorithm

Data Mining Part 5. Prediction

18.6 Regression and Classification with Linear Models

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Artificial Neural Networks

Simple Neural Nets For Pattern Classification

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Unit III. A Survey of Neural Network Model

) (d o f. For the previous layer in a neural network (just the rightmost layer if a single neuron), the required update equation is: 2.

Neural Nets Supervised learning

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

4. Multilayer Perceptrons

CSC Neural Networks. Perceptron Learning Rule

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Advanced Machine Learning

Neural Networks for Machine Learning. Lecture 2a An overview of the main types of neural network architecture

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

Artificial Intelligence

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

GRADIENT DESCENT AND LOCAL MINIMA

EEE 241: Linear Systems

Unit 8: Introduction to neural networks. Perceptrons

Feedforward Neural Nets and Backpropagation

Linear & nonlinear classifiers

CMSC 421: Neural Computation. Applications of Neural Networks

Chapter 2 Single Layer Feedforward Networks

1 What a Neural Network Computes

Lecture 4: Feed Forward Neural Networks

Input layer. Weight matrix [ ] Output layer

Lecture 7 Artificial neural networks: Supervised learning

Mathematical Tools for Neuroscience (NEU 314) Princeton University, Spring 2016 Jonathan Pillow. Homework 8: Logistic Regression & Information Theory

Reification of Boolean Logic

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Intelligent Systems Discriminative Learning, Neural Networks

Course 10. Kernel methods. Classical and deep neural networks.

Introduction To Artificial Neural Networks

Introduction to Neural Networks

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

ECE521 Lecture 7/8. Logistic Regression

Temporal Backpropagation for FIR Neural Networks

Classification with Perceptrons. Reading:

Advanced statistical methods for data analysis Lecture 2

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

A summary of Deep Learning without Poor Local Minima

PMR5406 Redes Neurais e Lógica Fuzzy Aula 3 Single Layer Percetron

y(x n, w) t n 2. (1)

Chapter 3 Supervised learning:

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Week 5: Logistic Regression & Neural Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Artificial Neural Network : Training

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

CSC321 Lecture 8: Optimization

Chapter 9: The Perceptron

Multilayer Perceptrons and Backpropagation

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Artificial Neural Networks. Edward Gatt

CSC321 Lecture 9: Generalization

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Machine Learning. Neural Networks

The Perceptron. Volker Tresp Summer 2016

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Linear & nonlinear classifiers

Artificial Neural Networks

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Artificial Neural Networks

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Lecture 4: Perceptrons and Multilayer Perceptrons

17 Neural Networks NEURAL NETWORKS. x XOR 1. x Jonathan Richard Shewchuk

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Neural Networks (Part 1) Goals for the lecture

Linear Discrimination Functions

Single layer NN. Neuron Model

Multilayer Neural Networks

Chapter ML:VI. VI. Neural Networks. Perceptron Learning Gradient Descent Multilayer Perceptron Radial Basis Functions

Reading Group on Deep Learning Session 1

ECE521 Lectures 9 Fully Connected Neural Networks

CSC242: Intro to AI. Lecture 21

Neural networks. Chapter 20, Section 5 1

COGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November.

Machine Learning and Adaptive Systems. Lectures 3 & 4

Artificial Neuron (Perceptron)

Supervised Learning. George Konidaris

From perceptrons to word embeddings. Simon Šuster University of Groningen

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

Lecture 4 Backpropagation

1 Review of the dot product

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

15. LECTURE 15. I can calculate the dot product of two vectors and interpret its meaning. I can find the projection of one vector onto another one.

Course 395: Machine Learning - Lectures

Artificial Neural Networks The Introduction

Artificial Neural Network

Introduction to Artificial Neural Networks

Portfolio Optimization

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2017

Neural networks (not in book)

COMP-4360 Machine Learning Neural Networks

Linear Regression. S. Sumitra

Transcription:

Supervised learning in single-stage feedforward networks Bruno A Olshausen September, 204 Abstract This handout describes supervised learning in single-stage feedforward networks composed of McCulloch-Pitts type neurons These methods were originally worked out by Bernie Widrow at Stanford in the early 960 s Related work was done by Frank Rosenblatt at Cornell around the same time There are numerous texts that describe these ideas, among them the text by Herz, Krogh, & Palmer, as well as Widrow s book, Adaptive Signal Processing (Widrow & Stearns, Prentice-Hall, 985) The idea of this handout is to describe the central ideas in a rather condensed and intuitive form The problem we consider is that of learning in single-stage networks like that shown in Figure Information flows strictly upward in this network, from the inputs y y 2 y m x x 3 x n Figure : A single-stage network of model neurons x i at the bottom to the outputs y i at the top Thus, it is termed a feedforward network For example, the input nodes may correspond to some sensory input and the output nodes may represent some appropriate motor response for a simple organism The function performed by the network is essentially determined by the links connecting the input nodes to the output nodes, analogous to synapses in real neurons Our challenge is to figure out how to change these links so as to achieve some desired input-output transformation Importantly, we would like the system to learn the proper links from example, by training it with desired input-output pairs This form of learning is called supervised learning because it relies on having a teacher to tell the system what the desired output is for each input example In order to make

x x 3 x n w w 2 w 3 Σ w n w 0 y inputs weights bias output Figure 2: A model linear neuron this work, we need to devise a learning rule for the system, which specifies how the links are automatically changed upon each trial of training We shall first consider the problem of learning in a single linear neuron, and then we will add a non-linearity on the outputs We then discuss what sorts of transformations are possible to learn with such a system A single linear neuron Let us first consider how to learn with a single, linear neuron, illustrated in Figure 2 The inputs are denoted by the variables, x,,, x n ; the links between the input and output, or weights, are denoted by w, w 2,, w n ; and the bias to the neuron is denoted by w 0 The output, y, is a linear function of the inputs, x i, as follows: n y = w i x i + w 0 () i= The input-output transformation, or the function performed by the neuron depends entirely on the weights, w i, in addition to the bias term w 0 It will generally be assumed that these parameters are initially set to random values The goal of learning is to tweek the weights and bias over time so that a particular, desired input-output function is performed by the neuron For example, let us say that we wish to train the system to perform a pattern classification task, so that it yields a + response to a T pattern presented on the inputs, and a response to an L pattern presented on the inputs, as shown in Figure 3a With a little effort one can see that the weight values that would achieve this are those shown at right, in Figure 3b, with the bias w 0 set to zero (Verify for yourself that this is correct) But we could see the solution here only because we are smart If you were a little neuron, and all you received were a bunch of inputs and their desired corresponding outputs as training examples, what rule would you use to adjust your weights to achieve the desired outcome? Here s an intuitive way to think about the problem: Let s say you are presented with a specific pattern on your inputs, you compute the corresponding output, y, 2

a b y = + x x 3 0 +/3 +/3 x 4 x 5 x 6 /3 +/3 0 x 7 x 8 x 9 y = /3 0 /3 Figure 3: A pattern discrimination task a, The inputs are arrayed in two-dimensions as shown at left We wish to train the neuron to yield a + response to a T and a - response to an L Black pixels denote x i =, blank pixels denote x i = 0 b, The weight values, w i, that achieve the desired result, displayed with the same ordering as the x i according to Equation, and it does not match the desired output The first question you might ask is whether the computed output, y, was too high or too low If y was too low, then if x i is on (ie, a black pixel) we should crank up w i on that trial That way, the next time the pattern is presented, y will be closer to the desired value Vice versa, if y is too high, then we should decrease w i Let s see if we can put these intuitions on a firmer foundation We will enumerate the training examples with the symbol α, where each example consists of an input pattern along with its desired output In the case above, we would have α =, 2 for the training examples T /+ and L / respectively We shall denote the input state for example α as x α = {x α, x α n}, and the desired output (or teacher signal) for example α as t α (Note that the α are merely superscripts which denote a specific training example, not an exponent) The goal of learning is to find a set of w s that minimize the error between the desired output or teacher signal, t α, and the actual output computed via Equation, y(x α ), over all of the examples α If we take the measure of the error to be the square of the difference between the teacher and computed output, then our error function for example α is and the total error function over all the examples is then E α = [t α y(x α )] 2 (2) E = α E α (3) assuming each training example is weighed equally E is essentially a function of the weights and bias of the neuron, w 0 w n It turns out that since the neuron is linear in this case, we can compute the solution for the w s in closed-form (we will show this in class) But for now we are not going to assume that our little neuron is this smart, and so we need to come up with an iterative method for determining the w s, assuming that we have a teacher who is patient enough to make repeated 3

05 0-05 - 0 05 5 2 W 25 3 35 4-2 -5- -05 0 05 W2 5 2 Figure 4: A function of two variables, f(w, w 2 ), which describes the surface shown The gradient evaluated at any point tells you which way is downhill at that point presentations of input-output pairs until the neuron finally gets it right In order to do this, we need to find out how to tweek the w s on each training example so as to reduce E α For this, we shall turn to a technique known as gradient descent An aside: gradient descent Gradient descent is a technique used for function minimization It is based on a very simple principle: move downhill Let s say we have a function of two variables, f(w, w 2 ), as illustrated in Figure 4 The question is, if we are currently sitting at the point (w, 0 w2), 0 in what direction should we move in order to go downhill? If you were a skier, you would simply inspect the slope where you are standing and point the skiis downward If you are a variable, you do the same thing The only problem is that you need to figure out which way is down (because variables aren t equipped with visual perception) If you have a formula for the function, f, which describes the surface, then the gradient of f tells you which way to go to increase the function, and so the negative gradient tells you which way is down The slope along the w direction is f w, while the slope along the w 2 direction is f w 2 So, if you are standing at (w, 0 w2), 0 you simply evaluate the slopes in each direction at that point and move in a direction proportional to the negative gradient: w f w w 2 f w 2 w 0,w 0 2 w 0,w 0 2 where w 0,w 0 2 means evaluated at w0, w 0 2 The minus sign indicates that we are moving downhill The exact amount you move depends on how bold you are, as well as the smoothness of the surface The derivative is only a local measurement of the slope, so as you move away from (w 0, w 0 2) the slope will generally change If you step too far, you will end up bouncing all around the surface rather than going downhill 4 (4) (5)

like you intended Picking the right step size is one of the tricky empirical aspects of gradient descent Usually, you will want to observe how f decreases as you increase your step size, and pick a step size that yields the greatest average decrease in f per step OK, now back to the problem of determining the learning rule for our linear neuron E α is a function of the w s, just like f above, so we can get a learning rule for the weights and bias simply by doing gradient descent: w i Eα w i, (6) where w i denotes the incremental change we should make in weight w i on trial α Now we just have to do some math to figure out what Eα w i is Using the definition of E α in Equation 2 we get E α = [t α y(x α )] 2 (7) w i w i = 2 [t α y(x α )] Then, using the definition of y(x α ) in Equation we get Thus, our learning rule is w i y(x α ) (8) y(x α ) = w j x α j + w 0 (9) w i w i j { x α = i if i if i = 0 (0) w i [t α y(x α )] x α i () w 0 [t α y(x α )] (2) Equation says that on each trial, α, we change w i by an amount proportional to the output error times the current input, x α i For example, if the i th input on example α is on (x α i = ), and the computed output, y(x α ), is lower than the desired output t α, then w i should be increased This is just as we intuited previously Moreover, if we are off by alot, then w i should change alot Equation 2 says that the bias, w 0, increases or decreases proportional to the amount of undershoot or overshoot, respectively, in y(x α ), independent of any of the input values A linear neuron with an output non-linearity Now let us make a slight modification to the neuron so that there is a non-linearity on the output, as shown in Figure 5 The input-output relationship for this neuron 5

x x 3 x n w w 2 w 3 Σ w n w 0 u σ y Figure 5: A linear neuron with an output non-linearity is then y = σ(u) (3) u = n w i x i + w 0 (4) i= The function σ expresses the non-linearity on the output, which we take to be the so-called sigmoid function: σ(x) = (5) + e λx This function is plotted in Figure 6 for several values of λ The effect of the output 08 06 04 02 0-4 -2 0 2 4 Figure 6: The sigmoid function σ(x) = +e λx The function is plotted for several different values of λ which controls the steepness non-linearity is that it will cause y to saturate as it gets too large, and it will bottom-out as it gets too small It between, it has a rather linear relationship to the sum over the inputs, u Frank Rosenblatt considered the case of a neuron with a very sharp output linearity (λ = ), so that it essentially becomes a step function In this case, you cannot derive a learning rule via gradient descent, because you cannot compute a gradient of the output with respect to the input (since the step function is discontinous) Rosenblatt instead devised an algorithm, presumably following his intuitions, and proved that it would converge on a solution (see Herz, Krogh and 6

a b c x α w x x x Figure 7: A geometric interpretation of the function performed by a model neuron a, A linear neuron computes its output by simply taking the projection of a given input pattern x α onto the weight vector, and then adding bias w 0 (which is for now set to zero) The dotted-line through the origin denotes the separating line (or hyperplane, in higher dimensions) for separating positive outputs from negative outputs b, Two pattern classes that may be separated by a line or plane are said to be linearly separable, and any such classes of patterns may be successfully discriminated by the model neuron c, An example of two pattern classes that are not linearly separable, and which therefore could not be discriminated by a model neuron, no matter how hard it tried Palmer, chapter 5) But if the output non-linearity is somewhat gradual (λ ), then you can compute a gradient and derive a learning rule for the w i This you will do in the homework Linear separability Can a single neuron, with or without an output non-linearity, discriminate any arbitrary sets of patterns? For example, let s say we want to make the system give a + response to either a T or a U, and a response to either an L or an O Will this be possible? Let s make a geometric picture of what the model neuron is doing, and then we will see what its limitations are Figure 7 shows the input state-space for a neuron with two inputs x and A given input pattern, x α = {x α, x α 2 } constitutes a point in this space We can consider the weights to form a vector w in this space with components w, w 2 The output of the linear neuron is then given by projecting point x α onto the weight vector, w as shown in Figure 7a This can be seen by re-writing Equation in vector terms, y = w T x (6) (ignoring for now the term w 0 ) Any point x lying along the dotted line, perpendicular to w, will yield an output of zero Any point lying to the right of the dotted line will yield a postive output, and any point lying to the left of the dotted line will yield a negative output If we pass the output of the neuron through the logistic function non-linearity, using a high value of λ so as to make a steep transition, then those input patterns falling to the right of the dotted line will yield an output of and 7

those falling to the left will yield an output of 0 Thus, the class of patterns that can be discriminated by such a model neuron are those that can be separated by a line in two-dimensions, or by a hyperplane in higher dimensional spaces, as shown in Figure 7b This is known as linear separability, and it is an important limitation to the discrimination capabilities of linear neuron models (with or without an output non-linearity) Pattern classes that are not linearly separable, such as in Figure 7c, cannot be discriminated by linear neurons Multiple outputs It is now easy to extend the above neuron models to a system consisting of multiple neurons, each with a distinct output, as shown originally in Figure We just need to re-denote the outputs, weights, and biases for each of the neurons as follows: y i = σ(u i ) (7) u i = j w ij x j + w i0 (8) where w ij denotes the weight from the j th input node to the i th output node The learning rule for achieving a desired input-output transformation is the same as derived previously it is just a matter of redoing the subscripts on the w s and y s The next challenge will be to have multiple layers of such neurons, so that one set of neurons receives its inputs from the outputs of a previous stage This is how we will get around the limitation of linear separability The problem then becomes how to learn the weights of the hidden units in such a multi-layer system 8