ECE 471/571 - Lecture 17. Types of NN. History. Back Propagation. Recurrent (feedback during operation) Feedforward

Similar documents
Neural Networks (Part 1) Goals for the lecture

Introduction to Neural Networks

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Artificial Neural Networks. Historical description

Neural networks. Chapter 20. Chapter 20 1

Neural networks. Chapter 19, Sections 1 5 1

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Artificial Intelligence

Computational Intelligence Winter Term 2017/18

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Computational Intelligence

Multilayer Neural Networks

Neural Networks. Learning and Computer Vision Prof. Olga Veksler CS9840. Lecture 10

Lecture 4: Perceptrons and Multilayer Perceptrons

AI Programming CS F-20 Neural Networks

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Chapter 9: The Perceptron

Neural Networks and Deep Learning

Unit III. A Survey of Neural Network Model

Artificial Neural Networks. Q550: Models in Cognitive Science Lecture 5

Lecture 7 Artificial neural networks: Supervised learning

Part 8: Neural Networks

Computational statistics

Artifical Neural Networks

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Revision: Neural Network

Artificial Neural Network

CS4442/9542b Artificial Intelligence II prof. Olga Veksler. Lecture 5 Machine Learning. Neural Networks. Many presentation Ideas are due to Andrew NG

Artificial Neural Networks

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Course 395: Machine Learning - Lectures

Lecture 17: Neural Networks and Deep Learning

Lab 5: 16 th April Exercises on Neural Networks

Introduction to Artificial Neural Networks

Intelligent Systems Discriminative Learning, Neural Networks

Neural networks. Chapter 20, Section 5 1

EEE 241: Linear Systems

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Multilayer Perceptrons and Backpropagation

Advanced statistical methods for data analysis Lecture 2

y(x n, w) t n 2. (1)

Lecture 5: Logistic Regression. Neural Networks

Multilayer Perceptron

18.6 Regression and Classification with Linear Models

Multilayer Perceptron

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

CHALMERS, GÖTEBORGS UNIVERSITET. EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Artificial Neural Networks

ECE521 Lecture 7/8. Logistic Regression

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Artificial Neural Networks Examination, June 2005

CS:4420 Artificial Intelligence

Neural Networks: Basics. Darrell Whitley Colorado State University

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

4. Multilayer Perceptrons

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Plan. Perceptron Linear discriminant. Associative memories Hopfield networks Chaotic networks. Multilayer perceptron Backpropagation

Introduction Biologically Motivated Crude Model Backpropagation

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Deep Feedforward Networks

Input layer. Weight matrix [ ] Output layer

A Logarithmic Neural Network Architecture for Unbounded Non-Linear Function Approximation

Multi-layer Neural Networks

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

1 What a Neural Network Computes

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Deep Feedforward Networks

Chapter 3 Supervised learning:

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Introduction to Neural Networks

Learning Deep Architectures for AI. Part I - Vijay Chakilam

Backpropagation Neural Net

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Learning and Memory in Neural Networks

Supervised (BPL) verses Hybrid (RBF) Learning. By: Shahed Shahir

Artificial Neural Networks Examination, June 2004

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Neural Networks DWML, /25

Introduction to Convolutional Neural Networks (CNNs)

CSC242: Intro to AI. Lecture 21

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Artificial Neuron (Perceptron)

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

8. Lecture Neural Networks

In the Name of God. Lecture 11: Single Layer Perceptrons

Transcription:

ECE 47/57 - Lecture 7 Back Propagation Types of NN Recurrent (feedback during operation) n Hopfield n Kohonen n Associative memory Feedforward n No feedback during operation or testing (only during determination of weights or training) n Perceptron n Backpropagation History In the 980s, NN became fashionable again, after the dark age during the 970s One of the reasons is the paper by Rumelhart, Hinton, and McClelland, which made the BP algorithm famous The algorithm was first discovered in 960s, but didn t draw much attention BP is most famous for applications for layered feedforward networks, or multilayer perceptrons 3

Limitations of Perceptron The output only has two values ( or 0) Can only classify samples which are linearly separable (straight line or straight plane) ingle layer: can only train AND, OR, NOT Can t train a network functions like XOR 4 XOR (3-layer NN) x x w3 w4 w3 w4 3 4 x x x d w w w d -b w35 w45 5, are identity functions 3, 4, 5 are sigmoid w3.0, w4 -.0 w4.0, w3 -.0 w35 0., w45-0. The input takes on only and 5 y BP 3-Layer Network w i h w y x (y ) i i E ( T ( y )) Choose a set of initial ω st ω st k + ω st k c k E k ω st k w st is the weight connecting input s at neuron t The problem is essentially how to choose weight w to minimize the error between the expected output and the actual output The basic idea behind BP is gradient descent 6

Exercise w i h w y x (y ) i i y h i i x ω ( h ) ω ω and ( h ) i i y h ω x i and h y ω ω i x i 7 *The Derivative Chain Rule w i h w y x (y ) i i Δω E E ω y T y ω ( )( $ )( ( h )) Δω i E & E y ) ( + h ω i '( y * + h ω i & ( T ) $ ( )( ω ) ( ) + '( * + $ ( ) x i ( ) 8 Threshold Function Traditional threshold function as proposed by McCulloch-Pitts is binary function The importance of differentiable A threshold-like but differentiable form for (5 years) The sigmoid ( x) + exp( x) 9 3

*BP vs. MPP E(ω) [g k (x;w) T k ] [g k (x;w) ] + [g k (x;w) 0] x ') n n k ( *) n x ω k n k [g k (x;w) ] x ω k + n n k n x ω k + ) [g k (x;w)], n n k x ω k -) lim n n E(w) P(ω k) [g k (x;w) ] p(x ω k )dx + P(ω i k ) g k (x;w)p(x w i k )dx [g k (x;w) g k (x;w) +]p(x,ω k )dx + g k (x;w)p(x,w i k )dx g k (x;w) p(x)dx g k (x;w)p(x,ω k )dx + p(x,ω k )dx [g k (x;w) P(ω k x)] p(x)dx + C 0 Practical Improvements to Backpropagation Activation (Threshold) Function The signum function # if x 0 ( x) signum( x) "! if x < 0 The sigmoid function n Nonlinear n aturate ( x) sigmoid( x) + exp( x) n Continuity and smoothness n Monotonicity (so (x) > 0) Improved a ( x) sigmoid( x) n Centered at zero + exp( bx) a n Antisymmetric (odd) leads to faster learning n a.76, b /3, to keep (0) ->, the linear range is <x<, and the extrema of (x) occur roughly at x -> 4

3 Data tandardization Problem in the units of the inputs n Different units cause magnitude of difference n ame units cause magnitude of difference tandardization scaling input n hift the input pattern w The average over the training set of each feature is zero n cale the full data set w Have the same variance in each feature component (around.0) 4 Target Values (output) Instead of one-of-c (c is the number of classes), we use +/- n + indicates target category n - indicates non-target category For faster convergence 5 5

Number of Hidden Layers The number of hidden layers governs the expressive power of the network, and also the complexity of the decision boundary More hidden layers -> higher expressive power -> better tuned to the particular training set -> poor performance on the testing set Rule of thumb n Choose the number of weights to be roughly n/0, where n is the total number of samples in the training set n tart with a large number of hidden units, and decay, prune, or eliminate weights 6 Number of Hidden Layers 7 Initializing Weight Can t start with zero Fast and uniform learning n All weights reach their final euilibrium values at about the same time n Choose weights randomly from a uniform distribution to help ensure uniform learning n Eual negative and positive weights n et the weights such that the integration value at a hidden unit is in the range of and + n Input-to-hidden weights: (-/srt(d), /srt(d)) n Hidden-to-output weights: (-/srt(n H ), /srt(n H )), n H is the number of connected units 8 6

Learning Rate ' ME $ c opt % " The optimal learning rate & ω # n Calculate the nd derivative of the obective function with respect to each weight n et the optimal learning rate separately for each weight n A learning rate of 0. is often adeuate n The maximum learning rate is c < max c opt n When c < c <, the convergence is slow opt c opt 9 Plateaus or Flat urface in Plateaus E n Regions where the derivative ω is very small st n When the sigmoid function saturates Momentum n Allows the network to learn more uickly when plateaus in the error surface exist ω st k + ω st k c k E k ω st k k ω + st ω k st + ( c k k )Δω bp + c k (ω k st ω k st ) 0 7

Weight Decay hould almost always lead to improved performance ω new ω old ( ε) Batch Training vs. On-line Training Batch training n Add up the weight changes for all the training patterns and apply them in one go n GD On-line training n Update all the weights immediately after processing each training pattern n Not true GD but faster learning rate 3 Other Improvements Other error function (Minkowski error) 4 8

Further Discussions How to draw the decision boundary of BPNN? How to set the range of valid output n 0-0.5 and 0.5-? n 0-0. and 0.8-? n 0.-0. and 0.8-0.9? The importance of having symmetric initial input 5 9