Artificial Neural Networks

Similar documents
100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units

Artificial Neural Networks

Artificial Neural Networks

COMP-4360 Machine Learning Neural Networks

Lecture 4: Perceptrons and Multilayer Perceptrons

Machine Learning

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Machine Learning

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Lecture 5: Logistic Regression. Neural Networks

Neural Networks biological neuron artificial neuron 1

Neural Networks DWML, /25

Incremental Stochastic Gradient Descent

Course 395: Machine Learning - Lectures

Multilayer Neural Networks

Introduction to Machine Learning

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

AI Programming CS F-20 Neural Networks

Unit 8: Introduction to neural networks. Perceptrons

Multilayer Perceptrons and Backpropagation

Neural networks. Chapter 20. Chapter 20 1

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Neural Networks (Part 1) Goals for the lecture

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

CMSC 421: Neural Computation. Applications of Neural Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Multilayer Neural Networks

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Neural networks. Chapter 19, Sections 1 5 1

Lab 5: 16 th April Exercises on Neural Networks

Artificial Neural Networks. Edward Gatt

Pattern Classification

4. Multilayer Perceptrons

Slide04 Haykin Chapter 4: Multi-Layer Perceptrons

Machine Learning. Neural Networks

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

y(x n, w) t n 2. (1)

CSC242: Intro to AI. Lecture 21

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

CS:4420 Artificial Intelligence

Feedforward Neural Nets and Backpropagation

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

The error-backpropagation algorithm is one of the most important and widely used (and some would say wildly used) learning techniques for neural

Data Mining Part 5. Prediction

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Neural networks. Chapter 20, Section 5 1

Artificial Neural Network

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

Multilayer Perceptron

A summary of Deep Learning without Poor Local Minima

Machine Learning in Bioinformatics

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Artifical Neural Networks

Multilayer Perceptrons (MLPs)

Neural Networks and Deep Learning

Machine Learning Lecture 5

Introduction to Neural Networks

Neural Networks. Xiaojin Zhu Computer Sciences Department University of Wisconsin, Madison. slide 1

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks and the Back-propagation Algorithm

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Gradient Descent Training Rule: The Details

CSC321 Lecture 5: Multilayer Perceptrons

Neural Networks and Deep Learning.

Introduction to feedforward neural networks

Artificial Neural Networks

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Linear Discrimination Functions

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Introduction To Artificial Neural Networks

Artificial Neural Networks Examination, June 2004

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

NN V: The generalized delta learning rule

Learning Neural Networks

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Mul7layer Perceptrons

Revision: Neural Network

Supervised Learning in Neural Networks

Plan. Perceptron Linear discriminant. Associative memories Hopfield networks Chaotic networks. Multilayer perceptron Backpropagation

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

CSC 411 Lecture 10: Neural Networks

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Introduction to Neural Networks

Artificial Neural Networks

Introduction to Machine Learning Spring 2018 Note Neural Networks

Introduction to Artificial Neural Networks

Multilayer Perceptron

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

ECE 471/571 - Lecture 17. Types of NN. History. Back Propagation. Recurrent (feedback during operation) Feedforward

Transcription:

Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation Hidden layer representations Example: Face Recognition Advanced topics 1

Connectionist Models Consider humans: Neuron switching time.001 second Number of neurons 10 10 Connections per neuron 10 4 5 Scene recognition time.1 second 100 inference steps doesn t seem like enough much parallel computation 2

Connectionist Models Properties of artificial neural nets (ANN s): Many neuron-like threshold switching units Many weighted interconnections among units Highly parallel, distributed process Emphasis on tuning weights automatically 3

When to Consider Neural Networks Input is high-dimensional discrete or real-valued (e.g. raw sensor input) Output is discrete or real valued Output is a vector of values Possibly noisy data Form of target function is unknown Human readability of result is unimportant 4

When to Consider Neural Networks Examples: Speech phoneme recognition [Waibel] Image classification [Kanade, Baluja, Rowley] Financial prediction 5

ALVINN drives 70 mph on highways Sharp Left Straight Ahead Sharp Right 30 Output Units 4 Hidden Units 30x32 Sensor Input Retina 6

Perceptron x 1 w 1 x 0 =1 x 2 x n... w 2 w n w 0 Σ n Σ w i x i i=0 { n 1 if Σ w > 0 o = i x i=0 i -1 otherwise { 1 if w0 + w o(x 1,..., x n ) = 1 x 1 + + w n x n > 0 1 otherwise. Sometimes we ll use simpler vector notation: o( x) = { 1 if w x > 0 1 otherwise. 7

Decision Surface of a Perceptron + + x 2 + - x 2 + - x 1 x 1 - - - + (a) Represents some useful functions What weights represent g(x 1, x 2 ) = AND(x 1, x 2 )? But some functions not representable e.g., not linearly separable Will need more complicated models to represent these. (b) 8

Perceptron training rule w i w i + w i where w i = η(t o)x i Where: t = c( x) is target value o is perceptron output η is small constant (e.g.,.1) called learning rate 9

Perceptron training rule Can prove it will converge If training data is linearly separable and η sufficiently small 10

Gradient Descent To understand, consider simpler linear unit, where o = w 0 + w 1 x 1 + + w n x n Let s learn w i s that minimize the squared error E[ w] 1 2 (t d o d ) 2 d D Where D is set of training examples This defines the Loss function that you want to minimize! 11

Gradient Descent 25 20 15 E[w] 10 5 0 2 1 0 w0 Gradient -1 3 0 1 2 w1 E[ w] -1-2 [ E, E, E ] w 0 w 1 w n 12

Training rule: i.e., Update: w = η E[ w] w i = η E w i w i w i + w i 13

Gradient Descent Derivation: E = 1 (t w i w i 2 d o d ) 2 = 1 2 = 1 2 = d E = w i d d d d w i (t d o d ) 2 2(t d o d ) w i (t d o d ) (t d o d ) w i (t d w x d ) (t d o d )( x i,d ) 14

Gradient Descent GRADIENT-DESCENT(training examples, η) Each training example is a pair of the form x, t, where x is the vector of input values, and t is the target output value. η is the learning rate (e.g.,.05). Initialize each w i to some small random value Until the termination condition is met, Do Initialize each w i to zero. For each x, t in training examples, Do Input the instance x to the unit and compute the output o 15

For each linear unit weight w i, Do w i w i + η(t o)x i For each linear unit weight w i, Do w i w i + w i 16

Summary Perceptron training rule guaranteed to succeed if Training examples are linearly separable Sufficiently small learning rate η Linear unit training rule uses gradient descent Guaranteed to converge to hypothesis with minimum squared error Given sufficiently small learning rate η Even when training data contains noise Even when training data not separable by H 17

Incremental (Stochastic) Gradient Descent Batch mode Gradient Descent: Do until satisfied 1. Compute the gradient E D [ w] 2. w w η E D [ w] Incremental mode Gradient Descent: Do until satisfied For each training example d in D 1. Compute the gradient E d [ w] 2. w w η E d [ w] 18

E D [ w] 1 2 (t d o d ) 2 d D E d [ w] 1 2 (t d o d ) 2 Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if η made small enough 19

Multilayer Networks of Sigmoid Units head hid who d hood...... F1 F2 20

21

Sigmoid Unit x 1 w 1 x 0 = 1 x 2 x n... w 2 w n w 0 Σ n net = Σ w i x i=0 i σ(x) is the sigmoid function 1 1+e x Nice property: dσ(x) dx = σ(x)(1 σ(x)) o = σ(net) = We can derive gradient decent rules to train One sigmoid unit 1 1 + ē net Multilayer networks of sigmoids Backpropagation 22

Error Gradient for a Sigmoid Unit E = 1 (t w i w i 2 d o d ) 2 d D = 1 (t 2 w d o d ) 2 d i = 1 2(t 2 d o d ) (t w d o d ) d i = ( (t d o d ) o ) d w d i = d (t d o d ) o d net d net d w i 23

But we know: So: o d net d = σ(net d) net d = o d (1 o d ) net d w i = ( w x d) w i = x i,d E = (t w d o d )o d (1 o d )x i,d i d D 24

Backpropagation Algorithm Initialize all weights to small random numbers. Until satisfied, Do For each training example, Do 1. Input the training example to the network and compute the network outputs 2. For each output unit k 3. For each hidden unit h δ k o k (1 o k )(t k o k ) 25

δ h o h (1 o h ) k outputs w h,k δ k 4. Update each network weight w i,j where w i,j w i,j + w i,j w i,j = ηδ j x i,j 26

More on Backpropagation Gradient descent over entire network weight vector Easily generalized to arbitrary directed graphs Will find a local, not necessarily global error minimum In practice, often works well (can run multiple times) Often include weight momentum α w i,j (n) = ηδ j x i,j + α w i,j (n 1) Minimizes error over training examples Will it generalize well to subsequent examples? 27

Training can take thousands of iterations slow! Using network after training is very fast 28

Learning Hidden Layer Representations Inputs Outputs A target function: Input Output 10000000 10000000 01000000 01000000 00100000 00100000 00010000 00010000 00001000 00001000 00000100 00000100 00000010 00000010 00000001 00000001 Can this be learned?? 29

Learning Hidden Layer Representations A network: Learned hidden layer: Inputs Outputs Input Hidden Output Values 10000000.89.04.08 10000000 01000000.01.11.88 01000000 00100000.01.97.27 00100000 00010000.99.97.71 00010000 00001000.03.05.02 00001000 00000100.22.99.99 00000100 00000010.80.01.98 00000010 00000001.60.94.01 00000001 30

Training 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Sum of squared errors for each output unit 0 500 1000 1500 2000 2500 31

Training 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Hidden unit encoding for input 01000000 0 500 1000 1500 2000 2500 32

Training 4 3 2 1 0-1 -2-3 -4-5 Weights from inputs to one hidden unit 0 500 1000 1500 2000 2500 33

Convergence of Backpropagation Gradient descent to some local minimum Perhaps not global minimum... Add momentum Stochastic gradient descent Train multiple nets with different inital weights Nature of convergence Initialize weights near zero Therefore, initial networks near-linear Increasing non-linearity possible as training progresses 34

Expressive Capabilities of ANNs Boolean functions: Every boolean function can be represented by network with single hidden layer but might require exponential (in number of inputs) hidden units Continuous functions: Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989; Hornik et al. 1989] 35

36 Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988].

Overfitting in ANNs Error 0.01 0.009 0.008 0.007 0.006 0.005 0.004 0.003 0.002 Error versus weight updates (example 1) Training set error Validation set error 0 5000 10000 15000 20000 Number of weight updates 37

Error 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 Error versus weight updates (example 2) Training set error Validation set error 0 1000 2000 3000 4000 5000 6000 Number of weight updates 38

Neural Nets for Face Recognition left strt rght up...... 30x32 inputs 39

Typical input images 90% accurate learning head pose, and recognizing 1-of-20 faces 40

Learned Hidden Unit Weights left strt rght up Learned Weights...... 30x32 inputs 41

Alternative Error Functions Penalize large weights: E( w) 1 2 d D k outputs(t kd o kd ) 2 + γ i,j Train on target slopes as well as values: w 2 ji [ (tkd o kd ) 2 E( w) 1 2 d D k outputs +µ j inputs ( t kd x j d o kd x j d ) 2 ] 42

Tie together weights: e.g., in phoneme recognition network 43

Recurrent Networks y(t + 1) y(t + 1) b x(t) x(t) c(t) (a) Feedforward network (b) Recurrent network y(t + 1) x(t) c(t) y(t) x(t 1) c(t 1) y(t 1) (c) Recurrent network unfolded in time x(t 2) c(t 2) 44