Neural Networks. Xiaojin Zhu Computer Sciences Department University of Wisconsin, Madison. slide 1

Similar documents
Neural Networks biological neuron artificial neuron 1

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Feedforward Neural Nets and Backpropagation

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Artificial Neural Networks

Lab 5: 16 th April Exercises on Neural Networks

Artifical Neural Networks

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Lecture 5: Logistic Regression. Neural Networks

Revision: Neural Network

Single layer NN. Neuron Model

Multilayer Perceptron

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Artificial Neural Networks

Machine Learning (CSE 446): Neural Networks

Data Mining Part 5. Prediction

Computational statistics

Machine Learning

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

AI Programming CS F-20 Neural Networks

Lecture 4: Perceptrons and Multilayer Perceptrons

Linear discriminant functions

Neural Networks (Part 1) Goals for the lecture

Linear Regression, Neural Networks, etc.

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

COGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November.

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Neural networks. Chapter 20. Chapter 20 1

Neural Networks and Deep Learning

Neural networks. Chapter 19, Sections 1 5 1

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Machine Learning

Machine Learning Lecture 5

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units

CSC242: Intro to AI. Lecture 21

Artificial Neural Networks

Intelligent Systems Discriminative Learning, Neural Networks

Course 395: Machine Learning - Lectures

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Simple Neural Nets For Pattern Classification

1 What a Neural Network Computes

Security Analytics. Topic 6: Perceptron and Support Vector Machine

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

ECE521 Lectures 9 Fully Connected Neural Networks

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Artificial Neural Networks

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Introduction to Machine Learning

COMP-4360 Machine Learning Neural Networks

ECE521 Lecture 7/8. Logistic Regression

Logistic Regression & Neural Networks

Nonlinear Classification

Multilayer Neural Networks

COMP9444 Neural Networks and Deep Learning 2. Perceptrons. COMP9444 c Alan Blair, 2017

Artificial Neural Networks

Computational Intelligence Winter Term 2017/18

Artificial Neural Networks. MGS Lecture 2

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Neural Networks DWML, /25

Neural Networks: Backpropagation

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Computational Intelligence

Introduction Biologically Motivated Crude Model Backpropagation

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Artificial Intelligence

y(x n, w) t n 2. (1)

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

Pattern Classification

Classification with Perceptrons. Reading:

Logistic Regression Trained with Different Loss Functions. Discussion

8. Lecture Neural Networks

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Machine Learning Basics III

Supervised Learning. George Konidaris

Introduction to Neural Networks

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

In the Name of God. Lecture 11: Single Layer Perceptrons

Statistical NLP for the Web

Multilayer Neural Networks

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Artificial Neural Networks. Edward Gatt

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Input layer. Weight matrix [ ] Output layer

Transcription:

Neural Networks Xiaoin Zhu erryzhu@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison slide 1

Terminator 2 (1991) JOHN: Can you learn? So you can be... you know. More human. Not such a dork all the time. TERMINATOR: My CPU is a neural-net processor... a learning computer. But Skynet presets the switch to "read-only" when we are sent out alone. TERMINATOR Basically. (starting the engine, backing out) The Skynet funding bill is passed. The system goes on-line August 4th, 1997. Human decisions are removed from strategic defense. Skynet begins to learn, at a geometric rate. It becomes self-aware at 2:14 a.m. eastern time, August 29. In a panic, they try to pull the plug. SARAH: And Skynet fights back. TERMINATOR: Yes. It launches its ICBMs against their targets in Russia. SARAH: Why attack Russia? We ll learn how to set the neural net TERMINATOR: Because Skynet knows the Russian counter-strike will remove its enemies here. slide 2

A single neuron Linear perceptron Non-linear perceptron Outline Learning of a single perceptron The power of a single perceptron Neural network: a network of neurons Layers, hidden units Learning of neural network: backpropagation The power of neural network Issues Everything revolves around gradient descent slide 3

Biological neurons Human brain: 100, 000, 000, 000 neurons Each neuron receives input from 1,000 others Impulses arrive simultaneously Added together* an impulse can either increase or decrease the possibility of nerve pulse firing If sufficiently strong, a nerve pulse is generated The pulse forms the input to other neurons. The interface of two neurons is called a synapse http://www.bris.ac.uk/synaptic/public/brainbasic.html slide 4

Example: ALVINN steering direction [Pomerleau, 1995] slide 5

Linear perceptron Perceptron = a math model for a single neuron Input: x1,, xd (signal from other neurons) Weights: w1,, wd (dendrites, can be negative) We sneak in a constant (bias term) x0=1, with some weight w0 Activation function: linear (for the time being) a = w0*x0 + w1*x1 + + wd*xd This is the output of a linear perceptron 1 w0 x1 w1 d=0 D wd*xd a xd wd slide 6

Learning in linear perceptron Regression. Training data {(X1, y1),, (XN, yn)} X1 is a vector: (x11,, x1d), so are X2 XN y1 is a real-valued output Goal: learn the weights w0 wd, so that given input Xi, the output of the perceptron ai is close to yi Define close : E = ½ i=1..n (ai-yi) 2 E is the error. Given the training set, E is a function of w0 wd. Minimize E: unconstrained optimization. Variables w0 wd. slide 7

Learning in linear perceptron Gradient descent: W W - E(W) is a small constant, learning rate = step size The gradient descent rule: E(W) = ½ i=1..n (ai-yi) 2 E/ wd = i=1..n (ai-yi) xid wd wd - i=1..n (ai-yi) xid Repeat until E converges. E is convex in W: there is a unique global minimum slide 8

The (limited) power of linear perceptron Linear perceptron is ust a=w X where X is the input vector, augmented by x0=1 It can represent any linear function in D+1 dimensional space but that s it In particular, it won t be a nice fit to binary classification (y=0 or y=1) 1 slide 9

Non-linear perceptron Change the activation function: use a step function a = g(w0*x0 + w1*x1 + + wd*xd) g(h)=0, if h < 0; g(h)=1 if h0 x1 1 w1 w0 g( d=0 D wd*xd) a xd wd Can you see how to make logic AND, OR, NOT with such a perceptron? slide 10

Linear Threshold Unit (LTU): Our first non-linear perceptron Change the activation function: use a step function a = g(w0*x0 + w1*x1 + + wd*xd) g(h)=0, if h < 0; g(h)=1 if h0 x1 1 w1 w0 g( d=0 D wd*xd) a xd wd AND: w1=w2=1, w0= -1.5 OR: w1=w2=1, w0= -0.5 NOT: w1= -1, w0= 0.5 Now we see the reason for bias terms slide 11

Sigmod activation function: Our second non-linear perceptron The problem with LTU: step function is discontinuous, cannot use gradient descent Change the activation function (again): use a sigmoid function g(h) = 1 / (1 + exp(-h)) Exercise: g (h)=? slide 12

Sigmod activation function: Our second non-linear perceptron The problem with LTU: step function is discontinuous, cannot use gradient descent Change the activation function (again): use a sigmoid function g(h) = 1 / (1 + exp(-h)) Exercise: g (h)= g(h) (1-g(h)) slide 13

Learning in non-linear perceptron Again we will minimize the error: Now ai = g( d wd*xid) E(W) = ½ i=1..n (ai-yi) 2 E/ wd = i=1..n (ai-yi) ai (1-ai) xid The sigmoid perceptron update rule wd wd - i=1..n (ai-yi) ai (1-ai) xid is a small constant, learning rate = step size Repeat until E converges slide 14

The (limited) power of non-linear perceptron Even with a non-linear sigmoid function, the decision boundary a perceptron can produce is still linear AND, OR, NOT revisited How about XOR? slide 15

The (limited) power of non-linear perceptron Even with a non-linear sigmoid function, the decision boundary a perceptron can produce is still linear AND, OR, NOT revisited How about XOR? This contributed to the first AI winter slide 16

(Multi-layer) neural network Given sigmoid perceptrons Can you produce output like which had non-linear decision boundarys 0 1 0 1 0 slide 17

Multi-layer neural network There are many ways to connect perceptrons into a network. One standard way is multi-layer neural nets 1 Hidden layer: we can t see the output; 1 output layer w 11 v 1 g N INS k 1 w 1k x k w 1 x 1 w 21 w 12 w 31 w 22 v 2 g N INS k 1 w 2k x k w 2 Out g N HID k1 W k v k x 2 [from Andrew Moore] w 32 v 3 g N INS k 1 w 3k x k w 3 slide 18

The (unlimited) power of neural network In theory we don t need too many layers: 1-hidden-layer net with enough hidden units can represent any continuous function of the inputs with arbitrary accuracy 2-hidden-layer net can even represent discontinuous functions slide 19

slide 20 Neural net for K-way classification Use K output units. During training, encode a label y by an indicator vector with K entries: class1=(1,0,0,,0), class2=(0,1,0,,0) etc. During test (decoding), choose the class corresponding to the largest output unit N HID k W k v k g 1 Out INS INS INS N k k k N k k k N k k k x w g v x w g v x w g v 1 3 3 1 2 2 1 1 1 x 1 x 2 N HID k W k v k g 1 Out out 1 out K

Example Y encoding [Pomerleau, 1995] slide 21

Obtaining training data [Pomerleau, 1995] slide 22

Learning in neural network Again we will minimize the error (K outputs): E(W) = ½ i=1..n c=1..k (o ic -Y ic ) 2 i: the i-th training point o ic : the c-th output for the i-th training point Y ic : the c-th element of the i-th label indicator vector Our variables are all the weights w on all the edges Apparent difficulty: we don t know the correct output of hidden units It turns out to be OK: we can still do gradient descent. The trick you need is the chain rule The algorithm is known as back-propagation slide 23

Backpropagation algorithm (page 1) BACKPROPAGATION(training set,, D, n hidden, K) Training set: {(X1, Y1),, (Xn, Yn)}, Xi is a feature vector of size D, Yi is an output vector of size K, is the learning rate (step size in gradient descent), n hidden is the number of hidden units Create a neural network with D inputs, n hidden hidden units, and K outputs. Connect each layer. Initialize all weights to some small random numbers (e.g. between 0.05 and 0.05) Repeat next page until the termination condition is met slide 24

Backpropagation algorithm (page 2) For each training example (X, Y): Propagate the input forward through the network Input X to the network, compute output o u for every unit u in the network Propagate the errors backward through the network for each output unit c, compute its error term c c ( oc yc) oc (1 oc ) for each hidden unit h, compute its error term h h w ihi oh (1 oh ) isucc( h) update each weight w i w w x i i where x i is the input from unit i into unit (o i if i is a hidden unit; X i if i is an input) w i is the weight from unit i to unit i slide 25

Derivation of backpropagation For simplicity we assume online learning (as oppose to batch learning): 1-step gradient descent after seeing each training example (X,Y) For each (X,Y), the error is E(W) = ½ c=1..k (o c -Y c ) 2 o c : the c-th output unit (when input is X) Y c : the c-th element of the label indicator vector Use gradient descent to change all the weights w i to minimize the error. Separate two cases: Case 1: w i when is an output unit Case 2: w i when is a hidden unit slide 26

Case 1: weights of an output unit Error w ( o i y ) o 1 2 ( o y g w ) ( ( mx m) y ) 2 2 m 2 w (1 o i ) x i 1 w i w i x i i y o o c : the c-th output unit (when input is X) Y c : the c-th element of the label indicator vector gradient descent: to minimize error, run away from the partial derivative w i w i Error w i w i ( o y ) o (1 o ) x i slide 27

slide 28 Case 2: weights of a hidden unit o c y c w i x i o c i c c c succ c c c i n n n cm m cm cm succ c c c i c succ c c c i x o o w o o y o w x w g x x w g y o w o o o o E w Error ) (1 ) (1 ) ( ) ( ) ( ) ( ) ( ) ( ) (

Neural network weight learning issues When to terminate backpropagation? Overfitting and early stopping After fixed number of iterations (ok) When training error less than a threshold (wrong) When holdout set error starts to go up (ok) Local optima The weights will converge to a local minimum Learning rate Convergence sensitive to learning rate Weight learning can be rather slow slide 29

Sensitivity to learning rate From J. Hertz, A. Krogh, and R. G. Palmer. Introduction to the Theory of Neural Computation. Addison-Wesley, 1994. slide 30

Neural network weight learning issues Use momentum (a heuristic?) to dampen gradient descent w (t-1) = last time s change to w w (t) = - E(W) / w + w (t-1) w w + w (t) Alternatives to gradient descent: Newton-Raphson, Conugate gradient slide 31

Neural network structure learning issues How many hidden units? How many layers? How to connect units? Cross validation slide 32