CS:4420 Artificial Intelligence

Similar documents
22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

Neural networks. Chapter 20, Section 5 1

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Artificial neural networks

Neural networks. Chapter 19, Sections 1 5 1

Neural networks. Chapter 20. Chapter 20 1

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Grundlagen der Künstlichen Intelligenz

Ch.8 Neural Networks

Data Mining Part 5. Prediction

Sections 18.6 and 18.7 Artificial Neural Networks

Sections 18.6 and 18.7 Artificial Neural Networks

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Artifical Neural Networks

Learning and Neural Networks

Supervised Learning Artificial Neural Networks

Artificial Neural Networks

AI Programming CS F-20 Neural Networks

Lecture 4: Perceptrons and Multilayer Perceptrons

Sections 18.6 and 18.7 Analysis of Artificial Neural Networks

CS 4700: Foundations of Artificial Intelligence

CS 4700: Foundations of Artificial Intelligence

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Linear classification with logistic regression

Artificial Neural Network

Feedforward Neural Nets and Backpropagation

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Course 395: Machine Learning - Lectures

Neural Networks biological neuron artificial neuron 1

EE04 804(B) Soft Computing Ver. 1.2 Class 2. Neural Networks - I Feb 23, Sasidharan Sreedharan

Unit 8: Introduction to neural networks. Perceptrons

18.6 Regression and Classification with Linear Models

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

Lecture 5: Logistic Regression. Neural Networks

CSC242: Intro to AI. Lecture 21

Introduction to Artificial Neural Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Artificial Neural Networks

CMSC 421: Neural Computation. Applications of Neural Networks

Numerical Learning Algorithms

Artificial Neural Networks. Q550: Models in Cognitive Science Lecture 5

Artificial Neural Networks. Edward Gatt

Lecture 7 Artificial neural networks: Supervised learning

Introduction to Neural Networks

Introduction To Artificial Neural Networks

Neural Networks and Fuzzy Logic Rajendra Dept.of CSE ASCET

Artificial Neural Networks

Machine Learning. Neural Networks

Artificial Neural Network and Fuzzy Logic

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Multilayer Neural Networks

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Unit III. A Survey of Neural Network Model

Multilayer Perceptrons (MLPs)

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Multilayer Perceptron

EEE 241: Linear Systems

Multilayer Neural Networks

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Neural Networks and the Back-propagation Algorithm

Neural Networks: Introduction

Neural Networks (Part 1) Goals for the lecture

Revision: Neural Network

Learning in State-Space Reinforcement Learning CIS 32

Multilayer Perceptrons and Backpropagation

Introduction Biologically Motivated Crude Model Backpropagation

Artificial Neural Networks. Historical description

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Learning and Memory in Neural Networks

4. Multilayer Perceptrons

Linear discriminant functions

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Master Recherche IAC TC2: Apprentissage Statistique & Optimisation

Deep Feedforward Networks. Sargur N. Srihari

CSC321 Lecture 5: Multilayer Perceptrons

Artificial Neural Networks

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Lecture 4: Feed Forward Neural Networks

Neural Networks. Fundamentals of Neural Networks : Architectures, Algorithms and Applications. L, Fausett, 1994

The error-backpropagation algorithm is one of the most important and widely used (and some would say wildly used) learning techniques for neural

Artificial Intelligence

Neural Networks Introduction

Deep Feedforward Networks

Lab 5: 16 th April Exercises on Neural Networks

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Part 8: Neural Networks

Learning from Examples

Neural Networks and Ensemble Methods for Classification

Nonlinear Classification

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Pattern Classification

ECLT 5810 Classification Neural Networks. Reference: Data Mining: Concepts and Techniques By J. Hand, M. Kamber, and J. Pei, Morgan Kaufmann

Transcription:

CS:4420 Artificial Intelligence Spring 2018 Neural Networks Cesare Tinelli The University of Iowa Copyright 2004 18, Cesare Tinelli and Stuart Russell a a These notes were originally developed by Stuart Russell and are used with permission. They are copyrighted material and may not be used in other course settings outside of the University of Iowa in their current or modified form without the express written consent of the copyright holders. CS:4420 Spring 2018 p.1/39

Readings Chap. 18 of [Russell and Norvig, 2012] CS:4420 Spring 2018 p.2/39

Brains as Computational Devices Brain s advantages with respect to digital computers: Massively parallel Fault-tolerant Reliable Graceful degradation CS:4420 Spring 2018 p.3/39

Brains and Neurons 10 11 neurons of > 20 types, 10 14 synapses, 1ms 10ms cycle time Signals are noisy spike trains of electrical potential Axonal arborization Synapse Axon from another cell Dendrite Axon Nucleus Synapses Cell body or Soma CS:4420 Spring 2018 p.4/39

Artificial Neural Network Artificial neural networks are inspired by brains and neurons A neural network is a graph with nodes, or units, connected by links Each link has an associated weight, a real number Typically, each node i outputs a real number, which is fed as input to the nodes connected to i The output of a note is a function of the weighted sum of the node s inputs CS:4420 Spring 2018 p.5/39

McCulloch & Pitts model: A Neural Network Unit a 0 = 1 a i = g(in i ) Wj,i a j Bias Weight W 0,i Σ in i g a i Input Links Input Function Activation Function Output Output Links Output is a squashed linear function of the inputs: ( ) a i g(in i ) = g j W j,ia j This is a gross oversimplification of real neurons, but is meant to develop understanding of what networks of simple units can do CS:4420 Spring 2018 p.6/39

Possible Activation Functions a i a i a i +1 +1 +1 t in i in i in i 1 (a) Step function (b) Sign function (c) Sigmoid function step t (x) = { 1, if x t 0, if x < t sign(x) = { +1, if x 0 1, if x < 0 sigmoid(x) = 1 1+e x CS:4420 Spring 2018 p.7/39

Normalizing Unit Thresholds. If t is the threshold value of the output unit, then step t ( n W j I j ) = step 0 ( j=1 where W 0 = t and I 0 = 1 n W j I j ) j=0 So we can always assume that the unit s threshold is 0 This allows thresholds to be learned like any other weight Then, we can even allow output values in [0,1] by replacing step 0 by the sigmoid function CS:4420 Spring 2018 p.8/39

Units as Logic Gates W 0 = 1.5 W 0 = 0.5 W 0 = 0.5 W 1 = 1 W 2 = 1 W 1 = 1 W 2 = 1 W 1 = 1 AND OR NOT Activation function: step function Since units can implement the,, boolean operators, neural nets are Turing-complete: they can implement any computable function CS:4420 Spring 2018 p.9/39

Computing with NNs Different functions are implemented by different network topologies and unit weights The allure of NNs is that a network does not need to be explicitly programmed to compute a certain function f Given enough nodes and links, a NN can learn the function by itself It does so by looking at a training set of input/output pairs for f and modifying its topology and weights so that its own input/output behavior agrees with the training pairs In other words, NNs too learn by induction CS:4420 Spring 2018 p.10/39

Learning Network Structures The structure of a NN is given by its nodes and links The class of functions a network can represent depends on the network structure Fixing the network structure in advance can make the task of learning a certain function impossible On the other hand, using a large network is also potentially problematic If a network has too many parameters (i.e., weights), it will simply learn the examples by memorizing them in its weights (overfitting) CS:4420 Spring 2018 p.11/39

Learning Network Structures Two main ways to modify a network structure in accordance with the training set: Optimal brain damage: Start with a large, fully-connected network and remove connections that do not seem to matter Tiling: Start with a very small network and increasingly add units to cover correctly more and more examples Neither technique is completely satisfactory in practice Often, the network structure is established manually by trial and error (using cross-validation, etc.) Learning procedures are then used to learn the network weights only CS:4420 Spring 2018 p.12/39

Network structures Feed-forward networks: single-layer perceptrons multi-layer perceptrons Feed-forward networks implement functions, have no internal state Recurrent networks: Hopfield networks have symmetric weights (W i,j = W j,i ) g(x)=sign(x), a i = ±1; holographic associative memory Boltzmann machines use stochastic activation functions Recurrent networks have directed cycles with delays, hence have internal state (like flip-flops), can oscillate etc. CS:4420 Spring 2018 p.13/39

Feed-forward eetwork example 1 W 1,3 W 1,4 3 W 3,5 5 2 W 2,3 W 2,4 4 W 4,5 If we consider the weights as parameters, a network representes an entire family of nonlinear functions: a 5 = g(w 3,5 a 3 +W 4,5 a 4 ) = g(w 3,5 g(w 1,3 a 1 +W 2,3 a 2 )+ W 4,5 g(w 1,4 a 1 +W 2,4 a 2 )) Changing weights changes the function: do learning this way! CS:4420 Spring 2018 p.14/39

(Single-layer) Perceptrons Single-layer, feed-forward networks whose units use a step/sigmoid function as activation function I j W j,i O i I j W j O Input Units Output Units Input Units Output Unit Perceptron Network Single Perceptron CS:4420 Spring 2018 p.15/39

Perceptrons Input Units Perceptron output 1 0.8 0.6 0.4 0.2 0-4 -2 Output 0 x 2-4 W -2 0 1 4 j,i Units 2 4 x 2 Output units all operate separately no shared weights Adjusting weights changes the cliff s location, orientation, and steepness CS:4420 Spring 2018 p.16/39

Perceptron Learning Perceptrons caused a great stir when they were invented because it was shown that If a function is representable by a perceptron, then it is learnable with 100% accuracy, given enough training examples CS:4420 Spring 2018 p.17/39

Perceptron Learning Perceptrons caused a great stir when they were invented because it was shown that If a function is representable by a perceptron, then it is learnable with 100% accuracy, given enough training examples Problem: perceptrons can only represent linearly-separable functions CS:4420 Spring 2018 p.17/39

Perceptron Learning Perceptrons caused a great stir when they were invented because it was shown that If a function is representable by a perceptron, then it is learnable with 100% accuracy, given enough training examples Problem: perceptrons can only represent linearly-separable functions It was soon shown that most of the functions we would like to compute are not linearly-separable CS:4420 Spring 2018 p.17/39

Linearly Separable Functions 2-dimensional space: I 1 I 1 I 1 1 1 1? 0 0 0 0 1 I 2 I 2 0 1 0 1 I 2 I 1 I 2 I 1 I 2 (a) and (b) or (c) xor I 1 I 2 A black dot corresponds to an output value of 1; an empty dot corresponds to an output value of 0 Can represent and, or, not, majority, etc., but not xor Represents a linear separator in input space: W j I j > 0 or W I > 0 j CS:4420 Spring 2018 p.18/39

A Linearly Separable Function 3-dimensional space: The minority function: return 1 if the input vector contains less 1s than 0s; return 0 otherwise I 1 W = 1 I 2 W = 1 t = 1.5 W = 1 I 3 (a) Separating plane (b) Weights and threshold CS:4420 Spring 2018 p.19/39

Learning with NNs Most NN learning methods are current-best-hypothesis methods function NEURAL-NETWORK-LEARNING(examples) returns network network a network with randomly assigned weights repeat for each e in examples do O NEURAL-NETWORK-OUTPUT(network, e) T the observed output values from e update the weights in network based on e, O, and T end until all examples correctly predicted or stopping criterion is reached return network Each cycle in the procedure above is called an epoch CS:4420 Spring 2018 p.20/39

The Perceptron Learning Method Weight updating in perceptrons is very simple because each output node is independent of the other output nodes. I j W j,i O i I j W j O Input Units Output Units Input Units Output Unit Perceptron Network Single Perceptron So we can consider a perceptron with a single output node CS:4420 Spring 2018 p.21/39

The Perceptron Learning Method If O is the value returned by the output unit for a given example and T is the expected output, then the unit s error is E = T O If the error E is positive we need to increase O; otherwise, we need to decrease it CS:4420 Spring 2018 p.22/39

The Perceptron Learning Method Since O = g( n j=0 W ji j ) where g is the sigmoid function, we can change O by changing each W j to increase O we should increase W j if I j is positive, decrease W j if I j is negative to decrease O we should decrease W j if I j is positive, increase W j if I j is negative This is done by updating each W j in parallel as follows: W j W j +α I j g ( n W j I j ) (T O) j=0 where g (x) = g(x) (1 g(x)) is the first derivative of g and α is a positive constant, the learning rate CS:4420 Spring 2018 p.23/39

Perceptron Learning as Search Provided that the learning rate constant is not too high, the perceptron will learn any linearly-separable function. Why? The perceptron learning procedure is a gradient descent search procedure whose search space has no local minima. Err a W 1 b W 2 Each possible configuration of weights for the perceptron is a state in the search space CS:4420 Spring 2018 p.24/39

Perceptron learning contd. Perceptron learning rule converges to a consistent function for any linearly separable data set Proportion correct on test set 1 0.9 0.8 0.7 0.6 0.5 0.4 Perceptron Decision tree 0 10 20 30 40 50 60 70 80 90 100 Proportion correct on test set 1 0.9 0.8 0.7 0.6 0.5 0.4 Perceptron Decision tree 0 10 20 30 40 50 60 70 80 90 100 Training set size - MAJORITY on 11 inputs Training set size - RESTAURANT data Perceptron learns majority function easily, DTL is hopeless DTL learns restaurant function easily, perceptron cannot represent it CS:4420 Spring 2018 p.25/39

Multilayer, Feed-forward Networks A kind of neural network in which links are unidirectional and form no cycles (the net is a directed acyclic graph) the root nodes of the graph are input units, their activation value is determined by the environment the leaf nodes are output units the remaining nodes are hidden units units can be divided into layers: a unit in a layer is connected only to units in the next layer CS:4420 Spring 2018 p.26/39

A Two-layer, Feed-forward Network Output units O i W j,i Hidden units a j W k,j Input units I k Notes: The roots of the graph are at the bottom and the (only) leaf at the top The layer of input units is generally not counted (which is why this is a two-layer net) Layers are usually fully connected; numbers of hidden units is typically chosen by hand CS:4420 Spring 2018 p.27/39

Multilayer, Feed-forward Networks Are a powerful computational device: with just one hidden layer, they can approximate any continuous function with just two hidden layers, they can approximate any computable function However, the number of needed units per layer may grow exponentially with the number of input units CS:4420 Spring 2018 p.28/39

Expressiveness of MLNs All continuous functions w/ 2 layers, all functions w/ 3 layers h W (x 1, x 2 ) 1 0.8 0.6 0.4 0.2 0-4 -2 0 x 1 2-4 -2 0 4 2 4 x 2 h W (x 1, x 2 ) 1 0.8 0.6 0.4 0.2 0-4 -2 0 x 1 2-4 -2 0 4 2 4 x 2 Combine two opposite-facing threshold functions to make a ridge Combine two perpendicular ridges to make a bump Add bumps of various sizes and locations to fit any surface Proof requires exponentially many hidden units (cf. DTL proof) CS:4420 Spring 2018 p.29/39

Back-Propagation Learning Extends the the main idea of perceptron learning to multilayer networks: Assess the blame for a unit s error and divide it among the contributing weights 1. start from the units in the output layer 2. propagate the error back to previous layers up to the input layer Weight updates: Output layer: as in the perceptron case Hidden layer: by back-propagation CS:4420 Spring 2018 p.30/39

Updating Weights: Output Layer Exactly as in perceptrons: a j Wji g(in i ) Oi unit i W ji W ji +α a j i where α > 0 is the learning rate i = g (in i ) (T i O i ) is the error of unit i g is the sigmoid function, in i = j W jia j T i is the expected output CS:4420 Spring 2018 p.31/39

Updating Weights: Hidden Layers a k Wkj g(in j) a j a j unit j a j W kj W kj +α a k j where j = g (in j ) i W ji i i = error of unit in the next layer that is connected to unit j CS:4420 Spring 2018 p.32/39

The Back-propagation Procedure 1. Choose a learning rate α 2. Choose (small) values for the weights randomly 3. Repeat until network performance is satisfactory For each training example e a. Propagate e s inputs forward to compute output O i for each output node i b. For each output node i, compute i := g (in i ) (T i O i ) c. For each previous level l and node j in l, compute d. Update each weight W rs by j := g (in j ) i W ji i W rs W rs +α a r s CS:4420 Spring 2018 p.33/39

Why Back-Propagation Works Back-propagation learning too is a gradient descent search in the weight space over a certain error surface If W is the vector of all the weights in the network, the error surface is given by E(W) := i (T i O i ) 2 2 The update for each weight W ji of a unit i is the opposite of the gradient (slope) of the error surface along the direction W ji : a j i = E(W) W ji CS:4420 Spring 2018 p.34/39

Why BP doesn t Always Work Producing a new vector W by adding to each W ji in W the opposite of E s slope along W ji guarantees that E(W ) E(W) Err a W 1 b W 2 In general, however, the error surface may contain local minima CS:4420 Spring 2018 p.35/39

Back-propagation learning contd. Training curve for 100 restaurant examples: finds exact fit 14 Total error on training set 12 10 8 6 4 2 0 0 50 100 150 200 250 300 350 400 Number of epochs Typical problems: slow convergence, local minima CS:4420 Spring 2018 p.36/39

Back-propagation learning contd. Learning curve for MLP with 4 hidden units: Proportion correct on test set 1 0.9 0.8 0.7 0.6 0.5 0.4 Decision tree Multilayer network 0 10 20 30 40 50 60 70 80 90 100 Training set size - RESTAURANT data MLNs are quite good for complex pattern recognition tasks, but resulting hypotheses cannot be understood easily CS:4420 Spring 2018 p.37/39

Evaluating Back-propagation To assess the goodness of back-propagation learning for multilayer networks one must consider several issues: Expressiveness Computational efficiency Generalization power Sensitivity to noise Transparency Background Knowledge CS:4420 Spring 2018 p.38/39

Handwritten digit recognition 3-NN FFN LeNet B LeNet SVM V SVM Match ER 2.4 1.6 0.9 0.7 1.1 0.56 0.63 RT 1K 10 30 50 2K 200 Mem 12 0.49 0.12 0.21 11 T 0 7 14 30 10 R 8.1 3.2 1.8 0.5 1.8 ER = error rate, RT = runtime (ms/digit), M = memory (MB), TT = training time (days), R = % rejected for 0.5% error CS:4420 Spring 2018 p.39/39