Supervised Learning Artificial Neural Networks

Similar documents
Neural networks. Chapter 20, Section 5 1

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Neural networks. Chapter 19, Sections 1 5 1

Neural networks. Chapter 20. Chapter 20 1

Artificial neural networks

CS:4420 Artificial Intelligence

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Sections 18.6 and 18.7 Artificial Neural Networks

Sections 18.6 and 18.7 Artificial Neural Networks

Sections 18.6 and 18.7 Analysis of Artificial Neural Networks

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

Grundlagen der Künstlichen Intelligenz

Data Mining Part 5. Prediction

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Artificial Neural Networks

Lecture 4: Perceptrons and Multilayer Perceptrons

Introduction to Neural Networks

Learning and Neural Networks

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

Feedforward Neural Nets and Backpropagation

Unit 8: Introduction to neural networks. Perceptrons

AI Programming CS F-20 Neural Networks

Ch.8 Neural Networks

EEE 241: Linear Systems

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Neural Networks biological neuron artificial neuron 1

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

Lecture 7 Artificial neural networks: Supervised learning

CS 4700: Foundations of Artificial Intelligence

Artificial Neural Networks. MGS Lecture 2

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Course 395: Machine Learning - Lectures

Artificial Neural Network

Lecture 4: Feed Forward Neural Networks

Introduction to Artificial Neural Networks

Lecture 5: Logistic Regression. Neural Networks

Artificial Neural Networks. Q550: Models in Cognitive Science Lecture 5

Artificial Neural Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

CSC242: Intro to AI. Lecture 21

18.6 Regression and Classification with Linear Models

Artifical Neural Networks

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Neural Networks and Deep Learning

Neural Networks: Introduction

Artificial Neural Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Artificial Intelligence

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Introduction To Artificial Neural Networks

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Neural Networks and the Back-propagation Algorithm

Artificial Neural Network

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Artificial Neural Network and Fuzzy Logic

Machine Learning. Neural Networks

CS 4700: Foundations of Artificial Intelligence

Linear classification with logistic regression

Artificial Neural Networks. Edward Gatt

Artificial Neural Networks. Historical description

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

Numerical Learning Algorithms

Learning and Memory in Neural Networks

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Artificial Neural Networks

Linear discriminant functions

Neural Networks and Fuzzy Logic Rajendra Dept.of CSE ASCET

Bits of Machine Learning Part 1: Supervised Learning

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

y(x n, w) t n 2. (1)

CMSC 421: Neural Computation. Applications of Neural Networks

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Introduction Biologically Motivated Crude Model Backpropagation

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Artificial Neural Networks The Introduction

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Multilayer Perceptron

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

Machine Learning

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

COMP-4360 Machine Learning Neural Networks

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Neural Networks Introduction

Machine Learning

Machine Learning Lecture 5

4. Multilayer Perceptrons

PATTERN CLASSIFICATION

Artificial Neural Networks Examination, June 2005

Multilayer Neural Networks

Unit III. A Survey of Neural Network Model

Transcription:

Lecture 11 upervised Learning Artificial Marco Chiarandini Department of Mathematics & Computer cience University of outhern Denmark lides by tuart Russell and Peter Norvig

Course Overview Introduction Artificial Intelligence Intelligent Agents earch Uninformed earch Heuristic earch Uncertain knowledge and Reasoning Probability and Bayesian approach Bayesian Networks Hidden Markov Chains Kalman Filters Learning upervised Decision Trees, Neural Networks Learning Bayesian Networks Unsupervised EM Algorithm Reinforcement Learning Games and Adversarial earch Minimax search and Alpha-beta pruning Multiagent search Knowledge representation and Reasoning Propositional logic First order logic Inference Plannning 2

Outline 1. Feedforward Networks ingle-layer perceptrons Multi-layer perceptrons 2. 3

A neuron in a living biological system Axonal arborization Axon from another cell ynapse Dendrite Axon Nucleus ynapses Cell body or oma ignals are noisy spike trains of electrical potential 4

In the brain: > 20 types of neurons with 10 14 synapses (compare with world population = 7 10 9 ) Additionally, brain is parallel and reorganizing while computers are serial and static Brain is fault tolerant: neurons can be destroyed. 5

Observations of neuroscience Neuroscientists: view brains as a web of clues to the biological mechanisms of cognition. Engineers: The brain is an example solution to the problem of cognitive computing 6

Applications supervised learning: regression and classification associative memory optimization: grammatical induction, (aka, grammatical inference) e.g. in natural language processing noise filtering simulation of biological brains 7

Artificial The neural network does not exist. There are different paradigms for neural networks, how they are trained and where they are used. Artificial Neuron Each input is multiplied by a weighting factor. Output is 1 if sum of weighted inputs exceeds the threshold value; 0 otherwise. Network is programmed by adjusting weights using feedback from examples. 8

McCulloch Pitts unit (1943) Output is a function of weighted inputs: a i = g(in i ) = g W j,i a j j Bias Weight a 0 = 1 a i = g(in i ) Wj,i a j W 0,i Σ in i g a i Input Links Input Function Activation Function Output Output Links A gross oversimplification of real neurons, but its purpose is to develop understanding of what networks of simple units can do 9

Activation functions Non linear activation functions g(in i ) g(in i ) +1 +1 (a) in i (b) in i (a) is a step function or threshold function (mostly used in theoretical studies) (b) is a continuous activation function, e.g., sigmoid function 1/(1 + e x ) (mostly used in practical applications) Changing the bias weight W 0,i moves the threshold location 10

Implementing logical functions W 0 = 1.5 W 0 = 0.5 W 0 = 0.5 W 1 = 1 W 2 = 1 W 1 = 1 W 2 = 1 W 1 = 1 AND OR NOT McCulloch and Pitts: every (basic) Boolean function can be implemented (eventually by connecting a large number of units in networks, possibly recurrent, of arbitrary depth) 11

Network structures Architecture: definition of number of nodes and interconnection structures and activation functions g but not weights. Feed-forward networks: no cycles in the connection graph single-layer perceptrons (no hidden layers) multi-layer perceptrons (one or more hidden layers) Feed-forward networks implement functions, have no internal state Recurrent networks: Hopfield networks have symmetric weights (W i,j = W j,i ) g(x) = sign(x), a i = {1, 0}; associative memory recurrent neural nets have directed cycles with delays = have internal state (like flip-flops), can oscillate etc. 13

Use are used in classification and regression Boolean classification: - value over 0.5 one class - value below 0.5 other class k-way classification - divide single output into k portions - k separate output unit continuous output - identity activation function in output unit 14

ingle-layer NN (perceptrons) Input Units W j,i Output Units Perceptron output 1 0.8 0.6 0.4 0.2 0-4 -2 0 x 1 2 4-4 -2 0 2 4 x 2 Output units all operate separately no shared weights Adjusting weights moves the location, orientation, and steepness of cliff 15

Expressiveness of perceptrons Consider a perceptron with g = step function (Rosenblatt, 1957, 1960) The output is 1 when: W j x j > 0 or W x > 0 j Hence, it represents a linear separator in input space: - hyperplane in multidimensional space - line in 2 dimensions Minsky & Papert (1969) pricked the neural network balloon 16

Perceptron learning Learn by adjusting weights to reduce error on training set The squared error for an example with input x and true output y is E = 1 2 Err 2 1 2 (y h W(x)) 2, Find local optima for the minimization of the function E(W) in the vector of variables W by gradient methods. Note, the function E depends on constant values x that are the inputs to the perceptron. The function E depends on h which is non-convex, hence the optimization problem cannot be solved just by solving E(W) = 0 17

Digression: Gradient methods Gradient methods are iterative approaches: find a descent direction with respect to the objective function E move W in that direction by a step size The descent direction can be computed by various methods, such as gradient descent, Newton-Raphson method and others. The step size can be computed either exactly or loosely by solving a line search problem. Example: gradient descent 1. et iteration counter t = 0, and make an initial guess W 0 for the minimum 2. Repeat: 3. Compute a descent direction p t = (E(W t )) 4. Choose α t to minimize f (α) = E(W t αp t ) over α R + 5. Update W t+1 = W t α t p t, and t = t + 1 6. Until f (W k ) < tolerance tep 3 can be solved loosely by taking a fixed small enough value α > 0 18

Perceptron learning In the specific case of the perceptron, the descent direction is computed by the gradient: E = Err Err n = Err y g( W j x j ) W j W j W j = Err g (in) x j and the weight update rule (perceptron learning rule) in step 5 becomes: W t+1 j = W t j + α Err g (in) x j j = 0 For threshold perceptron, g (in) is undefined: Original perceptron learning rule (Rosenblatt, 1957) simply omits g (in) 19

Perceptron learning contd. function Perceptron-Learning(examples,network) returns perceptron weights inputs: examples, a set of examples, each with input x = x 1, x 2,..., x n and output y inputs: network, a perceptron with weights W j, j = 0,..., n and activation function g repeat for each e in examples do in n j=0 W jx j [e] Err y[e] g(in) W j W j + α Err g (in) x j [e] end until all examples correctly predicted or stopping criterion is reached return network Perceptron learning rule converges to a consistent function for any linearly separable data set 20

Numerical Example The (Fisher s or Anderson s) iris data set gives the measurements in centimeters of the variables petal length and width, respectively, for 50 flowers from each of 2 species of iris. The species are Iris setosa, and versicolor. Width 4.5 4.0 3.5 3.0 2.5 2.0 Petal Dimensions in Iris Blossoms > head(iris.data) epal.length epal.width pecies id 6 5.4 3.9 setosa -1 4 4.6 3.1 setosa -1 84 6.0 2.7 versicolor 1 31 4.8 3.1 setosa -1 77 6.8 2.8 versicolor 1 15 5.8 4.0 setosa -1 1.5 1.0 etosa Petals ersicolor Petals 4 5 6 7 8 Length 21

> sigma <- function(w, point) { + x <- c(point, 1) + sign(w %*% x) + } > w.0 <- c(runif(1), runif(1), runif(1)) > w.t <- w.0 > for (j in 1:1000) { + i <- (j - 1)%%50 + 1 + diff <- iris.data[i, 4] - sigma(w.t, c(iris.data[i, 1], iris.data[i, 2])) + w.t <- w.t + 0.2 * diff * c(iris.data[i, 1], iris.data[i, 2], 1) + } Petal Dimensions in Iris Blossoms 4.5 4.0 Width 3.5 3.0 2.5 2.0 1.5 etosa Petals ersicolor Petals 1.0 4 5 6 7 8 Length

Multilayer Feed-forward 1 W 1,3 W 1,4 3 W 3,5 5 2 W 2,3 W 2,4 4 W 4,5 Feed-forward network = a parametrized family of nonlinear functions: a 5 = g(w 3,5 a 3 + W 4,5 a 4 ) = g(w 3,5 g(w 1,3 a 1 + W 2,3 a 2 ) + W 4,5 g(w 1,4 a 1 + W 2,4 a 2 )) Adjusting weights changes the function: do learning this way! 24

Neural Network with two layers 25

Expressiveness of MLPs All continuous functions with 2 layers, all functions with 3 layers h W (x 1, x 2 ) 1 0.8 0.6 0.4 0.2 0-4 -2 0 x 1 2 4-4 -2 0 2 4 x 2 h W (x 1, x 2 ) 1 0.8 0.6 0.4 0.2 0-4 -2 0 x 1 2 4-4 -2 Combine two opposite-facing threshold functions to make a ridge Combine two perpendicular ridges to make a bump Add bumps of various sizes and locations to fit any surface Proof requires exponentially many hidden units 27

Backpropagation Algorithm upervised learning method to train multilayer feedforward NNs with diffrerentiable transfer functions. Adjust weights along the negative of the gradient of performance function. Forward-Backward pass. equential or batch mode Convergence time vary exponentially with number of inputs Avoid local minima by simulated annealing and other metaheuristics 28

Multilayer perceptrons Layers are usually fully connected; numbers of hidden units typically chosen by hand Output units a i W j,i Hidden units a j W k,j Input units a k 29

Back-propagation learning Output layer: same as for single-layer perceptron, W j,i W j,i + α a j i where i = Err i g (in i ). Note: the general case has multiple output units hence: Err = (y h w (x)) Hidden layer: back-propagate the error from the output layer: j = g (in j ) i W j,i i (sum over the multiple output units) Update rule for weights in hidden layer: W k,j W k,j + α a k j. (Most neuroscientists deny that back-propagation occurs in the brain) 30

Back-propagation derivation The squared error on a single example is defined as E = 1 (y i a i ) 2, 2 i where the sum is over the nodes in the output layer. E W j,i = (y i a i ) a i W j,i = (y i a i ) g(in i) W j,i = (y i a i )g (in i ) in i = (y i a i )g (in i ) W j,i = (y i a i )g (in i )a j = a j i W j,i j W j,i a j 31

Back-propagation derivation contd. For the hidden layer: E W k,j = i = i (y i a i ) a i W k,j = i (y i a i )g (in i ) in i W k,j = i (y i a i ) g(in i) W k,j i W k,j j W j,i a j = i a j i W j,i = W k,j i i W j,i g(in j ) W k,j = i = i = i i W j,i g (in j ) in j W k,j i W j,i g (in j ) W k,j i W j,i g (in j )a k = a k j ( ) W k,j a k k 32

Numerical Example The (Fisher s or Anderson s) iris data set gives the measurements in centimeters of the variables petal length and width, respectively, for 50 flowers from each of 2 species of iris. The species are Iris setosa, and versicolor. Petal.Length Petal.Width epal.length setosa Petal.Length Petal.Width epal.length versicolor Petal.Length Petal.Width epal.length virginica Petal.Length Petal.Width epal.length 33

Numerical Example > samp <- c(sample(1:50, 25), sample(51:100, 25), sample(101:150, 25)) > Target <- class.ind(iris$pecies) > ir.nn <- nnet(target ~ epal.length * Petal.Length * Petal.Width, data = iris, subset = sam + size = 2, rang = 0.1, decay = 5e-04, maxit = 200, trace = FALE) > test.cl <- function(true, pred) { + true <- max.col(true) + cres <- max.col(pred) + table(true, cres) + } > test.cl(target[-samp, ], predict(ir.nn, iris[-samp, c(1, 3, 4)])) cres true 1 2 3 1 25 0 0 2 0 22 3 3 0 2 23 34

Learning structures Beside weights also structure can be learned: Optimal brain damage: iteratively remove single edegs or units if performance does not worsen after weights are re-learned Tiling: iteralively add units and links and re-learn weights 35

Handwritten digit recognition 400 300 10 unit MLP = 1.6% error LeNet: 768 192 30 10 unit MLP = 0.9% error http://yann.lecun.com/exdb/lenet/ Current best (kernel machines, vision algorithms) 0.6% error Humans are at 0.2% 2.5 % error 36

Another Practical Example 37

Directions of research in ANN Representational capability assuming unlimited number of neurons (no training) Numerical analysis or approximation theoretic: how many hidden units are necessary to achieve a certain approximation error? (no training) Results for single hidden layer and multiple hidden layers ample complexity: how many samples are needed to characterize a certain unknown mapping. Efficient learning: backpropagation has the curse of dimensionality problem 38

Approximation properties NNs with 2 hidden layers and arbitrarily many nodes can approximate any real-valued function up to any desired accuracy, using continuous activation functions E.g.: required number of hidden units grows exponentially with number of inputs. 2 n /n hidden units needed to encode all Boolean functions of n inputs However profs are not constructive. More interest in efficiency issues: NNs with small size and depth ize-depth trade off: more layers more costly to simulate 39

Outline 1. Feedforward Networks ingle-layer perceptrons Multi-layer perceptrons 2. 40

Training and Assessment Use different data for different tasks: Training and Test data: holdout cross validation If little data: k-fold cross validation Avoid peeking: Weights learned on training data. Parameters such as learning rate α and net topology compared on validation data Final assessment on test data 41

Ensemble Methods Use majority rule to predict among K hypothesis learned. If the hypothesis are independent this yields a considerable reduction of misclassification Boosting: weight adaptively the examples 42

Learning Theory Probably approximately correct (PAC) learning apnik-chervonenkis (C) dimensions provide information-theoretic bounds to sample complexities in continuous function classes 43

ummary upervised learning Decision trees Linear models Perceptron learning rule: an algorithm for learning weights in single layered networks. Perceptrons: linear separators, insufficiently expressive Multi-layer networks are sufficiently expressive Many applications: speech, driving, handwriting, fraud detection, etc. k nearest neighbor, non-parametric regression 44