Learning and Neural Networks

Similar documents
Learning from Observations. Chapter 18, Sections 1 3 1

CS 380: ARTIFICIAL INTELLIGENCE

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Learning Decision Trees

From inductive inference to machine learning

Introduction to Artificial Intelligence. Learning from Oberservations

Statistical Learning. Philipp Koehn. 10 November 2015

1. Courses are either tough or boring. 2. Not all courses are boring. 3. Therefore there are tough courses. (Cx, Tx, Bx, )

Neural networks. Chapter 19, Sections 1 5 1

Learning from Examples

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Bayesian learning Probably Approximately Correct Learning

Decision Trees. None Some Full > No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. Patrons? WaitEstimate? Hungry? Alternate?

Neural networks. Chapter 20. Chapter 20 1

Introduction To Artificial Neural Networks

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

Ch.8 Neural Networks

Neural networks. Chapter 20, Section 5 1

Machine Learning (CS 419/519): M. Allen, 14 Sept. 18 made, in hopes that it will allow us to predict future decisions

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

CS:4420 Artificial Intelligence

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

Decision Trees. Ruy Luiz Milidiú

EECS 349:Machine Learning Bryan Pardo

Incremental Stochastic Gradient Descent

Decision Trees. CS 341 Lectures 8/9 Dan Sheldon

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

} It is non-zero, and maximized given a uniform distribution } Thus, for any distribution possible, we have:

Artificial neural networks

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Supervised Learning (contd) Decision Trees. Mausam (based on slides by UW-AI faculty)

Introduction to Machine Learning

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Sections 18.6 and 18.7 Artificial Neural Networks

Data Mining Part 5. Prediction

CSC 411 Lecture 3: Decision Trees

Sections 18.6 and 18.7 Analysis of Artificial Neural Networks

Nonlinear Classification

Multilayer Perceptron

brainlinksystem.com $25+ / hr AI Decision Tree Learning Part I Outline Learning 11/9/2010 Carnegie Mellon

Neural Networks and the Back-propagation Algorithm

Sections 18.6 and 18.7 Artificial Neural Networks

Multilayer Neural Networks

Unit 8: Introduction to neural networks. Perceptrons

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Artifical Neural Networks

Artificial Intelligence

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

16.4 Multiattribute Utility Functions

Multilayer Perceptrons (MLPs)

Multilayer Neural Networks

Classification Algorithms

CSC242: Intro to AI. Lecture 23

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Course 395: Machine Learning - Lectures

18.6 Regression and Classification with Linear Models

Artificial Neural Networks

Assignment 1: Probabilistic Reasoning, Maximum Likelihood, Classification

AI Programming CS F-20 Neural Networks

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

CS 4700: Foundations of Artificial Intelligence

CSC242: Intro to AI. Lecture 21

CMPT 310 Artificial Intelligence Survey. Simon Fraser University Summer Instructor: Oliver Schulte

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Artificial Intelligence Roman Barták

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Supervised Learning Artificial Neural Networks

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Lab 5: 16 th April Exercises on Neural Networks

EEE 241: Linear Systems

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Artificial Neural Networks Examination, June 2005

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Revision: Neural Network

CS 4700: Foundations of Artificial Intelligence

Introduction to Statistical Learning Theory. Material para Máster en Matemáticas y Computación

Pattern Classification

Data Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td

ECLT 5810 Classification Neural Networks. Reference: Data Mining: Concepts and Techniques By J. Hand, M. Kamber, and J. Pei, Morgan Kaufmann

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Neural Nets Supervised learning

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Trees. Tirgul 5

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Machine Learning. Neural Networks

Lecture 4: Perceptrons and Multilayer Perceptrons

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Artificial Neural Networks. Edward Gatt

CMSC 421: Neural Computation. Applications of Neural Networks

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

) (d o f. For the previous layer in a neural network (just the rightmost layer if a single neuron), the required update equation is: 2.

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

Multilayer Perceptrons and Backpropagation

Transcription:

Artificial Intelligence Learning and Neural Networks Readings: Chapter 19 & 20.5 of Russell & Norvig

Example: A Feed-forward Network w 13 I 1 H 3 w 35 w 14 O 5 I 2 w 23 w 24 H 4 w 45 a 5 = g 5 (W 3,5 a 3 + W 4,5 a 4 ) = g 5 (W 3,5 g 3 (W 1,3 a 1 + W 2,3 a 2 ) + W 4,5 g 4 (W 1,4 a 1 + W 2,4 a 2 )) where a i is the output and g i is the activation function of node i.

Computing with NNs Different functions are implemented by different network topologies and unit weights. The lure of NNs is that a network need not be explicitly programmed to compute a certain function f. Given enough nodes and links, a NN can learn the function by itself. It does so by looking at a training set of input/output pairs for f and modifying its topology and weights so that its own input/output behavior agrees with the training pairs. In other words, NNs learn by induction, too.

Learning = Training in Neural Networks Neural networks are trained using data referred to as a training set. The process is one of computing outputs, compare outputs with desired answers, adjust weights and repeat. The information of a Neural Network is in its structure, activation functions, weights, and Learning to use different structures and activation functions is very difficult. These weights are used to express the relative strength of an input value or from a connecting unit (i.e., in another layer). It is by adjusting these weights that a neural network learns.

Process for Developing Neural Networks 1. Collect data Ensure that application is amenable to a NN approach and pick data randomly. 2. Separate Data into Training Set and Test Set 3. Define a Network Structure Are perceptrons sufficient? 4. Select a Learning Algorithm Decided by available tools 5. Set Parameter Values They will affect the length of the training period. 6. Training Determine and revise weights 7. Test If not acceptable, go back to steps 1, 2,..., or 5. 8. Delivery of the product

The Perceptron Learning Method Weight updating in perceptrons is very simple because each output node is independent of the other output nodes. I j W j,i O i I j W j O Input Units Output Units Input Units Output Unit Perceptron Network Single Perceptron With no loss of generality then, we can consider a perceptron with a single output node.

Normalizing Unit Thresholds. Notice that, if t is the threshold value of the output unit, then step t ( n j=1 W j I j ) = step 0 ( n j=0 W j I j ) where W 0 = t and I 0 = 1. Therefore, we can always assume that the unit s threshold is 0 if we include the actual threshold as the weight of an extra link with a fixed input value. This allows thresholds to be learned like any other weight. Then, we can even allow output values in [0, 1] by replacing step 0 by the sigmoid function.

The Perceptron Learning Method If O is the value returned by the output unit for a given example and T is the expected output, then the unit s error is Err = T O If the error Err is positive we need to increase O; otherwise, we need to decrease O.

The Perceptron Learning Method Since O = g( n j=0 W ji j ), we can change O by changing each W j. Assuming g is monotonic, to increase O we should increase W j if I j is positive, decrease W j if I j is negative. Similarly, to decrease O we should decrease W j if I j is positive, increase W j if I j is negative. This is done by updating each W j as follows W j W j + α I j (T O) where α is a positive constant, the learning rate.

Theoretic Background Learn by adjusting weights to reduce error on training set The squared error for an example with input x and true output y is E = 1 2 Err 2 1 2 (y h W (x))2, Perform optimization search by gradient descent: E = Err Err = Err n y g( W j x j ) W j W j W j = Err g (in) x j j = 0

Weight update rule W j W j α E W j = W j + α Err g (in) x j E.g., positive error = increasing network output = increasing weights on positive inputs and decreasing on negative inputs Simple weight update rule (assuming g (in) constant): W j W j + α Err x j

Learning a 5-place Minority Function At first, collect the data (see below), then choose a structure (a perceptron with five inputs and one output) and the activation function (i.e., step 3 ). Finally, set up parameters (i.e., W i = 0) and start to learn: Assumping α = 1, we have Sum = 5 i=1 W ii i, Out = step 3 (Sum), Err = T Out, and W j W j + I j Err. I 1 I 2 I 3 I 4 I 5 T W 1 W 2 W 3 W 4 W 5 Sum Out Err e 1 1 0 0 0 1 1 0 0 0 0 0 e 2 1 1 0 1 0 0 e 3 0 0 0 1 1 1 e 4 1 1 1 1 0 0 e 5 0 1 0 1 0 1 e 6 0 1 1 1 1 0 e 7 0 1 0 1 0 1 e 8 1 0 1 0 0 1

Learning a 5-place Minority Function The same as the last example, except that α = 0.5 instead of α = 1. Sum = 5 i=1 W ii i, Out = step 3 (Sum), Err = T Out, and W j W j + I j Err. I 1 I 2 I 3 I 4 I 5 T W 1 W 2 W 3 W 4 W 5 Sum Out Err e 1 1 0 0 0 1 1 0 0 0 0 0 e 2 1 1 0 1 0 0 e 3 0 0 0 1 1 1 e 4 1 1 1 1 0 0 e 5 0 1 0 1 0 1 e 6 0 1 1 1 1 0 e 7 0 1 0 1 0 1 e 8 1 0 1 0 0 1

Multilayer perceptrons Layers are usually fully connected; numbers of hidden units typically chosen by hand Output units a i W j,i Hidden units a j W k,j Input units a k

Expressiveness of MLPs All continuous functions w/ 2 layers, all functions w/ 3 layers h W (x 1, x 2 ) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-4 -2 0 x 1 2 4-4 -2 0 2 4 x 2 h W (x 1, x 2 ) 0.9 1 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-4 -2 0 x 1 2 4-4 -2 0 2 4 x 2

Back-propagation learning Output layer: same as for single-layer perceptron, W j,i W j,i + α a j i where i = Err i g (in i ) Hidden layer: back-propagate the error from the output layer: j = g (in j ) i W j,i i. Update rule for weights in hidden layer: W k,j W k,j + α a k j. (Most neuroscientists deny that back-propagation occurs in the brain)

Back-propagation derivation The squared error on a single example is defined as E = 1 2 (y i a i ) 2, where the sum is over the nodes in the output layer. i E W j,i = (y i a i ) a i W j,i = (y i a i ) g(in i) W j,i = (y i a i )g (in i ) in i W j,i = (y i a i )g (in i ) W j,i j W j,i a j = (y i a i )g (in i )a j = a j i

Back-propagation derivation E W k,j = i = i (y i a i ) a i W k,j = i (y i a i )g (in i ) in i W k,j = i (y i a i ) g(in i) W k,j i W k,j j W j,i = i a j i W j,i = W k,j i i W j,i g(in j ) W k,j = i = i i W j,i g (in j ) in j W k,j i W j,i g (in j ) W k,j ( k W k,j a k ) = i W j,i g (in j )a k = a k j

Decision Trees for Classification Examples described by attribute values (Boolean, discrete, continuous, etc.) E.g., situations where I will/won t wait for a table: Example Attributes Targe Bar F ri Hun P at P rice Rain Res T ype Est WillWa X 1 F F T Some $$$ F T French 0 10 T X 2 F F T Full $ F F Thai 30 60 F X 3 T F F Some $ F F Burger 0 10 T X 4 F T T Full $ F F Thai 10 30 T X 5 F T F Full $$$ F T French >60 F X 6 T F T Some $$ T T Italian 0 10 T X 7 T F F None $ T F Burger 0 10 F X 8 F F T Some $$ T T Thai 0 10 T X 9 T T F Full $ T F Burger >60 F X 10 T T T Full $$$ F T Italian 10 30 F X 11 F F F None $ F F Thai 0 10 F X 12 T T T Full $ F F Burger 30 60 T Classification of examples is positive (T) or negative (F)

Decision trees One possible representation for hypotheses. E.g., here is the true tree for deciding whether to wait: Patrons? None Some Full F T WaitEstimate? >60 30 60 10 30 0 10 F Alternate? Hungry? T No Yes No Yes Reservation? Fri/Sat? T Alternate? No Yes No Yes No Yes Bar? T F T T Raining? No Yes No Yes F T F T

Expressiveness Decision trees can express any function of the input attributes. E.g., for Boolean functions, each truth table row is a path from root to leaf: A B A xor B F F F F T T T F T T T F F F B F T A T T B T F T Trivially, a consistent decision tree for any training set w/ one path to leaf for each example (unless f nondeterministic in x) but it probably won t generalize to new examples Prefer to find more compact decision trees. F

Hypothesis spaces How many distinct decision trees with n Boolean attributes

Hypothesis spaces How many distinct decision trees with n Boolean attributes = number of Boolean functions

Hypothesis spaces How many distinct decision trees with n Boolean attributes = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2n

Hypothesis spaces How many distinct decision trees with n Boolean attributes = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees How many purely conjunctive hypotheses (e.g., Hungry Rain)

Hypothesis spaces How many distinct decision trees with n Boolean attributes = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees How many purely conjunctive hypotheses (e.g., Hungry Rain) Each attribute can be in (positive), in (negative), or out.

Hypothesis spaces How many distinct decision trees with n Boolean attributes = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees How many purely conjunctive hypotheses (e.g., Hungry Rain) Each attribute can be in (positive), in (negative), or out. 3 n distinct conjunctive hypotheses

Decision tree learning Aim: find a small tree consistent with the training examples Idea: (recursively) choose most significant attribute as root of (sub)tree Reason: A good attribute splits the examples into subsets that are (ideally) all positive or all negative Patrons? Type? None Some Full French Italian Thai Burger P atrons? is a better choice gives information about the classification

Information Information answers questions. The more clueless I am about the answer initially, the more information is contained in the answer. Scale: 1 = answer to Boolean question with prior 0.5, 0.5 0 = answer to Boolean question with known result. Information in an answer when prior is P 1,..., P n is I( P 1,..., P n ) = n i = 1 P i log 2 P i (also called entropy of the prior)

Information Suppose we have p positive and n negative examples at the root. Hence I( p/(p + n), n/(p + n) ) bits needed to classify a new example. E.g., for 12 restaurant examples and p = n = 6, we need 1 bit. Suppose an attribute A splits the examples E into subsets E i, each of which (we hope) needs less information to complete the classification. Let E i have p i positive and n i negative examples. Then I( p i /(p i + n i ), n i /(p i + n i ) ) bits needed to classify E i. The remaining bits after choosing A will be Remainder(A) = i p i + n i p + n I( p i/(p i + n i ), n i /(p i + n i ) ) Idea: Choose A such that Remainder(A) is minimal.

Information Patrons? Type? None Some Full French Italian Thai Burger Remainder(P atrons?) = 2 4 6 12I(0, 1) + 12I(1, 0) + 12 I( 2 6, 4 6 ) = 0.459 Remainder(T ype?) = 2 12 I( 1 2, 1 2 ) + 12 2 I( 1 2, 1 2 ) + 12 4 I( 2 4, 2 4 ) + 12 4 I( 2 4, 2 4 ) = 1. So we choose the attribute P atrons? that minimizes the remaining information.

Example contd. Decision tree learned from the 12 examples: Patrons? None Some Full F T Hungry? Yes No Type? F French Italian Thai Burger T F Fri/Sat? T No Yes F T Substantially simpler than true tree a more complex hypothesis isn t justified by small amount of data

Summary Learning needed for unknown environments, lazy designers Learning agent = performance element + learning element Learning method depends on type of performance element, available feedback, type of component to be improved, and its representation For supervised learning, the aim is to find a simple hypothesis approximately consistent with training examples Learning performance = prediction accuracy measured on test set Most brains have lots of neurons; each neuron linear threshold unit (?) Perceptrons (one-layer networks) insufficiently expressive Multi-layer networks are sufficiently expressive; can be trained by gradient descent, i.e., error back-propagation Many applications: speech, driving, handwriting, credit cards, etc.