Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Similar documents
Lab 5: 16 th April Exercises on Neural Networks

y(x n, w) t n 2. (1)

Introduction to Machine Learning

Machine Learning

Lecture 4: Perceptrons and Multilayer Perceptrons

Neural networks. Chapter 19, Sections 1 5 1

Neural networks. Chapter 20. Chapter 20 1

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

Lecture 5: Logistic Regression. Neural Networks

Artificial Neural Networks

Neural networks. Chapter 20, Section 5 1

Neural Networks biological neuron artificial neuron 1

Revision: Neural Network

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Neural Networks (Part 1) Goals for the lecture

Course 395: Machine Learning - Lectures

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Machine Learning

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Multilayer Perceptron

Multi-layer Neural Networks

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Artificial Neural Networks

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Multilayer Perceptrons (MLPs)

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

AI Programming CS F-20 Neural Networks

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

4. Multilayer Perceptrons

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Multilayer Perceptron

CS:4420 Artificial Intelligence

Neural Networks and the Back-propagation Algorithm

Artificial Neural Networks 2

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Artificial Neural Networks. MGS Lecture 2

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Multilayer Perceptrons and Backpropagation

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Unit III. A Survey of Neural Network Model

Final Examination CS 540-2: Introduction to Artificial Intelligence

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Artificial Neural Networks

Multilayer Neural Networks

Machine Learning (CSE 446): Neural Networks

Introduction to Neural Networks

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Logistic Regression & Neural Networks

Supervised Learning. George Konidaris

Machine Learning. Neural Networks

Nonlinear Classification

Neural networks III: The delta learning rule with semilinear activation function

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

Neural Networks. Xiaojin Zhu Computer Sciences Department University of Wisconsin, Madison. slide 1

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Machine Learning Lecture 12

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

CSC242: Intro to AI. Lecture 21

Multilayer Perceptron = FeedForward Neural Network

Artificial Intelligence

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

CSC321 Lecture 5: Multilayer Perceptrons

NN V: The generalized delta learning rule

Midterm: CS 6375 Spring 2018

Midterm, Fall 2003

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Incremental Stochastic Gradient Descent

Learning Neural Networks

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Neural Nets Supervised learning

Multilayer Neural Networks

Feedforward Neural Nets and Backpropagation

Data Mining Part 5. Prediction

Advanced statistical methods for data analysis Lecture 2

Input layer. Weight matrix [ ] Output layer

Linear discriminant functions

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Chapter 2 Single Layer Feedforward Networks

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Machine Learning Lecture 10

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

ADAPTIVE NEURO-FUZZY INFERENCE SYSTEMS

18.6 Regression and Classification with Linear Models

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

C1.2 Multilayer perceptrons

Computational statistics

Retrieval of Cloud Top Pressure

Artificial Neural Networks Examination, June 2005

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Transcription:

Module 12 Machine Learning

Lesson 39 Neural Networks - III

12.4.4 Multi-Layer Perceptrons In contrast to perceptrons, multilayer networks can learn not only multiple decision boundaries, but the boundaries may be nonlinear. The typical architecture of a multi-layer perceptron (MLP) is shown below. Output nodes Internal nodes Input nodes To make nonlinear partitions on the space we need to define each unit as a nonlinear function (unlike the perceptron). One solution is to use the sigmoid unit. Another reason for using sigmoids are that they are continuous unlike linear thresholds and are thus differentiable at all points. x1 w1 x2 w2 Σ net xn wn X0=1 w0 O = σ(net) = 1 / 1 + e -net O(x1,x2,,xn) = σ ( WX ) where: σ ( WX ) = 1 / 1 + e -WX Function σ is called the sigmoid or logistic function. It has the following property: d σ(y) / dy = σ(y) (1 σ(y))

12.4.4.1 Back-Propagation Algorithm Multi-layered perceptrons can be trained using the back-propagation algorithm described next. Goal: To learn the weights for all links in an interconnected multilayer network. We begin by defining our measure of error: E(W) = ½ Σd Σk (tkd okd) 2 k varies along the output nodes and d over the training examples. The idea is to use again a gradient descent over the space of weights to find a global minimum (no guarantee). Algorithm: 1. Create a network with nin input nodes, nhidden internal nodes, and nout output nodes. 2. Initialize all weights to small random numbers. 3. Until error is small do: For each example X do Propagate example X forward through the network Propagate errors backward through the network Forward Propagation Given example X, compute the output of every node until we reach the output nodes: Output Internal Compute sigmoid function Input Example

Backward Propagation A. For each output node k compute the error: δk = Ok (1-Ok)(tk Ok) B. For each hidden unit h, calculate the error: δh = Oh (1-Oh) Σk Wkh δk C. Update each network weight: C. Wji = Wji + ΔWji where ΔWji = η δj Xji (Wji and Xji are the input and weight of node i to node j) A momentum term, depending on the weight value at last iteration, may also be added to the update rule as follows. At iteration n we have the following: ΔWji (n) = η δj Xji + αδwji (n) Where α ( 0 <= α <= 1) is a constant called the momentum. 1. It increases the speed along a local minimum. 2. It increases the speed along flat regions. Remarks on Back-propagation 1. It implements a gradient descent search over the weight space. 2. It may become trapped in local minima. 3. In practice, it is very effective. 4. How to avoid local minima? a) Add momentum. b) Use stochastic gradient descent. c) Use different networks with different initial values for the weights. Multi-layered perceptrons have high representational power. They can represent the following: 1. Boolean functions. Every boolean function can be represented with a network having two layers of units. 2. Continuous functions. All bounded continuous functions can also be approximated with a network having two layers of units. 3. Arbitrary functions. Any arbitrary function can be approximated with a network with three layers of units.

Generalization and overfitting: One obvious stopping point for backpropagation is to continue iterating until the error is below some threshold; this can lead to overfitting. Error Validation set Training set Number of weight Overfitting can be avoided using the following strategies. Use a validation set and stop until the error is small in this set. Use 10 fold cross validation. Use weight decay; the weights are decreased slowly on each iteration. Applications of Neural Networks Neural networks have broad applicability to real world business problems. They have already been successfully applied in many industries. Since neural networks are best at identifying patterns or trends in data, they are well suited for prediction or forecasting needs including: sales forecasting industrial process control customer research data validation risk management target marketing

Because of their adaptive and non-linear nature they have also been used in a number of control system application domains like, process control in chemical plants unmanned vehicles robotics consumer electronics Neural networks are also used a number of other applications which are too hard to model using classical techniques. These include, computer vision, path planning and user modeling. Questions 1. Assume a learning problem where each example is represented by four attributes. Each attribute can take two values in the set {a,b}. Run the candidate elimination algorithm on the following examples and indicate the resulting version space. What is the size of the space? ((a, b, b, a), +) ((b, b, b, b), +) ((a, b, b, b), +) ((a, b, a, a), -) 2. Decision Trees The following table contains training examples that help a robot janitor predict whether or not an office contains a recycling bin. STATUS DEPT. OFFICE SIZE RECYCLING BIN? 1. faculty ee large no 2. staff ee small no 3. faculty cs medium yes 4. student ee large yes 5. staff cs medium no 6. faculty cs large yes 7. student ee small yes 8. staff cs medium no Use entropy splitting function to construct a minimal decision tree that predicts whether or not an office contains a recycling bin. Show each step of the computation. 3. Translate the above decision tree into a collection of decision rules. 4. Neural networks: Consider the following function of two variables:

A B Desired Output 0 0 1 1 0 0 0 1 0 1 1 1 Prove that this function cannot be learned by a single perceptron that uses the step function as its activation function. 4. Construct by hand a perceptron that can calculate the logic function implies (=>). Assume that 1 = true and 0 = false for all inputs and outputs. Solution 1. The general and specific boundaries of the version space are as follows: S: {(?,b,b,?)} G: {(?,?,b,?)} The size of the version space is 2. 2. The solution steps are as follows: 1. Selecting the attribute for the root node: YES NO Status: Faculty: Staff: (3/8)*[ -(2/3)*log_2(2/3) - (1/3)*log_2(1/3)] + (3/8)*[ -(0/3)*log_2(0/3) - (3/3)*log_2(3/3)] Student: + (2/8)*[ -(2/2)*log_2(2/2) - (0/2)*log_2(0/2)] TOTAL: = 0.35 + 0 + 0 = 0.35 Size: Large: Medium: Small: (3/8)*[ -(2/3)*log_2(2/3) - (1/3)*log_2(1/3)] + (3/8)*[ -(1/3)*log_2(1/3) - (2/3)*log_2(2/3)] + (2/8)*[ -(1/2)*log_2(1/2) - (1/2)*log_2(1/2)] TOTAL: = 0.35 + 0.35 + 2/8 = 0.95

Dept: ee: cs: (4/8)*[ -(2/4)*log_2(2/4) - (2/4)*log_2(2/4)] + (4/8)*[ -(2/4)*log_2(2/4) - (2/4)*log_2(2/4)] TOTAL: = 0.5 + 0.5 = 1 Since status is the attribute with the lowest entropy, it is selected as the root node: Status / \ Faculty / staff \ student / \ 1-,3+,6+ 2-,5-,8-4+,7+ BIN=? BIN=NO BIN=YES Only the branch corresponding to Status=Faculty needs further processing. Selecting the attribute to split the branch Status=Faculty: YES NO Dept: ee: cs: (1/3)*[ -(0/1)*log_2(0/1) - (1/1)*log_2(1/1)] + (2/3)*[ -(2/2)*log_2(2/2) - (0/2)*log_2(0/2)] TOTAL: = 0 + 0 = 0

Since the minimum possible entropy is 0 and dept has that minimum possible entropy, it is selected as the best attribute to split the branch Status=Faculty. Note that we don't even have to calculate the entropy of the attribute size since it cannot possibly be better than 0. Status / \ Faculty / staff \ student / \ 1-,3+,6+ 2-,5-,8-4+,7+ Dept BIN=NO BIN=YES / \ ee / \ cs / \ 1-3+,6+ BIN=NO BIN=YES Each branch ends in a homogeneous set of examples so the construction of the decision tree ends here. 3. The rules are: IF status=faculty AND dept=ee THEN recycling-bin=no IF status=faculty AND dept=cs THEN recycling-bin=yes IF status=staff THEN recycling-bin=no IF status=student THEN recycling-bin=yes

4. This function cannot be computed using a single perceptron since the function is not linearly separable: 5. Let B ^ 0 1 ---1----0--------> A Choose g to be the threshold function (output 1 on input > 0, output 0 otherwise). There are two inputs to the implies function, Therefore, Arbitrarily choose W2 = 1, and substitute into above for all values of X1 and X2. Note that greater/less than signs are decided by the threshold fn. and the result of For example, when X1=0 and X2=0, the resulting equation must be greater than 0 since g is threshold fn. and is True. Similarly, when X1=1 and X2=0, the resulting equation must be less than 0 since g is the threshold fn. and is False. Since W0 < 0, and W0 > W1, and W0 W1 < 1, choose W1 = -1. Substitute where possible.

Since W0 < 0, and W0 > -1, choose W0 = -0.5. Thus W0 = -0.5, W1 = -1, W2 = 1. Finally,