Introduction to Machine Learning

Similar documents
AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Neural Networks biological neuron artificial neuron 1

Machine Learning

ECE521 Lectures 9 Fully Connected Neural Networks

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Lecture 4: Perceptrons and Multilayer Perceptrons

Artificial Neural Networks

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Logistic Regression & Neural Networks

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Multilayer Perceptrons (MLPs)

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Artificial Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Artificial Neural Networks

Lecture 5: Logistic Regression. Neural Networks

Neural Networks (Part 1) Goals for the lecture

Neural Networks and Deep Learning

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

4. Multilayer Perceptrons

Machine Learning

CSC 411 Lecture 10: Neural Networks

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Artificial Neural Networks

Introduction to Neural Networks

y(x n, w) t n 2. (1)

Reading Group on Deep Learning Session 1

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Multilayer Neural Networks

The error-backpropagation algorithm is one of the most important and widely used (and some would say wildly used) learning techniques for neural

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

EPL442: Computational

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Neural Networks and the Back-propagation Algorithm

Neural networks. Chapter 20, Section 5 1

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Course 395: Machine Learning - Lectures

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

CSC321 Lecture 5: Multilayer Perceptrons

Lab 5: 16 th April Exercises on Neural Networks

Neural networks. Chapter 19, Sections 1 5 1

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Multilayer Perceptrons and Backpropagation

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Artificial Neural Networks Examination, June 2005

Artificial Neural Networks Examination, June 2004

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Artificial Neural Networks. MGS Lecture 2

COMP-4360 Machine Learning Neural Networks

Machine Learning: Multi Layer Perceptrons

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Jakub Hajic Artificial Intelligence Seminar I

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning Basics III

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

CMSC 421: Neural Computation. Applications of Neural Networks

Neural networks. Chapter 20. Chapter 20 1

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Unit 8: Introduction to neural networks. Perceptrons

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

CSC321 Lecture 4: Learning a Classifier

Feedforward Neural Networks

How to do backpropagation in a brain

AI Programming CS F-20 Neural Networks

A summary of Deep Learning without Poor Local Minima

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Machine Learning. Neural Networks

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Unit III. A Survey of Neural Network Model

NN V: The generalized delta learning rule

Statistical NLP for the Web

Artificial Neural Networks Examination, March 2004

Automatic Differentiation and Neural Networks

From perceptrons to word embeddings. Simon Šuster University of Groningen

Introduction to feedforward neural networks

CSCI567 Machine Learning (Fall 2018)

Artificial Neural Networks. Edward Gatt

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Regularization in Neural Networks

Neural Networks. Henrik I. Christensen. Computer Science and Engineering University of California, San Diego

1 What a Neural Network Computes

Linear & nonlinear classifiers

CSC321 Lecture 4: Learning a Classifier

Neural Networks. Intro to AI Bert Huang Virginia Tech

Artificial Neural Networks

Revision: Neural Network

Transcription:

Introduction to Machine Learning Neural Networks Varun Chandola x x 5 Input Outline Contents February 2, 207 Extending Perceptrons 2 Multi Layered Perceptrons 2 2. Generalizing to Multiple Labels................. 2 2.2 An Example of a Multi Neural Network.......... 2 2.3 Properties of Sigmoid Function................. 4 2.4 Motivation for Using Non-linear Surfaces............ 4 3 Feed Forward Neural Networks 5 4 Backpropagation 5 4. Derivation of the Backpropagation Rules............ 7 5 Final Algorithm 0 6 Wrapping up Neural Networks 7 Bias Variance Tradeoff Extending Perceptrons Questions? Why not work with thresholded perceptron? Not differentiable How to learn non-linear surfaces? How to generalize to multiple outputs, numeric output? The reason we do not use the thresholded perceptron is because the objective function is not differentiable. To understand this, recall that to compute the gradient for perceptron learning we compute the partial derivative of the objective function with respect to every component of the weight vector. E (y j w x j ) 2 w i w i 2 j Now if we use the thresholded perceptron, we need to replace w x j with o in the above equation, where o is if w x j < 0 and, otherwise. Obviously, given that o is not smooth, the function is not differentiable. Hence we work with the unthresholded perceptron unit. 2 Multi Layered Perceptrons 2. Generalizing to Multiple Labels Distinguishing between multiple categories Solution: Add another - Multi Layer Neural Networks Multi-class classification is more applicable than binary classification. Applications include, handwritten digit recognition, robotics, etc. 2

Inputs x x 0 Hidden o o 2 o 3 o 4 x 0 x net w x o σ(net) As mentioned earlier, the perceptron unit cannot be used as it is not differentiable. The linear unit is differentiable but only learns linear discriminating surfaces. So to learn non-linear surfaces, we need to use a non-linear unit such as the sigmoid. 2.2 An Example of a Multi Neural Network One significant early application of neural networks in the area of autonomous driving was ALVINN or Autonomous Land Vehicle In a Neural Network developed at Carnegie Mellon University in the early 990s. More information available here: http://www.ri.cmu.edu/research_project_detail. html?project_id60&menu_id26. ALVINN is a perception system which learns to control the NAVLAB vehicles by watching a person drive. ALVINN s architecture consists of a single hidden back-propagation network. The input of the network is a 30x32 unit two dimensional retina which receives input from the vehicles video camera. Each input unit is fully connected to a of five hidden units which are in turn fully connected to a of 30 output units. The output is a linear representation of the direction the vehicle should travel in order to keep the vehicle on the road. Linear Unit Perceptron Unit Sigmoid Unit Smooth, differentiable threshold function Non-linear output σ(net) + e net 2.3 Properties of Sigmoid Function fh (x) 0 x ft (x) 0 The threshold output in the case of the sigmoid unit is continuous and smooth, as opposed to a perceptron unit or a linear unit. A useful property of sigmoid is that its derivative can be easily expressed as: dσ(y) dy x σ(y)( σ(y)) One can also use e ky instead of e y, where k controls the steepness of the threshold curve. fs (x) 0 x 3 4

Inputs x x 0 Hidden 2.4 Motivation for Using Non-linear Surfaces o o 2 o 3 o 4 At output nodes: w l, l k, w l R m+ The multi- neural network shown above is used in a feed forward mode, i.e., information only flows in one direction (forward). Each hidden node collects the inputs from all input nodes and computes a weighted sum of the inputs and then applies the sigmoid function to the weighted sum. The output of each hidden node is forwarded to every output node. The output node collects the inputs (from hidden nodes) and computes a weighted sum of its inputs and then applies the sigmoid function to obtain the final output. The class corresponding to the output node with the largest output value is assigned as the predicted class for the input. For implementation, one can even represent the weights as two matrices, W () (m d + ) and W (2) (k m + ). 4 Backpropagation Assume that the network structure is predetermined (number of hidden nodes and interconnections) Objective function for N training examples: N J J i N (y il o il ) 2 2 i i The learning problem is to recognize 0 different vowel sounds from the audio input. The raw sound signal is compressed into two features using spectral analysis. 3 Feed Forward Neural Networks d + input nodes (including bias) m hidden nodes k output nodes At hidden nodes: w j, j m, w j R d+ 5 y il - Target value associated with l th class for input (x i ) y il when k is true class for x i, and 0 otherwise o il - Predicted output value at l th output node for x i What are we learning? Weight vectors for all output and hidden nodes that minimize J The first question that comes to mind is, why not use a standard gradient descent based minimization as the one that we saw in single perceptron unit learning. The reason is that the output at every output node (o l ) is directly dependent on the weights associated with the output nodes but not with weights at hidden nodes. But the input values are used by the hidden 6

nodes and are not visible to the output nodes. To learn all the weights simultaneously, direct minimization is not possible. Advanced methods such as Backpropagation need to be employed.. Initialize all weights to small values 2. For each training example, x, y : (a) Propagate input forward through the network (b) Propagate errors backward through the network Gradient Descent Move in the opposite direction of the gradient of the objective function η J J N J i i What is the gradient computed with respect to? Weights - m at hidden nodes and k at output nodes w j (j... m) w l (l... k) w j w j η w j w j η N i i w j w l w l η w l w l η N i w l J i i w r 7 i w i w 2. i w m+k i w r i w r2. Need to compute i w rq Update rule for the q th entry in the r th weight vector: w rq w rq η w rq w rq η N i w i rq 4. Derivation of the Backpropagation Rules Assume that we only one training example, i.e., i, J J i. Dropping the subscript i from here onwards. Consider any weight w rq Let u rq be the q th element of the input vector coming in to the r th unit. Observation Weight w rq is connected to J through net r i w rqu rq. net r u rq w rq net r w rq net r Observation 2 net l for an output node is connected to J only through the output value of the node (or o l ) o l o l The first term above can be computed as: o l o l 2 (y l o l ) 2 The entries in the summation in the right hand side will be non zero only for l. This results in: o l o l 2 (y l o l ) 2 (y l o l ) 8

Moreover, the second term in the chain rule above can be computed as: o l σ(net l) o l ( o l ) The last result arises from the fact o l is a sigmoid function. Using the above results, one can compute the following. Let Therefore, (y l o l )o l ( o l ) δ l (y l o l )o l ( o l ) δ l Finally we can compute the partial derivative of the error with respect to the weight w lj as: w lj δ l u lj Update Rule for Units where δ l (y l o l )o l ( o l ). w lj w lj + ηδ l u lj Question: What is u lj for the l th output node? u lj is the j th input to l th output node, which will be the output coming from the j th hidden node. Observation 3 net j for a hidden node is connected to J through all output nodes 9 Remember that we have already computed the first term on the right hand side for output nodes: δ l where δ l (y l o l )o l ( o l ). This result gives us: Thus, the gradient becomes: δ l δ l o j o j o j δ l w lj δ l w lj o j ( o j ) o j ( o j ) δ l w lj u jp w jp o j ( o j )( δ l w lj )u jp δ j u jp Update Rule for Hidden Units w jp w jp + ηδ j u jp δ j o j ( o j ) δ l w lj δ l (y l o l )o l ( o l ) 0

Question: What is u jp for the j th hidden node? u jp is the p th input to j th hidden node, which will be p th attribute value for the input, i.e., x p. 5 Final Algorithm While not converged: Move forward to compute outputs at hidden and output nodes Move backward to propagate errors back Compute δ errors at output nodes (δ l ) Compute δ errors at hidden nodes (δ j ) Update all weights according to weight update equations 6 Wrapping up Neural Networks Error function contains many local minima No guarantee of convergence Not a big issue in practical deployments Improving backpropagation Adding momentum Using stochastic gradient descent Train multiple times using different initializations Adding momentum to the learning process refers to adding an inertia term which tries to keep the current value of a weight value similar to the one taken in the previous round. 7 Bias Variance Tradeoff Neural networks are universal function approximators By making the model more complex (increasing number of hidden s or m) one can lower the error Is the model with least training error the best model? The simple answer is no! Risk of overfitting (chasing the data) Overfitting High generalization error High Variance - Low Bias Chases the data Model parameters change significantly when the training data is changed, hence the term high variance. Very low training error Poor performance on unseen data Low Variance - High Bias Less sensitive to training data Higher training error Better performance on unseen data General rule of thumb If two models are giving similar training error, choose the simpler model What is simple for a neural network? Low weights in the weight matrices? Why? 2

The simple answer to this is that if the weights in the weight vectors at each node are high, the resulting discriminating surface learnt by the neural network will be highly non-linear. If the weights are smaller, the surface will be smoother (and hence simpler). Penalize solutions in which the weights are high Can be done by introducing a penalty term in the objective function Regularization Regularization for Backpropagation ( J J + λ m d+ (w () 2n j i ) m+ ji )2 + (w (2) lj )2 j Other Extensions? Use a different loss function (why)? Quadratic (Squared), Cross-entropy, Exponential, KL Divergence, etc. Use a different activation function (why)? Sigmoid Tanh f(z) Rectified Linear Unit (ReLU) + exp( z) f(z) ez e z e z + e z f(z) max(0, z) References 3