Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

Similar documents
Multilayer Perceptrons and Backpropagation

Multilayer neural networks

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Multi-layer neural networks

EEE 241: Linear Systems

Multilayer Perceptron (MLP)

Admin NEURAL NETWORKS. Perceptron learning algorithm. Our Nervous System 10/25/16. Assignment 7. Class 11/22. Schedule for the rest of the semester

Neural networks. Nuno Vasconcelos ECE Department, UCSD

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Evaluation of classifiers MLPs

1 Convex Optimization

MATH 567: Mathematical Techniques in Data Science Lab 8

CHALMERS, GÖTEBORGS UNIVERSITET. SOLUTIONS to RE-EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD

Neural Networks. Perceptrons and Backpropagation. Silke Bussen-Heyen. 5th of Novemeber Universität Bremen Fachbereich 3. Neural Networks 1 / 17

Generalized Linear Methods

Fundamentals of Computational Neuroscience 2e

1 Input-Output Mappings. 2 Hebbian Failure. 3 Delta Rule Success.

Lecture Notes on Linear Regression

Introduction to the Introduction to Artificial Neural Network

Week 5: Neural Networks

Gradient Descent Learning and Backpropagation

Lecture 23: Artificial neural networks

Supporting Information

Supervised Learning NNs

Multigradient for Neural Networks for Equalizers 1

Kernel Methods and SVMs Extension

Linear Classification, SVMs and Nearest Neighbors

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

Solving Nonlinear Differential Equations by a Neural Network Method

Multi layer feed-forward NN FFNN. XOR problem. XOR problem. Neural Network for Speech. NETtalk (Sejnowski & Rosenberg, 1987) NETtalk (contd.

CS4495/6495 Introduction to Computer Vision. 3C-L3 Calibrating cameras

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

IV. Performance Optimization

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Video Data Analysis. Video Data Analysis, B-IT

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Using deep belief network modelling to characterize differences in brain morphometry in schizophrenia

Ensemble Methods: Boosting

Neural Networks. Class 22: MLSP, Fall 2016 Instructor: Bhiksha Raj

Hopfield networks and Boltzmann machines. Geoffrey Hinton et al. Presented by Tambet Matiisen

Development of a General Purpose On-Line Update Multiple Layer Feedforward Backpropagation Neural Network

Internet Engineering. Jacek Mazurkiewicz, PhD Softcomputing. Part 3: Recurrent Artificial Neural Networks Self-Organising Artificial Neural Networks

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Model of Neurons. CS 416 Artificial Intelligence. Early History of Neural Nets. Cybernetics. McCulloch-Pitts Neurons. Hebbian Modification.

A neural network with localized receptive fields for visual pattern classification

Pattern Classification

Classification (klasifikácia) Feedforward Multi-Layer Perceptron (Dopredná viacvrstvová sieť) 14/11/2016. Perceptron (Frank Rosenblatt, 1957)

Lecture 10 Support Vector Machines. Oct

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Linear Feature Engineering 11

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

Why feed-forward networks are in a bad shape

Online Classification: Perceptron and Winnow

Other NN Models. Reinforcement learning (RL) Probabilistic neural networks

Logistic Classifier CISC 5800 Professor Daniel Leeds

Linear Approximation with Regularization and Moving Least Squares

Neural Networks & Learning

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Feature Selection: Part 1

Supervised Learning. Neural Networks and Back-Propagation Learning. Credit Assignment Problem. Feedforward Network. Adaptive System.

10-701/ Machine Learning, Fall 2005 Homework 3

Maximal Margin Classifier

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

Chapter Newton s Method

Discriminative classifier: Logistic Regression. CS534-Machine Learning

MULTISPECTRAL IMAGE CLASSIFICATION USING BACK-PROPAGATION NEURAL NETWORK IN PCA DOMAIN

The Study of Teaching-learning-based Optimization Algorithm

Application research on rough set -neural network in the fault diagnosis system of ball mill

Machine Learning CS-527A ANN ANN. ANN Short History ANN. Artificial Neural Networks (ANN) Artificial Neural Networks

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

NUMERICAL DIFFERENTIATION

The Cortex. Networks. Laminar Structure of Cortex. Chapter 3, O Reilly & Munakata.

Introduction to Neural Networks. David Stutz

Maximum Likelihood Estimation (MLE)

Design and Optimization of Fuzzy Controller for Inverse Pendulum System Using Genetic Algorithm

CS294A Lecture notes. Andrew Ng

Classification as a Regression Problem

Logistic Regression Maximum Likelihood Estimation

Numerical Methods. ME Mechanical Lab I. Mechanical Engineering ME Lab I

Mean Field / Variational Approximations

An identification algorithm of model kinetic parameters of the interfacial layer growth in fiber composites

Big Data Analytics! Special Topics for Computer Science CSE CSE Mar 31

A random variable is a function which associates a real number to each element of the sample space

THE CURRENT BALANCE Physics 258/259

Support Vector Machines

2 Laminar Structure of Cortex. 4 Area Structure of Cortex

2 STATISTICALLY OPTIMAL TRAINING DATA 2.1 A CRITERION OF OPTIMALITY We revew the crteron of statstcally optmal tranng data (Fukumzu et al., 1994). We

Problem Set 9 Solutions

Support Vector Machines

β0 + β1xi. You are interested in estimating the unknown parameters β

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

Discriminative classifier: Logistic Regression. CS534-Machine Learning

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Boostrapaggregating (Bagging)

SDMML HT MSc Problem Sheet 4

Support Vector Machines

Finding Dense Subgraphs in G(n, 1/2)

Transcription:

Multlayer Perceptrons and Informatcs CG: Lecture 6 Mrella Lapata School of Informatcs Unversty of Ednburgh mlap@nf.ed.ac.uk Readng: Kevn Gurney s Introducton to Neural Networks, Chapters 5 6.5 January, 6 / 33 / 33 Perceptrons Recap: Perceptrons w Connectonsm s a computer modelng approach nspred by neural networks. Anatomy of a connectonst model: unts, connectons The Perceptron as a lnear classfer. A learnng algorthm for Perceptrons. Key lmtaton: only works for lnearly separable data. x w x = n = w x y w n x n 3 / 33 4 / 33

Multlayer Perceptrons (MLPs) Actvaton Functons x w x w w n h y x n Step functon Sgmod functon y y Input layer Output layer Hdden layer MLPs are feed-forward neural networks, organzed n layers. One nput layer, one or more hdden layers, one output layer. Each node n a layer connected to all other nodes n next layer. Each connecton has a weght (can be zero). 5 / 33 h Outputs or. x h x Outputs a real value between and. 6 / 33 Sgmods Learnng wth MLPs Input layer Output layer Hdden layer 7 / 33 As wth perceptrons, fndng the rght weghts s very hard! Soluton technque: learnng! Learnng: adustng the weghts based on tranng examples. 8 / 33

Supervsed Learnng General Idea Send the MLP an nput pattern, x, from the tranng set. Get the output from the MLP, y. 3 Compare y wth the rght answer, or target t, to get the error quantty. 4 Use the error quantty to modfy the weghts, so next tme y wll be closer to t. 5 Repeat wth another x from the tranng set. When updatng weghts after seeng x, the network doesn t ust change the way t deals wth x, but other nputs too... Inputs t has not seen yet! Generalzaton s the ablty to deal accurately wth unseen nputs. Learnng and Error Mnmzaton Recall: Perceptron Learnng Rule Mnmze the dfference between the actual and desred outputs: w w + η(t o)x Error Functon: Mean Squared Error (MSE) An error functon represents such a dfference over a set of nputs: E( w) = N N s the number of patterns N (t p o p ) p= t p s the target output for pattern p o p s the output obtaned for pattern p the makes lttle dfference, but makes lfe easer later on! 9 / 33 / 33 Gradent Descent One technque that can be used for mnmzng functons s gradent descent. Can we use ths on our error functon E? We would lke a learnng rule that tells us how to update weghts, lke ths: Gradent and Dervatves: The Idea The dervatve s a measure of the rate of change of a functon, as ts nput changes; For functon y = f (x), the dervatve dy ndcates how much y changes n response to changes n x. If x and y are real numbers, and f the graph of y s plotted aganst x, the dervatve measures the slope or gradent of the lne at each pont,.e., t descrbes the steepness or nclne. w = w + w But what should w be? / 33 / 33

Gradent and Dervatves: The Idea Gradent and Dervatves: The Idea So, we know how to use dervatves to adust one nput value. But we have several weghts to adust! We need to use partal dervatves. A partal dervatve of a functon of several varables s ts dervatve wth respect to one of those varables, wth the others held constant. dy > mples that y ncreases as x ncreases. If we want to fnd the mnmum y, we should reduce x. dy < mples that y decreases as x ncreases. If we want to fnd the mnmum y, we should ncrease x. dy = mples that we are at a mnmum or maxmum or a plateau. To get closer to the mnmum: x new = x old η dy 3 / 33 Example If y = f (x, x ), then we can have y x and y x. In our learnng rule case, f we can work out the partal dervatves, we can use ths rule to update the weghts: w = w + w where w = η E w. 4 / 33 Summary So Far Usng Gradent Descent to Mnmze the Error We learnt what a multlayer perceptron s. We know a learnng rule for updatng weghts n order to mnmse the error: w = w + w where w = η E w w tells us n whch drecton and how much we should change each weght to roll down the slope (descend the gradent) of the error functon E. So, how do we calculate E w? w The mean squared error functon E, whch we want to mnmze: E( w) = N f N (t p o p ) p= 5 / 33 6 / 33

Usng Gradent Descent to Mnmze the Error w Usng Gradent Descent to Mnmze the Error w If we use a sgmod actvaton functon f, then the output of neuron for pattern p s: o p = f (u ) = + e au where a s a pre-defned constant and u s the result of the nput functon n neuron : u = w x For the pth pattern and the th neuron, we use gradent descent on the error functon: w = η E p w = η(t p o p )f (u )x where f (u ) = df du s the dervatve of f wth respect to u. If f s the sgmod functon, f (u ) = af (u )( f (u )). 7 / 33 8 / 33 Usng Gradent Descent to Mnmze the Error w We can update weghts after processng each pattern, usng rule: w = η (t p f o p ) f (u ) x w = η δ p x Ths s known as the generalzed delta rule. We need to use the dervatve of the actvaton functon f. So, f must be dfferentable! The threshold actvaton functon s not contnuous, thus not dfferentable! Sgmod has a dervatve whch s easy to calculate. 9 / 33 Updatng Output vs Hdden Neurons We can update output neurons usng the generalze delta rule: w = η δ p x δ p = (t p o p )f (u ) Ths δ p s only good for the output neurons, snce t reles on target outputs. But we don t have target output for the hdden nodes! What can we use nstead? δ p = k w k δ k f (u ) Ths rule propagates error back from output nodes to hdden nodes. If effect, t blames hdden nodes accordng to how much nfluence they had. So, now we have rules for updatng both output and hdden neurons! / 33

straton: straton: Present the pattern at the nput layer. Present the pattern at the nput layer. / 33 / 33 traton: straton: Present the pattern at the nput layer. Propagate forward actvatons. Present the pattern at the nput layer. Propagate forward actvatons. 3 / 33 4 / 33

raton: aton: Present the pattern at the nput layer. Propagate forward actvatons. Present the pattern at the nput layer. Propagate forward actvatons. 5 / 33 6 / 33 traton: traton: Present the pattern at the nput layer. Propagate forward actvatons. 3 Calculate error for the output neurons. Present the pattern at the nput layer. Propagate forward actvatons. 3 Propagate backward error. 7 / 33 8 / 33

raton: aton: Present the pattern at the nput layer. Propagate forward actvatons. 3 Propagate backward error. Present the pattern at the nput layer. Propagate forward actvatons. 3 Propagate backward error. 9 / 33 3 / 33 raton: Onlne Present the pattern at the nput layer. Propagate forward actvatons. 3 Propagate backward error. 4 Calculate E w 5 Repate for all patterns and sum up. : Intalze all weghts to small random values. : repeat 3: for each tranng example do 4: Forward propagate the nput features of the example to determne the MLP s outputs. 5: Back propagate error to generate w for all weghts w. 6: Update the weghts usng w. 7: end for 8: untl stoppng crtera reached. 3 / 33 3 / 33

Summary We learnt what a multlayer perceptron s. We have some ntuton about usng gradent descent on an error functon. We know a learnng rule for updatng weghts n order to mnmze the error: w = η E w If we use the squared error, we get the generalzed delta rule: w = ηδ p x. We know how to calculate δ p for output and hdden layers. We can use ths rule to learn an MLP s weghts usng the backpropagaton algorthm. Next lecture: a neural network model of the past tense. 33 / 33