Artificial Neural Networks. MGS Lecture 2

Similar documents
Reading Group on Deep Learning Session 1

Neural Network Training

NEURAL NETWORKS

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Course 395: Machine Learning - Lectures

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Neural Networks and Deep Learning

Feed-forward Network Functions

y(x n, w) t n 2. (1)

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Artificial Neural Networks

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Introduction to Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Machine Learning Lecture 5

Machine Learning Lecture 10

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Error Backpropagation

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Introduction to Neural Networks

Logistic Regression & Neural Networks

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

CSC321 Lecture 5: Multilayer Perceptrons

CSC 411 Lecture 10: Neural Networks

Computational statistics

Machine Learning. Neural Networks

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 5: Logistic Regression. Neural Networks

Introduction to Convolutional Neural Networks (CNNs)

A summary of Deep Learning without Poor Local Minima

Multi-layer Neural Networks

Data Mining Part 5. Prediction

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Statistical Machine Learning from Data

Neural Networks (Part 1) Goals for the lecture

Neural networks. Chapter 20. Chapter 20 1

4. Multilayer Perceptrons

Artificial Intelligence

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Logistic Regression. COMP 527 Danushka Bollegala

Machine Learning Lecture 12

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

2018 EE448, Big Data Mining, Lecture 5. (Part II) Weinan Zhang Shanghai Jiao Tong University

Artificial Neural Networks

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Artificial Neural Networks

CSC321 Lecture 4: Learning a Classifier

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Neural Networks in Structured Prediction. November 17, 2015

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Feedforward Neural Nets and Backpropagation

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Artificial Neural Networks. Edward Gatt

AI Programming CS F-20 Neural Networks

Multilayer Perceptron

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Machine Learning. 7. Logistic and Linear Regression

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Stochastic gradient descent; Classification

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

CSC321 Lecture 4: Learning a Classifier

Neural Networks and the Back-propagation Algorithm

Artificial Neural Networks (ANN) Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso

Artifical Neural Networks

Plan. Perceptron Linear discriminant. Associative memories Hopfield networks Chaotic networks. Multilayer perceptron Backpropagation

Neural Networks (and Gradient Ascent Again)

Deep Neural Networks (1) Hidden layers; Back-propagation

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Unit III. A Survey of Neural Network Model

Artificial Neural Networks 2

Multiclass Logistic Regression

Neural Networks. Intro to AI Bert Huang Virginia Tech

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Lecture 4: Perceptrons and Multilayer Perceptrons

Introduction to Artificial Neural Networks

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

Introduction to Machine Learning

CSC321 Lecture 6: Backpropagation

Deep Feedforward Networks

Artificial Neural Networks

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Neural Networks and Deep Learning.

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Artificial Neural Networks

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

MLPR: Logistic Regression and Neural Networks

STA 414/2104: Lecture 8

Transcription:

Artificial Neural Networks MGS 2018 - Lecture 2

OVERVIEW Biological Neural Networks Cell Topology: Input, Output, and Hidden Layers Functional description Cost functions Training ANNs Back-Propagation

ARTIFICIAL NEURAL NETS Feed-forward neural network/multilayer Perceptron one of many ANNs Fixed number of basis functions Each basis function adaptive (i.e. by tuning a parameter) We focus on the Multilayer Perceptron Really multiple layers of logistic regression models Continuous nonlinearities Perceptron is non-continuous Neural Nets are good, but likelihood function isn t a convex function of model parameters

ORIGINS Originally developed as algorithms that mimic the brain (bioinspired) Killed-off for some time by Marvin Minsky, who led a personal vendetta to divert funding to AI research (book: perceptrons) Developed around 1960s (Perceptron model, Frank Rosenblatt) Resurgence in 1980s and 1990s Recent resurgence due to increased computational power, and the advent of Deep Learning

BIO-INSPIRATION Input wires Output wire Signal processing

LOGISTIC REGRESSION Linear combinations of fixed nonlinear basis functions: y(x, w) =f 0 @ MX j=1 w j 1 j (x) A Non-linear activation function for classification Identity for regression ANNs replace these by parameterised basis functions: a j = DX i=1 w (1) ji x i + w (1) j0 Indicates layer 1 (input layer)

ACTIVATION AND OUTPUT a f(a) y a = (net) activation y = f(a) = output

ACTIVATION AND OUTPUT x 1 w j1 y x i w ji X wji x i a f(a) y w jk x k y a = kx i=1 w ji x i The output as input x i y functions to units in the next layer.

PERCEPTRON A perceptron maps a real valued input binary output : f(x) ( 1 if w x + b>0 f(x) = 0 otherwise x to a a f(a) y We focus on logistic regression functions more than perceptrons. Perceptrons have mostly historical value.

SIMPLE NETWORK The simplest ANNs consist of: A layer of D input nodes A layer of hidden nodes A layer of output nodes Fully connected between layers

INPUT LAYER One input node for every feature/dimension Output of input layer serves as a linear combinatory input to the hidden units: a j = DX i=1 w (1) ji x i + w (1) j0 x D w (1) MD z M w (2) KM y K inputs outputs y 1 x 1 x 0 z 1 w (2) 10 z 0

CURSE OF DIMENSIONALITY Fundamental Machine Learning concept: Curse of Dimensionality When D becomes large, learning problems can become very difficult. For example: when dividing a space 2 R D into regular cells, the number of cells grows exponentially with D. in linear regression a polynomial model of order M has D M coefficients a sphere in high dimension has most of it s volume in an infinitesimally thin slice near the surface

CURSE OF DIM (EXAMPLE) Regular sub-division of feature space: Growth of general polynomial:

HIDDEN LAYER Hidden layer(s) can: have arbitrary number of nodes/units have arbitrary number of links from input nodes and to output nodes (or to next hidden layer) there can be multiple hidden layers Default is a fully interconnected graph, i.e. every input node is linked to every hidden node, and every hidden node to every output node.

HIDDEN UNIT ACTIVATION z j = h(a j ) Common functions for h( ) are the sigmoid or tanh: f(x,, )= 1 1+e (x ) f(x) = tanh(x)

RELU f(x) = 1X i=1 (x i +0.5) log(1 + e x ) New trend, responsible for great deal of Deep Learning success: No vanishing gradient problem Can model any positive real value Can stimulate sparseness

OUTPUT LAYER Output layer can be: single node for binary classification single node for regression n nodes for multi-class classification One network can also cover multiple output variables, thus increasing the number of nodes.

OUTPUT UNIT ACTIVATION Output unit activation transformation depends on output type: Regression: y k = a k (Identity) Binary classification: y k = (a k ) 1 (a) = 1+exp( a) Multiclass classification: y k = exp(a k) P j exp(a j) (Softmax)

OVERALL NETWORK FUNCTION Network can be represented as a single function of input variables and weights: x D w (1) MD z M w (2) KM y K inputs outputs y 1 x 1 x 0 z 1 w (2) 10 y k (x, w) = M X j=1 z 0 w (2) kj h D X i=1 w (1) ji x i + w (1) j0 + w (2) k0

SHORTER NETWORK FUNCTION Biases can be incorporated as unity-valued units: x D w (1) MD z M w (2) KM y K inputs outputs y 1 x 1 x 0 z 1 w (2) 10 y k (x, w) = z 0 M X j=0 w (2) kj h D X i=0 w (1) ji x i

TERMINOLOGY Networks can be easily generalised to arbitrary number of layers This gives rise to confusion in naming conventions. The network shown is either a: 3-layer network (counting the number of layers of units) Our definition 1-hidden-layer network 2-layer network (counting the num- ber of layers of adaptive weights)

NETWORK TOPOLOGY Variations include: Arbitrary number of layers Fewer hidden units than input units (causes in effect dimensionality reduction, equivalent to PCA) Skip-layer connections (see below) Fully/sparsely interconnected networks Large number of possible weight-assignments lead to identical functionality - factor of M!2 M per hidden layer.

EXPRESSIVE POWER

TRAINING A NETWORK target t...... t 1 t 2 t k t c output z z 1...... z 2 z k z c output w kj...... y 1 y 2 y j y nh hidden w ji...... x 1 x 2 x i x d input input x...... x 1 x 2 x i x d

ERROR FUNCTIONS In order to optimise the performance of ANNs an error function on the training set must be minimised This is done by adjusting: Weights connecting nodes Network Architecture Parameters of non-linear functions h(a)

ERROR FUNCTIONS In order to optimise the performance of ANNs an error function on the training set must be minimised This is done by adjusting: Weights connecting nodes Network Architecture Parameters of non-linear functions h(a) Intrinsic parameters (optimised during training)

ERROR FUNCTIONS In order to optimise the performance of ANNs an error function on the training set must be Hyper-parameters minimised (optimised This is done by adjusting: by measuring and comparing Weights connecting nodes Network Architecture Parameters of non-linear functions h(a) generalisation error on validation data)

BACKPROPAGATION Used to calculate derivatives of error function efficiently Errors propagate backwards layer by layer

W-DEPENDENT ERROR FUNCTION E(w) w 1 w A w B w C w 2 E

MOVING IN ERROR SPACE Making a small step in weight space: w w + w Results in a change in error: Points in direction of greatest change E(w) E ' w T re(w) Stop condition: w 1 re(w) =0 w 2 w A w B w C E

ERROR FUNCTIONS Regression: E(w) = 1 2 NX ky(x n, w) t n k 2 n=1 Binary classification: E(w) = (cross-entropy error) NX {t n ln y n +(1 t n )ln(1 y n )} n=1 Multiple independent binary classification: NX KX E(w) = {t nk ln y nk +(1 t nk )ln(1 y nk )} n=1 k=1

ERROR FUNCTIONS Multi-class classification (mutually exclusive): E(w) = NX n=1 KX t kn ln y k (x n, w) k=1

NO ANALYTICAL HOPE Error function has highly nonlinear dependence on weights Many points in weight space where gradient vanishes Many inequivalent stationary points (local minima) No hope for an analytical solution Use iterative numerical procedures (e.g. backprop): w ( +1) = w ( ) + w ( )

GRADIENT DESCENT Repeat until convergence: { } j := j @ @ j J( 0,..., k ) for j =1...k Multi-dimensional case required here: w ( +1) = w ( ) re(w ( ) )

GRADIENT DESCENT VARIANTS Gradient descent is a poor algorithm itself. Better variants exist: Conjugate gradients Quasi Newton Stochastic gradient descent Ballistic methods New methods coming out regularly due to importance to field Fletcher, Practical Methods of Optimization (second ed.),wiley, 1987 Gill, Murray, and Wright, Practical Optimization, Academic Press, 1981 Nocedal and Wright, Numerical Optimization, Springer, 1999

BACKPROPAGATION Used to calculate derivatives of error function efficiently Errors propagate backwards layer by layer Iterative minimisation of error function: 1. Calculate derivative of error function w.r.t. weights 2. Derivatives used to adjust weights Backpropagation refers to the calculation of the derivatives

GENERAL FORMULATION Backprop is for: Arbitrary feed-forward topology Differentiable nonlinear activation functions Broad class of error function General error function formulation: E(w) = 1 N NX n=1 E n (w)

SIMPLE LINEAR CASE x 1 w j1 x i w ji X wji x i y k x m for a single input pattern x n E n = 1 2 X (y nk t nk ) 2 k All output nodes

TOPOLOGY & VARIABLES Note the index order! x D w (1) MD z M w (2) KM y K inputs outputs y 1 x 1 x 0 z 1 w (2) 10 z 0

Error for a single training pattern and single output: E nk = 1 2 (y nk t nk ) 2 Gradient with respect to a single weight : w ji @E n @w ij =(y nj t nj )x ni Local computation for a single weight involving product of error signal and input variable.

FORWARD PROPAGATION a j = X i w ji z i z j = h(a j ) z 1 w j1 z i z j w X ji a wji x j i h(a j ) w jk z j z k Each pattern in the training set results in particular values for z i z j, and calculated through forward propagation through the network. a j

PRODUCT RULE s dependency on is only through summed input E n w ji a j, and therefore chain rule can be applied: @E n @w ji = @E n @a j @a j @w ji j @E n @a j @E n @w ji = j z i @a j @w ji = z i Called errors Cell outputs of previous layer

NODE DELTAS forward z i w ji δ j w kj δ k z j backward δ 1 Output nodes: k = y k t k Hidden nodes: j @E n @a j = X k j = h 0 (a j ) X k @E n @a k @a k @a j w kj k backpropagation formula

ERROR BACKPROPAGATION 1. Apply input vector to network and propagate forward 2. Evaluate for al output units k 3. Backpropagate s to obtain j for all hidden units @E n 4. Evaluate error derivatives as: Result = j of z i forward Result of backprop @w activation ji of network Weight update: w kj = k z j

DEEP LEARNING Basically a Neural Network Many hidden layers Major breakthrough in pre-training Treat each layer first as an unsupervised restricted Boltzmann machine to initialise weights Then do standard supervised backpropagation Can be used for unsupervised learning and dimensionality reduction Decoder T W 1 2000 T W 2 1000 T W 3 500 T W 4 30 Code layer W 4 500 W 3 1000 W 2 2000 W 1 Encoder Unrolling