Cheng Soon Ong & Christian Walder. Canberra February June 2018

Similar documents
Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Reading Group on Deep Learning Session 1

Neural Network Training

NEURAL NETWORKS

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Feed-forward Network Functions

Artificial Neural Networks. MGS Lecture 2

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Neural Networks and Deep Learning

Artificial Neural Networks

y(x n, w) t n 2. (1)

Introduction to Neural Networks

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Neural Networks and the Back-propagation Algorithm

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Multi-layer Neural Networks

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 5: Logistic Regression. Neural Networks

Logistic Regression. COMP 527 Danushka Bollegala

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Course 395: Machine Learning - Lectures

Cheng Soon Ong & Christian Walder. Canberra February June 2018

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Machine Learning Lecture 5

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

4. Multilayer Perceptrons

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

CSC321 Lecture 5: Multilayer Perceptrons

CSC 578 Neural Networks and Deep Learning

Pattern Recognition and Machine Learning

Multilayer Perceptron

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Deep Feedforward Networks

Logistic Regression & Neural Networks

Neural Networks (and Gradient Ascent Again)

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Machine Learning Spring 2018 Note Neural Networks

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

1 What a Neural Network Computes

Introduction to Machine Learning

Artificial Neural Networks 2

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Error Backpropagation

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Computational statistics

Linear Models for Classification

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Machine Learning. 7. Logistic and Linear Regression

A summary of Deep Learning without Poor Local Minima

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Deep Neural Networks (1) Hidden layers; Back-propagation

From perceptrons to word embeddings. Simon Šuster University of Groningen

Linear Regression (continued)

Deep Feedforward Networks. Sargur N. Srihari

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Regularization in Neural Networks

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

Multiclass Logistic Regression

Multilayer Perceptrons (MLPs)

Deep Feedforward Networks

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Neural Networks (Part 1) Goals for the lecture

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

CSC242: Intro to AI. Lecture 21

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Overfitting, Bias / Variance Analysis

Lecture 17: Neural Networks and Deep Learning

Deep Feedforward Networks

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

ECS171: Machine Learning

Linear & nonlinear classifiers

STA 4273H: Statistical Machine Learning

Deep Neural Networks (1) Hidden layers; Back-propagation

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

2018 EE448, Big Data Mining, Lecture 5. (Part II) Weinan Zhang Shanghai Jiao Tong University

Machine Learning Lecture 10

Transcription:

Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1 Linear Regression 2 Linear Classification 1 Linear Classification 2 Kernel Methods Sparse Kernel Methods Mixture Models and EM 1 Mixture Models and EM 2 1 2 Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data 2 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 828

Part XIX 1 676of 828

Basis Functions The basis functions play a crucial role in the algorithms explored so far. Number and parameters of basis functions fixed before learning starts (e.g. Linear Regression and Linear Classification). Number of basis functions fixed, parameters of the basis functions are adaptive (e.g. ). Centre basis function on the data, select a subset of basis functions in the training phase (e.g. Support Vector Machines, Relevance Vector Machines). 677of 828

Outline The functional form of the network model (including special parametrisation of the basis functions). How to determine the network parameters within the maximum likelihood framework? (Solution of a nonlinear optimisation problem.) Error backpropagation : efficiently evaluate the derivatives of the log likelihood function with respect to the network parameters. Various approaches to regularise neural networks. 678of 828

Feed-forward Network Functions Same goal as before: e.g. for regression, decompose t(x) = y(x, w) + ɛ(x) where ɛ(x) is the residual error. (Generalised) Linear Model M y(x, w) = f w j φ j (x) j=0 where φ = (φ 0,..., φ M ) T is the fixed model basis and w = (w 0,..., w M ) T are the model parameter. For regression: f ( ) is the identity function. For classification: f ( ) is a nonlinear activation function. Goal : Let φ j (x) depend on parameters, and then adjust these parameters together with w. 679of 828

Feed-forward Network Functions Goal : Let φ j (x) depend on parameters, and then adjust these parameters together with w. Many ways to do this. Neural networks use basis functions which follow the same form as the (generalised) linear model. EACH basis function is itself a nonlinear function of an adaptive linear combination of the inputs. hidden units x D w (1) MD z M KM y K inputs outputs y 1 x 1 x 0 z 1 10 z 0 680of 828

Functional Transformations Construct M linear combinations of the input variables x 1,..., x D in the form a j = }{{} activations D w (1) ji x i + w (1) j0 }{{}}{{} weights bias i=1 j = 1,..., M Apply a differentiable, nonlinear activation function h( ) to get the output of the hidden units z j = h(a j ) h( ) is typically sigmoid, tanh, or more recently ReLU(x) = max(x, 0). hidden units xd w (1) MD zm KM yk inputs outputs y1 x1 x0 z1 10 z0 681of 828

Functional Transformations The outputs of the hidden units are again linearly combined a k = M j=1 kj z j + k0 k = 1,..., K Apply again a differentiable, nonlinear activation function g( ) to get the network outputs y k y k = g(a k ) hidden units xd w (1) MD zm KM yk inputs outputs y1 x1 x0 z1 10 z0 682of 828

Functional Transformations The activation function g( ) is determined by the nature of the data and the distribution of the target variables. For standard regression: g( ) is the identity function so that y k = a k. For multiple binary classification, g( ) is a logistic sigmoid function 1 y k = σ(a k ) = 1 + exp( a k ) hidden units xd w (1) MD zm KM yk inputs outputs y1 x1 x0 z1 10 z0 683of 828

Functional Transformations Combine all transformations into one formula ( M D ) y k (x, w) = g kj h w (1) ji x i + w (1) j0 j=1 i=1 where w contains all weight and bias parameters. + k0 hidden units xd w (1) MD zm KM yk inputs outputs y1 x1 x0 z1 10 z0 684of 828

Functional Transformations As before, the biases can be absorbed into the weights by introducing an extra input x 0 = 1 and a hidden unit z 0 = 1. ( M ) D y k (x, w) = g kj h w (1) ji x i j=0 where w now contains all weight and bias parameters. i=0 hidden units x D w (1) MD z M KM y K inputs outputs y 1 x 1 x 0 z 1 10 z 0 685of 828

Comparison to Perceptron A neural network looks like a multilayer perceptron. But perceptron s nonlinear activation function was a step function. Not smooth. Not differentiable. { +1, a 0 f (a) = 1, a < 0 The activation functions h( ) and g( ) of a neural network are smooth and differentiable. hidden units x D w (1) MD z M KM y K inputs outputs y 1 x 1 x 0 z 1 10 z 0 686of 828

Linear Activation Functions If all activation functions are linear functions then there exists an equivalent network without hidden units. (Composition of linear functions is a linear function.) But if the number of hidden units in this case is smaller than the number of input or output units, the resulting linear function are not the most general. Dimensionality reduction. Principal Component Analysis (comes later in the lecture). Generally, most neural networks use nonlinear activation functions as the goal is to approximate a nonlinear mapping from the input space to the outputs. 687of 828

FYI Only: NN Extensions Add more hidden layers (deep learning). Tom make it work we need many of the following tricks: Clever weight initialisation to ensure the gradient is flowing through the entire network. Some links may additively skip over one or several subsequent layer(s). Favour ReLU over e.g. the sigmoid, to avoid vanishing gradients. Clever regularisation methods such as dropout. Specific architectures, not further considered here: Parameters may be shared, notably as in convolutional neural networks for images. A state space model with neural network transitions is a recurrent neural network. Attention mechanisms learn to focus on specific parts of an input. 688of 828

as Universal Function Approximators Feed-forward neural networks are universal approximators. Example: A two-layer neural network with linear outputs can uniformly approximate any continuous function on a compact input domain to arbitrary accuracy if it has enough hidden units. Holds for a wide range of hidden unit activation functions, but NOT for polynomials. Remaining big question : Where do we get the appropriate settings for the weights from? With other words, how do we learn the weights from training examples? 689of 828

Aproximation Capabilities of Neural network approximating f (x) = x 2 Two-layer network with 3 hidden units (tanh activation functions) and linear outputs trained on 50 data points sampled from the interval ( 1, 1). Red: resulting output. Dashed: Output of the hidden units. 690of 828

Aproximation Capabilities of Neural network approximating f (x) = sin(x) Two-layer network with 3 hidden units (tanh activation functions) and linear outputs trained on 50 data points sampled from the interval ( 1, 1). Red: resulting output. Dashed: Output of the hidden units. 691of 828

Aproximation Capabilities of Neural network approximating f (x) = x Two-layer network with 3 hidden units (tanh activation functions) and linear outputs trained on 50 data points sampled from the interval ( 1, 1). Red: resulting output. Dashed: Output of the hidden units. 692of 828

Aproximation Capabilities of Neural network approximating Heaviside function { 1, x 0 f (x) = 0, x < 0 Two-layer network with 3 hidden units (tanh activation functions) and linear outputs trained on 50 data points sampled from the interval ( 1, 1). Red: resulting output. Dashed: Output of the hidden units. 693of 828

Variable Basis Functions in a Hidden layer nodes represent parametrised basis functions -10-5 0 x2 5 10-10 -5 5 0 x1 z = σ(w 0 + w 1 x 1 + w 2 x 2 ) for (w 0, w 1, w 2 ) = (0.0, 1.0, 0.1) 10 694of 828

Variable Basis Functions in a Hidden layer nodes represent parametrised basis functions -10-5 0 x2 5 10-10 -5 5 0 x1 z = σ(w 0 + w 1 x 1 + w 2 x 2 ) for (w 0, w 1, w 2 ) = (0.0, 0.1, 1.0) 10 695of 828

Variable Basis Functions in a Hidden layer nodes represent parametrised basis functions -10-5 0 x2 5 10-10 -5 5 0 x1 z = σ(w 0 + w 1 x 1 + w 2 x 2 ) for (w 0, w 1, w 2 ) = (0.0, 0.5, 0.5) 10 696of 828

Variable Basis Functions in a Hidden layer nodes represent parametrised basis functions -10-5 0 x2 5 10-10 -5 5 0 x1 z = σ(w 0 + w 1 x 1 + w 2 x 2 ) for (w 0, w 1, w 2 ) = (10.0, 0.5, 0.5) 10 697of 828

Aproximation Capabilities of Neural network for two-class classification. 2 inputs, 2 hidden units with tanh activation function, 1 output with logistic sigmoid activation function. 3 2 1 0 1 2 2 1 0 1 2 Red: y = 0.5 decision boundary. Dashed blue: z = 0.5 hidden unit contours. Green: Optimal decision boundary from the known data distribution. 698of 828

Given a set of weights w. This fixes a mapping from the input space to the output space. Does there exist another set of weights realising the same mapping? Assume tanh activation function for the hidden units. As tanh is an odd function: tanh( a) = tanh(a). Change the sign of all inputs to a hidden unit and outputs of this hidden unit: Mapping stays the same. hidden units x D w (1) MD z M KM y K inputs outputs y 1 x 1 x 0 z 1 10 z 0 699of 828

M hidden units, therefore 2 M equivalent weight vectors. Furthermore, exchange all of the weights going into and out of a hidden unit with the corresponding weights of another hidden unit. Mapping stays the same. M! symmetries. Overall weight space symmetry : M! 2 M M 1 2 3 4 5 6 7 M! 2 M 2 8 48 384 3840 46080 645120 hidden units xd w (1) MD zm KM yk inputs outputs y1 x1 x0 z1 10 z0 700of 828

Assume the error E(w) is a smooth function of the weights. Smallest value will occur at a critical point for which E(w) = 0. This could be a minimum, maxium, or saddle point. Furthermore, because of symmetry in weight space, there are at least M! 2 M other critical points with the same value for the error. 701of 828

Definition (Global Minimum) A point w for which the error E(w ) is smaller than any other error E(w). Definition (Local Minimum) A point w for which the error E(w ) is smaller than any other error E(w) in some neighbourhood of w. E(w) w1 wa wb wc w2 E 702of 828

Finding the global minimium is difficult in general (would have to check everywhere) unless the error function comes from a special class (e.g. smooth convex functions have only one minimum). Error functions for neural networks are not convex (symmetries!). But finding a local minimum might be sufficient. Use iterative methods with weight vector update w (τ) to find a local minimum. w (τ+1) = w (τ) + w (τ) E(w) w1 wa wb wc w2 E 703of 828

Local Quadratic Approximation Around a stationary point w we can approximate E(w) E(w ) + 1 2 (w w ) T H(w w ), where the Hessian H is evaluated at w. Using a set {u i } of orthonormal eigenvectors of H, Hu i = λ i u i, to expand w w = i α i u i. We get E(w) = E(w ) + 1 λ i αi 2. 2 i 704of 828

Local Quadratic Approximation Around a minimum w we can approximate w 2 E(w) = E(w ) + 1 λ i αi 2. 2 i u 2 u 1 λ 1/2 2 w λ 1/2 1 w 1 705of 828

Local Quadratic Approximation Around a minimum w, the Hessian H must be positive definite if evaluated at w. w 2 u 2 u 1 λ 1/2 2 w λ 1/2 1 w 1 706of 828

Gradient Information improves Performances Hessian is symmetric and contains W(W + 1)/2 independent entries where W is the total number of weights in the network. Need to gather this O(W 2 ) pieces of information by doing O(W) function evaluations if nothing else is know. Get order O(W 3 ). The gradient E provides W pieces of information at once. Still need O(W) steps, but the order is now O(W 2 ). 707of 828

Batch processing : Update the weight vector with a small step in the direction of the negative gradient w (τ+1) = w (τ) η E(w (τ) ) where η is the learning rate. After each step, re-evaluate the gradient E(w (τ) ) again. has problems in long valleys. 708of 828

has problems in long valleys. Example of zig-zag of Algorithm. 709of 828

Nonlinear Use Conjugate instead of Gradient Descent to avoid zig-zag behaviour. Use Newton method which also calculates the inverse Hessian in each iteration (but inverting the Hessian is usually costly). Use Quasi-Newton methods (e.g. BFGS) which also calculates an estimate of the inverse Hessian while iterating. Run the algorithm from a set of starting points to find the smallest local minimum. 710of 828

Nonlinear Remaining big problem: Error function is defined over the whole training set. Therefore, need to process the whole training set for each calculation of the gradient E(w (τ) ). If the error function is a sum of errors for each data point E(w) = N E n (w) n=1 we can use on-line gradient descent (also called sequential gradient descent or stochastic gradient descent updating the weights by one data point at a time w (τ+1) = w (τ) η E n (w (τ) ). 711of 828