Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Similar documents
Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Deep Feedforward Networks

Deep Feedforward Networks

Lecture 5: Logistic Regression. Neural Networks

4. Multilayer Perceptrons

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Reading Group on Deep Learning Session 1

Based on the original slides of Hung-yi Lee

Neural Networks and Deep Learning

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Introduction to Convolutional Neural Networks (CNNs)

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Neural Network Training

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Artificial Neural Networks. MGS Lecture 2

Introduction to Machine Learning

Machine Learning Lecture 10

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

RegML 2018 Class 8 Deep learning

Statistical Machine Learning from Data

A summary of Deep Learning without Poor Local Minima

10. Artificial Neural Networks

Linear discriminant functions

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Machine Learning

ECE521 Lectures 9 Fully Connected Neural Networks

Deep Learning Lab Course 2017 (Deep Learning Practical)

Machine Learning Lecture 14

Grundlagen der Künstlichen Intelligenz

Cheng Soon Ong & Christian Walder. Canberra February June 2018

1 What a Neural Network Computes

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

How to do backpropagation in a brain

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Day 3 Lecture 3. Optimizing deep networks

Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants

Lecture 4: Perceptrons and Multilayer Perceptrons

Feed-forward Network Functions

Neural Networks and the Back-propagation Algorithm

ECE521 Lecture 7/8. Logistic Regression

Introduction to Neural Networks

Neural Networks and Deep Learning.

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Multilayer Perceptron

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Course 395: Machine Learning - Lectures

Machine Learning Lecture 5

Deep Learning: a gentle introduction

From perceptrons to word embeddings. Simon Šuster University of Groningen

Introduction to Neural Networks

CSCI567 Machine Learning (Fall 2018)

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

OPTIMIZATION METHODS IN DEEP LEARNING

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Deep Learning & Artificial Intelligence WS 2018/2019

Deep Learning. Convolutional Neural Network (CNNs) Ali Ghodsi. October 30, Slides are partially based on Book in preparation, Deep Learning

y(x n, w) t n 2. (1)

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

Machine Learning Lecture 12

Data Mining Part 5. Prediction

Lecture 3 Feedforward Networks and Backpropagation

Advanced Machine Learning

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Introduction to Deep Neural Networks

Demystifying deep learning. Artificial Intelligence Group Department of Computer Science and Technology, University of Cambridge, UK

Introduction to Deep Learning CMPT 733. Steven Bergner

Understanding Neural Networks : Part I

Lecture 6 Optimization for Deep Neural Networks

Logistic Regression. COMP 527 Danushka Bollegala

CSC 578 Neural Networks and Deep Learning

Introduction to Neural Networks

Multilayer Perceptrons (MLPs)

Lecture 3 Feedforward Networks and Backpropagation

UNSUPERVISED LEARNING

Neural networks. Chapter 20. Chapter 20 1

Neural Networks. Mark van Rossum. January 15, School of Informatics, University of Edinburgh 1 / 28

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Artificial Neural Networks

Artificial Neuron (Perceptron)

Lecture 17: Neural Networks and Deep Learning

Neural Networks: Optimization & Regularization

Machine Learning Basics III

Deep Learning II: Momentum & Adaptive Step Size

Transcription:

Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning) Solution is linear combination of kernels Need for Deep Networks Multilayer Perceptron (MLP) Network of interconnected neurons layered architecture: neurons from one layer send outputs to the following layer Input layer at the bottom (input features) One or more hidden layers in the middle (learned features) Output layer on top (predictions) Multilayer Perceptron (MLP) 1

Activation Function Perceptron: threshold activation f(x) = sign(w T x) Derivative is zero everywhere apart from zero (where it s not differentiable) Impossible to run gradient-based optimization Activation Function 1 0.9 0.8 0.7 1/(1+exp(-x)) 0.6 0.5 0.4 0.3 0.2 0.1 0-10 -5 0 5 10 x Sigmoid Smooth version of threshold approximately linear around zero saturates for large and small values f(x) = σ(w T x) = 1 1 + exp( w T x) 2

Representational power of MLP Representable functions boolean functions any boolean function can be represented by some network with two layers of units continuous functions every bounded continuous function can be approximated with arbitrary small error by a network with two layers of units (sigmoid hidden activation, linear output activation) arbitrary functions any function can be approximated to arbitrary accuracy by a network with three layers of units (sigmoid hidden activation, linear output activation) Shallow vs deep architectures: Boolean functions Conjunctive normal form (CNF) One neuron for each clause (OR gate), with negative weights for negated literals One neuron at the top (AND gate) PB: number of gates Some functions require an exponential number of gates!! (e.g. parity function) Can be expressed with linear number of gates with a deep network (e.g. combination of XOR gates) Training MLP Stochastic gradient descent Training error for example (x, y) (e.g. regression): E(W ) = 1 (y f(x))2 2 Gradient update (η learning rate) Backpropagation Use chain rule for derivation: w lj = w lj η E(W ) w lj E(W ) = E(W ) a l = δ l φ j w lj a }{{ l w } lj δ l 3

Training MLP Output units Delta is easy to compute on output units. E.g. for regression with sigmoid outputs: δ o = E(W ) a o = 1 2 (y f(x))2 a o = 1 2 (y σ(a o)) 2 = (y σ(a o )) σ(a o) a o a o = (y σ(a o ))σ(a o )(1 σ(a o )) Training MLP Derivative of sigmoid 4

σ(x) x = x 1 1 + exp( x) = (1 + exp( x)) 2 (1 + exp( x)) x = (1 + exp( x)) 2 exp( 2x) exp(x) x = (1 + exp( x)) 2 exp( 2x) exp(x) 1 exp( x) = 1 + exp( x) 1 + exp( x) 1 = 1 + exp( x) (1 1 1 + exp( x) ) = σ(x)(1 σ(x)) Training MLP Hidden units Consider contribution to error through all outer connections (again sigmoid activation): δ l = E(W ) a l = k ch[l] = k ch[l] = k ch[l] = k ch[l] δ k a k φ l φ l a l δ k w kl σ(a l ) a l E(W ) a k a k a l δ k w kl σ(a l )(1 σ(a l )) 5

Deep architectures: modular structure 6

W 3 W 2 W 1 Loss( 3, y) 3 F 3 ( 2,W 3 ) 2 F 2 ( 1,W 2 ) 1 F 1 (x,w 1 ) x @E @ 3 @E @ 2 @E @ 1 y E j = F j ( j 1,W j ) @E = @E @ j = @E @F j ( j 1,W j ) @W j @ j @W j @ j @W j @E = @E @ j 1 @ j @ j @ j 1 = @E @ j @F j ( j 1,W j ) @ j 1 Remarks on backpropagation Local minima The error surface of a multilayer neural network can contain several minima Backpropagation is only guaranteed to converge to a local minimum Heuristic attempts to address the problem: use stochastic instead of true gradient descent train multiple networks with different random weights and average or choose best many more.. Note Training kernel machines requires solving quadratic optimization problems global optimum guaranteed Deep networks are more expressive in principle, but harder to train 7

Stopping criterion and generalization Stopping criterion How can we choose the training termination condition? Overtraining the network increases possibility of overfitting training data Network is initialized with small random weights very simple decision surface Overfitting occurs at later iterations, when increasingly complex surfaces are being generated Use a separate validation set to estimate performance of the network and choose when to stop training Error validation training test 1 2 3 4 5 6 7 8 9 10 11 epochs Training deep architectures PB: Vanishing gradient Error gradient is backpropagated through layers At each step gradient is multiplied by derivative of sigmoid: very small for saturated units Gradient vanishes in lower layers Difficulty of training deep networks!! 8

Tricks of the trade Few simple suggestions Do not initialize weights to zero, but to small random values around zero Standardize inputs (x = (x µ x )/σ x ) to avoid saturating hidden units Randomly shuffle training examples before each training epoch Tricks of the trade: activation functions Rectifier f(x) = max(0, w T x) Linearity is nice for learning Saturation (as in sigmoid) is bad for learning (gradient vanishes no weight update) Neuron employing rectifier activation called rectified linear unit (ReLU) Tricks of the trade: loss functions Cross entropy E(W ) = y log f(x) + (1 y) log(1 f(x)) (x,y) D Minimize cross entropy of network output wrt targets Useful for binary classification tasks Model target function as probability that output is one (use sigmoid for output layer) Corresponds to maximum likelihood learning Log removes saturation effect of sigmoid (helps optimization) Can be generalized to multiclass classification (use softmax for output layer) 9

Tricks of the trade: regularization W 2 E(W ) W 2 W 1 2-norm regularization J(W ) = E(W ) + λ W 2 Penalizes weights by Euclidean norm Weights with less influence on error get smaller values Tricks of the trade: regularization W 2 E(W ) W W 1 1-norm regularization J(W ) = E(W ) + λ W 10

Penalizes weights by sum of absolute values Encourages less relevant weights to be exactly zero (sparsity inducing norm) Tricks of the trade: initialization Suggestions Randomly initialize weights (for breaking symmetries between neurons) Carefully set initialization range (to preserve forward and backward variance) n and m number of inputs and outputs 6 6 W ij U(, ) n + m n + m Sparse initialization: enforce a fraction of weights to be non-zero (encourages diversity between neurons) Tricks of the trade: gradient descent Batch vs Stochastic Batch gradient descent updates parameters after seeing all examples too slow for large datasets Fully stochastic gradient descent updates parameters after seeing each example objective too different from true one Minibatch gradient descent: update parameters after seeing a minibach of m examples (m depends on many factors, e.g. size of GPU memory) Tricks of the trade: gradient descent Momentum 0 α < 1 is called momentum v ji = αv ji η E(W ) w lj w ji = w ji + v ji Tends to keep updating weights in the same direction Think of a ball rolling on an error surface Possible effects: roll through small local minima without stopping traverse flat surfaces instead of stopping there increase step size of search in regions of constant gradient 11

Tricks of the trade: adaptive gradient Decreasing learning rate { (1 t η t = τ )η 0 + t τ η τ η τ if t < τ otherwise Larger learning rate at the beginning for faster convergence towards attraction basin Smaller learning rate at the end to avoid oscillation close to the minimum Tricks of the trade: adaptive gradient Adagrad Reduce learning rate in steep directions Increase learning rate in gentler directions ( E(W ) r ji = r ji + w lj ) 2 w ji = w ji η rji E(W ) w lj Problem Square gradient accumulated over all iterations For non-convex problems, learning rate reduction can be excessive/premature Tricks of the trade: adaptive gradient RMSProp ( E(W ) r ji = ρr ji + (1 ρ) w ji = w ji η rji E(W ) w lj w lj Exponentially decaying accumulation of squared gradient (0 < ρ < 1) Avoids premature reduction of Adagrad Adagrad-like behaviour when reaching convex bowl ) 2 Tricks of the trade: batch normalization Covariate shift problem Covariate shift problem is when the input distribution to your model changes over time (and the model does not adapt to the change) In (very) deep networks, internal covariate shift takes place among layers when they get updated by backpropagation 12

Tricks of the trade: batch normalization Solution (sketch) Normalize each node activation (input to activation function) by its batch statistics where: ˆx i = x i µ B σ B x is the activation of an arbitrary node in an arbitrary layer B = {x 1,..., x m }, is a batch of values for that activation µ B, σb 2 are batch mean and variance Scale and shift each activation with adjustable parameters (γ and β become part of the network parameters) y i = γˆx i + β Tricks of the trade: batch normalization Advantages More robustness to parameter initialization Allows for faster learning rates without divergence Keeps activations in non-saturated region even for saturating activation functions Regularizes the model Tricks of the trade: layerwise pre-training reconstruction code input Autoencoder train shallow network to reproduce input in the output learns to map inputs into a sensible hidden representation (representation learning) can be done with unlabelled examples (unsupervised learning) 13

Tricks of the trade: layerwise pre-training input 14

input 15

reconstruction Stacked autoencoder input repeat: 1. discard output layer 2. freeze hidden layer weights 3. add another hidden + output layer 16

4. train network to reproduce input Tricks of the trade: layerwise pre-training reconstruction input 17

input 18

output global refinement input discard autoencoder output layer add appropriate output layer for supervised task (e.g. one-hot encoding for multiclass classification) learn output layer weights + refine all internal weights by backpropagation algorithm 19

Tricks of the trade: layerwise pre-training Modern pre-training Supervised pre-training: layerwise training with actual labels Transfer learning: train network on similar task, discard last layers and retrain on target task Multi-level supervision: auxhiliary output nodes at intermediate layers to speed up learning Popular deep architectures Many different architectures convolutional networks for exploiting local correlations (e.g. for images) recurrent and recursive networks for collective predictions (e.g. sequential labelling) deep Boltzmann machines as probabilistic generative models (can also generate new instances of a certain class) generative adversarial networks to generate new instances as a game between discriminator and generator Convolutional networks (CNN) Location invariance + compositionality convolution filters extracting local features pooling to provide invariance to local variations hierarchy of filters to compose complex features from simpler ones (e.g. pixels to edges to shapes) fully connected layers for final classification 20

Long Short-Term Memory Networks (LSMT) Recurrent computation with selective memory Cell state propagated along chain Forget gate selectively forgets parts of the cell state Input gate selectively chooses parts of the candidate for cell update Output gate selectively chooses parts of the cell state for output References Libraries Pylearn2 (http://deeplearning.net/software/pylearn2/) Caffe (http://caffe.berkeleyvision.org/) PyTorch (http://pytorch.org/) TensorFlow (https://www.tensorflow.org/) Literature Yoshua Bengio, Learning Deep Architectures for AI, Foundations & Trends in Machine Learning, 2009. Ian Goodfellow, Yoshua Bengio and Aaron Courville, Deep Learning, Book in preparation for MIT Press, 2016 (http://www.deeplearningbook.org/) Christopher Olah, Understanding LSTM Networks (http://colah.github.io/posts/2015-08-understanding-l 21