Regularization in Neural Networks

Similar documents
Parameter Norm Penalties. Sargur N. Srihari

Neural Network Training

Reading Group on Deep Learning Session 1

Feed-forward Network Functions

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Reading Group on Deep Learning Session 2

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Artificial Neural Networks

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Convolution and Pooling as an Infinitely Strong Prior

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Bagging and Other Ensemble Methods

Pattern Recognition and Machine Learning

Machine Learning Lecture 5

CSC321 Lecture 5: Multilayer Perceptrons

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Computational statistics

Multiclass Logistic Regression

y(x n, w) t n 2. (1)

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Neural Networks: Optimization & Regularization

Introduction to Machine Learning

Neural networks and optimization

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Introduction to Neural Networks

Artificial Neural Networks. MGS Lecture 2

CSC 578 Neural Networks and Deep Learning

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Neural Networks. Henrik I. Christensen. Computer Science and Engineering University of California, San Diego

Gradient-Based Learning. Sargur N. Srihari

Lecture 4: Perceptrons and Multilayer Perceptrons

OPTIMIZATION METHODS IN DEEP LEARNING

Neural networks. Chapter 20. Chapter 20 1

Artificial Neural Network

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Neural Networks (and Gradient Ascent Again)

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Lecture 5: Logistic Regression. Neural Networks

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

STA 4273H: Statistical Machine Learning

CSC 411 Lecture 10: Neural Networks

ECE521 week 3: 23/26 January 2017

Basic Principles of Unsupervised and Unsupervised

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Notion of Distance. Metric Distance Binary Vector Distances Tangent Distance

Gradient Descent. Sargur Srihari

4. Multilayer Perceptrons

Artificial Neural Networks

NEURAL NETWORKS

Introduction to Neural Networks

PATTERN CLASSIFICATION

1 What a Neural Network Computes

Linear Models for Regression. Sargur Srihari

Machine Learning Lecture 10

Final Examination CS 540-2: Introduction to Artificial Intelligence

Lecture 3: Pattern Classification

CSCI 315: Artificial Intelligence through Deep Learning

Neural Networks and Deep Learning

Artificial Neural Networks

Deep Learning Lab Course 2017 (Deep Learning Practical)

Linear Models for Classification

Neural networks. Chapter 19, Sections 1 5 1

Automatic Differentiation and Neural Networks

Deep Feedforward Networks. Sargur N. Srihari

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Tutorial on Tangent Propagation

Statistical Learning Theory. Part I 5. Deep Learning

Machine Learning (CSE 446): Neural Networks

Machine Learning Lecture 7

Machine Learning Basics

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Optmization Methods for Machine Learning Beyond Perceptron Feed Forward neural networks (FFN)

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Loss Functions and Optimization. Lecture 3-1

Logistic Regression & Neural Networks

Weight Quantization for Multi-layer Perceptrons Using Soft Weight Sharing

STA 414/2104: Lecture 8

Neural Networks and the Back-propagation Algorithm

Linear & nonlinear classifiers

Neural Networks. Intro to AI Bert Huang Virginia Tech

Deep Feedforward Networks

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Artificial neural networks

Holdout and Cross-Validation Methods Overfitting Avoidance

Understanding Neural Networks : Part I

How to do backpropagation in a brain

Transcription:

Regularization in Neural Networks Sargur Srihari 1

Topics in Neural Network Regularization What is regularization? Methods 1. Determining optimal number of hidden units 2. Use of regularizer in error function Linear Transformations and Consistent Gaussian priors 3. Early stopping Invariances Tangent propagation Training with transformed data 2

What is Regularization? Central problem of ML is Generalization to design algorithms that will perform well not just on training data but on new inputs as well Regularization is: any modification we make to a learning algorithm to reduce its generalization error but not its training error Reduce test error even at the expense of increasing training error Some goals of regularization 1. Encode prior knowledge 2. Express preference for simpler model 3. Need to make underdetermined problem determined 3

Need for Regularization Generalization Prevent over-fitting Occam's razor Bayesian point of view Regularization corresponds to prior distributions on model parameters 4

What is the Best Model? Best fitting model obtained not by finding the right number of parameters Instead, best fitting model is a large model that has been regularized appropriately We review several strategies for how to create such a large, deep regularized model 5

Common Regularization Strategies 1. Parameter Norm Penalties 2. Early Stopping 3. Parameter tying and parameter sharing 4. Bagging and other ensemble methods 5. Dropout 6. Data Set Augmentation 7. Adversarial training 8. Tangent methods 6

Optimizing no. of hidden units Number of input and output units is determined by dimensionality of data set Number of hidden units M is a free parameter Adjusted to get best predictive performance Possible approach is to get maximum likelihood estimate of M for balance between under-fitting and over-fitting 7

Effect of Varying Number of Hidden Units Sinusoidal Regression Problem Two layer network trained on 10 data points M = 1, 3 and 10 hidden units Minimizing sum-of-squared error function Using conjugate gradient descent Generalization error is not a simple function of M due to presence of local minima in error function

Using Validation Set to determine no of hidden units Plot a graph choosing random starts and different numbers of hidden units M and choose the specific solution having smallest generalization error Sum of squares Test error 30 random starts for each M 30 points in each column of graph Overall best validation set performance happened at M=8 No. of hidden units, M There are other ways to control the complexity of a neural network in order to avoid overfitting Alternative approach is to choose a relatively large value of M and then control complexity by adding a regularization term 9

Regularization using Simple Weight Decay Generalization error is not a simple function of M Due to presence of local minima Need to control network complexity to avoid over-fitting Choose a relatively large M and control complexity by addition of regularization term Simplest regularizer is weight decay!e(w) = E(w)+ λ 2 w T w Effective model complexity determined by choice of regularization coefficient λ Regularizer is equivalent to a Gaussian prior over weight vector w Simple weight decay has certain shortcomings invariance to scaling

Consistent Gaussian priors Simple weight decay is inconsistent with certain scaling properties of network mappings To show this, consider a multi-layer perceptron network with two layers of weights and linear output units Set of input variables {x i } and output variables {y k } Activations of hidden units in first layer have the form z j = h w ji x i + w j 0 Activations of output units are i y k = w kj z j + w k 0 j 11

Linear Transformations of input/output Variables Suppose we perform a linear transformation of input data x i!x i = ax i +b Then we can arrange for mapping performed by network to be unchanged by making a corresponding linear transformation Of the weights and biases from the inputs to the hidden units as w ji!w ji = 1 a w ji and w j 0 = w j 0 b a Similar linear transformation of output variables of network of the form y k!y k = cy k +d Can be achieved by transformation of second layer weights and biases i w ji w k j!w kj = cw kj and w k 0 = cw k 0 +d 12

Desirable invariance property of regularizer Suppose we train two different networks First network trained using original data: {x i }, {y k } Second network for which input and/or target variables are transformed by one of the linear transformations and/or Then consistency requires that we should obtain equivalent networks that differ only by linear transformation of the weights For first layer: w ji 1 a w ji Regularizer should have this property Otherwise it arbitrarily favors one solution over another Simple weight decay And/or or second layer:!e(w) = E(w)+ λ 2 w T w Treats all weights and biases on an equal footing While resulting w ji and w kj should be treated differently x i!x i = ax i +b y k!y k = cy k +d w k j cw kj does not have this property Consequently networks will have different weights and violate invariance We therefore look for a regularizer invariant under the linear transformations 13

Regularizer invariant under linear transformation The regularizer should be invariant to re-scaling of weights and shifts of biases Such a regularizer is λ 1 2 w 2 + λ 2 w W 1 2 w W 2 where W 1 are weights of first layer and W 2 are the set of weights in the second layer w 2 This regularizer remains unchanged under the weight transformations provided the parameters are rescaled using λ 1 a 1/2 λ 1 and λ 2 c 1/2 λ 2 We have seen before that weight decay is equivalent to a Gaussian prior. So what is the equivalent prior to this?

Equivalent prior The regularizer invariant to re-scaling of weights/ biases is λ 1 2 w 2 + λ 2 w W 1 2 w W 2 where W 1 are weights of first layer and W 2 of second layer It corresponds to a prior of the form p(w α 1,α 2 ) α exp α 1 2 This is an improper prior which cannot be normalized w 2 w 2 α 2 w W 1 2 Leads to difficulties in selecting regularization coefficients and in model comparison within Bayesian framework Instead include separate priors for biases with their own hyper-parameters We can illustrate effect of the resulting four hyperparameters α 1b, precision of Gaussian distribution of first layer bias w W 2 α 1w, precision of Gaussian of first layer weights α 2b, precision of Gaussian distribution of second layer bias α 2w, precision of Gaussian distribution of second layer weights w 2 α 1 and α 2 are hyper-parameters

Effect of hyperparameters on input-output Network with single input (x value range: -1 to +1), single linear output (y value range: -60 to +40) Priors are governed by four hyper-parameters α 1b, precision of Gaussian distribution of first layer bias α 1w, precision of Gaussian of first layer weights α 2b,...of second layer bias α 2w,. of second layer weights 12 hidden units with tanh activation functions output output Observe: vertical scale controlled by α 2 w Draw five samples from prior over hyperparameters, and plot inputoutput (network functions) Five samples correspond to five colors For each setting function is learnt and plotted output input input Observe: horizontal scale controlled by α 1 w input 16

3. Early Stopping Alternative to regularization In controlling complexity Error measured with an independent validation set shows initial decrease in error and then an increase Training stopped at point of smallest error with validation data Effectively limits network complexity Training Set Error Iteration Step Validation Set Error Iteration Step 17

Interpreting the effect of Early Stopping Consider quadratic error function Axes in weight space are parallel to eigen vectors of Hessian In absence of weight decay, weight vector starts at origin and proceeds to w ML Stopping at w is similar to weight decay Contour of constant error w ML represents minimum!e(w) = E(w)+ λ 2 w T w Effective number of parameters in the network grows during course of training 18

Invariances Quite often in classification problems there is a need Predictions should be invariant under one or more transformations of input variable Example: handwritten digit should be assigned same classification irrespective of position in the image (translation) and size (scale) Such transformations produce significant changes in raw data, yet need to produce same output from classifier Examples: pixel intensities, in speech recognition, nonlinear time warping along time axis 19

Simple Approach for Invariance Large sample set where all transformations are present E.g., for translation invariance, examples of objects in may different positions Impractical Number of different samples grows exponentially with number of transformations Seek alternative approaches for adaptive model to exhibit required invariances 20

Approaches to Invariance 1. Training set augmented by transforming training patterns according to desired invariances E.g., shift each image into different positions 2. Add regularization term to error function that penalizes changes in model output when input is transformed. Leads to tangent propagation. 3. Invariance built into pre-processing by extracting features invariant to required transformations 4. Build invariance property into structure of neural network (convolutional networks) Local receptive fields and shared weights 21

Approach 1: Transform each input Synthetically warp each handwritten digit image before presentation to model Easy to implement but computationally costly Original Image Warped Images Random displacements Δx,Δy (0,1) at each pixel Then smooth by convolutions of width 0.01, 30 and 60 resply 22

Approach 2: Tangent Propagation Regularization can be used to encourage models to be invariant to transformations by techniques of tangent propagation Consider effect of a transformation on input vector x n A one-dimensional continuous transformation parameterized by ξ applied to x n sweeps a manifold M in D-dimensional input space Two-dimensional input space showing effect of continuous transformation with single parameter ξ Let the vector resulting from acting on x n by this transformation be denoted by s(x n,ξ) defined so that s(x,0)=x. Then the tangent to the curve M is given by the directional derivative τ = δs/δξ and the tangent vector at point x n is given by τ n = s(x n,ξ) ξ ξ = 0 23

Tangent Propagation as Regularization Under a transformation of input vector The network output vector will change Derivative of output k wrt ξ is given by y k ξ ξ=0 = D i=1 y k x i x i ξ ξ=0 = D i=1 J ki τ i where J ki is the (k,i) element of the Jacobian Matrix J Result is used to modify the standard error function So as to encourage local invariance in neighborhood of data point!e = E + λω where λ is a regularization coefficient and Ω= 1 y nk = 1 2 n k ξ ξ=0 2 2 n k D i=1 J nki τ ni 2 24

Tangent vector from finite differences In practical implementation tangent vector τ n is approximated using finite differences by subtracting original vector x n from the corresponding vector after transformations using a small value of ξ and then dividing by ξ A related technique Tangent Distance is used to build invariance properties into distance-based methods such as nearest-neighbor classifiers Original Image x Adding small contribution from tangent vector to image x+ετ Tangent vector τ corresponding to small clockwise rotation True image rotated for comparison 25

Equivalence of Approaches 1 and 2 Expanding the training set is closely related to tangent propagation Small transformations of the original input vectors together with Sum-of-squared error function can be shown to be equivalent to the tangent propagation regularizer 26