CSC321 Lecture 9: Generalization

Similar documents
CSC321 Lecture 9: Generalization

Lecture 9: Generalization

CSC321 Lecture 2: Linear Regression

CSC 411 Lecture 6: Linear Regression

CSC321 Lecture 15: Exploding and Vanishing Gradients

CSC321 Lecture 4: Learning a Classifier

CSC321 Lecture 5: Multilayer Perceptrons

CSC321 Lecture 4: Learning a Classifier

CSC321 Lecture 8: Optimization

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC321 Lecture 7: Optimization

Lecture 2: Linear regression

CSC321 Lecture 6: Backpropagation

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Ways to make neural networks generalize better

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

CSC 411 Lecture 10: Neural Networks

Lecture 6: Backpropagation

Lecture 15: Exploding and Vanishing Gradients

Advanced computational methods X Selected Topics: SGD

Regularization in Neural Networks

CSC321 Lecture 16: ResNets and Attention

Deep Learning & Neural Networks Lecture 4

Overfitting, Bias / Variance Analysis

ECE521 lecture 4: 19 January Optimization, MLE, regularization

CSC321 Lecture 5 Learning in a Single Neuron

CSC321 Lecture 7 Neural language models

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

CSC321 Lecture 20: Autoencoders

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

CSC321 Lecture 18: Learning Probabilistic Models

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

CSC 411: Lecture 02: Linear Regression

Deep Learning Lab Course 2017 (Deep Learning Practical)

Deep Feedforward Networks

1 What a Neural Network Computes

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

SGD and Deep Learning

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Week 5: Logistic Regression & Neural Networks

Introduction to Neural Networks

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

Lecture 4: Training a Classifier

ECE521 Lecture 7/8. Logistic Regression

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Neural networks and optimization

CSCI567 Machine Learning (Fall 2018)

Generative adversarial networks

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Machine Learning and Data Mining. Linear regression. Kalev Kask

Summary and discussion of: Dropout Training as Adaptive Regularization

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

OPTIMIZATION METHODS IN DEEP LEARNING

Lecture 6 Optimization for Deep Neural Networks

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Recap from previous lecture

Linear Models for Regression

Lecture 15: Recurrent Neural Nets

Data Mining & Machine Learning

Lecture 5: Logistic Regression. Neural Networks

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

Linear classifiers: Overfitting and regularization

Neural Network Training

Lecture 4: Training a Classifier

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Convolution and Pooling as an Infinitely Strong Prior

Neural Networks: Optimization & Regularization

Comparison of Modern Stochastic Optimization Algorithms

Lecture 17: Neural Networks and Deep Learning

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Linear Models for Regression CS534

CSC321 Lecture 10 Training RNNs

Automatic Differentiation and Neural Networks

Lecture 18: Learning probabilistic models

How to do backpropagation in a brain

Introduction to Convolutional Neural Networks (CNNs)

Linear Regression (continued)

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

VBM683 Machine Learning

Machine Learning (CSE 446): Backpropagation

CSC 411 Lecture 7: Linear Classification

CSC 411 Lecture 12: Principal Component Analysis

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Parameter Norm Penalties. Sargur N. Srihari

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Introduction to Machine Learning

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Neural Networks and Deep Learning.

Transcription:

CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 26

Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions on the training set. How do we make sure they generalize to data they haven t seen before? Even though the topic is well studied, it s still poorly understood. Roger Grosse CSC321 Lecture 9: Generalization 2 / 26

Generalization Recall: overfitting and underfitting 1 M = 1 1 M = 3 1 M = 9 t t t 0 0 0 1 1 1 0 x 1 0 x 1 0 x 1 We d like to minimize the generalization error, i.e. error on novel examples. Roger Grosse CSC321 Lecture 9: Generalization 3 / 26

Generalization Training and test error as a function of # training examples and # parameters: Roger Grosse CSC321 Lecture 9: Generalization 4 / 26

Bias/Variance Decomposition There s an interesting decomposition of generalization error in the particular case of squared error loss. It s often convenient to suppose our training and test data are sampled from a data generating distribution p D(x, t). Suppose we re given an input x. We d like to minimize the expected loss: E pd [(y t) 2 x] The best possible prediction we can make is the conditional expectation Proof: y = E pd [t x]. E[(y t) 2 x] = E[y 2 2yt + t 2 x] = y 2 2yE[t x] + E[t 2 x] = y 2 2yE[t x] + E[t x] 2 + Var[t x] = (y y ) 2 + Var[t x] The term Var[t x], called the Bayes error, is the best risk we can hope to achieve. Roger Grosse CSC321 Lecture 9: Generalization 5 / 26

Bias/Variance Decomposition Now suppose we sample a training set from the data generating distribution, train a model on it, and use that model to make a prediction y on test example x. Here, y is a random variable, and we get a new value each time we sample a new training set. We d like to minimize the risk, or expected loss E[(y t) 2 ]. We can decompose this into bias, variance, and Bayes error. (We suppress the conditioning on x for clarity.) E[(y t) 2 ] = E[(y y ) 2 ] + Var(t) = E[y 2 2y y + y 2 ] + Var(t) = y 2 2y E[y] + E[y 2 ] + Var(t) = y 2 2y E[y] + E[y] 2 + Var(y) + Var(t) = (y E[y]) 2 }{{} bias + Var(y) }{{} variance + Var(t) }{{} Bayes error Roger Grosse CSC321 Lecture 9: Generalization 6 / 26

Bias/Variance Decomposition We can visualize this decomposition in output space, where the axes correspond to predictions on the test examples. If we have an overly simple model, it might have high bias (because it s too simplistic to capture the structure in the data) low variance (because there s enough data to estimate the parameters, so you get the same results for any sample of the training data) Roger Grosse CSC321 Lecture 9: Generalization 7 / 26

Bias/Variance Decomposition If you have an overly complex model, it might have low bias (since it learns all the relevant structure) high variance (it fits the idiosyncrasies of the data you happened to sample) The bias/variance decomposition holds only for squared error, but it provides a useful intuition even for other loss functions. Roger Grosse CSC321 Lecture 9: Generalization 8 / 26

Our Bag of Tricks How can we train a model that s complex enough to model the structure in the data, but prevent it from overfitting? I.e., how to achieve low bias and low variance? Our bag of tricks data augmentation reduce the number of paramters weight decay early stopping ensembles (combine predictions of different models) stochastic regularization (e.g. dropout) The best-performing models on most benchmarks use some or all of these tricks. Roger Grosse CSC321 Lecture 9: Generalization 9 / 26

Data Augmentation The best way to improve generalization is to collect more data! Suppose we already have all the data we re willing to collect. We can augment the training data by transforming the examples. This is called data augmentation. Examples (for visual recognition) translation horizontal or vertical flip rotation smooth warping noise (e.g. flip random pixels) Only warp the training, not the test, examples. The choice of transformations depends on the task. (E.g. horizontal flip for object recognition, but not handwritten digit recognition.) Roger Grosse CSC321 Lecture 9: Generalization 10 / 26

Reducing the Number of Parameters Can reduce the number of layers or the number of paramters per layer. Adding a linear bottleneck layer is another way to reduce the number of parameters: The first network is strictly more expressive than the second (i.e. it can represent a strictly larger class of functions). (Why?) Remember how linear layers don t make a network more expressive? They might still improve generalization. Roger Grosse CSC321 Lecture 9: Generalization 11 / 26

Weight Decay We ve already seen that we can regularize a network by penalizing large weight values, thereby encouraging the weights to be small in magnitude. E reg = E + λr = E + λ 2 We saw that the gradient descent update can be interpreted as weight decay: ( ) E w w α w + λ R w ( ) E = w α w + λw = (1 αλ)w α E w j w 2 j Roger Grosse CSC321 Lecture 9: Generalization 12 / 26

Weight Decay Why we want weights to be small: y = 0.1x 5 + 0.2x 4 + 0.75x 3 x 2 2x + 2 y = 7.2x 5 + 10.4x 4 + 24.5x 3 37.9x 2 3.6x + 12 The red polynomial overfits. Notice it has really large coefficients. Roger Grosse CSC321 Lecture 9: Generalization 13 / 26

Weight Decay Why we want weights to be small: Suppose inputs x 1 and x 2 are nearly identical. The following two networks make nearly the same predictions: But the second network might make weird predictions if the test distribution is slightly different (e.g. x 1 and x 2 match less closely). Roger Grosse CSC321 Lecture 9: Generalization 14 / 26

Weight Decay The geometric picture: Roger Grosse CSC321 Lecture 9: Generalization 15 / 26

Weight Decay There are other kinds of regularizers which encourage weights to be small, e.g. sum of the absolute values. These alternative penalties are commonly used in other areas of machine learning, but less commonly for neural nets. Regularizers differ by how strongly they prioritize making weights exactly zero, vs. not being very large. Hinton, Coursera lectures Bishop, Pattern Recognition and Machine Learning Roger Grosse CSC321 Lecture 9: Generalization 16 / 26

Early Stopping We don t always want to find a global (or even local) optimum of our cost function. It may be advantageous to stop training early. Early stopping: monitor performance on a validation set, stop training when the validtion error starts going up. Roger Grosse CSC321 Lecture 9: Generalization 17 / 26

Early Stopping A slight catch: validation error fluctuates because of stochasticity in the updates. Determining when the validation error has actually leveled off can be tricky. Roger Grosse CSC321 Lecture 9: Generalization 18 / 26

Early Stopping Why does early stopping work? Weights start out small, so it takes time for them to grow large. Therefore, it has a similar effect to weight decay. If you are using sigmoidal units, and the weights start out small, then the inputs to the activation functions take only a small range of values. Therefore, the network starts out approximately linear, and gradually becomes more nonlinear (and hence more powerful). Roger Grosse CSC321 Lecture 9: Generalization 19 / 26

Ensembles If a loss function is convex (with respect to the predictions), you have a bunch of predictions, and you don t know which one is best, you are always better off averaging them. L(λ 1y 1 + + λ N y N, t) λ 1L(y 1, t) + + λ N L(y N, t) for λ i 0, i λ i = 1 This is true no matter where they came from (trained neural net, random guessing, etc.). Note that only the loss function needs to be convex, not the optimization problem. Examples: squared error, cross-entropy, hinge loss If you have multiple candidate models and don t know which one is the best, maybe you should just average their predictions on the test data. The set of models is called an ensemble. Averaging often helps even when the loss is nonconvex (e.g. 0 1 loss). Roger Grosse CSC321 Lecture 9: Generalization 20 / 26

Ensembles Some examples of ensembles: Train networks starting from different random initializations. But this might not give enough diversity to be useful. Train networks on differnet subsets of the training data. This is called bagging. Train networks with different architectures or hyperparameters, or even use other algorithms which aren t neural nets. Ensembles can improve generalization quite a bit, and the winning systems for most machine learning benchmarks are ensembles. But they are expensive, and the predictions can be hard to interpret. Roger Grosse CSC321 Lecture 9: Generalization 21 / 26

Stochastic Regularization For a network to overfit, its computations need to be really precise. This suggests regularizing them by injecting noise into the computations, a strategy known as stochastic regularization. Dropout is a stochastic regularizer which randomly deactivates a subset of the units (i.e. sets their activations to zero). { φ(zi ) with probability 1 ρ h i = 0 with probability ρ, where ρ is a hyperparameter. Equivalently, h i = m i φ(z i ), where m i is a Bernoulli random variable, independent for each hidden unit. Backprop rule: z i = h i m i φ (z i ) Roger Grosse CSC321 Lecture 9: Generalization 22 / 26

Stochastic Regularization Dropout can be seen as training an ensemble of 2 D different architectures with shared weights (where D is the number of units): Goodfellow et al., Deep Learning Roger Grosse CSC321 Lecture 9: Generalization 23 / 26

Dropout Dropout at test time: Most principled thing to do: run the network lots of times independently with different dropout masks, and average the predictions. Individual predictions are stochastic and may have high variance, but the averaging fixes this. In practice: don t do dropout at test time, but multiply the weights by 1 ρ Since the weights are on 1 ρ fraction of the time, this matches their expectation. You ll derive an interesting interpretation of this in Homework 4. Roger Grosse CSC321 Lecture 9: Generalization 24 / 26

Stochastic Regularization Dropout can help performance quite a bit, even if you re already using weight decay. Lots of other stochastic regularizers have been proposed: DropConnect drops connections instead of activations. Batch normalization (mentioned last week for its optimization benefits) also introduces stochasticity, thereby acting as a regularizer. The stochasticity in SGD updates has been observed to act as a regularizer, helping generalization. Increasing the mini-batch size may improve training error at the expense of test error! Roger Grosse CSC321 Lecture 9: Generalization 25 / 26

Our Bag of Tricks Techniques we just covered: data augmentation reduce the number of paramters weight decay early stopping ensembles (combine predictions of different models) stochastic regularization (e.g. dropout) The best-performing models on most benchmarks use some or all of these tricks. Roger Grosse CSC321 Lecture 9: Generalization 26 / 26