Mollifying Networks. ICLR,2017 Presenter: Arshdeep Sekhon & Be

Similar documents
Deep Feedforward Networks

Deep Feedforward Networks

4. Multilayer Perceptrons

Deep Learning & Artificial Intelligence WS 2018/2019

Deep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville

Ch.6 Deep Feedforward Networks (2/3)

The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems

Nonparametric Bayesian Methods (Gaussian Processes)

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Nonlinear Optimization Methods for Machine Learning

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Feed-forward Network Functions

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Day 3 Lecture 3. Optimizing deep networks

Q-Learning in Continuous State Action Spaces

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

STA 4273H: Statistical Machine Learning

Failures of Gradient-Based Deep Learning

Normalization Techniques in Training of Deep Neural Networks

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Gradient-Based Learning. Sargur N. Srihari

Local Affine Approximators for Improving Knowledge Transfer

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

for Global Optimization with a Square-Root Cooling Schedule Faming Liang Simulated Stochastic Approximation Annealing for Global Optim

Reading Group on Deep Learning Session 1

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Probabilistic & Bayesian deep learning. Andreas Damianou

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Introduction to Convolutional Neural Networks (CNNs)

Training Neural Networks Practical Issues

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Neural Networks and Deep Learning

CS60010: Deep Learning

Lecture 6 Optimization for Deep Neural Networks

Advanced computational methods X Selected Topics: SGD

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

FDSA and SPSA in Simulation-Based Optimization Stochastic approximation provides ideal framework for carrying out simulation-based optimization

Lecture 3 Feedforward Networks and Backpropagation

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Dispersion relations, linearization and linearized dynamics in PDE models

Greedy Layer-Wise Training of Deep Networks

Introduction to Deep Neural Networks

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

CLOSE-TO-CLEAN REGULARIZATION RELATES

Lecture 16 Deep Neural Generative Models

Continuous Neural Networks

Evaluating the Variance of

Variational Autoencoder

Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C.

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Kernel-based density. Nuno Vasconcelos ECE Department, UCSD

Training Recurrent Neural Networks by Diffusion

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Computer Intensive Methods in Mathematical Statistics

Advanced Machine Learning

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Lecture : Probabilistic Machine Learning

Reinforcement Learning as Variational Inference: Two Recent Approaches

9.2 Support Vector Machines 159

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

Neural Network Training

Gradient-based Adaptive Stochastic Search

From CDF to PDF A Density Estimation Method for High Dimensional Data

Deep learning attracts lots of attention.

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

New Insights and Perspectives on the Natural Gradient Method

Neural networks and optimization

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Introduction to Machine Learning (67577)

arxiv: v1 [cs.lg] 11 May 2015

Learning the Number of Neurons in Deep Networks

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Course 10. Kernel methods. Classical and deep neural networks.

Why is Deep Learning so effective?

Stochastic Gradient Descent in Continuous Time

Bits of Machine Learning Part 1: Supervised Learning

Feedforward Neural Networks. Michael Collins, Columbia University

Cheng Soon Ong & Christian Walder. Canberra February June 2018

STA414/2104 Statistical Methods for Machine Learning II

Learning Deep Architectures

Parametric Techniques Lecture 3

Recurrent Latent Variable Networks for Session-Based Recommendation

Statistical Machine Learning from Data

Nonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University.

The Convolution Operation

Lecture 3 Feedforward Networks and Backpropagation

Introduction to (Convolutional) Neural Networks

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Neural Networks: Optimization & Regularization

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Variational Dropout via Empirical Bayes

Lecture 2: Modular Learning

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Transcription:

Mollifying Networks Caglar Gulcehre 1 Marcin Moczulski 2 Francesco Visin 3 Yoshua Bengio 1 1 University of Montreal, 2 University of Oxford, 3 Politecnico di Milano ICLR,2017 Presenter: Arshdeep Sekhon & Beilun Wang

Reviews https://openreview.net/forum?id=r1g4z8cge

Outline 1 Introduction Motivation Previous Studies 2 Method Proposed Method

Motivation DNNs: highly non-convex nature of loss function tanh and sigmoid are difficult to optimize.

Outline 1 Introduction Motivation Previous Studies 2 Method Proposed Method

Previous Studies A number of recently proposed methods to make optimization easier: 1 curriculum learning 2 training RNNs with diffusion 3 noise injection

Simulated Annealing See the wiki: https://en.wikipedia.org/wiki/simulated_annealing

Outline 1 Introduction Motivation Previous Studies 2 Method Proposed Method

Key ideas injecting noise to the activation function during the training annealing the noise Figure: *

Strategy I: on Feedforward Networks Figure: *

Strategy II: on Linearizing the Network 1 adding noise to the activation function: may suffer from excessive random exploration when the noise is very large 2 Solution: bounding the element-wise activation function f ( ) with its linear approximation when the variance of the noise is very large, after centering it at the origin 3 f is bounded and centered at the origin

Linearizing the Network Figure: * Figure: * u(x) is the first order Taylor approximation of the original activation function around zero and u*(x) stands for the centered u(x) which is obtained by shifting u(x) towards the origin.

Algorithm Figure: *

Annealing schedules for p a different schedule for each layer of the network, such that the noise in the lower layers will anneal faster. 1 Exponential Decay 2 Square root decay p l t = 1 e kv tl tl (1) t min(p min, 1 ) (2) N epochs 3 Linear decay min(p min, 1 t N epochs ) (3)

Explanation of two strategies: Mollification for Neural Networks 1 novel method for training neural networks 2 A sequence of optimization problems of increasing complexity, where the first ones are easy to solve but only the last one corresponds to the actual problem of interest. Figure: * 3 The training procedure iterates over a sequence of objective functions starting from the simpler ones i.e. with a smoother loss surface and moving towards more complex ones until the last, original, objective

Mollifiers 1 To smooth the loss function L, parametrized by θ R n by convolving it with another function K( ) with stride τ R n L K (θ) = 2 Many choices for K but must be a mollifier (L(θ τ)k(τ))(dτ) (4)

Mollifier 1 A mollifier is an infinitely differentiable function that behaves like an approximate identity in the group of convolutions of integrable functions. 2 If K() is an infinitely differentiable function, that converges to the Dirac delta function when appropriately rescaled and for any integrable function L, then it is a mollifier Mollifier L K (θ) = lim ɛ 0 L K (θ) = (L K)(θ) (5) ɛ n K( τ )L(θ τ)dτ (6) ɛ

Mollifiers: Gradients 1 gradients of the mollified loss: θ L K (θ) = θ (L K)(θ) = L (K)(θ) (7) 2 How does this θ L K (θ) relate to θ L(θ)? 3 Use weak gradients

Weak Gradients 1 For an integrable function L in space L L([a, b]), g L([a, b] n is an n-dimensional weak gradient of L if it satisfies: g(τ)k(τ)dτ = L(τ) K(τ)dτ (8) where K(τ) is an infinitely differentiable function vanishing at infinity, C [a, b] n and τ R n

Mollified Gradients g(τ)k(τ)dτ = L(τ) K(τ)dτ (9) θ L K (θ) = θ (L K)(θ) = L (K)(θ) (10) θ L K (θ) = L(θ τ) K(τ)dτ (11) θ L K (θ) = g(θ τ)k(τ)dτ (12) For a differentiable almost everywhere function L, the weak gradient g(θ) is equal to θ L almost everywhere θ L K (θ) = θ L(θ τ)k(τ)dτ (13)

weight noise methods: Gaussian Mollifiers 1 Use a gaussian mollifier K( ): 1 infinitely differentiable 2 a sequence of properly rescaled Gaussian distributions converges to the Dirac delta function 3 vanishes in infinity θ L K=N (θ) = θ L(θ τ)p(τ)dτ (14) τn (0, I ) θ L K=N (θ) = E[ θ L(θ τ)] (15) θ L K=N (θ) = θ L(θ τ)p(τ)dτ (16) 2 sequence of mollifiers indexed by ɛ θ L K=N (θ) = θ L(θ τ)ɛ 1 p( τ )dτ (17) ɛ τn (0, ɛ 2 I )

weight noise methods: Gaussian Mollifiers θ L K=N (θ) = θ L(θ τ)ɛ 1 p( τ )dτ (18) ɛ τn (0, ɛ 2 I ) τn (0, ɛ 2 I ) This satisfies the property: θ L K=N,ɛ (θ) = E τ [ θ L(θ τ)] (19) lim θl K=N,ɛ (θ) = θ L(θ) (20) ɛ 0

weight noise methods: Gaussian Mollifiers L K (θ) = (L(θ ξ)k(ξ))(dξ) (21) By monte carlo estimate: L K (θ) θ 1 N N L(θ ξ i ) (22) i=1 1 N N i=1 L K (θ ξ i ) θ Therefore introducing additive noise to the input of L(θ) is equivalent to mollification. Using this mollifier for neural networks ξ l N (µ, σ 2 ) (23) h l = f (W l h l 1 ) (24) h l = f ((W l ξ l )h l 1 ) (25)

Generalized Mollifiers Generalized Mollifier A generalized mollifier is an operator, where T σ (f ) defines a mapping between two functions, such that T σ : f f : lim T σf = f (26) σ 0 f 0 = lim σ T σf is an identity function T σ f (x) exists x, σ > 0 x (27)

Noisy Mollifier Noisy Mollifier A stochastic function φ(x, ξ σ ) with input x and noise ξ is a noisy mollifier if its expected value corresponds to the application of a generalized mollifier T σ (T σ f )(x) = E[φ(x, ξ σ )] (28) 1 When σ = 0 no noise is injected and therefore the original function will be optimized. 2 If σ instead, the function will become an identity function

Method: Mollify the cost of an NN 1 During training minimize a sequence of increasingly complex noisy objectives {L 1 (θ, ξ σ1 ), L 2 (θ, ξ σ2 ),, L k (θ, ξ σk )} by annealing the scale (variance) of the noise σ i 2 algorithm satisfies the fundamental properties of the generalized and noisy mollifiers

Algorithm 1 start by optimizing a convex objective function that is obtained by configuring all the layers between the input and the last cost layer to compute an identity function, {by skipping both the affine transformations and the blocks followed by nonlinearities.}

Algorithm 1 start by optimizing a convex objective function that is obtained by configuring all the layers between the input and the last cost layer to compute an identity function, {by skipping both the affine transformations and the blocks followed by nonlinearities.} 2 During training, the magnitude of noise which is proportional to p is annealed, allowing to gradually evolve from identity transformations to linear transformations between the layers.

Algorithm 1 start by optimizing a convex objective function that is obtained by configuring all the layers between the input and the last cost layer to compute an identity function, {by skipping both the affine transformations and the blocks followed by nonlinearities.} 2 During training, the magnitude of noise which is proportional to p is annealed, allowing to gradually evolve from identity transformations to linear transformations between the layers. 3 Simultaneously, as we decrease the p, the noisy mollification procedure allows the element-wise activation functions to gradually change from linear to be nonlinear 4 Thus changing both the shape of the cost and the model architecture

Experiments: Deep Parity,CIFAR Figure: *

Different Annealing Schedules Figure: *