CLOSE-TO-CLEAN REGULARIZATION RELATES

Similar documents
Fast Learning with Noise in Deep Neural Nets

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep Learning Made Easier by Linear Transformations in Perceptrons

ECE521 week 3: 23/26 January 2017

NON-LINEAR NOISE ADAPTIVE KALMAN FILTERING VIA VARIATIONAL BAYES

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

Regularization in Neural Networks

Learning Gaussian Process Models from Uncertain Data

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning

Latent Variable Models and EM algorithm

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Nonparametric Bayesian Methods (Gaussian Processes)

CS 7140: Advanced Machine Learning

Least Squares Regression

Semi-Supervised Learning with the Deep Rendering Mixture Model

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Variational Autoencoders

Complexity Bounds of Radial Basis Functions and Multi-Objective Learning

RECURSIVE OUTLIER-ROBUST FILTERING AND SMOOTHING FOR NONLINEAR SYSTEMS USING THE MULTIVARIATE STUDENT-T DISTRIBUTION

Linear Regression (9/11/13)

A Least Squares Formulation for Canonical Correlation Analysis

Statistical Data Mining and Machine Learning Hilary Term 2016

Pattern Recognition and Machine Learning

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces

arxiv: v2 [stat.ml] 15 Aug 2017

Least Squares Regression

LECTURE NOTE #NEW 6 PROF. ALAN YUILLE

CS281 Section 4: Factor Analysis and PCA

Day 3 Lecture 3. Optimizing deep networks

Variational Autoencoder

Deep Feedforward Networks

PATTERN CLASSIFICATION

Probabilistic & Unsupervised Learning

Adversarial Examples

Statistical Machine Learning Hilary Term 2018

Probabilistic & Bayesian deep learning. Andreas Damianou

Gaussian Mixture Distance for Information Retrieval

Learning to Learn and Collaborative Filtering

Recursive Noise Adaptive Kalman Filtering by Variational Bayesian Approximations

Greedy Layer-Wise Training of Deep Networks

MTTTS16 Learning from Multiple Sources

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Short Term Memory Quantifications in Input-Driven Linear Dynamical Systems

Minimum Error Rate Classification

Discriminative Direction for Kernel Classifiers

Covariance function estimation in Gaussian process regression

Natural Gradients via the Variational Predictive Distribution

Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information

Synthesis of Maximum Margin and Multiview Learning using Unlabeled Data

Learning Gaussian Process Kernels via Hierarchical Bayes

Deep unsupervised learning

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

GAUSSIAN PROCESS REGRESSION

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Bayesian Semi-supervised Learning with Deep Generative Models

Reading Group on Deep Learning Session 1

Auto-Encoders & Variants

Information geometry for bivariate distribution control

EM-algorithm for Training of State-space Models with Application to Time Series Prediction

Probability and Information Theory. Sargur N. Srihari

Unsupervised Learning with Permuted Data

The Learning Problem and Regularization

Active and Semi-supervised Kernel Classification

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Gaussian Process Regression

An Introduction to Expectation-Maximization

DETECTING PROCESS STATE CHANGES BY NONLINEAR BLIND SOURCE SEPARATION. Alexandre Iline, Harri Valpola and Erkki Oja

5. Discriminant analysis

STA 4273H: Statistical Machine Learning

The Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models

CSCI5654 (Linear Programming, Fall 2013) Lectures Lectures 10,11 Slide# 1

Lecture 6. Regression

Density Estimation. Seungjin Choi

Worst-Case Bounds for Gaussian Process Models

Machine Learning Basics: Maximum Likelihood Estimation

Variational Principal Components

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Smart PCA. Yi Zhang Machine Learning Department Carnegie Mellon University

Denoising Criterion for Variational Auto-Encoding Framework

Summary and discussion of: Dropout Training as Adaptive Regularization

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

arxiv: v3 [cs.lg] 18 Mar 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Statistical mechanics of ensemble learning

Learning gradients: prescriptive models

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Machine learning - HT Maximum Likelihood

Learning SVM Classifiers with Indefinite Kernels

18.6 Regression and Classification with Linear Models

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Bayesian Machine Learning

A Comparison of Nonlinear Kalman Filtering Applied to Feed forward Neural Networks as Learning Algorithms

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Porcupine Neural Networks: (Almost) All Local Optima are Global

Transcription:

Worshop trac - ICLR 016 CLOSE-TO-CLEAN REGULARIZATION RELATES VIRTUAL ADVERSARIAL TRAINING, LADDER NETWORKS AND OTHERS Mudassar Abbas, Jyri Kivinen, Tapani Raio Department of Computer Science, School of Science, Aalto University, Espoo, Finland. firstname.lastname@aalto.fi ABSTRACT We propose a regularization framewor where we feed an original clean data point and a nearby point through a mapping, which is then penalized by the Euclidian distance between the corresponding outputs. The nearby point may be chosen randomly or adversarially. We relate this framewor to many existing regularization methods: It is a stochastic estimate of penalizing the Frobenius norm of the Jacobian of the mapping as in Poggio & Girosi (1990), it generalizes noise regularization (Sietsma & Dow, 1991), and it is a simplification of the canonical regularization term by the ladder networs in Rasmus et al. (015). We also study the connection to virtual adversarial training (VAT) (Miyato et al., 016) and show how VAT can be interpreted as penalizing the largest eigenvalue of a Fisher information matrix. Our main contribution is discovering connections between the proposed and existing regularization methods. 1 INTRODUCTION Regularization is a commonly used meta-level technique in training neural networs. This paper proposes a novel regularization method and studies theoretical connections of the approach to three regularization techniques proposed in the literature. These regularization techniques are the canonical regularizations by penalizing the Jacobian (Poggio & Girosi, 1990), by noise injection (Sietsma & Dow, 1991), by the Ladder networ (Rasmus et al., 015), and by the virtual adversarial training (Miyato et al., 016). The structure of the rest of the paper is as follows: In the following section we describe the novel regularization criteria. In section 3 we study connections of these regularization techniques theoretically, providing novel theoretical results together connecting all of the regularization schemes. Section 4 concludes and discusses future wor. CLOSE-TO-CLEAN REGULARIZATION Given a set of (N) input examples X = { {x (n) } N n=1}, and mapping f(x; θ) parameterized by θ, training happens by minimizing a training criterion L (X, θ) = E p(x) ( ) iteratively, where the expectation E p(x) is taen over the training data set X. 1 Often a regularization term R is added to the training criterion in order to help generalize better to unseen data. The canonical regularization term by the proposed approach is given as follows: R cc (x, θ; α, ɛ) = αe p(x, x) f(x) f(x + x) = αe p(x, x) h h, (1) where h = f(x; θ) denotes the networ output given an example x, h = f(x + x; θ) denotes the networ output given a perturbed version of the example x + x. α and ɛ are positive scalar hyperparameters where ɛ is used to control the amount of perturbation. The authors declare equal contributions. 1 We leave open the particular tas and rest of the neural architecture. 1

Worshop trac - ICLR 016 By default, the perturbation is implemented as additive white Gaussian noise x N ( 0, ɛ I ), with I denoting the identity-matrix. We can also consider an adversarial variant where x = x adv. = arg x: x ɛ h h. 3 CONNECTIONS TO OTHER REGULARIZATION METHODS 3.1 PENALIZING THE JACOBIAN A central method connecting different regularizers here is the penalization of the Jacobian. Poggio & Girosi (1990) proposed to penalize a neural networ by the Frobenius norm of the Jacobian J: ] R Jacobian = E p(x, x) J F = E p(x, x) ( ) h. (),i In the special case of linear f, the Jacobian is constant and the Jacobian penalty can be seen as a generalization of weight decay. Note that the Frobenius norm of the Jacobian penalizes the mapping f directly, rather than through its parameters. Matsuoa (199) and Reed et al. (199) found a connection between noise regularization and penalizing the Jacobian, assuming a regression setting with a quadratic loss. The connection is applicable in our setting, too, as is demonstrated in the following: Assuming that a small perturbation of the form x N (0, ɛ I) is added to every example x, such that every h = f(x+ x; θ), approximating h near h using a component-wise Taylor series expansion yields that for every output dimension index h h + J,: x + 1 ( x) H () x, where J,i = h (x) is the Jacobian matrix of partial derivatives of h w.r.t. x i, J,: denotes its th row vector, and H () = h (x) x j is the Hessian matrix. Plugging this into eq. (1) yields that E p(x, x) ( h h ) ] E p(x, x) (h + J,: x + 1 ( x) H () x h ) ] = E p(x, x) (J,: x) ] + E p(x, x) J,: x( x) H () x ] 1 + E p(x, x) ( ( x) H () x) ] ( ) h ɛ, E p(x, x) (J,: x) ] = ɛ E p(x) J,: ] = i where we have used the fact that E p( x) x( x) ] = ɛ I and the assumption that the terms involving the Hessians are very small. Therefore we can obtain that R cc = αe p(x, x) h h ] αɛ R Jacobian, and thus conclude that the proposed regularization can be seen as a stochastic approximation of the Jacobian penalty in the limit of low noise. 3. NOISE REGULARIZATION Sietsma & Dow (1991) proposed to regularize neural networs by injecting noise to the inputs x while training. This method does not use a separate penalty function. It can be interpreted either as eeping the solution insensitive to random perturbations in x, or as a data augmentation method, where a number of noisy copies of each data point are used in training. In the semi-supervised setting, noise regularization alone does not mae unlabelled data useful. However, close-to-clean regularization is applicable even when the output is not available: We can compute the regularization penalty R cc based on the input alone. In noise regularization, the strength of regularization can only be adjusted by changing the noise level. This affects the locality of the regularization at the same time. Close-to-clean regularization gives two hyperparameters to adjust: Noise level ɛ and the regularization strength α. Using small noise level ɛ emphasizes the local (linear) properties of the mapping f, while a larger noise level emphasizes more global (nonlinear) properties.

Worshop trac - ICLR 016 3.3 VIRTUAL ADVERSARIAL TRAINING In this section we will show the following: (i) VAT penalizes the largest eigenvalue of a Fisher information matrix, (ii) in the special case of h parameterizing the mean of a Gaussian output, the VAT loss which the adversarial perturbation imization is operating on is equivalent to the proposed loss R cc, and (iii) the adversarial variant of the close-to-clean regularization penalizes the largest eigenvalue of J J, whereas the sum of its eigenvalues equals the Frobenius norm of J. The virtual adversarial training of Miyato et al. (016) regularizes training by a term called latent distribution smoothing LDS = {D KL (p(y x, Φ) p(y x + x, Φ))}, x: x ɛ the negative value of the imal Kullbac-Leibler divergence between encoding distributions for an example x and its perturbed version x + x, denoted p(y x, Φ) and p(y x + x, Φ), respectively; the distributions are assumed to be of the same parametric form, and that the perturbation affects the values of the parameters. Using a note in Kasi et al. (001, eq. (1)-()), and a result in Kullbac (1959, Sec. 6, page 8, eq. (6.4)), we can state: D KL (p(y x, Φ) p(y x + x, Φ)) 1 ( x) I(x) x, (3) ( )( ) with I(x) = E log p(y x,φ) log p(y x,φ), y p(y x,φ) x x a Fisher-information matrix. Let x =, a unit-length input-perturbation vector. Using the approximation above we can then write that x x LDS 1 = 1 x: x ɛ ( x) I(x) x = 1 x : x ɛ x }{{} ɛ x ( ) I(x) x x x x x : x ɛ ( ) I(x) x x x x = ɛ {λ } K = ɛ λ, (4) where λ denotes the imal eigenvalue of the (K) eigenvalues {λ } K of I(x), with the equality mared = suggested by a spectral ( theory ) for symmetric matrices printed in Råde & Westergren (1998, Sec. 4.5, page 96); diag {λ } K = C I(x)C, where C denotes the matrix of eigenvectors with C, denoting its th column and th eigenvector; the adversarial perturbation vector x adv. = ɛ C,arg {λ } K. Let us assume y = f(x; θ) + n, where n N (0, σ I). We can then write that p(y x, Φ) = N ( y; h, σ I ) (, p(y x + x, Φ) = N y; h, ) σ I, where h = f(x; θ), h = f(x + x; θ), Φ = {θ, σ}, and I denotes the identity-matrix. Using Kullbac (1959, Chapter 9, Sec. 1, page 190, eq. (1.4)) (and the derivation in the Appendix), we have that D KL (N ( y; h, σ I ) N ( )) y; h, σ I = 1 σ K (h h ) = 1 ( ) ( ) σ h h h h For small perturbations x the following linearization approximation is expected to be effective for non-linear functions f : It is exact for linear functions f. h h = f(x; θ) f(x + x; θ) J x,. 3

Worshop trac - ICLR 016 where J denotes a Jacobian-matrix, with element J,i = f (x), where 1, K], and i 1, I]. Then given the prior analysis, we have for the Gaussian models that LDS 1 σ x: x ɛ ( x) J J x = ɛ σ λ, where λ denotes the imal eigenvalue of the (K) eigenvalues {λ } K of J J; we have used the fact that the optimization problem is exactly the same as in eq. (4), with the replacement of I(x) with J J. Since K λ = Trace ( J J) = J,i = ( ) f (x), (5) x i i i where the first equality is suggested by a fact in Råde & Westergren (1998, Sec. 4.5, page 96), we find that the (canonical) regularization term of the contractive auto-encoder (Rifai et al., 011), i J,i = ( ) f (x), i is equivalent to the sum of the eigenvalues of J J; this is different to the canonical VAT-type regularization above, where it was shown to be equivalent (exactly or approximately) to a scalar multiple 3 of the imum of the eigenvalues. 3.4 LADDER NETWORKS The Γ variant of the ladder networ (Rasmus et al., 015) is closely related to the proposed approach. It uses an auxiliary denoised ĥ = g( h) and an auxiliary cost R Γ = E p(x, x) h ĥ to support another tas such as classification. It means that if we would train a Γ-ladder networ and further restrict the g-function to identity, we would recover the proposed method. 4 FUTURE WORK We would lie to study the proposed regularization method and the found connections empirically. Miyato et al. (016) used the D KL in equation (3) as a regularization criterion with adversarial perturbations, but it could also be used for regularization using random perturbations. ACKNOWLEDGMENTS We than Casper Kaae Sønderby, Jaao Peltonen, and Jelena Luetina for useful comments related to the manuscript/wor. A KL-DIVERGENCE BETWEEN TWO DIAGONAL-COVARIANCE MULTIVARIATE GAUSSIANS We first derive the KL-divergence between two diagonal-covariance multivariate Gaussians, with untied parameters. We then use this result for the case where the covariance matrices are tied and the elements in the diagonal have the same values. Starting with the first derivation: Assume that both N (y Θ 1 ) and N (y Θ ) denote (K-dimensional) multivariate Gaussians assuming parameters Θ 1 and Θ, respectively, and with independent dimensions as follows: N (y Θ 1 ) = K N (y ; µ, σ ), N (y Θ ) = K N (y ; m, τ ), where N (y ; µ, σ ) denotes a univariate Gaussian with mean µ and variance σ, and N (y ; m, τ ) denotes a univariate Gaussian with mean m and variance τ. We use a divided-and-then-joined calculation strategy to give the result, similar to as in the presentation of the result in Kingma & Welling (014, Appendix B): D KL (N (y Θ 1 ) N (y Θ )) = E y N (y Θ1) log N (y Θ 1 ) E y N (y Θ1) log N (y Θ ). }{{}}{{} A B 3 with the scalar sign set such that the two regularization terms have the same signs under the training optimization. 4

Worshop trac - ICLR 016 Let us now derive the forms of A and B above: { A = E y N (y Θ1) 1 K log π + log σ + (y µ ) /σ ] } = 1 K log π + log σ + 1 { σ E y N (y Θ 1) (y µ ) } }{{}, σ { B = E y N (y Θ1) 1 K log π + log τ + (y m ) /τ ] } = 1 K log π + log τ + 1 { τ E y N (y Θ 1) y m y + m } ] = 1 K log π + log τ + 1 { σ τ + µ m µ + m } ], { } where we have used the fact that E y N (y Θ 1) y = σ + µ ; E(y ) Var(y ) + E(y )]. Joining the results, we have that D KL (N (y Θ 1 ) N (y Θ )) = 1 K 1 + log σ 1 { σ τ τ + (µ m ) }]. (6) Let σ = τ = σ,, N (y Θ 1 ) = K N (y ; µ, σ ), N (y Θ ) = K N (y ; m, σ ). Then using the result in eq. (6), we have that D KL (N (y Θ 1 ) N (y Θ )) = 1 K σ (µ m ) = 1 σ (µ m) (µ m). (7) The result can also be obtained by ( using ( the result in Kullbac (1959, Chapter 9, Sec. 1, page 190, eq. (1.4)) which states that D ) KL N y; µ, Λ 1 N ( y; m, Λ 1)) = 1 (µ m) Λ (µ m): plugging in Λ 1 = ( σ I ) 1 = σ I yields the result in eq. (7). REFERENCES S. Kasi, J. Sinonen, and J. Peltonen. Banruptcy analysis with self-organizing maps in learning metrics. IEEE Transactions on Neural Networs, 1(4):936 947, 001. D. P. Kingma and M. Welling. Auto-Encoding Variational Bayes. arxiv:131.6114v10 stat.ml], 014. S. Kullbac. Information Theory and Statistics. John Wiley& Sons, Inc., 1959. K. Matsuoa. Noise injection into inputs in bac-propagation learning. IEEE Transactions on Systems, Man and Cybernetics, (3):436 440, 199. T. Miyato, S.-i Maeda, M. Koyama, K. Naae, and S. Ishii. Distributional smoothing with virtual adversarial training. arxiv:1507.00677v7 stat.ml], 016. T. Poggio and F. Girosi. Networs for approximation and learning. Proceedings of the IEEE, 78(9): 1481 1497, 1990. L. Råde and B. Westergren. Mathematics Handboo for Science and Engineering. Studentlitteratur, 4 th edition, 1998. A. Rasmus, M. Berglund, M. Honala, H. Valpola, and T. Raio. Semi-Supervised learning with Ladder Networs. In Proc., Advances in Neural Information Processing Systems (NIPS), pp. 353 3540. Curran Associates, Inc., 015. 5

Worshop trac - ICLR 016 R. Reed, S. Oh, and R. J. Mars II. Regularization using jittered training data. In Proc., International Joint Conference on Neural Networs (IJNN), volume 3, pp. 147 15. IEEE, 199. S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio. Contractive Auto-Encoders: Explicit invariance during feature extraction. In Proc., International Conference on Machine Learning (ICML), 011. J. Sietsma and R. J. F. Dow. Creating artificial neural networs that generalize. Neural networs, 4 (1):67 79, 1991. 6