A. Exclusive KL View of the MLE

Similar documents
Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k

Least-Squares Regression on Sparse Spaces

Proof of SPNs as Mixture of Trees

Table of Common Derivatives By David Abraham

19 Eigenvalues, Eigenvectors, Ordinary Differential Equations, and Control

NOTES ON EULER-BOOLE SUMMATION (1) f (l 1) (n) f (l 1) (m) + ( 1)k 1 k! B k (y) f (k) (y) dy,

Lower Bounds for the Smoothed Number of Pareto optimal Solutions

Calculus of Variations

Lecture 2: Correlated Topic Model

PDE Notes, Lecture #11

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

Schrödinger s equation.

Robust Low Rank Kernel Embeddings of Multivariate Distributions

SYSTEMS OF DIFFERENTIAL EQUATIONS, EULER S FORMULA. where L is some constant, usually called the Lipschitz constant. An example is

Problem set 2: Solutions Math 207B, Winter 2016

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

6 General properties of an autonomous system of two first order ODE

arxiv: v4 [math.pr] 27 Jul 2016

FLUCTUATIONS IN THE NUMBER OF POINTS ON SMOOTH PLANE CURVES OVER FINITE FIELDS. 1. Introduction

Influence of weight initialization on multilayer perceptron performance

Linear First-Order Equations

Euler equations for multiple integrals

Tractability results for weighted Banach spaces of smooth functions

Jointly continuous distributions and the multivariate Normal

Function Spaces. 1 Hilbert Spaces

Agmon Kolmogorov Inequalities on l 2 (Z d )

Introduction to the Vlasov-Poisson system

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions

A Course in Machine Learning

Sturm-Liouville Theory

Tutorial on Maximum Likelyhood Estimation: Parametric Density Estimation

Stable and compact finite difference schemes

Convergence of Random Walks

Conservation Laws. Chapter Conservation of Energy

Bayesian Estimation of the Entropy of the Multivariate Gaussian

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments

Lower bounds on Locality Sensitive Hashing

Gaussian processes with monotonicity information

ELEC3114 Control Systems 1

A Review of Multiple Try MCMC algorithms for Signal Processing

Problem Sheet 2: Eigenvalues and eigenvectors and their use in solving linear ODEs

Analyzing Tensor Power Method Dynamics in Overcomplete Regime

EVALUATING HIGHER DERIVATIVE TENSORS BY FORWARD PROPAGATION OF UNIVARIATE TAYLOR SERIES

Hyperbolic Systems of Equations Posed on Erroneous Curved Domains

7.1 Support Vector Machine

Expected Value of Partial Perfect Information

LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION

Euler Equations: derivation, basic invariants and formulae

GENERATIVE NETWORKS AS INVERSE PROBLEMS

Self-normalized Martingale Tail Inequality

LOGISTIC regression model is often used as the way

WUCHEN LI AND STANLEY OSHER

Parameter estimation: A new approach to weighting a priori information

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

Equilibrium in Queues Under Unknown Service Times and Service Value

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors

The derivative of a function f(x) is another function, defined in terms of a limiting expression: f(x + δx) f(x)

1 Parametric Bessel Equation and Bessel-Fourier Series

Acute sets in Euclidean spaces

Capacity Analysis of MIMO Systems with Unknown Channel State Information

Physics 5153 Classical Mechanics. The Virial Theorem and The Poisson Bracket-1

REAL ANALYSIS I HOMEWORK 5

Physics 505 Electricity and Magnetism Fall 2003 Prof. G. Raithel. Problem Set 3. 2 (x x ) 2 + (y y ) 2 + (z + z ) 2

Math 342 Partial Differential Equations «Viktor Grigoryan

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

Monte Carlo Methods with Reduced Error

SYMPLECTIC GEOMETRY: LECTURE 3

Estimating Causal Direction and Confounding Of Two Discrete Variables

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs

II. First variation of functionals

JUST THE MATHS UNIT NUMBER DIFFERENTIATION 2 (Rates of change) A.J.Hobson

Multi-View Clustering via Canonical Correlation Analysis

P(x) = 1 + x n. (20.11) n n φ n(x) = exp(x) = lim φ (x) (20.8) Our first task for the chain rule is to find the derivative of the exponential

Chapter 2. Exponential and Log functions. Contents

Logarithmic spurious regressions

Lecture 5. Symmetric Shearer s Lemma

SINGULAR PERTURBATION AND STATIONARY SOLUTIONS OF PARABOLIC EQUATIONS IN GAUSS-SOBOLEV SPACES

Linear Algebra- Review And Beyond. Lecture 3

Leaving Randomness to Nature: d-dimensional Product Codes through the lens of Generalized-LDPC codes

Introduction to Machine Learning

On combinatorial approaches to compressed sensing

Cascaded redundancy reduction

On conditional moments of high-dimensional random vectors given lower-dimensional projections

Lecture 6: Calculus. In Song Kim. September 7, 2011

INDEPENDENT COMPONENT ANALYSIS VIA

Characterizing Real-Valued Multivariate Complex Polynomials and Their Symmetric Tensor Representations

Topic 7: Convergence of Random Variables

Binary Discrimination Methods for High Dimensional Data with a. Geometric Representation

Multi-View Clustering via Canonical Correlation Analysis

Separation of Variables

BIRS Ron Peled (Tel Aviv University) Portions joint with Ohad N. Feldheim (Tel Aviv University)

Short Intro to Coordinate Transformation

Analyzing the Structure of Multidimensional Compressed Sensing Problems through Coherence

Global Solutions to the Coupled Chemotaxis-Fluid Equations

Generalizing Kronecker Graphs in order to Model Searchable Networks

Relative Entropy and Score Function: New Information Estimation Relationships through Arbitrary Additive Perturbation

Perfect Matchings in Õ(n1.5 ) Time in Regular Bipartite Graphs

Diagonalization of Matrices Dr. E. Jacobs

Basic Thermoelasticity

Chapter 2 Lagrangian Modeling

Transcription:

A. Exclusive KL View of the MLE Lets assume a change-of-variable moel p Z z on the ranom variable Z R m, such as the one use in Dinh et al. 2017: z 0 p 0 z 0 an z = ψz 0, where ψ is an invertible function an ensity evaluation of z 0 R m is tractable uner p 0. The resulting ensity function can be written p Z z = p 0 z 0 ψz 0 z 0 The maximum likelihoo principle requires us to minimize the following metric: D KL pata z p Z z = E pata [log p ata z log p Z z] = E pata [log p ata z log p 0 z 0 ψ 1 ] z z ] = E pata [log p ata z ψ 1 1 z z log p 0 z 0 which coincies with exclusive KL ivergence; to see this, we take X = Z, Y = Z 0, f = ψ 1, p X = p ata, an p target = p 0 = E px [log p X x fx x = D KL py y p target y 1 1 log p target y This means we want to transform the empirical istribution p ata, or p X, to fit the target ensity the usually unstructure, base istribution p 0, as explaine in Section 2. B. Monotonicity of NAF Here we show that using strictly positive weights an strictly monotonically increasing activation functions is sufficient to ensure strict monotonicity of NAF. For brevity, we write monotonic, or monotonicity to represent the strictly monotonically increasing behavior of a function. Also, note that a continuous function is strictly monotonically increasing exactly when its erivative is greater than 0. Proof. Proposition 1 Suppose we have an MLP with L + 1 layers: h 0, h 1, h 2,..., h L, x = h 0 an y = h L, where x an y are scalar, an h l,j enotes the j-th noe of the l-th layer.for 1 l L, we have p l,j = w T l,jh l 1 + b l,j 15 h l,j = A l p l,j 16 ] for some monotonic activation function A l, positive weight vector w l,j, an bias b l,j. Differentiating this yiels h l,j = A lp l,j w l,j,k, 17 h l 1,k p l,j which is greater than 0 since A is monotonic an hence has positive erivative, an w l,j,k is positive by construction. Thus any unit of layer l is monotonic with respect to any unit of layer l 1, for 1 l L. Now, suppose we have J monotonic functions f j of the input x. Then the weighte sum of these functions J u jf j with u j > 0 is also monotonic with respect to x, since x J u j f j = J u j f j x > 0 18 Finally, we use inuction to show that all h l,j incluing y are monotonic with respect to x. 1. The base case l = 1 is given by Equation 17. 2. Suppose the inuctive hypothesis hols, which means h l,j is monotonic with respect to x for all j of layer l. Then by Equation 18, h l+1,k is also monotonic with respect to x for all k. Thus by mathematical inuction, monotonicity of h l,j hols for all l an j. C. Log Determinant of Jacobian As we mention at the en of Section 3.1, to compute the log-eterminant of the Jacobian as part of the objective function, we nee to hanle the numerical stability. We first erive the Jacobian of the DDSF note that DSF is a special case of DDSF, an then summarize the numerically stable operations that were utilize in this work. C.1. Jacobian of DDSF Again efining x = h 0 an y = h L, the Jacobian of each DDSF transformation can be written as a sequence of ot proucts ue to the chain rule: x y = [ h L 1h L] [ h L 2h L 1],, [ h 0h 1] 19 For notational convenience, we efine a few more interme-

iate variables. For each layer of DDSF, we have C l+1 D l+1 = a l+1 } {{ } u l+1 = w l+1 σc l+1 h l+1 = σ 1 D l+1 h l + b 1+1 l l The graient can be expane as h l h l+1 = D σ 1 D l+1 [:, ] σc D l+1 C σc l+1 u l+1 h l h l [,:] C l+1 [,:] u l+1 h l [,:,:] 1 [:,:, ] where the bullet in the subscript inicates the imension is broacaste, enotes element-wise multiplication, an 1 enotes summation over the last imension after element-wise prouct, 1 = D l+1 1 D l+1 [:, ] w l+1 σc l+1 1 σc l+1 a l+1 [,:] u l+1 [,:,:] C.2. Numerically Stable Operations 1 [:,:, ] [,:] 20 Since the Jacobian of each DDSF transformation is chain of ot proucts Equation 19, with some neste multiplicative operations Equation 20, we calculate everything in the log-scale where multiplication is replace with aition to avoi unstable graient signal. C.2.1. LOG ACTIVATION To ensure the summing-to-one an positivity constraints of u an w, we let the autoregressive conitioner output pre-activation u an w, an apply softmax to them. We o the same for a by having the conitioner output a an apply softplus to ensure positivity. In Equation 20, we have where log w = logsoftmaxw log u = logsoftmaxu log σc = logsigmoic log 1 σc = logsigmoi C logsoftmaxx = x logsumexpx logsigmoix = softplus x logsumexpx = log expx i x + x i softplusx = log1 + expx + δ where x = max i {x i } an δ is a small value such as 10 6. C.2.2. LOGARITHMIC DOT PRODUCT In both Equation 19 an 20, we encounter matrix/tensor prouct, which is achive by summing over one imension after oing element-wise multiplication. Let M1 = log M 1 an M 2 = log M 2 be 0 1 an 1 2, respectively. The logarithmic matrix ot prouct can be written as: M 1 M 2 = logsumexp im=1 M1 [:,:, ] + M2 [,:,:] where the subscript of logsumexp inicates the imension inex starting from 0 over which the elements are to be summe up. Note that M1 M 2 = log M 1 M 2. D. Scalability an Parameter Sharing As iscusse in Section 3.3, a multi-layer NAF such as DDSF requires the autoregressive conitioner c to output many pseuo-parameters, on the orer of OL 2, where L is the number of layers of the transformer network τ, an is the average number of hien units per layer. In practice, we reuce the number of outputs an thus the computation an memory requirements of DDSF by instea enowing τ with some learne non-conitional statistical parameters. Specifically, we ecompose w an u the preactivations of τ s weights, see section C.2.1 into pseuo-parameters an statistical parameters. Take u for example: u l+1 = softmax im=1 v l+1 + η l+1 [,:] where v l+1 is a l matrix of statistical parameters, an η is output by c. See figure D for a epiction. The linear transformation before applying sigmoi resembles conitional weight normalization CWN Krueger

1.0 0.8 0.6 S * n S S n y 0.4 0.2 Figure 9. Factorize weight of DDSF. The v i, j are learne parameters of the moel; only the pseuo-parameters η an a are output by the conitioner. The activation function f is softmax, so aing η yiels an element-wise rescaling of the inputs to this layer of the transformer by expη. et al., 2017. While CWN rescales the weight vectors normalize to have unit L2 norm, here we rescale the weight vector normalize by softmax such that it sums to one an is positive. We call this conitional normalize weight exponentiation. This reuces the number of pseuo-parameters to OL. E. Ientity Flow Initialization In many cases, initializing the transformation to have a minimal effect is believe to help with training, as it can be thought of as a warm start with a simpler istribution. For instance, for variational inference, when we initialize the normalizing flow to be an ientity flow, the approximate posterior is at least as goo as the input istribution usually a fully factorize Gaussian istribution before the transformation. To this en, for DSF an DDSF, we initialize the pseuo-weights a to be close to 1, the pseuo-biases b to be close to 0. This is achieve by initializing the conitioner whose outputs are the pseuo-parameters to have small weights an the appropriate output biases. Specifically, we initialize the output biases of the last layer of our MADE Germain et al., 2015 conitioner to be zero, an a softplus 1 1 0.5413 to the outputs of which correspon to a before applying the softplus activation function. We initialize all conitioner s weights by sampling from from Unif 0.001, 0.001. We note that there might be better ways to initialize the weights to account for the ifferent numbers of active incoming units. 0.0 4 2 0 2 4 x Figure 10. Visualization of how sigmoial functions can universally approximate an monotonic function in [0, 1]. The re otte curve is the target monotonic function S, an blue soli curve is the intermeiate superposition of step functions S n with the parameters chosen in the proof. In this case, n is 6 an S n S 1 7. The green ashe curve is the resulting superposition of sigmois S n, that intercepts with each step of S n with the aitionally chosen τ. F. Lemmas: Uniform Convergence of DSF We want to show the convergence result of Equation 11. To this en, we first show that DSF can be use to universally approximate any strictly monotonic function. The is the case where x 1:t 1 are fixe, which means Cx 1:t 1 are simply constants. We emonstrate it using the following two lemmas. Lemma 1. Step functions universally approximate monotonic functions Define: S nx = w j s x b j where sz is efine as a step function that is 1 when z 0 an 0 otherwise. For any continuous, strictly monotonically increasing S : [r 0, r 1 ] [0, 1] where Sr 0 = 0, Sr 1 = 1 an r 0, r 1 R; an given any ɛ > 0, there exists a positive integer n, real constants w j an b j for j = 1,..., n, where n j w j = 1, w j > 0 an b j [r 0, r 1 ] for all j, such that Snx Sx < ɛ x [r 0, r 1 ]. Proof. Lemma 1 For brevity, we write s j x = sx b j. For any ɛ > 0, we choose n = 1 ɛ, an ivie the range 0, 1 into n + 1 evenly space intervals: 0, y 1, y 1, y 2,..., y n, 1. For each y j, there is a corresponing inverse value since S is strictly monotonic, x j = S 1 y j. We want to set

Snx j = y j for 1 j n 1 an Snx n = 1. To o so, we set the bias terms b j to be x j. Then we just nee to solve a system of n linear equations n j =1 w j s j x j = t j, where t j = y j for 1 j < n, t 0 = 0 an t n = 1. We can express this system of equations in the matrix form as Sw = t, where: S j,j = s j x j = δ xj b j = δ j j, w j = w j, t j = t j where δ ω η = 1 whenever ω η an δ ω η = 0 otherwise. Then we have w = S 1 t. Note that S is a lower triangular matrix, an its inverse takes the form of a Joran matrix: S 1 i,i = 1 an S 1 i+1,i = 1. Aitionally, t j t j 1 = 1 n+1 for j = 1,..., n 1 an is equal to 2 n+2 for j = n. We then have S nx = t T S T sx, where sx j = s j x; thus S nx Sx = s j xt j t j 1 Sx n 1 1 = s j x + 2s nx n + 1 n + 1 Sx = C y 1:n 1 x n + 1 + 2δ x y n n + 1 Sx 1 n + 1 < 1 1/ɛ ɛ 21 where C v z = k δ z v k is the count of elements in a vector that z is no smaller than. Note that the aitional constraint that w lies on an n 1 imensional simplex is always satisfie, because w j = t j t j 1 = t n t 0 = 1 j j See Figure 10 for a visual illustration of the proof. Using this result, we now turn to the case of using sigmoi functions instea of step functions. Lemma 2. Superimpose sigmois universally approximate monotonic functions Define: S n x = x bj w j σ With the same constraints an efinition in Lemma 1, given any ɛ > 0, there exists a positive integer n, real constants w j, τ j an b j for j = 1,..., n, where aitionally τ j are boune an positive, such that S n x Sx < ɛ x r 0, r 1. Proof. Lemma 2 τ j Let ɛ 1 = 1 3 ɛ an ɛ 2 = 2 3 ɛ. We know that for this ɛ 1, there exists an n such that S n S < ɛ 1. We chose the same w j, b j for j = 1,..., n as the ones use in the proof of Lemma 1, an let τ 1,..., τ n all be the same value enote by τ. κ Take κ = min j j b j b j an τ = σ 1 1 ɛ 0 for some ɛ 0 > 0. Take Γ to be a lower triangular matrix with values of 0.5 on the iagonal an 1 below the iagonal. max S nb j Γ j w,...,n = max bj b j w j σ w j Γ jj τ j j = max bj b j σ Γ jj τ j w j < max j w j ɛ 0 = ɛ 0 The inequality is ue to bj b j σ γ = σ b j b j min k k b k b k = 0.5 if j = j 1 ɛ 0 if j > j ɛ 0 if j < j σ 1 1 ɛ 0 Since the prouct Γ w represents the half step points of Sn at x = b j s, the result above entails S n x Snx < 1 ɛ 2 = 2ɛ 1 for all x. To see this, we choose ɛ 0 = 2n+1. Then S n intercepts with all segments of Sn except for the ens. We choose ɛ 2 = 2ɛ 1 since the last step of Sn is of 2 size n+1, an thus the boun also hols true in the vicinity where S n intercepts with the last step of Sn. Hence, S n x Sx S n x S nx + S nx Sx < ɛ 1 + ɛ 2 = ɛ Now we show that the pre-logit DSF Equation 11 can universally approximate monotonic functions. We o this by showing that the well-known universal function approximation properties of neural networks Cybenko, 1989 allow us to prouce parameters which are sufficiently close to those require by Lemma 2. Lemma 3. Let x 1:m [r 0, r 1 ] m where r 0, r 1 R. Given any ɛ > 0 an any multivariate continuously ifferentiable

function 11 Sx 1:m t = S t x t, x 1:t 1 for t [1, m] that is strictly monotonic with respect to the first argument when the secon argument is fixe, where the bounary values are S t r 0, x 1:t 1 = 0 an S t r 1, x 1:t 1 = 1 for all x 1:t 1 an t, then there exists a multivariate function S such that Sx 1:m Sx 1:m < ɛ for all x 1:m, of the following form: Sx 1:m t = S t x t, C t x 1:t 1 = w tj x 1:t 1 σ xt b tj x 1:t 1 τ tj x 1:t 1 where t [1, m], an C t = w tj, b tj, τ tj n are functions of x 1:1 t parameterize by neural networks, with τ tj boune an positive, b tj [r 0, r 1 ], n w tj = 1, an w tj > 0 Proof. Lemma 3 First we eal with the univariate case for any t an rop the subscript t of the functions. We write S n an C k to enote the sequences of univariate functions. We want to show that for any ɛ > 0, there exist 1 a sequence of functions S n x t, C k x 1:t 1 in the given form, an 2 large enough N an K such that when n N an k K, S n x t, C k x 1:t 1 Sx t, x 1:t 1 ɛ for all x 1:t [r 0, r 1 ] t. The iea is first to show that we can fin a sequence of parameters, C n x 1:t 1, that yiel a goo approximation of the target function, Sx t, x 1:t 1. We then show that these parameters can be arbitrarily well approximate by the outputs of a neural network, C k x 1:t 1, which in turn yiel a goo approximation of S. From Lemma 2, we know that such a sequence C n x 1:t 1 exists, an furthermore that we can, for any ɛ, an inepenently of S an x 1:t 1 choose an N large enough so that: S n x t, C n x 1:t 1 Sx t, x 1:t 1 < ɛ 2 22 To see that we can further approximate a given C n x 1:t 1 well by C k x 1:t 1 12, we apply the classic result of Cybenko 1989, which states that a multilayer perceptron can approximate any continuous function on a compact subset of R m. Note that specification of C n x 1:t 1 = 11 S : [r 0, r 1] m [0, 1] m is a multivariate-multivariable function, where S t is its t-th component, which is a univariatemultivariable function, written as S t, : [r 0, r 1] [r 0, r 1] t 1 [0, 1]. 12 Note that C n is a chosen function we are not assuming its parameterization; i.e. not necessarily a neural network that we seek to approximate using C k, which is the output of a neural network. w tj, b tj, τ tj n in Lemma 2 epens on the quantiles of S t x t, as a function of x 1:t 1 ; since the quantiles are continuous functions of x 1:t 1, so is C n x 1:t 1, an the theorem applies. Now, S has boune erivative wrt C, an is thus uniformly continuous, so long as τ is greater than some positive constant, which is always the case for any fixe C n, an thus can be guarantee for C k as well for large enough k. Uniform continuity allows us into translate the convergence of C k C n to convergence of S n x t, C k x 1:t 1 S n x t, C n x 1:t 1, since for any ɛ, there exists a δ > 0 such that C k x 1:t 1 C n x 1:t 1 < δ = S n x t, C k x 1:t 1 S n x t, C n x 1:t 1 < ɛ 2 23 Combining this with Equation 22, we have for all x t an x 1:t 1, an for all n N an k K Sn x t, C k x 1:t 1 Sx t, x x1:t 1 S n x t, C k x 1:t 1 S n x t, C n x 1:t 1 + Sn x t, C n x 1:t 1 Sx t, x x1:t 1 < ɛ 2 + ɛ 2 = ɛ Having prove the univariate case, we a back the subscript t to enote the inex of the function. From the above, we know that given any ɛ > 0 for each t, there exist Nt an a sequence of univariate functions S n,t such that for all n Nt, S n,t S t < ɛ for all x 1:t. Choosing N = max t [1,m] Nt, we have that there exists a sequence of multivariate functions S n in the given form such that for all n N, S n S < ɛ for all x 1:m. G. Proof of Universal Approximation of DSF Lemma 4. Let X X be a ranom variable, an X R m an Y R m. Given any function J : X Y an a sequences of functions J n that converges pointwise to J, the sequence of ranom variables inuce by the transformations Y n = Jn X converges in istribution to Y =.. JX. Proof. Lemma 4 Let h be any boune, continuous function on R m, so that h J n converges pointwise to h J by continuity of h. Since h is boune, then by the ominate convergence theorem, E[hY n ] = E[hJ n X] converges to E[hJX] = E[hY ]. As this result hols for any boune continuous function h, by the Portmanteau s lemma, we have Y n Y.

Proof. Proposition 2 Given an arbitrary orering, let F be the CDFs of Y, efine as F t y t, x 1:t 1 = PrY t y t x 1:t 1. Accoring to Theorem 1 of Hyvärinen & Pajunen 1999, F Y is uniformly istribute in the cube [0, 1] m. F has an upper triangular Jacobian matrix, whose iagonal entries are conitional ensities which are positive by assumption. Let G be a multivariate an multivariable function where G t is the inverse of the CDF of Y t : G t F t y t, x 1:t 1, x 1:t 1 = y t. Accoring to Lemma 3, there exists a sequence of functions in the given form S n n 1 that converge uniformly to σ G. Since uniform convergence implies pointwise convergence, G n = σ 1 S n converges pointwise to G, by continuity of σ 1. Since G n converges pointwise to G an GX = Y, by Lemma 4, we have Y n Y Proof. Proposition 3 Given an arbitrary orering, let H be the CDFs of X: y 1. = H1 x 1, = F 1 x 1, = PrX 1 x 1. y t = Ht x t, x 1:t 1 = F t xt, {H t t x t t, x 1:t t 1} t 1 t =1 = PrX t x t y 1:t 1 for 2 t m Due to Hyvärinen & Pajunen 1999, y 1,...y m are inepenently an uniformly istribute in 0, 1 m. Accoring to Lemma 3, there exists a sequence of functions in the given form S n n 1 that converge uniformly to H. Since H n = S n converges pointwise to H an HX = Y, by Lemma 4, we have Y n Y Proof. Theorem 1 Given an arbitrary orering, let H be the CDFs of X efine the same way in the proof for Proposition 3, an let G be the inverse of the CDFs of Y efine the same way in the proof for Proposition 2. Due to Hyvärinen & Pajunen 1999, HX is uniformly istribute in 0, 1 m, so GHX = Y. Since H t x t, x 1:t 1 is monotonic wrt x t given x 1:t 1, an G t H t, H 1:t 1 is monotonic wrt H t given H 1:t 1, G t is also monotonic wrt x t given x 1:t 1, as G t H t, H 1:t 1 x t = G th t, H 1:t 1 H t H t x t, x 1:t 1 x t is always positive. Accoring to Lemma 3, there exists a sequence of functions in the given form S n n 1 that converge uniformly to σ G H. Since uniform convergence implies pointwise convergence, K n = σ 1 S n converges pointwise to G H, by continuity of σ 1. Since K n converges pointwise to G H an GHX = Y, by Lemma 4, we have Y Y n H. Experimental Details For the experiment of amortize variational inference, we implement the Variational Autoencoer Kingma & Welling, 2014. Specifically, we follow the architecture use in Kingma et al. 2016: the encoer has three layers with [16, 32, 32] feature maps. We use resnet blocks He et al., 2016 with 3 3 convolution filters an a strie of 2 to ownsize the feature maps. The convolution layers are followe by a fully connecte layer of size 450 as a context for the flow layers that transform the noise sample from a stanar normal istribution of imension 32. The ecoer is symmetrical with the encoer, with the strie convolution replace by a combination of bilinear upsampling an regular resnet convolution to ouble the feature map size. We use the ELUs activation function Clevert et al., 2016 an weight normalization Salimans & Kingma, 2016 in the encoer an ecoer. In terms of optimization, Aam Kingma & Ba, 2014 is use with learning rate fine tune for each inference setting, an Polyak averaging Polyak & Juitsky, 1992 was use for evaluation with α = 0.998 which stans for the proportion of the past at each time step. We also consier a variant of Aam known as Amsgra Rei et al., 2018 as a hyperparameter. For vanilla VAE, we simply apply a resnet ot prouct with the context vector to output the mean an the pre-softplus stanar eviation, an transform each imension of the noise vector inepenently. We call this linear flow. For IAF-affine an IAF-DSF, we employ MADE Germain et al., 2015 as the conitioner cx 1:t 1, an we apply ot prouct on the context vector to output a scale vector an a bias vector to conitionally rescale an shift the preactivation of each layer of the MADE. Each MADE has one hien layer with 1920 hien units. The IAF experiments all start with a linear flow layer followe by IAF-affine or IAF-DSF transformations. For DSF, we choose = 16. For the experiment of ensity estimation with MAF, we followe the implementation of Papamakarios et al. 2017. Specifically for each ataset, we experimente with both 5 an 10 flow layers, followe by one linear flow layer. The following table specifies the number of hien layers an the number of hien units per hien layer for MADE: Table 3. Architecture specification of MADE in the MAF experiment. Number of hien layers an number of hien units. POWER GAS HEPMASS MINIBOONE BSDS300 2 100 2 100 2 512 1 512 2 1024