A. Exclusive KL View of the MLE

A. Exclusive KL View of the MLE Lets assume a change-of-variable moel p Z z on the ranom variable Z R m, such as the one use in Dinh et al. 2017: z 0 p 0 z 0 an z = ψz 0, where ψ is an invertible function an ensity evaluation of z 0 R m is tractable uner p 0. The resulting ensity function can be written p Z z = p 0 z 0 ψz 0 z 0 The maximum likelihoo principle requires us to minimize the following metric: D KL pata z p Z z = E pata [log p ata z log p Z z] = E pata [log p ata z log p 0 z 0 ψ 1 ] z z ] = E pata [log p ata z ψ 1 1 z z log p 0 z 0 which coincies with exclusive KL ivergence; to see this, we take X = Z, Y = Z 0, f = ψ 1, p X = p ata, an p target = p 0 = E px [log p X x fx x = D KL py y p target y 1 1 log p target y This means we want to transform the empirical istribution p ata, or p X, to fit the target ensity the usually unstructure, base istribution p 0, as explaine in Section 2. B. Monotonicity of NAF Here we show that using strictly positive weights an strictly monotonically increasing activation functions is sufficient to ensure strict monotonicity of NAF. For brevity, we write monotonic, or monotonicity to represent the strictly monotonically increasing behavior of a function. Also, note that a continuous function is strictly monotonically increasing exactly when its erivative is greater than 0. Proof. Proposition 1 Suppose we have an MLP with L + 1 layers: h 0, h 1, h 2,..., h L, x = h 0 an y = h L, where x an y are scalar, an h l,j enotes the j-th noe of the l-th layer.for 1 l L, we have p l,j = w T l,jh l 1 + b l,j 15 h l,j = A l p l,j 16 ] for some monotonic activation function A l, positive weight vector w l,j, an bias b l,j. Differentiating this yiels h l,j = A lp l,j w l,j,k, 17 h l 1,k p l,j which is greater than 0 since A is monotonic an hence has positive erivative, an w l,j,k is positive by construction. Thus any unit of layer l is monotonic with respect to any unit of layer l 1, for 1 l L. Now, suppose we have J monotonic functions f j of the input x. Then the weighte sum of these functions J u jf j with u j > 0 is also monotonic with respect to x, since x J u j f j = J u j f j x > 0 18 Finally, we use inuction to show that all h l,j incluing y are monotonic with respect to x. 1. The base case l = 1 is given by Equation 17. 2. Suppose the inuctive hypothesis hols, which means h l,j is monotonic with respect to x for all j of layer l. Then by Equation 18, h l+1,k is also monotonic with respect to x for all k. Thus by mathematical inuction, monotonicity of h l,j hols for all l an j. C. Log Determinant of Jacobian As we mention at the en of Section 3.1, to compute the log-eterminant of the Jacobian as part of the objective function, we nee to hanle the numerical stability. We first erive the Jacobian of the DDSF note that DSF is a special case of DDSF, an then summarize the numerically stable operations that were utilize in this work. C.1. Jacobian of DDSF Again efining x = h 0 an y = h L, the Jacobian of each DDSF transformation can be written as a sequence of ot proucts ue to the chain rule: x y = [ h L 1h L] [ h L 2h L 1],, [ h 0h 1] 19 For notational convenience, we efine a few more interme-

iate variables. For each layer of DDSF, we have C l+1 D l+1 = a l+1 } {{ } u l+1 = w l+1 σc l+1 h l+1 = σ 1 D l+1 h l + b 1+1 l l The graient can be expane as h l h l+1 = D σ 1 D l+1 [:, ] σc D l+1 C σc l+1 u l+1 h l h l [,:] C l+1 [,:] u l+1 h l [,:,:] 1 [:,:, ] where the bullet in the subscript inicates the imension is broacaste, enotes element-wise multiplication, an 1 enotes summation over the last imension after element-wise prouct, 1 = D l+1 1 D l+1 [:, ] w l+1 σc l+1 1 σc l+1 a l+1 [,:] u l+1 [,:,:] C.2. Numerically Stable Operations 1 [:,:, ] [,:] 20 Since the Jacobian of each DDSF transformation is chain of ot proucts Equation 19, with some neste multiplicative operations Equation 20, we calculate everything in the log-scale where multiplication is replace with aition to avoi unstable graient signal. C.2.1. LOG ACTIVATION To ensure the summing-to-one an positivity constraints of u an w, we let the autoregressive conitioner output pre-activation u an w, an apply softmax to them. We o the same for a by having the conitioner output a an apply softplus to ensure positivity. In Equation 20, we have where log w = logsoftmaxw log u = logsoftmaxu log σc = logsigmoic log 1 σc = logsigmoi C logsoftmaxx = x logsumexpx logsigmoix = softplus x logsumexpx = log expx i x + x i softplusx = log1 + expx + δ where x = max i {x i } an δ is a small value such as 10 6. C.2.2. LOGARITHMIC DOT PRODUCT In both Equation 19 an 20, we encounter matrix/tensor prouct, which is achive by summing over one imension after oing element-wise multiplication. Let M1 = log M 1 an M 2 = log M 2 be 0 1 an 1 2, respectively. The logarithmic matrix ot prouct can be written as: M 1 M 2 = logsumexp im=1 M1 [:,:, ] + M2 [,:,:] where the subscript of logsumexp inicates the imension inex starting from 0 over which the elements are to be summe up. Note that M1 M 2 = log M 1 M 2. D. Scalability an Parameter Sharing As iscusse in Section 3.3, a multi-layer NAF such as DDSF requires the autoregressive conitioner c to output many pseuo-parameters, on the orer of OL 2, where L is the number of layers of the transformer network τ, an is the average number of hien units per layer. In practice, we reuce the number of outputs an thus the computation an memory requirements of DDSF by instea enowing τ with some learne non-conitional statistical parameters. Specifically, we ecompose w an u the preactivations of τ s weights, see section C.2.1 into pseuo-parameters an statistical parameters. Take u for example: u l+1 = softmax im=1 v l+1 + η l+1 [,:] where v l+1 is a l matrix of statistical parameters, an η is output by c. See figure D for a epiction. The linear transformation before applying sigmoi resembles conitional weight normalization CWN Krueger

1.0 0.8 0.6 S * n S S n y 0.4 0.2 Figure 9. Factorize weight of DDSF. The v i, j are learne parameters of the moel; only the pseuo-parameters η an a are output by the conitioner. The activation function f is softmax, so aing η yiels an element-wise rescaling of the inputs to this layer of the transformer by expη. et al., 2017. While CWN rescales the weight vectors normalize to have unit L2 norm, here we rescale the weight vector normalize by softmax such that it sums to one an is positive. We call this conitional normalize weight exponentiation. This reuces the number of pseuo-parameters to OL. E. Ientity Flow Initialization In many cases, initializing the transformation to have a minimal effect is believe to help with training, as it can be thought of as a warm start with a simpler istribution. For instance, for variational inference, when we initialize the normalizing flow to be an ientity flow, the approximate posterior is at least as goo as the input istribution usually a fully factorize Gaussian istribution before the transformation. To this en, for DSF an DDSF, we initialize the pseuo-weights a to be close to 1, the pseuo-biases b to be close to 0. This is achieve by initializing the conitioner whose outputs are the pseuo-parameters to have small weights an the appropriate output biases. Specifically, we initialize the output biases of the last layer of our MADE Germain et al., 2015 conitioner to be zero, an a softplus 1 1 0.5413 to the outputs of which correspon to a before applying the softplus activation function. We initialize all conitioner s weights by sampling from from Unif 0.001, 0.001. We note that there might be better ways to initialize the weights to account for the ifferent numbers of active incoming units. 0.0 4 2 0 2 4 x Figure 10. Visualization of how sigmoial functions can universally approximate an monotonic function in [0, 1]. The re otte curve is the target monotonic function S, an blue soli curve is the intermeiate superposition of step functions S n with the parameters chosen in the proof. In this case, n is 6 an S n S 1 7. The green ashe curve is the resulting superposition of sigmois S n, that intercepts with each step of S n with the aitionally chosen τ. F. Lemmas: Uniform Convergence of DSF We want to show the convergence result of Equation 11. To this en, we first show that DSF can be use to universally approximate any strictly monotonic function. The is the case where x 1:t 1 are fixe, which means Cx 1:t 1 are simply constants. We emonstrate it using the following two lemmas. Lemma 1. Step functions universally approximate monotonic functions Define: S nx = w j s x b j where sz is efine as a step function that is 1 when z 0 an 0 otherwise. For any continuous, strictly monotonically increasing S : [r 0, r 1 ] [0, 1] where Sr 0 = 0, Sr 1 = 1 an r 0, r 1 R; an given any ɛ > 0, there exists a positive integer n, real constants w j an b j for j = 1,..., n, where n j w j = 1, w j > 0 an b j [r 0, r 1 ] for all j, such that Snx Sx < ɛ x [r 0, r 1 ]. Proof. Lemma 1 For brevity, we write s j x = sx b j. For any ɛ > 0, we choose n = 1 ɛ, an ivie the range 0, 1 into n + 1 evenly space intervals: 0, y 1, y 1, y 2,..., y n, 1. For each y j, there is a corresponing inverse value since S is strictly monotonic, x j = S 1 y j. We want to set

Snx j = y j for 1 j n 1 an Snx n = 1. To o so, we set the bias terms b j to be x j. Then we just nee to solve a system of n linear equations n j =1 w j s j x j = t j, where t j = y j for 1 j < n, t 0 = 0 an t n = 1. We can express this system of equations in the matrix form as Sw = t, where: S j,j = s j x j = δ xj b j = δ j j, w j = w j, t j = t j where δ ω η = 1 whenever ω η an δ ω η = 0 otherwise. Then we have w = S 1 t. Note that S is a lower triangular matrix, an its inverse takes the form of a Joran matrix: S 1 i,i = 1 an S 1 i+1,i = 1. Aitionally, t j t j 1 = 1 n+1 for j = 1,..., n 1 an is equal to 2 n+2 for j = n. We then have S nx = t T S T sx, where sx j = s j x; thus S nx Sx = s j xt j t j 1 Sx n 1 1 = s j x + 2s nx n + 1 n + 1 Sx = C y 1:n 1 x n + 1 + 2δ x y n n + 1 Sx 1 n + 1 < 1 1/ɛ ɛ 21 where C v z = k δ z v k is the count of elements in a vector that z is no smaller than. Note that the aitional constraint that w lies on an n 1 imensional simplex is always satisfie, because w j = t j t j 1 = t n t 0 = 1 j j See Figure 10 for a visual illustration of the proof. Using this result, we now turn to the case of using sigmoi functions instea of step functions. Lemma 2. Superimpose sigmois universally approximate monotonic functions Define: S n x = x bj w j σ With the same constraints an efinition in Lemma 1, given any ɛ > 0, there exists a positive integer n, real constants w j, τ j an b j for j = 1,..., n, where aitionally τ j are boune an positive, such that S n x Sx < ɛ x r 0, r 1. Proof. Lemma 2 τ j Let ɛ 1 = 1 3 ɛ an ɛ 2 = 2 3 ɛ. We know that for this ɛ 1, there exists an n such that S n S < ɛ 1. We chose the same w j, b j for j = 1,..., n as the ones use in the proof of Lemma 1, an let τ 1,..., τ n all be the same value enote by τ. κ Take κ = min j j b j b j an τ = σ 1 1 ɛ 0 for some ɛ 0 > 0. Take Γ to be a lower triangular matrix with values of 0.5 on the iagonal an 1 below the iagonal. max S nb j Γ j w,...,n = max bj b j w j σ w j Γ jj τ j j = max bj b j σ Γ jj τ j w j < max j w j ɛ 0 = ɛ 0 The inequality is ue to bj b j σ γ = σ b j b j min k k b k b k = 0.5 if j = j 1 ɛ 0 if j > j ɛ 0 if j < j σ 1 1 ɛ 0 Since the prouct Γ w represents the half step points of Sn at x = b j s, the result above entails S n x Snx < 1 ɛ 2 = 2ɛ 1 for all x. To see this, we choose ɛ 0 = 2n+1. Then S n intercepts with all segments of Sn except for the ens. We choose ɛ 2 = 2ɛ 1 since the last step of Sn is of 2 size n+1, an thus the boun also hols true in the vicinity where S n intercepts with the last step of Sn. Hence, S n x Sx S n x S nx + S nx Sx < ɛ 1 + ɛ 2 = ɛ Now we show that the pre-logit DSF Equation 11 can universally approximate monotonic functions. We o this by showing that the well-known universal function approximation properties of neural networks Cybenko, 1989 allow us to prouce parameters which are sufficiently close to those require by Lemma 2. Lemma 3. Let x 1:m [r 0, r 1 ] m where r 0, r 1 R. Given any ɛ > 0 an any multivariate continuously ifferentiable

function 11 Sx 1:m t = S t x t, x 1:t 1 for t [1, m] that is strictly monotonic with respect to the first argument when the secon argument is fixe, where the bounary values are S t r 0, x 1:t 1 = 0 an S t r 1, x 1:t 1 = 1 for all x 1:t 1 an t, then there exists a multivariate function S such that Sx 1:m Sx 1:m < ɛ for all x 1:m, of the following form: Sx 1:m t = S t x t, C t x 1:t 1 = w tj x 1:t 1 σ xt b tj x 1:t 1 τ tj x 1:t 1 where t [1, m], an C t = w tj, b tj, τ tj n are functions of x 1:1 t parameterize by neural networks, with τ tj boune an positive, b tj [r 0, r 1 ], n w tj = 1, an w tj > 0 Proof. Lemma 3 First we eal with the univariate case for any t an rop the subscript t of the functions. We write S n an C k to enote the sequences of univariate functions. We want to show that for any ɛ > 0, there exist 1 a sequence of functions S n x t, C k x 1:t 1 in the given form, an 2 large enough N an K such that when n N an k K, S n x t, C k x 1:t 1 Sx t, x 1:t 1 ɛ for all x 1:t [r 0, r 1 ] t. The iea is first to show that we can fin a sequence of parameters, C n x 1:t 1, that yiel a goo approximation of the target function, Sx t, x 1:t 1. We then show that these parameters can be arbitrarily well approximate by the outputs of a neural network, C k x 1:t 1, which in turn yiel a goo approximation of S. From Lemma 2, we know that such a sequence C n x 1:t 1 exists, an furthermore that we can, for any ɛ, an inepenently of S an x 1:t 1 choose an N large enough so that: S n x t, C n x 1:t 1 Sx t, x 1:t 1 < ɛ 2 22 To see that we can further approximate a given C n x 1:t 1 well by C k x 1:t 1 12, we apply the classic result of Cybenko 1989, which states that a multilayer perceptron can approximate any continuous function on a compact subset of R m. Note that specification of C n x 1:t 1 = 11 S : [r 0, r 1] m [0, 1] m is a multivariate-multivariable function, where S t is its t-th component, which is a univariatemultivariable function, written as S t, : [r 0, r 1] [r 0, r 1] t 1 [0, 1]. 12 Note that C n is a chosen function we are not assuming its parameterization; i.e. not necessarily a neural network that we seek to approximate using C k, which is the output of a neural network. w tj, b tj, τ tj n in Lemma 2 epens on the quantiles of S t x t, as a function of x 1:t 1 ; since the quantiles are continuous functions of x 1:t 1, so is C n x 1:t 1, an the theorem applies. Now, S has boune erivative wrt C, an is thus uniformly continuous, so long as τ is greater than some positive constant, which is always the case for any fixe C n, an thus can be guarantee for C k as well for large enough k. Uniform continuity allows us into translate the convergence of C k C n to convergence of S n x t, C k x 1:t 1 S n x t, C n x 1:t 1, since for any ɛ, there exists a δ > 0 such that C k x 1:t 1 C n x 1:t 1 < δ = S n x t, C k x 1:t 1 S n x t, C n x 1:t 1 < ɛ 2 23 Combining this with Equation 22, we have for all x t an x 1:t 1, an for all n N an k K Sn x t, C k x 1:t 1 Sx t, x x1:t 1 S n x t, C k x 1:t 1 S n x t, C n x 1:t 1 + Sn x t, C n x 1:t 1 Sx t, x x1:t 1 < ɛ 2 + ɛ 2 = ɛ Having prove the univariate case, we a back the subscript t to enote the inex of the function. From the above, we know that given any ɛ > 0 for each t, there exist Nt an a sequence of univariate functions S n,t such that for all n Nt, S n,t S t < ɛ for all x 1:t. Choosing N = max t [1,m] Nt, we have that there exists a sequence of multivariate functions S n in the given form such that for all n N, S n S < ɛ for all x 1:m. G. Proof of Universal Approximation of DSF Lemma 4. Let X X be a ranom variable, an X R m an Y R m. Given any function J : X Y an a sequences of functions J n that converges pointwise to J, the sequence of ranom variables inuce by the transformations Y n = Jn X converges in istribution to Y =.. JX. Proof. Lemma 4 Let h be any boune, continuous function on R m, so that h J n converges pointwise to h J by continuity of h. Since h is boune, then by the ominate convergence theorem, E[hY n ] = E[hJ n X] converges to E[hJX] = E[hY ]. As this result hols for any boune continuous function h, by the Portmanteau s lemma, we have Y n Y.

Proof. Proposition 2 Given an arbitrary orering, let F be the CDFs of Y, efine as F t y t, x 1:t 1 = PrY t y t x 1:t 1. Accoring to Theorem 1 of Hyvärinen & Pajunen 1999, F Y is uniformly istribute in the cube [0, 1] m. F has an upper triangular Jacobian matrix, whose iagonal entries are conitional ensities which are positive by assumption. Let G be a multivariate an multivariable function where G t is the inverse of the CDF of Y t : G t F t y t, x 1:t 1, x 1:t 1 = y t. Accoring to Lemma 3, there exists a sequence of functions in the given form S n n 1 that converge uniformly to σ G. Since uniform convergence implies pointwise convergence, G n = σ 1 S n converges pointwise to G, by continuity of σ 1. Since G n converges pointwise to G an GX = Y, by Lemma 4, we have Y n Y Proof. Proposition 3 Given an arbitrary orering, let H be the CDFs of X: y 1. = H1 x 1, = F 1 x 1, = PrX 1 x 1. y t = Ht x t, x 1:t 1 = F t xt, {H t t x t t, x 1:t t 1} t 1 t =1 = PrX t x t y 1:t 1 for 2 t m Due to Hyvärinen & Pajunen 1999, y 1,...y m are inepenently an uniformly istribute in 0, 1 m. Accoring to Lemma 3, there exists a sequence of functions in the given form S n n 1 that converge uniformly to H. Since H n = S n converges pointwise to H an HX = Y, by Lemma 4, we have Y n Y Proof. Theorem 1 Given an arbitrary orering, let H be the CDFs of X efine the same way in the proof for Proposition 3, an let G be the inverse of the CDFs of Y efine the same way in the proof for Proposition 2. Due to Hyvärinen & Pajunen 1999, HX is uniformly istribute in 0, 1 m, so GHX = Y. Since H t x t, x 1:t 1 is monotonic wrt x t given x 1:t 1, an G t H t, H 1:t 1 is monotonic wrt H t given H 1:t 1, G t is also monotonic wrt x t given x 1:t 1, as G t H t, H 1:t 1 x t = G th t, H 1:t 1 H t H t x t, x 1:t 1 x t is always positive. Accoring to Lemma 3, there exists a sequence of functions in the given form S n n 1 that converge uniformly to σ G H. Since uniform convergence implies pointwise convergence, K n = σ 1 S n converges pointwise to G H, by continuity of σ 1. Since K n converges pointwise to G H an GHX = Y, by Lemma 4, we have Y Y n H. Experimental Details For the experiment of amortize variational inference, we implement the Variational Autoencoer Kingma & Welling, 2014. Specifically, we follow the architecture use in Kingma et al. 2016: the encoer has three layers with [16, 32, 32] feature maps. We use resnet blocks He et al., 2016 with 3 3 convolution filters an a strie of 2 to ownsize the feature maps. The convolution layers are followe by a fully connecte layer of size 450 as a context for the flow layers that transform the noise sample from a stanar normal istribution of imension 32. The ecoer is symmetrical with the encoer, with the strie convolution replace by a combination of bilinear upsampling an regular resnet convolution to ouble the feature map size. We use the ELUs activation function Clevert et al., 2016 an weight normalization Salimans & Kingma, 2016 in the encoer an ecoer. In terms of optimization, Aam Kingma & Ba, 2014 is use with learning rate fine tune for each inference setting, an Polyak averaging Polyak & Juitsky, 1992 was use for evaluation with α = 0.998 which stans for the proportion of the past at each time step. We also consier a variant of Aam known as Amsgra Rei et al., 2018 as a hyperparameter. For vanilla VAE, we simply apply a resnet ot prouct with the context vector to output the mean an the pre-softplus stanar eviation, an transform each imension of the noise vector inepenently. We call this linear flow. For IAF-affine an IAF-DSF, we employ MADE Germain et al., 2015 as the conitioner cx 1:t 1, an we apply ot prouct on the context vector to output a scale vector an a bias vector to conitionally rescale an shift the preactivation of each layer of the MADE. Each MADE has one hien layer with 1920 hien units. The IAF experiments all start with a linear flow layer followe by IAF-affine or IAF-DSF transformations. For DSF, we choose = 16. For the experiment of ensity estimation with MAF, we followe the implementation of Papamakarios et al. 2017. Specifically for each ataset, we experimente with both 5 an 10 flow layers, followe by one linear flow layer. The following table specifies the number of hien layers an the number of hien units per hien layer for MADE: Table 3. Architecture specification of MADE in the MAF experiment. Number of hien layers an number of hien units. POWER GAS HEPMASS MINIBOONE BSDS300 2 100 2 100 2 512 1 512 2 1024