Gradient estimation for attractor networks. Thomas Flynn

Size: px

Start display at page:

Download "Gradient estimation for attractor networks. Thomas Flynn"

Morgan Dalton
5 years ago
Views:

1 Gradient estimation for attractor networks Thomas Flynn

2 Abstract We study several types of neural networks that involve feedback connections. These models lead to interesting problems in optimization and may also be useful in applications. In all of these networks, which include deterministic and stochastic networks, on both continuous and discrete state spaces, the gradient estimation is a dynamic process involving feedback, just like the underlying networks. First we consider optimization of deterministic attractor networks, based on the forward and adjoint sensitivity analysis methods. Then we consider derivative estimation in a stochastic variant of this network, and propose to port the forward sensitivity analysis method to this setting. Thirdly, we consider a stochastic network on a discrete space, known as the Little model. Gradient estimation in special cases of this model, such as the Boltzmann machine and sigmoid belief networks, has been studied, based on closed-form solutions for the resulting probability distributions. In the general case, one only has the Markov kernels that specify the short term behavior of the model. To enable gradient-based optimization in this setting, we propose to calculate search directions by combining features of simultaneous perturbation analysis and measure valued differentiation. Our aim in studying these models is to expand the set of tools that are available for experimentation on problems of interest in machine learning. In addition to the theoretical results, we are also interested in applying these models to relevant problems in machine learning, such as image classification.

3 Contents 1 Introduction Notations Deterministic Attractor networks Background Research question Stochastic Attractor Networks Research question Stochastic contraction Approach to the problem Discrete models Background Stationary differentiability using MVD Research question Conclusion and Time line Bibliography

4 1 Introduction Many machine learning problems are approached using gradient based optimization. At each step of the algorithm, the optimization program calculates the derivative of an objective function with respect to parameters of the model, in order to determine what direction to move in. In many cases these derivatives cannot be calculated exactly. Difficulties in gradient estimation happen when dealing with stochastic models, and also with deterministic models that involve feedback. One approach that can be taken with models that involve feedback is to run the gradient estimation and optimization processes simultaneously. The idea is to use an iterative algorithm for computing the gradient, and carry over the state of the estimation procedure after each parameter update, to avoid starting from scratch each time. This type of gradient estimation can be used for sensitivity analysis in attractor networks. This type of neural network, which we formally define in Section 2, is in some ways similar to the usual feed-forward networks. They can be used as input-output maps, sending, for example, an input image to an vector of confidence scores for different categories that the image could belong to. The main difference is that an attractor network has a dynamic internal state, enabled by allowing feed-back connections among units. Feedforward networks, on the other hand, exhibit trivial dynamical behavior; they always reach a steady-state in a finite number of steps. The capacity for interesting dynamical behavior may translate to information processing capacity. However, training and maintaining the stability of these networks seems difficult. The dynamical nature means that one cannot compute the derivatives of interest easily, as one can in feed-forward networks. One can construct an optimization algorithm for attractor networks using a dynamic variant of back-propagation based algorithms. We propose to address the problems of how to tune the algorithm to guarantee a function decrease, and to obtain a long-term guarantee on the optimization algorithm. Some other works (See Figure 1) also considered these models, but the results they obtained for gradient estimation and optimization were mostly heuristic, or were asymptotic. In this setting we seek results that concern finite-step sizes and we also to express the results in terms of available model information. The attractor networks can be generalized by making them stochastic. For instance, one can have stochastic connections between the nodes of the network, so that a node does not always communicate with the same set of neighbors. This could model for instance large, distributed, neural networks whose units can not communicate perfectly due to communication delays or synchronization issues. Having stochastic connections has also been conjectured to help with over fitting. In Section 3, we propose to extend the gradient estimation algorithm from deterministic attractor networks to this setting. As shown in Figure 1, some other authors considered restricted cases of such models that have acyclic connectivity. We then turn our attention in Section 4 to networks that operate on a discrete state space. Several stochastic neural networks on discrete state spaces have been studied, and their gradient estimation procedures are based on having closed form solutions for the resulting 3

5 stochastic deterministic continuous space discrete space continuous discrete? general acyclic general symmetric acyclic general acyclic asynchronous? synchronous this work [5, 6] asynchronous? synchronous [7, 4], this work asynchronous [1] synchronous [2, 3] [8] [9, 10], this work [11] Figure 1: Gradient estimation and optimization in different dynamical settings for neural networks. The first level splits into networks that are stochastic or deterministic. At the next level the split is whether the state space of the network is discrete or continuous. At the third level we differentiate based on connectivity constraints; whether the network is acyclic, has symmetric connections, or allows general connectivity. At the fifth level the split is based on the order of updating the nodes - whether they are they updated all at once (synchronous) or one at a time, in an asynchronous manner. This work considers gradient estimation for three types of networks. A question mark means the author is unaware of any works considering gradient estimation in models with those properties. probability distributions. The works [1, 2, 3] (See Fig. 1) depend on constraints on network connectivity - for instance symmetry, or prohibiting cycles. In the case of networks where cycles are allowed, the work [4] considered gradient estimation in the finite horizon setting. Our interest is in the long-term average cost in networks that have general connectivity, where only knowledge of the transition probabilities is available. Methods such as forward sensitivity analysis cannot be used in this case, as they rely on the differential structure of the underlying state space. Instead, we propose an algorithm that computes descent directions based on simultaneous perturbation analysis and measure valued differentiation (MVD). Along the way we recall a number of relevant mathematical tools. In the deterministic case, to analyze convergence of the forward sensitivity system we use the contraction mapping theorem and a theorem on hierarchies of contracting systems. For considering the long term behavior of stochastic networks, we use a condition based on contraction in the Wasserstein distance. When considering sensitivity analysis for discrete networks we will use results on MVD for Markov chains to establish differentiability of stationary costs. 4

6 1.1 Notations n - dimensionality of the state space of a model. In a network based model, this will be the number of nodes in the network. m - dimensionality of the parameter space of a model. x - variable for the state of a neural network or other dynamic system. w i,j - weight from node j into node i b i - bias at node i θ - parameter vector. In a neural network model, θ is a pair (w, b), consisting of a weight matrix and bias vector. [ g (y) - for a function g : y Rn R m g, this is the m n matrix with entries (y) = y ]i,j g i y j (y). t - iteration number of algorithm. θ(1), θ(2),..., θ(t),... - sequence of parameters generated by optimization algorithm. µ - a probability measure representing an external noise source. π θ - probability measure depending on a parameter θ. P (x, A) - the probability a Markov kernel P assigns to set A from the state x. When the state space is discrete, we will often abuse notation and write P (x, y) instead of P (x, {y}). ν(e) - expected value of random variable e under measure ν; ν(e) = X e(x)dν(x). δ x - point mass centered at x; the measure δ x (A) = { 1 if x A, 0 otherwise. } 2 Deterministic Attractor networks The work of [11] generated much interest in gradient-based training of feed-forward networks, and shortly after there appeared works studying all variety of neural networks. The type of networks that we describe in this section, attractor networks, seem to first have been described in [9], [10]. Attractor networks form a class of neural networks that includes feed-forward networks, but also allows for feed-back connections. These are interesting because they can possibly support non-trivial dynamic behavior. It seems that allowing feed-back connections should 5

7 increase the computational power of neural networks - perhaps allowing a network with fewer nodes and connections to replace a very big feed forward network. We review two known algorithms for computing derivatives in these networks, called adjoint sensitivity analysis and forward sensitivity analysis. Our proposed work is to analyze an optimization algorithm based on these gradient estimators. The theoretical analyses in previous works was very limited. 2.1 Background We now define an attractor network. The state space is X = R n and there is a set of edges that determines the connectivity between nodes. We let x = (x 1,..., x n ) denote a state of the neural network. The weights w R n n and biases b R n are the parameters of the network. The number w i,j is the weight into node i from node j. A value of w i,j = 0 means there is no connection to node i from node j. We let Θ = R n R n n represent the joint parameter space. The following function f : X Θ X X gives the next state of the network given the current state, parameters, and input: ( n f i (x, (w, b), u) = σ w i,j x j + b i + u i ), i = 1, 2,..., n (1) j=1 That is, the state of node i at time t + 1 is determined by the input u i at that node, and the states of its neighbors at time t. Typical choices for the function σ are the logistic function 1 σ(x) = or the hyperbolic tangent σ(x) = exp( x) 1+exp( 2x) By iterating (1) starting from an initial point x(0), one obtains a sequence of network states x(1), x(2),... where x(t+1) = f(x(t), θ, u). Alternatively, we may write x(t+1) = f t+1 (x(0), θ, u). A network has a globally attractive fixed-point when there is a state x that is a fixed-point for f, meaning f(x, θ, u) = x, and this fixed-point is globally attractive, meaning lim f t (x(0), θ, u) = x t for any initial point x 0. We will also use the name fixed-point networks to refer to attractor networks. In general, this fixed-point will depend on the parameters θ = (w, b) and u, and to make this explicit we will write x (θ, u). The question of whether a network admits a globally attractive fixed-point for a given value of the parameters w, b seems to be difficult; for one type of attractor network known as the Hopfield network, various hardness results have been obtained for this question [12]. There are several known conditions, however. For instance, it occurs when the weights of the network are small in some sense. Formally, for any norm on R n, there will be contraction when sup x X f x (x) < 1 (2) 6

8 If we specialize to the form of f in Equation 1, we can get a more specific condition. If the norm is a so-called absolute norm, which means that (u 1,..., u n ) = ( u 1,..., u n ), then contraction will occur when w 1 σ Norms that are absolute include the norms p for p {1, 2,..., }. When σ is the logistic function, σ < 4. Let e : X R be a loss function. The problem is to minimize the loss at the fixedpoint: min J(θ) (3) θ where J(θ) = e(x (θ, u)). From now on we will consider the input u as fixed and will drop it from the notation. For the class of feed-back networks that converge to a unique equilibrium, the basic methodology that is used in feed-forward networks can still be applied. For these networks the derivatives of interest can be computed using a protocol in which information is transmitted along the existing links of the network, and on backwards connections, with a dynamical variant of the backpropagation algorithm. Some early works that considered these networks were [13], [10], [14], [9], [15]. We derive the optimization algorithm that we are interested in. The exposition is somewhat similar to that in the works [16], [17]. The differentiability of J follows by the implicit function theorem and the chain rule. That is, starting from the equation x (θ) = f(x (θ), θ) (4) and using the contractivity and differentiability properties of f, one can conclude that x (θ) is differentiable. Then as long as e is differentiable we can apply the chain rule to get that J is differentiable. Using this and the formulas provided by the implicit function theorem, one obtains that J (θ) = A(θ)B(θ)C(θ) (5) θ where A(θ) = e ( x (x (θ)), B(θ) = I f ) 1 x (x (θ), θ), C(θ) = f θ (x (θ), θ) From this formula, we can see two challenges to computing, or even approximating, the derivative. The first is that these terms involve x (θ), which can only be approximated by iteration. Secondly, they involve the solution of linear systems (one can choose between either A(θ)B(θ) or B(θ)C(θ)). Below we describe two iterative algorithms that can address these problems. 7

9 We can calculate the term A(θ)B(θ)C(θ) by computing A(θ)B(θ) and then post multiplying by C(θ), or we can compute BC and premultiply by A. In each case, one can use iterative solver to jointly solve the fixed-point equation (4) and deal with the matrix inverse. These approaches are referred to as adjoint sensitivity analysis and forward sensitivity analysis respectively. The derivations are somewhat symmetric; we focus on the adjoint method in what follows. One early work to mention this method of sensitivity analysis, for the finite time case, was [18]. Let the space Z be R n R n, and define the map T Adj : Z Θ Z as ( T Adj ((x, y), θ) = f(x, θ), y f ) e (x, θ) + x x (x) (6) Assuming that T Adj possesses a fixed-point z (θ) = (x (θ), y (θ)), it is easy to verify that x (θ) is a fixed-point for f and y (θ) = A(θ)B(θ) Therefore, if we could obtain the fixed-point z (θ) we could easily compute the gradient (5), since J θ (θ) = G(z (θ), θ) where G((x, y), θ) = y f (x, θ) (7) θ This map T Adj is essentially doing the same type of gradient estimation as in the backpropagation procedure for neural networks. In the case that the network has no cycles, then the gradient estimation converges in a finite number of steps. For example, if f describes a feed-forward network with k layers then a fixed-point of T Adj will be reached by iterating for k steps. If there are cycles, then under certain contraction assumptions, the operator T Adj also satisfies this contraction property. This can be verified using the condition on the derivative in Inequality 2, for an appropriate choice of norm on the space Z. This is discussed in [14] and other works concerning attractor networks. If T Adj defines a globally attractive process on Z, this gives an iterative method to estimate the gradient: Iterate T Adj enough times starting from an arbitrary point (x 0, y 0 ) to obtain point (x M, y M ) close to (x (θ), y (θ)), and then form the estimate G((x M, y M ), θ). By continuity properties of f, it should be that G((x M, y M ), θ) G(z (θ), θ) where the quantity on the right which is the true gradient. The pseudocode for this proce- 8

10 dure (termed adjoint sensitivity analysis) procedure in Algorithm 1. Algorithm 1: Deterministic Adjoint sensitivity analysis Define T Adj : R n R n R n R n as ( T Adj ((x, y), θ) = f(x, θ), y f ) e (x, θ) + x x (x) for t = 0, 1,..., M 1 do (x(t + 1), y(t + 1)) = T Adj ((x(t), y(t)), θ) end - Set Adj = y(m) f (x(m), θ) θ return Adj The output of Algorithm 1 is the gradient estimate Adj. It has the property that Adj J (θ) as M, θ as we have explained above. By a very similar argument, one justifies the forward sensitivity analysis procedure: Algorithm 2: Deterministic Forward sensitivity analysis Define T F wd : R n R n m Θ R n R n m as ( T F wd ((x, u), θ) = f(x, θ), f ) f (x, θ)u + (x, θ) x θ for t = 0, 1,..., M 1 do (x(t + 1), u(t + 1)) = T F wd ((x(t), u(t)), θ) end - Set F wd = e x (x(m))u(m) return F wd Just like adjoint sensitivity analysis, the output of Algorithm 2, has the property F wd J (θ) as M. θ The difference is that T F wd has as a fixed-point the pair (x (θ), u (θ)), where u (θ) = B(θ)C(θ). A very important point is that the map T Adj is operates in the space R n R n, while the parameter θ is in R m. It means that the computationally intensive dynamical part of gradient estimation happens on a state space that is of dimensionality 2n. The forward sensitivity analysis involves the state space R n R n m, which is of dimensionality (n + 9

11 1)m. By a crude criteria that equates small dimensionality of a state space with efficient algorithm, the adjoint sensitivity analysis has clear advantages. In neural networks for instance, the computation of an arbitrary entry of f, f e or are roughly the same, and x θ x adjoint sensitivity analysis (which the back-propagation algorithm is based on) is the clear choice. In practice, the comparison might be more subtle. In any case, in this setting of deterministic contracting systems on a continuous state space, there is a choice between two alternatives. It is of great interest to know whether algorithms with these properties can be realized in other contexts, such as stochastic systems on continuous space, or systems on discrete spaces. We turn these gradient estimation procedures into optimization algorithms by running the gradient estimation and parameter update processes in parallel. The first, presented in Algorithm 3, is based on forward sensitivity analysis: Algorithm 3: Optimization using Forward sensitivity analysis for t = 0, 1,..., do (x(t + 1), u(t + 1)) = T F wd ((x(t), u(t)), θ(t)) θ(t + 1) = θ(t) ɛ e (x(t + 1))u(t + 1) x end The second, Algorithm 4 is based on the adjoint sensitivity analysis. Algorithm 4: Optimization using Adjoint sensitivity analysis for t = 0, 1,..., do (x(t + 1), y(t + 1)) = T Adj ((x(t), y(t)), θ(t)) θ(t + 1) = θ(t) ɛy(t + 1) f (x(t + 1), θ(t)) θ end These algorithms, which couple gradient estimation with optimization are well-known also in design optimization, where it is used in aerospace applications [19], [20], [21]. In that case, the gradient is difficult to compute because it involves the numerical solution of a PDE, and the T Adj or T F wd represents some kind numerical solver with a contraction property. 2.2 Research question The issue in analyzing either of the procedures is to choose the step size, and to determine what conditions, if any, we must have on the initial state. We will focus on Algorithm 4 in the rest of this section, but the discussion applies equally to Algorithm 3. In this case the initial state is (x 0, y 0, θ 0 ). There are two considerations when choosing the step size ɛ. The first is that, given a descent direction, only values that are sufficiently small will guarantee a function decrease. This is an important consideration in any gradient based algorithm. The second issue is related to the dynamic nature of the gradient estimation. If we start out with a good gradient estimate and ɛ is small, then the parameter hardly changes, and perhaps at the next iteration only a single step of gradient estimation will suffice to get a 10

12 good search direction, since by continuity (of both the objective function and the gradient estimation process), we will be starting from a good approximation. The type of theorem we would like to prove about the algorithm is as follows: Theorem 2.1. Under some reasonable assumptions, there is a choice of constants ɛ, δ depending on available problem information, so that if z(0) = (x(0), y(0)) satisfies z(0) z (0) < δ (8) then the sequence of θ(n) generated by Algorithm 4, starting from the point (x(0), y(0), θ(0)) and using the step size ɛ, satisfies these properties: i. (e x )(θ(t + 1)) < (e x )(θ(t)), ii. (e θ x )(θ(t)) 0 as t. The criteria (8) means that the initial gradient estimate is accurate. This can be achieved by running the gradient estimation algorithm for a number of iterations before starting the parameter adaptation process. The property (i) is just that the objective function is decreasing at each step. The property (ii) states that the magnitude of the gradient tends to zero. Essentially, these two conditions state that the function values are decreasing and are not decreasing too slowly. We give a few remarks on the challenges in proving Theorem 2.1. First, we will specify conditions that guarantee the optimization problem is well-defined. This involes contraction and differentiability conditions on the underlying system. Then, to show that the gradient estimation works, we should verify our claim that the procedure map T Adj is a contraction mapping and converges to a unique fixed-point. When this estimation procedure is used inside an optimization algorithm (Algorithm 4), the exact gradients will not be used, so we will find a condition on the error in the gradient that guarantees convergence, and then find out how to tune the gradient procedure to maintain this condition during optimization. There are a number of relevant works that attempt to analyze adjoint-based optimization procedures. The work of [22] analyzed a continuous time-version of the algorithm and obtained local convergence results using methods for singularly perturbed systems [23]. The results were local in the sense that they assume optimization begins near an attracting local minimum. They were also not constructive, meaning they didn t quantify the requirements on the algorithm, such as time-scale parameters. In [24] the same authors considered the algorithm in the discrete time setting, but did not pursue a convergence analysis. A related set of works considers Hebbian learning in neural networks. This a type of learning process for a neural network that is not explicitly gradient based, instead being motivated by the neuroscientific theory of Hebbian learning. The similarity is that both involve simultaneous process of adaptation and underlying network dynamics, and the need to consider the relative rate of the two activities. Convergence of a continuous time Hebbian learning process was discuss in [25]. The work of [26] also considered convergence of this type of algorithm, again using the singular perturbation methods of [23]. 11

13 As mentioned above, algorithms that involve joint adjoint-based derivative estimation and optimization are used in other optimization fields as well. In PDE-constrained optimization it is called the one-shot approach [27, 21]. However, the results in these works also assume that the algorithm starts near a global minimum. There are a number of applications that could be of interest. One idea is inspired by reservoir networks [28]. We can consider a hierarchical network, the first component being a random attractor network, and the second being a regular feed-forward layer. We would keep the random weights fixed, but only optimize the output weights. The above results require a uniform contraction property (that is, the rate of convergence of the system should stay constant as optimization proceeds). This is accommodated by keeping the feed-back weights fixed. 3 Stochastic Attractor Networks The neural networks we described in the previous section have been deterministic. To each input they associate one output. There are a number of situations when one may be interested in stochastic networks. For instance, if one is dealing with a very large network that has to be spread across multiple computers, this can cause random delays or synchronization issues. We can model this with a neural network that has noisy connections, meaning the set of neighbors of each node is random at each time step. Additionally, a noisy network can represent a one-to-many mapping that could be useful when there are multiple interpretations to a given image, for instance. In this section we describe a class of stochastic networks with feedback, and consider the corresponding optimization problem. When the network is well behaved, the process will be ergodic and the problem is to optimize the long-term behavior. An interesting question is whether the gradient estimation procedures we have described, the adjoint and forward sensitivity analysis (Algorithms 1 and 2), can be applied in this setting. We propose to study this question, to derive a useful gradient estimator for this model. As in the deterministic attractor networks, there are three issues that must be addressed in a comprehensive treatment of these models. Firstly, we specify how these networks determine an input/output mapping. This lets us define the objective function. Secondly there is the issue of gradient estimation, meaning we need to find an algorithm for computing the derivative of this function. The third step is to define an optimization algorithm that combines the gradient estimation procedure with a parameter update step. The output of the network we are interested in is the long-term average behavior. To guarantee that the network has a regular long-term behavior that is independent of its starting point, we will use a condition which is a stochastic analogue of the contraction condition we saw in the previous section. The general form of our stochastic neural networks are as follows. Let ξ(1),ξ(2),... be an infinite sequence of independent identically distributed (i.i.d) Ξ-valued random variables distributed according to a measure µ. Each ξ(t) could be for instance a vector of uniform 12

14 random variables in [0, 1]. Then the state at time t + 1, denoted x(t + 1), is obtained from the state x(t), the parameters θ and the random input ξ(t + 1) as x(t + 1) = f(x(t), θ, ξ(t + 1)) (9) For instance the ξ(t) could be a matrix of Bernoulli random variables, that determine which connections are activated, in the random connection model we described above. We interpret ξ i,j = 1 to mean the connection from j to i is activated, and ξ i,j = 0 means the connection is disabled. In this case, f takes the form ( n f i (x, θ, ξ) = σ ξ i,j w i,j x j + b i ) We are interested in the problem of optimizing the long-term average behavior of recursions such as (9). This reduces to the deterministic attractor problem in case there is no noise (f does not depend on ξ). To define the objective function, we need to put some restrictions on our process so that it does in fact have a regular long-term behavior. The condition will be that the Markov chain defined by (9) possesses a unique variant measure π θ, and that given any distribution on the initial state x(0), the distributions of the states x(1), x(2),... converge to π θ. We can formally state this as follows. Let P θ be the Markov kernel associated to the stochastic system (9). Given that one is in state x, the quantity P θ (x, A) is the probability that the random next state f(x, θ, ξ) is in the set A. Formally, letting µ be the noise measure, P θ (x, A) = µ({ξ f(x, θ, ξ) A}) For a function e : X R we denote by P θ e the function (P θ e)(x) = e(y)d(δ x P θ )(y) = e(f(x, θ, ξ))dµ(ξ) X This number (P θ e)(x) is the expectation of the value e at the next state of the network, given that we start in state x. For a measure ν on X we denote by νp the measure (νp θ )(A) = P θ (x, A)dν(x) Then the algorithm (9) possesses a stationary measure π θ when X Ξ π θ = π θ P θ (10) Instead of using the terminology globally attractive, we will say that the network is ergodic. This means that (10) holds, and for any initial measure ν, νp t θ π θ as t. (11) 13

15 The type of convergence in (11) is weak convergence. A sequence of measures µ t converges weakly to a measure µ if µ t (e) µ(e) as t for all bounded, Lipschitz, functions e : X R. The optimization problem we are interested in is, given a cost function e : X R, where J(θ) = min J(θ) θ Θ X e(x)dπ θ (x) We approach this in a way that generalizes the deterministic approach described in the previous section. Our concern is with gradient estimation for this problem. The adjointsensitivity analysis has already been described above, in the context of deterministic attractor networks. In Algorithm 5 We define essentially the same algorithm as described for deterministic networks, only now the algorithm is operating in the stochastic environment: Algorithm 5: Stochastic Adjoint sensitivity analysis Define T Adj : R n R n Θ Ξ R n R n as ( T Adj ((x, y), θ, ξ) = f(x, θ, ξ), y f ) e (x, θ, ξ) + x x (x) for t = 0, 1,..., M 1 do (x(t + 1), y(t + 1)) = T Adj ((x(t), y(t)), θ, ξ(t)) end - Set Adj f = y M (x(m), θ, ξ(m)) θ return Adj Alternatively, the method of forward sensitivity analysis leads to a gradient estimation procedure as shown in Algorithm 6. Algorithm 6: Stochastic Forward sensitivity analysis - Define T F wd : R n R n m Θ Ξ R n R n m as ( T F wd ((x, u), θ, ξ) = f(x, θ, ξ), f ) f (x, θ, ξ)u + (x, θ, ξ) x θ for t = 0, 1,..., M 1 do (x(n + 1), u(n + 1)) = T F wd ((x(t), u(t)), θ, ξ(t + 1)) end - Set F wd = e x (x M)u M return F wd 14

16 3.1 Research question In each case, we would like to know if the procedures do in fact help to calculate derivatives, as stated in the following: Theorem 3.1. Let the network function f, the error function e and the noise ν satisfy some reasonable assumptions. Then the derivative θ π θ(e) can be computed using one of the algorithms (6) or (5). That is, one of the following holds: i. The process (x(t + 1), u(t + 1)) = T F wd (x(t), u(t), θ, ξ(t + 1)) is ergodic and E[ F wd ] J θ (θ) ii. The process (x(t + 1), y(t + 1)) = T Adj (x(t), y(t), θ, ξ(t + 1)) is ergodic and 3.2 Stochastic contraction E[ Adj ] J θ (θ) We will define a stochastic version of contraction. This will be useful for analyzing the stochastic sensitivity analysis procedures defined in Algs. 5 and 6. Consider a stochastic algorithm of the form x(t + 1) = f(x(t), ξ(t + 1)) (12) where the ξ(t) are i.i.d µ distributed random variables. Let P be the corresponding Markov kernel. It is possible to define a metric on the probability measures in such a way that contraction of the Markov kernel P on this metric space reduces to the deterministic notion of contraction in the case that the iterations (12) are actually deterministic, that is, not depending on ξ. This is done through the Wasserstein distance. There are several equivalent definitions of the Wasserstein distance. A simple way to define it is as follows. For probability measures µ 1, µ 2 on a metric space, d(µ 1, µ 2 ) = sup µ 1 (e) µ 2 (e) = sup e(x)dµ 1 (x) e(x)dµ 2 (x) e Lip 1 e Lip 1 To make sure this is well defined we need some requirements on the measure µ 1, µ 2. We define the Wasserstein space P(X) as follows { } P(X) = µ x dµ(x) < X That is, P(X) consists of all the measure µ such that x x is µ-integrable. It can be shown that this metric space (P, d) is complete when X is complete [29]. That means the contraction mapping theorem can be used to test if a Markov kernel is ergodic; if the X X 15

17 Markov kernel P defines a contraction on this space, then µp n converges to a unique invariant measure π for any starting measure µ. This is formalized below. Proposition 3.2. Let P be a Markov kernel on a Polish space X. Let the following contraction condition hold: d(δ x1 P, δ x2 P ) sup := ρ < 1 (13) x 1 x 2 d(x 1, x 2 ) Then P has a unique stationary measure π, and and for any Lipschitz function d(µp t, π) ρ t d(µ, π), µp t (e) π(e) e Lip ρ t d(µ, π). One can prove the following sufficient condition for contraction in terms of the derivative of f: Proposition 3.3. Consider the system (12). Let be a norm on X and say that i. µp P for all µ P, ii. sup x (x, ξ) dν(ξ) < 1. Ξ f x Then the corresponding Markov kernel P is a contraction on the space P(X). Processes that satisfy the conditions of 3.3 include iterated function systems [30, 31, 32, 33]. 3.3 Approach to the problem We outline our approach to showing, under certain conditions, that the forward sensitivity analysis procedure is valid for stochastic systems. We assume that P θ is a contraction on the state space X. We also need to assume that the function f satisfies certain differentiability requirements. The function f and its derivatives will also need to satisfy certain integrability requirements relative to the noise measure µ. First, we establish differentiability of π θ. Define θ P θ to be the linear map sending e to θ P θe(θ). We show that the equation l = lp θ + π θ θ P θ (14) has a unique solution l, and that this linear functional must be the derivative of π θ, that is, l (e) = θ π θ(e) 16

18 Then we can show that, asymptotically, the Algorithm 6 represents this linear functional. That is, assuming γ θ is the stationary distribution of the recursion z m+1 = T F wd (z m, θ, ξ), we can consider the linear functional e l(e) = (x)m dγ x θ(x, m) Z If we can show that l also satisfies the equation (14), then we are done, since the only solution is the stationary derivative. 4 Discrete models In this section we discuss a type of stochastic network on a discrete space termed random threshold networks, also known as the Little model [7]. Gradient estimation has been studied for closely related models, such as the Boltzmann machine and sigmoid belief networks, as we discuss below. This model is somewhat more challenging since there is not a known closed-form solution for the stationary distribution. Therefore one has to focus on gradient estimation methods that only use the Markov kernel associated to the process. We review some standard methods that can apply to this setting, including finite differences (FD), simultaneous perturbation analysis (SP), and measure valued differentiation (MVD). We then propose an estimator which combines features of SP and MVD. The algorithm generates a random direction, as in SP, and then uses measure valued differentiation to approximate this directional derivative. 4.1 Background The earliest neural network models to be studied from the computational view were the deterministic threshold networks [34, 35]. In this model, each unit senses the states of its neighbors, takes a weighted sum of the values, and applies a threshold to determine its next state (either on or off). For single layer versions of these networks, where the units are partitioned into input and output groups, with connections only from input to output nodes, the corresponding optimization problem can be solved by the perceptron algorithm. Any iterative algorithm for optimizing threshold networks has to address the credit assignment problem. This means that during optimization, the algorithm must identify which internal components of the network are not working correctly, and adjust those units to improve the output. The difficulty in solving the credit assignment problem for threshold networks with multiple layers prevents simple deterministic threshold models from being used in complex problems like image recognition. There have been a number of well-known approaches to the problem. For instance, one can abandon the threshold units, and work with units that have a smooth, graded, response such as the sigmoid neural networks described above. In this case methods of calculus are available to determine unit sensitivities. These new networks are still deterministic but now operate on a continuous state space. 17

19 Another approach is to keep the space discrete but make the network probabilistic, and use the smoothing effects of the noise to obtain a model one can apply methods of calculus to. One can interpret the Sigmoid Belief Networks in this way. These networks were introduced in [8] and so named because they combine features of sigmoid neural networks and Bayesian networks. In these networks, when a unit receives a large positive input it is very likely to turn on, while a large negative input means the unit is likely to remain off. In fact, these networks can be interpreted as threshold networks with random thresholds. The use of the sigmoid function, which is the cumulative distribution function (CDF) of the logistic distribution, leads to an interpretation of a network with thresholds drawn from the logistic distribution. In [8], the author derived formulas for the gradient in these networks, and showed how Markov chain Monte Carlo (MCMC) techniques can be used to implement gradient estimators. The networks studied in [8] had a feed-forward architecture, but one could also define variants that allow cycles among the connections. In this way one is lead to the random threshold networks. In this case, one would be interested in the longterm average behavior of the network. Such a generalization would resemble the random threshold networks that are our focus. It would be interesting to obtain a gradient estimator for these new networks. Another motivation to study general random threshold networks comes from the Boltzmann machine [1]. This is a network of stochastic units that are connected symmetrically. This means there is feed-back in the network, and the problem in these networks is to optimize the long-term behavior. The symmetry in the network, and the use of the sigmoid function to calculate the probabilities, leads to a nice closed form solution for the stationary measure in this model. Based on formulas for the stationary distribution, expressions for the gradient of long-term costs can be obtained, leading to MCMC based gradient estimators. If one changes the model, by for instance using non-symmetric connections, or changing the type of nonlinearity, these formulas are no longer available. Instead, one winds up with a model like the random threshold networks. This provides another motivation for studying gradient estimation in the Little model. The networks that we are studying were first defined in [7], and are sometimes referred to as the Little model. They can be interpreted as threshold networks, where the thresholds are randomly chosen at each time step. For this reason we also refer to them as Random Threshold Networks (RTNs). We define a random threshold network as follows. Let the network have n nodes, and let ξ(1), ξ(2)... be a sequence of noise vectors in R n, with the entire collection {ξ i (t); i = 1,..., n, t = 1, 2,...} independent and distributed according to the logistic distribution. That is, the CDF of ξ i (t) is P (ξ i (t) < x) = σ(x) = exp( x). Define f RT N : {0, 1} n Θ Ξ {0, 1} n as n 1 if w fi RT N i,j x j + b i > ξ i (x, (w, b), ξ) = j=1 0 otherwise 18 (15)

20 This function f RT N and the noise ξ(1), ξ(2)... determines the operation of the random threshold network; from the initial point x(0) one follows the recursion x(t + 1) = f RT N (x(t), θ, ξ(t + 1)) (16) to generate the next state. The transition probability for the RTN are as follows. We denote by P θ (x 0, x 1 ) be the probability of going to state x 1 {0, 1} n from state x 0 {0, 1} n. The function u i (x), that determines the input to each node at the state x is defined as u i (x, θ) = j w i,j x j + b i Then n P θ (x 0, x 1 ) = σ(u i (x 0, θ)) x1 i (1 σ(ui (x 0, θ))) 1 x1 i Alternatively, we can use the following notation of [8]: for x {0, 1}, define x = 2x 1 Using this together with the identity 1 σ(x) = σ( x), we get P θ (x 0, x 1 ) = n σ((x 1 i ) u i (x 0, θ)) (17) The RTN is related to the two models we discussed above, the sigmoid belief network and the Boltzmann machine. If the connectivity graph is acyclic, then one obtains a model resembling the sigmoid belief network. We can enforce this by requiring w i,j = 0 if i < j. A model like the Boltzmann machine is obtained if the weights are symmetric, meaning w i,j = w j,i. Technically, if one puts a symmetry requirement on our threshold networks, one does not exactly recover the Boltzmann machine, but a variant known as the synchronous or parallel Boltzmann machine [2]. The synchronous Boltzmann machine also has a known, simple, stationary distribution, as shown in [2]. We will not attempt to solve the stationary the distribution, rather we are interested in constructing an optimization procedure using only knowledge of the transition function f RT N. Firstly, the differentiability of P θ is easily established by properties of the sigmoid function. As we are dealing with a discrete space, the general methods available for computing the derivative include finite-differences, the score-function method and measure-valued differentiation. We briefly review these now. The finite difference estimator for the derivative of the stationary cost is shown in Algorithm 7. In the case of a single, real-valued parameter, the finite difference method uses two copies of the system to estimate the derivative at a particular point θ(0). One copy of the system runs with setting θ(0) + λ, and one uses θ(0) λ. After running for a long-time, the error in both of these systems is sampled, and a difference quotient is formed as the gradient 19

21 estimate. The extension to m parameters involves replicating the procedure m times, one for each coordinate direction. Typically, when dealing with an n node neural network there are n 2 parameters. Therefore finite differences would require running 2n 2 copies of the network, which is unfeasible. Furthermore, it has unfavorable variance properties. Algorithm 7: Finite difference derivative estimation for Markov chains - Let v 1, v 2,..., v m be the coordinate basis vectors in R m. - Define T F D : X m X m R m Ξ X m X m as T F D i (x, y, θ, ξ) = (f(x i, θ + λv i, ξ), f(y i, θ λv i, ξ)), i = 1, 2,..., m. for t = 0, 1,..., M 1 do (x(t + 1), y(t + 1)) = T F D (x(t), y(t), θ, ξ(t + 1)) end m - Set F D e(x i (M)) e(y i (M)) = v i 2λ return F D One interesting solution to the cost of estimating derivatives with finite differences is known as simultaneous perturbation [36]. In this scheme, one picks a random direction v, and then approximates the directional derivative using stochastic finite differences as in Algorithm 7. In this case only two simulations are needed for a system with m parameters. The variance issues remain with this approach; in order to decrease the bias of the estimator, one has to deal with a larger variance. For generating the directions, one possibility is to let v be a random point on the hypercube { 1/2, 1/2} m, as suggested in [36]. For the theoretical analysis, it is important that the directions have 0 mean, and that the random variable 1 is integrable. The procedure is shown in Algorithm 8. v Algorithm 8: Simultaneous perturbation derivative estimation for Markov chains - Sample direction v from the measure P (v) = n [ 1δ 2 1/2(v i ) + 1δ 2 1/2(v i )] (18) - Define T SP : X X Ξ X X as T SP (x, y, ξ) = (f(x, θ + λv, ξ), f(y, θ λv, ξ)) for t = 0, 1,..., M do (x(t + 1), y(t + 1)) = T SP (x(t), y(t), θ, ξ(t)) end - Set SP = e(x(m)) e(y(m)) v 2λ return SP The idea of measure valued differentiation is to express the derivative of an expectation 20

22 as the difference of two expectations. Each of these expectations involves the same cost function of interest, but the underlying measures are different. If these measures are easy to sample from, this leads to a simple, unbiased derivative estimator. Formally, we consider a family of cost functions D and a measure µ θ that depends on a real parameter θ. For simplicity we will consider the setting of a finite state space X in these definitions. Then µ θ is said to be D-differentiable at θ, if there is a triple (c θ, µ θ, µ θ ), consisting of a real number c θ, and two probability measures µ θ, µ θ on X such that θ µ θ(e) = c θ ( µ θ (e) µ θ (e)) for any function e : X R. An MVD gradient estimator would consist of two parts: First, sample a random variable Ẏ distributed according to µ θ, then sample random variable Ÿ according to µ θ, and finally form the estimate MV D = c θ [e(ẏ ) e(ÿ )]. Compared to finite differences, the advantage is that there is no bias and there is no division by a small number. For some background see [37]. The following is a simple example Example 4.1. Let ν 1, ν 2 be two probability measures on a measurable space X, and define the measure µ θ = e θ2 ν 1 + (1 e θ2 )ν 2 (19) that depends on a parameter θ R. The parameter determines which of the measures ν i is more likely in this mixture. By simple calculus, it holds that for any bounded measurable function e : X R, θ µ θ(e) = c θ [ν 2 (e) ν 1 (e)] where c θ = 2θe θ2 Therefore the triple (c θ, ν 2, ν 1 ) an MVD of the measure µ θ. The concept of MVD can be extended from measures to Markov kernels, and then applied to derivatives of stationary costs. A Markov kernel P θ that is defined on a discrete space X and which depends on a real parameter θ is said to bed-differentiable at θ if for each x X there is a triple (c θ (x), P θ (x, ), P θ (x, )) which is the measure valued derivative of the measure P (x, ) for the cost functions D at the parameter θ. If the Markov kernel P θ is ergodic with stationary measure π θ, then in certain cases we can use the (c θ (x), P θ (x, ), P θ (x, )) to compute the stationary derivatives π θ θ(e) [38]. We now describe this. Let the Markov kernels P and P be represented by the functions f, f, meaning P (x, A) = µ({ξ f(x, θ, ξ) A}), P (x, A) = µ({ξ f(x, θ, ξ) A}), 21

23 The procedure is presented in Algorithm 9. A theorem about its correctness is given in Theorem 4.2. For measure valued differentiation, like finite differences, it seems that m simulations are required for a system with m parameters, although the variance characteristics are much more favorable compared to finite differences (see [39], Section 4.3). In finite differences, one must trade off bias for variance, but for MVD the variance can be shown to be bounded independently of the parameters M 1, M 0 which determine the bias. Algorithm 9: MVD gradient estimation for Markov chains for t = 0, 1,..., M 0 1 do x(t + 1) = f(x(t), θ, ξ(t)) end - Set ẋ(0) = f(x(m 0 ), θ, ξ(m 0 )) and ẍ(0) = f(x(m 0 ), θ, ξ(m 0 )) for t = 0, 1,..., M 1 1 do (ẋ(t + 1), ẍ(t + 1)) = (f(ẋ(t), θ, ξ(m 0 + t + 1)), f(ẍ(t), θ, ξ(m 0 + t + 1))) end M 1 - Set MV D = c θ (x(m)) [e(ẋ(t)) e(ẍ(t))] return MV D. t=1 4.2 Stationary differentiability using MVD In this section we recall a theorem on measure valued differentiation for Markov chains. It gives a condition on a Markov chain P θ that guarantees the corresponding stationary costs π θ (e) are differentiable. This result is from [38]. Theorem 4.2. Let (δ x P θ )(e) be differentiable for each bounded, Lipschitz continuous e. That is, for each x there is a triple c(x), P θ (x, ), P θ (x, ) such that θ (δ xp θ )(e) = c(x)[ P θ (e) P θ (e)] for each bounded, Lipschitz e. Furthermore, suppose that P θ is a contraction on the space P(X) in the sense of Inequality 13. Then the stationary cost π θ (e) is differentiable, and Algorithm 9 can be used to estimate the derivatives. Specifically, if we let MV D be the output of the algorithm, then lim MV D = M 0,M 1 θ π θ(e) More general results on MVD for stationary measures can be found in [40]. 4.3 Research question Using the above ideas, we propose a gradient estimator that works by picking a random direction, as in SP, and computes the directional derivative using measure valued differentiation. In this way one deals with a small number of simulations, as in SP, while avoiding 22

24 the variance issues with finite differences. The method is termed simultaneous perturbation measure valued differentiation. The only requirement is that one can compute the measure valued derivative along arbitrary directions. Let µ θ be a measure depending on an n-dimensional vector parameter θ. Let v R n be a direction. A triple (c θ,v, µ θ,v, µ θ,v ) is called a measure valued directional directional derivative in the direction v at θ if for all e : X R, θ µ θ(e)v = c θ,v [ µ θ,v (e) µ θ,v (e)]. In practice, one can try to calculate the MVD in direction v as follows. Consider the measure ˆµ λ = µ θ+λv that depends on real parameter λ. Then by basic calculus, θ µ θ(e)v = λ ˆµ λ(e)(0) Therefore, to find the MVD of µ θ in direction v it suffices to find the normal, scalar, MVD for ˆµ λ at λ = 0. This is the approach in the following example. Example 4.3. Let µ 1, µ 2,..., µ m be m mixture components, and for a vector parameter θ R n define µ θ as m e θ i µ θ = µ i Z θ where Z θ = m e θ i. For any function e : X R and direction v R m, then, we have µ θ+λv (e) = m e (θ i+λv i ) Z θ+λvi µ i (e) (20) Introduce the notation γ + = max{0, γ} and γ = min{0, γ}, so that for any number γ, the identities v = γ + γ and v = γ + + γ hold. Also define J θ = m e θ v j. Differentiating (20) at λ = 0, and doing some algebra, one can get the following representation for the directional derivative: [ θ µ θ(e)v = m ] λ µ m θ+λv(e)[0] = c θ α i µ i (e) β i µ i (e) j=1 where c θ = J θ Z θ [ α i = 1 v i Z θ J Z θ + θ m e θ i v + j j=1 ] 23

Gradient Estimation for Attractor Networks

Gradient Estimation for Attractor Networks Thomas Flynn Department of Computer Science Graduate Center of CUNY July 2017 1 Outline Motivations Deterministic attractor networks Stochastic attractor networks