The Calculus of Jacobian Adaptation

Size: px

Start display at page:

Download "The Calculus of Jacobian Adaptation"

Tamsyn Moore
6 years ago
Views:

1 Gary William ÐÒºÒºÒºÓÑ NEC Research Institute / 4 Independence Way / Princeton, NJ Abstract For many problems, the correct behavior of a model depends not only on its input-output mapping but also on properties of its Jacobian matrix, the matrix of partial derivatives of the model s outputs with respect to its inputs. This paper introduces the J-prop algorithm, an efficient general method for computing the exact partial derivatives of a variety of simple functions of the Jacobian of a model with respect to its free parameters. The algorithm applies to any parametrized doublydifferentiable data-driven feedforward model, including nonlinear regression, multilayer perceptrons, and radial basis function networks. J-prop has as special cases some earlier algorithms, such as Tangent Prop and Double Backpropagation, but it can also be used to optimize the eigenvectors, eigenvalues, or determinant of the Jacobian, making it quite general with respect to the model types, applications, and objective functions that can be used. Applications of J-prop include forcing stability and sensitivity conditions, regularization, building low-complexity encoder-decoder networks, exploiting prior knowledge, designing controllers, and blind source separation. Keywords Jacobian, derivative, gradient, numerical optimization, calculus CONTENTS 1 Introduction Background J-Prop Derivation Intuitive Derivation Generic Derivation Auxiliary Error Function Differential Operator Equivalence Method Applied to MLPs Complete Algorithm Numerical Example J-Prop Recipes Training with Explicit Target Derivatives Tangent Prop Double Backpropagation Enforcing Identical I/O Sensitivities Targeting Eigenvectors & Eigenvalues Targeting the Largest Eigenvalue Minimizing Eigenvalue Bounds Optimizing Lyapunov Exponents Optimizing the Determinant Optimizing Quadratic Forms Conclusions INTRODUCTION This paper introduces a general method by which constraints on the sensitivity of a model can be enforced through a gradient-based method. While this is a useful thing to do in the general case (from the point of view of regularization) there are also many application in which specific properties of these sensitivities are known in advance and can therefore be exploited. Let Ü be an Ò input, Ñ output, twice differentiable feedforward model parameterized by an input vector, Ü, and a weight vector. Its Jacobian is Â Ü. Ñ Ü ÜÒ.... Ñ ÜÒ Ü Ü The algorithm introduced can be used to optimize functions of the form or Ù Â Ì Ù (1) Ú ÂÚ () where Ù, Ú,, and are user-defined constants. The proposed algorithm, J-prop, can be used to calculate the exact value of both Ù or Ú in Ç times the time required to calculate the normal gradient. Thus, J-prop is suitable for training models to have specific first derivatives, or for implementing several other well-known algorithms such as Double Backpropagation (Drucker & Le Cun, 199) and Tangent Prop (Simard, Victorri, Le Cun & Denker, 199). Suitable choice of,, Ù and Ú make the functions Ù and Ú surprisingly expressive. For example, with Ù set to 1

2 a unit vector can be interpreted as a target value for a particular row of the Jacobian matrix. As will be shown, other modifications can be used to optimize the eigenvalues, the eigenvectors, or even the determinant of Â. Clearly, being able to optimize Equations 1 and is useful; however, the formalism which is used to derive the algorithm may actually more interesting because it allows one to modify J-prop to easily be applicable to a wide-variety of model types and objective functions. As such, a good portion of this paper is devoted to describing the mathematical framework from which J-prop is built. The remainder of this paper is divided into five sections. Section contains a brief overview of prior research related to the accuracy of the first derivative of a model. Both descriptive and prescriptive works are covered. In Section, the J-prop algorithm is derived in an intuitive manner and in a purely analytical form. The algorithm is also shown applied to a multilayer perceptron as an example of how to use the algorithm with a specific model class. Section 4 contains a brief numerical example illustrating the correctness of the derivation. In Section 5, several special cases of J-prop recipes that can be used for a variety of applications are described. In each case, an objective function is defined along with the modifications to the basic underlying algorithm that must be made to compute the gradient of that function. These modifications usually only involve a simple setting of user-defined parameters. Some simple applications of the the modified objective functions are also described. Section 6 concludes the paper with a summary. BACKGROUND Previous work concerning the modeling of a function and its derivatives can be divided into works that are descriptive or prescriptive. One well-known descriptive result is due to (White & Gallant, 199) and (White, Hornik & Stinchcombe, 199), who show that given noise-free data a multilayer perceptron (MLP) can approximate the higher derivatives of a function in the limit as the number of training points goes to infinity. The difficulty with applying this result are the strong requirements on the amount and integrity of the training data, requirements which are rarely met in practice. This problem was specifically demonstrated by (Principe, Rathie & Kuo, 199) and (Deco & Schürmann, 1997), who showed that using noisy training data from chaotic systems can lead to models that are accurate in the input-output sense, but inaccurate in their estimates of quantities related to the Jacobian of the unknown system such as the largest Lyapunov exponent and the correlation dimension. Deco and Schürmann proposed a modified objective function and learning method to specifically penalized inaccurate Lyapunov exponents through a forgetting factor related to the largest Lyapunov exponent. While this improved results, the solution is only applicable to the domain of MLPs and chaotic systems. Early in their development of radial basis function networks (RBFNs) (Poggio & Girosi, 1990) showed that RBFNs have the desirable property that one can effectively place a bound on the errors of all of the higher derivatives of the model. The key idea here is that the choice of how to penalize the errors of successive derivatives is used to determine an appropriate basis function class via regularization theory. While this is an elegant result, it is not applicable to situations in which the form of the model is determined in advance, nor does it directly address the simple problem of the derivatives being inaccurate at a specific point in the input space. The issue of having a model s estimate of an unknown function s derivative being inaccurate is particularly problematic in MLPs because large weights can lead to saturation at a particular sigmoidal neuron which, in turn, results in extremely large first derivatives at the neuron when evaluated near the center of the sigmoid transition. Several methods to combat this type of over-fitting have been proposed. One of the earliest methods, weight decay (Hinton, 1986), uses a penalty term on the magnitude of the weights. Weight decay is arguably optimal for models in which the output is linear in the weights because minimizing the magnitude of the weights is equivalent to minimizing the magnitude of the model s first derivatives. However, in the nonlinear case, weight decay can have suboptimal performance (Drucker & Le Cun, 199) because large (or small) weights do not always correspond to having large (or small) first derivatives. Training with noisy inputs (Sietsma & Dow, 1988) is another method often used to regularize the first derivatives of a model. The simplest methods add a fixed amount of noise to the training data input vectors. More complicated versions, such as cleaning (Weigend, Zimmermann & Neuneier, 1995), adaptively learn a simple noise model on the input space which allows adaptation to be specific to the noise levels in the training data. When compared to weight decay, training with noise has the advantage that it explicitly penalizes large first derivatives of the model; however, it is strictly a numerical approximation to something that can be done analytically. For example, the Double Backpropagation algorithm (Drucker & Le Cun, 199) adds an additional penalty term to the quadratic error function, Ü, equal to Ü. Training on this function results in a form of regularization that is in many ways an elegant combination of weight decay and training with noise: it is strictly analytic (unlike training with noise) but it explicitly penalizes large first derivatives of the model (unlike weight decay). Double Backpropagation can be seen as a special case of J-prop, the algorithm derived in the next section. As to the general problem of coercing the first derivatives of a model to specific values, (Simard, Victorri, Le Cun & Denker, 199) introduced the Tangent Prop algorithm, which was used to train MLPs for optical character recognition to be insensitive to small affine transformations in the character space. Tangent Prop is very similar to the algorithm proposed

3 in this paper but differs in several respects. First, its derivation is very specific to the application that it was developed for. Second, the derivation itself is not easy to generalize for models other than MLPs. And, third, Tangent Prop uses an objective function that cannot be easily generalized. In short, J-prop is more general in the models, applications, and objective functions that it can be used for when compared to previously related work. As such, the proposed formalism for J-prop allows for an entire new class of learning algorithms to be explored. J-PROP DERIVATION This section describes two methods for calculating Ù and Ú for arbitrary choices of Ù, Ú,, and. An intuitive graphical derivation is first shown for a modified network similar to an MLP. Afterwards, a more detailed derivation is presented that uses a special differential operator. Before we proceed, a brief explanation of the notation is required to avoid confusion. When referring to generic datadriven models the model will always be expressed as a vector function Ý Ü, where Ü refers to the model input and refers to a vector of all of the tunable parameters of the model. The term Ý should be understood as the target value of the model, while the residual is written as Ý Ý. In this way, we can talk about models while ignoring the mechanics of how the models work internally. Complementary to the generic vector notation, a multilayer perceptron is used as a specific model example of how to use these results. The notation for an MLP uses only scalar symbols; however, these symbols must refer to internal variable of the model (e.g., neuron thresholds, net inputs, weights, etc.) which can lead to some ambiguity. To be clear, when using vector notation, the input and output of an MLP will always be denoted by Ü and Ý, respectively, and the collection of all of the weights (including biases) map to the vector. However, when using scalar arithmetic, the notation for MLPs will apply. In later portions of this paper, objective functions are often rewritten to resemble either Equation 1 or. The objective function is labelled as either Ù or Ú to denote this fact..1 Intuitive Derivation Consider an MLP with Ä layers of nodes defined by the equations Ý Ð Ü Ð () Ü Ð ÆÐ Ý Ð Ð Ð (4) In these equations, superscripts denote the layer number (starting at 0), subscripts index over terms in a particular layer, and Æ Ð is the number of input nodes in layer Ð. Thus, Ý Ð is the output of neuron at node layer Ð, and Ü Ð is the net input coming into the same neuron. Moreover, Ý Ä is an output of the entire MLP while Ý ¼ is an input going into the MLP. Let be a function for which we would like to calculate Ü. At this point, we make no assumptions about the specific form of, that is, it may be the normal quadratic error function or possibly something totally unrelated to a typical error measure. With these rules in place, we can write a portion of the feedback equations required to calculate Ü as: Ü Ð Ý Ð Ý Ð ÆÐ ¼ Ü Ð (for Ð Ä) (5) Ð Ü Ð (6) Note that we are purposely omitting equations for Ð and Ð because they are not needed at this time. In the normal case of performing gradient-based learning, Equation 6 is only applied to the node layers that are neither inputs nor outputs, that is, Ð is neither 0 nor Ä. Nevertheless, if one were to continue the feedback procedure all of the way to the MLP inputs, thus using Equation 6 with Ð equal to 0, the computed values will be equal to the partial derivatives of with respect to the MLP inputs. This is significant because Ü Â Ì Ý; thus, by choosing so that Ý is equal to a vector that we would like to multiply to the Jacobian matrix, we can easily calculate the product of the transpose of the Jacobian to an arbitrary vector. Of course, this is old news, as it is common practice to choose Ý to be equal to a unit vector so that one can extract the rows of the Jacobian matrix. What is interesting, however, is that all of Equations 6 can be implemented by a single, specialized, feedforward network that calculates the normal feedback pass of a MLP as well as the normal feedforward pass. This is illustrated in Figure 1 which shows two hidden MLP neurons that are functions of two earlier neurons. The first half of the network does the normal part of the feedforward pass (i.e., it performs Equations and 4 through the single layers of weights). However, the second half of the network implements Equations 5 and 6. Note that the dashed lines leading from the neuron outputs to the Ý Ð values are there to indicate that a particular partial derivative may or may not be a function of the neuron s output. Moreover, while the network contains two layers of weights they are, in fact, the same weight layers only transposed with respect to each other. Since this augmented network is strictly feedforward in the sense that information always flows from the left to the right, it is possible to differentiate the final output of the network with respect to the weights. However, since the weights are shared in multiple locations, care must be taken to accu-

4 Ð Ý Ð Ý Ð Ý Ð Ð Ü Ð ¼ Ý Ð Ü Ð Ü Ð Ð Ý Ð Ð Ð Ð Ð Ý Ð Ð Ü Ð ¼ Ý Ð Ü Ð Ü Ð Ð Ý Ð Ý Ð Ð Ý Ð Â Ì Ù can be computed by performing a backprop-like procedure on a Figure 1: A network illustrating the fact that special feedforward network that calculates the normal feedforward and feedback passes of standard backprop. mulate the gradient information in the shared weights. For any twice differentiable model we could construct a similar specialized network that computes the partial derivatives of a user-defined function with respect to the model inputs. Obviously, such a network would be cumbersome to construct for an MLP with many hidden layers, and applying backprop to this new network would also be difficult. But the trick to differentiating the network breaks down to doing it in a manner so that the chain rule doesn t require too much bookkeeping. We will see how to do this in the next subsection.. Generic Derivation In this subsection a generic method is defined that can be used to calculate Ù and Ú for any model, Ü, that is twice differentiable. The method is very similar to a technique introduced by (Pearlmutter, 1994) for calculating the product of the Hessian of an MLP and an arbitrary vector; however, where Pearlmutter used differential operators applied to a model s weight space, this work uses a differential operators defined with respect to a model s input space. The method is presented in five steps. First, we will define an auxiliary error function that has a few useful mathematical properties that simplify the derivation. Next, we will define a special differential operator that can be applied to both the auxiliary error function, and its gradient with respect to the weights. We will then see that the result of applying the differential operator to the gradient of the auxiliary error function is equivalent to analytically calculating Ù or Ú. This is followed by an example of the technique applied to a multilayer perceptron. Finally, in the last step, the complete algorithm is presented for calculating any part of Ù or Ú for any choice of,, Ù, and Ú, with the one caveat that we assume (only at this time) that Ù and Ú are independent of the model input and output...1 Auxiliary Error Function Our auxiliary error function,, is defined as Ü Ù Ì Ü (7) Note that we never actually optimize with respect ; we define it only because it has the property that Ü Ù Ì Â, which will be useful to the derivation shortly. Note that Ü appears in the Taylor expansion of about a point in input space: Ü Ü Ü Ì Ü Ç Ü (8) Ü Thus, while holding the weights,, fixed and letting Ü be a perturbation of the input, Ü, Equation 8 characterizes how small changes in the input of the model change the value of the auxiliary error function. By setting Ü ÖÚ, with Ú being an arbitrary vector and Ö being a small value, we can rearrange Equation 8 into the form: Ì Ú Ü Ö. ÐÑ Ö¼ Ö Ü ÖÚ Ü Ç Ö Ü ÖÚ Ü 4

5 Ù Ì ÂÚ Ü ÖÚ Ö Ö¼ (9) This final expression will allow us to define the differential operator in the next subsection... Differential Operator Let Ü be an arbitrary twice differentiable function. Define the differentiable operator Ê Ú Ü (10) Ö Ü ÖÚ Ö¼ which has the property that Ê Ú Ü Ù Ì ÂÚ. In fact, Ê Ú is actually a directional derivative operator in the input space of the model. As a result, Ê Ú generalizes partial derivatives into an alternate basis as specified by Ú. Being a differential operator, Ê Ú obeys all of the standard rules for differentiation: Ê Ú ¼ (11) Ê Ú Ü Ê Ú Ü Ê Ú Ü Ü Ê Ú Ü Ê Ú Ü Ê Ú Ü Ü Ê Ú Ü Ü Ü Ê Ú Ü Ê Ú Ü ¼ Ü Ê Ú Ü Ê Ú Ø Ü Ø Ê Ú Ü The operator also has the property that Ê Ú Ü Ú... Equivalence We will now see that the result of calculating Ê Ú can be used to calculate both Ù and Ú. Note that Equations 7 9 all assume that both Ù and Ú are independent of Ü and. To calculate Ù and Ú, we will actually set Ù or Ú to a value that depends on both Ü and ; however, the derivation still works because the directional derivative operator is not supposed to be applied to Ù or Ú in these cases. Intuitively, our choices for Ù and Ú are made in such a way that the chain rule of differentiation is not supposed to be applied to these terms. Hence, the correct analytical solution is obtained despite the dependence. To optimize with respect to Equation 1, we use: Ì Â Ì Ù Â Ì Ù Â Ì Ù Now, assume that Ù is independent of Ü and (thus, ÙÜ Ù ¼). We can substitute with Ú Â Ì Ù above and disregard the fact that Ú is a function of both Ü and because the differential operator is not supposed to be chained to the subsequent Â Ì Ù. The substitution yields: Â Ì Ù Ì Â Ì Ù Ú Ê Ú Ê Ú Ò To optimize with respect to Equation, we use a similar trick, substituting Ù ÂÚ, which yields: ÂÚ Ù Ì ÂÚ Ò Ó Ê Ú Ê Ú Thus, in both cases applying Ê Ú to the weight gradient of a model s error function produces the desired result...4 Method Applied to MLPs We are now ready to see how this technique can be applied to a specific type of model. We use a multilayer perceptron as an example because of its generality and familiarity. Equations and 4 suffice for the feedforward equations. However, we need to slightly modify the feedback Equations 5 and 6 so that they are applied to the specific form of the auxiliary error function,. Ý Ä Ý Ð Ü Ð Ð Ð Ù (1) ÆÐ Ð Ü Ð Ó (for Ð ¼) (1) Ý Ð ¼ Ü Ð (14) Ü Ð Ý Ð (15) Ü Ð (16) where the Ù term is a component in the vector Ù from Equation 1. Applying the Ê Ú operator to the feedforward equations yields: Ê Ú Ý¼ Ê Ú ÝÐ Ú (17) Ð ¼ Ü Ê Ú ÜÐ (for Ð ¼) (18) 5

6 ÆÐ Ê Ú ÜÐ Ê Ú ÝÐ Ð (19) where the Ú term is a component in the vector Ú from Equation. Note that the equations above are nearly identical to how normal backprop is calculated with the only exception being Equation 1; thus, these calculations only require a trivial change to most MLP implementations. As the final step, we apply the Ê Ú operator to the feedback equations: Ê Ú Ê Ú Ê Ú Ê Ú Ê Ú Ý Ä Ý Ð Ü Ð Ð Ð ¼ (0)..5 Complete Algorithm ÆÐ Ð Ê Ú Ü Ð (for Ð ¼) (1) Ê Ú Ý Ð Ê Ú Ü Ð Ê Ú Ý Ð ¼ Ü Ð ¼¼ Ü Ð Ê Ú ÜÐ Ü Ð Ê Ú ÝÐ Ü Ð Ý Ð () () (4) Implementing this algorithm is nearly as simple as implementing normal gradient descent. For each type of variable that is used in an MLP (net input, neuron output, weights, thresholds, partial derivatives, etc.) we require that an extra variable be allocated to hold the result of applying the Ê Ú operator to the original variable. With this change in place, the complete algorithm to compute Ù is as follows: Set Ù and to the user specified vectors from Equation 1. Set the MLP inputs to the value of Ü that Â is to be evaluated at. Perform a normal feedforward pass using Equations and 4. Set Ý Ä to Ù. Perform the feedback pass with Equations Note that values in the Ý ¼ terms are now equal to Â Ì Ù. Set Ú to Â Ì Ù Perform a Ê Ú forward pass with Equations Set the Ê Ú Ý Ä terms to 0. Perform a Ê Ú backward pass with Equations 0 4. After the last step, the values in the Ê Ú Ð and Ê Ú Ð terms contain the required result. It is important to note that the time complexity of the J-forward and J-backward calculations are nearly identical to the typical output and gradient evaluations (i.e., the forward and backward passes) of the models used. A similar technique can be used for calculating Ú. The main difference is that the Ê Ú forward pass is performed between the normal forward and backward passes because Ù can only be determined after the Ê Ú Ü has been calculated. 4 NUMERICAL EXAMPLE To demonstrate the effectiveness and generality of the J-prop algorithm, the method has been implemented on top of an existing neural network library (, 1999) in such a way that the algorithm can be used on a large number of architectures, including MLPs, radial basis function networks, and higher order networks. An MLP with ten hidden ØÒ nodes was trained on 100 points with conjugate gradient. The training exemplars consisted of inputs in and a target derivative from Ó Ü Ó ¼Ü. The unknown function is Ò Ü Ò ¼Ü; however, the MLP never sees data from this function, but only its derivative. The model quickly converges to a solution in approximately 100 iterations. Figure shows the performance of the MLP. Having never seen data from the unknown function, the MLP yields a poor (but accurate within a constant) approximation of the function; however, an excellent approximation of the function s derivative is obtained. The model could have been trained on both outputs and derivatives, but the goal was to illustrate that J-prop can target derivatives alone. To be clear, Figure a was produced by supplying the inputs to the model and performing the standard forward calculation to produce the outputs, i.e., no integration is required. Figure b was produced by differentiating the model with respect to the input space and plotting the resultant values against the target derivatives. 5 J-PROP RECIPES In this section, we consider a number of special cases of the J-prop algorithm. In all of the following derivations, the problem is re-expressed so that the objective function matches one of the following canonical forms: an error function similar to Equation 1 or ; 6

7 5 8 4 true function approximation 6 true derivative approximation (a) (b) Figure : Learning only the derivative: showing (a) poor approximation of the function with (b) excellent approximation of the derivative. output error tanh(w * x + b) x w b 4 5 derivative error (a) composite error (b) w (c) b 4 5 w (d) b 4 5 Figure : With Ü and Ý, there are any infinite number of solutions to fit ØÒ Ü. However, with a target derivative (e.g., ÝÜ, there is only one solution. (a) Various fits to target data; (b) An error function that only considers the target output; (c) An error function that only considers the target derivative; (d) An error function that combines target output and target derivative information. 7

8 an auxiliary error function,, to be used in place of Equation 7; or a partial derivative of the form Ù Ì ÂÚ. For all three cases, J-prop can be used to calculate the required weight gradient efficiently and exactly, but with different values of the user specified vectors,, Ù, and/or Ú, or the auxiliary error function,. The dual goal for this section is to show that J-prop can be used to implement and generalize existing methods, and that J-prop may have many potential applications that have not been addressed by any other numerical means. Thus, some of the proposed J-prop recipes are presented in the spirit of being potential applications and not tried and tested methods. 5.1 Training with Explicit Target Derivatives The first special case of J-prop is nearly identical to the numerical experiment from the previous section. However, here a more complete description is given for how and why one would want to train with explicit target derivatives; the discussion is also motivated with the simplest example of how training with target derivatives alters the underlying optimization problem. Let the entire target matrix of first derivatives evaluated at a specific point in input space be denoted by Â. The extra gradient information needed to optimize the new error function is calculated by: Ò Â Ì Ù Â Ì Ù where Ò is the number of outputs in the model and Ù denotes the th unit vector. This expression can be computed by using a choice of equal to that of Equation 7, setting Ù to Ù, to Â Ì Ù, and Ú to Â Ì Ù Â Ì Ù. Figure shows a simple example of how including information related to the Jacobian in an error function alters the error surface. If only the target output is specified, then there are an infinite number of curves that pass through the single target I/O data point,. If only the target derivative is optimized, then all curves with the same slope as the pointed curve in Figure a are valid fits. As shown in Figures b, c, and d, when a mixture of the two targets is optimized, the error surface becomes better conditioned because it loses some of the flatness and becomes more localized in the region where both outputs and derivatives are correct. If target derivatives are explicitly known, then they will usually come from one of two sources: either a data driven model is being trained on a known analytical process that can be differentiated, or the data is from a process that has a wellbehaved first difference. Even if explicit targets for the first derivatives are unknown, it is still possible to acquire estimates through a boot-strap process. For example, suppose we have noisy data but are not confident that first differences will give accurate estimates of the first derivatives. We can still train a model in the traditional manner and then use the first derivatives of the first model to augment the training of a second model, and so on. In this way, incremental learning can be performed in such a way that subsequent models are forced to have first derivatives that are consistent with earlier models. 5. Tangent Prop The Tangent Prop algorithm (Simard, Victorri, Le Cun & Denker, 199) is designed to train neural networks in such a way that they learn invariances relative to a classification task. For example, if a classifier is trained to recognize letters of the alphabet, then the optimal classifier would be invariant in small translations, rotations, and scalings of the input images. More specifically, let Ö Ü «be an operator that rotates an image described in Ü by «degrees. To make a classifier invariant in this operator, we would like to minimize: Ú Â Ö ««¼ Tangent prop maps directly into J-prop by setting: Ú Ö «Ù Â Ö ««¼ «¼ and ¼. The invariance approach can be applied to any domain in which a classification task has known invariance properties. However, the basic idea was also expanded by Thrun and Mitchell and applied to the tasks of lifelong learning (Thrun & Mitchell, 1995) and explanation-based learning (Thrun & Mitchell, 199) which both have the goal of training models to be consistent with earlier models (and are, therefore, similar in spirit to the boot-strap method described in the previous subsection). 5. Double Backpropagation Like Tangent Prop, the Double Backpropagation algorithm (Drucker & Le Cun, 199) uses a composite error function that is a linear combination of the usual quadratic error plus the norm of the first derivative of a quadratic error function with respect to the model inputs. Hence, it minimizes output errors while simultaneously minimizing the sensitivity of the error function to the model inputs. The extra gradient information needed to descend down this error function is is calculated by: Â Ì 8

9 where Ý Ý, that is, the output residuals. Calculating this expression with J-prop requires that we modify the auxiliary error function to account for the fact that the vector that we are multiplying to the Jacobian matrix is no longer constant, but depends explicitly on the model output. The proper choice for is: Ý Ý because Ü Â Ì. We also need to set Ú Â Ì and ¼. These changes imply that the algorithm must also account for Ý Ä and Ê Ú Ý Ä Ê Ú ÝÄ. Interestingly, Double Backpropagation may also be beneficial for training encoder-decoder networks. More specifically, let Ü be an encoder and Ü be a decoder such that Ü Ü Ü performs compression and decompression. For this example, the output dimensionality of the decoder must be identical to the input dimensionality of the encoder. Let Â and Â denote the Jacobian matrices of the encoder and decoder, respectively. In the linear case, i.e., principal component analysis (PCA), the optimal solution has the properties: Â Â Ì and Â Ì Â Â Â Ì Á Note that Â Ì Â Á being true when summed over all data implies that the encoded components are statistically independent if data points are Gaussian. An encoder-decoder network is said to be consistent (Rico-Martinez, Anderson & Kevrekidis, 1996) if for all inputs, Ü, Ü Ü that is, when used as an input, the output of the network maps back to itself, which is always true in the case of PCA. One necessary (but not necessarily sufficient) condition for consistency is: Ü Ý Ü Ý Ý Â Ì Â Ì ¼ Ü Thus, J-prop can be used to minimize Â Ì so as to encourage (but not necessarily guarantee) consistency in a nonlinear system. A related problem is found when attempting to train networks similar to encoder-decoder networks to have stable attractors and, therefore, act as continuous associative memories (Seung, 1998). Double Backpropagation (and J-prop) may be used to encourage these particular stability properties. 5.4 Enforcing Identical I/O Sensitivities Related to the previous special case of J-prop, this next special case attempts to enforce the constraint that the sensitivity of the error function with respect to the inputs should be identical to the sensitivity of the error function with respect to the outputs. Such a goal may be reasonable in the case of training a neural network to emulate a dynamical system that behaves as an iterated map. An iterated map has input and output dimensionalities being equal, the input represents a current state, and the output represents the next state. This case for J-prop requires that we use a different auxiliary error function for the derivation: Ý Ü with Ù, ¼, and Ú Â Ì. This choice of results in the correct gradient calculation because Ü Â Ì. Note that we must also make modifications for Ý Ä Ý Ä Ý¼ and Ê Ú Ý Ä Ê Ú ÝÄ Ú. Additionally, this example implicitly assumes that the input and output dimensionalities of the model are identical. 5.5 Targeting Eigenvectors & Eigenvalues Given a square Jacobian matrix, it is often desirable to force a system to have specific eigenvector / eigenvalue pairs. This subsection illustrates how to solve the problem for the lefthanded eigenvectors with a function similar to Ù. Handling right-handed eigenvectors only requires a switch to an error function similar to Ú. Any Ò Ò matrix,, with Ò unique eigenvalues can be rewritten as Ò Ì (5) where are the eigenvalues, are the (right-handed) eigenvectors (which forms a basis in Ê Ò ), and are a set of left-handed eigenvectors that form a contravariant basis with respect to the eigenvectors. Note that this means that for all and, Ì ¼, if and only if, and Ì. These facts also imply that the terms are actually the eigenvectors of Ì, which can be trivially seen by transposing the Equation 5. If there exists one or more target right-handed eigenvector / eigenvalue pairs, then a set of left-handed eigenvectors can be numerically calculated. Note that there may exist an infinite number of left-handed eigenvectors that meet all of the constraints, but this is simply a consequence of the problem being ill-defined from the start (with less than Ò unique eigenvectors). To optimize the Jacobian so that it has specific left-handed eigenvectors, we minimize the function Ù Â Ì 9

10 where and are the target left-handed eigenvector and eigenvalue, respectively. 5.6 Targeting the Largest Eigenvalue While the preceding recipe can be used to target specific eigenvalue / eigenvector pairs, the method may produce a Jacobian with other eigenvalue / eigenvector pairs that are undesirable. Minimizing the largest eigenvalue of the Jacobian of a system has many applications in engineering, such as control and stability. We will now see how a method to compute the largest eigenvalue of a matrix can be modified with J-prop so that the maximum eigenvalue of the Jacobian is explicitly minimized. Finding the largest eigenvalue (and its corresponding eigenvector) of a matrix is relatively straight-forward and fast via the power method (Wilkinson, 1965), which uses the iterations Ý Ø Þ Ø Þ Ø and Þ Ø Ý Ø. The initial choice of Þ ¼ is usually made to be random or. After several iterations, Ý Ì Ø Þ Ø will converge to the largest eigenvalue, and Ý Ø will converge to the corresponding eigenvector. The errors for these approximations will be proportional to Ø, where and are the largest and second largest eigenvalues, respectively. Thus, convergence is generally fast. Now, suppose we wished to alter the Jacobian matrix of a system so that its largest eigenvalue is equal to and it has the corresponding eigenvector. In this case, we desire that the iterated system: Ú Ø Þ Ø Þ Ø (6) Þ Ø ÂÚ Ø (7) converge to the fixed-point Þ Ô after Ô power method iterations, with the initial condition Þ ¼. The desired goal corresponds to minimizing the function, Ú ÂÚ Ô (8) In the derivation that follows, differentiation is done with respect to a virtual scalar parameter ; however, one could differentiate with respect to the entire vector of parameters,, by using tensor notation. Tensors are avoided here for the sake of clarity, but note that the tensor formulation is possible. Differentiating Equation 8 with respect to yields: Ú Þ Ô Â Ì Ú Ô Â Ú Ô Equation 9 recursively depends on: Ú Ø Þ Ø Þ Ø ÞØ Þ Ø Þ Ø Á Ú Ø Ú Ì Ø (9) Þ Ø Ú Ø ÞØ (0) And Equation 0 is recursively a function of: Þ Ø Â Ú Ø Â Ú Ø (1) with the understanding that Ú ¼ ¼. Note that Equations 9, 0, and 1 can be combined into a backward recurrent system that is more easily managed: Ú Ô Ø Ù Ì Ø Â Ú Ø () Ù Ô Þ Ô () Ù Ø Ù Ì Ø Â Þ Ø Á Ú Ø Ú Ì Ø (4) The entire procedure for optimizing a system with respect to Equation 8 can be summarized as follows: Pick the input vector, Ü, about which the Jacobian, Â, should be optimized and evaluate Â. Perform Ô power method iterations with Equations 6 and 7, saving all values of Ú Ø and Þ Ø. Choose the desired maximal eigenvector / eigenvalue pair, and, which can be functions of the actual maximal ones found in the previous step. Calculate all values of Ù Ø for Ø equal to Ô down to 1, with Equations and 4. Perform the summation in Equation ; Ù Ì Ø ÂÚ Ø is calculated with the Ê Ú operator. In the final step, is as defined in Equation 7, and Ù and Ú are set to Ù Ø and Ú Ø, respectively. The entire procedure calculates the entire gradient vector, not just for the single weight,, as given in the derivation. This procedure is similar to the backpropagation through time (BPTT) algorithm (Pearlmutter, 1995) in the way it unfolds the gradient recurrence. As with BPTT, fixed-points may present numerical problems in the optimization procedure. 5.7 Minimizing Eigenvalue Bounds Similar to the previous recipe, the magnitude of all of the eigenvalues of the Jacobian matrix can be indirectly optimized by exploiting Gershgorin s Theorem (Johnston, 1971). This alternate method has the advantages that it is less numerically demanding and it simultaneously deals with all eigenvalues; however, it minimizes an upper-bound, so it may be imprecise and with undesirable results. 10

11 Gershgorin s Theorem states that the set of eigenvalues for an Ò Ò matrix,, is contained in the union of Ò disks,, in the complex plane: Þ Þ Ò Ò This theorem gives two possibilities for optimizing the properties of the eigenvalues. First, we can optimize the center of a single disk by optimizing an element in the diagonal of the Jacobian. This can be easily accomplished, for disk, with the function: Â Ì Ù Ù where Ù is, once again, the th unit vector, and is set to zero except for the th component, which contains the target value for the center of disk. Minimizing the radius of disk requires that we minimize the sum of the magnitude of the elements of the Jacobian in row, not including the diagonal term. This can be accomplished by setting Ù to Ù, to a vector that is all zero except for the th entry which we jet to Â, and setting Ú equal to Ò Â Ì Ù. The resulting function is equivalent to the Gershgorin upper bound for the radius, Ù ÚÌ Â Ì Ù Ò Â and it is trivially in a form that can be handled by the J-prop algorithm. This particular application of J-prop should be useful for forcing dynamical systems to be more stable, especially when the model being trained is acting as a controller and the Jacobian whose eigenvalues we are optimizing is actually of the entire system (the system being controlled and the controller). 5.8 Optimizing Lyapunov Exponents Lyapunov exponents are often used to characterize the types of behavior found in dynamical system (Eckmann & Ruelle, 1985). It has been noted (Principe, Rathie & Kuo, 199; Deco & Schürmann, 1997) that numerical models of dynamical systems can be very accurate in the input-output sense while being inaccurate in terms of the Lyapunov exponents. Often, one has a priori knowledge that some time series has a chaotic source, which implies that the largest Lyapunov exponent of the system must be positive. In this case, one could use J-prop to coerce the model to be more accurate to the true system in terms of the Lyapunov exponents. While the following method can be generalized for multidimensional systems, the procedure is derived only for a model that has a single input and a single output. Multidimensional systems are problematic because the analytical form of the Lyapunov exponent is not nearly as simple as in the one-dimensional case. Consider a model Ü that approximates a onedimensional iterated map. The Lyapunov exponent,, is defined as: ÐÑ Æ Æ Æ ÐÓ Ü (5) While Equation 5 is not strictly differentiable, due to the absolute value function, we can analytically optimize the Lyapunov exponent if we use ÜÜ Ò Ü, which is algebraically consisted. Denoting the target Lyapunov exponent by, the function we wish to minimize is: (6) Differentiating Equation 6 with respect to yields: Æ Æ Æ ÐÓ Ü Æ Ò Ü Ü (7) To optimize a model with Equation 7, one follows the following procedure: Provide an initial input, Ü ¼. Iterate the system, using Ü Ø Ü Ø, while saving Ü (the one-dimensional Jacobian) at every step. At each iteration, also calculate Ü which can be done with R-prop. Calculate with Equation 5. Calculate the weight gradient with Equation 7. Care must be taken to iterate the system for a sufficiently large number of time steps. Moreover, since the Lyapunov exponent is usually only defined for iterates that already fall within a basin of attraction, one may wish to iterate the system for several time steps and only then let an iterate be chosen for Ü ¼, in the first step of the procedure. 11

12 5.9 Optimizing the Determinant Many applications require one to optimize the determinant of a system s Ò Ò Jacobian matrix. Hence, for gradient-based optimization, this translates to calculating Ø Â Ò Ò Ø Â Â Â (8) If the underlying system is linear, that is, the Jacobian matrix is equal to the weights (reordered in matrix form), then this partial derivative consists of all of the cofactors of Â. However, for nonlinear systems, calculating Equation 8 can be very problematic. J-prop can be used to calculate the result of Equation 8 through a series of Ò applications of the J-prop rules. The total calculation required is described by the summation Ø Â Ò Ú Ì Â Ì Ù (9) where Ù is the th unit vector and Ú Ì is vector that consists of the cofactors from a single row of Â: Ú Ì Ó Â ¼ Ó Â Ò Each term in the summation of Equation 9 can be calculated with a single application of the J-prop rules. This may seem to be a time consuming process, and it is, but surprisingly it is not significantly more time consuming than computing all of the cofactors of Â alone since simply evaluating Â requires Ò feedforward and feedback passes of the model and calculating this partial derivative only requires Ò additional Jforward and J-backward passes. An alternate derivation could exploit the identity Ø ÐÒ Ø Ø ØÖ Ø however, using this identity is not straight-forward. Optimizing the determinant of the Jacobian of a nonlinear systems (or, more accurately, its logarithm) is known to have many applications in the area of independent component analysis (ICA) and blind source separation (BSS) (Deco & Obradovic, 1996; Parra, Deco & Miesbach, 1995). The J- prop algorithm may be able to extend existing linear ICA and BSS algorithms into nonlinear versions Optimizing Quadratic Forms Other applications related to BSS and ISA require one to optimize with respect to quadratic forms of the Jacobian, Þ Ì Â Ì ÂÞ. In this case, we desire the gradient: Þ Ì Â Ì ÂÞ (40) The J-prop algorithm can be used in such cases by simple application of the chain rule of differentiation. Tensor notation is used to simplify the derivation in this case; however, the end result can be expressed in standard vector-matrix form. Þ Ì Â Ì ÂÞ Þ Ì Â Ì ÂÞ Þ Ì Â Ì ÂÞ Þ Ì Â Ì Â Þ (41) Equation 41 can be optimized by two applications of the R- prop algorithm, i.e., one application for each additive term. Despite the tensors in Equation 41, each additive term can be represented in the canonical form, Ù Ì ÂÚ, with minimal linear algebra, by setting Ù and Ú to the left and righthand factors of each Â tensor. As a result, each of the two added products easily maps to the R-prop algorithm because, in the end, the products yield scalar results. 6 CONCLUSIONS This paper has introduced a general method for calculating the weight gradient of functions of the Jacobian matrix of nonlinear systems that are twice differentiable. The method uses a differentiable operator that can be easily applied to most nonlinear models in common use today. The resulting algorithm, J-prop, can be easily modified into several special cases that can be used to minimize functions that are known to be useful in many application domains. While some special cases of the J-prop algorithm have already been studied, a great deal is unknown about how optimization of the Jacobian changes the overall optimization problem. Some evidence from special cases of the algorithm seems to imply that optimization of the Jacobian can lead to better generalization and faster training. What remains to be seen is if J-prop used on nonlinear extension of previous linear methods will lead to truly stable and superior solutions. Three obvious areas to explore are control systems, blind source separation (or independent component analysis), and compression. In all of these cases, optimizing the Jacobian of the underlying linear system is straight-forward. Future work should attempt to identify if optimizing a nonlinear extension is both numerically stable and worth the extra computational effort. ACKNOWLEDGEMENTS I am greatly indebted to many people for helpful discussion during the course of this work. I would specifically like to thank Frans Coetzee, Yannis Kevrekidis, Joe O Ruanaidh, Lucas Parra, Barak Pearlmutter, Scott Rickard, Justinian Rosca, and Patrice Simard. Eric Baum and Steve Lawrence of the NEC Research Institute graciously funded the time to write up these results. 1

13 REFERENCES Deco, G. & Obradovic, D. (1996). An Information Theoretic Approach to Neural Computing. Perspective in Neural Computing. Springer. Deco, G. & Schürmann, B. (1997). Dynamic modeling of chaotic time series. In R. Greiner, T. Petsche, & S. J. Hanson (Eds.), Computational Learning Theory and Natural Learning Systems, volume IV of Making Learning Systems Practical chapter 9, (pp ). Cambridge, Mass.: The MIT Press. Drucker, H. & Le Cun, Y. (November 199). Improving generalization performance using double backpropagation. IEEE Transactions on Neural Networks, (6). Eckmann, J. P. & Ruelle, D. (1985). Ergodic theory of chaos and strange attractors. Rev. Mod. Phys., 57: , G. W. (1999). An industrial strength neural optimization development engine. Technical Report , NEC Research Institute. Hinton, G. E. (1986). Learning distributed representations of concepts. In Proc. Eigth Annual Conf. Cognitive Science Society, (pp. 1 1)., Hillsdale, NJ. Erlbaum. Johnston, R. L. (1971). Gershgorin theorems for partitioned matrices. Linear Algebra and its Applications, 4: Parra, L., Deco, G., & Miesbach, S. (1995). Redundancy reduction with information preserving nonlinear maps. Network: Computation in Neural Systems, 6: Pearlmutter, B. A. (1994). Fast exact multiplication by the Hessian. Neural Computation, 6(1): Pearlmutter, B. A. (1995). Gradient calculation for dynamic recurrent neural networks: a survey. IEEE Transactions on Neural Networks, 6(5): Poggio, T. & Girosi, F. (Sept 1990). Networks for approximation and learning. Proc. IEEE, 78(9): Principe, J., Rathie, A., & Kuo, J. (199). Prediction of chaotic time series with neural networks and the issues of dynamic modeling. Bifurcations and Chaos, (4): 989. Rico-Martinez, R., Anderson, J. S., & Kevrekidis, I. G. (1996). Self-consistency in neural network-based nlpc analysis with applications to time-series processing. Comp. Chem. Eng., 0 Suppl. B: Seung, H. S. (1998). Learning continuous attractors in recurrent networks. In Jordan, M. I., Kearns, M. J., & Solla, S. A. (Eds.), Advances in Neural Information Processing Systems, volume 10. MIT Press. Sietsma, J. & Dow, R. J. F. (1988). Neural network pruning why and how. In Proc. IEEE Int. Conf. Neural Networks, volume I (pp. 5 ). New York: IEEE. Simard, P., Victorri, B., Le Cun, Y., & Denker, J. (199). Tangent prop A formalism for specifying selected invariances in an adaptive network. In Moody, J. E., Hanson, S. J., & Lippmann, R. P. (Eds.), Advances in Neural Information Processing Systems, volume 4, (pp ). Morgan Kaufmann Publishers, Inc. Thrun, S. B. & Mitchell, T. M. (199). Integrating inductive neural network learning and explanation-based learning. In the Proc. of the Thirteenth Int. Joint Conf. on Art. Int. Morgan Kaufmann. Thrun, S. B. & Mitchell, T. M. (1995). Learning one more thing. In the Proc. of the Fourteenth Joint Conf. on Art. Int. Morgan Kaufmann. Weigend, A. S., Zimmermann, H. G., & Neuneier, R. (1995). Clearning. In Neural Networks in Financial Engineering, (NNCM95). White, H. & Gallant, A. R. (199). On learning the derivatives of an unknown mapping with multilayer feedforward networks. In H. White (Ed.), Artificial Neural Networks chapter 1, (pp. 06 ). Cambridge, Mass.: Blackwell. White, H., Hornik, K., & Stinchcombe, M. (199). Universal approximation of an unknown mapping and its derivative. In H. White (Ed.), Artificial Neural Networks chapter 6, (pp ). Cambridge, Mass.: Blackwell. Wilkinson, J. H. (1965). The Algebraic Eigenvalue Problem. Oxford University Press. 1

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Lecture - 27 Multilayer Feedforward Neural networks with Sigmoidal