The Calculus of Jacobian Adaptation

Size: px
Start display at page:

Download "The Calculus of Jacobian Adaptation"

Transcription

1 Gary William ÐÒºÒºÒºÓÑ NEC Research Institute / 4 Independence Way / Princeton, NJ Abstract For many problems, the correct behavior of a model depends not only on its input-output mapping but also on properties of its Jacobian matrix, the matrix of partial derivatives of the model s outputs with respect to its inputs. This paper introduces the J-prop algorithm, an efficient general method for computing the exact partial derivatives of a variety of simple functions of the Jacobian of a model with respect to its free parameters. The algorithm applies to any parametrized doublydifferentiable data-driven feedforward model, including nonlinear regression, multilayer perceptrons, and radial basis function networks. J-prop has as special cases some earlier algorithms, such as Tangent Prop and Double Backpropagation, but it can also be used to optimize the eigenvectors, eigenvalues, or determinant of the Jacobian, making it quite general with respect to the model types, applications, and objective functions that can be used. Applications of J-prop include forcing stability and sensitivity conditions, regularization, building low-complexity encoder-decoder networks, exploiting prior knowledge, designing controllers, and blind source separation. Keywords Jacobian, derivative, gradient, numerical optimization, calculus CONTENTS 1 Introduction Background J-Prop Derivation Intuitive Derivation Generic Derivation Auxiliary Error Function Differential Operator Equivalence Method Applied to MLPs Complete Algorithm Numerical Example J-Prop Recipes Training with Explicit Target Derivatives Tangent Prop Double Backpropagation Enforcing Identical I/O Sensitivities Targeting Eigenvectors & Eigenvalues Targeting the Largest Eigenvalue Minimizing Eigenvalue Bounds Optimizing Lyapunov Exponents Optimizing the Determinant Optimizing Quadratic Forms Conclusions INTRODUCTION This paper introduces a general method by which constraints on the sensitivity of a model can be enforced through a gradient-based method. While this is a useful thing to do in the general case (from the point of view of regularization) there are also many application in which specific properties of these sensitivities are known in advance and can therefore be exploited. Let Ü be an Ò input, Ñ output, twice differentiable feedforward model parameterized by an input vector, Ü, and a weight vector. Its Jacobian is  Ü. Ñ Ü ÜÒ.... Ñ ÜÒ Ü Ü The algorithm introduced can be used to optimize functions of the form or Ù Â Ì Ù (1) Ú ÂÚ () where Ù, Ú,, and are user-defined constants. The proposed algorithm, J-prop, can be used to calculate the exact value of both Ù or Ú in Ç times the time required to calculate the normal gradient. Thus, J-prop is suitable for training models to have specific first derivatives, or for implementing several other well-known algorithms such as Double Backpropagation (Drucker & Le Cun, 199) and Tangent Prop (Simard, Victorri, Le Cun & Denker, 199). Suitable choice of,, Ù and Ú make the functions Ù and Ú surprisingly expressive. For example, with Ù set to 1

2 a unit vector can be interpreted as a target value for a particular row of the Jacobian matrix. As will be shown, other modifications can be used to optimize the eigenvalues, the eigenvectors, or even the determinant of Â. Clearly, being able to optimize Equations 1 and is useful; however, the formalism which is used to derive the algorithm may actually more interesting because it allows one to modify J-prop to easily be applicable to a wide-variety of model types and objective functions. As such, a good portion of this paper is devoted to describing the mathematical framework from which J-prop is built. The remainder of this paper is divided into five sections. Section contains a brief overview of prior research related to the accuracy of the first derivative of a model. Both descriptive and prescriptive works are covered. In Section, the J-prop algorithm is derived in an intuitive manner and in a purely analytical form. The algorithm is also shown applied to a multilayer perceptron as an example of how to use the algorithm with a specific model class. Section 4 contains a brief numerical example illustrating the correctness of the derivation. In Section 5, several special cases of J-prop recipes that can be used for a variety of applications are described. In each case, an objective function is defined along with the modifications to the basic underlying algorithm that must be made to compute the gradient of that function. These modifications usually only involve a simple setting of user-defined parameters. Some simple applications of the the modified objective functions are also described. Section 6 concludes the paper with a summary. BACKGROUND Previous work concerning the modeling of a function and its derivatives can be divided into works that are descriptive or prescriptive. One well-known descriptive result is due to (White & Gallant, 199) and (White, Hornik & Stinchcombe, 199), who show that given noise-free data a multilayer perceptron (MLP) can approximate the higher derivatives of a function in the limit as the number of training points goes to infinity. The difficulty with applying this result are the strong requirements on the amount and integrity of the training data, requirements which are rarely met in practice. This problem was specifically demonstrated by (Principe, Rathie & Kuo, 199) and (Deco & Schürmann, 1997), who showed that using noisy training data from chaotic systems can lead to models that are accurate in the input-output sense, but inaccurate in their estimates of quantities related to the Jacobian of the unknown system such as the largest Lyapunov exponent and the correlation dimension. Deco and Schürmann proposed a modified objective function and learning method to specifically penalized inaccurate Lyapunov exponents through a forgetting factor related to the largest Lyapunov exponent. While this improved results, the solution is only applicable to the domain of MLPs and chaotic systems. Early in their development of radial basis function networks (RBFNs) (Poggio & Girosi, 1990) showed that RBFNs have the desirable property that one can effectively place a bound on the errors of all of the higher derivatives of the model. The key idea here is that the choice of how to penalize the errors of successive derivatives is used to determine an appropriate basis function class via regularization theory. While this is an elegant result, it is not applicable to situations in which the form of the model is determined in advance, nor does it directly address the simple problem of the derivatives being inaccurate at a specific point in the input space. The issue of having a model s estimate of an unknown function s derivative being inaccurate is particularly problematic in MLPs because large weights can lead to saturation at a particular sigmoidal neuron which, in turn, results in extremely large first derivatives at the neuron when evaluated near the center of the sigmoid transition. Several methods to combat this type of over-fitting have been proposed. One of the earliest methods, weight decay (Hinton, 1986), uses a penalty term on the magnitude of the weights. Weight decay is arguably optimal for models in which the output is linear in the weights because minimizing the magnitude of the weights is equivalent to minimizing the magnitude of the model s first derivatives. However, in the nonlinear case, weight decay can have suboptimal performance (Drucker & Le Cun, 199) because large (or small) weights do not always correspond to having large (or small) first derivatives. Training with noisy inputs (Sietsma & Dow, 1988) is another method often used to regularize the first derivatives of a model. The simplest methods add a fixed amount of noise to the training data input vectors. More complicated versions, such as cleaning (Weigend, Zimmermann & Neuneier, 1995), adaptively learn a simple noise model on the input space which allows adaptation to be specific to the noise levels in the training data. When compared to weight decay, training with noise has the advantage that it explicitly penalizes large first derivatives of the model; however, it is strictly a numerical approximation to something that can be done analytically. For example, the Double Backpropagation algorithm (Drucker & Le Cun, 199) adds an additional penalty term to the quadratic error function, Ü, equal to Ü. Training on this function results in a form of regularization that is in many ways an elegant combination of weight decay and training with noise: it is strictly analytic (unlike training with noise) but it explicitly penalizes large first derivatives of the model (unlike weight decay). Double Backpropagation can be seen as a special case of J-prop, the algorithm derived in the next section. As to the general problem of coercing the first derivatives of a model to specific values, (Simard, Victorri, Le Cun & Denker, 199) introduced the Tangent Prop algorithm, which was used to train MLPs for optical character recognition to be insensitive to small affine transformations in the character space. Tangent Prop is very similar to the algorithm proposed

3 in this paper but differs in several respects. First, its derivation is very specific to the application that it was developed for. Second, the derivation itself is not easy to generalize for models other than MLPs. And, third, Tangent Prop uses an objective function that cannot be easily generalized. In short, J-prop is more general in the models, applications, and objective functions that it can be used for when compared to previously related work. As such, the proposed formalism for J-prop allows for an entire new class of learning algorithms to be explored. J-PROP DERIVATION This section describes two methods for calculating Ù and Ú for arbitrary choices of Ù, Ú,, and. An intuitive graphical derivation is first shown for a modified network similar to an MLP. Afterwards, a more detailed derivation is presented that uses a special differential operator. Before we proceed, a brief explanation of the notation is required to avoid confusion. When referring to generic datadriven models the model will always be expressed as a vector function Ý Ü, where Ü refers to the model input and refers to a vector of all of the tunable parameters of the model. The term Ý should be understood as the target value of the model, while the residual is written as Ý Ý. In this way, we can talk about models while ignoring the mechanics of how the models work internally. Complementary to the generic vector notation, a multilayer perceptron is used as a specific model example of how to use these results. The notation for an MLP uses only scalar symbols; however, these symbols must refer to internal variable of the model (e.g., neuron thresholds, net inputs, weights, etc.) which can lead to some ambiguity. To be clear, when using vector notation, the input and output of an MLP will always be denoted by Ü and Ý, respectively, and the collection of all of the weights (including biases) map to the vector. However, when using scalar arithmetic, the notation for MLPs will apply. In later portions of this paper, objective functions are often rewritten to resemble either Equation 1 or. The objective function is labelled as either Ù or Ú to denote this fact..1 Intuitive Derivation Consider an MLP with Ä layers of nodes defined by the equations Ý Ð Ü Ð () Ü Ð ÆÐ Ý Ð Ð Ð (4) In these equations, superscripts denote the layer number (starting at 0), subscripts index over terms in a particular layer, and Æ Ð is the number of input nodes in layer Ð. Thus, Ý Ð is the output of neuron at node layer Ð, and Ü Ð is the net input coming into the same neuron. Moreover, Ý Ä is an output of the entire MLP while Ý ¼ is an input going into the MLP. Let be a function for which we would like to calculate Ü. At this point, we make no assumptions about the specific form of, that is, it may be the normal quadratic error function or possibly something totally unrelated to a typical error measure. With these rules in place, we can write a portion of the feedback equations required to calculate Ü as: Ü Ð Ý Ð Ý Ð ÆÐ ¼ Ü Ð (for Ð Ä) (5) Ð Ü Ð (6) Note that we are purposely omitting equations for Ð and Ð because they are not needed at this time. In the normal case of performing gradient-based learning, Equation 6 is only applied to the node layers that are neither inputs nor outputs, that is, Ð is neither 0 nor Ä. Nevertheless, if one were to continue the feedback procedure all of the way to the MLP inputs, thus using Equation 6 with Ð equal to 0, the computed values will be equal to the partial derivatives of with respect to the MLP inputs. This is significant because Ü Â Ì Ý; thus, by choosing so that Ý is equal to a vector that we would like to multiply to the Jacobian matrix, we can easily calculate the product of the transpose of the Jacobian to an arbitrary vector. Of course, this is old news, as it is common practice to choose Ý to be equal to a unit vector so that one can extract the rows of the Jacobian matrix. What is interesting, however, is that all of Equations 6 can be implemented by a single, specialized, feedforward network that calculates the normal feedback pass of a MLP as well as the normal feedforward pass. This is illustrated in Figure 1 which shows two hidden MLP neurons that are functions of two earlier neurons. The first half of the network does the normal part of the feedforward pass (i.e., it performs Equations and 4 through the single layers of weights). However, the second half of the network implements Equations 5 and 6. Note that the dashed lines leading from the neuron outputs to the Ý Ð values are there to indicate that a particular partial derivative may or may not be a function of the neuron s output. Moreover, while the network contains two layers of weights they are, in fact, the same weight layers only transposed with respect to each other. Since this augmented network is strictly feedforward in the sense that information always flows from the left to the right, it is possible to differentiate the final output of the network with respect to the weights. However, since the weights are shared in multiple locations, care must be taken to accu-

4 Ð Ý Ð Ý Ð Ý Ð Ð Ü Ð ¼ Ý Ð Ü Ð Ü Ð Ð Ý Ð Ð Ð Ð Ð Ý Ð Ð Ü Ð ¼ Ý Ð Ü Ð Ü Ð Ð Ý Ð Ý Ð Ð Ý Ð Â Ì Ù can be computed by performing a backprop-like procedure on a Figure 1: A network illustrating the fact that special feedforward network that calculates the normal feedforward and feedback passes of standard backprop. mulate the gradient information in the shared weights. For any twice differentiable model we could construct a similar specialized network that computes the partial derivatives of a user-defined function with respect to the model inputs. Obviously, such a network would be cumbersome to construct for an MLP with many hidden layers, and applying backprop to this new network would also be difficult. But the trick to differentiating the network breaks down to doing it in a manner so that the chain rule doesn t require too much bookkeeping. We will see how to do this in the next subsection.. Generic Derivation In this subsection a generic method is defined that can be used to calculate Ù and Ú for any model, Ü, that is twice differentiable. The method is very similar to a technique introduced by (Pearlmutter, 1994) for calculating the product of the Hessian of an MLP and an arbitrary vector; however, where Pearlmutter used differential operators applied to a model s weight space, this work uses a differential operators defined with respect to a model s input space. The method is presented in five steps. First, we will define an auxiliary error function that has a few useful mathematical properties that simplify the derivation. Next, we will define a special differential operator that can be applied to both the auxiliary error function, and its gradient with respect to the weights. We will then see that the result of applying the differential operator to the gradient of the auxiliary error function is equivalent to analytically calculating Ù or Ú. This is followed by an example of the technique applied to a multilayer perceptron. Finally, in the last step, the complete algorithm is presented for calculating any part of Ù or Ú for any choice of,, Ù, and Ú, with the one caveat that we assume (only at this time) that Ù and Ú are independent of the model input and output...1 Auxiliary Error Function Our auxiliary error function,, is defined as Ü Ù Ì Ü (7) Note that we never actually optimize with respect ; we define it only because it has the property that Ü Ù Ì Â, which will be useful to the derivation shortly. Note that Ü appears in the Taylor expansion of about a point in input space: Ü Ü Ü Ì Ü Ç Ü (8) Ü Thus, while holding the weights,, fixed and letting Ü be a perturbation of the input, Ü, Equation 8 characterizes how small changes in the input of the model change the value of the auxiliary error function. By setting Ü ÖÚ, with Ú being an arbitrary vector and Ö being a small value, we can rearrange Equation 8 into the form: Ì Ú Ü Ö. ÐÑ Ö¼ Ö Ü ÖÚ Ü Ç Ö Ü ÖÚ Ü 4

5 Ù Ì ÂÚ Ü ÖÚ Ö Ö¼ (9) This final expression will allow us to define the differential operator in the next subsection... Differential Operator Let Ü be an arbitrary twice differentiable function. Define the differentiable operator Ê Ú Ü (10) Ö Ü ÖÚ Ö¼ which has the property that Ê Ú Ü Ù Ì ÂÚ. In fact, Ê Ú is actually a directional derivative operator in the input space of the model. As a result, Ê Ú generalizes partial derivatives into an alternate basis as specified by Ú. Being a differential operator, Ê Ú obeys all of the standard rules for differentiation: Ê Ú ¼ (11) Ê Ú Ü Ê Ú Ü Ê Ú Ü Ü Ê Ú Ü Ê Ú Ü Ê Ú Ü Ü Ê Ú Ü Ü Ü Ê Ú Ü Ê Ú Ü ¼ Ü Ê Ú Ü Ê Ú Ø Ü Ø Ê Ú Ü The operator also has the property that Ê Ú Ü Ú... Equivalence We will now see that the result of calculating Ê Ú can be used to calculate both Ù and Ú. Note that Equations 7 9 all assume that both Ù and Ú are independent of Ü and. To calculate Ù and Ú, we will actually set Ù or Ú to a value that depends on both Ü and ; however, the derivation still works because the directional derivative operator is not supposed to be applied to Ù or Ú in these cases. Intuitively, our choices for Ù and Ú are made in such a way that the chain rule of differentiation is not supposed to be applied to these terms. Hence, the correct analytical solution is obtained despite the dependence. To optimize with respect to Equation 1, we use: Ì Â Ì Ù Â Ì Ù Â Ì Ù Now, assume that Ù is independent of Ü and (thus, ÙÜ Ù ¼). We can substitute with Ú Â Ì Ù above and disregard the fact that Ú is a function of both Ü and because the differential operator is not supposed to be chained to the subsequent Â Ì Ù. The substitution yields: Â Ì Ù Ì Â Ì Ù Ú Ê Ú Ê Ú Ò To optimize with respect to Equation, we use a similar trick, substituting Ù ÂÚ, which yields: ÂÚ Ù Ì ÂÚ Ò Ó Ê Ú Ê Ú Thus, in both cases applying Ê Ú to the weight gradient of a model s error function produces the desired result...4 Method Applied to MLPs We are now ready to see how this technique can be applied to a specific type of model. We use a multilayer perceptron as an example because of its generality and familiarity. Equations and 4 suffice for the feedforward equations. However, we need to slightly modify the feedback Equations 5 and 6 so that they are applied to the specific form of the auxiliary error function,. Ý Ä Ý Ð Ü Ð Ð Ð Ù (1) ÆÐ Ð Ü Ð Ó (for Ð ¼) (1) Ý Ð ¼ Ü Ð (14) Ü Ð Ý Ð (15) Ü Ð (16) where the Ù term is a component in the vector Ù from Equation 1. Applying the Ê Ú operator to the feedforward equations yields: Ê Ú Ý¼ Ê Ú ÝÐ Ú (17) Ð ¼ Ü Ê Ú ÜÐ (for Ð ¼) (18) 5

6 ÆÐ Ê Ú ÜÐ Ê Ú ÝÐ Ð (19) where the Ú term is a component in the vector Ú from Equation. Note that the equations above are nearly identical to how normal backprop is calculated with the only exception being Equation 1; thus, these calculations only require a trivial change to most MLP implementations. As the final step, we apply the Ê Ú operator to the feedback equations: Ê Ú Ê Ú Ê Ú Ê Ú Ê Ú Ý Ä Ý Ð Ü Ð Ð Ð ¼ (0)..5 Complete Algorithm ÆÐ Ð Ê Ú Ü Ð (for Ð ¼) (1) Ê Ú Ý Ð Ê Ú Ü Ð Ê Ú Ý Ð ¼ Ü Ð ¼¼ Ü Ð Ê Ú ÜÐ Ü Ð Ê Ú ÝÐ Ü Ð Ý Ð () () (4) Implementing this algorithm is nearly as simple as implementing normal gradient descent. For each type of variable that is used in an MLP (net input, neuron output, weights, thresholds, partial derivatives, etc.) we require that an extra variable be allocated to hold the result of applying the Ê Ú operator to the original variable. With this change in place, the complete algorithm to compute Ù is as follows: Set Ù and to the user specified vectors from Equation 1. Set the MLP inputs to the value of Ü that  is to be evaluated at. Perform a normal feedforward pass using Equations and 4. Set Ý Ä to Ù. Perform the feedback pass with Equations Note that values in the Ý ¼ terms are now equal to Â Ì Ù. Set Ú to Â Ì Ù Perform a Ê Ú forward pass with Equations Set the Ê Ú Ý Ä terms to 0. Perform a Ê Ú backward pass with Equations 0 4. After the last step, the values in the Ê Ú Ð and Ê Ú Ð terms contain the required result. It is important to note that the time complexity of the J-forward and J-backward calculations are nearly identical to the typical output and gradient evaluations (i.e., the forward and backward passes) of the models used. A similar technique can be used for calculating Ú. The main difference is that the Ê Ú forward pass is performed between the normal forward and backward passes because Ù can only be determined after the Ê Ú Ü has been calculated. 4 NUMERICAL EXAMPLE To demonstrate the effectiveness and generality of the J-prop algorithm, the method has been implemented on top of an existing neural network library (, 1999) in such a way that the algorithm can be used on a large number of architectures, including MLPs, radial basis function networks, and higher order networks. An MLP with ten hidden ØÒ nodes was trained on 100 points with conjugate gradient. The training exemplars consisted of inputs in and a target derivative from Ó Ü Ó ¼Ü. The unknown function is Ò Ü Ò ¼Ü; however, the MLP never sees data from this function, but only its derivative. The model quickly converges to a solution in approximately 100 iterations. Figure shows the performance of the MLP. Having never seen data from the unknown function, the MLP yields a poor (but accurate within a constant) approximation of the function; however, an excellent approximation of the function s derivative is obtained. The model could have been trained on both outputs and derivatives, but the goal was to illustrate that J-prop can target derivatives alone. To be clear, Figure a was produced by supplying the inputs to the model and performing the standard forward calculation to produce the outputs, i.e., no integration is required. Figure b was produced by differentiating the model with respect to the input space and plotting the resultant values against the target derivatives. 5 J-PROP RECIPES In this section, we consider a number of special cases of the J-prop algorithm. In all of the following derivations, the problem is re-expressed so that the objective function matches one of the following canonical forms: an error function similar to Equation 1 or ; 6

7 5 8 4 true function approximation 6 true derivative approximation (a) (b) Figure : Learning only the derivative: showing (a) poor approximation of the function with (b) excellent approximation of the derivative. output error tanh(w * x + b) x w b 4 5 derivative error (a) composite error (b) w (c) b 4 5 w (d) b 4 5 Figure : With Ü and Ý, there are any infinite number of solutions to fit ØÒ Ü. However, with a target derivative (e.g., ÝÜ, there is only one solution. (a) Various fits to target data; (b) An error function that only considers the target output; (c) An error function that only considers the target derivative; (d) An error function that combines target output and target derivative information. 7

8 an auxiliary error function,, to be used in place of Equation 7; or a partial derivative of the form Ù Ì ÂÚ. For all three cases, J-prop can be used to calculate the required weight gradient efficiently and exactly, but with different values of the user specified vectors,, Ù, and/or Ú, or the auxiliary error function,. The dual goal for this section is to show that J-prop can be used to implement and generalize existing methods, and that J-prop may have many potential applications that have not been addressed by any other numerical means. Thus, some of the proposed J-prop recipes are presented in the spirit of being potential applications and not tried and tested methods. 5.1 Training with Explicit Target Derivatives The first special case of J-prop is nearly identical to the numerical experiment from the previous section. However, here a more complete description is given for how and why one would want to train with explicit target derivatives; the discussion is also motivated with the simplest example of how training with target derivatives alters the underlying optimization problem. Let the entire target matrix of first derivatives evaluated at a specific point in input space be denoted by Â. The extra gradient information needed to optimize the new error function is calculated by: Ò Â Ì Ù Â Ì Ù where Ò is the number of outputs in the model and Ù denotes the th unit vector. This expression can be computed by using a choice of equal to that of Equation 7, setting Ù to Ù, to Â Ì Ù, and Ú to Â Ì Ù Â Ì Ù. Figure shows a simple example of how including information related to the Jacobian in an error function alters the error surface. If only the target output is specified, then there are an infinite number of curves that pass through the single target I/O data point,. If only the target derivative is optimized, then all curves with the same slope as the pointed curve in Figure a are valid fits. As shown in Figures b, c, and d, when a mixture of the two targets is optimized, the error surface becomes better conditioned because it loses some of the flatness and becomes more localized in the region where both outputs and derivatives are correct. If target derivatives are explicitly known, then they will usually come from one of two sources: either a data driven model is being trained on a known analytical process that can be differentiated, or the data is from a process that has a wellbehaved first difference. Even if explicit targets for the first derivatives are unknown, it is still possible to acquire estimates through a boot-strap process. For example, suppose we have noisy data but are not confident that first differences will give accurate estimates of the first derivatives. We can still train a model in the traditional manner and then use the first derivatives of the first model to augment the training of a second model, and so on. In this way, incremental learning can be performed in such a way that subsequent models are forced to have first derivatives that are consistent with earlier models. 5. Tangent Prop The Tangent Prop algorithm (Simard, Victorri, Le Cun & Denker, 199) is designed to train neural networks in such a way that they learn invariances relative to a classification task. For example, if a classifier is trained to recognize letters of the alphabet, then the optimal classifier would be invariant in small translations, rotations, and scalings of the input images. More specifically, let Ö Ü «be an operator that rotates an image described in Ü by «degrees. To make a classifier invariant in this operator, we would like to minimize: Ú Â Ö ««¼ Tangent prop maps directly into J-prop by setting: Ú Ö «Ù Â Ö ««¼ «¼ and ¼. The invariance approach can be applied to any domain in which a classification task has known invariance properties. However, the basic idea was also expanded by Thrun and Mitchell and applied to the tasks of lifelong learning (Thrun & Mitchell, 1995) and explanation-based learning (Thrun & Mitchell, 199) which both have the goal of training models to be consistent with earlier models (and are, therefore, similar in spirit to the boot-strap method described in the previous subsection). 5. Double Backpropagation Like Tangent Prop, the Double Backpropagation algorithm (Drucker & Le Cun, 199) uses a composite error function that is a linear combination of the usual quadratic error plus the norm of the first derivative of a quadratic error function with respect to the model inputs. Hence, it minimizes output errors while simultaneously minimizing the sensitivity of the error function to the model inputs. The extra gradient information needed to descend down this error function is is calculated by: Â Ì 8

9 where Ý Ý, that is, the output residuals. Calculating this expression with J-prop requires that we modify the auxiliary error function to account for the fact that the vector that we are multiplying to the Jacobian matrix is no longer constant, but depends explicitly on the model output. The proper choice for is: Ý Ý because Ü Â Ì. We also need to set Ú Â Ì and ¼. These changes imply that the algorithm must also account for Ý Ä and Ê Ú Ý Ä Ê Ú ÝÄ. Interestingly, Double Backpropagation may also be beneficial for training encoder-decoder networks. More specifically, let Ü be an encoder and Ü be a decoder such that Ü Ü Ü performs compression and decompression. For this example, the output dimensionality of the decoder must be identical to the input dimensionality of the encoder. Let  and  denote the Jacobian matrices of the encoder and decoder, respectively. In the linear case, i.e., principal component analysis (PCA), the optimal solution has the properties: Â Â Ì and Â Ì Â Â Â Ì Á Note that Â Ì Â Á being true when summed over all data implies that the encoded components are statistically independent if data points are Gaussian. An encoder-decoder network is said to be consistent (Rico-Martinez, Anderson & Kevrekidis, 1996) if for all inputs, Ü, Ü Ü that is, when used as an input, the output of the network maps back to itself, which is always true in the case of PCA. One necessary (but not necessarily sufficient) condition for consistency is: Ü Ý Ü Ý Ý Â Ì Â Ì ¼ Ü Thus, J-prop can be used to minimize Â Ì so as to encourage (but not necessarily guarantee) consistency in a nonlinear system. A related problem is found when attempting to train networks similar to encoder-decoder networks to have stable attractors and, therefore, act as continuous associative memories (Seung, 1998). Double Backpropagation (and J-prop) may be used to encourage these particular stability properties. 5.4 Enforcing Identical I/O Sensitivities Related to the previous special case of J-prop, this next special case attempts to enforce the constraint that the sensitivity of the error function with respect to the inputs should be identical to the sensitivity of the error function with respect to the outputs. Such a goal may be reasonable in the case of training a neural network to emulate a dynamical system that behaves as an iterated map. An iterated map has input and output dimensionalities being equal, the input represents a current state, and the output represents the next state. This case for J-prop requires that we use a different auxiliary error function for the derivation: Ý Ü with Ù, ¼, and Ú Â Ì. This choice of results in the correct gradient calculation because Ü Â Ì. Note that we must also make modifications for Ý Ä Ý Ä Ý¼ and Ê Ú Ý Ä Ê Ú ÝÄ Ú. Additionally, this example implicitly assumes that the input and output dimensionalities of the model are identical. 5.5 Targeting Eigenvectors & Eigenvalues Given a square Jacobian matrix, it is often desirable to force a system to have specific eigenvector / eigenvalue pairs. This subsection illustrates how to solve the problem for the lefthanded eigenvectors with a function similar to Ù. Handling right-handed eigenvectors only requires a switch to an error function similar to Ú. Any Ò Ò matrix,, with Ò unique eigenvalues can be rewritten as Ò Ì (5) where are the eigenvalues, are the (right-handed) eigenvectors (which forms a basis in Ê Ò ), and are a set of left-handed eigenvectors that form a contravariant basis with respect to the eigenvectors. Note that this means that for all and, Ì ¼, if and only if, and Ì. These facts also imply that the terms are actually the eigenvectors of Ì, which can be trivially seen by transposing the Equation 5. If there exists one or more target right-handed eigenvector / eigenvalue pairs, then a set of left-handed eigenvectors can be numerically calculated. Note that there may exist an infinite number of left-handed eigenvectors that meet all of the constraints, but this is simply a consequence of the problem being ill-defined from the start (with less than Ò unique eigenvectors). To optimize the Jacobian so that it has specific left-handed eigenvectors, we minimize the function Ù Â Ì 9

10 where and are the target left-handed eigenvector and eigenvalue, respectively. 5.6 Targeting the Largest Eigenvalue While the preceding recipe can be used to target specific eigenvalue / eigenvector pairs, the method may produce a Jacobian with other eigenvalue / eigenvector pairs that are undesirable. Minimizing the largest eigenvalue of the Jacobian of a system has many applications in engineering, such as control and stability. We will now see how a method to compute the largest eigenvalue of a matrix can be modified with J-prop so that the maximum eigenvalue of the Jacobian is explicitly minimized. Finding the largest eigenvalue (and its corresponding eigenvector) of a matrix is relatively straight-forward and fast via the power method (Wilkinson, 1965), which uses the iterations Ý Ø Þ Ø Þ Ø and Þ Ø Ý Ø. The initial choice of Þ ¼ is usually made to be random or. After several iterations, Ý Ì Ø Þ Ø will converge to the largest eigenvalue, and Ý Ø will converge to the corresponding eigenvector. The errors for these approximations will be proportional to Ø, where and are the largest and second largest eigenvalues, respectively. Thus, convergence is generally fast. Now, suppose we wished to alter the Jacobian matrix of a system so that its largest eigenvalue is equal to and it has the corresponding eigenvector. In this case, we desire that the iterated system: Ú Ø Þ Ø Þ Ø (6) Þ Ø ÂÚ Ø (7) converge to the fixed-point Þ Ô after Ô power method iterations, with the initial condition Þ ¼. The desired goal corresponds to minimizing the function, Ú ÂÚ Ô (8) In the derivation that follows, differentiation is done with respect to a virtual scalar parameter ; however, one could differentiate with respect to the entire vector of parameters,, by using tensor notation. Tensors are avoided here for the sake of clarity, but note that the tensor formulation is possible. Differentiating Equation 8 with respect to yields: Ú Þ Ô Â Ì Ú Ô Â Ú Ô Equation 9 recursively depends on: Ú Ø Þ Ø Þ Ø ÞØ Þ Ø Þ Ø Á Ú Ø Ú Ì Ø (9) Þ Ø Ú Ø ÞØ (0) And Equation 0 is recursively a function of: Þ Ø Â Ú Ø Â Ú Ø (1) with the understanding that Ú ¼ ¼. Note that Equations 9, 0, and 1 can be combined into a backward recurrent system that is more easily managed: Ú Ô Ø Ù Ì Ø Â Ú Ø () Ù Ô Þ Ô () Ù Ø Ù Ì Ø Â Þ Ø Á Ú Ø Ú Ì Ø (4) The entire procedure for optimizing a system with respect to Equation 8 can be summarized as follows: Pick the input vector, Ü, about which the Jacobian, Â, should be optimized and evaluate Â. Perform Ô power method iterations with Equations 6 and 7, saving all values of Ú Ø and Þ Ø. Choose the desired maximal eigenvector / eigenvalue pair, and, which can be functions of the actual maximal ones found in the previous step. Calculate all values of Ù Ø for Ø equal to Ô down to 1, with Equations and 4. Perform the summation in Equation ; Ù Ì Ø ÂÚ Ø is calculated with the Ê Ú operator. In the final step, is as defined in Equation 7, and Ù and Ú are set to Ù Ø and Ú Ø, respectively. The entire procedure calculates the entire gradient vector, not just for the single weight,, as given in the derivation. This procedure is similar to the backpropagation through time (BPTT) algorithm (Pearlmutter, 1995) in the way it unfolds the gradient recurrence. As with BPTT, fixed-points may present numerical problems in the optimization procedure. 5.7 Minimizing Eigenvalue Bounds Similar to the previous recipe, the magnitude of all of the eigenvalues of the Jacobian matrix can be indirectly optimized by exploiting Gershgorin s Theorem (Johnston, 1971). This alternate method has the advantages that it is less numerically demanding and it simultaneously deals with all eigenvalues; however, it minimizes an upper-bound, so it may be imprecise and with undesirable results. 10

11 Gershgorin s Theorem states that the set of eigenvalues for an Ò Ò matrix,, is contained in the union of Ò disks,, in the complex plane: Þ Þ Ò Ò This theorem gives two possibilities for optimizing the properties of the eigenvalues. First, we can optimize the center of a single disk by optimizing an element in the diagonal of the Jacobian. This can be easily accomplished, for disk, with the function: Â Ì Ù Ù where Ù is, once again, the th unit vector, and is set to zero except for the th component, which contains the target value for the center of disk. Minimizing the radius of disk requires that we minimize the sum of the magnitude of the elements of the Jacobian in row, not including the diagonal term. This can be accomplished by setting Ù to Ù, to a vector that is all zero except for the th entry which we jet to Â, and setting Ú equal to Ò Â Ì Ù. The resulting function is equivalent to the Gershgorin upper bound for the radius, Ù ÚÌ Â Ì Ù Ò Â and it is trivially in a form that can be handled by the J-prop algorithm. This particular application of J-prop should be useful for forcing dynamical systems to be more stable, especially when the model being trained is acting as a controller and the Jacobian whose eigenvalues we are optimizing is actually of the entire system (the system being controlled and the controller). 5.8 Optimizing Lyapunov Exponents Lyapunov exponents are often used to characterize the types of behavior found in dynamical system (Eckmann & Ruelle, 1985). It has been noted (Principe, Rathie & Kuo, 199; Deco & Schürmann, 1997) that numerical models of dynamical systems can be very accurate in the input-output sense while being inaccurate in terms of the Lyapunov exponents. Often, one has a priori knowledge that some time series has a chaotic source, which implies that the largest Lyapunov exponent of the system must be positive. In this case, one could use J-prop to coerce the model to be more accurate to the true system in terms of the Lyapunov exponents. While the following method can be generalized for multidimensional systems, the procedure is derived only for a model that has a single input and a single output. Multidimensional systems are problematic because the analytical form of the Lyapunov exponent is not nearly as simple as in the one-dimensional case. Consider a model Ü that approximates a onedimensional iterated map. The Lyapunov exponent,, is defined as: ÐÑ Æ Æ Æ ÐÓ Ü (5) While Equation 5 is not strictly differentiable, due to the absolute value function, we can analytically optimize the Lyapunov exponent if we use ÜÜ Ò Ü, which is algebraically consisted. Denoting the target Lyapunov exponent by, the function we wish to minimize is: (6) Differentiating Equation 6 with respect to yields: Æ Æ Æ ÐÓ Ü Æ Ò Ü Ü (7) To optimize a model with Equation 7, one follows the following procedure: Provide an initial input, Ü ¼. Iterate the system, using Ü Ø Ü Ø, while saving Ü (the one-dimensional Jacobian) at every step. At each iteration, also calculate Ü which can be done with R-prop. Calculate with Equation 5. Calculate the weight gradient with Equation 7. Care must be taken to iterate the system for a sufficiently large number of time steps. Moreover, since the Lyapunov exponent is usually only defined for iterates that already fall within a basin of attraction, one may wish to iterate the system for several time steps and only then let an iterate be chosen for Ü ¼, in the first step of the procedure. 11

12 5.9 Optimizing the Determinant Many applications require one to optimize the determinant of a system s Ò Ò Jacobian matrix. Hence, for gradient-based optimization, this translates to calculating Ø Â Ò Ò Ø Â Â Â (8) If the underlying system is linear, that is, the Jacobian matrix is equal to the weights (reordered in matrix form), then this partial derivative consists of all of the cofactors of Â. However, for nonlinear systems, calculating Equation 8 can be very problematic. J-prop can be used to calculate the result of Equation 8 through a series of Ò applications of the J-prop rules. The total calculation required is described by the summation Ø Â Ò Ú Ì Â Ì Ù (9) where Ù is the th unit vector and Ú Ì is vector that consists of the cofactors from a single row of Â: Ú Ì Ó Â ¼ Ó Â Ò Each term in the summation of Equation 9 can be calculated with a single application of the J-prop rules. This may seem to be a time consuming process, and it is, but surprisingly it is not significantly more time consuming than computing all of the cofactors of  alone since simply evaluating  requires Ò feedforward and feedback passes of the model and calculating this partial derivative only requires Ò additional Jforward and J-backward passes. An alternate derivation could exploit the identity Ø ÐÒ Ø Ø ØÖ Ø however, using this identity is not straight-forward. Optimizing the determinant of the Jacobian of a nonlinear systems (or, more accurately, its logarithm) is known to have many applications in the area of independent component analysis (ICA) and blind source separation (BSS) (Deco & Obradovic, 1996; Parra, Deco & Miesbach, 1995). The J- prop algorithm may be able to extend existing linear ICA and BSS algorithms into nonlinear versions Optimizing Quadratic Forms Other applications related to BSS and ISA require one to optimize with respect to quadratic forms of the Jacobian, Þ Ì Â Ì ÂÞ. In this case, we desire the gradient: Þ Ì Â Ì ÂÞ (40) The J-prop algorithm can be used in such cases by simple application of the chain rule of differentiation. Tensor notation is used to simplify the derivation in this case; however, the end result can be expressed in standard vector-matrix form. Þ Ì Â Ì ÂÞ Þ Ì Â Ì ÂÞ Þ Ì Â Ì ÂÞ Þ Ì Â Ì Â Þ (41) Equation 41 can be optimized by two applications of the R- prop algorithm, i.e., one application for each additive term. Despite the tensors in Equation 41, each additive term can be represented in the canonical form, Ù Ì ÂÚ, with minimal linear algebra, by setting Ù and Ú to the left and righthand factors of each  tensor. As a result, each of the two added products easily maps to the R-prop algorithm because, in the end, the products yield scalar results. 6 CONCLUSIONS This paper has introduced a general method for calculating the weight gradient of functions of the Jacobian matrix of nonlinear systems that are twice differentiable. The method uses a differentiable operator that can be easily applied to most nonlinear models in common use today. The resulting algorithm, J-prop, can be easily modified into several special cases that can be used to minimize functions that are known to be useful in many application domains. While some special cases of the J-prop algorithm have already been studied, a great deal is unknown about how optimization of the Jacobian changes the overall optimization problem. Some evidence from special cases of the algorithm seems to imply that optimization of the Jacobian can lead to better generalization and faster training. What remains to be seen is if J-prop used on nonlinear extension of previous linear methods will lead to truly stable and superior solutions. Three obvious areas to explore are control systems, blind source separation (or independent component analysis), and compression. In all of these cases, optimizing the Jacobian of the underlying linear system is straight-forward. Future work should attempt to identify if optimizing a nonlinear extension is both numerically stable and worth the extra computational effort. ACKNOWLEDGEMENTS I am greatly indebted to many people for helpful discussion during the course of this work. I would specifically like to thank Frans Coetzee, Yannis Kevrekidis, Joe O Ruanaidh, Lucas Parra, Barak Pearlmutter, Scott Rickard, Justinian Rosca, and Patrice Simard. Eric Baum and Steve Lawrence of the NEC Research Institute graciously funded the time to write up these results. 1

13 REFERENCES Deco, G. & Obradovic, D. (1996). An Information Theoretic Approach to Neural Computing. Perspective in Neural Computing. Springer. Deco, G. & Schürmann, B. (1997). Dynamic modeling of chaotic time series. In R. Greiner, T. Petsche, & S. J. Hanson (Eds.), Computational Learning Theory and Natural Learning Systems, volume IV of Making Learning Systems Practical chapter 9, (pp ). Cambridge, Mass.: The MIT Press. Drucker, H. & Le Cun, Y. (November 199). Improving generalization performance using double backpropagation. IEEE Transactions on Neural Networks, (6). Eckmann, J. P. & Ruelle, D. (1985). Ergodic theory of chaos and strange attractors. Rev. Mod. Phys., 57: , G. W. (1999). An industrial strength neural optimization development engine. Technical Report , NEC Research Institute. Hinton, G. E. (1986). Learning distributed representations of concepts. In Proc. Eigth Annual Conf. Cognitive Science Society, (pp. 1 1)., Hillsdale, NJ. Erlbaum. Johnston, R. L. (1971). Gershgorin theorems for partitioned matrices. Linear Algebra and its Applications, 4: Parra, L., Deco, G., & Miesbach, S. (1995). Redundancy reduction with information preserving nonlinear maps. Network: Computation in Neural Systems, 6: Pearlmutter, B. A. (1994). Fast exact multiplication by the Hessian. Neural Computation, 6(1): Pearlmutter, B. A. (1995). Gradient calculation for dynamic recurrent neural networks: a survey. IEEE Transactions on Neural Networks, 6(5): Poggio, T. & Girosi, F. (Sept 1990). Networks for approximation and learning. Proc. IEEE, 78(9): Principe, J., Rathie, A., & Kuo, J. (199). Prediction of chaotic time series with neural networks and the issues of dynamic modeling. Bifurcations and Chaos, (4): 989. Rico-Martinez, R., Anderson, J. S., & Kevrekidis, I. G. (1996). Self-consistency in neural network-based nlpc analysis with applications to time-series processing. Comp. Chem. Eng., 0 Suppl. B: Seung, H. S. (1998). Learning continuous attractors in recurrent networks. In Jordan, M. I., Kearns, M. J., & Solla, S. A. (Eds.), Advances in Neural Information Processing Systems, volume 10. MIT Press. Sietsma, J. & Dow, R. J. F. (1988). Neural network pruning why and how. In Proc. IEEE Int. Conf. Neural Networks, volume I (pp. 5 ). New York: IEEE. Simard, P., Victorri, B., Le Cun, Y., & Denker, J. (199). Tangent prop A formalism for specifying selected invariances in an adaptive network. In Moody, J. E., Hanson, S. J., & Lippmann, R. P. (Eds.), Advances in Neural Information Processing Systems, volume 4, (pp ). Morgan Kaufmann Publishers, Inc. Thrun, S. B. & Mitchell, T. M. (199). Integrating inductive neural network learning and explanation-based learning. In the Proc. of the Thirteenth Int. Joint Conf. on Art. Int. Morgan Kaufmann. Thrun, S. B. & Mitchell, T. M. (1995). Learning one more thing. In the Proc. of the Fourteenth Joint Conf. on Art. Int. Morgan Kaufmann. Weigend, A. S., Zimmermann, H. G., & Neuneier, R. (1995). Clearning. In Neural Networks in Financial Engineering, (NNCM95). White, H. & Gallant, A. R. (199). On learning the derivatives of an unknown mapping with multilayer feedforward networks. In H. White (Ed.), Artificial Neural Networks chapter 1, (pp. 06 ). Cambridge, Mass.: Blackwell. White, H., Hornik, K., & Stinchcombe, M. (199). Universal approximation of an unknown mapping and its derivative. In H. White (Ed.), Artificial Neural Networks chapter 6, (pp ). Cambridge, Mass.: Blackwell. Wilkinson, J. H. (1965). The Algebraic Eigenvalue Problem. Oxford University Press. 1

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Lecture - 27 Multilayer Feedforward Neural networks with Sigmoidal

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4 Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:

More information

Temporal Backpropagation for FIR Neural Networks

Temporal Backpropagation for FIR Neural Networks Temporal Backpropagation for FIR Neural Networks Eric A. Wan Stanford University Department of Electrical Engineering, Stanford, CA 94305-4055 Abstract The traditional feedforward neural network is a static

More information

Lecture 11 Recurrent Neural Networks I

Lecture 11 Recurrent Neural Networks I Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks

More information

Neural Networks and the Back-propagation Algorithm

Neural Networks and the Back-propagation Algorithm Neural Networks and the Back-propagation Algorithm Francisco S. Melo In these notes, we provide a brief overview of the main concepts concerning neural networks and the back-propagation algorithm. We closely

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA

More information

Entropy Manipulation of Arbitrary Non I inear Map pings

Entropy Manipulation of Arbitrary Non I inear Map pings Entropy Manipulation of Arbitrary Non I inear Map pings John W. Fisher I11 JosC C. Principe Computational NeuroEngineering Laboratory EB, #33, PO Box 116130 University of Floridaa Gainesville, FL 326 1

More information

Lecture 4: Perceptrons and Multilayer Perceptrons

Lecture 4: Perceptrons and Multilayer Perceptrons Lecture 4: Perceptrons and Multilayer Perceptrons Cognitive Systems II - Machine Learning SS 2005 Part I: Basic Approaches of Concept Learning Perceptrons, Artificial Neuronal Networks Lecture 4: Perceptrons

More information

A Logarithmic Neural Network Architecture for Unbounded Non-Linear Function Approximation

A Logarithmic Neural Network Architecture for Unbounded Non-Linear Function Approximation 1 Introduction A Logarithmic Neural Network Architecture for Unbounded Non-Linear Function Approximation J Wesley Hines Nuclear Engineering Department The University of Tennessee Knoxville, Tennessee,

More information

Analysis of Spectral Kernel Design based Semi-supervised Learning

Analysis of Spectral Kernel Design based Semi-supervised Learning Analysis of Spectral Kernel Design based Semi-supervised Learning Tong Zhang IBM T. J. Watson Research Center Yorktown Heights, NY 10598 Rie Kubota Ando IBM T. J. Watson Research Center Yorktown Heights,

More information

Lecture Notes: Geometric Considerations in Unconstrained Optimization

Lecture Notes: Geometric Considerations in Unconstrained Optimization Lecture Notes: Geometric Considerations in Unconstrained Optimization James T. Allison February 15, 2006 The primary objectives of this lecture on unconstrained optimization are to: Establish connections

More information

Tutorial on Tangent Propagation

Tutorial on Tangent Propagation Tutorial on Tangent Propagation Yichuan Tang Centre for Theoretical Neuroscience February 5, 2009 1 Introduction Tangent Propagation is the name of a learning technique of an artificial neural network

More information

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD WHAT IS A NEURAL NETWORK? The simplest definition of a neural network, more properly referred to as an 'artificial' neural network (ANN), is provided

More information

Analysis of Multilayer Neural Network Modeling and Long Short-Term Memory

Analysis of Multilayer Neural Network Modeling and Long Short-Term Memory Analysis of Multilayer Neural Network Modeling and Long Short-Term Memory Danilo López, Nelson Vera, Luis Pedraza International Science Index, Mathematical and Computational Sciences waset.org/publication/10006216

More information

A recursive algorithm based on the extended Kalman filter for the training of feedforward neural models. Isabelle Rivals and Léon Personnaz

A recursive algorithm based on the extended Kalman filter for the training of feedforward neural models. Isabelle Rivals and Léon Personnaz In Neurocomputing 2(-3): 279-294 (998). A recursive algorithm based on the extended Kalman filter for the training of feedforward neural models Isabelle Rivals and Léon Personnaz Laboratoire d'électronique,

More information

Artificial Neural Networks

Artificial Neural Networks Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation Hidden layer representations Example: Face Recognition Advanced topics 1 Connectionist Models Consider humans:

More information

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann (Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for

More information

Using Kernel PCA for Initialisation of Variational Bayesian Nonlinear Blind Source Separation Method

Using Kernel PCA for Initialisation of Variational Bayesian Nonlinear Blind Source Separation Method Using Kernel PCA for Initialisation of Variational Bayesian Nonlinear Blind Source Separation Method Antti Honkela 1, Stefan Harmeling 2, Leo Lundqvist 1, and Harri Valpola 1 1 Helsinki University of Technology,

More information

1 What a Neural Network Computes

1 What a Neural Network Computes Neural Networks 1 What a Neural Network Computes To begin with, we will discuss fully connected feed-forward neural networks, also known as multilayer perceptrons. A feedforward neural network consists

More information

Introduction to Machine Learning Spring 2018 Note Neural Networks

Introduction to Machine Learning Spring 2018 Note Neural Networks CS 189 Introduction to Machine Learning Spring 2018 Note 14 1 Neural Networks Neural networks are a class of compositional function approximators. They come in a variety of shapes and sizes. In this class,

More information

IN neural-network training, the most well-known online

IN neural-network training, the most well-known online IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 1, JANUARY 1999 161 On the Kalman Filtering Method in Neural-Network Training and Pruning John Sum, Chi-sing Leung, Gilbert H. Young, and Wing-kay Kan

More information

Artificial Neural Networks (ANN) Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso

Artificial Neural Networks (ANN) Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso Artificial Neural Networks (ANN) Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso xsu@utep.edu Fall, 2018 Outline Introduction A Brief History ANN Architecture Terminology

More information

A Generalization of Principal Component Analysis to the Exponential Family

A Generalization of Principal Component Analysis to the Exponential Family A Generalization of Principal Component Analysis to the Exponential Family Michael Collins Sanjoy Dasgupta Robert E. Schapire AT&T Labs Research 8 Park Avenue, Florham Park, NJ 7932 mcollins, dasgupta,

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks Delivered by Mark Ebden With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable

More information

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function Austin Wang Adviser: Xiuyuan Cheng May 4, 2017 1 Abstract This study analyzes how simple recurrent neural

More information

Lecture 11 Recurrent Neural Networks I

Lecture 11 Recurrent Neural Networks I Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor niversity of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS LAST TIME Intro to cudnn Deep neural nets using cublas and cudnn TODAY Building a better model for image classification Overfitting

More information

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks Topics in Machine Learning-EE 5359 Neural Networks 1 The Perceptron Output: A perceptron is a function that maps D-dimensional vectors to real numbers. For notational convenience, we add a zero-th dimension

More information

Lecture 15: Exploding and Vanishing Gradients

Lecture 15: Exploding and Vanishing Gradients Lecture 15: Exploding and Vanishing Gradients Roger Grosse 1 Introduction Last lecture, we introduced RNNs and saw how to derive the gradients using backprop through time. In principle, this lets us train

More information

Artificial Neural Networks 2

Artificial Neural Networks 2 CSC2515 Machine Learning Sam Roweis Artificial Neural s 2 We saw neural nets for classification. Same idea for regression. ANNs are just adaptive basis regression machines of the form: y k = j w kj σ(b

More information

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler + Machine Learning and Data Mining Multi-layer Perceptrons & Neural Networks: Basics Prof. Alexander Ihler Linear Classifiers (Perceptrons) Linear Classifiers a linear classifier is a mapping which partitions

More information

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009 AN INTRODUCTION TO NEURAL NETWORKS Scott Kuindersma November 12, 2009 SUPERVISED LEARNING We are given some training data: We must learn a function If y is discrete, we call it classification If it is

More information

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer. University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x

More information

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization Tim Roughgarden & Gregory Valiant April 18, 2018 1 The Context and Intuition behind Regularization Given a dataset, and some class of models

More information

Block-tridiagonal matrices

Block-tridiagonal matrices Block-tridiagonal matrices. p.1/31 Block-tridiagonal matrices - where do these arise? - as a result of a particular mesh-point ordering - as a part of a factorization procedure, for example when we compute

More information

Equivalence of Backpropagation and Contrastive Hebbian Learning in a Layered Network

Equivalence of Backpropagation and Contrastive Hebbian Learning in a Layered Network LETTER Communicated by Geoffrey Hinton Equivalence of Backpropagation and Contrastive Hebbian Learning in a Layered Network Xiaohui Xie xhx@ai.mit.edu Department of Brain and Cognitive Sciences, Massachusetts

More information

CLOSE-TO-CLEAN REGULARIZATION RELATES

CLOSE-TO-CLEAN REGULARIZATION RELATES Worshop trac - ICLR 016 CLOSE-TO-CLEAN REGULARIZATION RELATES VIRTUAL ADVERSARIAL TRAINING, LADDER NETWORKS AND OTHERS Mudassar Abbas, Jyri Kivinen, Tapani Raio Department of Computer Science, School of

More information

Notes on Back Propagation in 4 Lines

Notes on Back Propagation in 4 Lines Notes on Back Propagation in 4 Lines Lili Mou moull12@sei.pku.edu.cn March, 2015 Congratulations! You are reading the clearest explanation of forward and backward propagation I have ever seen. In this

More information

) (d o f. For the previous layer in a neural network (just the rightmost layer if a single neuron), the required update equation is: 2.

) (d o f. For the previous layer in a neural network (just the rightmost layer if a single neuron), the required update equation is: 2. 1 Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.034 Artificial Intelligence, Fall 2011 Recitation 8, November 3 Corrected Version & (most) solutions

More information

Introduction to Neural Networks

Introduction to Neural Networks Introduction to Neural Networks What are (Artificial) Neural Networks? Models of the brain and nervous system Highly parallel Process information much more like the brain than a serial computer Learning

More information

LECTURE # - NEURAL COMPUTATION, Feb 04, Linear Regression. x 1 θ 1 output... θ M x M. Assumes a functional form

LECTURE # - NEURAL COMPUTATION, Feb 04, Linear Regression. x 1 θ 1 output... θ M x M. Assumes a functional form LECTURE # - EURAL COPUTATIO, Feb 4, 4 Linear Regression Assumes a functional form f (, θ) = θ θ θ K θ (Eq) where = (,, ) are the attributes and θ = (θ, θ, θ ) are the function parameters Eample: f (, θ)

More information

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017 Modelling Time Series with Neural Networks Volker Tresp Summer 2017 1 Modelling of Time Series The next figure shows a time series (DAX) Other interesting time-series: energy prize, energy consumption,

More information

Automatic Differentiation and Neural Networks

Automatic Differentiation and Neural Networks Statistical Machine Learning Notes 7 Automatic Differentiation and Neural Networks Instructor: Justin Domke 1 Introduction The name neural network is sometimes used to refer to many things (e.g. Hopfield

More information

Address for Correspondence

Address for Correspondence Research Article APPLICATION OF ARTIFICIAL NEURAL NETWORK FOR INTERFERENCE STUDIES OF LOW-RISE BUILDINGS 1 Narayan K*, 2 Gairola A Address for Correspondence 1 Associate Professor, Department of Civil

More information

Input layer. Weight matrix [ ] Output layer

Input layer. Weight matrix [ ] Output layer MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science 6.034 Artificial Intelligence, Fall 2003 Recitation 10, November 4 th & 5 th 2003 Learning by perceptrons

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.5. Spring 2010 Instructor: Dr. Masoud Yaghini Outline How the Brain Works Artificial Neural Networks Simple Computing Elements Feed-Forward Networks Perceptrons (Single-layer,

More information

Some Notes on Linear Algebra

Some Notes on Linear Algebra Some Notes on Linear Algebra prepared for a first course in differential equations Thomas L Scofield Department of Mathematics and Statistics Calvin College 1998 1 The purpose of these notes is to present

More information

Neural Networks biological neuron artificial neuron 1

Neural Networks biological neuron artificial neuron 1 Neural Networks biological neuron artificial neuron 1 A two-layer neural network Output layer (activation represents classification) Weighted connections Hidden layer ( internal representation ) Input

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Chapter 2. Linear Algebra. rather simple and learning them will eventually allow us to explain the strange results of

Chapter 2. Linear Algebra. rather simple and learning them will eventually allow us to explain the strange results of Chapter 2 Linear Algebra In this chapter, we study the formal structure that provides the background for quantum mechanics. The basic ideas of the mathematical machinery, linear algebra, are rather simple

More information

Advanced statistical methods for data analysis Lecture 2

Advanced statistical methods for data analysis Lecture 2 Advanced statistical methods for data analysis Lecture 2 RHUL Physics www.pp.rhul.ac.uk/~cowan Universität Mainz Klausurtagung des GK Eichtheorien exp. Tests... Bullay/Mosel 15 17 September, 2008 1 Outline

More information

Machine Learning. Neural Networks

Machine Learning. Neural Networks Machine Learning Neural Networks Bryan Pardo, Northwestern University, Machine Learning EECS 349 Fall 2007 Biological Analogy Bryan Pardo, Northwestern University, Machine Learning EECS 349 Fall 2007 THE

More information

Math 123, Week 2: Matrix Operations, Inverses

Math 123, Week 2: Matrix Operations, Inverses Math 23, Week 2: Matrix Operations, Inverses Section : Matrices We have introduced ourselves to the grid-like coefficient matrix when performing Gaussian elimination We now formally define general matrices

More information

Appendix A Taylor Approximations and Definite Matrices

Appendix A Taylor Approximations and Definite Matrices Appendix A Taylor Approximations and Definite Matrices Taylor approximations provide an easy way to approximate a function as a polynomial, using the derivatives of the function. We know, from elementary

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

Artificial Neural Networks

Artificial Neural Networks Artificial Neural Networks 鮑興國 Ph.D. National Taiwan University of Science and Technology Outline Perceptrons Gradient descent Multi-layer networks Backpropagation Hidden layer representations Examples

More information

Ways to make neural networks generalize better

Ways to make neural networks generalize better Ways to make neural networks generalize better Seminar in Deep Learning University of Tartu 04 / 10 / 2014 Pihel Saatmann Topics Overview of ways to improve generalization Limiting the size of the weights

More information

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Neural Networks CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Perceptrons x 0 = 1 x 1 x 2 z = h w T x Output: z x D A perceptron

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Neural Networks Lecture 4: Radial Bases Function Networks

Neural Networks Lecture 4: Radial Bases Function Networks Neural Networks Lecture 4: Radial Bases Function Networks H.A Talebi Farzaneh Abdollahi Department of Electrical Engineering Amirkabir University of Technology Winter 2011. A. Talebi, Farzaneh Abdollahi

More information

Multilayer Perceptron Learning Utilizing Singular Regions and Search Pruning

Multilayer Perceptron Learning Utilizing Singular Regions and Search Pruning Multilayer Perceptron Learning Utilizing Singular Regions and Search Pruning Seiya Satoh and Ryohei Nakano Abstract In a search space of a multilayer perceptron having hidden units, MLP(), there exist

More information

Keywords- Source coding, Huffman encoding, Artificial neural network, Multilayer perceptron, Backpropagation algorithm

Keywords- Source coding, Huffman encoding, Artificial neural network, Multilayer perceptron, Backpropagation algorithm Volume 4, Issue 5, May 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Huffman Encoding

More information

E.G., data mining, prediction. Function approximation problem: How >9 best estimate the function 0 from partial information (examples)

E.G., data mining, prediction. Function approximation problem: How >9 best estimate the function 0 from partial information (examples) Can Neural Network and Statistical Learning Theory be Formulated in terms of Continuous Complexity Theory? 1. Neural Networks and Statistical Learning Theory Many areas of mathematics, statistics, and

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Designing Information Devices and Systems I Fall 2018 Lecture Notes Note Introduction to Linear Algebra the EECS Way

Designing Information Devices and Systems I Fall 2018 Lecture Notes Note Introduction to Linear Algebra the EECS Way EECS 16A Designing Information Devices and Systems I Fall 018 Lecture Notes Note 1 1.1 Introduction to Linear Algebra the EECS Way In this note, we will teach the basics of linear algebra and relate it

More information

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems Thore Graepel and Nicol N. Schraudolph Institute of Computational Science ETH Zürich, Switzerland {graepel,schraudo}@inf.ethz.ch

More information

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation BACKPROPAGATION Neural network training optimization problem min J(w) w The application of gradient descent to this problem is called backpropagation. Backpropagation is gradient descent applied to J(w)

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

Lecture 2: Linear regression

Lecture 2: Linear regression Lecture 2: Linear regression Roger Grosse 1 Introduction Let s ump right in and look at our first machine learning algorithm, linear regression. In regression, we are interested in predicting a scalar-valued

More information

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs) Multilayer Neural Networks (sometimes called Multilayer Perceptrons or MLPs) Linear separability Hyperplane In 2D: w x + w 2 x 2 + w 0 = 0 Feature x 2 = w w 2 x w 0 w 2 Feature 2 A perceptron can separate

More information

Supervised Learning. George Konidaris

Supervised Learning. George Konidaris Supervised Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom Mitchell,

More information

HOPFIELD neural networks (HNNs) are a class of nonlinear

HOPFIELD neural networks (HNNs) are a class of nonlinear IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 52, NO. 4, APRIL 2005 213 Stochastic Noise Process Enhancement of Hopfield Neural Networks Vladimir Pavlović, Member, IEEE, Dan Schonfeld,

More information

Linear Algebra. The analysis of many models in the social sciences reduces to the study of systems of equations.

Linear Algebra. The analysis of many models in the social sciences reduces to the study of systems of equations. POLI 7 - Mathematical and Statistical Foundations Prof S Saiegh Fall Lecture Notes - Class 4 October 4, Linear Algebra The analysis of many models in the social sciences reduces to the study of systems

More information

Fundamentals of Dynamical Systems / Discrete-Time Models. Dr. Dylan McNamara people.uncw.edu/ mcnamarad

Fundamentals of Dynamical Systems / Discrete-Time Models. Dr. Dylan McNamara people.uncw.edu/ mcnamarad Fundamentals of Dynamical Systems / Discrete-Time Models Dr. Dylan McNamara people.uncw.edu/ mcnamarad Dynamical systems theory Considers how systems autonomously change along time Ranges from Newtonian

More information

Linear Programming Redux

Linear Programming Redux Linear Programming Redux Jim Bremer May 12, 2008 The purpose of these notes is to review the basics of linear programming and the simplex method in a clear, concise, and comprehensive way. The book contains

More information

Lecture - 24 Radial Basis Function Networks: Cover s Theorem

Lecture - 24 Radial Basis Function Networks: Cover s Theorem Neural Network and Applications Prof. S. Sengupta Department of Electronic and Electrical Communication Engineering Indian Institute of Technology, Kharagpur Lecture - 24 Radial Basis Function Networks:

More information

NEURAL NETWORKS

NEURAL NETWORKS 5 Neural Networks In Chapters 3 and 4 we considered models for regression and classification that comprised linear combinations of fixed basis functions. We saw that such models have useful analytical

More information

APPPHYS217 Tuesday 25 May 2010

APPPHYS217 Tuesday 25 May 2010 APPPHYS7 Tuesday 5 May Our aim today is to take a brief tour of some topics in nonlinear dynamics. Some good references include: [Perko] Lawrence Perko Differential Equations and Dynamical Systems (Springer-Verlag

More information

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Speaker Representation and Verification Part II. by Vasileios Vasilakakis Speaker Representation and Verification Part II by Vasileios Vasilakakis Outline -Approaches of Neural Networks in Speaker/Speech Recognition -Feed-Forward Neural Networks -Training with Back-propagation

More information

Conjugate gradient algorithm for training neural networks

Conjugate gradient algorithm for training neural networks . Introduction Recall that in the steepest-descent neural network training algorithm, consecutive line-search directions are orthogonal, such that, where, gwt [ ( + ) ] denotes E[ w( t + ) ], the gradient

More information

Designing Information Devices and Systems I Spring 2018 Lecture Notes Note Introduction to Linear Algebra the EECS Way

Designing Information Devices and Systems I Spring 2018 Lecture Notes Note Introduction to Linear Algebra the EECS Way EECS 16A Designing Information Devices and Systems I Spring 018 Lecture Notes Note 1 1.1 Introduction to Linear Algebra the EECS Way In this note, we will teach the basics of linear algebra and relate

More information

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Lecture 16: Modern Classification (I) - Separating Hyperplanes Lecture 16: Modern Classification (I) - Separating Hyperplanes Outline 1 2 Separating Hyperplane Binary SVM for Separable Case Bayes Rule for Binary Problems Consider the simplest case: two classes are

More information

Gradient-Based Learning. Sargur N. Srihari

Gradient-Based Learning. Sargur N. Srihari Gradient-Based Learning Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation and Other Differentiation

More information

Lecture 7 Artificial neural networks: Supervised learning

Lecture 7 Artificial neural networks: Supervised learning Lecture 7 Artificial neural networks: Supervised learning Introduction, or how the brain works The neuron as a simple computing element The perceptron Multilayer neural networks Accelerated learning in

More information

Statistical Independence and Novelty Detection with Information Preserving Nonlinear Maps

Statistical Independence and Novelty Detection with Information Preserving Nonlinear Maps Statistical Independence and Novelty Detection with Information Preserving Nonlinear Maps Lucas Parra, Gustavo Deco, Stefan Miesbach Siemens AG, Corporate Research and Development, ZFE ST SN 4 Otto-Hahn-Ring

More information

chapter 12 MORE MATRIX ALGEBRA 12.1 Systems of Linear Equations GOALS

chapter 12 MORE MATRIX ALGEBRA 12.1 Systems of Linear Equations GOALS chapter MORE MATRIX ALGEBRA GOALS In Chapter we studied matrix operations and the algebra of sets and logic. We also made note of the strong resemblance of matrix algebra to elementary algebra. The reader

More information

Artificial Neural Network

Artificial Neural Network Artificial Neural Network Contents 2 What is ANN? Biological Neuron Structure of Neuron Types of Neuron Models of Neuron Analogy with human NN Perceptron OCR Multilayer Neural Network Back propagation

More information

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 1 x 2. x n 8 (4) 3 4 2

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 1 x 2. x n 8 (4) 3 4 2 MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS SYSTEMS OF EQUATIONS AND MATRICES Representation of a linear system The general system of m equations in n unknowns can be written a x + a 2 x 2 + + a n x n b a

More information

Margin Maximizing Loss Functions

Margin Maximizing Loss Functions Margin Maximizing Loss Functions Saharon Rosset, Ji Zhu and Trevor Hastie Department of Statistics Stanford University Stanford, CA, 94305 saharon, jzhu, hastie@stat.stanford.edu Abstract Margin maximizing

More information

CHAPTER 3. Pattern Association. Neural Networks

CHAPTER 3. Pattern Association. Neural Networks CHAPTER 3 Pattern Association Neural Networks Pattern Association learning is the process of forming associations between related patterns. The patterns we associate together may be of the same type or

More information

Kernel expansions with unlabeled examples

Kernel expansions with unlabeled examples Kernel expansions with unlabeled examples Martin Szummer MIT AI Lab & CBCL Cambridge, MA szummer@ai.mit.edu Tommi Jaakkola MIT AI Lab Cambridge, MA tommi@ai.mit.edu Abstract Modern classification applications

More information

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92 ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92 BIOLOGICAL INSPIRATIONS Some numbers The human brain contains about 10 billion nerve cells (neurons) Each neuron is connected to the others through 10000

More information

Computational statistics

Computational statistics Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Learning features by contrasting natural images with noise

Learning features by contrasting natural images with noise Learning features by contrasting natural images with noise Michael Gutmann 1 and Aapo Hyvärinen 12 1 Dept. of Computer Science and HIIT, University of Helsinki, P.O. Box 68, FIN-00014 University of Helsinki,

More information

DETECTING PROCESS STATE CHANGES BY NONLINEAR BLIND SOURCE SEPARATION. Alexandre Iline, Harri Valpola and Erkki Oja

DETECTING PROCESS STATE CHANGES BY NONLINEAR BLIND SOURCE SEPARATION. Alexandre Iline, Harri Valpola and Erkki Oja DETECTING PROCESS STATE CHANGES BY NONLINEAR BLIND SOURCE SEPARATION Alexandre Iline, Harri Valpola and Erkki Oja Laboratory of Computer and Information Science Helsinki University of Technology P.O.Box

More information

Confidence Estimation Methods for Neural Networks: A Practical Comparison

Confidence Estimation Methods for Neural Networks: A Practical Comparison , 6-8 000, Confidence Estimation Methods for : A Practical Comparison G. Papadopoulos, P.J. Edwards, A.F. Murray Department of Electronics and Electrical Engineering, University of Edinburgh Abstract.

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information