Gradient estimation for attractor networks. Thomas Flynn

Size: px
Start display at page:

Download "Gradient estimation for attractor networks. Thomas Flynn"

Transcription

1 Gradient estimation for attractor networks Thomas Flynn

2 Abstract We study several types of neural networks that involve feedback connections. These models lead to interesting problems in optimization and may also be useful in applications. In all of these networks, which include deterministic and stochastic networks, on both continuous and discrete state spaces, the gradient estimation is a dynamic process involving feedback, just like the underlying networks. First we consider optimization of deterministic attractor networks, based on the forward and adjoint sensitivity analysis methods. Then we consider derivative estimation in a stochastic variant of this network, and propose to port the forward sensitivity analysis method to this setting. Thirdly, we consider a stochastic network on a discrete space, known as the Little model. Gradient estimation in special cases of this model, such as the Boltzmann machine and sigmoid belief networks, has been studied, based on closed-form solutions for the resulting probability distributions. In the general case, one only has the Markov kernels that specify the short term behavior of the model. To enable gradient-based optimization in this setting, we propose to calculate search directions by combining features of simultaneous perturbation analysis and measure valued differentiation. Our aim in studying these models is to expand the set of tools that are available for experimentation on problems of interest in machine learning. In addition to the theoretical results, we are also interested in applying these models to relevant problems in machine learning, such as image classification.

3 Contents 1 Introduction Notations Deterministic Attractor networks Background Research question Stochastic Attractor Networks Research question Stochastic contraction Approach to the problem Discrete models Background Stationary differentiability using MVD Research question Conclusion and Time line Bibliography

4 1 Introduction Many machine learning problems are approached using gradient based optimization. At each step of the algorithm, the optimization program calculates the derivative of an objective function with respect to parameters of the model, in order to determine what direction to move in. In many cases these derivatives cannot be calculated exactly. Difficulties in gradient estimation happen when dealing with stochastic models, and also with deterministic models that involve feedback. One approach that can be taken with models that involve feedback is to run the gradient estimation and optimization processes simultaneously. The idea is to use an iterative algorithm for computing the gradient, and carry over the state of the estimation procedure after each parameter update, to avoid starting from scratch each time. This type of gradient estimation can be used for sensitivity analysis in attractor networks. This type of neural network, which we formally define in Section 2, is in some ways similar to the usual feed-forward networks. They can be used as input-output maps, sending, for example, an input image to an vector of confidence scores for different categories that the image could belong to. The main difference is that an attractor network has a dynamic internal state, enabled by allowing feed-back connections among units. Feedforward networks, on the other hand, exhibit trivial dynamical behavior; they always reach a steady-state in a finite number of steps. The capacity for interesting dynamical behavior may translate to information processing capacity. However, training and maintaining the stability of these networks seems difficult. The dynamical nature means that one cannot compute the derivatives of interest easily, as one can in feed-forward networks. One can construct an optimization algorithm for attractor networks using a dynamic variant of back-propagation based algorithms. We propose to address the problems of how to tune the algorithm to guarantee a function decrease, and to obtain a long-term guarantee on the optimization algorithm. Some other works (See Figure 1) also considered these models, but the results they obtained for gradient estimation and optimization were mostly heuristic, or were asymptotic. In this setting we seek results that concern finite-step sizes and we also to express the results in terms of available model information. The attractor networks can be generalized by making them stochastic. For instance, one can have stochastic connections between the nodes of the network, so that a node does not always communicate with the same set of neighbors. This could model for instance large, distributed, neural networks whose units can not communicate perfectly due to communication delays or synchronization issues. Having stochastic connections has also been conjectured to help with over fitting. In Section 3, we propose to extend the gradient estimation algorithm from deterministic attractor networks to this setting. As shown in Figure 1, some other authors considered restricted cases of such models that have acyclic connectivity. We then turn our attention in Section 4 to networks that operate on a discrete state space. Several stochastic neural networks on discrete state spaces have been studied, and their gradient estimation procedures are based on having closed form solutions for the resulting 3

5 stochastic deterministic continuous space discrete space continuous discrete? general acyclic general symmetric acyclic general acyclic asynchronous? synchronous this work [5, 6] asynchronous? synchronous [7, 4], this work asynchronous [1] synchronous [2, 3] [8] [9, 10], this work [11] Figure 1: Gradient estimation and optimization in different dynamical settings for neural networks. The first level splits into networks that are stochastic or deterministic. At the next level the split is whether the state space of the network is discrete or continuous. At the third level we differentiate based on connectivity constraints; whether the network is acyclic, has symmetric connections, or allows general connectivity. At the fifth level the split is based on the order of updating the nodes - whether they are they updated all at once (synchronous) or one at a time, in an asynchronous manner. This work considers gradient estimation for three types of networks. A question mark means the author is unaware of any works considering gradient estimation in models with those properties. probability distributions. The works [1, 2, 3] (See Fig. 1) depend on constraints on network connectivity - for instance symmetry, or prohibiting cycles. In the case of networks where cycles are allowed, the work [4] considered gradient estimation in the finite horizon setting. Our interest is in the long-term average cost in networks that have general connectivity, where only knowledge of the transition probabilities is available. Methods such as forward sensitivity analysis cannot be used in this case, as they rely on the differential structure of the underlying state space. Instead, we propose an algorithm that computes descent directions based on simultaneous perturbation analysis and measure valued differentiation (MVD). Along the way we recall a number of relevant mathematical tools. In the deterministic case, to analyze convergence of the forward sensitivity system we use the contraction mapping theorem and a theorem on hierarchies of contracting systems. For considering the long term behavior of stochastic networks, we use a condition based on contraction in the Wasserstein distance. When considering sensitivity analysis for discrete networks we will use results on MVD for Markov chains to establish differentiability of stationary costs. 4

6 1.1 Notations n - dimensionality of the state space of a model. In a network based model, this will be the number of nodes in the network. m - dimensionality of the parameter space of a model. x - variable for the state of a neural network or other dynamic system. w i,j - weight from node j into node i b i - bias at node i θ - parameter vector. In a neural network model, θ is a pair (w, b), consisting of a weight matrix and bias vector. [ g (y) - for a function g : y Rn R m g, this is the m n matrix with entries (y) = y ]i,j g i y j (y). t - iteration number of algorithm. θ(1), θ(2),..., θ(t),... - sequence of parameters generated by optimization algorithm. µ - a probability measure representing an external noise source. π θ - probability measure depending on a parameter θ. P (x, A) - the probability a Markov kernel P assigns to set A from the state x. When the state space is discrete, we will often abuse notation and write P (x, y) instead of P (x, {y}). ν(e) - expected value of random variable e under measure ν; ν(e) = X e(x)dν(x). δ x - point mass centered at x; the measure δ x (A) = { 1 if x A, 0 otherwise. } 2 Deterministic Attractor networks The work of [11] generated much interest in gradient-based training of feed-forward networks, and shortly after there appeared works studying all variety of neural networks. The type of networks that we describe in this section, attractor networks, seem to first have been described in [9], [10]. Attractor networks form a class of neural networks that includes feed-forward networks, but also allows for feed-back connections. These are interesting because they can possibly support non-trivial dynamic behavior. It seems that allowing feed-back connections should 5

7 increase the computational power of neural networks - perhaps allowing a network with fewer nodes and connections to replace a very big feed forward network. We review two known algorithms for computing derivatives in these networks, called adjoint sensitivity analysis and forward sensitivity analysis. Our proposed work is to analyze an optimization algorithm based on these gradient estimators. The theoretical analyses in previous works was very limited. 2.1 Background We now define an attractor network. The state space is X = R n and there is a set of edges that determines the connectivity between nodes. We let x = (x 1,..., x n ) denote a state of the neural network. The weights w R n n and biases b R n are the parameters of the network. The number w i,j is the weight into node i from node j. A value of w i,j = 0 means there is no connection to node i from node j. We let Θ = R n R n n represent the joint parameter space. The following function f : X Θ X X gives the next state of the network given the current state, parameters, and input: ( n f i (x, (w, b), u) = σ w i,j x j + b i + u i ), i = 1, 2,..., n (1) j=1 That is, the state of node i at time t + 1 is determined by the input u i at that node, and the states of its neighbors at time t. Typical choices for the function σ are the logistic function 1 σ(x) = or the hyperbolic tangent σ(x) = exp( x) 1+exp( 2x) By iterating (1) starting from an initial point x(0), one obtains a sequence of network states x(1), x(2),... where x(t+1) = f(x(t), θ, u). Alternatively, we may write x(t+1) = f t+1 (x(0), θ, u). A network has a globally attractive fixed-point when there is a state x that is a fixed-point for f, meaning f(x, θ, u) = x, and this fixed-point is globally attractive, meaning lim f t (x(0), θ, u) = x t for any initial point x 0. We will also use the name fixed-point networks to refer to attractor networks. In general, this fixed-point will depend on the parameters θ = (w, b) and u, and to make this explicit we will write x (θ, u). The question of whether a network admits a globally attractive fixed-point for a given value of the parameters w, b seems to be difficult; for one type of attractor network known as the Hopfield network, various hardness results have been obtained for this question [12]. There are several known conditions, however. For instance, it occurs when the weights of the network are small in some sense. Formally, for any norm on R n, there will be contraction when sup x X f x (x) < 1 (2) 6

8 If we specialize to the form of f in Equation 1, we can get a more specific condition. If the norm is a so-called absolute norm, which means that (u 1,..., u n ) = ( u 1,..., u n ), then contraction will occur when w 1 σ Norms that are absolute include the norms p for p {1, 2,..., }. When σ is the logistic function, σ < 4. Let e : X R be a loss function. The problem is to minimize the loss at the fixedpoint: min J(θ) (3) θ where J(θ) = e(x (θ, u)). From now on we will consider the input u as fixed and will drop it from the notation. For the class of feed-back networks that converge to a unique equilibrium, the basic methodology that is used in feed-forward networks can still be applied. For these networks the derivatives of interest can be computed using a protocol in which information is transmitted along the existing links of the network, and on backwards connections, with a dynamical variant of the backpropagation algorithm. Some early works that considered these networks were [13], [10], [14], [9], [15]. We derive the optimization algorithm that we are interested in. The exposition is somewhat similar to that in the works [16], [17]. The differentiability of J follows by the implicit function theorem and the chain rule. That is, starting from the equation x (θ) = f(x (θ), θ) (4) and using the contractivity and differentiability properties of f, one can conclude that x (θ) is differentiable. Then as long as e is differentiable we can apply the chain rule to get that J is differentiable. Using this and the formulas provided by the implicit function theorem, one obtains that J (θ) = A(θ)B(θ)C(θ) (5) θ where A(θ) = e ( x (x (θ)), B(θ) = I f ) 1 x (x (θ), θ), C(θ) = f θ (x (θ), θ) From this formula, we can see two challenges to computing, or even approximating, the derivative. The first is that these terms involve x (θ), which can only be approximated by iteration. Secondly, they involve the solution of linear systems (one can choose between either A(θ)B(θ) or B(θ)C(θ)). Below we describe two iterative algorithms that can address these problems. 7

9 We can calculate the term A(θ)B(θ)C(θ) by computing A(θ)B(θ) and then post multiplying by C(θ), or we can compute BC and premultiply by A. In each case, one can use iterative solver to jointly solve the fixed-point equation (4) and deal with the matrix inverse. These approaches are referred to as adjoint sensitivity analysis and forward sensitivity analysis respectively. The derivations are somewhat symmetric; we focus on the adjoint method in what follows. One early work to mention this method of sensitivity analysis, for the finite time case, was [18]. Let the space Z be R n R n, and define the map T Adj : Z Θ Z as ( T Adj ((x, y), θ) = f(x, θ), y f ) e (x, θ) + x x (x) (6) Assuming that T Adj possesses a fixed-point z (θ) = (x (θ), y (θ)), it is easy to verify that x (θ) is a fixed-point for f and y (θ) = A(θ)B(θ) Therefore, if we could obtain the fixed-point z (θ) we could easily compute the gradient (5), since J θ (θ) = G(z (θ), θ) where G((x, y), θ) = y f (x, θ) (7) θ This map T Adj is essentially doing the same type of gradient estimation as in the backpropagation procedure for neural networks. In the case that the network has no cycles, then the gradient estimation converges in a finite number of steps. For example, if f describes a feed-forward network with k layers then a fixed-point of T Adj will be reached by iterating for k steps. If there are cycles, then under certain contraction assumptions, the operator T Adj also satisfies this contraction property. This can be verified using the condition on the derivative in Inequality 2, for an appropriate choice of norm on the space Z. This is discussed in [14] and other works concerning attractor networks. If T Adj defines a globally attractive process on Z, this gives an iterative method to estimate the gradient: Iterate T Adj enough times starting from an arbitrary point (x 0, y 0 ) to obtain point (x M, y M ) close to (x (θ), y (θ)), and then form the estimate G((x M, y M ), θ). By continuity properties of f, it should be that G((x M, y M ), θ) G(z (θ), θ) where the quantity on the right which is the true gradient. The pseudocode for this proce- 8

10 dure (termed adjoint sensitivity analysis) procedure in Algorithm 1. Algorithm 1: Deterministic Adjoint sensitivity analysis Define T Adj : R n R n R n R n as ( T Adj ((x, y), θ) = f(x, θ), y f ) e (x, θ) + x x (x) for t = 0, 1,..., M 1 do (x(t + 1), y(t + 1)) = T Adj ((x(t), y(t)), θ) end - Set Adj = y(m) f (x(m), θ) θ return Adj The output of Algorithm 1 is the gradient estimate Adj. It has the property that Adj J (θ) as M, θ as we have explained above. By a very similar argument, one justifies the forward sensitivity analysis procedure: Algorithm 2: Deterministic Forward sensitivity analysis Define T F wd : R n R n m Θ R n R n m as ( T F wd ((x, u), θ) = f(x, θ), f ) f (x, θ)u + (x, θ) x θ for t = 0, 1,..., M 1 do (x(t + 1), u(t + 1)) = T F wd ((x(t), u(t)), θ) end - Set F wd = e x (x(m))u(m) return F wd Just like adjoint sensitivity analysis, the output of Algorithm 2, has the property F wd J (θ) as M. θ The difference is that T F wd has as a fixed-point the pair (x (θ), u (θ)), where u (θ) = B(θ)C(θ). A very important point is that the map T Adj is operates in the space R n R n, while the parameter θ is in R m. It means that the computationally intensive dynamical part of gradient estimation happens on a state space that is of dimensionality 2n. The forward sensitivity analysis involves the state space R n R n m, which is of dimensionality (n + 9

11 1)m. By a crude criteria that equates small dimensionality of a state space with efficient algorithm, the adjoint sensitivity analysis has clear advantages. In neural networks for instance, the computation of an arbitrary entry of f, f e or are roughly the same, and x θ x adjoint sensitivity analysis (which the back-propagation algorithm is based on) is the clear choice. In practice, the comparison might be more subtle. In any case, in this setting of deterministic contracting systems on a continuous state space, there is a choice between two alternatives. It is of great interest to know whether algorithms with these properties can be realized in other contexts, such as stochastic systems on continuous space, or systems on discrete spaces. We turn these gradient estimation procedures into optimization algorithms by running the gradient estimation and parameter update processes in parallel. The first, presented in Algorithm 3, is based on forward sensitivity analysis: Algorithm 3: Optimization using Forward sensitivity analysis for t = 0, 1,..., do (x(t + 1), u(t + 1)) = T F wd ((x(t), u(t)), θ(t)) θ(t + 1) = θ(t) ɛ e (x(t + 1))u(t + 1) x end The second, Algorithm 4 is based on the adjoint sensitivity analysis. Algorithm 4: Optimization using Adjoint sensitivity analysis for t = 0, 1,..., do (x(t + 1), y(t + 1)) = T Adj ((x(t), y(t)), θ(t)) θ(t + 1) = θ(t) ɛy(t + 1) f (x(t + 1), θ(t)) θ end These algorithms, which couple gradient estimation with optimization are well-known also in design optimization, where it is used in aerospace applications [19], [20], [21]. In that case, the gradient is difficult to compute because it involves the numerical solution of a PDE, and the T Adj or T F wd represents some kind numerical solver with a contraction property. 2.2 Research question The issue in analyzing either of the procedures is to choose the step size, and to determine what conditions, if any, we must have on the initial state. We will focus on Algorithm 4 in the rest of this section, but the discussion applies equally to Algorithm 3. In this case the initial state is (x 0, y 0, θ 0 ). There are two considerations when choosing the step size ɛ. The first is that, given a descent direction, only values that are sufficiently small will guarantee a function decrease. This is an important consideration in any gradient based algorithm. The second issue is related to the dynamic nature of the gradient estimation. If we start out with a good gradient estimate and ɛ is small, then the parameter hardly changes, and perhaps at the next iteration only a single step of gradient estimation will suffice to get a 10

12 good search direction, since by continuity (of both the objective function and the gradient estimation process), we will be starting from a good approximation. The type of theorem we would like to prove about the algorithm is as follows: Theorem 2.1. Under some reasonable assumptions, there is a choice of constants ɛ, δ depending on available problem information, so that if z(0) = (x(0), y(0)) satisfies z(0) z (0) < δ (8) then the sequence of θ(n) generated by Algorithm 4, starting from the point (x(0), y(0), θ(0)) and using the step size ɛ, satisfies these properties: i. (e x )(θ(t + 1)) < (e x )(θ(t)), ii. (e θ x )(θ(t)) 0 as t. The criteria (8) means that the initial gradient estimate is accurate. This can be achieved by running the gradient estimation algorithm for a number of iterations before starting the parameter adaptation process. The property (i) is just that the objective function is decreasing at each step. The property (ii) states that the magnitude of the gradient tends to zero. Essentially, these two conditions state that the function values are decreasing and are not decreasing too slowly. We give a few remarks on the challenges in proving Theorem 2.1. First, we will specify conditions that guarantee the optimization problem is well-defined. This involes contraction and differentiability conditions on the underlying system. Then, to show that the gradient estimation works, we should verify our claim that the procedure map T Adj is a contraction mapping and converges to a unique fixed-point. When this estimation procedure is used inside an optimization algorithm (Algorithm 4), the exact gradients will not be used, so we will find a condition on the error in the gradient that guarantees convergence, and then find out how to tune the gradient procedure to maintain this condition during optimization. There are a number of relevant works that attempt to analyze adjoint-based optimization procedures. The work of [22] analyzed a continuous time-version of the algorithm and obtained local convergence results using methods for singularly perturbed systems [23]. The results were local in the sense that they assume optimization begins near an attracting local minimum. They were also not constructive, meaning they didn t quantify the requirements on the algorithm, such as time-scale parameters. In [24] the same authors considered the algorithm in the discrete time setting, but did not pursue a convergence analysis. A related set of works considers Hebbian learning in neural networks. This a type of learning process for a neural network that is not explicitly gradient based, instead being motivated by the neuroscientific theory of Hebbian learning. The similarity is that both involve simultaneous process of adaptation and underlying network dynamics, and the need to consider the relative rate of the two activities. Convergence of a continuous time Hebbian learning process was discuss in [25]. The work of [26] also considered convergence of this type of algorithm, again using the singular perturbation methods of [23]. 11

13 As mentioned above, algorithms that involve joint adjoint-based derivative estimation and optimization are used in other optimization fields as well. In PDE-constrained optimization it is called the one-shot approach [27, 21]. However, the results in these works also assume that the algorithm starts near a global minimum. There are a number of applications that could be of interest. One idea is inspired by reservoir networks [28]. We can consider a hierarchical network, the first component being a random attractor network, and the second being a regular feed-forward layer. We would keep the random weights fixed, but only optimize the output weights. The above results require a uniform contraction property (that is, the rate of convergence of the system should stay constant as optimization proceeds). This is accommodated by keeping the feed-back weights fixed. 3 Stochastic Attractor Networks The neural networks we described in the previous section have been deterministic. To each input they associate one output. There are a number of situations when one may be interested in stochastic networks. For instance, if one is dealing with a very large network that has to be spread across multiple computers, this can cause random delays or synchronization issues. We can model this with a neural network that has noisy connections, meaning the set of neighbors of each node is random at each time step. Additionally, a noisy network can represent a one-to-many mapping that could be useful when there are multiple interpretations to a given image, for instance. In this section we describe a class of stochastic networks with feedback, and consider the corresponding optimization problem. When the network is well behaved, the process will be ergodic and the problem is to optimize the long-term behavior. An interesting question is whether the gradient estimation procedures we have described, the adjoint and forward sensitivity analysis (Algorithms 1 and 2), can be applied in this setting. We propose to study this question, to derive a useful gradient estimator for this model. As in the deterministic attractor networks, there are three issues that must be addressed in a comprehensive treatment of these models. Firstly, we specify how these networks determine an input/output mapping. This lets us define the objective function. Secondly there is the issue of gradient estimation, meaning we need to find an algorithm for computing the derivative of this function. The third step is to define an optimization algorithm that combines the gradient estimation procedure with a parameter update step. The output of the network we are interested in is the long-term average behavior. To guarantee that the network has a regular long-term behavior that is independent of its starting point, we will use a condition which is a stochastic analogue of the contraction condition we saw in the previous section. The general form of our stochastic neural networks are as follows. Let ξ(1),ξ(2),... be an infinite sequence of independent identically distributed (i.i.d) Ξ-valued random variables distributed according to a measure µ. Each ξ(t) could be for instance a vector of uniform 12

14 random variables in [0, 1]. Then the state at time t + 1, denoted x(t + 1), is obtained from the state x(t), the parameters θ and the random input ξ(t + 1) as x(t + 1) = f(x(t), θ, ξ(t + 1)) (9) For instance the ξ(t) could be a matrix of Bernoulli random variables, that determine which connections are activated, in the random connection model we described above. We interpret ξ i,j = 1 to mean the connection from j to i is activated, and ξ i,j = 0 means the connection is disabled. In this case, f takes the form ( n f i (x, θ, ξ) = σ ξ i,j w i,j x j + b i ) We are interested in the problem of optimizing the long-term average behavior of recursions such as (9). This reduces to the deterministic attractor problem in case there is no noise (f does not depend on ξ). To define the objective function, we need to put some restrictions on our process so that it does in fact have a regular long-term behavior. The condition will be that the Markov chain defined by (9) possesses a unique variant measure π θ, and that given any distribution on the initial state x(0), the distributions of the states x(1), x(2),... converge to π θ. We can formally state this as follows. Let P θ be the Markov kernel associated to the stochastic system (9). Given that one is in state x, the quantity P θ (x, A) is the probability that the random next state f(x, θ, ξ) is in the set A. Formally, letting µ be the noise measure, P θ (x, A) = µ({ξ f(x, θ, ξ) A}) For a function e : X R we denote by P θ e the function (P θ e)(x) = e(y)d(δ x P θ )(y) = e(f(x, θ, ξ))dµ(ξ) X This number (P θ e)(x) is the expectation of the value e at the next state of the network, given that we start in state x. For a measure ν on X we denote by νp the measure (νp θ )(A) = P θ (x, A)dν(x) Then the algorithm (9) possesses a stationary measure π θ when X Ξ π θ = π θ P θ (10) Instead of using the terminology globally attractive, we will say that the network is ergodic. This means that (10) holds, and for any initial measure ν, νp t θ π θ as t. (11) 13

15 The type of convergence in (11) is weak convergence. A sequence of measures µ t converges weakly to a measure µ if µ t (e) µ(e) as t for all bounded, Lipschitz, functions e : X R. The optimization problem we are interested in is, given a cost function e : X R, where J(θ) = min J(θ) θ Θ X e(x)dπ θ (x) We approach this in a way that generalizes the deterministic approach described in the previous section. Our concern is with gradient estimation for this problem. The adjointsensitivity analysis has already been described above, in the context of deterministic attractor networks. In Algorithm 5 We define essentially the same algorithm as described for deterministic networks, only now the algorithm is operating in the stochastic environment: Algorithm 5: Stochastic Adjoint sensitivity analysis Define T Adj : R n R n Θ Ξ R n R n as ( T Adj ((x, y), θ, ξ) = f(x, θ, ξ), y f ) e (x, θ, ξ) + x x (x) for t = 0, 1,..., M 1 do (x(t + 1), y(t + 1)) = T Adj ((x(t), y(t)), θ, ξ(t)) end - Set Adj f = y M (x(m), θ, ξ(m)) θ return Adj Alternatively, the method of forward sensitivity analysis leads to a gradient estimation procedure as shown in Algorithm 6. Algorithm 6: Stochastic Forward sensitivity analysis - Define T F wd : R n R n m Θ Ξ R n R n m as ( T F wd ((x, u), θ, ξ) = f(x, θ, ξ), f ) f (x, θ, ξ)u + (x, θ, ξ) x θ for t = 0, 1,..., M 1 do (x(n + 1), u(n + 1)) = T F wd ((x(t), u(t)), θ, ξ(t + 1)) end - Set F wd = e x (x M)u M return F wd 14

16 3.1 Research question In each case, we would like to know if the procedures do in fact help to calculate derivatives, as stated in the following: Theorem 3.1. Let the network function f, the error function e and the noise ν satisfy some reasonable assumptions. Then the derivative θ π θ(e) can be computed using one of the algorithms (6) or (5). That is, one of the following holds: i. The process (x(t + 1), u(t + 1)) = T F wd (x(t), u(t), θ, ξ(t + 1)) is ergodic and E[ F wd ] J θ (θ) ii. The process (x(t + 1), y(t + 1)) = T Adj (x(t), y(t), θ, ξ(t + 1)) is ergodic and 3.2 Stochastic contraction E[ Adj ] J θ (θ) We will define a stochastic version of contraction. This will be useful for analyzing the stochastic sensitivity analysis procedures defined in Algs. 5 and 6. Consider a stochastic algorithm of the form x(t + 1) = f(x(t), ξ(t + 1)) (12) where the ξ(t) are i.i.d µ distributed random variables. Let P be the corresponding Markov kernel. It is possible to define a metric on the probability measures in such a way that contraction of the Markov kernel P on this metric space reduces to the deterministic notion of contraction in the case that the iterations (12) are actually deterministic, that is, not depending on ξ. This is done through the Wasserstein distance. There are several equivalent definitions of the Wasserstein distance. A simple way to define it is as follows. For probability measures µ 1, µ 2 on a metric space, d(µ 1, µ 2 ) = sup µ 1 (e) µ 2 (e) = sup e(x)dµ 1 (x) e(x)dµ 2 (x) e Lip 1 e Lip 1 To make sure this is well defined we need some requirements on the measure µ 1, µ 2. We define the Wasserstein space P(X) as follows { } P(X) = µ x dµ(x) < X That is, P(X) consists of all the measure µ such that x x is µ-integrable. It can be shown that this metric space (P, d) is complete when X is complete [29]. That means the contraction mapping theorem can be used to test if a Markov kernel is ergodic; if the X X 15

17 Markov kernel P defines a contraction on this space, then µp n converges to a unique invariant measure π for any starting measure µ. This is formalized below. Proposition 3.2. Let P be a Markov kernel on a Polish space X. Let the following contraction condition hold: d(δ x1 P, δ x2 P ) sup := ρ < 1 (13) x 1 x 2 d(x 1, x 2 ) Then P has a unique stationary measure π, and and for any Lipschitz function d(µp t, π) ρ t d(µ, π), µp t (e) π(e) e Lip ρ t d(µ, π). One can prove the following sufficient condition for contraction in terms of the derivative of f: Proposition 3.3. Consider the system (12). Let be a norm on X and say that i. µp P for all µ P, ii. sup x (x, ξ) dν(ξ) < 1. Ξ f x Then the corresponding Markov kernel P is a contraction on the space P(X). Processes that satisfy the conditions of 3.3 include iterated function systems [30, 31, 32, 33]. 3.3 Approach to the problem We outline our approach to showing, under certain conditions, that the forward sensitivity analysis procedure is valid for stochastic systems. We assume that P θ is a contraction on the state space X. We also need to assume that the function f satisfies certain differentiability requirements. The function f and its derivatives will also need to satisfy certain integrability requirements relative to the noise measure µ. First, we establish differentiability of π θ. Define θ P θ to be the linear map sending e to θ P θe(θ). We show that the equation l = lp θ + π θ θ P θ (14) has a unique solution l, and that this linear functional must be the derivative of π θ, that is, l (e) = θ π θ(e) 16

18 Then we can show that, asymptotically, the Algorithm 6 represents this linear functional. That is, assuming γ θ is the stationary distribution of the recursion z m+1 = T F wd (z m, θ, ξ), we can consider the linear functional e l(e) = (x)m dγ x θ(x, m) Z If we can show that l also satisfies the equation (14), then we are done, since the only solution is the stationary derivative. 4 Discrete models In this section we discuss a type of stochastic network on a discrete space termed random threshold networks, also known as the Little model [7]. Gradient estimation has been studied for closely related models, such as the Boltzmann machine and sigmoid belief networks, as we discuss below. This model is somewhat more challenging since there is not a known closed-form solution for the stationary distribution. Therefore one has to focus on gradient estimation methods that only use the Markov kernel associated to the process. We review some standard methods that can apply to this setting, including finite differences (FD), simultaneous perturbation analysis (SP), and measure valued differentiation (MVD). We then propose an estimator which combines features of SP and MVD. The algorithm generates a random direction, as in SP, and then uses measure valued differentiation to approximate this directional derivative. 4.1 Background The earliest neural network models to be studied from the computational view were the deterministic threshold networks [34, 35]. In this model, each unit senses the states of its neighbors, takes a weighted sum of the values, and applies a threshold to determine its next state (either on or off). For single layer versions of these networks, where the units are partitioned into input and output groups, with connections only from input to output nodes, the corresponding optimization problem can be solved by the perceptron algorithm. Any iterative algorithm for optimizing threshold networks has to address the credit assignment problem. This means that during optimization, the algorithm must identify which internal components of the network are not working correctly, and adjust those units to improve the output. The difficulty in solving the credit assignment problem for threshold networks with multiple layers prevents simple deterministic threshold models from being used in complex problems like image recognition. There have been a number of well-known approaches to the problem. For instance, one can abandon the threshold units, and work with units that have a smooth, graded, response such as the sigmoid neural networks described above. In this case methods of calculus are available to determine unit sensitivities. These new networks are still deterministic but now operate on a continuous state space. 17

19 Another approach is to keep the space discrete but make the network probabilistic, and use the smoothing effects of the noise to obtain a model one can apply methods of calculus to. One can interpret the Sigmoid Belief Networks in this way. These networks were introduced in [8] and so named because they combine features of sigmoid neural networks and Bayesian networks. In these networks, when a unit receives a large positive input it is very likely to turn on, while a large negative input means the unit is likely to remain off. In fact, these networks can be interpreted as threshold networks with random thresholds. The use of the sigmoid function, which is the cumulative distribution function (CDF) of the logistic distribution, leads to an interpretation of a network with thresholds drawn from the logistic distribution. In [8], the author derived formulas for the gradient in these networks, and showed how Markov chain Monte Carlo (MCMC) techniques can be used to implement gradient estimators. The networks studied in [8] had a feed-forward architecture, but one could also define variants that allow cycles among the connections. In this way one is lead to the random threshold networks. In this case, one would be interested in the longterm average behavior of the network. Such a generalization would resemble the random threshold networks that are our focus. It would be interesting to obtain a gradient estimator for these new networks. Another motivation to study general random threshold networks comes from the Boltzmann machine [1]. This is a network of stochastic units that are connected symmetrically. This means there is feed-back in the network, and the problem in these networks is to optimize the long-term behavior. The symmetry in the network, and the use of the sigmoid function to calculate the probabilities, leads to a nice closed form solution for the stationary measure in this model. Based on formulas for the stationary distribution, expressions for the gradient of long-term costs can be obtained, leading to MCMC based gradient estimators. If one changes the model, by for instance using non-symmetric connections, or changing the type of nonlinearity, these formulas are no longer available. Instead, one winds up with a model like the random threshold networks. This provides another motivation for studying gradient estimation in the Little model. The networks that we are studying were first defined in [7], and are sometimes referred to as the Little model. They can be interpreted as threshold networks, where the thresholds are randomly chosen at each time step. For this reason we also refer to them as Random Threshold Networks (RTNs). We define a random threshold network as follows. Let the network have n nodes, and let ξ(1), ξ(2)... be a sequence of noise vectors in R n, with the entire collection {ξ i (t); i = 1,..., n, t = 1, 2,...} independent and distributed according to the logistic distribution. That is, the CDF of ξ i (t) is P (ξ i (t) < x) = σ(x) = exp( x). Define f RT N : {0, 1} n Θ Ξ {0, 1} n as n 1 if w fi RT N i,j x j + b i > ξ i (x, (w, b), ξ) = j=1 0 otherwise 18 (15)

20 This function f RT N and the noise ξ(1), ξ(2)... determines the operation of the random threshold network; from the initial point x(0) one follows the recursion x(t + 1) = f RT N (x(t), θ, ξ(t + 1)) (16) to generate the next state. The transition probability for the RTN are as follows. We denote by P θ (x 0, x 1 ) be the probability of going to state x 1 {0, 1} n from state x 0 {0, 1} n. The function u i (x), that determines the input to each node at the state x is defined as u i (x, θ) = j w i,j x j + b i Then n P θ (x 0, x 1 ) = σ(u i (x 0, θ)) x1 i (1 σ(ui (x 0, θ))) 1 x1 i Alternatively, we can use the following notation of [8]: for x {0, 1}, define x = 2x 1 Using this together with the identity 1 σ(x) = σ( x), we get P θ (x 0, x 1 ) = n σ((x 1 i ) u i (x 0, θ)) (17) The RTN is related to the two models we discussed above, the sigmoid belief network and the Boltzmann machine. If the connectivity graph is acyclic, then one obtains a model resembling the sigmoid belief network. We can enforce this by requiring w i,j = 0 if i < j. A model like the Boltzmann machine is obtained if the weights are symmetric, meaning w i,j = w j,i. Technically, if one puts a symmetry requirement on our threshold networks, one does not exactly recover the Boltzmann machine, but a variant known as the synchronous or parallel Boltzmann machine [2]. The synchronous Boltzmann machine also has a known, simple, stationary distribution, as shown in [2]. We will not attempt to solve the stationary the distribution, rather we are interested in constructing an optimization procedure using only knowledge of the transition function f RT N. Firstly, the differentiability of P θ is easily established by properties of the sigmoid function. As we are dealing with a discrete space, the general methods available for computing the derivative include finite-differences, the score-function method and measure-valued differentiation. We briefly review these now. The finite difference estimator for the derivative of the stationary cost is shown in Algorithm 7. In the case of a single, real-valued parameter, the finite difference method uses two copies of the system to estimate the derivative at a particular point θ(0). One copy of the system runs with setting θ(0) + λ, and one uses θ(0) λ. After running for a long-time, the error in both of these systems is sampled, and a difference quotient is formed as the gradient 19

21 estimate. The extension to m parameters involves replicating the procedure m times, one for each coordinate direction. Typically, when dealing with an n node neural network there are n 2 parameters. Therefore finite differences would require running 2n 2 copies of the network, which is unfeasible. Furthermore, it has unfavorable variance properties. Algorithm 7: Finite difference derivative estimation for Markov chains - Let v 1, v 2,..., v m be the coordinate basis vectors in R m. - Define T F D : X m X m R m Ξ X m X m as T F D i (x, y, θ, ξ) = (f(x i, θ + λv i, ξ), f(y i, θ λv i, ξ)), i = 1, 2,..., m. for t = 0, 1,..., M 1 do (x(t + 1), y(t + 1)) = T F D (x(t), y(t), θ, ξ(t + 1)) end m - Set F D e(x i (M)) e(y i (M)) = v i 2λ return F D One interesting solution to the cost of estimating derivatives with finite differences is known as simultaneous perturbation [36]. In this scheme, one picks a random direction v, and then approximates the directional derivative using stochastic finite differences as in Algorithm 7. In this case only two simulations are needed for a system with m parameters. The variance issues remain with this approach; in order to decrease the bias of the estimator, one has to deal with a larger variance. For generating the directions, one possibility is to let v be a random point on the hypercube { 1/2, 1/2} m, as suggested in [36]. For the theoretical analysis, it is important that the directions have 0 mean, and that the random variable 1 is integrable. The procedure is shown in Algorithm 8. v Algorithm 8: Simultaneous perturbation derivative estimation for Markov chains - Sample direction v from the measure P (v) = n [ 1δ 2 1/2(v i ) + 1δ 2 1/2(v i )] (18) - Define T SP : X X Ξ X X as T SP (x, y, ξ) = (f(x, θ + λv, ξ), f(y, θ λv, ξ)) for t = 0, 1,..., M do (x(t + 1), y(t + 1)) = T SP (x(t), y(t), θ, ξ(t)) end - Set SP = e(x(m)) e(y(m)) v 2λ return SP The idea of measure valued differentiation is to express the derivative of an expectation 20

22 as the difference of two expectations. Each of these expectations involves the same cost function of interest, but the underlying measures are different. If these measures are easy to sample from, this leads to a simple, unbiased derivative estimator. Formally, we consider a family of cost functions D and a measure µ θ that depends on a real parameter θ. For simplicity we will consider the setting of a finite state space X in these definitions. Then µ θ is said to be D-differentiable at θ, if there is a triple (c θ, µ θ, µ θ ), consisting of a real number c θ, and two probability measures µ θ, µ θ on X such that θ µ θ(e) = c θ ( µ θ (e) µ θ (e)) for any function e : X R. An MVD gradient estimator would consist of two parts: First, sample a random variable Ẏ distributed according to µ θ, then sample random variable Ÿ according to µ θ, and finally form the estimate MV D = c θ [e(ẏ ) e(ÿ )]. Compared to finite differences, the advantage is that there is no bias and there is no division by a small number. For some background see [37]. The following is a simple example Example 4.1. Let ν 1, ν 2 be two probability measures on a measurable space X, and define the measure µ θ = e θ2 ν 1 + (1 e θ2 )ν 2 (19) that depends on a parameter θ R. The parameter determines which of the measures ν i is more likely in this mixture. By simple calculus, it holds that for any bounded measurable function e : X R, θ µ θ(e) = c θ [ν 2 (e) ν 1 (e)] where c θ = 2θe θ2 Therefore the triple (c θ, ν 2, ν 1 ) an MVD of the measure µ θ. The concept of MVD can be extended from measures to Markov kernels, and then applied to derivatives of stationary costs. A Markov kernel P θ that is defined on a discrete space X and which depends on a real parameter θ is said to bed-differentiable at θ if for each x X there is a triple (c θ (x), P θ (x, ), P θ (x, )) which is the measure valued derivative of the measure P (x, ) for the cost functions D at the parameter θ. If the Markov kernel P θ is ergodic with stationary measure π θ, then in certain cases we can use the (c θ (x), P θ (x, ), P θ (x, )) to compute the stationary derivatives π θ θ(e) [38]. We now describe this. Let the Markov kernels P and P be represented by the functions f, f, meaning P (x, A) = µ({ξ f(x, θ, ξ) A}), P (x, A) = µ({ξ f(x, θ, ξ) A}), 21

23 The procedure is presented in Algorithm 9. A theorem about its correctness is given in Theorem 4.2. For measure valued differentiation, like finite differences, it seems that m simulations are required for a system with m parameters, although the variance characteristics are much more favorable compared to finite differences (see [39], Section 4.3). In finite differences, one must trade off bias for variance, but for MVD the variance can be shown to be bounded independently of the parameters M 1, M 0 which determine the bias. Algorithm 9: MVD gradient estimation for Markov chains for t = 0, 1,..., M 0 1 do x(t + 1) = f(x(t), θ, ξ(t)) end - Set ẋ(0) = f(x(m 0 ), θ, ξ(m 0 )) and ẍ(0) = f(x(m 0 ), θ, ξ(m 0 )) for t = 0, 1,..., M 1 1 do (ẋ(t + 1), ẍ(t + 1)) = (f(ẋ(t), θ, ξ(m 0 + t + 1)), f(ẍ(t), θ, ξ(m 0 + t + 1))) end M 1 - Set MV D = c θ (x(m)) [e(ẋ(t)) e(ẍ(t))] return MV D. t=1 4.2 Stationary differentiability using MVD In this section we recall a theorem on measure valued differentiation for Markov chains. It gives a condition on a Markov chain P θ that guarantees the corresponding stationary costs π θ (e) are differentiable. This result is from [38]. Theorem 4.2. Let (δ x P θ )(e) be differentiable for each bounded, Lipschitz continuous e. That is, for each x there is a triple c(x), P θ (x, ), P θ (x, ) such that θ (δ xp θ )(e) = c(x)[ P θ (e) P θ (e)] for each bounded, Lipschitz e. Furthermore, suppose that P θ is a contraction on the space P(X) in the sense of Inequality 13. Then the stationary cost π θ (e) is differentiable, and Algorithm 9 can be used to estimate the derivatives. Specifically, if we let MV D be the output of the algorithm, then lim MV D = M 0,M 1 θ π θ(e) More general results on MVD for stationary measures can be found in [40]. 4.3 Research question Using the above ideas, we propose a gradient estimator that works by picking a random direction, as in SP, and computes the directional derivative using measure valued differentiation. In this way one deals with a small number of simulations, as in SP, while avoiding 22

24 the variance issues with finite differences. The method is termed simultaneous perturbation measure valued differentiation. The only requirement is that one can compute the measure valued derivative along arbitrary directions. Let µ θ be a measure depending on an n-dimensional vector parameter θ. Let v R n be a direction. A triple (c θ,v, µ θ,v, µ θ,v ) is called a measure valued directional directional derivative in the direction v at θ if for all e : X R, θ µ θ(e)v = c θ,v [ µ θ,v (e) µ θ,v (e)]. In practice, one can try to calculate the MVD in direction v as follows. Consider the measure ˆµ λ = µ θ+λv that depends on real parameter λ. Then by basic calculus, θ µ θ(e)v = λ ˆµ λ(e)(0) Therefore, to find the MVD of µ θ in direction v it suffices to find the normal, scalar, MVD for ˆµ λ at λ = 0. This is the approach in the following example. Example 4.3. Let µ 1, µ 2,..., µ m be m mixture components, and for a vector parameter θ R n define µ θ as m e θ i µ θ = µ i Z θ where Z θ = m e θ i. For any function e : X R and direction v R m, then, we have µ θ+λv (e) = m e (θ i+λv i ) Z θ+λvi µ i (e) (20) Introduce the notation γ + = max{0, γ} and γ = min{0, γ}, so that for any number γ, the identities v = γ + γ and v = γ + + γ hold. Also define J θ = m e θ v j. Differentiating (20) at λ = 0, and doing some algebra, one can get the following representation for the directional derivative: [ θ µ θ(e)v = m ] λ µ m θ+λv(e)[0] = c θ α i µ i (e) β i µ i (e) j=1 where c θ = J θ Z θ [ α i = 1 v i Z θ J Z θ + θ m e θ i v + j j=1 ] 23

Gradient Estimation for Attractor Networks

Gradient Estimation for Attractor Networks Gradient Estimation for Attractor Networks Thomas Flynn Department of Computer Science Graduate Center of CUNY July 2017 1 Outline Motivations Deterministic attractor networks Stochastic attractor networks

More information

Automatic Differentiation and Neural Networks

Automatic Differentiation and Neural Networks Statistical Machine Learning Notes 7 Automatic Differentiation and Neural Networks Instructor: Justin Domke 1 Introduction The name neural network is sometimes used to refer to many things (e.g. Hopfield

More information

Markov Chains and MCMC

Markov Chains and MCMC Markov Chains and MCMC Markov chains Let S = {1, 2,..., N} be a finite set consisting of N states. A Markov chain Y 0, Y 1, Y 2,... is a sequence of random variables, with Y t S for all points in time

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

4. Multilayer Perceptrons

4. Multilayer Perceptrons 4. Multilayer Perceptrons This is a supervised error-correction learning algorithm. 1 4.1 Introduction A multilayer feedforward network consists of an input layer, one or more hidden layers, and an output

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Introduction to Neural Networks

Introduction to Neural Networks Introduction to Neural Networks What are (Artificial) Neural Networks? Models of the brain and nervous system Highly parallel Process information much more like the brain than a serial computer Learning

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Lecture 4: Deep Learning Essentials Pierre Geurts, Gilles Louppe, Louis Wehenkel 1 / 52 Outline Goal: explain and motivate the basic constructs of neural networks. From linear

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Lecture - 27 Multilayer Feedforward Neural networks with Sigmoidal

More information

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed

More information

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017 COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines COMP9444 17s2 Boltzmann Machines 1 Outline Content Addressable Memory Hopfield Network Generative Models Boltzmann Machine Restricted Boltzmann

More information

Learning and Memory in Neural Networks

Learning and Memory in Neural Networks Learning and Memory in Neural Networks Guy Billings, Neuroinformatics Doctoral Training Centre, The School of Informatics, The University of Edinburgh, UK. Neural networks consist of computational units

More information

Neural networks. Chapter 20. Chapter 20 1

Neural networks. Chapter 20. Chapter 20 1 Neural networks Chapter 20 Chapter 20 1 Outline Brains Neural networks Perceptrons Multilayer networks Applications of neural networks Chapter 20 2 Brains 10 11 neurons of > 20 types, 10 14 synapses, 1ms

More information

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization Tim Roughgarden & Gregory Valiant April 18, 2018 1 The Context and Intuition behind Regularization Given a dataset, and some class of models

More information

Least Squares Based Self-Tuning Control Systems: Supplementary Notes

Least Squares Based Self-Tuning Control Systems: Supplementary Notes Least Squares Based Self-Tuning Control Systems: Supplementary Notes S. Garatti Dip. di Elettronica ed Informazione Politecnico di Milano, piazza L. da Vinci 32, 2133, Milan, Italy. Email: simone.garatti@polimi.it

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 2016 Robert Nowak Probabilistic Graphical Models 1 Introduction We have focused mainly on linear models for signals, in particular the subspace model x = Uθ, where U is a n k matrix and θ R k is a vector

More information

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation BACKPROPAGATION Neural network training optimization problem min J(w) w The application of gradient descent to this problem is called backpropagation. Backpropagation is gradient descent applied to J(w)

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

C4 Phenomenological Modeling - Regression & Neural Networks : Computational Modeling and Simulation Instructor: Linwei Wang

C4 Phenomenological Modeling - Regression & Neural Networks : Computational Modeling and Simulation Instructor: Linwei Wang C4 Phenomenological Modeling - Regression & Neural Networks 4040-849-03: Computational Modeling and Simulation Instructor: Linwei Wang Recall.. The simple, multiple linear regression function ŷ(x) = a

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

Credit Assignment: Beyond Backpropagation

Credit Assignment: Beyond Backpropagation Credit Assignment: Beyond Backpropagation Yoshua Bengio 11 December 2016 AutoDiff NIPS 2016 Workshop oo b s res P IT g, M e n i arn nlin Le ain o p ee em : D will r G PLU ters p cha k t, u o is Deep Learning

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.

More information

Machine Learning Basics III

Machine Learning Basics III Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient

More information

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler + Machine Learning and Data Mining Multi-layer Perceptrons & Neural Networks: Basics Prof. Alexander Ihler Linear Classifiers (Perceptrons) Linear Classifiers a linear classifier is a mapping which partitions

More information

Ch.6 Deep Feedforward Networks (2/3)

Ch.6 Deep Feedforward Networks (2/3) Ch.6 Deep Feedforward Networks (2/3) 16. 10. 17. (Mon.) System Software Lab., Dept. of Mechanical & Information Eng. Woonggy Kim 1 Contents 6.3. Hidden Units 6.3.1. Rectified Linear Units and Their Generalizations

More information

1 Lyapunov theory of stability

1 Lyapunov theory of stability M.Kawski, APM 581 Diff Equns Intro to Lyapunov theory. November 15, 29 1 1 Lyapunov theory of stability Introduction. Lyapunov s second (or direct) method provides tools for studying (asymptotic) stability

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables Revised submission to IEEE TNN Aapo Hyvärinen Dept of Computer Science and HIIT University

More information

Extreme Value Analysis and Spatial Extremes

Extreme Value Analysis and Spatial Extremes Extreme Value Analysis and Department of Statistics Purdue University 11/07/2013 Outline Motivation 1 Motivation 2 Extreme Value Theorem and 3 Bayesian Hierarchical Models Copula Models Max-stable Models

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Introduction to Machine Learning Spring 2018 Note Neural Networks

Introduction to Machine Learning Spring 2018 Note Neural Networks CS 189 Introduction to Machine Learning Spring 2018 Note 14 1 Neural Networks Neural networks are a class of compositional function approximators. They come in a variety of shapes and sizes. In this class,

More information

Markov Chain Monte Carlo The Metropolis-Hastings Algorithm

Markov Chain Monte Carlo The Metropolis-Hastings Algorithm Markov Chain Monte Carlo The Metropolis-Hastings Algorithm Anthony Trubiano April 11th, 2018 1 Introduction Markov Chain Monte Carlo (MCMC) methods are a class of algorithms for sampling from a probability

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Yongjin Park 1 Goal of Feedforward Networks Deep Feedforward Networks are also called as Feedforward neural networks or Multilayer Perceptrons Their Goal: approximate some function

More information

The Origin of Deep Learning. Lili Mou Jan, 2015

The Origin of Deep Learning. Lili Mou Jan, 2015 The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Advanced computational methods X Selected Topics: SGD

Advanced computational methods X Selected Topics: SGD Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Neural networks. Chapter 19, Sections 1 5 1

Neural networks. Chapter 19, Sections 1 5 1 Neural networks Chapter 19, Sections 1 5 Chapter 19, Sections 1 5 1 Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural networks Chapter 19, Sections 1 5 2 Brains 10

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Probabilistic Models in Theoretical Neuroscience

Probabilistic Models in Theoretical Neuroscience Probabilistic Models in Theoretical Neuroscience visible unit Boltzmann machine semi-restricted Boltzmann machine restricted Boltzmann machine hidden unit Neural models of probabilistic sampling: introduction

More information

The connection of dropout and Bayesian statistics

The connection of dropout and Bayesian statistics The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Computational statistics

Computational statistics Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial

More information

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD WHAT IS A NEURAL NETWORK? The simplest definition of a neural network, more properly referred to as an 'artificial' neural network (ANN), is provided

More information

Artificial Neural Networks 2

Artificial Neural Networks 2 CSC2515 Machine Learning Sam Roweis Artificial Neural s 2 We saw neural nets for classification. Same idea for regression. ANNs are just adaptive basis regression machines of the form: y k = j w kj σ(b

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Artificial Neural Networks

Artificial Neural Networks Introduction ANN in Action Final Observations Application: Poverty Detection Artificial Neural Networks Alvaro J. Riascos Villegas University of los Andes and Quantil July 6 2018 Artificial Neural Networks

More information

Monte-Carlo MMD-MA, Université Paris-Dauphine. Xiaolu Tan

Monte-Carlo MMD-MA, Université Paris-Dauphine. Xiaolu Tan Monte-Carlo MMD-MA, Université Paris-Dauphine Xiaolu Tan tan@ceremade.dauphine.fr Septembre 2015 Contents 1 Introduction 1 1.1 The principle.................................. 1 1.2 The error analysis

More information

Artificial Neural Networks. MGS Lecture 2

Artificial Neural Networks. MGS Lecture 2 Artificial Neural Networks MGS 2018 - Lecture 2 OVERVIEW Biological Neural Networks Cell Topology: Input, Output, and Hidden Layers Functional description Cost functions Training ANNs Back-Propagation

More information

Artificial Neural Network

Artificial Neural Network Artificial Neural Network Eung Je Woo Department of Biomedical Engineering Impedance Imaging Research Center (IIRC) Kyung Hee University Korea ejwoo@khu.ac.kr Neuron and Neuron Model McCulloch and Pitts

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

PATTERN CLASSIFICATION

PATTERN CLASSIFICATION PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS

More information

If we want to analyze experimental or simulated data we might encounter the following tasks:

If we want to analyze experimental or simulated data we might encounter the following tasks: Chapter 1 Introduction If we want to analyze experimental or simulated data we might encounter the following tasks: Characterization of the source of the signal and diagnosis Studying dependencies Prediction

More information

CSC 411 Lecture 10: Neural Networks

CSC 411 Lecture 10: Neural Networks CSC 411 Lecture 10: Neural Networks Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 10-Neural Networks 1 / 35 Inspiration: The Brain Our brain has 10 11

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Neural Networks Varun Chandola x x 5 Input Outline Contents February 2, 207 Extending Perceptrons 2 Multi Layered Perceptrons 2 2. Generalizing to Multiple Labels.................

More information

CSC242: Intro to AI. Lecture 21

CSC242: Intro to AI. Lecture 21 CSC242: Intro to AI Lecture 21 Administrivia Project 4 (homeworks 18 & 19) due Mon Apr 16 11:59PM Posters Apr 24 and 26 You need an idea! You need to present it nicely on 2-wide by 4-high landscape pages

More information

9 Classification. 9.1 Linear Classifiers

9 Classification. 9.1 Linear Classifiers 9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive

More information

Tutorial on Machine Learning for Advanced Electronics

Tutorial on Machine Learning for Advanced Electronics Tutorial on Machine Learning for Advanced Electronics Maxim Raginsky March 2017 Part I (Some) Theory and Principles Machine Learning: estimation of dependencies from empirical data (V. Vapnik) enabling

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Neural Networks and the Back-propagation Algorithm

Neural Networks and the Back-propagation Algorithm Neural Networks and the Back-propagation Algorithm Francisco S. Melo In these notes, we provide a brief overview of the main concepts concerning neural networks and the back-propagation algorithm. We closely

More information

QUALIFYING EXAM IN SYSTEMS ENGINEERING

QUALIFYING EXAM IN SYSTEMS ENGINEERING QUALIFYING EXAM IN SYSTEMS ENGINEERING Written Exam: MAY 23, 2017, 9:00AM to 1:00PM, EMB 105 Oral Exam: May 25 or 26, 2017 Time/Location TBA (~1 hour per student) CLOSED BOOK, NO CHEAT SHEETS BASIC SCIENTIFIC

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

STA205 Probability: Week 8 R. Wolpert

STA205 Probability: Week 8 R. Wolpert INFINITE COIN-TOSS AND THE LAWS OF LARGE NUMBERS The traditional interpretation of the probability of an event E is its asymptotic frequency: the limit as n of the fraction of n repeated, similar, and

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Lecture 4: Perceptrons and Multilayer Perceptrons

Lecture 4: Perceptrons and Multilayer Perceptrons Lecture 4: Perceptrons and Multilayer Perceptrons Cognitive Systems II - Machine Learning SS 2005 Part I: Basic Approaches of Concept Learning Perceptrons, Artificial Neuronal Networks Lecture 4: Perceptrons

More information

16 : Markov Chain Monte Carlo (MCMC)

16 : Markov Chain Monte Carlo (MCMC) 10-708: Probabilistic Graphical Models 10-708, Spring 2014 16 : Markov Chain Monte Carlo MCMC Lecturer: Matthew Gormley Scribes: Yining Wang, Renato Negrinho 1 Sampling from low-dimensional distributions

More information

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical

More information

Gradient Descent Training Rule: The Details

Gradient Descent Training Rule: The Details Gradient Descent Training Rule: The Details 1 For Perceptrons The whole idea behind gradient descent is to gradually, but consistently, decrease the output error by adjusting the weights. The trick is

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Stanford Machine Learning - Week V

Stanford Machine Learning - Week V Stanford Machine Learning - Week V Eric N Johnson August 13, 2016 1 Neural Networks: Learning What learning algorithm is used by a neural network to produce parameters for a model? Suppose we have a neural

More information

Revision: Neural Network

Revision: Neural Network Revision: Neural Network Exercise 1 Tell whether each of the following statements is true or false by checking the appropriate box. Statement True False a) A perceptron is guaranteed to perfectly learn

More information

Development of Stochastic Artificial Neural Networks for Hydrological Prediction

Development of Stochastic Artificial Neural Networks for Hydrological Prediction Development of Stochastic Artificial Neural Networks for Hydrological Prediction G. B. Kingston, M. F. Lambert and H. R. Maier Centre for Applied Modelling in Water Engineering, School of Civil and Environmental

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

Different Criteria for Active Learning in Neural Networks: A Comparative Study

Different Criteria for Active Learning in Neural Networks: A Comparative Study Different Criteria for Active Learning in Neural Networks: A Comparative Study Jan Poland and Andreas Zell University of Tübingen, WSI - RA Sand 1, 72076 Tübingen, Germany Abstract. The field of active

More information

Deep Learning Lab Course 2017 (Deep Learning Practical)

Deep Learning Lab Course 2017 (Deep Learning Practical) Deep Learning Lab Course 207 (Deep Learning Practical) Labs: (Computer Vision) Thomas Brox, (Robotics) Wolfram Burgard, (Machine Learning) Frank Hutter, (Neurorobotics) Joschka Boedecker University of

More information

Deep Feedforward Networks. Sargur N. Srihari

Deep Feedforward Networks. Sargur N. Srihari Deep Feedforward Networks Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation and Other Differentiation

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Organization. I MCMC discussion. I project talks. I Lecture.

Organization. I MCMC discussion. I project talks. I Lecture. Organization I MCMC discussion I project talks. I Lecture. Content I Uncertainty Propagation Overview I Forward-Backward with an Ensemble I Model Reduction (Intro) Uncertainty Propagation in Causal Systems

More information

Information theoretic perspectives on learning algorithms

Information theoretic perspectives on learning algorithms Information theoretic perspectives on learning algorithms Varun Jog University of Wisconsin - Madison Departments of ECE and Mathematics Shannon Channel Hangout! May 8, 2018 Jointly with Adrian Tovar-Lopez

More information

Lecture: Local Spectral Methods (1 of 4)

Lecture: Local Spectral Methods (1 of 4) Stat260/CS294: Spectral Graph Methods Lecture 18-03/31/2015 Lecture: Local Spectral Methods (1 of 4) Lecturer: Michael Mahoney Scribe: Michael Mahoney Warning: these notes are still very rough. They provide

More information

Artificial Neural Networks Examination, March 2004

Artificial Neural Networks Examination, March 2004 Artificial Neural Networks Examination, March 2004 Instructions There are SIXTY questions (worth up to 60 marks). The exam mark (maximum 60) will be added to the mark obtained in the laborations (maximum

More information

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92 ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92 BIOLOGICAL INSPIRATIONS Some numbers The human brain contains about 10 billion nerve cells (neurons) Each neuron is connected to the others through 10000

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information