Sparse coding via thresholding and local competition in neural circuits

Size: px
Start display at page:

Download "Sparse coding via thresholding and local competition in neural circuits"

Transcription

1 Sparse coding via thresholding and local competition in neural circuits Christopher J. Rozell, Don H. Johnson, Richard G. Baraniuk, Bruno A. Olshausen Abstract While evidence indicates that neural systems may be employing sparse approximations to represent sensed stimuli, the mechanisms underlying this ability are not understood. We describe a locally competitive algorithm (LCA) that solves a collection of sparse coding principles minimizing a weighted combination of mean-squared error (MSE) and a coefficient cost function. LCAs are designed to be implemented in a dynamical system composed of many neuron-like elements operating in parallel. These algorithms use thresholding functions to induce local (usually one-way) inhibitory competitions between nodes to produce sparse representations. LCAs produce coefficients with sparsity levels comparable to the most popular centralized sparse coding algorithms while being readily suited for neural implementation. Additionally, LCA coefficients for video sequences demonstrate inertial properties that are both qualitatively and quantitatively more regular (i.e., smoother and more predictable) than the coefficients produced by greedy algorithms. 1 Introduction Natural images can be well-approximated by a small subset of elements from an overcomplete dictionary (Field, 1994; Olshausen, 23; Candès and Donoho, 24). The process of choosing a good subset of dictionary elements along with the corresponding coefficients to represent a signal is known as sparse approximation. Recent theoretical and experimental evidence indicates that many sensory neural systems appear to employ similar sparse representations with their population codes (Vinje and Gallant, 22; Lewicki, 22; Olshausen and Field, 24, 1996; Delgutte et al., 1998), encoding a stimulus in the activity of just a few neurons. While sparse coding in neural systems is an intriguing hypothesis, the challenge of collecting simultaneous data from large neural populations makes it difficult to evaluate its credibility without testing predictions from a specific proposed coding mechanism. Unfortunately, we currently lack a proposed sparse coding mechanism that is realistically implementable in the parallel architectures of neural systems and produces sparse coefficients that efficiently represent the time-varying stimuli important to biological systems. Sparse approximation is a difficult non-convex optimization problem that is at the center of much research in mathematics and signal processing. Existing sparse approximation algorithms suffer from one or more of the following drawbacks: 1) they are not implementable in the parallel analog architectures used by neural systems; 2) they have difficulty producing exactly zero-valued coefficients in finite time; 3) they produce coefficients for time-varying stimuli that contain inefficient fluctuations, making the stimulus content more difficult to interpret; or 4) they only use a heuristic approximation to minimizing a desired objective function. We introduce and study a new neurally plausible algorithm based on the principles of thresholding and local competition that solves a family of sparse approximation problems corresponding to various sparsity metrics. In our Locally Competitive Algorithms (LCAs), neurons in a population continually compete with neighboring units using (usually one-way) lateral inhibition to calculate coefficients representing an input signal using an overcomplete dictionary. Our continuous-time LCA is described by the dynamics of a system of nonlinear ordinary differential equations (ODEs) that govern the internal state (membrane potential) and external communication (short-term firing rate) of units in a neural population. These systems use 1

2 computational primitives that correspond to simple analog elements (e.g., resistors, capacitors, amplifiers), making them realistic for parallel implementations. We show that each LCA corresponds to an optimal sparse approximation problem that minimizes an energy function combining reconstruction mean-squared error (MSE) and a sparsity-inducing cost function. This paper develops a neural architecture for LCAs, shows their correspondence to a broad class of sparse approximation problems, and demonstrates that LCAs possess three properties critical for a neurally plausible sparse coding algorithm. First, we show that the LCA dynamical system is stable, guaranteeing that a physical implementation is well-behaved. Next, we show that LCAs perform their primary task well, finding codes for fixed images that have sparsity comparable to the most popular centralized algorithms. Finally, we demonstrate that LCAs display inertia, coding video sequences with a coefficient time series that is significantly smoother in time than the coefficients produced by other algorithms. This increased coefficient regularity better reflects the smooth nature of natural input signals, making the coefficients much more predictable and making it easier for higher-level structures to identify and understand the changing content in the time-varying stimulus. Although still lacking in biophysical realism, the LCA methods presented here represent a first step toward a testable neurobiological model of sparse coding. As an added benefit, the parallel analog architecture described by our LCAs could greatly benefit the many modern signal processing applications that rely on sparse approximation. While the principles we describe apply to many signal modalities, we will focus on the visual system and the representation of video sequences. 2 Background and related work 2.1 Sparse approximation Given an N-dimensional stimulus s R N (e.g., an N-pixel image), we seek a representation in terms of a dictionary D composed of M vectors {φ m } that span the space R N. Define the l p norm of the vector x to be x p = ( m x m p ) 1/p and the inner product between x and y to be x, y = m x my m. Without loss of generality, assume the dictionary vectors are unit-norm, φ m 2 = 1. When the dictionary is overcomplete (M > N), there are an infinite number of ways to choose coefficients {a m } such that s = M m=1 a mφ m. In optimal sparse approximation, we seek the coefficients having the fewest number of non-zero entries by solving the minimization problem min a a subject to s = M a m φ m, (1) where the l norm 1 denotes the number of non-zero elements of a = [a 1, a 2,...,a M ]. Unfortunately, this combinatorial optimization problem is NP-hard (Natarajan, 1995). In the signal processing community, two approaches are typically used to find acceptable suboptimal solutions to this intractable problem. The first general approach substitutes an alternate sparsity measure to convexify the l norm. One well-known example is Basis Pursuit (BP) (Chen et al., 21), which replaces the l norm with the l 1 norm min a a 1 subject to s = m=1 M a m φ m. (2) Despite this substitution, BP has the same solution as the optimal sparse approximation problem (Donoho and Elad, 23) if the signal is sparse compared to the most similar pair of dictionary elements (e.g., 1 a < min m n 2 [1 + 1/ φ m, φ n ]). In practice, the presence of signal noise often leads to using a modified approach called Basis Pursuit De-Noising (BPDN) (Chen et al., 21) that makes a tradeoff between 1 While clearly not a norm in the mathematical sense, we will use this terminology prevalent in the literature. m=1 2

3 reconstruction mean-squared error (MSE) and sparsity in an unconstrained optimization problem: min M a s 2 a m φ m + λ a 1, (3) m=1 where λ is a tradeoff parameter. BPDN provides the l 1 -sparsest approximation for a given reconstruction quality. There are many algorithms that can be used to solve the BPDN optimization problem, with interior point-type methods being the most common choice. The second general approach employed by signal processing researchers uses iterative greedy algorithms to constructively build up a signal representation (Tropp, 24). The canonical example of a greedy algorithm 2 is known in the signal processing community as Matching Pursuit (MP) (Mallat and Zhang, 1993). The MP algorithm is initialized with a residual r = s. At the k th iteration, MP finds the index of the single dictionary element best approximating the current residual signal, θ k = argmax m r k 1, φ m. The coefficient d k = r k 1, φ θk and index θ k are recorded as part of the reconstruction, and the residual is updated, r k = r k 1 φ θk d k. After K iterations, the signal approximation using MP is given by ŝ = K k=1 φ θ k d k. Though they may not be optimal in general, greedy algorithms often efficiently find good sparse signal representations in practice. 2.2 Sparse coding in neural systems Recent research in neuroscience suggests that V1 population responses to natural stimuli may be the result of a sparse approximation of images. For example, it has been shown that both the spatial and temporal properties of V1 receptive fields may be accounted for in terms of a dictionary that has been optimized for sparseness in response to natural images (Olshausen, 23). Additionally, V1 recordings in response to natural scene stimuli show activity levels (corresponding to the coefficients {a m }) becoming more sparse as neighboring units are also stimulated (Vinje and Gallant, 22). These populations are typically very overcomplete (Olshausen and Field, 24), allowing great flexibility in the representation of a stimulus. Using this flexibility to achieve sparse codes might offer many advantages to sensory neural systems, including enhancing the performance of subsequent processing stages, increasing the storage capacity in associative memories, and increasing the energy efficiency of the system (Olshausen and Field, 24). However, existing sparse approximation algorithms do not have implementations that correspond both naturally and efficiently to plausible neural architectures. For convex relaxation approaches, a network implementation of BPDN can be constructed (Olshausen and Field, 1996), following the common practice of using dynamical systems to implement direct gradient descent optimization (Cichocki and Unbehauen, 1993). Unfortunately, this implementation has two major drawbacks. First, it lacks a natural mathematical mechanism to make small coefficients identically zero. While the true BPDN solution would have many coefficients that are exactly zero, direct gradient methods to find an approximate solution in finite time produce coefficients that merely have small magnitudes. Ad hoc thresholding can be used on the results to produce zero-valued coefficients, but such methods lack theoretical justification and can be difficult to use without oracle knowledge of the best threshold value. Second, this implementation requires persistent (two-way) signaling between all units with overlapping receptive fields (e.g., even a node with a nearly zero value would have to continue sending inhibition signals to all similar nodes). In greedy algorithm approaches, spiking neural circuits can be constructed to implement MP (Perrinet, 25). Unfortunately, this type of circuit implementation relies on a temporal code that requires tightly coupled and precise elements to both encode and decode. Beyond implementation considerations, existing sparse approximation algorithms also do not consider the time-varying stimuli faced by neural systems. A time-varying input signal s(t) is represented with a set of time-varying coefficients {a m (t)}. While temporal coefficient changes are necessary to encode stimulus 2 Many types of algorithms for convex and non-convex optimization could be considered greedy in some sense (including systems based on descending the gradient of an instantaneous energy function). Our use of this terminology will apply to the family of iterative greedy algorithms such as MP to remain consistent with the majority of the signal processing literature. 2 3

4 changes, the most useful encoding would use coefficient changes that reflect the character of the stimulus. In particular, sparse coefficients should have smooth temporal variations in response to smooth changes in the image. However, most sparse approximation schemes have a single goal: select the smallest number of coefficients to represent a fixed signal. This single-minded approach can produce coefficient sequences for time-varying stimuli that are erratic, with drastic changes not only in the values of the coefficients but also in the selection of which coefficients are used. These erratic temporal codes are inefficient because they introduce uncertainty about which coefficients are coding the most significant stimulus changes, thereby complicating the process of understanding the changing stimulus content. In Section 3 we develop our LCAs, in which dictionary elements continually fight for the right to represent the stimulus. These LCAs adapt their coefficients continually over time as the input changes without having to build a new representation from scratch at each time step. This evolution induces inertia in the coefficients, regularizing the temporal variations for smoothly varying input signals. In contrast to the problems seen with current algorithms, our LCAs are easily implemented in analog circuits composed neuron-like elements, and they encourage both sparsity and smooth temporal variations in the coefficients as the stimulus changes. 2.3 Other related work There are several sparse approximation methods that do not fit into the two primary approaches of pure greedy algorithms or convex relaxation. Methods such as Sparse Bayesian Learning (Wipf and Rao, 24; Tipping, 21), FOCUSS (Rao and Kreutz-Delgado, 1999), modifications of greedy algorithms that select multiple coefficients on each iteration (Donoho et al., 26; Pece and Petkov, 2; Feichtinger et al., 1994) and MP extensions that perform an orthogonalization at each step (Davis et al., 1994; Rebollo-Neira and Lowe, 22) involve computations that would be very difficult to implement in a parallel distributed architecture. While FOCUSS can implement both l 1 global optimization and l local optimization (Rao and Kreutz-Delgado, 1999) in a dynamical system (Kreutz-Delgado et al., 23) that uses parallel computation to implement a competition strategy among the nodes (strong nodes are encouraged to grow while weak nodes are penalized), it does not lend itself to forming smooth time-varying representations because coefficients cannot be reactivated once they go to zero. There are also several sparse approximation methods built on a parallel computational framework that are related to our LCAs (Fischer et al., 24; Kingsbury and Reeves, 23; Herrity et al., 26; Rehn and Sommer, 27; Hale et al., 27; Figueiredo and Nowak, 23; Daubechies et al., 24; Blumensath and Davies, 28). These algorithms typically start the first iteration with many super-threshold coefficients and iteratively try to prune the representation through a thresholding procedure, rather than charging up from zero as in our LCAs. Appendix A contains a detailed comparison. 3 Locally competitive algorithms for sparse coding 3.1 Architecture of locally competitive algorithms Our LCAs associate each element of the dictionary φ m D with a separate computing node, or neuron. When the system is presented with an input image s(t), the population of neurons evolves according to fixed dynamics (described below) and settle on a collective output {a m (t)}, corresponding to the short-term average firing rate of the neurons. 3 The goal is to define the LCA system dynamics so that few coefficients have non-zero values while approximately reconstructing the input, ŝ (t) = m a m(t)φ m s(t). The LCA dynamics are inspired by several properties observed in neural systems: inputs cause the membrane potential to charge up like a leaky integrator; membrane potentials exceeding a threshold produce action potentials for extracellular signaling; and these super-threshold responses inhibit neighboring units through horizontal connections. We represent each unit s sub-threshold value by a time-varying internal state u m (t). The unit s excitatory input current is proportional to how well the image matches with the 3 Note that for analytical simplicity we allow positive and negative coefficients, but rectified systems could use two physical units to implement one LCA node. 4

5 u 1(t) a 1(t) a 2(t) φ 2, φ 1 u 2(t) a 2(t) V m(t) _+ + u m(t) _ T λ ( ) + a m(t) =T λ (u m(t)) _ s(t) Φ t s(t) a 2(t) φ 2, φ M V m(t) = φ m, s(t) P T λ (u n(t)) φ m, φ n n m u M(t) a M(t) (a) (b) Figure 1: (a) LCA nodes behave as a leaky integrators, charging with a speed that depends on how well the input matches the associated dictionary element and the inhibition received from other nodes. (b) A system diagram shows the inhibition signals being sent between nodes. In this case, only node 2 is shown as being active (i.e., having a coefficient above threshold) and inhibiting its neighbors. Since the neighbors are inactive then the inhibition is one-way. node s receptive field, b m (t) = φ m, s(t). When the internal state u m of a node becomes significantly large, the node becomes active and produces an output signal a m used to represent the stimulus and inhibit other nodes. This output coefficient is the result of an activation function applied to the membrane potential, a m = T λ (u m ), parameterized by the system threshold λ. Though similar activation functions have traditionally taken a sigmoidal form, we consider activation functions that operate as thresholding devices (e.g., essentially zero for values below λ and essentially linear for values above λ). The nodes best matching the stimulus will have internal state variables that charge at the fastest rates and become active soonest. To induce the competition that allows these nodes to suppress weaker nodes, we have active nodes inhibit other nodes with an inhibition signal proportional to both their activity level and the similarity of the nodes receptive fields. Specifically, the inhibition signal from the active node m to any other node n is proportional to a m G m,n, where G m,n = φ m, φ n. The possibility of unidirectional inhibition gives strong nodes a chance to prevent weaker nodes from becoming active and initiating counter-inhibition, thus making the search for a sparse solution more energy efficient. Note that unlike the direct gradient descent methods described in Section 2 that require two-way inhibition signals from all nodes that overlap (i.e., have G m,n ), LCAs only require one-way inhibition from a small selection of nodes (i.e., only the active nodes). Putting all of the above components together, LCA node dynamics are expressed by the nonlinear ordinary differential equation (ODE) u m (t) = 1 b m (t) u m (t) G m,n a n (t). (4) τ n m This ODE is essentially the same form as the well-known continuous Hopfield network (Hopfield, 1984). Figure 1 shows a LCA node circuit schematic and a system diagram illustrating the lateral inhibition. To express the system of coupled nonlinear ODEs that govern the whole dynamic system, we represent the internal state variables in the vector u(t) = [u 1 (t),...,u M (t)] t, the active coefficients in the vector a(t) = [a 1 (t),...,a M (t)] t = T λ (u(t)), the dictionary elements in the columns of the (N M) matrix Φ = [φ 1,..., φ M ] and the driving inputs in the vector b(t) = [b 1 (t),...,b M (t)] t = Φ t s(t). The function T λ ( ) performs element-by-element thresholding on vector inputs. The stimulus approximation is ŝ (t) = Φa(t), 5

6 and the full dynamic system equation is u(t) = f(u(t)) = 1 [ ( b(t) u(t) Φ t Φ I ) a(t) ], (5) τ a(t) = T λ (u(t)). The LCA architecture described by (5) solves a family of sparse approximation problems with different sparsity measures. Specifically, LCAs descend an energy function combining the reconstruction MSE and a sparsity-inducing cost penalty C( ), E(t) = 1 2 s(t) ŝ (t) 2 + λ m C(a m (t)). (6) The specific form of the cost function C( ) is determined by the form of the thresholding activation function T λ ( ). For a given threshold function, the cost function is specified (up to a constant) by λ dc(a m) da m = u m a m = u m T λ (u m ). (7) This correspondence between the thresholding function and the cost function can be seen by computing the derivative of E with respect to the active coefficients, {a m } (see Appendix B). Using the relationship in (7) and letting the internal states {u m } evolve according to u m E a m yields the equation for the internal state dynamics above in (4). Note that although the dynamics are specified through a gradient approach, the system is not performing direct gradient descent (e.g., u m E u m ). As long as a m and u m are related by a monotonically increasing function, the {a m } will also descend the energy function E. This method for showing the correspondence between a dynamic system and an energy function is essentially the same procedure used by Hopfield (Hopfield, 1984) to establish network dynamics in associative memory systems. We focus specifically on the cost functions associated with thresholding activation functions. Thresholding functions limit the lateral inhibition by allowing only strong units to suppress other units and forcing most coefficients to be identically zero. For our purposes, thresholding functions T λ ( ) have two distinct behaviors over their range: they are essentially linear with unit slope above threshold λ, and essentially zero below threshold. Among many reasonable choices for thresholding functions, we start with a smooth sigmoidal function u m αλ T (α,γ,λ) (u m ) =, (8) 1 + e γ(um λ) where γ is a parameter controlling the speed of the threshold transition and α [, 1] indicates what fraction of an additive adjustment is made for values above threshold. An example sigmoidal thresholding function is shown in Figure 2a. We are particularly interested in the limit of this thresholding function as γ, a piecewise linear function we denote as the ideal thresholding function. In the signal processing literature, T (,,λ) ( ) = lim γ T (,γ,λ) ( ) is known as a hard thresholding function and T (1,,λ) ( ) = lim γ T (1,γ,λ) ( ) is known as a soft thresholding function (Donoho, 1995). 3.2 Sparse approximation by locally competitive algorithms Combining (7) and (8), we can integrate numerically to determine the cost function corresponding to T (α,γ,λ) ( ), shown in Figure 2b. For the ideal threshold functions we derive a corresponding ideal cost function, C (α,,λ) (a m ) = (1 α)2 λ + α a m. (9) 2 Details of this derivation are in Appendix C. Note that unless α = 1 the ideal cost function has a gap because active coefficients cannot take all possible values, a m / [, (1 α)λ] (i.e., the ideal thresholding function is not technically invertible). 6

7 2 2 2 Threshold Function a m 1 a m 1 a m u m 1 2 u m 1 2 u m 2 (a) 2 (c) 2 (e) Cost Function C( a m ) 1 C( a m ) 1 C( a m ) a m a m a m (b) (d) (f) Figure 2: Relationship between the threshold function T (α,γ,λ) ( ) and the sparsity cost function C( ). Only the positive half of the symmetric threshold and cost functions are plotted. (a) Sigmoidal threshold function and (b) cost function for γ = 5, α = and λ = 1. (c) The ideal hard thresholding function (γ =, α =, λ = 1) and the (d) corresponding cost function. The dashed line shows the limit, but coefficients produced by the ideal thresholding function cannot take values in this range (e) The ideal soft thresholding function (γ =, α = 1, λ = 1) and the (f) corresponding cost function. 7

8 coefficient magnitude coefficient magnitude coefficient number (a) coefficient number (b) Figure 3: (a) The top 2 coefficients from a BPDN solver sorted by magnitude. (b) The same coefficients, sorted according to the magnitude ordering of the SLCA coefficients. While there is a gross decreasing trend noticeable, the largest SLCA coefficients are not in the same locations as the largest BPDN coefficients. While the solutions have equivalent energy functions, the two sets of coefficients differ significantly. 3.3 Special case: Soft-thresholding locally competitive algorithm (SLCA) As we see in Section 3.2, LCAs can optimize a variety of different sparsity measures depending on the choice of thresholding function. One special case is the soft thresholding function, corresponding to α = 1 and shown graphically in Figures 2e and 2f. The soft-thresholding locally competitive algorithm (SLCA) applies the l 1 norm as a cost function on the active coefficients, C (1,,λ) (a m ) = a m. Thus, the SLCA is simply another solution method for the general BPDN problem described in Section 2, and simulations confirm that the SLCA does indeed find solutions with values of E(t) equivalent to the solutions produced by standard interior-point based methods. Despite minimizing the same convex energy function, SLCA and BPDN solvers may find different sets of coefficients, as illustrated in Figure 3. This may be due either to the non-uniqueness of a BPDN solution, 4 or because the numerical approximations have not quite converged to a global optimum in the alloted computation time. The connection between softthresholding and BPDN is well-known in the case of orthonormal dictionaries (Chen et al., 21), and recent results (Elad, 26) have given some justification for using soft-thresholding in overcomplete dictionaries. The SLCA provides another formal connection between the soft-thresholding function and the l 1 cost function. Though BPDN uses the l 1 -norm as its sparsity penalty, we often expect many of the resulting coefficients to be identically zero (especially when M N). However, most numerical methods (including direct gradient descent and interior point solvers) will drive coefficients toward zero but will never make them identically zero. While an ad hoc threshold could be applied to the results of a BPDN solver, the SLCA has the advantage of incorporating a natural thresholding function that keeps coefficients identically zero during the computation unless they become active. In other words, while BPDN solvers often start with many non-zero coefficients and try to force coefficients down, the SLCA starts with all coefficients equal to zero and only lets a few grow up. This advantage is especially important for neural systems that must expend energy for non-zero values throughout the entire computation. 4 With an overcomplete dictionary, it is technically possible to have a BPDN energy function that is convex (guaranteeing all local minima are global minima) but not strictly convex (guaranteeing that a global minimum is unique). 8

9 Extra dictionary element φ N Input vector x MP coefficients a i (a) HLCA coefficients (b) HLCA time dynamics u M (t) u 1 (t) λ i (c) a i u(t) i (d) time (s) (e) Figure 4: (a) The dictionary in this example has one extra element that consists of decaying combinations of all other dictionary elements. (b) The input vector has a sparse representation in just a few dictionary elements. (c) MP initially chooses the extra dictionary element, preventing it from finding the optimally sparse representation (coefficients shown after 1 iterations). (d) In contrast, the HLCA system finds the optimally sparse coefficients. (e) The time-dynamics of the HLCA system illustrate its advantage. The extra dictionary element is the first node to activate, followed shortly by the nodes corresponding to the optimal coefficients. The collective inhibition of the optimal nodes causes the extra node to die away. 3.4 Special case: Hard-thresholding locally competitive algorithm (HLCA) Another important special case is the hard thresholding function, corresponding to α = and shown graphically in Figures 2c and 2d. Using the relationship in (7), we see that this hard-thresholding locally competitive algorithm (HLCA) applies an l -like cost function by using a constant penalty regardless of magnitude, C (,,λ) (a m ) = λ 2 I( a m > λ), where I( ) is the indicator function evaluating to 1 if the argument is true and if the argument is false. Unlike the SLCA, the HLCA energy function is not convex and the system will only find a local minimum of the energy function E(t). As with the SLCA, the HLCA also has connections to known sparse approximation principles. If node m is fully charged, the inhibition signal it sends to other nodes would be exactly the same as the update step when the m th node is chosen in the MP algorithm. However, due to the continuous competition between nodes before they are fully charged, the HLCA is not equivalent to MP in general. As a demonstration of the power of competitive algorithms over greedy algorithms such as MP, consider a canonical example used to illustrate the shortcomings of iterative greedy algorithms (Chen et al., 21; DeVore and Temlyakov, 1996). For this example, specify a positive integer K < N and construct a dictionary D with M = N + 1 elements to have the following form: { e m if m N φ m = K n=1 κe n + N n=k+1 (κ/(n K))e n if m = N + 1, 9

10 where e m is the canonical basis element (i.e., it contains a single 1 in the m th location) and κ is a constant to make the vectors have unit norm. In words, the dictionary includes the canonical basis along with one extra element that is a decaying combination of all other elements (illustrated in Figure 4, with N = 2 and K = 5). The input signal is sparsely represented in the first K dictionary elements, s = K 1 m=1 K e m. The first MP iteration chooses φ M, introducing a residual with decaying terms. Even though s has an exact representation in K elements, MP iterates forever trying to atone for this bad initial choice. In contrast, the HLCA initially activates the M th node but uses the collective inhibition from nodes 1,...,K to suppress this node and calculate the optimal set of coefficients. While this pathological example is unlikely to exactly occur in natural signals, it is often used as a criticism of greedy methods to demonstrate their shortsightedness. We mention it here to demonstrate the flexibility of LCAs and their differences from pure greedy algorithms. 4 LCA system properties To be a physically viable sparse coding mechanism, LCAs must exhibit several critical properties: the system must remain stable under normal operating conditions, the system must produce sparse coefficients that represent the stimulus with low error, and coefficient sequences must exhibit regularity in response to time-varying inputs. In this section we show that LCAs exhibit good characteristics in each of these three areas. We focus our analysis on the HLCA both because it yields the most interesting results and because it is notationally the cleanest to discuss. In general, the analysis principles we use will apply to all LCAs through straightforward (through perhaps laborious) extensions. 4.1 Stability Any proposed neural system must remain well-behaved under normal conditions. While linear systems theory has an intuitive notion of stability that is easily testable (Franklin et al., 1986), no such unifying concept of stability exists for nonlinear systems (Khalil, 22). Instead, nonlinear systems are characterized in a variety of ways, including their behavior near an equilibrium point u where f(u ) = and their input-output relationship. The various stability analyses of Sections and depend on a common criterion. Define M u(t) [1,...,M] as the set of nodes that are above threshold in the internal state vector u(t), M u(t) = {m : u m (t) λ}. We say that the LCA meets the stability criterion if t the set of active vectors {φ m } m Mu(t) is linearly independent. It makes some intuitive sense that this condition is important: if a collection of linearly dependent nodes are active simultaneously, the nodes could have growing coefficients but no net effect on the reconstruction. The system is likely to satisfy the stability criterion eventually under normal operating conditions for two reasons. First, small subsets of dictionary elements are unlikely to be linearly dependent unless the dictionary is designed with this property (e.g., (Tropp, 26)). Second, sparse coding systems are actively trying to select dictionary subsets so that they can use many fewer coefficients than the dimension of the signal space, Mu(t) N M. While the LCA lateral inhibition signals discourage linearly dependent sets from activating, the stability criterion can be violated when a collection of nodes becomes active too quickly, before inhibition can take effect. In simulation, we have observed this situation when the threshold is too low compared to the system time constant. As discussed below, the stability criterion amounts to a sufficient (but not necessary) condition for good behavior, and we have never empirically observed the simulated system becoming unstable even when this condition is transiently violated Equilibrium points In a LCA presented with a static input, we look to the steady-state response (where u(t) = ) to determine the coefficients. The character of the equilibrium points u (f(u ) = ) and the system s behavior in a neighborhood around an equilibrium point provides one way to ensure that a system is well-behaved. Consider the ball around an equilibrium point B ǫ (u ) = {u : u u < ǫ}. Nonlinear system analysis 1

11 typically asks an intuitive question: if the system is perturbed within this ball, does it then run away, stay where it is, or get attracted back? Specifically, a system is said to be locally asymptotically stable (Bacciotti and Rosier, 21) at an equilibrium point u if one can specify ǫ > such that u() B ǫ (u ) = lim u(t) = u. t Previous research (Cohen and Grossberg, 1983; Yang and Dillon, 1994; Li et al., 1988) has used the tools of Lyapunov functions (Khalil, 22) to study a Hopfield network (Hopfield, 1984) similar to the LCA architecture. However, all of these analyses make assumptions that do not encompass the ideal thresholding functions used in the LCAs (e.g., they are continuously differentiable and/or monotone increasing). In Appendix D.1 we show that as long as the stability criterion is met, the HLCA: has a finite number of equilibrium points; has equilibrium points that are almost certainly isolated (no two equilibrium points are arbitrarily close together); and is almost certainly locally asymptotically stable for every equilibrium point. The conditions that hold almost certainly are true as long as none of the equilibria have components identically equal to the threshold, (u m λ, m), which holds with overwhelming probability. With a finite number of isolated equilibria, we can be confident that the HLCA steady-state response is a distinct set of coefficients representing the stimulus. Asymptotic stability also implies a notion of robustness, guaranteeing that the system will remain well-behaved even under perturbations (Theorems 2.8 and 2.9 in (Bacciotti and Rosier, 21)) Input-output stability In physical systems it is important that the energy of both internal and external signals remain bounded for bounded inputs. One intuitive approach to ensuring output stability is to examine the energy function E. We show in Appendix D.2 that for non-decreasing threshold functions, the energy function is non-increasing ( d dte(t) ) for fixed inputs. While this is encouraging, it does not guarantee input-output stability. To appreciate this effect, note that the HLCA cost function is constant for nodes above threshold nothing explicitly keeps a node from growing without bound once it is active. While there is no universal input-output stability test for general nonlinear systems, we observe that the LCA system equation is linear and fixed until a unit crosses threshold. A branch of control theory specifically addresses these switched systems (Decarlo et al., 2). Results from this field indicate that input-output stability can be guaranteed if the individual linear subsystems are stable, and the system does not switch too fast between these subsystems (Hespanha and Morse, 1999). In Appendix D.2 we give a precise mathematical statement of this input-output stability and show that the HLCA linear subsystems are individually stable if and only if the stability criterion is met. Therefore, the HLCA is input-output stable as long as nodes are limited in how fast they can change states. We expect that an infinitely fast switching condition is avoided in practice either by the physical principles of the system implementation or through an explicit hysteresis in the thresholding function. 4.2 Sparsity and representation error Viewing the sparse approximation problem through the lens of rate-distortion theory (Berger, 1971), the most powerful algorithm produces the lowest reconstruction MSE for a given sparsity. When the sparsity measure is the l 1 norm, the problem is convex and the SLCA produces solutions with equivalent l 1 -sparsity to interior point BPDN solvers. Despite the analytic appeal of the l 1 norm as a sparsity measure, many systems concerned with energy minimization (including neural systems) likely have an interest in minimizing the l norm of the coefficients. The HLCA is appealing because of its l -like sparsity penalty, but this objective function is not convex and the HLCA may find a local minimum. We will show that while HLCA cannot 11

12 guarantee the l sparsest solution, it produces coefficients that demonstrate comparable sparsity to MP for natural images. Insight about the HLCA reconstruction fidelity comes from rewriting the LCA system equation u(t) = 1 [ Φ t (s(t) ŝ (t)) u(t) + T (α,,λ) (u(t)) ]. (1) τ For a constant input, HLCA equilibrium points ( u(t) = ) occur when the residual error is orthogonal to active nodes and balanced with the internal state variables of inactive nodes. { u m (t) if u m λ φ m, s(t) ŝ (t) = if u m > λ. Therefore, when HLCA converges the coefficients will perfectly reconstruct the component of the input signal that projects onto the subspace spanned by the final set of active nodes. Using standard results from frame theory (Christensen, 22), we can bound the HLCA reconstruction MSE in terms of the set of inactive nodes s(t) ŝ (t) 2 1 η min m/ M u(t) φ m, s(t) ŝ (t) 2 ( M Mu(t) ) λ 2 η min, where η min is the minimum eigenvalue of (ΦΦ t ). Though the HLCA is not guaranteed to find the globally optimal l sparsest solution we must ensure that it does not produce unreasonably non-sparse solutions. While the system nonlinearity makes it impossible to analytically determine the LCA steady-state coefficients, it is possible to rule out some coefficients as not being possible. For example, let M [1,..., M] be an arbitrary set of active coefficients. Using linear systems theory we can calculate the steady-state response ũ M = lim t u(t) assuming that M stays fixed. If ũ M m < λ for any m M or if ũ M m > λ for any m / M, then M cannot describe the set of active nodes in the steady-state response and we call it inconsistent. We show in Appendix E that when the stability criterion is met, the following statement is true for the HLCA: If s = φ m, then any set of active coefficients M with m M and M > 1 is inconsistent. In other words, the HLCA may use the m th node or a collection of other nodes to represent s, but it cannot use a combination of both. This result extends intuitively beyond one-sparse signals: each component in an optimal decomposition is represented by either the optimal node or another collection of nodes, but not both. While not necessarily finding the optimal representation, the system does not needlessly employ both the optimal and extraneous nodes. We have also verified numerically that the LCAs achieve a combination of error and sparsity comparable with known methods. We employed a dictionary consisting of the bandpass band of a steerable pyramid (Simoncelli and Freeman, 1995) 5 with one level and four orientation bands (i.e., the dictionary is approximately four times overcomplete with 496 elements). Image patches (32 32) were selected at random from a standard set of test images. The selected image patches were decomposed using the steerable pyramid and reconstructed using just the bandpass band. 6 The bandpass image patches were also normalized to have unit energy. Each image patch was used as the fixed input to the LCA system equation (5) using either a soft or hard thresholding function (with variable threshold values) and with a biologically plausible membrane time constant of τ = 1 ms (Dayan and Abbott, 21). We simulated the system using a simple Euler s method approach (i.e., first order finite difference approximation) (Süli and Mayers, 23) with a time step of 1 ms. Figure 5 shows the time evolution of the reconstruction MSE and l sparsity for SLCA and HLCA responding to an individual image, and Figure 6 shows the mean steady-state tradeoff between l sparsity and MSE. For comparison, we also plotted the results obtained from using MP, 7 a standard BPDN interiorpoint solver 8 followed by thresholding to enforce l sparsity (denoted BPDNthr ) and SLCA with the 5 We used the implementation of the steerable pyramid given in the code available at with the circular boundary handling option. 6 We eliminate the lowpass band because it accounts for a large fraction of the total image energy in just a few dictionary elements and it is unlikely that much gain could be achieved by sparsifying these coefficients. 7 We iterated MP until it had the same MSE as the LCA implementation to compare the sparsity levels of the results. 8 We used the implementation of a primal-dual BPDN solver given in the code available at 12

13 1.8 HLCA MSE γ =.6 γ =.115 γ =.185 γ = SLCA MSE γ =.6 γ =.115 γ =.185 γ = MSE MSE number of active coefficients (a) HLCA sparsity γ =.6 γ =.115 γ =.185 γ = time (s) (b) number of active coefficients (c) SLCA sparsity γ =.6 γ =.115 γ =.185 γ = time (s) (d) Figure 5: The time response of the HLCA and SLCA (τ = 1 ms) for a single fixed (32 32) image patch with a dictionary that is four time overcomplete with 496 elements. (a) The MSE decay and (b) the l sparsity for HLCA. (c) The MSE decay and (d) the l sparsity for SLCA. The error converges within 1-2 time constants and the sparsity often approximately converges within 3-4 time constants. In some cases sparsity is reduced with a longer running time. 13

14 L "norm" mean HLCA SLCA BPDNthr SLCAthr MP L "norm" standard deviation HLCA SLCA BPDNthr SLCAthr MP MSE (% error) (a) MSE (% error) (b) Figure 6: Mean tradeoff between MSE and l -sparsity for normalized (32 32) patches from a standard set of test images. The dictionary was four times overcomplete with 496 elements. For a given MSE range, we plot the mean (a) and standard deviation (b) of the l sparsity. same threshold applied (denoted SLCAthr ). Most importantly, note that the HLCA and MP are almost identical in their sparsity-mse tradeoff. Though the connections between the HLCA and MP were pointed out in Section 3.4, these are very different systems and there is no reason to expect them to produce the same coefficients. Additionally, note that the SLCA is producing coefficients that are nearly as l -sparse as what we can achieved by oracle thresholding the results of a BPDN solver even though the SLCA keeps most coefficients zero throughout the calculation. 4.3 Time-varying stimuli Inertia Biological sensory systems are faced with constantly changing stimuli due to both external movement and internal factors (e.g., organism movement, eye saccades, etc.). As discussed in Section 2.2, sparse codes with temporal variations that also reflect the smooth nature of the changing stimulus would be easier for higher level systems to understand and interpret. However, approximation methods that only optimize sparsity at each time step (especially greedy algorithms) can produce brittle representations that change dramatically with relatively small stimulus changes. In contrast, LCAs naturally produce smoothly changing outputs in response to smoothly changing time-varying inputs. Assuming that the system time constant τ is faster than the temporal changes in the stimulus, the LCA will evolve to capture the stimulus change and converge to a new sparse representation. While local minima in an energy function are typically problematic, the LCAs can use these local minima to find coefficients that are close to their previous coefficients even if they are not optimally sparse. While permitting suboptimal coefficient sparsity, this property allows the LCA to exhibit inertia that smoothes the coefficient sequences. The inertia property exhibited in LCAs can be seen by focusing on a single node in the system equation (1): { u m (t) = 1 φ m, (s(t) ŝ (t)) u m (t) when u m (t) < λ τ φ m, (s(t) ŝ (t)) αλ when u m (t) λ. A new residual signal drives the coefficient higher but suffers an additive penalty. Inactive coefficients suffer an increasing penalty as they get closer to threshold while active coefficients only suffer a constant penalty αλ that can be very small (e.g., the HLCA has αλ = ). This property induces a king of the hill effect: 14

15 when a new residual appears, active nodes move virtually unimpeded to represent it while inactive nodes are penalized until they reach threshold. This inertia encourages inactive nodes to remain inactive unless the active nodes cannot adequately represent the new input. To illustrate this inertia, we applied the LCAs to a sequence of pixel, bandpass filtered, normalized frames from the standard foreman test video sequence with the same experimental setup described in Section 4.2. The LCA input is switched to the next video frame every (simulated) 1/3 seconds. The results are shown in Figure 7, along with comparisons to MP and BPDN applied independently on each frame. The changing coefficient locations are nodes that either became active or inactive at each frame. Mathematically, the number of changing coefficients at frame n is: Mu(n 1) M u(n), where is the exclusive OR operator and u(n) are the internal state variables at the end of the simulation for frame n. This simulation highlights that the HLCA uses approximately the same number of active coefficients as MP but chooses coefficients that more efficiently represent the video sequence. The HLCA is significantly more likely to re-use active coefficient locations from the previous frame without making significant sacrifices in the sparsity of the solution. This difference is highlighted when looking at the ratio of the number of changing coefficients to the number of active coefficients, Mu(n 1) M u(n) / Mu(n). MP has a ratio of 1.7, meaning that MP is finding almost an entirely new set of active coefficient locations for each frame. The HLCA has a ratio of.5, meaning that it is changing approximately 25% of its coefficient locations at each frame. SLCA and BPDNthr have approximately the same performance, with regularity falling between HLCA and MP. Though the two systems can calculate different coefficients, the convexity of the energy function appears to be limiting the coefficient choices enough so that SLCA cannot smooth the coefficient time series substantially more than BPDNthr Markov state transitions The simulation results indicate that the HLCA is producing time series coefficients that are much more regular than MP. This regularity is visualized in Figure 9 by looking at the time-series of example HLCA and MP coefficients. Note that though the two coding schemes produce roughly the same number of non-zero entries, the HLCA does much better than MP at clustering the values into consecutive runs of positive or negative values. This type of smoothness better reflects the regularity in the natural video sequence input. We can quantify this increased regularity by examining the Markov state transitions. Specifically, each coefficient time-series is Markov chain (Norris, 1997) with three possible states at frame n: if u m (n) < λ σ m (n) = if λ u m (n) λ + if u m (n) > λ. Figure 8 shows the marginal probabilities P( ) of the states and the conditional probabilities P( ) of moving to a state given the previous state. The HLCA and MP are equally likely to have non-zero states, but the HLCA is over five times more likely than MP to have a positive coefficient stay positive (P(+ +)). Also, though the absolute probabilities are small, MP is roughly two orders of magnitude more likely to have a coefficient swing from positive to negative (P( +)) and vice-versa (P( +)). To quantify the regularity of the active coefficient locations we calculate the entropy (Cover and Thomas, 1991) of the coefficient states at frame n conditioned on the coefficient states at frame (n 1): H(σ m (n) σ m (n 1)) = P(+)[P( +) + P( +) + P(+ +)] P()[P( ) + P( ) + P(+ )] P( )[P( ) + P( ) + P(+ )], (11) plotted in Figure 9. This conditional entropy indicates how much uncertainty there is about the status of the current coefficients given the coefficients from the previous frame. Note that the conditional entropy for MP is almost double the entropy for the HLCA, while SLCA is again similar to BPDNthr. The principle contributing factor to the conditional entropy appears to be the probability a non-zero node remains in the 15

16 MSE (% error) HLCA (3.33) SLCA (3.31) BPDNthr (3.73) MP (3.33) number of active coefficients HLCA (2968) SLCA (834) BPDNthr (7775) MP (2348) number of changing coefficient locations frame (a) frame (c) HLCA (169) SLCA (7343) BPDNthr (733) MP (46) ratio of changing to active coefficients frame (b) HLCA (.55944) SLCA (.87394) BPDNthr (.93827) MP (1.756) frame (d) Figure 7: The HLCA and SLCA systems simulated on 2 frames of the foreman test video sequence. For comparison, MP coefficients and thresholded BPDN coefficients are also shown. Average values for each system are noted in the legend. (a) Per-frame MSE for each coding scheme, designed to be approximately equal. (b) The number of active coefficients in each frame. (c) The number of changing coefficient locations for each frame, including the number of inactive nodes becoming active and the number of active nodes becoming inactive. (d) The ratio of changing coefficients to active coefficients. A ratio near 2 (such as with MP) means that almost 1% of the coefficient locations are new at each frame. A ratio near.5 (such as with HLCA) means that approximately 25% of the coefficients are new at each frame. 16

Sparse Coding via Thresholding and Local Competition in Neural Circuits

Sparse Coding via Thresholding and Local Competition in Neural Circuits LETTER Communicated by Michael Lewicki Sparse Coding via Thresholding and Local Competition in Neural Circuits Christopher J. Rozell crozell@gatech.edu Don H. Johnson dhj@rice.edu Richard G. Baraniuk richb@rice.edu

More information

Sparse approximation via locally competitive algorithms

Sparse approximation via locally competitive algorithms Sparse approximation via locally competitive algorithms AMSC PhD Preliminary Exam May 14, 2014 1/51 Outline I Motivation 1 Motivation 2 3 4 5 2/51 Table of Contents Motivation 1 Motivation 2 3 4 5 3/51

More information

Sparse Coding as a Generative Model

Sparse Coding as a Generative Model Sparse Coding as a Generative Model image vector neural activity (sparse) feature vector other stuff Find activations by descending E Coefficients via gradient descent Driving input (excitation) Lateral

More information

SPARSE signal representations have gained popularity in recent

SPARSE signal representations have gained popularity in recent 6958 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 10, OCTOBER 2011 Blind Compressed Sensing Sivan Gleichman and Yonina C. Eldar, Senior Member, IEEE Abstract The fundamental principle underlying

More information

Generalized Orthogonal Matching Pursuit- A Review and Some

Generalized Orthogonal Matching Pursuit- A Review and Some Generalized Orthogonal Matching Pursuit- A Review and Some New Results Department of Electronics and Electrical Communication Engineering Indian Institute of Technology, Kharagpur, INDIA Table of Contents

More information

Dictionary Learning for L1-Exact Sparse Coding

Dictionary Learning for L1-Exact Sparse Coding Dictionary Learning for L1-Exact Sparse Coding Mar D. Plumbley Department of Electronic Engineering, Queen Mary University of London, Mile End Road, London E1 4NS, United Kingdom. Email: mar.plumbley@elec.qmul.ac.u

More information

Structured matrix factorizations. Example: Eigenfaces

Structured matrix factorizations. Example: Eigenfaces Structured matrix factorizations Example: Eigenfaces An extremely large variety of interesting and important problems in machine learning can be formulated as: Given a matrix, find a matrix and a matrix

More information

Sparse & Redundant Signal Representation, and its Role in Image Processing

Sparse & Redundant Signal Representation, and its Role in Image Processing Sparse & Redundant Signal Representation, and its Role in Michael Elad The CS Department The Technion Israel Institute of technology Haifa 3000, Israel Wave 006 Wavelet and Applications Ecole Polytechnique

More information

COMPARATIVE ANALYSIS OF ORTHOGONAL MATCHING PURSUIT AND LEAST ANGLE REGRESSION

COMPARATIVE ANALYSIS OF ORTHOGONAL MATCHING PURSUIT AND LEAST ANGLE REGRESSION COMPARATIVE ANALYSIS OF ORTHOGONAL MATCHING PURSUIT AND LEAST ANGLE REGRESSION By Mazin Abdulrasool Hameed A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for

More information

A Generalized Uncertainty Principle and Sparse Representation in Pairs of Bases

A Generalized Uncertainty Principle and Sparse Representation in Pairs of Bases 2558 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 48, NO 9, SEPTEMBER 2002 A Generalized Uncertainty Principle Sparse Representation in Pairs of Bases Michael Elad Alfred M Bruckstein Abstract An elementary

More information

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT MLCC 2018 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net

More information

MATCHING-PURSUIT DICTIONARY PRUNING FOR MPEG-4 VIDEO OBJECT CODING

MATCHING-PURSUIT DICTIONARY PRUNING FOR MPEG-4 VIDEO OBJECT CODING MATCHING-PURSUIT DICTIONARY PRUNING FOR MPEG-4 VIDEO OBJECT CODING Yannick Morvan, Dirk Farin University of Technology Eindhoven 5600 MB Eindhoven, The Netherlands email: {y.morvan;d.s.farin}@tue.nl Peter

More information

Machine Learning for Signal Processing Sparse and Overcomplete Representations

Machine Learning for Signal Processing Sparse and Overcomplete Representations Machine Learning for Signal Processing Sparse and Overcomplete Representations Abelino Jimenez (slides from Bhiksha Raj and Sourish Chaudhuri) Oct 1, 217 1 So far Weights Data Basis Data Independent ICA

More information

An Introduction to Sparse Approximation

An Introduction to Sparse Approximation An Introduction to Sparse Approximation Anna C. Gilbert Department of Mathematics University of Michigan Basic image/signal/data compression: transform coding Approximate signals sparsely Compress images,

More information

Sparsity in Underdetermined Systems

Sparsity in Underdetermined Systems Sparsity in Underdetermined Systems Department of Statistics Stanford University August 19, 2005 Classical Linear Regression Problem X n y p n 1 > Given predictors and response, y Xβ ε = + ε N( 0, σ 2

More information

Sparse linear models

Sparse linear models Sparse linear models Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 2/22/2016 Introduction Linear transforms Frequency representation Short-time

More information

5742 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 12, DECEMBER /$ IEEE

5742 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 12, DECEMBER /$ IEEE 5742 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 12, DECEMBER 2009 Uncertainty Relations for Shift-Invariant Analog Signals Yonina C. Eldar, Senior Member, IEEE Abstract The past several years

More information

Wavelet Footprints: Theory, Algorithms, and Applications

Wavelet Footprints: Theory, Algorithms, and Applications 1306 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 51, NO. 5, MAY 2003 Wavelet Footprints: Theory, Algorithms, and Applications Pier Luigi Dragotti, Member, IEEE, and Martin Vetterli, Fellow, IEEE Abstract

More information

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013 Machine Learning for Signal Processing Sparse and Overcomplete Representations Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013 1 Key Topics in this Lecture Basics Component-based representations

More information

Chapter 9: The Perceptron

Chapter 9: The Perceptron Chapter 9: The Perceptron 9.1 INTRODUCTION At this point in the book, we have completed all of the exercises that we are going to do with the James program. These exercises have shown that distributed

More information

A Common Network Architecture Efficiently Implements a Variety of Sparsity-based Inference Problems

A Common Network Architecture Efficiently Implements a Variety of Sparsity-based Inference Problems A Common Network Architecture Efficiently Implements a Variety of Sparsity-based Inference Problems Adam S. Charles, Pierre Garrigues 2, Christopher J. Rozell School of Electrical and Computer Engineering,

More information

2.3. Clustering or vector quantization 57

2.3. Clustering or vector quantization 57 Multivariate Statistics non-negative matrix factorisation and sparse dictionary learning The PCA decomposition is by construction optimal solution to argmin A R n q,h R q p X AH 2 2 under constraint :

More information

Problem Set Number 02, j/2.036j MIT (Fall 2018)

Problem Set Number 02, j/2.036j MIT (Fall 2018) Problem Set Number 0, 18.385j/.036j MIT (Fall 018) Rodolfo R. Rosales (MIT, Math. Dept., room -337, Cambridge, MA 0139) September 6, 018 Due October 4, 018. Turn it in (by 3PM) at the Math. Problem Set

More information

Sparse linear models and denoising

Sparse linear models and denoising Lecture notes 4 February 22, 2016 Sparse linear models and denoising 1 Introduction 1.1 Definition and motivation Finding representations of signals that allow to process them more effectively is a central

More information

Handout 2: Invariant Sets and Stability

Handout 2: Invariant Sets and Stability Engineering Tripos Part IIB Nonlinear Systems and Control Module 4F2 1 Invariant Sets Handout 2: Invariant Sets and Stability Consider again the autonomous dynamical system ẋ = f(x), x() = x (1) with state

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

The Iteration-Tuned Dictionary for Sparse Representations

The Iteration-Tuned Dictionary for Sparse Representations The Iteration-Tuned Dictionary for Sparse Representations Joaquin Zepeda #1, Christine Guillemot #2, Ewa Kijak 3 # INRIA Centre Rennes - Bretagne Atlantique Campus de Beaulieu, 35042 Rennes Cedex, FRANCE

More information

Equivalence Probability and Sparsity of Two Sparse Solutions in Sparse Representation

Equivalence Probability and Sparsity of Two Sparse Solutions in Sparse Representation IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 12, DECEMBER 2008 2009 Equivalence Probability and Sparsity of Two Sparse Solutions in Sparse Representation Yuanqing Li, Member, IEEE, Andrzej Cichocki,

More information

Analysis of an Attractor Neural Network s Response to Conflicting External Inputs

Analysis of an Attractor Neural Network s Response to Conflicting External Inputs Journal of Mathematical Neuroscience (2018) 8:6 https://doi.org/10.1186/s13408-018-0061-0 RESEARCH OpenAccess Analysis of an Attractor Neural Network s Response to Conflicting External Inputs Kathryn Hedrick

More information

Inverse problems and sparse models (1/2) Rémi Gribonval INRIA Rennes - Bretagne Atlantique, France

Inverse problems and sparse models (1/2) Rémi Gribonval INRIA Rennes - Bretagne Atlantique, France Inverse problems and sparse models (1/2) Rémi Gribonval INRIA Rennes - Bretagne Atlantique, France remi.gribonval@inria.fr Structure of the tutorial Session 1: Introduction to inverse problems & sparse

More information

Stability and the elastic net

Stability and the elastic net Stability and the elastic net Patrick Breheny March 28 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/32 Introduction Elastic Net Our last several lectures have concentrated on methods for

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

c 2011 International Press Vol. 18, No. 1, pp , March DENNIS TREDE

c 2011 International Press Vol. 18, No. 1, pp , March DENNIS TREDE METHODS AND APPLICATIONS OF ANALYSIS. c 2011 International Press Vol. 18, No. 1, pp. 105 110, March 2011 007 EXACT SUPPORT RECOVERY FOR LINEAR INVERSE PROBLEMS WITH SPARSITY CONSTRAINTS DENNIS TREDE Abstract.

More information

l 0 -norm Minimization for Basis Selection

l 0 -norm Minimization for Basis Selection l 0 -norm Minimization for Basis Selection David Wipf and Bhaskar Rao Department of Electrical and Computer Engineering University of California, San Diego, CA 92092 dwipf@ucsd.edu, brao@ece.ucsd.edu Abstract

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Conditions for Robust Principal Component Analysis

Conditions for Robust Principal Component Analysis Rose-Hulman Undergraduate Mathematics Journal Volume 12 Issue 2 Article 9 Conditions for Robust Principal Component Analysis Michael Hornstein Stanford University, mdhornstein@gmail.com Follow this and

More information

DEVS Simulation of Spiking Neural Networks

DEVS Simulation of Spiking Neural Networks DEVS Simulation of Spiking Neural Networks Rene Mayrhofer, Michael Affenzeller, Herbert Prähofer, Gerhard Höfer, Alexander Fried Institute of Systems Science Systems Theory and Information Technology Johannes

More information

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit New Coherence and RIP Analysis for Wea 1 Orthogonal Matching Pursuit Mingrui Yang, Member, IEEE, and Fran de Hoog arxiv:1405.3354v1 [cs.it] 14 May 2014 Abstract In this paper we define a new coherence

More information

1 Lyapunov theory of stability

1 Lyapunov theory of stability M.Kawski, APM 581 Diff Equns Intro to Lyapunov theory. November 15, 29 1 1 Lyapunov theory of stability Introduction. Lyapunov s second (or direct) method provides tools for studying (asymptotic) stability

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

Compressed sensing. Or: the equation Ax = b, revisited. Terence Tao. Mahler Lecture Series. University of California, Los Angeles

Compressed sensing. Or: the equation Ax = b, revisited. Terence Tao. Mahler Lecture Series. University of California, Los Angeles Or: the equation Ax = b, revisited University of California, Los Angeles Mahler Lecture Series Acquiring signals Many types of real-world signals (e.g. sound, images, video) can be viewed as an n-dimensional

More information

Algorithms for sparse analysis Lecture I: Background on sparse approximation

Algorithms for sparse analysis Lecture I: Background on sparse approximation Algorithms for sparse analysis Lecture I: Background on sparse approximation Anna C. Gilbert Department of Mathematics University of Michigan Tutorial on sparse approximations and algorithms Compress data

More information

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Lecture - 27 Multilayer Feedforward Neural networks with Sigmoidal

More information

Figure 1.1: Schematic symbols of an N-transistor and P-transistor

Figure 1.1: Schematic symbols of an N-transistor and P-transistor Chapter 1 The digital abstraction The term a digital circuit refers to a device that works in a binary world. In the binary world, the only values are zeros and ones. Hence, the inputs of a digital circuit

More information

Reconstruction from Anisotropic Random Measurements

Reconstruction from Anisotropic Random Measurements Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013

More information

9 Classification. 9.1 Linear Classifiers

9 Classification. 9.1 Linear Classifiers 9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Artificial Neural Networks

Artificial Neural Networks Artificial Neural Networks 鮑興國 Ph.D. National Taiwan University of Science and Technology Outline Perceptrons Gradient descent Multi-layer networks Backpropagation Hidden layer representations Examples

More information

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann (Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for

More information

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression

More information

Lecture 15: Exploding and Vanishing Gradients

Lecture 15: Exploding and Vanishing Gradients Lecture 15: Exploding and Vanishing Gradients Roger Grosse 1 Introduction Last lecture, we introduced RNNs and saw how to derive the gradients using backprop through time. In principle, this lets us train

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Algorithms for Learning Good Step Sizes

Algorithms for Learning Good Step Sizes 1 Algorithms for Learning Good Step Sizes Brian Zhang (bhz) and Manikant Tiwari (manikant) with the guidance of Prof. Tim Roughgarden I. MOTIVATION AND PREVIOUS WORK Many common algorithms in machine learning,

More information

Stability and Robustness of Weak Orthogonal Matching Pursuits

Stability and Robustness of Weak Orthogonal Matching Pursuits Stability and Robustness of Weak Orthogonal Matching Pursuits Simon Foucart, Drexel University Abstract A recent result establishing, under restricted isometry conditions, the success of sparse recovery

More information

Computational statistics

Computational statistics Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial

More information

Interior-Point Methods for Linear Optimization

Interior-Point Methods for Linear Optimization Interior-Point Methods for Linear Optimization Robert M. Freund and Jorge Vera March, 204 c 204 Robert M. Freund and Jorge Vera. All rights reserved. Linear Optimization with a Logarithmic Barrier Function

More information

Stability theory is a fundamental topic in mathematics and engineering, that include every

Stability theory is a fundamental topic in mathematics and engineering, that include every Stability Theory Stability theory is a fundamental topic in mathematics and engineering, that include every branches of control theory. For a control system, the least requirement is that the system is

More information

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 19: Data Representation by Design What is data representation? Let X be a data-space X M (M) F (M) X A data representation

More information

Bias-free Sparse Regression with Guaranteed Consistency

Bias-free Sparse Regression with Guaranteed Consistency Bias-free Sparse Regression with Guaranteed Consistency Wotao Yin (UCLA Math) joint with: Stanley Osher, Ming Yan (UCLA) Feng Ruan, Jiechao Xiong, Yuan Yao (Peking U) UC Riverside, STATS Department March

More information

2 One-dimensional models in discrete time

2 One-dimensional models in discrete time 2 One-dimensional models in discrete time So far, we have assumed that demographic events happen continuously over time and can thus be written as rates. For many biological species with overlapping generations

More information

18.6 Regression and Classification with Linear Models

18.6 Regression and Classification with Linear Models 18.6 Regression and Classification with Linear Models 352 The hypothesis space of linear functions of continuous-valued inputs has been used for hundreds of years A univariate linear function (a straight

More information

Why Sparse Coding Works

Why Sparse Coding Works Why Sparse Coding Works Mathematical Challenges in Deep Learning Shaowei Lin (UC Berkeley) shaowei@math.berkeley.edu 10 Aug 2011 Deep Learning Kickoff Meeting What is Sparse Coding? There are many formulations

More information

SPARSE SIGNAL RESTORATION. 1. Introduction

SPARSE SIGNAL RESTORATION. 1. Introduction SPARSE SIGNAL RESTORATION IVAN W. SELESNICK 1. Introduction These notes describe an approach for the restoration of degraded signals using sparsity. This approach, which has become quite popular, is useful

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Lecture Notes 5: Multiresolution Analysis

Lecture Notes 5: Multiresolution Analysis Optimization-based data analysis Fall 2017 Lecture Notes 5: Multiresolution Analysis 1 Frames A frame is a generalization of an orthonormal basis. The inner products between the vectors in a frame and

More information

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 27 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions

More information

At the Edge of Chaos: Real-time Computations and Self-Organized Criticality in Recurrent Neural Networks

At the Edge of Chaos: Real-time Computations and Self-Organized Criticality in Recurrent Neural Networks At the Edge of Chaos: Real-time Computations and Self-Organized Criticality in Recurrent Neural Networks Thomas Natschläger Software Competence Center Hagenberg A-4232 Hagenberg, Austria Thomas.Natschlaeger@scch.at

More information

Lecture 9: Generalization

Lecture 9: Generalization Lecture 9: Generalization Roger Grosse 1 Introduction When we train a machine learning model, we don t just want it to learn to model the training data. We want it to generalize to data it hasn t seen

More information

Max$Sum(( Exact(Inference(

Max$Sum(( Exact(Inference( Probabilis7c( Graphical( Models( Inference( MAP( Max$Sum(( Exact(Inference( Product Summation a 1 b 1 8 a 1 b 2 1 a 2 b 1 0.5 a 2 b 2 2 a 1 b 1 3 a 1 b 2 0 a 2 b 1-1 a 2 b 2 1 Max-Sum Elimination in Chains

More information

Lecture 4: Training a Classifier

Lecture 4: Training a Classifier Lecture 4: Training a Classifier Roger Grosse 1 Introduction Now that we ve defined what binary classification is, let s actually train a classifier. We ll approach this problem in much the same way as

More information

Neural Modeling and Computational Neuroscience. Claudio Gallicchio

Neural Modeling and Computational Neuroscience. Claudio Gallicchio Neural Modeling and Computational Neuroscience Claudio Gallicchio 1 Neuroscience modeling 2 Introduction to basic aspects of brain computation Introduction to neurophysiology Neural modeling: Elements

More information

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

EE 381V: Large Scale Optimization Fall Lecture 24 April 11 EE 381V: Large Scale Optimization Fall 2012 Lecture 24 April 11 Lecturer: Caramanis & Sanghavi Scribe: Tao Huang 24.1 Review In past classes, we studied the problem of sparsity. Sparsity problem is that

More information

Convergence in shape of Steiner symmetrized line segments. Arthur Korneychuk

Convergence in shape of Steiner symmetrized line segments. Arthur Korneychuk Convergence in shape of Steiner symmetrized line segments by Arthur Korneychuk A thesis submitted in conformity with the requirements for the degree of Master of Science Graduate Department of Mathematics

More information

Alternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization

Alternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization Alternating Direction Method of Multipliers Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last time: dual ascent min x f(x) subject to Ax = b where f is strictly convex and closed. Denote

More information

+ + ( + ) = Linear recurrent networks. Simpler, much more amenable to analytic treatment E.g. by choosing

+ + ( + ) = Linear recurrent networks. Simpler, much more amenable to analytic treatment E.g. by choosing Linear recurrent networks Simpler, much more amenable to analytic treatment E.g. by choosing + ( + ) = Firing rates can be negative Approximates dynamics around fixed point Approximation often reasonable

More information

Classification and Regression Trees

Classification and Regression Trees Classification and Regression Trees Ryan P Adams So far, we have primarily examined linear classifiers and regressors, and considered several different ways to train them When we ve found the linearity

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers

More information

Sparse Solutions of Systems of Equations and Sparse Modelling of Signals and Images

Sparse Solutions of Systems of Equations and Sparse Modelling of Signals and Images Sparse Solutions of Systems of Equations and Sparse Modelling of Signals and Images Alfredo Nava-Tudela ant@umd.edu John J. Benedetto Department of Mathematics jjb@umd.edu Abstract In this project we are

More information

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari Learning MN Parameters with Alternative Objective Functions Sargur srihari@cedar.buffalo.edu 1 Topics Max Likelihood & Contrastive Objectives Contrastive Objective Learning Methods Pseudo-likelihood Gradient

More information

Estimation Error Bounds for Frame Denoising

Estimation Error Bounds for Frame Denoising Estimation Error Bounds for Frame Denoising Alyson K. Fletcher and Kannan Ramchandran {alyson,kannanr}@eecs.berkeley.edu Berkeley Audio-Visual Signal Processing and Communication Systems group Department

More information

SPARSE approximation decomposes a signal y R M in

SPARSE approximation decomposes a signal y R M in Convergence of a Neural Network for Sparse Approximation using the Nonsmooth Łojasiewicz Inequality Aurèle Balavoine, Christopher J. Rozell, Justin Romberg Abstract Sparse approximation is an optimization

More information

arxiv: v1 [math.na] 26 Nov 2009

arxiv: v1 [math.na] 26 Nov 2009 Non-convexly constrained linear inverse problems arxiv:0911.5098v1 [math.na] 26 Nov 2009 Thomas Blumensath Applied Mathematics, School of Mathematics, University of Southampton, University Road, Southampton,

More information

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University UNDERDETERMINED LINEAR EQUATIONS We

More information

Fundamentals of Dynamical Systems / Discrete-Time Models. Dr. Dylan McNamara people.uncw.edu/ mcnamarad

Fundamentals of Dynamical Systems / Discrete-Time Models. Dr. Dylan McNamara people.uncw.edu/ mcnamarad Fundamentals of Dynamical Systems / Discrete-Time Models Dr. Dylan McNamara people.uncw.edu/ mcnamarad Dynamical systems theory Considers how systems autonomously change along time Ranges from Newtonian

More information

Appendix A.1 Derivation of Nesterov s Accelerated Gradient as a Momentum Method

Appendix A.1 Derivation of Nesterov s Accelerated Gradient as a Momentum Method for all t is su cient to obtain the same theoretical guarantees. This method for choosing the learning rate assumes that f is not noisy, and will result in too-large learning rates if the objective is

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

Nonlinear dynamics & chaos BECS

Nonlinear dynamics & chaos BECS Nonlinear dynamics & chaos BECS-114.7151 Phase portraits Focus: nonlinear systems in two dimensions General form of a vector field on the phase plane: Vector notation: Phase portraits Solution x(t) describes

More information

Cambridge University Press The Mathematics of Signal Processing Steven B. Damelin and Willard Miller Excerpt More information

Cambridge University Press The Mathematics of Signal Processing Steven B. Damelin and Willard Miller Excerpt More information Introduction Consider a linear system y = Φx where Φ can be taken as an m n matrix acting on Euclidean space or more generally, a linear operator on a Hilbert space. We call the vector x a signal or input,

More information

Sparse Optimization Lecture: Basic Sparse Optimization Models

Sparse Optimization Lecture: Basic Sparse Optimization Models Sparse Optimization Lecture: Basic Sparse Optimization Models Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this lecture will know basic l 1, l 2,1, and nuclear-norm

More information

Adaptive Dual Control

Adaptive Dual Control Adaptive Dual Control Björn Wittenmark Department of Automatic Control, Lund Institute of Technology Box 118, S-221 00 Lund, Sweden email: bjorn@control.lth.se Keywords: Dual control, stochastic control,

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.5. Spring 2010 Instructor: Dr. Masoud Yaghini Outline How the Brain Works Artificial Neural Networks Simple Computing Elements Feed-Forward Networks Perceptrons (Single-layer,

More information

Lecture 4: Feed Forward Neural Networks

Lecture 4: Feed Forward Neural Networks Lecture 4: Feed Forward Neural Networks Dr. Roman V Belavkin Middlesex University BIS4435 Biological neurons and the brain A Model of A Single Neuron Neurons as data-driven models Neural Networks Training

More information

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4 Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:

More information

Fast Sparse Representation Based on Smoothed

Fast Sparse Representation Based on Smoothed Fast Sparse Representation Based on Smoothed l 0 Norm G. Hosein Mohimani 1, Massoud Babaie-Zadeh 1,, and Christian Jutten 2 1 Electrical Engineering Department, Advanced Communications Research Institute

More information

Dual Augmented Lagrangian, Proximal Minimization, and MKL

Dual Augmented Lagrangian, Proximal Minimization, and MKL Dual Augmented Lagrangian, Proximal Minimization, and MKL Ryota Tomioka 1, Taiji Suzuki 1, and Masashi Sugiyama 2 1 University of Tokyo 2 Tokyo Institute of Technology 2009-09-15 @ TU Berlin (UT / Tokyo

More information