Sparse coding via thresholding and local competition in neural circuits

Size: px

Start display at page:

Download "Sparse coding via thresholding and local competition in neural circuits"

Horace Henderson
6 years ago
Views:

1 Sparse coding via thresholding and local competition in neural circuits Christopher J. Rozell, Don H. Johnson, Richard G. Baraniuk, Bruno A. Olshausen Abstract While evidence indicates that neural systems may be employing sparse approximations to represent sensed stimuli, the mechanisms underlying this ability are not understood. We describe a locally competitive algorithm (LCA) that solves a collection of sparse coding principles minimizing a weighted combination of mean-squared error (MSE) and a coefficient cost function. LCAs are designed to be implemented in a dynamical system composed of many neuron-like elements operating in parallel. These algorithms use thresholding functions to induce local (usually one-way) inhibitory competitions between nodes to produce sparse representations. LCAs produce coefficients with sparsity levels comparable to the most popular centralized sparse coding algorithms while being readily suited for neural implementation. Additionally, LCA coefficients for video sequences demonstrate inertial properties that are both qualitatively and quantitatively more regular (i.e., smoother and more predictable) than the coefficients produced by greedy algorithms. 1 Introduction Natural images can be well-approximated by a small subset of elements from an overcomplete dictionary (Field, 1994; Olshausen, 23; Candès and Donoho, 24). The process of choosing a good subset of dictionary elements along with the corresponding coefficients to represent a signal is known as sparse approximation. Recent theoretical and experimental evidence indicates that many sensory neural systems appear to employ similar sparse representations with their population codes (Vinje and Gallant, 22; Lewicki, 22; Olshausen and Field, 24, 1996; Delgutte et al., 1998), encoding a stimulus in the activity of just a few neurons. While sparse coding in neural systems is an intriguing hypothesis, the challenge of collecting simultaneous data from large neural populations makes it difficult to evaluate its credibility without testing predictions from a specific proposed coding mechanism. Unfortunately, we currently lack a proposed sparse coding mechanism that is realistically implementable in the parallel architectures of neural systems and produces sparse coefficients that efficiently represent the time-varying stimuli important to biological systems. Sparse approximation is a difficult non-convex optimization problem that is at the center of much research in mathematics and signal processing. Existing sparse approximation algorithms suffer from one or more of the following drawbacks: 1) they are not implementable in the parallel analog architectures used by neural systems; 2) they have difficulty producing exactly zero-valued coefficients in finite time; 3) they produce coefficients for time-varying stimuli that contain inefficient fluctuations, making the stimulus content more difficult to interpret; or 4) they only use a heuristic approximation to minimizing a desired objective function. We introduce and study a new neurally plausible algorithm based on the principles of thresholding and local competition that solves a family of sparse approximation problems corresponding to various sparsity metrics. In our Locally Competitive Algorithms (LCAs), neurons in a population continually compete with neighboring units using (usually one-way) lateral inhibition to calculate coefficients representing an input signal using an overcomplete dictionary. Our continuous-time LCA is described by the dynamics of a system of nonlinear ordinary differential equations (ODEs) that govern the internal state (membrane potential) and external communication (short-term firing rate) of units in a neural population. These systems use 1

2 computational primitives that correspond to simple analog elements (e.g., resistors, capacitors, amplifiers), making them realistic for parallel implementations. We show that each LCA corresponds to an optimal sparse approximation problem that minimizes an energy function combining reconstruction mean-squared error (MSE) and a sparsity-inducing cost function. This paper develops a neural architecture for LCAs, shows their correspondence to a broad class of sparse approximation problems, and demonstrates that LCAs possess three properties critical for a neurally plausible sparse coding algorithm. First, we show that the LCA dynamical system is stable, guaranteeing that a physical implementation is well-behaved. Next, we show that LCAs perform their primary task well, finding codes for fixed images that have sparsity comparable to the most popular centralized algorithms. Finally, we demonstrate that LCAs display inertia, coding video sequences with a coefficient time series that is significantly smoother in time than the coefficients produced by other algorithms. This increased coefficient regularity better reflects the smooth nature of natural input signals, making the coefficients much more predictable and making it easier for higher-level structures to identify and understand the changing content in the time-varying stimulus. Although still lacking in biophysical realism, the LCA methods presented here represent a first step toward a testable neurobiological model of sparse coding. As an added benefit, the parallel analog architecture described by our LCAs could greatly benefit the many modern signal processing applications that rely on sparse approximation. While the principles we describe apply to many signal modalities, we will focus on the visual system and the representation of video sequences. 2 Background and related work 2.1 Sparse approximation Given an N-dimensional stimulus s R N (e.g., an N-pixel image), we seek a representation in terms of a dictionary D composed of M vectors {φ m } that span the space R N. Define the l p norm of the vector x to be x p = ( m x m p ) 1/p and the inner product between x and y to be x, y = m x my m. Without loss of generality, assume the dictionary vectors are unit-norm, φ m 2 = 1. When the dictionary is overcomplete (M > N), there are an infinite number of ways to choose coefficients {a m } such that s = M m=1 a mφ m. In optimal sparse approximation, we seek the coefficients having the fewest number of non-zero entries by solving the minimization problem min a a subject to s = M a m φ m, (1) where the l norm 1 denotes the number of non-zero elements of a = [a 1, a 2,...,a M ]. Unfortunately, this combinatorial optimization problem is NP-hard (Natarajan, 1995). In the signal processing community, two approaches are typically used to find acceptable suboptimal solutions to this intractable problem. The first general approach substitutes an alternate sparsity measure to convexify the l norm. One well-known example is Basis Pursuit (BP) (Chen et al., 21), which replaces the l norm with the l 1 norm min a a 1 subject to s = m=1 M a m φ m. (2) Despite this substitution, BP has the same solution as the optimal sparse approximation problem (Donoho and Elad, 23) if the signal is sparse compared to the most similar pair of dictionary elements (e.g., 1 a < min m n 2 [1 + 1/ φ m, φ n ]). In practice, the presence of signal noise often leads to using a modified approach called Basis Pursuit De-Noising (BPDN) (Chen et al., 21) that makes a tradeoff between 1 While clearly not a norm in the mathematical sense, we will use this terminology prevalent in the literature. m=1 2

3 reconstruction mean-squared error (MSE) and sparsity in an unconstrained optimization problem: min M a s 2 a m φ m + λ a 1, (3) m=1 where λ is a tradeoff parameter. BPDN provides the l 1 -sparsest approximation for a given reconstruction quality. There are many algorithms that can be used to solve the BPDN optimization problem, with interior point-type methods being the most common choice. The second general approach employed by signal processing researchers uses iterative greedy algorithms to constructively build up a signal representation (Tropp, 24). The canonical example of a greedy algorithm 2 is known in the signal processing community as Matching Pursuit (MP) (Mallat and Zhang, 1993). The MP algorithm is initialized with a residual r = s. At the k th iteration, MP finds the index of the single dictionary element best approximating the current residual signal, θ k = argmax m r k 1, φ m. The coefficient d k = r k 1, φ θk and index θ k are recorded as part of the reconstruction, and the residual is updated, r k = r k 1 φ θk d k. After K iterations, the signal approximation using MP is given by ŝ = K k=1 φ θ k d k. Though they may not be optimal in general, greedy algorithms often efficiently find good sparse signal representations in practice. 2.2 Sparse coding in neural systems Recent research in neuroscience suggests that V1 population responses to natural stimuli may be the result of a sparse approximation of images. For example, it has been shown that both the spatial and temporal properties of V1 receptive fields may be accounted for in terms of a dictionary that has been optimized for sparseness in response to natural images (Olshausen, 23). Additionally, V1 recordings in response to natural scene stimuli show activity levels (corresponding to the coefficients {a m }) becoming more sparse as neighboring units are also stimulated (Vinje and Gallant, 22). These populations are typically very overcomplete (Olshausen and Field, 24), allowing great flexibility in the representation of a stimulus. Using this flexibility to achieve sparse codes might offer many advantages to sensory neural systems, including enhancing the performance of subsequent processing stages, increasing the storage capacity in associative memories, and increasing the energy efficiency of the system (Olshausen and Field, 24). However, existing sparse approximation algorithms do not have implementations that correspond both naturally and efficiently to plausible neural architectures. For convex relaxation approaches, a network implementation of BPDN can be constructed (Olshausen and Field, 1996), following the common practice of using dynamical systems to implement direct gradient descent optimization (Cichocki and Unbehauen, 1993). Unfortunately, this implementation has two major drawbacks. First, it lacks a natural mathematical mechanism to make small coefficients identically zero. While the true BPDN solution would have many coefficients that are exactly zero, direct gradient methods to find an approximate solution in finite time produce coefficients that merely have small magnitudes. Ad hoc thresholding can be used on the results to produce zero-valued coefficients, but such methods lack theoretical justification and can be difficult to use without oracle knowledge of the best threshold value. Second, this implementation requires persistent (two-way) signaling between all units with overlapping receptive fields (e.g., even a node with a nearly zero value would have to continue sending inhibition signals to all similar nodes). In greedy algorithm approaches, spiking neural circuits can be constructed to implement MP (Perrinet, 25). Unfortunately, this type of circuit implementation relies on a temporal code that requires tightly coupled and precise elements to both encode and decode. Beyond implementation considerations, existing sparse approximation algorithms also do not consider the time-varying stimuli faced by neural systems. A time-varying input signal s(t) is represented with a set of time-varying coefficients {a m (t)}. While temporal coefficient changes are necessary to encode stimulus 2 Many types of algorithms for convex and non-convex optimization could be considered greedy in some sense (including systems based on descending the gradient of an instantaneous energy function). Our use of this terminology will apply to the family of iterative greedy algorithms such as MP to remain consistent with the majority of the signal processing literature. 2 3

4 changes, the most useful encoding would use coefficient changes that reflect the character of the stimulus. In particular, sparse coefficients should have smooth temporal variations in response to smooth changes in the image. However, most sparse approximation schemes have a single goal: select the smallest number of coefficients to represent a fixed signal. This single-minded approach can produce coefficient sequences for time-varying stimuli that are erratic, with drastic changes not only in the values of the coefficients but also in the selection of which coefficients are used. These erratic temporal codes are inefficient because they introduce uncertainty about which coefficients are coding the most significant stimulus changes, thereby complicating the process of understanding the changing stimulus content. In Section 3 we develop our LCAs, in which dictionary elements continually fight for the right to represent the stimulus. These LCAs adapt their coefficients continually over time as the input changes without having to build a new representation from scratch at each time step. This evolution induces inertia in the coefficients, regularizing the temporal variations for smoothly varying input signals. In contrast to the problems seen with current algorithms, our LCAs are easily implemented in analog circuits composed neuron-like elements, and they encourage both sparsity and smooth temporal variations in the coefficients as the stimulus changes. 2.3 Other related work There are several sparse approximation methods that do not fit into the two primary approaches of pure greedy algorithms or convex relaxation. Methods such as Sparse Bayesian Learning (Wipf and Rao, 24; Tipping, 21), FOCUSS (Rao and Kreutz-Delgado, 1999), modifications of greedy algorithms that select multiple coefficients on each iteration (Donoho et al., 26; Pece and Petkov, 2; Feichtinger et al., 1994) and MP extensions that perform an orthogonalization at each step (Davis et al., 1994; Rebollo-Neira and Lowe, 22) involve computations that would be very difficult to implement in a parallel distributed architecture. While FOCUSS can implement both l 1 global optimization and l local optimization (Rao and Kreutz-Delgado, 1999) in a dynamical system (Kreutz-Delgado et al., 23) that uses parallel computation to implement a competition strategy among the nodes (strong nodes are encouraged to grow while weak nodes are penalized), it does not lend itself to forming smooth time-varying representations because coefficients cannot be reactivated once they go to zero. There are also several sparse approximation methods built on a parallel computational framework that are related to our LCAs (Fischer et al., 24; Kingsbury and Reeves, 23; Herrity et al., 26; Rehn and Sommer, 27; Hale et al., 27; Figueiredo and Nowak, 23; Daubechies et al., 24; Blumensath and Davies, 28). These algorithms typically start the first iteration with many super-threshold coefficients and iteratively try to prune the representation through a thresholding procedure, rather than charging up from zero as in our LCAs. Appendix A contains a detailed comparison. 3 Locally competitive algorithms for sparse coding 3.1 Architecture of locally competitive algorithms Our LCAs associate each element of the dictionary φ m D with a separate computing node, or neuron. When the system is presented with an input image s(t), the population of neurons evolves according to fixed dynamics (described below) and settle on a collective output {a m (t)}, corresponding to the short-term average firing rate of the neurons. 3 The goal is to define the LCA system dynamics so that few coefficients have non-zero values while approximately reconstructing the input, ŝ (t) = m a m(t)φ m s(t). The LCA dynamics are inspired by several properties observed in neural systems: inputs cause the membrane potential to charge up like a leaky integrator; membrane potentials exceeding a threshold produce action potentials for extracellular signaling; and these super-threshold responses inhibit neighboring units through horizontal connections. We represent each unit s sub-threshold value by a time-varying internal state u m (t). The unit s excitatory input current is proportional to how well the image matches with the 3 Note that for analytical simplicity we allow positive and negative coefficients, but rectified systems could use two physical units to implement one LCA node. 4

5 u 1(t) a 1(t) a 2(t) φ 2, φ 1 u 2(t) a 2(t) V m(t) _+ + u m(t) _ T λ ( ) + a m(t) =T λ (u m(t)) _ s(t) Φ t s(t) a 2(t) φ 2, φ M V m(t) = φ m, s(t) P T λ (u n(t)) φ m, φ n n m u M(t) a M(t) (a) (b) Figure 1: (a) LCA nodes behave as a leaky integrators, charging with a speed that depends on how well the input matches the associated dictionary element and the inhibition received from other nodes. (b) A system diagram shows the inhibition signals being sent between nodes. In this case, only node 2 is shown as being active (i.e., having a coefficient above threshold) and inhibiting its neighbors. Since the neighbors are inactive then the inhibition is one-way. node s receptive field, b m (t) = φ m, s(t). When the internal state u m of a node becomes significantly large, the node becomes active and produces an output signal a m used to represent the stimulus and inhibit other nodes. This output coefficient is the result of an activation function applied to the membrane potential, a m = T λ (u m ), parameterized by the system threshold λ. Though similar activation functions have traditionally taken a sigmoidal form, we consider activation functions that operate as thresholding devices (e.g., essentially zero for values below λ and essentially linear for values above λ). The nodes best matching the stimulus will have internal state variables that charge at the fastest rates and become active soonest. To induce the competition that allows these nodes to suppress weaker nodes, we have active nodes inhibit other nodes with an inhibition signal proportional to both their activity level and the similarity of the nodes receptive fields. Specifically, the inhibition signal from the active node m to any other node n is proportional to a m G m,n, where G m,n = φ m, φ n. The possibility of unidirectional inhibition gives strong nodes a chance to prevent weaker nodes from becoming active and initiating counter-inhibition, thus making the search for a sparse solution more energy efficient. Note that unlike the direct gradient descent methods described in Section 2 that require two-way inhibition signals from all nodes that overlap (i.e., have G m,n ), LCAs only require one-way inhibition from a small selection of nodes (i.e., only the active nodes). Putting all of the above components together, LCA node dynamics are expressed by the nonlinear ordinary differential equation (ODE) u m (t) = 1 b m (t) u m (t) G m,n a n (t). (4) τ n m This ODE is essentially the same form as the well-known continuous Hopfield network (Hopfield, 1984). Figure 1 shows a LCA node circuit schematic and a system diagram illustrating the lateral inhibition. To express the system of coupled nonlinear ODEs that govern the whole dynamic system, we represent the internal state variables in the vector u(t) = [u 1 (t),...,u M (t)] t, the active coefficients in the vector a(t) = [a 1 (t),...,a M (t)] t = T λ (u(t)), the dictionary elements in the columns of the (N M) matrix Φ = [φ 1,..., φ M ] and the driving inputs in the vector b(t) = [b 1 (t),...,b M (t)] t = Φ t s(t). The function T λ ( ) performs element-by-element thresholding on vector inputs. The stimulus approximation is ŝ (t) = Φa(t), 5

6 and the full dynamic system equation is u(t) = f(u(t)) = 1 [ ( b(t) u(t) Φ t Φ I ) a(t) ], (5) τ a(t) = T λ (u(t)). The LCA architecture described by (5) solves a family of sparse approximation problems with different sparsity measures. Specifically, LCAs descend an energy function combining the reconstruction MSE and a sparsity-inducing cost penalty C( ), E(t) = 1 2 s(t) ŝ (t) 2 + λ m C(a m (t)). (6) The specific form of the cost function C( ) is determined by the form of the thresholding activation function T λ ( ). For a given threshold function, the cost function is specified (up to a constant) by λ dc(a m) da m = u m a m = u m T λ (u m ). (7) This correspondence between the thresholding function and the cost function can be seen by computing the derivative of E with respect to the active coefficients, {a m } (see Appendix B). Using the relationship in (7) and letting the internal states {u m } evolve according to u m E a m yields the equation for the internal state dynamics above in (4). Note that although the dynamics are specified through a gradient approach, the system is not performing direct gradient descent (e.g., u m E u m ). As long as a m and u m are related by a monotonically increasing function, the {a m } will also descend the energy function E. This method for showing the correspondence between a dynamic system and an energy function is essentially the same procedure used by Hopfield (Hopfield, 1984) to establish network dynamics in associative memory systems. We focus specifically on the cost functions associated with thresholding activation functions. Thresholding functions limit the lateral inhibition by allowing only strong units to suppress other units and forcing most coefficients to be identically zero. For our purposes, thresholding functions T λ ( ) have two distinct behaviors over their range: they are essentially linear with unit slope above threshold λ, and essentially zero below threshold. Among many reasonable choices for thresholding functions, we start with a smooth sigmoidal function u m αλ T (α,γ,λ) (u m ) =, (8) 1 + e γ(um λ) where γ is a parameter controlling the speed of the threshold transition and α [, 1] indicates what fraction of an additive adjustment is made for values above threshold. An example sigmoidal thresholding function is shown in Figure 2a. We are particularly interested in the limit of this thresholding function as γ, a piecewise linear function we denote as the ideal thresholding function. In the signal processing literature, T (,,λ) ( ) = lim γ T (,γ,λ) ( ) is known as a hard thresholding function and T (1,,λ) ( ) = lim γ T (1,γ,λ) ( ) is known as a soft thresholding function (Donoho, 1995). 3.2 Sparse approximation by locally competitive algorithms Combining (7) and (8), we can integrate numerically to determine the cost function corresponding to T (α,γ,λ) ( ), shown in Figure 2b. For the ideal threshold functions we derive a corresponding ideal cost function, C (α,,λ) (a m ) = (1 α)2 λ + α a m. (9) 2 Details of this derivation are in Appendix C. Note that unless α = 1 the ideal cost function has a gap because active coefficients cannot take all possible values, a m / [, (1 α)λ] (i.e., the ideal thresholding function is not technically invertible). 6

7 2 2 2 Threshold Function a m 1 a m 1 a m u m 1 2 u m 1 2 u m 2 (a) 2 (c) 2 (e) Cost Function C( a m ) 1 C( a m ) 1 C( a m ) a m a m a m (b) (d) (f) Figure 2: Relationship between the threshold function T (α,γ,λ) ( ) and the sparsity cost function C( ). Only the positive half of the symmetric threshold and cost functions are plotted. (a) Sigmoidal threshold function and (b) cost function for γ = 5, α = and λ = 1. (c) The ideal hard thresholding function (γ =, α =, λ = 1) and the (d) corresponding cost function. The dashed line shows the limit, but coefficients produced by the ideal thresholding function cannot take values in this range (e) The ideal soft thresholding function (γ =, α = 1, λ = 1) and the (f) corresponding cost function. 7

8 coefficient magnitude coefficient magnitude coefficient number (a) coefficient number (b) Figure 3: (a) The top 2 coefficients from a BPDN solver sorted by magnitude. (b) The same coefficients, sorted according to the magnitude ordering of the SLCA coefficients. While there is a gross decreasing trend noticeable, the largest SLCA coefficients are not in the same locations as the largest BPDN coefficients. While the solutions have equivalent energy functions, the two sets of coefficients differ significantly. 3.3 Special case: Soft-thresholding locally competitive algorithm (SLCA) As we see in Section 3.2, LCAs can optimize a variety of different sparsity measures depending on the choice of thresholding function. One special case is the soft thresholding function, corresponding to α = 1 and shown graphically in Figures 2e and 2f. The soft-thresholding locally competitive algorithm (SLCA) applies the l 1 norm as a cost function on the active coefficients, C (1,,λ) (a m ) = a m. Thus, the SLCA is simply another solution method for the general BPDN problem described in Section 2, and simulations confirm that the SLCA does indeed find solutions with values of E(t) equivalent to the solutions produced by standard interior-point based methods. Despite minimizing the same convex energy function, SLCA and BPDN solvers may find different sets of coefficients, as illustrated in Figure 3. This may be due either to the non-uniqueness of a BPDN solution, 4 or because the numerical approximations have not quite converged to a global optimum in the alloted computation time. The connection between softthresholding and BPDN is well-known in the case of orthonormal dictionaries (Chen et al., 21), and recent results (Elad, 26) have given some justification for using soft-thresholding in overcomplete dictionaries. The SLCA provides another formal connection between the soft-thresholding function and the l 1 cost function. Though BPDN uses the l 1 -norm as its sparsity penalty, we often expect many of the resulting coefficients to be identically zero (especially when M N). However, most numerical methods (including direct gradient descent and interior point solvers) will drive coefficients toward zero but will never make them identically zero. While an ad hoc threshold could be applied to the results of a BPDN solver, the SLCA has the advantage of incorporating a natural thresholding function that keeps coefficients identically zero during the computation unless they become active. In other words, while BPDN solvers often start with many non-zero coefficients and try to force coefficients down, the SLCA starts with all coefficients equal to zero and only lets a few grow up. This advantage is especially important for neural systems that must expend energy for non-zero values throughout the entire computation. 4 With an overcomplete dictionary, it is technically possible to have a BPDN energy function that is convex (guaranteeing all local minima are global minima) but not strictly convex (guaranteeing that a global minimum is unique). 8

9 Extra dictionary element φ N Input vector x MP coefficients a i (a) HLCA coefficients (b) HLCA time dynamics u M (t) u 1 (t) λ i (c) a i u(t) i (d) time (s) (e) Figure 4: (a) The dictionary in this example has one extra element that consists of decaying combinations of all other dictionary elements. (b) The input vector has a sparse representation in just a few dictionary elements. (c) MP initially chooses the extra dictionary element, preventing it from finding the optimally sparse representation (coefficients shown after 1 iterations). (d) In contrast, the HLCA system finds the optimally sparse coefficients. (e) The time-dynamics of the HLCA system illustrate its advantage. The extra dictionary element is the first node to activate, followed shortly by the nodes corresponding to the optimal coefficients. The collective inhibition of the optimal nodes causes the extra node to die away. 3.4 Special case: Hard-thresholding locally competitive algorithm (HLCA) Another important special case is the hard thresholding function, corresponding to α = and shown graphically in Figures 2c and 2d. Using the relationship in (7), we see that this hard-thresholding locally competitive algorithm (HLCA) applies an l -like cost function by using a constant penalty regardless of magnitude, C (,,λ) (a m ) = λ 2 I( a m > λ), where I( ) is the indicator function evaluating to 1 if the argument is true and if the argument is false. Unlike the SLCA, the HLCA energy function is not convex and the system will only find a local minimum of the energy function E(t). As with the SLCA, the HLCA also has connections to known sparse approximation principles. If node m is fully charged, the inhibition signal it sends to other nodes would be exactly the same as the update step when the m th node is chosen in the MP algorithm. However, due to the continuous competition between nodes before they are fully charged, the HLCA is not equivalent to MP in general. As a demonstration of the power of competitive algorithms over greedy algorithms such as MP, consider a canonical example used to illustrate the shortcomings of iterative greedy algorithms (Chen et al., 21; DeVore and Temlyakov, 1996). For this example, specify a positive integer K < N and construct a dictionary D with M = N + 1 elements to have the following form: { e m if m N φ m = K n=1 κe n + N n=k+1 (κ/(n K))e n if m = N + 1, 9

10 where e m is the canonical basis element (i.e., it contains a single 1 in the m th location) and κ is a constant to make the vectors have unit norm. In words, the dictionary includes the canonical basis along with one extra element that is a decaying combination of all other elements (illustrated in Figure 4, with N = 2 and K = 5). The input signal is sparsely represented in the first K dictionary elements, s = K 1 m=1 K e m. The first MP iteration chooses φ M, introducing a residual with decaying terms. Even though s has an exact representation in K elements, MP iterates forever trying to atone for this bad initial choice. In contrast, the HLCA initially activates the M th node but uses the collective inhibition from nodes 1,...,K to suppress this node and calculate the optimal set of coefficients. While this pathological example is unlikely to exactly occur in natural signals, it is often used as a criticism of greedy methods to demonstrate their shortsightedness. We mention it here to demonstrate the flexibility of LCAs and their differences from pure greedy algorithms. 4 LCA system properties To be a physically viable sparse coding mechanism, LCAs must exhibit several critical properties: the system must remain stable under normal operating conditions, the system must produce sparse coefficients that represent the stimulus with low error, and coefficient sequences must exhibit regularity in response to time-varying inputs. In this section we show that LCAs exhibit good characteristics in each of these three areas. We focus our analysis on the HLCA both because it yields the most interesting results and because it is notationally the cleanest to discuss. In general, the analysis principles we use will apply to all LCAs through straightforward (through perhaps laborious) extensions. 4.1 Stability Any proposed neural system must remain well-behaved under normal conditions. While linear systems theory has an intuitive notion of stability that is easily testable (Franklin et al., 1986), no such unifying concept of stability exists for nonlinear systems (Khalil, 22). Instead, nonlinear systems are characterized in a variety of ways, including their behavior near an equilibrium point u where f(u ) = and their input-output relationship. The various stability analyses of Sections and depend on a common criterion. Define M u(t) [1,...,M] as the set of nodes that are above threshold in the internal state vector u(t), M u(t) = {m : u m (t) λ}. We say that the LCA meets the stability criterion if t the set of active vectors {φ m } m Mu(t) is linearly independent. It makes some intuitive sense that this condition is important: if a collection of linearly dependent nodes are active simultaneously, the nodes could have growing coefficients but no net effect on the reconstruction. The system is likely to satisfy the stability criterion eventually under normal operating conditions for two reasons. First, small subsets of dictionary elements are unlikely to be linearly dependent unless the dictionary is designed with this property (e.g., (Tropp, 26)). Second, sparse coding systems are actively trying to select dictionary subsets so that they can use many fewer coefficients than the dimension of the signal space, Mu(t) N M. While the LCA lateral inhibition signals discourage linearly dependent sets from activating, the stability criterion can be violated when a collection of nodes becomes active too quickly, before inhibition can take effect. In simulation, we have observed this situation when the threshold is too low compared to the system time constant. As discussed below, the stability criterion amounts to a sufficient (but not necessary) condition for good behavior, and we have never empirically observed the simulated system becoming unstable even when this condition is transiently violated Equilibrium points In a LCA presented with a static input, we look to the steady-state response (where u(t) = ) to determine the coefficients. The character of the equilibrium points u (f(u ) = ) and the system s behavior in a neighborhood around an equilibrium point provides one way to ensure that a system is well-behaved. Consider the ball around an equilibrium point B ǫ (u ) = {u : u u < ǫ}. Nonlinear system analysis 1

11 typically asks an intuitive question: if the system is perturbed within this ball, does it then run away, stay where it is, or get attracted back? Specifically, a system is said to be locally asymptotically stable (Bacciotti and Rosier, 21) at an equilibrium point u if one can specify ǫ > such that u() B ǫ (u ) = lim u(t) = u. t Previous research (Cohen and Grossberg, 1983; Yang and Dillon, 1994; Li et al., 1988) has used the tools of Lyapunov functions (Khalil, 22) to study a Hopfield network (Hopfield, 1984) similar to the LCA architecture. However, all of these analyses make assumptions that do not encompass the ideal thresholding functions used in the LCAs (e.g., they are continuously differentiable and/or monotone increasing). In Appendix D.1 we show that as long as the stability criterion is met, the HLCA: has a finite number of equilibrium points; has equilibrium points that are almost certainly isolated (no two equilibrium points are arbitrarily close together); and is almost certainly locally asymptotically stable for every equilibrium point. The conditions that hold almost certainly are true as long as none of the equilibria have components identically equal to the threshold, (u m λ, m), which holds with overwhelming probability. With a finite number of isolated equilibria, we can be confident that the HLCA steady-state response is a distinct set of coefficients representing the stimulus. Asymptotic stability also implies a notion of robustness, guaranteeing that the system will remain well-behaved even under perturbations (Theorems 2.8 and 2.9 in (Bacciotti and Rosier, 21)) Input-output stability In physical systems it is important that the energy of both internal and external signals remain bounded for bounded inputs. One intuitive approach to ensuring output stability is to examine the energy function E. We show in Appendix D.2 that for non-decreasing threshold functions, the energy function is non-increasing ( d dte(t) ) for fixed inputs. While this is encouraging, it does not guarantee input-output stability. To appreciate this effect, note that the HLCA cost function is constant for nodes above threshold nothing explicitly keeps a node from growing without bound once it is active. While there is no universal input-output stability test for general nonlinear systems, we observe that the LCA system equation is linear and fixed until a unit crosses threshold. A branch of control theory specifically addresses these switched systems (Decarlo et al., 2). Results from this field indicate that input-output stability can be guaranteed if the individual linear subsystems are stable, and the system does not switch too fast between these subsystems (Hespanha and Morse, 1999). In Appendix D.2 we give a precise mathematical statement of this input-output stability and show that the HLCA linear subsystems are individually stable if and only if the stability criterion is met. Therefore, the HLCA is input-output stable as long as nodes are limited in how fast they can change states. We expect that an infinitely fast switching condition is avoided in practice either by the physical principles of the system implementation or through an explicit hysteresis in the thresholding function. 4.2 Sparsity and representation error Viewing the sparse approximation problem through the lens of rate-distortion theory (Berger, 1971), the most powerful algorithm produces the lowest reconstruction MSE for a given sparsity. When the sparsity measure is the l 1 norm, the problem is convex and the SLCA produces solutions with equivalent l 1 -sparsity to interior point BPDN solvers. Despite the analytic appeal of the l 1 norm as a sparsity measure, many systems concerned with energy minimization (including neural systems) likely have an interest in minimizing the l norm of the coefficients. The HLCA is appealing because of its l -like sparsity penalty, but this objective function is not convex and the HLCA may find a local minimum. We will show that while HLCA cannot 11

12 guarantee the l sparsest solution, it produces coefficients that demonstrate comparable sparsity to MP for natural images. Insight about the HLCA reconstruction fidelity comes from rewriting the LCA system equation u(t) = 1 [ Φ t (s(t) ŝ (t)) u(t) + T (α,,λ) (u(t)) ]. (1) τ For a constant input, HLCA equilibrium points ( u(t) = ) occur when the residual error is orthogonal to active nodes and balanced with the internal state variables of inactive nodes. { u m (t) if u m λ φ m, s(t) ŝ (t) = if u m > λ. Therefore, when HLCA converges the coefficients will perfectly reconstruct the component of the input signal that projects onto the subspace spanned by the final set of active nodes. Using standard results from frame theory (Christensen, 22), we can bound the HLCA reconstruction MSE in terms of the set of inactive nodes s(t) ŝ (t) 2 1 η min m/ M u(t) φ m, s(t) ŝ (t) 2 ( M Mu(t) ) λ 2 η min, where η min is the minimum eigenvalue of (ΦΦ t ). Though the HLCA is not guaranteed to find the globally optimal l sparsest solution we must ensure that it does not produce unreasonably non-sparse solutions. While the system nonlinearity makes it impossible to analytically determine the LCA steady-state coefficients, it is possible to rule out some coefficients as not being possible. For example, let M [1,..., M] be an arbitrary set of active coefficients. Using linear systems theory we can calculate the steady-state response ũ M = lim t u(t) assuming that M stays fixed. If ũ M m < λ for any m M or if ũ M m > λ for any m / M, then M cannot describe the set of active nodes in the steady-state response and we call it inconsistent. We show in Appendix E that when the stability criterion is met, the following statement is true for the HLCA: If s = φ m, then any set of active coefficients M with m M and M > 1 is inconsistent. In other words, the HLCA may use the m th node or a collection of other nodes to represent s, but it cannot use a combination of both. This result extends intuitively beyond one-sparse signals: each component in an optimal decomposition is represented by either the optimal node or another collection of nodes, but not both. While not necessarily finding the optimal representation, the system does not needlessly employ both the optimal and extraneous nodes. We have also verified numerically that the LCAs achieve a combination of error and sparsity comparable with known methods. We employed a dictionary consisting of the bandpass band of a steerable pyramid (Simoncelli and Freeman, 1995) 5 with one level and four orientation bands (i.e., the dictionary is approximately four times overcomplete with 496 elements). Image patches (32 32) were selected at random from a standard set of test images. The selected image patches were decomposed using the steerable pyramid and reconstructed using just the bandpass band. 6 The bandpass image patches were also normalized to have unit energy. Each image patch was used as the fixed input to the LCA system equation (5) using either a soft or hard thresholding function (with variable threshold values) and with a biologically plausible membrane time constant of τ = 1 ms (Dayan and Abbott, 21). We simulated the system using a simple Euler s method approach (i.e., first order finite difference approximation) (Süli and Mayers, 23) with a time step of 1 ms. Figure 5 shows the time evolution of the reconstruction MSE and l sparsity for SLCA and HLCA responding to an individual image, and Figure 6 shows the mean steady-state tradeoff between l sparsity and MSE. For comparison, we also plotted the results obtained from using MP, 7 a standard BPDN interiorpoint solver 8 followed by thresholding to enforce l sparsity (denoted BPDNthr ) and SLCA with the 5 We used the implementation of the steerable pyramid given in the code available at with the circular boundary handling option. 6 We eliminate the lowpass band because it accounts for a large fraction of the total image energy in just a few dictionary elements and it is unlikely that much gain could be achieved by sparsifying these coefficients. 7 We iterated MP until it had the same MSE as the LCA implementation to compare the sparsity levels of the results. 8 We used the implementation of a primal-dual BPDN solver given in the code available at 12

13 1.8 HLCA MSE γ =.6 γ =.115 γ =.185 γ = SLCA MSE γ =.6 γ =.115 γ =.185 γ = MSE MSE number of active coefficients (a) HLCA sparsity γ =.6 γ =.115 γ =.185 γ = time (s) (b) number of active coefficients (c) SLCA sparsity γ =.6 γ =.115 γ =.185 γ = time (s) (d) Figure 5: The time response of the HLCA and SLCA (τ = 1 ms) for a single fixed (32 32) image patch with a dictionary that is four time overcomplete with 496 elements. (a) The MSE decay and (b) the l sparsity for HLCA. (c) The MSE decay and (d) the l sparsity for SLCA. The error converges within 1-2 time constants and the sparsity often approximately converges within 3-4 time constants. In some cases sparsity is reduced with a longer running time. 13

14 L "norm" mean HLCA SLCA BPDNthr SLCAthr MP L "norm" standard deviation HLCA SLCA BPDNthr SLCAthr MP MSE (% error) (a) MSE (% error) (b) Figure 6: Mean tradeoff between MSE and l -sparsity for normalized (32 32) patches from a standard set of test images. The dictionary was four times overcomplete with 496 elements. For a given MSE range, we plot the mean (a) and standard deviation (b) of the l sparsity. same threshold applied (denoted SLCAthr ). Most importantly, note that the HLCA and MP are almost identical in their sparsity-mse tradeoff. Though the connections between the HLCA and MP were pointed out in Section 3.4, these are very different systems and there is no reason to expect them to produce the same coefficients. Additionally, note that the SLCA is producing coefficients that are nearly as l -sparse as what we can achieved by oracle thresholding the results of a BPDN solver even though the SLCA keeps most coefficients zero throughout the calculation. 4.3 Time-varying stimuli Inertia Biological sensory systems are faced with constantly changing stimuli due to both external movement and internal factors (e.g., organism movement, eye saccades, etc.). As discussed in Section 2.2, sparse codes with temporal variations that also reflect the smooth nature of the changing stimulus would be easier for higher level systems to understand and interpret. However, approximation methods that only optimize sparsity at each time step (especially greedy algorithms) can produce brittle representations that change dramatically with relatively small stimulus changes. In contrast, LCAs naturally produce smoothly changing outputs in response to smoothly changing time-varying inputs. Assuming that the system time constant τ is faster than the temporal changes in the stimulus, the LCA will evolve to capture the stimulus change and converge to a new sparse representation. While local minima in an energy function are typically problematic, the LCAs can use these local minima to find coefficients that are close to their previous coefficients even if they are not optimally sparse. While permitting suboptimal coefficient sparsity, this property allows the LCA to exhibit inertia that smoothes the coefficient sequences. The inertia property exhibited in LCAs can be seen by focusing on a single node in the system equation (1): { u m (t) = 1 φ m, (s(t) ŝ (t)) u m (t) when u m (t) < λ τ φ m, (s(t) ŝ (t)) αλ when u m (t) λ. A new residual signal drives the coefficient higher but suffers an additive penalty. Inactive coefficients suffer an increasing penalty as they get closer to threshold while active coefficients only suffer a constant penalty αλ that can be very small (e.g., the HLCA has αλ = ). This property induces a king of the hill effect: 14

15 when a new residual appears, active nodes move virtually unimpeded to represent it while inactive nodes are penalized until they reach threshold. This inertia encourages inactive nodes to remain inactive unless the active nodes cannot adequately represent the new input. To illustrate this inertia, we applied the LCAs to a sequence of pixel, bandpass filtered, normalized frames from the standard foreman test video sequence with the same experimental setup described in Section 4.2. The LCA input is switched to the next video frame every (simulated) 1/3 seconds. The results are shown in Figure 7, along with comparisons to MP and BPDN applied independently on each frame. The changing coefficient locations are nodes that either became active or inactive at each frame. Mathematically, the number of changing coefficients at frame n is: Mu(n 1) M u(n), where is the exclusive OR operator and u(n) are the internal state variables at the end of the simulation for frame n. This simulation highlights that the HLCA uses approximately the same number of active coefficients as MP but chooses coefficients that more efficiently represent the video sequence. The HLCA is significantly more likely to re-use active coefficient locations from the previous frame without making significant sacrifices in the sparsity of the solution. This difference is highlighted when looking at the ratio of the number of changing coefficients to the number of active coefficients, Mu(n 1) M u(n) / Mu(n). MP has a ratio of 1.7, meaning that MP is finding almost an entirely new set of active coefficient locations for each frame. The HLCA has a ratio of.5, meaning that it is changing approximately 25% of its coefficient locations at each frame. SLCA and BPDNthr have approximately the same performance, with regularity falling between HLCA and MP. Though the two systems can calculate different coefficients, the convexity of the energy function appears to be limiting the coefficient choices enough so that SLCA cannot smooth the coefficient time series substantially more than BPDNthr Markov state transitions The simulation results indicate that the HLCA is producing time series coefficients that are much more regular than MP. This regularity is visualized in Figure 9 by looking at the time-series of example HLCA and MP coefficients. Note that though the two coding schemes produce roughly the same number of non-zero entries, the HLCA does much better than MP at clustering the values into consecutive runs of positive or negative values. This type of smoothness better reflects the regularity in the natural video sequence input. We can quantify this increased regularity by examining the Markov state transitions. Specifically, each coefficient time-series is Markov chain (Norris, 1997) with three possible states at frame n: if u m (n) < λ σ m (n) = if λ u m (n) λ + if u m (n) > λ. Figure 8 shows the marginal probabilities P( ) of the states and the conditional probabilities P( ) of moving to a state given the previous state. The HLCA and MP are equally likely to have non-zero states, but the HLCA is over five times more likely than MP to have a positive coefficient stay positive (P(+ +)). Also, though the absolute probabilities are small, MP is roughly two orders of magnitude more likely to have a coefficient swing from positive to negative (P( +)) and vice-versa (P( +)). To quantify the regularity of the active coefficient locations we calculate the entropy (Cover and Thomas, 1991) of the coefficient states at frame n conditioned on the coefficient states at frame (n 1): H(σ m (n) σ m (n 1)) = P(+)[P( +) + P( +) + P(+ +)] P()[P( ) + P( ) + P(+ )] P( )[P( ) + P( ) + P(+ )], (11) plotted in Figure 9. This conditional entropy indicates how much uncertainty there is about the status of the current coefficients given the coefficients from the previous frame. Note that the conditional entropy for MP is almost double the entropy for the HLCA, while SLCA is again similar to BPDNthr. The principle contributing factor to the conditional entropy appears to be the probability a non-zero node remains in the 15

16 MSE (% error) HLCA (3.33) SLCA (3.31) BPDNthr (3.73) MP (3.33) number of active coefficients HLCA (2968) SLCA (834) BPDNthr (7775) MP (2348) number of changing coefficient locations frame (a) frame (c) HLCA (169) SLCA (7343) BPDNthr (733) MP (46) ratio of changing to active coefficients frame (b) HLCA (.55944) SLCA (.87394) BPDNthr (.93827) MP (1.756) frame (d) Figure 7: The HLCA and SLCA systems simulated on 2 frames of the foreman test video sequence. For comparison, MP coefficients and thresholded BPDN coefficients are also shown. Average values for each system are noted in the legend. (a) Per-frame MSE for each coding scheme, designed to be approximately equal. (b) The number of active coefficients in each frame. (c) The number of changing coefficient locations for each frame, including the number of inactive nodes becoming active and the number of active nodes becoming inactive. (d) The ratio of changing coefficients to active coefficients. A ratio near 2 (such as with MP) means that almost 1% of the coefficient locations are new at each frame. A ratio near.5 (such as with HLCA) means that approximately 25% of the coefficients are new at each frame. 16

Sparse Coding via Thresholding and Local Competition in Neural Circuits

LETTER Communicated by Michael Lewicki Sparse Coding via Thresholding and Local Competition in Neural Circuits Christopher J. Rozell crozell@gatech.edu Don H. Johnson dhj@rice.edu Richard G. Baraniuk richb@rice.edu