On The Equivalence of Hierarchical Temporal Memory and Neural Nets

Size: px

Start display at page:

Download "On The Equivalence of Hierarchical Temporal Memory and Neural Nets"

Vernon Sanders
5 years ago
Views:

1 On The Equivalence of Hierarchical Temporal Memory and Neural Nets Bedeho Mesghina Wolde Mender December 7, 2009 Abstract In this paper we present a rigorous definition of classification in a common family of Hierarchical Temporal Memory networks, and provide a formal analysis of their classification power. We determine the classification power of a single unit, and subsequently demonstrate how to reduce a network of arbitrary topology and size to such a single unit. This reduction is shown to require no worse then an exponential increase in memory, and we finally we show how such a network, via a reduction to a single unit, is equivalent to a simple 3 layer feed forward network of threshold units. 1 Introduction With our maturing understanding of cortical computation in mammals, particularly in the visual cortex, there has been developed a number of biologically inspired vision systems and models of cortex, e.g. K. Fukushima [3], M. Riesenhuber and T. Poggio [2], T.S. Lee and D. Mumford [4], G. Wallis and E. T. Rolls [1], T. Dean [9] and finally D. George and J. Hawkins [5]. These models generally mimic the hierarchical organization and the alternating simple and complex cell structure of the visual cortex [6]. In the context of machine learning, this family of models may be called hierarchical classifiers, and it is from this perspective they are studied in this paper. We consider the model of D. George and J. Hawkins, called Hierarchical Temporal Memory (HTM, first outlined in the 2004 book OnIntelligence by J. Hawkins as a biological model of cortex. We chose this model, not only because of the significant empirical performance it has demonstrated [5] or because it subsumes similar models [7], but also because it is being developed actively and enjoys a growing comunity of developers using a general purpose computing platform based on it 1. The goal of this paper is to demonstrate the usefullness of formal analysis of biologically inspired models of cortex. It is also the opinion of the author of this paper that if such models aspire to be not only devices of utility but also bmmender@ifi.uio.no, Institute of Informatics, University of Oslo, Norway. 1 Numenta Inc. ( is a private company, founded by D. George and J. Hawkins, developing a general purpose computing platform, called NuPiC, based on the HTM modell. 1

2 accurate descriptions of cortical computation, then foundational insight into their computational mechanics is a necessary step towards a full understanding of the cortex. 2 Hierarchical Temporal Memory In this paper we consider fully trained HTM networks ready for classification tasks, for more information on how learning, prediction and other features work, please consult [7, 5, 8]. A HTM network is a tree of nodes where the input is fed into the leaf nodes and the result is outputted from the top node. The leaf nodes are given nonoverlapping sections of the total network input, and the internal nodes are given input from their children. A leaf node has a finite list of vectors from its input space, called spatial patterns, and these are partitioned into groups called temporal groups. 2 Given some input x, the output of the node is a vector with a score for each of its temporal groups, where the score of a group is the maximum score among all spatial patterns in that group. The score of each individual spatial pattern is a simple matching score between the pattern and the input x. An internal node also has in finite list of vectors called coincidence vectors. These are as long as the number of children of the node, and each position in a coincidence vector references a temporal group in the corresponding child node. So for example if a node has a coincidence vector (2, 4, 6, then the node has three children and this specific coincidence is referencing the second, fourth and sixth temporal group in the children respectively. As in the leaf nodes, these are partitioned in to temporal groups and the score of each group is the maximum score among all coincidence vectors in that group. The score of each coincidence is calculated by taking the product of the scores of the temporal groups in the children referenced by the coincidence. With this in mind we give a formal definition, but first some preliminaries. Let x = (x 1,..., x n R n and y = (y 1,..., y m R m. Vector concatenation of y onto x is denoted x y, that is x y = (x 1,..., x n, y 1,..., y m. We may project out a contiguous section, or just one position, of a vector by writing x[i, j] = (x i,..., x j or x[i] = x i respectively. We use the standard norm, that is x 2 = n i=1 (x i 2. Finally, a cover for some set T is a family of sets {T 1,..., T p } such that T = i T i, and this cover is called a partitioning if T i T j = for i j. Definition 1. A gaussian node is defined as G = (l, r, σ, S, T where - 1 l r and (l, r is called the receptive field, - σ R is called the spread, - S R r l+1 is finite and called the spatial set, 2 These names (spatial pattern, temporal group, coincidence are due to how this information is learned during training, and while this is not relevant in this presentation, we choose to use the original names for the sake of consistency. 2

3 - T = {T 1,..., T p } is a partitioning of S, and the partitions are are called temporal groups. We may indicate the receptive field of a node by writing G l,r. We overload the notation and allow interpreting G as a function, so for any x R d where d r we let where G(x = (z 1 (x[l, r],..., z p (x[l, r], z i (u = max e u c 2. Definition 2. A network with( receptive field (l, r is inductively defined as a gaussian node G l,r, or as H = Hr 1 1,r 2,..., Hr n 2n 1,r 2n, C, T where - H 1 r 1,r 2,..., H n r 2n 1,r 2n are networks such that l = r 1 <... < r 2n = r and r 2 r 3 =... = r 2n 1 r 2n 2 = 1, - C {(c 1,..., c n 1 c i d i } is called the coincidence set, where d i is the number of temporal groups in H i, - T = {T 1,..., T p } is a partitioning of C, and the partitions are are called temporal groups. As before we overload the notation and allow interpreting H as a function, so for any x R d where d r where H(x = (z 1 (x,..., z p (x, z i (x = max n H j (x[c[j]]. We first consider a gaussian node by itself. Recall that each spatial pattern is a point in the input space, and that groups of these points constitute temporal groups. Given two spatial patterns a, b R n and some input x R n it is easily observable that the score for a dominates the score for b as long as x is closer to a then b. You may observe e x a 2 = e x b 2 x a 2 x b 2 = x a = x b, hence the score boundary, where they have equal score, is the hyperplane of inputs equidistant to a and b. This is the case for any pair of spatial patterns, so if we fix some node spatial pattern a and consider all other node patterns b 1,..., b n, we see that the region of the input space where a has the greatest score among all patterns is the convex polytype 3 arising from the intersection of all score boundary hyperplanes 3 A polytype is a n-dimensional generalization of the two dimensional polygon, and a convex polytype is one which defines a convex set. They are often defined as intersections of a finite number of half-spaces, which again are characterized by hyperplanes. 3

4 between a and each of b 1,..., b n. More generally, this means that the temporal group {a 1,..., a m } dominates all other temporal groups exactly for all points in residing in some convex polytype around some a i. Moreover we see that the input space of any gaussian node is covered by a finite number of convex polytypes. We emphasize that it is covered and not partitioned, since the polytypes overlap exactly at the score boundaries. A convenient way of representing a convex polytypes is by keeping the coefficients of the bounding hyperplanes in matrix form, this allows us to efficiently represent each temporal group as a finite number of such matrices. Definition 3. A cover {T 1,..., T p } for R n is said to be a finite union of convex polytypes when for each i = 1,..., p there exist m n matrices A i 1,..., A i k i and vectors b i 1,..., b i k i R n such that x T i A i jx b i j for some j = 1,..., k i. In general, both for a full network and a single gaussian node, it is natural to consider how the network classifies the input space, that is how it gives category labels to diffrent input points. As we already know, the score boundaries for diffrent categories will in the case of gaussian nodes have each point in the input space with some fixed label set. In practice, when some point x is assigned more then one label, the network has given the same maximum score to two or more categories. Definition 4. For any n, m > 0, a function f : R n P({l 1,..., l m } is called a classification in dimension n of size m when for each l i there exists an x R n such that l i f(x. Given classification f, we define R n l i = {x R n l i f(x}, and R n f = {Rn l 1,..., R n l m }, which is a cover for R n. For any network H l,r with n temporal groups we define the classification f H : R r l+1 P({l 1,..., l n } by { } f H (x = l i i argmax {H(x[j]} 1 j n. Now we can precisely summarize our recent discussion in the form of the next theorem. Theorem 5. For any gaussian node G 1,n, R n f G is a finite unions of convex polytypes. Proof. We demonstrate this by the providing the matrices that characterice each partition. So let G 1,n = (l, r, σ, S, {T 1,..., T p } and let S = {s 1,..., s m } and T i = {s i 1,..., s i k i } for any 1 i p. For all i = 1,..., p and j = 1,..., k i let and A i j = s i j [1] s 1[1] s i j [n] s 1[n]..... s i j [1] s m[1] s i j [n] s m[n] n b i j = 1 (s 1[v] 2 ( s i j [v] 2 2. n (s m[v] 2 (. s i j [v] 2, 4

5 Now fix any 1 i p and observe that for any x = (x 1,..., x n R n x R n l i l i f G (x R n l i def. i argmax {G(x[u]} 1 u p u (G(x[i] G(x[u] x c 2 x c 2 u max e max c T u e j c e x si j 2 c 2 x e j c ( x s i j 2 x c 2 ( n j c (x[v] s i j[v] 2 (x[v] c[v] 2 ( n j c x[v] 2 2x[v]s i j[v] + s i j[v] 2 ( n j c 2x[v]s i j[v] + s i j[v] 2 ( n j c 2x[v](s i j[v] c[v] ( n j c x[v](s i j[v] c[v] 1 2 A i jx b i j for some j = 1,..., k i. x[v] 2 2x[v]c[v] + c[v] 2 2x[v]c[v] + c[v] 2 c[v] 2 s i j[v] 2 c[v] 2 s i j[v] 2 f G def. c S This proves that the cover R n f G is a finite union of convex polytypes. With our understanding of the classification in a gaussian node we can try to relate this to to a more complicated network, so consider two level network with one top node and two gaussian children, both with the same spread σ. Let c T i be some coincidence in the top node which references temporal groups T L, T R in the left and right child nodes respectively. Lets say that for some particular network input x, the score of T L, T R is the score of a T L, b T R respectively, which also means that the score of c is the score of score of a times the score of b. In general, for any input x there exists á T L, b T R as mentioned above, and observe then that the score of c is e x[l, r 1] á 2 e x[r 2, r] b 2 = e x[l, r 1] á 2 + x[r 2, r] b 2 2 á b x = e 5

6 We recognize the last term as the score of the spatial pattern á b in a gaussian node with spread σ. We also see that the score of c will always be this term by way of the combination of á T L, b T R which yield the maximum score for a particular x. This suggests a method for reducing a two layer network to a single node, and by repeatedly doing this from below we can reduce any network to a single gaussian node. Theorem 6. For any network H l,r where all gaussian nodes have the same spread σ, there exists a gaussian node G l,r with spread σ such that for any x R d where d r l + 1 we have H(x = G(x. Proof. We prove this by induction on the structure of H. The base case is trivial, since then H itself would be a gaussian node. Now let H = (Hr 1 1,r 2,..., Hr n 2n 1,r 2n, C, T. For each H i let G i be a gaussian node with spread σ such that H i (x = G i (x for all x R d. Such a gaussian node must exist for each i since if H i itself is a gaussian node, it has spread σ by assumption, if it is not a gaussian node we apply the induction hypothesis to it and obtain the necessary gaussian node. We can now, based on H, G 1,..., G n, construct the desired gaussian node G. Let T j i and T i denote the i th temporal group of G j and the i th temporal group of H respectively. For each c C let { } S c = t 1... t n t 1 Tc[1] 1,..., t n Tc[n] n, and for each i = 1,..., n let letting us define our gaussian node ( G = β i = l, r, c C S c, S c, {β i } i n. To confirm that H and G agree, we consult the their i th component and estab- 6

7 lish equality. So for any 1 i n observe that n H(x[i] = max H j (x[c[j]] n = max G j (x[c[j]] n x[r 2j 1, r 2j ] c 2 = max max c T j c[j] e n max e x[r 2j 1, r 2j ] c j 2 = max = max = max = max = max u β i c 1 T 1 c[1],...,cn T n c[n] x[r 2j 1, r 2j ] c j 2 max e c 1 T 1 c[1],...,c n T n c[n] 1 max c 1 T 1 c[1],...,cn T n e x c 1... c n 2 c[n] 1 max u S c e x u 2 1 e x u 2 H def. H i (x = G i (x. G i def. ( ( = G(x[i] G def. For ( observe that taking the product of a sequence of function maximums is the same as taking the maximum over all ordered products of the functions, that is max i 1 {F 1 (i 1 }... max {F n (i n } = max {F 1 (i 1... F n (i n }. i n i 1,...,i n 7

8 For ( observe that e 1 x[r 2j 1, r 2j ] c j 2 = e 1 = e = e = e r n r 2j r 2j 1+1 l=1 r 2j (x[r 2j 1, r 2j ][l] c j [l] 2 l=r 2j 1 (x[l] c j [l r 2j 1 + 1] 2 r 2j l=r 2j 1 (x[l] c 1... c n [l] 2 l=r 1 (x[l] c 1... c n [l] 2 = e 1 x c 1... c n 2 Observe that the process suggested in the previous theorem can also be performed in reverse, giving a means of decomposing a single node into a network. The practical feasibility of such a reduction is dependent on the increase in the number of spatial patterns, since they must be saved in memory and iterated when computing the score. Therefor we give a very loose upper bound on the size of G H, the node built in previous theorem, by way of an inductive function counting the number of spatial patterns in a network, so let Spatial(l, r, σ, S, T = S, Spatial ( H 1,..., H n, C, T = C + i Spatial(Hi. Theorem 7. For any network H we have that Spatial(G H 2 Spatial(H Proof. We prove this by induction on the structure of H. trivial, since then H = G H. The base case is 8

9 For the induction step we have Spatial(G H = S c G H def. c C S c c C = c C Tc[1] 1 n... Tc[n] Sc def. c C Spatial(G H 1... Spatial(G H n ( c C 2 Spatial(H Spatial(Hn (IH = c C 2 Spatial(H Spatial(H n = C 2 Spatial(H Spatial(H n = 2 lg C + Spatial(H Spatial(H n 2 Spatial(H For ( recall that the temporal groups of a node partitions its spatial pattern set, so obviously T i c[i] Spatial(G H i. It is well known that a single linear threshold unit, or perceptron, can solve linearly separable classification tasks. This means we can use a collection of such units to capture the convex polytype surrounding each spatial pattern in a gaussian node, by having one perceptron for each hyperplane supporting the polytype. Further, we can combine many such collections of units to determine whether input is in at least one of the polytypes in a temporal group, and thereby deciding if that temporal group is the winning group for that particular input. This can be done for each temporal group, giving us a simple feed forward network of perceptrons which fully captures the classification the node induces on the input space. This can also be done for full networks via Theorem 6. Theorem 8. For any finite union of convex polytypes cover {T 1,..., T p } for R n there exists 3 layer feed-forward neural network classifying it. More over, such a network exists for any network H 1,n where all gaussian nodes have the same spread. Proof. We provide a neural network as described which takes any x R n and has p outputs, one for each element T i in the cover. For any x R n the network will spike on output i if and only if x T i. We use simple neurons with a threshold activation function and an input bias. We construct one network N i for each T i which spikes if and only if x T i, and each such network is one of the p outputs of the total network. Fix some i and recall that, by assumption on the cover, there exist m n matrices A i 1,..., A i k and vectors b i 1,..., b i k Rn such that x T i A i jx b i j for some j = 1,..., k. 9

10 So we may decide x T i by examining A i j x bi j for at most all 1 j k, and therefor we build a neural network A j for each j which does exactly this. Observe that testing the condition amounts to testing on which side of m hyperplanes, one for each row in A i j combined with corresponding cell from bi j, x is. This is easily achieved by having m neurons V i,j 1,..., Vi,j m, each deciding x for one hyperplane. The network A i j is then simply an AND-gate which spikes when all V i,j 1,..., Vi,j m spike. Finally the network N i is then simply an OR-gate which spikes when at least one of A 1,..., A k spike. We now have a 3 layer feed-forward network where layer three is p OR nodes, layer two is AND nodes and layer one is a layer with hyperplane deciders. 3 Discussion As we alluded to in the abstract, the flavor of networks considered here by no means represent the full range of networks that one can have in the HTM model. The model has matured significantly, although the fundamental concepts of hierarchical temporal and spatial pooling remains the same. The two primary alternative forms are letting the score of a temporal group be a weighted sum of pattern/coincidence scores instead of the maximum, letting the score of a coincidence pattern be a weighted sum child temporal group scores instead of a product. More exotic features include mixing node types, overlapping receptive fields and level skipping, these are however much less used in practice. A glaring limitation in our results is that we require all gaussian nodes to share the same spread, this is however not such a severe restriction as it may seem. In principle, if all gaussian nodes are receiving input from the same domain, for example as bottom nodes in a vision network, we expect that they should learn the same low level spatial patterns (e.g. horizontal, vertical, diagonal lines and shapes and thus having the same spread is reasonable. At present the spread is a handtuned parameter which cannot be learned, so it is therefor often left the same across all nodes. Additionally, a common technique in training such networks is what is called node cloning, which means that you only train a single node pr. network level, and then just replicate that node when you do classification. Lastly, given that the upperbound in Theorem 7 cannot be significantly improved, then this can be seen as evidence that the hierarchical organization of this and other models is a nontrivial construct in terms of computational feasibility. This may also, as mentioned, mean that large gaussian nodes may be decomposed to increase classification speed. References [1] G. Wallis and E. T. Rolls. A model of invariant object recognition in the visual system. Prog. Neurobiol., 51: , [2] M. Riesenhuber and T. Poggio. Hierarchical models of object recognition in cortex. Nat. Neurosci., 2: ,

11 [3] K. Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cyb., 36: , [4] T.S. Lee and D. Mumford. Hierarchical bayesian inference in the visual cortex. J Opt Soc Am A Opt Image Sci Vis, 20(7: , July [5] George D, Hawkins J, 2009 Towards a Mathematical Theory of Cortical Microcircuits. PLoS Comput Biol 5(10: e doi: /journal.pcbi [6] D.H. Hubel and T.N. Wiesel. Receptive fields and functional architecture of monkey striate cortex. J. Phys., 195: , [7] George D. HOW THE BRAIN MIGHT WORK: A hierarchical and temporal model for learning and recognition. PhD Thesis Stanford, June 2008: [8] Numenta Inc. Zeta1 Algorithms Reference Version 1.5. Available upon request. [9] Thomas Dean. A computational model of the cerebral cortex. In Proceedings of AAAI-05, pages , Cambridge, Massachusetts, MIT Press. 11

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD WHAT IS A NEURAL NETWORK? The simplest definition of a neural network, more properly referred to as an 'artificial' neural network (ANN), is provided