Optimality of Inference in Hierarchical Coding for Distributed Object-Based Representations

Optimaity of Inference in Hierarchica Coding for Distributed Object-Based Representations Simon Brodeur, Jean Rouat NECOTIS, Département génie éectrique et génie informatique, Université de Sherbrooke, Sherbrooke, Canada Emai: simon.brodeur@usherbrooke.ca, jean.rouat@usherbrooke.ca Abstract Hierarchica approaches for representation earning have the abiity to encode reevant features at mutipe scaes or eves of abstraction. However, most hierarchica approaches expoit ony the ast eve in the hierarchy, or provide a mutiscae representation that hods a significant amount of redundancy. We argue that removing redundancy across the mutipe eves of abstraction is important for an efficient representation of compositionaity in object-based representations. With the perspective of feature earning as a data compression operation, we propose a new greedy inference agorithm for hierarchica sparse coding. Convoutiona matching pursuit with a L 0 -norm constraint was used to encode the input signa into compact and non-redundant codes distributed across eves of the hierarchy. Simpe and compex synthetic datasets of tempora signas were created to evauate the encoding efficiency and compare with the theoretica ower bounds on the information rate for those signas. Empirica evidence have shown that the agorithm is abe to infer near-optima codes for simpe signas. However, it faied for compex signas with strong overapping between objects. We expain the inefficiency of convoutiona matching pursuit that occurred in such case. This brings new insights about the NP-hard optimization probem reated to using L 0 -norm constraint in inferring optimay compact and distributed objectbased representations. I. INTRODUCTION Unsupervised earning is very attractive today because of the arge amount of unannotated data avaiabe from socia media and sensors in inteigent devices (e.g. smartphones, autonomous cars, home automation systems). Manifod earning (embedding) is currenty one of the most popuar forms of earning a atent, ow-dimensiona feature map that describes some tempora or spatia properties of the input signa (for review, see [1]). There is, however, very itte work on evauating the quaity of those representations based on compression criteria. The goa of the paper is to investigate how the notion of data compression from both theoretica and empirica perspectives can hep us design, compare and understand better the earning of embeddings that efficienty represent the input structures. This work is in ine with custering by compression (e.g. []) and minimum description ength (MDL) approaches (e.g. [3], [4]) to feature earning and coding, which contrasts with discriminative approaches. Object-based representations are very efficient to encode spatiay or temporay independent structures of a signa, and are in fact very common in the rea word (e.g. puse-resonance sounds in audio, visua composition in images, syntactic items in anguage). Sparse coding (SC) approaches are we adapted to mode such kind of signas (e.g. [14]). We suggest that dictionary-based feature earning and representation with sparse coding coud be comparabe to genera ossess data compression agorithms such as LZ77 [5]. Those agorithms achieve compression by finding redundant sequences in the data, storing a singe copy in a static or siding window dictionary and then use a reference to this copy to repace the mutipe occurrences in the data. This indexing scheme of the dictionary aows the representation to be sparse and compact, meaning that a singe index (or reference to a dictionary eement) describes a arger segment of the input sequence. Most prior work on dictionary earning and optima data compression considers string sequences with discrete symbos (e.g. text documents, DNA sequences), as they are primariy used for fie compression (e.g. [5]). We rather consider sequences of continuous vaues, and propose a new greedy inference agorithm for hierarchica sparse coding. In this work, we focus on the inference process, i.e. finding the optima sets of sparse coefficients given the signa and earned dictionaries. The goa is to evauate how the hierarchica and compression frameworks can improve object-based representations. II. SPARSE CODING Assume a simpe inear decomposition S = XD. The variabe S defines a set of M signas of dimensionaity N, i.e. S = [s 1,..., s M ] R M N. The variabe D defines a dictionary of K bases of dimensionaity N, i.e. D = [d 1,..., d K ] R K N. The resut is a set of coefficients X = [x 1,..., x M ] R M K that represents the input S. Sparse coding (SC) aims at providing a set of sparse coefficients by the penaization of the L 1 norm of X, as shown in Equation 1 (for review, see [6]). The constant α contros the tradeoff between the reconstruction error and the sparsity of the coefficients. (X, D ) = arg min X,D 1 S XD + α X 1 (1) with constraint d k = 1 for k {1,,..., K} Sparse coding aows to compact the energy of the representation into a few high-vaued coefficients whie etting the remaining coefficients sit near zero. Compression is achieved by imiting the amount of information that needs to be stored to represent those coefficients (e.g. by negecting most owvaued coefficients). Sparse coding can be appied to sequences of variabe engths (e.g. tempora signas) by changing the 978-1-5090-606-9/17/$30 017 IEEE

Passthrough feature maps Higher-eve feature maps Distributed representation no strong correation strong correation feature maps greedy eve-wise inference input Fig. 1. Greedy eve-wise inference scheme for distributed representations in hierarchica sparse coding. When the input S is convoved with the dictionary D 1 at eve and greedy inference (shown as white dots) is appied to seect sparse activations in the feature maps, the first-eve sparse representation X std 1 is created. X std 1 now becomes the input signa for the next eve. The second eve can choose during inference to combine severa ower-eve activations as modeed by the dictionary D and output information in the higher-eve feature maps X std, or forward particuar sparse activations as is with the passthrough feature maps X pass if sparse activations in X std 1 do not correate strongy with D. The same process is repeated for the next eves in the hierarchy. From the output of eve, both passthrough and higher-eve features maps describe the input signa in a distributed way using the non-redundant sets of coefficients {X 1, X, X 3 }. The passthrough feature maps aso aow earning over mutipe scaes. Note that the widths of the dictionaries increase over the hierarchy, so that the considered tempora context increases at each eve and thus aows to encode onger time dependencies between sparse activations. matrix mutipication operator into a discrete convoution 1, as shown in Equation. This adds time shift (transation) invariance properties to the dictionary D, and D R K W can now have tempora ength W < N to represent oca features in the signa. Moreover, the direct penaization of the L 0 -norm of X can be used instead. This is more adapted to compression, since ony the information about non-zero coefficients needs to be stored. The drawback with the L 0 -norm is that it makes the optimization probem NP-hard [7]. (X, D ) = arg min X,D 1 S X D + α X 0 () with constraint d k = 1 for k {1,,..., K} Learning representations and encoding the input at a too restricted tempora ength W does not aow to earn highereve objects that derive from the composition of mutipe oca features. This degrades the abiity to compress information. Thus, we propose to efficienty incorporate a hierarchica aspect into the convoutiona sparse coding framework. III. PROPOSED HIERARCHICAL SPARSE CODING Many of the successes in machine earning that ed to the sub-fied of deep earning is about hierarchica processing. This means stacking in a hierarchy severa eves of representations or processing ayers to buid an abstract representation of the input usefu for a particuar task (e.g. cassification, contro). However, many approaches to representation earning (e.g. [1], [8]) ony consider the output of the ast eve and thus ose crucia information about the compositionaity of objects. This means that fine information is mixed together with the higher-eve, coarse information about objects. It aso scaes inefficienty with the context, as the number of possibe objects typicay increases at an exponentia rate when the tempora or physica context is made arger. Some approaches 1 The definition of convoution from machine earning is used, where the fiter is not time-reversed. This convoution is equa to a cross-correation. (e.g. [9]) aggregate encodings at mutipe independent scaes prior to cassification, but disregard the redundancy across the scaes. An efficient hierarchica processing means encoding objects at the eve at which they naturay appear, rather than re-encode those to propagate the information up in the hierarchy. This is simiar to the mutiresoution decomposition scheme of the discrete waveet transform [10], except that hierarchica sparse coding is not constrained to the timefrequency domain anaysis. This avoids the tradeoff between time and frequency resoutions. Hierarchica sparse coding aso contrasts with other methods such as mutistage vector quantization (e.g. [11]), which encode information maximay at each eve and ony pass the residua information (i.e. the reconstruction error) to the next eve in the hierarchy. These methods do not have the abiity to earn high-eve objects, since the objects are encoded into their most basic constituents at the very first eves, and the residua energy decreases rapidy towards zero. This is because ow-eve constituents are typicay easier to earn and encode. The proposed method for hierarchica processing in a sparse coding framework is described in Equation 3. In this hierarchica decomposition scheme, the reconstruction is based on the sets of coefficients {X 1, X,..., X L } from L eves, with inear composition across eves based on their respective dictionaries {D 1, D,..., D L }. The sets of coefficients at a eves do not share redundant information, which makes the representation compact in terms of data compression. Note that the dictionaries are now mutidimensiona arrays (tensors) of rank 3, i.e. D R K W K 1. The dictionary at eve defines a set of K bases (or convoution fiters) of dimensionaity W K 1, where W is the tempora ength of the bases and K 1 is the number of features from the previous eve. The variabe S now defines a (tempora) sequence of ength N t, with features vectors of dimensionaity N s, i.e. S = [s 1,..., s Nt ] R Nt Ns. The set of coefficients X at eve is a mutidimensiona array of rank 3, i.e. X R Nt K 1 K, that are aso caed feature maps.

Θ = {X 1, X,..., X L, D 1, D,..., D L } ( ( Θ 1 L = arg min Θ S )) X D k =1 k=1 L + α X 0 (3) =1 with constraint d k, = 1 for k {1,,..., K}, {1,,..., L} and operator as a composition of convoutions: L D = D 1 D D L =1 A. Greedy Inference for Hierarchica Sparse Coding The hierarchica inference scheme is iustrated in figure 1. We extended convoutiona matching pursuit (e.g. [14]) and appied it to the hierarchy in a greedy eve-wise manner. At higher eves of the hierarchy (i.e. > 1), we introduce two types of feature maps defined by X pass and X std : passthrough features maps from the previous eves, and standard feature maps obtained directy at eve from the decomposition with dictionary D. The passthrough features maps aow each eve to forward to the next eve the sparse activations that do not correate we with the bases of dictionary D. This can be impemented by considering during matching pursuit a second dictionary D pass R Kpass W K pass containing a singe Kronecker deta function per dictionary basis so that S D pass = S. This corresponds to copying specific sparse activations from the input to output feature maps X pass. K pass = 1 i=1 K i is the number of aggregated features maps (both passthroughs and standard) from the previous eve. To aow earning features over mutipe scaes, the dictionary D at each eve can aso cover the passthrough feature maps of the previous eve, thus increasing its dimensionaity to D R K W (K pass 1 +K 1). The hierarchica sparse coding inference agorithm SOLVE-HSC is described beow. Note that more sophisticated matching pursuit methods can be integrated in SOLVE-MP, such as ow compexity orthogona matching pursuit (LoCOMP) [13]. The constant β aows to contro the aocation of coefficients in the passthrough feature maps. A. Dataset Generation IV. EXPERIMENT Mutipe synthetic datasets were created to evauate the hierarchica encoding performance in terms of compression. The first dataset named dataset-simpe has no overap between objects in the decomposition and the generated signa when sparse activations are composed in time. The decomposition depends ony on the previous eve, and thus the dictionaries do not use the passthrough feature maps. The second dataset named dataset-compex has no such constraints for overap, function SOLVE-HSC(S, {D 1, D,..., D L}, β) Sove for coefficients at first-eve (no passthrough) Note that X pass 1 wi be an empty coefficient set X pass 1, X std 1 SOLVE-MP(S, D 1, 0) Iterate for a higher eves (with passthrough) for = to L do Concatenate passthrough and standard feature maps, then sove with matching pursuit agorithm X pass, X std SOLVE-MP(X pass 1 Xstd 1, D, β) end for Extract output coefficients sets from eve L X 1, X, X L 1 SPLIT(X pass L ) X L X std L Cacuate the overa residua on the input R S ( ( L )) =1 X k=1 D k return {X 1, X,..., X L}, R end function function SOLVE-MP(S, D, β) Create a passthrough dictionary such that S D pass = S D pass Kronecker deta functions R 1 S Initiaize residua X pass, X std Empty coefficient sets n 1 Iterate over( the residua ) unti convergence at given SNR S whie 10 og F < SNR R n thres do F Cacuate convoution with dictionaries C pass, C std R n D pass, R n D Seect best activated basis for both dictionaries, at respective dictionary index k and center position t, d std arg max C pass, arg max C std Compare coefficients from both seected bases d pass if β C pass > C std then Add coefficient to passthrough set X pass X pass {C pass } Remove contribution ( of seected ) basis to residua R n+1 R n C pass d pass k ese Add coefficient to standard set X std X std {C std } Remove contribution of seected basis to residua R n+1 R n ( ) C std d std k end if n n + 1 end whie return {X pass, X std } end function and is usefu to assess the encoding of compositionaity at mutipe scaes with compexity coser to rea-word signas. The first-eve dictionary bases were generated using 1D Perin noise [1]. Hanning windowing was used to make the bases ocaized in time and avoid sharp discontinuities when overaps occur. Higher-eve dictionary bases were generated by uniformy samping a fixed number Nd of activated subbases from the previous eve. Nd corresponds to the decomposition size at eve. The reative random positions of the seected sub-bases were uniformy samped in time domain within the defined tempora ength. The reative random co-

TABLE I SPECIFICATIONS OF THE GENERATED DATASETS. Dataset name L4 Input-eve scaes: 16 64 18 56 dataset-simpe Dictionary sizes (K): 4 8 16 3 Decomposition sizes (N d ): - 3 Input-eve scaes: 3 64 18 56 dataset-compex Dictionary sizes (K): 4 8 16 3 Decomposition sizes (N d ): - 4 3 3 Ampitude 1.5 Ampitude 1.5 Input-eve scae 56 Input-eve scae 56 1.5 0 1000 000 3000 4000 5000 Sampes 1.5 0 1000 000 3000 4000 5000 Sampes Abstraction L4 18 64 16 (a) Dictionaries for dataset-simpe Abstraction L4 18 64 3 (b) Dictionaries for dataset-compex Fig.. Exampe of dictionary bases at each of the 4 eves in the hierarchy that was used to generate the datasets. The bod bars on the right indicate the tempora span of each basis (input-eve scae), in sampes. As the eve increases, the bases become more compex and abstract. efficients of the seected sub-bases were uniformy samped from the interva [, ]. Tabe I gives the specifications of the dictionaries for the two datasets, where the input-eve scaes, sizes of the dictionaries or decomposition sizes vary across eves. Figure shows some bases of the datasets. For dataset-simpe, the tempora signa was generated by concatenating the input-eve representations of randomy chosen bases across a eves (with equa probabiities), since there is no overapping. For dataset-compex, the tempora sequence was generated from the dictionaries using independent homogeneous Poisson processes to sampe sparse activations (events) in the sets of coefficients {X 1, X, X 3, X 4 }. Each dictionary basis was assigned a Poisson process with fixed rate λ = 9 10 4 (in units of events/sampe), chosen so that the generated signa was not temporay too compact or sparse. In both databases, random coefficients for the sparse activations were uniformy samped in the interva [, 4.0]. The ength of the generated signas was 100,000 sampes, with roughy 650 activations distributed across the various eves. Figure 3 shows segments of the datasets. We impemented the hierarchica sparse coding agorithm and dataset generation in Python using the Numpy and Scipy ibraries. B. Evauation Method and Resuts In the experiments, the optima (reference) dictionaries that generated the time series were given to the matching pursuit agorithm. This was to test the optimaity of inference ony, and not dictionary earning. To quantify the performance of the proposed inference agorithm, we evauated the information (a) Segment for dataset-simpe (b) Segment for dataset-compex Fig. 3. Exampe of segments from the datasets. When averaged over a onger time scae, the information rate needed to encode each signa remains constant. rate at the output of the encoder when it is imited to a given maximum eve, and compared with the ower bound I opt for that eve. Equation 4 defines how the optima information rate I opt can be cacuated for both datasets, based on the known coefficient sets that were used to generate the signas, and the known decomposition sizes. Equation 5 describes how the actua information rate I act max can be cacuated from the output coefficients of an encoder at evauation time. Note that there must be toerance on the reconstruction signa-to-noise ratio (SNR) during inference. We assumed 40 db SNR to be sufficient for many types of signas (e.g. images, audio). We used 3 bits foating point precision for the ampitudes of the coefficients (i.e. I fp = 3). I opt = =1 L =+1 X 0 N }{{ t } Activation ( rate at eve I activation at eve ) N X h 0 d N t h =+1 Activation rate transfered from higher-eve h to eve based on decomposition size N h d with I = (og L + og K + og N t ) Indexing bits for eve, basis and time I act max = =1 X 0 N t Activation rate at eve + (4) I max activation at eve I activation at eve + I fp Ampitude bits Figure 4 shows that the weighting factor β can indeed contro the distribution of non-zero coefficients amongst eves. Figure 5 shows that for a hierarchica inference task on dataset-simpe, the framework achieves performance cose to the ower bound of the information rate. It means that mutipe eves of the hierarchy are used effectivey assuming the proper choice of β. If β is too ow, a sparse activations wi be reconstructed from the higher-eve dictionaries, which may not be adapted and thus significanty degrade compression (5)

Distribution ratio 0.9 0.8 0.7 0.6 0.4 0.3 0. 0.1 0.75 0.90 0.00 β weight Leve 4 Leve 3 Leve Leve 1 Fig. 4. Distribution of non-zero coefficients inferred across eves for datasetsimpe, as a function of the weighting factor β. Lower β vaues (e.g. β = 0.75) increased the percentage of higher-eve activations. When β was too high (e.g. β = ), a was eventuay encoded by the first eve ony. Information rate [bit/sampe] 10 8 6 4 Maximum eve max L4 β = 0.75 β = 0.90 β = 0 β =.00 Lower bound 0 0 50 100 150 00 50 300 Maximum input-eve scae (a) dataset-simpe Information rate [bit/sampe] 180 160 140 10 100 80 60 40 0 Maximum eve max L4 β = 0.90 β = 0 β =.00 Lower bound 0 0 50 100 150 00 50 300 Maximum input-eve scae (b) dataset-compex Fig. 5. Optimaity of the hierarchica inference on the datasets as a function of the maximum eve considered. On dataset-simpe (shown on eft), using the weighting factor β = achieved the best I act performance, near the ower bound defined by I opt. Too ow weighting when β = 0.75 eads to ess efficient encoding with eves. Too high weighting when β =.0 eads to no decrease in information rate I act with eves. On dataset-compex (shown on right), the inference faied to find an efficient encoding for a vaues of β because of the imitations of convoutiona matching pursuit. performance. If β is too high, the input wi be encoded by the first eve ony and propagated up in the hierarchy ony by the passthrough feature maps. The same anaysis on dataset-compex ed to very different resuts when we tested the imits of convoutiona matching pursuit, indicating it can strugge on compex signas when overapping between objects occurs frequenty. For a vaues of β tested, we observed an increase in the information rate as more eves are considered at the output of the encoder. It was observed that using L 0 -norm reguarization during inference woud add noise in the residua that needed to be removed ater, and thus required a significant number of bits to encode. We aso observed simiar resuts (not shown) when using the LoCOMP [13] variant of matching pursuit. Aowing dictionary earning in the optimization process (see Equation 3) coud sove this probem, as the higher-eve dictionaries can be adapted to the noise introduced at inference due to matching pursuit. V. CONCLUSION In this work, we proposed a hierarchica sparse coding agorithm to efficienty encode an input signa into compact and non-redundant codes distributed across eves of the hierarchy. The main idea is to represent features at the abstraction eve at which they naturay appear, rather than re-encode those through the higher-eve features of the hierarchy. With the proposed method, it is possibe to empiricay achieve nearoptima representations in terms of signa compression, as indicated by the information rate needed to transmit the output sparse activations. The experiments highighted the important roe of the greedy inference agorithm at each eve. Poor inference such as aocating too much coefficients because of interference effects directy impacts the information rate, as seen with convoutiona matching pursuit. This effect was observed on compex signas, where the compression abiity vanished because of the interference-reated variabiity. Future work shoud aim at soving this probem that is primariy due to the usage of L 0 -norm reguarization. ACKNOWLEDGMENT The authors woud ike to thank the ERA-NET (CHIST- ERA) and FRQNT organizations for funding this research as part of the European IGLU project. REFERENCES [1] Y. Bengio, A. Courvie, and P. Vincent, Representation earning: A review and new perspectives, IEEE Transactions on Pattern Anaysis and Machine Inteigence, vo. 35, no. 8, pp. 1798 188, 013. [] R. Ciibrasi and P Vitányi, Custering by compression, IEEE Transactions on Information Theory, vo. 51, no. 4, pp. 153 1545, 005. [3] A. Barron, J. Rissanen, and B. Yu, The Minimum Description Length Principe in Coding and Modeing, IEEE Transactions on Information Theory, vo. 44, no. 6, pp. 743 760, 1998. [4] I. Ramirez and G. Sapiro, An MDL framework for sparse coding and dictionary earning, IEEE Transactions on Signa Processing, vo. 60, no. 6, pp. 913 97, 01. [5] J. Ziv and A. Lempe, A Universa Agorithm for Sequentia Data Compression, IEEE Transactions on Information Theory, vo. 3, no. 3, pp. 337 343, 1977. [6] B. R. Rubinstein, A. M. Bruckstein, and M. Ead, Dictionaries for Sparse Representation Modeing, Proceedings of the IEEE, vo. 98, no. 6, pp. 1045 1057, 010. [7] B. K. Natarajan, Sparse Approximate Soutions to Linear Systems, SIAM Journa on Computing, vo. 4, no., pp. 7 34, 1995. [8] L. Bo, X. Ren, and D. Fox, Mutipath Sparse Coding Using Hierarchica Matching Pursuit, in IEEE Conference on Computer Vision and Pattern Recognition, pp. 660 667, 013. [9] K. Yu, Y. Lin, and J. Lafferty, Learning image representations from the pixe eve via hierarchica sparse coding, in IEEE Conference on Computer Vision and Pattern Recognition, pp. 1713 170, 011. [10] S. Maat, Mutiresoution Representations and Waveets, Doctora Dissertation, University of Pennsyvania, 1988. [11] B.-H. Juang and A. Gray, Mutipe Stage Vector Quantization for Speech Coding, in IEEE Internationa Conference on Acoustics, Speech, and Signa Processing, no. 3, pp. 597 600, 198. [1] K. Perin, Improving noise, ACM Transactions on Graphics, vo. 1, no. 3, pp. 3, 00. [13] B. Maihe, R. Gribonva, F. Bimbot, and P. Vandergheynst, A ow compexity Orthogona Matching Pursuit for sparse signa approximation with shift-invariant dictionaries, in IEEE Internationa Conference on Acoustics, Speech and Signa Processing, pp. 3445 3448, 009. [14] M. S. Lewicki and T. J. Sejnowski, Coding time-varying signas using sparse, shift-invariant representations, in Advances in Neura Information Processing, pp. 730 736, 1999.